Skip to content

possible change to pd.cut instead of pd.qcut to match Figure 7.2 #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 14 additions & 13 deletions Ch07-nonlin-lab.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -300,29 +300,30 @@ value do not cover each other up. This type of plot is often called a
*rug plot*.

In order to fit a step function, as discussed in
Section~\ref{Ch7:sec:scolstep-function}, we first use the `pd.qcut()`
function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the
Section~\ref{Ch7:sec:scolstep-function}, we first use the `pd.cut()`
function to discretize `age` into bins of equal width. Then we use `pd.get_dummies()` to create the
columns of the model matrix for this categorical variable. Note that this function will
include *all* columns for a given categorical, rather than the usual approach which drops one
of the levels.

```{python}
cut_age = pd.qcut(age, 4)
cut_age = pd.cut(age, 4)
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())

```


Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
quantiles directly instead of the argument `4`. For cuts not based
on quantiles we would use the `pd.cut()` function.
The function `pd.qcut()` (and `pd.cut()`) returns an ordered categorical variable.
The regression model then creates a set of
dummy variables for use in the regression. Since `age` is the only variable in the model, the value $94,158.40 is the average salary for those under 33.75 years of
age, and the other coefficients are the average
salary for those in the other age groups. We can produce
predictions and plots just as we did in the case of the polynomial
fit.
Here `pd.cut()` automatically picked the bins to be of equal
length. We could also have specified our own bins directly
instead of the argument `4`. For cuts based on quantiles we would
use the `pd.qcut()` function. The function `pd.cut()` (and
`pd.qcut()`) returns an ordered categorical variable. The regression
model then creates a set of dummy variables for use in the
regression. Since `age` is the only variable in the model, the value
$94,158.40 is the average salary for those under 33.75 years of age,
and the other coefficients are the average salary for those in the
other age groups. We can produce predictions and plots just as we did
in the case of the polynomial fit.


## Splines
Expand Down