Assignment #4 (demo). Exploring OLS, Lasso and Random Forest in a regression task. Solution#
Author: Yury Kashnitsky. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.
Same assignment as a Kaggle Notebook + solution.
Fill in the missing code and choose answers in this web form.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso, LassoCV, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import (GridSearchCV, cross_val_score,
train_test_split)
from sklearn.preprocessing import StandardScaler
We are working with UCI Wine quality dataset (no need to download it – it’s already there, in course repo and in Kaggle Dataset).
# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_PATH = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"
data = pd.read_csv(DATA_PATH + "winequality-white.csv", sep=";")
data.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 4898 non-null float64
1 volatile acidity 4898 non-null float64
2 citric acid 4898 non-null float64
3 residual sugar 4898 non-null float64
4 chlorides 4898 non-null float64
5 free sulfur dioxide 4898 non-null float64
6 total sulfur dioxide 4898 non-null float64
7 density 4898 non-null float64
8 pH 4898 non-null float64
9 sulphates 4898 non-null float64
10 alcohol 4898 non-null float64
11 quality 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with StandardScaler
.
y = data["quality"]
X = data.drop("quality", axis=1)
X_train, X_holdout, y_train, y_holdout = train_test_split(
X, y, test_size=0.3, random_state=17
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_holdout_scaled = scaler.transform(X_holdout)
Linear regression#
Train a simple linear regression model (Ordinary Least Squares).
linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train);
Question 1: What are mean squared errors of model predictions on train and holdout sets?
print(
"Mean squared error (train): %.3f"
% mean_squared_error(y_train, linreg.predict(X_train_scaled))
)
print(
"Mean squared error (test): %.3f"
% mean_squared_error(y_holdout, linreg.predict(X_holdout_scaled))
)
Mean squared error (train): 0.558
Mean squared error (test): 0.584
Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It’s handy to use pandas.DataFrame
here.
Question 2: Which feature this linear regression model treats as the most influential on wine quality?
linreg_coef = pd.DataFrame(
{"coef": linreg.coef_, "coef_abs": np.abs(linreg.coef_)},
index=data.columns.drop("quality"),
)
linreg_coef.sort_values(by="coef_abs", ascending=False)
coef | coef_abs | |
---|---|---|
density | -0.665720 | 0.665720 |
residual sugar | 0.538164 | 0.538164 |
volatile acidity | -0.192260 | 0.192260 |
pH | 0.150036 | 0.150036 |
alcohol | 0.129533 | 0.129533 |
fixed acidity | 0.097822 | 0.097822 |
sulphates | 0.062053 | 0.062053 |
free sulfur dioxide | 0.042180 | 0.042180 |
total sulfur dioxide | 0.014304 | 0.014304 |
chlorides | 0.008127 | 0.008127 |
citric acid | -0.000183 | 0.000183 |
Lasso regression#
Train a LASSO model with \(\alpha = 0.01\) (weak regularization) and scaled data. Again, set random_state=17.
lasso1 = Lasso(alpha=0.01, random_state=17)
lasso1.fit(X_train_scaled, y_train)
Lasso(alpha=0.01, random_state=17)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.01, random_state=17)
Which feature is the least informative in predicting wine quality, according to this LASSO model?
lasso1_coef = pd.DataFrame(
{"coef": lasso1.coef_, "coef_abs": np.abs(lasso1.coef_)},
index=data.columns.drop("quality"),
)
lasso1_coef.sort_values(by="coef_abs", ascending=False)
coef | coef_abs | |
---|---|---|
alcohol | 0.322425 | 0.322425 |
residual sugar | 0.256363 | 0.256363 |
density | -0.235492 | 0.235492 |
volatile acidity | -0.188479 | 0.188479 |
pH | 0.067277 | 0.067277 |
free sulfur dioxide | 0.043088 | 0.043088 |
sulphates | 0.029722 | 0.029722 |
chlorides | -0.002747 | 0.002747 |
fixed acidity | -0.000000 | 0.000000 |
citric acid | -0.000000 | 0.000000 |
total sulfur dioxide | -0.000000 | 0.000000 |
Train LassoCV with random_state=17 to choose the best value of \(\alpha\) in 5-fold cross-validation.
alphas = np.logspace(-6, 2, 200)
lasso_cv = LassoCV(random_state=17, cv=5, alphas=alphas)
lasso_cv.fit(X_train_scaled, y_train)
LassoCV(alphas=array([1.00000000e-06, 1.09698580e-06, 1.20337784e-06, 1.32008840e-06, 1.44811823e-06, 1.58856513e-06, 1.74263339e-06, 1.91164408e-06, 2.09704640e-06, 2.30043012e-06, 2.52353917e-06, 2.76828663e-06, 3.03677112e-06, 3.33129479e-06, 3.65438307e-06, 4.00880633e-06, 4.39760361e-06, 4.82410870e-06, 5.29197874e-06, 5.80522552e-06, 6.36824994e-06, 6.98587975e-0... 1.18953407e+01, 1.30490198e+01, 1.43145894e+01, 1.57029012e+01, 1.72258597e+01, 1.88965234e+01, 2.07292178e+01, 2.27396575e+01, 2.49450814e+01, 2.73644000e+01, 3.00183581e+01, 3.29297126e+01, 3.61234270e+01, 3.96268864e+01, 4.34701316e+01, 4.76861170e+01, 5.23109931e+01, 5.73844165e+01, 6.29498899e+01, 6.90551352e+01, 7.57525026e+01, 8.30994195e+01, 9.11588830e+01, 1.00000000e+02]), cv=5, random_state=17)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LassoCV(alphas=array([1.00000000e-06, 1.09698580e-06, 1.20337784e-06, 1.32008840e-06, 1.44811823e-06, 1.58856513e-06, 1.74263339e-06, 1.91164408e-06, 2.09704640e-06, 2.30043012e-06, 2.52353917e-06, 2.76828663e-06, 3.03677112e-06, 3.33129479e-06, 3.65438307e-06, 4.00880633e-06, 4.39760361e-06, 4.82410870e-06, 5.29197874e-06, 5.80522552e-06, 6.36824994e-06, 6.98587975e-0... 1.18953407e+01, 1.30490198e+01, 1.43145894e+01, 1.57029012e+01, 1.72258597e+01, 1.88965234e+01, 2.07292178e+01, 2.27396575e+01, 2.49450814e+01, 2.73644000e+01, 3.00183581e+01, 3.29297126e+01, 3.61234270e+01, 3.96268864e+01, 4.34701316e+01, 4.76861170e+01, 5.23109931e+01, 5.73844165e+01, 6.29498899e+01, 6.90551352e+01, 7.57525026e+01, 8.30994195e+01, 9.11588830e+01, 1.00000000e+02]), cv=5, random_state=17)
lasso_cv.alpha_
np.float64(0.0002833096101839324)
Question 3: Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?
lasso_cv_coef = pd.DataFrame(
{"coef": lasso_cv.coef_, "coef_abs": np.abs(lasso_cv.coef_)},
index=data.columns.drop("quality"),
)
lasso_cv_coef.sort_values(by="coef_abs", ascending=False)
coef | coef_abs | |
---|---|---|
density | -0.648161 | 0.648161 |
residual sugar | 0.526883 | 0.526883 |
volatile acidity | -0.192049 | 0.192049 |
pH | 0.146549 | 0.146549 |
alcohol | 0.137115 | 0.137115 |
fixed acidity | 0.093295 | 0.093295 |
sulphates | 0.060939 | 0.060939 |
free sulfur dioxide | 0.042698 | 0.042698 |
total sulfur dioxide | 0.012969 | 0.012969 |
chlorides | 0.006933 | 0.006933 |
citric acid | -0.000000 | 0.000000 |
Question 4: What are mean squared errors of tuned LASSO predictions on train and holdout sets?
print(
"Mean squared error (train): %.3f"
% mean_squared_error(y_train, lasso_cv.predict(X_train_scaled))
)
print(
"Mean squared error (test): %.3f"
% mean_squared_error(y_holdout, lasso_cv.predict(X_holdout_scaled))
)
Mean squared error (train): 0.558
Mean squared error (test): 0.583
Random Forest#
Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.
forest = RandomForestRegressor(random_state=17)
forest.fit(X_train_scaled, y_train)
RandomForestRegressor(random_state=17)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=17)
Question 5: What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring=’neg_mean_squared_error’ and other arguments left with default values) and on holdout set?
print(
"Mean squared error (train): %.3f"
% mean_squared_error(y_train, forest.predict(X_train_scaled))
)
print(
"Mean squared error (cv): %.3f"
% np.mean(
np.abs(
cross_val_score(
forest, X_train_scaled, y_train, scoring="neg_mean_squared_error"
)
)
)
)
print(
"Mean squared error (test): %.3f"
% mean_squared_error(y_holdout, forest.predict(X_holdout_scaled))
)
Mean squared error (train): 0.053
Mean squared error (cv): 0.414
Mean squared error (test): 0.372
Tune the max_features
and max_depth
hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.
forest_params = {"max_depth": list(range(10, 25)), "max_features": list(range(6, 12))}
locally_best_forest = GridSearchCV(
RandomForestRegressor(n_jobs=-1, random_state=17),
forest_params,
scoring="neg_mean_squared_error",
n_jobs=-1,
cv=5,
verbose=True,
)
locally_best_forest.fit(X_train_scaled, y_train)
Fitting 5 folds for each of 90 candidates, totalling 450 fits
GridSearchCV(cv=5, estimator=RandomForestRegressor(n_jobs=-1, random_state=17), n_jobs=-1, param_grid={'max_depth': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], 'max_features': [6, 7, 8, 9, 10, 11]}, scoring='neg_mean_squared_error', verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=RandomForestRegressor(n_jobs=-1, random_state=17), n_jobs=-1, param_grid={'max_depth': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], 'max_features': [6, 7, 8, 9, 10, 11]}, scoring='neg_mean_squared_error', verbose=True)
RandomForestRegressor(max_depth=21, max_features=6, n_jobs=-1, random_state=17)
RandomForestRegressor(max_depth=21, max_features=6, n_jobs=-1, random_state=17)
locally_best_forest.best_params_, locally_best_forest.best_score_
({'max_depth': 21, 'max_features': 6}, np.float64(-0.39773288191505934))
Question 6: What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring=’neg_mean_squared_error’ and other arguments left with default values) and on holdout set?
print(
"Mean squared error (cv): %.3f"
% np.mean(
np.abs(
cross_val_score(
locally_best_forest.best_estimator_,
X_train_scaled,
y_train,
scoring="neg_mean_squared_error",
)
)
)
)
print(
"Mean squared error (test): %.3f"
% mean_squared_error(y_holdout, locally_best_forest.predict(X_holdout_scaled))
)
Mean squared error (cv): 0.398
Mean squared error (test): 0.366
Output RF’s feature importance. Again, it’s nice to present it as a DataFrame.
Question 7: What is the most important feature, according to the Random Forest model?
rf_importance = pd.DataFrame(
locally_best_forest.best_estimator_.feature_importances_,
columns=["coef"],
index=data.columns[:-1],
)
rf_importance.sort_values(by="coef", ascending=False)
coef | |
---|---|
alcohol | 0.206056 |
volatile acidity | 0.117578 |
free sulfur dioxide | 0.111556 |
density | 0.088549 |
pH | 0.073659 |
total sulfur dioxide | 0.073640 |
chlorides | 0.073366 |
residual sugar | 0.072072 |
citric acid | 0.062601 |
fixed acidity | 0.061813 |
sulphates | 0.059111 |
Make conclusions about the performance of the explored 3 models in this particular prediction task.
The dependency of wine quality on other features in hand is, presumable, non-linear. So Random Forest works better in this task.