Assignment #4 (demo). Exploring OLS, Lasso and Random Forest in a regression task#

Author: Yury Kashnitsky. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.

Same assignment as a Kaggle Notebook + solution.

../../_images/wine_quality.jpg

Fill in the missing code and choose answers in this web form.

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso, LassoCV, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import (GridSearchCV, cross_val_score,
                                     train_test_split)
from sklearn.preprocessing import StandardScaler

We are working with UCI Wine quality dataset (no need to download it – it’s already there, in course repo and in Kaggle Dataset).

# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_PATH = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"
data = pd.read_csv(DATA_PATH + "winequality-white.csv", sep=";")
data.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with StandardScaler.

# y = None

# X_train, X_holdout, y_train, y_holdout = train_test_split
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform
# X_holdout_scaled = scaler.transform

Linear regression#

Train a simple linear regression model (Ordinary Least Squares).

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# linreg =
# linreg.fit

Question 1: What are mean squared errors of model predictions on train and holdout sets?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# print("Mean squared error (train): %.3f"
# print("Mean squared error (test): %.3f" %

Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It’s handy to use pandas.DataFrame here.

Question 2: Which feature this linear regression model treats as the most influential on wine quality?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# linreg_coef = pd.DataFrame
# linreg_coef.sort_values

Lasso regression#

Train a LASSO model with \(\alpha = 0.01\) (weak regularization) and scaled data. Again, set random_state=17.

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# lasso1 = Lasso
# lasso1.fit

Which feature is the least informative in predicting wine quality, according to this LASSO model?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# lasso1_coef = pd.DataFrame
# lasso1_coef.sort_values

Train LassoCV with random_state=17 to choose the best value of \(\alpha\) in 5-fold cross-validation.

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# alphas = np.logspace(-6, 2, 200)
# lasso_cv = LassoCV
# lasso_cv.fit
# lasso_cv.alpha_

Question 3: Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# lasso_cv_coef = pd.DataFrame
# lasso_cv_coef.sort_values

Question 4: What are mean squared errors of tuned LASSO predictions on train and holdout sets?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# print("Mean squared error (train): %.3f"
# print("Mean squared error (test): %.3f" %

Random Forest#

Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# forest = RandomForestRegressor
# forest.fit

Question 5: What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring=’neg_mean_squared_error’ and other arguments left with default values) and on holdout set?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# print("Mean squared error (train): %.3f" %
# print("Mean squared error (cv): %.3f" %
# print("Mean squared error (test): %.3f" %

Tune the max_features and max_depth hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# forest_params = {'max_depth': list(range(10, 25)),
#                  'min_samples_leaf': list(range(1, 8)),
#                  'max_features': list(range(6,12))}

# locally_best_forest = GridSearchCV
# locally_best_forest.fit
# locally_best_forest.best_params_, locally_best_forest.best_score_

Question 6: What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring=’neg_mean_squared_error’ and other arguments left with default values) and on holdout set?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# print("Mean squared error (cv): %.3f" %
# print("Mean squared error (test): %.3f" %

Output RF’s feature importance. Again, it’s nice to present it as a DataFrame.
Question 7: What is the most important feature, according to the Random Forest model?

# (read-only in a JupyterBook, pls run jupyter-notebook to edit)
# rf_importance = pd.DataFrame  
# rf_importance.sort_values  

Make conclusions about the performance of the explored 3 models in this particular prediction task.