mlcourse.ai – Open Machine Learning Course

Author: Yury Kashnitskiy (@yorko). Edited by Anna Tarelina (@feuerengel). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #3. Fall 2018

Decision trees for classification and regression

In this assignment, we will find out how a decision tree works in a regression task, then will build and tune classification decision trees for identifying heart diseases. Fill in the missing code in the cells marked "You code here" and answer the questions in the web form.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz

1. A simple example of regression using decision trees

Let's consider the following one-dimensional regression problem. It is needed to build the function $a(x)$ to approximate original dependency $y = f(x)$ using mean-squared error $min \sum_i {(a(x_i) - f(x_i))}^2$.

In [2]:
X = np.linspace(-2, 2, 7)
y = X ** 3

plt.scatter(X, y)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$');

Let's make several steps to build the decision tree. Let's choose the symmetric thresholds equal to 0, 1.5 and -1.5 for partitioning. In the case of a regression task, the leaf outputs mean answer for all observations in this leaf.

Let's start from tree of depth 0 that contains all train observations. How will predictions of this tree look like for $x \in [-2, 2]$? Create the appropriate plot using a pen, paper and Python if it is needed (without using sklearn).

In [3]:
# You code here

Let's split the data according to the following condition $[x < 0]$. It gives us the tree of depth 1 with two leaves. Let's create a similar plot for predictions of this tree.

In [4]:
# You code here

In the decision tree algorithm, the feature and the threshold for splitting are chosen according to some criterion. The commonly used criterion for regression is based on variance: $$\large Q(X, y, j, t) = D(X, y) - \dfrac{|X_l|}{|X|} D(X_l, y_l) - \dfrac{|X_r|}{|X|} D(X_r, y_r),$$ where $\large X$ and $\large y$ are a feature matrix and a target vector (correspondingly) for training instances in a current node, $\large X_l, y_l$ and $\large X_r, y_r$ are splits of samples $\large X, y$ into two parts w.r.t. $\large [x_j < t]$ (by $\large j$-th feature and threshold $\large t$), $\large |X|$, $\large |X_l|$, $\large |X_r|$ (or, the same, $\large |y|$, $\large |y_l|$, $\large |y_r|$) are sizes of appropriate samples, and $\large D(X, y)$ is variance of answers $\large y$ for all instances in $\large X$: $$\large D(X) = \dfrac{1}{|X|} \sum_{j=1}^{|X|}(y_j – \dfrac{1}{|X|}\sum_{i = 1}^{|X|}y_i)^2$$ Here $\large y_i = y(x_i)$ is the answer for the $\large x_i$ instance. Feature index $\large j$ and threshold $\large t$ are chosen to maximize the value of criterion $\large Q(X, y, j, t)$ for each split.

In our 1D case, there's only one feature so $\large Q$ depends only on threshold $\large t$ and training data $\large X$ and $\large y$. Let's designate it $\large Q_{1d}(X, y, t)$ meaning that the criterion no longer depends on feature index $\large j$, i.e. in 1D case $\large j = 0$.

Create the plot of criterion $\large Q_{1d}(X, y, t)$ as a function of threshold value $t$ on the interval $[-1.9, 1.9]$.

In [5]:
def regression_var_criterion(X, y, t):
    pass
    # You code here
In [6]:
# You code here

Question 1. Is the threshold value $t = 0$ optimal according to the variance criterion?

  • Yes
  • No

Then let's make splitting in each of the leaves' nodes. In the left branch (where previous split was $x < 0$) using the criterion $[x < -1.5]$, in the right branch (where previous split was $x \geqslant 0$) with the following criterion $[x < 1.5]$. It gives us the tree of depth 2 with 7 nodes and 4 leaves. Create the plot of these tree predictions for $x \in [-2, 2]$.

In [7]:
# You code here

Question 2. How many segments are there on the plot of tree predictions in the interval [-2, 2] (it is necessary to count only horizontal lines)?

  • 2
  • 3
  • 4
  • 5

2. Building a decision tree for predicting heart diseases

Let's read the data on heart diseases. The dataset can be downloaded from the course repo from here by clicking on Download and then selecting Save As option.

Problem

Predict presence or absence of cardiovascular disease (CVD) using the patient examination results.

Data description

There are 3 types of input features:

  • Objective: factual information;
  • Examination: results of medical examination;
  • Subjective: information given by the patient.
Feature Variable Type Variable Value Type
Age Objective Feature age int (days)
Height Objective Feature height int (cm)
Weight Objective Feature weight float (kg)
Gender Objective Feature gender categorical code
Systolic blood pressure Examination Feature ap_hi int
Diastolic blood pressure Examination Feature ap_lo int
Cholesterol Examination Feature cholesterol 1: normal, 2: above normal, 3: well above normal
Glucose Examination Feature gluc 1: normal, 2: above normal, 3: well above normal
Smoking Subjective Feature smoke binary
Alcohol intake Subjective Feature alco binary
Physical activity Subjective Feature active binary
Presence or absence of cardiovascular disease Target Variable cardio binary

All of the dataset values were collected at the moment of medical examination.

In [8]:
df = pd.read_csv('../../data/mlbootcamp5_train.csv', 
                 index_col='id', sep=';')
In [9]:
df.head()
Out[9]:
age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
id
0 18393 2 168 62.0 110 80 1 1 0 0 1 0
1 20228 1 156 85.0 140 90 3 1 0 0 1 1
2 18857 1 165 64.0 130 70 3 1 0 0 0 1
3 17623 2 169 82.0 150 100 1 1 0 0 1 1
4 17474 1 156 56.0 100 60 1 1 0 0 0 0

Transform the features: create "age in years" (full age) and also create 3 binary features based on cholesterol and 3 more on gluc, where they are equal to 1, 2 or 3. This method is called dummy-encoding or One Hot Encoding (OHE). It is more convenient to use pandas.get_dummmies.. There is no need to use the original features cholesterol and gluc after encoding.

In [10]:
# You code here

Split data into train and holdout parts in the proportion of 7/3 using sklearn.model_selection.train_test_split with random_state=17.

In [11]:
# You code here
# X_train, X_valid, y_train, y_valid = ...

Train the decision tree on the dataset (X_train, y_train) with max depth equals to 3 and random_state=17. Plot this tree with sklearn.tree.export_graphviz, dot and pydot. You don't need to use quotes in the file names in order to make it work in a jupyter notebook. The commands starting from the exclamation mark are terminal commands that are usually run in terminal/command line.

In [12]:
# You code here

Question 3. What 3 features are used to make predictions in the created decision tree?

  • weight, height, gluc=3
  • smoke, age, gluc=3
  • age, weight, chol=3
  • age, ap_hi, chol=3

Make predictions for holdout data (X_valid, y_valid) with the trained decision tree. Calculate accuracy.

In [13]:
# You code here

Set up the depth of the tree using cross-validation on the dataset (X_train, y_train) in order to increase quality of the model. Use GridSearchCV with 5 folds. Fix random_state=17 and change max_depth from 2 to 10.

In [14]:
tree_params = {'max_depth': list(range(2, 11))}

tree_grid = GridSearchCV # You code here

Draw the plot to show how mean accuracy is changing in regards to max_depth value on cross-validation.

In [15]:
# You code here

Print the best value of max_depth where the mean value of cross-validation quality metric reaches maximum. Also compute accuracy on holdout data. All these computations are possible to make using the trained instance of the class GridSearchCV.

In [16]:
# You code here

Question 4. Is there a local maximum of accuracy on the built validation curve? Did GridSearchCV help to tune max_depth so that there's been at least 1% change in holdout accuracy? (check out the expression (acc2 - acc1) / acc1 * 100%, where acc1 and acc2 are accuracies on holdout data before and after tuning max_depth with GridSearchCV respectively)?

  • yes, yes
  • yes, no
  • no, yes
  • no, no

Take a look at the SCORE table to estimate ten-year risk of fatal cardiovascular disease in Europe. Source paper.

Create binary features according to this picture:

  • $age \in [40,50), \ldots age \in [60,65) $ (4 features)
  • systolic blood pressure: $ap\_hi \in [120,140), ap\_hi \in [140,160), ap\_hi \in [160,180),$ (3 features)

If the values of age or blood pressure don't fall into any of the intervals then all binary features will be equal to zero. Then we create decision tree with these features and additional smoke, cholesterol and gender features. Transform the cholesterol to 3 binary features according to it's 3 unique values ( cholesterol=1, cholesterol=2 and cholesterol=3). This method is called dummy-encoding or One Hot Encoding (OHE). Transform the gender from 1 and 2 into 0 and 1. It is better to rename it to male (0 – woman, 1 – man). In general, this is typically done with sklearn.preprocessing.LabelEncoder but here in case of only 2 unique values it's not necessary.

Finally the decision tree is built using 12 binary features (without original features).

Create a decision tree with the limitation max_depth=3 and train it on the whole train data. Use the DecisionTreeClassifier class with fixed random_state=17, but all other arguments (except for max_depth and random_state) should be set by default.

Question 5. What binary feature is the most important for heart disease detection (it is placed in the root of the tree)?

  • Systolic blood pressure from 160 to 180 (mmHg)
  • Gender male / female
  • Systolic blood pressure from 140 to 160 (mmHg)
  • Age from 50 to 55 (years)
  • Smokes / doesn't smoke
  • Age from 60 to 65 (years)
In [17]:
# You code here