Author: Yury Kashnitskiy (@yorko). Edited by Anna Tarelina (@feuerengel). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

**In this assignment, we will find out how a decision tree works in a regression task, then will build and tune classification decision trees for identifying heart diseases.
Fill in the missing code in the cells marked "You code here" and answer the questions in the web form.**

In [1]:

```
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
```

Let's consider the following one-dimensional regression problem. It is needed to build the function $a(x)$ to approximate original dependency $y = f(x)$ using mean-squared error $min \sum_i {(a(x_i) - f(x_i))}^2$.

In [2]:

```
X = np.linspace(-2, 2, 7)
y = X ** 3
plt.scatter(X, y)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$');
```

Let's make several steps to build the decision tree. Let's choose the symmetric thresholds equal to 0, 1.5 and -1.5 for partitioning. In the case of a regression task, the leaf outputs mean answer for all observations in this leaf.

Let's start from tree of depth 0 that contains all train observations. How will predictions of this tree look like for $x \in [-2, 2]$? Create the appropriate plot using a pen, paper and Python if it is needed (without using `sklearn`

).

In [3]:

```
# You code here
```

Let's split the data according to the following condition $[x < 0]$. It gives us the tree of depth 1 with two leaves. Let's create a similar plot for predictions of this tree.

In [4]:

```
# You code here
```

In the decision tree algorithm, the feature and the threshold for splitting are chosen according to some criterion. The commonly used criterion for regression is based on variance: $$\large Q(X, y, j, t) = D(X, y) - \dfrac{|X_l|}{|X|} D(X_l, y_l) - \dfrac{|X_r|}{|X|} D(X_r, y_r),$$ where $\large X$ and $\large y$ are a feature matrix and a target vector (correspondingly) for training instances in a current node, $\large X_l, y_l$ and $\large X_r, y_r$ are splits of samples $\large X, y$ into two parts w.r.t. $\large [x_j < t]$ (by $\large j$-th feature and threshold $\large t$), $\large |X|$, $\large |X_l|$, $\large |X_r|$ (or, the same, $\large |y|$, $\large |y_l|$, $\large |y_r|$) are sizes of appropriate samples, and $\large D(X, y)$ is variance of answers $\large y$ for all instances in $\large X$: $$\large D(X) = \dfrac{1}{|X|} \sum_{j=1}^{|X|}(y_j – \dfrac{1}{|X|}\sum_{i = 1}^{|X|}y_i)^2$$ Here $\large y_i = y(x_i)$ is the answer for the $\large x_i$ instance. Feature index $\large j$ and threshold $\large t$ are chosen to maximize the value of criterion $\large Q(X, y, j, t)$ for each split.

In our 1D case, there's only one feature so $\large Q$ depends only on threshold $\large t$ and training data $\large X$ and $\large y$. Let's designate it $\large Q_{1d}(X, y, t)$ meaning that the criterion no longer depends on feature index $\large j$, i.e. in 1D case $\large j = 0$.

Create the plot of criterion $\large Q_{1d}(X, y, t)$ as a function of threshold value $t$ on the interval $[-1.9, 1.9]$.

In [5]:

```
def regression_var_criterion(X, y, t):
pass
# You code here
```

In [6]:

```
# You code here
```

**Question 1. Is the threshold value $t = 0$ optimal according to the variance criterion?**

- Yes
- No

Then let's make splitting in each of the leaves' nodes. In the left branch (where previous split was $x < 0$) using the criterion $[x < -1.5]$, in the right branch (where previous split was $x \geqslant 0$) with the following criterion $[x < 1.5]$. It gives us the tree of depth 2 with 7 nodes and 4 leaves. Create the plot of these tree predictions for $x \in [-2, 2]$.

In [7]:

```
# You code here
```

**Question 2. How many segments are there on the plot of tree predictions in the interval [-2, 2] (it is necessary to count only horizontal lines)?**

- 2
- 3
- 4
- 5

Let's read the data on heart diseases. The dataset can be downloaded from the course repo from here by clicking on `Download`

and then selecting `Save As`

option.

**Problem**

Predict presence or absence of cardiovascular disease (CVD) using the patient examination results.

**Data description**

There are 3 types of input features:

*Objective*: factual information;*Examination*: results of medical examination;*Subjective*: information given by the patient.

Feature | Variable Type | Variable | Value Type |
---|---|---|---|

Age | Objective Feature | age | int (days) |

Height | Objective Feature | height | int (cm) |

Weight | Objective Feature | weight | float (kg) |

Gender | Objective Feature | gender | categorical code |

Systolic blood pressure | Examination Feature | ap_hi | int |

Diastolic blood pressure | Examination Feature | ap_lo | int |

Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |

Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |

Smoking | Subjective Feature | smoke | binary |

Alcohol intake | Subjective Feature | alco | binary |

Physical activity | Subjective Feature | active | binary |

Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

All of the dataset values were collected at the moment of medical examination.

In [8]:

```
df = pd.read_csv('../../data/mlbootcamp5_train.csv',
index_col='id', sep=';')
```

In [9]:

```
df.head()
```

Out[9]:

Transform the features: create "age in years" (full age) and also create 3 binary features based on `cholesterol`

and 3 more on `gluc`

, where they are equal to 1, 2 or 3. This method is called dummy-encoding or One Hot Encoding (OHE). It is more convenient to use `pandas.get_dummmies.`

. There is no need to use the original features `cholesterol`

and `gluc`

after encoding.

In [10]:

```
# You code here
```

Split data into train and holdout parts in the proportion of 7/3 using `sklearn.model_selection.train_test_split`

with `random_state=17`

.

In [11]:

```
# You code here
# X_train, X_valid, y_train, y_valid = ...
```

Train the decision tree on the dataset `(X_train, y_train)`

with max depth equals to 3 and `random_state=17`

. Plot this tree with `sklearn.tree.export_graphviz`

, `dot`

and `pydot`

. You don't need to use quotes in the file names in order to make it work in a jupyter notebook. The commands starting from the exclamation mark are terminal commands that are usually run in terminal/command line.

In [12]:

```
# You code here
```

**Question 3. What 3 features are used to make predictions in the created decision tree?**

- weight, height, gluc=3
- smoke, age, gluc=3
- age, weight, chol=3
- age, ap_hi, chol=3

Make predictions for holdout data `(X_valid, y_valid)`

with the trained decision tree. Calculate accuracy.

In [13]:

```
# You code here
```

Set up the depth of the tree using cross-validation on the dataset `(X_train, y_train)`

in order to increase quality of the model. Use `GridSearchCV`

with 5 folds. Fix `random_state=17`

and change `max_depth`

from 2 to 10.

In [14]:

```
tree_params = {'max_depth': list(range(2, 11))}
tree_grid = GridSearchCV # You code here
```

Draw the plot to show how mean accuracy is changing in regards to `max_depth`

value on cross-validation.

In [15]:

```
# You code here
```

Print the best value of `max_depth`

where the mean value of cross-validation quality metric reaches maximum. Also compute accuracy on holdout data. All these computations are possible to make using the trained instance of the class `GridSearchCV`

.

In [16]:

```
# You code here
```

**Question 4. Is there a local maximum of accuracy on the built validation curve? Did GridSearchCV help to tune max_depth so that there's been at least 1% change in holdout accuracy?**
(check out the expression (acc2 - acc1) / acc1 * 100%, where acc1 and acc2 are accuracies on holdout data before and after tuning

`max_depth`

with `GridSearchCV`

respectively)?- yes, yes
- yes, no
- no, yes
- no, no

Take a look at the SCORE table to estimate ten-year risk of fatal cardiovascular disease in Europe. Source paper.

Create binary features according to this picture:

- $age \in [40,50), \ldots age \in [60,65) $ (4 features)
- systolic blood pressure: $ap\_hi \in [120,140), ap\_hi \in [140,160), ap\_hi \in [160,180),$ (3 features)

If the values of age or blood pressure don't fall into any of the intervals then all binary features will be equal to zero. Then we create decision tree with these features and additional `smoke`

, `cholesterol`

and `gender`

features. Transform the `cholesterol`

to 3 binary features according to it's 3 unique values ( `cholesterol`

=1, `cholesterol`

=2 and `cholesterol`

=3). This method is called dummy-encoding or One Hot Encoding (OHE). Transform the `gender`

from 1 and 2 into 0 and 1. It is better to rename it to `male`

(0 – woman, 1 – man). In general, this is typically done with `sklearn.preprocessing.LabelEncoder`

but here in case of only 2 unique values it's not necessary.

Finally the decision tree is built using 12 binary features (without original features).

Create a decision tree with the limitation `max_depth=3`

and train it on the whole train data. Use the `DecisionTreeClassifier`

class with fixed `random_state=17`

, but all other arguments (except for `max_depth`

and `random_state`

) should be set by default.

**Question 5. What binary feature is the most important for heart disease detection (it is placed in the root of the tree)?**

- Systolic blood pressure from 160 to 180 (mmHg)
- Gender male / female
- Systolic blood pressure from 140 to 160 (mmHg)
- Age from 50 to 55 (years)
- Smokes / doesn't smoke
- Age from 60 to 65 (years)

In [17]:

```
# You code here
```