mlcourse.ai – Open Machine Learning Course

Author: Yury Kashnitskiy (@yorko). Edited by Anna Tarelina (@feuerengel). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #3. Optional part

Implementation of the decision tree algorithm

In [1]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.base import BaseEstimator
from sklearn.datasets import make_classification, make_regression, load_digits, load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error

Let's fix random_state (a.k.a. random seed) beforehand.

In [2]:
RANDOM_STATE = 17

Implement the class DecisionTree Specification:

  • the class is inherited from sklearn.BaseEstimator;
  • class constructor has the following parameters: max_depth - maximum depth of the tree (numpy.inf by default); min_samples_split - the minimum number of instances in a node for a splitting to be done (2 by default); criterion - split criterion ('gini' or 'entropy' for classification, 'variance' or 'mad_median' for regression; 'gini' by default);

    A functional to be maximized to find an optimal partition at a given node has the form $$Q(X, j, t) = F(X) - \dfrac{|X_l|}{|X|} F(X_l) - \dfrac{|X_r|}{|X|} F(X_r),$$ where $X$ are samples at a given node, $X_l$ and $X_r$ are partitions of samples $X$ into two parts with the following condition $[x_j < t]$, and $F(X)$ is a partition criterion.

    For classification: let $p_i$ be the fraction of the instances of the $i$-th class in the dataset $X$.

    'gini': Gini impurity $F(X) = 1 -\sum_{i = 1}^K p_i^2$.

    'entropy': Entropy $F(X) = -\sum_{i = 1}^K p_i \log_2(p_i)$.

    For regression: $y_j = y(x_j)$ - is a target for an instance $x_j$, $y = (y_1, \dots, y_{|X|})$ - is a target vector.

    'variance': Variance (mean quadratic deviation from average) $F(X) = \dfrac{1}{|X|} \sum_{x_j \in X}(y_j - \dfrac{1}{|X|}\sum_{x_i \in X}y_i)^2$

    'madmedian': Mean deviation from the median $F(X) = \dfrac{1}{|X|} \sum{x_j \in X}|y_j - \mathrm{med}(y)|$

  • the class has several methods: fit, predict and predict_proba;

  • thefit method takes the matrix of instances X and a target vector y (numpy.ndarray objects) and returns an instance of the class DecisionTree representing the decision tree trained on the dataset (X, y) according to parameters set in the constructor;
  • the predict_proba method takes the matrix of instances X and returns the matrix P of a size X.shape[0] x K, where K is the number of classes and $p_{ij}$ is the probability of an instance in $i$-th row of X to belong to class $j \in \{1, \dots, K\}$.
  • the predict method takes the matrix of instances X and returns a prediction vector; in case of classification, prediction for an instance $x_i$ falling into leaf $L$ will be the class, mostly represented among instances in $L$. In case of regression, it'll be the mean value of targets for all instances in leaf $L$.
In [3]:
def entropy(y):    
    pass

def gini(y):
    pass

def variance(y):
    pass

def mad_median(y):
    pass

The Node class implements a node in the decision tree.

In [4]:
class Node():
    
    def __init__(self, feature_idx=0, threshold=0, labels=None, left=None, right=None):
        self.feature_idx = feature_idx
        self.threshold = threshold
        self.labels = labels
        self.left = left
        self.right = right

Let's determine the function for calculating a prediction in a leaf. For regression, let's take the mean for all values in a leaf, for classification - the most popular class in leaf.

In [5]:
class DecisionTree(BaseEstimator):
    
    def __init__(self, max_depth=np.inf, min_samples_split=2, 
                 criterion='gini', debug=False):
        pass
    
    def fit(self, X, y):
        pass
        
    def predict(self, X):
        pass
        
    def predict_proba(self, X):
        pass

Testing the implemented algorithm

Classification

Download the dataset digits using the method load_digits. Split the data into train and test with the train_test_split method, use parameter values test_size=0.2, and random_state=17. Try to train shallow decision trees and make sure that gini and entropy criteria return different results.

In [6]:
# You code here

Using 5-folds cross-validation (GridSearchCV) pick up the optimal values of the max_depth and criterion parameters. For the parameter max_depth use range(3, 11), for criterion use {'gini', 'entropy'}. Quality measure is scoring='accuracy'.

In [7]:
# You code here

Draw the plot of the mean quality measure accuracy for criteria gini and entropy depending on max_depth.

In [8]:
# You code here

1. Choose all correct statements:

  1. Optimal value of the max_depth parameter is on the interval [4, 9] for both criteria.
  2. Created plots have no intersection on the interval [3, 10]
  3. Created plots intersect each other only once on the interval [3, 10].
  4. The best quality for max_depth on the interval [3, 10] is reached using gini criterion .
  5. Accuracy is strictly increasing at least for one of the criteria, when max_depth is also increasing on the interval [3, 10]

2. What are the optimal values for max_depth and criterion parameters?

  1. max_depth = 7, criterion = 'gini';
  2. max_depth = 7, criterion = 'entropy';
  3. max_depth = 10, criterion = 'entropy';
  4. max_depth = 10, criterion = 'gini';
  5. max_depth = 9, criterion = 'entropy';
  6. max_depth = 9, criterion = 'gini';

Train decision tree on (X_train, y_train) using the optimal values of max_depth and criterion. Compute class probabilities for X_test.

In [9]:
# You code here

Using the given matrix, compute the mean class probabilities for all instances in X_test.

In [10]:
# You code here

3. What is the maximum probability in a resulted vector?

  1. 0.127
  2. 0.118
  3. 1.0
  4. 0.09

Regression

Download the dataset boston using the method load_boston. Split the data into train and test with the train_test_split method, use parameter values test_size=0.2, random_state=17. Try to train shallow regression decision trees and make sure that variance and mad_median criteria return different results.

In [11]:
# You code here

Using 5-folds cross-validation (GridSearchCV) pick up the optimal values of the max_depth and criterion parameters. For the parameter max_depth use range(2, 9), for criterion use {'variance', 'mad_median'}. Quality measure is scoring='neg_mean_squared_error'.

In [12]:
# You code here

Draw the plot of the mean quality measure neg_mean_squared_error for criteria variance and mad_median depending on max_depth.

In [13]:
# You code here

4. Choose all correct statements:

  1. Created plots have no intersection on the interval [2, 8].
  2. Created plots intersect each other only once on the interval [2, 8].
  3. Optimal value of the max_depth for each of the criteria is on the border of the interval [2, 8].
  4. The best quality at max_depth on the interval [2, 8] is reached using mad_median criterion.

5. What are the optimal values for max_depth and criterion parameters?

  1. max_depth = 9, criterion = 'variance';
  2. max_depth = 5, criterion = 'mad_median';
  3. max_depth = 4, criterion = 'variance';
  4. max_depth = 2, criterion = 'mad_median';
  5. max_depth = 4, criterion = 'mad_median';
  6. max_depth = 5, criterion = 'variance'.