mlcourse.ai – Open Machine Learning Course

Author: Vitaly Radchenko (@vradchenko), Yury Kashnitskiy (@yorko). Edited by Sergey Volkov (@sevaspb). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #5. Fall 2018

RandomForest and Logistic Regression in credit scoring and movie reviews classification

Here we will develop and tune models for credit scoring and movies reviews sentiment prediction. Fill the code where needed ("#Your code is here") and answer the questions in the web form.

For the warm-up, solve the first task.

<font color = 'red'> Task 1: </font> There are 7 jurors in the courtroom. Each of them individually can correctly determine whether the defendant is guilty or not with 80% probability. How likely is the jury will make a correct verdict jointly if the decision is made by majority voting?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q1

<font color = 'red'> Answer options: </font>

  • 20.97%
  • 80.00%
  • 83.70%
  • 96.66%
In [1]:
# Your code is here

Now let's move directly to machine learning.

The dataset looks like this:

Target variable
  • SeriousDlqin2yrs - the person had long delays in payments during 2 years; binary variable
Features
  • age - Age of the loan borrower (number of full years); type - integer
  • NumberOfTime30-59DaysPastDueNotWorse - the number of times a person has had a delay in repaying other loans more than 30-59 days (but not more) during last two years; type - integer
  • DebtRatio - monthly payments (loans, alimony, etc.) divided by aggregate monthly income, percentage; float type
  • MonthlyIncome - monthly income in dollars; float type
  • NumberOfTimes90DaysLate - the number of times a person has had a delay in repaying other loans for more than 90 days; type - integer
  • NumberOfTime60-89DaysPastDueNotWorse - the number of times a person has had a delay in repaying other loans more than 60-89 days (but not more) in the last two years; type - integer
  • NumberOfDependents - number of people in the family of the borrower; type - integer
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let us implement a function that will replace the NaN values by the median in each column of the table.

In [3]:
def impute_nan_with_median(table):
    for col in table.columns:
        table[col]= table[col].fillna(table[col].median())
    return table   

Reading the data:

In [4]:
data = pd.read_csv('../../data/credit_scoring_sample.csv', sep=";")
data.head()
Out[4]:
SeriousDlqin2yrs age NumberOfTime30-59DaysPastDueNotWorse DebtRatio NumberOfTimes90DaysLate NumberOfTime60-89DaysPastDueNotWorse MonthlyIncome NumberOfDependents
0 0 64 0 0.249908 0 0 8158.0 0.0
1 0 58 0 3870.000000 0 0 NaN 0.0
2 0 41 0 0.456127 0 0 6666.0 0.0
3 0 43 0 0.000190 0 0 10500.0 2.0
4 1 49 0 0.271820 0 0 400.0 0.0

View data types of the features:

In [5]:
data.dtypes
Out[5]:
SeriousDlqin2yrs                          int64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
NumberOfTimes90DaysLate                   int64
NumberOfTime60-89DaysPastDueNotWorse      int64
MonthlyIncome                           float64
NumberOfDependents                      float64
dtype: object

Look at the distribution of classes in target:

In [6]:
ax = data['SeriousDlqin2yrs'].hist(orientation='horizontal', color='red')
ax.set_xlabel("number_of_observations")
ax.set_ylabel("unique_value")
ax.set_title("Target distribution")

print('Distribution of target:')
data['SeriousDlqin2yrs'].value_counts() / data.shape[0]
Distribution of target:
Out[6]:
0    0.777511
1    0.222489
Name: SeriousDlqin2yrs, dtype: float64

Select all the features and drop the target:

In [7]:
independent_columns_names = data.columns.values
independent_columns_names = [x for x in data if x != 'SeriousDlqin2yrs']
independent_columns_names
Out[7]:
['age',
 'NumberOfTime30-59DaysPastDueNotWorse',
 'DebtRatio',
 'NumberOfTimes90DaysLate',
 'NumberOfTime60-89DaysPastDueNotWorse',
 'MonthlyIncome',
 'NumberOfDependents']

We apply a function that replaces all values of NaN by the median value of the corresponding column.

In [8]:
table = impute_nan_with_median(data)

Split the target and features - now we get a training sample.

In [9]:
X = table[independent_columns_names]
y = table['SeriousDlqin2yrs']

Bootstrap

<font color = 'red'> Task 2. </font> Make an interval estimate based on the bootstrap of the average income (MonthlyIncome) of customers who had overdue loan payments, and of those who paid in time, make 90% confidence interval. Find the difference between the lower limit of the derived interval for those who paid in time and the upper limit for those who are overdue. So, you are asked to build 90% intervals for the income of "good" customers $ [good\_income\_lower, good\_income\_upper] $ and for "bad" - $ [bad\_income\_lower, bad\_income\_upper] $ and find the difference $ good\_income\_lower - bad\_income\_upper $.

Use the example from the article. Set np.random.seed (17). Round the answer to the integer value.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q2

Answer options:

  • 344
  • 424
  • 584
  • 654
In [10]:
# Your code is here

Decision tree, hyperparameter tuning

One of the main performance metrics of a model is the area under the ROC curve. The ROC-AUC values lay between 0 and 1. The closer the value of ROC-AUC to 1, the better the classification is done.

Find the values of DecisionTreeClassifier hyperparameters using the GridSearchCV, which maximize the area under the ROC curve.

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

Use the DecisionTreeClassifier class to create a decision tree. Due to the imbalance of the classes in the target, we add the balancing parameter. We also use the parameter random_state = 17 for the reproducibility of the results.

In [12]:
dt = DecisionTreeClassifier(random_state=17, class_weight='balanced')

We will look through such values of hyperparameters:

In [13]:
max_depth_values = [5, 6, 7, 8, 9]
max_features_values = [4, 5, 6, 7]
tree_params = {'max_depth': max_depth_values,
               'max_features': max_features_values}

Fix cross-validation parameters: stratified, 5 partitions with shuffle, random_state.

In [14]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

Task 3. Run GridSearch with the ROC AUC metric using the hyperparameters from the tree_params dictionary. What is the maximum ROC AUC value (round up to 2 decimals)? We call cross-validation stable if the standard deviation of the metric on the cross-validation is less than 1%. Was cross-validation stable under optimal combinations of hyperparameters (i.e., providing a maximum of the mean ROC AUC value for cross-validation)?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q3

Answer options:

  • 0.82, no
  • 0.84, no
  • 0.82, yes
  • 0.84, yes
In [15]:
# Your code is here

Simple RandomForest implementation

Task 4. Implement your own random forest using DecisionTreeClassifier with the best parameters from the previous task. There will be 10 trees, the predicted probabilities of which you need to average.

Brief specification:

  • Use the base code below
  • In the fit method in the loop (i from 0 to n_estimators-1), fix the seed equal to (random_state + i). The idea is that at each iteration there's a new value of random seed to add more "randomness", but at the same time results are reproducible
  • After fixing the seed, select max_features features without replacement, save the list of selected feature ids in self.feat_ids_by_tree
  • Also make a bootstrap sample (i.e. sampling with replacement) of training instances. For that, resort to np.random.choice and its argument replace
  • Train a decision tree with specified (in a constructor) arguments max_depth, max_features and random_state (do not specify class_weight) on a corresponding subset of training data.
  • The fit method returns the current instance of the class RandomForestClassifierCustom, that is self
  • In the predict_proba method, we need to loop through all the trees. For each prediction, obviously, we need to take only those features which we used for training the corresponding tree. The method returns predicted probabilities (predict_proba), averaged for all trees

Perform cross-validation. What is the average ROC AUC for cross-validation? Select the closest value.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q4

Answer options:

  • 0.823
  • 0.833
  • 0.843
  • 0.853
In [16]:
from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_val_score

class RandomForestClassifierCustom(BaseEstimator):
    def __init__(self, n_estimators=10, max_depth=10, max_features=10, 
                 random_state=17):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.random_state = random_state
        
        self.trees = []
        self.feat_ids_by_tree = []
        
    def fit(self, X, y):
        
        # Your code is here
        pass
    
    def predict_proba(self, X):
        
        # Your code is here
        pass
In [17]:
# Your code is here

Task 5. Let us compare our own implementation of a random forest with sklearn version of it. To do this, use RandomForestClassifier (class_weight='balanced', random_state=17), specify all the same values for max_depth and max_features as before. What average value of ROC AUC on cross-validation we finally got? Select the closest value.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q5

Answer options:

  • 0.823
  • 0.833
  • 0.843
  • 0.853
In [18]:
from sklearn.ensemble import RandomForestClassifier
In [19]:
# Your code is here

sklearn RandomForest, hyperparameter tuning

Task 6. In the third task, we found the optimal hyperparameters for one tree. However it could be that these parameters are not optimal for an ensemble. Let's check this assumption with GridSearchCV (RandomForestClassifier (class_weight='balanced', random_state = 17) ). Now we extend the value of max_depth up to 15, because the trees need to be deeper in the forest (you should be aware of it from the article). What are the best values of hyperparameters now?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q6

Answer options:

  • max_depth=8, max_features=4
  • max_depth=9, max_features=5
  • max_depth=10, max_features=6
  • max_depth=11, max_features=7
In [20]:
max_depth_values = range(5, 15)
max_features_values = [4, 5, 6, 7]
forest_params = {'max_depth': max_depth_values,
                'max_features': max_features_values}
In [21]:
# Your code is here

Logistic regression, hyperparameter tuning

Task 7. Now let's compare our results with logistic regression (we indicate class_weight='balanced' and random_state = 17). Do a full search by the parameter C from a wide range of values np.logspace(-8, 8, 17). Now we will build a pipeline - first apply scaling, then train the model.

Learn about the pipelines and make cross-validation. What is the best average ROC AUC? Select the closest value.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q7

Answer options:

  • 0.778
  • 0.788
  • 0.798
  • 0.808
In [22]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
logit = LogisticRegression(random_state=17, class_weight='balanced')

logit_pipe = Pipeline([('scaler', scaler), ('logit', logit)])
logit_pipe_params = {'logit__C': np.logspace(-8, 8, 17)}
In [23]:
# Your code is here

Logistic regression and RandomForest on sparse features

In case of a small number of features, random forest was proved to be better than logistic regression. However, one of the main disadvantages of trees is how they work with sparse data, for example, with texts. Let's compare logistic regression and random forest in a new task. Download dataset with reviews of movies here.

In [24]:
# Download data
df = pd.read_csv("../../data/movie_reviews_train.csv", nrows=50000)

# Split data to train and test
X_text = df["text"]
y_text = df["label"]

# Classes counts
df.label.value_counts()
Out[24]:
1    32492
0    17508
Name: label, dtype: int64
In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Split on 3 folds
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)

# In Pipeline we will modify the text and train logistic regression
classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', LogisticRegression(random_state=17))])

Task 8. For Logistic Regression: iterate parameter C with values from the list [0.1, 1, 10, 100] and find the best ROC AUC in cross-validation. Select the closest answer.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q8

Answer options:

  • 0.74
  • 0.75
  • 0.84
  • 0.85
In [26]:
# Your code is here

Task 9. Now try to perform the same operation with random forest. Similarly, look over all the values and get the maximum ROC AUC. Select the closest value.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q9

Answer options:

  • 0.74
  • 0.75
  • 0.84
  • 0.85
In [27]:
classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', RandomForestClassifier(random_state=17, n_jobs=-1))])

min_samples_leaf = [1, 2, 3]
max_features = [0.3, 0.5, 0.7]
max_depth = [None]
In [28]:
# Your code is here