Author: Vitaly Radchenko (@vradchenko), Yury Kashnitskiy (@yorko). Edited by Sergey Volkov (@sevaspb). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

For the warm-up, solve the first task.

<font color = 'red'> **Task 1:** </font> There are 7 jurors in the courtroom. Each of them individually can correctly determine whether the defendant is guilty or not with 80% probability. How likely is the jury will make a correct verdict jointly if the decision is made by majority voting?

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q1*

<font color = 'red'> **Answer options:** </font>

- 20.97%
- 80.00%
- 83.70%
- 96.66%

In [1]:

```
# Your code is here
```

Now let's move directly to machine learning.

- SeriousDlqin2yrs - the person had long delays in payments during 2 years; binary variable

- age - Age of the loan borrower (number of full years); type - integer
- NumberOfTime30-59DaysPastDueNotWorse - the number of times a person has had a delay in repaying other loans more than 30-59 days (but not more) during last two years; type - integer
- DebtRatio - monthly payments (loans, alimony, etc.) divided by aggregate monthly income, percentage; float type
- MonthlyIncome - monthly income in dollars; float type
- NumberOfTimes90DaysLate - the number of times a person has had a delay in repaying other loans for more than 90 days; type - integer
- NumberOfTime60-89DaysPastDueNotWorse - the number of times a person has had a delay in repaying other loans more than 60-89 days (but not more) in the last two years; type - integer
- NumberOfDependents - number of people in the family of the borrower; type - integer

In [2]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

Let us implement a function that will replace the NaN values by the median in each column of the table.

In [3]:

```
def impute_nan_with_median(table):
for col in table.columns:
table[col]= table[col].fillna(table[col].median())
return table
```

Reading the data:

In [4]:

```
data = pd.read_csv('../../data/credit_scoring_sample.csv', sep=";")
data.head()
```

Out[4]:

View data types of the features:

In [5]:

```
data.dtypes
```

Out[5]:

Look at the distribution of classes in target:

In [6]:

```
ax = data['SeriousDlqin2yrs'].hist(orientation='horizontal', color='red')
ax.set_xlabel("number_of_observations")
ax.set_ylabel("unique_value")
ax.set_title("Target distribution")
print('Distribution of target:')
data['SeriousDlqin2yrs'].value_counts() / data.shape[0]
```

Out[6]:

Select all the features and drop the target:

In [7]:

```
independent_columns_names = data.columns.values
independent_columns_names = [x for x in data if x != 'SeriousDlqin2yrs']
independent_columns_names
```

Out[7]:

We apply a function that replaces all values of NaN by the median value of the corresponding column.

In [8]:

```
table = impute_nan_with_median(data)
```

Split the target and features - now we get a training sample.

In [9]:

```
X = table[independent_columns_names]
y = table['SeriousDlqin2yrs']
```

**<font color = 'red'> Task 2. </font>** Make an interval estimate based on the bootstrap of the average income (MonthlyIncome) of customers who had overdue loan payments, and of those who paid in time, make 90% confidence interval. Find the difference between the lower limit of the derived interval for those who paid in time and the upper limit for those who are overdue.
So, you are asked to build 90% intervals for the income of "good" customers $ [good\_income\_lower, good\_income\_upper] $ and for "bad" - $ [bad\_income\_lower, bad\_income\_upper] $ and find the difference $ good\_income\_lower - bad\_income\_upper $.

Use the example from the article. Set `np.random.seed (17)`

. Round the answer to the integer value.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q2*

**Answer options:**

- 344
- 424
- 584
- 654

In [10]:

```
# Your code is here
```

One of the main performance metrics of a model is the area under the ROC curve. The ROC-AUC values lay between 0 and 1. The closer the value of ROC-AUC to 1, the better the classification is done.

Find the values of `DecisionTreeClassifier`

hyperparameters using the `GridSearchCV`

, which maximize the area under the ROC curve.

In [11]:

```
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
```

Use the `DecisionTreeClassifier`

class to create a decision tree. Due to the imbalance of the classes in the target, we add the balancing parameter. We also use the parameter `random_state = 17`

for the reproducibility of the results.

In [12]:

```
dt = DecisionTreeClassifier(random_state=17, class_weight='balanced')
```

We will look through such values of hyperparameters:

In [13]:

```
max_depth_values = [5, 6, 7, 8, 9]
max_features_values = [4, 5, 6, 7]
tree_params = {'max_depth': max_depth_values,
'max_features': max_features_values}
```

Fix cross-validation parameters: stratified, 5 partitions with shuffle,
`random_state`

.

In [14]:

```
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
```

**Task 3.**
Run GridSearch with the ROC AUC metric using the hyperparameters from the `tree_params`

dictionary. What is the maximum ROC AUC value (round up to 2 decimals)? We call cross-validation stable if the standard deviation of the metric on the cross-validation is less than 1%. Was cross-validation stable under optimal combinations of hyperparameters (i.e., providing a maximum of the mean ROC AUC value for cross-validation)?

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q3*

**Answer options:**

- 0.82, no
- 0.84, no
- 0.82, yes
- 0.84, yes

In [15]:

```
# Your code is here
```

**Task 4.**
Implement your own random forest using `DecisionTreeClassifier`

with the best parameters from the previous task. There will be 10 trees, the predicted probabilities of which you need to average.

Brief specification:

- Use the base code below
- In the
`fit`

method in the loop (`i`

from 0 to`n_estimators-1`

), fix the seed equal to (`random_state + i`

). The idea is that at each iteration there's a new value of random seed to add more "randomness", but at the same time results are reproducible - After fixing the seed, select
`max_features`

features**without replacement**, save the list of selected feature ids in`self.feat_ids_by_tree`

- Also make a bootstrap sample (i.e.
**sampling with replacement**) of training instances. For that, resort to`np.random.choice`

and its argument`replace`

- Train a decision tree with specified (in a constructor) arguments
`max_depth`

,`max_features`

and`random_state`

(do not specify`class_weight`

) on a corresponding subset of training data. - The
`fit`

method returns the current instance of the class`RandomForestClassifierCustom`

, that is`self`

- In the
`predict_proba`

method, we need to loop through all the trees. For each prediction, obviously, we need to take only those features which we used for training the corresponding tree. The method returns predicted probabilities (`predict_proba`

), averaged for all trees

Perform cross-validation. What is the average ROC AUC for cross-validation? Select the closest value.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q4*

**Answer options:**

- 0.823
- 0.833
- 0.843
- 0.853

In [16]:

```
from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_val_score
class RandomForestClassifierCustom(BaseEstimator):
def __init__(self, n_estimators=10, max_depth=10, max_features=10,
random_state=17):
self.n_estimators = n_estimators
self.max_depth = max_depth
self.max_features = max_features
self.random_state = random_state
self.trees = []
self.feat_ids_by_tree = []
def fit(self, X, y):
# Your code is here
pass
def predict_proba(self, X):
# Your code is here
pass
```

In [17]:

```
# Your code is here
```

**Task 5.**
Let us compare our own implementation of a random forest with `sklearn`

version of it. To do this, use `RandomForestClassifier (class_weight='balanced', random_state=17)`

, specify all the same values for `max_depth`

and `max_features`

as before. What average value of ROC AUC on cross-validation we finally got? Select the closest value.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q5*

**Answer options:**

- 0.823
- 0.833
- 0.843
- 0.853

In [18]:

```
from sklearn.ensemble import RandomForestClassifier
```

In [19]:

```
# Your code is here
```

`sklearn`

RandomForest, hyperparameter tuning¶**Task 6.**
In the third task, we found the optimal hyperparameters for one tree. However it could be that these parameters are not optimal for an ensemble. Let's check this assumption with `GridSearchCV`

`(RandomForestClassifier (class_weight='balanced', random_state = 17)`

). Now we extend the value of `max_depth`

up to 15, because the trees need to be deeper in the forest (you should be aware of it from the article). What are the best values of hyperparameters now?

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q6*

**Answer options:**

`max_depth=8, max_features=4`

`max_depth=9, max_features=5`

`max_depth=10, max_features=6`

`max_depth=11, max_features=7`

In [20]:

```
max_depth_values = range(5, 15)
max_features_values = [4, 5, 6, 7]
forest_params = {'max_depth': max_depth_values,
'max_features': max_features_values}
```

In [21]:

```
# Your code is here
```

**Task 7.** Now let's compare our results with logistic regression (we indicate `class_weight='balanced'`

and `random_state = 17`

). Do a full search by the parameter `C`

from a wide range of values `np.logspace(-8, 8, 17)`

.
Now we will build a pipeline - first apply scaling, then train the model.

Learn about the pipelines and make cross-validation. What is the best average ROC AUC? Select the closest value.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q7*

**Answer options:**

- 0.778
- 0.788
- 0.798
- 0.808

In [22]:

```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
scaler = StandardScaler()
logit = LogisticRegression(random_state=17, class_weight='balanced')
logit_pipe = Pipeline([('scaler', scaler), ('logit', logit)])
logit_pipe_params = {'logit__C': np.logspace(-8, 8, 17)}
```

In [23]:

```
# Your code is here
```

In case of a small number of features, random forest was proved to be better than logistic regression. However, one of the main disadvantages of trees is how they work with sparse data, for example, with texts. Let's compare logistic regression and random forest in a new task. Download dataset with reviews of movies here.

In [24]:

```
# Download data
df = pd.read_csv("../../data/movie_reviews_train.csv", nrows=50000)
# Split data to train and test
X_text = df["text"]
y_text = df["label"]
# Classes counts
df.label.value_counts()
```

Out[24]:

In [25]:

```
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# Split on 3 folds
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
# In Pipeline we will modify the text and train logistic regression
classifier = Pipeline([
('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
('clf', LogisticRegression(random_state=17))])
```

**Task 8.** For Logistic Regression: iterate parameter `C`

with values from the list [0.1, 1, 10, 100] and find the best ROC AUC in cross-validation. Select the closest answer.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q8*

**Answer options:**

- 0.74
- 0.75
- 0.84
- 0.85

In [26]:

```
# Your code is here
```

**Task 9.** Now try to perform the same operation with random forest. Similarly, look over all the values and get the maximum ROC AUC. Select the closest value.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a5_q9*

**Answer options:**

- 0.74
- 0.75
- 0.84
- 0.85

In [27]:

```
classifier = Pipeline([
('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
('clf', RandomForestClassifier(random_state=17, n_jobs=-1))])
min_samples_leaf = [1, 2, 3]
max_features = [0.3, 0.5, 0.7]
max_depth = [None]
```

In [28]:

```
# Your code is here
```