Authors: Vitaliy Radchenko and Yury Kashnitskiy. Translated and edited by Christina Butsko, Egor Polusmak, Anastasia Manokhina, Anna Shirshova, and Yuanyuan Pao. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0. Free use is permitted for any non-commercial purpose.

*This is a static version of a Jupyter notebook. You can also check out the latest version in the course repository, the corresponding interactive web-based Kaggle Kernel or a video lecture.*

Pretty often, you would like to make out the exact reasons why the algorithm outputted a particular answer. Or, not being able to understand the algorithm completely, then at least we’d like to find out which input features contributed the most to the result. With Random Forest, you can obtain such information quite easily.

From the picture below, it is intuitively clear that, in our credit scoring problem, *Age* is much more important than *Income*. This can be formally explained using the concept of *information gain*.

In the case of many decision trees or a random forest, the closer the mean position of a feature over all the trees to the root, the more significant it is for a given classification or regression problem. Gains in the splitting criterion, such as the *Gini impurity*, obtained at each optimal split in every tree is a measure of importance that is directly associated with the splitting feature. The value of this score is distinct for each feature and accumulates over all the trees.

Let's go a little deeper into the details.

The average reduction in accuracy caused by a variable is determined during the calculation of the out-of-bag error. The greater the reduction in accuracy due to an exclusion or permutation of the variable, the higher its *importance score*. For this reason, variables with a greater average reduction in accuracy are generally more significant for classification.

The average reduction in the Gini impurity – or *MSE* for regression – represents the contribution of each variable to the homogeneity of nodes and leaves in the resulting Random Forest model. Every time a selected variable is used for splitting, the Gini impurity of the child nodes is calculated and compared with that of the original node.

The Gini impurity is a score of homogenity with the range from $0$ (homogeneous) to $1$ (heterogeneous). The changes in the value of the splitting criterion are accumulated for each variable and normalized at the end of the calculation. A higher reduction in the Gini impurity signals that splitting results by this variable results in nodes with higher purity.

Now, let's represent this method in analytic form:

$$\large VI^{T} = \frac{\sum_{i \in \mathfrak{B}^T}I \Big(y_i=\hat{y}_i^{T}\Big)}{\Big |\mathfrak{B}^T\Big |} - \frac{\sum_{i \in \mathfrak{B}^T}I \Big(y_i=\hat{y}_{i,\pi_j}^{T}\Big)}{\Big |\mathfrak{B}^T\Big |}$$where

- $\large \hat{y}_i^{(T)} = f^{T}(x_i)$ is a class prediction before the feature permutation or exclusion;
- $\large \hat{y}_{i,\pi_j}^{(T)} = f^{T}(x_{i,\pi_j})$ is a class prediction after the feature permutation or exclusion;
- and $\large x_{i,\pi_j} = (x_{i,1}, \dots , x_{i,j-1}, \quad x_{\pi_j(i),j}, \quad x_{i,j+1}, \dots , x_{i,p})$.

Note that $\large VI^{T}(x_j) = 0$ if $\large x_j$ is not in the tree $\large T$.

Now, we can give the feature importance calculation for ensembles

- not normalized:

- normalized by the standard deviation of the differences:

Let's consider the results of a survey given to visitors of hostels listed on Booking.com and TripAdvisor.com. Our features here are the average ratings for different categories include service quality, room condition, value for money, etc. Our target variable is the hostel's overall rating on the website.

In [1]:

```
%matplotlib inline
from matplotlib import pyplot as plt
# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg'
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.ensemble.forest import RandomForestRegressor
```

In [2]:

```
hostel_data = pd.read_csv("../input/hostel_factors.csv")
features = {"f1":u"Staff",
"f2":u"Hostel booking",
"f3":u"Check-in and check-out",
"f4":u"Room condition",
"f5":u"Shared kitchen condition",
"f6":u"Shared space condition",
"f7":u"Extra services",
"f8":u"General conditions & conveniences",
"f9":u"Value for money",
"f10":u"Customer Co-creation"}
forest = RandomForestRegressor(n_estimators=1000, max_features=10,
random_state=0)
forest.fit(hostel_data.drop(['hostel', 'rating'], axis=1),
hostel_data['rating'])
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
# Plot the feature importancies of the forest
num_to_plot = 10
feature_indices = [ind+1 for ind in indices[:num_to_plot]]
# Print the feature ranking
print("Feature ranking:")
for f in range(num_to_plot):
print("%d. %s %f " % (f + 1,
features["f"+str(feature_indices[f])],
importances[indices[f]]))
plt.figure(figsize=(12,6))
plt.title(u"Feature Importance")
bars = plt.bar(range(num_to_plot),
importances[indices[:num_to_plot]],
color=([str(i/float(num_to_plot+1))
for i in range(num_to_plot)]),
align="center")
ticks = plt.xticks(range(num_to_plot),
feature_indices)
plt.xlim([-1, num_to_plot])
plt.legend(bars, [u''.join(features["f"+str(i)])
for i in feature_indices]);
```

The picture above shows that, more often than not, customers pay great attention to staff and the price-quality ratio. This couple of factors affects the resulting overall rating the most. But, the difference between these two features and other festures is not very large. We can therefore conclude that exclusion of any of these features will lead to a reduction in model accuracy. Based on our analysis, we can recommend hostel owners to focus primarily on staff training and price-to-quality ratio.

To see pros and cons of linear models and Random Forest, you can complete this assignment. The assignment is just for you to practice, and goes with solution.

- The same notebook as an interactive web-based Kaggle Kernel
- Medium "story" based on this notebook
- Main course site, course repo, and YouTube channel
- Course materials as a Kaggle Dataset
- If you read Russian: an article on Habrahabr with ~ the same material. And a lecture on YouTube
- Chapter 15 of the book “Elements of Statistical Learning” by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie.
- More about practical applications of random forests and other algorithms can be found in the official documentation of
`scikit-learn`

. - For more in-depth discussion of variance and decorrelation of random forests, see the original paper.