## mlcourse.ai – Open Machine Learning Course¶

Authors: Pavel Nesterov (@mephistopheies), Yury Kashnitskiy (@yorko), and Daniel Potapov (@sharthZ23). Edited by Anastasia Manokhina (@manokhina). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

# Assignment #8. Fall 2018

## StackOverflow questions tagging with logistic regression

You need to derive formulas where it's asked (yes, with pen and paper), fill in the code in the cells and select answers in the web form.

## 0. Problem description¶

In this assignment, we will code a model for predicting tags based on a multilabel logistic regression. Unlike the usual setting of a multiclass problem, in this case one example can belong to several classes. We will implement an online version of the multilabel classification algorithm.

We will use a small sample of 70k questions extracted from StackOverflow (about 23 MB if zipped, download from here).

Actually, such implementations are used in real life (though not implemented in Python). For example, in online Click-Through Rate prediction models, the user is shown a banner, then, depending on the presence of a click, the model parameters are updated. In real applications, the number of model parameters can reach hundreds of millions, while one user usually has only a hundred or a thousand non-zero parameters out of this hundred of millions, therefore it is not effective to vectorize such a model. Usually, all user data is stored in huge clusters in in-memory databases, and user processing is distributed.

To perfectly grasp this folk wisdom, we'll dive into the math of multiclass & multilabel logistic regression.

## 1. Multiclass & multilabel logistic regression¶

### 1.1. Softmax classifier (multiclass logistic regression)¶

Let's see how logistic regression is derived for two classes $\left\{0, 1\right\}$: the probability that an instance belongs to class $1$ is derived from the Bayes theorem:

$$\large \begin{array}{rcl} p\left(c = 1 \mid \vec{x}\right) &=& \dfrac{p\left(\vec{x} \mid c = 1\right)p\left(c = 1\right)}{p\left(\vec{x}\right)} \\ &=& \dfrac{p\left(\vec{x} \mid c = 1\right)p\left(c = 1\right)}{p\left(\vec{x} \mid c = 1\right)p\left(c = 1\right) + p\left(\vec{x} \mid c = 0\right)p\left(c = 0\right)} \\ &=& \dfrac{1}{1 + e^{-a}} = \sigma\left(a\right) \end{array}$$ where:

• $\vec{x}$ – is a feature vector
• $\sigma$ – stands for the sigmoid function of a scalar argument
• $a = \log \frac{p\left(\vec{x} \mid c = 1\right)p\left(c = 1\right)}{p\left(\vec{x} \mid c = 0\right)p\left(c = 0\right)} = \sum_{i=0}^{d} w_i x_i$ – this relation is modeled by a linear function of features and model parameters

This expression can easily be generalized to a set of $K$ classes, only the denominator in the Bayes formula changes. Let us write down the probability that an instance belongs to class $k$: $$\large \begin{array}{rcl} p\left(c = k \mid \vec{x}\right) &=& \dfrac{p\left(\vec{x} \mid c = k\right)p\left(c = k\right)}{\sum_{i=1}^K p\left(\vec{x} \mid c = i\right)p\left(c = i\right)} = \dfrac{e^{z_k}}{\sum_{i=1}^{K}e^{z_i}} = \sigma_k\left(\vec{z}\right) \end{array}$$ where

• $\sigma_k$ – stands for a softmax function of a vector argument
• $z_k = \log p\left(\vec{x} \mid c = k\right)p\left(c = k\right) = \sum_{i=0}^{d} w_{ki} x_i$ – this relation is modeled by a linear function of features and model parameters for the class $k$

To model the full likelihood of one example (look, it's not the same as the likelihood of the whole training set), we use the categorical distribution, or, to be more precise, its logarithm (for convenience):

$$\Large \mathcal{L} = \log p\left({\vec{x}}\right) = \log \prod_{i=1}^K \sigma_i\left(\vec{z}\right)^{y_i} = \sum_{i=1}^K y_i \log \sigma_i\left(\vec{z}\right),$$

where

• $K$ is the number of classes
• $y_i$ is either 0 or 1, depending on the true class label of the example $\vec{x}$

It turns out to be a famous cross entropy function (if multiplied by $-1$). Likelihood needs to be maximized, and, accordingly, cross entropy should be minimized. By differentiating with respect to the parameters of the model, we will obtain the rules for updating the weights for gradient descent, do this derivation on your own, you will need to understand this for further fulfillment of the task:

$$\large \begin{array}{rcl} \frac{\partial \mathcal{L}}{\partial w_{km}} &=& x_m \left(y_k - \sigma_k\left(\vec{z}\right)\right) \end{array}$$

Softmax classifier is very well explained in Stanford's course cs231n "Convolutional Neural Networks for Visual Recognition".

### 1.2. Multilabel logistic regression¶

It turns out that the softmax classifier tends to predict a high probability for some class and low probabilities for all other classes. That's due to an exponent in the formula of softmax. Also, in the previous formulation, it turns out that the vector $\left(\sigma_1, \sigma_2, \ldots, \sigma_K\right)$ forms a discrete probability distribution, i.e. $\sum_{i=1}^K \sigma_i = 1$. But in our problem statement, each example can have several tags or can simultaneously belong to several classes. To take it into account we will slightly change the model:

• We assume that all tags are independent of each other, i.e. each outcome is a logistic regression on two classes (either there is a tag or not), then the probability that an example has a tag will be written as following (each tag/class has its own set of parameters as in the case of a softmax classifier): $$\large p\left(\text{tag}_k \mid \vec{x}\right) = \sigma\left(z_k\right) = \sigma\left(\sum_{i=0}^d w_{ki} x_i \right)$$
• The presence of each tag will be modeled using Bernoulli distribution

Your first task is to write a simplified expression for the negative log-likelihood (NLL) of one training example. As a rule, many optimization algorithms have an interface for minimizing the function, and we follow the same tradition and multiply the resulting expression for log-likelihood by $-1$ to get NLL $-\mathcal{L}$. In the second part, we derive formulas to minimize the resulting expression.

Question 1: What's the correct formula for negative log-likelihood of one training example?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q1

1. $\large -\mathcal{L} = -\sum_{i=1}^d y_i \log \sigma\left(z_i\right) + \left(1 - y_i\right) \log \left(1 - \sigma\left(z_i\right)\right)$

2. $\large -\mathcal{L} = -\sum_{i=1}^d z_i \log \sigma\left(y_i\right) + \left(1 - z_i\right) \log \left(1 - \sigma\left(y_i\right)\right)$

3. $\large -\mathcal{L} = -\sum_{i=1}^K z_i \log \sigma\left(y_i\right) + \left(1 - z_i\right) \log \left(1 - \sigma\left(y_i\right)\right)$

4. $\large -\mathcal{L} = -\sum_{i=1}^K y_i \log \sigma\left(z_i\right) + \left(1 - y_i\right) \log \left(1 - \sigma\left(z_i\right)\right)$

## 2. Deriving the formula for weight updates¶

In the second task, you need to derive the formula for the partial derivative of $-\mathcal{L}$ w.r.t weights.

Question 2: What's the correct formula for the derivative of negative log-likelihood w.r.t. to weights?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q2

1. $\large -\frac{\partial \mathcal{L}}{\partial w_{km}} = -x_m \left(y_k - \sigma\left(z_k\right)\right)$
2. $\large -\frac{\partial \mathcal{L}}{\partial w_{km}} = -x_m \left(\sigma\left(z_k\right) - y_k\right)$
3. $\large -\frac{\partial \mathcal{L}}{\partial w_{km}} = \left(y_k - \sigma\left(z_k\right)x_m\right)$
4. $\large -\frac{\partial \mathcal{L}}{\partial w_{km}} = \left(\sigma\left(z_k\right)x_m - y_k\right)$

## 3. Basic model implementation¶

First, let's check the configuration

In [1]:
#!pip install watermark

In [ ]:
%watermark -v -m -p numpy,scipy,pandas,matplotlib,sklearn -g


Docker and author's laptop configuration:

CPython 3.5.2
IPython 7.0.1

numpy 1.15.2
scipy 1.1.0
pandas 0.23.4
matplotlib 3.0.0
sklearn 0.20.0

compiler : GCC 5.4.0 20160609
system : Linux
release : 4.17.14-041714-generic
machine : x86_64
processor : x86_64
CPU cores : 12
interpreter: 64bit

In [2]:
from collections import defaultdict
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
from IPython.display import clear_output

pd.set_option('display.float_format', lambda x: '{:g}'.format(x))
np.set_printoptions(suppress=True)
sns.set_style("dark")
plt.rcParams['figure.figsize'] = 16, 12


Let's load and preprocess our dataset. Change paths to data files if needed.

In [3]:
df = pd.read_csv('../../data/stackoverflow_sample_70k.csv.zip')

In [4]:
df.head()

Out[4]:
question tags
0 i ve got some code in window scroll that check... javascript jquery
1 i have a custom adapter for a list view it has... android
2 in my form panel i added a checkbox setting st... javascript
3 i have the two dates variables startwork and e... c#
4 i might have been using the wrong search tags ... android

Top tags on StackOverflow for these 70k questions are the following:

In [5]:
top_tags = ['python', 'ios', 'html', 'android', 'c++', 'jquery', 'java', 'php', 'c#', 'javascript']


question and tags are strings, so we need to preprocess them.

Preprocessing steps will be as follows:

• convert to lowercase
• strip whitespaces
• split by whitespaces to form a list of words
In [6]:
%%time
df['tags'] = df['tags'].str.lower()\
.str.strip()\
.str.split(' ')
df['question'] = df['question'].str.lower()\
.str.strip()\
.str.split(' ')

CPU times: user 1.07 s, sys: 227 ms, total: 1.3 s
Wall time: 1.3 s

In [7]:
df.head()

Out[7]:
question tags
0 [i, ve, got, some, code, in, window, scroll, t... [javascript, jquery]
1 [i, have, a, custom, adapter, for, a, list, vi... [android]
2 [in, my, form, panel, i, added, a, checkbox, s... [javascript]
3 [i, have, the, two, dates, variables, startwor... [c#]
4 [i, might, have, been, using, the, wrong, sear... [android]
In [8]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 2 columns):
question    70000 non-null object
tags        70000 non-null object
dtypes: object(2)
memory usage: 128.2 MB


You are given a template of the class LogRegressor. Analyze it carefully paying attention to all comments. Then fill in the blanks, run the resulting model and answer the test question.

As you might notice, when you update the weight of $w_ {km}$, the value of the sign $x_m$ is used, which is $0$ if the word with the index $m$ is not in the sentence, and is greater than zero if there is such a word. Accordingly, when calculating a linear combination $z$ of model weights and sample features, only non-zero attributes of the object should be considered.

Hint:

• If you implement the calculation of the sigmoid in the same way as in the formula, then for a large negative value $z$ the calculation of $e ^ {- z}$ turns into a very large number that will go beyond permissible limits
• at the same time $e ^ {- z}$ from a large positive $z$ will be zero
• use properties of the sigmoid function $\sigma$ to fix this.
In [ ]:
class LogRegressor():
def __init__(self, tags):
"""LogRegressor class constructor

Parameters
----------
tags: list of string
"""
self.__version__ = 'v0.3'
# set will drop duplicated tags
self._tags = set(tags)

# A dictionary that contains the mapping of sentence words and tags into indexes (to save memory)
# example: self._vocab ['exception'] = 17 means that the word "exception" has an index of 17
self._vocab = {} #defaultdict(lambda: len(self._vocab))

# parameters of the model: weights
# for each class/tag we need to store its own vector of weights
# By default, all weights will be zero
# we do not know in advance how many scales we will need
# so for each class we create a dictionary of a variable size with a default value of 0
# example: self._w['java'][self._vocab['exception']] contains a weight for word exception and tag java
self._w = dict([(t, defaultdict(int)) for t in tags])

# parameters of the model: bias term or w_0 weight
self._b = dict([(t, 0) for t in tags])

def update_vocab(self, words_list):
"""Update vocab with new words from words_list

Parameters
----------
words_list: list of strings
"""
for word in words_list:
# every new word will get index=len(self._vocab)
# so at the end of training all words will be numbered from 0 to len(self._vocab)
if word not in self._vocab:
self._vocab[word] = len(self._vocab)

def generate_vocab(self, df, column_name):
"""Build vocabulary from a dataframe column which contain lists of words.

Parameters
----------
df: pandas.Dataframe

column_name: string
"""
if column_name not in df.columns:
raise ValueError("DataFrame doesnt have '{}' column!")
df[column_name].map(self.update_vocab)

def fit_sample(self, sample):
"""Fit single sample

Parameters
----------
sample: pandas.Series
a dict-like object which contains a question and its tags

Returns
-------
pandas.Series object with metrics for sample
"""
# sample.name is value from df.index aka row number
sample_id = sample.name
question = sample['question']
tags = set(sample['tags'])

sample_loss = 0

# derive the gradients for each tag
for tag in self._tags:
# target is 1 if this sample has the tag
y = int(tag in tags)
# calculate a linear combination of weights and features
# z = ...

for word in question:
is_word_unknown = word not in self._vocab
# in the test mode, ignore the words that are not in the vocabulary
if sample_id >= self.top_n_train and is_word_unknown:
continue
# z += ...

# calculate the probability of tag
# sigma = ...

# update the value of the loss function for this sample
# sample_loss += ...

# If the sample belongs to the training part, update the model weights
if sample_id < self.top_n_train:
# compute the log-likelihood derivative by weight
# dLdw = ...

# We minimize negative log-likelihood (second minus sign)
# so we go to the opposite direction of the gradient to minimize it (the first minus sign)
delta = self.learning_rate * dLdw
for word in question:
self._w[tag][self._vocab[word]] -= -delta
self._b[tag] -= -delta
if sample_id % self.show_period == 0:
n = sample_id + self.show_period
clear_output(wait=True)
print('LogRegressor {} | {} ({:.2f}%) samples fitted.'.format(
self.__version__,
n,
100 * n / self.total_len))
return pd.Series({'loss': sample_loss})

def fit_dataframe(self,
df,
top_n_train=60000,
learning_rate=0.1,
tolerance=1e-16):
"""One run through dataframe

Parameters
----------
df : pandas.DataFrame
pandas DataFrame with question and tags data

top_n_train : int
first top_n_train samples will be used for training, the remaining samples are for the test
default=60000

learning_rate : float
default=0.1

tolerance : float
used for bounding the values of logarithm argument
default=1e-16

Returns
-------
pandas.DataFrame with metrics for each sample
"""
self.total_len = df.shape[0]
self.top_n_train = top_n_train
self.learning_rate = learning_rate
self.tolerance = tolerance

if self.top_n_train > self.total_len:
print("Warning! 'top_n_train' more than dataframe rows count!\n"
"Set default 'top_n_train'=60000")
self.top_n_train = 60000

# generating self._vocab
self.generate_vocab(df, column_name='question')
# Show progress every self.show_period sample, 1% by default
self.show_period = self.total_len // 100
# apply self.fit_sample to each row (sample) of dataframe
self.metrics = df.apply(self.fit_sample, axis=1)
return self.metrics

In [ ]:
%%time
model = LogRegressor(tags=top_tags)
# by default, we will train on first 60k samples, and test on last 10k
metrics = model.fit_dataframe(df)

In [ ]:
metrics.head()


Let's check if the value of negative logarithmic likelihood has actually decreased. Since we are using stochastic gradient descent, we should not expect a smooth fall of the loss function. We will use a moving average with a window of 10,000 examples to smooth the graph.

In [ ]:
plot = plt.plot(pd.Series(metrics['loss'][:-10000]).rolling(10000).mean())


Spolier to save your time: if you get such a graph of the loss function, it's OK.

In [ ]:
last_10k_train_loss = np.mean(metrics['loss'][-20000:-10000])
print('Mean of the loss function on the last 10k TRAIN samples: {:.2f}'.format(last_10k_train_loss))


Question 3: What's the average value of the cost function for the last 10000 examples of the training set?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q3

1. 18.31
2. 19.86
3. 21.74
4. 26.43

## 4. Model testing¶

In the base model, the first 60k rows are used for training, and the remaining ones are used as a test set. As you can see, the value of negative log-likelihood is not very informative, although it allows you to compare different models. In the fourth task, you need to modify the base model so that the fit_dataframe method calculates the value of accuracy on the test portion of the dataset for every sample.

The accuracy is defined as following:

• consider that the question has a tag if the predicted probability of the tag is greater than 0.9
• the accuracy of one example is calculated as Jaccard coefficient between the set of real tags and tags predicted by the model
• for example, if the sample has real tags ['html', 'jquery'], and according to the model they are ['ios', 'html', 'java'], then the Jaccard coefficient will be |['html', 'jquery'] $\cap$ ['ios', 'html', 'java']| / |['html', 'jquery'] $\cup$ ['ios', 'html', 'java']| = |['html']| / |['jquery', 'ios', 'html', 'java']| = 1/4
• fit_dataframe method returns pd.DataFrame with column Jaccard
• For answer you need to calculate average (mean) accuracy on Jaccard column on the test set

Modified class:

In [ ]:
class LogRegressor():
def __init__(self, tags):
self.__version__ = 'v0.4'
self._tags = set(tags)
self._vocab = {}
self._w = dict([(t, defaultdict(int)) for t in tags])
self._b = dict([(t, 0) for t in tags])

def update_vocab(self, words_list):
for word in words_list:
if word not in self._vocab:
self._vocab[word] = len(self._vocab)

def generate_vocab(self, df, column_name):
if column_name not in df.columns:
raise ValueError("DataFrame doesnt have '{}' column!")
df[column_name].map(self.update_vocab)

def fit_sample(self, sample):
sample_id = sample.name
question = sample['question']
tags = set(sample['tags'])
sample_loss = 0
predicted_tags = None

for tag in self._tags:
y = int(tag in tags)
# z = ...

for word in question:
is_word_unknown = word not in self._vocab
if sample_id >= self.top_n_train and is_word_unknown:
continue
# z += ...

# sigma = ...

# sample_loss += ...

if sample_id < self.top_n_train:
# dLdw = ...

delta = self.learning_rate*dLdw
for word in question:
self._w[tag][self._vocab[word]] -= -delta
self._b[tag] -= -delta
else:
if predicted_tags is None:
predicted_tags = []
# if sigma... :
#     predicted_tags...

if sample_id % self.show_period == 0:
n = sample_id + self.show_period
clear_output(wait=True)
print('LogRegressor {} | {} ({:.2f}%) samples fitted.'.format(
self.__version__,
n,
100 * n / self.total_len))
if predicted_tags is not None:
# Jaccard = ...
return pd.Series({'loss': sample_loss, 'Jaccard': Jaccard})
else:
return pd.Series({'loss': sample_loss, 'Jaccard': np.NaN})

def fit_dataframe(self,
df,
top_n_train=60000,
learning_rate=0.1,
tolerance=1e-16,
accuracy_level=0.9):
self.total_len = df.shape[0]
self.top_n_train = top_n_train
self.learning_rate = learning_rate
self.tolerance = tolerance
self.accuracy_level = accuracy_level

if self.top_n_train > self.total_len:
print("Warning! 'top_n_train' more than dataframe rows count!\n"
"Set default 'top_n_train'=60000")
self.top_n_train = 60000

self.generate_vocab(df, column_name='question')
self.show_period = self.total_len // 100
self.metrics = df.apply(self.fit_sample, axis=1)
return self.metrics

In [ ]:
%%time
model = LogRegressor(tags=top_tags)
metrics = model.fit_dataframe(df)

In [ ]:
metrics.head()

In [ ]:
metrics.tail()

In [ ]:
# HERE'S YOUR CODE
# accuracy = ...
print('Mean Jaccard accuracy: {:.2f}'.format(accuracy))


Question 4: What is the mean Jaccard accuracy for the test set?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q4

1. 0.31
2. 0.41
3. 0.51
4. 0.61

## 5. $L_2$-regularization¶

In the fifth task, you need to add $L_2$-regularization to the LogRegressor class. In the fit_sample method, the lambda_ = 0.01 parameter with the default value should appear (we call this argument lambda_ (with underscore) because lambda (without underscore) is a reserved Python name). Taking into account regularization, the new cost function takes the form:

$$\large \begin{array}{rcl} L &=& -\mathcal{L} + \frac{\lambda}{2} R\left(W\right) \\ &=& -\mathcal{L} + \frac{\lambda}{2} \sum_{k=1}^K\sum_{i=1}^d w_{ki}^2 \end{array}$$

We have already derived the gradient of the first term of the sum, and for the second one it looks like:

$$\large \begin{array}{rcl} \frac{\partial}{\partial w_{ki}} \frac{\lambda}{2} R\left(W\right) &=& \lambda w_{ki} \end{array}$$

If we make an explicit update of all weights on each example, then the process will be very slow, because we have to run through every word of the vocabulary at each iteration. At the expense of the theoretical accuracy, we use a dirty trick: we regularize only those words that are present in the current sentence. Do not forget that the bias term is not regularized. sample_loss should also remain unchanged.

Modified class:

In [ ]:
class LogRegressor():
def __init__(self, tags):
self.__version__ = 'v0.5'
self._tags = set(tags)
self._vocab = {}
self._w = dict([(t, defaultdict(int)) for t in tags])
self._b = dict([(t, 0) for t in tags])

def update_vocab(self, words_list):
for word in words_list:
if word not in self._vocab:
self._vocab[word] = len(self._vocab)

def generate_vocab(self, df, column_name):
if column_name not in df.columns:
raise ValueError("DataFrame doesnt have '{}' column!")
df[column_name].map(self.update_vocab)

def fit_sample(self, sample):
sample_id = sample.name
question = sample['question']
tags = set(sample['tags'])
sample_loss = 0
predicted_tags = None

for tag in self._tags:
y = int(tag in tags)
# z = ...

for word in question:
is_word_unknown = word not in self._vocab
if sample_id >= self.top_n_train and is_word_unknown:
continue
# z += ...

# sigma = ...

# sample_loss += ...

if sample_id < self.top_n_train:
# dLdw = ...

delta = self.learning_rate*dLdw
for word in question:
# self._w[tag][self._vocab[word]] -= (- delta...
self._b[tag] -= -delta
else:
if predicted_tags is None:
predicted_tags = []
# if sigma... :
#     predicted_tags...

if sample_id % self.show_period == 0:
n = sample_id + self.show_period
clear_output(wait=True)
print('LogRegressor {} | {} ({:.2f}%) samples fitted.'.format(
self.__version__,
n,
100 * n / self.total_len))
if predicted_tags is not None:
# Jaccard = ...
return pd.Series({'loss': sample_loss, 'Jaccard': Jaccard})
else:
return pd.Series({'loss': sample_loss, 'Jaccard': np.NaN})

def fit_dataframe(self,
df,
top_n_train=60000,
learning_rate=0.1,
tolerance=1e-16,
accuracy_level=0.9,
lambda_=0.01):
self.total_len = df.shape[0]
self.top_n_train = top_n_train
self.learning_rate = learning_rate
self.tolerance = tolerance
self.accuracy_level = accuracy_level
self.lambda_ = lambda_

if self.top_n_train > self.total_len:
print("Warning! 'top_n_train' more than dataframe rows count!\n"
"Set default 'top_n_train'=60000")
self.top_n_train = 60000

self.generate_vocab(df, column_name='question')
self.show_period = self.total_len // 100
self.metrics = df.apply(self.fit_sample, axis=1)
return self.metrics

In [ ]:
%%time
model = LogRegressor(tags=top_tags)
metrics = model.fit_dataframe(df)

In [ ]:
# HERE'S YOUR CODE
# accuracy = ...
print('{:.2f}'.format(accuracy))
plot = plt.plot(pd.Series(metrics['loss'][:-10000]).rolling(10000).mean())


Question 5: What's the average value of Jaccard accuracy in case of $L_2$-regularization?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q5

1. 0.32
2. 0.38
3. 0.48
4. 0.52

## 6. ElasticNet regularization, derivation¶

In addition to $L_2$ regularization, $L_1$ regularization is often used.

$$\large \begin{array}{rcl} L &=& -\mathcal{L} + \lambda R\left(W\right) \\ &=& -\mathcal{L} + \lambda \sum_{k=1}^K\sum_{i=1}^d \left|w_{ki}\right| \end{array}$$

If we linearly combine $L_1$ and $L_2$ regularization, then the resulting regularization type is called ElasticNet:

$$\large \begin{array}{rcl} L &=& -\mathcal{L} + \lambda R\left(W\right) \\ &=& -\mathcal{L} + \lambda \left(\gamma \sum_{k=1}^K\sum_{i=1}^d w_{ki}^2 + \left(1 - \gamma\right) \sum_{k=1}^K\sum_{i=1}^d \left|w_{ki}\right| \right) \end{array}$$

• where $\gamma \in \left[0, 1\right]$

Here we omitted the multiplier $\frac{1}{2}$ that we used for $L_2$-regularization.

Question 6: What's the correct formula for the gradient of the ElasticNet regularization term?

1. $\large \frac{\partial}{\partial w_{ki}} \lambda R\left(W\right) = \lambda \left(\gamma w_{ki} + \left(1 - \gamma\right) \text{sign}\left(w_{ki}\right)\right)$
2. $\large \frac{\partial}{\partial w_{ki}} \lambda R\left(W\right) = \lambda \left(2 \gamma w_{ki} + \left(1 - \gamma\right) w_{ki}\right)$
3. $\large \frac{\partial}{\partial w_{ki}} \lambda R\left(W\right) = \lambda \left(2 \gamma w_{ki} + \left(1 - \gamma\right) \text{sign}\left(w_{ki}\right)\right)$
4. $\large \frac{\partial}{\partial w_{ki}} \lambda R\left(W\right) = \lambda \left(2 \gamma \left|w_{ki}\right| + \left(1 - \gamma\right) \text{sign}\left(w_{ki}\right)\right)$

## 7. ElasticNet regularization, implementation¶

In the seventh task you are supposed to change the class LogRegressor so that thefit_dataframe method takes two parameters with default values lambda = 0.001 and gamma = 0.1. Do one pass through the dataset with ElasticNet regularization and default parameter values and answer the question.

Modified class:

In [ ]:
class LogRegressor():
def __init__(self, tags):
self.__version__ = 'v0.7'
self._tags = set(tags)
self._vocab = {}
self._w = dict([(t, defaultdict(int)) for t in tags])
self._b = dict([(t, 0) for t in tags])


In [ ]:
%%time
model = LogRegressor(tags=top_tags)
metrics = model.fit_dataframe(df)

In [ ]:
# HERE'S YOUR CODE
# accuracy = ...
plot = plt.plot(pd.Series(metrics['loss'][:-10000]).rolling(10000).mean())


Question 7: What's the average value of Jaccard accuracy in case of ElasticNet regularization?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q7

1. 0.51
2. 0.61
3. 0.71
4. 0.81

## 8. The most important words for a tag¶

The beauty of linear models is that they are somewhat interpretable. You are supposed to calculate which words contribute the most to the probability of each of the tags. And then answer the question.

In [ ]:
model._vocab_inv = dict([(v, k) for (k, v) in model._vocab.items()])
top = 5
for tag in model._tags:
# top5_words = ...
print(tag, ':', ', '.join(top5_words))


For many tags, the presence of the tag itself in the sentence is an important signal, and for many, the tag itself is the strongest signal, which is not surprising.

Question 8: For which of the tags the tag name itself is not included in the top 5 most important words?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q8

1. android
2. javascript
3. jquery
4. c#

## 9. Reducing the size of the vocabulary¶

Now the number of words in the vocabulary is too big. If it was a sample of 10 million questions from the StackOverflow website, then the vocabulary size would've been ~ 10 million as well. You can regularize the model not only mathematically, but also simply limiting the size of the vocabulary. You are supposed to make the following changes in the class LogRegressor:

• add self._word_stats = defaultdict(int) to __init__ to calculate word frequencies
• add one more argument to the fit_dataframe method with the default value freeze_vocab = False
• when freeze_vocab = False allow adding words to the vocabulary and word_stats
• when freeze_vocab = True ignore words which are not in the vocabulary and don't update word_stats
• add the class method filter_vocab (n = 10000), which will leave only top-n most popular words in the dictionary

For first fit_dataframe call use learning_rate=0.2.

Modified class:

In [ ]:
class LogRegressor():
def __init__(self, tags):
self.__version__ = 'v0.9'
self._tags = set(tags)
self._vocab = {}
self._w = dict([(t, defaultdict(int)) for t in tags])
self._b = dict([(t, 0) for t in tags])
self._word_stats = defaultdict(int)


In [ ]:
%%time
model = LogRegressor(tags=top_tags)
# metrics = model.fit_dataframe(df, ...
# accuracy = ...
print('Mean Jaccard accuracy: {:.2f}'.format(accuracy))
plot = plt.plot(pd.Series(metrics['loss'][:-10000]).rolling(10000).mean())


We leave only 10 000 words.

In [ ]:
model.filter_vocab(top_n=10000)


Do one more iteration through the dataset, reducing learning rate 20 times and L2-regularization 5 times with freezed vocab:

In [ ]:
%%time
# metrics = model.fit_dataframe(df, ...
# accuracy = ...
print('Mean Jaccard accuracy: {:.2f}'.format(accuracy))
plot = plt.plot(pd.Series(metrics['loss'][:-10000]).rolling(10000).mean())


Question 9: What's the average value of Jaccard accuracy in case of reducing the vocabulary size?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q9

1. 0.66
2. 0.69
3. 0.72
4. 0.75

## 10. Predicting tags for new questions¶

At the end of this assignment, you are supposed to implement the method predict_proba, which takes a model and a string containing the question and returns a list of predicted question tags with their probabilities.

In [ ]:
def predict_proba(model, sentence):
p = {}
return p

In [ ]:
sentence = ("I want to improve my coding skills, so I have planned write " +
"a Mobile Application.need to choose between Apple's iOS or Google's Android." +
" my background: I have done basic programming in .Net,C/C++,Python and PHP " +
"in college, so got OOP concepts covered. about my skill level, I just know " +
"concepts and basic syntax. But can't write complex applications, if asked :(" +
" So decided to hone my skills, And I wanted to know which is easier to " +
"learn for a programming n00b. A) iOS which uses Objective C B) Android " +
"which uses Java. I want to decide based on difficulty level")


Preprocessing of the question (sentence) will only include converting it to lower case and deleting commas.

In [ ]:
pred = predict_proba(model, sentence.lower().replace(',', ''))

In [ ]:
tag_preds = sorted(pred.items(), key=lambda t: t[1], reverse=True)

In [ ]:
list(filter(lambda t: t[1] > 0.9, tag_preds))


Question 10: Which tag or tags are associated with this question if the acceptance threshold is $0.9$?

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a8_q10