Author: Yury Kashnitsky. Translated and edited by Serge Oreshkov, and Yuanyuan Pao. This material is subject to the terms and conditions of the license Creative Commons CC BY-NC-SA 4.0. Free use is permitted for any non-comercial purpose.

This week, we’ll cover two reasons for Vowpal Wabbit’s exceptional training speed, namely, online learning and hashing trick, in both theory and practice. We will try it out with news, movie reviews, and StackOverflow questions.

In [1]:

```
import warnings
warnings.filterwarnings('ignore')
import os
import re
import numpy as np
import pandas as pd
from pathlib2 import Path
from tqdm import tqdm_notebook
from sklearn.datasets import fetch_20newsgroups, load_files
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, log_loss
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%config InlineBackend.figure_format = 'retina'
```

Despite the fact that gradient descent is one of the first things learned in machine learning and optimization courses, it is hard to overrate one of it's modifications, namely, Stochastic Gradient Descent (SGD).

Lets recap that the very idea of gradient descent is to minimize some function by making small steps in the direction of fastest function decreasing. The method was named due to the following fact from calculus: vector $\nabla f = (\frac{\partial f}{\partial x_1}, \ldots \frac{\partial f}{\partial x_n})^T$ of partial derivatives of the function $f(x) = f(x_1, \ldots x_n)$ points to the direction of the fastest function growth. It means that by moving in the opposite direction (antigradient) it is possible to decrease the function value with the fastest rate.

Here is a snowboarder (me) in Sheregesh, Russian most popular winter resort. I highly recommended it if you like skiing or snowboarding. We place this picture not only for a good view but also for picturing the idea of gradient descent. If you have an aim to ride as fast as possible, you need to choose the way with steepest descent (as long as you stay alive). Calculating antigradient can be seen as evaluating the slope in each particular point.

**Example**

The paired regression problem can be solved with gradient descent. Let us predict one variable with another, say, height with weight and assume that these variables are linearly dependent. Here we are using the SOCR dataset.

In [2]:

```
PATH_TO_ALL_DATA = Path('../../data/')
data_demo = pd.read_csv(PATH_TO_ALL_DATA / 'weights_heights.csv')
```

In [3]:

```
plt.scatter(data_demo['Weight'], data_demo['Height']);
plt.xlabel('Weight in lb')
plt.ylabel('Height in inches');
```

Here we have a vector $x$ of dimension $\ell$ (weight of every person, i.e. training sample) and $y$, a vector which stores heights of every person in the dataset.

The task is the following: find weights $w_0$ and $w_1$ such that predicting height with weight as $y_i = w_0 + w_1 x_i$ (where $y_i$ is $i$-th height value, $x_i$ is $i$-th weight value) minimizes the squared error (as well as mean squared error as $\frac{1}{\ell}$ doesn't make any difference ): $$\Large SE(w_0, w_1) = \frac{1}{2}\sum_{i=1}^\ell(y_i - (w_0 + w_1x_{i}))^2 \rightarrow min_{w_0,w_1}$$

We will use gradient descent, utilizing partial derivatives of $SE(w_0, w_1)$ over weights $w_0$ and $w_1$. An iterative training procedure is then defined by simple weight update formulas (we change model weights in small steps, proportionally to a small constant $\eta$, towards antigradient of the function $SE(w_0, w_1)$):

$$\Large \begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} -\eta \frac{\partial SE}{\partial w_0} |_{t} \\ w_1^{(t+1)} = w_1^{(t)} -\eta \frac{\partial SE}{\partial w_1} |_{t} \end{array}$$Diving in into analytical computing of partial derivatives, we get the following:

$$\Large \begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} + \eta \sum_{i=1}^{\ell}(y_i - w_0^{(t)} - w_1^{(t)}x_i) \\ w_1^{(t+1)} = w_1^{(t)} + \eta \sum_{i=1}^{\ell}(y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i \end{array}$$This math works quite well until the amount of data isn't very huge (we will not discuss issues with local minima, saddle points, choosing the value of learning rate (gradient descent step), moments and other stuff – lots these topics are well presented in chapter "Numeric Computation" in "Deep Learning" book). There is an issue with this approach (batch gradient descent) - gradient evaluation requires the summation of some values for every object from training set. In other words, the algorithm requires a lot of iterations, at every iteration we have to recompute weights with formula which have a sum $\sum_{i=1}^\ell$ over the whole training set. What if we have billions of training samples?

Here goes stochastic gradient descent! Simply, we throw away the summation sign and update the weights only over single training samples (or some dozen of them). Considering our case:

$$\Large \begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} + \eta (y_i - w_0^{(t)} - w_1^{(t)}x_i) \\ w_1^{(t+1)} = w_1^{(t)} + \eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i \end{array}$$With this approach there is no guarantee that we will move in best possible direction at every iteration and we might need many more iterations, but we get virtually instant weight updates.

Andrew Ng has a good illustration in his machine learning course, lets take a look at it.

Here are contour plots of some function and we are looking for the global minimum of this function. The red curve shows weight changes (in this picture $\theta_0$ and $\theta_1$ correspond to $w_0$ and $w_1$ in our example). According to the properties of a gradient, the direction of change in every point is orthogonal to contour plots. With stochastic gradient descent weights are changing in a less predictable manner and it even may seem that some steps are wrong – leading away from minima, but both procedures converge to the same solution.

Stochastic gradient descent, being one of optimization methods, gives us practical guidance for training both classifiers and regressors with huge amounts of data, up to hundreds of GiBs (depending on the computational resources on one's disposal).

Considering the case of paired regression, we can store the training data set $(X,y)$ in HDD without loading it into RAM (it simply won't fit), read objects one by one and update the weights of our model:

$$\Large \begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} + \eta (y_i - w_0^{(t)} - w_1^{(t)}x_i) \\ w_1^{(t+1)} = w_1^{(t)} + \eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i \end{array}$$After working out the whole training dataset, the loss function we are trying to minimize (quadratic squared root error in case of regression or, for example, logistic loss for classification task), will decrease, but usually it takes dozens of passes over the training set to make the loss small enough.

This approach to learning is usually called **online learning**, and this name emerged even before machine learning MOOC-s turned mainstream.

We do not discuss many specific things about SGD and if you want do dive into theory, a good choice is "Convex Optimization" book by Boyd). Instead, we will introduce the Vowpal Wabbit library, which is good for training simple models with huge data sets thanks to stochastic optimization and another trick – feature hashing, which we are introducing in the next part of the tutorial.

In Scikit-learn classifiers and regressors, trained with SGD, are named `SGDClassifier`

and `SGDRegressor`

from `sklearn.linear_model`

. These are nice implementations of SGD but we'll focus on VW as being superior to sklearn's SGD models in many aspects including computational performance.

Many classification and regression algorithms operate in euclidean or metric spaces, implying that data is represented with vectors of real numbers. However, in real data we often meet categorical features with discrete values range, like yes/no or January/February/.../December. We will see how to process this kind of data, particularly with linear models and how to deal with lots of categorical features even when these features have lots of unique values.

Here we explore the UCI bank marketing dataset, where most of features are categorical.

In [4]:

```
df = pd.read_csv(PATH_TO_ALL_DATA / 'bank_train.csv')
labels = pd.read_csv(PATH_TO_ALL_DATA / 'bank_train_target.csv', header=None).values
df.head()
```

Out[4]:

We see, most of features are not represented by numbers. It poses a problem: we cannot use most of machine learning methods (at least those implemented in Scikit-learn) out-of-the-box.

Let's explore the feature "education".

In [5]:

```
df['education'].value_counts().plot.barh();
```

The most straightforward solution would be to map each value of this feature into a unique number. For example we can map `university.degree`

to 0, `basic.9y`

to 1 and so on. One can use `sklearn.preprocessing.LabelEncoder`

to perform this mapping.

In [6]:

```
label_encoder = LabelEncoder()
```

The `fit`

method of this class finds all unique values and builds the actual mapping between categories and numbers, and the `transform`

method converts categories into numbers. After `fit`

is executed, `label_encoder`

will have the `classes_`

attribute, with all unique values of the feature. Let us count them to make sure the transformation was correct.

In [7]:

```
mapped_education = pd.Series(label_encoder.fit_transform(df['education']))
mapped_education.value_counts().plot.barh()
print(dict(enumerate(label_encoder.classes_)))
```

In [8]:

```
df['education'] = mapped_education
df.head()
```

Out[8]:

Let's apply such transformation to all columns with type `object`

.

In [9]:

```
categorical_columns = df.columns[df.dtypes == 'object'].union(['education'])
for column in categorical_columns:
df[column] = label_encoder.fit_transform(df[column])
df.head()
```

Out[9]:

The main issue with this approach is that we've introduced some order while there might haven't been such.

For example, we implicitly introduced algebra over the values of the job feature, now we can substract the job of client #2 from the job of client #1 :

In [10]:

```
df.loc[1].job - df.loc[2].job
```

Out[10]:

Does this operation make sense? Maybe not. Let's try to train logistic regression with this feature transformation.

In [11]:

```
def logistic_regression_accuracy_on(dataframe, labels):
features = dataframe.as_matrix()
train_features, test_features, train_labels, test_labels = \
train_test_split(features, labels)
logit = LogisticRegression()
logit.fit(train_features, train_labels)
return classification_report(test_labels, logit.predict(test_features))
print(logistic_regression_accuracy_on(df[categorical_columns],
labels.ravel()))
```

We can see that logistic regression hasn't ever predicted class 1. In order to use linear models with categorical features we will take a different approach - One-Hot Encoding.

Suppose that some feature can have one of 10 unique values. In this case one-hot encoding creates 10 new features corresponding to these unique values, all of them *except one* are zeros.

In [12]:

```
one_hot_example = pd.DataFrame([{i: 0 for i in range(10)}])
one_hot_example.loc[0, 6] = 1
one_hot_example
```

Out[12]:

This idea is implemented in the `OneHotEncoder`

class from `sklearn.preprocessing`

. By default `OneHotEncoder`

transforms data into a sparse matrix for the sake of saving memory space - most of the values are zeroes and we don't want to spare extra RAM. However, in this particular example we don't run into such problems, so we are going to use "dense" matrix representation.

In [13]:

```
onehot_encoder = OneHotEncoder(sparse=False)
```

In [14]:

```
encoded_categorical_columns = pd.DataFrame(onehot_encoder.fit_transform(df[categorical_columns]))
encoded_categorical_columns.head()
```

Out[14]:

We've got 53 columns that correspond to the number of unique values of categorical features in our data set. When transformed with One-Hot Encoding, this data can be used with linear models:

In [15]:

```
print(logistic_regression_accuracy_on(encoded_categorical_columns,
labels.ravel()))
```

Real data can be even more volatile, we cannot guarantee that new values of categorical features will not occur. This issue hampers the using of the trained model with some new data. Besides that, `LabelEncoder`

assumes preliminary analysis of the whole dataset and storage of constructed mappings in memory, which makes it difficult to work with big amounts of data.

There is a simple approach to vectorization of categorical data, it is based on hashing, and is known as, surprisingly, the hashing trick.

Hash functions can help us in finding unique codes for different feature values, for example:

In [16]:

```
for s in ('university.degree', 'high.school', 'illiterate'):
print(s, '->', hash(s))
```

We don't use negative values or values of such high magnitude. We restrict the range of values for the hash function:

In [17]:

```
hash_space = 25
for s in ('university.degree', 'high.school', 'illiterate'):
print(s, '->', hash(s) % hash_space)
```

Imagine that our data set contains a single (not married yet) student, who received a call on Monday. His feature vectors will be created similarly to the case of One-Hot Encoding, but in the united space with fixed range for all features:

In [18]:

```
hashing_example = pd.DataFrame([{i: 0.0 for i in range(hash_space)}])
for s in ('job=student', 'marital=single', 'day_of_week=mon'):
print(s, '->', hash(s) % hash_space)
hashing_example.loc[0, hash(s) % hash_space] = 1
hashing_example
```

Out[18]:

We want to point out: we hash not only feature values, but pairs of **feature name + feature value**. It's important to distinguish the same values of different features.

In [19]:

```
assert hash('no') == hash('no')
assert hash('housing=no') != hash('loan=no')
```

Is it possible that we have a collision (when hash codes, computed for two different combinations of feature names and values, coincide)? Sure, it is. But it can be seen that it is a rare case, with big enough hashing spaces. And even if the collision occurs, regression or classification metrics will not suffer much. Surprisingly, hash collisions in this case work as a sort of regularization.

Perhaps, you ask "WTF?", hashing seems counterintuitive. Maybe, but sometimes this heuristics is in fact the only plausible approach to work with categorical data. Moreover, this technique has proved to just work.

A good analysis of hash collisions, their dependency on feature space and hashing space dimensions and affecting classification/regression performance is done in this article by Booking.com.

Vowpal Wabbit (VW) is one of the most wide-spread machine learning libraries in industry. It is prominent for high training speed and support of many training modes. Especially, we are interested in online learning with big and high-dimentional data. This is one of major merits of the library. Also, with hashing trick implemented, Vowpal Wabbit is a perfect choice for working with text data.

Shell is the main interface for VW. See installation instructions. Alternatively, you can use Python package vowpalwabbit, here's a usage example.

In [20]:

```
!vw --help
```

Vowpal Wabbit reads data from files or from standard input stream (stdin) assuming the following format:

`[Label] [Importance] [Tag]|Namespace Features |Namespace Features ... |Namespace Features`

`Namespace=String[:Value]`

`Features=(String[:Value] )*`

here [] denotes non-mandatory elements, and (...)* means some repeats.

**Label**is a number. In case of classification it is usually 1 and -1, in case of regression it is some real float value**Importance**is a number, it denotes sample weight during training. Setting this helps when working with imbalanced data.**Tag**is some string without spaces - it is a "name" of sample, VW saves it upon prediction. In order to separate Tag and Importance it is better to start Tag with the ' character.**Namespace**is for creation of separate feature spaces.**Features**are object features inside**Namespace**. Features have weight 1.0 by default, but it can be changed, for example - feature:0.1.

The following string matches the VW format:

`1 1.0 |Subject WHAT car is this |Organization University of Maryland:0.5 College Park`

Let's check it and run VW with this training sample:

In [21]:

```
!echo '1 1.0 |Subject WHAT car is this |Organization University of Maryland:0.5 College Park' | vw
```

VW is a wonderful tool for working with text data. We'll illustrate it with the 20newsgroups dataset, containing letters from 20 different news letters.

In [22]:

```
# load data with sklearn's function
newsgroups = fetch_20newsgroups(PATH_TO_ALL_DATA)
```

In [23]:

```
newsgroups['target_names']
```

Out[23]:

Lets look at the first document from this collection:

In [24]:

```
text = newsgroups['data'][0]
target = newsgroups['target_names'][newsgroups['target'][0]]
print('-----')
print(target)
print('-----')
print(text.strip())
print('----')
```

Now we convert the data into something Vowpal Wabbit can understand, and we throw away words shorter than of 3 symbols. Here we skip many important NLP stages (stemming and lemmatization, for example), however, we will later see that VW solves the problem even without these steps.

In [25]:

```
def to_vw_format(document, label=None):
return str(label or '') + ' |text ' + ' '.join(re.findall('\w{3,}', document.lower())) + '\n'
to_vw_format(text, 1 if target == 'rec.autos' else -1)
```

Out[25]:

We split the dataset into train and test and write these to files. We will consider document as positive, if it corresponds to **rec.autos**. Thus, we construct the model which distinguishes letters about cars from others:

In [26]:

```
all_documents = newsgroups['data']
all_targets = [1 if newsgroups['target_names'][target] == 'rec.autos'
else -1 for target in newsgroups['target']]
```

In [27]:

```
train_documents, test_documents, train_labels, test_labels = \
train_test_split(all_documents, all_targets, random_state=7)
with open(os.path.join(PATH_TO_ALL_DATA, '20news_train.vw'), 'w') as vw_train_data:
for text, target in zip(train_documents, train_labels):
vw_train_data.write(to_vw_format(text, target))
with open(os.path.join(PATH_TO_ALL_DATA, '20news_test.vw'), 'w') as vw_test_data:
for text in test_documents:
vw_test_data.write(to_vw_format(text))
```

And now we pass the created file to Vowpal Wabbit. We solve the classification problem with hinge loss function (linear SVM). The trained model will be saved in the `20news_model.vw`

file:

In [28]:

```
!vw -d $PATH_TO_ALL_DATA/20news_train.vw \
--loss_function hinge -f $PATH_TO_ALL_DATA/20news_model.vw
```

Training is finished. VW prints a lot of interesting info while training (one can suppress it with the `--quiet`

parameter). More on this diagnostic output can be seen on GitHub. Note how average loss drops while training. For loss computation VW uses samples it has never seen before, so this measure is usually correct. Now we apply trained model to the test set, saving predictions into a file with the `-p`

flag:

In [29]:

```
!vw -i $PATH_TO_ALL_DATA/20news_model.vw -t -d $PATH_TO_ALL_DATA/20news_test.vw \
-p $PATH_TO_ALL_DATA/20news_test_predictions.txt
```

Now we load predictions, compute AUC and plot the ROC curve:

In [30]:

```
with open(PATH_TO_ALL_DATA / '20news_test_predictions.txt') as pred_file:
test_prediction = [float(label)
for label in pred_file.readlines()]
auc = roc_auc_score(test_labels, test_prediction)
a_roc_curve = roc_curve(test_labels, test_prediction)
with plt.xkcd():
plt.plot(a_roc_curve[0], a_roc_curve[1]);
plt.plot([0,1], [0,1])
plt.xlabel('FPR'); plt.ylabel('TPR')
plt.title('test AUC = %f' % (auc))
plt.axis([-0.05,1.05,-0.05,1.05]);
```

AUC value we get states that here we've achieved high classification quality.

We use same news dataset, but this time we solve multiclass classification problem. `Vowpal Wabbit`

is a little picky – it loves labels starting from 1 till K, where K – is a number of classes in classification task (20 in our case). So we will use LabelEncoder and add 1 afterwards (`LabelEncoder`

maps labels into range from 0 to K-1).

In [31]:

```
all_documents = newsgroups['data']
topic_encoder = LabelEncoder()
all_targets_mult = topic_encoder.fit_transform(newsgroups['target']) + 1
```

**Data is the same, but we change labels, train_labels_mult and test_labels_mult – label vectors from 1 to 20.**

In [32]:

```
train_documents, test_documents, train_labels_mult, test_labels_mult = \
train_test_split(all_documents, all_targets_mult, random_state=7)
with open(os.path.join(PATH_TO_ALL_DATA, '20news_train_mult.vw'), 'w') as vw_train_data:
for text, target in zip(train_documents, train_labels_mult):
vw_train_data.write(to_vw_format(text, target))
with open(os.path.join(PATH_TO_ALL_DATA, '20news_test_mult.vw'), 'w') as vw_test_data:
for text in test_documents:
vw_test_data.write(to_vw_format(text))
```

We train Vowpal Wabbit in multiclass classification mode, passing `oaa`

parameter("one against all"), with number of classes. Also, lets see parameters we can tune and model quality can be very depended on them (more info – in the official Vowpal Wabbit tutorial):

- learning rate (-l, 0.5 default) – rate of weight change on every step
- learning rate decay (--power_t, 0.5 default) – it is proven by practice, that if learning rate drops along with further steps of stochastic gradient descent, we approach loss minimum better
- loss function (--loss_function) – entire training algorithm depends on it. Docs about loss functions
Regularization (-l1) – note, that VW calculates regularization for every object, that why we usually set regularization values small, about $10^{-20}.$

Additionally we can try automatic Vowpal Wabbit parameters tuning with Hyperopt.

In [33]:

```
%%time
!vw --oaa 20 $PATH_TO_ALL_DATA/20news_train_mult.vw -f $PATH_TO_ALL_DATA/20news_model_mult.vw \
--loss_function=hinge
```

In [34]:

```
%%time
!vw -i $PATH_TO_ALL_DATA/20news_model_mult.vw -t -d $PATH_TO_ALL_DATA/20news_test_mult.vw \
-p $PATH_TO_ALL_DATA/20news_test_predictions_mult.txt
```

In [35]:

```
with open(os.path.join(PATH_TO_ALL_DATA, '20news_test_predictions_mult.txt')) as pred_file:
test_prediction_mult = [float(label) for label in pred_file.readlines()]
```

In [36]:

```
accuracy_score(test_labels_mult, test_prediction_mult)
```

Out[36]:

Here is how often the model misclassifies atheism with other topics.

In [37]:

```
M = confusion_matrix(test_labels_mult, test_prediction_mult)
for i in np.where(M[0,:] > 0)[0][1:]:
print(newsgroups['target_names'][i], M[0,i])
```

In this part we will do binary classification of IMDB (International Movie DataBase) movie reviews. We will see how fast Vowpal Wabbit is.

Using the `load_files`

function from `sklearn.datasets`

we load movie reviews from here. If you want to reproduce the results, please download the archive, unzip it and set the path to `imdb_reviews`

(it will contain *train* and *test* subdirectories). Unpacking can take several minutes as there are 100k of files. Both train and test sets hold 12500 good and bad movie reviews. First we split texts and labels.

In [39]:

```
PATH_TO_MOVIE_REVIEWS = PATH_TO_ALL_DATA / 'imdb_reviews'
reviews_train = load_files(PATH_TO_MOVIE_REVIEWS / 'train')
text_train, y_train = reviews_train.data, reviews_train.target
```

In [40]:

```
print("Number of documents in training data: %d" % len(text_train))
print(np.bincount(y_train))
```

The same for the test set.

In [41]:

```
reviews_test = load_files(PATH_TO_MOVIE_REVIEWS / 'test')
text_test, y_test = reviews_test.data, reviews_train.target
print("Number of documents in test data: %d" % len(text_test))
print(np.bincount(y_test))
```

Examples of some reviews and corresponding labels.

In [42]:

```
text_train[0]
```

Out[42]:

In [43]:

```
y_train[0] # good review
```

Out[43]:

In [44]:

```
text_train[1]
```

Out[44]:

In [45]:

```
y_train[1] # bad review
```

Out[45]:

In [46]:

```
to_vw_format(str(text_train[1]), 1 if y_train[0] == 1 else -1)
```

Out[46]:

Now we prepare training (`movie_reviews_train.vw`

), validation (`movie_reviews_valid.vw`

) and test (`movie_reviews_test.vw`

) sets for Vowpal Wabbit. 70% of training set we hold for training process, 30% – for hold-out set.

In [47]:

```
train_share = int(0.7 * len(text_train))
train, valid = text_train[:train_share], text_train[train_share:]
train_labels, valid_labels = y_train[:train_share], y_train[train_share:]
```

In [48]:

```
len(train_labels), len(valid_labels)
```

Out[48]:

In [49]:

```
with open(PATH_TO_MOVIE_REVIEWS / 'movie_reviews_train.vw', 'w') as vw_train_data:
for text, target in zip(train, train_labels):
vw_train_data.write(to_vw_format(str(text), 1 if target == 1 else -1))
with open(PATH_TO_MOVIE_REVIEWS / 'movie_reviews_valid.vw', 'w') as vw_train_data:
for text, target in zip(valid, valid_labels):
vw_train_data.write(to_vw_format(str(text), 1 if target == 1 else -1))
with open(PATH_TO_MOVIE_REVIEWS / 'movie_reviews_test.vw', 'w') as vw_test_data:
for text in text_test:
vw_test_data.write(to_vw_format(str(text)))
```

In [50]:

```
!head -2 $PATH_TO_MOVIE_REVIEWS/movie_reviews_train.vw
```

In [51]:

```
!head -2 $PATH_TO_MOVIE_REVIEWS/movie_reviews_valid.vw
```

In [52]:

```
!head -2 $PATH_TO_MOVIE_REVIEWS/movie_reviews_test.vw
```

**Now we launch Vowpal Wabbit with the following arguments:**

`-d`

, path to training set (corresponding .vw file)`--loss_function`

– hinge (feel free to experiment here)`-f`

– path to the output file (which can also be in the .vw format)

In [53]:

```
!vw -d $PATH_TO_MOVIE_REVIEWS/movie_reviews_train.vw --loss_function hinge \
-f $PATH_TO_MOVIE_REVIEWS/movie_reviews_model.vw --quiet
```

Next, make the hold-out prediction with the following VW arguments:

`-i`

–path to the trained model (.vw file)`-d`

– path to the hold-out set (.vw file)`-p`

– path to a txt-file where the predictions will be stored`-t`

- tells VW to ignore labels

In [54]:

```
!vw -i $PATH_TO_MOVIE_REVIEWS/movie_reviews_model.vw -t \
-d $PATH_TO_MOVIE_REVIEWS/movie_reviews_valid.vw \
-p $PATH_TO_MOVIE_REVIEWS/movie_valid_pred.txt --quiet
```

Now we read predictions from file and estimate accuracy and ROC AUC. Note that VW prints probability estimates of +1 class. These estimates are distributed from -1 to 1, so we convert these into binary answers (0 or 1) assuming that positive values belong to class 1.

In [55]:

```
with open(os.path.join(PATH_TO_MOVIE_REVIEWS, 'movie_valid_pred.txt')) as pred_file:
valid_prediction = [float(label)
for label in pred_file.readlines()]
print("Accuracy: {}".format(round(accuracy_score(valid_labels,
[int(pred_prob > 0) for pred_prob in valid_prediction]), 3)))
print("AUC: {}".format(round(roc_auc_score(valid_labels, valid_prediction), 3)))
```

And the same for the test set.

In [56]:

```
!vw -i $PATH_TO_MOVIE_REVIEWS/movie_reviews_model.vw -t \
-d $PATH_TO_MOVIE_REVIEWS/movie_reviews_test.vw \
-p $PATH_TO_MOVIE_REVIEWS/movie_test_pred.txt --quiet
```

In [57]:

```
with open(os.path.join(PATH_TO_MOVIE_REVIEWS, 'movie_test_pred.txt')) as pred_file:
test_prediction = [float(label)
for label in pred_file.readlines()]
print("Accuracy: {}".format(round(accuracy_score(y_test,
[int(pred_prob > 0) for pred_prob in test_prediction]), 3)))
print("AUC: {}".format(round(roc_auc_score(y_test, test_prediction), 3)))
```

Now we try to achieve higher accuracy by incorporating bigrams.

In [58]:

```
!vw -d $PATH_TO_MOVIE_REVIEWS/movie_reviews_train.vw \
--loss_function hinge --ngram 2 -f $PATH_TO_MOVIE_REVIEWS/movie_reviews_model2.vw --quiet
```

In [59]:

```
!vw -i$PATH_TO_MOVIE_REVIEWS/movie_reviews_model2.vw -t \
-d $PATH_TO_MOVIE_REVIEWS/movie_reviews_valid.vw \
-p $PATH_TO_MOVIE_REVIEWS/movie_valid_pred2.txt --quiet
```

In [60]:

```
with open(os.path.join(PATH_TO_MOVIE_REVIEWS, 'movie_valid_pred2.txt')) as pred_file:
valid_prediction = [float(label)
for label in pred_file.readlines()]
print("Accuracy: {}".format(round(accuracy_score(valid_labels,
[int(pred_prob > 0) for pred_prob in valid_prediction]), 3)))
print("AUC: {}".format(round(roc_auc_score(valid_labels, valid_prediction), 3)))
```

In [61]:

```
!vw -i $PATH_TO_MOVIE_REVIEWS/movie_reviews_model2.vw -t \
-d $PATH_TO_MOVIE_REVIEWS/movie_reviews_test.vw \
-p $PATH_TO_MOVIE_REVIEWS/movie_test_pred2.txt --quiet
```

In [62]:

```
with open(os.path.join(PATH_TO_MOVIE_REVIEWS, 'movie_test_pred2.txt')) as pred_file:
test_prediction2 = [float(label)
for label in pred_file.readlines()]
print("Accuracy: {}".format(round(accuracy_score(y_test,
[int(pred_prob > 0) for pred_prob in test_prediction2]), 3)))
print("AUC: {}".format(round(roc_auc_score(y_test, test_prediction2), 3)))
```

Bigrams really bettered our model!

Now it is time to actually work with large datasets and Vowpal Wabbit. There is a 10 GiB dataset of StackOverflow questions, we have a processed version here. The original dataset is compised of 10 million questions, every question can have several tags. Data is quite clean, so don't call it "Big Data" even in a pub :)

We chose only 10 tags (namely, 'javascript', 'java', 'python', 'ruby', 'php', 'c++', 'c#', 'go', 'scala' and 'swift'), and solve 10-class classification problem: we want to predict a tag corresponding to one of 10 popular programming languages, given only the texto f this question.

In [63]:

```
# change the path to data
PATH_TO_STACKOVERFLOW_DATA = PATH_TO_ALL_DATA / 'stackoverflow'
```

These are first 3 lines from a sample of the dataset.

In [64]:

```
!head -3 $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_10mln.vw
```

After selecting 10 tags we have 4.7G set which is divided into train, and test parts.

In [65]:

```
!du -hs $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_*.vw
```

We will process training part of the dataset (3.1 GiB) with Vowpal Wabbit and the following arguments:

- -oaa 10 – mark for multiclass classification with 10 classes
- -d – path to data
- -f – path to output file of the trained model
- -b 28 – we will use 28 bits for hashing resulting in the $2^{28}$-sized feature space
- fix random seed for reproducibility

In [66]:

```
%%time
!vw --oaa 10 -d $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_train.vw \
-f $PATH_TO_STACKOVERFLOW_DATA/vw_model1_10mln.vw -b 28 --random_seed 17 --quiet
```

In [67]:

```
%%time
!vw -t -i $PATH_TO_STACKOVERFLOW_DATA/vw_model1_10mln.vw \
-d $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_test.vw \
-p $PATH_TO_STACKOVERFLOW_DATA/vw_test_pred.csv --random_seed 17 --quiet
```

In [68]:

```
vw_pred = np.loadtxt(PATH_TO_STACKOVERFLOW_DATA / 'vw_test_pred.csv')
test_labels = np.loadtxt(PATH_TO_STACKOVERFLOW_DATA / 'stackoverflow_test_labels.txt')
accuracy_score(test_labels, vw_pred)
```

Out[68]:

The model has been trained and made a prediction in less than a minute (check it, the results are reported for MacBook Pro Mid 2015, 2,2 GHz Intel Core i7, 16 Gib RAM) and got almost 92% accuracy. Without any Hadoop cluster! :) Impressing, isn't is?

To better understand stochastic learning, you can complete this assignment where you'll be asked to implement a stochastic gradient regressor from scratch. The assignment is just for you to practice, and goes with a solution.

- The same notebook as am interactive web-based Kaggle Kernel
- "Training while reading" - an example of the Python wrapper usage
- Main course site, course repo, and YouTube channel
- Course materials as a Kaggle Dataset
- Official VW documentation on Github
- "Awesome Vowpal Wabbit" Wiki
- Don’t be tricked by the Hashing Trick - analysis of hash collisions, their dependency on feature space and hashing space dimensions and affecting classification/regression performance
- "Numeric Computation" Chapter of the Deep Learning book
- "Convex Optimization" by Stephen Boyd
- "Command-line Tools can be 235x Faster than your Hadoop Cluster" post
- Benchmarking various ML algorithms on Criteo 1TB dataset on GitHub
- VW on FastML.com