## mlcourse.ai – Open Machine Learning Course¶

Author: Yury Kashnitskiy (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

## Assignment #6

### Beating baselines in "How good is your Medium article?"

Competition. The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared "primitive" baseline - you'll find something valuable there.

1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
2. You need to name your team (out of 1 person) in full accordance with the course rating. You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a6

In [ ]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge


The following code will help to throw away all HTML tags from an article content.

In [ ]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()


Supplementary function to read a JSON line without crashing on escape characters.

In [ ]:
def read_json_line(line=None):
result = None
try:
except Exception as e:
# Find the offending character index:
idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))
# Remove the offending character:
new_line = list(line)
new_line[idx_to_replace] = ' '
new_line = ''.join(new_line)
return result


Extract features content, published, title and author, write them to separate files for train and test sets.

In [ ]:
def extract_features_and_write(path_to_data,
inp_filename, is_train=True):

features = ['content', 'published', 'title', 'author']
prefix = 'train' if is_train else 'test'
feature_files = [open(os.path.join(path_to_data,
'{}_{}.txt'.format(prefix, feat)),
'w', encoding='utf-8')
for feat in features]

with open(os.path.join(path_to_data, inp_filename),
encoding='utf-8') as inp_json_file:

for line in tqdm_notebook(inp_json_file):

# You code here

In [ ]:
PATH_TO_DATA = '../../data/kaggle_medium' # modify this if you need to

In [ ]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

In [ ]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)


Add the following groups of features:

- Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
- Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
- Time features: publication hour, whether it's morning, day, night, whether it's a weekend
- Bag of authors (i.e. One-Hot-Encoded author names)
In [ ]:
# You code here


Join all sparse matrices.

In [ ]:
X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
X_train_author_sparse,
X_train_time_features_sparse]).tocsr()

In [ ]:
X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
X_test_author_sparse,
X_test_time_features_sparse]).tocsr()


Read train target and split data for validation.

In [ ]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'),
index_col='id')
y_train = train_target['log_recommends'].values

In [ ]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]


Train a simple Ridge model and check MAE on the validation set.

In [ ]:
# You code here


Train the same Ridge with all available data, make predictions for the test set and form a submission file.

In [ ]:
# You code here

In [ ]:
def write_submission_file(prediction, filename,
path_to_sample=os.path.join(PATH_TO_DATA,
'sample_submission.csv')):

submission['log_recommends'] = prediction
submission.to_csv(filename)

In [ ]:
write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA,
'assignment6_medium_submission.csv'))


Now's the time for dirty Kaggle hacks. Form a submission file with all zeros. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?

In [ ]:
write_submission_file(np.zeros_like(ridge_test_pred),
os.path.join(PATH_TO_DATA,
'medium_all_zeros_submission.csv'))


Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.

In [ ]:
ridge_test_pred_modif = ridge_test_pred # You code here

In [ ]:
write_submission_file(ridge_test_pred_modif,
os.path.join(PATH_TO_DATA,
'assignment6_medium_submission_with_hack.csv'))


That's it for the assignment. Much more credits will be given to the winners in this competition, check course roadmap. Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

• Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
• You may not ignore HTML and extract some features from there
• You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
• Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
• Try various NLP techniques like stemming and lemmatization
• Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
• SGD and Vowpal Wabbit will learn much faster
• Play around with blending and/or stacking. An intro is given in this Kernel by @yorko
• In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!