Roadmap

This roadmap guides you through self-paced mlcourse.ai. First, take a look at course prerequisites. Be ready to spend around 3 months on passing the course, some 4-10 hours/week. Though it heavily depends on how much you will be willing to dive into Kaggle competitions (they are time-consuming but very rewarding in terms of skills) that we offer during this course. To get help, join OpenDataScience, discussions are held in the #mlcourse_ai Slack channel. For a general picture of what role all of Machine Learning plays in your would-be Data Science career, check the “Jump into Data Science” video, it walks you through the preparation process for your first DS position once basic ML and Python are covered (slides). In case you are already in DS, this course will be a good ML refresher.

Below we outline the 10 topics covered in the course, and give specific instructions on articles to read, lectures to watch, and assignments to crack. Some extra bonus assignments are also listed – you get those with the Bonus Assignments tier. Good luck!

Week 1. Exploratory Data Analysis

You definitely want to immediately start with Machine Learning and see math in action. But 70-80% of the time working on a real project is fussing with data, and here Pandas is very good, I use it in my work almost every day. This article describes the basic Pandas methods for preliminary data analysis. Then we analyze the data set on the churn of telecom customers and try to predict the churn without any training, simply relying on common sense. By no means should you underestimate such an approach.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture
  3. Complete demo assignment 1 where you’ll be exploring demographic data (UCI “Adult”), and (opt.) check out the solution
  4. Bonus Assignment: here you’ll be analyzing the history of the Olympic Games with Pandas.

Week 2. Visual Data Analysis

The role of visual data analysis is hard to overestimate, this is how new insights are found in data and how features are engineered. Here we discuss main data visualization techniques and how they are applied in practice. Also take a sneak peek into multidimensional feature space using the t-SNE algorithm, which sometimes is useful but mostly just draws such Christmas tree decorations.

  1. Read two articles:
  1. (opt.) watch a video lecture
  2. Complete demo assignment 2 where you’ll be analyzing cardiovascular disease data, and (opt.) check out the solution
  3. Bonus Assignment: here you’ll be performing EDA of a much larger dataset of US flights sometimes attending to the performance of basic operations.

Week 3. Classification, Decision Trees, and k Nearest Neighbors

Here we delve into machine learning and discuss two simple approaches to solving the classification problem. In a real project, you’d better start with something simple, and often you’d try out decision trees or nearest neighbors (as well as linear models, the next topic) right after even simpler heuristics. We discuss the pros and cons of trees and nearest neighbors. Also, we touch upon the important topic of assessing the quality of model predictions and performing cross-validation. The article is long, but decision trees, in particular, deserve it – they make a foundation for Random Forest and Gradient Boosting, two algorithms that you’ll be likely using in practice most often.

  1. Read the article (same as a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
    • the theory behind decision trees, an intuitive explanation
    • practice with Sklearn decision trees
  3. Complete demo assignment 3 on decision trees, and (opt.) check out the solution
Bonus Assignment 3. Decision trees for classification and regression

In this assignment (accessible via Patreon), we’ll go through the math and code behind decision trees applied to the regression problem, some toy examples will help with that. It is good to understand this because the regression tree is the key component of the gradient boosting algorithm which we cover in the end of the course.

Left: Building a regression tree, step 1. Right: Building a regression tree; step 3

Further, we apply classification decision trees to cardiovascular desease data.

Left: Risk of fatal cardiovascular disease. Right: A decision tree fit to cardiovascular disease data.

In one more bonus assignment, a more challenging one, you’ll be guided through an implementation of a decision tree from scratch. You’ll be given a template for a general DecisionTree class that will work both for classification and regression problems, and then you’ll be testing the implementation with a couple of toy- and actual classification and regression tasks.

Week 4. Linear Classification and Regression

The following 5 articles may form a small brochure, and that’s for a good reason: linear models are the most widely used family of predictive algorithms. These articles represent our course in miniature: a lot of theory, a lot of practice. We discuss the theoretical basis of the Ordinary Least Squares method and logistic regression, as well as their merits in terms of practical applications. Also, crucial concepts like regularization and learning curves are introduced. In the practical part, we apply logistic regression to the task of user identification on the Internet, it’s a Kaggle Inclass competition (a.k.a “Alice”, we go on with this competition in Week 6).

It’s better to admit that this week’s material would rather take you 2-3 weeks to digest and practice, that’s fine. But for consistency with article numbering, we stick to “Week 4” for this topic.

  1. Read 5 articles:
  2. Watch a video lecture on logistic regression coming in 2 parts:
  3. Watch a video lecture on regression and regularization coming in 2 parts:
    • the theory behind linear models, an intuitive explanation
    • business case, where we discuss a real regression task – predicting customer Life-Time Value
  4. Complete demo assignment 4 on sarcasm detection, and (opt.) check out the solution
  5. Complete demo assignment 6 (sorry for misleading numbering here) on OLS, Lasso and Random Forest in a regression task, and (opt.) check out the solution
  6. Bonus Assignment: here you’ll be guided through working with sparse data, feature engineering, model validation, and the process of competing on Kaggle. The task will be to beat baselines in that “Alice” Kaggle competition. That’s a very useful assignment for anyone starting to practice with Machine Learning, regardless of the desire to compete on Kaggle.

Week 5. Bagging and Random Forest

Yet again, both theory and practice are exciting. We discuss why “wisdom of a crowd” works for machine learning models, and an ensemble of several models works better than each one of the ensemble members. In practice, we try out Random Forest (an ensemble of many decision trees) – a “default algorithm” in many tasks. We discuss in detail the numerous advantages of the Random Forest algorithm and its applications. No silver bullet though: in some cases, linear models still work better and faster.

  1. Read 3 articles:
  2. Watch a video lecture coming in 3 parts:
    • part 1 on Random Forest
    • part 2 on classification metrics
    • business case, where we discuss a real classification task – predicting customer payment
  3. Complete demo assignment 5 where you compare logistic regression and Random Forest in the credit scoring problem, and (opt.) check out the solution
  4. Bonus Assignment: here you’ll be applying logistic regression and Random Forest in two different tasks, which will be great for your understanding of application scenarios of these two extremely popular algorithms.

Week 6. Feature Engineering and Feature Selection

Feature engineering is one of the most interesting processes in the whole of ML. It’s an art or at least craft and is therefore not yet well-automated. The article describes the ways of working with heterogeneous features in various ML tasks with texts, images, geodata, etc. Practice with the “Alice” competition is going to convince you how powerful feature engineering can be. And that it’s a lot of fun as well!

  1. Read the article (same in a form of a Kaggle Notebook)
  2. Kaggle: Now that you’ve beaten simple baselines in the “Alice” competition (see Topic 4), check out a bit more advanced Notebooks:

    Go on with feature engineering and try to achieve ~ 0.955 (or higher) ROC AUC on the Public Leaderboard. Alternatively, if a better solution is already shared by the time you join the competition, try to improve the best publicly shared solution by at least 0.5%. However, please do not share high-performing solutions, it ruins the competitive spirit of the competition and also hurts some other courses which also have this competition in their syllabus.

  3. Bonus Assignment: here we walk you through beating a baseline in a competition where the task is to predict the popularity of an article published on Medium. The basic solution uses the text only. But on the go, you’ll create a lot of additional features to improve the model. Also, in this assignment, you’ll learn some dirty Kaggle hacks.

Week 7. Unsupervised Learning: Principal Component Analysis and Clustering

Here we turn to the vast topic of unsupervised learning, it’s about the cases when we have data but it is unlabeled with no target feature to predict like in classification/regression tasks. Most of the data out there is unlabeled, and we need to be able to make use of it. We discuss only 2 types of unsupervised learning tasks – clustering and dimensionality reduction.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
  3. Complete demo assignment 7 where you analyze data coming from mobile phone accelerometers and gyroscopes to cluster people into different types of physical activities, and (opt.) check out the solution

Bonus Assignment 7. Principal Component Analysis and Clustering

In this assignment (accessible via Patreon), we walk you through Sklearn built-in implementations of dimensionality reduction and clustering methods and apply these techniques to the popular “faces” dataset.


Faces: principal components

Left: clustering faces: pairwise similarities. Right: Clustering faces: a dendrogram

Week 8. Vowpal Wabbit: Learning with Gigabytes of Data

The theoretical part here covert the analysis of Stochastic Gradient Descent, it was this optimization method that made it possible to successfully train both neural networks and linear models on really large training sets. Here we also discuss what can be done in cases of millions of features in a supervised learning task (“hashing trick”) and move on to Vowpal Wabbit, a utility that allows you to train a model with gigabytes of data in a matter of minutes, and sometimes of acceptable quality. We consider several cases including StackOverflow questions tagging with a training set of several gigabytes.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
  3. Complete demo assignment 8 “Implementing online regressor” which walks you through implementation from scratch, very good for the intuitive understanding of the algorithm. Optionally, check out the solution
  4. Bonus Assignment: Here we go through the math implement two algorithms – a regressor and a classifier – driven by stochastic gradient descent (SGD).

Week 9. Time Series Analysis with Python

Here we discuss various approaches to work with time series: what data preparation is necessary, how to get short-term and long-term predictions. We walk through various types of time series models, from simple moving average to gradient boosting. We also take a look at the ways to search for anomalies in time series and discuss the pros and cons of these methods.

  1. Read the following two articles:
  2. (opt.) watch a video lecture
  3. Complete demo assignment 9 where you’ll get hands dirty with ARIMA and Prophet, and (opt.) check out the solution
  4. Bonus Assignment: In this assignment, we’ll engineer some features and apply an ML model to a time series prediction task

Week 10. Gradient Boosting

Gradient boosting is one of the most prominent Machine Learning algorithms, it founds a lot of industrial applications. For instance, the Yandex search engine is a big and complex system with gradient boosting (MatrixNet) somewhere deep inside. Many recommender systems are also built on boosting. It is a very versatile approach applicable to classification, regression, and ranking. Therefore, here we cover both theoretical basics of gradient boosting and specifics of most widespread implementations – Xgboost, LightGBM, and Catboost.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
    • part 1, fundamental ideas behind gradient boosting
    • part 2, key ideas behind major implementations: Xgboost, LightGBM, and CatBoost
  3. Kaggle: Take a look at the “Flight delays” competition and a starter with CatBoost. Start analyzing data and building features to improve your solution. Try to improve the best publicly shared solution by at least 0.5%. But still, please do not share high-performing solutions, it ruins the competitive spirit.
Bonus Assignment 10. Implementation of the gradient boosting algorithm

In this assignment (accessible via Patreon), we go through the math and implement the general gradient boosting algorithm - the same class will implement a binary classifier that minimizes the logistic loss function and two regressors that minimize the mean squared error (MSE) and the root mean squared logarithmic error (RMSLE). This way, we will see that we can optimize arbitrary differentiable functions using gradient boosting and how this technique adapts to different contexts.


Residuals at each gradient boosting iteration and the corresponding tree prediction: