Roadmap

This roadmap guides you through self-paced mlcourse.ai. First, take a look at course prerequisites. Be ready to spend around 3 months on passing the course, some 4-10 hours/week. Though it heavily depends on how much you will be willing to dive into Kaggle competitions (they are time-consuming but very rewarding in terms of skills) that we offer during this course. To get help, join OpenDataScience, discussions are held in the #mlcourse_ai Slack channel. For a general picture on what role all of Machine Learning plays in your would-be Data Science career, check the “Jump into Data Science” video, it walks you through the preparation process for your first DS position once basic ML and Python are covered (slides). In case you are already in DS, this course will be a good ML refresher. Good luck!

Week 1. Exploratory Data Analysis

You definitely want to immediately start with Machine Learning and see math in action. But 70-80% of the time working on a real project is fussing with data, and here Pandas is very good, I use it in my work almost every day. This article describes the basic Pandas methods for preliminary data analysis. Then we analyze the data set on the churn of telecom customers and try to predict the churn without any training, simply relying on common sense. By no means should you underestimate such an approach.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture
  3. Complete assignment 1 where you’ll be exploring demographic data (UCI “Adult”), and (opt.) check out the solution

Week 2. Visual Data Analysis

The role of visual data analysis is hard to overestimate, this is how new insights are found in data and how features are engineered. Here we discuss main data visualization techniques and how they are applied in practice. Also taking a sneak peek into multidimensional feature space using the t-SNE algorithm, which sometimes is useful but mostly just draws such Christmas tree decorations.

  1. Read two articles:
  1. (opt.) watch a video lecture
  2. Complete assignment 2 where you’ll be analyzing cardiovascular disease data, and (opt.) check out the solution

Week 3. Classification, Decision Trees and k Nearest Neighbors

Here we delve into machine learning and discuss two simple approaches to solving the classification problem. In a real project, you’d better start with something simple, and often you’d try out decision trees or nearest neighbors (as well as linear models, the next topic) right after even simpler heuristics. We discuss pros and cons of trees and nearest neighbors. Also we touch upon the important topic of assessing quality of model predictions and performing cross-validation. The article is long, but decision trees in particular deserve it - they make a foundation for Random Forest and Gradient Boosting, two algorithms that you’ll be likely using in practice most often.

  1. Read the article (same as a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
    • theory behind decision trees, an intuitive explanation
    • practice with Sklearn decision trees
  3. Complete assignment 3 on decision trees, and (opt.) check out the solution

Week 4. Linear Classification and Regression

The following 5 articles may form a small brochure, and that’s for a good reason: linear models is the most widely used family of predictive algorithms. These articles represent our course in a miniature: a lot of theory, a lot of practice. We discuss the theoretical basis of the Ordinary Least Squares method and logistic regression, as well as their merits in terms of practical applications. Also, crucial concepts like regularization and learning curves are introduced. In the practical part, we apply logistic regression to the task of user identification on the Internet, it’s a Kaggle Inclass competition (a.k.a “Alice”, we go on with this competition in Week 6).

It’s better to admit that this week’s material would rather take you 2-3 weeks to digest and practice, that’s fine. But for consistency with article numbering, we stick to “Week 4” for this topic.

  1. Read 5 articles:
  2. Watch a video lecture on logistic regression coming in 2 parts:
  3. Watch a video lecture on regression and regularization coming in 2 parts:
    • theory behind linear models, an intuitive explanation
    • business case, where we discuss a real regression task - predicting customer Life-Time Value
  4. Complete assignment 4 on sarcasm detection, and (opt.) check out the solution
  5. Complete assignment 6 (sorry for misleading numbering here) on OLS, Lasso and Random Forest in a regression task, and (opt.) check out the solution

Week 5. Bagging and Random Forest

Yet again, both theory and practice are exciting. We discuss why “wisdom of a crowd” works for machine learning models, and an ensemble of several models works better than each one of the ensemble members. In practice, we try out Random Forest (an ensemble of many decision trees) - a “default algorithm” in many tasks. We discuss in detail the numerous advantages of the Random Forest algorithm and its applications. No silver bullet though: in some cases linear models still work better and faster.

  1. Read 3 articles:
  2. Watch a video lecture coming in 3 parts:
    • part 1 on Random Forest
    • part 2 on classification metrics
    • business case, where we discuss a real classification task - predicting customer payment
  3. Complete assignment 5 where you compare logistic regression and Random Forest in the credit scoring problem, and (opt.) check out the solution

Week 6. Feature Engineering and Feature Selection

Feature engineering is one of the most interesting processes in the whole of ML. It’s art or at least craft and is therefore not yet well-automated. The article describes the ways of working with heterogeneous features in various ML tasks with texts, images, geodata etc. Practice with the “Alice” competition is going to convince you how powerful feature engineering can be. And that it’s a lot of fun as well!

  1. Read the article (same in a form of a Kaggle Notebook)
  2. Kaggle: Now that you’ve beaten simple baselines in the “Alice” competition (see Topic 4), check out a bit more advanced Notebooks:

    Go on with feature engineering and try to achieve ~ 0.955 (or higher) ROC AUC on the Public Leaderboard. Alternatively, if a better solution is already shared by the time you join the competition, try to improve the best publicly shared solution by at least 0.5%. However, please do not share high-performing solutions, it ruins the competitive spirit of the competition, and also hurts some other courses which also have this competition in their syllabus.

Week 7. Unsupervised Learning: Principal Component Analysis and Clustering

Here we turn to the vast topic of unsupervised learning, it’s about the cases when we have data but it is unlabeled, no target feature to predict like in classification/regression tasks. Actually, most of the data out there is unlabeled, and we need to be able to make use of it. We discuss only 2 types of unsupervised learning tasks - clustering and dimensionality reduction.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
  3. Complete assignment 7 where you analyze data coming from mobile phone accelerometers and gyroscopes to cluster people into different types of physical activities, and (opt.) check out the solution

Week 8. Vowpal Wabbit: Learning with Gigabytes of Data

The theoretical part here covert the analysis of Stochastic Gradient Descent, it was this optimization method that made it possible to successfully train both neural networks and linear models on really large training sets. Here we also discuss what can be done in cases of millions of features in a supervised learning task (“hashing trick”) and move on to Vowpal Wabbit, a utility which allows you to train a model with gigabytes of data in a matter of minutes, and sometimes of acceptable quality. We consider several cases including StackOverflow questions tagging with a training set of several gigabytes.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
  3. Complete assignment 8 “Implementing online regressor” which walks you through an implementation from scratch, very good for intuitive understanding of the algorithm. Optionally, check out the solution

Week 9. Time Series Analysis with Python

Here we discuss various approaches to work with time series: what data preparation is necessary, how to get short-term and long-term predictions. We walk through various types of time series models, from simple moving average to gradient boosting. We also take a look at the ways to search for anomalies in time series and discuss pros and cons of these methods.

  1. Read the following two articles:
  2. (opt.) watch a video lecture
  3. Complete assignment 9 where you’ll get hands dirty with ARIMA and Prophet, and (opt.) check out the solution

Week 10. Gradient Boosting

Gradient boosting is one of the most prominent Machine Learning algorithms, it founds a lot of industrial applications. For instance, Yandex search engine is a big and complex system with gradient boosting (MatrixNet) somewhere deep inside. Many recommender systems are also built on boosting. It is a very versatile approach applicable to classification, regression and ranking. Therefore, here we cover both theoretical basics of gradient boosting and specifics of most wide-spread implementations - Xgboost, LightGBM, and Catboost.

  1. Read the article (same in a form of a Kaggle Notebook)
  2. (opt.) watch a video lecture coming in 2 parts:
    • part 1, key ideas behind major implementations: Xgboost, LightGBM, and CatBoost
    • part 2, practice
  3. Kaggle: Take a look at the “Flight delays” competition and a starter with CatBoost. Start analyzing data and building features to improve your solution. Try to improve the best publicly shared solution by at least 0.5%. But still, please do not share high-performing solutions, it ruins the competitive spirit.