Prerequisites#

../../_images/ods_stickers.jpg

Software requirements#

Here we cover:

  • Basics like git and bash

  • Setting the environment

  • Jupyter Notebooks

  • Jupyter Book

Git, bash, and all#

Apart from installing the environment, it’s highly recommended that you familiarize yourself with git, GitHub and bash. Learn git branching and GitHowTo are nice interactive tutorials to grasp the basics of git.

As for bash, it’s just very rewarding to be familiar with UNIX OS and command-line utils like wc, cat, sed, sort, etc. These utilities have been constantly optimized throughout several decades of UNIX existence, and many basic operations can be done with these bash utils very efficiently: counting the number of lines in a file, replacing an expression with another one for all files in a folder, etc.

Setting the environment#

You’ve got several alternatives to set up your learning environment:

  • Kaggle Notebooks or Azure ML, i.e. avoid local configurations and just use the browser

  • Pip & Anaconda or Poetry

  • Docker

Kaggle Notebooks or Azure ML#

The easiest way to start working with course materials (no local software installations needed) is to visit Kaggle Dataset mlcourse.ai and fork some Notebooks (better to keep them private). All your Jupyter notebooks with Anaconda are live and running in your browser. Almost all needed datasets are there as well. However, uploading other datasets might be tiresome.

Pip & Anaconda or Poetry#

Most python packages like NumPy, Pandas, or Sklearn can be installed manually with pip – Python installer, e.g. pip install numpy. Additionally, you’ll need Xgboost, Vowpal Wabbit, and (maybe) LightGBM and CatBoost for competitions.

However, to manage package dependencies, it’s better to use either Anaconda or Poetry.

Anaconda#

The Anaconda 3 distribution is one of the best options as it already contains the latest Python with NumPy, Pandas, Sklearn, Jupyter, and lots of other libraries. However, some other packages are also used in our course – Xgboost and/or LightGBM and/or CatBoost and Vowpal Wabbit to name a few. In addition, the Graphviz library must be installed. Installing some of them on Windows might be painful.

Poetry#

Poetry is an alternative Python packages and dependency manager.

Installing Poetry:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

Installing dependencies from the pyproject.toml file.

poetry install

This will install the required packages. For the rest, please refer to Poetry docs.

Docker#

This part is pretty lengthy, so we moved it to a separate next page.

Note: Using Docker is optional, you can set up your environment with Poetry or Anaconda as well. It’s hard to say which option is more challenging.

Jupyter Notebooks#

The recommended way of working with course materials is running Jupyter notebooks. If new to this, take a look at jupyter.org. In a nutshell, this is a way of mixing code, graphics, MarkDown, latex, etc. in a single development environment. Perfect for sharing your work/ideas, for prototyping and for working with educative materials.

To start working with the course materials (i.e. Jupyter notebooks):

  • install jupyter, this depends on how you set up the environment in the previous step

  • download/clone the course repo repo

  • run jupyter-notebook from the downloaded directory mlcourse.ai.

  • this opens http://localhost:8888/tree (8888 is the default port) in your browser, from there you can run Jupyter notebooks in the jupyter_english folder (NB: the most up-to-date version of course materials is in the mlcourse_ai_jupyter_book folder, see below)

  • check Jupyter docs and the interactive demo (“try classic notebook”) to get hands dirty with Jupyter

../../_images/intro_running_jupyter.png

Jupyter Book#

Note: not to be confused with Jupyter Notebooks

The mlcourse.ai website now renders a Jupyter book. A strong advantage of this type of content is that it’s actually a book with executable content meaning that the pages that you see are not just static but they are updated with each build of the book by running all Python code. This also guarantees (well, if the book is frequently re-built, say, through a CI/CD process) that the book actually shows working Python code.

To reproduce all the code that you see on the current website (lectures, assignments, solutions, etc. for all topic), clone the course repo, navigate to the mlcourse.ai directory, and run

jupyter-book build mlcourse_ai_jupyter_book

Note: this may take a long time, about an hour, to play around with a toy example, check how a template JupyterBook is created.

Then, open the HTML file located at mlcourse_ai_jupyter_book/_build/html/index.html.

You can also download any mlcourse.ai page as a Jupyter Notebook and run it yourself:

../../_images/download_as_jupyter.png