Author: [Egor Polusmak](https://www.linkedin.com/in/egor-polusmak/). Translated and edited by [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. Time series forecasting finds wide application in data analytics. These are only some of the conceivable predictions of future trends that might be useful: - The number of servers that an online service will need next year. - The demand for a grocery product at a supermarket on a given day. - The tomorrow closing price of a tradable financial asset. For another example, we can make a prediction of some team's performance and then use it as a baseline: first to set goals for the team, and then to measure the actual team performance relative to the baseline. There are quite a few different methods to predict future trends, for example, [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average), [ARCH](https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity), [regressive models](https://en.wikipedia.org/wiki/Autoregressive_model), [neural networks](https://medium.com/machine-learning-world/neural-networks-for-algorithmic-trading-1-2-correct-time-series-forecasting-backtesting-9776bfd9e589). In this article, we will look at [Prophet](https://facebook.github.io/prophet/), a library for time series forecasting released by Facebook and open-sourced on February 23, 2017. We will also try it out in the problem of predicting the daily number of posts published on Medium. ## Article outline 1. Introduction 2. [The Prophet Forecasting Model](#the-prophet-forecasting-model) 3. [Practice with Prophet](#practice-with-facebook-prophet) * 3.1 Installation in Python * 3.2 Dataset * 3.3 Exploratory visual analysis * 3.4 Making a forecast * 3.5 Forecast quality evaluation * 3.6 Visualization 4. [Box-Cox Transformation](#box-cox-transformation) 5. [Summary](#summary) 6. [References](#references) ## 1. Introduction According to the [article](https://research.fb.com/prophet-forecasting-at-scale/) on Facebook Research, Prophet was initially developed for the purpose of creating high quality business forecasts. This library tries to address the following difficulties common to many business time series: - Seasonal effects caused by human behavior: weekly, monthly and yearly cycles, dips and peaks on public holidays. - Changes in trend due to new products and market events. - Outliers. The authors claim that, even with the default settings, in many cases, their library produces forecasts as accurate as those delivered by experienced analysts. Moreover, Prophet has a number of intuitive and easily interpretable customizations that allow gradually improving the quality of the forecasting model. What is especially important, these parameters are quite comprehensible even for non-experts in time series analysis, which is a field of data science requiring certain skill and experience. By the way, the original article is called "Forecasting at Scale", but it is not about the scale in the "usual" sense, that is addressing computational and infrastructure problems of a large number of working programs. According to the authors, Prophet should scale well in the following 3 areas: - Accessibility to a wide audience of analysts, possibly without profound expertise in time series. - Applicability to a wide range of distinct forecasting problems. - Automated performance estimation of a large number of forecasts including flagging of potential problems for their subsequent inspection by the analyst. ## 2. The Prophet Forecasting Model Now, let's take a closer look at how Prophet works. In its essence, this library utilizes the [additive regression model](https://en.wikipedia.org/wiki/Additive_model) $y(t)$ comprising the following components: $$y(t) = g(t) + s(t) + h(t) + \epsilon_{t},$$ where: * Trend $g(t)$ models non-periodic changes. * Seasonality $s(t)$ represents periodic changes. * Holidays component $h(t)$ contributes information about holidays and events. Below, we will consider some important properties of these model components. ### Trend The Prophet library implements two possible trend models for $g(t)$. The first one is called *Nonlinear, Saturating Growth*. It is represented in the form of the [logistic growth model](https://en.wikipedia.org/wiki/Logistic_function): $$g(t) = \frac{C}{1+e^{-k(t - m)}},$$ where: * $C$ is the carrying capacity (that is the curve's maximum value). * $k$ is the growth rate (which represents "the steepness" of the curve). * $m$ is an offset parameter. This logistic equation allows modelling non-linear growth with saturation, that is when the growth rate of a value decreases with its growth. One of the typical examples would be representing the growth of the audience of an application or a website. Actually, $C$ and $k$ are not necessarily constants and may vary over time. Prophet supports both automatic and manual tuning of their variability. The library can itself choose optimal points of trend changes by fitting the supplied historical data. Also, Prophet allows analysts to manually set changepoints of the growth rate and capacity values at different points in time. For instance, analysts may have insights about dates of past releases that prominently influenced some key product indicators. The second trend model is a simple *Piecewise Linear Model* with a constant rate of growth. It is best suited for problems without saturating growth. ### Seasonality The seasonal component $s(t)$ provides a flexible model of periodic changes due to weekly and yearly seasonality. Weekly seasonal data is modeled with dummy variables. Six new variables are added: `monday`, `tuesday`, `wednesday`, `thursday`, `friday`, `saturday`, which take values 0 or 1 depending on the day of the week. The feature `sunday` is not added because it would be a linear combination of the other days of the week, and this fact would have an adverse effect on the model. Yearly seasonality model in Prophet relies on Fourier series. Since [version 0.2](https://github.com/facebook/prophet) you can also use *sub-daily time series* and make *sub-daily forecasts* as well as employ the new *daily seasonality* feature. ### Holidays and Events The component $h(t)$ represents predictable abnormal days of the year including those on irregular schedules, e.g., Black Fridays. To utilize this feature, the analyst needs to provide a custom list of events. ### Error The error term $\epsilon(t)$ represents information that was not reflected in the model. Usually it is modeled as normally distributed noise. ### Prophet Benchmarking For a detailed description of the model and algorithms behind Prophet refer to the paper ["Forecasting at scale"](https://peerj.com/preprints/3190/) by Sean J. Taylor and Benjamin Letham. The authors also compared their library with several other methods for time series forecasting. They used [Mean Absolute Percentage Error (MAPE)](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) as a measure of prediction accuracy. In this research, Prophet has shown substantially lower forecasting error than the other models. Let's look closer at how the forecasting quality was measured in the article. To do this, we will need the formula of Mean Absolute Percentage Error. Let $y_{i}$ be the *actual (historical) value* and $\hat{y}_{i}$ be the *forecast value* given by our model. Then $e_{i} = y_{i} - \hat{y}_{i}$ is the *forecast error* and $p_{i} =\frac{\displaystyle e_{i}}{\displaystyle y_{i}}$ is the *relative forecast error*. We define $$MAPE = mean\big(\left |p_{i} \right |\big)$$ MAPE is widely used as a measure of prediction accuracy because it expresses error as a percentage and thus can be used in model evaluations on different datasets. In addition, when evaluating a forecasting algorithm, it may prove useful to calculate [MAE (Mean Absolute Error)](https://en.wikipedia.org/wiki/Mean_absolute_error) in order to have a picture of errors in absolute numbers. Using previously defined components, its equation will be $$MAE = mean\big(\left |e_{i}\right |\big)$$ A few words about the algorithms that Prophet was compared with. Most of them are quite simple and often are used as a baseline for other models: * `naive` is a simplistic forecasting approach where we predict all future values relying solely on the observation at the last available point of time. * `snaive` (seasonal naive) is a model that makes constant predictions taking into account information about seasonality. For instance, in the case of weekly seasonal data for each future Monday, we would predict the value from the last Monday, and for all future Tuesdays we would use the value from the last Tuesday and so on. * `mean` uses the averaged value of data as a forecast. * `arima` stands for *Autoregressive Integrated Moving Average*, see [Wikipedia](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) for details. * `ets` stands for *Exponential Smoothing*, see [Wikipedia](https://en.wikipedia.org/wiki/Exponential_smoothing) for more. ## 3. Practice with Facebook Prophet ### 3.1 Installation in Python First, you need to install the library. Prophet is available for Python and R. The choice will depend on your personal preferences and project requirements. Further in this article we will use Python. In Python you can install Prophet using PyPI: ``` $ pip install fbprophet ``` In R you can find the corresponding CRAN package. Refer to the [documentation](https://facebookincubator.github.io/prophet/docs/installation.html) for details. Let's import the modules that we will need, and initialize our environment: ```{code-cell} ipython3 import os from pathlib import Path import warnings warnings.filterwarnings("ignore") import numpy as np import pandas as pd import statsmodels.api as sm from scipy import stats import matplotlib.pyplot as plt #%config InlineBackend.figure_format = 'retina' ``` ### 3.2 Dataset We will predict the daily number of posts published on [Medium](https://medium.com/). First, we load our dataset. ```{code-cell} ipython3 def download_file_from_gdrive(file_url, filename, out_path: Path, overwrite=False): """ Downloads a file from GDrive given an URL :param file_url: a string formated as https://drive.google.com/uc?id=