Assignment #9 (demo). Time series analysis. Solution#

mlcourse.ai – Open Machine Learning Course

Author: Mariya Mansurova, Analyst & developer in Yandex.Metrics team. Translated by Ivan Zakharov, ML enthusiast.
This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Same assignment as a Kaggle Notebook + solution.

In this assignment, we are using Prophet and ARIMA to analyze the number of views for a Wikipedia page on Machine Learning.

Fill cells marked with “Your code here” and submit your answers to the questions through the web form.

import warnings

warnings.filterwarnings("ignore")
import os

import numpy as np
import pandas as pd
import requests
from plotly import __version__
from plotly import graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
from IPython.display import display, IFrame

print(__version__)  # need 1.9.0 or greater
init_notebook_mode(connected=True)
5.12.0
def plotly_df(df, title="", width=800, height=500):
    """Visualize all the dataframe columns as line plots."""
    common_kw = dict(x=df.index, mode="lines")
    data = [go.Scatter(y=df[c], name=c, **common_kw) for c in df.columns]
    layout = dict(title=title)
    fig = dict(data=data, layout=layout)

    # in a Jupyter Notebook, the following should work
    #iplot(fig, show_link=False)

    # in a Jupyter Book, we save a plot offline and then render it with IFrame
    plot_path = f"../../_static/plotly_htmls/{title}.html".replace(" ", "_")
    plot(fig, filename=plot_path, show_link=False, auto_open=False);
    display(IFrame(plot_path, width=width, height=height))

Data preparation#

# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_PATH = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"
df = pd.read_csv(DATA_PATH + "wiki_machine_learning.csv", sep=" ")
df = df[df["count"] != 0]
df.head()
date count lang page rank month title
81 2015-01-01 1414 en Machine_learning 8708 201501 Machine_learning
80 2015-01-02 1920 en Machine_learning 8708 201501 Machine_learning
79 2015-01-03 1338 en Machine_learning 8708 201501 Machine_learning
78 2015-01-04 1404 en Machine_learning 8708 201501 Machine_learning
77 2015-01-05 2264 en Machine_learning 8708 201501 Machine_learning
df.shape
(383, 7)

Predicting with FB Prophet#

We will train at first 5 months and predict the number of trips for June.

df.date = pd.to_datetime(df.date)
plotly_df(df.set_index("date")[["count"]])
from prophet import Prophet
predictions = 30

df = df[["date", "count"]]
df.columns = ["ds", "y"]
df.tail()
ds y
382 2016-01-16 1644
381 2016-01-17 1836
376 2016-01-18 2983
375 2016-01-19 3389
372 2016-01-20 3559
train_df = df[:-predictions].copy()
m = Prophet()
m.fit(train_df);
16:01:20 - cmdstanpy - INFO - Chain [1] start processing
16:01:20 - cmdstanpy - INFO - Chain [1] done processing
future = m.make_future_dataframe(periods=predictions)
future.tail()
ds
378 2016-01-16
379 2016-01-17
380 2016-01-18
381 2016-01-19
382 2016-01-20
forecast = m.predict(future)
forecast.tail()
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper weekly weekly_lower weekly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
378 2016-01-16 2980.502288 1703.304147 2532.638180 2958.717842 2999.271292 -861.703627 -861.703627 -861.703627 -861.703627 -861.703627 -861.703627 0.0 0.0 0.0 2118.798661
379 2016-01-17 2985.922335 1875.176136 2675.011637 2962.826974 3005.897115 -720.722254 -720.722254 -720.722254 -720.722254 -720.722254 -720.722254 0.0 0.0 0.0 2265.200080
380 2016-01-18 2991.342382 2830.135344 3665.476780 2966.830439 3013.327733 281.347221 281.347221 281.347221 281.347221 281.347221 281.347221 0.0 0.0 0.0 3272.689603
381 2016-01-19 2996.762429 3112.889453 3958.332832 2970.610664 3019.922072 541.446780 541.446780 541.446780 541.446780 541.446780 541.446780 0.0 0.0 0.0 3538.209209
382 2016-01-20 3002.182476 3016.240515 3834.277697 2974.537645 3026.671679 425.562221 425.562221 425.562221 425.562221 425.562221 425.562221 0.0 0.0 0.0 3427.744697

Question 1: What is the prediction of the number of views of the wiki page on January 20? Round to the nearest integer.

  • 4947

  • 3426 [+]

  • 5229

  • 2744

m.plot(forecast)
../../_images/db7ebafc7d24006b9a72fd7727fb7c5ee793a2d2d9a611208e2c672f12826154.png ../../_images/db7ebafc7d24006b9a72fd7727fb7c5ee793a2d2d9a611208e2c672f12826154.png
m.plot_components(forecast)
../../_images/29eaf881a7cb18c3a42eac06f633d304eb55f5d32f066ee10609b979306cced7.png ../../_images/29eaf881a7cb18c3a42eac06f633d304eb55f5d32f066ee10609b979306cced7.png
cmp_df = forecast.set_index("ds")[["yhat", "yhat_lower", "yhat_upper"]].join(
    df.set_index("ds")
)
cmp_df["e"] = cmp_df["y"] - cmp_df["yhat"]
cmp_df["p"] = 100 * cmp_df["e"] / cmp_df["y"]
print("MAPE = ", round(np.mean(abs(cmp_df[-predictions:]["p"])), 2))
print("MAE = ", round(np.mean(abs(cmp_df[-predictions:]["e"])), 2))
MAPE =  34.55
MAE =  600.81

Estimate the quality of the prediction with the last 30 points.

Question 2: What is MAPE equal to?

  • 34.5 [+]

  • 42.42

  • 5.39

  • 65.91

Question 3: What is MAE equal to?

  • 355

  • 4007

  • 600 [+]

  • 903

Predicting with ARIMA#

%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

plt.rcParams["figure.figsize"] = (15, 10)

Question 4: Let’s verify the stationarity of the series using the Dickey-Fuller test. Is the series stationary? What is the p-value?

  • Series is stationary, p_value = 0.107

  • Series is not stationary, p_value = 0.107 [+]

  • Series is stationary, p_value = 0.001

  • Series is not stationary, p_value = 0.001

sm.tsa.seasonal_decompose(train_df["y"].values, period=7).plot()
print("Dickey-Fuller test: p=%f" % sm.tsa.stattools.adfuller(train_df["y"])[1])
Dickey-Fuller test: p=0.107392
../../_images/ae878b54a1f22c9bf349019559efc283e2dc74fefea7de98d69bf5b7d9cb46aa.png

But the seasonally differentiated series will already be stationary.

train_df.set_index("ds", inplace=True)
train_df["y_diff"] = train_df.y - train_df.y.shift(7)
sm.tsa.seasonal_decompose(train_df.y_diff[7:].values, period=7).plot()
print("Dickey-Fuller test: p=%f" % sm.tsa.stattools.adfuller(train_df.y_diff[8:])[1])
Dickey-Fuller test: p=0.000000
../../_images/c8f7d64e45f0fb5dd7f1823099bff42edd6257225235101f99db1d83acdd6654.png
ax = plt.subplot(211)
sm.graphics.tsa.plot_acf(train_df.y_diff[13:].values.squeeze(), lags=48, ax=ax)

ax = plt.subplot(212)
sm.graphics.tsa.plot_pacf(train_df.y_diff[13:].values.squeeze(), lags=48, ax=ax)
../../_images/bbef59ee1f5992fd483e5bc35c118134bfa3daee7deeaefafb4b7432e3b4c15e.png ../../_images/bbef59ee1f5992fd483e5bc35c118134bfa3daee7deeaefafb4b7432e3b4c15e.png

Initial values:

  • Q = 1

  • q = 3

  • P = 3

  • p = 1

ps = range(0, 2)
ds = range(0, 2)
qs = range(0, 4)
Ps = range(0, 4)
Ds = range(0, 3)
Qs = range(0, 2)
from itertools import product

parameters = product(ps, ds, qs, Ps, Ds, Qs)
parameters_list = list(parameters)
len(parameters_list)
384
%%time
import warnings

from tqdm.notebook import tqdm

results1 = []
best_aic = float("inf")
warnings.filterwarnings("ignore")

for param in tqdm(parameters_list):
    # try except is necessary, because on some sets of parameters the model can not be trained
    try:
        model = sm.tsa.statespace.SARIMAX(
            train_df["y"],
            order=(param[0], param[1], param[2]),
            seasonal_order=(param[3], param[4], param[5], 7),
            # train the model as is even if that would lead to a non-stationary / non-invertible model
            # see https://github.com/statsmodels/statsmodels/issues/6225 for details
        ).fit(disp=-1)

    except (ValueError, np.linalg.LinAlgError):
        continue

    aic = model.aic
    # save the best model, aic, parameters
    if aic < best_aic:
        best_model = model
        best_aic = aic
        best_param = param
    results1.append([param, model.aic])
CPU times: user 26min 34s, sys: 5min 40s, total: 32min 14s
Wall time: 5min 41s
result_table1 = pd.DataFrame(results1)
result_table1.columns = ["parameters", "aic"]
print(result_table1.sort_values(by="aic", ascending=True).head())
             parameters          aic
165  (0, 1, 2, 3, 2, 1)  4961.632628
332  (1, 1, 1, 3, 2, 1)  4962.829123
380  (1, 1, 3, 3, 2, 1)  4965.688340
189  (0, 1, 3, 3, 2, 1)  4969.534607
213  (1, 0, 0, 3, 2, 1)  4973.212241

If we consider the variants proposed in the form:

result_table1[
    result_table1["parameters"].isin(
        [(1, 0, 2, 3, 1, 0), (1, 1, 2, 3, 2, 1), (1, 1, 2, 3, 1, 1), (1, 0, 2, 3, 0, 0)]
    )
].sort_values(by="aic")
parameters aic
356 (1, 1, 2, 3, 2, 1) 4988.977318
354 (1, 1, 2, 3, 1, 1) 5019.555903
257 (1, 0, 2, 3, 1, 0) 5022.312524
255 (1, 0, 2, 3, 0, 0) 5183.824614

Now do the same, but for the series with Box-Cox transformation.

import scipy.stats

train_df["y_box"], lmbda = scipy.stats.boxcox(train_df["y"])
print("The optimal Box-Cox transformation parameter: %f" % lmbda)
The optimal Box-Cox transformation parameter: 0.732841
results2 = []
best_aic = float("inf")

for param in tqdm(parameters_list):
    # try except is necessary, because on some sets of parameters the model can not be trained
    try:
        model = sm.tsa.statespace.SARIMAX(
            train_df["y_box"],
            order=(param[0], param[1], param[2]),
            seasonal_order=(param[3], param[4], param[5], 7),
            # train the model as is even if that would lead to a non-stationary / non-invertible model
            # see https://github.com/statsmodels/statsmodels/issues/6225 for details
            enforce_stationary=False,  
            enforce_invertibility=False  
        ).fit(disp=-1)

    except (ValueError, np.linalg.LinAlgError):
        continue

    aic = model.aic
    # save the best model, aic, parameters
    if aic < best_aic:
        best_model = model
        best_aic = aic
        best_param = param
    results2.append([param, model.aic])

warnings.filterwarnings("default")
result_table2 = pd.DataFrame(results2)
result_table2.columns = ["parameters", "aic"]
print(result_table2.sort_values(by="aic", ascending=True).head())
             parameters          aic
65   (0, 0, 2, 3, 0, 0)    12.000000
66   (0, 0, 2, 3, 0, 1)    14.000000
261  (1, 0, 2, 3, 2, 1)  3528.568461
285  (1, 0, 3, 3, 2, 1)  3529.820329
213  (1, 0, 0, 3, 2, 1)  3530.231534

If we consider the variants proposed in the form:

result_table2[
    result_table2["parameters"].isin(
        [(1, 0, 2, 3, 1, 0), (1, 1, 2, 3, 2, 1), (1, 1, 2, 3, 1, 1), (1, 0, 2, 3, 0, 0)]
    )
].sort_values(by="aic")
parameters aic
357 (1, 1, 2, 3, 2, 1) 3544.416192
258 (1, 0, 2, 3, 1, 0) 3556.880030
355 (1, 1, 2, 3, 1, 1) 3557.894203
256 (1, 0, 2, 3, 0, 0) 3687.195117

Next, we turn to the construction of the SARIMAX model (sm.tsa.statespace.SARIMAX).
Question 5: What parameters are the best for the model according to the AIC criterion?

  • D = 1, d = 0, Q = 0, q = 2, P = 3, p = 1

  • D = 2, d = 1, Q = 1, q = 2, P = 3, p = 1 [+]

  • D = 1, d = 1, Q = 1, q = 2, P = 3, p = 1

  • D = 0, d = 0, Q = 0, q = 2, P = 3, p = 1

Let’s look at the forecast of the best AIC model.

Note: any AIC below 3000 is suspicious, probably caused by non-convergence with MLE optimization, we’ll pick the 3rd-best model in terms of AIC to visualize predictions.

best_model = sm.tsa.statespace.SARIMAX(
    train_df["y_box"],
    order=(1, 0, 2),
    seasonal_order=(3, 2, 1, 7),
    enforce_stationary=False,  
    enforce_invertibility=False  
).fit(disp=-1)
/Users/kashnitskiyy/Library/Caches/pypoetry/virtualenvs/mlcourse-ai-hdQs52Zw-py3.10/lib/python3.10/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning:

A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.

/Users/kashnitskiyy/Library/Caches/pypoetry/virtualenvs/mlcourse-ai-hdQs52Zw-py3.10/lib/python3.10/site-packages/statsmodels/tsa/statespace/representation.py:374: FutureWarning:

Unknown keyword arguments: dict_keys(['enforce_stationary']).Passing unknown keyword arguments will raise a TypeError beginning in version 0.15.
/Users/kashnitskiyy/Library/Caches/pypoetry/virtualenvs/mlcourse-ai-hdQs52Zw-py3.10/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals
print(best_model.summary())
                                      SARIMAX Results                                      
===========================================================================================
Dep. Variable:                               y_box   No. Observations:                  353
Model:             SARIMAX(1, 0, 2)x(3, 2, [1], 7)   Log Likelihood               -1756.284
Date:                             Wed, 17 May 2023   AIC                           3528.568
Time:                                     16:12:28   BIC                           3559.176
Sample:                                          0   HQIC                          3540.766
                                             - 353                                         
Covariance Type:                               opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.8412      0.110      7.660      0.000       0.626       1.056
ma.L1         -0.3666      0.117     -3.143      0.002      -0.595      -0.138
ma.L2         -0.2172      0.086     -2.540      0.011      -0.385      -0.050
ar.S.L7       -0.6579      0.040    -16.366      0.000      -0.737      -0.579
ar.S.L14      -0.4445      0.059     -7.489      0.000      -0.561      -0.328
ar.S.L21      -0.2794      0.043     -6.475      0.000      -0.364      -0.195
ma.S.L7       -0.9650      0.051    -19.055      0.000      -1.064      -0.866
sigma2      1687.2577     93.876     17.973      0.000    1503.264    1871.251
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):               530.32
Prob(Q):                              0.98   Prob(JB):                         0.00
Heteroskedasticity (H):               0.74   Skew:                             0.95
Prob(H) (two-sided):                  0.11   Kurtosis:                         8.83
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
plt.subplot(211)
best_model.resid[13:].plot()
plt.ylabel(u"Residuals")

ax = plt.subplot(212)
sm.graphics.tsa.plot_acf(best_model.resid[13:].values.squeeze(), lags=48, ax=ax)

print("Student's test: p=%f" % stats.ttest_1samp(best_model.resid[13:], 0)[1])
print("Dickey-Fuller test: p=%f" % sm.tsa.stattools.adfuller(best_model.resid[13:])[1])
Student's test: p=0.118890
Dickey-Fuller test: p=0.000000
../../_images/cc7742cebce2b41ed0be3562069cbf7a519bab556eef38b2bbec3265c7561b9d.png
def invboxcox(y, lmbda):
    # reverse Box Cox transformation
    if lmbda == 0:
        return np.exp(y)
    else:
        return np.exp(np.log(lmbda * y + 1) / lmbda)
train_df["arima_model"] = invboxcox(best_model.fittedvalues, lmbda)

train_df.y.tail(200).plot()
train_df.arima_model[13:].tail(200).plot(color="r")
plt.ylabel("wiki pageviews");
../../_images/0a636f4b99ddce602163818de6e3cca9f7b8ad76260349dd8c04ec355b1a8fbe.png