mlcourse.ai – Open Machine Learning Course

Author: Dmitry Sergeyev, Zeptolab. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #9. Fall 2018

Time series analysis

Fill the missing code in cells and choose answers in the web-form.

Importing necessary libraries

In [1]:
import random
random.seed(42)

import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

import matplotlib.pyplot as plt
%matplotlib inline


def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

We will take real time-series data of total ads watched by hour in one of our games.

In [2]:
df = pd.read_csv('../../data/ads_hour.csv',index_col=['Date'], parse_dates=['Date'])
In [3]:
with plt.style.context('bmh'):    
    plt.figure(figsize=(15, 8))
    plt.title('Ads watched (hour ticks)')
    plt.plot(df.ads);

We have everything you could ask for - trend, seasonality, even some outliers.

In this assignment we will concentrate on methods that have proven to be working in practice and can provide quality, comparable to ARIMA models. Namely, feature engineering, selecting and machine learning

But before digging into practice - a tiny bit of theory on how to create even more features. In this hometask we will be working with linear models. That means we assume there's a linear dependance between our target variable and the featureset, and we try to model that dependance. Now remember the hour variable we've created in the lecture? That variable showed on which hour a given observation arrived. Naturally, we expect hour plays a huge role, since at night the amount ofactive users who can watch ads drops significantly, at during the day it rises to its peak. However, if we use a single feature with number of an hour, linear models will not be able to model that dependance correctly. Why? Because look at the feature itself. It has values from $0$ to $23$ and even though we know that 0:00 is closer to 23:00 than, say, 20:00 to 23:00, the model will only see the numbers, not the logic behind them. And, of course, number 0 is way further from number 23 than number 20 from 23.

How will we solve that problem?

  • First possible solution - let's get dummies and make 24 new columns out of one. Clear disadvantages - we explode the dimentionality of our data and lose any trace of the cyclical nature of hours.

  • Second solution - sine/cosine transformation. To fully understand that approach, read this short article. But in short - we want to encode hour feature with two new columns, which are sine and cosine transformations of "hour from midnight"

We will use the already familiar prepareData function with some modifications - sine/cosine transformation for hour and dummy transformation for weekday features

In [4]:
def prepareData(data, lag_start=5, lag_end=14, test_size=0.15):
    """
    series: pd.DataFrame
        dataframe with timeseries

    lag_start: int
        initial step back in time to slice target variable 
        example - lag_start = 1 means that the model 
                  will see yesterday's values to predict today

    lag_end: int
        final step back in time to slice target variable
        example - lag_end = 4 means that the model 
                  will see up to 4 days back in time to predict today

    test_size: float
        size of the test dataset after train/test split as percentage of dataset

    """
    data = pd.DataFrame(data.copy())
    data.columns = ["y"]
    
    # calculate test index start position to split data on train test
    test_index = int(len(data) * (1 - test_size))
    
    # adding lags of original time series data as features
    for i in range(lag_start, lag_end):
        data["lag_{}".format(i)] = data.y.shift(i)
        
    # transforming df index to datetime and creating new variables
    data.index = pd.to_datetime(data.index)
    data["hour"] = data.index.hour
    data["weekday"] = data.index.weekday
        
    # since we will be using only linear models we need to get dummies from weekdays 
    # to avoid imposing weird algebraic rules on day numbers
    data = pd.concat([
        data.drop("weekday", axis=1), 
        pd.get_dummies(data['weekday'], prefix='weekday')
    ], axis=1)
        
    # encode hour with sin/cos transformation
    # credits - https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/
    data['sin_hour'] = np.sin(2*np.pi*data['hour']/24)
    data['cos_hour'] = np.cos(2*np.pi*data['hour']/24)
    data.drop(["hour"], axis=1, inplace=True)
        

    data = data.dropna()
    data = data.reset_index(drop=True)
    
    
    # splitting whole dataset on train and test
    X_train = data.loc[:test_index].drop(["y"], axis=1)
    y_train = data.loc[:test_index]["y"]
    X_test = data.loc[test_index:].drop(["y"], axis=1)
    y_test = data.loc[test_index:]["y"]
    
    return X_train, X_test, y_train, y_test

Use the functions on original df to prepare datasets necessary for model training. Reserve 30% of data for testing, use initial lag 12 and final lag 48. This way the model will be able to make forecasts twelve steps ahead, having observed data from the previous 1.5 day.

Scale the resulting datasets with the help of StandardScaler and create new variables - X_train_scaled and X_test_scaled. Don't forget that scaler should be trained on train set only to prevent information about future from leaking.

In [5]:
# You code here
# X_train, X_test, y_train, y_test = 

Now train a simple linear regression on scaled data:

In [6]:
# You code here

Check the quality of the model on the training set via cross-validation. To do so you need to create an object-generator of time series cv folds with the help of TimeSeriesSplit. Set the number of folds to be equal to 5. Then use cross_val_score, feeding it's cv parameter with the created generator. Quality metrics should be neg_mean_absolute_error.

Don't forget to take an average of the result and multiply it by -1.

In [7]:
# You code here
# tscv = 

Task 1. What is the value of MAE on cross-validation?

  • 4876
  • 41454725
  • 4490
  • 0.712

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a9_q1

Now let's have a look at the forecast itself. We'll need plotModelResults function.

In [ ]:
def plotModelResults(model, df_train, df_test, y_train, y_test, plot_intervals=False, scale=1.96, cv=tscv):
    """
    Plots modelled vs fact values
    
    model: fitted model 
    
    df_train, df_test: splitted featuresets
    
    y_train, y_test: targets
    
    plot_intervals: bool, if True, plot prediction intervals
    
    scale: float, sets the width of the intervals
    
    cv: cross validation method, needed for intervals
    
    """
    # making predictions for test
    prediction = model.predict(df_test)
    
    plt.figure(figsize=(20, 7))
    plt.plot(prediction, "g", label="prediction", linewidth=2.0)
    plt.plot(y_test.values, label="actual", linewidth=2.0)
    
    if plot_intervals:
        # calculate cv scores
        cv = cross_val_score(
            model, 
            df_train, 
            y_train, 
            cv=cv, 
            scoring="neg_mean_squared_error"
        )

        # calculate cv error deviation
        deviation = np.sqrt(cv.std())
        
        # calculate lower and upper intervals
        lower = prediction - (scale * deviation)
        upper = prediction + (scale * deviation)
        
        plt.plot(lower, "r--", label="upper bond / lower bond", alpha=0.5)
        plt.plot(upper, "r--", alpha=0.5)
        
    # calculate overall quality on test set
    mae  = mean_absolute_error(prediction, y_test)
    mape = mean_absolute_percentage_error(prediction, y_test)
    plt.title("MAE {}, MAPE {}%".format(round(mae), round(mape, 2)))
    plt.legend(loc="best")
    plt.grid(True);

For model coefficients visualization use the following functions:

In [9]:
def getCoefficients(model):
    """Returns sorted coefficient values of the model"""
    coefs = pd.DataFrame(model.coef_, X_train.columns)
    coefs.columns = ["coef"]
    coefs["abs"] = coefs.coef.apply(np.abs)
    return coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)    
    

def plotCoefficients(model):
    """Plots sorted coefficient values of the model"""
    coefs = getCoefficients(model)
    
    plt.figure(figsize=(20, 7))
    coefs.coef.plot(kind='bar')
    plt.grid(True, axis='y')
    plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed')
    plt.show()

Check the model results and plot the coefficients

In [10]:
# You code here

Wonderful, the intervals are plotted, forecast quality is great and everything seems to be fine, but...We might be having too many variables in our model and, possibly, some of them are not that important and can be dropped, reducing the dimentionality of data and reducing the chances of overfit. To make sure we have extra features, plot a heatmape of X_train data correlations with the help of heatmap function from seaborn library:

In [11]:
# You code here

Indeed, features are highly correlated and you can even observe some kind of "seasonality" in those correlations on each 24-th lag. Let's try to add regularization to our models and remove some features.

Train Lasso regression on cross-validation (LassoCV) again feeding the cv parameter with the created generator-object.

Plot the forecast of the model and notice that the error on test dataset has not increased significantly

In [12]:
# You code here

Perfect, we are still having practically the same model quality, while having less features. Use the function getCoefficients and find, how many features are now dropped (coefficient equals zero).

In [13]:
# You code here

Task 2. How many coefficients of Lasso model equal zero? (with 6 digit precision)?

  • 11
  • 12
  • 15
  • 17

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a9_q2

Alright, we have some features dropped. But what if we want to go further and transform our linear-dependant features into more compact representation? To do so we'll use principal component analysis.

In [14]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

def plotPCA(pca):
    """
    Plots accumulated percentage of explained variance by component
    
    pca: fitted PCA object
    """
    components = range(1, pca.n_components_ + 1)
    variance = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
    plt.figure(figsize=(20, 10))
    plt.bar(components, variance)
    
    # additionally mark the level of 95% of explained variance 
    plt.hlines(y = 95, xmin=0, xmax=len(components), linestyles='dashed', colors='red')
    
    plt.xlabel('PCA components')
    plt.ylabel('variance')
    plt.xticks(components)
    plt.show()
In [15]:
# Create PCA object: pca


# Train PCA on scaled data


# plot explained variance

Task 3. What is the minimal number of components needed to explain at least 95% of variance of the train dataset?

  • 5
  • 7
  • 9
  • 12

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a9_q3

Create the pca object once again, this time setting inside it the optimal number of components (explaining at least 95% of variance). After that create two new variables - pca_features_train and pca_features_test, assigning to them pca-transformed scaled datasets.

In [16]:
# You code here

Now train linear regression on pca features and plot its forecast.

In [17]:
# You code here

Task 4. What is the MAE of linear regression, trained on pca-transformed features?

  • 5140
  • 4917
  • 6719
  • 4663

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a9_q4