Open Machine Learning Course

Author: Yury Kashnitsky, Data Scientist at Group
Translated by Anna Larionova, DS @ Picturer, Data4, BNTouch
All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.

Assignment #4 (demo)

Linear Regression as an optimization problem

(no solution shared, part of this Coursera specialization)

Part 1. Basic data analysis with Pandas

In this task we will use SOCR data containing information about height and weight of 25 thousands teenagers.

[1]. If you haven't installed yet Seaborn library you should execute conda install seaborn in the terminal. (Seaborn isn't part of Anaconda and it provides suitable high level functionality for data visualization).

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Read the data about height and weight into Pandas DataFrame:

In [2]:
data = pd.read_csv('../../data/weights_heights.csv', index_col='Index')

First thing you should do after reading the data is to look at first records. It helps to find the data reading errors (for example, when we have 1 column instead of 10 and it has 9 dots with commas in column name). Also it allows to take a closer look at the data and features and their nature (numerical, categorical, etc.).

Than we should plot histograms of feature distributions. Also it can help to understand features nature (power-series distribution or standard or something else). Histogram can help us find some values that aren't similar to each other - outliers. It is convenient to plot histograms using plot method of Pandas DataFrame with option kind='hist'.

Example. Let's plot the histogram of teenager's height distribution. We use method plot for DataFrame data with options y='Height' (the feature which distribution we want to plot).

In [3]:
data.plot(y='Height', kind='hist', 
           color='red',  title='Height (inch.) distribution');


  • y='Height' - the feature which distribution we want to plot
  • kind='hist' - means that plot type is histogram
  • color='red' - set color

[2]. Look at the first 5 rows using head method of Pandas DataFrame. Plot the histogram of weight distribution using method plot Pandas DataFrame. Make the color of histogram to be green and add title.

In [4]:
# Your code here
In [5]:
# Your code here

One of the most effective methods of basic data analysis is mapping pairwise dependencies of features. We make $m \times m$ plots (m is number of features) where we have histograms of feature distributions in diagonal and scatter plots of two feature dependencies outside. We can do this using $scatter\_matrix$ method of Pandas Data Frame or pairplot of Seaborn library.

To illustrate this method we add third feature. Let's create bodymass index (BMI). To do this we use apply method of Pandas FataFrame and Python's lambda functions.

In [6]:
def make_bmi(height_inch, weight_pound):
    METER_TO_INCH, KILO_TO_POUND = 39.37, 2.20462
    return (weight_pound / KILO_TO_POUND) / \
           (height_inch / METER_TO_INCH) ** 2
In [7]:
data['BMI'] = data.apply(lambda row: make_bmi(row['Height'], 
                                              row['Weight']), axis=1)

[3]. Create the picture that contains pairwise dependencies of features 'Height', 'Weight' и 'BMI'. You should use pairplot method of Seaborn library.

In [8]:
# Your code here

During the basic analysis you often have to investigate dependencies of numerical from categorical features (for example, dependency between salary and employee sex). In this case we can use boxplots from Seaborn library. Box plot is a compact way to show real value statistics (mean and quartiles) by different values of categorical feature. It also helps to find outliers - observations that have very different values from others.

[4]. Create new feature weight_category in DataFrame data that will have 3 values: 1 if the weight is less than 120 pounds, 3 if the weight is greater or equal to 150 pounds, 2 in other cases. Create boxplot showing dependency between height and weight category. Use boxplot method of Seaborn library and apply method of Pandas DataFrame. Add titles "Height" to y axis and "Weight category" to x axis.

In [9]:
def weight_category(weight):
    # Your code here

data['weight_cat'] = data['Weight'].apply(weight_category)
# Your code here

[5]. Create scatter plot of dependencies between height and weight using plot method for Pandas DataFrame with option kind='scatter'. Add title to the figure.

In [10]:
# Your code here

Part 2. Squared Error Minimization

In basic case the task of real value prediction by other features (regression task) can be solved using squared error minimization.

[6]. Create function computing squared error of dependency approximation between height $y$ and weight $x$ using straight line $y = w_0 + w_1 * x$ by two parameters $w_0$ and $w_1$: $$error(w_0, w_1) = \sum_{i=1}^n {(y_i - (w_0 + w_1 * x_i))}^2 $$ Where $n$ is number of observations in dataset, $y_i$ and $x_i$ are height and weight of $i$th person in dataset.

In [11]:
# Your code here

So we are solving the task how to draw a straight line through the points cloud corresponding to observations in our dataset in space of features "Height" and "Weight" to minimize function[6]. Let's start with drawings some lines and make sure they transfer dependencies from height to weight.

[7]. On plot from [5] Problem 1 draw two straight lines corresponding to values of parameters $w_0, w_1) = (60, 0.05)$ and ($w_0, w_1) = (50, 0.16)$. Use plot method from matplotlib.pyplot and linspace method from NumPy library. Add the titles to axes and plot.

In [12]:
# Your code here

Squared error function minimization is very easy task because of the function's convex nature. There are many optimization methods for this problem. Let's look at dependency between error function and the first parameter (slope of the straight line) if the second parameter (absolute term) is fixed.

[8]. Plot dependency between error function calculated in [6] and $w_1$ parameter when $w_0$ = 50. Add the titles to axes and plot.

In [13]:
# Your code here

Now we can find the slope of the straight line approximating dependency between height and weight when coefficient is fixed $w_0 = 50$ using optimization method.

[9]. Using minimize_scalar method from scipy.optimize find the minimum of the function[6] for parameter value $w_1$ in range [-5,5]. Draw on plot [5] Problem 1 the straight line corresponding to the values of parameters ($w_0$, $w_1$) = (50, $w_1\_opt$) where $w_1\_opt$ is optimal value of parameter $w_1$ that was found in [8].

In [14]:
# Your code here
In [15]:
# Your code here

When you analyze multidimensional data, you often want to get intuitive understanding about data nature using visualization. It is impossible to plot the data when you have more than 3 features. It is better to choose 2 or 3 principal components from data and represent them in plane or volume.

Let's have a look how Python can draw 3D figures on example of function $z(x,y) = sin(\sqrt{x^2+y^2})$ for values of $x$ и $y$ from interval [-5,5] with step 0.25

In [16]:
from mpl_toolkits.mplot3d import Axes3D

Create objects of type matplotlib.figure.Figure (picture) and matplotlib.axes._subplots.Axes3DSubplot (axes).

In [17]:
fig = plt.figure()
ax = fig.gca(projection='3d') # get current axis

# Create NumPy arrays with data points on X and Y axes.
# Use meshgrid method creating matrix of coordinates
# By vectors of coordinates. Set needed function Z(x, y).
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# Finally use *plot_surface* method of type object
# Axes3DSubplot. Add titles to axes.
surf = ax.plot_surface(X, Y, Z)

[10]. Create 3D-plot between error function calculated in [6] and parameters $w_0$ and $w_1$. Add titles "Intercept" to the $x$ axis, "Slope" to the $y$ axis, "Error" to the $z$ axis.

In [18]:
# Your code here
In [19]:
# Your code here
In [20]:
# Your code here

[11]. Find the minimum of the function in [6] using minimize method from scipy.optimize for parameters values $w_0$ in range [-100,100] and $w_1$ in range [-5, 5]. Starting point is ($w_0$, $w_1$) = (0, 0). Use L-BFGS-B optimization method (option method in minimize). Draw on plot from [5] Problem 1 the straight line coresponding finded optimal values of parameters $w_0$ and $w_1$. Add titles to the axes and plot.

In [21]:
# Your code here
In [22]:
# Your code here