## mlcourse.ai – Open Machine Learning Course¶

Author: Yury Kashnitskiy (@yorko). Translated and edited by Egor Polusmak, Anastasia Manokhina, Eugene Mashkin, and Yuanyuan Pao. This material is subject to the terms and conditions of the license Creative Commons CC BY-NC-SA 4.0. Free use is permitted for any non-commercial purpose with an obligatory indication of the names of the authors and of the source.

# Assignment #7. Fall 2018

## Principal Component Analysis and Clustering

In this assignment, we are going to walk through sklearn built-in implementations of dimensionality reduction and clustering methods. Answers should be submitted using this web-form.

## 1. Principal Component Analysis¶

First import all required modules:

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns; sns.set(style='white')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn import datasets


Use the given toy data set:

In [2]:
X = np.array([[2., 13.], [1., 3.], [6., 19.],
[7., 18.], [5., 17.], [4., 9.],
[5., 22.], [6., 11.], [8., 25.]])

In [3]:
plt.scatter(X[:,0], X[:, 1])
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$');


#### Question 1. What is the angle between the $x_1$ axis and the vector corresponding to the first principal component for this data (don't forget to scale data using StandardScaler)?¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q1

• 30 degrees
• 45 degrees
• 60 degrees
• 75 degrees
In [ ]:
# Your code here


#### Question 2. What are the eigenvalues of the $X^{\text{T}}X$ matrix, given $X$, a scaled matrix from the previous question?¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q2

• 4 and 1.42
• 16.2 and 2702.8
• 4.02 and 51.99
• 15.97 and 2.03
In [ ]:
# Your code here


#### Question 3. What is the meaning of the two numbers from the previous question?¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q3

• their squares tell what part of the initial data's variance is explained by principal components
• they define a rotation angle between the first principal component and the initial axis
• those numbers tell what part of the initial data's variance is explained by principal components
• the square roots of those numbers define a rotation angle between the first principal component and the initial axis

Let's load a dataset of peoples' faces and output their names. (This step requires stable, fast internet connection.)

In [4]:
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=50,
resize=0.4, data_home='../../data/faces')

print('%d objects, %d features, %d classes' % (lfw_people.data.shape[0],
lfw_people.data.shape[1], len(lfw_people.target_names)))
print('\nPersons:')
for name in lfw_people.target_names:
print(name)

1560 objects, 1850 features, 12 classes

Persons:
Ariel Sharon
Colin Powell
Donald Rumsfeld
George W Bush
Gerhard Schroeder
Hugo Chavez
Jacques Chirac
Jean Chretien
John Ashcroft
Junichiro Koizumi
Serena Williams
Tony Blair


Let's look at some faces. All images are stored in a handy lfw_people.images array.

In [5]:
fig = plt.figure(figsize=(8, 6))

for i in range(15):
ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
ax.imshow(lfw_people.images[i], cmap='gray')


#### Question 4. What 's the minimal number of principal components is needed to explain 90% variance in the data (scaled using StandardScaler)?¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q4

• 75
• 76
• 77
• 78

For this task, you should be using the svd_solver='randomized' parameter, which is a PCA approximation, but it significantly increases performance on large data sets. Use fixed random_state=1 for comparable results.

In [1]:
# Your code here


Print a picture showing the first 30 principal components (don't be scared when you see the results). In order to create it, use 30 vectors from pca.components_, reshape them to their initial size (50 x 37), and display. Specify cmap='binary'.

In [ ]:
# Your code here


#### Question 5. </font> Within the first 30 principal components, which one brightens the left side of the face? More specifically, which principal component corresponds to a linear combination of the initial features (pixels' intensity), which, when shown as an image, looks like a photo highlighted from the right side (the same as a face highlighted from its left side)?¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q5

• 1
• 2
• 4
• 5

Now let's create a projection of faces onto the space of the first two principal components.

#### Question 6. Who looks the least similar to the other people in the dataset if we only consider the two first principal components?¶

To answer this question, take the first two principal components from the scaled data, evaluate two mean principal components' values for each person over all their images in the dataset (again, use both svd_solver='randomized' and random_state=1). Then, with 12 two-dimensional points, find the one which has the largest distance from the others (by Euclidean distance). You can do this either precisely or approximately using sklearn.metrics.euclidean_distances and seaborn.heatmap.

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q6

• Colin Powell
• George W Bush
• Jacques Chirac
• Serena Williams
In [2]:
# Your code here


## 2. Clustering¶

For the next question, load the housing prices dataset:

In [ ]:
boston = datasets.load_boston()
X = boston.data


Using the elbow-method (reference article 7 of the course), find the optimal number of clusters to set as a hyperparameter for the k-means algorithm.

#### Question 7. What is the optimal number of clusters to use on housing prices data set according to the elbow method? Use random_state=1 in the k-means method, and don't scale the data.¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q7

• 2
• 3
• 4
• 5

In this case, we are looking for the most significant curve fracture on the Cluster number vs Centroid distances graph. Consider the number of clusters from 2 to 10. Use random_state=1 for the k-means algorithm initialization.

In [ ]:
# Your code here


Go back to the faces dataset (that is already scaled). Imagine that we did not know the names for who was each photo but that we knew that there were 12 different people. Let's compare clustering results from 4 algorithms - k-means, Agglomerative clustering, Affinity Propagation, and Spectral clustering. Use the same respective parameters as in the end of this article, only change the number of clusters to 12.

In [ ]:
# Your code here


#### Question 8. Select all of the correct statements:¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q8

• Agglomerative clustering worked better than others by all metrics
• Clustering results are disappointing - there isn't a metric that exceeds 0.35
• Affinity Propagation worked better than Spectral clustering by all metrics
• Considering only 2 clusters (whether it is Serena Williams or not) and comparing clustering results with a binary vector, we can see that clustering algorithms work better, with some metrics exceeding 66%

Use the coordinates of the 12 "average" people's images you got before. Draw a dendrogram for them. Use scipy.cluster.hierarchy and scipy.spatial.distance.pdist, take parameters values from the appropriate example in the article.

In [ ]:
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

# Your code here


#### Question 9. Look at the dendrogram and consider a step when just two clusters are left: Serena Williams vs. all. Who was the last person added to the "big" cluster?¶

For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q9

• Gerhard Schroeder
• Jean Chretien
• John Ashcroft
• Junichiro Koizumi