Author: Yury Kashnitskiy (@yorko). Translated and edited by Egor Polusmak, Anastasia Manokhina, Eugene Mashkin, and Yuanyuan Pao. This material is subject to the terms and conditions of the license Creative Commons CC BY-NC-SA 4.0. Free use is permitted for any non-commercial purpose with an obligatory indication of the names of the authors and of the source.

In this assignment, we are going to walk through `sklearn`

built-in implementations of dimensionality reduction and clustering methods. Answers should be submitted using this web-form.

First import all required modules:

In [1]:

```
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns; sns.set(style='white')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn import datasets
```

Use the given toy data set:

In [2]:

```
X = np.array([[2., 13.], [1., 3.], [6., 19.],
[7., 18.], [5., 17.], [4., 9.],
[5., 22.], [6., 11.], [8., 25.]])
```

In [3]:

```
plt.scatter(X[:,0], X[:, 1])
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$');
```

`StandardScaler`

)?¶*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q1*

- 30 degrees
- 45 degrees
- 60 degrees
- 75 degrees

In [ ]:

```
# Your code here
```

In [ ]:

```
# Your code here
```

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q3*

- their squares tell what part of the initial data's variance is explained by principal components
- they define a rotation angle between the first principal component and the initial axis
- those numbers tell what part of the initial data's variance is explained by principal components
- the square roots of those numbers define a rotation angle between the first principal component and the initial axis

Let's load a dataset of peoples' faces and output their names. (This step requires stable, fast internet connection.)

In [4]:

```
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=50,
resize=0.4, data_home='../../data/faces')
print('%d objects, %d features, %d classes' % (lfw_people.data.shape[0],
lfw_people.data.shape[1], len(lfw_people.target_names)))
print('\nPersons:')
for name in lfw_people.target_names:
print(name)
```

Let's look at some faces. All images are stored in a handy `lfw_people.images`

array.

In [5]:

```
fig = plt.figure(figsize=(8, 6))
for i in range(15):
ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
ax.imshow(lfw_people.images[i], cmap='gray')
```

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q4*

- 75
- 76
- 77
- 78

For this task, you should be using the `svd_solver='randomized'`

parameter, which is a PCA approximation, but it significantly increases performance on large data sets. Use fixed `random_state=1`

for comparable results.

In [1]:

```
# Your code here
```

Print a picture showing the first 30 principal components (don't be scared when you see the results). In order to create it, use 30 vectors from `pca.components_`

, reshape them to their initial size (50 x 37), and display. Specify `cmap`

='binary'.

In [ ]:

```
# Your code here
```

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q5*

- 1
- 2
- 4
- 5

Now let's create a projection of faces onto the space of the first two principal components.

To answer this question, take the first two principal components from the scaled data, evaluate two mean principal components' values for each person over all their images in the dataset (again, use both svd_solver='randomized' and random_state=1). Then, with 12 two-dimensional points, find the one which has the largest distance from the others (by Euclidean distance). You can do this either precisely or approximately using `sklearn.metrics.euclidean_distances`

and `seaborn.heatmap`

.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q6*

- Colin Powell
- George W Bush
- Jacques Chirac
- Serena Williams

In [2]:

```
# Your code here
```

For the next question, load the housing prices dataset:

In [ ]:

```
boston = datasets.load_boston()
X = boston.data
```

Using the elbow-method (reference article 7 of the course), find the optimal number of clusters to set as a hyperparameter for the k-means algorithm.

`random_state=1`

in the k-means method, and don't scale the data.¶*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q7*

- 2
- 3
- 4
- 5

In this case, we are looking for the most significant curve fracture on the `Cluster number vs Centroid distances`

graph. Consider the number of clusters from 2 to 10. Use `random_state=1`

for the k-means algorithm initialization.

In [ ]:

```
# Your code here
```

Go back to the faces dataset (that is already scaled). Imagine that we did not know the names for who was each photo but that we knew that there were 12 different people. Let's compare clustering results from 4 algorithms - k-means, Agglomerative clustering, Affinity Propagation, and Spectral clustering. Use the same respective parameters as in the end of this article, only change the number of clusters to 12.

In [ ]:

```
# Your code here
```

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q8*

- Agglomerative clustering worked better than others by all metrics
- Clustering results are disappointing - there isn't a metric that exceeds 0.35
- Affinity Propagation worked better than Spectral clustering by all metrics
- Considering only 2 clusters (whether it is Serena Williams or not) and comparing clustering results with a binary vector, we can see that clustering algorithms work better, with some metrics exceeding 66%

Use the coordinates of the 12 "average" people's images you got before. Draw a dendrogram for them. Use `scipy.cluster.hierarchy`

and `scipy.spatial.distance.pdist`

, take parameters values from the appropriate example in the article.

In [ ]:

```
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
# Your code here
```

*For discussions, please stick to ODS Slack, channel #mlcourse_ai, pinned thread #a7_q9*

- Gerhard Schroeder
- Jean Chretien
- John Ashcroft
- Junichiro Koizumi