Author: Egor Polusmak. Translated and edited by Alena Sharlo, Artem Trunov, Anastasia Manokhina, and Yuanyuan Pao. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

*This is a static version of a Jupyter notebook. You can also check out the latest version in the course repository, the corresponding interactive web-based Kaggle Notebook or a video lecture.*

First, we will set up our environment by importing all necessary libraries. We will also change the display settings to better show plots.

In [1]:

```
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings("ignore")
# Matplotlib forms basis for visualization in Python
import matplotlib.pyplot as plt
# We will use the Seaborn library
import seaborn as sns
sns.set()
# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'
# Increase the default plot size and set the color scheme
plt.rcParams["figure.figsize"] = 8, 5
plt.rcParams["image.cmap"] = "viridis"
import pandas as pd
```

Now, let's load the dataset that we will be using into a `DataFrame`

. I have picked a dataset on video game sales and ratings from Kaggle Datasets.
Some of the games in this dataset lack ratings; so, let's filter for only those examples that have all of their values present.

In [2]:

```
df = pd.read_csv("../input/video_games_sales.csv").dropna()
print(df.shape)
```

Next, print the summary of the `DataFrame`

to check data types and to verify everything is non-null.

In [3]:

```
df.info()
```

We see that `pandas`

has loaded some of the numerical features as `object`

type. We will explicitly convert those columns into `float`

and `int`

.

In [4]:

```
df["User_Score"] = df["User_Score"].astype("float64")
df["Year_of_Release"] = df["Year_of_Release"].astype("int64")
df["User_Count"] = df["User_Count"].astype("int64")
df["Critic_Count"] = df["Critic_Count"].astype("int64")
```

The resulting `DataFrame`

contains 6825 examples and 16 columns. Let's look at the first few entries with the `head()`

method to check that everything has been parsed correctly. To make it more convenient, I have listed only the variables that we will use in this notebook.

In [5]:

```
useful_cols = [
"Name",
"Platform",
"Year_of_Release",
"Genre",
"Global_Sales",
"Critic_Score",
"Critic_Count",
"User_Score",
"User_Count",
"Rating",
]
df[useful_cols].head()
```

Out[5]:

Before we turn to Seaborn and Plotly, discuss the simplest and often most convenient way to visualize data from a `DataFrame`

: using its own `plot()`

method.

As an example, we will create a plot of video game sales by country and year. First, keep only the columns we need. Then, we will calculate the total sales by year and call the `plot()`

method on the resulting `DataFrame`

.

In [6]:

```
df[[x for x in df.columns if "Sales" in x] + ["Year_of_Release"]].groupby(
"Year_of_Release"
).sum().plot();
```

Note that the implementation of the `plot()`

method in `pandas`

is based on `matplotlib`

.

Using the `kind`

parameter, you can change the type of the plot to, for example, a *bar chart*. `matplotlib`

is generally quite flexible for customizing plots. You can change almost everything in the chart, but you may need to dig into the documentation to find the corresponding parameters. For example, the parameter `rot`

is responsible for the rotation angle of ticks on the x-axis (for vertical plots):

In [7]:

```
df[[x for x in df.columns if "Sales" in x] + ["Year_of_Release"]].groupby(
"Year_of_Release"
).sum().plot(kind="bar", rot=45);
```

Now, let's move on to the `Seaborn`

library. `seaborn`

is essentially a higher-level API based on the `matplotlib`

library. Among other things, it differs from the latter in that it contains more adequate default settings for plotting. By adding `import seaborn as sns; sns.set()`

in your code, the images of your plots will become much nicer. Also, this library contains a set of complex tools for visualization that would otherwise (i.e. when using bare `matplotlib`

) require quite a large amount of code.

Let's take a look at the first of such complex plots, a *pairwise relationships plot*, which creates a matrix of scatter plots by default. This kind of plot helps us visualize the relationship between different variables in a single output.

In [8]:

```
# `pairplot()` may become very slow with the SVG format
%config InlineBackend.figure_format = 'png'
sns.pairplot(
df[["Global_Sales", "Critic_Score", "Critic_Count", "User_Score", "User_Count"]]
);
```

As you can see, the distribution histograms lie on the diagonal of the matrix. The remaining charts are scatter plots for the corresponding pairs of features.

It is also possible to plot a distribution of observations with `seaborn`

's `distplot()`

. For example, let's look at the distribution of critics' ratings: `Critic_Score`

. By default, the plot displays a histogram and the kernel density estimate.

In [9]:

```
%config InlineBackend.figure_format = 'retina'
sns.distplot(df["Critic_Score"]);
```

To look more closely at the relationship between two numerical variables, you can use *joint plot*, which is a cross between a scatter plot and histogram. Let's see how the `Critic_Score`

and `User_Score`

features are related.

In [10]:

```
sns.jointplot(x="Critic_Score", y="User_Score", data=df, kind="scatter");
```

Another useful type of plot is a *box plot*. Let's compare critics' ratings for the top 5 biggest gaming platforms.

In [11]:

```
top_platforms = (
df["Platform"].value_counts().sort_values(ascending=False).head(5).index.values
)
sns.boxplot(
y="Platform",
x="Critic_Score",
data=df[df["Platform"].isin(top_platforms)],
orient="h",
);
```

It is worth spending a bit more time to discuss how to interpret a box plot. Its components are a *box* (obviously, this is why it is called a *box plot*), the so-called *whiskers*, and a number of individual points (*outliers*).

The box by itself illustrates the interquartile spread of the distribution; its length determined by the $25\% \, (\text{Q1})$ and $75\% \, (\text{Q3})$ percentiles. The vertical line inside the box marks the median ($50\%$) of the distribution.

The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval $(\text{Q1} - 1.5 \cdot \text{IQR}, \text{Q3} + 1.5 \cdot \text{IQR})$, where $\text{IQR} = \text{Q3} - \text{Q1}$ is the interquartile range.

Outliers that fall out of the range bounded by the whiskers are plotted individually.

The last type of plot that we will cover here is a *heat map*. A heat map allows you to view the distribution of a numerical variable over two categorical ones. Let's visualize the total sales of games by genre and gaming platform.

In [12]:

```
platform_genre_sales = (
df.pivot_table(
index="Platform", columns="Genre", values="Global_Sales", aggfunc=sum
)
.fillna(0)
.applymap(float)
)
sns.heatmap(platform_genre_sales, annot=True, fmt=".1f", linewidths=0.5);
```