Visual data analysis in Python. Part 2. Overview of Seaborn, Matplotlib and Plotly libraries

Visual data analysis in Python. Part 2. Overview of Seaborn, Matplotlib and Plotly libraries #

mlcourse.ai – Open Machine Learning Course

Author: Egor Polusmak. Translated and edited by Alena Sharlo, Yury Kashnitsky, Artem Trunov, Anastasia Manokhina, and Yuanyuan Pao. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

1. Dataset #

First, we will set up our environment by importing all necessary libraries. We will also change the display settings to better show plots.

# Matplotlib forms basis for visualization in Python
import matplotlib.pyplot as plt

# We will use the Seaborn library
import seaborn as sns

sns.set()

# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg'

# Increase the default plot size and set the color scheme
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["image.cmap"] = "viridis"
import pandas as pd

Now, let’s load the dataset that we will be using into a DataFrame. I have picked a dataset on video game sales and ratings from Kaggle Datasets. Some of the games in this dataset lack ratings; so, let’s filter for only those examples that have all of their values present.

# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_URL = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"

df = pd.read_csv(DATA_URL + "video_games_sales.csv").dropna()
print(df.shape)

(6825, 16)

Next, print the summary of the DataFrame to check data types and to verify everything is non-null.

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6825 entries, 0 to 16706
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             6825 non-null   object 
 1   Platform         6825 non-null   object 
 2   Year_of_Release  6825 non-null   float64
 3   Genre            6825 non-null   object 
 4   Publisher        6825 non-null   object 
 5   NA_Sales         6825 non-null   float64
 6   EU_Sales         6825 non-null   float64
 7   JP_Sales         6825 non-null   float64
 8   Other_Sales      6825 non-null   float64
 9   Global_Sales     6825 non-null   float64
 10  Critic_Score     6825 non-null   float64
 11  Critic_Count     6825 non-null   float64
 12  User_Score       6825 non-null   object 
 13  User_Count       6825 non-null   float64
 14  Developer        6825 non-null   object 
 15  Rating           6825 non-null   object 
dtypes: float64(9), object(7)
memory usage: 906.4+ KB

We see that pandas has loaded some of the numerical features as object type. We will explicitly convert those columns into float and int.

df["User_Score"] = df["User_Score"].astype("float64")
df["Year_of_Release"] = df["Year_of_Release"].astype("int64")
df["User_Count"] = df["User_Count"].astype("int64")
df["Critic_Count"] = df["Critic_Count"].astype("int64")

The resulting DataFrame contains 6825 examples and 16 columns. Let’s look at the first few entries with the head() method to check that everything has been parsed correctly. To make it more convenient, I have listed only the variables that we will use in this notebook.

useful_cols = [
    "Name",
    "Platform",
    "Year_of_Release",
    "Genre",
    "Global_Sales",
    "Critic_Score",
    "Critic_Count",
    "User_Score",
    "User_Count",
    "Rating",
]
df[useful_cols].head()

	Name	Platform	Year_of_Release	Genre	Global_Sales	Critic_Score	Critic_Count	User_Score	User_Count	Rating
0	Wii Sports	Wii	2006	Sports	82.53	76.0	51	8.0	322	E
2	Mario Kart Wii	Wii	2008	Racing	35.52	82.0	73	8.3	709	E
3	Wii Sports Resort	Wii	2009	Sports	32.77	80.0	73	8.0	192	E
6	New Super Mario Bros.	DS	2006	Platform	29.80	89.0	65	8.5	431	E
7	Wii Play	Wii	2006	Misc	28.92	58.0	41	6.6	129	E

2. DataFrame.plot()#

Before we turn to Seaborn and Plotly, let’s discuss the simplest and often most convenient way to visualize data from a DataFrame: using its own plot() method.

As an example, we will create a plot of video game sales by country and year. First, let’s keep only the columns we need. Then, we will calculate the total sales by year and call the plot() method on the resulting DataFrame.

df[[x for x in df.columns if "Sales" in x] + ["Year_of_Release"]].groupby(
    "Year_of_Release"
).sum().plot();

../../_images/fc8d7af338dd6166d9ccd8031f2f72c3e24ced2101271e0fe806717884172d7c.svg

Note that the implementation of the plot() method in pandas is based on matplotlib.

Using the kind parameter, you can change the type of the plot to, for example, a bar chart. matplotlib is generally quite flexible for customizing plots. You can change almost everything in the chart, but you may need to dig into the documentation to find the corresponding parameters. For example, the parameter rot is responsible for the rotation angle of ticks on the x-axis (for vertical plots):

df[[x for x in df.columns if "Sales" in x] + ["Year_of_Release"]].groupby(
    "Year_of_Release"
).sum().plot(kind="bar", rot=45);

../../_images/b71da059a89d4bfa2a1eb6faa18b4287500e7e88c5579df60f9524ca275ec7cd.svg

3. Seaborn #

Now, let’s move on to the Seaborn library. seaborn is essentially a higher-level API based on the matplotlib library. Among other things, it differs from the latter in that it contains more adequate default settings for plotting. By adding import seaborn as sns; sns.set() in your code, the images of your plots will become much nicer. Also, this library contains a set of complex tools for visualization that would otherwise (i.e. when using bare matplotlib) require quite a large amount of code.

pairplot()#

Let’s take a look at the first of such complex plots, a pairwise relationships plot, which creates a matrix of scatter plots by default. This kind of plot helps us visualize the relationship between different variables in a single output.

# `pairplot()` may become very slow with the SVG format
%config InlineBackend.figure_format = 'png'
sns.pairplot(
    df[["Global_Sales", "Critic_Score", "Critic_Count", "User_Score", "User_Count"]]
);

../../_images/cf958387ff69eaa3b8b3f1c328feee36712cf5e5a1a68f89f11f0b06066ff45c.png

As you can see, the distribution histograms lie on the diagonal of the matrix. The remaining charts are scatter plots for the corresponding pairs of features.

histplot()#

It is also possible to plot a distribution of observations with seaborn’s histplot(). For example, let’s look at the distribution of critics’ ratings: Critic_Score.

%config InlineBackend.figure_format = 'svg'
sns.histplot(df["Critic_Score"], kde=True, stat="density");

../../_images/93a90038f065317a10ca6469dfa27b4316e9b718c89e77d84d040d510e759722.svg

jointplot()#

To look more closely at the relationship between two numerical variables, you can use joint plot, which is a cross between a scatter plot and histogram. Let’s see how the Critic_Score and User_Score features are related.

sns.jointplot(x="Critic_Score", y="User_Score", data=df, kind="scatter");

../../_images/091dc0c9b9c7dc49e7af166d3af105b07787b8c40a67ecbe8147975572f074eb.svg

boxplot()#

Another useful type of plot is a box plot. Let’s compare critics’ ratings for the top 5 biggest gaming platforms.

top_platforms = (
    df["Platform"].value_counts().sort_values(ascending=False).head(5).index.values
)
sns.boxplot(
    y="Platform",
    x="Critic_Score",
    data=df[df["Platform"].isin(top_platforms)],
    orient="h",
);

../../_images/a40721b3c51fb33c7cfa3b87fb54a5f8c447214eea36eb45156f53450eb06328.svg

It is worth spending a bit more time to discuss how to interpret a box plot. Its components are a box (obviously, this is why it is called a box plot), the so-called whiskers, and a number of individual points (outliers).

The box by itself illustrates the interquartile spread of the distribution; its length determined by the \(25\% \, (\text{Q1})\) and \(75\% \, (\text{Q3})\) percentiles. The vertical line inside the box marks the median (\(50\%\)) of the distribution.

The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval \((\text{Q1} - 1.5 \cdot \text{IQR}, \text{Q3} + 1.5 \cdot \text{IQR})\), where \(\text{IQR} = \text{Q3} - \text{Q1}\) is the interquartile range.

Outliers that fall out of the range bounded by the whiskers are plotted individually.

heatmap()#

The last type of plot that we will cover here is a heat map. A heat map allows you to view the distribution of a numerical variable over two categorical ones. Let’s visualize the total sales of games by genre and gaming platform.

platform_genre_sales = (
    df.pivot_table(
        index="Platform", columns="Genre", values="Global_Sales", aggfunc="sum"
    )
    .fillna(0)
    .map(float)
)
sns.heatmap(platform_genre_sales, annot=True, fmt=".1f", linewidths=0.5);

../../_images/3fe4435f9dbb1200aa577426214d41013d66f669b19c7f10afaddf6b08261c1b.svg

4. Plotly #

We have examined some visualization tools based on the matplotlib library. However, this is not the only option for plotting in Python. Let’s take a look at the plotly library. Plotly is an open-source library that allows creation of interactive plots within a Jupyter notebook without having to use Javascript.

The real beauty of interactive plots is that they provide a user interface for detailed data exploration. For example, you can see exact numerical values by mousing over points, hide uninteresting series from the visualization, zoom in onto a specific part of the plot, etc.

Before we start, let’s import all the necessary modules and initialize plotly by calling the init_notebook_mode() function.

import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
from IPython.display import display, IFrame

init_notebook_mode(connected=True)

def plotly_depict_figure_as_iframe(fig, title="", width=800, height=500,
  plot_path='../../_static/plotly_htmls/'):
  """
  This is a helper method to visualizae PLotly plots as Iframes in a Jupyter book.
  If you are running `jupyter-notebook`, you can just use iplot(fig).
  """

  # in a Jupyter Notebook, the following should work
  #iplot(fig, show_link=False)

  # in a Jupyter Book, we save a plot offline and then render it with IFrame
  fig_path_path = f"{plot_path}/{title}.html"
  plot(fig, filename=fig_path_path, show_link=False, auto_open=False);
  display(IFrame(fig_path_path, width=width, height=height))

Line plot #

First of all, let’s build a line plot showing the number of games released and their sales by year.

years_df = (
    df.groupby("Year_of_Release")[["Global_Sales"]]
    .sum()
    .join(df.groupby("Year_of_Release")[["Name"]].count())
)
years_df.columns = ["Global_Sales", "Number_of_Games"]

Figure is the main class and a work horse of visualization in plotly. It consists of the data (an array of lines called traces in this library) and the style (represented by the layout object). In the simplest case, you may call the iplot function to return only traces.

The show_link parameter toggles the visibility of the links leading to the online platform plot.ly in your charts. Most of the time, this functionality is not needed, so you may want to turn it off by passing show_link=False to prevent accidental clicks on those links.

# Create a line (trace) for the global sales
trace0 = go.Scatter(x=years_df.index, y=years_df["Global_Sales"], name="Global Sales")

# Create a line (trace) for the number of games released
trace1 = go.Scatter(
    x=years_df.index, y=years_df["Number_of_Games"], name="Number of games released"
)

# Define the data array
data = [trace0, trace1]

# Set the title
layout = {"title": "Statistics for video games"}

# Create a Figure and plot it
fig = go.Figure(data=data, layout=layout)

# in a Jupyter Notebook, the following should work
#iplot(fig, show_link=False)

# in a Jupyter Book, we save a plot offline and then render it with IFrame
plotly_depict_figure_as_iframe(fig, title="topic2_part2_plot1")

As an option, you can save the plot in an html file:

# commented out as it produces a large in size file
#plotly.offline.plot(fig, filename="years_stats.html", show_link=False, auto_open=False);

Bar chart #

Let’s use a bar chart to compare the market share of different gaming platforms broken down by the number of new releases and by total revenue.

# Do calculations and prepare the dataset
platforms_df = (
    df.groupby("Platform")[["Global_Sales"]]
    .sum()
    .join(df.groupby("Platform")[["Name"]].count())
)
platforms_df.columns = ["Global_Sales", "Number_of_Games"]
platforms_df.sort_values("Global_Sales", ascending=False, inplace=True)

# Create a bar for the global sales
trace0 = go.Bar(
    x=platforms_df.index, y=platforms_df["Global_Sales"], name="Global Sales"
)

# Create a bar for the number of games released
trace1 = go.Bar(
    x=platforms_df.index,
    y=platforms_df["Number_of_Games"],
    name="Number of games released",
)

# Get together the data and style objects
data = [trace0, trace1]
layout = {"title": "Market share by gaming platform"}

# Create a `Figure` and plot it
fig = go.Figure(data=data, layout=layout)
# in a Jupyter Notebook, the following should work
#iplot(fig, show_link=False)

# in a Jupyter Book, we save a plot offline and then render it with IFrame
plotly_depict_figure_as_iframe(fig, title="topic2_part2_plot2")

Box plot #

plotly also supports box plots. Let’s consider the distribution of critics’ ratings by the genre of the game.

data = []

# Create a box trace for each genre in our dataset
for genre in df.Genre.unique():
    data.append(go.Box(y=df[df.Genre == genre].Critic_Score, name=genre))

# Visualize
# in a Jupyter Notebook, the following should work
#iplot(data, show_link=False)

# in a Jupyter Book, we save a plot offline and then render it with IFrame
plotly_depict_figure_as_iframe(data, title="topic2_part2_plot3")

Using plotly, you can also create other types of visualization. Even with default settings, the plots look quite nice. Additionally, the library makes it easy to modify various parameters: colors, fonts, captions, annotations, and so on.

5. Useful resources #

The same notebook as an interactive web-based Kaggle Kernel
“Plotly for interactive plots” - a tutorial by Alexander Kovalev within mlcourse.ai (full list of tutorials is here)
“Bring your plots to life with Matplotlib animations” - a tutorial by Kyriacos Kyriacou within mlcourse.ai
“Some details on Matplotlib” - a tutorial by Ivan Pisarev within mlcourse.ai
Main course site, course repo, and YouTube channel
Medium “story” based on this notebook
Course materials as a Kaggle Dataset
If you read Russian: an article on Habrahabr with ~ the same material. And a lecture on YouTube
Here is the official documentation for the libraries we used: matplotlib, seaborn and pandas.
The gallery of sample charts created with seaborn is a very good resource.
Also, see the documentation on Manifold Learning in scikit-learn.
Efficient t-SNE implementation Multi-core-TSNE.
“How to Use t-SNE Effectively”, Distill.pub.