mlcourse.ai – Open Machine Learning Course

Author: Yury Kashnitskiy. Translated and edited by Maxim Keremet, Artem Trunov, and Aditya Soni. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #2. Fall 2018
Exploratory Data Analysis (EDA) of US flights
(using Pandas, Matplotlib & Seaborn)

Prior to working on the assignment, you'd better check out the corresponding course material:

Your task is to:

  • write code and perform computations in the cells below
  • choose answers in the webform
  • submit answers with the very same email and name as in assignment 1. This is a part of the assignment, if you don't manage to do so, you won't get credits. If in doubt, you can re-submit A1 form till the deadline for A1, no problem

Deadline for A2: 2018 October 21, 20:59 CET

In [1]:
import numpy as np
import pandas as pd
# pip install seaborn 
import seaborn as sns
import matplotlib.pyplot as plt
  • Download the data archive (Archived ~ 114 Mb, unzipped - ~ 690 Mb). No need to unzip - pandas can unbzip on the fly.
  • Place it in the "../../data" folder, or change the path below according to your location.
  • The dataset has information about carriers and flights between US airports during the year 2008.
  • Column description is available here. Visit this site to find ex. meaning of flight cancellation codes.

Reading data into memory and creating a Pandas DataFrame object

(This may take a while, be patient)

We are not going to read in the whole dataset. In order to reduce memory footprint, we instead load only needed columns and cast them suitable data types.

In [2]:
dtype = {'DayOfWeek': np.uint8, 'DayofMonth': np.uint8, 'Month': np.uint8 , 'Cancelled': np.uint8, 
         'Year': np.uint16, 'FlightNum': np.uint16 , 'Distance': np.uint16, 
         'UniqueCarrier': str, 'CancellationCode': str, 'Origin': str, 'Dest': str,
         'ArrDelay': np.float16, 'DepDelay': np.float16, 'CarrierDelay': np.float16,
         'WeatherDelay': np.float16, 'NASDelay': np.float16, 'SecurityDelay': np.float16,
         'LateAircraftDelay': np.float16, 'DepTime': np.float16}
In [3]:
%%time
# change the path if needed
path = '../../data/2008.csv.bz2'
flights_df = pd.read_csv(path, usecols=dtype.keys(), dtype=dtype)
CPU times: user 26.6 s, sys: 311 ms, total: 26.9 s
Wall time: 25.6 s

Check the number of rows and columns and print column names.

In [4]:
print(flights_df.shape)
print(flights_df.columns)
(7009728, 19)
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'UniqueCarrier',
       'FlightNum', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance',
       'Cancelled', 'CancellationCode', 'CarrierDelay', 'WeatherDelay',
       'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
      dtype='object')

Print first 5 rows of the dataset.

In [5]:
flights_df.head()
Out[5]:
Year Month DayofMonth DayOfWeek DepTime UniqueCarrier FlightNum ArrDelay DepDelay Origin Dest Distance Cancelled CancellationCode CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
0 2008 1 3 4 2003.0 WN 335 -14.0 8.0 IAD TPA 810 0 NaN NaN NaN NaN NaN NaN
1 2008 1 3 4 754.0 WN 3231 2.0 19.0 IAD TPA 810 0 NaN NaN NaN NaN NaN NaN
2 2008 1 3 4 628.0 WN 448 14.0 8.0 IND BWI 515 0 NaN NaN NaN NaN NaN NaN
3 2008 1 3 4 926.0 WN 1746 -6.0 -4.0 IND BWI 515 0 NaN NaN NaN NaN NaN NaN
4 2008 1 3 4 1829.0 WN 3920 34.0 34.0 IND BWI 515 0 NaN 2.0 0.0 0.0 0.0 32.0

Transpose the frame to see all features at once.

In [6]:
flights_df.head().T
Out[6]:
0 1 2 3 4
Year 2008 2008 2008 2008 2008
Month 1 1 1 1 1
DayofMonth 3 3 3 3 3
DayOfWeek 4 4 4 4 4
DepTime 2003 754 628 926 1829
UniqueCarrier WN WN WN WN WN
FlightNum 335 3231 448 1746 3920
ArrDelay -14 2 14 -6 34
DepDelay 8 19 8 -4 34
Origin IAD IAD IND IND IND
Dest TPA TPA BWI BWI BWI
Distance 810 810 515 515 515
Cancelled 0 0 0 0 0
CancellationCode NaN NaN NaN NaN NaN
CarrierDelay NaN NaN NaN NaN 2
WeatherDelay NaN NaN NaN NaN 0
NASDelay NaN NaN NaN NaN 0
SecurityDelay NaN NaN NaN NaN 0
LateAircraftDelay NaN NaN NaN NaN 32

Examine data types of all features and total dataframe size in memory.

In [7]:
flights_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7009728 entries, 0 to 7009727
Data columns (total 19 columns):
Year                 uint16
Month                uint8
DayofMonth           uint8
DayOfWeek            uint8
DepTime              float16
UniqueCarrier        object
FlightNum            uint16
ArrDelay             float16
DepDelay             float16
Origin               object
Dest                 object
Distance             uint16
Cancelled            uint8
CancellationCode     object
CarrierDelay         float16
WeatherDelay         float16
NASDelay             float16
SecurityDelay        float16
LateAircraftDelay    float16
dtypes: float16(8), object(4), uint16(3), uint8(4)
memory usage: 387.7+ MB

Get basic statistics of each feature.

In [8]:
flights_df.describe().T
Out[8]:
count mean std min 25% 50% 75% max
Year 7009728.0 2008.000000 0.000000 2008.0 2008.0 2008.0 2008.0 2008.0
Month 7009728.0 6.375130 3.406737 1.0 3.0 6.0 9.0 12.0
DayofMonth 7009728.0 15.728015 8.797068 1.0 8.0 16.0 23.0 31.0
DayOfWeek 7009728.0 3.924182 1.988259 1.0 2.0 4.0 6.0 7.0
DepTime 6873482.0 NaN NaN 1.0 928.0 1325.0 1728.0 2400.0
FlightNum 7009728.0 2224.200105 1961.715999 1.0 622.0 1571.0 3518.0 9743.0
ArrDelay 6855029.0 NaN NaN -519.0 -10.0 -2.0 12.0 2460.0
DepDelay 6873482.0 NaN NaN -534.0 -4.0 -1.0 8.0 2468.0
Distance 7009728.0 726.387029 562.101803 11.0 325.0 581.0 954.0 4962.0
Cancelled 7009728.0 0.019606 0.138643 0.0 0.0 0.0 0.0 1.0
CarrierDelay 1524735.0 NaN NaN 0.0 0.0 0.0 16.0 2436.0
WeatherDelay 1524735.0 NaN NaN 0.0 0.0 0.0 0.0 1352.0
NASDelay 1524735.0 NaN NaN 0.0 0.0 6.0 21.0 1357.0
SecurityDelay 1524735.0 NaN NaN 0.0 0.0 0.0 0.0 392.0
LateAircraftDelay 1524735.0 NaN NaN 0.0 0.0 0.0 26.0 1316.0

Count unique Carriers and plot their relative share of flights:

In [9]:
flights_df['UniqueCarrier'].nunique()
Out[9]:
20
In [10]:
flights_df.groupby('UniqueCarrier').size().plot(kind='bar');

We can also group by category/categories in order to calculate different aggregated statistics.

For example, finding top-3 flight codes, that have the largest total distance travelled in year 2008.

In [11]:
flights_df.groupby(['UniqueCarrier','FlightNum'])['Distance'].sum().sort_values(ascending=False).iloc[:3]
Out[11]:
UniqueCarrier  FlightNum
CO             15           1796244.0
               14           1796244.0
UA             52           1789722.0
Name: Distance, dtype: float64

Another way:

In [12]:
flights_df.groupby(['UniqueCarrier','FlightNum'])\
  .agg({'Distance': [np.mean, np.sum, 'count'],
        'Cancelled': np.sum})\
  .sort_values(('Distance', 'sum'), ascending=False)\
  .iloc[0:3]
Out[12]:
Distance Cancelled
mean sum count sum
UniqueCarrier FlightNum
CO 15 4962.000000 1796244.0 362 0
14 4962.000000 1796244.0 362 0
UA 52 2465.181818 1789722.0 726 8

Number of flights by days of week and months:

In [13]:
pd.crosstab(flights_df.Month, flights_df.DayOfWeek)
Out[13]:
DayOfWeek 1 2 3 4 5 6 7
Month
1 80807 97298 100080 102043 81940 67178 76419
2 81504 79700 80587 82158 102726 66462 76099
3 103210 81159 82307 82831 82936 86153 97494
4 82463 100785 102586 82799 82964 68304 78225
5 80626 79884 81264 102572 102878 84493 74576
6 104168 82160 82902 83617 83930 72322 99566
7 84095 103429 103315 105035 79349 72219 80489
8 82983 80895 81773 82625 103878 86155 93970
9 94300 91533 74057 75589 75881 58343 71205
10 75131 72195 91900 94123 93894 58168 70794
11 74214 72443 73653 68071 70484 76031 88376
12 92700 90568 85241 70761 74306 61708 69674

It can also be handy to color such tables in order to easily notice outliers:

In [14]:
plt.imshow(pd.crosstab(flights_df.Month, flights_df.DayOfWeek),
           cmap='seismic', interpolation='none');

Flight distance histogram:

In [15]:
flights_df.hist('Distance', bins=20);

Making a histogram of flight frequency by date.

In [16]:
flights_df['Date'] = pd.to_datetime(flights_df.rename(columns={'DayofMonth': 'Day'})[['Year', 'Month', 'Day']])
In [17]:
num_flights_by_date = flights_df.groupby('Date').size()
In [18]:
num_flights_by_date.plot();

Do you see a weekly pattern above? And below?

In [19]:
num_flights_by_date.rolling(window=7).mean().plot();

1. Find top-10 carriers in terms of the number of completed flights (UniqueCarrier column)?

Which of the listed below is not in your top-10 list?

  • DL
  • AA
  • OO
  • EV
In [16]:
# You code here

2. Plot distributions of flight cancellation reasons (CancellationCode).

What is the most frequent reason for flight cancellation? (Use this link to translate codes into reasons)

  • carrier
  • weather conditions
  • National Air System
  • security reasons
In [17]:
# You code here

3. Which route is the most frequent, in terms of the number of flights?

(Take a look at 'Origin' and 'Dest' features. Consider A->B and B->A directions as different routes)

  • New-York – Washington
  • San-Francisco – Los-Angeles
  • San-Jose – Dallas
  • New-York – San-Francisco
In [18]:
# You code here

4. Find top-5 delayed routes (count how many times they were delayed on departure). From all flights on these 5 routes, count all flights with weather conditions contributing to a delay.

  • 449
  • 539
  • 549
  • 668
In [15]:
# You code here

5. Examine the hourly distribution of departure times. For that, create a new series from DepTime, removing missing values.

Choose all correct statements:

  • Flights are normally distributed within time interval [0-23] (Search for: Normal distribution, bell curve).
  • Flights are uniformly distributed within time interval [0-23].
  • In the period from 0 am to 4 am there are considerably less flights than from 7 pm to 8 pm.
In [19]:
# You code here

6. Show how the number of flights changes through time (on the daily/weekly/monthly basis) and interpret the findings.

Choose all correct statements:

  • The number of flights during weekends is less than during weekdays (working days).
  • The lowest number of flights is on Sunday.
  • There are less flights during winter than during summer.
In [20]:
# You code here

7. Examine the distribution of cancellation reasons with time. Make a bar plot of cancellation reasons aggregated by months.

Choose all correct statements:

  • December has the highest rate of cancellations due to weather.
  • The highest rate of cancellations in September is due to Security reasons.
  • April's top cancellation reason is carriers.
  • Flights cancellations due to National Air System are more frequent than those due to carriers.
In [21]:
# You code here

8. Which month has the greatest number of cancellations due to Carrier?

  • May
  • January
  • September
  • April
In [22]:
# You code here

9. Identify the carrier with the greatest number of cancellations due to carrier in the corresponding month from the previous question.

  • 9E
  • EV
  • HA
  • AA
In [23]:
# You code here

10. Examine median arrival and departure delays (in time) by carrier. Which carrier has the lowest median delay time for both arrivals and departures? Leave only non-negative values of delay times ('ArrDelay', 'DepDelay'). Boxplots can be helpful in this exercise, as well as it might be a good idea to remove outliers in order to build nice graphs. You can exclude delay time values higher than a corresponding .95 percentile.

  • EV
  • OO
  • AA
  • AQ
In [25]:
# You code here