Assignment #1 (demo). Exploratory data analysis with Pandas. Solution#
Author: Yury Kashnitsky. Translated and edited by Sergey Isaev, Artem Trunov, Anastasia Manokhina, and Yuanyuan Pao. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.
Same assignment as a Kaggle Kernel + solution.
In this task you should use Pandas to answer a few questions about the Adult dataset. (You don’t have to download the data – it’s already in the repository). Choose the answers in the web-form.
Unique values of features (for more information please see the link above):
age: continuous;workclass:Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked;fnlwgt: continuous;education:Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool;education-num: continuous;marital-status:Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse,occupation:Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners,Machine-op-inspct,Adm-clerical,Farming-fishing,Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces;relationship:Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried;race:White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black;sex:Female,Male;capital-gain: continuous.capital-loss: continuous.hours-per-week: continuous.native-country:United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands;salary:>50K,<=50K.
import numpy as np
import pandas as pd
pd.set_option("display.max.columns", 100)
# to draw pictures in jupyter notebook
%matplotlib inline
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings("ignore")
# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_URL = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"
data = pd.read_csv(DATA_URL + "adult.data.csv")
data.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
1. How many men and women (sex feature) are represented in this dataset?
data["sex"].value_counts()
sex
Male 21790
Female 10771
Name: count, dtype: int64
2. What is the average age (age feature) of women?
data[data["sex"] == "Female"]["age"].mean()
np.float64(36.85823043357163)
3. What is the proportion of German citizens (native-country feature)?
float((data["native-country"] == "Germany").sum()) / data.shape[0]
0.004207487485028101
**4-5. What are the mean value and standard deviation of the age of those who receive more than 50K per year (salary feature) and those who receive less than 50K per year? **
ages1 = data[data["salary"] == ">50K"]["age"]
ages2 = data[data["salary"] == "<=50K"]["age"]
print(
"The average age of the rich: {0} +- {1} years, poor - {2} +- {3} years.".format(
round(ages1.mean()),
round(ages1.std(), 1),
round(ages2.mean()),
round(ages2.std(), 1),
)
)
The average age of the rich: 44 +- 10.5 years, poor - 37 +- 14.0 years.
6. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)
data[data["salary"] == ">50K"]["education"].unique() # No
array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
'10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)
7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.
for (race, sex), sub_df in data.groupby(["race", "sex"]):
print("Race: {0}, sex: {1}".format(race, sex))
print(sub_df["age"].describe())
Race: Amer-Indian-Eskimo, sex: Female
count 119.000000
mean 37.117647
std 13.114991
min 17.000000
25% 27.000000
50% 36.000000
75% 46.000000
max 80.000000
Name: age, dtype: float64
Race: Amer-Indian-Eskimo, sex: Male
count 192.000000
mean 37.208333
std 12.049563
min 17.000000
25% 28.000000
50% 35.000000
75% 45.000000
max 82.000000
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Female
count 346.000000
mean 35.089595
std 12.300845
min 17.000000
25% 25.000000
50% 33.000000
75% 43.750000
max 75.000000
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Male
count 693.000000
mean 39.073593
std 12.883944
min 18.000000
25% 29.000000
50% 37.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: Black, sex: Female
count 1555.000000
mean 37.854019
std 12.637197
min 17.000000
25% 28.000000
50% 37.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: Black, sex: Male
count 1569.000000
mean 37.682600
std 12.882612
min 17.000000
25% 27.000000
50% 36.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: Other, sex: Female
count 109.000000
mean 31.678899
std 11.631599
min 17.000000
25% 23.000000
50% 29.000000
75% 39.000000
max 74.000000
Name: age, dtype: float64
Race: Other, sex: Male
count 162.000000
mean 34.654321
std 11.355531
min 17.000000
25% 26.000000
50% 32.000000
75% 42.000000
max 77.000000
Name: age, dtype: float64
Race: White, sex: Female
count 8642.000000
mean 36.811618
std 14.329093
min 17.000000
25% 25.000000
50% 35.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: White, sex: Male
count 19174.000000
mean 39.652498
std 13.436029
min 17.000000
25% 29.000000
50% 38.000000
75% 49.000000
max 90.000000
Name: age, dtype: float64
8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.
# married men
data[(data["sex"] == "Male")
& (data["marital-status"].str.startswith("Married"))][
"salary"
].value_counts(normalize=True)
salary
<=50K 0.559486
>50K 0.440514
Name: proportion, dtype: float64
# single men
data[
(data["sex"] == "Male")
& ~(data["marital-status"].str.startswith("Married"))
]["salary"].value_counts(normalize=True)
salary
<=50K 0.915505
>50K 0.084495
Name: proportion, dtype: float64
It’s good to be married :)
9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours and what is the percentage of those who earn a lot among them?
max_load = data["hours-per-week"].max()
print("Max time - {0} hours./week.".format(max_load))
num_workaholics = data[data["hours-per-week"] == max_load].shape[0]
print("Total number of such hard workers {0}".format(num_workaholics))
rich_share = (
float(
data[(data["hours-per-week"] == max_load) & (data["salary"] == ">50K")].shape[0]
)
/ num_workaholics
)
print("Percentage of rich among them {0}%".format(int(100 * rich_share)))
Max time - 99 hours./week.
Total number of such hard workers 85
Percentage of rich among them 29%
10. Count the average time of work (hours-per-week) those who earn a little and a lot (salary) for each country (native-country).
Simple method:
for (country, salary), sub_df in data.groupby(["native-country", "salary"]):
print(country, salary, round(sub_df["hours-per-week"].mean(), 2))
? <=50K 40.16
? >50K 45.55
Cambodia <=50K 41.42
Cambodia >50K 40.0
Canada <=50K 37.91
Canada >50K 45.64
China <=50K 37.38
China >50K 38.9
Columbia <=50K 38.68
Columbia >50K 50.0
Cuba <=50K 37.99
Cuba >50K 42.44
Dominican-Republic <=50K 42.34
Dominican-Republic >50K 47.0
Ecuador <=50K 38.04
Ecuador >50K 48.75
El-Salvador <=50K 36.03
El-Salvador >50K 45.0
England <=50K 40.48
England >50K 44.53
France <=50K 41.06
France >50K 50.75
Germany <=50K 39.14
Germany >50K 44.98
Greece <=50K 41.81
Greece >50K 50.62
Guatemala <=50K 39.36
Guatemala >50K 36.67
Haiti <=50K 36.33
Haiti >50K 42.75
Holand-Netherlands <=50K 40.0
Honduras <=50K 34.33
Honduras >50K 60.0
Hong <=50K 39.14
Hong >50K 45.0
Hungary <=50K 31.3
Hungary >50K 50.0
India <=50K 38.23
India >50K 46.48
Iran <=50K 41.44
Iran >50K 47.5
Ireland <=50K 40.95
Ireland >50K 48.0
Italy <=50K 39.62
Italy >50K 45.4
Jamaica <=50K 38.24
Jamaica >50K 41.1
Japan <=50K 41.0
Japan >50K 47.96
Laos <=50K 40.38
Laos >50K 40.0
Mexico <=50K 40.0
Mexico >50K 46.58
Nicaragua <=50K 36.09
Nicaragua >50K 37.5
Outlying-US(Guam-USVI-etc) <=50K 41.86
Peru <=50K 35.07
Peru >50K 40.0
Philippines <=50K 38.07
Philippines >50K 43.03
Poland <=50K 38.17
Poland >50K 39.0
Portugal <=50K 41.94
Portugal >50K 41.5
Puerto-Rico <=50K 38.47
Puerto-Rico >50K 39.42
Scotland <=50K 39.44
Scotland >50K 46.67
South <=50K 40.16
South >50K 51.44
Taiwan <=50K 33.77
Taiwan >50K 46.8
Thailand <=50K 42.87
Thailand >50K 58.33
Trinadad&Tobago <=50K 37.06
Trinadad&Tobago >50K 40.0
United-States <=50K 38.8
United-States >50K 45.51
Vietnam <=50K 37.19
Vietnam >50K 39.2
Yugoslavia <=50K 41.6
Yugoslavia >50K 49.5
Elegant method:
pd.crosstab(
data["native-country"],
data["salary"],
values=data["hours-per-week"],
aggfunc=np.mean,
).T
| native-country | ? | Cambodia | Canada | China | Columbia | Cuba | Dominican-Republic | Ecuador | El-Salvador | England | France | Germany | Greece | Guatemala | Haiti | Holand-Netherlands | Honduras | Hong | Hungary | India | Iran | Ireland | Italy | Jamaica | Japan | Laos | Mexico | Nicaragua | Outlying-US(Guam-USVI-etc) | Peru | Philippines | Poland | Portugal | Puerto-Rico | Scotland | South | Taiwan | Thailand | Trinadad&Tobago | United-States | Vietnam | Yugoslavia |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| salary | ||||||||||||||||||||||||||||||||||||||||||
| <=50K | 40.164760 | 41.416667 | 37.914634 | 37.381818 | 38.684211 | 37.985714 | 42.338235 | 38.041667 | 36.030928 | 40.483333 | 41.058824 | 39.139785 | 41.809524 | 39.360656 | 36.325 | 40.0 | 34.333333 | 39.142857 | 31.3 | 38.233333 | 41.44 | 40.947368 | 39.625 | 38.239437 | 41.000000 | 40.375 | 40.003279 | 36.09375 | 41.857143 | 35.068966 | 38.065693 | 38.166667 | 41.939394 | 38.470588 | 39.444444 | 40.15625 | 33.774194 | 42.866667 | 37.058824 | 38.799127 | 37.193548 | 41.6 |
| >50K | 45.547945 | 40.000000 | 45.641026 | 38.900000 | 50.000000 | 42.440000 | 47.000000 | 48.750000 | 45.000000 | 44.533333 | 50.750000 | 44.977273 | 50.625000 | 36.666667 | 42.750 | NaN | 60.000000 | 45.000000 | 50.0 | 46.475000 | 47.50 | 48.000000 | 45.400 | 41.100000 | 47.958333 | 40.000 | 46.575758 | 37.50000 | NaN | 40.000000 | 43.032787 | 39.000000 | 41.500000 | 39.416667 | 46.666667 | 51.43750 | 46.800000 | 58.333333 | 40.000000 | 45.505369 | 39.200000 | 49.5 |