Assignment #1 (demo). Exploratory data analysis with Pandas. Solution#
Author: Yury Kashnitsky. Translated and edited by Sergey Isaev, Artem Trunov, Anastasia Manokhina, and Yuanyuan Pao. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.
Same assignment as a Kaggle Kernel + solution.
In this task you should use Pandas to answer a few questions about the Adult dataset. (You don’t have to download the data – it’s already in the repository). Choose the answers in the web-form.
Unique values of features (for more information please see the link above):
age
: continuous;workclass
:Private
,Self-emp-not-inc
,Self-emp-inc
,Federal-gov
,Local-gov
,State-gov
,Without-pay
,Never-worked
;fnlwgt
: continuous;education
:Bachelors
,Some-college
,11th
,HS-grad
,Prof-school
,Assoc-acdm
,Assoc-voc
,9th
,7th-8th
,12th
,Masters
,1st-4th
,10th
,Doctorate
,5th-6th
,Preschool
;education-num
: continuous;marital-status
:Married-civ-spouse
,Divorced
,Never-married
,Separated
,Widowed
,Married-spouse-absent
,Married-AF-spouse
,occupation
:Tech-support
,Craft-repair
,Other-service
,Sales
,Exec-managerial
,Prof-specialty
,Handlers-cleaners
,Machine-op-inspct
,Adm-clerical
,Farming-fishing
,Transport-moving
,Priv-house-serv
,Protective-serv
,Armed-Forces
;relationship
:Wife
,Own-child
,Husband
,Not-in-family
,Other-relative
,Unmarried
;race
:White
,Asian-Pac-Islander
,Amer-Indian-Eskimo
,Other
,Black
;sex
:Female
,Male
;capital-gain
: continuous.capital-loss
: continuous.hours-per-week
: continuous.native-country
:United-States
,Cambodia
,England
,Puerto-Rico
,Canada
,Germany
,Outlying-US(Guam-USVI-etc)
,India
,Japan
,Greece
,South
,China
,Cuba
,Iran
,Honduras
,Philippines
,Italy
,Poland
,Jamaica
,Vietnam
,Mexico
,Portugal
,Ireland
,France
,Dominican-Republic
,Laos
,Ecuador
,Taiwan
,Haiti
,Columbia
,Hungary
,Guatemala
,Nicaragua
,Scotland
,Thailand
,Yugoslavia
,El-Salvador
,Trinadad&Tobago
,Peru
,Hong
,Holand-Netherlands
;salary
:>50K
,<=50K
.
import numpy as np
import pandas as pd
pd.set_option("display.max.columns", 100)
# to draw pictures in jupyter notebook
%matplotlib inline
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings("ignore")
# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_URL = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"
data = pd.read_csv(DATA_URL + "adult.data.csv")
data.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
1. How many men and women (sex feature) are represented in this dataset?
data["sex"].value_counts()
sex
Male 21790
Female 10771
Name: count, dtype: int64
2. What is the average age (age feature) of women?
data[data["sex"] == "Female"]["age"].mean()
np.float64(36.85823043357163)
3. What is the proportion of German citizens (native-country feature)?
float((data["native-country"] == "Germany").sum()) / data.shape[0]
0.004207487485028101
**4-5. What are mean value and standard deviation of the age of those who receive more than 50K per year (salary feature) and those who receive less than 50K per year? **
ages1 = data[data["salary"] == ">50K"]["age"]
ages2 = data[data["salary"] == "<=50K"]["age"]
print(
"The average age of the rich: {0} +- {1} years, poor - {2} +- {3} years.".format(
round(ages1.mean()),
round(ages1.std(), 1),
round(ages2.mean()),
round(ages2.std(), 1),
)
)
The average age of the rich: 44 +- 10.5 years, poor - 37 +- 14.0 years.
6. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors
, Prof-school
, Assoc-acdm
, Assoc-voc
, Masters
or Doctorate
feature)
data[data["salary"] == ">50K"]["education"].unique() # No
array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
'10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)
7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo
race.
for (race, sex), sub_df in data.groupby(["race", "sex"]):
print("Race: {0}, sex: {1}".format(race, sex))
print(sub_df["age"].describe())
Race: Amer-Indian-Eskimo, sex: Female
count 119.000000
mean 37.117647
std 13.114991
min 17.000000
25% 27.000000
50% 36.000000
75% 46.000000
max 80.000000
Name: age, dtype: float64
Race: Amer-Indian-Eskimo, sex: Male
count 192.000000
mean 37.208333
std 12.049563
min 17.000000
25% 28.000000
50% 35.000000
75% 45.000000
max 82.000000
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Female
count 346.000000
mean 35.089595
std 12.300845
min 17.000000
25% 25.000000
50% 33.000000
75% 43.750000
max 75.000000
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Male
count 693.000000
mean 39.073593
std 12.883944
min 18.000000
25% 29.000000
50% 37.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: Black, sex: Female
count 1555.000000
mean 37.854019
std 12.637197
min 17.000000
25% 28.000000
50% 37.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: Black, sex: Male
count 1569.000000
mean 37.682600
std 12.882612
min 17.000000
25% 27.000000
50% 36.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: Other, sex: Female
count 109.000000
mean 31.678899
std 11.631599
min 17.000000
25% 23.000000
50% 29.000000
75% 39.000000
max 74.000000
Name: age, dtype: float64
Race: Other, sex: Male
count 162.000000
mean 34.654321
std 11.355531
min 17.000000
25% 26.000000
50% 32.000000
75% 42.000000
max 77.000000
Name: age, dtype: float64
Race: White, sex: Female
count 8642.000000
mean 36.811618
std 14.329093
min 17.000000
25% 25.000000
50% 35.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Race: White, sex: Male
count 19174.000000
mean 39.652498
std 13.436029
min 17.000000
25% 29.000000
50% 38.000000
75% 49.000000
max 90.000000
Name: age, dtype: float64
8. Among whom is the proportion of those who earn a lot (>50K
) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse
, Married-spouse-absent
or Married-AF-spouse
), the rest are considered bachelors.
# married men
data[(data["sex"] == "Male")
& (data["marital-status"].str.startswith("Married"))][
"salary"
].value_counts(normalize=True)
salary
<=50K 0.559486
>50K 0.440514
Name: proportion, dtype: float64
# single men
data[
(data["sex"] == "Male")
& ~(data["marital-status"].str.startswith("Married"))
]["salary"].value_counts(normalize=True)
salary
<=50K 0.915505
>50K 0.084495
Name: proportion, dtype: float64
It’s good to be married :)
9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours and what is the percentage of those who earn a lot among them?
max_load = data["hours-per-week"].max()
print("Max time - {0} hours./week.".format(max_load))
num_workaholics = data[data["hours-per-week"] == max_load].shape[0]
print("Total number of such hard workers {0}".format(num_workaholics))
rich_share = (
float(
data[(data["hours-per-week"] == max_load) & (data["salary"] == ">50K")].shape[0]
)
/ num_workaholics
)
print("Percentage of rich among them {0}%".format(int(100 * rich_share)))
Max time - 99 hours./week.
Total number of such hard workers 85
Percentage of rich among them 29%
10. Count the average time of work (hours-per-week) those who earning a little and a lot (salary) for each country (native-country).
Simple method:
for (country, salary), sub_df in data.groupby(["native-country", "salary"]):
print(country, salary, round(sub_df["hours-per-week"].mean(), 2))
? <=50K 40.16
? >50K 45.55
Cambodia <=50K 41.42
Cambodia >50K 40.0
Canada <=50K 37.91
Canada >50K 45.64
China <=50K 37.38
China >50K 38.9
Columbia <=50K 38.68
Columbia >50K 50.0
Cuba <=50K 37.99
Cuba >50K 42.44
Dominican-Republic <=50K 42.34
Dominican-Republic >50K 47.0
Ecuador <=50K 38.04
Ecuador >50K 48.75
El-Salvador <=50K 36.03
El-Salvador >50K 45.0
England <=50K 40.48
England >50K 44.53
France <=50K 41.06
France >50K 50.75
Germany <=50K 39.14
Germany >50K 44.98
Greece <=50K 41.81
Greece >50K 50.62
Guatemala <=50K 39.36
Guatemala >50K 36.67
Haiti <=50K 36.33
Haiti >50K 42.75
Holand-Netherlands <=50K 40.0
Honduras <=50K 34.33
Honduras >50K 60.0
Hong <=50K 39.14
Hong >50K 45.0
Hungary <=50K 31.3
Hungary >50K 50.0
India <=50K 38.23
India >50K 46.48
Iran <=50K 41.44
Iran >50K 47.5
Ireland <=50K 40.95
Ireland >50K 48.0
Italy <=50K 39.62
Italy >50K 45.4
Jamaica <=50K 38.24
Jamaica >50K 41.1
Japan <=50K 41.0
Japan >50K 47.96
Laos <=50K 40.38
Laos >50K 40.0
Mexico <=50K 40.0
Mexico >50K 46.58
Nicaragua <=50K 36.09
Nicaragua >50K 37.5
Outlying-US(Guam-USVI-etc) <=50K 41.86
Peru <=50K 35.07
Peru >50K 40.0
Philippines <=50K 38.07
Philippines >50K 43.03
Poland <=50K 38.17
Poland >50K 39.0
Portugal <=50K 41.94
Portugal >50K 41.5
Puerto-Rico <=50K 38.47
Puerto-Rico >50K 39.42
Scotland <=50K 39.44
Scotland >50K 46.67
South <=50K 40.16
South >50K 51.44
Taiwan <=50K 33.77
Taiwan >50K 46.8
Thailand <=50K 42.87
Thailand >50K 58.33
Trinadad&Tobago <=50K 37.06
Trinadad&Tobago >50K 40.0
United-States <=50K 38.8
United-States >50K 45.51
Vietnam <=50K 37.19
Vietnam >50K 39.2
Yugoslavia <=50K 41.6
Yugoslavia >50K 49.5
Elegant method:
pd.crosstab(
data["native-country"],
data["salary"],
values=data["hours-per-week"],
aggfunc=np.mean,
).T
native-country | ? | Cambodia | Canada | China | Columbia | Cuba | Dominican-Republic | Ecuador | El-Salvador | England | France | Germany | Greece | Guatemala | Haiti | Holand-Netherlands | Honduras | Hong | Hungary | India | Iran | Ireland | Italy | Jamaica | Japan | Laos | Mexico | Nicaragua | Outlying-US(Guam-USVI-etc) | Peru | Philippines | Poland | Portugal | Puerto-Rico | Scotland | South | Taiwan | Thailand | Trinadad&Tobago | United-States | Vietnam | Yugoslavia |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
salary | ||||||||||||||||||||||||||||||||||||||||||
<=50K | 40.164760 | 41.416667 | 37.914634 | 37.381818 | 38.684211 | 37.985714 | 42.338235 | 38.041667 | 36.030928 | 40.483333 | 41.058824 | 39.139785 | 41.809524 | 39.360656 | 36.325 | 40.0 | 34.333333 | 39.142857 | 31.3 | 38.233333 | 41.44 | 40.947368 | 39.625 | 38.239437 | 41.000000 | 40.375 | 40.003279 | 36.09375 | 41.857143 | 35.068966 | 38.065693 | 38.166667 | 41.939394 | 38.470588 | 39.444444 | 40.15625 | 33.774194 | 42.866667 | 37.058824 | 38.799127 | 37.193548 | 41.6 |
>50K | 45.547945 | 40.000000 | 45.641026 | 38.900000 | 50.000000 | 42.440000 | 47.000000 | 48.750000 | 45.000000 | 44.533333 | 50.750000 | 44.977273 | 50.625000 | 36.666667 | 42.750 | NaN | 60.000000 | 45.000000 | 50.0 | 46.475000 | 47.50 | 48.000000 | 45.400 | 41.100000 | 47.958333 | 40.000 | 46.575758 | 37.50000 | NaN | 40.000000 | 43.032787 | 39.000000 | 41.500000 | 39.416667 | 46.666667 | 51.43750 | 46.800000 | 58.333333 | 40.000000 | 45.505369 | 39.200000 | 49.5 |