Assignment #1 (demo). Exploratory data analysis with Pandas#
Author: Yury Kashnitsky. Translated and edited by Sergey Isaev, Artem Trunov, Anastasia Manokhina, and Yuanyuan Pao. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.
Same assignment as a Kaggle Kernel + solution.
In this task you should use Pandas to answer a few questions about the Adult dataset. (You don’t have to download the data – it’s already in the repository). Choose the answers in the web-form.
Unique values of features (for more information please see the link above):
age
: continuous;workclass
:Private
,Self-emp-not-inc
,Self-emp-inc
,Federal-gov
,Local-gov
,State-gov
,Without-pay
,Never-worked
;fnlwgt
: continuous;education
:Bachelors
,Some-college
,11th
,HS-grad
,Prof-school
,Assoc-acdm
,Assoc-voc
,9th
,7th-8th
,12th
,Masters
,1st-4th
,10th
,Doctorate
,5th-6th
,Preschool
;education-num
: continuous;marital-status
:Married-civ-spouse
,Divorced
,Never-married
,Separated
,Widowed
,Married-spouse-absent
,Married-AF-spouse
,occupation
:Tech-support
,Craft-repair
,Other-service
,Sales
,Exec-managerial
,Prof-specialty
,Handlers-cleaners
,Machine-op-inspct
,Adm-clerical
,Farming-fishing
,Transport-moving
,Priv-house-serv
,Protective-serv
,Armed-Forces
;relationship
:Wife
,Own-child
,Husband
,Not-in-family
,Other-relative
,Unmarried
;race
:White
,Asian-Pac-Islander
,Amer-Indian-Eskimo
,Other
,Black
;sex
:Female
,Male
;capital-gain
: continuous.capital-loss
: continuous.hours-per-week
: continuous.native-country
:United-States
,Cambodia
,England
,Puerto-Rico
,Canada
,Germany
,Outlying-US(Guam-USVI-etc)
,India
,Japan
,Greece
,South
,China
,Cuba
,Iran
,Honduras
,Philippines
,Italy
,Poland
,Jamaica
,Vietnam
,Mexico
,Portugal
,Ireland
,France
,Dominican-Republic
,Laos
,Ecuador
,Taiwan
,Haiti
,Columbia
,Hungary
,Guatemala
,Nicaragua
,Scotland
,Thailand
,Yugoslavia
,El-Salvador
,Trinadad&Tobago
,Peru
,Hong
,Holand-Netherlands
;salary
:>50K
,<=50K
.
import numpy as np
import pandas as pd
pd.set_option("display.max.columns", 100)
# to draw pictures in jupyter notebook
%matplotlib inline
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings("ignore")
# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_URL = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"
data = pd.read_csv(DATA_URL + "adult.data.csv")
data.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
1. How many men and women (sex feature) are represented in this dataset?
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
2. What is the average age (age feature) of women?
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
3. What is the percentage of German citizens (native-country feature)?
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
6. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors
, Prof-school
, Assoc-acdm
, Assoc-voc
, Masters
or Doctorate
feature)
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo
race.
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
8. Among whom is the proportion of those who earn a lot (>50K
) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse
, Married-spouse-absent
or Married-AF-spouse
), the rest are considered bachelors.
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K
) among them?
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)
10. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)