import numpy as np # linear algebra
import pandas as pd # tabular data processing
import matplotlib.pyplot as plt # data plotting
import seaborn as sns # data visualisation and plotting
# Set a style
'ggplot')
plt.style.use(
# set the custom size for my graphs
set(rc={'figure.figsize':(8.7,6.27)})
sns.
# supress warnings due to different versions of packages
import warnings
='ignore', category=FutureWarning)
warnings.simplefilter(action
# set plots to show inline
%matplotlib inline
Chapter 1- Exploratory Data Analysis
1 Introduction
Exploratory data analysis is a fluid process and there is no single approach. It can be thought of a process of hypothesis generation, data exploration, and formal statistical testing. It comes after the stage of importing and tidying your data.
In this section we will walk through organising your data, getting to know your data structure, and understanding variation and covariation within and between variables.
For these exercises we will use a simulated dataset of departmental salaries, which consists of the salaries of professors in 5 departments: english, informatics, statistics, biology, and sociology.
1.0.1 Packages for this adventure
1.0.2 Tidying your data
To conduct regression analyses in python it is important to have your data in a format that is easy to work with. A good approach is to organise your data in a “tidy” format.
Tidy data is a set of values, where each value
is placed in its own “cell”, each variable
in its own column, and each observation
in its own row. Although this course is on Statistics in Python, the book R for Data Science by Garrett Grolemund and Hadley Wickham is a great resource when thinking about tidying data. Here, we use some of the definitions they set out in the book to describe tidy data.
Some definitions for tidy data:
A
variable
is a quantity, quality, or property that you can measure.A
value
is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.An
observation
is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable.In python, the pandas package makes it easy to format your data in a tabular format.
Reading in the data
# read in the salary data
= pd.read_csv("../data/faculty-data.csv") salaries
# looking at the head of the dataframe
salaries.head()
ids | department | bases | experience | raises | salary | |
---|---|---|---|---|---|---|
0 | 1 | sociology | 39012.062997 | 3 | 2122.325646 | 45379.039935 |
1 | 2 | biology | 51872.123941 | 9 | 541.643975 | 56746.919719 |
2 | 3 | english | 64341.126468 | 3 | 543.178641 | 65970.662390 |
3 | 4 | informatics | 68975.266754 | 2 | 1736.946839 | 72449.160433 |
4 | 5 | statistics | 78262.278702 | 9 | 469.943148 | 82491.767032 |
Exercise
Our salary data consists of the salaries of individual professors from 5 departments within the University: english, informatics, statistics, biology, and sociology. The data contains the following variables
ids
= individual iddepartment
= university departmentbases
= the starting salaryexperience
= years of experienceraises
= raise per yearsalary
= current salary
We’re interested in exploring the relationship between years of experience and salary, but first we need to get to know our data a bit better.
- Looking at the definitions for
value
,observation
, andvariable
, give an example of each term from the salary data.
1.1 Getting to know your data
To get started, let’s explore the following questions for our dataset.
What is the structure of the data?
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
1.1.1 Data Structure and Data Summaries
One of the things we will wish to know about our variables are whether they are continuous or categorical.
Image Credit: @AllisonHorst.
Continuous variable
: a variable that can take on an unlimited number of values between the lowest and highest points of measurements.- e.g. speed, distance, height
Categorical variable
can take one of a limited subset of values. For example, if you have a dataset about a household then you will typically find variables like employment sector, marriage status, and country.- In python, categorical variables are usually stored as character strings or integers (e.g. ‘Industry’ and ‘Academia’ for types of employment sector).
- Categorical variables are nominal if they have no order (e.g. ‘Ghana’ and ‘Uruguay’)
- Categorical variables are ordinal if there is an order associated with them (e.g. ‘low’, ‘medium’, and ‘high’ referring to economic status).
As we start to think about our data and the types of variables we are working with, we recommend reading Chapter 4: “What Gets Counted Counts” of Data Feminism by Catherine D’Ignazio and Lauren Klein. The book encourages us to rethink binaries and examine classification systems in data science.
Exercises
What are the dimensions of the dataframe?
# the shape attribute returns the dimensions of the dataframe
salaries.shape
(100, 6)
Exercises
What are the first and last values of salary?
# the head method shows us the first 5 rows
salaries.head()
ids | department | bases | experience | raises | salary | |
---|---|---|---|---|---|---|
0 | 1 | sociology | 39012.062997 | 3 | 2122.325646 | 45379.039935 |
1 | 2 | biology | 51872.123941 | 9 | 541.643975 | 56746.919719 |
2 | 3 | english | 64341.126468 | 3 | 543.178641 | 65970.662390 |
3 | 4 | informatics | 68975.266754 | 2 | 1736.946839 | 72449.160433 |
4 | 5 | statistics | 78262.278702 | 9 | 469.943148 | 82491.767032 |
# the tail method shows us the last 5 rows
salaries.tail()
ids | department | bases | experience | raises | salary | |
---|---|---|---|---|---|---|
95 | 96 | sociology | 42107.716434 | 8 | 1957.880086 | 57770.757125 |
96 | 97 | biology | 49774.735037 | 5 | 470.338273 | 52126.426402 |
97 | 98 | english | 60919.523254 | 3 | 502.197549 | 62426.115902 |
98 | 99 | informatics | 63809.889673 | 7 | 1641.660044 | 75301.509985 |
99 | 100 | statistics | 84420.476008 | 3 | 537.931883 | 86034.271658 |
Exercises
Which variables are categorical or continuous variables? What are these data types called in Python?
# the info method will tell us about our data frame, including how many observations per column and the type of data.
salaries.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ids 100 non-null int64
1 department 100 non-null object
2 bases 100 non-null float64
3 experience 100 non-null int64
4 raises 100 non-null float64
5 salary 100 non-null float64
dtypes: float64(3), int64(2), object(1)
memory usage: 4.8+ KB
We often see type: object
in pandas
, this section of the package user guide explains why this is.
Exercises
Using the data summary, what is the minimum and maximum salary?
# the describe method gives us a summary of the numerical data and the spread of values
salaries.describe()
ids | bases | experience | raises | salary | |
---|---|---|---|---|---|
count | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
mean | 50.500000 | 60226.029444 | 4.820000 | 1033.461045 | 65448.393841 |
std | 29.011492 | 14240.030985 | 2.889829 | 671.385251 | 13219.178125 |
min | 1.000000 | 37246.936522 | 0.000000 | 450.600025 | 42827.156913 |
25% | 25.750000 | 48071.305546 | 3.000000 | 481.551932 | 54745.711028 |
50% | 50.500000 | 59431.965403 | 5.000000 | 540.378028 | 62375.126164 |
75% | 75.250000 | 72671.300340 | 7.250000 | 1741.022583 | 78631.780911 |
max | 100.000000 | 87009.876061 | 9.000000 | 2177.423176 | 91342.489460 |
Exercises
What are the names of the columns?
# the columns attribute returns the column names
salaries.columns
Index(['ids', 'department', 'bases', 'experience', 'raises', 'salary'], dtype='object')
Exercises
Do we have any missing data? How many missing values do we have per variable?
# the isna() method returns missing values and the sum() method adds them up.
sum() salaries.isna().
ids 0
department 0
bases 0
experience 0
raises 0
salary 0
dtype: int64
We can only find missing values when we have told the dataframe what to count as missing data. Sometimes our data may contain unrecognised or other custom values that signify missing values. If this happens we need to overwrite these values as missing values.
Exercises
What are the unique departments in our dataset?
# the unique() method returns the distinct values in a column.
'department'].unique() salaries[
array(['sociology', 'biology', 'english', 'informatics', 'statistics'],
dtype=object)
Exercises
Do we have any duplicates in our data?
# the duplicated() method returns duplicates, the sum() method adds them up for us.
sum() salaries.duplicated().
0
Note that if we wanted to filter out duplicates we can use .drop_duplicates()
. Although, in this dataset we don’t have duplicates.
# you can filter out the duplicates using the drop_duplicates method.
salaries.drop_duplicates()
ids | department | bases | experience | raises | salary | |
---|---|---|---|---|---|---|
0 | 1 | sociology | 39012.062997 | 3 | 2122.325646 | 45379.039935 |
1 | 2 | biology | 51872.123941 | 9 | 541.643975 | 56746.919719 |
2 | 3 | english | 64341.126468 | 3 | 543.178641 | 65970.662390 |
3 | 4 | informatics | 68975.266754 | 2 | 1736.946839 | 72449.160433 |
4 | 5 | statistics | 78262.278702 | 9 | 469.943148 | 82491.767032 |
... | ... | ... | ... | ... | ... | ... |
95 | 96 | sociology | 42107.716434 | 8 | 1957.880086 | 57770.757125 |
96 | 97 | biology | 49774.735037 | 5 | 470.338273 | 52126.426402 |
97 | 98 | english | 60919.523254 | 3 | 502.197549 | 62426.115902 |
98 | 99 | informatics | 63809.889673 | 7 | 1641.660044 | 75301.509985 |
99 | 100 | statistics | 84420.476008 | 3 | 537.931883 | 86034.271658 |
100 rows × 6 columns
2 Variation
Now we know more about the structure of our data we can explore the variation and covariation in the variables. Knowing the variation and covariation between variables can help us to understand the spread of the data and potential relationships in the data that may give insight into modelling.
variation
: is the tendency of values of a variable to change from measurement to measurement. Variation can come in several forms:measurement error
: you may measure the same thing twice and get slightly different values.natural variation
: is the term I use to refer to variation that is inherent in a population or sample (e.g. as humans we all have different heights, the way these values vary reflect the variation in the sample or population).
covariation
: tendency of values of a variable to change with the values of another variable.