Chapter 1- Exploratory Data Analysis

1 Introduction

Exploratory data analysis is a fluid process and there is no single approach. It can be thought of a process of hypothesis generation, data exploration, and formal statistical testing. It comes after the stage of importing and tidying your data.

In this section we will walk through organising your data, getting to know your data structure, and understanding variation and covariation within and between variables.

For these exercises we will use a simulated dataset of departmental salaries, which consists of the salaries of professors in 5 departments: english, informatics, statistics, biology, and sociology.

1.0.1 Packages for this adventure

import numpy as np # linear algebra
import pandas as pd # tabular data processing
import matplotlib.pyplot as plt # data plotting
import seaborn as sns # data visualisation and plotting

# Set a style
plt.style.use('ggplot')

# set the custom size for my graphs
sns.set(rc={'figure.figsize':(8.7,6.27)})

# supress warnings due to different versions of packages
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# set plots to show inline
%matplotlib inline

1.0.2 Tidying your data

To conduct regression analyses in python it is important to have your data in a format that is easy to work with. A good approach is to organise your data in a “tidy” format.

Tidy data is a set of values, where each value is placed in its own “cell”, each variable in its own column, and each observation in its own row. Although this course is on Statistics in Python, the book R for Data Science by Garrett Grolemund and Hadley Wickham is a great resource when thinking about tidying data. Here, we use some of the definitions they set out in the book to describe tidy data.

Some definitions for tidy data:

A variable is a quantity, quality, or property that you can measure.
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable.
In python, the pandas package makes it easy to format your data in a tabular format.

Reading in the data

# read in the salary data
salaries = pd.read_csv("../data/faculty-data.csv")

# looking at the head of the dataframe
salaries.head()

	ids	department	bases	experience	raises	salary
0	1	sociology	39012.062997	3	2122.325646	45379.039935
1	2	biology	51872.123941	9	541.643975	56746.919719
2	3	english	64341.126468	3	543.178641	65970.662390
3	4	informatics	68975.266754	2	1736.946839	72449.160433
4	5	statistics	78262.278702	9	469.943148	82491.767032

Exercise

Our salary data consists of the salaries of individual professors from 5 departments within the University: english, informatics, statistics, biology, and sociology. The data contains the following variables

ids = individual id
department = university department
bases = the starting salary
experience = years of experience
raises = raise per year
salary = current salary

We’re interested in exploring the relationship between years of experience and salary, but first we need to get to know our data a bit better.

Looking at the definitions for value, observation, and variable, give an example of each term from the salary data.

1.1 Getting to know your data

To get started, let’s explore the following questions for our dataset.

What is the structure of the data?
What type of variation occurs within my variables?
What type of covariation occurs between my variables?

1.1.1 Data Structure and Data Summaries

One of the things we will wish to know about our variables are whether they are continuous or categorical.

Image Credit: @AllisonHorst.

Continuous variable: a variable that can take on an unlimited number of values between the lowest and highest points of measurements.
- e.g. speed, distance, height
Categorical variable can take one of a limited subset of values. For example, if you have a dataset about a household then you will typically find variables like employment sector, marriage status, and country.
- In python, categorical variables are usually stored as character strings or integers (e.g. ‘Industry’ and ‘Academia’ for types of employment sector).
- Categorical variables are nominal if they have no order (e.g. ‘Ghana’ and ‘Uruguay’)
- Categorical variables are ordinal if there is an order associated with them (e.g. ‘low’, ‘medium’, and ‘high’ referring to economic status).

As we start to think about our data and the types of variables we are working with, we recommend reading Chapter 4: “What Gets Counted Counts” of Data Feminism by Catherine D’Ignazio and Lauren Klein. The book encourages us to rethink binaries and examine classification systems in data science.

What are the dimensions of the dataframe?

# the shape attribute returns the dimensions of the dataframe
salaries.shape

(100, 6)

Exercises

Question 2
Answer

What are the first and last values of salary?

# the head method shows us the first 5 rows

salaries.head()

	ids	department	bases	experience	raises	salary
0	1	sociology	39012.062997	3	2122.325646	45379.039935
1	2	biology	51872.123941	9	541.643975	56746.919719
2	3	english	64341.126468	3	543.178641	65970.662390
3	4	informatics	68975.266754	2	1736.946839	72449.160433
4	5	statistics	78262.278702	9	469.943148	82491.767032

# the tail method shows us the last 5 rows

salaries.tail()

	ids	department	bases	experience	raises	salary
95	96	sociology	42107.716434	8	1957.880086	57770.757125
96	97	biology	49774.735037	5	470.338273	52126.426402
97	98	english	60919.523254	3	502.197549	62426.115902
98	99	informatics	63809.889673	7	1641.660044	75301.509985
99	100	statistics	84420.476008	3	537.931883	86034.271658

Exercises

Question 3
Answer

Which variables are categorical or continuous variables? What are these data types called in Python?

# the info method will tell us about our data frame, including how many observations per column and the type of data.

salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ids         100 non-null    int64  
 1   department  100 non-null    object 
 2   bases       100 non-null    float64
 3   experience  100 non-null    int64  
 4   raises      100 non-null    float64
 5   salary      100 non-null    float64
dtypes: float64(3), int64(2), object(1)
memory usage: 4.8+ KB

We often see type: object in pandas, this section of the package user guide explains why this is.

Exercises

Question 4
Answer

Using the data summary, what is the minimum and maximum salary?

# the describe method gives us a summary of the numerical data and the spread of values

salaries.describe()

	ids	bases	experience	raises	salary
count	100.000000	100.000000	100.000000	100.000000	100.000000
mean	50.500000	60226.029444	4.820000	1033.461045	65448.393841
std	29.011492	14240.030985	2.889829	671.385251	13219.178125
min	1.000000	37246.936522	0.000000	450.600025	42827.156913
25%	25.750000	48071.305546	3.000000	481.551932	54745.711028
50%	50.500000	59431.965403	5.000000	540.378028	62375.126164
75%	75.250000	72671.300340	7.250000	1741.022583	78631.780911
max	100.000000	87009.876061	9.000000	2177.423176	91342.489460

Exercises

Question 5
Answer

What are the names of the columns?

# the columns attribute returns the column names

salaries.columns

Index(['ids', 'department', 'bases', 'experience', 'raises', 'salary'], dtype='object')

Exercises

Question 6
Answer

Do we have any missing data? How many missing values do we have per variable?

# the isna() method returns missing values and the sum() method adds them up.

salaries.isna().sum()

ids           0
department    0
bases         0
experience    0
raises        0
salary        0
dtype: int64

We can only find missing values when we have told the dataframe what to count as missing data. Sometimes our data may contain unrecognised or other custom values that signify missing values. If this happens we need to overwrite these values as missing values.

Exercises

Question 7
Answer

What are the unique departments in our dataset?

# the unique() method returns the distinct values in a column.

salaries['department'].unique()

array(['sociology', 'biology', 'english', 'informatics', 'statistics'],
      dtype=object)

Exercises

Question 8
Answer

Do we have any duplicates in our data?

# the duplicated() method returns duplicates, the sum() method adds them up for us.

salaries.duplicated().sum()

Note that if we wanted to filter out duplicates we can use .drop_duplicates(). Although, in this dataset we don’t have duplicates.

# you can filter out the duplicates using the drop_duplicates method.

salaries.drop_duplicates()

	ids	department	bases	experience	raises	salary
0	1	sociology	39012.062997	3	2122.325646	45379.039935
1	2	biology	51872.123941	9	541.643975	56746.919719
2	3	english	64341.126468	3	543.178641	65970.662390
3	4	informatics	68975.266754	2	1736.946839	72449.160433
4	5	statistics	78262.278702	9	469.943148	82491.767032
...	...	...	...	...	...	...
95	96	sociology	42107.716434	8	1957.880086	57770.757125
96	97	biology	49774.735037	5	470.338273	52126.426402
97	98	english	60919.523254	3	502.197549	62426.115902
98	99	informatics	63809.889673	7	1641.660044	75301.509985
99	100	statistics	84420.476008	3	537.931883	86034.271658

100 rows × 6 columns

2 Variation

Now we know more about the structure of our data we can explore the variation and covariation in the variables. Knowing the variation and covariation between variables can help us to understand the spread of the data and potential relationships in the data that may give insight into modelling.

variation: is the tendency of values of a variable to change from measurement to measurement. Variation can come in several forms:
- measurement error: you may measure the same thing twice and get slightly different values.
- natural variation: is the term I use to refer to variation that is inherent in a population or sample (e.g. as humans we all have different heights, the way these values vary reflect the variation in the sample or population).
covariation: tendency of values of a variable to change with the values of another variable.

Chapter 1- Exploratory Data Analysis

1 Introduction

1.0.1 Packages for this adventure

1.0.2 Tidying your data

1.1 Getting to know your data

1.1.1 Data Structure and Data Summaries

Exercises

Exercises

Exercises

Exercises

Exercises

Exercises

Exercises

Exercises

2 Variation

2.1 Visualising Distributions

3 Covariation

Exercises

Exercises