import pandas as pd
import numpy as np
Chapter 6 - Summary Statistics and Aggregation
1 Chapter Overview and Learning Objectives
Packages and Data
- Packages
- Data
Overall Descriptive Statistics
Range
- min
- max
- quantiles
Averages
- Mean
- Median
- Mode
Spread
- Standard Deviation
- Variance
Counting Values
- Counts
- Null Value counts
- Value Counts
Other Summary Statistics
Creating Size Bands
Aggregation
- Single Aggregation
- Multiple Aggregation
2 Packages and Datasets
2.1 Packages
As a reminder, we should always import our packages at the top of our script. In this session we will use the following:
- pandas, and give it the nickname “pd”
- numpy and give it the nickname “np”
2.1.1 Exercise
Import Pandas and give it the nickname pd Import Numpy and give it the nickname np
2.2 Datasets
Good practice also dictates that we read in our datasets at the top of our script too.
In this session we’ll be using:
- animals - animals.csv
- titanic - titanic.xlsx
2.2.1 Exercise
Load in these datasets listed above
You can check your variables are loaded by using “%whos” in Jupyter. In Spyder or other IDE’s they should appear in your variable explorer.
= pd.read_csv("../data/animals.csv")
animals = pd.read_excel("../data/titanic.xlsx") titanic
If you struggle with this section, feel free to review the content of Chapter 3 that covered this in depth. Practicing these commands that we repeat over and over is a great way to retain the good practice guidelines.
3 Overall Descriptive Statistics
Pandas has an inbuilt method to get basic descriptive statistics across our dataset, this is the “.describe()” method.
3.1 Example
titanic.describe()
pclass | survived | age | sibsp | parch | fare | body | |
---|---|---|---|---|---|---|---|
count | 1309.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 121.000000 |
mean | 2.294882 | 0.381971 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 160.809917 |
std | 0.837836 | 0.486055 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 97.696922 |
min | 1.000000 | 0.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 72.000000 |
50% | 3.000000 | 0.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 155.000000 |
75% | 3.000000 | 1.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 256.000000 |
max | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 328.000000 |
We can also get summary statistics on a specific column.
"fare"].describe() titanic[
count 1308.000000
mean 33.295479
std 51.758668
min 0.000000
25% 7.895800
50% 14.454200
75% 31.275000
max 512.329200
Name: fare, dtype: float64
These statistics are explained in more detail below:
Summary Statistic | Description |
---|---|
count | the number (count) of non missing entries in the given column. |
mean | the average (arithmetic mean) data value in the given column. |
std | the standard deviation (spread) of values in the given column. |
min | the smallest value in the given column. |
25% | the value of the data at the lower quartile (i.e. after the first 25% of data, ordered from smallest to largest). |
50% | the middle value of the data (aka the median). half the values are larger than this value, and half smaller. |
75% | the value of the data at the upper quartile (i.e. after the first 75% of data, ordered from smallest to largest). |
max | the maximum data value recorded. |
3.2 Example 2
Describe works across all numeric columns by default. We can display descriptive information for other data types by using the parameter “include=” .
This parameter takes a list as the input; even if we’re just including one kind of data. Here we’re specifying that we want to include “object”, our text/string columns.
=["object"]) titanic.describe(include
name | sex | ticket | cabin | embarked | home.dest | boat | |
---|---|---|---|---|---|---|---|
count | 1309 | 1309 | 1309 | 295 | 1307 | 745 | 486 |
unique | 1307 | 2 | 929 | 186 | 3 | 369 | 28 |
top | Connolly, Miss. Kate | male | CA. 2343 | C23 C25 C27 | S | New York, NY | 13 |
freq | 2 | 843 | 11 | 6 | 914 | 64 | 39 |
Summary Statistic | Description |
---|---|
count | The number (count) of non missing entries in the given column. |
unique | The number of unique variables in a column. |
top | The most frequently occurring value. |
freq | The frequency of the “top” value (how often it appears). |
If there are two or more “top” values; e.g. both most frequently occurring values that have the same frequency within the table, Python will kind of arbitrarily (a Pandas Issue) choose one of them to be the top value.
In this data there are two women and two men who share the same name. Pandas will choose one of them to display.
4 Range
We can also access these summary statistics individually. In most cases the name of the method is the same as the summary statistic.
4.1 min
We can use “.min()” to return the minimum value in a column.
4.1.1 Example
"fare"].min() titanic[
0.0
This also works for object (text) columns.
"name"].min() titanic[
'Abbing, Mr. Anthony'
4.2 max
We can use “.max()” to return the maximum value in a column.
4.2.1 Example
"fare"].max() titanic[
512.3292
This also works for object (text) columns.
"name"].max() titanic[
'van Melkebeke, Mr. Philemon'
Something important to note here is that “pandas” effectively assigns a value to each letter. This goes A-Z and then a-z. So a lower case “a” is treated as coming after a capital “Z” in Python.
This is why van Melkebeke, Mr. Philemon is the maximum value in our Titanic Dataframe rather than Zimmerman, Mr Leo. We can handle this issue by ensuring our data is either all lower case or all upper case before finding the “.min()” or “.max()” values.
4.2.2 Example
Here I have handled the issue by chaining the methods “.str.lower()” and “.max()” together. If I wanted to do more work with this column in future I may consider overwriting it with a lower case version (which would form part of the cleaning routine).
"name"].str.lower().max() titanic[
'zimmerman, mr. leo'
The chaining of methods here applies left to right just as it is read, the column is selected, then the “.lower()” method is used from “str” methods, and then the “.max()” method is applied to the lower case column.
4.3 Quantiles
We can use “.quantile()” to find information at different points of our data.
4.3.1 Example
Our parameter is “q=” and then a decimal number. If we don’t specify this the default behaviour is “0.5”, which returns the median (as it is the 50% quantile).
"fare"].quantile(q=0.25) titanic[
7.8958
If we wish to specify more than one, we pass a list to the parameter “q=”
"fare"].quantile(q=[0, 0.25, 0.5, 0.75, 1]) titanic[
0.00 0.0000
0.25 7.8958
0.50 14.4542
0.75 31.2750
1.00 512.3292
Name: fare, dtype: float64
This is a great time to mention that you don’t necessary need to type a full list of numbers this way, plus the numbers we may want to use could belong to an enormous list, which is unreasonable to type by hand. This is where the range functions come in.
There are two of these functions:
range(start=, stop=, step=) where the start parameter specifies where to start, stop specifies where to stop (which is exclusive! Not included) and step is how we should jump from number to number, this must be an integer.
np.arange(start=, stop=, step=) comes from the numpy package (“np”) and creates a range from the start to the stop values (where the stop is exclusive). However, this function allows us to step in increments of any size, including floats (or decimal numbers).
Let’s see two quick examples before using this with the “quantile()” function.
4.3.2 Example
# Creating a range object
= range(0, 10, 2) # Jump from 0-10 in increments of 2
x
# Using np.arange() to obtain an array of values
= np.arange(0, 10, 0.5) # Jump from 0-10 in increments of 0.5
y
print(x, y)
range(0, 10, 2) [0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 6. 6.5 7. 7.5 8. 8.5
9. 9.5]
You see here that the “range()” function gives a range object, rather than a list like structure of the values themselves, this restricts its use even further as it is less readable. It is recommended that “np.arange()” is used in most circumstances. Whilst the second output looks like a list, it is actually an array, a numpy specific object that is beyond the scope of this course.
For more information consult this Numpy for Beginners tutorial.
4.3.3 Example 2
Let’s see how this function helps when finding quantiles.
# Note that .arange() comes from the numpy package (np)
"fare"].quantile(q=np.arange(start=0.0,
titanic[=1.1, # Remember this is exclusive!
stop=0.1)) step
0.0 0.00000
0.1 7.56750
0.2 7.85420
0.3 8.05000
0.4 10.50000
0.5 14.45420
0.6 21.67920
0.7 26.95500
0.8 41.57920
0.9 78.05082
1.0 512.32920
Name: fare, dtype: float64
4.3.4 Exercise
How old are the youngest and oldest passengers in the titanic DataFrame?
Print out the 20th, 40th, 60th and 80th percentiles of the fare column in the titanic DataFrame.
# (a)
= titanic["age"].max()
max_age
= titanic["age"].min()
min_age
print("The oldest person on the titanic was", max_age, "and the youngest person was", str(min_age)+".")
The oldest person on the titanic was 80.0 and the youngest person was 0.1667.
# (b)
= titanic["fare"].quantile(q = np.arange(start = 0.2,
percentiles = 1,
stop = 0.2))
step percentiles
0.2 7.8542
0.4 10.5000
0.6 21.6792
0.8 41.5792
Name: fare, dtype: float64
5 Averages
Now we will look at Measures of Central Tendency, often referred to as Averages.
5.1 Mean
Seen briefly in the previous chapter when discussing filling in missing values, the “.mean()” method (this is the arithmetic mean, which corresponds to summing all values and then dividing by the number of values there were) computes the arithmetic mean of a column.
5.1.1 Example
# Compute mean of the fare column
"fare"].mean() titanic[
33.29547928134557
5.1.2 Exercise
Confirm that the value given by “.mean()” for the fare column is the same as computing it manually.
Hint: You will need the “.sum()” function to compute the total of the column and the “.count()” function to compute the number of values (we will see this function in some more detail later). Remember that to compare if two objects are equal, we need to use “==”.
= titanic["fare"].mean()
mean_way_one
= titanic["fare"].sum()/titanic["fare"].count()
mean_way_two
# Remember to compare two objects we use the == logical operator to return True or False
== mean_way_two mean_way_one
True
An interesting observation here is the using “len()” won’t work here and provide an answer that is slightly out, this is because “len()” counts all values (including nulls), whereas “.count()” only counts the non-null values (and is what is used by “.mean()” in the background).
5.2 Median
The “.median()” is the middle value (50th percentile) when the numbers are listed in order.
5.2.1 Example
"fare"].median() titanic[
14.4542
5.3 Mode
The “.mode()” is the value that occurs most frequently in the column.
5.3.1 Example 1
"fare"].mode() titanic[
0 8.05
Name: fare, dtype: float64
5.3.2 Example 2
We can also find the mode in object based columns. Here the “name” column has two modes, since both “Connolly, Miss. Kate” and “Kelly, Mr. James” appear twice in the Data Frame.
Generally, when these cases appear the following terminology is used:
- One mode - Unimodal
- Two modes - Bimodal
- Three modes - Trimodal
As such, the name column is bimodal in this example. If you’re wondering how we know the frequency (the number of times they each appear in the DataFrame), we will see a very useful method for this later in the chapter.
"name"].mode() titanic[
0 Connolly, Miss. Kate
1 Kelly, Mr. James
Name: name, dtype: object
Interestingly this data isn’t duplicates! There really were two separate individuals with those names!
"name"] == "Connolly, Miss. Kate") | (titanic["name"] == "Kelly, Mr. James")] titanic[(titanic[
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
725 | 3 | 1 | Connolly, Miss. Kate | female | 22.0 | 0 | 0 | 370373 | 7.7500 | NaN | Q | Ireland | NaN | 13 |
726 | 3 | 0 | Connolly, Miss. Kate | female | 30.0 | 0 | 0 | 330972 | 7.6292 | NaN | Q | Ireland | NaN | NaN |
924 | 3 | 0 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | NaN | 70.0 | NaN |
925 | 3 | 0 | Kelly, Mr. James | male | 44.0 | 0 | 0 | 363592 | 8.0500 | NaN | S | NaN | NaN | NaN |
6 Spread
Another element of the data we often want to investigate is how spread out it is, we use Measures of Spread to do so.
6.1 Standard Deviation
The standard deviation measures the spread of the data about the mean value. It shows you how much your data distribution is spread out around the mean or average.
6.1.1 Example
We can calculate the standard deviation using the “.std()” method.
"fare"].std() titanic[
51.758668239174135
6.2 Variance
Variance measures how spread out the values of a variable are in general.
6.2.1 Example
We can calculate the variance using the “.var()” method.
"fare"].var() titanic[
2678.9597378928934
6.2.2 Exercise
(a)(i) Find the variance and standard deviation of the “age” column in the titanic dataset.
(a)(ii) Round these two values to 2 decimal places and output the values in a sentence.
- There is a link between these two metrics, namely that the Standard Deviation is the square root of the Variance. Confirm this relationship. Hint: To use square root, we need the “np.sqrt()” function from numpy (“import numpy as np”).
# (a)(i) & (a)(ii)
= round(titanic["age"].var(), ndigits = 2)
var_age
= round(titanic["age"].std(), ndigits = 2)
std_age
print("The variance of the ages of titanic passengers was", var_age, "and the standard deviation was", str(std_age) + ".")
The variance of the ages of titanic passengers was 207.75 and the standard deviation was 14.41.
# (b)
= titanic["age"].var()
var_age
= titanic["age"].std()
std_age
== std_age np.sqrt(var_age)
True
7 Counting Values
7.1 Counts
As we saw in the earlier exercise, we can find the number of non null values in a column using the “.count()” method. By non null values we mean values with data in them, not the missing values.
As a reminder we can use either “.shape” or “len()” to find the number of rows, which for titanic is 1309 rows (this includes rows that have missing values).
7.1.1 Example
"embarked"].count() titanic[
1307
As you can see “embarked” returns 1307 - so there’s 2 missing values.
7.2 Null Value counts
We can find how many null values we have by using the “.isna()” method. This returns a Boolean series consisting of “True” and “False” values whether the value is null or not.
7.2.1 Example 1
"age"].isna().tail() titanic[
1304 False
1305 True
1306 False
1307 False
1308 False
Name: age, dtype: bool
As these have numeric values behind them (“True” is 1, “False” is 0 ) we can use “.sum()” to total them and obtain the number of null values in a column.
"age"].isna().sum() titanic[
263
In older versions of pandas, the method “.isnull()” might be the only one available, but will return the same output.
7.2.2 Exericse
Compute the proportion of null values (a decimal between 0 and 1, equivalent to a percentage, for example 0.85 means 85%) in the “age” column. Use the “isna()” function to compute the number of null values this time.
Hint: To compute this, we need to first compute the number of null values and divide by the number of values overall in the column including the nulls!.
# Proportion of null values
"age"].isna().sum())/len(titanic["age"]) (titanic[
0.20091673032849502
7.3 Value Counts
We can find the frequencies of each unique value in a column, by using “.value_counts()”. This is particularly useful when applied to object columns to observe what values appear the most and least frequently.
"sex"].value_counts() titanic[
sex
male 843
female 466
Name: count, dtype: int64
7.3.1 Exercise
How many passengers were in each class?
Look in the help for “pd.Series.value_counts()” to see how you can return the values as a proportion instead.
# (a)
"pclass"].value_counts() titanic[
pclass
3 709
1 323
2 277
Name: count, dtype: int64
# (b)
# Use the parameter and argument normalize = True to return proportions
# Show the help documentation
#help(pd.Series.value_counts)
"pclass"].value_counts(normalize=True) titanic[
pclass
3 0.541635
1 0.246753
2 0.211612
Name: proportion, dtype: float64
8 Other Summary Statistics
8.1 Sum
We can use “.sum()” to add up columns of numeric data, which we saw when counting the null values.
8.1.1 Example 1
"fare"].sum() titanic[
43550.4869
The “.sum()” method comes from Pandas; there is also an inbuilt function “sum()”. However if we have null values in the column this will return us “nan”.
8.1.2 Example 2
sum(titanic["fare"])
nan
8.2 Unique
We can use “.unique()” to find values that are unique in a column. This is returned as an array.
8.2.1 Example
"boat"].unique() titanic[
array([2, '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A', '5',
'7', 'C', '14', '2', '5 9', '13', '1', '15', '5 7', '8 10', '12',
'16', '13 15 B', 'C D', '15 16', '13 15'], dtype=object)
8.3 Nunique
“.nunique()” can be used to find the number of unique values in a column.
8.3.1 Example
"boat"].nunique() titanic[
28
9 Creating Size Bands
9.1 pd.cut()
We can use the method “pd.cut()” to cut or “bin” our data into groups or categories. This is commonly done when creating size bands; like age bands, to create a categorical column out of a numeric one.
The method “pd.cut()” takes a column of data and groups it into “bins” or categories. This column will have the data type of “category”. There’s some more information about the “category” data type In the pandas documentation.
We often assign the output of “pd.cut()” to a new column. This is because we don’t want to overwrite and change the data type of the existing column.
9.1.1 Example
"binned_ages"] = pd.cut(titanic["age"],
titanic[=10)
bins
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | binned_ages | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 | (24.117, 32.1] |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 | (0.0869, 8.15] |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (0.0869, 8.15] |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN | (24.117, 32.1] |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (24.117, 32.1] |
We set the parameter “bins =” to specify the number of categories that we want. By passing an integer to this, Pandas takes the smallest value and the largest value in the column and creates the number of categories defined.
We can look at our bins. Note the “(” denotes exclusion of that number and the “]” denotes inclusion of that number.
9.1.2 Example 1
Now these are “categories” we can see there is a relationship between each category.
"binned_ages"].unique() titanic[
[(24.117, 32.1], (0.0869, 8.15], (40.083, 48.067], (56.05, 64.033], (32.1, 40.083], ..., (64.033, 72.017], (16.133, 24.117], (72.017, 80.0], NaN, (8.15, 16.133]]
Length: 11
Categories (10, interval[float64, right]): [(0.0869, 8.15] < (8.15, 16.133] < (16.133, 24.117] < (24.117, 32.1] ... (48.067, 56.05] < (56.05, 64.033] < (64.033, 72.017] < (72.017, 80.0]]
We can also pass our own values to determine where the edges of the categories are, rather than allowing Python to compute these approximate ones. This could be as a list of values or a range object that we discussed before.
9.1.3 Example 2
Note that here I am having to use the numpy method “np.arange()” as there will be decimals.
"binned_ages2"] = pd.cut(titanic["age"], # Data to cut
titanic[=np.arange(start=0,
bins=(titanic["age"].max() + 1) , # Remember stop is exclusive!
stop=10))
step
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | binned_ages | binned_ages2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 | (24.117, 32.1] | (20.0, 30.0] |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 | (0.0869, 8.15] | (0.0, 10.0] |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (0.0869, 8.15] | (0.0, 10.0] |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN | (24.117, 32.1] | (20.0, 30.0] |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (24.117, 32.1] | (20.0, 30.0] |
This means I don’t have to know the maximum value for the column before I write this piece of code. You can see how we can combine methods quite easily to make our life easier when performing such tasks.
It is important to note that in this part of the code :
- stop=(titanic[“age”].max() + 1)
I use brackets to enforce the order of operations. If I didn’t add 1 to the stop, the entry at my maximum value would read “NaN”, which I want to avoid.
9.1.4 Example 3
We can also add labels to our categories. This time rather than displaying the bin edges it will display the text strings we specify.
This is passed as a list to the parameter “labels=” to the “pd.cut()” function.
"binned_ages3"] = pd.cut(titanic["age"], # Data to cut
titanic[=np.arange(start=0,
bins=(titanic["age"].max() + 1),
stop=10),
step=["0 – 10", "11-20", " 21 – 30",
labels"31 – 40", "41 – 50", "51 – 60",
"61 – 70", "71 - 80"])
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | binned_ages | binned_ages2 | binned_ages3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 | (24.117, 32.1] | (20.0, 30.0] | 21 – 30 |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 | (0.0869, 8.15] | (0.0, 10.0] | 0 – 10 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (0.0869, 8.15] | (0.0, 10.0] | 0 – 10 |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN | (24.117, 32.1] | (20.0, 30.0] | 21 – 30 |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (24.117, 32.1] | (20.0, 30.0] | 21 – 30 |
Note these bands are approximate, e.g someone with an age of 20.2 will go into the band labelled “21-30”. Integers were chosen as it’s an easier read, and most ages after 1 are whole numbers.
There’s additional parameters we can set here; check the help function if there’s anything specific you need to do. You will get some more practice with this in the Case Study in Chapter 7.
9.2 pd.qcut()
“pd.qcut()” is described in the documentation as a “Quantile-based discretization function”. This means that “pd.qcut()” tries to divide the data into bins of equal sizes, rather than using the numeric edges of the data to create the bins.
9.2.1 Example
In the cell below I’m using “pd.qcut()” to cut the “age” column in to 10. This using the same data and the same number of segments as the “pd.cut()” we did at the start.
# Divide fare into 3 equally sized classes.
"age_qcut"] = pd.qcut(titanic["age"],
titanic[=10) # q for quantiles
q
#View the data
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | binned_ages | binned_ages2 | binned_ages3 | age_qcut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 | (24.117, 32.1] | (20.0, 30.0] | 21 – 30 | (28.0, 31.0] |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 | (0.0869, 8.15] | (0.0, 10.0] | 0 – 10 | (0.166, 14.0] |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (0.0869, 8.15] | (0.0, 10.0] | 0 – 10 | (0.166, 14.0] |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN | (24.117, 32.1] | (20.0, 30.0] | 21 – 30 | (28.0, 31.0] |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | (24.117, 32.1] | (20.0, 30.0] | 21 – 30 | (22.0, 25.0] |
10 Aggregation
Aggregation means grouping data together by a particular grouping variable (of our choice) and producing a summary of one or more columns for that grouping variable. For example, what if I want to find out if the average fare paid is different for each passenger class? I could use this technique to find out.
We have 3 passenger classes; 1st, 2nd and 3rd, which we can check that using the “.unique()” function as we saw earlier.
"pclass"].unique() titanic[
array([1, 2, 3], dtype=int64)
We’ll use the “.groupby()” method in this tutorial. Panda’s also provides us with the “pd.crosstab()” and the “pd.pivot_table()” methods as well. You can find these in the Reference Material as they are a little more niche in their application.
This function can be really useful, especially when your data are disaggregate, e.g. data about individual units of people or things, rather than by category.
The method “.groupby()” allows us to aggregate by a categorical variable and summarise numerical data into a new DataFrame. It works on a very interesting principle known as split-apply-combine:
Split is where a DataFrame is divided into a set of smaller DataFrames based on the grouping variable. If there are n levels of a categorical variable, there are n smaller dataframes.
Apply is where an aggregation is applied to each of these smaller dataframes to obtain a single valued output (for example a mean).
Combine is where we bring together the aggregated DataFrame rows into a final new DataFrame.
10.1 Examples
10.1.1 Example 1
Let’s see an example of this in action, it sounds complex (and it is with what goes on behind the scenes), but the final output is much more readable.
= titanic.groupby(by="pclass")["fare"].mean()
titanic_class_fare
titanic_class_fare
pclass
1 87.508992
2 21.179196
3 13.302889
Name: fare, dtype: float64
I want to find if the “.mean()” value of the “fare” column was different depending where someone embarked. let’s break the code down:
In the “.groupby()” method to the “by=” parameter I pass the column I wish to group by. The column “pclass” has three values, “1” , “2” and “3”.
The “.groupby()” behaviour will effectively split the original titanic DataFrame into three new DataFrames. One with the values of “1”, one for “2” and one with the values of “3”. This is the split step.
From these new “.groupby()” DataFrames I select the column “[”fare”]” and apply the summary statistic “.mean()” to it. This is the apply step.
This is returned in the DataFrame “titanic_class_fare”, this is the combine step.
10.1.2 Example 2
We can also use more complicated groupings using more than one variable, here we group first by “pclass” then “embarked”. As with many methods we discussed in earlier chapters, we must specify the column names in a list, where order is important.
=["pclass", "embarked"])["fare"].mean() titanic.groupby(by
pclass embarked
1 C 106.845330
Q 90.000000
S 72.148094
2 C 23.300593
Q 11.735114
S 21.206921
3 C 11.021624
Q 10.390820
S 14.435422
Name: fare, dtype: float64
To show off why order is important, see the following:
=["embarked", "pclass"])["fare"].mean() titanic.groupby(by
embarked pclass
C 1 106.845330
2 23.300593
3 11.021624
Q 1 90.000000
2 11.735114
3 10.390820
S 1 72.148094
2 21.206921
3 14.435422
Name: fare, dtype: float64
10.1.3 Example 3
You can also use other summary statistics here, for example “.count()” to return the number of values. The following shows that 141 passengers embarked in Cherbourg (embarked = C) and were pclass 1.
=["embarked", "pclass"])["fare"].count() titanic.groupby(by
embarked pclass
C 1 141
2 28
3 101
Q 1 3
2 7
3 113
S 1 177
2 242
3 494
Name: fare, dtype: int64
Now let’s do some exercises!
10.1.4 Exercises
Group animals by the column “AnimalClass” and find the “.sum()” of the “IncidentNominalCost(£)” column.
Group animals by the column “Borough” and “AnimalClass” and find the “.mean()” of the “PumpHoursTotal” column.
Reverse the order of the grouping in (b) and observe the differences.
# (a)
= animals.groupby(by="AnimalClass")['IncidentNominalCost(£)'].sum()
animalclass_sum
animalclass_sum
AnimalClass
Bird 350487.0
Fish 1560.0
Mammal 1597513.0
Reptile 3730.0
Name: IncidentNominalCost(£), dtype: float64
# (b)
= animals.groupby(by=["Borough", "AnimalClass"])['PumpHoursTotal'].mean()
pump_hours_mean
pump_hours_mean.head()
Borough AnimalClass
Barking And Dagenham Bird 1.035714
Mammal 1.141667
Barnet Bird 1.066667
Mammal 1.164021
Reptile 1.000000
Name: PumpHoursTotal, dtype: float64
# (c)
= animals.groupby(by=["AnimalClass", "Borough"])['PumpHoursTotal'].mean()
pump_hours_mean
pump_hours_mean.head()
AnimalClass Borough
Bird Barking And Dagenham 1.035714
Barnet 1.066667
Bexley 1.058824
Brent 1.131579
Bromley 1.200000
Name: PumpHoursTotal, dtype: float64
10.2 Multiple Aggregation
If we want to return more than one aggregation at once to the columns we’ve selected, there is the “.agg()” method. This takes a dictionary where the key is our column and the value is the aggregation method we wish to apply as a string.
These aggfuncs are slightly different to what we’ve seen, but are straightforward like “sum”, ”count”, ”mean” etc. Let’s first see it with one aggregation before building up to two.
10.2.1 Example 1
= animals.groupby(by="AnimalClass").agg({"IncidentNominalCost(£)": "sum",
animalclass_sum "PumpHoursTotal": "mean"})
animalclass_sum
IncidentNominalCost(£) | PumpHoursTotal | |
---|---|---|
AnimalClass | ||
Bird | 350487.0 | 1.106912 |
Fish | 1560.0 | 3.000000 |
Mammal | 1597513.0 | 1.203162 |
Reptile | 3730.0 | 1.083333 |
If we want to apply more than one aggregation to a column we can pass a list to the values of the dictionary. This requires us to use the numpy methods which are very similar to the methods we have used previously, so “np.sum” , “np.mean” etc.
10.2.2 Example 2
Here we will output the sum and the mean for “IncidentNominalCost(£)” and the mean for “PumpHoursTotal”. Notice that since there is only one method applied in the second case, we can just use the standard mean method, it is only when we use lists that numpy comes into play.
= animals.groupby(by="AnimalClass").agg({"IncidentNominalCost(£)": [np.sum, np.mean],
animalclass_sum "PumpHoursTotal": "mean"})
animalclass_sum
C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\2031256029.py:1: FutureWarning:
The provided callable <function sum at 0x000002B20CE8C680> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\2031256029.py:1: FutureWarning:
The provided callable <function mean at 0x000002B20CE8D760> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
IncidentNominalCost(£) | PumpHoursTotal | ||
---|---|---|---|
sum | mean | mean | |
AnimalClass | |||
Bird | 350487.0 | 323.029493 | 1.106912 |
Fish | 1560.0 | 780.000000 | 3.000000 |
Mammal | 1597513.0 | 346.006714 | 1.203162 |
Reptile | 3730.0 | 310.833333 | 1.083333 |
10.2.3 Exercise
Group the animals dataset by the variable “Borough” and obtain the following summary statistics:
The variance and standard deviation of the “IncidentNominalCost(£)” column.
The mean of the “PumpCount” column.
The sum of the “PumpHoursTotal” column.
= animals.groupby(by = "Borough").agg({"IncidentNominalCost(£)":[np.var, np.std],
animals_aggregates "PumpCount":"mean",
"PumpHoursTotal":"sum"})
animals_aggregates
C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\4038799905.py:1: FutureWarning:
The provided callable <function var at 0x000002B20CE8D9E0> is currently using SeriesGroupBy.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.
C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\4038799905.py:1: FutureWarning:
The provided callable <function std at 0x000002B20CE8D8A0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
IncidentNominalCost(£) | PumpCount | PumpHoursTotal | ||
---|---|---|---|---|
var | std | mean | sum | |
Borough | ||||
Barking And Dagenham | 17478.446543 | 132.206076 | 1.006757 | 166.0 |
Barnet | 18924.149664 | 137.565074 | 1.017021 | 269.0 |
Bexley | 101288.188050 | 318.258053 | 1.050000 | 232.0 |
Brent | 14903.635693 | 122.080448 | 1.012903 | 170.0 |
Brentwood | NaN | NaN | 1.000000 | 1.0 |
Bromley | 96915.276777 | 311.312185 | 1.037915 | 288.0 |
Broxbourne | NaN | NaN | 1.000000 | 1.0 |
Camden | 101271.452381 | 318.231759 | 1.035503 | 207.0 |
City Of London | 5970.090909 | 77.266363 | 1.000000 | 12.0 |
Croydon | 12917.137020 | 113.653583 | 1.000000 | 273.0 |
Ealing | 12108.624975 | 110.039197 | 1.019900 | 222.0 |
Enfield | 52486.690464 | 229.099739 | 1.072874 | 324.0 |
Epping Forest | 101180.333333 | 318.088562 | 1.333333 | 7.0 |
Greenwich | 27466.981031 | 165.731654 | 1.016575 | 212.0 |
Hackney | 11048.169623 | 105.110274 | 1.000000 | 223.0 |
Hammersmith And Fulham | 24555.948524 | 156.703378 | 1.020833 | 161.0 |
Haringey | 19360.601034 | 139.142377 | 1.014778 | 226.0 |
Harrow | 14084.828954 | 118.679522 | 1.009346 | 122.0 |
Havering | 43320.559880 | 208.135917 | 1.035714 | 218.0 |
Hillingdon | 81438.478868 | 285.374279 | 1.048128 | 252.0 |
Hounslow | 61567.012384 | 248.127009 | 1.032680 | 188.0 |
Islington | 27034.293488 | 164.421086 | 1.005025 | 218.0 |
Kensington And Chelsea | 9777.240258 | 98.879928 | 1.008000 | 135.0 |
Kingston Upon Thames | 33175.594343 | 182.141688 | 1.008850 | 133.0 |
Lambeth | 10143.245118 | 100.713679 | 1.018018 | 245.0 |
Lewisham | 10361.723148 | 101.792550 | 1.000000 | 205.0 |
Merton | 13145.226087 | 114.652632 | 1.000000 | 126.0 |
Newham | 22322.972614 | 149.408743 | 1.005128 | 222.0 |
Redbridge | 29134.820504 | 170.689251 | 1.016304 | 223.0 |
Richmond Upon Thames | 21911.281480 | 148.024598 | 1.012739 | 183.0 |
Southwark | 19417.532703 | 139.346807 | 1.000000 | 253.0 |
Sutton | 11911.604924 | 109.140299 | 1.015748 | 139.0 |
Tandridge | NaN | NaN | 1.000000 | 1.0 |
Tower Hamlets | 64299.889716 | 253.574229 | 1.051643 | 267.0 |
Waltham Forest | 111310.958494 | 333.632970 | 1.049261 | 266.0 |
Wandsworth | 34465.758256 | 185.649558 | 1.027778 | 215.0 |
Westminster | 29891.372414 | 172.891216 | 1.034483 | 164.0 |
In later versions of Pandas (from 0.25.0) named aggregation also exists. The help guide will help you implement this.
Note that these DataFrames displayed when you run the code look a little different to the ones I have displayed here. You will see that in yours, the index is our grouping categories. We can use our “.reset_index()” method here too, or use “as_index = False”. There are a few reasons we might do this, including making visualisation easier.
= titanic.groupby(by=["embarked", "pclass"], as_index= False)["fare"].count()
group_by_dropped_index group_by_dropped_index
embarked | pclass | fare | |
---|---|---|---|
0 | C | 1 | 141 |
1 | C | 2 | 28 |
2 | C | 3 | 101 |
3 | Q | 1 | 3 |
4 | Q | 2 | 7 |
5 | Q | 3 | 113 |
6 | S | 1 | 177 |
7 | S | 2 | 242 |
8 | S | 3 | 494 |
11 Chapter Summary
Excellent job, you have completed Chapter 6 of the Introduction to Python course. There is only one chapter left before you complete Core Part 1 of the course! Absolutely amazing aggregating!
Chapter 7 will feature a structured end of course exercise, that tests you on everything you have learned from Chapter 3 to Chapter 6, with a new dataset we have yet to explore, the schools dataset.