Chapter 6 - Summary Statistics and Aggregation

1 Chapter Overview and Learning Objectives

Packages and Data
- Packages
- Data
Overall Descriptive Statistics
Range
- min
- max
- quantiles
Averages
- Mean
- Median
- Mode
Spread
- Standard Deviation
- Variance
Counting Values
- Counts
- Null Value counts
- Value Counts
Other Summary Statistics
Creating Size Bands
Aggregation
- Single Aggregation
- Multiple Aggregation

2 Packages and Datasets

2.1 Packages

As a reminder, we should always import our packages at the top of our script. In this session we will use the following:

pandas, and give it the nickname “pd”
numpy and give it the nickname “np”

Import Pandas and give it the nickname pd Import Numpy and give it the nickname np

import pandas as pd
import numpy as np

2.2 Datasets

Good practice also dictates that we read in our datasets at the top of our script too.

In this session we’ll be using:

animals - animals.csv
titanic - titanic.xlsx

2.2.1 Exercise

Exercise
Show Answer

Load in these datasets listed above

You can check your variables are loaded by using “%whos” in Jupyter. In Spyder or other IDE’s they should appear in your variable explorer.

animals = pd.read_csv("../data/animals.csv")
titanic = pd.read_excel("../data/titanic.xlsx")

If you struggle with this section, feel free to review the content of Chapter 3 that covered this in depth. Practicing these commands that we repeat over and over is a great way to retain the good practice guidelines.

3 Overall Descriptive Statistics

Pandas has an inbuilt method to get basic descriptive statistics across our dataset, this is the “.describe()” method.

3.1 Example

titanic.describe()

	pclass	survived	age	sibsp	parch	fare	body
count	1309.000000	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000	121.000000
mean	2.294882	0.381971	29.881135	0.498854	0.385027	33.295479	160.809917
std	0.837836	0.486055	14.413500	1.041658	0.865560	51.758668	97.696922
min	1.000000	0.000000	0.166700	0.000000	0.000000	0.000000	1.000000
25%	2.000000	0.000000	21.000000	0.000000	0.000000	7.895800	72.000000
50%	3.000000	0.000000	28.000000	0.000000	0.000000	14.454200	155.000000
75%	3.000000	1.000000	39.000000	1.000000	0.000000	31.275000	256.000000
max	3.000000	1.000000	80.000000	8.000000	9.000000	512.329200	328.000000

We can also get summary statistics on a specific column.

titanic["fare"].describe()

count    1308.000000
mean       33.295479
std        51.758668
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: fare, dtype: float64

These statistics are explained in more detail below:

Summary Statistic	Description
count	the number (count) of non missing entries in the given column.
mean	the average (arithmetic mean) data value in the given column.
std	the standard deviation (spread) of values in the given column.
min	the smallest value in the given column.
25%	the value of the data at the lower quartile (i.e. after the first 25% of data, ordered from smallest to largest).
50%	the middle value of the data (aka the median). half the values are larger than this value, and half smaller.
75%	the value of the data at the upper quartile (i.e. after the first 75% of data, ordered from smallest to largest).
max	the maximum data value recorded.

3.2 Example 2

Describe works across all numeric columns by default. We can display descriptive information for other data types by using the parameter “include=” .

This parameter takes a list as the input; even if we’re just including one kind of data. Here we’re specifying that we want to include “object”, our text/string columns.

titanic.describe(include=["object"])

	name	sex	ticket	cabin	embarked	home.dest	boat
count	1309	1309	1309	295	1307	745	486
unique	1307	2	929	186	3	369	28
top	Connolly, Miss. Kate	male	CA. 2343	C23 C25 C27	S	New York, NY	13
freq	2	843	11	6	914	64	39

Summary Statistic	Description
count	The number (count) of non missing entries in the given column.
unique	The number of unique variables in a column.
top	The most frequently occurring value.
freq	The frequency of the “top” value (how often it appears).

If there are two or more “top” values; e.g. both most frequently occurring values that have the same frequency within the table, Python will kind of arbitrarily (a Pandas Issue) choose one of them to be the top value.

In this data there are two women and two men who share the same name. Pandas will choose one of them to display.

4 Range

We can also access these summary statistics individually. In most cases the name of the method is the same as the summary statistic.

4.1 min

We can use “.min()” to return the minimum value in a column.

4.1.1 Example

titanic["fare"].min()

0.0

This also works for object (text) columns.

titanic["name"].min()

'Abbing, Mr. Anthony'

4.2 max

We can use “.max()” to return the maximum value in a column.

4.2.1 Example

titanic["fare"].max()

512.3292

This also works for object (text) columns.

titanic["name"].max()

'van Melkebeke, Mr. Philemon'

Something important to note here is that “pandas” effectively assigns a value to each letter. This goes A-Z and then a-z. So a lower case “a” is treated as coming after a capital “Z” in Python.

This is why van Melkebeke, Mr. Philemon is the maximum value in our Titanic Dataframe rather than Zimmerman, Mr Leo. We can handle this issue by ensuring our data is either all lower case or all upper case before finding the “.min()” or “.max()” values.

4.2.2 Example

Here I have handled the issue by chaining the methods “.str.lower()” and “.max()” together. If I wanted to do more work with this column in future I may consider overwriting it with a lower case version (which would form part of the cleaning routine).

titanic["name"].str.lower().max()

'zimmerman, mr. leo'

The chaining of methods here applies left to right just as it is read, the column is selected, then the “.lower()” method is used from “str” methods, and then the “.max()” method is applied to the lower case column.

4.3 Quantiles

We can use “.quantile()” to find information at different points of our data.

4.3.1 Example

Our parameter is “q=” and then a decimal number. If we don’t specify this the default behaviour is “0.5”, which returns the median (as it is the 50% quantile).

titanic["fare"].quantile(q=0.25)

7.8958

If we wish to specify more than one, we pass a list to the parameter “q=”

titanic["fare"].quantile(q=[0, 0.25, 0.5, 0.75, 1])

0.00      0.0000
0.25      7.8958
0.50     14.4542
0.75     31.2750
1.00    512.3292
Name: fare, dtype: float64

This is a great time to mention that you don’t necessary need to type a full list of numbers this way, plus the numbers we may want to use could belong to an enormous list, which is unreasonable to type by hand. This is where the range functions come in.

There are two of these functions:

range(start=, stop=, step=) where the start parameter specifies where to start, stop specifies where to stop (which is exclusive! Not included) and step is how we should jump from number to number, this must be an integer.

np.arange(start=, stop=, step=) comes from the numpy package (“np”) and creates a range from the start to the stop values (where the stop is exclusive). However, this function allows us to step in increments of any size, including floats (or decimal numbers).

Let’s see two quick examples before using this with the “quantile()” function.

4.3.2 Example

# Creating a range object 

x = range(0, 10, 2) # Jump from 0-10 in increments of 2

# Using np.arange() to obtain an array of values

y = np.arange(0, 10, 0.5) # Jump from 0-10 in increments of 0.5

print(x, y)

range(0, 10, 2) [0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5 5.  5.5 6.  6.5 7.  7.5 8.  8.5
 9.  9.5]

You see here that the “range()” function gives a range object, rather than a list like structure of the values themselves, this restricts its use even further as it is less readable. It is recommended that “np.arange()” is used in most circumstances. Whilst the second output looks like a list, it is actually an array, a numpy specific object that is beyond the scope of this course.

For more information consult this Numpy for Beginners tutorial.

4.3.3 Example 2

Let’s see how this function helps when finding quantiles.

# Note that .arange() comes from the numpy package (np)


titanic["fare"].quantile(q=np.arange(start=0.0, 
                                     stop=1.1,  # Remember this is exclusive!
                                     step=0.1))

0.0      0.00000
0.1      7.56750
0.2      7.85420
0.3      8.05000
0.4     10.50000
0.5     14.45420
0.6     21.67920
0.7     26.95500
0.8     41.57920
0.9     78.05082
1.0    512.32920
Name: fare, dtype: float64

4.3.4 Exercise

Exercise
Show Answer

How old are the youngest and oldest passengers in the titanic DataFrame?
Print out the 20th, 40th, 60th and 80th percentiles of the fare column in the titanic DataFrame.

# (a)

max_age = titanic["age"].max()

min_age = titanic["age"].min()

print("The oldest person on the titanic was", max_age, "and the youngest person was", str(min_age)+".")

The oldest person on the titanic was 80.0 and the youngest person was 0.1667.

# (b)

percentiles = titanic["fare"].quantile(q = np.arange(start = 0.2,
                                                     stop = 1,
                                                     step = 0.2))
percentiles

0.2     7.8542
0.4    10.5000
0.6    21.6792
0.8    41.5792
Name: fare, dtype: float64

5 Averages

Now we will look at Measures of Central Tendency, often referred to as Averages.

5.1 Mean

Seen briefly in the previous chapter when discussing filling in missing values, the “.mean()” method (this is the arithmetic mean, which corresponds to summing all values and then dividing by the number of values there were) computes the arithmetic mean of a column.

5.1.1 Example

# Compute mean of the fare column

titanic["fare"].mean()

33.29547928134557

5.1.2 Exercise

Exercise
Show Answer

Confirm that the value given by “.mean()” for the fare column is the same as computing it manually.

Hint: You will need the “.sum()” function to compute the total of the column and the “.count()” function to compute the number of values (we will see this function in some more detail later). Remember that to compare if two objects are equal, we need to use “==”.

mean_way_one = titanic["fare"].mean()

mean_way_two = titanic["fare"].sum()/titanic["fare"].count()

# Remember to compare two objects we use the == logical operator to return True or False

mean_way_one == mean_way_two

True

An interesting observation here is the using “len()” won’t work here and provide an answer that is slightly out, this is because “len()” counts all values (including nulls), whereas “.count()” only counts the non-null values (and is what is used by “.mean()” in the background).

5.2 Median

The “.median()” is the middle value (50th percentile) when the numbers are listed in order.

5.2.1 Example

titanic["fare"].median()

14.4542

5.3 Mode

The “.mode()” is the value that occurs most frequently in the column.

5.3.1 Example 1

titanic["fare"].mode()

0    8.05
Name: fare, dtype: float64

5.3.2 Example 2

We can also find the mode in object based columns. Here the “name” column has two modes, since both “Connolly, Miss. Kate” and “Kelly, Mr. James” appear twice in the Data Frame.

Generally, when these cases appear the following terminology is used:

One mode - Unimodal
Two modes - Bimodal
Three modes - Trimodal

As such, the name column is bimodal in this example. If you’re wondering how we know the frequency (the number of times they each appear in the DataFrame), we will see a very useful method for this later in the chapter.

titanic["name"].mode()

0    Connolly, Miss. Kate
1        Kelly, Mr. James
Name: name, dtype: object

Interestingly this data isn’t duplicates! There really were two separate individuals with those names!

titanic[(titanic["name"] == "Connolly, Miss. Kate") | (titanic["name"] == "Kelly, Mr. James")]

	pclass	survived	name	sex	age	ticket	fare	cabin	embarked	home.dest	body	boat
725	3	1	Connolly, Miss. Kate	female	22.0	370373	7.7500	NaN	Q	Ireland	NaN	13
726	3	0	Connolly, Miss. Kate	female	30.0	330972	7.6292	NaN	Q	Ireland	NaN	NaN
924	3	0	Kelly, Mr. James	male	34.5	330911	7.8292	NaN	Q	NaN	70.0	NaN
925	3	0	Kelly, Mr. James	male	44.0	363592	8.0500	NaN	S	NaN	NaN	NaN

6 Spread

Another element of the data we often want to investigate is how spread out it is, we use Measures of Spread to do so.

6.1 Standard Deviation

The standard deviation measures the spread of the data about the mean value. It shows you how much your data distribution is spread out around the mean or average.

6.1.1 Example

We can calculate the standard deviation using the “.std()” method.

titanic["fare"].std()

51.758668239174135

6.2 Variance

Variance measures how spread out the values of a variable are in general.

6.2.1 Example

We can calculate the variance using the “.var()” method.

titanic["fare"].var()

2678.9597378928934

6.2.2 Exercise

Exercise
Show Answer

(a)(i) Find the variance and standard deviation of the “age” column in the titanic dataset.

(a)(ii) Round these two values to 2 decimal places and output the values in a sentence.

There is a link between these two metrics, namely that the Standard Deviation is the square root of the Variance. Confirm this relationship. Hint: To use square root, we need the “np.sqrt()” function from numpy (“import numpy as np”).

# (a)(i) & (a)(ii)

var_age = round(titanic["age"].var(), ndigits = 2)

std_age = round(titanic["age"].std(), ndigits = 2)

print("The variance of the ages of titanic passengers was", var_age, "and the standard deviation was", str(std_age) + ".")

The variance of the ages of titanic passengers was 207.75 and the standard deviation was 14.41.

# (b) 

var_age = titanic["age"].var()

std_age = titanic["age"].std()

np.sqrt(var_age) == std_age

True

7 Counting Values

7.1 Counts

As we saw in the earlier exercise, we can find the number of non null values in a column using the “.count()” method. By non null values we mean values with data in them, not the missing values.

As a reminder we can use either “.shape” or “len()” to find the number of rows, which for titanic is 1309 rows (this includes rows that have missing values).

7.1.1 Example

titanic["embarked"].count()

As you can see “embarked” returns 1307 - so there’s 2 missing values.

7.2 Null Value counts

We can find how many null values we have by using the “.isna()” method. This returns a Boolean series consisting of “True” and “False” values whether the value is null or not.

7.2.1 Example 1

titanic["age"].isna().tail()

1304    False
1305     True
1306    False
1307    False
1308    False
Name: age, dtype: bool

As these have numeric values behind them (“True” is 1, “False” is 0 ) we can use “.sum()” to total them and obtain the number of null values in a column.

titanic["age"].isna().sum()

In older versions of pandas, the method “.isnull()” might be the only one available, but will return the same output.

7.2.2 Exericse

Exercise
Show Answer

Compute the proportion of null values (a decimal between 0 and 1, equivalent to a percentage, for example 0.85 means 85%) in the “age” column. Use the “isna()” function to compute the number of null values this time.

Hint: To compute this, we need to first compute the number of null values and divide by the number of values overall in the column including the nulls!.

# Proportion of null values

(titanic["age"].isna().sum())/len(titanic["age"])

0.20091673032849502

7.3 Value Counts

We can find the frequencies of each unique value in a column, by using “.value_counts()”. This is particularly useful when applied to object columns to observe what values appear the most and least frequently.

titanic["sex"].value_counts()

sex
male      843
female    466
Name: count, dtype: int64

7.3.1 Exercise

Exercise
Show Answer

How many passengers were in each class?
Look in the help for “pd.Series.value_counts()” to see how you can return the values as a proportion instead.

# (a)

titanic["pclass"].value_counts()

pclass
3    709
1    323
2    277
Name: count, dtype: int64

# (b)

# Use the parameter and argument  normalize = True to return proportions

# Show the help documentation

#help(pd.Series.value_counts)

titanic["pclass"].value_counts(normalize=True)

pclass
3    0.541635
1    0.246753
2    0.211612
Name: proportion, dtype: float64

8 Other Summary Statistics

8.1 Sum

We can use “.sum()” to add up columns of numeric data, which we saw when counting the null values.

8.1.1 Example 1

titanic["fare"].sum()

43550.4869

The “.sum()” method comes from Pandas; there is also an inbuilt function “sum()”. However if we have null values in the column this will return us “nan”.

8.1.2 Example 2

sum(titanic["fare"])

nan

8.2 Unique

We can use “.unique()” to find values that are unique in a column. This is returned as an array.

8.2.1 Example

titanic["boat"].unique()

array([2, '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A', '5',
       '7', 'C', '14', '2', '5 9', '13', '1', '15', '5 7', '8 10', '12',
       '16', '13 15 B', 'C D', '15 16', '13 15'], dtype=object)

8.3 Nunique

“.nunique()” can be used to find the number of unique values in a column.

8.3.1 Example

titanic["boat"].nunique()

9 Creating Size Bands

9.1 pd.cut()

We can use the method “pd.cut()” to cut or “bin” our data into groups or categories. This is commonly done when creating size bands; like age bands, to create a categorical column out of a numeric one.

The method “pd.cut()” takes a column of data and groups it into “bins” or categories. This column will have the data type of “category”. There’s some more information about the “category” data type In the pandas documentation.

We often assign the output of “pd.cut()” to a new column. This is because we don’t want to overwrite and change the data type of the existing column.

9.1.1 Example

titanic["binned_ages"] = pd.cut(titanic["age"],
                                bins=10)

titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	home.dest	body	boat	binned_ages
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	St Louis, MO	NaN	2	(24.117, 32.1]
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	11	(0.0869, 8.15]
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(0.0869, 8.15]
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	135.0	NaN	(24.117, 32.1]
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(24.117, 32.1]

We set the parameter “bins =” to specify the number of categories that we want. By passing an integer to this, Pandas takes the smallest value and the largest value in the column and creates the number of categories defined.

We can look at our bins. Note the “(” denotes exclusion of that number and the “]” denotes inclusion of that number.

9.1.2 Example 1

Now these are “categories” we can see there is a relationship between each category.

titanic["binned_ages"].unique()

[(24.117, 32.1], (0.0869, 8.15], (40.083, 48.067], (56.05, 64.033], (32.1, 40.083], ..., (64.033, 72.017], (16.133, 24.117], (72.017, 80.0], NaN, (8.15, 16.133]]
Length: 11
Categories (10, interval[float64, right]): [(0.0869, 8.15] < (8.15, 16.133] < (16.133, 24.117] < (24.117, 32.1] ... (48.067, 56.05] < (56.05, 64.033] < (64.033, 72.017] < (72.017, 80.0]]

We can also pass our own values to determine where the edges of the categories are, rather than allowing Python to compute these approximate ones. This could be as a list of values or a range object that we discussed before.

9.1.3 Example 2

Note that here I am having to use the numpy method “np.arange()” as there will be decimals.

titanic["binned_ages2"]  = pd.cut(titanic["age"],  # Data to cut
                                  bins=np.arange(start=0,
                                                 stop=(titanic["age"].max() + 1) , # Remember stop is exclusive!
                                                 step=10))

titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	home.dest	body	boat	binned_ages	binned_ages2
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	St Louis, MO	NaN	2	(24.117, 32.1]	(20.0, 30.0]
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	11	(0.0869, 8.15]	(0.0, 10.0]
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(0.0869, 8.15]	(0.0, 10.0]
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	135.0	NaN	(24.117, 32.1]	(20.0, 30.0]
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(24.117, 32.1]	(20.0, 30.0]

This means I don’t have to know the maximum value for the column before I write this piece of code. You can see how we can combine methods quite easily to make our life easier when performing such tasks.

It is important to note that in this part of the code :

stop=(titanic[“age”].max() + 1)

I use brackets to enforce the order of operations. If I didn’t add 1 to the stop, the entry at my maximum value would read “NaN”, which I want to avoid.

9.1.4 Example 3

We can also add labels to our categories. This time rather than displaying the bin edges it will display the text strings we specify.

This is passed as a list to the parameter “labels=” to the “pd.cut()” function.

titanic["binned_ages3"] = pd.cut(titanic["age"],  # Data to cut
                                 bins=np.arange(start=0,
                                                        stop=(titanic["age"].max() + 1),
                                                step=10),
                                 labels=["0 – 10", "11-20", " 21 – 30",
                                         "31 – 40", "41 – 50", "51 – 60",
                                         "61 – 70", "71 - 80"])

titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	home.dest	body	boat	binned_ages	binned_ages2	binned_ages3
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	St Louis, MO	NaN	2	(24.117, 32.1]	(20.0, 30.0]	21 – 30
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	11	(0.0869, 8.15]	(0.0, 10.0]	0 – 10
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(0.0869, 8.15]	(0.0, 10.0]	0 – 10
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	135.0	NaN	(24.117, 32.1]	(20.0, 30.0]	21 – 30
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(24.117, 32.1]	(20.0, 30.0]	21 – 30

Note these bands are approximate, e.g someone with an age of 20.2 will go into the band labelled “21-30”. Integers were chosen as it’s an easier read, and most ages after 1 are whole numbers.

There’s additional parameters we can set here; check the help function if there’s anything specific you need to do. You will get some more practice with this in the Case Study in Chapter 7.

9.2 pd.qcut()

“pd.qcut()” is described in the documentation as a “Quantile-based discretization function”. This means that “pd.qcut()” tries to divide the data into bins of equal sizes, rather than using the numeric edges of the data to create the bins.

9.2.1 Example

In the cell below I’m using “pd.qcut()” to cut the “age” column in to 10. This using the same data and the same number of segments as the “pd.cut()” we did at the start.

# Divide fare into 3 equally sized classes.
titanic["age_qcut"] = pd.qcut(titanic["age"],
                              q=10)  # q for quantiles

#View the data
titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	home.dest	body	boat	binned_ages	binned_ages2	binned_ages3	age_qcut
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	St Louis, MO	NaN	2	(24.117, 32.1]	(20.0, 30.0]	21 – 30	(28.0, 31.0]
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	11	(0.0869, 8.15]	(0.0, 10.0]	0 – 10	(0.166, 14.0]
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(0.0869, 8.15]	(0.0, 10.0]	0 – 10	(0.166, 14.0]
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	135.0	NaN	(24.117, 32.1]	(20.0, 30.0]	21 – 30	(28.0, 31.0]
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	Montreal, PQ / Chesterville, ON	NaN	NaN	(24.117, 32.1]	(20.0, 30.0]	21 – 30	(22.0, 25.0]

10 Aggregation

Aggregation means grouping data together by a particular grouping variable (of our choice) and producing a summary of one or more columns for that grouping variable. For example, what if I want to find out if the average fare paid is different for each passenger class? I could use this technique to find out.

We have 3 passenger classes; 1st, 2nd and 3rd, which we can check that using the “.unique()” function as we saw earlier.

titanic["pclass"].unique()

array([1, 2, 3], dtype=int64)

We’ll use the “.groupby()” method in this tutorial. Panda’s also provides us with the “pd.crosstab()” and the “pd.pivot_table()” methods as well. You can find these in the Reference Material as they are a little more niche in their application.

This function can be really useful, especially when your data are disaggregate, e.g. data about individual units of people or things, rather than by category.

The method “.groupby()” allows us to aggregate by a categorical variable and summarise numerical data into a new DataFrame. It works on a very interesting principle known as split-apply-combine:

A groupby process hinges on split, apply and combine.

Split is where a DataFrame is divided into a set of smaller DataFrames based on the grouping variable. If there are n levels of a categorical variable, there are n smaller dataframes.
Apply is where an aggregation is applied to each of these smaller dataframes to obtain a single valued output (for example a mean).
Combine is where we bring together the aggregated DataFrame rows into a final new DataFrame.

10.1 Examples

10.1.1 Example 1

Let’s see an example of this in action, it sounds complex (and it is with what goes on behind the scenes), but the final output is much more readable.

titanic_class_fare = titanic.groupby(by="pclass")["fare"].mean() 

titanic_class_fare

pclass
1    87.508992
2    21.179196
3    13.302889
Name: fare, dtype: float64

I want to find if the “.mean()” value of the “fare” column was different depending where someone embarked. let’s break the code down:

In the “.groupby()” method to the “by=” parameter I pass the column I wish to group by. The column “pclass” has three values, “1” , “2” and “3”.
The “.groupby()” behaviour will effectively split the original titanic DataFrame into three new DataFrames. One with the values of “1”, one for “2” and one with the values of “3”. This is the split step.
From these new “.groupby()” DataFrames I select the column “[”fare”]” and apply the summary statistic “.mean()” to it. This is the apply step.
This is returned in the DataFrame “titanic_class_fare”, this is the combine step.

10.1.2 Example 2

We can also use more complicated groupings using more than one variable, here we group first by “pclass” then “embarked”. As with many methods we discussed in earlier chapters, we must specify the column names in a list, where order is important.

titanic.groupby(by=["pclass", "embarked"])["fare"].mean()

pclass  embarked
1       C           106.845330
        Q            90.000000
        S            72.148094
2       C            23.300593
        Q            11.735114
        S            21.206921
3       C            11.021624
        Q            10.390820
        S            14.435422
Name: fare, dtype: float64

To show off why order is important, see the following:

titanic.groupby(by=["embarked", "pclass"])["fare"].mean()

embarked  pclass
C         1         106.845330
          2          23.300593
          3          11.021624
Q         1          90.000000
          2          11.735114
          3          10.390820
S         1          72.148094
          2          21.206921
          3          14.435422
Name: fare, dtype: float64

10.1.3 Example 3

You can also use other summary statistics here, for example “.count()” to return the number of values. The following shows that 141 passengers embarked in Cherbourg (embarked = C) and were pclass 1.

titanic.groupby(by=["embarked", "pclass"])["fare"].count()

embarked  pclass
C         1         141
          2          28
          3         101
Q         1           3
          2           7
          3         113
S         1         177
          2         242
          3         494
Name: fare, dtype: int64

Now let’s do some exercises!

10.1.4 Exercises

Exercise
Show Answer

Group animals by the column “AnimalClass” and find the “.sum()” of the “IncidentNominalCost(£)” column.
Group animals by the column “Borough” and “AnimalClass” and find the “.mean()” of the “PumpHoursTotal” column.
Reverse the order of the grouping in (b) and observe the differences.

# (a)

animalclass_sum = animals.groupby(by="AnimalClass")['IncidentNominalCost(£)'].sum()

animalclass_sum

AnimalClass
Bird        350487.0
Fish          1560.0
Mammal     1597513.0
Reptile       3730.0
Name: IncidentNominalCost(£), dtype: float64

# (b) 

pump_hours_mean = animals.groupby(by=["Borough", "AnimalClass"])['PumpHoursTotal'].mean()

pump_hours_mean.head()

Borough               AnimalClass
Barking And Dagenham  Bird           1.035714
                      Mammal         1.141667
Barnet                Bird           1.066667
                      Mammal         1.164021
                      Reptile        1.000000
Name: PumpHoursTotal, dtype: float64

# (c) 

pump_hours_mean = animals.groupby(by=["AnimalClass", "Borough"])['PumpHoursTotal'].mean()

pump_hours_mean.head()

AnimalClass  Borough             
Bird         Barking And Dagenham    1.035714
             Barnet                  1.066667
             Bexley                  1.058824
             Brent                   1.131579
             Bromley                 1.200000
Name: PumpHoursTotal, dtype: float64

10.2 Multiple Aggregation

If we want to return more than one aggregation at once to the columns we’ve selected, there is the “.agg()” method. This takes a dictionary where the key is our column and the value is the aggregation method we wish to apply as a string.

These aggfuncs are slightly different to what we’ve seen, but are straightforward like “sum”, ”count”, ”mean” etc. Let’s first see it with one aggregation before building up to two.

10.2.1 Example 1

animalclass_sum = animals.groupby(by="AnimalClass").agg({"IncidentNominalCost(£)": "sum",
                                                         "PumpHoursTotal": "mean"})

animalclass_sum

	IncidentNominalCost(£)	PumpHoursTotal
AnimalClass
Bird	350487.0	1.106912
Fish	1560.0	3.000000
Mammal	1597513.0	1.203162
Reptile	3730.0	1.083333

If we want to apply more than one aggregation to a column we can pass a list to the values of the dictionary. This requires us to use the numpy methods which are very similar to the methods we have used previously, so “np.sum” , “np.mean” etc.

10.2.2 Example 2

Here we will output the sum and the mean for “IncidentNominalCost(£)” and the mean for “PumpHoursTotal”. Notice that since there is only one method applied in the second case, we can just use the standard mean method, it is only when we use lists that numpy comes into play.

animalclass_sum = animals.groupby(by="AnimalClass").agg({"IncidentNominalCost(£)": [np.sum, np.mean],
                                                         "PumpHoursTotal": "mean"})

animalclass_sum

C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\2031256029.py:1: FutureWarning:

The provided callable <function sum at 0x000002B20CE8C680> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.

C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\2031256029.py:1: FutureWarning:

The provided callable <function mean at 0x000002B20CE8D760> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.

	IncidentNominalCost(£)		PumpHoursTotal
	sum	mean	mean
AnimalClass
Bird	350487.0	323.029493	1.106912
Fish	1560.0	780.000000	3.000000
Mammal	1597513.0	346.006714	1.203162
Reptile	3730.0	310.833333	1.083333

10.2.3 Exercise

Exercise
Show Answer

Group the animals dataset by the variable “Borough” and obtain the following summary statistics:

The variance and standard deviation of the “IncidentNominalCost(£)” column.
The mean of the “PumpCount” column.
The sum of the “PumpHoursTotal” column.

animals_aggregates = animals.groupby(by = "Borough").agg({"IncidentNominalCost(£)":[np.var, np.std],
                "PumpCount":"mean",
                "PumpHoursTotal":"sum"})

animals_aggregates

C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\4038799905.py:1: FutureWarning:

The provided callable <function var at 0x000002B20CE8D9E0> is currently using SeriesGroupBy.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.

C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\4038799905.py:1: FutureWarning:

The provided callable <function std at 0x000002B20CE8D8A0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.

	IncidentNominalCost(£)		PumpCount	PumpHoursTotal
	var	std	mean	sum
Borough
Barking And Dagenham	17478.446543	132.206076	1.006757	166.0
Barnet	18924.149664	137.565074	1.017021	269.0
Bexley	101288.188050	318.258053	1.050000	232.0
Brent	14903.635693	122.080448	1.012903	170.0
Brentwood	NaN	NaN	1.000000	1.0
Bromley	96915.276777	311.312185	1.037915	288.0
Broxbourne	NaN	NaN	1.000000	1.0
Camden	101271.452381	318.231759	1.035503	207.0
City Of London	5970.090909	77.266363	1.000000	12.0
Croydon	12917.137020	113.653583	1.000000	273.0
Ealing	12108.624975	110.039197	1.019900	222.0
Enfield	52486.690464	229.099739	1.072874	324.0
Epping Forest	101180.333333	318.088562	1.333333	7.0
Greenwich	27466.981031	165.731654	1.016575	212.0
Hackney	11048.169623	105.110274	1.000000	223.0
Hammersmith And Fulham	24555.948524	156.703378	1.020833	161.0
Haringey	19360.601034	139.142377	1.014778	226.0
Harrow	14084.828954	118.679522	1.009346	122.0
Havering	43320.559880	208.135917	1.035714	218.0
Hillingdon	81438.478868	285.374279	1.048128	252.0
Hounslow	61567.012384	248.127009	1.032680	188.0
Islington	27034.293488	164.421086	1.005025	218.0
Kensington And Chelsea	9777.240258	98.879928	1.008000	135.0
Kingston Upon Thames	33175.594343	182.141688	1.008850	133.0
Lambeth	10143.245118	100.713679	1.018018	245.0
Lewisham	10361.723148	101.792550	1.000000	205.0
Merton	13145.226087	114.652632	1.000000	126.0
Newham	22322.972614	149.408743	1.005128	222.0
Redbridge	29134.820504	170.689251	1.016304	223.0
Richmond Upon Thames	21911.281480	148.024598	1.012739	183.0
Southwark	19417.532703	139.346807	1.000000	253.0
Sutton	11911.604924	109.140299	1.015748	139.0
Tandridge	NaN	NaN	1.000000	1.0
Tower Hamlets	64299.889716	253.574229	1.051643	267.0
Waltham Forest	111310.958494	333.632970	1.049261	266.0
Wandsworth	34465.758256	185.649558	1.027778	215.0
Westminster	29891.372414	172.891216	1.034483	164.0

In later versions of Pandas (from 0.25.0) named aggregation also exists. The help guide will help you implement this.

Note that these DataFrames displayed when you run the code look a little different to the ones I have displayed here. You will see that in yours, the index is our grouping categories. We can use our “.reset_index()” method here too, or use “as_index = False”. There are a few reasons we might do this, including making visualisation easier.

group_by_dropped_index = titanic.groupby(by=["embarked", "pclass"], as_index= False)["fare"].count()
group_by_dropped_index

	embarked	pclass	fare
0	C	1	141
1	C	2	28
2	C	3	101
3	Q	1	3
4	Q	2	7
5	Q	3	113
6	S	1	177
7	S	2	242
8	S	3	494

11 Chapter Summary

Excellent job, you have completed Chapter 6 of the Introduction to Python course. There is only one chapter left before you complete Core Part 1 of the course! Absolutely amazing aggregating!

Chapter 7 will feature a structured end of course exercise, that tests you on everything you have learned from Chapter 3 to Chapter 6, with a new dataset we have yet to explore, the schools dataset.