Chapter 6 - Summary Statistics and Aggregation

1 Chapter Overview and Learning Objectives

  • Packages and Data

    • Packages
    • Data
  • Overall Descriptive Statistics

  • Range

    • min
    • max
    • quantiles
  • Averages

    • Mean
    • Median
    • Mode
  • Spread

    • Standard Deviation
    • Variance
  • Counting Values

    • Counts
    • Null Value counts
    • Value Counts
  • Other Summary Statistics

  • Creating Size Bands

  • Aggregation

    • Single Aggregation
    • Multiple Aggregation

2 Packages and Datasets

2.1 Packages

As a reminder, we should always import our packages at the top of our script. In this session we will use the following:

  • pandas, and give it the nickname “pd”
  • numpy and give it the nickname “np”

2.1.1 Exercise

Import Pandas and give it the nickname pd Import Numpy and give it the nickname np

import pandas as pd
import numpy as np

2.2 Datasets

Good practice also dictates that we read in our datasets at the top of our script too.

In this session we’ll be using:

  • animals - animals.csv
  • titanic - titanic.xlsx

2.2.1 Exercise

Load in these datasets listed above

You can check your variables are loaded by using “%whos” in Jupyter. In Spyder or other IDE’s they should appear in your variable explorer.

animals = pd.read_csv("../data/animals.csv")
titanic = pd.read_excel("../data/titanic.xlsx")

If you struggle with this section, feel free to review the content of Chapter 3 that covered this in depth. Practicing these commands that we repeat over and over is a great way to retain the good practice guidelines.

3 Overall Descriptive Statistics

Pandas has an inbuilt method to get basic descriptive statistics across our dataset, this is the “.describe()” method.

3.1 Example

titanic.describe()
pclass survived age sibsp parch fare body
count 1309.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000 121.000000
mean 2.294882 0.381971 29.881135 0.498854 0.385027 33.295479 160.809917
std 0.837836 0.486055 14.413500 1.041658 0.865560 51.758668 97.696922
min 1.000000 0.000000 0.166700 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 21.000000 0.000000 0.000000 7.895800 72.000000
50% 3.000000 0.000000 28.000000 0.000000 0.000000 14.454200 155.000000
75% 3.000000 1.000000 39.000000 1.000000 0.000000 31.275000 256.000000
max 3.000000 1.000000 80.000000 8.000000 9.000000 512.329200 328.000000

We can also get summary statistics on a specific column.

titanic["fare"].describe()
count    1308.000000
mean       33.295479
std        51.758668
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: fare, dtype: float64

These statistics are explained in more detail below:

Summary Statistic Description
count the number (count) of non missing entries in the given column.
mean the average (arithmetic mean) data value in the given column.
std the standard deviation (spread) of values in the given column.
min the smallest value in the given column.
25% the value of the data at the lower quartile
(i.e. after the first 25% of data, ordered from smallest to largest).
50% the middle value of the data (aka the median).
half the values are larger than this value, and half smaller.
75% the value of the data at the upper quartile
(i.e. after the first 75% of data, ordered from smallest to largest).
max the maximum data value recorded.

3.2 Example 2

Describe works across all numeric columns by default. We can display descriptive information for other data types by using the parameter “include=” .

This parameter takes a list as the input; even if we’re just including one kind of data. Here we’re specifying that we want to include “object”, our text/string columns.

titanic.describe(include=["object"])
name sex ticket cabin embarked home.dest boat
count 1309 1309 1309 295 1307 745 486
unique 1307 2 929 186 3 369 28
top Connolly, Miss. Kate male CA. 2343 C23 C25 C27 S New York, NY 13
freq 2 843 11 6 914 64 39
Summary Statistic Description
count The number (count) of non missing entries in the given column.
unique The number of unique variables in a column.
top The most frequently occurring value.
freq The frequency of the “top” value (how often it appears).

If there are two or more “top” values; e.g. both most frequently occurring values that have the same frequency within the table, Python will kind of arbitrarily (a Pandas Issue) choose one of them to be the top value.

In this data there are two women and two men who share the same name. Pandas will choose one of them to display.

4 Range

We can also access these summary statistics individually. In most cases the name of the method is the same as the summary statistic.

4.1 min

We can use “.min()” to return the minimum value in a column.

4.1.1 Example

titanic["fare"].min()
0.0

This also works for object (text) columns.

titanic["name"].min()
'Abbing, Mr. Anthony'

4.2 max

We can use “.max()” to return the maximum value in a column.

4.2.1 Example

titanic["fare"].max()
512.3292

This also works for object (text) columns.

titanic["name"].max()
'van Melkebeke, Mr. Philemon'

Something important to note here is that “pandas” effectively assigns a value to each letter. This goes A-Z and then a-z. So a lower case “a” is treated as coming after a capital “Z” in Python.

This is why van Melkebeke, Mr. Philemon is the maximum value in our Titanic Dataframe rather than Zimmerman, Mr Leo. We can handle this issue by ensuring our data is either all lower case or all upper case before finding the “.min()” or “.max()” values.

4.2.2 Example

Here I have handled the issue by chaining the methods “.str.lower()” and “.max()” together. If I wanted to do more work with this column in future I may consider overwriting it with a lower case version (which would form part of the cleaning routine).

titanic["name"].str.lower().max()
'zimmerman, mr. leo'

The chaining of methods here applies left to right just as it is read, the column is selected, then the “.lower()” method is used from “str” methods, and then the “.max()” method is applied to the lower case column.

4.3 Quantiles

We can use “.quantile()” to find information at different points of our data.

4.3.1 Example

Our parameter is “q=” and then a decimal number. If we don’t specify this the default behaviour is “0.5”, which returns the median (as it is the 50% quantile).

titanic["fare"].quantile(q=0.25)
7.8958

If we wish to specify more than one, we pass a list to the parameter “q=”

titanic["fare"].quantile(q=[0, 0.25, 0.5, 0.75, 1])
0.00      0.0000
0.25      7.8958
0.50     14.4542
0.75     31.2750
1.00    512.3292
Name: fare, dtype: float64

This is a great time to mention that you don’t necessary need to type a full list of numbers this way, plus the numbers we may want to use could belong to an enormous list, which is unreasonable to type by hand. This is where the range functions come in.

There are two of these functions:

range(start=, stop=, step=) where the start parameter specifies where to start, stop specifies where to stop (which is exclusive! Not included) and step is how we should jump from number to number, this must be an integer.

np.arange(start=, stop=, step=) comes from the numpy package (“np”) and creates a range from the start to the stop values (where the stop is exclusive). However, this function allows us to step in increments of any size, including floats (or decimal numbers).

Let’s see two quick examples before using this with the “quantile()” function.

4.3.2 Example

# Creating a range object 

x = range(0, 10, 2) # Jump from 0-10 in increments of 2

# Using np.arange() to obtain an array of values

y = np.arange(0, 10, 0.5) # Jump from 0-10 in increments of 0.5

print(x, y)
range(0, 10, 2) [0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5 5.  5.5 6.  6.5 7.  7.5 8.  8.5
 9.  9.5]

You see here that the “range()” function gives a range object, rather than a list like structure of the values themselves, this restricts its use even further as it is less readable. It is recommended that “np.arange()” is used in most circumstances. Whilst the second output looks like a list, it is actually an array, a numpy specific object that is beyond the scope of this course.

For more information consult this Numpy for Beginners tutorial.

4.3.3 Example 2

Let’s see how this function helps when finding quantiles.

# Note that .arange() comes from the numpy package (np)


titanic["fare"].quantile(q=np.arange(start=0.0, 
                                     stop=1.1,  # Remember this is exclusive!
                                     step=0.1)) 
0.0      0.00000
0.1      7.56750
0.2      7.85420
0.3      8.05000
0.4     10.50000
0.5     14.45420
0.6     21.67920
0.7     26.95500
0.8     41.57920
0.9     78.05082
1.0    512.32920
Name: fare, dtype: float64

4.3.4 Exercise

  1. How old are the youngest and oldest passengers in the titanic DataFrame?

  2. Print out the 20th, 40th, 60th and 80th percentiles of the fare column in the titanic DataFrame.

# (a)

max_age = titanic["age"].max()

min_age = titanic["age"].min()

print("The oldest person on the titanic was", max_age, "and the youngest person was", str(min_age)+".")
The oldest person on the titanic was 80.0 and the youngest person was 0.1667.
# (b)

percentiles = titanic["fare"].quantile(q = np.arange(start = 0.2,
                                                     stop = 1,
                                                     step = 0.2))
percentiles
0.2     7.8542
0.4    10.5000
0.6    21.6792
0.8    41.5792
Name: fare, dtype: float64

5 Averages

Now we will look at Measures of Central Tendency, often referred to as Averages.

5.1 Mean

Seen briefly in the previous chapter when discussing filling in missing values, the “.mean()” method (this is the arithmetic mean, which corresponds to summing all values and then dividing by the number of values there were) computes the arithmetic mean of a column.

5.1.1 Example

# Compute mean of the fare column

titanic["fare"].mean()
33.29547928134557

5.1.2 Exercise

Confirm that the value given by “.mean()” for the fare column is the same as computing it manually.

Hint: You will need the “.sum()” function to compute the total of the column and the “.count()” function to compute the number of values (we will see this function in some more detail later). Remember that to compare if two objects are equal, we need to use “==”.

mean_way_one = titanic["fare"].mean()

mean_way_two = titanic["fare"].sum()/titanic["fare"].count()

# Remember to compare two objects we use the == logical operator to return True or False

mean_way_one == mean_way_two
True

An interesting observation here is the using “len()” won’t work here and provide an answer that is slightly out, this is because “len()” counts all values (including nulls), whereas “.count()” only counts the non-null values (and is what is used by “.mean()” in the background).

5.2 Median

The “.median()” is the middle value (50th percentile) when the numbers are listed in order.

5.2.1 Example

titanic["fare"].median()
14.4542

5.3 Mode

The “.mode()” is the value that occurs most frequently in the column.

5.3.1 Example 1

titanic["fare"].mode()
0    8.05
Name: fare, dtype: float64

5.3.2 Example 2

We can also find the mode in object based columns. Here the “name” column has two modes, since both “Connolly, Miss. Kate” and “Kelly, Mr. James” appear twice in the Data Frame.

Generally, when these cases appear the following terminology is used:

  • One mode - Unimodal
  • Two modes - Bimodal
  • Three modes - Trimodal

As such, the name column is bimodal in this example. If you’re wondering how we know the frequency (the number of times they each appear in the DataFrame), we will see a very useful method for this later in the chapter.

titanic["name"].mode()
0    Connolly, Miss. Kate
1        Kelly, Mr. James
Name: name, dtype: object

Interestingly this data isn’t duplicates! There really were two separate individuals with those names!

titanic[(titanic["name"] == "Connolly, Miss. Kate") | (titanic["name"] == "Kelly, Mr. James")]
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
725 3 1 Connolly, Miss. Kate female 22.0 0 0 370373 7.7500 NaN Q Ireland NaN 13
726 3 0 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q Ireland NaN NaN
924 3 0 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q NaN 70.0 NaN
925 3 0 Kelly, Mr. James male 44.0 0 0 363592 8.0500 NaN S NaN NaN NaN

6 Spread

Another element of the data we often want to investigate is how spread out it is, we use Measures of Spread to do so.

6.1 Standard Deviation

The standard deviation measures the spread of the data about the mean value. It shows you how much your data distribution is spread out around the mean or average.

6.1.1 Example

We can calculate the standard deviation using the “.std()” method.

titanic["fare"].std()
51.758668239174135

6.2 Variance

Variance measures how spread out the values of a variable are in general.

6.2.1 Example

We can calculate the variance using the “.var()” method.

titanic["fare"].var()
2678.9597378928934

6.2.2 Exercise

(a)(i) Find the variance and standard deviation of the “age” column in the titanic dataset.

(a)(ii) Round these two values to 2 decimal places and output the values in a sentence.

  1. There is a link between these two metrics, namely that the Standard Deviation is the square root of the Variance. Confirm this relationship. Hint: To use square root, we need the “np.sqrt()” function from numpy (“import numpy as np”).
# (a)(i) & (a)(ii)

var_age = round(titanic["age"].var(), ndigits = 2)

std_age = round(titanic["age"].std(), ndigits = 2)

print("The variance of the ages of titanic passengers was", var_age, "and the standard deviation was", str(std_age) + ".")
The variance of the ages of titanic passengers was 207.75 and the standard deviation was 14.41.
# (b) 

var_age = titanic["age"].var()

std_age = titanic["age"].std()

np.sqrt(var_age) == std_age
True

7 Counting Values

7.1 Counts

As we saw in the earlier exercise, we can find the number of non null values in a column using the “.count()” method. By non null values we mean values with data in them, not the missing values.

As a reminder we can use either “.shape” or “len()” to find the number of rows, which for titanic is 1309 rows (this includes rows that have missing values).

7.1.1 Example

titanic["embarked"].count()
1307

As you can see “embarked” returns 1307 - so there’s 2 missing values.

7.2 Null Value counts

We can find how many null values we have by using the “.isna()” method. This returns a Boolean series consisting of “True” and “False” values whether the value is null or not.

7.2.1 Example 1

titanic["age"].isna().tail()
1304    False
1305     True
1306    False
1307    False
1308    False
Name: age, dtype: bool

As these have numeric values behind them (“True” is 1, “False” is 0 ) we can use “.sum()” to total them and obtain the number of null values in a column.

titanic["age"].isna().sum()
263

In older versions of pandas, the method “.isnull()” might be the only one available, but will return the same output.

7.2.2 Exericse

Compute the proportion of null values (a decimal between 0 and 1, equivalent to a percentage, for example 0.85 means 85%) in the “age” column. Use the “isna()” function to compute the number of null values this time.

Hint: To compute this, we need to first compute the number of null values and divide by the number of values overall in the column including the nulls!.

# Proportion of null values

(titanic["age"].isna().sum())/len(titanic["age"])
0.20091673032849502

7.3 Value Counts

We can find the frequencies of each unique value in a column, by using “.value_counts()”. This is particularly useful when applied to object columns to observe what values appear the most and least frequently.

titanic["sex"].value_counts()
sex
male      843
female    466
Name: count, dtype: int64

7.3.1 Exercise

  1. How many passengers were in each class?

  2. Look in the help for “pd.Series.value_counts()” to see how you can return the values as a proportion instead.

# (a)

titanic["pclass"].value_counts()
pclass
3    709
1    323
2    277
Name: count, dtype: int64
# (b)

# Use the parameter and argument  normalize = True to return proportions

# Show the help documentation

#help(pd.Series.value_counts)

titanic["pclass"].value_counts(normalize=True)
pclass
3    0.541635
1    0.246753
2    0.211612
Name: proportion, dtype: float64

8 Other Summary Statistics

8.1 Sum

We can use “.sum()” to add up columns of numeric data, which we saw when counting the null values.

8.1.1 Example 1

titanic["fare"].sum()
43550.4869

The “.sum()” method comes from Pandas; there is also an inbuilt function “sum()”. However if we have null values in the column this will return us “nan”.

8.1.2 Example 2

sum(titanic["fare"])
nan

8.2 Unique

We can use “.unique()” to find values that are unique in a column. This is returned as an array.

8.2.1 Example

titanic["boat"].unique()
array([2, '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A', '5',
       '7', 'C', '14', '2', '5 9', '13', '1', '15', '5 7', '8 10', '12',
       '16', '13 15 B', 'C D', '15 16', '13 15'], dtype=object)

8.3 Nunique

“.nunique()” can be used to find the number of unique values in a column.

8.3.1 Example

titanic["boat"].nunique()
28

9 Creating Size Bands

9.1 pd.cut()

We can use the method “pd.cut()” to cut or “bin” our data into groups or categories. This is commonly done when creating size bands; like age bands, to create a categorical column out of a numeric one.

The method “pd.cut()” takes a column of data and groups it into “bins” or categories. This column will have the data type of “category”. There’s some more information about the “category” data type In the pandas documentation.

We often assign the output of “pd.cut()” to a new column. This is because we don’t want to overwrite and change the data type of the existing column.

9.1.1 Example

titanic["binned_ages"] = pd.cut(titanic["age"],
                                bins=10)

titanic.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat binned_ages
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2 (24.117, 32.1]
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11 (0.0869, 8.15]
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (0.0869, 8.15]
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN (24.117, 32.1]
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (24.117, 32.1]

We set the parameter “bins =” to specify the number of categories that we want. By passing an integer to this, Pandas takes the smallest value and the largest value in the column and creates the number of categories defined.

We can look at our bins. Note the “(” denotes exclusion of that number and the “]” denotes inclusion of that number.

9.1.2 Example 1

Now these are “categories” we can see there is a relationship between each category.

titanic["binned_ages"].unique()
[(24.117, 32.1], (0.0869, 8.15], (40.083, 48.067], (56.05, 64.033], (32.1, 40.083], ..., (64.033, 72.017], (16.133, 24.117], (72.017, 80.0], NaN, (8.15, 16.133]]
Length: 11
Categories (10, interval[float64, right]): [(0.0869, 8.15] < (8.15, 16.133] < (16.133, 24.117] < (24.117, 32.1] ... (48.067, 56.05] < (56.05, 64.033] < (64.033, 72.017] < (72.017, 80.0]]

We can also pass our own values to determine where the edges of the categories are, rather than allowing Python to compute these approximate ones. This could be as a list of values or a range object that we discussed before.

9.1.3 Example 2

Note that here I am having to use the numpy method “np.arange()” as there will be decimals.

titanic["binned_ages2"]  = pd.cut(titanic["age"],  # Data to cut
                                  bins=np.arange(start=0,
                                                 stop=(titanic["age"].max() + 1) , # Remember stop is exclusive!
                                                 step=10))

titanic.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat binned_ages binned_ages2
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2 (24.117, 32.1] (20.0, 30.0]
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11 (0.0869, 8.15] (0.0, 10.0]
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (0.0869, 8.15] (0.0, 10.0]
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN (24.117, 32.1] (20.0, 30.0]
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (24.117, 32.1] (20.0, 30.0]

This means I don’t have to know the maximum value for the column before I write this piece of code. You can see how we can combine methods quite easily to make our life easier when performing such tasks.

It is important to note that in this part of the code :

  • stop=(titanic[“age”].max() + 1)

I use brackets to enforce the order of operations. If I didn’t add 1 to the stop, the entry at my maximum value would read “NaN”, which I want to avoid.

9.1.4 Example 3

We can also add labels to our categories. This time rather than displaying the bin edges it will display the text strings we specify.

This is passed as a list to the parameter “labels=” to the “pd.cut()” function.

titanic["binned_ages3"] = pd.cut(titanic["age"],  # Data to cut
                                 bins=np.arange(start=0,
                                                        stop=(titanic["age"].max() + 1),
                                                step=10),
                                 labels=["0 – 10", "11-20", " 21 – 30",
                                         "31 – 40", "41 – 50", "51 – 60",
                                         "61 – 70", "71 - 80"])

titanic.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat binned_ages binned_ages2 binned_ages3
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2 (24.117, 32.1] (20.0, 30.0] 21 – 30
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11 (0.0869, 8.15] (0.0, 10.0] 0 – 10
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (0.0869, 8.15] (0.0, 10.0] 0 – 10
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN (24.117, 32.1] (20.0, 30.0] 21 – 30
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (24.117, 32.1] (20.0, 30.0] 21 – 30

Note these bands are approximate, e.g someone with an age of 20.2 will go into the band labelled “21-30”. Integers were chosen as it’s an easier read, and most ages after 1 are whole numbers.

There’s additional parameters we can set here; check the help function if there’s anything specific you need to do. You will get some more practice with this in the Case Study in Chapter 7.

9.2 pd.qcut()

“pd.qcut()” is described in the documentation as a “Quantile-based discretization function”. This means that “pd.qcut()” tries to divide the data into bins of equal sizes, rather than using the numeric edges of the data to create the bins.

9.2.1 Example

In the cell below I’m using “pd.qcut()” to cut the “age” column in to 10. This using the same data and the same number of segments as the “pd.cut()” we did at the start.

# Divide fare into 3 equally sized classes.
titanic["age_qcut"] = pd.qcut(titanic["age"],
                              q=10)  # q for quantiles

#View the data
titanic.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat binned_ages binned_ages2 binned_ages3 age_qcut
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2 (24.117, 32.1] (20.0, 30.0] 21 – 30 (28.0, 31.0]
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11 (0.0869, 8.15] (0.0, 10.0] 0 – 10 (0.166, 14.0]
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (0.0869, 8.15] (0.0, 10.0] 0 – 10 (0.166, 14.0]
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN (24.117, 32.1] (20.0, 30.0] 21 – 30 (28.0, 31.0]
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN (24.117, 32.1] (20.0, 30.0] 21 – 30 (22.0, 25.0]

10 Aggregation

Aggregation means grouping data together by a particular grouping variable (of our choice) and producing a summary of one or more columns for that grouping variable. For example, what if I want to find out if the average fare paid is different for each passenger class? I could use this technique to find out.

We have 3 passenger classes; 1st, 2nd and 3rd, which we can check that using the “.unique()” function as we saw earlier.

titanic["pclass"].unique()
array([1, 2, 3], dtype=int64)

We’ll use the “.groupby()” method in this tutorial. Panda’s also provides us with the “pd.crosstab()” and the “pd.pivot_table()” methods as well. You can find these in the Reference Material as they are a little more niche in their application.

This function can be really useful, especially when your data are disaggregate, e.g. data about individual units of people or things, rather than by category.

The method “.groupby()” allows us to aggregate by a categorical variable and summarise numerical data into a new DataFrame. It works on a very interesting principle known as split-apply-combine:

A groupby process hinges on split, apply and combine.

  • Split is where a DataFrame is divided into a set of smaller DataFrames based on the grouping variable. If there are n levels of a categorical variable, there are n smaller dataframes.

  • Apply is where an aggregation is applied to each of these smaller dataframes to obtain a single valued output (for example a mean).

  • Combine is where we bring together the aggregated DataFrame rows into a final new DataFrame.

10.1 Examples

10.1.1 Example 1

Let’s see an example of this in action, it sounds complex (and it is with what goes on behind the scenes), but the final output is much more readable.

titanic_class_fare = titanic.groupby(by="pclass")["fare"].mean() 

titanic_class_fare
pclass
1    87.508992
2    21.179196
3    13.302889
Name: fare, dtype: float64

I want to find if the “.mean()” value of the “fare” column was different depending where someone embarked. let’s break the code down:

  • In the “.groupby()” method to the “by=” parameter I pass the column I wish to group by. The column “pclass” has three values, “1” , “2” and “3”.

  • The “.groupby()” behaviour will effectively split the original titanic DataFrame into three new DataFrames. One with the values of “1”, one for “2” and one with the values of “3”. This is the split step.

  • From these new “.groupby()” DataFrames I select the column “[”fare”]” and apply the summary statistic “.mean()” to it. This is the apply step.

  • This is returned in the DataFrame “titanic_class_fare”, this is the combine step.

10.1.2 Example 2

We can also use more complicated groupings using more than one variable, here we group first by “pclass” then “embarked”. As with many methods we discussed in earlier chapters, we must specify the column names in a list, where order is important.

titanic.groupby(by=["pclass", "embarked"])["fare"].mean()
pclass  embarked
1       C           106.845330
        Q            90.000000
        S            72.148094
2       C            23.300593
        Q            11.735114
        S            21.206921
3       C            11.021624
        Q            10.390820
        S            14.435422
Name: fare, dtype: float64

To show off why order is important, see the following:

titanic.groupby(by=["embarked", "pclass"])["fare"].mean()
embarked  pclass
C         1         106.845330
          2          23.300593
          3          11.021624
Q         1          90.000000
          2          11.735114
          3          10.390820
S         1          72.148094
          2          21.206921
          3          14.435422
Name: fare, dtype: float64

10.1.3 Example 3

You can also use other summary statistics here, for example “.count()” to return the number of values. The following shows that 141 passengers embarked in Cherbourg (embarked = C) and were pclass 1.

titanic.groupby(by=["embarked", "pclass"])["fare"].count()
embarked  pclass
C         1         141
          2          28
          3         101
Q         1           3
          2           7
          3         113
S         1         177
          2         242
          3         494
Name: fare, dtype: int64

Now let’s do some exercises!

10.1.4 Exercises

  1. Group animals by the column “AnimalClass” and find the “.sum()” of the “IncidentNominalCost(£)” column.

  2. Group animals by the column “Borough” and “AnimalClass” and find the “.mean()” of the “PumpHoursTotal” column.

  3. Reverse the order of the grouping in (b) and observe the differences.

# (a)

animalclass_sum = animals.groupby(by="AnimalClass")['IncidentNominalCost(£)'].sum()

animalclass_sum
AnimalClass
Bird        350487.0
Fish          1560.0
Mammal     1597513.0
Reptile       3730.0
Name: IncidentNominalCost(£), dtype: float64
# (b) 

pump_hours_mean = animals.groupby(by=["Borough", "AnimalClass"])['PumpHoursTotal'].mean()

pump_hours_mean.head()
Borough               AnimalClass
Barking And Dagenham  Bird           1.035714
                      Mammal         1.141667
Barnet                Bird           1.066667
                      Mammal         1.164021
                      Reptile        1.000000
Name: PumpHoursTotal, dtype: float64
# (c) 

pump_hours_mean = animals.groupby(by=["AnimalClass", "Borough"])['PumpHoursTotal'].mean()

pump_hours_mean.head()
AnimalClass  Borough             
Bird         Barking And Dagenham    1.035714
             Barnet                  1.066667
             Bexley                  1.058824
             Brent                   1.131579
             Bromley                 1.200000
Name: PumpHoursTotal, dtype: float64

10.2 Multiple Aggregation

If we want to return more than one aggregation at once to the columns we’ve selected, there is the “.agg()” method. This takes a dictionary where the key is our column and the value is the aggregation method we wish to apply as a string.

These aggfuncs are slightly different to what we’ve seen, but are straightforward like “sum”, ”count”, ”mean” etc. Let’s first see it with one aggregation before building up to two.

10.2.1 Example 1

animalclass_sum = animals.groupby(by="AnimalClass").agg({"IncidentNominalCost(£)": "sum",
                                                         "PumpHoursTotal": "mean"})

animalclass_sum
IncidentNominalCost(£) PumpHoursTotal
AnimalClass
Bird 350487.0 1.106912
Fish 1560.0 3.000000
Mammal 1597513.0 1.203162
Reptile 3730.0 1.083333

If we want to apply more than one aggregation to a column we can pass a list to the values of the dictionary. This requires us to use the numpy methods which are very similar to the methods we have used previously, so “np.sum” , “np.mean” etc.

10.2.2 Example 2

Here we will output the sum and the mean for “IncidentNominalCost(£)” and the mean for “PumpHoursTotal”. Notice that since there is only one method applied in the second case, we can just use the standard mean method, it is only when we use lists that numpy comes into play.

animalclass_sum = animals.groupby(by="AnimalClass").agg({"IncidentNominalCost(£)": [np.sum, np.mean],
                                                         "PumpHoursTotal": "mean"})

animalclass_sum
C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\2031256029.py:1: FutureWarning:

The provided callable <function sum at 0x000002B20CE8C680> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.

C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\2031256029.py:1: FutureWarning:

The provided callable <function mean at 0x000002B20CE8D760> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
IncidentNominalCost(£) PumpHoursTotal
sum mean mean
AnimalClass
Bird 350487.0 323.029493 1.106912
Fish 1560.0 780.000000 3.000000
Mammal 1597513.0 346.006714 1.203162
Reptile 3730.0 310.833333 1.083333

10.2.3 Exercise

Group the animals dataset by the variable “Borough” and obtain the following summary statistics:

  • The variance and standard deviation of the “IncidentNominalCost(£)” column.

  • The mean of the “PumpCount” column.

  • The sum of the “PumpHoursTotal” column.

animals_aggregates = animals.groupby(by = "Borough").agg({"IncidentNominalCost(£)":[np.var, np.std],
                "PumpCount":"mean",
                "PumpHoursTotal":"sum"})

animals_aggregates
C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\4038799905.py:1: FutureWarning:

The provided callable <function var at 0x000002B20CE8D9E0> is currently using SeriesGroupBy.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.

C:\Users\tbalb\AppData\Local\Temp\ipykernel_6440\4038799905.py:1: FutureWarning:

The provided callable <function std at 0x000002B20CE8D8A0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
IncidentNominalCost(£) PumpCount PumpHoursTotal
var std mean sum
Borough
Barking And Dagenham 17478.446543 132.206076 1.006757 166.0
Barnet 18924.149664 137.565074 1.017021 269.0
Bexley 101288.188050 318.258053 1.050000 232.0
Brent 14903.635693 122.080448 1.012903 170.0
Brentwood NaN NaN 1.000000 1.0
Bromley 96915.276777 311.312185 1.037915 288.0
Broxbourne NaN NaN 1.000000 1.0
Camden 101271.452381 318.231759 1.035503 207.0
City Of London 5970.090909 77.266363 1.000000 12.0
Croydon 12917.137020 113.653583 1.000000 273.0
Ealing 12108.624975 110.039197 1.019900 222.0
Enfield 52486.690464 229.099739 1.072874 324.0
Epping Forest 101180.333333 318.088562 1.333333 7.0
Greenwich 27466.981031 165.731654 1.016575 212.0
Hackney 11048.169623 105.110274 1.000000 223.0
Hammersmith And Fulham 24555.948524 156.703378 1.020833 161.0
Haringey 19360.601034 139.142377 1.014778 226.0
Harrow 14084.828954 118.679522 1.009346 122.0
Havering 43320.559880 208.135917 1.035714 218.0
Hillingdon 81438.478868 285.374279 1.048128 252.0
Hounslow 61567.012384 248.127009 1.032680 188.0
Islington 27034.293488 164.421086 1.005025 218.0
Kensington And Chelsea 9777.240258 98.879928 1.008000 135.0
Kingston Upon Thames 33175.594343 182.141688 1.008850 133.0
Lambeth 10143.245118 100.713679 1.018018 245.0
Lewisham 10361.723148 101.792550 1.000000 205.0
Merton 13145.226087 114.652632 1.000000 126.0
Newham 22322.972614 149.408743 1.005128 222.0
Redbridge 29134.820504 170.689251 1.016304 223.0
Richmond Upon Thames 21911.281480 148.024598 1.012739 183.0
Southwark 19417.532703 139.346807 1.000000 253.0
Sutton 11911.604924 109.140299 1.015748 139.0
Tandridge NaN NaN 1.000000 1.0
Tower Hamlets 64299.889716 253.574229 1.051643 267.0
Waltham Forest 111310.958494 333.632970 1.049261 266.0
Wandsworth 34465.758256 185.649558 1.027778 215.0
Westminster 29891.372414 172.891216 1.034483 164.0

In later versions of Pandas (from 0.25.0) named aggregation also exists. The help guide will help you implement this.

Note that these DataFrames displayed when you run the code look a little different to the ones I have displayed here. You will see that in yours, the index is our grouping categories. We can use our “.reset_index()” method here too, or use “as_index = False”. There are a few reasons we might do this, including making visualisation easier.

group_by_dropped_index = titanic.groupby(by=["embarked", "pclass"], as_index= False)["fare"].count()
group_by_dropped_index
embarked pclass fare
0 C 1 141
1 C 2 28
2 C 3 101
3 Q 1 3
4 Q 2 7
5 Q 3 113
6 S 1 177
7 S 2 242
8 S 3 494

11 Chapter Summary

Excellent job, you have completed Chapter 6 of the Introduction to Python course. There is only one chapter left before you complete Core Part 1 of the course! Absolutely amazing aggregating!

Chapter 7 will feature a structured end of course exercise, that tests you on everything you have learned from Chapter 3 to Chapter 6, with a new dataset we have yet to explore, the schools dataset.