Chapter 4 - Working with DataFrames

1 Chapter Overview and Learning Objectives

  • Packages and Datasets

  • Exploring Datasets

    • Quick Previews
    • Size
    • Data Types
    • Column Names
  • Sorting Data

  • Subsetting Data

    • Selecting Single Columns
    • Selecting Multiple Columns
  • Filtering

    • Single Conditional Filtering
    • Multiple Conditional Filtering
  • Deriving New Columns

    • Constant Values
    • Numeric Values
    • Boolean and Binary Values
    • Mapping new values
    • Deleting columns
  • Merging Data

    • About Merging
    • Merging using Pandas
    • Union joins

2 Packages and Datasets

2.1 Packages

As a reminder, we should always import our packages at the top of our script. In this chapter (and for the rest of the course) we will use pandas, and give it the alias “pd”.

2.1.1 Exercise

Import Pandas and give it the alias pd

import pandas as pd

2.2 Datasets

Good practice also dictates that we import our datasets at the top of our script too, following the import of packages.

In this session we’ll be using:

  • animals - animals.csv
  • titanic - titanic.xlsx
  • joining_data1 - joining_data1.csv
  • joining_data2 - joining_data2.csv
  • union_join_data - union_data.csv
  • marvel_left - joining_exercise1.csv
  • marvel_right - joining_exercise2.csv

These are all straight forward data imports with no additional parameters required. This is a good test of the techniques covered in Chapter 3.

2.2.1 Exercise

Load in these datasets listed above

You can check your variables are loaded by using “%whos” in Jupyter. In Spyder or other IDE’s they should appear in your variable explorer.

animals = pd.read_csv("../data/animals.csv")
titanic = pd.read_excel("../data/titanic.xlsx")
joining_data1 = pd.read_csv("../data/joining_data1.csv")
joining_data2 = pd.read_csv("../data/joining_data2.csv")
union_join_data = pd.read_csv("../data/union_data.csv")
marvel_left = pd.read_csv("../data/joining_exercise1.csv")
marvel_right = pd.read_csv("../data/joining_exercise2.csv")

3 Exploring Datasets

Before you start working with a dataset it is important to examine it. For example:

  • Do the values look how you would expect?
  • Does the data have the right number of rows and columns?
  • Do the columns have your expected data types?
  • Is your data clean?

In this section we’ll look at various ways of exploring our data, with the aim of answering the above questions.

In chapter 5 we will look at how we can clean our data. When working with data, cleaning is normally a step we’d do first. Those concepts are a little more complicated; and we like to establish a good base knowledge of how Python works first. For these sessions we’ll be using data that has already been cleaned to make it easy to handle.

3.1 Quick Previews

In the last section we looked at three methods of quickly inspecting our dataframes.

  • .head() - Top 5 rows (can be altered with parameter n =)
  • .tail() - Bottom 5 Rows (can be altered with paramter n =)
  • .sample() - 1 Randomly sampled row.

We use these because we do not always want Jupyter Notebooks to print out large datasets; we just want to check that certain operations have worked on a smaller amount of data.

To review, we can modify how many rows are returned by putting a number within the round brackets. The number in the round brackets is known as an argument. “.head()” and “.tail()” have a default argument of 5; if we don’t pass an argument that default behaviour is used.

You may recall that we mentioned this parameter is ‘n =’, but we need not specify the name, just the argument. Why is this? This is because Python knows what parameter we are giving an argument for when specifying the integer to “.head()” or “.tail()”.

However, don’t always rely on this, as order comes into play when functions/methods have a large number of parameters (and particularly multiple ones of the same type, integers etc). As such, it is good practice to use the name of the parameter in most cases.

3.2 Example

animals.head(3)

# Best Practice
# animals.head(n = 3)
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
0 139091 01/01/2009 03:01 2009 2008/09 Special Service 1.0 2.0 255 510.0 DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 Dog House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Croydon Norbury Mammal 00AH Outer London
1 275091 01/01/2009 08:51 2009 2008/09 Special Service 1.0 1.0 255 255.0 ASSIST RSPCA WITH FOX TRAPPED,B15 Fox Railings Other animal assistance Animal assistance involving livestock - Other ... Croydon Woodside Mammal 00AH Outer London
2 2075091 04/01/2009 10:07 2009 2008/09 Special Service 1.0 1.0 255 255.0 DOG CAUGHT IN DRAIN,B15 Dog Pipe or drain Animal rescue from below ground Animal rescue from below ground - Domestic pet Sutton Wallington Mammal 00BF Outer London

If you run just “animals”, you may notice that Jupyter does not display all of the rows in our DataFrame.

First 30 rows, some elipses then last 30 rows.

Instead it represents these rows using three dots, “…” (ellipsis) symbol. You may also see this with columns. This often happens in large datasets; and helps to save computational power. This is similar to the max print option in Spyder and other IDEs.

While there are various ways of changing this setting with Jupyter (Stack Overflow has several answers), if you are regularly using large DataFrames, and wish to inspect all rows you may wish to use VSCode or another IDE, since many of these have in built data viewers (which allow us to view the whole dataset).

We can also use “.info()” this gives us information about:

  • The class of the object – a “pandas” DataFrame
  • The range (or length of our index), a.k.a the number of rows
  • How many columns we have
  • The column names
  • The number of non-null entries (a.k.a without missing values)
  • The data type of the column
  • The number of columns of each data type.

3.3 Example

animals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5752 entries, 0 to 5751
Data columns (total 19 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   IncidentNumber              5752 non-null   object 
 1   DateTimeOfCall              5752 non-null   object 
 2   CalYear                     5752 non-null   int64  
 3   FinYear                     5752 non-null   object 
 4   TypeOfIncident              5752 non-null   object 
 5   PumpCount                   5716 non-null   float64
 6   PumpHoursTotal              5716 non-null   float64
 7   HourlyNominalCost(£)        5752 non-null   int64  
 8   IncidentNominalCost(£)      5716 non-null   float64
 9   FinalDescription            5749 non-null   object 
 10  AnimalGroupParent           5752 non-null   object 
 11  PropertyType                5752 non-null   object 
 12  SpecialServiceTypeCategory  5752 non-null   object 
 13  SpecialServiceType          5752 non-null   object 
 14  Borough                     5748 non-null   object 
 15  StnGroundName               5752 non-null   object 
 16  AnimalClass                 5752 non-null   object 
 17  Code                        5742 non-null   object 
 18  London                      5742 non-null   object 
dtypes: float64(3), int64(2), object(14)
memory usage: 853.9+ KB

3.4 Size

Knowing the size of our DataFrame is also useful (a.k.a the number of rows and columns), separately to what “.info()” gives us. Often we want to save these elements to variables and this is difficult with that method and as such, we use some attributes

3.4.1 Example

“.shape” is an attribute (note the lack of brackets!) that provides us a tuple with the dimensions of the object, in the form (number of rows, number of columns).

animals.shape
(5752, 19)

Notice this is the first example of an attribute of the DataFrame rather than a method, so doesn’t require the round brackets at the end of its call. We talk a bit more about the similarities and differences between functions, methods and attributes in chapter 2. Feel free to review that section if you are stuck.

We can use our indexing brackets (also discussed at length in chapter 2) to return an item in this tuple, for example the number rows at index 0.

animals.shape[0]
5752

And the number of columns at index 1.

animals.shape[1]
19

As an aside:

Because “.shape” returns us a tuple we can assign each of these to it’s own variable. To do this we give our variable names in the order of the tuple output, separated by a comma before our assignment operator (=).

3.4.2 Example

Our tuple outputs rows first then columns, so we give our variables as “nrow” and “ncol”. This is also a great chance to show off multi-variable assignment introduced in chapter 2.

nrow, ncol = animals.shape

print("There are", nrow, "rows, and", ncol, "columns in the animals DataFrame.")
There are 5752 rows, and 19 columns in the animals DataFrame.

Of course, you can always do this separately without multi-variable assignment as below, but that way is certainly the most efficient.

nrow = animals.shape[0]
ncol = animals.shape[1]

We can also use the inbuilt “len()” function to find the length. This will return us the number of rows.

len(animals)
5752

We can also modify this to return us the number of columns as follows.

len(animals.columns)
19

Note that the .columns attribute is also very useful, returning something called an Index object containing the ordered column names of the pandas object. Returning the length of such an object gives us the number of values (a.k.a the number of columns).

More info on the index datatype is available, but I would suggest returning to this later as reference material.

3.4.3 Exercise

Look at the dimensions of the “titanic” DataFrame.

  1. Use shape to return the tuple of rows and columns.

  2. Use multi-variable assignment to obtain the rows and columns as variables.

  3. Print the results from (b) in a suitable sentence.

# (a) 

titanic.shape
(1309, 14)
# (b)

nrow, ncol = titanic.shape

print(nrow, ncol)
1309 14
# (c)

print("There are", nrow, "rows and", ncol, "columns in the titanic DataFrame.")
There are 1309 rows and 14 columns in the titanic DataFrame.

4 Data Types

It is also important to check the data types when we bring in data. If you’ve worked with other analytical software before you’ll know that sometimes data doesn’t come in as we would expect.

We do this using “.dtypes”, another very useful attribute (delivering on that promise that we will use attributes as well as methods!).

4.1 Example

animals.dtypes
IncidentNumber                 object
DateTimeOfCall                 object
CalYear                         int64
FinYear                        object
TypeOfIncident                 object
PumpCount                     float64
PumpHoursTotal                float64
HourlyNominalCost(£)            int64
IncidentNominalCost(£)        float64
FinalDescription               object
AnimalGroupParent              object
PropertyType                   object
SpecialServiceTypeCategory     object
SpecialServiceType             object
Borough                        object
StnGroundName                  object
AnimalClass                    object
Code                           object
London                         object
dtype: object

We have some of the data types we’re seen before

  • int represents integers; our whole numbers
    • CalYear (Calendar Year) is column of whole numbers
  • float represents ‘floating point’ numbers or decimals
    • IncidentNominalCost(£) (The financial cost of each incident) is a decimal number.
  • object or O here represents what we’ve been calling string data so far, in that they are text values
    • AnimalClass (The kind of animal) is a text based value.

We may also see

  • bool - representing Boolean values; our True and False
  • datetime - date and time values
  • category - a special “pandas” datatype for categorical or factor variables (see Chapter 7 for more info, this is reference material)

You may notice that some categories have not come in quite how we would expect.

4.1.1 Example

animals['DateTimeOfCall'].dtypes

# This is how we select a single column; we'll cover this later
dtype('O')

“DateTimeOfCall” has come in as an “object” or text value; when it should be “datetime”. This is extremely common with dates and times in a variety of software; not just Python!

Python will often choose an object data type for columns if there’s any ambiguity over the data type. This is because objects retain all the characteristics of the data and allows us to make the choice about how to process it.

For example, the date 01/11/19 could be interpreted as either:

  • 1st November 2019
  • 11th January 2019
  • 19th November 2001

depending on region (which makes the most sense to you?).

Since we don’t look at handling date and time data in this course; we’ll leave these columns alone for now. This is a perfect time to reference the Dates and Times in Python course on the Learning Hub! This is a recommended follow up to the Introduction to Python course.

For an overview that is much less in-depth than that course, the following resource is available:

Dates and Times in Pandas is a tutorial for handling dates and times using pandas.

4.1.2 Exercise

Look at the data types of the titanic DataFrame.

Are there any values you didn’t expect to see?

You can check the Data Dictionary (an explanation of each column)

# Look at the data types of the titanic DataFrame.

titanic.dtypes
pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
home.dest     object
body         float64
boat          object
dtype: object
  • We may expect “survived” to be a boolean (since it is a binary choice, yes or no) but it’s a numeric. To come over as boolean it would’ve needed to be ‘TRUE’, ‘FALSE’ or ‘True’, ‘False’.

  • We may expect “boat” to be a number, but some boats are referred to by letters.

4.2 Column Names

It is important to know what your columns are called because in Python we commonly select our columns by name. It is also key to know how they are constructed (lower case, snake case etc) as Python is case sensitive.

4.2.1 Example

First we give the DataFrame name then the name of our column, in quotes, inside square brackets.

animals['AnimalClass']
0       Mammal
1       Mammal
2       Mammal
3       Mammal
4       Mammal
         ...  
5747    Mammal
5748    Mammal
5749    Mammal
5750    Mammal
5751    Mammal
Name: AnimalClass, Length: 5752, dtype: object

We can then add a method to the end, for example “.head()”.

animals['AnimalClass'].head()
0    Mammal
1    Mammal
2    Mammal
3    Mammal
4    Mammal
Name: AnimalClass, dtype: object

While Python can handle column names with spaces, symbols and irregular capitalization; it is often easier to clean them. We’ll have a look at how to do this in Chapter 5.

4.2.2 Example

As a reminder, We can access our columns by using the “.columns” attribute to return an Index datatype filled with the column names.

animals.columns
Index(['IncidentNumber', 'DateTimeOfCall', 'CalYear', 'FinYear',
       'TypeOfIncident', 'PumpCount', 'PumpHoursTotal', 'HourlyNominalCost(£)',
       'IncidentNominalCost(£)', 'FinalDescription', 'AnimalGroupParent',
       'PropertyType', 'SpecialServiceTypeCategory', 'SpecialServiceType',
       'Borough', 'StnGroundName', 'AnimalClass', 'Code', 'London'],
      dtype='object')

It would be nice to display them as a list to make the object easier to work with, so we can add (or chain) the method “.tolist()” on the end.

animals.columns.tolist()
['IncidentNumber',
 'DateTimeOfCall',
 'CalYear',
 'FinYear',
 'TypeOfIncident',
 'PumpCount',
 'PumpHoursTotal',
 'HourlyNominalCost(£)',
 'IncidentNominalCost(£)',
 'FinalDescription',
 'AnimalGroupParent',
 'PropertyType',
 'SpecialServiceTypeCategory',
 'SpecialServiceType',
 'Borough',
 'StnGroundName',
 'AnimalClass',
 'Code',
 'London']

4.2.3 Exercise

  1. Look at the column names of the titanic DataFrame and convert them to a list, giving it an appropriate variable name.

  2. Use indexing to return the column name ‘embarked’.

  3. Use negative indexing to return the list without ‘boat’.

# (a)
titanic_cols = titanic.columns.tolist()

titanic_cols
['pclass',
 'survived',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'home.dest',
 'body',
 'boat']
# (b)
titanic_cols[10]
'embarked'
# (c)
titanic_cols[:-1]
['pclass',
 'survived',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'home.dest',
 'body']

We can also use Jupyter’s auto complete feature to help us when accessing the names of our columns. Copy and uncomment the cell below place your cursor after the H and hit tab.

#animals['H']

You’ll see that the auto complete option gives us the name of our “HourlyNominalCost(£)” column. This can be a really useful shortcut when we’re writing our code, circumventing the need to type out full column names.

5 Sorting Data

Our data is displayed in the same order as the source data, which means that we may want to sort our data, based on specific columns.

5.1 Example

To do this we use the method “.sort_values(by=”column name”)“. When transforming data like this, we want to assign the new object as we don’t want to re-type this process each time.

# animals_sorted = animals.sort_values(by="IncidentNominalCost(£)")
# animals_sorted.head()

There are additional arguments we can set, for example, by default the values are sorted in ascending order, by changing “ascending=” to “ascending=False” we can sort in descending order instead.

animals_sorted = animals.sort_values(by="IncidentNominalCost(£)", ascending=False)
animals_sorted.head()
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
4450 098141-28072016 28/07/2016 13:49 2016 2016/17 Special Service 3.0 12.0 326 3912.0 CAT STUCK WITHIN WALL SPACE RSPCA IN ATTENDANCE Cat Purpose Built Flats/Maisonettes - 4 to 9 storeys Other animal assistance Animal assistance involving domestic animal - ... Camden Kentish Town Mammal 00AG Inner London
2542 48360131 19/04/2013 07:01 2013 2013/14 Special Service 2.0 12.0 290 3480.0 FOAL IN RIVER WATER LEVEL ONE IMPLEMENTED Horse River/canal Other animal assistance Animal assistance - Lift heavy livestock animal Waltham Forest Chingford Mammal 00BH Outer London
3757 62700151 22/05/2015 11:34 2015 2015/16 Special Service 2.0 10.0 298 2980.0 HORSE FALLEN INTO SWIMMING POOL Horse Other outdoor structures Animal rescue from water Animal rescue from water - Farm animal Bromley Biggin Hill Mammal 00AF Outer London
5568 092389-09072018 09/07/2018 08:35 2018 2018/19 Special Service 2.0 8.0 333 2664.0 HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... Horse Animal harm outdoors Other animal assistance Animal assistance - Lift heavy domestic animal Hillingdon Hillingdon Mammal 00AS Outer London
3143 49076141 22/04/2014 09:54 2014 2014/15 Special Service 2.0 9.0 295 2655.0 KITTEN TRAPPED IN CHIMNEY Cat Purpose Built Flats/Maisonettes - Up to 3 stor... Other animal assistance Assist trapped domestic animal Hounslow Feltham Mammal 00AT Outer London

We can also sort by more than one column, but we can’t just type out the column names in succession to the “.sort_values()” method. We must include them in a list instead, where order is important.

5.2 Example

Here we are sorting by “IncidentNominalCost(£)”; and then by “AnimalClass”. Notice object (text/string) columns are sorted by alphabetical order. The reason order is important is because it creates a dependence, namely that sorting by the second column is entirely dependent on the sorting of the first column.

animals_sorted = animals.sort_values(by=["IncidentNominalCost(£)", "AnimalClass"])
animals_sorted.head()
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
27 7051091 11/01/2009 12:32 2009 2008/09 Special Service 1.0 1.0 255 255.0 SWAN STUCK ON ROOF,B15 Bird Domestic garden (vegetation not equipment) Animal rescue from height Animal rescue from height - Bird Havering Hornchurch Bird 00AR Outer London
31 9252091 15/01/2009 12:38 2009 2008/09 Special Service 1.0 1.0 255 255.0 ASSIST RSPCA WITH BIRD TRAPPED IN TREE BY FOOT... Bird Wasteland Animal rescue from height Animal rescue from height - Bird Lambeth Clapham Bird 00AY Inner London
40 13245091 22/01/2009 21:35 2009 2008/09 Special Service 1.0 1.0 255 255.0 BIRD IN CHIMNEY,B15 Bird House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Harrow Stanmore Bird 00AQ Outer London
63 21760091 07/02/2009 11:12 2009 2008/09 Special Service 1.0 1.0 255 255.0 PIGEON IN PRECARIOUS POSITION, B15 Bird Tree scrub Animal rescue from below ground Animal rescue from below ground - Bird Westminster Soho Bird 00BK Outer London
68 26715091 15/02/2009 15:53 2009 2008/09 Special Service 1.0 1.0 255 255.0 SWAN IN DISTRESS, J15 Bird River/canal Animal rescue from water Animal rescue from water - Bird Tower Hamlets Shadwell Bird 00BG Outer London

When sorting more than one column, it’s best to specify “ascending=” as a list as well, to improve the readability and accessibility of our code. This list is in the same order as our columns and as such “.sort_values()” will apply column wise (first column, ascending, second column, descending etc).

Some versions of Pandas will sort both columns as descending if we just supply “ascending = False” ; however some won’t, so it’s better to be clear about what we want to do. By using a list I also have finer control in that I could have one in ascending order, and one in descending order if I wished.

5.3 Example

animals_sorted_desc = animals.sort_values(by=["IncidentNominalCost(£)", "AnimalClass"], 
                                          ascending=[False, False])

animals_sorted_desc.head()
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
4450 098141-28072016 28/07/2016 13:49 2016 2016/17 Special Service 3.0 12.0 326 3912.0 CAT STUCK WITHIN WALL SPACE RSPCA IN ATTENDANCE Cat Purpose Built Flats/Maisonettes - 4 to 9 storeys Other animal assistance Animal assistance involving domestic animal - ... Camden Kentish Town Mammal 00AG Inner London
2542 48360131 19/04/2013 07:01 2013 2013/14 Special Service 2.0 12.0 290 3480.0 FOAL IN RIVER WATER LEVEL ONE IMPLEMENTED Horse River/canal Other animal assistance Animal assistance - Lift heavy livestock animal Waltham Forest Chingford Mammal 00BH Outer London
3757 62700151 22/05/2015 11:34 2015 2015/16 Special Service 2.0 10.0 298 2980.0 HORSE FALLEN INTO SWIMMING POOL Horse Other outdoor structures Animal rescue from water Animal rescue from water - Farm animal Bromley Biggin Hill Mammal 00AF Outer London
5568 092389-09072018 09/07/2018 08:35 2018 2018/19 Special Service 2.0 8.0 333 2664.0 HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... Horse Animal harm outdoors Other animal assistance Animal assistance - Lift heavy domestic animal Hillingdon Hillingdon Mammal 00AS Outer London
3143 49076141 22/04/2014 09:54 2014 2014/15 Special Service 2.0 9.0 295 2655.0 KITTEN TRAPPED IN CHIMNEY Cat Purpose Built Flats/Maisonettes - Up to 3 stor... Other animal assistance Assist trapped domestic animal Hounslow Feltham Mammal 00AT Outer London

5.4 Exercise

  1. Sort the titanic DataFrame by the “age” column in descending order, name it “titanic_sorted”.

  2. Sort the titanic DataFrame by the “age” then “sex” columns, both in ascending order.

  3. Sort the titanic DataFrame by ascending “fare” then descending “age”, naming it appropriately.

# (a)
titanic_sorted = titanic.sort_values(by="age", ascending=False)
titanic_sorted.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
14 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S Hessle, Yorks NaN B
61 1 1 Cavendish, Mrs. Tyrell William (Julia Florence... female 76.0 1 0 19877 78.8500 C46 S Little Onn Hall, Staffs NaN 6
1235 3 0 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S NaN NaN NaN
135 1 0 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C New York, NY NaN NaN
9 1 0 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C Montevideo, Uruguay 22.0 NaN
# (b)

titanic_age_sex = titanic.sort_values(by = ['age', 'sex'], ascending = [True, True])

# This also works but is not good practice

# titanic_age_sex = titanic.sort_values(by = ['age', 'sex'])
# (d)

titanic_fare_age = titanic.sort_values(by = ['fare', 'age'], ascending = [True, False])

6 Revisiting parameters and arguments

I mentioned earlier that not including parameter names can sometimes confuse Python when it comes to the order in which they can be specified for larger functions. Now we will see this in practice.

6.1 Important Example

Parameters are key words that tell python what the following argument relates to. We have two in the code below:

  • “by=” is a parameter and the string representing the column “IncidentNominalCost(£)” is the argument.

  • “ascending=” is a parameter and the Boolean value “False” is the argument

If we don’t specify parameters Python will assume our arguments are in the order specified in the Signature (press “shift + tab” with your cursor in “.sort_values()” to see this), and because of this the code below won’t work.

# This code will give us an error!

# After you've tried it, you may want to comment it out!

# animals_sorted = animals.sort_values("IncidentNominalCost(£)", False)

# animals_sorted.head()

Our first argument “IncidentNominalCost(£)” maps directly to the “by=” parameter as expected. However, Python assumes the second argument “False” relates to the “axis =” parameter; which in some versions isn’t a valid argument, and in some pandas versions is treated effectively as “axis =0”.

It is personal preference if you use parameters before your arguments. It is considered good practice, and can make your code more readable, especially to new coders. We would strongly advise doing so.

As a bonus when you use parameters with your arguments you can pass them in any order as Python no longer relies on it being a positional based argument.

animals_sorted = animals.sort_values(ascending=False, by="IncidentNominalCost(£)")
animals_sorted.head()
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
4450 098141-28072016 28/07/2016 13:49 2016 2016/17 Special Service 3.0 12.0 326 3912.0 CAT STUCK WITHIN WALL SPACE RSPCA IN ATTENDANCE Cat Purpose Built Flats/Maisonettes - 4 to 9 storeys Other animal assistance Animal assistance involving domestic animal - ... Camden Kentish Town Mammal 00AG Inner London
2542 48360131 19/04/2013 07:01 2013 2013/14 Special Service 2.0 12.0 290 3480.0 FOAL IN RIVER WATER LEVEL ONE IMPLEMENTED Horse River/canal Other animal assistance Animal assistance - Lift heavy livestock animal Waltham Forest Chingford Mammal 00BH Outer London
3757 62700151 22/05/2015 11:34 2015 2015/16 Special Service 2.0 10.0 298 2980.0 HORSE FALLEN INTO SWIMMING POOL Horse Other outdoor structures Animal rescue from water Animal rescue from water - Farm animal Bromley Biggin Hill Mammal 00AF Outer London
5568 092389-09072018 09/07/2018 08:35 2018 2018/19 Special Service 2.0 8.0 333 2664.0 HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... Horse Animal harm outdoors Other animal assistance Animal assistance - Lift heavy domestic animal Hillingdon Hillingdon Mammal 00AS Outer London
3143 49076141 22/04/2014 09:54 2014 2014/15 Special Service 2.0 9.0 295 2655.0 KITTEN TRAPPED IN CHIMNEY Cat Purpose Built Flats/Maisonettes - Up to 3 stor... Other animal assistance Assist trapped domestic animal Hounslow Feltham Mammal 00AT Outer London

7 Subsetting

7.1 Selecting Single Columns

Sometimes we will want to work with smaller “cut down” DataFrames that contain fewer columns. As we saw earlier the simplest way to select a column from a DataFrame is to use the name of the column in the indexing brackets “[]”.

7.1.1 Example

Here I am creating a new Series called ‘animal_grp_parent’ using the ‘AnimalGroupParent’ column from the animals DataFrame.

animal_grp_parent = animals['AnimalGroupParent']

animal_grp_parent.head()
0       Dog
1       Fox
2       Dog
3     Horse
4    Rabbit
Name: AnimalGroupParent, dtype: object

7.1.2 Exercise

  1. Select the “fare” column from the “titanic” DataFrame and name it “titanic_fare”.

  2. Randomly Sample 12 rows from the “titanic_fare” object you created.

# (a)

titanic_fare = titanic["fare"]
titanic_fare.head()
0    211.3375
1    151.5500
2    151.5500
3    151.5500
4    151.5500
Name: fare, dtype: float64
# (b) 

titanic_fare.sample(n = 12)
69       55.0000
192     146.5208
1087      7.8542
66      262.3750
1043     15.5000
694       6.7500
540      26.0000
16      247.5208
759      17.4000
1026     12.4750
1281      9.0000
1082      3.1708
Name: fare, dtype: float64

7.2 Selecting Multiple Columns

7.2.1 Example

Similarly as with sorting by multiple columns, we can select multiple columns by providing them in a list.

# Create a list of columns
my_list_of_columns = [ "DateTimeOfCall", "AnimalGroupParent", "IncidentNominalCost(£)"]

# Pass that list to my indexing brackets, creating a dataframe
animal_small_df = animals[my_list_of_columns]

We can, however do this in one step by using two sets of square brackets next to each other. It’s important to realise that they are doing two separate roles:

  • Our first set is selecting or indexing from the animals DataFrame.
  • Our second set is creating a list of our columns we wish to select.
# This gives the same result as the cell above.
animal_small_df = animals[[ "DateTimeOfCall", "AnimalGroupParent", "IncidentNominalCost(£)"]]
animal_small_df.sample(6)
DateTimeOfCall AnimalGroupParent IncidentNominalCost(£)
202 16/05/2009 14:28 Dog 260.0
3317 14/07/2014 16:25 Cat 295.0
3955 26/08/2015 09:38 Cat 298.0
2604 20/05/2013 12:33 Bird 290.0
4004 01/10/2015 06:34 Cat 298.0
4551 16/09/2016 17:42 Bird 326.0

Two important things to note are the following:

  • When we return one column we get a Series object (as DataFrames are collections of Series objects)

  • When we return more than one column we get a DataFrame.

7.2.2 Exercise

  1. Select the “name, sex, age” and “survived” column from the “titanic” DataFrame, naming it accordingly.

  2. Create two sub DataFrames from your answer to part (a), one containing “sex” and “age” and another containing “name” and “survived”, naming them accordingly.

# (a)

titanic_select = titanic[["name","sex","age", "survived"]]
titanic_select.head()
name sex age survived
0 Allen, Miss. Elisabeth Walton female 29.0000 1
1 Allison, Master. Hudson Trevor male 0.9167 1
2 Allison, Miss. Helen Loraine female 2.0000 0
3 Allison, Mr. Hudson Joshua Creighton male 30.0000 0
4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 0
# (b)

titanic_ages_sexes = titanic_select[["age", "sex"]]

titanic_names_survived = titanic_select[["name", "survived"]]

Note that you may see people selecting columns using “dataframe.column_name”, which is a legal practice, but certainly not a recommended one!

# This will work - but is not good practice!

titanic.sex.head()
0    female
1      male
2    female
3      male
4    female
Name: sex, dtype: object

We don’t recommend this because it looks too much like an attribute or method. This also won’t work if you had a column that was something like “mean” as Python can’t determine if you mean the column “mean” or the method “.mean()”.

There’s no ambiguity with

  • dataframe[“column_name”]

which is why we will always recommend it over “dataframe.column_name”.

8 Selecting on data type

We can also use the method “.select_dtypes()” if we want to include or exclude specific kinds of data from our dataframe.

# Select just the object columns
animal_object_cols = animals.select_dtypes(include=["object"])

# Return the top of the animal columns

animal_object_cols.head()
IncidentNumber DateTimeOfCall FinYear TypeOfIncident FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
0 139091 01/01/2009 03:01 2008/09 Special Service DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 Dog House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Croydon Norbury Mammal 00AH Outer London
1 275091 01/01/2009 08:51 2008/09 Special Service ASSIST RSPCA WITH FOX TRAPPED,B15 Fox Railings Other animal assistance Animal assistance involving livestock - Other ... Croydon Woodside Mammal 00AH Outer London
2 2075091 04/01/2009 10:07 2008/09 Special Service DOG CAUGHT IN DRAIN,B15 Dog Pipe or drain Animal rescue from below ground Animal rescue from below ground - Domestic pet Sutton Wallington Mammal 00BF Outer London
3 2872091 05/01/2009 12:27 2008/09 Special Service HORSE TRAPPED IN LAKE,J17 Horse Intensive Farming Sheds (chickens, pigs etc) Animal rescue from water Animal rescue from water - Farm animal Hillingdon Ruislip Mammal 00AS Outer London
4 3553091 06/01/2009 15:23 2008/09 Special Service RABBIT TRAPPED UNDER SOFA,B15 Rabbit House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Havering Harold Hill Mammal 00AR Outer London

We must specify our include data types as a list here, even if we’re just specifying one value. It’s just the way this function was written (it will tell you this in the help documentation).

  • “object” includes our string columns, I could also use “O” here.
  • “float64” and “int64” will select floating point numbers and integer numbers respectively.
  • If you have the numpy package loaded (“import numpy as np”) you can use np.number (no quotes) to return numerical data.

You can even use the parameter “exclude=” which may be easier if you wish to omit just one data type.

8.1 Exercise

Create a new DataFrame “titanic_int_no_float” where we include all integer columns, but exclude all float columns.

titanic_int_no_float = titanic.select_dtypes(include = ['int64'], exclude = ['float64'])

9 Filtering

We can achieve really simple filtering by passing a range to the DataFrame indexer.

9.1 Example

animals[0:5]
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
0 139091 01/01/2009 03:01 2009 2008/09 Special Service 1.0 2.0 255 510.0 DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 Dog House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Croydon Norbury Mammal 00AH Outer London
1 275091 01/01/2009 08:51 2009 2008/09 Special Service 1.0 1.0 255 255.0 ASSIST RSPCA WITH FOX TRAPPED,B15 Fox Railings Other animal assistance Animal assistance involving livestock - Other ... Croydon Woodside Mammal 00AH Outer London
2 2075091 04/01/2009 10:07 2009 2008/09 Special Service 1.0 1.0 255 255.0 DOG CAUGHT IN DRAIN,B15 Dog Pipe or drain Animal rescue from below ground Animal rescue from below ground - Domestic pet Sutton Wallington Mammal 00BF Outer London
3 2872091 05/01/2009 12:27 2009 2008/09 Special Service 1.0 1.0 255 255.0 HORSE TRAPPED IN LAKE,J17 Horse Intensive Farming Sheds (chickens, pigs etc) Animal rescue from water Animal rescue from water - Farm animal Hillingdon Ruislip Mammal 00AS Outer London
4 3553091 06/01/2009 15:23 2009 2008/09 Special Service 1.0 1.0 255 255.0 RABBIT TRAPPED UNDER SOFA,B15 Rabbit House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Havering Harold Hill Mammal 00AR Outer London

Which returns us the same rows as “.head()”. We can however modify this to return any range of rows within the DataFrame we like - for example the middle of our DataFrame.

animals[200:210]
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
200 81770091 15/05/2009 16:50 2009 2009/10 Special Service 1.0 1.0 260 260.0 CAT STUCK ON FENCE,B15 Cat Bridge Animal rescue from height Animal rescue from height - Domestic pet Tower Hamlets Bethnal Green Mammal 00BG Outer London
201 82208091 16/05/2009 13:20 2009 2009/10 Special Service 1.0 1.0 260 260.0 DOG TRAPPED UNDER CAR,B15 Dog Car Other animal assistance Animal assistance - Lift heavy livestock animal Brent West Hampstead Mammal 00AE Outer London
202 82239091 16/05/2009 14:28 2009 2009/10 Special Service 1.0 1.0 260 260.0 ROAD TRAFFIC ACCIDENT,B1 Dog Car Other animal assistance Animal assistance involving livestock - Other ... Bromley Orpington Mammal 00AF Outer London
203 82314091 16/05/2009 17:09 2009 2009/10 Special Service 1.0 1.0 260 260.0 DOG STUCK IN FOX HOLE,B15 Dog Domestic garden (vegetation not equipment) Animal rescue from below ground Animal rescue from below ground - Domestic pet Redbridge Hainault Mammal 00BC Outer London
204 82769091 17/05/2009 09:54 2009 2009/10 Special Service 1.0 1.0 260 260.0 CAT STUCK BEHIND TOILET,B15 Cat Purpose Built Flats/Maisonettes - 4 to 9 storeys Animal rescue from below ground Animal rescue from below ground - Domestic pet Croydon Addington Mammal 00AH Outer London
205 83410091 18/05/2009 13:44 2009 2009/10 Special Service 1.0 4.0 260 1040.0 ASSIST RSPCA WITH PIGEON TRAPPED IN CHIMNEY,B15 Bird House - single occupancy Animal rescue from height Animal rescue from height - Bird Redbridge Ilford Bird 00BC Outer London
206 83903091 19/05/2009 10:38 2009 2009/10 Special Service 1.0 1.0 260 260.0 ASSIST RSPCA WITH 2 DUCKLINGS TRAPPED BEWTEEN... Bird Multi-Storey car park Animal rescue from below ground Animal rescue from below ground - Bird Hillingdon Hillingdon Bird 00AS Outer London
207 83922091 19/05/2009 11:35 2009 2009/10 Special Service 1.0 1.0 260 260.0 ASSIST RSPCA, B15 Bird Bridge Animal rescue from height Animal rescue from height - Bird Enfield Enfield Bird 00AK Outer London
208 84125091 19/05/2009 17:46 2009 2009/10 Special Service 1.0 1.0 260 260.0 ASSIST RSPCA WITH CAT UP A TREE,B15 Cat Domestic garden (vegetation not equipment) Animal rescue from height Animal rescue from height - Domestic pet Ealing Ealing Mammal 00AJ Outer London
209 84176091 19/05/2009 19:07 2009 2009/10 Special Service 1.0 1.0 260 260.0 G561 RUNNING CALL TO PIGEON TRAPPED IN NETTING... Bird Vehicle Repair Workshop Animal rescue from height Animal rescue from height - Bird Hillingdon Heathrow Bird 00AS Inner London

However, we can do more involved, conditional filtering using pandas, which is one of the key processes for any Data Analysis task.

9.2 Single Conditional Filtering

Pandas filters data in a two step process.

  1. It creates a “mask” that specifies inclusion or exclusion for each row in the DataFrame.
  2. Apply the mask on to the DataFrame to return the subset of rows that are included.

In the code below I will create a few simple DataFrames to illustrate my point. Don’t worry too much about how these bits of code works; as previously mentioned you’ll often be working with much larger DataFrames that you’ll load in using one of the “pd.read_” commands.

# Let's create a simple DataFrame to look at this in practice
# Again, you don't need to worry about how this works

cat_dog = pd.DataFrame({"animal": ["Cat", "Cat", "Dog", "Cat"] , 
                        "name": ["Catalie Portman","Pico de Gato", "Chewbarka", "Sir Isaac Mewton"]})

# Print out the DataFrame

cat_dog
animal name
0 Cat Catalie Portman
1 Cat Pico de Gato
2 Dog Chewbarka
3 Cat Sir Isaac Mewton

9.2.1 Example

I want to filter so I just have ‘Cat’ Rows.

My condition can be written like this:

mask = cat_dog['animal'] == 'Cat'
  • I am assigning my output to the variable “mask” with the single = sign
  • The column I want to look for values in is “animal” from the catdog Dataframe - “catdog[”animal”]”
  • From that column I want to find values that meet my condition – I state this with the double equals “==”
  • Finally I am telling Pandas what value I want to find; a string containing “Cat”

If we run the cell we can see the output is a Boolean Series, our True and False values.

# Run me to see the Series of Boolean values
mask
0     True
1     True
2    False
3     True
Name: animal, dtype: bool

Just to make things clearer, Here I’ve added this series as a new column on our “cat_dog” DataFrame to demonstrate.

cat_dog animal name included
0 Cat Catalie Portman True
1 Cat Pico de Gato True
2 Dog Chewbarka False
3 Cat Sir Isaac Mewton True

We can see where the value in the “animal” column was “Cat”; the “mask” column is “True”. Where the value was not cat (e.g “Dog”) the “mask” column is “False”. However I want to return just the rows where the value in “mask” was “True”.

To do this we apply the “mask” variable on to the “cat_dog” Dataframe using our indexing or selection brackets.

just_cats = cat_dog[mask]
just_cats
animal name
0 Cat Catalie Portman
1 Cat Pico de Gato
3 Cat Sir Isaac Mewton

This will only return the rows where our conditional filter equated to “True”.

This seems a little long but thankfully, rather than create and apply a mask separately we can do the whole filter operation in one step. We teach it this way to break down the process into component parts, before wrapping it up in a one line bow.

9.2.2 Example

just_cats = cat_dog[cat_dog['animal'] == "Cat"]
just_cats
animal name
0 Cat Catalie Portman
1 Cat Pico de Gato
3 Cat Sir Isaac Mewton

You’ll notice the cell with the “Dog” value has been dropped.

You’ll also notice that our index has not changed. You will sometimes want to reset the index after filtering to restore it to sequential integers starting at 0 as this can cause some problems in certain circumstances.

9.2.3 Example

You can do this like so.

just_cats.reset_index(drop=True, inplace=True)
just_cats
animal name
0 Cat Catalie Portman
1 Cat Pico de Gato
2 Cat Sir Isaac Mewton
  • By default the previous column index is put in as a new column. By using “drop=True” we can remove this behaviour.

  • “inplace=True” updates the DataFrame in place and is an alternative to overwriting the data with “just_cats = just_cats.reset_index(drop=True)”

9.3 Other Comparison Operators

We have seen the equivalence sign “==” thus far, but there are other very useful operators we can utilise when conditional filtering:

Symbol Meaning
== Is equivalent to
!= Does not equal
> Greater than
>= Greater than or equivalent too
< Less than
<= Less than or equivalent too

9.3.1 Example

Let’s use our animals DataFrame to find the rows where “IncidentNominalCost(£)” is over £1000.

# Note here we're using the animals DataFrame

cost_over_1k = animals[animals['IncidentNominalCost(£)'] > 1000]
cost_over_1k.sample(6)
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
5568 092389-09072018 09/07/2018 08:35 2018 2018/19 Special Service 2.0 8.0 333 2664.0 HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... Horse Animal harm outdoors Other animal assistance Animal assistance - Lift heavy domestic animal Hillingdon Hillingdon Mammal 00AS Outer London
572 3215101 06/01/2010 10:09 2010 2009/10 Special Service 2.0 5.0 260 1300.0 PUPPY STUCK ON ICED LAKE - WATER RESCUE LEVEL ... Dog Park Other animal assistance Animal assistance involving livestock - Other ... Richmond Upon Thames Kingston Mammal 00BD Outer London
2084 18627122 21/06/2012 05:05 2012 2012/13 Special Service 1.0 5.0 260 1300.0 HORSE STUCK ON SOAK AWAY Horse Grassland, pasture, grazing etc Other animal assistance Animal assistance - Lift heavy domestic animal Enfield Enfield Mammal 00AK Outer London
425 158794091 02/09/2009 10:57 2009 2009/10 Special Service 2.0 5.0 260 1300.0 HORSE STUCK IN MUD Horse Wasteland Animal rescue from water Animal rescue from water - Farm animal Bexley Erith Mammal 00AD Inner London
431 161584091 06/09/2009 13:59 2009 2009/10 Special Service 1.0 4.0 260 1040.0 CAT STUCK BETWEEN TWO WALLS Cat Purpose Built Flats/Maisonettes - Up to 3 stor... Animal rescue from below ground Animal rescue from below ground - Domestic pet Islington Holloway Mammal 00AU Outer London
1358 68681111 05/05/2011 18:00 2011 2011/12 Special Service 1.0 4.0 260 1040.0 CAT STUCK IN THE CHIMNEY Cat Tree scrub Animal rescue from height Animal rescue from height - Domestic pet Newham Stratford Mammal 00BB Inner London

Here we have done the following:

  • Assigned a new name “cost_over_1k” which will hold our output
  • Give our DataFrame name, and our indexing or selection brackets: “animals[]”
  • The column we want to look for values in is ”IncidentNominalCost” from the Animals Dataframe “animals[‘IncidentNominalCost(£)’]”
  • The condition I want is “> 1000” , greater than a thousand. As this is a numeric column I can just specify the number.
  • Printing out the variable, using “.sample(6)” method to display 6 rows of the new DataFrame.

9.3.2 Exercise

  1. Filter the titanic DataFrame by “sex”, returning only the rows where this is “female”. Name this something appropriate

  2. Filter the titanic DataFrame to return the rows where the passenger age is smaller than 40 (It is useful to practice these without the exact columns to filter by given).

  3. Filter the titanic DataFrame, returning only the rows where the fare paid is at least £30.50 (Hint: which operator would ‘at least’ correspond to?)

# (a)

titanic_female = titanic[titanic['sex'] ==  "female"]
titanic_female.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
0 1 1 Allen, Miss. Elisabeth Walton female 29.0 0 0 24160 211.3375 B5 S St Louis, MO NaN 2
2 1 0 Allison, Miss. Helen Loraine female 2.0 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN
6 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S Hudson, NY NaN 10
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.0 2 0 11769 51.4792 C101 S Bayside, Queens, NY NaN D
# (b)

titanic_younger = titanic[titanic['age'] < 40]

titanic_younger.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN
# (c) At least means bigger than or equal to that value

titanic_fares = titanic[titanic['fare'] >= 30.50]

titanic_fares.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN

9.4 Multiple Conditional Filtering

Just as we can filter by one condition we can also filter by multiple conditions. Let’s add in a new column, representing age to the cat_dog DataFrame.

# Let's create a simple DataFrame to look at this in practice
# Again, we don't need to worry about how we do this!

cat_dog = pd.DataFrame({"animal": ["Cat", "Cat", "Dog", "Cat", "Dog"], 
                        "name": ["Catalie Portman","Pico de Gato", "Chewbarka", "Sir Isaac Mewton", "K9"],
                        "age": [3, 5, 1, 7, 11]}, 
                       columns=["name", "animal", "age"])


# Print out the DataFrame

cat_dog
name animal age
0 Catalie Portman Cat 3
1 Pico de Gato Cat 5
2 Chewbarka Dog 1
3 Sir Isaac Mewton Cat 7
4 K9 Dog 11

When we want to filter on more than one condition, we have to consider how those conditions relate to each other in terms of what is included and what is excluded from the final filtering.

  • AND relationships use the & symbol (Shift and 7 on the keyboard) and require both or all conditions to evaluate to “True”.

  • OR relationships use the | symbol (Shift and &124;) and require one of the conditions to evaluate to “True”.

Below is a “Truth table” that can be handy as a look up.

Condition 1 Condition 2 & (AND) Equates to | (OR) Equates to
True True True True
True False False True
False True False True
False False False False

9.4.1 Example

Now that we have an age column, we can now select all Cats that are older than 4. Again, in this example we will create a mask so we can see what Boolean values are happening behind the scenes, but in the exercises we will do this in one line.

# Create my mask
mask = (cat_dog['animal'] == "Cat") & (cat_dog['age'] > 4)

# have a look at the mask
mask
0    False
1     True
2    False
3     True
4    False
dtype: bool

This is quite overwhelming upon first viewing, let’s break it down a little bit!

  • I am assigning my output to the variable “mask” with a single = sign.

  • My first conditional statement:

    • Has round brackets surrounding it “(”
    • The first column I want to look for values in is “animal” from the catdog Dataframe - “catdog[“animal”]”
    • From that column I want to find values that meet my condition, which I state with the double equals “==”
    • Finally I am telling pandas what value I want to find; a string containing “Cat”
    • My first condition complete I close my round brackets “)”
  • I use a symbol to represent the relationship between my conditions. Here an ampersand - “&” (Shift and 7) is used to represent “AND”

  • My second conditional statement:

    • Has round brackets surrounding it “(”
    • The first column I want to look for values in is “age” from the catdog Dataframe - “catdog[“age”]”
    • From that column I want to find values that are greater than my condition, which I state with the “>”
    • Finally I am telling Pandas what value I want to find; an int 4
    • My second condition complete I close my round brackets “)”

Let’s look at the “mask” values alongside the DataFrame:

cat_dog animal name age included
0 Cat Catalie Portman 3 False
1 Cat Pico de Gato 5 True
2 Dog Chewbarka 1 False
3 Cat Sir Isaac Mewton 7 True
4 Dog K9 11 False

And when we apply the mask to the DataFrame we can see the filtered Dataframe - with only Cats over 4.

mask = (cat_dog['animal'] == "Cat") & (cat_dog['age'] > 4)

cats_over_4 = cat_dog[mask]
cats_over_4
name animal age
1 Pico de Gato Cat 5
3 Sir Isaac Mewton Cat 7

To do this in one line, we reference the DataFrame we are selecting from, and then place our filter condition within the indexing or selection brackets.

cats_over_4 = cat_dog[(cat_dog["animal"] == "Cat") & (cat_dog["age"] > 4)]

If we can use the AND condition (&) we can also use the OR condition where only one of the conditions must be met to evaluate to “True”.

The OR condition is represented by a vertical bar symbol ( “|” ) – press shift and “".

Lets add a condition where the animal is a cat OR the age is greater than 4.

# Create my mask
mask = (cat_dog["animal"] == "Cat") | (cat_dog["age"] > 4)

# View mask

mask
0     True
1     True
2    False
3     True
4     True
dtype: bool

Let’s look at the “mask” values alongside the DataFrame:

cat_dog animal name age included
0 Cat Catalie Portman 3 True
1 Cat Pico de Gato 5 True
2 Dog Chewbarka 1 False
3 Cat Sir Isaac Mewton 7 True
4 Dog K9 11 True

Catalie Portman now evaluates to True. While she is younger than 3 she is a Cat, and by meeting one of the conditions, she is included.

K9 also evaluates to “True”. He is not a cat; but meets the age condition by being older than 4 and so is included.

We can of course apply this mask to filter the DataFrame.

cats_or_over4 = cat_dog[mask]

Or do the whole action in one line of code.

cats_or_over4 = cat_dog[(cat_dog["animal"] == "Cat") | (cat_dog["age"] > 4)]

cats_or_over4.head()
name animal age
0 Catalie Portman Cat 3
1 Pico de Gato Cat 5
3 Sir Isaac Mewton Cat 7
4 K9 Dog 11

Here are some more examples using the “titanic” DataFrame.

Here we are filtering for survivors (value 1), who were female. Using an AND (&) condition

female_survivors = titanic[(titanic["survived"] == 1) & (titanic["sex"] == "female")]
female_survivors.tail()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
1260 3 1 Turja, Miss. Anna Sofia female 18.0 0 0 4138 9.8417 NaN S NaN NaN 15
1261 3 1 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S NaN NaN 15
1286 3 1 Whabee, Mrs. George Joseph (Shawneene Abi-Saab) female 38.0 0 0 2688 7.2292 NaN C NaN NaN C
1290 3 1 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S NaN NaN NaN
1300 3 1 Yasbeck, Mrs. Antoni (Selini Alexander) female 15.0 1 0 2659 14.4542 NaN C NaN NaN NaN

And here we are filtering for those with more than 3 sibling/spouses (“sibsp”) on board or who paid more than £400 for their fare.

siblings_or_expensive = titanic[(titanic['sibsp'] > 3) | (titanic['fare'] > 400)]
siblings_or_expensive.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
49 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C Austria-Hungary / Germantown, Philadelphia, PA NaN 3
50 1 1 Cardeza, Mrs. James Warburton Martinez (Charlo... female 58.0 0 1 PC 17755 512.3292 B51 B53 B55 C Germantown, Philadelphia, PA NaN 3
183 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C NaN NaN 3
302 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C NaN NaN 3
622 3 0 Andersson, Master. Sigvard Harald Elias male 4.0 4 2 347082 31.2750 NaN S Sweden Winnipeg, MN NaN NaN

9.4.2 Exercise

  1. Filter the animals DataFrame where the IncidentNominalCost(£) Column was less than £400 and where the Borough column value is Croydon.

  2. Filter the animals DataFrame where the ‘PropertyType’ was “River/canal” or the “PumpHoursTotal” was at most 8.

# (a)

cost_u400_croydon = animals[(animals['IncidentNominalCost(£)'] < 400) & (animals['Borough'] == "Croydon")]

cost_u400_croydon.tail()
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London
5595 099586-19072018 19/07/2018 07:09 2018 2018/19 Special Service 1.0 1.0 333 333.0 FOX TRAPPED IN NETTING IN SCHOOL FIELD Fox Infant/Primary school Other animal assistance Assist trapped wild animal Croydon Norbury Mammal 00AH Inner London
5612 103324-24072018 24/07/2018 12:25 2018 2018/19 Special Service 1.0 1.0 333 333.0 CAT TRAPPED IN ROOM Cat House - single occupancy Animal rescue from height Animal rescue from height - Domestic pet Croydon West Norwood Mammal 00AH Inner London
5654 113548-07082018 07/08/2018 16:54 2018 2018/19 Special Service 1.0 1.0 333 333.0 BIRD IN CHIMNEY CALLER DIABLED - UNABLE TO WA... Bird House - single occupancy Other animal assistance Assist trapped wild animal Croydon Purley Bird 00AH Outer London
5657 114584-09082018 09/08/2018 13:30 2018 2018/19 Special Service 1.0 1.0 333 333.0 CAT TRAPPED IN GAP BETWEEN TWO BUILDINGS - OW... Cat House - single occupancy Other animal assistance Animal assistance involving domestic animal - ... Croydon Woodside Mammal 00AH Outer London
5664 116258-13082018 13/08/2018 10:27 2018 2018/19 Special Service 1.0 1.0 333 333.0 KITTEN TRAPPED BEHIND KITCHEN SINK Cat Purpose Built Flats/Maisonettes - 4 to 9 storeys Other animal assistance Assist trapped domestic animal Croydon Croydon Mammal 00AH Inner London
# (b)

river_pumphours = animals[(animals['PropertyType'] == "River/canal") | (animals['PumpHoursTotal'] > 8)]

9.5 Some Filter short cuts.

We can also use some other filters to make life easier.

9.5.1 Example 1

With the method “.isin()” we can provide a list of things to filter, here the boat identities “[“2”, “C”, “15”]”.

titanic[titanic["boat"].isin(["2", "C", "15"])].head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
56 1 1 Carter, Mr. William Ernest male 36.0 1 2 113760 120.000 B96 B98 S Bryn Mawr, PA NaN C
79 1 1 Cornell, Mrs. Robert Clifford (Malvina Helen L... female 55.0 2 0 11770 25.700 C101 S New York, NY NaN 2
98 1 1 Douglas, Mrs. Walter Donald (Mahala Dutton) female 48.0 1 0 PC 17761 106.425 C86 C Deephaven, MN / Cedar Rapids, IA NaN 2
164 1 1 Homer, Mr. Harry ("Mr E Haven") male 35.0 0 0 111426 26.550 NaN C Indianapolis, IN NaN 15
170 1 1 Ismay, Mr. Joseph Bruce male 49.0 0 0 112058 0.000 B52 B54 B56 S Liverpool NaN C

9.5.2 Example 2

If dealing with numeric values we can use “.between()” and specify an upper and lower bound. Here we’re looking where the “age” column was between 0 and 12.

titanic[titanic["age"].between(0, 12)].head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN
54 1 1 Carter, Master. William Thornton II male 11.0000 1 2 113760 120.0000 B96 B98 S Bryn Mawr, PA NaN 4
94 1 1 Dodge, Master. Washington male 4.0000 0 2 33638 81.8583 A34 S San Francisco, CA NaN 5
273 1 1 Spedden, Master. Robert Douglas male 6.0000 0 2 16966 134.5000 E34 C Tuxedo Park, NY NaN 3

9.5.3 Example 3

We also have the “~” tilde symbol to invert an expression, this works similarly to the “!=” for does not equal. For example, I can invert the filter for female passengers to return just male passengers.

Note that I have to wrap my conditional statement in round brackets to ensure this works.

titanic[~(titanic["sex"] == "female")].head()  # Inverts the results of == Female; effectively giving == Male
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN
5 1 1 Anderson, Mr. Harry male 48.0000 0 0 19952 26.5500 E12 S New York, NY NaN 3
7 1 0 Andrews, Mr. Thomas Jr male 39.0000 0 0 112050 0.0000 A36 S Belfast, NI NaN NaN
9 1 0 Artagaveytia, Mr. Ramon male 71.0000 0 0 PC 17609 49.5042 NaN C Montevideo, Uruguay 22.0 NaN

The tilde “~” can be useful for other scenarios, there’s a variety of “.is” methods for example that do not have an is not equivalent; but we can invert the “.is” method instead. We’ll see a useful example of that in chapter 5.

9.5.4 Filtering Text

So far when we’ve talked about filtering text we do so on exact matches. However, as Python is case sensitive, filtering for “cat” will not show the values “CAT” or “Cat” in our cat_dog DataFrame, should they exist (which they often will in real world data).

# Lets create our cat_dog DataFrame again with some variation in the capitalisation of "cat"

cat_dog_wrong_spelling = pd.DataFrame({"animal": ["cat", "Cat", "Dog", "cat", "Dog", "Rat", "Bat"] , 
                                       "name": ["Catalie Portman","Pico de Gato", "Chewbarka", "Sir Isaac Mewton", "K9", "Roland", "Vlad"],
                                       "age": [3, 5, 1, 7, 11, 1, 3]} , 
                                      columns = ["name", "animal", "age"])

cat_dog_wrong_spelling
name animal age
0 Catalie Portman cat 3
1 Pico de Gato Cat 5
2 Chewbarka Dog 1
3 Sir Isaac Mewton cat 7
4 K9 Dog 11
5 Roland Rat 1
6 Vlad Bat 3

Here the filter condition we previously used (== Cat) will only now only return us “Pico De Gato”.

cat_only_spelling = cat_dog_wrong_spelling[cat_dog_wrong_spelling["animal"] == "Cat"]
cat_only_spelling
name animal age
1 Pico de Gato Cat 5

Which is not the result we wanted.

9.5.5 Example

For simple instances we can use the string method “str.contains()” like in the example below. Here, in our DataFrame, some entries in the “Animal” column have “Cat” and others have “cat”. By using “.str.contains(“at”)“, we can bring in both values because both contain the partial string”at”.

# Filter to find instances where the string in "animal" contains "at"
cat_only_spelling = cat_dog_wrong_spelling[cat_dog_wrong_spelling["animal"].str.contains("at")]
cat_only_spelling
name animal age
0 Catalie Portman cat 3
1 Pico de Gato Cat 5
3 Sir Isaac Mewton cat 7
5 Roland Rat 1
6 Vlad Bat 3

This is clearly not foolproof however, as both “Rat” and “Bat” also have “at” in them, and as such are displayed in the output.

In chapter 5 we’ll look at changing the column values and column names to lowercase letters, which would allow a straight forward filter to work in this case.

9.6 Aside - Regular Expressions

For more complex cases we can use what is known as Regular Expressions or RegEx in Python to search for partial strings with specified conditions (such as placing where the letters are located, starting with, ending with etc).

We do not cover these in this introduction course; but here are some more resources you may wish to explore. If you have used Regular Expressions in other languages you will find this very familiar.

RegEx How to

Python Regular Expressions

Datacamp RegEx tutorial

9.7 More Filtering Methods

We’ve looked here at the base Python way of filtering, but as with many things in Python there are other alternatives.

9.7.1 Example 1

titanic.query('fare > 50 & sex == "male"').head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN
10 1 0 Astor, Col. John Jacob male 47.0000 1 0 PC 17757 227.5250 C62 C64 C New York, NY 124.0 NaN
16 1 0 Baxter, Mr. Quigg Edmond male 24.0000 0 1 PC 17558 247.5208 B58 B60 C Montreal, PQ NaN NaN
19 1 0 Beattie, Mr. Thomson male 36.0000 0 0 13050 75.2417 C6 C Winnipeg, MN NaN A

Here we use “.query()” which allows us to type a more SQL style query in words and symbols without utilising our filtering brackets.

9.7.2 Example 2

We also have “.filter()”, which works on the index, rather then the contents! Here I’m looking for “like = “1”” which will give me index “1” , “11” , “12” as they contain the number 1. Note that “axis = 0” means apply to the rows, whereas “axis = 1” applies to the columns.

So far we’ve worked with indexes that are numerical, from 0 to n; but they can also be text based; and using the pandas “.filter()” would be useful here.

titanic.filter(like = "1", axis = 0).head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.550 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11
10 1 0 Astor, Col. John Jacob male 47.0000 1 0 PC 17757 227.525 C62 C64 C New York, NY 124.0 NaN
11 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.0000 1 0 PC 17757 227.525 C62 C64 C New York, NY NaN 4
12 1 1 Aubart, Mme. Leontine Pauline female 24.0000 0 0 PC 17477 69.300 B35 C Paris, France NaN 9
13 1 1 Barber, Miss. Ellen "Nellie" female 26.0000 0 0 19877 78.850 NaN S NaN NaN 6

You may find even more ways of filtering your data when researching, there’s lots out there!

10 Deriving New Columns

We create new columns in very similar to filtering them:

“titanic[”new_column”] = 1”

In the example above we are creating a new column in the titanic DataFrame called “new_column”, which will have the constant value of 1 in each row.

  • To create a new column we first give the DataFrame name and use square brackets to specify, as a string our column name, here this is “new_column”.
    • If the column name does not exist in our Dataframe (like in the example) the contents will be placed in a new column, at the end of the DataFrame.
    • If the column already exists in our DataFrame the contents of that column will be overwritten.

We then use the assignment operator “=” and after this is what we fill our column with, in this example it was a constant value of the integer 1.

We’ll now look at various different ways we can fill these new columns.

10.1 Other Constant Values

Some more examples are in the cell below, filling with both floats as well as objects/strings. The syntax remains the same, subsetting the column and assigning it to the value.

10.1.1 Example

As a reminder, our cat_dog dataframe looks as follows:

cat_dog.head()
name animal age
0 Catalie Portman Cat 3
1 Pico de Gato Cat 5
2 Chewbarka Dog 1
3 Sir Isaac Mewton Cat 7
4 K9 Dog 11

Now let’s add some more constant columns!

cat_dog["constant_string"] = "mammal"   # adds a string/object constant
cat_dog["constant_float"] = 3.2 # adds a floating point constant
cat_dog["constant_int"] = 2   # adds an interger constant

cat_dog
name animal age constant_string constant_float constant_int
0 Catalie Portman Cat 3 mammal 3.2 2
1 Pico de Gato Cat 5 mammal 3.2 2
2 Chewbarka Dog 1 mammal 3.2 2
3 Sir Isaac Mewton Cat 7 mammal 3.2 2
4 K9 Dog 11 mammal 3.2 2

10.1.2 Exercise

Add a new constant column to the animals DataFrame called “attended” and set the value to be “True”.

animals['attended'] = True

# View the dataframe

animals.head()
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription AnimalGroupParent PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London attended
0 139091 01/01/2009 03:01 2009 2008/09 Special Service 1.0 2.0 255 510.0 DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 Dog House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Croydon Norbury Mammal 00AH Outer London True
1 275091 01/01/2009 08:51 2009 2008/09 Special Service 1.0 1.0 255 255.0 ASSIST RSPCA WITH FOX TRAPPED,B15 Fox Railings Other animal assistance Animal assistance involving livestock - Other ... Croydon Woodside Mammal 00AH Outer London True
2 2075091 04/01/2009 10:07 2009 2008/09 Special Service 1.0 1.0 255 255.0 DOG CAUGHT IN DRAIN,B15 Dog Pipe or drain Animal rescue from below ground Animal rescue from below ground - Domestic pet Sutton Wallington Mammal 00BF Outer London True
3 2872091 05/01/2009 12:27 2009 2008/09 Special Service 1.0 1.0 255 255.0 HORSE TRAPPED IN LAKE,J17 Horse Intensive Farming Sheds (chickens, pigs etc) Animal rescue from water Animal rescue from water - Farm animal Hillingdon Ruislip Mammal 00AS Outer London True
4 3553091 06/01/2009 15:23 2009 2008/09 Special Service 1.0 1.0 255 255.0 RABBIT TRAPPED UNDER SOFA,B15 Rabbit House - single occupancy Other animal assistance Animal assistance involving livestock - Other ... Havering Harold Hill Mammal 00AR Outer London True

10.2 Numeric Values

We can create numeric columns as well, which are often referred to as ‘calculated columns’ as they can be calculated referencing many other numeric columns.

10.2.1 Example

We can do any form of operations we like here, for example creating the family size of the person. This is the number of siblings or spouses on board (“sibsp”), the number of parents or children on board (“parch”), plus 1 (the person themselves).

titanic["family_size"] = titanic["sibsp"] + titanic["parch"] + 1

titanic.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat family_size
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2 1
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11 4
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN 4
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN 4
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN 4

10.3 Boolean and Binary Values

Using the same conditions as we do when filtering we can also create Boolean (True/False) columns. This is very common practice when we want to create binary columns (which can be very useful for visualisation and analysis purposes).

10.3.1 Example

In the example below I am assigning “True” to all values in the “age” column that are under 18. My new column is called “under_18”.

titanic["under_18_bool"] = titanic["age"] < 18
titanic.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat family_size under_18_bool
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2 1 False
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11 4 True
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN 4 True
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN 4 False
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN 4 False

We can also change these values from the Boolean “True” and “False” to the binary values “1” and “0”, which is the key change we make when it comes to building models (it is often required that all values are numeric for example).

We can convert columns by using the code “.astype(“int64”)“.

titanic["under_18_binary"] = titanic["under_18_bool"].astype("int64")

titanic.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat family_size under_18_bool under_18_binary
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S St Louis, MO NaN 2 1 False 0
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN 11 4 True 1
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN 4 True 1
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON 135.0 NaN 4 False 0
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON NaN NaN 4 False 0

We can also do this in one step when we create the column. We can wrap the first part of the code in brackets; like our order of operations in mathematics (BIDMAS, BODMAS whatever you referred to it as) this tells Python which bit to do first.

In the cell below we add the new column in one line, and then sort out the missing values in the next.

# Add the new column and change the type in one line

titanic["under_18_binary"] = (titanic["age"] < 18).astype("int64")

It’s important to note that where we have missing values, it will mark those as “False” (or 0). We’ll look at one way of fixing this after the exercises, but for a full look into missing values, stay tuned for Chapter 5!

10.3.2 Example

The following code uses the “.isnull()” method to single out where there are missing values. This is wrapped up in the indexing brackets [], so we end up with all rows containing missing values.

titanic[titanic['age'].isnull()].head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat family_size under_18_bool under_18_binary
15 1 0 Baumann, Mr. John D male NaN 0 0 PC 17318 25.9250 NaN S New York, NY NaN NaN 1 False 0
37 1 1 Bradley, Mr. George ("George Arthur Brayton") male NaN 0 0 111427 26.5500 NaN S Los Angeles, CA NaN 9 1 False 0
40 1 0 Brewe, Dr. Arthur Jackson male NaN 0 0 112379 39.6000 NaN C Philadelphia, PA NaN NaN 1 False 0
46 1 0 Cairns, Mr. Alexander male NaN 0 0 113798 31.0000 NaN S NaN NaN NaN 1 False 0
59 1 1 Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... female NaN 0 0 17770 27.7208 NaN C New York, NY NaN 5 1 False 0

10.3.3 Exercise

  1. Create a new boolean column in the “animals” DataFrame called “fish_bool”. Use the “AnimalClass” column to find the values that are eqivalant to “Fish”.

  2. Convert “fish_bool” to a numeric column.

# (a)

animals["fish_bool"] = animals["AnimalClass"] == "Fish"

# Test if it's worked with a filter
animals[animals['AnimalClass'] == "Fish"] 
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription ... PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London attended fish_bool
962 129048101 30/07/2010 21:44 2010 2010/11 Special Service 1.0 1.0 260 260.0 FISH IN DANGER OF DYING IN POND ... Lake/pond/reservoir Other animal assistance Animal assistance involving livestock - Other ... Sutton Sutton Fish 00BF Inner London True True
964 129261101 31/07/2010 06:15 2010 2010/11 Special Service 2.0 5.0 260 1300.0 FISH IN DISTRESS, HVPU REQUIRED ... Lake/pond/reservoir Other animal assistance Animal assistance involving livestock - Other ... Enfield Enfield Fish 00AK Inner London True True

2 rows × 21 columns

# (b)

animals["fish_bool"] = animals["fish_bool"].astype("int64")

# Check with a filter

animals[animals["AnimalClass"] == "Fish"]
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNominalCost(£) IncidentNominalCost(£) FinalDescription ... PropertyType SpecialServiceTypeCategory SpecialServiceType Borough StnGroundName AnimalClass Code London attended fish_bool
962 129048101 30/07/2010 21:44 2010 2010/11 Special Service 1.0 1.0 260 260.0 FISH IN DANGER OF DYING IN POND ... Lake/pond/reservoir Other animal assistance Animal assistance involving livestock - Other ... Sutton Sutton Fish 00BF Inner London True 1
964 129261101 31/07/2010 06:15 2010 2010/11 Special Service 2.0 5.0 260 1300.0 FISH IN DISTRESS, HVPU REQUIRED ... Lake/pond/reservoir Other animal assistance Animal assistance involving livestock - Other ... Enfield Enfield Fish 00AK Inner London True 1

2 rows × 21 columns

10.3.4 More on the issue of Booleans and Missing Values

Earlier we declared a boolean column that has “True” for under 18, and “False” for over 18. One issue with this method is that if an entry in the “age” column is a missing value (“NaN”) then it is classified as “False” here.

While Pandas has done exactly as we asked (it does not meet the criteria so is “False”) we’re artificially inflating the number of adults in our two end columns.

Statistically this isn’t right. There’s 263 missing values in the “age” column, as we don’t know the passengers age; so we should preserve the missing data types rather than assume they’re over 18 (as it currently does!)

10.3.5 Example

To illustrate, the cell below is filtering the titanic dataframe, showing the values that are null (using “.isnull()”) in the “age” column.

You can see that they show “False” in the “under_18” column.

# Create the "under_18" column again - in case it's not been run yet!
titanic["under_18_bool"] = titanic["age"] < 18

# Look at missing values in the age column
# Note that in the "under_18" column - they are True!

titanic[titanic['age'].isnull()].head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat family_size under_18_bool under_18_binary
15 1 0 Baumann, Mr. John D male NaN 0 0 PC 17318 25.9250 NaN S New York, NY NaN NaN 1 False 0
37 1 1 Bradley, Mr. George ("George Arthur Brayton") male NaN 0 0 111427 26.5500 NaN S Los Angeles, CA NaN 9 1 False 0
40 1 0 Brewe, Dr. Arthur Jackson male NaN 0 0 112379 39.6000 NaN C Philadelphia, PA NaN NaN 1 False 0
46 1 0 Cairns, Mr. Alexander male NaN 0 0 113798 31.0000 NaN S NaN NaN NaN 1 False 0
59 1 1 Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... female NaN 0 0 17770 27.7208 NaN C New York, NY NaN 5 1 False 0

10.4 Example - Filling in missing values with a constant

We’ll look at more ways to action missing values in chapter 6; but for now we can use “.loc[]” to fill in these values.

We’ll also talk about “.loc[]” in chapter 5 and in the extension chapter 7. Basically, “.loc[]” can look up rows or columns based on either their index; or as we’re doing here, a logical statement like a filter statement.

Here we’re using “.loc[]” to find rows where the “age” column has a null value; and updating our new column “under_18” to show null (“NaN”) too using the keyword “None”. See that “loc[]” takes two inputs, the condition (here that is whether it is null), then the column to apply that too.

titanic.loc[titanic['age'].isnull(), 'under_18_bool'] = None
C:\Users\tbalb\AppData\Local\Temp\ipykernel_3976\1030760974.py:1: FutureWarning:

Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.

Now our “NaN” in “age” are also “NaN” values in “under_18”, as we can see if we run the cell below.

titanic[titanic['age'].isnull()].head()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat family_size under_18_bool under_18_binary
15 1 0 Baumann, Mr. John D male NaN 0 0 PC 17318 25.9250 NaN S New York, NY NaN NaN 1 NaN 0
37 1 1 Bradley, Mr. George ("George Arthur Brayton") male NaN 0 0 111427 26.5500 NaN S Los Angeles, CA NaN 9 1 NaN 0
40 1 0 Brewe, Dr. Arthur Jackson male NaN 0 0 112379 39.6000 NaN C Philadelphia, PA NaN NaN 1 NaN 0
46 1 0 Cairns, Mr. Alexander male NaN 0 0 113798 31.0000 NaN S NaN NaN NaN 1 NaN 0
59 1 1 Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... female NaN 0 0 17770 27.7208 NaN C New York, NY NaN 5 1 NaN 0

While we still have missing data in this column that we may want to deal with later (see chapter 6); at least our “under_18” column is now more accurate. It’s really important we think about missing values whenever we perform operations using our data.

Even just recoding values not considered missing is a very important step when analysing and cleaning our data, as this informs us on how to treat the column.

10.5 Mapping New Values

We can also use “.map()” to map a dictionary to a column.

This can be useful when we want to explain coded variables for other users; e.g. the column “embarked” has the values “S”, “C” and “Q”. We can use “.map()” to give the full names of these ports.

titanic["port"] = titanic["embarked"].map({"S":"Southampton",
                                           "C":"Cherbourg", 
                                           "Q": "Queenstown (Cobh)"})

titanic.tail()
pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest body boat family_size under_18_bool under_18_binary port
1304 3 0 Zabour, Miss. Hileni female 14.5 1 0 2665 14.4542 NaN C NaN 328.0 NaN 2 True 1 Cherbourg
1305 3 0 Zabour, Miss. Thamine female NaN 1 0 2665 14.4542 NaN C NaN NaN NaN 2 NaN 0 Cherbourg
1306 3 0 Zakarian, Mr. Mapriededer male 26.5 0 0 2656 7.2250 NaN C NaN 304.0 NaN 1 False 0 Cherbourg
1307 3 0 Zakarian, Mr. Ortin male 27.0 0 0 2670 7.2250 NaN C NaN NaN NaN 1 False 0 Cherbourg
1308 3 0 Zimmerman, Mr. Leo male 29.0 0 0 315082 7.8750 NaN S NaN NaN NaN 1 False 0 Southampton

The usefulness of this process cannot be overstated, many times we will be handed data in the real world that has many of these very confusing acronyms, initials etc. Recoding them to be more descriptive and easier to work with can be a lifesaver in certain situations!

10.6 Deleting columns

There are a few ways to remove or drop columns.

Let’s look at cat_dog again

cat_dog
name animal age constant_string constant_float constant_int
0 Catalie Portman Cat 3 mammal 3.2 2
1 Pico de Gato Cat 5 mammal 3.2 2
2 Chewbarka Dog 1 mammal 3.2 2
3 Sir Isaac Mewton Cat 7 mammal 3.2 2
4 K9 Dog 11 mammal 3.2 2

10.6.1 Example 1

We have the inbuilt python statement del, which we can follow with a selected column to permanently delete it.

del cat_dog["constant_float"]
cat_dog
name animal age constant_string constant_int
0 Catalie Portman Cat 3 mammal 2
1 Pico de Gato Cat 5 mammal 2
2 Chewbarka Dog 1 mammal 2
3 Sir Isaac Mewton Cat 7 mammal 2
4 K9 Dog 11 mammal 2

10.6.2 Example 2

We can also use the “.drop()” method.

For older Pandas versions:

cat_dog.drop(columns = ["constant_string", "constant_int"], inplace = True)

# If this cell doesn't work check your version using pd.__version__
# If lower than 0.21.0 use the code in the text below.

Here note:

  • I’m passing my column names as a list, as there’s more than one
  • “inplace=True” updates the original Dataframe

This is a very good example of how packages can evolve over time. For versions lower than 0.21.0 of Pandas we have to provide “axis = 1”, as it can’t detect if you’re dealing with columns or rows.

11 Merging Data

Let’s have a look at the two “joining_data” DataFrames we loaded in at the start of the chapter:

joining_data1
name animal age
0 Catalie Portman cat 3
1 Pico de Gato cat 5
2 Chewbarka dog 1
3 Sir Isaac Mewton cat 7
4 K9 cat 11
5 Arf Vader dog 6
joining_data2
name vaccinated
0 Catalie Portman True
1 Pico de Gato True
2 Chewbarka False
3 Sir Isaac Mewton True
4 K9 False
5 Spiderpig True
  • In “joining_data1” we have the columns “name”, “animal” and “age”.
  • In “joining_data2” we have the columns “name” and “vaccinated”.

We see that “Arf Vader” only appears in “joining_data1” and “Spiderpig” only appears in “joining_data2”. The remainder of the animals appear in both DataFrames.

11.1 Types of Join

These two DataFrames have a column in common that we could use to join them on, which is “name”. Without this we could not join the datasets together as this is the stipulation to perform a merge/join.

We can create a variety of joins. These are often best demonstrated with a Venn diagram. From here, I’m going to say that “joining_data1” is my “left” DataFrame, and that “joining_data2” is my “right” DataFrame.

For this I’m going to highlight the position of 3 animal names. As we mentioned earlier:

  • “Arf Vader” is only in “joining_data1”, our “left” DataFrame.
  • “Spiderpig” is only in “joining_data2”, our “right” DataFrame
  • “Catalie Portman” is in both “joining_data1” and “joining_data2” and will represent values in BOTH DataFrames.

Types of merge, inner, outer, left and right.

As you can see “left” and “right” merges take the data from that DataFrame and any matches from the opposite one.

This means that in our “left”:

  • “Arf Vader” is included because he has data in the left DataFrame.
  • “Arf Vader” has missing values for “vaccinated”, as he isn’t in that DataFame
  • “Catalie Portman” has data in all columns, as she appears in both the left and right
  • “Spiderpig” doesn’t have any data. We’re only taking data from the left, and he’s only in the right.

11.1.1 Example

Let’s look at how we do this left merge in code. We’re going to use “pd.merge()” and do a left join.

left_merge = pd.merge(left = joining_data1,
                      right=joining_data2, 
                                 how="left",
                                 on="name",
                                 indicator=True)

left_merge
name animal age vaccinated _merge
0 Catalie Portman cat 3 True both
1 Pico de Gato cat 5 True both
2 Chewbarka dog 1 False both
3 Sir Isaac Mewton cat 7 True both
4 K9 cat 11 False both
5 Arf Vader dog 6 NaN left_only

This is the most layered method we have seen so far, with numerous arguments we must utilise to perform the joins we want.

  • “left =” is our left hand DataFrame
  • “right =” is our right hand DataFrame
  • “how =” is how we want to perform the join, other options are
    • “right”
    • “outer”
    • “inner” (default behaviour)
  • “on =” is the name of the column both DataFrames have in common. If we’re joining on more than one column we can specify a list here.
  • “indicator” set to “True” adds an additional column showing if the data was in the left_only, both or right_only as applicable, which is excellent for checking.

11.1.2 Exercise

  1. Using “marvel_left” and “marvel_right” that we loaded in earlier, perform an outer join using the column “name”. Include the column that specifies whether they came from the left or right DataFrame.

  2. Using “marvel_left” and “marvel_right” perform a left join on the column containing the years. Note that the names of these columns may not be the same in both DataFrames! (Hint: Use the help function to observe optional arguments to help with this).

# (a)

marvel_outer = marvel_left.merge(right=marvel_right, 
                                 how="outer",
                                 on="name", 
                                 indicator=True)
marvel_outer.head()
name year month worldwide_box_office release_year score _merge
0 Ant Man 2015.0 May 5.193120e+08 2015.0 82.0 both
1 Ant Man and the Wasp 2018.0 July 6.226481e+08 2018.0 88.0 both
2 Avengers 2012.0 May 1.518813e+09 2012.0 92.0 both
3 Avengers: Age of Ultron 2015.0 July 1.405404e+09 2015.0 75.0 both
4 Avengers: Endgame NaN NaN NaN 2019.0 94.0 right_only
# (b) Using left_on and right_on since the column names are not the same in each DataFrame

marvel_left_merge = marvel_left.merge(right=marvel_right , 
                                      how="left",
                                      left_on="year",
                                      right_on="release_year",
                                      indicator=True)
marvel_left_merge.head()
name_x year month worldwide_box_office name_y release_year score _merge
0 Iron Man 2008 May 585174222 Iron Man 2008 93 both
1 Iron Man 2008 May 585174222 The Incredible Hulk 2008 67 both
2 The Incredible Hulk 2008 June 263427551 Iron Man 2008 93 both
3 The Incredible Hulk 2008 June 263427551 The Incredible Hulk 2008 67 both
4 Iron Man 2 2010 May 623933331 Iron Man 2 2010 73 both

Notice here we use the parameter “left_on” to list the names of the columns in our left dataframe and “right_on” for the names of the columns in our right.

We see here that maybe the year is not the best joining column! We’re getting multiple joins for movies that have different names because we only need that “year” data to match.

The solution to this is to join on more than one column, which you will do next!

11.1.3 Extension Exercise

Modify the join from (b) in the previous Exercise to:

  • Join on BOTH the column containing names and the column containing years. Remember that the names may differ across the dataframes!

Hint: How did we specify multiple columns in previous chapters? Remember order is important.

marvel_left_merge = marvel_left.merge(right=marvel_right , 
                                      how="left" ,
                                      left_on=["name", "year"],
                                      right_on=["name", "release_year"],
                                      indicator=True)
marvel_left_merge.head()
name year month worldwide_box_office release_year score _merge
0 Iron Man 2008 May 585174222 2008.0 93.0 both
1 The Incredible Hulk 2008 June 263427551 2008.0 67.0 both
2 Iron Man 2 2010 May 623933331 2010.0 73.0 both
3 Thor 2011 May 449326618 NaN NaN left_only
4 Captain America: The First Avenger 2011 July 370569774 2011.0 79.0 both

Notice here we use the parameter left_on to list the names of the columns in our left dataframe and right_on for the names of the columns in our right and as we’re using multiple columns these are in lists. This would be a great solution if we had a larger dataset to work with.

For example in 1998 there was a movie called “The Avengers”. It’s not a marvel movie, and does not star Iron Man, but instead Uma Thurman and Ralph Fiennes as spys! By using the name and year we’d ensure a higher level of accuracy, being as specific as you can is good.

11.2 Union Merge

We can also do what’s known as “union” merges, which is implemented using “pd.concat()”.This is where we can add data on to the bottom of an existing dataFrame (like an appending motion).

It is so important that the columns of each DataFrame are in the same order, otherwise the union operation will fail (remember, columns can only have one datatype!).

11.2.1 Example

For example we have some new animals to add on to “joining_data1”, these are contained in the “union_joined” DataFrame we loaded in earlier.

union_joined = pd.concat(objs=[joining_data1, union_join_data], ignore_index=True)
union_joined
name animal age
0 Catalie Portman cat 3
1 Pico de Gato cat 5
2 Chewbarka dog 1
3 Sir Isaac Mewton cat 7
4 K9 cat 11
5 Arf Vader dog 6
6 Andy Warhowl dog 12
7 Voldetort tortoise 90
8 Repecka chicken 1

We can see our new additions, namely “Andy Warhowl”, “Voldetort” and “Repecka” here now.

Our parameters and arguments in “pd.conact()” are: * “objs=” Our objects that we wish to concatenate, here as a list * “ignore_index=” My new DataFrame will have it’s “own” index starting from 0. If I set this to “False” then I will have two things with an index of 0 and I don’t want that as this will cause frustration later!

This is a very rich topic (as are most processes in Pandas!) and other interesting methods exist that allow us to combine in different ways. For example, “.append()” and “.join()” also exist, here’s a link with a lot more detail about joining together data. Merge, Join, Concatenate and Compare

12 Chapter Summary

Amazing work! Congratulations on completing Chapter 4 of the Introduction to Python Course, you are now a DataFrame Wizard!

In Chapter 5 we will focus on applying the various methods we covered here to clean our data, preparing it ready for the tasks that follow such as Aggregation and Summary (Chapter 6) and Visualisation (Future courses on the Learning Hub). We will cover:

  • Copies and Views
  • Updating Values
  • Changing Column Names
  • Missing Values