import pandas as pd
Chapter 4 - Working with DataFrames
1 Chapter Overview and Learning Objectives
Packages and Datasets
Exploring Datasets
- Quick Previews
- Size
- Data Types
- Column Names
Sorting Data
Subsetting Data
- Selecting Single Columns
- Selecting Multiple Columns
Filtering
- Single Conditional Filtering
- Multiple Conditional Filtering
Deriving New Columns
- Constant Values
- Numeric Values
- Boolean and Binary Values
- Mapping new values
- Deleting columns
Merging Data
- About Merging
- Merging using Pandas
- Union joins
2 Packages and Datasets
2.1 Packages
As a reminder, we should always import our packages at the top of our script. In this chapter (and for the rest of the course) we will use pandas, and give it the alias “pd”.
2.1.1 Exercise
Import Pandas and give it the alias pd
2.2 Datasets
Good practice also dictates that we import our datasets at the top of our script too, following the import of packages.
In this session we’ll be using:
- animals - animals.csv
- titanic - titanic.xlsx
- joining_data1 - joining_data1.csv
- joining_data2 - joining_data2.csv
- union_join_data - union_data.csv
- marvel_left - joining_exercise1.csv
- marvel_right - joining_exercise2.csv
These are all straight forward data imports with no additional parameters required. This is a good test of the techniques covered in Chapter 3.
2.2.1 Exercise
Load in these datasets listed above
You can check your variables are loaded by using “%whos” in Jupyter. In Spyder or other IDE’s they should appear in your variable explorer.
= pd.read_csv("../data/animals.csv")
animals = pd.read_excel("../data/titanic.xlsx")
titanic = pd.read_csv("../data/joining_data1.csv")
joining_data1 = pd.read_csv("../data/joining_data2.csv")
joining_data2 = pd.read_csv("../data/union_data.csv")
union_join_data = pd.read_csv("../data/joining_exercise1.csv")
marvel_left = pd.read_csv("../data/joining_exercise2.csv") marvel_right
3 Exploring Datasets
Before you start working with a dataset it is important to examine it. For example:
- Do the values look how you would expect?
- Does the data have the right number of rows and columns?
- Do the columns have your expected data types?
- Is your data clean?
In this section we’ll look at various ways of exploring our data, with the aim of answering the above questions.
In chapter 5 we will look at how we can clean our data. When working with data, cleaning is normally a step we’d do first. Those concepts are a little more complicated; and we like to establish a good base knowledge of how Python works first. For these sessions we’ll be using data that has already been cleaned to make it easy to handle.
3.1 Quick Previews
In the last section we looked at three methods of quickly inspecting our dataframes.
- .head() - Top 5 rows (can be altered with parameter n =)
- .tail() - Bottom 5 Rows (can be altered with paramter n =)
- .sample() - 1 Randomly sampled row.
We use these because we do not always want Jupyter Notebooks to print out large datasets; we just want to check that certain operations have worked on a smaller amount of data.
To review, we can modify how many rows are returned by putting a number within the round brackets. The number in the round brackets is known as an argument. “.head()” and “.tail()” have a default argument of 5; if we don’t pass an argument that default behaviour is used.
You may recall that we mentioned this parameter is ‘n =’, but we need not specify the name, just the argument. Why is this? This is because Python knows what parameter we are giving an argument for when specifying the integer to “.head()” or “.tail()”.
However, don’t always rely on this, as order comes into play when functions/methods have a large number of parameters (and particularly multiple ones of the same type, integers etc). As such, it is good practice to use the name of the parameter in most cases.
3.2 Example
3)
animals.head(
# Best Practice
# animals.head(n = 3)
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 139091 | 01/01/2009 03:01 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 | Dog | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Norbury | Mammal | 00AH | Outer London |
1 | 275091 | 01/01/2009 08:51 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | ASSIST RSPCA WITH FOX TRAPPED,B15 | Fox | Railings | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Woodside | Mammal | 00AH | Outer London |
2 | 2075091 | 04/01/2009 10:07 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | DOG CAUGHT IN DRAIN,B15 | Dog | Pipe or drain | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Sutton | Wallington | Mammal | 00BF | Outer London |
If you run just “animals”, you may notice that Jupyter does not display all of the rows in our DataFrame.
Instead it represents these rows using three dots, “…” (ellipsis) symbol. You may also see this with columns. This often happens in large datasets; and helps to save computational power. This is similar to the max print option in Spyder and other IDEs.
While there are various ways of changing this setting with Jupyter (Stack Overflow has several answers), if you are regularly using large DataFrames, and wish to inspect all rows you may wish to use VSCode or another IDE, since many of these have in built data viewers (which allow us to view the whole dataset).
We can also use “.info()” this gives us information about:
- The class of the object – a “pandas” DataFrame
- The range (or length of our index), a.k.a the number of rows
- How many columns we have
- The column names
- The number of non-null entries (a.k.a without missing values)
- The data type of the column
- The number of columns of each data type.
3.3 Example
animals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5752 entries, 0 to 5751
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 IncidentNumber 5752 non-null object
1 DateTimeOfCall 5752 non-null object
2 CalYear 5752 non-null int64
3 FinYear 5752 non-null object
4 TypeOfIncident 5752 non-null object
5 PumpCount 5716 non-null float64
6 PumpHoursTotal 5716 non-null float64
7 HourlyNominalCost(£) 5752 non-null int64
8 IncidentNominalCost(£) 5716 non-null float64
9 FinalDescription 5749 non-null object
10 AnimalGroupParent 5752 non-null object
11 PropertyType 5752 non-null object
12 SpecialServiceTypeCategory 5752 non-null object
13 SpecialServiceType 5752 non-null object
14 Borough 5748 non-null object
15 StnGroundName 5752 non-null object
16 AnimalClass 5752 non-null object
17 Code 5742 non-null object
18 London 5742 non-null object
dtypes: float64(3), int64(2), object(14)
memory usage: 853.9+ KB
3.4 Size
Knowing the size of our DataFrame is also useful (a.k.a the number of rows and columns), separately to what “.info()” gives us. Often we want to save these elements to variables and this is difficult with that method and as such, we use some attributes
3.4.1 Example
“.shape” is an attribute (note the lack of brackets!) that provides us a tuple with the dimensions of the object, in the form (number of rows, number of columns).
animals.shape
(5752, 19)
Notice this is the first example of an attribute of the DataFrame rather than a method, so doesn’t require the round brackets at the end of its call. We talk a bit more about the similarities and differences between functions, methods and attributes in chapter 2. Feel free to review that section if you are stuck.
We can use our indexing brackets (also discussed at length in chapter 2) to return an item in this tuple, for example the number rows at index 0.
0] animals.shape[
5752
And the number of columns at index 1.
1] animals.shape[
19
As an aside:
Because “.shape” returns us a tuple we can assign each of these to it’s own variable. To do this we give our variable names in the order of the tuple output, separated by a comma before our assignment operator (=).
3.4.2 Example
Our tuple outputs rows first then columns, so we give our variables as “nrow” and “ncol”. This is also a great chance to show off multi-variable assignment introduced in chapter 2.
= animals.shape
nrow, ncol
print("There are", nrow, "rows, and", ncol, "columns in the animals DataFrame.")
There are 5752 rows, and 19 columns in the animals DataFrame.
Of course, you can always do this separately without multi-variable assignment as below, but that way is certainly the most efficient.
= animals.shape[0]
nrow = animals.shape[1] ncol
We can also use the inbuilt “len()” function to find the length. This will return us the number of rows.
len(animals)
5752
We can also modify this to return us the number of columns as follows.
len(animals.columns)
19
Note that the .columns attribute is also very useful, returning something called an Index object containing the ordered column names of the pandas object. Returning the length of such an object gives us the number of values (a.k.a the number of columns).
More info on the index datatype is available, but I would suggest returning to this later as reference material.
3.4.3 Exercise
Look at the dimensions of the “titanic” DataFrame.
Use shape to return the tuple of rows and columns.
Use multi-variable assignment to obtain the rows and columns as variables.
Print the results from (b) in a suitable sentence.
# (a)
titanic.shape
(1309, 14)
# (b)
= titanic.shape
nrow, ncol
print(nrow, ncol)
1309 14
# (c)
print("There are", nrow, "rows and", ncol, "columns in the titanic DataFrame.")
There are 1309 rows and 14 columns in the titanic DataFrame.
4 Data Types
It is also important to check the data types when we bring in data. If you’ve worked with other analytical software before you’ll know that sometimes data doesn’t come in as we would expect.
We do this using “.dtypes”, another very useful attribute (delivering on that promise that we will use attributes as well as methods!).
4.1 Example
animals.dtypes
IncidentNumber object
DateTimeOfCall object
CalYear int64
FinYear object
TypeOfIncident object
PumpCount float64
PumpHoursTotal float64
HourlyNominalCost(£) int64
IncidentNominalCost(£) float64
FinalDescription object
AnimalGroupParent object
PropertyType object
SpecialServiceTypeCategory object
SpecialServiceType object
Borough object
StnGroundName object
AnimalClass object
Code object
London object
dtype: object
We have some of the data types we’re seen before
- int represents integers; our whole numbers
- CalYear (Calendar Year) is column of whole numbers
- float represents ‘floating point’ numbers or decimals
- IncidentNominalCost(£) (The financial cost of each incident) is a decimal number.
- object or O here represents what we’ve been calling string data so far, in that they are text values
- AnimalClass (The kind of animal) is a text based value.
We may also see
- bool - representing Boolean values; our True and False
- datetime - date and time values
- category - a special “pandas” datatype for categorical or factor variables (see Chapter 7 for more info, this is reference material)
You may notice that some categories have not come in quite how we would expect.
4.1.1 Example
'DateTimeOfCall'].dtypes
animals[
# This is how we select a single column; we'll cover this later
dtype('O')
“DateTimeOfCall” has come in as an “object” or text value; when it should be “datetime”. This is extremely common with dates and times in a variety of software; not just Python!
Python will often choose an object data type for columns if there’s any ambiguity over the data type. This is because objects retain all the characteristics of the data and allows us to make the choice about how to process it.
For example, the date 01/11/19 could be interpreted as either:
- 1st November 2019
- 11th January 2019
- 19th November 2001
depending on region (which makes the most sense to you?).
Since we don’t look at handling date and time data in this course; we’ll leave these columns alone for now. This is a perfect time to reference the Dates and Times in Python course on the Learning Hub! This is a recommended follow up to the Introduction to Python course.
For an overview that is much less in-depth than that course, the following resource is available:
Dates and Times in Pandas is a tutorial for handling dates and times using pandas.
4.1.2 Exercise
Look at the data types of the titanic DataFrame.
Are there any values you didn’t expect to see?
You can check the Data Dictionary (an explanation of each column)
# Look at the data types of the titanic DataFrame.
titanic.dtypes
pclass int64
survived int64
name object
sex object
age float64
sibsp int64
parch int64
ticket object
fare float64
cabin object
embarked object
home.dest object
body float64
boat object
dtype: object
We may expect “survived” to be a boolean (since it is a binary choice, yes or no) but it’s a numeric. To come over as boolean it would’ve needed to be ‘TRUE’, ‘FALSE’ or ‘True’, ‘False’.
We may expect “boat” to be a number, but some boats are referred to by letters.
4.2 Column Names
It is important to know what your columns are called because in Python we commonly select our columns by name. It is also key to know how they are constructed (lower case, snake case etc) as Python is case sensitive.
4.2.1 Example
First we give the DataFrame name then the name of our column, in quotes, inside square brackets.
'AnimalClass'] animals[
0 Mammal
1 Mammal
2 Mammal
3 Mammal
4 Mammal
...
5747 Mammal
5748 Mammal
5749 Mammal
5750 Mammal
5751 Mammal
Name: AnimalClass, Length: 5752, dtype: object
We can then add a method to the end, for example “.head()”.
'AnimalClass'].head() animals[
0 Mammal
1 Mammal
2 Mammal
3 Mammal
4 Mammal
Name: AnimalClass, dtype: object
While Python can handle column names with spaces, symbols and irregular capitalization; it is often easier to clean them. We’ll have a look at how to do this in Chapter 5.
4.2.2 Example
As a reminder, We can access our columns by using the “.columns” attribute to return an Index datatype filled with the column names.
animals.columns
Index(['IncidentNumber', 'DateTimeOfCall', 'CalYear', 'FinYear',
'TypeOfIncident', 'PumpCount', 'PumpHoursTotal', 'HourlyNominalCost(£)',
'IncidentNominalCost(£)', 'FinalDescription', 'AnimalGroupParent',
'PropertyType', 'SpecialServiceTypeCategory', 'SpecialServiceType',
'Borough', 'StnGroundName', 'AnimalClass', 'Code', 'London'],
dtype='object')
It would be nice to display them as a list to make the object easier to work with, so we can add (or chain) the method “.tolist()” on the end.
animals.columns.tolist()
['IncidentNumber',
'DateTimeOfCall',
'CalYear',
'FinYear',
'TypeOfIncident',
'PumpCount',
'PumpHoursTotal',
'HourlyNominalCost(£)',
'IncidentNominalCost(£)',
'FinalDescription',
'AnimalGroupParent',
'PropertyType',
'SpecialServiceTypeCategory',
'SpecialServiceType',
'Borough',
'StnGroundName',
'AnimalClass',
'Code',
'London']
4.2.3 Exercise
Look at the column names of the titanic DataFrame and convert them to a list, giving it an appropriate variable name.
Use indexing to return the column name ‘embarked’.
Use negative indexing to return the list without ‘boat’.
# (a)
= titanic.columns.tolist()
titanic_cols
titanic_cols
['pclass',
'survived',
'name',
'sex',
'age',
'sibsp',
'parch',
'ticket',
'fare',
'cabin',
'embarked',
'home.dest',
'body',
'boat']
# (b)
10] titanic_cols[
'embarked'
# (c)
-1] titanic_cols[:
['pclass',
'survived',
'name',
'sex',
'age',
'sibsp',
'parch',
'ticket',
'fare',
'cabin',
'embarked',
'home.dest',
'body']
We can also use Jupyter’s auto complete feature to help us when accessing the names of our columns. Copy and uncomment the cell below place your cursor after the H and hit tab.
#animals['H']
You’ll see that the auto complete option gives us the name of our “HourlyNominalCost(£)” column. This can be a really useful shortcut when we’re writing our code, circumventing the need to type out full column names.
5 Sorting Data
Our data is displayed in the same order as the source data, which means that we may want to sort our data, based on specific columns.
5.1 Example
To do this we use the method “.sort_values(by=”column name”)“. When transforming data like this, we want to assign the new object as we don’t want to re-type this process each time.
# animals_sorted = animals.sort_values(by="IncidentNominalCost(£)")
# animals_sorted.head()
There are additional arguments we can set, for example, by default the values are sorted in ascending order, by changing “ascending=” to “ascending=False” we can sort in descending order instead.
= animals.sort_values(by="IncidentNominalCost(£)", ascending=False)
animals_sorted animals_sorted.head()
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4450 | 098141-28072016 | 28/07/2016 13:49 | 2016 | 2016/17 | Special Service | 3.0 | 12.0 | 326 | 3912.0 | CAT STUCK WITHIN WALL SPACE RSPCA IN ATTENDANCE | Cat | Purpose Built Flats/Maisonettes - 4 to 9 storeys | Other animal assistance | Animal assistance involving domestic animal - ... | Camden | Kentish Town | Mammal | 00AG | Inner London |
2542 | 48360131 | 19/04/2013 07:01 | 2013 | 2013/14 | Special Service | 2.0 | 12.0 | 290 | 3480.0 | FOAL IN RIVER WATER LEVEL ONE IMPLEMENTED | Horse | River/canal | Other animal assistance | Animal assistance - Lift heavy livestock animal | Waltham Forest | Chingford | Mammal | 00BH | Outer London |
3757 | 62700151 | 22/05/2015 11:34 | 2015 | 2015/16 | Special Service | 2.0 | 10.0 | 298 | 2980.0 | HORSE FALLEN INTO SWIMMING POOL | Horse | Other outdoor structures | Animal rescue from water | Animal rescue from water - Farm animal | Bromley | Biggin Hill | Mammal | 00AF | Outer London |
5568 | 092389-09072018 | 09/07/2018 08:35 | 2018 | 2018/19 | Special Service | 2.0 | 8.0 | 333 | 2664.0 | HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... | Horse | Animal harm outdoors | Other animal assistance | Animal assistance - Lift heavy domestic animal | Hillingdon | Hillingdon | Mammal | 00AS | Outer London |
3143 | 49076141 | 22/04/2014 09:54 | 2014 | 2014/15 | Special Service | 2.0 | 9.0 | 295 | 2655.0 | KITTEN TRAPPED IN CHIMNEY | Cat | Purpose Built Flats/Maisonettes - Up to 3 stor... | Other animal assistance | Assist trapped domestic animal | Hounslow | Feltham | Mammal | 00AT | Outer London |
We can also sort by more than one column, but we can’t just type out the column names in succession to the “.sort_values()” method. We must include them in a list instead, where order is important.
5.2 Example
Here we are sorting by “IncidentNominalCost(£)”; and then by “AnimalClass”. Notice object (text/string) columns are sorted by alphabetical order. The reason order is important is because it creates a dependence, namely that sorting by the second column is entirely dependent on the sorting of the first column.
= animals.sort_values(by=["IncidentNominalCost(£)", "AnimalClass"])
animals_sorted animals_sorted.head()
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
27 | 7051091 | 11/01/2009 12:32 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | SWAN STUCK ON ROOF,B15 | Bird | Domestic garden (vegetation not equipment) | Animal rescue from height | Animal rescue from height - Bird | Havering | Hornchurch | Bird | 00AR | Outer London |
31 | 9252091 | 15/01/2009 12:38 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | ASSIST RSPCA WITH BIRD TRAPPED IN TREE BY FOOT... | Bird | Wasteland | Animal rescue from height | Animal rescue from height - Bird | Lambeth | Clapham | Bird | 00AY | Inner London |
40 | 13245091 | 22/01/2009 21:35 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | BIRD IN CHIMNEY,B15 | Bird | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Harrow | Stanmore | Bird | 00AQ | Outer London |
63 | 21760091 | 07/02/2009 11:12 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | PIGEON IN PRECARIOUS POSITION, B15 | Bird | Tree scrub | Animal rescue from below ground | Animal rescue from below ground - Bird | Westminster | Soho | Bird | 00BK | Outer London |
68 | 26715091 | 15/02/2009 15:53 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | SWAN IN DISTRESS, J15 | Bird | River/canal | Animal rescue from water | Animal rescue from water - Bird | Tower Hamlets | Shadwell | Bird | 00BG | Outer London |
When sorting more than one column, it’s best to specify “ascending=” as a list as well, to improve the readability and accessibility of our code. This list is in the same order as our columns and as such “.sort_values()” will apply column wise (first column, ascending, second column, descending etc).
Some versions of Pandas will sort both columns as descending if we just supply “ascending = False” ; however some won’t, so it’s better to be clear about what we want to do. By using a list I also have finer control in that I could have one in ascending order, and one in descending order if I wished.
5.3 Example
= animals.sort_values(by=["IncidentNominalCost(£)", "AnimalClass"],
animals_sorted_desc =[False, False])
ascending
animals_sorted_desc.head()
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4450 | 098141-28072016 | 28/07/2016 13:49 | 2016 | 2016/17 | Special Service | 3.0 | 12.0 | 326 | 3912.0 | CAT STUCK WITHIN WALL SPACE RSPCA IN ATTENDANCE | Cat | Purpose Built Flats/Maisonettes - 4 to 9 storeys | Other animal assistance | Animal assistance involving domestic animal - ... | Camden | Kentish Town | Mammal | 00AG | Inner London |
2542 | 48360131 | 19/04/2013 07:01 | 2013 | 2013/14 | Special Service | 2.0 | 12.0 | 290 | 3480.0 | FOAL IN RIVER WATER LEVEL ONE IMPLEMENTED | Horse | River/canal | Other animal assistance | Animal assistance - Lift heavy livestock animal | Waltham Forest | Chingford | Mammal | 00BH | Outer London |
3757 | 62700151 | 22/05/2015 11:34 | 2015 | 2015/16 | Special Service | 2.0 | 10.0 | 298 | 2980.0 | HORSE FALLEN INTO SWIMMING POOL | Horse | Other outdoor structures | Animal rescue from water | Animal rescue from water - Farm animal | Bromley | Biggin Hill | Mammal | 00AF | Outer London |
5568 | 092389-09072018 | 09/07/2018 08:35 | 2018 | 2018/19 | Special Service | 2.0 | 8.0 | 333 | 2664.0 | HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... | Horse | Animal harm outdoors | Other animal assistance | Animal assistance - Lift heavy domestic animal | Hillingdon | Hillingdon | Mammal | 00AS | Outer London |
3143 | 49076141 | 22/04/2014 09:54 | 2014 | 2014/15 | Special Service | 2.0 | 9.0 | 295 | 2655.0 | KITTEN TRAPPED IN CHIMNEY | Cat | Purpose Built Flats/Maisonettes - Up to 3 stor... | Other animal assistance | Assist trapped domestic animal | Hounslow | Feltham | Mammal | 00AT | Outer London |
5.4 Exercise
Sort the titanic DataFrame by the “age” column in descending order, name it “titanic_sorted”.
Sort the titanic DataFrame by the “age” then “sex” columns, both in ascending order.
Sort the titanic DataFrame by ascending “fare” then descending “age”, naming it appropriately.
# (a)
= titanic.sort_values(by="age", ascending=False)
titanic_sorted titanic_sorted.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0 | 0 | 0 | 27042 | 30.0000 | A23 | S | Hessle, Yorks | NaN | B |
61 | 1 | 1 | Cavendish, Mrs. Tyrell William (Julia Florence... | female | 76.0 | 1 | 0 | 19877 | 78.8500 | C46 | S | Little Onn Hall, Staffs | NaN | 6 |
1235 | 3 | 0 | Svensson, Mr. Johan | male | 74.0 | 0 | 0 | 347060 | 7.7750 | NaN | S | NaN | NaN | NaN |
135 | 1 | 0 | Goldschmidt, Mr. George B | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C | New York, NY | NaN | NaN |
9 | 1 | 0 | Artagaveytia, Mr. Ramon | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | NaN | C | Montevideo, Uruguay | 22.0 | NaN |
# (b)
= titanic.sort_values(by = ['age', 'sex'], ascending = [True, True])
titanic_age_sex
# This also works but is not good practice
# titanic_age_sex = titanic.sort_values(by = ['age', 'sex'])
# (d)
= titanic.sort_values(by = ['fare', 'age'], ascending = [True, False]) titanic_fare_age
6 Revisiting parameters and arguments
I mentioned earlier that not including parameter names can sometimes confuse Python when it comes to the order in which they can be specified for larger functions. Now we will see this in practice.
6.1 Important Example
Parameters are key words that tell python what the following argument relates to. We have two in the code below:
“by=” is a parameter and the string representing the column “IncidentNominalCost(£)” is the argument.
“ascending=” is a parameter and the Boolean value “False” is the argument
If we don’t specify parameters Python will assume our arguments are in the order specified in the Signature (press “shift + tab” with your cursor in “.sort_values()” to see this), and because of this the code below won’t work.
# This code will give us an error!
# After you've tried it, you may want to comment it out!
# animals_sorted = animals.sort_values("IncidentNominalCost(£)", False)
# animals_sorted.head()
Our first argument “IncidentNominalCost(£)” maps directly to the “by=” parameter as expected. However, Python assumes the second argument “False” relates to the “axis =” parameter; which in some versions isn’t a valid argument, and in some pandas versions is treated effectively as “axis =0”.
It is personal preference if you use parameters before your arguments. It is considered good practice, and can make your code more readable, especially to new coders. We would strongly advise doing so.
As a bonus when you use parameters with your arguments you can pass them in any order as Python no longer relies on it being a positional based argument.
= animals.sort_values(ascending=False, by="IncidentNominalCost(£)")
animals_sorted animals_sorted.head()
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4450 | 098141-28072016 | 28/07/2016 13:49 | 2016 | 2016/17 | Special Service | 3.0 | 12.0 | 326 | 3912.0 | CAT STUCK WITHIN WALL SPACE RSPCA IN ATTENDANCE | Cat | Purpose Built Flats/Maisonettes - 4 to 9 storeys | Other animal assistance | Animal assistance involving domestic animal - ... | Camden | Kentish Town | Mammal | 00AG | Inner London |
2542 | 48360131 | 19/04/2013 07:01 | 2013 | 2013/14 | Special Service | 2.0 | 12.0 | 290 | 3480.0 | FOAL IN RIVER WATER LEVEL ONE IMPLEMENTED | Horse | River/canal | Other animal assistance | Animal assistance - Lift heavy livestock animal | Waltham Forest | Chingford | Mammal | 00BH | Outer London |
3757 | 62700151 | 22/05/2015 11:34 | 2015 | 2015/16 | Special Service | 2.0 | 10.0 | 298 | 2980.0 | HORSE FALLEN INTO SWIMMING POOL | Horse | Other outdoor structures | Animal rescue from water | Animal rescue from water - Farm animal | Bromley | Biggin Hill | Mammal | 00AF | Outer London |
5568 | 092389-09072018 | 09/07/2018 08:35 | 2018 | 2018/19 | Special Service | 2.0 | 8.0 | 333 | 2664.0 | HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... | Horse | Animal harm outdoors | Other animal assistance | Animal assistance - Lift heavy domestic animal | Hillingdon | Hillingdon | Mammal | 00AS | Outer London |
3143 | 49076141 | 22/04/2014 09:54 | 2014 | 2014/15 | Special Service | 2.0 | 9.0 | 295 | 2655.0 | KITTEN TRAPPED IN CHIMNEY | Cat | Purpose Built Flats/Maisonettes - Up to 3 stor... | Other animal assistance | Assist trapped domestic animal | Hounslow | Feltham | Mammal | 00AT | Outer London |
7 Subsetting
7.1 Selecting Single Columns
Sometimes we will want to work with smaller “cut down” DataFrames that contain fewer columns. As we saw earlier the simplest way to select a column from a DataFrame is to use the name of the column in the indexing brackets “[]”.
7.1.1 Example
Here I am creating a new Series called ‘animal_grp_parent’ using the ‘AnimalGroupParent’ column from the animals DataFrame.
= animals['AnimalGroupParent']
animal_grp_parent
animal_grp_parent.head()
0 Dog
1 Fox
2 Dog
3 Horse
4 Rabbit
Name: AnimalGroupParent, dtype: object
7.1.2 Exercise
Select the “fare” column from the “titanic” DataFrame and name it “titanic_fare”.
Randomly Sample 12 rows from the “titanic_fare” object you created.
# (a)
= titanic["fare"]
titanic_fare titanic_fare.head()
0 211.3375
1 151.5500
2 151.5500
3 151.5500
4 151.5500
Name: fare, dtype: float64
# (b)
= 12) titanic_fare.sample(n
69 55.0000
192 146.5208
1087 7.8542
66 262.3750
1043 15.5000
694 6.7500
540 26.0000
16 247.5208
759 17.4000
1026 12.4750
1281 9.0000
1082 3.1708
Name: fare, dtype: float64
7.2 Selecting Multiple Columns
7.2.1 Example
Similarly as with sorting by multiple columns, we can select multiple columns by providing them in a list.
# Create a list of columns
= [ "DateTimeOfCall", "AnimalGroupParent", "IncidentNominalCost(£)"]
my_list_of_columns
# Pass that list to my indexing brackets, creating a dataframe
= animals[my_list_of_columns] animal_small_df
We can, however do this in one step by using two sets of square brackets next to each other. It’s important to realise that they are doing two separate roles:
- Our first set is selecting or indexing from the animals DataFrame.
- Our second set is creating a list of our columns we wish to select.
# This gives the same result as the cell above.
= animals[[ "DateTimeOfCall", "AnimalGroupParent", "IncidentNominalCost(£)"]]
animal_small_df 6) animal_small_df.sample(
DateTimeOfCall | AnimalGroupParent | IncidentNominalCost(£) | |
---|---|---|---|
202 | 16/05/2009 14:28 | Dog | 260.0 |
3317 | 14/07/2014 16:25 | Cat | 295.0 |
3955 | 26/08/2015 09:38 | Cat | 298.0 |
2604 | 20/05/2013 12:33 | Bird | 290.0 |
4004 | 01/10/2015 06:34 | Cat | 298.0 |
4551 | 16/09/2016 17:42 | Bird | 326.0 |
Two important things to note are the following:
When we return one column we get a Series object (as DataFrames are collections of Series objects)
When we return more than one column we get a DataFrame.
7.2.2 Exercise
Select the “name, sex, age” and “survived” column from the “titanic” DataFrame, naming it accordingly.
Create two sub DataFrames from your answer to part (a), one containing “sex” and “age” and another containing “name” and “survived”, naming them accordingly.
# (a)
= titanic[["name","sex","age", "survived"]]
titanic_select titanic_select.head()
name | sex | age | survived | |
---|---|---|---|---|
0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 1 |
1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 |
2 | Allison, Miss. Helen Loraine | female | 2.0000 | 0 |
3 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 0 |
4 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 0 |
# (b)
= titanic_select[["age", "sex"]]
titanic_ages_sexes
= titanic_select[["name", "survived"]] titanic_names_survived
Note that you may see people selecting columns using “dataframe.column_name”, which is a legal practice, but certainly not a recommended one!
# This will work - but is not good practice!
titanic.sex.head()
0 female
1 male
2 female
3 male
4 female
Name: sex, dtype: object
We don’t recommend this because it looks too much like an attribute or method. This also won’t work if you had a column that was something like “mean” as Python can’t determine if you mean the column “mean” or the method “.mean()”.
There’s no ambiguity with
- dataframe[“column_name”]
which is why we will always recommend it over “dataframe.column_name”.
8 Selecting on data type
We can also use the method “.select_dtypes()” if we want to include or exclude specific kinds of data from our dataframe.
# Select just the object columns
= animals.select_dtypes(include=["object"])
animal_object_cols
# Return the top of the animal columns
animal_object_cols.head()
IncidentNumber | DateTimeOfCall | FinYear | TypeOfIncident | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 139091 | 01/01/2009 03:01 | 2008/09 | Special Service | DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 | Dog | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Norbury | Mammal | 00AH | Outer London |
1 | 275091 | 01/01/2009 08:51 | 2008/09 | Special Service | ASSIST RSPCA WITH FOX TRAPPED,B15 | Fox | Railings | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Woodside | Mammal | 00AH | Outer London |
2 | 2075091 | 04/01/2009 10:07 | 2008/09 | Special Service | DOG CAUGHT IN DRAIN,B15 | Dog | Pipe or drain | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Sutton | Wallington | Mammal | 00BF | Outer London |
3 | 2872091 | 05/01/2009 12:27 | 2008/09 | Special Service | HORSE TRAPPED IN LAKE,J17 | Horse | Intensive Farming Sheds (chickens, pigs etc) | Animal rescue from water | Animal rescue from water - Farm animal | Hillingdon | Ruislip | Mammal | 00AS | Outer London |
4 | 3553091 | 06/01/2009 15:23 | 2008/09 | Special Service | RABBIT TRAPPED UNDER SOFA,B15 | Rabbit | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Havering | Harold Hill | Mammal | 00AR | Outer London |
We must specify our include data types as a list here, even if we’re just specifying one value. It’s just the way this function was written (it will tell you this in the help documentation).
- “object” includes our string columns, I could also use “O” here.
- “float64” and “int64” will select floating point numbers and integer numbers respectively.
- If you have the numpy package loaded (“import numpy as np”) you can use np.number (no quotes) to return numerical data.
You can even use the parameter “exclude=” which may be easier if you wish to omit just one data type.
8.1 Exercise
Create a new DataFrame “titanic_int_no_float” where we include all integer columns, but exclude all float columns.
= titanic.select_dtypes(include = ['int64'], exclude = ['float64']) titanic_int_no_float
9 Filtering
We can achieve really simple filtering by passing a range to the DataFrame indexer.
9.1 Example
0:5] animals[
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 139091 | 01/01/2009 03:01 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 | Dog | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Norbury | Mammal | 00AH | Outer London |
1 | 275091 | 01/01/2009 08:51 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | ASSIST RSPCA WITH FOX TRAPPED,B15 | Fox | Railings | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Woodside | Mammal | 00AH | Outer London |
2 | 2075091 | 04/01/2009 10:07 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | DOG CAUGHT IN DRAIN,B15 | Dog | Pipe or drain | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Sutton | Wallington | Mammal | 00BF | Outer London |
3 | 2872091 | 05/01/2009 12:27 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | HORSE TRAPPED IN LAKE,J17 | Horse | Intensive Farming Sheds (chickens, pigs etc) | Animal rescue from water | Animal rescue from water - Farm animal | Hillingdon | Ruislip | Mammal | 00AS | Outer London |
4 | 3553091 | 06/01/2009 15:23 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | RABBIT TRAPPED UNDER SOFA,B15 | Rabbit | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Havering | Harold Hill | Mammal | 00AR | Outer London |
Which returns us the same rows as “.head()”. We can however modify this to return any range of rows within the DataFrame we like - for example the middle of our DataFrame.
200:210] animals[
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | 81770091 | 15/05/2009 16:50 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | CAT STUCK ON FENCE,B15 | Cat | Bridge | Animal rescue from height | Animal rescue from height - Domestic pet | Tower Hamlets | Bethnal Green | Mammal | 00BG | Outer London |
201 | 82208091 | 16/05/2009 13:20 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | DOG TRAPPED UNDER CAR,B15 | Dog | Car | Other animal assistance | Animal assistance - Lift heavy livestock animal | Brent | West Hampstead | Mammal | 00AE | Outer London |
202 | 82239091 | 16/05/2009 14:28 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | ROAD TRAFFIC ACCIDENT,B1 | Dog | Car | Other animal assistance | Animal assistance involving livestock - Other ... | Bromley | Orpington | Mammal | 00AF | Outer London |
203 | 82314091 | 16/05/2009 17:09 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | DOG STUCK IN FOX HOLE,B15 | Dog | Domestic garden (vegetation not equipment) | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Redbridge | Hainault | Mammal | 00BC | Outer London |
204 | 82769091 | 17/05/2009 09:54 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | CAT STUCK BEHIND TOILET,B15 | Cat | Purpose Built Flats/Maisonettes - 4 to 9 storeys | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Croydon | Addington | Mammal | 00AH | Outer London |
205 | 83410091 | 18/05/2009 13:44 | 2009 | 2009/10 | Special Service | 1.0 | 4.0 | 260 | 1040.0 | ASSIST RSPCA WITH PIGEON TRAPPED IN CHIMNEY,B15 | Bird | House - single occupancy | Animal rescue from height | Animal rescue from height - Bird | Redbridge | Ilford | Bird | 00BC | Outer London |
206 | 83903091 | 19/05/2009 10:38 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | ASSIST RSPCA WITH 2 DUCKLINGS TRAPPED BEWTEEN... | Bird | Multi-Storey car park | Animal rescue from below ground | Animal rescue from below ground - Bird | Hillingdon | Hillingdon | Bird | 00AS | Outer London |
207 | 83922091 | 19/05/2009 11:35 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | ASSIST RSPCA, B15 | Bird | Bridge | Animal rescue from height | Animal rescue from height - Bird | Enfield | Enfield | Bird | 00AK | Outer London |
208 | 84125091 | 19/05/2009 17:46 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | ASSIST RSPCA WITH CAT UP A TREE,B15 | Cat | Domestic garden (vegetation not equipment) | Animal rescue from height | Animal rescue from height - Domestic pet | Ealing | Ealing | Mammal | 00AJ | Outer London |
209 | 84176091 | 19/05/2009 19:07 | 2009 | 2009/10 | Special Service | 1.0 | 1.0 | 260 | 260.0 | G561 RUNNING CALL TO PIGEON TRAPPED IN NETTING... | Bird | Vehicle Repair Workshop | Animal rescue from height | Animal rescue from height - Bird | Hillingdon | Heathrow | Bird | 00AS | Inner London |
However, we can do more involved, conditional filtering using pandas, which is one of the key processes for any Data Analysis task.
9.2 Single Conditional Filtering
Pandas filters data in a two step process.
- It creates a “mask” that specifies inclusion or exclusion for each row in the DataFrame.
- Apply the mask on to the DataFrame to return the subset of rows that are included.
In the code below I will create a few simple DataFrames to illustrate my point. Don’t worry too much about how these bits of code works; as previously mentioned you’ll often be working with much larger DataFrames that you’ll load in using one of the “pd.read_” commands.
# Let's create a simple DataFrame to look at this in practice
# Again, you don't need to worry about how this works
= pd.DataFrame({"animal": ["Cat", "Cat", "Dog", "Cat"] ,
cat_dog "name": ["Catalie Portman","Pico de Gato", "Chewbarka", "Sir Isaac Mewton"]})
# Print out the DataFrame
cat_dog
animal | name | |
---|---|---|
0 | Cat | Catalie Portman |
1 | Cat | Pico de Gato |
2 | Dog | Chewbarka |
3 | Cat | Sir Isaac Mewton |
9.2.1 Example
I want to filter so I just have ‘Cat’ Rows.
My condition can be written like this:
= cat_dog['animal'] == 'Cat' mask
- I am assigning my output to the variable “mask” with the single = sign
- The column I want to look for values in is “animal” from the catdog Dataframe - “catdog[”animal”]”
- From that column I want to find values that meet my condition – I state this with the double equals “==”
- Finally I am telling Pandas what value I want to find; a string containing “Cat”
If we run the cell we can see the output is a Boolean Series, our True and False values.
# Run me to see the Series of Boolean values
mask
0 True
1 True
2 False
3 True
Name: animal, dtype: bool
Just to make things clearer, Here I’ve added this series as a new column on our “cat_dog” DataFrame to demonstrate.
cat_dog | animal | name | included |
---|---|---|---|
0 | Cat | Catalie Portman | True |
1 | Cat | Pico de Gato | True |
2 | Dog | Chewbarka | False |
3 | Cat | Sir Isaac Mewton | True |
We can see where the value in the “animal” column was “Cat”; the “mask” column is “True”. Where the value was not cat (e.g “Dog”) the “mask” column is “False”. However I want to return just the rows where the value in “mask” was “True”.
To do this we apply the “mask” variable on to the “cat_dog” Dataframe using our indexing or selection brackets.
= cat_dog[mask]
just_cats just_cats
animal | name | |
---|---|---|
0 | Cat | Catalie Portman |
1 | Cat | Pico de Gato |
3 | Cat | Sir Isaac Mewton |
This will only return the rows where our conditional filter equated to “True”.
This seems a little long but thankfully, rather than create and apply a mask separately we can do the whole filter operation in one step. We teach it this way to break down the process into component parts, before wrapping it up in a one line bow.
9.2.2 Example
= cat_dog[cat_dog['animal'] == "Cat"]
just_cats just_cats
animal | name | |
---|---|---|
0 | Cat | Catalie Portman |
1 | Cat | Pico de Gato |
3 | Cat | Sir Isaac Mewton |
You’ll notice the cell with the “Dog” value has been dropped.
You’ll also notice that our index has not changed. You will sometimes want to reset the index after filtering to restore it to sequential integers starting at 0 as this can cause some problems in certain circumstances.
9.2.3 Example
You can do this like so.
=True, inplace=True)
just_cats.reset_index(drop just_cats
animal | name | |
---|---|---|
0 | Cat | Catalie Portman |
1 | Cat | Pico de Gato |
2 | Cat | Sir Isaac Mewton |
By default the previous column index is put in as a new column. By using “drop=True” we can remove this behaviour.
“inplace=True” updates the DataFrame in place and is an alternative to overwriting the data with “just_cats = just_cats.reset_index(drop=True)”
9.3 Other Comparison Operators
We have seen the equivalence sign “==” thus far, but there are other very useful operators we can utilise when conditional filtering:
Symbol | Meaning |
---|---|
== | Is equivalent to |
!= | Does not equal |
> | Greater than |
>= | Greater than or equivalent too |
< | Less than |
<= | Less than or equivalent too |
9.3.1 Example
Let’s use our animals DataFrame to find the rows where “IncidentNominalCost(£)” is over £1000.
# Note here we're using the animals DataFrame
= animals[animals['IncidentNominalCost(£)'] > 1000]
cost_over_1k 6) cost_over_1k.sample(
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5568 | 092389-09072018 | 09/07/2018 08:35 | 2018 | 2018/19 | Special Service | 2.0 | 8.0 | 333 | 2664.0 | HORSE TRAPPED IN DOORWAY ADDITIONAL FRU REQUES... | Horse | Animal harm outdoors | Other animal assistance | Animal assistance - Lift heavy domestic animal | Hillingdon | Hillingdon | Mammal | 00AS | Outer London |
572 | 3215101 | 06/01/2010 10:09 | 2010 | 2009/10 | Special Service | 2.0 | 5.0 | 260 | 1300.0 | PUPPY STUCK ON ICED LAKE - WATER RESCUE LEVEL ... | Dog | Park | Other animal assistance | Animal assistance involving livestock - Other ... | Richmond Upon Thames | Kingston | Mammal | 00BD | Outer London |
2084 | 18627122 | 21/06/2012 05:05 | 2012 | 2012/13 | Special Service | 1.0 | 5.0 | 260 | 1300.0 | HORSE STUCK ON SOAK AWAY | Horse | Grassland, pasture, grazing etc | Other animal assistance | Animal assistance - Lift heavy domestic animal | Enfield | Enfield | Mammal | 00AK | Outer London |
425 | 158794091 | 02/09/2009 10:57 | 2009 | 2009/10 | Special Service | 2.0 | 5.0 | 260 | 1300.0 | HORSE STUCK IN MUD | Horse | Wasteland | Animal rescue from water | Animal rescue from water - Farm animal | Bexley | Erith | Mammal | 00AD | Inner London |
431 | 161584091 | 06/09/2009 13:59 | 2009 | 2009/10 | Special Service | 1.0 | 4.0 | 260 | 1040.0 | CAT STUCK BETWEEN TWO WALLS | Cat | Purpose Built Flats/Maisonettes - Up to 3 stor... | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Islington | Holloway | Mammal | 00AU | Outer London |
1358 | 68681111 | 05/05/2011 18:00 | 2011 | 2011/12 | Special Service | 1.0 | 4.0 | 260 | 1040.0 | CAT STUCK IN THE CHIMNEY | Cat | Tree scrub | Animal rescue from height | Animal rescue from height - Domestic pet | Newham | Stratford | Mammal | 00BB | Inner London |
Here we have done the following:
- Assigned a new name “cost_over_1k” which will hold our output
- Give our DataFrame name, and our indexing or selection brackets: “animals[]”
- The column we want to look for values in is ”IncidentNominalCost” from the Animals Dataframe “animals[‘IncidentNominalCost(£)’]”
- The condition I want is “> 1000” , greater than a thousand. As this is a numeric column I can just specify the number.
- Printing out the variable, using “.sample(6)” method to display 6 rows of the new DataFrame.
9.3.2 Exercise
Filter the titanic DataFrame by “sex”, returning only the rows where this is “female”. Name this something appropriate
Filter the titanic DataFrame to return the rows where the passenger age is smaller than 40 (It is useful to practice these without the exact columns to filter by given).
Filter the titanic DataFrame, returning only the rows where the fare paid is at least £30.50 (Hint: which operator would ‘at least’ correspond to?)
# (a)
= titanic[titanic['sex'] == "female"]
titanic_female titanic_female.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN |
6 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0 | 1 | 0 | 13502 | 77.9583 | D7 | S | Hudson, NY | NaN | 10 |
8 | 1 | 1 | Appleton, Mrs. Edward Dale (Charlotte Lamson) | female | 53.0 | 2 | 0 | 11769 | 51.4792 | C101 | S | Bayside, Queens, NY | NaN | D |
# (b)
= titanic[titanic['age'] < 40]
titanic_younger
titanic_younger.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN |
# (c) At least means bigger than or equal to that value
= titanic[titanic['fare'] >= 30.50]
titanic_fares
titanic_fares.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN |
9.4 Multiple Conditional Filtering
Just as we can filter by one condition we can also filter by multiple conditions. Let’s add in a new column, representing age to the cat_dog DataFrame.
# Let's create a simple DataFrame to look at this in practice
# Again, we don't need to worry about how we do this!
= pd.DataFrame({"animal": ["Cat", "Cat", "Dog", "Cat", "Dog"],
cat_dog "name": ["Catalie Portman","Pico de Gato", "Chewbarka", "Sir Isaac Mewton", "K9"],
"age": [3, 5, 1, 7, 11]},
=["name", "animal", "age"])
columns
# Print out the DataFrame
cat_dog
name | animal | age | |
---|---|---|---|
0 | Catalie Portman | Cat | 3 |
1 | Pico de Gato | Cat | 5 |
2 | Chewbarka | Dog | 1 |
3 | Sir Isaac Mewton | Cat | 7 |
4 | K9 | Dog | 11 |
When we want to filter on more than one condition, we have to consider how those conditions relate to each other in terms of what is included and what is excluded from the final filtering.
AND relationships use the & symbol (Shift and 7 on the keyboard) and require both or all conditions to evaluate to “True”.
OR relationships use the | symbol (Shift and &124;) and require one of the conditions to evaluate to “True”.
Below is a “Truth table” that can be handy as a look up.
Condition 1 | Condition 2 | & (AND) Equates to | | (OR) Equates to |
---|---|---|---|
True | True | True | True |
True | False | False | True |
False | True | False | True |
False | False | False | False |
9.4.1 Example
Now that we have an age column, we can now select all Cats that are older than 4. Again, in this example we will create a mask so we can see what Boolean values are happening behind the scenes, but in the exercises we will do this in one line.
# Create my mask
= (cat_dog['animal'] == "Cat") & (cat_dog['age'] > 4)
mask
# have a look at the mask
mask
0 False
1 True
2 False
3 True
4 False
dtype: bool
This is quite overwhelming upon first viewing, let’s break it down a little bit!
I am assigning my output to the variable “mask” with a single = sign.
My first conditional statement:
- Has round brackets surrounding it “(”
- The first column I want to look for values in is “animal” from the catdog Dataframe - “catdog[“animal”]”
- From that column I want to find values that meet my condition, which I state with the double equals “==”
- Finally I am telling pandas what value I want to find; a string containing “Cat”
- My first condition complete I close my round brackets “)”
I use a symbol to represent the relationship between my conditions. Here an ampersand - “&” (Shift and 7) is used to represent “AND”
My second conditional statement:
- Has round brackets surrounding it “(”
- The first column I want to look for values in is “age” from the catdog Dataframe - “catdog[“age”]”
- From that column I want to find values that are greater than my condition, which I state with the “>”
- Finally I am telling Pandas what value I want to find; an int 4
- My second condition complete I close my round brackets “)”
Let’s look at the “mask” values alongside the DataFrame:
cat_dog | animal | name | age | included |
---|---|---|---|---|
0 | Cat | Catalie Portman | 3 | False |
1 | Cat | Pico de Gato | 5 | True |
2 | Dog | Chewbarka | 1 | False |
3 | Cat | Sir Isaac Mewton | 7 | True |
4 | Dog | K9 | 11 | False |
And when we apply the mask to the DataFrame we can see the filtered Dataframe - with only Cats over 4.
= (cat_dog['animal'] == "Cat") & (cat_dog['age'] > 4)
mask
= cat_dog[mask]
cats_over_4 cats_over_4
name | animal | age | |
---|---|---|---|
1 | Pico de Gato | Cat | 5 |
3 | Sir Isaac Mewton | Cat | 7 |
To do this in one line, we reference the DataFrame we are selecting from, and then place our filter condition within the indexing or selection brackets.
= cat_dog[(cat_dog["animal"] == "Cat") & (cat_dog["age"] > 4)] cats_over_4
If we can use the AND condition (&) we can also use the OR condition where only one of the conditions must be met to evaluate to “True”.
The OR condition is represented by a vertical bar symbol ( “|” ) – press shift and “".
Lets add a condition where the animal is a cat OR the age is greater than 4.
# Create my mask
= (cat_dog["animal"] == "Cat") | (cat_dog["age"] > 4)
mask
# View mask
mask
0 True
1 True
2 False
3 True
4 True
dtype: bool
Let’s look at the “mask” values alongside the DataFrame:
cat_dog | animal | name | age | included |
---|---|---|---|---|
0 | Cat | Catalie Portman | 3 | True |
1 | Cat | Pico de Gato | 5 | True |
2 | Dog | Chewbarka | 1 | False |
3 | Cat | Sir Isaac Mewton | 7 | True |
4 | Dog | K9 | 11 | True |
Catalie Portman now evaluates to True. While she is younger than 3 she is a Cat, and by meeting one of the conditions, she is included.
K9 also evaluates to “True”. He is not a cat; but meets the age condition by being older than 4 and so is included.
We can of course apply this mask to filter the DataFrame.
= cat_dog[mask] cats_or_over4
Or do the whole action in one line of code.
= cat_dog[(cat_dog["animal"] == "Cat") | (cat_dog["age"] > 4)]
cats_or_over4
cats_or_over4.head()
name | animal | age | |
---|---|---|---|
0 | Catalie Portman | Cat | 3 |
1 | Pico de Gato | Cat | 5 |
3 | Sir Isaac Mewton | Cat | 7 |
4 | K9 | Dog | 11 |
Here are some more examples using the “titanic” DataFrame.
Here we are filtering for survivors (value 1), who were female. Using an AND (&) condition
= titanic[(titanic["survived"] == 1) & (titanic["sex"] == "female")]
female_survivors female_survivors.tail()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1260 | 3 | 1 | Turja, Miss. Anna Sofia | female | 18.0 | 0 | 0 | 4138 | 9.8417 | NaN | S | NaN | NaN | 15 |
1261 | 3 | 1 | Turkula, Mrs. (Hedwig) | female | 63.0 | 0 | 0 | 4134 | 9.5875 | NaN | S | NaN | NaN | 15 |
1286 | 3 | 1 | Whabee, Mrs. George Joseph (Shawneene Abi-Saab) | female | 38.0 | 0 | 0 | 2688 | 7.2292 | NaN | C | NaN | NaN | C |
1290 | 3 | 1 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | NaN | NaN | NaN |
1300 | 3 | 1 | Yasbeck, Mrs. Antoni (Selini Alexander) | female | 15.0 | 1 | 0 | 2659 | 14.4542 | NaN | C | NaN | NaN | NaN |
And here we are filtering for those with more than 3 sibling/spouses (“sibsp”) on board or who paid more than £400 for their fare.
= titanic[(titanic['sibsp'] > 3) | (titanic['fare'] > 400)]
siblings_or_expensive siblings_or_expensive.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C | Austria-Hungary / Germantown, Philadelphia, PA | NaN | 3 |
50 | 1 | 1 | Cardeza, Mrs. James Warburton Martinez (Charlo... | female | 58.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C | Germantown, Philadelphia, PA | NaN | 3 |
183 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C | NaN | NaN | 3 |
302 | 1 | 1 | Ward, Miss. Anna | female | 35.0 | 0 | 0 | PC 17755 | 512.3292 | NaN | C | NaN | NaN | 3 |
622 | 3 | 0 | Andersson, Master. Sigvard Harald Elias | male | 4.0 | 4 | 2 | 347082 | 31.2750 | NaN | S | Sweden Winnipeg, MN | NaN | NaN |
9.4.2 Exercise
Filter the animals DataFrame where the IncidentNominalCost(£) Column was less than £400 and where the Borough column value is Croydon.
Filter the animals DataFrame where the ‘PropertyType’ was “River/canal” or the “PumpHoursTotal” was at most 8.
# (a)
= animals[(animals['IncidentNominalCost(£)'] < 400) & (animals['Borough'] == "Croydon")]
cost_u400_croydon
cost_u400_croydon.tail()
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5595 | 099586-19072018 | 19/07/2018 07:09 | 2018 | 2018/19 | Special Service | 1.0 | 1.0 | 333 | 333.0 | FOX TRAPPED IN NETTING IN SCHOOL FIELD | Fox | Infant/Primary school | Other animal assistance | Assist trapped wild animal | Croydon | Norbury | Mammal | 00AH | Inner London |
5612 | 103324-24072018 | 24/07/2018 12:25 | 2018 | 2018/19 | Special Service | 1.0 | 1.0 | 333 | 333.0 | CAT TRAPPED IN ROOM | Cat | House - single occupancy | Animal rescue from height | Animal rescue from height - Domestic pet | Croydon | West Norwood | Mammal | 00AH | Inner London |
5654 | 113548-07082018 | 07/08/2018 16:54 | 2018 | 2018/19 | Special Service | 1.0 | 1.0 | 333 | 333.0 | BIRD IN CHIMNEY CALLER DIABLED - UNABLE TO WA... | Bird | House - single occupancy | Other animal assistance | Assist trapped wild animal | Croydon | Purley | Bird | 00AH | Outer London |
5657 | 114584-09082018 | 09/08/2018 13:30 | 2018 | 2018/19 | Special Service | 1.0 | 1.0 | 333 | 333.0 | CAT TRAPPED IN GAP BETWEEN TWO BUILDINGS - OW... | Cat | House - single occupancy | Other animal assistance | Animal assistance involving domestic animal - ... | Croydon | Woodside | Mammal | 00AH | Outer London |
5664 | 116258-13082018 | 13/08/2018 10:27 | 2018 | 2018/19 | Special Service | 1.0 | 1.0 | 333 | 333.0 | KITTEN TRAPPED BEHIND KITCHEN SINK | Cat | Purpose Built Flats/Maisonettes - 4 to 9 storeys | Other animal assistance | Assist trapped domestic animal | Croydon | Croydon | Mammal | 00AH | Inner London |
# (b)
= animals[(animals['PropertyType'] == "River/canal") | (animals['PumpHoursTotal'] > 8)] river_pumphours
9.5 Some Filter short cuts.
We can also use some other filters to make life easier.
9.5.1 Example 1
With the method “.isin()” we can provide a list of things to filter, here the boat identities “[“2”, “C”, “15”]”.
"boat"].isin(["2", "C", "15"])].head() titanic[titanic[
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
56 | 1 | 1 | Carter, Mr. William Ernest | male | 36.0 | 1 | 2 | 113760 | 120.000 | B96 B98 | S | Bryn Mawr, PA | NaN | C |
79 | 1 | 1 | Cornell, Mrs. Robert Clifford (Malvina Helen L... | female | 55.0 | 2 | 0 | 11770 | 25.700 | C101 | S | New York, NY | NaN | 2 |
98 | 1 | 1 | Douglas, Mrs. Walter Donald (Mahala Dutton) | female | 48.0 | 1 | 0 | PC 17761 | 106.425 | C86 | C | Deephaven, MN / Cedar Rapids, IA | NaN | 2 |
164 | 1 | 1 | Homer, Mr. Harry ("Mr E Haven") | male | 35.0 | 0 | 0 | 111426 | 26.550 | NaN | C | Indianapolis, IN | NaN | 15 |
170 | 1 | 1 | Ismay, Mr. Joseph Bruce | male | 49.0 | 0 | 0 | 112058 | 0.000 | B52 B54 B56 | S | Liverpool | NaN | C |
9.5.2 Example 2
If dealing with numeric values we can use “.between()” and specify an upper and lower bound. Here we’re looking where the “age” column was between 0 and 12.
"age"].between(0, 12)].head() titanic[titanic[
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN |
54 | 1 | 1 | Carter, Master. William Thornton II | male | 11.0000 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S | Bryn Mawr, PA | NaN | 4 |
94 | 1 | 1 | Dodge, Master. Washington | male | 4.0000 | 0 | 2 | 33638 | 81.8583 | A34 | S | San Francisco, CA | NaN | 5 |
273 | 1 | 1 | Spedden, Master. Robert Douglas | male | 6.0000 | 0 | 2 | 16966 | 134.5000 | E34 | C | Tuxedo Park, NY | NaN | 3 |
9.5.3 Example 3
We also have the “~” tilde symbol to invert an expression, this works similarly to the “!=” for does not equal. For example, I can invert the filter for female passengers to return just male passengers.
Note that I have to wrap my conditional statement in round brackets to ensure this works.
~(titanic["sex"] == "female")].head() # Inverts the results of == Female; effectively giving == Male titanic[
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN |
5 | 1 | 1 | Anderson, Mr. Harry | male | 48.0000 | 0 | 0 | 19952 | 26.5500 | E12 | S | New York, NY | NaN | 3 |
7 | 1 | 0 | Andrews, Mr. Thomas Jr | male | 39.0000 | 0 | 0 | 112050 | 0.0000 | A36 | S | Belfast, NI | NaN | NaN |
9 | 1 | 0 | Artagaveytia, Mr. Ramon | male | 71.0000 | 0 | 0 | PC 17609 | 49.5042 | NaN | C | Montevideo, Uruguay | 22.0 | NaN |
The tilde “~” can be useful for other scenarios, there’s a variety of “.is” methods for example that do not have an is not equivalent; but we can invert the “.is” method instead. We’ll see a useful example of that in chapter 5.
9.5.4 Filtering Text
So far when we’ve talked about filtering text we do so on exact matches. However, as Python is case sensitive, filtering for “cat” will not show the values “CAT” or “Cat” in our cat_dog DataFrame, should they exist (which they often will in real world data).
# Lets create our cat_dog DataFrame again with some variation in the capitalisation of "cat"
= pd.DataFrame({"animal": ["cat", "Cat", "Dog", "cat", "Dog", "Rat", "Bat"] ,
cat_dog_wrong_spelling "name": ["Catalie Portman","Pico de Gato", "Chewbarka", "Sir Isaac Mewton", "K9", "Roland", "Vlad"],
"age": [3, 5, 1, 7, 11, 1, 3]} ,
= ["name", "animal", "age"])
columns
cat_dog_wrong_spelling
name | animal | age | |
---|---|---|---|
0 | Catalie Portman | cat | 3 |
1 | Pico de Gato | Cat | 5 |
2 | Chewbarka | Dog | 1 |
3 | Sir Isaac Mewton | cat | 7 |
4 | K9 | Dog | 11 |
5 | Roland | Rat | 1 |
6 | Vlad | Bat | 3 |
Here the filter condition we previously used (== Cat) will only now only return us “Pico De Gato”.
= cat_dog_wrong_spelling[cat_dog_wrong_spelling["animal"] == "Cat"]
cat_only_spelling cat_only_spelling
name | animal | age | |
---|---|---|---|
1 | Pico de Gato | Cat | 5 |
Which is not the result we wanted.
9.5.5 Example
For simple instances we can use the string method “str.contains()” like in the example below. Here, in our DataFrame, some entries in the “Animal” column have “Cat” and others have “cat”. By using “.str.contains(“at”)“, we can bring in both values because both contain the partial string”at”.
# Filter to find instances where the string in "animal" contains "at"
= cat_dog_wrong_spelling[cat_dog_wrong_spelling["animal"].str.contains("at")]
cat_only_spelling cat_only_spelling
name | animal | age | |
---|---|---|---|
0 | Catalie Portman | cat | 3 |
1 | Pico de Gato | Cat | 5 |
3 | Sir Isaac Mewton | cat | 7 |
5 | Roland | Rat | 1 |
6 | Vlad | Bat | 3 |
This is clearly not foolproof however, as both “Rat” and “Bat” also have “at” in them, and as such are displayed in the output.
In chapter 5 we’ll look at changing the column values and column names to lowercase letters, which would allow a straight forward filter to work in this case.
9.6 Aside - Regular Expressions
For more complex cases we can use what is known as Regular Expressions or RegEx in Python to search for partial strings with specified conditions (such as placing where the letters are located, starting with, ending with etc).
We do not cover these in this introduction course; but here are some more resources you may wish to explore. If you have used Regular Expressions in other languages you will find this very familiar.
9.7 More Filtering Methods
We’ve looked here at the base Python way of filtering, but as with many things in Python there are other alternatives.
9.7.1 Example 1
'fare > 50 & sex == "male"').head() titanic.query(
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN |
10 | 1 | 0 | Astor, Col. John Jacob | male | 47.0000 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | New York, NY | 124.0 | NaN |
16 | 1 | 0 | Baxter, Mr. Quigg Edmond | male | 24.0000 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | Montreal, PQ | NaN | NaN |
19 | 1 | 0 | Beattie, Mr. Thomson | male | 36.0000 | 0 | 0 | 13050 | 75.2417 | C6 | C | Winnipeg, MN | NaN | A |
Here we use “.query()” which allows us to type a more SQL style query in words and symbols without utilising our filtering brackets.
9.7.2 Example 2
We also have “.filter()”, which works on the index, rather then the contents! Here I’m looking for “like = “1”” which will give me index “1” , “11” , “12” as they contain the number 1. Note that “axis = 0” means apply to the rows, whereas “axis = 1” applies to the columns.
So far we’ve worked with indexes that are numerical, from 0 to n; but they can also be text based; and using the pandas “.filter()” would be useful here.
filter(like = "1", axis = 0).head() titanic.
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.550 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 |
10 | 1 | 0 | Astor, Col. John Jacob | male | 47.0000 | 1 | 0 | PC 17757 | 227.525 | C62 C64 | C | New York, NY | 124.0 | NaN |
11 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.0000 | 1 | 0 | PC 17757 | 227.525 | C62 C64 | C | New York, NY | NaN | 4 |
12 | 1 | 1 | Aubart, Mme. Leontine Pauline | female | 24.0000 | 0 | 0 | PC 17477 | 69.300 | B35 | C | Paris, France | NaN | 9 |
13 | 1 | 1 | Barber, Miss. Ellen "Nellie" | female | 26.0000 | 0 | 0 | 19877 | 78.850 | NaN | S | NaN | NaN | 6 |
You may find even more ways of filtering your data when researching, there’s lots out there!
10 Deriving New Columns
We create new columns in very similar to filtering them:
In the example above we are creating a new column in the titanic DataFrame called “new_column”, which will have the constant value of 1 in each row.
- To create a new column we first give the DataFrame name and use square brackets to specify, as a string our column name, here this is “new_column”.
- If the column name does not exist in our Dataframe (like in the example) the contents will be placed in a new column, at the end of the DataFrame.
- If the column already exists in our DataFrame the contents of that column will be overwritten.
We then use the assignment operator “=” and after this is what we fill our column with, in this example it was a constant value of the integer 1.
We’ll now look at various different ways we can fill these new columns.
10.1 Other Constant Values
Some more examples are in the cell below, filling with both floats as well as objects/strings. The syntax remains the same, subsetting the column and assigning it to the value.
10.1.1 Example
As a reminder, our cat_dog dataframe looks as follows:
cat_dog.head()
name | animal | age | |
---|---|---|---|
0 | Catalie Portman | Cat | 3 |
1 | Pico de Gato | Cat | 5 |
2 | Chewbarka | Dog | 1 |
3 | Sir Isaac Mewton | Cat | 7 |
4 | K9 | Dog | 11 |
Now let’s add some more constant columns!
"constant_string"] = "mammal" # adds a string/object constant
cat_dog["constant_float"] = 3.2 # adds a floating point constant
cat_dog["constant_int"] = 2 # adds an interger constant
cat_dog[
cat_dog
name | animal | age | constant_string | constant_float | constant_int | |
---|---|---|---|---|---|---|
0 | Catalie Portman | Cat | 3 | mammal | 3.2 | 2 |
1 | Pico de Gato | Cat | 5 | mammal | 3.2 | 2 |
2 | Chewbarka | Dog | 1 | mammal | 3.2 | 2 |
3 | Sir Isaac Mewton | Cat | 7 | mammal | 3.2 | 2 |
4 | K9 | Dog | 11 | mammal | 3.2 | 2 |
10.1.2 Exercise
Add a new constant column to the animals DataFrame called “attended” and set the value to be “True”.
'attended'] = True
animals[
# View the dataframe
animals.head()
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | attended | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 139091 | 01/01/2009 03:01 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 | Dog | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Norbury | Mammal | 00AH | Outer London | True |
1 | 275091 | 01/01/2009 08:51 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | ASSIST RSPCA WITH FOX TRAPPED,B15 | Fox | Railings | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Woodside | Mammal | 00AH | Outer London | True |
2 | 2075091 | 04/01/2009 10:07 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | DOG CAUGHT IN DRAIN,B15 | Dog | Pipe or drain | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Sutton | Wallington | Mammal | 00BF | Outer London | True |
3 | 2872091 | 05/01/2009 12:27 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | HORSE TRAPPED IN LAKE,J17 | Horse | Intensive Farming Sheds (chickens, pigs etc) | Animal rescue from water | Animal rescue from water - Farm animal | Hillingdon | Ruislip | Mammal | 00AS | Outer London | True |
4 | 3553091 | 06/01/2009 15:23 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | RABBIT TRAPPED UNDER SOFA,B15 | Rabbit | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Havering | Harold Hill | Mammal | 00AR | Outer London | True |
10.2 Numeric Values
We can create numeric columns as well, which are often referred to as ‘calculated columns’ as they can be calculated referencing many other numeric columns.
10.2.1 Example
We can do any form of operations we like here, for example creating the family size of the person. This is the number of siblings or spouses on board (“sibsp”), the number of parents or children on board (“parch”), plus 1 (the person themselves).
"family_size"] = titanic["sibsp"] + titanic["parch"] + 1
titanic[
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | family_size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 | 1 |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 | 4 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | 4 |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN | 4 |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | 4 |
10.3 Boolean and Binary Values
Using the same conditions as we do when filtering we can also create Boolean (True/False) columns. This is very common practice when we want to create binary columns (which can be very useful for visualisation and analysis purposes).
10.3.1 Example
In the example below I am assigning “True” to all values in the “age” column that are under 18. My new column is called “under_18”.
"under_18_bool"] = titanic["age"] < 18
titanic[ titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | family_size | under_18_bool | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 | 1 | False |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 | 4 | True |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | 4 | True |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN | 4 | False |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | 4 | False |
We can also change these values from the Boolean “True” and “False” to the binary values “1” and “0”, which is the key change we make when it comes to building models (it is often required that all values are numeric for example).
We can convert columns by using the code “.astype(“int64”)“.
"under_18_binary"] = titanic["under_18_bool"].astype("int64")
titanic[
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | family_size | under_18_bool | under_18_binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO | NaN | 2 | 1 | False | 0 |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | 11 | 4 | True | 1 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | 4 | True | 1 |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | 135.0 | NaN | 4 | False | 0 |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | Montreal, PQ / Chesterville, ON | NaN | NaN | 4 | False | 0 |
We can also do this in one step when we create the column. We can wrap the first part of the code in brackets; like our order of operations in mathematics (BIDMAS, BODMAS whatever you referred to it as) this tells Python which bit to do first.
In the cell below we add the new column in one line, and then sort out the missing values in the next.
# Add the new column and change the type in one line
"under_18_binary"] = (titanic["age"] < 18).astype("int64") titanic[
It’s important to note that where we have missing values, it will mark those as “False” (or 0). We’ll look at one way of fixing this after the exercises, but for a full look into missing values, stay tuned for Chapter 5!
10.3.2 Example
The following code uses the “.isnull()” method to single out where there are missing values. This is wrapped up in the indexing brackets [], so we end up with all rows containing missing values.
'age'].isnull()].head() titanic[titanic[
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | family_size | under_18_bool | under_18_binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 1 | 0 | Baumann, Mr. John D | male | NaN | 0 | 0 | PC 17318 | 25.9250 | NaN | S | New York, NY | NaN | NaN | 1 | False | 0 |
37 | 1 | 1 | Bradley, Mr. George ("George Arthur Brayton") | male | NaN | 0 | 0 | 111427 | 26.5500 | NaN | S | Los Angeles, CA | NaN | 9 | 1 | False | 0 |
40 | 1 | 0 | Brewe, Dr. Arthur Jackson | male | NaN | 0 | 0 | 112379 | 39.6000 | NaN | C | Philadelphia, PA | NaN | NaN | 1 | False | 0 |
46 | 1 | 0 | Cairns, Mr. Alexander | male | NaN | 0 | 0 | 113798 | 31.0000 | NaN | S | NaN | NaN | NaN | 1 | False | 0 |
59 | 1 | 1 | Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... | female | NaN | 0 | 0 | 17770 | 27.7208 | NaN | C | New York, NY | NaN | 5 | 1 | False | 0 |
10.3.3 Exercise
Create a new boolean column in the “animals” DataFrame called “fish_bool”. Use the “AnimalClass” column to find the values that are eqivalant to “Fish”.
Convert “fish_bool” to a numeric column.
# (a)
"fish_bool"] = animals["AnimalClass"] == "Fish"
animals[
# Test if it's worked with a filter
'AnimalClass'] == "Fish"] animals[animals[
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | ... | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | attended | fish_bool | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
962 | 129048101 | 30/07/2010 21:44 | 2010 | 2010/11 | Special Service | 1.0 | 1.0 | 260 | 260.0 | FISH IN DANGER OF DYING IN POND | ... | Lake/pond/reservoir | Other animal assistance | Animal assistance involving livestock - Other ... | Sutton | Sutton | Fish | 00BF | Inner London | True | True |
964 | 129261101 | 31/07/2010 06:15 | 2010 | 2010/11 | Special Service | 2.0 | 5.0 | 260 | 1300.0 | FISH IN DISTRESS, HVPU REQUIRED | ... | Lake/pond/reservoir | Other animal assistance | Animal assistance involving livestock - Other ... | Enfield | Enfield | Fish | 00AK | Inner London | True | True |
2 rows × 21 columns
# (b)
"fish_bool"] = animals["fish_bool"].astype("int64")
animals[
# Check with a filter
"AnimalClass"] == "Fish"] animals[animals[
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | ... | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | attended | fish_bool | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
962 | 129048101 | 30/07/2010 21:44 | 2010 | 2010/11 | Special Service | 1.0 | 1.0 | 260 | 260.0 | FISH IN DANGER OF DYING IN POND | ... | Lake/pond/reservoir | Other animal assistance | Animal assistance involving livestock - Other ... | Sutton | Sutton | Fish | 00BF | Inner London | True | 1 |
964 | 129261101 | 31/07/2010 06:15 | 2010 | 2010/11 | Special Service | 2.0 | 5.0 | 260 | 1300.0 | FISH IN DISTRESS, HVPU REQUIRED | ... | Lake/pond/reservoir | Other animal assistance | Animal assistance involving livestock - Other ... | Enfield | Enfield | Fish | 00AK | Inner London | True | 1 |
2 rows × 21 columns
10.3.4 More on the issue of Booleans and Missing Values
Earlier we declared a boolean column that has “True” for under 18, and “False” for over 18. One issue with this method is that if an entry in the “age” column is a missing value (“NaN”) then it is classified as “False” here.
While Pandas has done exactly as we asked (it does not meet the criteria so is “False”) we’re artificially inflating the number of adults in our two end columns.
Statistically this isn’t right. There’s 263 missing values in the “age” column, as we don’t know the passengers age; so we should preserve the missing data types rather than assume they’re over 18 (as it currently does!)
10.3.5 Example
To illustrate, the cell below is filtering the titanic dataframe, showing the values that are null (using “.isnull()”) in the “age” column.
You can see that they show “False” in the “under_18” column.
# Create the "under_18" column again - in case it's not been run yet!
"under_18_bool"] = titanic["age"] < 18
titanic[
# Look at missing values in the age column
# Note that in the "under_18" column - they are True!
'age'].isnull()].head() titanic[titanic[
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | family_size | under_18_bool | under_18_binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 1 | 0 | Baumann, Mr. John D | male | NaN | 0 | 0 | PC 17318 | 25.9250 | NaN | S | New York, NY | NaN | NaN | 1 | False | 0 |
37 | 1 | 1 | Bradley, Mr. George ("George Arthur Brayton") | male | NaN | 0 | 0 | 111427 | 26.5500 | NaN | S | Los Angeles, CA | NaN | 9 | 1 | False | 0 |
40 | 1 | 0 | Brewe, Dr. Arthur Jackson | male | NaN | 0 | 0 | 112379 | 39.6000 | NaN | C | Philadelphia, PA | NaN | NaN | 1 | False | 0 |
46 | 1 | 0 | Cairns, Mr. Alexander | male | NaN | 0 | 0 | 113798 | 31.0000 | NaN | S | NaN | NaN | NaN | 1 | False | 0 |
59 | 1 | 1 | Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... | female | NaN | 0 | 0 | 17770 | 27.7208 | NaN | C | New York, NY | NaN | 5 | 1 | False | 0 |
10.4 Example - Filling in missing values with a constant
We’ll look at more ways to action missing values in chapter 6; but for now we can use “.loc[]” to fill in these values.
We’ll also talk about “.loc[]” in chapter 5 and in the extension chapter 7. Basically, “.loc[]” can look up rows or columns based on either their index; or as we’re doing here, a logical statement like a filter statement.
Here we’re using “.loc[]” to find rows where the “age” column has a null value; and updating our new column “under_18” to show null (“NaN”) too using the keyword “None”. See that “loc[]” takes two inputs, the condition (here that is whether it is null), then the column to apply that too.
'age'].isnull(), 'under_18_bool'] = None titanic.loc[titanic[
C:\Users\tbalb\AppData\Local\Temp\ipykernel_3976\1030760974.py:1: FutureWarning:
Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
Now our “NaN” in “age” are also “NaN” values in “under_18”, as we can see if we run the cell below.
'age'].isnull()].head() titanic[titanic[
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | family_size | under_18_bool | under_18_binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 1 | 0 | Baumann, Mr. John D | male | NaN | 0 | 0 | PC 17318 | 25.9250 | NaN | S | New York, NY | NaN | NaN | 1 | NaN | 0 |
37 | 1 | 1 | Bradley, Mr. George ("George Arthur Brayton") | male | NaN | 0 | 0 | 111427 | 26.5500 | NaN | S | Los Angeles, CA | NaN | 9 | 1 | NaN | 0 |
40 | 1 | 0 | Brewe, Dr. Arthur Jackson | male | NaN | 0 | 0 | 112379 | 39.6000 | NaN | C | Philadelphia, PA | NaN | NaN | 1 | NaN | 0 |
46 | 1 | 0 | Cairns, Mr. Alexander | male | NaN | 0 | 0 | 113798 | 31.0000 | NaN | S | NaN | NaN | NaN | 1 | NaN | 0 |
59 | 1 | 1 | Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... | female | NaN | 0 | 0 | 17770 | 27.7208 | NaN | C | New York, NY | NaN | 5 | 1 | NaN | 0 |
While we still have missing data in this column that we may want to deal with later (see chapter 6); at least our “under_18” column is now more accurate. It’s really important we think about missing values whenever we perform operations using our data.
Even just recoding values not considered missing is a very important step when analysing and cleaning our data, as this informs us on how to treat the column.
10.5 Mapping New Values
We can also use “.map()” to map a dictionary to a column.
This can be useful when we want to explain coded variables for other users; e.g. the column “embarked” has the values “S”, “C” and “Q”. We can use “.map()” to give the full names of these ports.
"port"] = titanic["embarked"].map({"S":"Southampton",
titanic["C":"Cherbourg",
"Q": "Queenstown (Cobh)"})
titanic.tail()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | family_size | under_18_bool | under_18_binary | port | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1304 | 3 | 0 | Zabour, Miss. Hileni | female | 14.5 | 1 | 0 | 2665 | 14.4542 | NaN | C | NaN | 328.0 | NaN | 2 | True | 1 | Cherbourg |
1305 | 3 | 0 | Zabour, Miss. Thamine | female | NaN | 1 | 0 | 2665 | 14.4542 | NaN | C | NaN | NaN | NaN | 2 | NaN | 0 | Cherbourg |
1306 | 3 | 0 | Zakarian, Mr. Mapriededer | male | 26.5 | 0 | 0 | 2656 | 7.2250 | NaN | C | NaN | 304.0 | NaN | 1 | False | 0 | Cherbourg |
1307 | 3 | 0 | Zakarian, Mr. Ortin | male | 27.0 | 0 | 0 | 2670 | 7.2250 | NaN | C | NaN | NaN | NaN | 1 | False | 0 | Cherbourg |
1308 | 3 | 0 | Zimmerman, Mr. Leo | male | 29.0 | 0 | 0 | 315082 | 7.8750 | NaN | S | NaN | NaN | NaN | 1 | False | 0 | Southampton |
The usefulness of this process cannot be overstated, many times we will be handed data in the real world that has many of these very confusing acronyms, initials etc. Recoding them to be more descriptive and easier to work with can be a lifesaver in certain situations!
10.6 Deleting columns
There are a few ways to remove or drop columns.
Let’s look at cat_dog again
cat_dog
name | animal | age | constant_string | constant_float | constant_int | |
---|---|---|---|---|---|---|
0 | Catalie Portman | Cat | 3 | mammal | 3.2 | 2 |
1 | Pico de Gato | Cat | 5 | mammal | 3.2 | 2 |
2 | Chewbarka | Dog | 1 | mammal | 3.2 | 2 |
3 | Sir Isaac Mewton | Cat | 7 | mammal | 3.2 | 2 |
4 | K9 | Dog | 11 | mammal | 3.2 | 2 |
10.6.1 Example 1
We have the inbuilt python statement del, which we can follow with a selected column to permanently delete it.
del cat_dog["constant_float"]
cat_dog
name | animal | age | constant_string | constant_int | |
---|---|---|---|---|---|
0 | Catalie Portman | Cat | 3 | mammal | 2 |
1 | Pico de Gato | Cat | 5 | mammal | 2 |
2 | Chewbarka | Dog | 1 | mammal | 2 |
3 | Sir Isaac Mewton | Cat | 7 | mammal | 2 |
4 | K9 | Dog | 11 | mammal | 2 |
10.6.2 Example 2
We can also use the “.drop()” method.
For older Pandas versions:
= ["constant_string", "constant_int"], inplace = True)
cat_dog.drop(columns
# If this cell doesn't work check your version using pd.__version__
# If lower than 0.21.0 use the code in the text below.
Here note:
- I’m passing my column names as a list, as there’s more than one
- “inplace=True” updates the original Dataframe
This is a very good example of how packages can evolve over time. For versions lower than 0.21.0 of Pandas we have to provide “axis = 1”, as it can’t detect if you’re dealing with columns or rows.
11 Merging Data
Let’s have a look at the two “joining_data” DataFrames we loaded in at the start of the chapter:
joining_data1
name | animal | age | |
---|---|---|---|
0 | Catalie Portman | cat | 3 |
1 | Pico de Gato | cat | 5 |
2 | Chewbarka | dog | 1 |
3 | Sir Isaac Mewton | cat | 7 |
4 | K9 | cat | 11 |
5 | Arf Vader | dog | 6 |
joining_data2
name | vaccinated | |
---|---|---|
0 | Catalie Portman | True |
1 | Pico de Gato | True |
2 | Chewbarka | False |
3 | Sir Isaac Mewton | True |
4 | K9 | False |
5 | Spiderpig | True |
- In “joining_data1” we have the columns “name”, “animal” and “age”.
- In “joining_data2” we have the columns “name” and “vaccinated”.
We see that “Arf Vader” only appears in “joining_data1” and “Spiderpig” only appears in “joining_data2”. The remainder of the animals appear in both DataFrames.
11.1 Types of Join
These two DataFrames have a column in common that we could use to join them on, which is “name”. Without this we could not join the datasets together as this is the stipulation to perform a merge/join.
We can create a variety of joins. These are often best demonstrated with a Venn diagram. From here, I’m going to say that “joining_data1” is my “left” DataFrame, and that “joining_data2” is my “right” DataFrame.
For this I’m going to highlight the position of 3 animal names. As we mentioned earlier:
- “Arf Vader” is only in “joining_data1”, our “left” DataFrame.
- “Spiderpig” is only in “joining_data2”, our “right” DataFrame
- “Catalie Portman” is in both “joining_data1” and “joining_data2” and will represent values in BOTH DataFrames.
As you can see “left” and “right” merges take the data from that DataFrame and any matches from the opposite one.
This means that in our “left”:
- “Arf Vader” is included because he has data in the left DataFrame.
- “Arf Vader” has missing values for “vaccinated”, as he isn’t in that DataFame
- “Catalie Portman” has data in all columns, as she appears in both the left and right
- “Spiderpig” doesn’t have any data. We’re only taking data from the left, and he’s only in the right.
11.1.1 Example
Let’s look at how we do this left merge in code. We’re going to use “pd.merge()” and do a left join.
= pd.merge(left = joining_data1,
left_merge =joining_data2,
right="left",
how="name",
on=True)
indicator
left_merge
name | animal | age | vaccinated | _merge | |
---|---|---|---|---|---|
0 | Catalie Portman | cat | 3 | True | both |
1 | Pico de Gato | cat | 5 | True | both |
2 | Chewbarka | dog | 1 | False | both |
3 | Sir Isaac Mewton | cat | 7 | True | both |
4 | K9 | cat | 11 | False | both |
5 | Arf Vader | dog | 6 | NaN | left_only |
This is the most layered method we have seen so far, with numerous arguments we must utilise to perform the joins we want.
- “left =” is our left hand DataFrame
- “right =” is our right hand DataFrame
- “how =” is how we want to perform the join, other options are
- “right”
- “outer”
- “inner” (default behaviour)
- “on =” is the name of the column both DataFrames have in common. If we’re joining on more than one column we can specify a list here.
- “indicator” set to “True” adds an additional column showing if the data was in the left_only, both or right_only as applicable, which is excellent for checking.
11.1.2 Exercise
Using “marvel_left” and “marvel_right” that we loaded in earlier, perform an outer join using the column “name”. Include the column that specifies whether they came from the left or right DataFrame.
Using “marvel_left” and “marvel_right” perform a left join on the column containing the years. Note that the names of these columns may not be the same in both DataFrames! (Hint: Use the help function to observe optional arguments to help with this).
# (a)
= marvel_left.merge(right=marvel_right,
marvel_outer ="outer",
how="name",
on=True)
indicator marvel_outer.head()
name | year | month | worldwide_box_office | release_year | score | _merge | |
---|---|---|---|---|---|---|---|
0 | Ant Man | 2015.0 | May | 5.193120e+08 | 2015.0 | 82.0 | both |
1 | Ant Man and the Wasp | 2018.0 | July | 6.226481e+08 | 2018.0 | 88.0 | both |
2 | Avengers | 2012.0 | May | 1.518813e+09 | 2012.0 | 92.0 | both |
3 | Avengers: Age of Ultron | 2015.0 | July | 1.405404e+09 | 2015.0 | 75.0 | both |
4 | Avengers: Endgame | NaN | NaN | NaN | 2019.0 | 94.0 | right_only |
# (b) Using left_on and right_on since the column names are not the same in each DataFrame
= marvel_left.merge(right=marvel_right ,
marvel_left_merge ="left",
how="year",
left_on="release_year",
right_on=True)
indicator marvel_left_merge.head()
name_x | year | month | worldwide_box_office | name_y | release_year | score | _merge | |
---|---|---|---|---|---|---|---|---|
0 | Iron Man | 2008 | May | 585174222 | Iron Man | 2008 | 93 | both |
1 | Iron Man | 2008 | May | 585174222 | The Incredible Hulk | 2008 | 67 | both |
2 | The Incredible Hulk | 2008 | June | 263427551 | Iron Man | 2008 | 93 | both |
3 | The Incredible Hulk | 2008 | June | 263427551 | The Incredible Hulk | 2008 | 67 | both |
4 | Iron Man 2 | 2010 | May | 623933331 | Iron Man 2 | 2010 | 73 | both |
Notice here we use the parameter “left_on” to list the names of the columns in our left dataframe and “right_on” for the names of the columns in our right.
We see here that maybe the year is not the best joining column! We’re getting multiple joins for movies that have different names because we only need that “year” data to match.
The solution to this is to join on more than one column, which you will do next!
11.1.3 Extension Exercise
Modify the join from (b) in the previous Exercise to:
- Join on BOTH the column containing names and the column containing years. Remember that the names may differ across the dataframes!
Hint: How did we specify multiple columns in previous chapters? Remember order is important.
= marvel_left.merge(right=marvel_right ,
marvel_left_merge ="left" ,
how=["name", "year"],
left_on=["name", "release_year"],
right_on=True)
indicator marvel_left_merge.head()
name | year | month | worldwide_box_office | release_year | score | _merge | |
---|---|---|---|---|---|---|---|
0 | Iron Man | 2008 | May | 585174222 | 2008.0 | 93.0 | both |
1 | The Incredible Hulk | 2008 | June | 263427551 | 2008.0 | 67.0 | both |
2 | Iron Man 2 | 2010 | May | 623933331 | 2010.0 | 73.0 | both |
3 | Thor | 2011 | May | 449326618 | NaN | NaN | left_only |
4 | Captain America: The First Avenger | 2011 | July | 370569774 | 2011.0 | 79.0 | both |
Notice here we use the parameter left_on to list the names of the columns in our left dataframe and right_on for the names of the columns in our right and as we’re using multiple columns these are in lists. This would be a great solution if we had a larger dataset to work with.
For example in 1998 there was a movie called “The Avengers”. It’s not a marvel movie, and does not star Iron Man, but instead Uma Thurman and Ralph Fiennes as spys! By using the name and year we’d ensure a higher level of accuracy, being as specific as you can is good.
11.2 Union Merge
We can also do what’s known as “union” merges, which is implemented using “pd.concat()”.This is where we can add data on to the bottom of an existing dataFrame (like an appending motion).
It is so important that the columns of each DataFrame are in the same order, otherwise the union operation will fail (remember, columns can only have one datatype!).
11.2.1 Example
For example we have some new animals to add on to “joining_data1”, these are contained in the “union_joined” DataFrame we loaded in earlier.
= pd.concat(objs=[joining_data1, union_join_data], ignore_index=True)
union_joined union_joined
name | animal | age | |
---|---|---|---|
0 | Catalie Portman | cat | 3 |
1 | Pico de Gato | cat | 5 |
2 | Chewbarka | dog | 1 |
3 | Sir Isaac Mewton | cat | 7 |
4 | K9 | cat | 11 |
5 | Arf Vader | dog | 6 |
6 | Andy Warhowl | dog | 12 |
7 | Voldetort | tortoise | 90 |
8 | Repecka | chicken | 1 |
We can see our new additions, namely “Andy Warhowl”, “Voldetort” and “Repecka” here now.
Our parameters and arguments in “pd.conact()” are: * “objs=” Our objects that we wish to concatenate, here as a list * “ignore_index=” My new DataFrame will have it’s “own” index starting from 0. If I set this to “False” then I will have two things with an index of 0 and I don’t want that as this will cause frustration later!
This is a very rich topic (as are most processes in Pandas!) and other interesting methods exist that allow us to combine in different ways. For example, “.append()” and “.join()” also exist, here’s a link with a lot more detail about joining together data. Merge, Join, Concatenate and Compare
12 Chapter Summary
Amazing work! Congratulations on completing Chapter 4 of the Introduction to Python Course, you are now a DataFrame Wizard!
In Chapter 5 we will focus on applying the various methods we covered here to clean our data, preparing it ready for the tasks that follow such as Aggregation and Summary (Chapter 6) and Visualisation (Future courses on the Learning Hub). We will cover:
- Copies and Views
- Updating Values
- Changing Column Names
- Missing Values