import pandas as pd
Chapter 3 - Reading in and Exporting Data
Chapter Overview and Learning Objectives:
Script Structure and Packages
- Script Structure
- Packages in this session
Project Structure
- Setting Working Directories
- Relative vs Absolute file Paths
- A note on file paths
Reading Flat Files
Reading Excel Files
Reading in other file types
Exporting Data
1 Script Structure and Packages
1.1 Script Structure
Scripts should be read like books, from top to bottom and left to right. As such, packages and data should be loaded at the top of the script, which ensures that they’re available to use right from the start of your work.
Your code should also include explanatory comments, as well as following the naming conventions we talked about in chapter 2. You may understand your code now, but will you still understand it 6 months down the line, or will your co-worker understand it when you leave?
1.2 Packages
In this session we will use pandas, and give it the nickname “pd”.
1.2.1 Exercise
Import Pandas into your session.
2 Project Structure
There’s many important reasons to structure our projects:
You may have noticed the clean folder structure of the zip folder you downloaded. As our projects grow we can end up with a large amount of data files; notebook files and outputs. Keeping them structured is a really important aspect of managing your data. Imagine if you sent your colleague a folder that looked like this:
Finding the right files within here is tricky, and this is just the content we’ve explored so far on the course!
Having a folder structure like the following (sometimes known as a tree like structure), helps to keep our data organised. Of course, if there are guidelines at your organisation, ensure you organise your files according to that.
2.1 Setting Working Directories
The working directory is the place where Python automatically looks for our files, you can think of it as where we are currently located in the files of the computer itself.
2.2 Example
You can check what your working directory is by using the magic command in Jupyter.
Commands here are commented out so not to hard code in filepaths and directories.
#%pwd
This stands for “Print Working Directory”.
If you’re using another IDE (such as Spyder or VSCode scripts) you can find your working directory using:
#import os
#os.getcwd()
This stands for “Get Current Working Directory”
Note that the output here of “os.getcwd()” will be different to yours! Yours is likely to be:
‘C:\Users\username\Intro_to_Python\notebooks’
No matter what the first part is, you will always have “Intro_to_Python\notebooks” as the end part of the folder path. We’ll see why this happens in a second.
What if we want to change our current working directory?
2.3 Example
You can change your working directory by using the command (in Jupyter)
“%cd ‘filepath_of_new_directory’”
e.g.
“%cd //fa1rvwapxx333/Intro_to_Python/Notebooks”
This stands for “Change Directory” and changes our current directory to the one we have specified. You may want to change your directory if you are working with large data files; or working on a group project from a shared network drive. It is important to note that shared network drives have two forward slashes at the start.
2.4 Relative vs Absolute file Paths
If I want to give you a location of a file, I can use the absolute file path. Let’s say, for example, that I have saved the “Intro_to_Python” folder in my C Drive and I want to access the file “animals.csv”.
The full or absolute location of this file is:
- “C:/Users/username/Intro_to_Python/Data/animals.csv”
This is clear and explicit about where the data is stored. However, if you were to use this link you would need to change elements of it, for example your username is not “username”.
Because my working directory is automatically set to “C:/Users/marshj1/Intro_to_Python/notebooks”, I can use what’s called a relative path.
A relative path is the location relative to the working directory, i.e., we specify the filepath starting from where we currently are in the folder structure.
For example, I can load the same file as above using the path
- “../data/animals.csv”
This will work for any user, as long as their working directory is set to the “notebooks” Folder in the “Into_to_Python” parent folder. You may notice here I’ve used two full stops, which is something we have yet to see. This refers to us moving back one level in the folder structure.
Since “animals.csv” is within the data folder, not “notebooks” where we currently are, we must back out to the parent folder “Intro_to_Python” before we can specify the path to the dataset (which is within the data folder!).
Note that stringing these together as “../..” moves back two folders in the structure and so on, allowing us total control over relative file paths.
Here’s a very handy diagram that explains the principle of relative filepaths.
The black arrow leads from the starting location; or the working directory “notebooks”, and moves to the parent “Intro_to_Python” folder, using the double dot notation. The blue arrow moves forwards into the required folder, which is “/data” and the purple arrow selects our file name “/animals.csv”.
2.5 A note on file paths
We highly recommend using forward slashes “/” within file paths. However, when copying a file path from Windows Explorer it will often have the “” backslash character instead of a forward slash.
This causes issues in two ways. Firstly, this is a Windows exclusive issue, as Mac and Linux operating systems use the forward slash “/”. Secondly the backslash symbol is often used as an escape character within Python. Thirdly; although Python will often accept backslashes, other commonly used languages, like the statistical programming language “R” will not. It’s worth getting into the good practice of using forward slashes.
Lastly, if you absolutely must use backslashes you should preface the string with the letter “r”, to ensure it’s passed as a raw string, rather than a unique character in Python. This is commented out below due to conflicts with the software these notes are written in, but will work for you if you have followed thus far!
2.6 Example
Taking our previous example into account, the relative filepath to the schools_data.csv file would be:
- r”.._data.csv”
3 Flat Files
We use pandas to read in our files in this course. There are ways of reading in files without using this package; however, using pandas will bring in our data as the specialised data type we introduced in the last chapter, a DataFrame.
Flat files are files like .csv (Comma Separated Value), .txt files etc. The data is stored in a single table, with a specific delimiter (this is the comma in csv’s) used to separate values into columns.
To import a .csv file we need to give the data an identifier or a variable name as we want to assign the object we are creating. To start, let’s import the animals data, give it the variable name “animals” and use an equals sign to assign it.
I’m using one of the read functions from pandas. So I use the “pd” abbreviation before calling my function. This family of functions starts “.read_” and there are a large variety of them as part of the “pandas” package, a few of which we will see here.
3.1 Example
To read in a file, we must choose the appropriate read method, in this example we are using a flat file (csv) from the “data” folder:
= pd.read_csv("../data/animals.csv") animals
The “pd.read_” function we want is “pd.read_csv()”, and in the brackets the only parameter we must specify (mandatory!), is the filepath to the location of the file we wish to read in as a DataFrame. These are strings and as such, should go in quotation marks. We also need the file extension (.csv, .xlsx etc) as we are leading the read function to the exact file location and as such need the exact file name.
When we import data into Jupyter notebooks we get no confirmation that this code has worked. The number within the square brackets increases, indicating the cell has run. Whilst this may seem like a red flag, it’s quite the opposite when it comes to reading in data!
After importing data it may be useful to check that it appears as a variable using the magic command “%whos” introduced earlier. In IDEs (such as Spyder, VSCode etc) that have a variable explorer, you will see the object we have created appear there with some basic information (like its dimensions).
We will explore how to investigate and analyse this dataset in-depth in the next chapter, for now we will just do some viewing of the object we have created. For small datasets we can print it by inputting the name, however for large data sets this is not practical.
3.2 Example
Let’s get a birds eye view of our dataset! To do so, give Python the DataFrame name and then call the “.head()” method. This returns the top 5 rows by default.
animals.head()
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNominalCost(£) | IncidentNominalCost(£) | FinalDescription | AnimalGroupParent | PropertyType | SpecialServiceTypeCategory | SpecialServiceType | Borough | StnGroundName | AnimalClass | Code | London | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 139091 | 01/01/2009 03:01 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | DOG WITH JAW TRAPPED IN MAGAZINE RACK,B15 | Dog | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Norbury | Mammal | 00AH | Outer London |
1 | 275091 | 01/01/2009 08:51 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | ASSIST RSPCA WITH FOX TRAPPED,B15 | Fox | Railings | Other animal assistance | Animal assistance involving livestock - Other ... | Croydon | Woodside | Mammal | 00AH | Outer London |
2 | 2075091 | 04/01/2009 10:07 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | DOG CAUGHT IN DRAIN,B15 | Dog | Pipe or drain | Animal rescue from below ground | Animal rescue from below ground - Domestic pet | Sutton | Wallington | Mammal | 00BF | Outer London |
3 | 2872091 | 05/01/2009 12:27 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | HORSE TRAPPED IN LAKE,J17 | Horse | Intensive Farming Sheds (chickens, pigs etc) | Animal rescue from water | Animal rescue from water - Farm animal | Hillingdon | Ruislip | Mammal | 00AS | Outer London |
4 | 3553091 | 06/01/2009 15:23 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | RABBIT TRAPPED UNDER SOFA,B15 | Rabbit | House - single occupancy | Other animal assistance | Animal assistance involving livestock - Other ... | Havering | Harold Hill | Mammal | 00AR | Outer London |
3.2.1 Exercise
Import the data “schools_data.csv” from the data folder with an appropriate variable name.
Use the optional parameter “n =” (which specifies the number of rows to return) in the “.head()” method to view the top 12 rows of the dataset.
Look at the bottom of the data using the “.tail()” method. All of our missing data is at the bottom, so you may see lots of “nan” values, which stands for “not a number”.
# (a)
= pd.read_csv("../data/schools_data.csv") school
# (b)
= 12)
school.head(n
# OR school.head(12), you don't need to specify the argument name in this case!
URN | School name | Ofsted region | Local authority | Type of education | Phase of education | Total number of pupils | Deprivation index | Inspection start date | Inspection end date | Inspection type | Overall effectiveness | Category of concern | Outcomes for pupils | Quality of teaching, learning and assessment | Effectiveness of leadership and management | Personal development, behaviour and welfare | Early years provision | 16 - 19 study programmes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100000 | Sir John Cass's Foundation Primary School | London | City of London | Voluntary Aided School | Primary | 250.0 | 4.0 | 18/04/2013 | 19/04/2013 | S5 Inspection | 1.0 | NaN | 1.0 | 1.0 | 1.0 | 9.0 | 9.0 | 9.0 |
1 | 100005 | Thomas Coram Centre | London | Camden | LA Nursery School | Nursery | 114.0 | NaN | 05/06/2014 | 06/06/2014 | S5 Inspection | 1.0 | NaN | 1.0 | 1.0 | 1.0 | 9.0 | 9.0 | 9.0 |
2 | 100006 | CCfL Key Stage 4 PRU | London | Camden | Pupil Referral Unit | PRU | 27.0 | NaN | 24/04/2014 | 25/04/2014 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 9.0 | 9.0 |
3 | 100007 | Camden Primary Pupil Referral Unit | London | Camden | Pupil Referral Unit | PRU | 20.0 | NaN | 26/03/2015 | 27/03/2015 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 9.0 | 9.0 |
4 | 100008 | Argyle Primary School | London | Camden | Community School | Primary | 417.0 | 5.0 | 28/01/2015 | 29/01/2015 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 2.0 | 9.0 |
5 | 100009 | Beckford Primary School | London | Camden | Community School | Primary | 456.0 | 4.0 | 10/07/2012 | 11/07/2012 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 8.0 | 9.0 |
6 | 100010 | Brecknock Primary School | London | Camden | Community School | Primary | 406.0 | 5.0 | 19/11/2013 | 20/11/2013 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 9.0 | 9.0 |
7 | 100011 | Brookfield Primary School | London | Camden | Community School | Primary | 455.0 | 4.0 | 05/06/2014 | 06/06/2014 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 9.0 | 9.0 |
8 | 100012 | Carlton Primary School | London | Camden | Community School | Primary | 371.0 | 5.0 | 15/01/2015 | 16/01/2015 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 2.0 | 9.0 |
9 | 100013 | Edith Neville Primary School | London | Camden | Community School | Primary | 221.0 | 5.0 | 23/01/2014 | 24/01/2014 | S8 No Formal Designation Visit | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 9.0 | 9.0 |
10 | 100014 | Fleet Primary School | London | Camden | Community School | Primary | 225.0 | 4.0 | 02/02/2012 | 03/02/2012 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 8.0 | 9.0 |
11 | 100015 | Hawley Primary School | London | Camden | Community School | Primary | 111.0 | 5.0 | 14/05/2014 | 15/05/2014 | S5 Inspection | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 9.0 | 9.0 | 9.0 |
# (c)
school.tail()
URN | School name | Ofsted region | Local authority | Type of education | Phase of education | Total number of pupils | Deprivation index | Inspection start date | Inspection end date | Inspection type | Overall effectiveness | Category of concern | Outcomes for pupils | Quality of teaching, learning and assessment | Effectiveness of leadership and management | Personal development, behaviour and welfare | Early years provision | 16 - 19 study programmes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21960 | 143927 | Royal Greenwich Trust School | London | Greenwich | Free School | Secondary | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21961 | 143938 | Outwood Academy Danum | North East, Yorkshire and the Humber | Doncaster | Academy Sponsor Led | Secondary | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21962 | 143939 | Outwood Primary Academy Littleworth Grange | North East, Yorkshire and the Humber | Barnsley | Academy Sponsor Led | Primary | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21963 | 143940 | Outwood Primary Academy Darfield | North East, Yorkshire and the Humber | Barnsley | Academy Sponsor Led | Primary | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21964 | 143946 | The Albany School | London | Havering | Academy Converter | Secondary | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
For this initial read in we only used the mandatory paramter of the filepath, but when we look at the docstring we can see that there are many other arguments we can specify. This includes:
- Changing the delimiter (an example would be .tsv, tab separated value)
- Skipping Rows
- Parsing in date columns (breaking down into component parts)
- Manually specifying additional missing values
Recall that to look at the docstring for “read_csv()”, we use “help(pd.read_csv)”. We must include the call to pandas, as the read_csv function does not exist in Python out of the box.
We have seen parameters and arguments a few times throughout the course, but have yet to formally define them. A parameter is a name or a label for an argument that follows it.
As we’ll see later some parameters are optional, and Python can identify what they relate to by the order we pass them to our method. However, it’s less ambiguous to both python (and whoever reads your code) if you include the parameter (the name) as well.
It’s worth noting here that most of these parameters have default arguments. A good example of this is the “sep=” argument of the read in function, it tells Python that the argument that follows will be what the separator/delimiter type is in the file. The comma “,” is the default argument, so without specifying it, Python assumes “sep = ‘,’”, which we can change if we desire.
3.3 Example
If my separator was a semi colon I could write my “pd.read_csv()” statment like this:
= pd.read_csv("../data/schools_data.csv", sep=";") school
4 Excel Files
As mentioned earlier we can also use the “pd.read_” functions to read in excel files. These have the file extension “.xlsx” and differ slightly from comma separated value files.
The function for reading in these files is “pd.read_excel()”, which also takes a minimum of one argument; the location of the file including the file extension.
4.1 Example
Here we will read in a file called “titanic.xlsx” from the usual data folder, calling it the object “titanic”. Let’s also bring back “.sample()” to display just one random row of the data.
= pd.read_excel("../data/titanic.xlsx")
titanic titanic.sample()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | body | boat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
538 | 2 | 1 | Portaluppi, Mr. Emilio Ilario Giuseppe | male | 30.0 | 0 | 0 | C.A. 34644 | 12.7375 | NaN | C | Milford, NH | NaN | 14 |
Like with “pd.read_csv()”, “pd.read_excel()” has additional parameters we can set. Have a look at the docstring using “help(pd.read_excel)” and note the interesting ways Python can interact with Excel workbooks.
4.1.1 Exercise
Import the data “police_data.xlsx”. We specifically want the second sheet, this has the name “Table P1”. You will need to specify some additional parameters. Look in the help documentation to see which one you should specify.
Hint - If referencing the sheet by index position; remember that Python starts counting at 0!
# Option 1
= pd.read_excel("../data/police_data.xlsx", sheet_name=1)
police
police.head()
# Option 2
# police = pd.read_excel("../data/police_data.xlsx", sheet_name = "Table P1")
# Note if you're using an older version of Pandas ( < 0.21.0 ) the parameter is sheetname = with no space.
Area Codes | Area Name | Region | Country | Total recorded crime - including fraud | Total recorded crime - excluding fraud | Violence against the person | Homicide | Violence with injury | Violence without injury | ... | Theft from the person | Bicycle theft | Shoplifting | All other theft offences | Criminal damage and arson | Drug offences | Possession of weapons offences | Public order offences | Miscellaneous crimes against society | Fraud | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E23000013 | Cleveland | North East | ENGLAND | 39867 | 39770 | 6307 | 3 | 3659 | 2645 | ... | 394 | 1127 | 5333 | 5712 | 8058 | 2073 | 258 | 1376 | 390 | 97 |
1 | E23000008 | Durham | North East | ENGLAND | 30799 | 30751 | 4452 | 5 | 2485 | 1962 | ... | 153 | 639 | 2863 | 4501 | 6757 | 1251 | 244 | 1182 | 516 | 48 |
2 | E23000007 | Northumbria | North East | ENGLAND | 70082 | 69935 | 9547 | 21 | 6148 | 3378 | ... | 993 | 1995 | 8723 | 8947 | 14050 | 4723 | 644 | 2865 | 1136 | 147 |
3 | E23000006 | Cheshire | North West | ENGLAND | 57127 | 56780 | 10568 | 5 | 4910 | 5653 | ... | 1088 | 1753 | 6729 | 7368 | 9645 | 2744 | 256 | 1991 | 568 | 347 |
4 | E23000002 | Cumbria | North West | ENGLAND | 24079 | 23952 | 4817 | 6 | 2426 | 2385 | ... | 139 | 441 | 2595 | 3100 | 5345 | 1575 | 154 | 1061 | 357 | 127 |
5 rows × 27 columns
5 Reading in Other File types
We can also import a wide variety of other file types using the “pd.read_” family of functions. For example we have:
- json
- sas
- sql
- stata
- pickles
and many more. While we don’t cover these in this course you can explore the docstrings for each of these file types.
6 Exporting Data
As well as reading in data we can also export data from Python.
When we load data into Python using Pandas we create a temporary copy of it within our session. Any changes we make to the data won’t be reflected in the saved file. For example if you deleted columns, and closed your Python script without exporting the data, your data would still have all of it’s columns when you import it in again (reload).
6.1 Example
To save our files we use the “to_csv()” after our DataFrame object, with the mandatory parameter being the filepath (or location) where to save the new file we are creating.
"../outputs/my_titanic.csv") titanic.to_csv(
Again, nothing seems to have actioned within Jupyter - the line number has increased so we know the code has run. If you check the outputs folder in File Explorer you will see the new file there. As we talked about earlier; having the outputs in a different folder to the raw data is a good step for keeping our projects organised, plus it prevents us from making any mistakes overwriting or removing incorrect files.
If you type “pd.to_” and place your cursor after the underscore in the cell above and hit tab. You will see that there are many other file formats we can export to, just as there are many that we can import. Each of these has additional parameters you can specify to get the exact exported file you want.
6.1.1 Exercise
Export the “animals.csv” dataset as an excel file, remember that the file extension is “.xlsx”. Ensure that you save it in the Outputs folder.
# Commented out
#animals.to_excel("../outputs/animals.xlsx")
7 Chapter Summary
Fantastic job, you have completed chapter 3 of the Introduction to Python course!
In Chapter 4 you can look forward to a deep dive into DataFrames, covering all aspects of working with them, such as:
- Exploring Data
- Sorting Data
- Subsetting Data
- Filtering Data
- Merging Datasets