Chapter 7 - Case Study

1 Learning Outcomes

By the end of this chapter you should have more confidence with manipulating data and using techniques covered in the first six chapters of the course. This is an excellent opportunity for you to have more data to play with and reinforce learning from the previous chapters.

These data sets and question are designed to be an initial springboard for you to continue your learning. Answers are provided; but these may only show one or two ways of solving the issue. While you should always aim to follow clean code standards and best practice, answers may be different to those provided here and as long as the same outputs are achieved, this is fine.

It is worth noting that you attempt to understand the differences in the solutions though, this can help you learn something new or reinforce another concept by appreciating an alternate way of answering a question.

2 Task Format

Questions will be presented in tabs:

Tab 1 will contain the question itself.
Tab 2 will contain some hints for the solution.
Tab 3 will contain the solution in Python.

An example is below:

2.1 Example

This is an example question.

To hint or not to hint?

print("This tab contains some Python code")

This tab contains some Python code

3 Questions

Question 1

Load the following packages into your session.

pandas
numpy

Read in the following two files and assign to the given variables.

netflix - nextflix_data.csv imdb_scores - imdb_scores.csv

Display the first 10 rows of the netflix data.
Display the last 5 rows of the imdb_scores data.

Note - The data is sourced from Tidy Tuesday and directly from IMDB. Some data has been altered to suit the difficulty level of this course. This data should be treated as a training set, and so shouldn’t be relied upon for 100% accuracy.

Remember that we use the keywords “import” and “as” when performing this task.
Remember that we use the read_csv() function from pandas to accomplish this.
.head()
Opposite of (c) in words as well as process!

# (a)

import pandas as pd
import numpy as np

# (b)

netflix = pd.read_csv("../data/netflix_data.csv")
imdb_scores = pd.read_csv("../data/imdb_scores.csv")

# (c) 

netflix.head(10)

	show_id	type	title	director	cast	primary_country	country_2	country_3	country_4	countries_5_up	date_added	release_year	rating	duration	listed_in	description	num_cast
0	s1	TV Show	3%	NaN	João Miguel, Bianca Comparato, Michel Gomes, R...	Brazil	NaN	NaN	NaN	NaN	August 14, 2020	2020	TV-MA	4 Seasons	International TV Shows, TV Dramas, TV Sci-Fi &...	In a future where the elite inhabit an island ...	11.0
1	s2	Movie	07:19	Jorge Michel Grau	Demián Bichir, Héctor Bonilla, Oscar Serrano, ...	Mexico	NaN	NaN	NaN	NaN	December 23, 2016	2016	TV-MA	93 min	Dramas, International Movies	After a devastating earthquake hits Mexico Cit...	6.0
2	s3	Movie	23:59	Gilbert Chan	Tedd Chan, Stella Chung, Henley Hii, Lawrence ...	Singapore	NaN	NaN	NaN	NaN	December 20, 2018	2011	R	78 min	Horror Movies, International Movies	When an army recruit is found dead, his fellow...	9.0
3	s4	Movie	9	Shane Acker	Elijah Wood, John C. Reilly, Jennifer Connelly...	United States	NaN	NaN	NaN	NaN	November 16, 2017	2009	PG-13	80 min	Action & Adventure, Independent Movies, Sci-Fi...	In a postapocalyptic world, rag-doll robots hi...	9.0
4	s5	Movie	21	Robert Luketic	Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...	United States	NaN	NaN	NaN	NaN	January 1, 2020	2008	PG-13	123 min	Dramas	A brilliant group of students become card-coun...	12.0
5	s6	TV Show	46	Serdar Akar	Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...	Turkey	NaN	NaN	NaN	NaN	July 1, 2017	2016	TV-MA	1 Season	International TV Shows, TV Dramas, TV Mysteries	A genetics professor experiments with a treatm...	10.0
6	s7	Movie	122	Yasir Al Yasiri	Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...	Egypt	NaN	NaN	NaN	NaN	June 1, 2020	2019	TV-MA	95 min	Horror Movies, International Movies	After an awful accident, a couple admitted to ...	8.0
7	s8	Movie	187	Kevin Reynolds	Samuel L. Jackson, John Heard, Kelly Rowan, Cl...	United States	NaN	NaN	NaN	NaN	November 1, 2019	1997	R	119 min	Dramas	After one of his high school students attacks ...	5.0
8	s9	Movie	706	Shravan Kumar	Divya Dutta, Atul Kulkarni, Mohan Agashe, Anup...	India	NaN	NaN	NaN	NaN	April 1, 2019	2019	TV-14	118 min	Horror Movies, International Movies	When a doctor goes missing, his psychiatrist w...	9.0
9	s10	Movie	1920	Vikram Bhatt	Rajneesh Duggal, Adah Sharma, Indraneil Sengup...	India	NaN	NaN	NaN	NaN	December 15, 2017	2008	TV-MA	143 min	Horror Movies, International Movies, Thrillers	An architect and his wife move into a castle t...	8.0

# (d)

imdb_scores.tail()

	title	average Rating	num Votes
7782	Zoo	NaN	NaN
7783	Zoom	NaN	NaN
7784	Zulu Man in Japan	NaN	NaN
7785	Zumbo's Just Desserts	NaN	NaN
7786	ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS	NaN	NaN

Question 2

Clean up the column names of imdb_scores.

Remember that we need to use .str methods for this, and we must consider replacing empty spaces. Python is case sensitive!

imdb_scores.columns = imdb_scores.columns.str.replace(pat = " ", repl = "_") # Replace spaces with underscores

imdb_scores.columns = imdb_scores.columns.str.lower() # Convert to lowercase

imdb_scores.columns

Index(['title', 'average_rating', 'num_votes'], dtype='object')

Question 3

What are the dimensions of the Netflix data? Display the result in a sentence.

Remember that we can use multi-variable assignment to assign both elements (rows and columns) to their own variables.

# Way 1 - Good practice

rows, columns = netflix.shape

print("There are", rows, "rows and", columns, "columns in the Netflix DataFrame.")


# Way 2

# rows = netflix.shape[0]
# colums = netflix.shape[1]

# print("There are", rows, "rows and", columns, "columns in the Netflix DataFrame.")


# Way 3

# rows = len(netflix) 
# columns = len(netflix.columns) 

# print("There are", rows, "rows and", columns, "columns in the Netflix DataFrame.")

There are 7787 rows and 17 columns in the Netflix DataFrame.

Question 4

What data types does Netflix have?

We use the attribute dtypes.

# Way 1

netflix.dtypes

# Way 2

# netflix.info()

show_id             object
type                object
title               object
director            object
cast                object
primary_country     object
country_2           object
country_3           object
country_4           object
countries_5_up      object
date_added          object
release_year         int64
rating              object
duration            object
listed_in           object
description         object
num_cast           float64
dtype: object

Question 5

How many missing values do we have in the columns of Netflix?
How many missing values do we have in the columns of imdb_scores?
Can you find the total missing values in general?

For (a) & (b) We need to use the .isna() method and one of the summary statistics functions chained together to compute the missing values by column.

Chain a second summary statistic to your answer to (a) and observe!

# (a)

netflix.isna().sum()

show_id               0
type                  0
title                 0
director           2389
cast                718
primary_country     507
country_2          6637
country_3          7397
country_4          7637
countries_5_up     7733
date_added           10
release_year          0
rating                7
duration              0
listed_in             0
description           0
num_cast            718
dtype: int64

# (b)

imdb_scores.isna().sum()

title                0
average_rating    4773
num_votes         4773
dtype: int64

# (c)

netflix.isna().sum().sum()

Question 6

Output a new DataFrame from imdb_scores, where the average rating is greater than 5, name it ‘above_avg_movies’.
Output a new DataFrame from Netflix, where the films were directed by ‘Marcus Raboy’ and the rating was one of “TV-MA, R, PG-13”. Name it ‘raboy_mature_films’.

Remember that we use the indexing brackets [] and specify our condition inside.
Remember that when specifying equivalence, we use ‘==’ and we use the .isin() function for a list of strings.

# (a)
above_avg_movies = imdb_scores[imdb_scores['average_rating'] > 5.0]

above_avg_movies.head()

	title	average_rating	num_votes
351	#Roxy	5.1	284.0
352	10 Days in Sun City	5.1	59.0
353	13 Cameras	5.1	5093.0
354	A Witches' Ball	5.1	347.0
355	All the Devil's Men	5.1	2686.0

# (b) 
raboy_mature_films = netflix[(netflix['director'] == 'Marcus Raboy') & (netflix['rating'].isin(['TV-MA', 'PG-13', 'R']))]

raboy_mature_films.head()

	show_id	type	title	director	cast	primary_country	country_2	country_3	country_4	countries_5_up	date_added	release_year	rating	duration	listed_in	description	num_cast
550	s551	Movie	Anthony Jeselnik: Fire in the Maternity Ward	Marcus Raboy	Anthony Jeselnik	United States	NaN	NaN	NaN	NaN	April 30, 2019	2019	TV-MA	64 min	Stand-Up Comedy	Forging his own comedic boundaries, Anthony Je...	1.0
1543	s1544	Movie	Cristela Alonzo: Lower Classy	Marcus Raboy	Cristela Alonzo	United States	NaN	NaN	NaN	NaN	January 24, 2017	2017	TV-MA	67 min	Stand-Up Comedy	The irrepressible Alonzo skewers Latino stereo...	1.0
1585	s1586	Movie	Dana Carvey: Straight White Male, 60	Marcus Raboy	Dana Carvey	United States	NaN	NaN	NaN	NaN	November 4, 2016	2016	TV-MA	64 min	Stand-Up Comedy	Emmy-winning comedian Dana Carvey blends pitch...	1.0
1705	s1706	Movie	DeRay Davis: How to Act Black	Marcus Raboy	DeRay Davis	United States	NaN	NaN	NaN	NaN	November 14, 2017	2017	TV-MA	66 min	Stand-Up Comedy	Dynamic comic DeRay Davis hits the stage like ...	1.0
3231	s3232	Movie	Judd Apatow: The Return	Marcus Raboy	Judd Apatow	United States	NaN	NaN	NaN	NaN	December 12, 2017	2017	TV-MA	70 min	Stand-Up Comedy	Judd Apatow returns to stand-up comedy after 2...	1.0

Question 7

How many unique countries are in the column primary_country?

Remember that .unique() returns an array, so how do we find out the number of values within it?

len(netflix["primary_country"].unique())

# Note that nunique() will give 81 as it skips the missing values

Question 8

Create a new Dataframe that just contains “Movie” data. Ensure that this is a copy so we avoid the “SettingWithCopy” warning. Look at the column called type; this contains if something is a movie or a tv series.
Within the movies DataFrame created in part (a), create a new column called runtime, where duration has “min” removed, make sure it’s numeric!
Use the new runtime column to create a binned column containing labels for different lengths of runtime, grouping the films into “0-60”, “61-120”, “121-180”, “181-240”, and “241 or more”. Name this ‘runtime_group’. It is recommended that you use “pd.cut()” for this.

Remember that .copy() can be placed on the end of the filtering statement.
For Python you can use .str.replace() to remove some text. Recall that the type to change to is int64.
Remember that we can use np.arange(start =, stop =, step =) for this, and it is a good idea to set the stop as the maximum of the column + 1.

# (a)

movies = netflix[netflix["type"] == "Movie"].copy()

movies.head()

	show_id	type	title	director	cast	primary_country	country_2	country_3	country_4	countries_5_up	date_added	release_year	rating	duration	listed_in	description	num_cast
1	s2	Movie	07:19	Jorge Michel Grau	Demián Bichir, Héctor Bonilla, Oscar Serrano, ...	Mexico	NaN	NaN	NaN	NaN	December 23, 2016	2016	TV-MA	93 min	Dramas, International Movies	After a devastating earthquake hits Mexico Cit...	6.0
2	s3	Movie	23:59	Gilbert Chan	Tedd Chan, Stella Chung, Henley Hii, Lawrence ...	Singapore	NaN	NaN	NaN	NaN	December 20, 2018	2011	R	78 min	Horror Movies, International Movies	When an army recruit is found dead, his fellow...	9.0
3	s4	Movie	9	Shane Acker	Elijah Wood, John C. Reilly, Jennifer Connelly...	United States	NaN	NaN	NaN	NaN	November 16, 2017	2009	PG-13	80 min	Action & Adventure, Independent Movies, Sci-Fi...	In a postapocalyptic world, rag-doll robots hi...	9.0
4	s5	Movie	21	Robert Luketic	Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...	United States	NaN	NaN	NaN	NaN	January 1, 2020	2008	PG-13	123 min	Dramas	A brilliant group of students become card-coun...	12.0
6	s7	Movie	122	Yasir Al Yasiri	Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...	Egypt	NaN	NaN	NaN	NaN	June 1, 2020	2019	TV-MA	95 min	Horror Movies, International Movies	After an awful accident, a couple admitted to ...	8.0

# (b)

movies["runtime"] = movies["duration"].str.replace(pat= " min", repl= "")

movies["runtime"] = movies["runtime"].astype("int64")

movies["runtime"].dtype # Check if it's worked

dtype('int64')

# (c)

movies['runtime_group'] = pd.cut(movies['runtime'], 
                                  bins = np.arange(start = 0,
                                                   stop = (movies['runtime'].max() + 1),
                                                   step = 60),
                                  labels = ['0-60', '61-120', '121-180', '181-240', '241 or more'])

movies['runtime_group']

1        61-120
2        61-120
3        61-120
4       121-180
6        61-120
         ...   
7781     61-120
7782     61-120
7783     61-120
7784       0-60
7786     61-120
Name: runtime_group, Length: 5377, dtype: category
Categories (5, object): ['0-60' < '61-120' < '121-180' < '181-240' < '241 or more']

How many missing values are there in the ‘num_cast’ column of the movies DataFrame?
Fill in the missing values of the column with the mean number of cast numbers from the column overall. Ensure that this is rounded to the correct number of decimal places.

Hint

.isna()
Remember that we use .fillna() for filling in values and round() for rounding.

# (a)

num_cast_missing = movies['num_cast'].isna().sum()

print("There are", num_cast_missing, "missing values in the num_cast column.")

There are 426 missing values in the num_cast column.

# (b)

movies['num_cast'].fillna(value = round((movies['num_cast'].mean()), 1), 
                          inplace = True)

# Check

movies['num_cast'].isna().sum()

C:\Users\tbalb\AppData\Local\Temp\ipykernel_9720\1793540225.py:3: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

Question 10

What’s the median runtime of the movie data?
What’s the mean runtime of the movie data?
What’s the variance of the runtime of the movie data? Use it to find the standard deviation (without the “.std()” function!).

The function for this is the same as the word.
The function for this is the same as the word.
Remember that the standard deviation is the square root of the variance (np.sqrt()).

# (a)

movies["runtime"].median()

98.0

# (b)

movies["runtime"].mean()

99.30797842663195

# (c)

var_runtime = movies["runtime"].var()

std_runtime = np.sqrt(var_runtime)

print(var_runtime, std_runtime)

814.0111584310814 28.53088078610756

Question 11

Sort the movies DataFrame to find the longest movie and return just the columns title and runtime for this data.
Sort the movies DataFrame by release_year and runtime to find the oldest movie with the longest runtime, returning its director, title, runtime, release year and rating. Reset the index so that we have a new one starting at 0.

Remember that we must use a list within the index statement to select multiple columns, df[[my_cols]].
Remember that order is important so it will order by year first and then runtime. We can use .reset_index() and the optional parameter ‘drop’ to accomplish this.

# (a)

sorted_movies = movies[["title", "runtime"]].sort_values("runtime", ascending = False)

sorted_movies.head()

	title	runtime
957	Black Mirror: Bandersnatch	312
6850	The School of Mischief	253
4490	No Longer kids	237
3694	Lock Your Girls In	233
5108	Raya and Sakina	230

# (b)

short_movie_high_ratings = movies[["title", "director", "runtime", "rating", "release_year"]].sort_values(["release_year", "runtime"], ascending = [True, False]).reset_index(drop = True)

short_movie_high_ratings.head()

	title	director	runtime	rating	release_year
0	Prelude to War	Frank Capra	52	TV-14	1942
1	The Battle of Midway	John Ford	18	TV-14	1942
2	Why We Fight: The Battle of Russia	Frank Capra, Anatole Litvak	82	TV-PG	1943
3	Undercover: How to Operate Behind Enemy Lines	John Ford	61	TV-PG	1943
4	WWII: Report from the Aleutians	John Huston	45	TV-PG	1943

Question 12

Group the Netflix data by the primary country and find the mean number of cast listed.

Remember that we first .groupby(), followed with the column name [] and then the summary statistic.

mean_cast_per_country = netflix.groupby("primary_country")["num_cast"].mean()

mean_cast_per_country.head()

primary_country
Argentina     7.000000
Australia     7.600000
Austria       7.166667
Bangladesh    7.500000
Belarus       2.000000
Name: num_cast, dtype: float64

Question 13

Group the Netflix data by type and rating and obtain the overall counts of each type and the average cast numbers for each group.

Remember that grouping by multiple columns requires a list to be specified and we must use the .agg({key:value}) function for multiple aggregations.

rating_by_type = netflix.groupby(["type", "rating"]).agg({"type": "count", "num_cast":"mean"})

rating_by_type

		type	num_cast
type	rating
Movie	G	39	10.631579
	NC-17	3	9.666667
	NR	79	7.112903
	PG	247	9.954357
	PG-13	386	10.066138
	R	663	9.441221
	TV-14	1272	7.350043
	TV-G	111	6.422222
	TV-MA	1845	6.723304
	TV-PG	505	6.854801
	TV-Y	117	7.805825
	TV-Y7	95	8.471910
	TV-Y7-FV	5	6.333333
	UR	5	11.200000
TV Show	NR	5	10.400000
	R	2	12.000000
	TV-14	659	7.945092
	TV-G	83	5.391304
	TV-MA	1018	9.623303
	TV-PG	301	6.305221
	TV-Y	163	5.683099
	TV-Y7	176	9.196319
	TV-Y7-FV	1	8.000000

Question 14

Join the imdb data to the Netflix data to add rating scores to the overall dataset. In your join you should include the column that specifies whether the value came from the left or right DataFrame.

Remember that there are two ways to do this, left joining from A to B or right joining from B to A. The column we should join on is the same in both DataFrames.

netflix_imdb = netflix.merge(imdb_scores, on = "title", how = "left", indicator = True)

# OR 

# netflix_imdb = pd.merge(left = netflix, right = imdb_scores, on = "title", how = "left", indicator = True)

# netflix_imdb = pd.merge(left = imdb_scores, right = netflix, on = "title", how = "right", indicator = True)

netflix_imdb

	show_id	type	title	director	cast	primary_country	country_2	country_3	country_4	countries_5_up	date_added	release_year	rating	duration	listed_in	description	num_cast	average_rating	num_votes	_merge
0	s1	TV Show	3%	NaN	João Miguel, Bianca Comparato, Michel Gomes, R...	Brazil	NaN	NaN	NaN	NaN	August 14, 2020	2020	TV-MA	4 Seasons	International TV Shows, TV Dramas, TV Sci-Fi &...	In a future where the elite inhabit an island ...	11.0	NaN	NaN	both
1	s2	Movie	07:19	Jorge Michel Grau	Demián Bichir, Héctor Bonilla, Oscar Serrano, ...	Mexico	NaN	NaN	NaN	NaN	December 23, 2016	2016	TV-MA	93 min	Dramas, International Movies	After a devastating earthquake hits Mexico Cit...	6.0	5.9	611.0	both
2	s3	Movie	23:59	Gilbert Chan	Tedd Chan, Stella Chung, Henley Hii, Lawrence ...	Singapore	NaN	NaN	NaN	NaN	December 20, 2018	2011	R	78 min	Horror Movies, International Movies	When an army recruit is found dead, his fellow...	9.0	4.6	915.0	both
3	s4	Movie	9	Shane Acker	Elijah Wood, John C. Reilly, Jennifer Connelly...	United States	NaN	NaN	NaN	NaN	November 16, 2017	2009	PG-13	80 min	Action & Adventure, Independent Movies, Sci-Fi...	In a postapocalyptic world, rag-doll robots hi...	9.0	NaN	NaN	both
4	s5	Movie	21	Robert Luketic	Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...	United States	NaN	NaN	NaN	NaN	January 1, 2020	2008	PG-13	123 min	Dramas	A brilliant group of students become card-coun...	12.0	NaN	NaN	both
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7782	s7783	Movie	Zozo	Josef Fares	Imad Creidi, Antoinette Turk, Elias Gergi, Car...	Sweden	Czech Republic	United Kingdom	Denmark	Netherlands	October 19, 2020	2005	TV-MA	99 min	Dramas, International Movies	When Lebanon's Civil War deprives Zozo of his ...	7.0	6.6	3135.0	both
7783	s7784	Movie	Zubaan	Mozez Singh	Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...	India	NaN	NaN	NaN	NaN	March 2, 2019	2015	TV-14	111 min	Dramas, International Movies, Music & Musicals	A scrappy but poor boy worms his way into a ty...	8.0	6.1	424.0	both
7784	s7785	Movie	Zulu Man in Japan	NaN	Nasty C	NaN	NaN	NaN	NaN	NaN	September 25, 2020	2019	TV-MA	44 min	Documentaries, International Movies, Music & M...	In this documentary, South African rapper Nast...	1.0	NaN	NaN	both
7785	s7786	TV Show	Zumbo's Just Desserts	NaN	Adriano Zumbo, Rachel Khoo	Australia	NaN	NaN	NaN	NaN	October 31, 2020	2019	TV-PG	1 Season	International TV Shows, Reality TV	Dessert wizard Adriano Zumbo looks for the nex...	2.0	NaN	NaN	both
7786	s7787	Movie	ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS	Sam Dunn	NaN	United Kingdom	Canada	United States	NaN	NaN	March 1, 2020	2019	TV-MA	90 min	Documentaries, Music & Musicals	This documentary delves into the mystique behi...	NaN	NaN	NaN	both

7787 rows × 20 columns

4 Summary

In this case study you have had the opportunity to apply some of the techniques you have learned to some new data. This is not exhaustive however; have a look at the data and experiment with other techniques you can use.

A good start would be to continue cleaning the movies dataset, analysing the other columns’ missing values and seeing what can be done with them, as there are some that are more complex to consider!

For additional datasets we recommend exploring:

5 What’s Next?

Congratulations! You have reached the end of the compulsory materials for Core Part 1 of Introduction to Python. You have consolidated the variety of skills shown throughout the course in a large structured exercise and you should feel very accomplished!

You have two options on where to go next, you can either take a look at the Part 1 Reference Materials that build on some more advanced concepts at this level, or you can move on to Core Part 2 that introduces more programmatic concepts in Python that are key to developing your technical skills!

Whatever you chose to do, thank-you so much for your dedication in completing these materials! Please do give your honest feedback in the post course survey as we are always looking to improve the quality of our training offering.