Chapter 7 - Case Study

1 Learning Outcomes

By the end of this chapter you should have more confidence with manipulating data and using techniques covered in the first six chapters of the course. This is an excellent opportunity for you to have more data to play with and reinforce learning from the previous chapters.

These data sets and question are designed to be an initial springboard for you to continue your learning. Answers are provided; but these may only show one or two ways of solving the issue. While you should always aim to follow clean code standards and best practice, answers may be different to those provided here and as long as the same outputs are achieved, this is fine.

It is worth noting that you attempt to understand the differences in the solutions though, this can help you learn something new or reinforce another concept by appreciating an alternate way of answering a question.

2 Task Format

Questions will be presented in tabs:

  • Tab 1 will contain the question itself.
  • Tab 2 will contain some hints for the solution.
  • Tab 3 will contain the solution in Python.

An example is below:

2.1 Example

This is an example question.

To hint or not to hint?

print("This tab contains some Python code")
This tab contains some Python code

3 Questions

Question 1

  1. Load the following packages into your session.
  • pandas
  • numpy
  1. Read in the following two files and assign to the given variables.

netflix - nextflix_data.csv imdb_scores - imdb_scores.csv

  1. Display the first 10 rows of the netflix data.

  2. Display the last 5 rows of the imdb_scores data.

Note - The data is sourced from Tidy Tuesday and directly from IMDB. Some data has been altered to suit the difficulty level of this course. This data should be treated as a training set, and so shouldn’t be relied upon for 100% accuracy.

  1. Remember that we use the keywords “import” and “as” when performing this task.

  2. Remember that we use the read_csv() function from pandas to accomplish this.

  3. .head()

  4. Opposite of (c) in words as well as process!

# (a)

import pandas as pd
import numpy as np

# (b)

netflix = pd.read_csv("../data/netflix_data.csv")
imdb_scores = pd.read_csv("../data/imdb_scores.csv")
# (c) 

netflix.head(10)
show_id type title director cast primary_country country_2 country_3 country_4 countries_5_up date_added release_year rating duration listed_in description num_cast
0 s1 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R... Brazil NaN NaN NaN NaN August 14, 2020 2020 TV-MA 4 Seasons International TV Shows, TV Dramas, TV Sci-Fi &... In a future where the elite inhabit an island ... 11.0
1 s2 Movie 07:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico NaN NaN NaN NaN December 23, 2016 2016 TV-MA 93 min Dramas, International Movies After a devastating earthquake hits Mexico Cit... 6.0
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence ... Singapore NaN NaN NaN NaN December 20, 2018 2011 R 78 min Horror Movies, International Movies When an army recruit is found dead, his fellow... 9.0
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly... United States NaN NaN NaN NaN November 16, 2017 2009 PG-13 80 min Action & Adventure, Independent Movies, Sci-Fi... In a postapocalyptic world, rag-doll robots hi... 9.0
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... United States NaN NaN NaN NaN January 1, 2020 2008 PG-13 123 min Dramas A brilliant group of students become card-coun... 12.0
5 s6 TV Show 46 Serdar Akar Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan... Turkey NaN NaN NaN NaN July 1, 2017 2016 TV-MA 1 Season International TV Shows, TV Dramas, TV Mysteries A genetics professor experiments with a treatm... 10.0
6 s7 Movie 122 Yasir Al Yasiri Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed... Egypt NaN NaN NaN NaN June 1, 2020 2019 TV-MA 95 min Horror Movies, International Movies After an awful accident, a couple admitted to ... 8.0
7 s8 Movie 187 Kevin Reynolds Samuel L. Jackson, John Heard, Kelly Rowan, Cl... United States NaN NaN NaN NaN November 1, 2019 1997 R 119 min Dramas After one of his high school students attacks ... 5.0
8 s9 Movie 706 Shravan Kumar Divya Dutta, Atul Kulkarni, Mohan Agashe, Anup... India NaN NaN NaN NaN April 1, 2019 2019 TV-14 118 min Horror Movies, International Movies When a doctor goes missing, his psychiatrist w... 9.0
9 s10 Movie 1920 Vikram Bhatt Rajneesh Duggal, Adah Sharma, Indraneil Sengup... India NaN NaN NaN NaN December 15, 2017 2008 TV-MA 143 min Horror Movies, International Movies, Thrillers An architect and his wife move into a castle t... 8.0
# (d)

imdb_scores.tail()
title average Rating num Votes
7782 Zoo NaN NaN
7783 Zoom NaN NaN
7784 Zulu Man in Japan NaN NaN
7785 Zumbo's Just Desserts NaN NaN
7786 ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS NaN NaN

Question 2

Clean up the column names of imdb_scores.

Remember that we need to use .str methods for this, and we must consider replacing empty spaces. Python is case sensitive!

imdb_scores.columns = imdb_scores.columns.str.replace(pat = " ", repl = "_") # Replace spaces with underscores

imdb_scores.columns = imdb_scores.columns.str.lower() # Convert to lowercase

imdb_scores.columns
Index(['title', 'average_rating', 'num_votes'], dtype='object')

Question 3

What are the dimensions of the Netflix data? Display the result in a sentence.

Remember that we can use multi-variable assignment to assign both elements (rows and columns) to their own variables.

# Way 1 - Good practice

rows, columns = netflix.shape

print("There are", rows, "rows and", columns, "columns in the Netflix DataFrame.")


# Way 2

# rows = netflix.shape[0]
# colums = netflix.shape[1]

# print("There are", rows, "rows and", columns, "columns in the Netflix DataFrame.")


# Way 3

# rows = len(netflix) 
# columns = len(netflix.columns) 

# print("There are", rows, "rows and", columns, "columns in the Netflix DataFrame.")
There are 7787 rows and 17 columns in the Netflix DataFrame.

Question 4

What data types does Netflix have?

We use the attribute dtypes.

# Way 1

netflix.dtypes

# Way 2

# netflix.info()
show_id             object
type                object
title               object
director            object
cast                object
primary_country     object
country_2           object
country_3           object
country_4           object
countries_5_up      object
date_added          object
release_year         int64
rating              object
duration            object
listed_in           object
description         object
num_cast           float64
dtype: object

Question 5

  1. How many missing values do we have in the columns of Netflix?

  2. How many missing values do we have in the columns of imdb_scores?

  3. Can you find the total missing values in general?

For (a) & (b) We need to use the .isna() method and one of the summary statistics functions chained together to compute the missing values by column.

  1. Chain a second summary statistic to your answer to (a) and observe!
# (a)

netflix.isna().sum()
show_id               0
type                  0
title                 0
director           2389
cast                718
primary_country     507
country_2          6637
country_3          7397
country_4          7637
countries_5_up     7733
date_added           10
release_year          0
rating                7
duration              0
listed_in             0
description           0
num_cast            718
dtype: int64
# (b)

imdb_scores.isna().sum()
title                0
average_rating    4773
num_votes         4773
dtype: int64
# (c)

netflix.isna().sum().sum()
33753

Question 6

  1. Output a new DataFrame from imdb_scores, where the average rating is greater than 5, name it ‘above_avg_movies’.

  2. Output a new DataFrame from Netflix, where the films were directed by ‘Marcus Raboy’ and the rating was one of “TV-MA, R, PG-13”. Name it ‘raboy_mature_films’.

  1. Remember that we use the indexing brackets [] and specify our condition inside.

  2. Remember that when specifying equivalence, we use ‘==’ and we use the .isin() function for a list of strings.

# (a)
above_avg_movies = imdb_scores[imdb_scores['average_rating'] > 5.0]

above_avg_movies.head()
title average_rating num_votes
351 #Roxy 5.1 284.0
352 10 Days in Sun City 5.1 59.0
353 13 Cameras 5.1 5093.0
354 A Witches' Ball 5.1 347.0
355 All the Devil's Men 5.1 2686.0
# (b) 
raboy_mature_films = netflix[(netflix['director'] == 'Marcus Raboy') & (netflix['rating'].isin(['TV-MA', 'PG-13', 'R']))]

raboy_mature_films.head()
show_id type title director cast primary_country country_2 country_3 country_4 countries_5_up date_added release_year rating duration listed_in description num_cast
550 s551 Movie Anthony Jeselnik: Fire in the Maternity Ward Marcus Raboy Anthony Jeselnik United States NaN NaN NaN NaN April 30, 2019 2019 TV-MA 64 min Stand-Up Comedy Forging his own comedic boundaries, Anthony Je... 1.0
1543 s1544 Movie Cristela Alonzo: Lower Classy Marcus Raboy Cristela Alonzo United States NaN NaN NaN NaN January 24, 2017 2017 TV-MA 67 min Stand-Up Comedy The irrepressible Alonzo skewers Latino stereo... 1.0
1585 s1586 Movie Dana Carvey: Straight White Male, 60 Marcus Raboy Dana Carvey United States NaN NaN NaN NaN November 4, 2016 2016 TV-MA 64 min Stand-Up Comedy Emmy-winning comedian Dana Carvey blends pitch... 1.0
1705 s1706 Movie DeRay Davis: How to Act Black Marcus Raboy DeRay Davis United States NaN NaN NaN NaN November 14, 2017 2017 TV-MA 66 min Stand-Up Comedy Dynamic comic DeRay Davis hits the stage like ... 1.0
3231 s3232 Movie Judd Apatow: The Return Marcus Raboy Judd Apatow United States NaN NaN NaN NaN December 12, 2017 2017 TV-MA 70 min Stand-Up Comedy Judd Apatow returns to stand-up comedy after 2... 1.0

Question 7

How many unique countries are in the column primary_country?

Remember that .unique() returns an array, so how do we find out the number of values within it?

len(netflix["primary_country"].unique())

# Note that nunique() will give 81 as it skips the missing values
82

Question 8

  1. Create a new Dataframe that just contains “Movie” data. Ensure that this is a copy so we avoid the “SettingWithCopy” warning. Look at the column called type; this contains if something is a movie or a tv series.

  2. Within the movies DataFrame created in part (a), create a new column called runtime, where duration has “min” removed, make sure it’s numeric!

  3. Use the new runtime column to create a binned column containing labels for different lengths of runtime, grouping the films into “0-60”, “61-120”, “121-180”, “181-240”, and “241 or more”. Name this ‘runtime_group’. It is recommended that you use “pd.cut()” for this.

  1. Remember that .copy() can be placed on the end of the filtering statement.

  2. For Python you can use .str.replace() to remove some text. Recall that the type to change to is int64.

  3. Remember that we can use np.arange(start =, stop =, step =) for this, and it is a good idea to set the stop as the maximum of the column + 1.

# (a)

movies = netflix[netflix["type"] == "Movie"].copy()

movies.head()
show_id type title director cast primary_country country_2 country_3 country_4 countries_5_up date_added release_year rating duration listed_in description num_cast
1 s2 Movie 07:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico NaN NaN NaN NaN December 23, 2016 2016 TV-MA 93 min Dramas, International Movies After a devastating earthquake hits Mexico Cit... 6.0
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence ... Singapore NaN NaN NaN NaN December 20, 2018 2011 R 78 min Horror Movies, International Movies When an army recruit is found dead, his fellow... 9.0
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly... United States NaN NaN NaN NaN November 16, 2017 2009 PG-13 80 min Action & Adventure, Independent Movies, Sci-Fi... In a postapocalyptic world, rag-doll robots hi... 9.0
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... United States NaN NaN NaN NaN January 1, 2020 2008 PG-13 123 min Dramas A brilliant group of students become card-coun... 12.0
6 s7 Movie 122 Yasir Al Yasiri Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed... Egypt NaN NaN NaN NaN June 1, 2020 2019 TV-MA 95 min Horror Movies, International Movies After an awful accident, a couple admitted to ... 8.0
# (b)

movies["runtime"] = movies["duration"].str.replace(pat= " min", repl= "")

movies["runtime"] = movies["runtime"].astype("int64")

movies["runtime"].dtype # Check if it's worked
dtype('int64')
# (c)

movies['runtime_group'] = pd.cut(movies['runtime'], 
                                  bins = np.arange(start = 0,
                                                   stop = (movies['runtime'].max() + 1),
                                                   step = 60),
                                  labels = ['0-60', '61-120', '121-180', '181-240', '241 or more'])

movies['runtime_group']
1        61-120
2        61-120
3        61-120
4       121-180
6        61-120
         ...   
7781     61-120
7782     61-120
7783     61-120
7784       0-60
7786     61-120
Name: runtime_group, Length: 5377, dtype: category
Categories (5, object): ['0-60' < '61-120' < '121-180' < '181-240' < '241 or more']

Question 9

  1. How many missing values are there in the ‘num_cast’ column of the movies DataFrame?

  2. Fill in the missing values of the column with the mean number of cast numbers from the column overall. Ensure that this is rounded to the correct number of decimal places.

Hint

  1. .isna()

  2. Remember that we use .fillna() for filling in values and round() for rounding.

# (a)

num_cast_missing = movies['num_cast'].isna().sum()

print("There are", num_cast_missing, "missing values in the num_cast column.")
There are 426 missing values in the num_cast column.
# (b)

movies['num_cast'].fillna(value = round((movies['num_cast'].mean()), 1), 
                          inplace = True)

# Check

movies['num_cast'].isna().sum()
C:\Users\tbalb\AppData\Local\Temp\ipykernel_9720\1793540225.py:3: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


0

Question 10

  1. What’s the median runtime of the movie data?

  2. What’s the mean runtime of the movie data?

  3. What’s the variance of the runtime of the movie data? Use it to find the standard deviation (without the “.std()” function!).

  1. The function for this is the same as the word.

  2. The function for this is the same as the word.

  3. Remember that the standard deviation is the square root of the variance (np.sqrt()).

# (a)

movies["runtime"].median()
98.0
# (b)

movies["runtime"].mean()
99.30797842663195
# (c)

var_runtime = movies["runtime"].var()

std_runtime = np.sqrt(var_runtime)

print(var_runtime, std_runtime)
814.0111584310814 28.53088078610756

Question 11

  1. Sort the movies DataFrame to find the longest movie and return just the columns title and runtime for this data.

  2. Sort the movies DataFrame by release_year and runtime to find the oldest movie with the longest runtime, returning its director, title, runtime, release year and rating. Reset the index so that we have a new one starting at 0.

  1. Remember that we must use a list within the index statement to select multiple columns, df[[my_cols]].

  2. Remember that order is important so it will order by year first and then runtime. We can use .reset_index() and the optional parameter ‘drop’ to accomplish this.

# (a)

sorted_movies = movies[["title", "runtime"]].sort_values("runtime", ascending = False)

sorted_movies.head()
title runtime
957 Black Mirror: Bandersnatch 312
6850 The School of Mischief 253
4490 No Longer kids 237
3694 Lock Your Girls In 233
5108 Raya and Sakina 230
# (b)

short_movie_high_ratings = movies[["title", "director", "runtime", "rating", "release_year"]].sort_values(["release_year", "runtime"], ascending = [True, False]).reset_index(drop = True)

short_movie_high_ratings.head()
title director runtime rating release_year
0 Prelude to War Frank Capra 52 TV-14 1942
1 The Battle of Midway John Ford 18 TV-14 1942
2 Why We Fight: The Battle of Russia Frank Capra, Anatole Litvak 82 TV-PG 1943
3 Undercover: How to Operate Behind Enemy Lines John Ford 61 TV-PG 1943
4 WWII: Report from the Aleutians John Huston 45 TV-PG 1943

Question 12

Group the Netflix data by the primary country and find the mean number of cast listed.

Remember that we first .groupby(), followed with the column name [] and then the summary statistic.

mean_cast_per_country = netflix.groupby("primary_country")["num_cast"].mean()

mean_cast_per_country.head()
primary_country
Argentina     7.000000
Australia     7.600000
Austria       7.166667
Bangladesh    7.500000
Belarus       2.000000
Name: num_cast, dtype: float64

Question 13

Group the Netflix data by type and rating and obtain the overall counts of each type and the average cast numbers for each group.

Remember that grouping by multiple columns requires a list to be specified and we must use the .agg({key:value}) function for multiple aggregations.

rating_by_type = netflix.groupby(["type", "rating"]).agg({"type": "count", "num_cast":"mean"})

rating_by_type
type num_cast
type rating
Movie G 39 10.631579
NC-17 3 9.666667
NR 79 7.112903
PG 247 9.954357
PG-13 386 10.066138
R 663 9.441221
TV-14 1272 7.350043
TV-G 111 6.422222
TV-MA 1845 6.723304
TV-PG 505 6.854801
TV-Y 117 7.805825
TV-Y7 95 8.471910
TV-Y7-FV 5 6.333333
UR 5 11.200000
TV Show NR 5 10.400000
R 2 12.000000
TV-14 659 7.945092
TV-G 83 5.391304
TV-MA 1018 9.623303
TV-PG 301 6.305221
TV-Y 163 5.683099
TV-Y7 176 9.196319
TV-Y7-FV 1 8.000000

Question 14

Join the imdb data to the Netflix data to add rating scores to the overall dataset. In your join you should include the column that specifies whether the value came from the left or right DataFrame.

Remember that there are two ways to do this, left joining from A to B or right joining from B to A. The column we should join on is the same in both DataFrames.

netflix_imdb = netflix.merge(imdb_scores, on = "title", how = "left", indicator = True)

# OR 

# netflix_imdb = pd.merge(left = netflix, right = imdb_scores, on = "title", how = "left", indicator = True)

# netflix_imdb = pd.merge(left = imdb_scores, right = netflix, on = "title", how = "right", indicator = True)

netflix_imdb
show_id type title director cast primary_country country_2 country_3 country_4 countries_5_up date_added release_year rating duration listed_in description num_cast average_rating num_votes _merge
0 s1 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R... Brazil NaN NaN NaN NaN August 14, 2020 2020 TV-MA 4 Seasons International TV Shows, TV Dramas, TV Sci-Fi &... In a future where the elite inhabit an island ... 11.0 NaN NaN both
1 s2 Movie 07:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico NaN NaN NaN NaN December 23, 2016 2016 TV-MA 93 min Dramas, International Movies After a devastating earthquake hits Mexico Cit... 6.0 5.9 611.0 both
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence ... Singapore NaN NaN NaN NaN December 20, 2018 2011 R 78 min Horror Movies, International Movies When an army recruit is found dead, his fellow... 9.0 4.6 915.0 both
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly... United States NaN NaN NaN NaN November 16, 2017 2009 PG-13 80 min Action & Adventure, Independent Movies, Sci-Fi... In a postapocalyptic world, rag-doll robots hi... 9.0 NaN NaN both
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... United States NaN NaN NaN NaN January 1, 2020 2008 PG-13 123 min Dramas A brilliant group of students become card-coun... 12.0 NaN NaN both
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7782 s7783 Movie Zozo Josef Fares Imad Creidi, Antoinette Turk, Elias Gergi, Car... Sweden Czech Republic United Kingdom Denmark Netherlands October 19, 2020 2005 TV-MA 99 min Dramas, International Movies When Lebanon's Civil War deprives Zozo of his ... 7.0 6.6 3135.0 both
7783 s7784 Movie Zubaan Mozez Singh Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... India NaN NaN NaN NaN March 2, 2019 2015 TV-14 111 min Dramas, International Movies, Music & Musicals A scrappy but poor boy worms his way into a ty... 8.0 6.1 424.0 both
7784 s7785 Movie Zulu Man in Japan NaN Nasty C NaN NaN NaN NaN NaN September 25, 2020 2019 TV-MA 44 min Documentaries, International Movies, Music & M... In this documentary, South African rapper Nast... 1.0 NaN NaN both
7785 s7786 TV Show Zumbo's Just Desserts NaN Adriano Zumbo, Rachel Khoo Australia NaN NaN NaN NaN October 31, 2020 2019 TV-PG 1 Season International TV Shows, Reality TV Dessert wizard Adriano Zumbo looks for the nex... 2.0 NaN NaN both
7786 s7787 Movie ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS Sam Dunn NaN United Kingdom Canada United States NaN NaN March 1, 2020 2019 TV-MA 90 min Documentaries, Music & Musicals This documentary delves into the mystique behi... NaN NaN NaN both

7787 rows × 20 columns

4 Summary

In this case study you have had the opportunity to apply some of the techniques you have learned to some new data. This is not exhaustive however; have a look at the data and experiment with other techniques you can use.

A good start would be to continue cleaning the movies dataset, analysing the other columns’ missing values and seeing what can be done with them, as there are some that are more complex to consider!

For additional datasets we recommend exploring:

5 What’s Next?

Congratulations! You have reached the end of the compulsory materials for Core Part 1 of Introduction to Python. You have consolidated the variety of skills shown throughout the course in a large structured exercise and you should feel very accomplished!

You have two options on where to go next, you can either take a look at the Part 1 Reference Materials that build on some more advanced concepts at this level, or you can move on to Core Part 2 that introduces more programmatic concepts in Python that are key to developing your technical skills!

Whatever you chose to do, thank-you so much for your dedication in completing these materials! Please do give your honest feedback in the post course survey as we are always looking to improve the quality of our training offering.