Data Visualisation Case Study

1 Introduction

You will use material from chapters 1-4 of the Data Visualisation in Python course to complete structured exercises on a fresh dataset.

The case study will guide you to practice using the following visualisations:

Scatterplots
Bar Plots
Box Plots

Of course there are many more types of visualisation and you are encouraged to experiment should you think there are better ways to look at the data.

2 Course Format

This chapter will present each question as a section, with each part a subsection.

For question 1, the format will be:

Tab 1 will contain the question itself
Tab 2 will contain a hint for the question
Tab 3 will contain the solution with code

For question 2, the format will be the same however there will be no solution, only a question and hint.

Important note - The solutions in this case study use a combination of Matplotlib and Seaborn to create visualisations. We’ve tried to reflect reality where you’d need to balance finding creative solutions with accessibility and time constraints. You may want to take your solutions in a different direction and that’s ok, as long as you take away the central message to use what works best for you, your team and your organisation.

2.1 Example

This is an example question (or is it)?

This is a hint!

Insert code here.

Throughout, we would ask you to reflect on the plots used throughout and whether or not you would make that same choice when approaching the dataset we use in the case study.

Data Visualisation is considered both an art and a science, so there are always numerous viewpoints on how best to display information.

3 Packages

Let’s start, as always by loading required packages and data.

Packages that may be needed:

Numpy Version: 1.24.4
Pandas Version: 1.5.3
Matplotlib Version: 3.5.3
Seaborn Version: 0.13.0

We may need to use the keyword import

 # Load packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rcParams 
import seaborn as sns

Always follow the standard convention for nicknames.

4 Fonts

Add in a sans-serif font such as

Arial
Tahoma
Helvetica

for rendering throughout the exercises, ensure it is automatically used.

We will use Arial for the solutions.

We can set these using rcParams.

rcParams["font.family"] = "sans-serif"
rcParams["font.sans-serif"] = ["Arial"]

5 Ames Housing Data

Throughout this chapter we will be visualising the Ames Housing Dataset, an incredibly popular choice for visualisation, statistical modelling, machine learning and exploratory data analysis.

It is larger than other training datasets used which gives it a feel closer to that of real-world data.

Street with houses and cars to represent ames housing dataset

6 Task 1 - Relationship with Sale Price

This chapter centers on the Ames housing dataset, specifically examining the sale price as the main variable of interest. You will explore the impact of other variables, analyse the distribution, and assess variations across categorical variables.

While Task 1 involves constructing a detailed scatterplot with facets for less clutter, the focus is primarily on visualizations. There will be limited emphasis on exploratory analysis already covered in Modules 3 and 5.

6.1 Task 1a

Let’s see the data first:

Read in the Ames housing dataset and clean its column names.
Look at the result.

There are packages that will do this for you, such as pyjanitor. You also learnt more programmatic methods to do this in your introductory modules.

ames_data = pd.read_csv("../data/ames.csv")

ames_data.columns = ames_data.columns.str.replace(pat=" ", repl="_")
ames_data.columns = ames_data.columns.str.lower()

ames_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   order            2930 non-null   int64  
 1   ms_subclass      2930 non-null   int64  
 2   ms_zoning        2930 non-null   object 
 3   lot_frontage     2440 non-null   float64
 4   lot_area         2930 non-null   int64  
 5   street           2930 non-null   object 
 6   alley            198 non-null    object 
 7   lot_shape        2930 non-null   object 
 8   land_contour     2930 non-null   object 
 9   utilities        2930 non-null   object 
 10  lot_config       2930 non-null   object 
 11  land_slope       2930 non-null   object 
 12  neighborhood     2930 non-null   object 
 13  condition_1      2930 non-null   object 
 14  condition_2      2930 non-null   object 
 15  bldg_type        2930 non-null   object 
 16  house_style      2930 non-null   object 
 17  overall_qual     2930 non-null   int64  
 18  overall_cond     2930 non-null   int64  
 19  year_built       2930 non-null   int64  
 20  year_remod/add   2930 non-null   int64  
 21  roof_style       2930 non-null   object 
 22  roof_matl        2930 non-null   object 
 23  exterior_1st     2930 non-null   object 
 24  exterior_2nd     2930 non-null   object 
 25  mas_vnr_type     1155 non-null   object 
 26  mas_vnr_area     2907 non-null   float64
 27  exter_qual       2930 non-null   object 
 28  exter_cond       2930 non-null   object 
 29  foundation       2930 non-null   object 
 30  bsmt_qual        2850 non-null   object 
 31  bsmt_cond        2850 non-null   object 
 32  bsmt_exposure    2847 non-null   object 
 33  bsmtfin_type_1   2850 non-null   object 
 34  bsmtfin_sf_1     2929 non-null   float64
 35  bsmtfin_type_2   2849 non-null   object 
 36  bsmtfin_sf_2     2929 non-null   float64
 37  bsmt_unf_sf      2929 non-null   float64
 38  total_bsmt_sf    2929 non-null   float64
 39  heating          2930 non-null   object 
 40  heating_qc       2930 non-null   object 
 41  central_air      2930 non-null   object 
 42  electrical       2929 non-null   object 
 43  1st_flr_sf       2930 non-null   int64  
 44  2nd_flr_sf       2930 non-null   int64  
 45  low_qual_fin_sf  2930 non-null   int64  
 46  gr_liv_area      2930 non-null   int64  
 47  bsmt_full_bath   2928 non-null   float64
 48  bsmt_half_bath   2928 non-null   float64
 49  full_bath        2930 non-null   int64  
 50  half_bath        2930 non-null   int64  
 51  bedroom_abvgr    2930 non-null   int64  
 52  kitchen_abvgr    2930 non-null   int64  
 53  kitchen_qual     2930 non-null   object 
 54  totrms_abvgrd    2930 non-null   int64  
 55  functional       2930 non-null   object 
 56  fireplaces       2930 non-null   int64  
 57  fireplace_qu     1508 non-null   object 
 58  garage_type      2773 non-null   object 
 59  garage_yr_blt    2771 non-null   float64
 60  garage_finish    2771 non-null   object 
 61  garage_cars      2929 non-null   float64
 62  garage_area      2929 non-null   float64
 63  garage_qual      2771 non-null   object 
 64  garage_cond      2771 non-null   object 
 65  paved_drive      2930 non-null   object 
 66  wood_deck_sf     2930 non-null   int64  
 67  open_porch_sf    2930 non-null   int64  
 68  enclosed_porch   2930 non-null   int64  
 69  3ssn_porch       2930 non-null   int64  
 70  screen_porch     2930 non-null   int64  
 71  pool_area        2930 non-null   int64  
 72  pool_qc          13 non-null     object 
 73  fence            572 non-null    object 
 74  misc_feature     106 non-null    object 
 75  misc_val         2930 non-null   int64  
 76  mo_sold          2930 non-null   int64  
 77  yr_sold          2930 non-null   int64  
 78  sale_type        2930 non-null   object 
 79  sale_condition   2930 non-null   object 
 80  saleprice        2930 non-null   int64  
dtypes: float64(11), int64(27), object(43)
memory usage: 1.8+ MB

6.2 Task 1b

Consider factors influencing residential housing prices. While some patterns may be elusive, we can explore logical directions. For instance, examine the above-ground living area as a key element in addition to the overall lot area.

Create a scatter plot using sale price as the x axis and ground living area as the y axis
What are your initial impressions from the plot?

You can choose either matplotlib or a seaborn scatterplot for your solution.

# Create our figure and our axes

figure, axes = plt.subplots()

axes.scatter(x=ames_data["saleprice"], y=ames_data["gr_liv_area"])

6.3 Task 1c

Practice playing with colors and group the plot based on the presence of central air conditioning. Explore trends across this categorical variable, adding another dimension to the plot.

Create two plots and display them side by side:

For the first, display the points in blue
For the second, use the central air conditioning variable as the colour source

You’ll probably need to think about using a loop to apply unique values to each plot.

air_conditioning = ames_data["central_air"].unique().tolist() # Get a list with the unique values of the central_air column.
air_conditioning.sort() # Sorting this list to alphabetical order

# Create our figure and our axes
figure, (axes1, axes2) = plt.subplots(1, 2, figsize=(10, 4))


axes1.scatter(x=ames_data["saleprice"], y=ames_data["gr_liv_area"], color = "blue")

# NEW - Loop through each unique value in turn
for air_con_present in air_conditioning:
    # NEW - Get the subset of our data corresponding to the presence of air conditioning or not.
    air_con_rows = ames_data[ames_data["central_air"] == air_con_present] 
    
    axes2.scatter(x=air_con_rows["saleprice"], 
                 y=air_con_rows["gr_liv_area"],
                 label=air_con_present) # NEW -  Gives each value a label for our legend

# Show the Legend
plt.legend();

6.4 Task 1d

You can tweak the colours used in line with accessibility guidelines

Start by copying over the code where colour was mapped to the central air variable.

Use the Palette table in the Analysis Function Guidance on Colours to add the following colours to your plot via the HEX CODES.

Dark blue
Orange

Think about creating a pallete variable using the colours you want and then applying later on.

air_conditioning = ames_data["central_air"].unique().tolist() 
air_conditioning.sort() 
my_air_con_palette = ["#12436D","#F46A25"]

air_con_palette_zip = zip(air_conditioning, my_air_con_palette) 

figure, axes = plt.subplots(figsize=(10, 4))

for air_con_present, colour in air_con_palette_zip:

    air_con_rows = ames_data[ames_data["central_air"] == air_con_present] 
    
    axes.scatter(x=air_con_rows["saleprice"], 
                 y=air_con_rows["gr_liv_area"],
                 c=colour,
                 label=air_con_present)


# Show the Legend
plt.legend();

6.5 Task 1e

Add some lines of best fit.

Add a smooth line of best fit (linear) through the points in your scatter plot.

Have a look at how this can be done in matplotlib and seaborn. Choose the simplest solution.

Here we’ve switched to Seaborn because the Matplotlib solution to this is……well, google “matplotlib scatter line of best fit” and see for yourself.

This is a noisy plot, however it shows that there may be a difference in how much the overall area of the property impacts the price, based on whether there is central air conditioning or not.

It seems that we have a much steeper increase when there isn’t, compared to when there is.

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

# Create a seaborn scatter with a custom palette.
plot = sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = True,  
                  hue = "central_air",
                  palette = sns.color_palette(my_air_con_palette))

6.6 Task 1f

What if you just want to see the overall trend, as opposed to the trend by group?

Modify the code from part e to produce a single line of best fit through the points, as opposed to one grouped by central air conditioning.
Comment on the plausibility of a correlation here.

There’s a solution to this in seaborn using an additional regplot to hold the previous plot.

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

# Create a seaborn scatter with a custom palette
plot = sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = False,  
                  hue = "central_air",
                  palette = sns.color_palette(my_air_con_palette))

sns.regplot(x = "saleprice", y = "gr_liv_area", data = ames_data, scatter=False, ax=plot.axes[0,0])

6.7 Task 1g

Notice the plot has many overlapping points. Split it to analyse the two groups with less clutter.

Use facets to subset the plot by the central air variable.

There are several ways to create multiple plots using either matplotlib or seaborn. Use the code in Chapter 1 as a guide.

There is a much wider distribution of sale prices that have central air conditioning. The properties without air conditioning have a less strong positive relationship than those without it.

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

# Create a seaborn scatter with a custom palette
plot = sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = True,  
                  hue = "central_air",
                  col = 'central_air',
                  col_wrap= 2,
                  palette = sns.color_palette(my_air_con_palette))

6.8 Task 1h

Added dimensions can be added to your plot by grouping based on whether the property has a paved driveway

Expand on part g by producing a plot for each combination of paved_drive and central_air.

You need to produce a 3 * 2 grid of subplots

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

# Create a seaborn scatter with a custom palette
plot = sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = True,  
                  hue = "central_air",
                  col = 'paved_drive',
                  row='central_air',
                  palette = sns.color_palette(my_air_con_palette))

6.9 Task 1i

While part h is a useful exercise to show you some of the tools available in matplotlib and seaborn, it is very cluttered and there are some issues with the axis scales and labels.

This makes drawing conclusions more complicated.

Let’s go back and tweak some of the formatting .

Add a suitable title and axes labels to your Scatter Plot.

You’ll need to bring in some matplotlib if you’ve used seaborn to access the underlying methods for setting titles and labels.

As long as your labels are appropriate, then any combination of them is fine.

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

# Create a seaborn scatter with a custom palette
plot = (sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = True,  
                  hue = "central_air",
                  col = 'central_air',
                  col_wrap= 2,
                  #facet_kws = dict(sharex=False, sharey=False),
                  palette = sns.color_palette(my_air_con_palette))
                  .set_titles("{col_name}", loc = "left" )
                  .set_titles("")
                  
                  
                  )

plot.fig.suptitle("Price vs. above ground living area split by existence of air conditioning", 
            size=16, y = 1.05, x = 0.31)

axes = plot.axes

for ax in axes:
    ax.set_xlabel("Sale price ($)")
    ax.set_ylabel("Living Area (sq. ft.) ")

6.10 Task 1j

Let’s investigate the x axis.

What issues are there with the x axis at present?
Modify the x axis to include the upper limits of the sale prices.

Think about the upper and lower limits, do they make sense and are they informative?

The x axis has a couple of key problems for best practice.

It doesn’t quite start from 0
The upper limit on properties is not marked on the x axis, meaning it’s difficult to interpret what the largest data points refer to.

Let’s adjust the right and left limits of the x axis

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

#Find max price for upper x limit
max_price = max(ames_data['saleprice'])

# Create a seaborn scatter with a custom palette
plot = (sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = True,  
                  hue = "central_air",
                  col = 'central_air',
                  col_wrap= 2,
                  #facet_kws = dict(sharex=False, sharey=False),
                  palette = sns.color_palette(my_air_con_palette))
                  .set_titles("{col_name}", loc = "left" )
                  .set_titles("")
                            )

plot.fig.suptitle("Price vs. above ground living area split by existence of air conditioning", 
            size=16, y = 1.05, x = 0.31)


axes = plot.axes

for ax in axes:
    ax.set_xlim(right=round(max_price,-5),left=0) # uses round and negative number to increase right limit to upper 10000
    ax.set_xlabel("Sale price ($)")
    ax.set_ylabel("Living Area (sq. ft.) ")
    ax.grid(visible=True) # Pop in some gridlines for good measure

# This introduces some issues with the distance between the subplots so let's fix that
plt.subplots_adjust(wspace=0.1)

6.11 Task 1k

Let’s investigate the y axis this time.

What issues are there with the y axis at present?
Modify the y axis to include proper upper and lower limits.

What about the top and bottom of the axis? Similar to the previous question but using a different method.

The y axis has similar problems to the x axis.

Let’s adjust the top and bottom limits of the x axis

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

#Find max price for upper x limit
max_price = max(ames_data['saleprice'])
max_floor_space = max(ames_data['gr_liv_area'])

# Create a seaborn scatter with a custom palette
plot = (sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = True,  
                  hue = "central_air",
                  col = 'central_air',
                  col_wrap= 2,
                  #facet_kws = dict(sharex=False, sharey=False),
                  palette = sns.color_palette(my_air_con_palette))
                  .set_titles("{col_name}", loc = "left" )
                  .set_titles("")
                            )

plot.fig.suptitle("Price vs. above ground living area split by existence of air conditioning", 
            size=16, y = 1.05, x = 0.31)


axes = plot.axes

for ax in axes:
    ax.set_xlim(right=round(max_price,-5),left=0) # uses round and negative number to increase right limit to upper 10000
    ax.set_ylim(top=round(max_floor_space,-3),bottom=0) # uses round and negative number to increase right limit to upper 10000
    ax.set_xlabel("Sale price ($)")
    ax.set_ylabel("Living Area (sq. ft.) ")
    ax.grid(visible=True) # pop in some gridlines for good measure


sns.despine(left=True, bottom=True)

# This introduces some issues with the distance between the subplots so let's fix that
plt.subplots_adjust(wspace=0.1)

6.12 Task 1l

Your final tweaks to make are to turn off the x axis grid lines.

Remove the x axis gridlines

Add further parameters to the ax.grid() method.

There. We’ve sufficiently tidied up our plot so that it is easier to interpret and visually more appealing, following some of the ONS guidance as much as we can.

This is just the start. If you want your visualisations to be production ready then you’ll need to work collaboratively with colleagues and follow the guidelines of your organisation to determine minimum standards.

#Create custom palette
my_air_con_palette = ["#12436D","#F46A25"]

#Find max price for upper x limit
max_price = max(ames_data['saleprice'])
max_floor_space = max(ames_data['gr_liv_area'])

# Create a seaborn scatter with a custom palette
plot = (sns.lmplot(x = "saleprice", y = "gr_liv_area", data = ames_data,
                  fit_reg = True,  
                  hue = "central_air",
                  col = 'central_air',
                  col_wrap= 2,
                  #facet_kws = dict(sharex=False, sharey=False),
                  palette = sns.color_palette(my_air_con_palette))
                  .set_titles("{col_name}", loc = "left" )
                  .set_titles("")
                            )

plot.fig.suptitle("Price vs. above ground living area split by existence of air conditioning", 
            size=16, y = 1.05, x = 0.31)


axes = plot.axes

for ax in axes:
    ax.set_xlim(right=round(max_price,-5),left=0) # uses round and negative number to increase right limit to upper 10000
    ax.set_ylim(top=round(max_floor_space,-3),bottom=0) # uses round and negative number to increase right limit to upper 10000
    ax.set_xlabel("Sale price ($)")
    ax.set_ylabel("Living Area (sq. ft.) ")
    ax.grid(visible=True , which = "major", axis="y", color = (0.745, 0.745, 0.745)) # Pop in some gridlines for good measure
    ax.set_axisbelow(True)


sns.despine(left=True, bottom=True)

# This introduces some issues with the distance between the subplots so let's fix that
plt.subplots_adjust(wspace=0.1)

7 Task 2 - Exploring continuous and categorical variables

In this section, we’ll delve into exploring various dataset variables—an essential first step in Exploratory Data Analysis (EDA). EDA involves the initial investigation of our data, guiding subsequent cleaning and analysis.

Further insights into EDA were covered in Module 3 - Statistics. Given the significance of data visualis-ation in this process, practical exercises are beneficial here.

Specifically for this question, our focus is on:

Generating additional geometric objects.
Formatting them while addressing unique characteristics.
Adhering to best practice guidelines.

Cruicially, this section does not have code solutions, only hints. You are on your own with this and it is your job to investigate the nature of the dataset using the skills you’ve developed.

7.1 Task 2a

Let’s start with a categorical variable that would likely contribute to house prices, such as garage type. This has 7 levels:

2Types - More than one type of garage.
Attchd - Attached to the home.
Basment - Basement garage.
BuiltIn - Built-in garage.
CarPort - Car port garage.
Detchd - Detached garage.
NA - No garage.

Question
Hint

Count the number of properties for each garage type.
Break this down further by outputting the number of properties per garage type by the foundational material used to build the property (foundation variable), which has six levels:

BrkTil - Brick and Tile
CBlock - Cinder Block
Poured Concrete
Slab - Slab
Stone - Stone
Wood - Wood

We can solve both of these problems with groupby

7.2 Task 2b

It is always important to have summary tables from part a, especially for accessibility purposes (if you can’t provide useful alternative text, then the data used to create the plot is the bare minimum).

It would, of course, be easier to visualise the results in a bar chart, where we can draw initial conclusions at a glance.

Question
Hint

Create a bar chart to count the number of properties per garage type, ensuring to:

Colour with the best practice blue #12436D
Rotate the plot 90 degrees to the right.

Both matplotlib and seaborn have barplots with the ability to reorient the bars to be horizontal

7.3 Task 2c

It would be easier to observe the results if we ordered the bars in descending order.

Question
Hint

Sort the y axis to be in descending order.

Comment on what the plot shows us.

Use pandas to group the data, sort values and then matplotlib/seaborn to plot the results

7.4 Task 2d

Investigate the dataset to see if there are any missing values in this category. We don’t want to remove them as they are an intended level of the garage_type variable. How could we identify and deal with them?

Question
Hint

Go back and investigate to see whether the original dataset has any missing values and how these are coded. Reproduce the grouped data to include these and see how this impacts the chart.

Think about how pandas deals with missing values when importing datasets. You might want to interrogate the data outside python to see what unique values this field has and compare with what has been read in.

The groupby method has the functionality to drop/include missing values.

7.5 Task 2e

Let’s ensure our plot adheres to best practice guidelines.

Question
Hint

Modify the y axis ticks to ensure we have more gridlines.
Apply as many of the principles you used in the previous section the plot.
Amend the plot to remove the y axis gridlines and horizontally adjust the axis title by 1.
Add a title, subtitle and axes labels to the plot.

Much of the syntax is the same as when you created scatterplots. The only real differences are with the barplot specific methods and the results of changing certain elements of the axes.

7.6 Task 2f

You now have a great plot at the surface level, but what if you want to break this down by the material used in its foundation?

This let’s you see if the attached and detached garage properties were also produced with high quality materials.

Question
Hint

Count the number of garage types per foundation material.
Create a stacked bar chart.
Tweak the title in line with what the plot now shows.
What does this plot show us?

Use groupby in pandas to include an additional variable.

Stacked bar charts can be created in either matplotlib or seaborn. You’ve already learnt how to edit the different elements of a plot.

7.7 Task 2g

Let’s look at a continuous variable this time, the best being the sale price itself!

A great place to start with any dataset (particularly if there is a variable of interest to predict, observe relationships with etc) is to investigate the distribution of its continuous variables.

These allow us to confirm if the data is skewed due to outliers (values outside of the range of the data) and thus if clipping variables would be a useful cleaning step.

Question
Hint

Produce a histogram to observe the distribution of the sale price variable.
What are your initial thoughts upon observing this plot?

As with other types of plot, investigate how to create a histogram in either matplotlib or seaborn.

7.8 Task 2h

Let’s play around with different bin sizes.

Question
Hint

Create three plots, one with the default bins, one with 20 bins and another with 40. Arrange them as best you see fit.

Does changing the number of bins impact our ability to the distribution of this specific variable?

Think back to previous solutions to creating multiple plots in either matplotlib or seaborn

7.9 Task 2i

Let’s apply some best practice tweaks to our histogram.

Question
Hint

Adjust the width of the bins to 1, set the outline of the bars to white, the bar colour to “#12436D” and transparency to 0.8.
Modify the continuous y scale to produce more gridlines.
Modify the continuous x scale to include an upper limit.
Set appropriate labels for the plot.
Style it how you like and turn off the x axis gridlines.

Use previous parts of the case study and the training materials to apply these tweaks

7.10 Task 2j

Histograms tend to be used for single continuous variables. What if we wanted to compare the distribution across the groups of a categorical variable?

Question
Hint

Let’s return to our thoughts earlier on whether properties with specific garage types have a price spread or a more common pattern.

What two methods could we use to observe the sale price distribution across the different garage types?

Think about how to create a boxplot, with the sale price distribution split between garage types. This gives summary statistics at a glance, averages and quartiles, as well as explicit markers where outliers are likely to be.

7.11 Task 2k

Let’s see how the distribution of sale price differs by garage type.

Question
Hint

Create a boxplot to compare the distribution of the sale price across the different garage types. Again you can choose matplotlib or seaborn
Apply any tweaks you think are appropriate
Comment on the results.

Think about features specific to a boxplot. Should you remove/modify outliers? Is the boxplot size proportional to the sample size?

At the very least just get the boxplot working so you understand what it’s showing you.

7.12 Task 2l

You probably can’t produce a publication ready visualisation in this case study, however you can improve the accessibility of the plot so that colleagues can review it against organisational guidelines.

Question
Hint

Apply the following tweaks to your boxplot:

Remove the outlier shape and colour changes, it is recommended we use filled black circles. Reduce the size to 2.5.
Use all of the colours from the Analysis Function guidance to fill the boxplot.
Modify the y axis if necessary.
Add appropriate labels to the plot.
Turn off the legend if present and remove x axis gridlines.

All of the above has either been covered previously in the case study, training materials or can be solved by looking at the matplotlib or seaborn documentation.

8 Summary

Congratulations on finishing the case study exercises. By working through it, you should feel more comfortable with using matplotlib and seaborn to create and customise them to align with analysis function guidelines.

You have a few options from here:

Proceed to investigate the optional reference material.
Read deeper into the topic using online references at https://matplotlib.org/ and https://seaborn.pydata.org/index.html.
Experiment with alternative visualisation packages such as plotly (https://plotly.com/python/)