Data Visualisation Case Study

Government Analysis Function and Data Science Campus Logos.

1 Purpose of Chapter

Chapters 1-4 covered the following:

  • Grammar of graphics in ggplot2
  • Analysis Function guidelines for publication quality plots
  • Plotting geometries for various variable types.

What we have yet to do is provide a structured exercise that allows you to practice the techniques covered throughout the course.

Note that some solutions are adapted from Hadley Wickhams R for Data Science 2nd Edition chapters 1 and 9.

2 Pre-requisite Information

This case study requires the use of techniques covered throughout chapters 1-4 of this Data Visualisation course. You should be comfortable with:

  • Creating a plot with ggplot and aesthetic mappings.
  • Adding features such as colour and labels.
  • Mapping features to variables such as colour and shape.
  • Setting theme elements.
  • Grouping plots with facets.
  • Using showtext to render fonts.
  • Creating scatter plots, histograms, bar charts and boxplots.

3 Course Format

This chapter will present questions as tabsets:

  • Tab 1 will contain the question itself.
  • Tab 2 will contain a hint for the question.
  • Tab 3 will contain the solution with code.

3.1 Example

This is an example question.

This is a hint.

Insert code here.

Throughout, we ask you to: * Reflect on the plots created throughout * Consider alternative plots you would make when approaching the case study.

Data Visualisation is considered both an art and a science, so there are always numerous viewpoints on how best to display information.

4 Packages

Import the following packages:

  • tidyverse
  • janitor
  • scales
  • patchwork
  • showtext

Where do you go to take out books?

# Import necessary packages

library(tidyverse) # Data manipulation & ggplot2
library(janitor) # Cleaning column names
library(scales) # Automatic axes labels
library(patchwork) # Combine plots side-by-side
library(showtext) # Streamlined font rendering

5 Fonts

Add in a sans-serif font such as

  • Arial
  • Tahoma
  • Helvetica

Ensuring it is automatically used throughout.

Solutions will use Arial font.

You add fonts using name and path.

Showtext functions decide how to render.

# Set the Arial font for the exercises

font_add(family = "Arial", regular = "C:\\Windows\\Fonts\\Arial.ttf")

showtext_auto()

6 Ames Housing Data

Throughout this chapter you will be investigating the Ames Housing Dataset, a popular choice for visualisation and machine learning.

Houses of all shapes, sizes and colours in the AMES neighbourhood of IOWA.

7 Question 1 - A relationship with sale price

The main variable of interest in the Ames housing dataset is the sale price of a house. You are interested in:

  • Its relationships with other variables
  • Its distribution and how this varies across categorical variables.

In this question you will:

  • Build a complex scatterplot from the ground up.
  • Use facets to create less busy subplots.
  • Add accessibility tweaks and best practice elements.

7.1 Part a

Let’s see the data first:

  1. Read in the Ames housing dataset and clean its column names with janitor.

  2. Observe the data.

  1. There is a function for this, clean_<..>

  2. Take a glimpse.

# Glimpse the ames housing data

ames <- read_csv("./data/ames.csv") |> 
    clean_names()
Rows: 2930 Columns: 81
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (44): MS SubClass, MS Zoning, Street, Alley, Lot Shape, Land Contour, Ut...
dbl (37): Order, Lot Frontage, Lot Area, Overall Qual, Overall Cond, Year Bu...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ames |> 
    glimpse()
Rows: 2,930
Columns: 81
$ order           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ ms_sub_class    <chr> "020", "020", "020", "020", "060", "060", "120", "120"…
$ ms_zoning       <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL", "RL", …
$ lot_frontage    <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, NA, 63, 8…
$ lot_area        <dbl> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5…
$ street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"…
$ alley           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ lot_shape       <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg", "IR1"…
$ land_contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "HLS"…
$ utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "All…
$ lot_config      <chr> "Corner", "Inside", "Corner", "Corner", "Inside", "Ins…
$ land_slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl"…
$ neighborhood    <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert", "Gilber…
$ condition_1     <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm…
$ condition_2     <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"…
$ bldg_type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "Twnhs…
$ house_style     <chr> "1Story", "1Story", "1Story", "1Story", "2Story", "2St…
$ overall_qual    <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, …
$ overall_cond    <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, …
$ year_built      <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ year_remod_add  <dbl> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996, …
$ roof_style      <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable", "Gable…
$ roof_matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg",…
$ exterior_1st    <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
$ exterior_2nd    <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
$ mas_vnr_type    <chr> "Stone", "None", "BrkFace", "None", "None", "BrkFace",…
$ mas_vnr_area    <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603,…
$ exter_qual      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd", "Gd", …
$ exter_cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ foundation      <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc", "PCon…
$ bsmt_qual       <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd", "Gd", …
$ bsmt_cond       <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ bsmt_exposure   <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No", "No", …
$ bsmt_fin_type_1 <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ"…
$ bsmt_fin_sf_1   <dbl> 639, 468, 923, 1065, 791, 602, 616, 263, 1180, 0, 0, 9…
$ bsmt_fin_type_2 <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf"…
$ bsmt_fin_sf_2   <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0…
$ bsmt_unf_sf     <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 76…
$ total_bsmt_sf   <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994…
$ heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"…
$ heating_qc      <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", …
$ central_air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", …
$ x1st_flr_sf     <dbl> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 102…
$ x2nd_flr_sf     <dbl> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0,…
$ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gr_liv_area     <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1…
$ bsmt_full_bath  <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
$ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ full_bath       <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1, …
$ half_bath       <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, …
$ bedroom_abv_gr  <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, …
$ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ kitchen_qual    <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd", "Gd", …
$ tot_rms_abv_grd <dbl> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8, 8,…
$ functional      <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ"…
$ fireplaces      <dbl> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, …
$ fireplace_qu    <chr> "Gd", NA, NA, "TA", "TA", "Gd", NA, NA, "TA", "TA", "T…
$ garage_type     <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "Att…
$ garage_yr_blt   <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ garage_finish   <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "Fin", "Fin", "RFn"…
$ garage_cars     <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
$ garage_area     <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440,…
$ garage_qual     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ garage_cond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ paved_drive     <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ wood_deck_sf    <dbl> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, …
$ open_porch_sf   <dbl> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 5…
$ enclosed_porch  <dbl> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ x3ssn_porch     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ screen_porch    <dbl> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, 210…
$ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pool_qc         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fence           <chr> NA, "MnPrv", NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, …
$ misc_feature    <chr> NA, NA, "Gar2", NA, NA, NA, NA, NA, NA, NA, NA, "Shed"…
$ misc_val        <dbl> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, …
$ mo_sold         <dbl> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, 6, …
$ yr_sold         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
$ sale_type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", …
$ sale_condition  <chr> "Normal", "Normal", "Normal", "Normal", "Normal", "Nor…
$ sale_price      <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 213500…

The Data Dictionary is in this American Statistical Association text file for clarity.

7.2 Part b

Let’s start thinking about what would likely impact the housing price in a residential area.

For example, the area of the property. The above ground living area would be a good start.

  1. Create a scatter plot using sale price as the x axis and ground living area as the y axis.

  2. What are your initial impressions from the plot?

Remember your aes() mappings and that each layer added to a ggplot uses the + operator, not the pipe.

The Scatter plot geom is not called scatter.

# Baseline scatter plot

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point()

  • There is a semblance of a positive linear relationship, in that as above ground living area increases, so to does the sale price.

  • There is an incredibly dense grouping in the low area, low price space, so these are likely the most common properties in the area.

This is hardly a shocking revelation, but we should never assume this relationship, the data should show it.

7.3 Part c

Let’s add another dimension to the plot by playing with the colour aesthetic.

Create two plots and display them side by side:

  1. Firstly, display the points in the colour “blue”.

  2. Secondly, map the central air conditioning variable to the colour aesthetic.

  3. Comment on the results.

You will need to use patchwork and mathematical operators to display plots in this way.

  1. Local layer colour is needed.

  2. You can do this in the global layer.

# Displaying colour variations side by side

single_colour_scatter <- ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(colour = "blue")

grouped_colour_scatter <- ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area,
                         colour = central_air)) + 
    geom_point()

single_colour_scatter + grouped_colour_scatter

  • Despite the overlay of colours, it seems that the properties without central air have lower sale prices.

  • Properties with central air are dispersed across the full range of sale prices.

7.4 Part d

You should always consider whether the plot can be made more accessible. Colours are a good place to start.

Start by copying over the code from part c.

Use the Palette table in the Analysis Function Guidance on Colours to add the following colours to your plot via the HEX CODES.

  • Dark blue
  • Orange

You will need to add a scale layer to the plot to manually add your own colours.

# Tweak colours in line with guidance

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area,
                         colour = central_air)) +
    geom_point() +
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25"))

The overlay still makes it difficult to determine one group from another.

7.5 Part e

Let’s consider adding some lines of best fit.

Add a smooth line of best fit (linear) through the points in your scatter plot.

Remember that you need to use the “lm” method for this.

# Add lines of best fit through points

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area,
                         colour = central_air)) +
    geom_point() +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25"))
`geom_smooth()` using formula = 'y ~ x'

This is a very noisy plot, but does show that there may be a difference in how much the overall area of the property impacts the price, based on whether there is central air conditioning or not.

There is a steeper increase when there isn’t, compared to when there is.

7.6 Part f

What if you just want to see the overall trend, as opposed to the trend by group?

  1. Modify the code from part e to produce a single line of best fit through the points.

  2. Comment on the plausibility of a correlation here.

  1. Remember the difference between mapping a variable globally in ggplot(), vs. mapping at the specific geometric layer.

  2. Be specific with your terms, type of correlation, strength of correlation?

With this example, you want your x and y axes to be set globally as they never change. However, if you group the colour by central_air globally, then every subsequent layer will group in this way.

As such, you need to place colour within the layer that requires it and only that layer.

# Single line of best fit through points

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25"))
`geom_smooth()` using formula = 'y ~ x'

  • It is plausible that there is a moderate to strong positive correlation here.
  • This implies that the sale price of the house increases with the above ground living area regardless of air conditioning.

Further correlation analysis would be needed to determine the relationship statistically.

7.7 Part g

The plot is very noisy, so it could be beneficial to create subplots by grouping.

Use facets to subset the plot by the central air variable.

To subset by one variable, you want to wrap.

# Facet by central_air

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(~central_air)
`geom_smooth()` using formula = 'y ~ x'

  • As noticed before, in that there is a much wider distribution of sale prices (as well as number of properties in general) that have central air conditioning.

  • The properties without air conditioning have a less strong positive relationship than those without it.

7.8 Part h

You could add another dimension to your plot by further grouping by whether the property has a paved driveway.

Expand on part g by producing a plot for each combination of paved_drive and central_air.

You need to produce a 3 * 2 grid of subplots.

# Subset by species and year

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(paved_drive ~ central_air)
`geom_smooth()` using formula = 'y ~ x'

This plot is confusing due to both the central air and paved drive variables having the levels “Y” and “N” within them, making drawing conclusions difficult.

  • At a glance we see that properties with central air and a paved drive are the most common and spread across the entire sale price range.

  • We see that any other combination of property features lands in the lower price ranges overall, be they paved, partially paved and so on.

7.9 Part i

Part h is useful, but in practice, has some problems:

  • The variables plots are grouped by have similar labels (N, Y etc) which make it hard to read.
  • The other grouped variable isn’t mentioned in the legend or key.

As such, you can disregard it and move forward with the approach in Part g.

Using the plot from part g:

Add a suitable title, subtitle and axes labels to your Scatter Plot.

You need to add a labels layer to our plot.

As long as your labels are appropriate, then any combination of them is fine.

# Adding labels to the plot

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(~central_air) +
    labs(title = "Price vs. above ground living area split by existence of air conditioning.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Sale price ($)",
         y = "Living Area (sq. ft.)")
`geom_smooth()` using formula = 'y ~ x'

7.10 Part j

You could further distinguish groups in the plot by mapping shapes to the air conditioning groups.

  1. Map shapes to the two air conditioning groups, ensuring they appear in the legend.

  2. Capitalize the legend label, ensuring only one legend exists.

  1. You need to use the shape parameter locally.

  2. You need to add a label using the parameter that led to the creation of the legend in the first place, which is colour.

# Map central_air to shape and tweak legend

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air,
                             shape = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(~central_air) +
    labs(title = "Price vs. above ground living area split by whether the property has air conditioning.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Sale price ($)",
         y = "Living Area (sq ft)",
         colour = "Central Air",
         shape = "Central Air")
`geom_smooth()` using formula = 'y ~ x'

7.11 Part k

Let’s investigate the x axis.

  1. What issues are there with the x axis at present?

  2. Modify the x axis to:

  • Include the upper limit of sale price on the axis.
  • Remove scientific notation, ensuring the numbers have commas.
  1. Should there be e’s?

Throughout (ii), you need the continuous x scaler:

  • You can use seq() to create a new set of axes breaks.
  • You can set the limits with a separate parameter.
  • You can use the labels parameter for commas.
  1. The x axis does start from 0, but has two key problems for best practice:
  • Currently in scientific notation, where 200,000 is written as 2e+05 which means 2 * 10^5.

  • An upper limit (properties > 600,000USD) is not included, so points go past the current axes breaks.

Since finding the max price is a separate task, let’s do that first.

# Find max sale price of ames houses

ames |> 
    summarise(max_price = 
                  max(sale_price, na.rm = TRUE))
# A tibble: 1 × 1
  max_price
      <dbl>
1    755000

Let’s think about new axes breaks:

  • Rescale the axes entirely in multiples of $50,000, but this would likely expand the axes too much.

  • You could rescale in multiples of 150,000 USD, but that would end up overshooting substantially.

A sensible choice would be to take the existing breaks and expand up to 800,000 USD.

# Use y scaler to modify the axes

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air,
                             shape = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(~central_air) +
    scale_x_continuous(breaks = seq(from = 0,
                                  to = 8e+05,
                                  by = 2e+05), 
                       limits = c(0, 8e+05),
                       labels = scales::comma_format()) + 
    labs(title = "Price vs. above ground living area split by whether the property has air conditioning.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Sale price ($)",
         y = "Living Area (sq ft)",
         colour = "Central Air",
         shape = "Central Air")
`geom_smooth()` using formula = 'y ~ x'

First, set up the axes breaks with the seq() function.

However, changing the breaks doesn’t ensure they are marked on the axis. Here, there aren’t any $800,000 properties, so you need to set the limits to ensure it’s marked on the graph.

Lastly, use the comma_format() from the excellent scales package to convert the scientific notation.

Aside: You can prevent repeated tweaking by turning off scientific notation:

  • options(scipen = 999)

7.12 Part l

Let’s investigate the y axis this time.

  1. What issues are there with the y axis at present?

  2. Modify the y axis to include proper limits.

  1. What about the top and bottom of the axis? Number of breaks?

  2. You will need the continuous y scaler.

At present, there are a few problems:

  • It is difficult to pinpoint the actual above ground living areas of the lowest and highest priced properties due to dense groupings.

  • For accessibility purposes you want 8-10 gridlines, which the present y axis will not allow for.

An aside is that the audience should also be considered given that square feet (a non-universal metric) is used.

# Maximum gr_liv_area

ames |> 
    summarise(max_area = 
                  max(gr_liv_area, na.rm = TRUE))
# A tibble: 1 × 1
  max_area
     <dbl>
1     5642

For the gridlines:

  • You could change the breaks to every 500 square feet, but this would produce many gridlines.

  • Perhaps a better choice would be to break every 750 square feet, producing 9 gridlines.

You need to set the limits as well, since there are no properties that quite reach 6000 square feet.

# Rescale the y axis

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air,
                             shape = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(~central_air) +
    scale_x_continuous(breaks = seq(from = 0,
                                  to = 8e+05,
                                  by = 2e+05), 
                       limits = c(0, 8e+05),
                       labels = scales::comma_format()) + 
    scale_y_continuous(breaks = seq(from = 0,
                                    to = 6000,
                                    by = 750),
                       limits = c(0, 6000)) + 
    labs(title = "Price vs. above ground living area split by whether the property has air conditioning.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Sale price ($)",
         y = "Living Area (sq ft)",
         colour = "Central Air",
         shape = "Central Air")
`geom_smooth()` using formula = 'y ~ x'

7.13 Part m

Now that the base plot is complete, theme elements are required for best practice customisation.

Create a new object called “af_theme” and Use the ggplot2 theme documentation to help you modify:

  • The title text to bold, size 16.
  • The subtitle text to size 14.
  • The axis title to size 12.
  • The axis text to black, size 10.
  • The axis ticks (x and y) to the colour #D9D9D9.

For the titles you use the plot.<…> family and axis.<…> family for axes elements.

# Start the custom theme

af_theme <- theme(
    plot.title = element_text(family = "Arial",
                              size = 16,
                              face = "bold"),
    plot.subtitle = element_text(family = "Arial",
                                 size = 14),
    axis.title = element_text(family = "Arial",
                              size = 12),
    axis.text = element_text(family = "Arial",
                             size = 10,
                             colour = "#black"),
    axis.text.x = element_text(colour = "#D9D9D9"),
    axis.text.y = element_text(colour = "#D9D9D9")
)

7.14 Part n

Let’s consider gridlines.

  • Guidance notes that you need one set of gridlines dependent on the orientation of the plot. Here it is vertically oriented, so y gridlines are preferred.

  • Our reproducible theme object should keep both sets of major gridlines on, to turn off at will manually.

Tweak to your custom theme to:

  • Remove the background panel.
  • Add major gridlines with the colour “#D9D9D9”.
  • Remove the legend background, title and key.
  • Remove the underlying x axis line.
  • Remove the underlying y axis line.

Use your theme on the Scatter Plot and observe the results.

  • For backgrounds and gridlines, use the panel.<…> family.
  • For axes lines, use the axis.<…> family.
  • For legends, use the legend.<…> family.
# Add to the custom theme

af_theme <- theme(
    plot.title = element_text(family = "Arial",
                              size = 16,
                              face = "bold"),
    plot.subtitle = element_text(family = "Arial",
                                 size = 14),
    axis.title = element_text(family = "Arial",
                              size = 12),
    axis.text = element_text(family = "Arial",
                             size = 10,
                             colour = "black"),
    axis.ticks = element_line(colour = "#D9D9D9"),
    panel.background = element_blank(),
    panel.grid.major = element_line(colour = "#D9D9D9"),
    legend.key = element_blank(),
    legend.title = element_blank(),
    legend.background = element_blank(),
    axis.line.y = element_blank(),
    axis.line.x = element_blank()
    
)

Now to use on the scatter plot:

# Add new theme to the plot for accessibility tweaks

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air,
                             shape = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(~central_air) +
    scale_x_continuous(breaks = seq(from = 0,
                                  to = 8e+05,
                                  by = 2e+05), 
                       limits = c(0, 8e+05),
                       labels = scales::comma_format()) + 
    scale_y_continuous(breaks = seq(from = 0,
                                    to = 6000,
                                    by = 750),
                       limits = c(0, 6000)) + 
    labs(title = "Price vs. above ground living area split by central air.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Sale price ($)",
         y = "Living Area (sq ft)",
         colour = "Central Air",
         shape = "Central Air") +
    af_theme
`geom_smooth()` using formula = 'y ~ x'

7.15 Part o

The final tweaks to make are manual:

  • Turn off the x axis grid lines.
  • Tweak the placement of the legend.

Guidance states that the legend should be within the body of the plot.

Within a manual theme layer:

  • Remove the major x axis gridlines.

  • Specify x and y co-ordinates for the legend placement.

Why shouldn’t these be placed within the af_theme object?

Use the panel.<…> family as well as the legend.<…> family.

Think of reproducibility, we want to use af_theme again.

# Add new theme to the plot for accessibility tweaks

ames |> 
    ggplot(mapping = aes(x = sale_price,
                         y = gr_liv_area)) +
    geom_point(mapping = aes(colour = central_air,
                             shape = central_air)) +
    geom_smooth(method = "lm") + 
    scale_colour_manual(values = c("N" = "#12436D",
                                   "Y" = "#F46A25")) +
    facet_wrap(~central_air) +
    scale_x_continuous(breaks = seq(from = 0,
                                  to = 8e+05,
                                  by = 2e+05), 
                       limits = c(0, 8e+05),
                       labels = scales::comma_format()) + 
    scale_y_continuous(breaks = seq(from = 0,
                                    to = 6000,
                                    by = 750),
                       limits = c(0, 6000)) + 
    labs(title = "Price vs. above ground living area split by central air.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Sale price ($)",
         y = "Living Area (sq ft)",
         colour = "Central Air",
         shape = "Central Air") +
    theme(panel.grid.major.x = element_blank(),
          legend.position = c(x = 0.94,
                              y = 0.4)) +
    af_theme
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
`geom_smooth()` using formula = 'y ~ x'

The legend position and gridline tweaks should be changed manually as they will differ dependent on the orientation of the plot and position of the points.

8 Question 2 - Exploring continuous and categorical variables

Here you will perform the first step of an Exploratory Data Analysis (EDA) process, which is the initial investigation of our data as a whole. You will learn more about this process in Module 3 - Statistics.

For this question the focus is on:

  • Creating other geometric objects.
  • Formatting them and considering individual quirks.
  • Adhering to best practice guidelines.

8.1 Part a

Let’s start with a categorical variable that would likely contribute to house prices, such as garage type. This has 7 levels:

  • 2Types - More than one type of garage.
  • Attchd - Attached to the home.
  • Basment - Basement garage.
  • BuiltIn - Built-in garage.
  • CarPort - Car port garage.
  • Detchd - Detached garage.
  • NA - No garage.
  1. Count the number of properties for each garage type.

  2. Count the number of properties per garage type by the foundational material used to build the property, which has six levels:

  • BrkTil - Brick and Tile
  • CBlock - Cinder Block
  • Poured Concrete
  • Slab - Slab
  • Stone - Stone
  • Wood - Wood

You can solve both of these problems with count()ing.

# Number of properties per garage type

ames |> 
    count(garage_type)
# A tibble: 7 × 2
  garage_type     n
  <chr>       <int>
1 2Types         23
2 Attchd       1731
3 Basment        36
4 BuiltIn       186
5 CarPort        15
6 Detchd        782
7 <NA>          157
# Number of cars per manufacturer by class

ames |> 
    count(garage_type, foundation)
# A tibble: 31 × 3
   garage_type foundation     n
   <chr>       <chr>      <int>
 1 2Types      BrkTil         4
 2 2Types      CBlock        15
 3 2Types      PConc          3
 4 2Types      Slab           1
 5 Attchd      BrkTil        28
 6 Attchd      CBlock       730
 7 Attchd      PConc        947
 8 Attchd      Slab          20
 9 Attchd      Stone          3
10 Attchd      Wood           3
# ℹ 21 more rows

8.2 Part b

It is important to have summary tables like in part a for accessibility purposes (if you can’t provide alternative text, then the data used is bare minimum).

It would be easier to visualise the results in a bar chart, where initial conclusions can be drawn.

Create a bar chart to count the number of properties per garage type, ensuring to:

  • Colour with the best practice blue “#12436D”
  • Rotate the plot 90 degrees to the right.

You need to use the bar geometry and flip the coordinates.

# Horizontal bar chart of garage types

ames |> 
    ggplot(mapping = aes(x = garage_type)) +
    geom_bar(fill = "#12436D") +
    coord_flip()

8.3 Part c

It would be easier to observe the results if you ordered the bars.

  1. Use a combination of count(), reorder() and geom_bar’s “stat” parameter to order the bars in descending order.

  2. Comment on what the plot shows you.

  3. What is strange about the result?

To accomplish this, start by using count to determine n for each manufacturer.

After computing n, use reorder() within aes() to order manufacturer by n. This, in turn, means that a y mapping is also needed.

Identities come into play with the stat parameter.

# Order bars by count in descending order

ames |> 
    count(garage_type) |> 
    ggplot(mapping = aes(x = reorder(garage_type, n),
                         y = n)) +
    geom_bar(fill = "#12436D",
             stat = "identity") +
    coord_flip()

  1. It seems that:
  • Attached garages are the most commonly occurring amongst the properties, with detached as the second most common.

  • Interestingly, even properties with built in garages are more common than those without garages, which could be a reason for the large spread in property prices as seen in Question 1.

  1. The missing values (which refer to properties without garages) are not included in the ordering, which is a default behaviour of ggplot2 in it attempting to avoid missing values.

8.4 Part d

Let’s consider this missing value issue, they are an intended level of the garage type variable after all.

Modify your code from part c to:

  • Recode the NA’s without permanently changing them.

  • Hence produce the properly ordered bar chart.

You will need to use the replace_na() function for this.

# Order bars including the "no garage" properties.
ames |> 
    replace_na(replace = 
                   list(garage_type = "NA")) |> # Not permanently altering them
    count(garage_type) |> 
    ggplot(mapping = aes(x = reorder(garage_type, n),
                         y = n)) +
    geom_bar(fill = "#12436D",
             stat = "identity") +
    coord_flip()

8.5 Part e

Let’s ensure the chart adheres to best practice guidelines.

  1. Modify the y axis ticks to produce more gridlines.

  2. Apply af_theme as a layer.

  3. Manually remove the y axis gridlines and horizontally adjust the axis title by 1.

  4. Add a title, subtitle and axes labels to the plot. Consider whether they are all useful.

  1. Use the continuous y scaler.

  2. The plus sign will come in handy.

  3. Use the panel.<..> and axis.<..> families for this.

  4. Use the labels layer. Look at the x label to determine usefulness.

Let’s break the axes a little more to have more than a few gridlines. Increasing in increments of 250 would allow this.

# Customize plot to best practice guidelines

ames |> 
    replace_na(replace = 
                   list(garage_type = "NA")) |>
    count(garage_type) |> 
    ggplot(mapping = aes(x = reorder(garage_type, n),
                         y = n)) +
    geom_bar(fill = "#12436D",
             stat = "identity") +
    scale_y_continuous(breaks = seq(from = 0,
                                  to = 1750,
                                  by = 250),
                       limits = c(0, 1750)) +
    coord_flip() +
    af_theme +
    theme(panel.grid.major.y = element_blank(),
          axis.title = element_text(hjust = 1)) +
    labs(title = "Number of properties per garage type.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "",
         y = "Number of properties")

Setting the x axis label solves the accessibility issue of necessary horizontal text in plots.

8.6 Part f

You now have a great plot at the surface level, but what if you want to break this down by the material used in its foundation?

This let’s you see if the attached and detached garage properties were also produced with high quality materials.

  1. Count the number of garage types per foundation material.

  2. Create a stacked bar chart using reorder(), geom_col() and its “fill” aesthetic.

  3. Tweak the title in line with this new plot.

  4. What does this plot show you?

  1. Two inputs to count this time.

  2. geom_col() requires a y aesthetic and you should map foundation to its fill argument.

  3. Say what you see. Comment on highs and lows, what could be explored further from here?

# Number of garage types per road access

ames |> 
    count(garage_type, foundation)
# A tibble: 31 × 3
   garage_type foundation     n
   <chr>       <chr>      <int>
 1 2Types      BrkTil         4
 2 2Types      CBlock        15
 3 2Types      PConc          3
 4 2Types      Slab           1
 5 Attchd      BrkTil        28
 6 Attchd      CBlock       730
 7 Attchd      PConc        947
 8 Attchd      Slab          20
 9 Attchd      Stone          3
10 Attchd      Wood           3
# ℹ 21 more rows
# Partially ordered stacked bar chart

ames |> 
    replace_na(replace = 
                   list(garage_type = "NA")) |> # Optional
    count(garage_type, foundation) |> 
    ggplot(mapping = aes(x = reorder(garage_type, n),
                         y = n,
                         fill = foundation)) +
    geom_col(position = position_stack(reverse = TRUE)) +
    scale_y_continuous(breaks = seq(from = 0,
                                  to = 1750,
                                  by = 250),
                       limits = c(0, 1750)) +
    coord_flip() +
    af_theme +
    theme(panel.grid.major.y = element_blank(),
          axis.title = element_text(hjust = 1)) +
    labs(title = "Number of properties per garage type.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "",
         y = "Number of properties")

It seems that:

  • Properties in Ames with an attached or built-in garage are commonly made from Cinder Blocks and Poured concrete, each being more expensive than brick and tile.

  • There is less of an even split with detached garages (which likely requires complex building work) with more being built with cinder blocks as well as brick and tiles, which is a less expensive but reliable material.

  • It would be useful to observe how these relationships relate to price via the scatter plot made in Question 1.

8.7 Part g

Let’s consider if alternative plot types would work here.

  1. Would a grouped bar chart be appropriate instead of the stacked plot in Part f?

  2. Is there another bar chart you could consider in place of part f? Would there be any difficulties associated with this?

  1. Check out chapter 4.3.4 for examples of this type.

  2. Check out chapter 4.3.3 for examples of a possible type.

  1. A grouped bar chart would not be applicable here as it would result in an incredibly busy plot. At worst, there would be 42 bars, which is not fit for this purpose.

  2. A segmented bar chart could be considered with proportions to display the differences in foundation material across the garage types. This would make the reordering redundant (all bars sum to 1) which reduces code, but does require the computation of sums and proportions.

8.8 Part h

Let’s look at a continuous variable this time, the best being the sale price itself.

A great place to start with any dataset is to investigate the distribution of its continuous variables.

These allow us to confirm if the data is skewed due to outliers (values outside of the range of the data) and thus inform cleaning steps.

  1. Produce a histogram to observe the distribution of the sale price variable.

  2. What are your initial thoughts upon observing this plot?

You need to use the histogram geom to accomplish this.

# Histogram to show distribution of sale_price

ames |> 
    ggplot(mapping = aes(x = sale_price)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This shows a peaked distribution that has a single mode (most frequently occuring point), but one that is heavily right skewed due to incredibly large sale prices at the tail end of the distribution.

The next step would be to investigate whether transforming this variable would bring it closer to a Normal (Gaussian) distribution.

8.9 Part i

Let’s play around with different bin sizes.

Create three plots:

  • One with the default bins.
  • One with 20 bins.
  • One with 40 bins.

Use patchwork to lay them out as one larger plot on the left, and two smaller plots vertically on the right.

Does changing the number of bins impact our ability to discuss the distribution of this specific variable?

You’ll need to create three variables first to combine with patchwork.

You will need to use “|” to split the plots into columns, and “/” to layer them vertically.

# Changing bin sizes

first_histogram <- ames |> 
    ggplot(mapping = aes(x = sale_price)) +
    geom_histogram() # Default bins set to 30

second_histogram <- ames |> 
    ggplot(mapping = aes(x = sale_price)) +
    geom_histogram(bins = 20) # Less bins

third_histogram <- ames |> 
    ggplot(mapping = aes(x = sale_price)) +
    geom_histogram(bins = 40) # Less bins

# Lay out the plots with patchwork

first_histogram | (second_histogram / third_histogram)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • In general, there is a good sense of peaks and possible outliers in the distribution with the standard 30 bins approach.

  • With a decreased number of bins, it smoothes over the actual shape and you loose the ability to pinpoint smaller peaks. At the same time, it is easier to visualise the line you could draw to identify the distribution (the density).

  • When an increased number of bins, it becomes harder to identify the overall shape as more breaks are introduced and the plot becomes more abstract as a result.

Going forward, the 30 bins approach seems the most sensible, as more information is not gained with the alternative approaches.

8.10 Part j

Let’s apply some best practice tweaks to your histogram.

  1. Adjust the width of the bins to 1, set the bar outline to white, the bar colour to “#12436D” and transparency to 0.8.

  2. Modify the continuous y scale to produce more gridlines.

  3. Modify the continuous x scale to include an upper limit and change the scientific notation to commas.

  4. Set appropriate labels for the plot.

  5. Add af_theme and turn off the x axis gridlines.

  1. These are not aesthetics.

  2. seq() will come in handy.

  3. You did this in question 1.

  4. Use a labels layer.

  5. You need the panel.<..> family for this.

Let’s go with increments of 100 on the y-axis to ensure there are have 7 gridlines, as opposed to 14 with the 50 increment approach.

# Best practice histogram

ames |> 
    ggplot(mapping = aes(x = sale_price)) +
    geom_histogram(bins = 30,
                   colour = "white",
                   fill = "#12436D",
                   alpha = 0.8) +
     scale_x_continuous(breaks = seq(from = 0e+00,
                                     to = 8e+05,
                                     by = 2e+05),
                        limits = c(0e+00, 8e+05),
                        labels = scales::comma_format()) +
     scale_y_continuous(breaks = seq(from = 0,
                                    to = 700,
                                    by = 100))  +
    labs(title = "Sale price distribution.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Sale Price ($)",
         y = "Frequency of properties") +
    af_theme +
    theme(panel.grid.major.x = element_blank()) # Turn off x gridlines
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

It is worth noting here that we received a warning about missing values. This is because of the “limits” parameter we used in scale_x_continuous().

In essence, missing values are removed during the addition of limits as they do not fall within the scale. As such, it would be a good idea to deal with these first prior to visualisation.

8.11 Part k

Histograms tend to be used for single continuous variables. What if you intended to compare the distribution across the groups of a categorical variable?

Let’s return to the thought process from earlier on the price spread of specific garage types.

What two methods could be used to observe the sale price distribution across the different garage types?

Consider wrapping, and present boxes.

The first method (and is likely the best one) is to create a boxplot, with the sale price distribution split between garage types. This gives summary statistics at a glance, averages and quartiles, as well as explicit markers where outliers are likely to be.

The second method (which would likely over complicate things) is to create a histogram as normal, and then use facet_wrap() to produce subplots, one for each garage type. Due to the wide nature of histograms, they would likely be rotated in the output to ensure they would fit in the plot pane.

8.12 Part l

Let’s see how the distribution of sale price differs by garage type.

  1. Create a boxplot to compare the distribution of the sale price across the different garage types.

  2. Change the outlier marks to blue crosses of size 5.

  3. Ensure the boxplot size is proportional to the sample size.

  4. Comment on the results.

  1. You’ll need the boxplot geom and an x and y aesthetic set for this.

  2. You need to use the outlier.<..> family within the boxplot geom.

  3. There’s a special argument “varwidth” for this.

  4. Are there outlier points? Is one garage type more common among the higher/lower priced properties?

# First boxplot

ames |> 
    ggplot(mapping = aes(x = garage_type, 
                         y = sale_price)) +
    geom_boxplot(outlier.colour = "blue",
                 outlier.shape = 4,
                 outlier.size = 5,
                 varwidth = TRUE)

Outliers are clearly very common here, and when considering the most frequent garage types:

  • With the attached properties there is a massive number of highly priced properties that create a prominent right skew.

  • Interestingly with detached properties, this right skew is still the overarching trend, but less so due to some incredibly low priced properties with the feature.

It would be a mistake to remove all outliers here, as this would result in a loss of a large proportion of the dataset. A better approach could be that of transformation, to scale the data appropriately.

As an aside:

Car ports (freestanding room shelters for cars, much less expensive to construct) are very uncommon across the dataset, which makes sense with many new builds.

# Filter to fuel type = CNG

ames |> 
    filter(garage_type == "CarPort")
# A tibble: 15 × 81
   order ms_sub_class ms_zoning lot_frontage lot_area street alley lot_shape
   <dbl> <chr>        <chr>            <dbl>    <dbl> <chr>  <chr> <chr>    
 1   206 090          RM                 110     8472 Grvl   <NA>  IR2      
 2   553 020          RH                  60     8400 Pave   <NA>  Reg      
 3   647 020          RL                  75    14559 Pave   <NA>  Reg      
 4   651 190          RL                 107    10615 Pave   <NA>  IR1      
 5   762 020          RL                  50     4280 Pave   <NA>  IR1      
 6  1265 020          RL                  NA    12929 Pave   <NA>  IR1      
 7  1386 090          RL                  60    12900 Pave   <NA>  Reg      
 8  1387 190          RL                  60    12900 Pave   <NA>  Reg      
 9  1880 090          RL                  NA    10899 Pave   <NA>  IR1      
10  1899 090          RL                  70     9842 Pave   <NA>  Reg      
11  1916 060          RL                  NA    18450 Pave   <NA>  IR1      
12  2707 090          RL                  72    10791 Pave   <NA>  Reg      
13  2879 050          C (all)             60    11040 Pave   <NA>  Reg      
14  2918 160          RM                  21     1533 Pave   <NA>  Reg      
15  2921 160          RM                  21     1894 Pave   <NA>  Reg      
# ℹ 73 more variables: land_contour <chr>, utilities <chr>, lot_config <chr>,
#   land_slope <chr>, neighborhood <chr>, condition_1 <chr>, condition_2 <chr>,
#   bldg_type <chr>, house_style <chr>, overall_qual <dbl>, overall_cond <dbl>,
#   year_built <dbl>, year_remod_add <dbl>, roof_style <chr>, roof_matl <chr>,
#   exterior_1st <chr>, exterior_2nd <chr>, mas_vnr_type <chr>,
#   mas_vnr_area <dbl>, exter_qual <chr>, exter_cond <chr>, foundation <chr>,
#   bsmt_qual <chr>, bsmt_cond <chr>, bsmt_exposure <chr>, …

There are only 15 properties with car ports, which should be considered if drawing conclusions on this particular set of properties, as it is more difficult to generalise.

8.13 Part m

Let’s tweak the plot in line with accessibility guidelines.

Apply the following tweaks to your boxplot:

  1. Change the outlier markers to filled black circles of size 2.5.

  2. Use all of the colours from the Analysis Function guidance to fill the boxplot.

  3. Modify the y axis to remove scientific notation.

  4. Add appropriate labels to the plot.

  5. Add the af_theme object as a layer, turn off the legend that results from (i) and remove x axis gridlines.

  1. You need to colour by the same variable creating the boxes in the first place. A scaling layer is needed for the chosen colours.

  2. You’ve done this a few times, no shame in copy and pasting. However, you do need some more gridlines this time.

  3. Use a labels layer.

  4. Use the legend.<…> family and panel.<..> family in a separate theme layer.

# Publication quality boxplot

ames |> 
    ggplot(mapping = aes(x = garage_type, 
                         y = sale_price,
                         fill = garage_type)) +
    geom_boxplot(outlier.size = 2.5,
                 varwidth = TRUE) +
    scale_fill_manual(values = c("#12436D",
                                 "#28A197",
                                 "#801650",
                                 "#F46A25",
                                 "#3D3D3D",
                                 "#A285D1")) +
    scale_y_continuous(breaks = seq(from = 0e+00,
                                    to = 8e+05,
                                    by = 1e+05),
                       limits = c(0e+00, 8e+05),
                       labels = scales::comma_format()) +
    labs(title = "Sale price distribution by garage type.",
         subtitle = "Data from the Ames housing project (2006 - 2010).",
         x = "Garage Type",
         y = "Sale Price ($)") +
    af_theme +
    theme(legend.position = "none",
          panel.grid.major.x = element_blank())

9 Summary

Amazing job working your way through the exercises, hopefully this allowed you to get into the rhythm of using ggplot2 to create visualisations as well as stress the importance of accessibility.

You have a few options from here:

  • Investigate the optional reference material.
  • Read deeper into the topic with Hadley Wickham’s R for Data Science 2nd Edition chapters 1 and 9.
  • Proceed directly to the summary chapter 7 for a wider view of next steps.

Reuse

Open Government Licence 3.0