# Import necessary packages
library(tidyverse) # Data manipulation & ggplot2
library(janitor) # Cleaning column names
library(scales) # Automatic axes labels
library(patchwork) # Combine plots side-by-side
library(showtext) # Streamlined font rendering
Data Visualisation Case Study
1 Purpose of Chapter
Chapters 1-4 covered the following:
- Grammar of graphics in ggplot2
- Analysis Function guidelines for publication quality plots
- Plotting geometries for various variable types.
What we have yet to do is provide a structured exercise that allows you to practice the techniques covered throughout the course.
Note that some solutions are adapted from Hadley Wickhams R for Data Science 2nd Edition chapters 1 and 9.
2 Pre-requisite Information
This case study requires the use of techniques covered throughout chapters 1-4 of this Data Visualisation course. You should be comfortable with:
- Creating a plot with ggplot and aesthetic mappings.
- Adding features such as colour and labels.
- Mapping features to variables such as colour and shape.
- Setting theme elements.
- Grouping plots with facets.
- Using showtext to render fonts.
- Creating scatter plots, histograms, bar charts and boxplots.
3 Course Format
This chapter will present questions as tabsets:
- Tab 1 will contain the question itself.
- Tab 2 will contain a hint for the question.
- Tab 3 will contain the solution with code.
3.1 Example
This is an example question.
This is a hint.
Insert code here.
Throughout, we ask you to: * Reflect on the plots created throughout * Consider alternative plots you would make when approaching the case study.
Data Visualisation is considered both an art and a science, so there are always numerous viewpoints on how best to display information.
4 Packages
Import the following packages:
- tidyverse
- janitor
- scales
- patchwork
- showtext
Where do you go to take out books?
5 Fonts
Add in a sans-serif font such as
- Arial
- Tahoma
- Helvetica
Ensuring it is automatically used throughout.
Solutions will use Arial font.
You add fonts using name and path.
Showtext functions decide how to render.
# Set the Arial font for the exercises
font_add(family = "Arial", regular = "C:\\Windows\\Fonts\\Arial.ttf")
showtext_auto()
6 Ames Housing Data
Throughout this chapter you will be investigating the Ames Housing Dataset, a popular choice for visualisation and machine learning.
7 Question 1 - A relationship with sale price
The main variable of interest in the Ames housing dataset is the sale price of a house. You are interested in:
- Its relationships with other variables
- Its distribution and how this varies across categorical variables.
In this question you will:
- Build a complex scatterplot from the ground up.
- Use facets to create less busy subplots.
- Add accessibility tweaks and best practice elements.
7.1 Part a
Let’s see the data first:
Read in the Ames housing dataset and clean its column names with janitor.
Observe the data.
There is a function for this, clean_<..>
Take a glimpse.
# Glimpse the ames housing data
<- read_csv("./data/ames.csv") |>
ames clean_names()
Rows: 2930 Columns: 81
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (44): MS SubClass, MS Zoning, Street, Alley, Lot Shape, Land Contour, Ut...
dbl (37): Order, Lot Frontage, Lot Area, Overall Qual, Overall Cond, Year Bu...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
|>
ames glimpse()
Rows: 2,930
Columns: 81
$ order <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ ms_sub_class <chr> "020", "020", "020", "020", "060", "060", "120", "120"…
$ ms_zoning <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL", "RL", …
$ lot_frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, NA, 63, 8…
$ lot_area <dbl> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5…
$ street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"…
$ alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ lot_shape <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg", "IR1"…
$ land_contour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "HLS"…
$ utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "All…
$ lot_config <chr> "Corner", "Inside", "Corner", "Corner", "Inside", "Ins…
$ land_slope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl"…
$ neighborhood <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert", "Gilber…
$ condition_1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm…
$ condition_2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"…
$ bldg_type <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "Twnhs…
$ house_style <chr> "1Story", "1Story", "1Story", "1Story", "2Story", "2St…
$ overall_qual <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, …
$ overall_cond <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, …
$ year_built <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ year_remod_add <dbl> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996, …
$ roof_style <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable", "Gable…
$ roof_matl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg",…
$ exterior_1st <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
$ exterior_2nd <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
$ mas_vnr_type <chr> "Stone", "None", "BrkFace", "None", "None", "BrkFace",…
$ mas_vnr_area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603,…
$ exter_qual <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd", "Gd", …
$ exter_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ foundation <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc", "PCon…
$ bsmt_qual <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd", "Gd", …
$ bsmt_cond <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ bsmt_exposure <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No", "No", …
$ bsmt_fin_type_1 <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ"…
$ bsmt_fin_sf_1 <dbl> 639, 468, 923, 1065, 791, 602, 616, 263, 1180, 0, 0, 9…
$ bsmt_fin_type_2 <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf"…
$ bsmt_fin_sf_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0…
$ bsmt_unf_sf <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 76…
$ total_bsmt_sf <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994…
$ heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"…
$ heating_qc <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", …
$ central_air <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", …
$ x1st_flr_sf <dbl> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 102…
$ x2nd_flr_sf <dbl> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0,…
$ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gr_liv_area <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1…
$ bsmt_full_bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
$ bsmt_half_bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ full_bath <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1, …
$ half_bath <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, …
$ bedroom_abv_gr <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, …
$ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ kitchen_qual <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd", "Gd", …
$ tot_rms_abv_grd <dbl> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8, 8,…
$ functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ"…
$ fireplaces <dbl> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, …
$ fireplace_qu <chr> "Gd", NA, NA, "TA", "TA", "Gd", NA, NA, "TA", "TA", "T…
$ garage_type <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "Att…
$ garage_yr_blt <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ garage_finish <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "Fin", "Fin", "RFn"…
$ garage_cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
$ garage_area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440,…
$ garage_qual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ garage_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ paved_drive <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ wood_deck_sf <dbl> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, …
$ open_porch_sf <dbl> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 5…
$ enclosed_porch <dbl> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ x3ssn_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ screen_porch <dbl> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, 210…
$ pool_area <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pool_qc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fence <chr> NA, "MnPrv", NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, …
$ misc_feature <chr> NA, NA, "Gar2", NA, NA, NA, NA, NA, NA, NA, NA, "Shed"…
$ misc_val <dbl> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, …
$ mo_sold <dbl> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, 6, …
$ yr_sold <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
$ sale_type <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", …
$ sale_condition <chr> "Normal", "Normal", "Normal", "Normal", "Normal", "Nor…
$ sale_price <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 213500…
The Data Dictionary is in this American Statistical Association text file for clarity.
7.2 Part b
Let’s start thinking about what would likely impact the housing price in a residential area.
For example, the area of the property. The above ground living area would be a good start.
Create a scatter plot using sale price as the x axis and ground living area as the y axis.
What are your initial impressions from the plot?
Remember your aes() mappings and that each layer added to a ggplot uses the + operator, not the pipe.
The Scatter plot geom is not called scatter.
# Baseline scatter plot
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point()
There is a semblance of a positive linear relationship, in that as above ground living area increases, so to does the sale price.
There is an incredibly dense grouping in the low area, low price space, so these are likely the most common properties in the area.
This is hardly a shocking revelation, but we should never assume this relationship, the data should show it.
7.3 Part c
Let’s add another dimension to the plot by playing with the colour aesthetic.
Create two plots and display them side by side:
Firstly, display the points in the colour “blue”.
Secondly, map the central air conditioning variable to the colour aesthetic.
Comment on the results.
You will need to use patchwork and mathematical operators to display plots in this way.
Local layer colour is needed.
You can do this in the global layer.
# Displaying colour variations side by side
<- ames |>
single_colour_scatter ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(colour = "blue")
<- ames |>
grouped_colour_scatter ggplot(mapping = aes(x = sale_price,
y = gr_liv_area,
colour = central_air)) +
geom_point()
+ grouped_colour_scatter single_colour_scatter
Despite the overlay of colours, it seems that the properties without central air have lower sale prices.
Properties with central air are dispersed across the full range of sale prices.
7.4 Part d
You should always consider whether the plot can be made more accessible. Colours are a good place to start.
Start by copying over the code from part c.
Use the Palette table in the Analysis Function Guidance on Colours to add the following colours to your plot via the HEX CODES.
- Dark blue
- Orange
You will need to add a scale layer to the plot to manually add your own colours.
# Tweak colours in line with guidance
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area,
colour = central_air)) +
geom_point() +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25"))
The overlay still makes it difficult to determine one group from another.
7.5 Part e
Let’s consider adding some lines of best fit.
Add a smooth line of best fit (linear) through the points in your scatter plot.
Remember that you need to use the “lm” method for this.
# Add lines of best fit through points
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area,
colour = central_air)) +
geom_point() +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25"))
`geom_smooth()` using formula = 'y ~ x'
This is a very noisy plot, but does show that there may be a difference in how much the overall area of the property impacts the price, based on whether there is central air conditioning or not.
There is a steeper increase when there isn’t, compared to when there is.
7.6 Part f
What if you just want to see the overall trend, as opposed to the trend by group?
Modify the code from part e to produce a single line of best fit through the points.
Comment on the plausibility of a correlation here.
Remember the difference between mapping a variable globally in ggplot(), vs. mapping at the specific geometric layer.
Be specific with your terms, type of correlation, strength of correlation?
With this example, you want your x and y axes to be set globally as they never change. However, if you group the colour by central_air globally, then every subsequent layer will group in this way.
As such, you need to place colour within the layer that requires it and only that layer.
# Single line of best fit through points
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25"))
`geom_smooth()` using formula = 'y ~ x'
- It is plausible that there is a moderate to strong positive correlation here.
- This implies that the sale price of the house increases with the above ground living area regardless of air conditioning.
Further correlation analysis would be needed to determine the relationship statistically.
7.7 Part g
The plot is very noisy, so it could be beneficial to create subplots by grouping.
Use facets to subset the plot by the central air variable.
To subset by one variable, you want to wrap.
# Facet by central_air
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(~central_air)
`geom_smooth()` using formula = 'y ~ x'
As noticed before, in that there is a much wider distribution of sale prices (as well as number of properties in general) that have central air conditioning.
The properties without air conditioning have a less strong positive relationship than those without it.
7.8 Part h
You could add another dimension to your plot by further grouping by whether the property has a paved driveway.
Expand on part g by producing a plot for each combination of paved_drive and central_air.
You need to produce a 3 * 2 grid of subplots.
# Subset by species and year
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(paved_drive ~ central_air)
`geom_smooth()` using formula = 'y ~ x'
This plot is confusing due to both the central air and paved drive variables having the levels “Y” and “N” within them, making drawing conclusions difficult.
At a glance we see that properties with central air and a paved drive are the most common and spread across the entire sale price range.
We see that any other combination of property features lands in the lower price ranges overall, be they paved, partially paved and so on.
7.9 Part i
Part h is useful, but in practice, has some problems:
- The variables plots are grouped by have similar labels (N, Y etc) which make it hard to read.
- The other grouped variable isn’t mentioned in the legend or key.
As such, you can disregard it and move forward with the approach in Part g.
Using the plot from part g:
Add a suitable title, subtitle and axes labels to your Scatter Plot.
You need to add a labels layer to our plot.
As long as your labels are appropriate, then any combination of them is fine.
# Adding labels to the plot
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(~central_air) +
labs(title = "Price vs. above ground living area split by existence of air conditioning.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Sale price ($)",
y = "Living Area (sq. ft.)")
`geom_smooth()` using formula = 'y ~ x'
7.10 Part j
You could further distinguish groups in the plot by mapping shapes to the air conditioning groups.
Map shapes to the two air conditioning groups, ensuring they appear in the legend.
Capitalize the legend label, ensuring only one legend exists.
You need to use the shape parameter locally.
You need to add a label using the parameter that led to the creation of the legend in the first place, which is colour.
# Map central_air to shape and tweak legend
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air,
shape = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(~central_air) +
labs(title = "Price vs. above ground living area split by whether the property has air conditioning.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Sale price ($)",
y = "Living Area (sq ft)",
colour = "Central Air",
shape = "Central Air")
`geom_smooth()` using formula = 'y ~ x'
7.11 Part k
Let’s investigate the x axis.
What issues are there with the x axis at present?
Modify the x axis to:
- Include the upper limit of sale price on the axis.
- Remove scientific notation, ensuring the numbers have commas.
- Should there be e’s?
Throughout (ii), you need the continuous x scaler:
- You can use seq() to create a new set of axes breaks.
- You can set the limits with a separate parameter.
- You can use the labels parameter for commas.
- The x axis does start from 0, but has two key problems for best practice:
Currently in scientific notation, where 200,000 is written as 2e+05 which means 2 * 10^5.
An upper limit (properties > 600,000USD) is not included, so points go past the current axes breaks.
Since finding the max price is a separate task, let’s do that first.
# Find max sale price of ames houses
|>
ames summarise(max_price =
max(sale_price, na.rm = TRUE))
# A tibble: 1 × 1
max_price
<dbl>
1 755000
Let’s think about new axes breaks:
Rescale the axes entirely in multiples of $50,000, but this would likely expand the axes too much.
You could rescale in multiples of 150,000 USD, but that would end up overshooting substantially.
A sensible choice would be to take the existing breaks and expand up to 800,000 USD.
# Use y scaler to modify the axes
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air,
shape = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(~central_air) +
scale_x_continuous(breaks = seq(from = 0,
to = 8e+05,
by = 2e+05),
limits = c(0, 8e+05),
labels = scales::comma_format()) +
labs(title = "Price vs. above ground living area split by whether the property has air conditioning.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Sale price ($)",
y = "Living Area (sq ft)",
colour = "Central Air",
shape = "Central Air")
`geom_smooth()` using formula = 'y ~ x'
First, set up the axes breaks with the seq() function.
However, changing the breaks doesn’t ensure they are marked on the axis. Here, there aren’t any $800,000 properties, so you need to set the limits to ensure it’s marked on the graph.
Lastly, use the comma_format() from the excellent scales package to convert the scientific notation.
Aside: You can prevent repeated tweaking by turning off scientific notation:
- options(scipen = 999)
7.12 Part l
Let’s investigate the y axis this time.
What issues are there with the y axis at present?
Modify the y axis to include proper limits.
What about the top and bottom of the axis? Number of breaks?
You will need the continuous y scaler.
At present, there are a few problems:
It is difficult to pinpoint the actual above ground living areas of the lowest and highest priced properties due to dense groupings.
For accessibility purposes you want 8-10 gridlines, which the present y axis will not allow for.
An aside is that the audience should also be considered given that square feet (a non-universal metric) is used.
# Maximum gr_liv_area
|>
ames summarise(max_area =
max(gr_liv_area, na.rm = TRUE))
# A tibble: 1 × 1
max_area
<dbl>
1 5642
For the gridlines:
You could change the breaks to every 500 square feet, but this would produce many gridlines.
Perhaps a better choice would be to break every 750 square feet, producing 9 gridlines.
You need to set the limits as well, since there are no properties that quite reach 6000 square feet.
# Rescale the y axis
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air,
shape = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(~central_air) +
scale_x_continuous(breaks = seq(from = 0,
to = 8e+05,
by = 2e+05),
limits = c(0, 8e+05),
labels = scales::comma_format()) +
scale_y_continuous(breaks = seq(from = 0,
to = 6000,
by = 750),
limits = c(0, 6000)) +
labs(title = "Price vs. above ground living area split by whether the property has air conditioning.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Sale price ($)",
y = "Living Area (sq ft)",
colour = "Central Air",
shape = "Central Air")
`geom_smooth()` using formula = 'y ~ x'
7.13 Part m
Now that the base plot is complete, theme elements are required for best practice customisation.
Create a new object called “af_theme” and Use the ggplot2 theme documentation to help you modify:
- The title text to bold, size 16.
- The subtitle text to size 14.
- The axis title to size 12.
- The axis text to black, size 10.
- The axis ticks (x and y) to the colour #D9D9D9.
For the titles you use the plot.<…> family and axis.<…> family for axes elements.
# Start the custom theme
<- theme(
af_theme plot.title = element_text(family = "Arial",
size = 16,
face = "bold"),
plot.subtitle = element_text(family = "Arial",
size = 14),
axis.title = element_text(family = "Arial",
size = 12),
axis.text = element_text(family = "Arial",
size = 10,
colour = "#black"),
axis.text.x = element_text(colour = "#D9D9D9"),
axis.text.y = element_text(colour = "#D9D9D9")
)
7.14 Part n
Let’s consider gridlines.
Guidance notes that you need one set of gridlines dependent on the orientation of the plot. Here it is vertically oriented, so y gridlines are preferred.
Our reproducible theme object should keep both sets of major gridlines on, to turn off at will manually.
Tweak to your custom theme to:
- Remove the background panel.
- Add major gridlines with the colour “#D9D9D9”.
- Remove the legend background, title and key.
- Remove the underlying x axis line.
- Remove the underlying y axis line.
Use your theme on the Scatter Plot and observe the results.
- For backgrounds and gridlines, use the panel.<…> family.
- For axes lines, use the axis.<…> family.
- For legends, use the legend.<…> family.
# Add to the custom theme
<- theme(
af_theme plot.title = element_text(family = "Arial",
size = 16,
face = "bold"),
plot.subtitle = element_text(family = "Arial",
size = 14),
axis.title = element_text(family = "Arial",
size = 12),
axis.text = element_text(family = "Arial",
size = 10,
colour = "black"),
axis.ticks = element_line(colour = "#D9D9D9"),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "#D9D9D9"),
legend.key = element_blank(),
legend.title = element_blank(),
legend.background = element_blank(),
axis.line.y = element_blank(),
axis.line.x = element_blank()
)
Now to use on the scatter plot:
# Add new theme to the plot for accessibility tweaks
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air,
shape = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(~central_air) +
scale_x_continuous(breaks = seq(from = 0,
to = 8e+05,
by = 2e+05),
limits = c(0, 8e+05),
labels = scales::comma_format()) +
scale_y_continuous(breaks = seq(from = 0,
to = 6000,
by = 750),
limits = c(0, 6000)) +
labs(title = "Price vs. above ground living area split by central air.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Sale price ($)",
y = "Living Area (sq ft)",
colour = "Central Air",
shape = "Central Air") +
af_theme
`geom_smooth()` using formula = 'y ~ x'
7.15 Part o
The final tweaks to make are manual:
- Turn off the x axis grid lines.
- Tweak the placement of the legend.
Guidance states that the legend should be within the body of the plot.
Within a manual theme layer:
Remove the major x axis gridlines.
Specify x and y co-ordinates for the legend placement.
Why shouldn’t these be placed within the af_theme object?
Use the panel.<…> family as well as the legend.<…> family.
Think of reproducibility, we want to use af_theme again.
# Add new theme to the plot for accessibility tweaks
|>
ames ggplot(mapping = aes(x = sale_price,
y = gr_liv_area)) +
geom_point(mapping = aes(colour = central_air,
shape = central_air)) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("N" = "#12436D",
"Y" = "#F46A25")) +
facet_wrap(~central_air) +
scale_x_continuous(breaks = seq(from = 0,
to = 8e+05,
by = 2e+05),
limits = c(0, 8e+05),
labels = scales::comma_format()) +
scale_y_continuous(breaks = seq(from = 0,
to = 6000,
by = 750),
limits = c(0, 6000)) +
labs(title = "Price vs. above ground living area split by central air.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Sale price ($)",
y = "Living Area (sq ft)",
colour = "Central Air",
shape = "Central Air") +
theme(panel.grid.major.x = element_blank(),
legend.position = c(x = 0.94,
y = 0.4)) +
af_theme
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
`geom_smooth()` using formula = 'y ~ x'
The legend position and gridline tweaks should be changed manually as they will differ dependent on the orientation of the plot and position of the points.
8 Question 2 - Exploring continuous and categorical variables
Here you will perform the first step of an Exploratory Data Analysis (EDA) process, which is the initial investigation of our data as a whole. You will learn more about this process in Module 3 - Statistics.
For this question the focus is on:
- Creating other geometric objects.
- Formatting them and considering individual quirks.
- Adhering to best practice guidelines.
8.1 Part a
Let’s start with a categorical variable that would likely contribute to house prices, such as garage type. This has 7 levels:
- 2Types - More than one type of garage.
- Attchd - Attached to the home.
- Basment - Basement garage.
- BuiltIn - Built-in garage.
- CarPort - Car port garage.
- Detchd - Detached garage.
- NA - No garage.
Count the number of properties for each garage type.
Count the number of properties per garage type by the foundational material used to build the property, which has six levels:
- BrkTil - Brick and Tile
- CBlock - Cinder Block
- Poured Concrete
- Slab - Slab
- Stone - Stone
- Wood - Wood
You can solve both of these problems with count()ing.
# Number of properties per garage type
|>
ames count(garage_type)
# A tibble: 7 × 2
garage_type n
<chr> <int>
1 2Types 23
2 Attchd 1731
3 Basment 36
4 BuiltIn 186
5 CarPort 15
6 Detchd 782
7 <NA> 157
# Number of cars per manufacturer by class
|>
ames count(garage_type, foundation)
# A tibble: 31 × 3
garage_type foundation n
<chr> <chr> <int>
1 2Types BrkTil 4
2 2Types CBlock 15
3 2Types PConc 3
4 2Types Slab 1
5 Attchd BrkTil 28
6 Attchd CBlock 730
7 Attchd PConc 947
8 Attchd Slab 20
9 Attchd Stone 3
10 Attchd Wood 3
# ℹ 21 more rows
8.2 Part b
It is important to have summary tables like in part a for accessibility purposes (if you can’t provide alternative text, then the data used is bare minimum).
It would be easier to visualise the results in a bar chart, where initial conclusions can be drawn.
Create a bar chart to count the number of properties per garage type, ensuring to:
- Colour with the best practice blue “#12436D”
- Rotate the plot 90 degrees to the right.
You need to use the bar geometry and flip the coordinates.
# Horizontal bar chart of garage types
|>
ames ggplot(mapping = aes(x = garage_type)) +
geom_bar(fill = "#12436D") +
coord_flip()
8.3 Part c
It would be easier to observe the results if you ordered the bars.
Use a combination of count(), reorder() and geom_bar’s “stat” parameter to order the bars in descending order.
Comment on what the plot shows you.
What is strange about the result?
To accomplish this, start by using count to determine n for each manufacturer.
After computing n, use reorder() within aes() to order manufacturer by n. This, in turn, means that a y mapping is also needed.
Identities come into play with the stat parameter.
# Order bars by count in descending order
|>
ames count(garage_type) |>
ggplot(mapping = aes(x = reorder(garage_type, n),
y = n)) +
geom_bar(fill = "#12436D",
stat = "identity") +
coord_flip()
- It seems that:
Attached garages are the most commonly occurring amongst the properties, with detached as the second most common.
Interestingly, even properties with built in garages are more common than those without garages, which could be a reason for the large spread in property prices as seen in Question 1.
- The missing values (which refer to properties without garages) are not included in the ordering, which is a default behaviour of ggplot2 in it attempting to avoid missing values.
8.4 Part d
Let’s consider this missing value issue, they are an intended level of the garage type variable after all.
Modify your code from part c to:
Recode the NA’s without permanently changing them.
Hence produce the properly ordered bar chart.
You will need to use the replace_na() function for this.
# Order bars including the "no garage" properties.
|>
ames replace_na(replace =
list(garage_type = "NA")) |> # Not permanently altering them
count(garage_type) |>
ggplot(mapping = aes(x = reorder(garage_type, n),
y = n)) +
geom_bar(fill = "#12436D",
stat = "identity") +
coord_flip()
8.5 Part e
Let’s ensure the chart adheres to best practice guidelines.
Modify the y axis ticks to produce more gridlines.
Apply af_theme as a layer.
Manually remove the y axis gridlines and horizontally adjust the axis title by 1.
Add a title, subtitle and axes labels to the plot. Consider whether they are all useful.
Use the continuous y scaler.
The plus sign will come in handy.
Use the panel.<..> and axis.<..> families for this.
Use the labels layer. Look at the x label to determine usefulness.
Let’s break the axes a little more to have more than a few gridlines. Increasing in increments of 250 would allow this.
# Customize plot to best practice guidelines
|>
ames replace_na(replace =
list(garage_type = "NA")) |>
count(garage_type) |>
ggplot(mapping = aes(x = reorder(garage_type, n),
y = n)) +
geom_bar(fill = "#12436D",
stat = "identity") +
scale_y_continuous(breaks = seq(from = 0,
to = 1750,
by = 250),
limits = c(0, 1750)) +
coord_flip() +
+
af_theme theme(panel.grid.major.y = element_blank(),
axis.title = element_text(hjust = 1)) +
labs(title = "Number of properties per garage type.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "",
y = "Number of properties")
Setting the x axis label solves the accessibility issue of necessary horizontal text in plots.
8.6 Part f
You now have a great plot at the surface level, but what if you want to break this down by the material used in its foundation?
This let’s you see if the attached and detached garage properties were also produced with high quality materials.
Count the number of garage types per foundation material.
Create a stacked bar chart using reorder(), geom_col() and its “fill” aesthetic.
Tweak the title in line with this new plot.
What does this plot show you?
Two inputs to count this time.
geom_col() requires a y aesthetic and you should map foundation to its fill argument.
Say what you see. Comment on highs and lows, what could be explored further from here?
# Number of garage types per road access
|>
ames count(garage_type, foundation)
# A tibble: 31 × 3
garage_type foundation n
<chr> <chr> <int>
1 2Types BrkTil 4
2 2Types CBlock 15
3 2Types PConc 3
4 2Types Slab 1
5 Attchd BrkTil 28
6 Attchd CBlock 730
7 Attchd PConc 947
8 Attchd Slab 20
9 Attchd Stone 3
10 Attchd Wood 3
# ℹ 21 more rows
# Partially ordered stacked bar chart
|>
ames replace_na(replace =
list(garage_type = "NA")) |> # Optional
count(garage_type, foundation) |>
ggplot(mapping = aes(x = reorder(garage_type, n),
y = n,
fill = foundation)) +
geom_col(position = position_stack(reverse = TRUE)) +
scale_y_continuous(breaks = seq(from = 0,
to = 1750,
by = 250),
limits = c(0, 1750)) +
coord_flip() +
+
af_theme theme(panel.grid.major.y = element_blank(),
axis.title = element_text(hjust = 1)) +
labs(title = "Number of properties per garage type.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "",
y = "Number of properties")
It seems that:
Properties in Ames with an attached or built-in garage are commonly made from Cinder Blocks and Poured concrete, each being more expensive than brick and tile.
There is less of an even split with detached garages (which likely requires complex building work) with more being built with cinder blocks as well as brick and tiles, which is a less expensive but reliable material.
It would be useful to observe how these relationships relate to price via the scatter plot made in Question 1.
8.7 Part g
Let’s consider if alternative plot types would work here.
Would a grouped bar chart be appropriate instead of the stacked plot in Part f?
Is there another bar chart you could consider in place of part f? Would there be any difficulties associated with this?
Check out chapter 4.3.4 for examples of this type.
Check out chapter 4.3.3 for examples of a possible type.
A grouped bar chart would not be applicable here as it would result in an incredibly busy plot. At worst, there would be 42 bars, which is not fit for this purpose.
A segmented bar chart could be considered with proportions to display the differences in foundation material across the garage types. This would make the reordering redundant (all bars sum to 1) which reduces code, but does require the computation of sums and proportions.
8.8 Part h
Let’s look at a continuous variable this time, the best being the sale price itself.
A great place to start with any dataset is to investigate the distribution of its continuous variables.
These allow us to confirm if the data is skewed due to outliers (values outside of the range of the data) and thus inform cleaning steps.
Produce a histogram to observe the distribution of the sale price variable.
What are your initial thoughts upon observing this plot?
You need to use the histogram geom to accomplish this.
# Histogram to show distribution of sale_price
|>
ames ggplot(mapping = aes(x = sale_price)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This shows a peaked distribution that has a single mode (most frequently occuring point), but one that is heavily right skewed due to incredibly large sale prices at the tail end of the distribution.
The next step would be to investigate whether transforming this variable would bring it closer to a Normal (Gaussian) distribution.
8.9 Part i
Let’s play around with different bin sizes.
Create three plots:
- One with the default bins.
- One with 20 bins.
- One with 40 bins.
Use patchwork to lay them out as one larger plot on the left, and two smaller plots vertically on the right.
Does changing the number of bins impact our ability to discuss the distribution of this specific variable?
You’ll need to create three variables first to combine with patchwork.
You will need to use “|” to split the plots into columns, and “/” to layer them vertically.
# Changing bin sizes
<- ames |>
first_histogram ggplot(mapping = aes(x = sale_price)) +
geom_histogram() # Default bins set to 30
<- ames |>
second_histogram ggplot(mapping = aes(x = sale_price)) +
geom_histogram(bins = 20) # Less bins
<- ames |>
third_histogram ggplot(mapping = aes(x = sale_price)) +
geom_histogram(bins = 40) # Less bins
# Lay out the plots with patchwork
| (second_histogram / third_histogram) first_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In general, there is a good sense of peaks and possible outliers in the distribution with the standard 30 bins approach.
With a decreased number of bins, it smoothes over the actual shape and you loose the ability to pinpoint smaller peaks. At the same time, it is easier to visualise the line you could draw to identify the distribution (the density).
When an increased number of bins, it becomes harder to identify the overall shape as more breaks are introduced and the plot becomes more abstract as a result.
Going forward, the 30 bins approach seems the most sensible, as more information is not gained with the alternative approaches.
8.10 Part j
Let’s apply some best practice tweaks to your histogram.
Adjust the width of the bins to 1, set the bar outline to white, the bar colour to “#12436D” and transparency to 0.8.
Modify the continuous y scale to produce more gridlines.
Modify the continuous x scale to include an upper limit and change the scientific notation to commas.
Set appropriate labels for the plot.
Add af_theme and turn off the x axis gridlines.
These are not aesthetics.
seq() will come in handy.
You did this in question 1.
Use a labels layer.
You need the panel.<..> family for this.
Let’s go with increments of 100 on the y-axis to ensure there are have 7 gridlines, as opposed to 14 with the 50 increment approach.
# Best practice histogram
|>
ames ggplot(mapping = aes(x = sale_price)) +
geom_histogram(bins = 30,
colour = "white",
fill = "#12436D",
alpha = 0.8) +
scale_x_continuous(breaks = seq(from = 0e+00,
to = 8e+05,
by = 2e+05),
limits = c(0e+00, 8e+05),
labels = scales::comma_format()) +
scale_y_continuous(breaks = seq(from = 0,
to = 700,
by = 100)) +
labs(title = "Sale price distribution.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Sale Price ($)",
y = "Frequency of properties") +
+
af_theme theme(panel.grid.major.x = element_blank()) # Turn off x gridlines
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
It is worth noting here that we received a warning about missing values. This is because of the “limits” parameter we used in scale_x_continuous().
In essence, missing values are removed during the addition of limits as they do not fall within the scale. As such, it would be a good idea to deal with these first prior to visualisation.
8.11 Part k
Histograms tend to be used for single continuous variables. What if you intended to compare the distribution across the groups of a categorical variable?
Let’s return to the thought process from earlier on the price spread of specific garage types.
What two methods could be used to observe the sale price distribution across the different garage types?
Consider wrapping, and present boxes.
The first method (and is likely the best one) is to create a boxplot, with the sale price distribution split between garage types. This gives summary statistics at a glance, averages and quartiles, as well as explicit markers where outliers are likely to be.
The second method (which would likely over complicate things) is to create a histogram as normal, and then use facet_wrap() to produce subplots, one for each garage type. Due to the wide nature of histograms, they would likely be rotated in the output to ensure they would fit in the plot pane.
8.12 Part l
Let’s see how the distribution of sale price differs by garage type.
Create a boxplot to compare the distribution of the sale price across the different garage types.
Change the outlier marks to blue crosses of size 5.
Ensure the boxplot size is proportional to the sample size.
Comment on the results.
You’ll need the boxplot geom and an x and y aesthetic set for this.
You need to use the outlier.<..> family within the boxplot geom.
There’s a special argument “varwidth” for this.
Are there outlier points? Is one garage type more common among the higher/lower priced properties?
# First boxplot
|>
ames ggplot(mapping = aes(x = garage_type,
y = sale_price)) +
geom_boxplot(outlier.colour = "blue",
outlier.shape = 4,
outlier.size = 5,
varwidth = TRUE)
Outliers are clearly very common here, and when considering the most frequent garage types:
With the attached properties there is a massive number of highly priced properties that create a prominent right skew.
Interestingly with detached properties, this right skew is still the overarching trend, but less so due to some incredibly low priced properties with the feature.
It would be a mistake to remove all outliers here, as this would result in a loss of a large proportion of the dataset. A better approach could be that of transformation, to scale the data appropriately.
As an aside:
Car ports (freestanding room shelters for cars, much less expensive to construct) are very uncommon across the dataset, which makes sense with many new builds.
# Filter to fuel type = CNG
|>
ames filter(garage_type == "CarPort")
# A tibble: 15 × 81
order ms_sub_class ms_zoning lot_frontage lot_area street alley lot_shape
<dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 206 090 RM 110 8472 Grvl <NA> IR2
2 553 020 RH 60 8400 Pave <NA> Reg
3 647 020 RL 75 14559 Pave <NA> Reg
4 651 190 RL 107 10615 Pave <NA> IR1
5 762 020 RL 50 4280 Pave <NA> IR1
6 1265 020 RL NA 12929 Pave <NA> IR1
7 1386 090 RL 60 12900 Pave <NA> Reg
8 1387 190 RL 60 12900 Pave <NA> Reg
9 1880 090 RL NA 10899 Pave <NA> IR1
10 1899 090 RL 70 9842 Pave <NA> Reg
11 1916 060 RL NA 18450 Pave <NA> IR1
12 2707 090 RL 72 10791 Pave <NA> Reg
13 2879 050 C (all) 60 11040 Pave <NA> Reg
14 2918 160 RM 21 1533 Pave <NA> Reg
15 2921 160 RM 21 1894 Pave <NA> Reg
# ℹ 73 more variables: land_contour <chr>, utilities <chr>, lot_config <chr>,
# land_slope <chr>, neighborhood <chr>, condition_1 <chr>, condition_2 <chr>,
# bldg_type <chr>, house_style <chr>, overall_qual <dbl>, overall_cond <dbl>,
# year_built <dbl>, year_remod_add <dbl>, roof_style <chr>, roof_matl <chr>,
# exterior_1st <chr>, exterior_2nd <chr>, mas_vnr_type <chr>,
# mas_vnr_area <dbl>, exter_qual <chr>, exter_cond <chr>, foundation <chr>,
# bsmt_qual <chr>, bsmt_cond <chr>, bsmt_exposure <chr>, …
There are only 15 properties with car ports, which should be considered if drawing conclusions on this particular set of properties, as it is more difficult to generalise.
8.13 Part m
Let’s tweak the plot in line with accessibility guidelines.
Apply the following tweaks to your boxplot:
Change the outlier markers to filled black circles of size 2.5.
Use all of the colours from the Analysis Function guidance to fill the boxplot.
Modify the y axis to remove scientific notation.
Add appropriate labels to the plot.
Add the af_theme object as a layer, turn off the legend that results from (i) and remove x axis gridlines.
You need to colour by the same variable creating the boxes in the first place. A scaling layer is needed for the chosen colours.
You’ve done this a few times, no shame in copy and pasting. However, you do need some more gridlines this time.
Use a labels layer.
Use the legend.<…> family and panel.<..> family in a separate theme layer.
# Publication quality boxplot
|>
ames ggplot(mapping = aes(x = garage_type,
y = sale_price,
fill = garage_type)) +
geom_boxplot(outlier.size = 2.5,
varwidth = TRUE) +
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#3D3D3D",
"#A285D1")) +
scale_y_continuous(breaks = seq(from = 0e+00,
to = 8e+05,
by = 1e+05),
limits = c(0e+00, 8e+05),
labels = scales::comma_format()) +
labs(title = "Sale price distribution by garage type.",
subtitle = "Data from the Ames housing project (2006 - 2010).",
x = "Garage Type",
y = "Sale Price ($)") +
+
af_theme theme(legend.position = "none",
panel.grid.major.x = element_blank())
9 Summary
Amazing job working your way through the exercises, hopefully this allowed you to get into the rhythm of using ggplot2 to create visualisations as well as stress the importance of accessibility.
You have a few options from here:
- Investigate the optional reference material.
- Read deeper into the topic with Hadley Wickham’s R for Data Science 2nd Edition chapters 1 and 9.
- Proceed directly to the summary chapter 7 for a wider view of next steps.