library(tidyverse) # Data manip and visualisation
library(janitor) # To clean column names
library(scales) # Auto determination of axes and legend labels
library(patchwork) # Combining plots together
library(ggrepel) # Repel labels away from graphs
library(showtext) # Text rendering
# Set the font
font_add(family = "Arial", regular = "C:\\Windows\\Fonts\\Arial.ttf")
showtext_auto()
Chapter 4 - Creating a Variety of Plot Types
1 Introduction
Up to now we have built up the scatterplot from the ground up, showing how to add many elements and layers to the plot, how to customise it with aesthetics and themes, as well as customising those to ensure accessibility. What we haven’t seen is the massive number of other geometric layers at our disposal and how best to utilise them. We will be covering a fair few in this chapter that you see all the time in government publications.
There are also some plot types in the Reference Materials Chapter 5, but these are less recommended plot types unless absolutely necessary for the message you are trying to convey.
2 Preliminaries
2.1 Packages & Fonts
As always, the first port of call is to bring in the packages we need for this session.
2.2 Data
Next is to read in the dataset that we will use, which is the same as we have used in previous chapters, the gapminder data. We will clean column names with janitor and drop the NA values as we have done previously.
# Importing data using the read_csv function
<- read_csv("./data/gapminder.csv")
gapminder
# Cleaning column names using the clean_names function
# Dropping missing values using the drop_na function
<- clean_names(gapminder) |>
gapminder drop_na()
2.3 Theme
We also need to redefine our custom_theme object we created previously, as it will be used throughout for the best practice guidance in this chapter.
<- theme(
custom_theme # Text elements
plot.title = element_text(family = "Arial", # Set font family
size = 16, # Set font size
face = "bold", # Bold typeface
hjust = 0, # Horizontal adjustment, Left align
vjust = 1), # Vertical Adjustment, raises it slightly
# Subtitle
plot.subtitle = element_text(family = "Arial", # Font family
size = 14, # Font size
hjust = 0), # Horizontal adjustment, this will left align
# Caption
plot.caption = element_text(family = "Arial", # Font family
size = 9), # Font size
# Axis titles
axis.title = element_text(family = "Arial", # Font family
size = 12), # Font size
# Axis text
axis.text = element_text(family = "Arial", # Axis family
size = 10,
colour = "black"),
panel.background = element_blank(), # removes background
legend.key = element_blank(), # make the legend background blank
panel.grid.major = element_line(color = "#D9D9D9"), # Adding major grid lines with colour specified
axis.line.x = element_blank(),
axis.line.y = element_blank()
)
3 Geometries for Plotting One Variable
Throughout this chapter we will make the distinction on whether data is continuous or discrete as this changes the recommended plot type dramatically, but we are aware that not everyone is familiar with this terminology and the details, so we will reinforce this here.
Discrete data is finite, meaning that it can be counted and there are only a certain number of points (such as age for example), whereas Continuous data is infinite and can be measured in an uncountable number of ways (such as height and weight). You may also hear discrete data referred to as Categorical data, as each unique value is a category of that variable.
3.1 Continuous Data
3.1.1 Histogram
A histogram is a visual representation of the distribution of a dataset. It groups continuous data into ranges and plots this data as bars. The height of each bar shows the amount of observations within each range. It will show the most common observations and also show the shape of the data. We can utilise these to answer the following questions:
What are the smallest and largest values?
What is the “center” or “most typical” value?
How do the values spread out?
What are frequent and infrequent values?
In a histogram, the total range of data set (i.e from minimum value to maximum value) is divided into equal parts. These equal parts are known as bins.
We first cut up the x-axis into a series of bins, where each bin represents a range of values.
For each bin, we count the number of observations that fall in the range corresponding to that bin.
Then for each bin, we draw a bar whose height marks the corresponding count
For example, if you were creating a histogram of age, the bins might be 0-5, 6-10, 11-15, and so on up to 85.
If we want to create a histogram with the ggplot2 package, we need to use the geom_histogram() function
3.1.1.1 Examples
By default, geom_histogram() will divide your data into 30 equal bins or intervals. A warning pops up that “stat_bin() using bins = 30” to inform you of this. Unfortunately, this default value isn’t always appropriate for the message we are trying to convey, so we can use the optional arguments:
binwidth - If this is too small, the historgram becomes overly peaky and visually busy, obscuring the trends in the data. If this is too large, we lose major trends in the data.
bins - The number of bins can also heavily impact this.
It is recommend to use the default to start, and tweak it from there to obtain the visualisation you want, there is also AF guidelines around this we will discuss later.
# We use geom_histogram
filter(gapminder, year == 1992) |>
ggplot() +
geom_histogram(mapping = aes(x = life_exp)) # Plotting a histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can get a clearer picture of the relative densities of the distribution when using different binwidths.
Changing the number of bins can highlight more trends in the data that are lost when summarising to a smaller number of bins.
# Here I am changing the number of bins to 40, using the bins argument
filter(gapminder, year == 1992) |>
ggplot() +
geom_histogram(mapping = aes(x = life_exp),
bins = 40)
We can also get a horizontal histogram by flipping the axes using the incredibly useful “coord_flip()” function. This is added as a layer to our ggplot code and impacts many of the geometric layers we can use.
# Example of a flipped histogram
filter(gapminder, year == 1992) |>
ggplot() +
geom_histogram(mapping = aes(x = life_exp),
bins = 15) +
coord_flip() # This flips our graph
It is also difficult at times to distinguish the different bars that make up a histogram. Since histograms represent continuous numerical variables, there are no gaps between the bars as with a bar chart (which represents discrete or categorical variable counts).
To adjust the colours of your histogram, just add the arguments col and fill, together with the desired colour:
colour here refers to the lines around the bars
fill is the colour inside the bars.
In the example below, I have used white as the line around the bars and hex value “#f85100” as the colour of my bars.
# Draw with white outline and orange fill
filter(gapminder, year == 1992) |>
ggplot() +
geom_histogram(mapping = aes(x = life_exp),
bins = 40,
colour = "black", # Colour of outlines of bars
fill = "#f85100") # Colour of the bars
3.1.1.2 Best Practice
Use small gaps between the bars to emphasise the profile of the data.
To avoid writing out all our best practice which we looked at the previous sections we will use our theme which we created earlier, this heavily lessens the amount of code repetition.
# We can now add out theme to our plot as layer
filter(gapminder, year == 1992) |>
ggplot() +
geom_histogram(mapping = aes(x = life_exp),
binwidth = 1,
colour = "white",
fill = "#12436D", # Dark blue recommended by AF
alpha = 0.8) +
scale_y_continuous(breaks = pretty_breaks()) +
scale_x_continuous(breaks = seq(from = 0, # seq function to modify breaks
to = 80,
by = 10)
+
) labs(title = "Life expectancy distribution, 1992",
caption = "Source: www.gapminder.org",
x = "Age (years)",
y = "") +
# theme added a layer custom_theme
3.2 Categorical Data
3.2.1 Bar/Frequency Chart
Bar charts are one of the most commonly used data visualizations. They show the distribution of categorical variables and can have either horizontal or vertical bars. There’s an important distinction you should be aware of when making bar graphs: sometimes the bar heights represent counts of cases in the data set, and sometimes they represent values in the data set.
If you want to simply show a count of the values we use the geom_bar() function in R and if you have a value for the x axis and the y axis (to show the values themselves) we use geom_col(). This enables the heights of the bars to represent values in the data. For more information refer to the Bar Charts section on the tidy verse documentation.
In previous versions of ggplot2, the recommended way to create a bar graph of values was to use geom_bar(stat = “identity”). As of ggplot2 2.2.0 and beyond, there is a geom_col() function which does the same thing.
Throughout this section we are only plotting one variable and as such don’t have a y value for the height, so we will just use ‘geom_bar()’ here.
3.2.1.1 Examples
Note that we create a boolean column here so that we have categorical data to plot.
# Example of basic bar chart
filter(gapminder, year == 1997) |>
mutate(above_mean_gdp = life_exp >= mean(life_exp)) |> # creates new column in our tibble, a boolean
ggplot() +
geom_bar(mapping = aes(x = above_mean_gdp)
# plotting a bar graph )
Another example is we can plot a counts of how many countries are in each continent.
# Example of basic bar chart - counts
filter(gapminder, year == 1997) |>
ggplot() +
geom_bar(mapping = aes(x = continent)
# plotting a bar graph )
The width argument of the geom_bar() function allows to control the bar width. It ranges between 0 and 1, 1 being full width. The default value is 0.9; larger values make the bars wider, and smaller values make the bars narrower
# We can width argument inside the aes()
filter(gapminder, year == 1997) |>
ggplot() +
geom_bar(mapping = aes(x = continent),
width = 0.4) # we can modify the bar width
It often makes sense to turn your bar plot horizontal. Indeed, it makes the group labels much easier to read in line with the guidance from the Analysis Function. We’re just as good at comparing the sizes in vertical and horizontal bar charts, but horizontal bar charts are good if we have long category names, or have lots of categories which don’t fit on a standard portrait-oriented page.
To do this in practice, we make use of ‘coord_flip()’ as we did with Histograms.
# Example of a flipped histogram
filter(gapminder, year == 1997) |>
ggplot() +
geom_bar(mapping = aes(x = continent)
+
) coord_flip() # This flips our graph
By default, bar graphs use a dark grey for the bars. To use a color fill, use fill. Also, by default, there is no outline around the fill. Just like with histograms to adjust the colours of your bar graph, just add the arguments col and fill, together with the desired colour.
colour here refers to the lines around the bars
fill is the colour inside the bars.
Here I am using the hex colour code for ONS green with a white outline.
# Draw with black outline and green fill
filter(gapminder, year == 1997) |>
ggplot() +
geom_bar(mapping = aes(x = continent),
colour = "black",
fill = "#a8bd3a")
3.2.1.2 Best Practice Examples
The y-axis on a bar chart should always start at zero. The length of bars relative to the axis gives an instant understanding of the relative sizes of categories.
Use bar charts to show data that are ranked, in either ascending or descending order. Horizontal bars should be used.
A bar chart should always be ranked by value, unless there is a natural order to the data (for example, age or time).
# We can now add out theme to our plot as layer
filter(gapminder, year == 1997) |>
ggplot() +
geom_bar(mapping = aes(x = continent),
fill = "#12436D") +
coord_flip() + # flip graph
+ # best practice theme
custom_theme theme(
legend.position = "none", # ammending theme
axis.title = element_text(hjust = 1), # align axis text
panel.grid.major.y = element_blank()
+ # remove major gridlines on y axis
) labs(title = "Number of Countries In Each Continent", # labels
caption = "Source: www.gapminder.org",
x = "",
y = "Number of Countries")
To highlight the highest values the largest value should be at the top of the chart. There are lots of ways to do this, here I have created a new column and created factors with levels and a specified order
# Creating a new column and creating factor levels
<- gapminder |>
example_data filter(year == 1997) |>
group_by(continent) |>
arrange(continent) |>
mutate(level_continent = factor(continent,
levels = c("Oceania", "Americas", "Asia", "Europe", "Africa")))
|> ggplot() +
example_data geom_bar(mapping = aes(x = level_continent),
fill = "#12436D") + # plotting level continent
coord_flip() +
+
custom_theme theme(legend.position = "none",
axis.title = element_text(hjust = 1),
panel.grid.major.y = element_blank()
+
) labs(title = "Number of Countries In Each Continent",
caption = "Source: www.gapminder.org",
x = "",
y = "Number of Countries")
Of course, we can reverse the order of the factor levels if we want to show highest to lowest value instead.
If you find yourself labeling individual bar values, consider using a table instead. If you do add bar values, make sure the labels are aligned at the bottom of the bars. This will enable easy comparison.
To add labels for a bar graph of counts instead of values we, use geom_bar(), which adds bars whose height is proportional to the number of rows, and then use geom_label() which is added as another layer with counts.
For the label itself, we can use the after_stat() functionality from ggplot to specify counting the number of observations in each category.
# Labelling the values on bars
<- gapminder |>
example_data filter(year == 1997) |>
group_by(continent) |>
arrange(continent) |>
mutate(level_continent = factor(continent,
levels = c("Oceania", "Americas", "Asia", "Europe", "Africa")))
|> ggplot() +
example_data geom_bar(mapping = aes(x = level_continent),
fill = "#12436D") +
coord_flip() +
+
custom_theme theme(legend.position = "none",
axis.title = element_text(hjust = 1),
panel.grid.major.y = element_blank()
+
) labs(title = "Number of Countries In Each Continent",
caption = "Source: www.gapminder.org",
x = "",
y = "Number of Countries") +
geom_label(mapping = aes(x = level_continent,
label = after_stat(count)),
y = 0,
colour = "white",
fill = NA,
label.size = NA,
size = 5,
hjust = 0,
stat = "count")
There are other methods for doing this and adjusting positions, one example is using geom_text() which can be a little more intuitive.
We can also tweak the data differently, including a count column instead and pointing to that for labelling purposes, you can do this by creating a new column using the n() function in mutate().
4 Plotting with Two Variables
4.1 Continuous X, Continuous Y
4.1.1 Scatter Plots
The simplest display of two quantitative variables is a scatter plot, with each variable represented on an axis. We have seen how to build one of these in Chapter 2, but what we didn’t cover was some more niche best practice guidelines, which we will reinforce here.
4.1.1.1 Best Practice
Use a dot (or other symbol) plot to make comparisons between values. The size of the visual element representing the data (dot position) is representative of the data value itself.
# Plotting a scatterplot
filter(gapminder, year == 2007) |>
ggplot() +
geom_point(mapping = aes(x = gdp_percap,
y = life_exp),
colour = "28A197") +
scale_x_continuous(labels = scales::dollar) +
+
custom_theme theme(panel.grid.major.x = element_blank()
+
) labs(title = "Relationship between life expectancy and income, 2007",
caption = "Source: www.gapminder.org",
x = "GDP per capita",
y = "Age (Years)")
4.2 Continuous Function
A graph of a continuous function is different to that of a continuous variable in general in that we see how that continuous variables changes in relation to another continuous variable (like how something changes over time for example). In this case we can join the points together to visualise the journey the variable/data takes.
4.2.1 Line Plots
We often use line graphs when representing this type of data, one great example is that line graphs work well for presenting time series.
4.2.1.1 Examples
Use ggplot() with geom_line(), and specify which variables you mapped to x and y.
# We use geom_line
|>
gapminder filter(country %in% c("Colombia")) |>
ggplot() +
geom_line(mapping = aes(x = year,
y = gdp_percap)
# Plotting a line graph )
We can amend the colour size and line types of the line, by specifying the arguments:
- colour
- size
- linetype
These need to be specified outside the aes() function.
# Amend colour, size and linetype
|>
gapminder filter(country %in% c("Colombia")) |>
ggplot() +
geom_line(mapping = aes(x = year,
y = gdp_percap),
colour = "orange", # Changing the colours
linewidth = 2, # amending the size
linetype = "dashed") # amending the type of the line
Other line types are presented in this document for use.
We can also plot multiple lines to visualise how continuous functions change but for different levels of a categorical variable. We do this by mapping colour to a variable just like we did in Chapter 2.
# Plotting multiple line charts with colour argument
|>
gapminder filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
ggplot() +
geom_line(mapping = aes(x = year,
y = gdp_percap,
colour = country)) # colour mapped to country
On the other hand, mapping variables to the properties, will result in each line looking different. Here I am mapping country to line type so each country will have a different colour as well as linetype.
# Plotting multiple line charts with colour and linetype varying
|>
gapminder filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
ggplot() +
geom_line(mapping = aes(x = year,
y = gdp_percap,
colour = country,
linetype = country)
# linetype mapped to country )
We can also specify our own colours for each line using scale_colour_manual that we met in chapter 2.
# Using scale colour manual to specify my own colours
|>
gapminder filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
ggplot() +
geom_line(mapping = aes(x = year,
y = gdp_percap,
colour = country)
+
) scale_colour_manual(values = c("Colombia" = "#bb4430", # specify own colours
"Chile" = "#7ebdc2",
"Argentina" = "#bcb3b5",
"Brazil" = "#e5ba38",
"Peru" = "#a8bd3a",
"Ecuador" = "#003d59")
)
Sometimes it is useful to indicate each data point on a line graph. This is helpful when the density of observations is low, or when the observations do not happen at regular intervals. We do this by adding a geom_point() layer to the plot, specifying the same aesthetics as the geom_line() layer (these could also go inside the ggplot() for global aesthetics).
# Here I am adding points on top of the lines
|>
gapminder filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
ggplot() +
geom_line(mapping = aes(x = year,
y = gdp_percap,
colour = country)
+
) geom_point(mapping = aes(x = year,
y = gdp_percap,
colour = country)
+ # add a layer of points
) scale_colour_manual(values = c( "Colombia" = "#bb4430",
"Chile" = "#7ebdc2",
"Argentina" = "#bcb3b5",
"Brazil" = "#e5ba38",
"Peru" = "#a8bd3a",
"Ecuador" = "#003d59")
)
The default shape for points is a solid circle, the default size is 2, and the default colour is black.
The fill color is relevant only for some point shapes (numbered 21–25), which have separate outline and fill colors (see section 2.2.3 setting shapes for more information the shapes).
# Altering shape and fill
|>
gapminder filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
ggplot() +
geom_line(mapping = aes(x = year,
y = gdp_percap,
colour = country)
+
) geom_point(mapping = aes(x = year,
y = gdp_percap,
colour = country),
shape = 21, # adding circles
fill = "white") + # this fills the circles white
scale_colour_manual(values = c("Colombia" = "#bb4430",
"Chile" = "#7ebdc2",
"Argentina" = "#bcb3b5",
"Brazil" = "#e5ba38",
"Peru" = "#a8bd3a",
"Ecuador" = "#003d59")
)
TIP - If the points and lines have different colors, you should specify the points after the lines, so that they are drawn on top. Otherwise, the lines will be drawn on top of the points.
4.2.1.2 Best Practice
A line chart will emphasise the overall pattern of the data and highlight trends. Use them when you have lots of data points or just a few. Time Series should always be represented using this method.
There is no definitive thickness measure for lines, but ensure they don’t appear too thin in the final chart.
When you can, label lines directly rather than using a legend. If a legend is unavoidable, place it near the lines within the body of the graph.
Use a dot plot with a line when there are lots of data points or the interval between data points is not equal. Show if data are irregular. Overuse of these makes charts cluttered, particularly if there are multiple lines.
Similarly, using texted lines like dotted or dashed lines can clutter a line chat so should be avoided if possible.
Time series charts do not have to begin at zero, if a chart does not start at zero this must be indicated by breaking the y-axis in an obvious way.
It can be difficult to differentiate lines on the line graph if you have lots of categories. Textures, colour and marks are all options, but it is essential that the graph is not cluttered and is still accessible. Additionally, users may assign meaning to different colours or textures, which can be lost to those with low vision, or misconstrued with those that can view them.
# Plotting a line chart
# following best practise
# filtering the data
<- gapminder |>
line_graph_data filter(country == "Malawi")
# Plotting the data
ggplot(data = line_graph_data) +
geom_line(mapping = aes(x = year,
y = life_exp),
colour = "#12436D",
linewidth = 1) +
geom_point(mapping = aes(x = year,
y = life_exp),
colour = "#12436D",
size = 2) +
scale_y_continuous(limits = c(0, max(line_graph_data$life_exp)
)+ # axis starts from 0
) + # adding our theme
custom_theme theme(panel.grid.major.x = element_blank()) + # removing gridlines
labs(title = "Life expectancy in Malawi, 1962 - 2007",
caption = "Source: www.gapminder.org",
x = "",
y = "Age (Years)")
We can also add a legend to our graph if absolutely necessary.
# Adding a legend
# filtering the data
<- gapminder |>
line_graph_data filter(country %in% c("Malawi", "Zambia"))
# Plotting the data
ggplot(data = line_graph_data) +
geom_line(mapping = aes(x = year,
y = life_exp,
colour = country),
size = 1) +
geom_point(mapping = aes(x = year,
y = life_exp,
colour = country),
size = 2) +
scale_colour_manual(values = c("Malawi" = "#12436D", # adding colours
"Zambia" = "#F46A25")
+
) scale_y_continuous(limits = c(0, NA)
+ # axis starts at 0
) + # adding a theme
custom_theme theme(panel.grid.major.x = element_blank(), # removing grid lines
legend.position = c(x = 0.9, y = 0.5), # adding a legend inside the graph
legend.title = element_blank(), # removing the legend title
legend.background = element_blank(), # removing the legend background
legend.key = element_blank() # removing the legend key
+
) labs(title = "Life expectancy in Malawi and Zambia, 1962 - 2007",
caption = "Source: www.gapminder.org",
x = "",
y = "Age (Years)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
We can add more lines to our graph.
# Adding more lines
# If the title is long, it can be split into multiple lines with \n
# filtering the data
<- gapminder |>
line_graph_data filter(country %in% c("Malawi", "Zambia", "Tanzania", "Zimbabwe"))
# Plotting the data
ggplot(data = line_graph_data) +
geom_line(mapping = aes(x = year,
y = life_exp,
colour = country),
size = 1) +
geom_point(mapping = aes(x = year,
y = life_exp,
colour = country),
size = 2) +
scale_colour_manual(values = c("Malawi" = "12436D", # adding colours
"Zambia" = "#28A197",
"Tanzania" = "#801650",
"Zimbabwe" = "#F46A25")
+
) scale_y_continuous(limits = c(0, NA)) + # axis starts at 0
+ # adding a theme
custom_theme theme(panel.grid.major.x = element_blank(), # removing gridlines
legend.position = c(x = 0.9, y = 0.3), # adding a legend inside the graph
legend.title = element_blank(), # removing the legend title
legend.background = element_blank(), # removing the legeng background
legend.key = element_blank() # removing the legend key
+
) labs(title = "Life expectancy in Malawi, Zambia, Tanzania and Zimbabwe,\n1962 - 2007",
caption = "Source: www.gapminder.org",
x = "",
y = "Age (Years)")
We can add labels on our lines using geom_text_repel() but this is quite complicated. We need to initialise a new column with the label names but somehow reduce this to the 2 we need, which means we start off with a column of null values.
We then impute the label value by referencing the max value of the year column and filling in the country value from there (this way we only have two labels and the rest as null). The which() function locates the index at which this label exists and is why it is used here.
The geom_text_repel() function repels the labels away from the lines, whereas traditional methods like geom_text() and geom_label() cause this problem and it is much harder to subsequently correct. It takes a few arguments:
The same aesthetic mapping as the prior layers if we aren’t using global aesthetics. There is also a label argument to add with this function however, which we assign to our label column.
nudge_x (or nudge_y) allows us to nudge the label slightly in case the lines are close together (like they are here, which will cause the labels to hit the line regardless).
na.rm = TRUE/FALSE allows us to ignore the null values in our label column.
# Adding labels on lines
# filtering the data
<- gapminder |>
line_graph_data filter(country %in% c("Malawi", "Zambia"))
# Set up labels
$label <- NA
line_graph_data
$label[which(line_graph_data$year == max(line_graph_data$year))] <- line_graph_data$country[which(line_graph_data$year == max(line_graph_data$year))]
line_graph_data
# Plotting the data
ggplot(data = line_graph_data) +
geom_line(mapping = aes(x = year,
y = life_exp,
colour = country),
size = 1) +
geom_point(mapping = aes(x = year,
y = life_exp,
colour = country),
size = 2) +
geom_text_repel(mapping = aes(x = year,
y = life_exp,
color = country,
label = label),
nudge_x = 1,
na.rm = TRUE) +
scale_colour_manual(values = c("Malawi" = "#801650",
"Zambia" = "#F46A25")
+ # adding colours
) scale_y_continuous(limits = c(0, NA)
+ # axis starts at 0
) + # adding a theme
custom_theme theme(panel.grid.major.x = element_blank(), # remove gridlines
legend.position = "none") + # remove legend
labs(title = "Life expectancy in Malawi and Zambia, 1962 - 2007",
caption = "Source: www.gapminder.org",
x = "",
y = "Age (Years)")
4.3 Categorical X, Continuous Y
4.3.1 Bar Charts
The intuition behind bar charts (frequency charts) remains the same as in the Plotting One Variable subchapter, however, now that we have two variables to plot, we will show off the use of geom_col() as opposed to geom_bar(). Remember, geom_col() allows us to specify the x value categories as well as a y value for the heights of the bars.
# Example of geom_col
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992) |>
group_by(continent) |>
summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))
# Plotting the data
ggplot(data = example_data) +
geom_col(mapping = aes(x = continent,
y = mean_life_exp)
# plotting a bar graph )
By default, bar graphs use a dark grey for the bars. Similar to before, to change the colour of the bars we use the colour and fill arguments. Best practice for these sections is consistent with the previous section (just using the new arguments in this section instead) so we won’t repeat the code needlessly here. We show off colours here but the best practice palettes etc have been provided throughout.
# Here i am specifying colour
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992) |>
group_by(continent) |>
summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))
# Plotting the data
ggplot(data = example_data) +
geom_col(mapping = aes(x = continent,
y = mean_life_exp),
colour = "white",
fill = "#12436D")
We can also specify our own colours.
# Use scale_fill_manual for colours
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992) |>
group_by(continent) |>
summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))
# Plotting the data
ggplot(data = example_data) +
geom_col(mapping = aes(x = continent,
y = mean_life_exp,
fill = continent)
+
) scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
We can label our graphs by adding geom_text()
to your graph. It requires a mapping for x, y, and the text itself. By setting vjust (the vertical justification), it is possible to move the text above or below the tops of the bars.
# Bar plot with labels outside the bars
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992) |>
group_by(continent) |>
summarise(mean_life_exp = mean(life_exp, na.rm = TRUE)) # Compute the mean which we will label
# Plotting the data
ggplot(data = example_data) +
geom_col(mapping = aes(x = continent,
y = mean_life_exp,
fill = continent)) +
geom_text(mapping = aes(x = continent,
y = mean_life_exp,
label = round(mean_life_exp), # adding a label
vjust = -0.3), # position the label
size = 4) + # size of the label
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
We can also position the labels inside the bars.
# Bar plot with labels inside the bars
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992) |>
group_by(continent) |>
summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))
# Plotting the data
ggplot(data = example_data) +
geom_col(mapping = aes(x = continent,
y = mean_life_exp,
fill = continent)
+
) geom_text(mapping = aes(x = continent,
y = mean_life_exp,
label = round(mean_life_exp), # adding a label
vjust = 1.6),
size = 5) + # position the label
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
It often makes sense to turn your bar plot horizontal as it makes the group labels much easier to read. As before, we will utilise coord_flip() for this.
We also introduce the guides() function as a new layer, which flips the legend and its labels to match the new flipped chart, which is especially useful if a legend is essential for your visualisation.
# We use coord flip
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992) |>
group_by(continent) |>
summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))
# Plotting the data
ggplot(data = example_data) +
geom_col(mapping = aes(x = continent,
y = mean_life_exp,
fill = continent)
+
) geom_text(mapping = aes(x = continent,
y = mean_life_exp,
label = round(mean_life_exp),
vjust = 0.5,
hjust = "left"),
size = 5) +
coord_flip() + # flips bar chart
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
+ # adding colours
) guides(fill = guide_legend(reverse = TRUE)
# flips the legend labels to match the graph )
To make the bars narrower or wider, set width in geom_col(), just like we did with geom_bar(). The default value is 0.9; larger values make the bars wider, and smaller values make the bars narrower
# Modifying the width of the bars
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992) |>
group_by(continent) |>
summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))
# Plotting the data
ggplot(data = example_data) +
geom_col(mapping = aes(x = continent,
y = mean_life_exp,
fill = continent),
width = 0.4) + # modify the width
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
4.3.2 Stacked Bar Plot
We can also create stacked bar charts, which are particularly useful when you want to compare multiple part-to-whole relationships in one chart, where a bar (which represents one level of a categorical/factor variable) is split into sections based on another categorical variable (this allows you to observe most and least common observations).
By default, the stacking order of the bars is the same as the order of items in the legend. Note that you must convert the variable on the x axis (which you want bars for) to a factor variable if it is a year or something that R treats as a numeric, otherwise we will have a numeric scale which won’t make sense.
To create a stacked bar chart, we supply a categorical variable to the ‘fill’ argument, which partitions the bar into multiple colours, with each width depending on the counts. Here we will stack each bar by continent.
# Year has been treated as numeric here)
# Manipulating the data
<- gapminder |>
pop_continent_after_92 filter(year >= 1992) |>
group_by(year, continent) |>
summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = year,
y = my_sum,
fill = continent)) +
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
You see here that the x axis has a numeric scale which doesn’t line up with the four bars we have in our chart. This requires tweaking using the as.factor() function we mentioned earlier.
# Converting year to a factor first
# Manipulating the data
<- gapminder |>
pop_continent_after_92 filter(year >= 1992) |>
group_by(year, continent) |>
summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = as.factor(year), # year converted to factor
y = my_sum,
fill = continent)) +
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
We can also reverse the order of the bars, which is done by using the position parameter inside geom_col (note that this is not an aesthetic so remains outside of aes()) and setting this equal to “position_stack(reverse = TRUE)”. In this case you will need to use guides() as we did earlier to reverse the order of the legend to ensure that they match up.
# Reversing the order of a stacked bar chart
# Manipulating the data
<- gapminder |>
pop_continent_after_92 filter(year >= 1992) |>
group_by(year, continent) |>
summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = as.factor(year),
y = my_sum,
fill = continent),
position = position_stack(reverse = TRUE)
+ # reverse the order of the bars
) guides(fill = guide_legend(reverse = TRUE)) + # Reverse order of legend
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
4.3.2.1 Best Practice Example
# Reversing the order of a stacked bar chart
# Manipulating the data
<- gapminder |>
pop_continent_after_92 filter(year >= 1992) |>
group_by(year, continent) |>
summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = as.factor(year),
y = my_sum,
fill = continent),
position = position_stack(reverse = TRUE)
+ # reverse the order of the bars
) guides(fill = guide_legend(reverse = TRUE)) + # Reverse order of legend
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
+ # adding colours
) +
custom_theme theme(legend.position = "bottom",
legend.title = element_blank(),
legend.background = element_blank(),
legend.key = element_blank(),
panel.grid.major.x = element_blank()) +
labs(title = "Population values by Continent (1997-2007)",
caption = "Source: www.gapminder.org",
x = "Year",
y = "Population")
What we have done in this section thus far adheres to best practices in place for stacked bar charts, but we want to formalise some information here for reference.
Stacked bar charts can be used for data that is not part to whole BUT care should be taken in this case as they can become cluttered quickly.
If each stacked bar has more than 4 categories, consider grouping categories together before using this technique. In the example above, oceania is so small it doesn’t violate this issue, but say we had 8 categories, it would then be much harder to work with.
If you are visualising survey questions on the Likert Scale (1 - Strongly Agree, 2 - Agree and so on) this can easily exceed 8 categories, so it is recommended that you aggregate or group some responses together into another category, there is guidance here.
If you aren’t using ordered variables like years or age ranges, you could consider applying an ordering to the factor variable in the stack, to get across the message more clearly (after all, information on either end of the bar is easier to interpret than in the middle).
Avoid using stacked bar charts for data that contains negative values, this becomes cumbersome to interpret and understand, the legend is never in the specified order throughout and the contrast between stacks changes as well. It is a nightmare for accessibility purposes, so you should separate out into multiple charts using facet_wrap() instead.
4.3.3 Segmented Bar Chart
A segmented bar (also called a 100% stacked bar graph) plot is a stacked bar plot where each bar represents 100 percent. This makes it so the heights of the bars are the same for each plot as the y axis is the scale from 0-1 (proportion). You can also cast this to percentages if you want the y axis to show these. It is however, recommended that you use proportions instead of the percentages.
This type of plot is particularly useful if the goal is to compare the percentage of a category in one variable across each level of another variable.
Use position = “fill” in the geom_col() argument to create a graph with proportions in the y-axis.
# Segmented bar chart
# Manipulating the data
<- gapminder |>
pop_continent_after_92 filter(year >= 1992) |>
group_by(year, continent) |>
summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = as.factor(year),
y = my_sum,
fill = continent),
position = position_fill()) + # creates a segmented plot
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
To print the labels as percentages, use scale_y_continuous(labels = scales::percent), this is why we import the scales library.
# Segmented bar chart - percentage
# plotting the data
ggplot(data = pop_continent_after_92) +
geom_col(
mapping = aes(x = as.factor(year),
y = my_sum,
fill = continent),
position = position_fill()
+
) scale_y_continuous(labels = scales::percent) + # shows y axis in percentages
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
# adding colours )
4.3.3.1 Best Practice
# Segmented bar chart
# Manipulating the data
<- gapminder |>
pop_continent_after_92 filter(year >= 1992) |>
group_by(year, continent) |>
summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = as.factor(year),
y = my_sum,
fill = continent),
position = position_fill()) + # creates a segmented plot
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")
+ # adding colours
) +
custom_theme theme(legend.position = "bottom",
legend.title = element_blank(),
legend.background = element_blank(),
legend.key = element_blank(),
panel.grid.major.x = element_blank()) +
labs(title = "Population Proportions by continent (1992-2007)",
caption = "Source: www.gapminder.org",
x = "Year",
y = "Proportion of Population")
4.3.4 Grouped Bar Plot
Grouped bar graphs are similar to stacked bar graphs; the only difference is that the grouped bar graph provides a bar for each category instead of stacking them into the same bar.
4.3.4.1 Example - Basic Plot
You have to set position=position_dodge(), in the geom_col() function to go from one to the other. This tells the bars to ‘dodge’ each other horizontally; if you don’t do this, you’ll end up with the stacked plot from the last section.
# Grouped bar chart
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = as.factor(year),
y = my_sum,
fill = continent),
position = position_dodge(1)) + # creates a grouped bar plot
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")) # adding colours
For grouped bars, there is no space between bars within each group by default. However, you can add some space between bars within a group, by adding a numeric value to the position_dodge() function where:
- < 1 phases the bars inside each other
- (>= 1) pushes the bars away from each other
and ensuring that this value is larger than the width of the columns.
4.3.4.2 Best Practice
ggplot(data = pop_continent_after_92) +
geom_col(mapping = aes(x = as.factor(year),
y = my_sum,
fill = continent),
width = 0.7,
position = position_dodge(1)) + # Light gap between bars
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")) + # adding colours
+
custom_theme theme(legend.position = "bottom",
legend.title = element_blank(),
legend.background = element_blank(),
legend.key = element_blank(),
panel.grid.major.x = element_blank()) +
labs(title = "Population Segmented by continent",
caption = "Source: www.gapminder.org",
x = "Year",
y = "Population")
4.3.5 Box Plots
Box plots are some of the most commonly used chart types to compare distributions across several groups of a categorical variable (remember Histograms are for singular continuous variables).
They are also a great way to visually present quartiles in your data. They are often referred to as box-and-whisker-plots as they often have lines extending from the box data, which denote additional variability outside of these quartiles (known as the whiskers). The top edge of the box is the upper quartile (75th percentile), the lower edge is the lower quartile (25th percentile) and the bold line in the center is the median (50th percentile).
The points outside of the box and whiskers are usually flagged as outliers or anomalous data points and as such this is a great visual way to check for this before statistically determining it with calculations/tests (or make for a great piece of supporting evidence!).
We use geom_boxplot() where you must specify a continuous variable for the y axis, and a categorical variable for the x axis.
# Example of basic box plot
# Manipulating the data
<- gapminder |>
example_data filter(year == 1992)
# Plotting the data
ggplot(data = example_data, mapping = aes(x = continent,
y = life_exp)
+
) geom_boxplot() # creates a box plot
We can amend the outlier, colour, shape and size, by specifying the following arguments within geom box plot.
# Change outlier, color, shape and size
ggplot(data = example_data, mapping = aes(x = continent,
y = life_exp)
+
) geom_boxplot(outlier.colour = "red", # outlier colour
outlier.shape = 4, # outlier shape
outlier.size = 4) # outlier size
We can make box_plots proportional to the sample size. To do this we use the varwidth argument (a boolean, TRUE or FALSE), which is specified within the geom_box plot.
# Using varwidth argument
ggplot(data = example_data, mapping = aes(x = continent,
y = life_exp)
+
) geom_boxplot(varwidth = TRUE) # make box_plots proportional to the sample size
It is worth noting that whilst this is helpful, it can make smaller groups much harder to observe due to how small the number of observations we have for them really are.
We can also add points to our box plot. This gives us a better idea of the number of measurements and of their distribution. We do this by adding a new layer geom_jitter() to our boxplot, which overlays the points to the boxplot.
# We can add points on top our boxplot
ggplot(data = example_data, mapping = aes(x = continent,
y = life_exp)
+
) geom_boxplot() +
geom_jitter() # add a layer of points on top of the box plots
We can colours to our box plot by specifying the fill argument within the geom_box_plot.
# Add colour to boxplot
ggplot(data = example_data, mapping = aes(x = continent,
y = life_exp)
+
) geom_boxplot(fill = "#f37735") # add a colour
We can also map the colour to a variable so we get a different colour for each box plot, remember this means we specify fill within aes() as opposed to outside it for all one colour.
# We can also map the colour to a variable
ggplot(data = example_data, aes(x = continent,
y = life_exp,
fill = continent)
+ # map fill to continent
) geom_boxplot()
We can also get horizontal box plots by flipping the axes using coord_flip() that we saw earlier with bar charts.
# Example of a flipped boxplot
ggplot(data = example_data, mapping = aes(x = continent,
y = life_exp,
fill = continent)
+
) geom_boxplot() +
coord_flip() # this flips our graph
We can also create grouped box plots so that for each level of the categorical variable, we can then apply a grouping and output multiple boxplots for each category, effectively allowing us to use two categorical variables and gain more insight. However, this can become unruly fast and hard to read, so always keep this in mind if applying this technique.
Remember that the variable we split by to get the grouped boxplots needs to be categorical, so in the case that we use year like we have here (which is interpreted as a numeric), you need to use as.factor() when specifying it to convert it first.
# Creating grouped box plots
# Manipulating the data
<- gapminder |>
example_data filter(year %in% c(1992, 2002, 2007))
# Plotting the data
ggplot(data = example_data, mapping = aes(x = continent,
y = life_exp,
fill = factor(year)
+ # grouped box plot
)) geom_boxplot()
4.3.5.1 Best Practice
There are no specific guidlines for this plot type, so applying out best practice themeing and colours should suffice here.
# data manipulation
<- gapminder |>
example_data filter(year == 1997)
# plotting the data
ggplot(data = example_data, aes(x = continent,
y = life_exp,
fill = continent)
+
) geom_boxplot() +
scale_fill_manual(values = c("#12436D",
"#28A197",
"#801650",
"#F46A25",
"#A285D1")) + # adding colours
+
custom_theme theme(legend.position = "none",
panel.grid.major.x = element_blank()
+
) labs(title = "Life expectancy distribution (1997)",
caption = "Source: www.gapminder.org",
x = "",
y = "Age (years)")
5 Summary
Wonderful work on the plentiful plots presented to you in this chapter! This is a great reference piece you can return to whenever you need to consider how to visualise your data in the future and what ways there are available. There are of course a few that we didn’t cover (you may notice the absence of pie charts), which are instead contained in the Reference Materials Chapter 6, as they are considered last resort plot types, not recommended to be used unless absolutely necessary.