Chapter 4 - Creating a Variety of Plot Types

Government Analysis Function and Data Science Campus Logos.

1 Introduction

Up to now we have built up the scatterplot from the ground up, showing how to add many elements and layers to the plot, how to customise it with aesthetics and themes, as well as customising those to ensure accessibility. What we haven’t seen is the massive number of other geometric layers at our disposal and how best to utilise them. We will be covering a fair few in this chapter that you see all the time in government publications.

There are also some plot types in the Reference Materials Chapter 5, but these are less recommended plot types unless absolutely necessary for the message you are trying to convey.

2 Preliminaries

2.1 Packages & Fonts

As always, the first port of call is to bring in the packages we need for this session.

library(tidyverse) # Data manip and visualisation
library(janitor) # To clean column names
library(scales) # Auto determination of axes and legend labels
library(patchwork) # Combining plots together
library(ggrepel) # Repel labels away from graphs
library(showtext) # Text rendering

# Set the font

font_add(family = "Arial", regular = "C:\\Windows\\Fonts\\Arial.ttf")

showtext_auto()

2.2 Data

Next is to read in the dataset that we will use, which is the same as we have used in previous chapters, the gapminder data. We will clean column names with janitor and drop the NA values as we have done previously.

# Importing data using the read_csv function 

gapminder <- read_csv("./data/gapminder.csv")

# Cleaning column names using the clean_names function

# Dropping missing values using the drop_na function 

gapminder <- clean_names(gapminder) |>
             drop_na()

2.3 Theme

We also need to redefine our custom_theme object we created previously, as it will be used throughout for the best practice guidance in this chapter.

custom_theme <- theme(
 # Text elements
 plot.title = element_text(family = "Arial", # Set font family
                            size = 16, # Set font size
                            face = "bold", # Bold typeface
                            hjust = 0, # Horizontal adjustment, Left align
                            vjust = 1), # Vertical Adjustment, raises it slightly

 # Subtitle
 plot.subtitle = element_text(family = "Arial", # Font family
                               size = 14, # Font size
                               hjust = 0), # Horizontal adjustment, this will left align
 # Caption
 plot.caption = element_text(family = "Arial", # Font family
                              size = 9), # Font size
 # Axis titles
 axis.title = element_text(family = "Arial", # Font family
                            size = 12), # Font size
 # Axis text
 axis.text = element_text(family = "Arial", # Axis family
                          size = 10,
                          colour = "black"),
 panel.background = element_blank(), # removes background
 
 legend.key = element_blank(), # make the legend background blank
 
 panel.grid.major = element_line(color = "#D9D9D9"), # Adding major grid lines with colour specified
 axis.line.x = element_blank(),
 axis.line.y = element_blank()
 )

3 Geometries for Plotting One Variable

Throughout this chapter we will make the distinction on whether data is continuous or discrete as this changes the recommended plot type dramatically, but we are aware that not everyone is familiar with this terminology and the details, so we will reinforce this here.

Discrete data is finite, meaning that it can be counted and there are only a certain number of points (such as age for example), whereas Continuous data is infinite and can be measured in an uncountable number of ways (such as height and weight). You may also hear discrete data referred to as Categorical data, as each unique value is a category of that variable.

3.1 Continuous Data

3.1.1 Histogram

A histogram is a visual representation of the distribution of a dataset. It groups continuous data into ranges and plots this data as bars. The height of each bar shows the amount of observations within each range. It will show the most common observations and also show the shape of the data. We can utilise these to answer the following questions:

  • What are the smallest and largest values?

  • What is the “center” or “most typical” value?

  • How do the values spread out?

  • What are frequent and infrequent values?

In a histogram, the total range of data set (i.e from minimum value to maximum value) is divided into equal parts. These equal parts are known as bins.

  • We first cut up the x-axis into a series of bins, where each bin represents a range of values.

  • For each bin, we count the number of observations that fall in the range corresponding to that bin.

  • Then for each bin, we draw a bar whose height marks the corresponding count

For example, if you were creating a histogram of age, the bins might be 0-5, 6-10, 11-15, and so on up to 85.

If we want to create a histogram with the ggplot2 package, we need to use the geom_histogram() function

3.1.1.1 Examples

By default, geom_histogram() will divide your data into 30 equal bins or intervals. A warning pops up that “stat_bin() using bins = 30” to inform you of this. Unfortunately, this default value isn’t always appropriate for the message we are trying to convey, so we can use the optional arguments:

  • binwidth - If this is too small, the historgram becomes overly peaky and visually busy, obscuring the trends in the data. If this is too large, we lose major trends in the data.

  • bins - The number of bins can also heavily impact this.

It is recommend to use the default to start, and tweak it from there to obtain the visualisation you want, there is also AF guidelines around this we will discuss later.

# We use geom_histogram

filter(gapminder, year == 1992) |>
ggplot() +
  geom_histogram(mapping = aes(x = life_exp)) # Plotting a histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can get a clearer picture of the relative densities of the distribution when using different binwidths.

Changing the number of bins can highlight more trends in the data that are lost when summarising to a smaller number of bins.

# Here I am changing the number of bins to 40, using the bins argument

filter(gapminder, year == 1992) |>
  ggplot() +
  geom_histogram(mapping = aes(x = life_exp),
                 bins = 40)

We can also get a horizontal histogram by flipping the axes using the incredibly useful “coord_flip()” function. This is added as a layer to our ggplot code and impacts many of the geometric layers we can use.

# Example of a flipped histogram

filter(gapminder, year == 1992) |>
  ggplot() +
  geom_histogram(mapping = aes(x = life_exp),
                 bins = 15) +
  coord_flip() # This flips our graph

It is also difficult at times to distinguish the different bars that make up a histogram. Since histograms represent continuous numerical variables, there are no gaps between the bars as with a bar chart (which represents discrete or categorical variable counts).

To adjust the colours of your histogram, just add the arguments col and fill, together with the desired colour:

  • colour here refers to the lines around the bars

  • fill is the colour inside the bars.

In the example below, I have used white as the line around the bars and hex value “#f85100” as the colour of my bars.

# Draw with white outline and orange fill

filter(gapminder, year == 1992) |>
  ggplot() +
  geom_histogram(mapping = aes(x = life_exp),
                 bins = 40,
                 colour = "black", # Colour of outlines of bars
                 fill = "#f85100") # Colour of the bars

3.1.1.2 Best Practice

Use small gaps between the bars to emphasise the profile of the data.

To avoid writing out all our best practice which we looked at the previous sections we will use our theme which we created earlier, this heavily lessens the amount of code repetition.

# We can now add out theme to our plot as layer

filter(gapminder, year == 1992) |>
  ggplot() +
  geom_histogram(mapping = aes(x = life_exp),
                 binwidth = 1,
                 colour = "white",
                 fill = "#12436D", # Dark blue recommended by AF
                 alpha = 0.8) + 
  scale_y_continuous(breaks = pretty_breaks()) +
  scale_x_continuous(breaks = seq(from = 0, # seq function to modify breaks
                                  to = 80,
                                  by = 10)
                     ) +
  labs(title = "Life expectancy distribution, 1992",
       caption = "Source: www.gapminder.org",
       x = "Age (years)",
       y = "") +
  custom_theme # theme added a layer

3.2 Categorical Data

3.2.1 Bar/Frequency Chart

Bar charts are one of the most commonly used data visualizations. They show the distribution of categorical variables and can have either horizontal or vertical bars. There’s an important distinction you should be aware of when making bar graphs: sometimes the bar heights represent counts of cases in the data set, and sometimes they represent values in the data set.

If you want to simply show a count of the values we use the geom_bar() function in R and if you have a value for the x axis and the y axis (to show the values themselves) we use geom_col(). This enables the heights of the bars to represent values in the data. For more information refer to the Bar Charts section on the tidy verse documentation.

In previous versions of ggplot2, the recommended way to create a bar graph of values was to use geom_bar(stat = “identity”). As of ggplot2 2.2.0 and beyond, there is a geom_col() function which does the same thing.

Throughout this section we are only plotting one variable and as such don’t have a y value for the height, so we will just use ‘geom_bar()’ here.

3.2.1.1 Examples

Note that we create a boolean column here so that we have categorical data to plot.

# Example of basic bar chart

filter(gapminder, year == 1997) |>
  mutate(above_mean_gdp = life_exp >= mean(life_exp)) |> # creates new column in our tibble, a boolean
  ggplot() +
  geom_bar(mapping = aes(x = above_mean_gdp)
           ) # plotting a bar graph

Another example is we can plot a counts of how many countries are in each continent.

# Example of basic bar chart - counts

filter(gapminder, year == 1997) |>
  ggplot() +
  geom_bar(mapping = aes(x = continent)
           ) # plotting a bar graph

The width argument of the geom_bar() function allows to control the bar width. It ranges between 0 and 1, 1 being full width. The default value is 0.9; larger values make the bars wider, and smaller values make the bars narrower

# We can width argument inside the aes()

filter(gapminder, year == 1997) |>
  ggplot() +
  geom_bar(mapping = aes(x = continent),
           width = 0.4) # we can modify the bar width

It often makes sense to turn your bar plot horizontal. Indeed, it makes the group labels much easier to read in line with the guidance from the Analysis Function. We’re just as good at comparing the sizes in vertical and horizontal bar charts, but horizontal bar charts are good if we have long category names, or have lots of categories which don’t fit on a standard portrait-oriented page.

To do this in practice, we make use of ‘coord_flip()’ as we did with Histograms.

# Example of a flipped histogram

filter(gapminder, year == 1997) |>
  ggplot() +
  geom_bar(mapping = aes(x = continent)
           ) +
  coord_flip() # This flips our graph

By default, bar graphs use a dark grey for the bars. To use a color fill, use fill. Also, by default, there is no outline around the fill. Just like with histograms to adjust the colours of your bar graph, just add the arguments col and fill, together with the desired colour.

  • colour here refers to the lines around the bars

  • fill is the colour inside the bars.

Here I am using the hex colour code for ONS green with a white outline.

# Draw with black outline and green fill

filter(gapminder, year == 1997) |>
  ggplot() +
  geom_bar(mapping = aes(x = continent),
           colour = "black",
           fill = "#a8bd3a")

3.2.1.2 Best Practice Examples

  • The y-axis on a bar chart should always start at zero. The length of bars relative to the axis gives an instant understanding of the relative sizes of categories.

  • Use bar charts to show data that are ranked, in either ascending or descending order. Horizontal bars should be used.

  • A bar chart should always be ranked by value, unless there is a natural order to the data (for example, age or time).

# We can now add out theme to our plot as layer

filter(gapminder, year == 1997) |>
  ggplot() +
  geom_bar(mapping = aes(x = continent), 
           fill = "#12436D") +
  coord_flip() + # flip graph
  custom_theme + # best practice theme
  theme(
    legend.position = "none", # ammending theme
    axis.title = element_text(hjust = 1), # align axis text
    panel.grid.major.y = element_blank()
  ) + # remove major gridlines on y axis
  labs(title = "Number of Countries In Each Continent", # labels
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Number of Countries")

To highlight the highest values the largest value should be at the top of the chart. There are lots of ways to do this, here I have created a new column and created factors with levels and a specified order

# Creating a new column and creating factor levels

example_data <- gapminder |>
  filter(year == 1997) |>
  group_by(continent) |>
  arrange(continent) |>
  mutate(level_continent = factor(continent, 
                                  levels = c("Oceania", "Americas", "Asia", "Europe", "Africa")))

example_data |> ggplot() +
  geom_bar(mapping = aes(x = level_continent),
           fill = "#12436D") + # plotting level continent
  coord_flip() +
  custom_theme +
  theme(legend.position = "none",
        axis.title = element_text(hjust = 1),
        panel.grid.major.y = element_blank()
        ) +
  labs(title = "Number of Countries In Each Continent",
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Number of Countries")

Of course, we can reverse the order of the factor levels if we want to show highest to lowest value instead.

If you find yourself labeling individual bar values, consider using a table instead. If you do add bar values, make sure the labels are aligned at the bottom of the bars. This will enable easy comparison.

To add labels for a bar graph of counts instead of values we, use geom_bar(), which adds bars whose height is proportional to the number of rows, and then use geom_label() which is added as another layer with counts.

For the label itself, we can use the after_stat() functionality from ggplot to specify counting the number of observations in each category.

# Labelling the values on bars

example_data <- gapminder |>
  filter(year == 1997) |>
  group_by(continent) |>
  arrange(continent) |>
  mutate(level_continent = factor(continent, 
                                  levels = c("Oceania", "Americas", "Asia", "Europe", "Africa")))


example_data |> ggplot() +
  geom_bar(mapping = aes(x = level_continent), 
           fill = "#12436D") +
  coord_flip() +
  custom_theme +
  theme(legend.position = "none",
        axis.title = element_text(hjust = 1),
        panel.grid.major.y = element_blank()
        ) +
  labs(title = "Number of Countries In Each Continent",
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Number of Countries") +
  geom_label(mapping = aes(x = level_continent, 
                          label = after_stat(count)), 
             y = 0, 
             colour = "white", 
             fill = NA, 
             label.size = NA, 
             size = 5, 
             hjust = 0,
             stat = "count") 

There are other methods for doing this and adjusting positions, one example is using geom_text() which can be a little more intuitive.

We can also tweak the data differently, including a count column instead and pointing to that for labelling purposes, you can do this by creating a new column using the n() function in mutate().

4 Plotting with Two Variables

4.1 Continuous X, Continuous Y

4.1.1 Scatter Plots

The simplest display of two quantitative variables is a scatter plot, with each variable represented on an axis. We have seen how to build one of these in Chapter 2, but what we didn’t cover was some more niche best practice guidelines, which we will reinforce here.

4.1.1.1 Best Practice

Use a dot (or other symbol) plot to make comparisons between values. The size of the visual element representing the data (dot position) is representative of the data value itself.

# Plotting a scatterplot

filter(gapminder, year == 2007) |>
  ggplot() +
  geom_point(mapping = aes(x = gdp_percap,
                           y = life_exp), 
             colour = "28A197") +
  scale_x_continuous(labels = scales::dollar) +
  custom_theme +
  theme(panel.grid.major.x = element_blank()
        ) +
  labs(title = "Relationship between life expectancy and income, 2007",
       caption = "Source: www.gapminder.org",
       x = "GDP per capita",
       y = "Age (Years)")

4.2 Continuous Function

A graph of a continuous function is different to that of a continuous variable in general in that we see how that continuous variables changes in relation to another continuous variable (like how something changes over time for example). In this case we can join the points together to visualise the journey the variable/data takes.

4.2.1 Line Plots

We often use line graphs when representing this type of data, one great example is that line graphs work well for presenting time series.

4.2.1.1 Examples

Use ggplot() with geom_line(), and specify which variables you mapped to x and y.

# We use geom_line

gapminder |>
  filter(country %in% c("Colombia")) |>
  ggplot() +
  geom_line(mapping = aes(x = year, 
                          y = gdp_percap)
            ) # Plotting a line graph

We can amend the colour size and line types of the line, by specifying the arguments:

  • colour
  • size
  • linetype

These need to be specified outside the aes() function.

# Amend colour, size and linetype

gapminder |>
  filter(country %in% c("Colombia")) |>
  ggplot() +
  geom_line(mapping = aes(x = year,
                          y = gdp_percap),
            colour = "orange", # Changing the colours
            linewidth = 2, # amending the size
            linetype = "dashed") # amending the type of the line

Other line types are presented in this document for use.

We can also plot multiple lines to visualise how continuous functions change but for different levels of a categorical variable. We do this by mapping colour to a variable just like we did in Chapter 2.

# Plotting multiple line charts with colour argument

gapminder |>
  filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
  ggplot() +
  geom_line(mapping = aes(x = year,
                          y = gdp_percap,
                          colour = country)) # colour mapped to country

On the other hand, mapping variables to the properties, will result in each line looking different. Here I am mapping country to line type so each country will have a different colour as well as linetype.

# Plotting multiple line charts with colour and linetype varying

gapminder |>
  filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
  ggplot() +
  geom_line(mapping = aes(x = year,
                          y = gdp_percap,
                          colour = country,
                          linetype = country)
            ) # linetype mapped to country

We can also specify our own colours for each line using scale_colour_manual that we met in chapter 2.

# Using scale colour manual to specify my own colours

gapminder |>
  filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
  ggplot() +
  geom_line(mapping = aes(x = year,
                          y = gdp_percap,
                          colour = country)
            ) +
  scale_colour_manual(values = c("Colombia" = "#bb4430", # specify own colours
                                 "Chile" = "#7ebdc2",
                                 "Argentina" = "#bcb3b5",
                                 "Brazil" = "#e5ba38",
                                 "Peru" = "#a8bd3a",
                                 "Ecuador" = "#003d59")
                      )

Sometimes it is useful to indicate each data point on a line graph. This is helpful when the density of observations is low, or when the observations do not happen at regular intervals. We do this by adding a geom_point() layer to the plot, specifying the same aesthetics as the geom_line() layer (these could also go inside the ggplot() for global aesthetics).

# Here I am adding points on top of the lines

gapminder |>
  filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
  ggplot() +
  geom_line(mapping = aes(x = year,
                          y = gdp_percap, 
                          colour = country)
            ) +
  geom_point(mapping = aes(x = year, 
                           y = gdp_percap, 
                           colour = country)
             ) + # add a layer of points
  scale_colour_manual(values = c( "Colombia" = "#bb4430",
                                  "Chile" = "#7ebdc2",
                                  "Argentina" = "#bcb3b5",
                                  "Brazil" = "#e5ba38",
                                  "Peru" = "#a8bd3a",
                                  "Ecuador" = "#003d59")
                      )

The default shape for points is a solid circle, the default size is 2, and the default colour is black.

The fill color is relevant only for some point shapes (numbered 21–25), which have separate outline and fill colors (see section 2.2.3 setting shapes for more information the shapes).

# Altering shape and fill

gapminder |>
  filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) |>
  ggplot() +
  geom_line(mapping = aes(x = year, 
                          y = gdp_percap, 
                          colour = country)
            ) +
  geom_point(mapping = aes(x = year,
                           y = gdp_percap,
                           colour = country),
             shape = 21, # adding circles
             fill = "white") + # this fills the circles white
  scale_colour_manual(values = c("Colombia" = "#bb4430",
                                 "Chile" = "#7ebdc2",
                                 "Argentina" = "#bcb3b5",
                                 "Brazil" = "#e5ba38",
                                 "Peru" = "#a8bd3a",
                                 "Ecuador" = "#003d59")
                      )

TIP - If the points and lines have different colors, you should specify the points after the lines, so that they are drawn on top. Otherwise, the lines will be drawn on top of the points.

4.2.1.2 Best Practice

  • A line chart will emphasise the overall pattern of the data and highlight trends. Use them when you have lots of data points or just a few. Time Series should always be represented using this method.

  • There is no definitive thickness measure for lines, but ensure they don’t appear too thin in the final chart.

  • When you can, label lines directly rather than using a legend. If a legend is unavoidable, place it near the lines within the body of the graph.

  • Use a dot plot with a line when there are lots of data points or the interval between data points is not equal. Show if data are irregular. Overuse of these makes charts cluttered, particularly if there are multiple lines.

  • Similarly, using texted lines like dotted or dashed lines can clutter a line chat so should be avoided if possible.

  • Time series charts do not have to begin at zero, if a chart does not start at zero this must be indicated by breaking the y-axis in an obvious way.

It can be difficult to differentiate lines on the line graph if you have lots of categories. Textures, colour and marks are all options, but it is essential that the graph is not cluttered and is still accessible. Additionally, users may assign meaning to different colours or textures, which can be lost to those with low vision, or misconstrued with those that can view them.

# Plotting a line chart
# following best practise

# filtering the data
line_graph_data <- gapminder |>
  filter(country == "Malawi")


# Plotting the data
ggplot(data = line_graph_data) +
  geom_line(mapping = aes(x = year, 
                          y = life_exp),
            colour = "#12436D", 
            linewidth = 1) +
  geom_point(mapping = aes(x = year,
                           y = life_exp), 
             colour = "#12436D", 
             size = 2) +
  scale_y_continuous(limits = c(0, max(line_graph_data$life_exp)
                                )
                     ) + # axis starts from 0
  custom_theme + # adding our theme
  theme(panel.grid.major.x = element_blank()) + # removing gridlines
  labs(title = "Life expectancy in Malawi, 1962 - 2007",
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Age (Years)")

We can also add a legend to our graph if absolutely necessary.

# Adding a legend

# filtering the data
line_graph_data <- gapminder |>
  filter(country %in% c("Malawi", "Zambia"))

# Plotting the data
ggplot(data = line_graph_data) +
  geom_line(mapping = aes(x = year,
                          y = life_exp, 
                          colour = country), 
            size = 1) +
  geom_point(mapping = aes(x = year, 
                           y = life_exp,
                           colour = country), 
             size = 2) +
  scale_colour_manual(values = c("Malawi" = "#12436D", # adding colours
                                 "Zambia" = "#F46A25")
                      ) +
  scale_y_continuous(limits = c(0, NA)
                     ) + # axis starts at 0
  custom_theme + # adding a theme
  theme(panel.grid.major.x = element_blank(), # removing grid lines
        legend.position = c(x = 0.9, y = 0.5), # adding a legend inside the graph
        legend.title = element_blank(), # removing the legend title
        legend.background = element_blank(), # removing the legend background
        legend.key = element_blank() # removing the legend key
        ) +
  labs(title = "Life expectancy in Malawi and Zambia, 1962 - 2007",
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Age (Years)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

We can add more lines to our graph.

# Adding more lines
# If the title is long, it can be split into multiple lines with \n

# filtering the data
line_graph_data <- gapminder |>
  filter(country %in% c("Malawi", "Zambia", "Tanzania", "Zimbabwe"))

# Plotting the data
ggplot(data = line_graph_data) +
  geom_line(mapping = aes(x = year, 
                          y = life_exp, 
                          colour = country),
            size = 1) +
  geom_point(mapping = aes(x = year,
                           y = life_exp,
                           colour = country),
             size = 2) +
  scale_colour_manual(values = c("Malawi" = "12436D", # adding colours
                                 "Zambia" = "#28A197",
                                 "Tanzania" = "#801650",
                                 "Zimbabwe" = "#F46A25")
                      ) +
  scale_y_continuous(limits = c(0, NA)) + # axis starts at 0
  custom_theme + # adding a theme
  theme(panel.grid.major.x = element_blank(), # removing gridlines
        legend.position = c(x = 0.9, y = 0.3), # adding a legend inside the graph
        legend.title = element_blank(), # removing the legend title
        legend.background = element_blank(), # removing the legeng background
        legend.key = element_blank() # removing the legend key
        ) +
  labs(title = "Life expectancy in Malawi, Zambia, Tanzania and Zimbabwe,\n1962 - 2007",
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Age (Years)")

We can add labels on our lines using geom_text_repel() but this is quite complicated. We need to initialise a new column with the label names but somehow reduce this to the 2 we need, which means we start off with a column of null values.

We then impute the label value by referencing the max value of the year column and filling in the country value from there (this way we only have two labels and the rest as null). The which() function locates the index at which this label exists and is why it is used here.

The geom_text_repel() function repels the labels away from the lines, whereas traditional methods like geom_text() and geom_label() cause this problem and it is much harder to subsequently correct. It takes a few arguments:

  • The same aesthetic mapping as the prior layers if we aren’t using global aesthetics. There is also a label argument to add with this function however, which we assign to our label column.

  • nudge_x (or nudge_y) allows us to nudge the label slightly in case the lines are close together (like they are here, which will cause the labels to hit the line regardless).

  • na.rm = TRUE/FALSE allows us to ignore the null values in our label column.

# Adding labels on lines

# filtering the data
line_graph_data <- gapminder |>
  filter(country %in% c("Malawi", "Zambia")) 

# Set up labels
line_graph_data$label <- NA

line_graph_data$label[which(line_graph_data$year == max(line_graph_data$year))] <- line_graph_data$country[which(line_graph_data$year == max(line_graph_data$year))]

# Plotting the data
ggplot(data = line_graph_data) +
  geom_line(mapping = aes(x = year, 
                          y = life_exp, 
                          colour = country),
            size = 1) +
  geom_point(mapping = aes(x = year,
                           y = life_exp,
                           colour = country),
             size = 2) +
  geom_text_repel(mapping = aes(x = year,
                      y = life_exp,
                      color = country,
                      label = label), 
                  nudge_x = 1,
                  na.rm = TRUE) +
  scale_colour_manual(values = c("Malawi" = "#801650", 
                                 "Zambia" = "#F46A25")
                      ) + # adding colours
  scale_y_continuous(limits = c(0, NA)
                     ) + # axis starts at 0
  custom_theme + # adding a theme
  theme(panel.grid.major.x = element_blank(), # remove gridlines
        legend.position = "none") + # remove legend
  labs(title = "Life expectancy in Malawi and Zambia, 1962 - 2007",
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Age (Years)")

4.3 Categorical X, Continuous Y

4.3.1 Bar Charts

The intuition behind bar charts (frequency charts) remains the same as in the Plotting One Variable subchapter, however, now that we have two variables to plot, we will show off the use of geom_col() as opposed to geom_bar(). Remember, geom_col() allows us to specify the x value categories as well as a y value for the heights of the bars.

# Example of geom_col

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992) |>
  group_by(continent) |>
  summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))

# Plotting the data
ggplot(data = example_data) +
  geom_col(mapping = aes(x = continent, 
                         y = mean_life_exp)
           ) # plotting a bar graph

By default, bar graphs use a dark grey for the bars. Similar to before, to change the colour of the bars we use the colour and fill arguments. Best practice for these sections is consistent with the previous section (just using the new arguments in this section instead) so we won’t repeat the code needlessly here. We show off colours here but the best practice palettes etc have been provided throughout.

# Here i am specifying colour

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992) |>
  group_by(continent) |>
  summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))

# Plotting the data
ggplot(data = example_data) +
  geom_col(mapping = aes(x = continent, 
                         y = mean_life_exp),
           colour = "white",
           fill = "#12436D")

We can also specify our own colours.

# Use scale_fill_manual for colours

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992) |>
  group_by(continent) |>
  summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))

# Plotting the data
ggplot(data = example_data) +
  geom_col(mapping = aes(x = continent, 
                         y = mean_life_exp, 
                         fill = continent)
           ) +
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    ) # adding colours

We can label our graphs by adding geom_text() to your graph. It requires a mapping for x, y, and the text itself. By setting vjust (the vertical justification), it is possible to move the text above or below the tops of the bars.

# Bar plot with labels outside the bars

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992) |>
  group_by(continent) |>
  summarise(mean_life_exp = mean(life_exp, na.rm = TRUE)) # Compute the mean which we will label

# Plotting the data
ggplot(data = example_data) +
  geom_col(mapping = aes(x = continent, 
                         y = mean_life_exp, 
                         fill = continent)) +
  geom_text(mapping = aes(x = continent, 
                          y = mean_life_exp,
                          label = round(mean_life_exp), # adding a label
                          vjust = -0.3), # position the label
            size = 4) + # size of the label
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    ) # adding colours

We can also position the labels inside the bars.

# Bar plot with labels inside the bars

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992) |>
  group_by(continent) |>
  summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))

# Plotting the data
ggplot(data = example_data) +
  geom_col(mapping = aes(x = continent, 
                         y = mean_life_exp, 
                         fill = continent)
           ) +
  geom_text(mapping = aes(x = continent, 
                          y = mean_life_exp,
                          label = round(mean_life_exp), # adding a label
                          vjust = 1.6), 
            size = 5) + # position the label
 scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    ) # adding colours

It often makes sense to turn your bar plot horizontal as it makes the group labels much easier to read. As before, we will utilise coord_flip() for this.

We also introduce the guides() function as a new layer, which flips the legend and its labels to match the new flipped chart, which is especially useful if a legend is essential for your visualisation.

# We use coord flip

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992) |>
  group_by(continent) |>
  summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))

# Plotting the data
ggplot(data = example_data) +
  geom_col(mapping = aes(x = continent,
                         y = mean_life_exp, 
                         fill = continent)
           ) +
  geom_text(mapping = aes(x = continent,
                          y = mean_life_exp,
                          label = round(mean_life_exp),
                          vjust = 0.5,
                          hjust = "left"),
            size = 5) +
  coord_flip() + # flips bar chart
 scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    ) + # adding colours 
  guides(fill = guide_legend(reverse = TRUE)
         ) # flips the legend labels to match the graph

To make the bars narrower or wider, set width in geom_col(), just like we did with geom_bar(). The default value is 0.9; larger values make the bars wider, and smaller values make the bars narrower

# Modifying the width of the bars

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992) |>
  group_by(continent) |>
  summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))

# Plotting the data
ggplot(data = example_data) +
  geom_col(mapping = aes(x = continent,
                         y = mean_life_exp, 
                         fill = continent),
           width = 0.4) + # modify the width 
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    )  # adding colours 

4.3.2 Stacked Bar Plot

We can also create stacked bar charts, which are particularly useful when you want to compare multiple part-to-whole relationships in one chart, where a bar (which represents one level of a categorical/factor variable) is split into sections based on another categorical variable (this allows you to observe most and least common observations).

By default, the stacking order of the bars is the same as the order of items in the legend. Note that you must convert the variable on the x axis (which you want bars for) to a factor variable if it is a year or something that R treats as a numeric, otherwise we will have a numeric scale which won’t make sense.

To create a stacked bar chart, we supply a categorical variable to the ‘fill’ argument, which partitions the bar into multiple colours, with each width depending on the counts. Here we will stack each bar by continent.

# Year has been treated as numeric here)

# Manipulating the data
pop_continent_after_92 <- gapminder |>
  filter(year >= 1992) |>
  group_by(year, continent) |>
  summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = year, 
                         y = my_sum, 
                         fill = continent)) + 
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    )  # adding colours 

You see here that the x axis has a numeric scale which doesn’t line up with the four bars we have in our chart. This requires tweaking using the as.factor() function we mentioned earlier.

# Converting year to a factor first

# Manipulating the data
pop_continent_after_92 <- gapminder |>
  filter(year >= 1992) |>
  group_by(year, continent) |>
  summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = as.factor(year), # year converted to factor
                         y = my_sum,
                         fill = continent)) +
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    )  # adding colours 

We can also reverse the order of the bars, which is done by using the position parameter inside geom_col (note that this is not an aesthetic so remains outside of aes()) and setting this equal to “position_stack(reverse = TRUE)”. In this case you will need to use guides() as we did earlier to reverse the order of the legend to ensure that they match up.

# Reversing the order of a stacked bar chart

# Manipulating the data
pop_continent_after_92 <- gapminder |>
  filter(year >= 1992) |>
  group_by(year, continent) |>
  summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = as.factor(year),
                         y = my_sum, 
                         fill = continent),
           position = position_stack(reverse = TRUE)
           ) + # reverse the order of the bars
  guides(fill = guide_legend(reverse = TRUE)) + # Reverse order of legend
    scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    )  # adding colours 

4.3.2.1 Best Practice Example

# Reversing the order of a stacked bar chart

# Manipulating the data
pop_continent_after_92 <- gapminder |>
  filter(year >= 1992) |>
  group_by(year, continent) |>
  summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = as.factor(year),
                         y = my_sum, 
                         fill = continent),
           position = position_stack(reverse = TRUE)
           ) + # reverse the order of the bars
  guides(fill = guide_legend(reverse = TRUE)) + # Reverse order of legend
    scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    )  + # adding colours 
   custom_theme +
 theme(legend.position = "bottom",
        legend.title = element_blank(),
        legend.background = element_blank(),
        legend.key = element_blank(),
        panel.grid.major.x = element_blank()) +
  labs(title = "Population values by Continent (1997-2007)",
       caption = "Source: www.gapminder.org",
       x = "Year",
       y = "Population")

What we have done in this section thus far adheres to best practices in place for stacked bar charts, but we want to formalise some information here for reference.

  • Stacked bar charts can be used for data that is not part to whole BUT care should be taken in this case as they can become cluttered quickly.

  • If each stacked bar has more than 4 categories, consider grouping categories together before using this technique. In the example above, oceania is so small it doesn’t violate this issue, but say we had 8 categories, it would then be much harder to work with.

  • If you are visualising survey questions on the Likert Scale (1 - Strongly Agree, 2 - Agree and so on) this can easily exceed 8 categories, so it is recommended that you aggregate or group some responses together into another category, there is guidance here.

  • If you aren’t using ordered variables like years or age ranges, you could consider applying an ordering to the factor variable in the stack, to get across the message more clearly (after all, information on either end of the bar is easier to interpret than in the middle).

  • Avoid using stacked bar charts for data that contains negative values, this becomes cumbersome to interpret and understand, the legend is never in the specified order throughout and the contrast between stacks changes as well. It is a nightmare for accessibility purposes, so you should separate out into multiple charts using facet_wrap() instead.

4.3.3 Segmented Bar Chart

A segmented bar (also called a 100% stacked bar graph) plot is a stacked bar plot where each bar represents 100 percent. This makes it so the heights of the bars are the same for each plot as the y axis is the scale from 0-1 (proportion). You can also cast this to percentages if you want the y axis to show these. It is however, recommended that you use proportions instead of the percentages.

This type of plot is particularly useful if the goal is to compare the percentage of a category in one variable across each level of another variable.

Use position = “fill” in the geom_col() argument to create a graph with proportions in the y-axis.

# Segmented bar chart

# Manipulating the data
pop_continent_after_92 <- gapminder |>
  filter(year >= 1992) |>
  group_by(year, continent) |>
  summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = as.factor(year), 
                         y = my_sum, 
                         fill = continent),
           position = position_fill()) + # creates a segmented plot
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    )  # adding colours 

To print the labels as percentages, use scale_y_continuous(labels = scales::percent), this is why we import the scales library.

# Segmented bar chart - percentage

# plotting the data
ggplot(data = pop_continent_after_92) +
  geom_col(
    mapping = aes(x = as.factor(year), 
                  y = my_sum, 
                  fill = continent),
    position = position_fill()
    ) +
  scale_y_continuous(labels = scales::percent) + # shows y axis in percentages
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    )  # adding colours 

4.3.3.1 Best Practice

# Segmented bar chart

# Manipulating the data
pop_continent_after_92 <- gapminder |>
  filter(year >= 1992) |>
  group_by(year, continent) |>
  summarise(my_sum = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Plotting the data
ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = as.factor(year), 
                         y = my_sum, 
                         fill = continent),
           position = position_fill()) + # creates a segmented plot
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")
                    ) + # adding colours 
 custom_theme +
 theme(legend.position = "bottom",
       legend.title = element_blank(),
       legend.background = element_blank(),
       legend.key = element_blank(),
       panel.grid.major.x = element_blank()) +
  labs(title = "Population Proportions by continent (1992-2007)",
       caption = "Source: www.gapminder.org",
       x = "Year",
       y = "Proportion of Population")

4.3.4 Grouped Bar Plot

Grouped bar graphs are similar to stacked bar graphs; the only difference is that the grouped bar graph provides a bar for each category instead of stacking them into the same bar.

4.3.4.1 Example - Basic Plot

You have to set position=position_dodge(), in the geom_col() function to go from one to the other. This tells the bars to ‘dodge’ each other horizontally; if you don’t do this, you’ll end up with the stacked plot from the last section.

# Grouped bar chart

ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = as.factor(year),
                         y = my_sum,
                         fill = continent),
           position = position_dodge(1)) + # creates a grouped bar plot
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1"))  # adding colours 

For grouped bars, there is no space between bars within each group by default. However, you can add some space between bars within a group, by adding a numeric value to the position_dodge() function where:

  • < 1 phases the bars inside each other
  • (>= 1) pushes the bars away from each other

and ensuring that this value is larger than the width of the columns.

4.3.4.2 Best Practice

ggplot(data = pop_continent_after_92) +
  geom_col(mapping = aes(x = as.factor(year),
                         y = my_sum,
                         fill = continent),
           width = 0.7,
           position = position_dodge(1)) + # Light gap between bars
  scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")) +  # adding colours 
  custom_theme +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        legend.background = element_blank(),
        legend.key = element_blank(),
        panel.grid.major.x = element_blank()) +
   labs(title = "Population Segmented by continent",
        caption = "Source: www.gapminder.org",
        x = "Year",
        y = "Population")

4.3.5 Box Plots

Box plots are some of the most commonly used chart types to compare distributions across several groups of a categorical variable (remember Histograms are for singular continuous variables).

They are also a great way to visually present quartiles in your data. They are often referred to as box-and-whisker-plots as they often have lines extending from the box data, which denote additional variability outside of these quartiles (known as the whiskers). The top edge of the box is the upper quartile (75th percentile), the lower edge is the lower quartile (25th percentile) and the bold line in the center is the median (50th percentile).

The points outside of the box and whiskers are usually flagged as outliers or anomalous data points and as such this is a great visual way to check for this before statistically determining it with calculations/tests (or make for a great piece of supporting evidence!).

We use geom_boxplot() where you must specify a continuous variable for the y axis, and a categorical variable for the x axis.

# Example of basic box plot

# Manipulating the data
example_data <- gapminder |>
  filter(year == 1992)

# Plotting the data
ggplot(data = example_data, mapping = aes(x = continent, 
                                          y = life_exp)
       ) +
  geom_boxplot() # creates a box plot

We can amend the outlier, colour, shape and size, by specifying the following arguments within geom box plot.

# Change outlier, color, shape and size

ggplot(data = example_data, mapping = aes(x = continent,
                                          y = life_exp)
       ) +
  geom_boxplot(outlier.colour = "red", # outlier colour
               outlier.shape = 4, # outlier shape
               outlier.size = 4) # outlier size

We can make box_plots proportional to the sample size. To do this we use the varwidth argument (a boolean, TRUE or FALSE), which is specified within the geom_box plot.

# Using varwidth argument

ggplot(data = example_data, mapping = aes(x = continent, 
                                          y = life_exp)
       ) +
  geom_boxplot(varwidth = TRUE) # make box_plots proportional to the sample size

It is worth noting that whilst this is helpful, it can make smaller groups much harder to observe due to how small the number of observations we have for them really are.

We can also add points to our box plot. This gives us a better idea of the number of measurements and of their distribution. We do this by adding a new layer geom_jitter() to our boxplot, which overlays the points to the boxplot.

# We can add points on top our boxplot

ggplot(data = example_data, mapping = aes(x = continent, 
                                          y = life_exp)
       ) +
  geom_boxplot() +
  geom_jitter() # add a layer of points on top of the box plots

We can colours to our box plot by specifying the fill argument within the geom_box_plot.

# Add colour to boxplot

ggplot(data = example_data, mapping = aes(x = continent,
                                          y = life_exp)
       ) +
  geom_boxplot(fill = "#f37735") # add a colour

We can also map the colour to a variable so we get a different colour for each box plot, remember this means we specify fill within aes() as opposed to outside it for all one colour.

#  We can also map the colour to a variable

ggplot(data = example_data, aes(x = continent,
                                y = life_exp, 
                                fill = continent)
       ) + # map fill to continent
  geom_boxplot()

We can also get horizontal box plots by flipping the axes using coord_flip() that we saw earlier with bar charts.

# Example of a flipped boxplot

ggplot(data = example_data, mapping = aes(x = continent,
                                          y = life_exp,
                                          fill = continent)
       ) +
  geom_boxplot() +
  coord_flip() # this flips our graph

We can also create grouped box plots so that for each level of the categorical variable, we can then apply a grouping and output multiple boxplots for each category, effectively allowing us to use two categorical variables and gain more insight. However, this can become unruly fast and hard to read, so always keep this in mind if applying this technique.

Remember that the variable we split by to get the grouped boxplots needs to be categorical, so in the case that we use year like we have here (which is interpreted as a numeric), you need to use as.factor() when specifying it to convert it first.

# Creating grouped box plots

# Manipulating the data
example_data <- gapminder |>
  filter(year %in% c(1992, 2002, 2007))

# Plotting the data
ggplot(data = example_data, mapping = aes(x = continent,
                                          y = life_exp, 
                                          fill = factor(year)
                                          )) + # grouped box plot
  geom_boxplot()

4.3.5.1 Best Practice

There are no specific guidlines for this plot type, so applying out best practice themeing and colours should suffice here.

# data manipulation
example_data <- gapminder |>
  filter(year == 1997)

# plotting the data
ggplot(data = example_data, aes(x = continent, 
                                y = life_exp, 
                                fill = continent)
       ) +
  geom_boxplot() +
 scale_fill_manual(values = c("#12436D", 
                               "#28A197", 
                               "#801650", 
                               "#F46A25", 
                               "#A285D1")) + # adding colours  
  custom_theme +
  theme(legend.position = "none",
        panel.grid.major.x = element_blank()
        ) +
  labs(title = "Life expectancy distribution (1997)",
       caption = "Source: www.gapminder.org",
       x = "",
       y = "Age (years)")

5 Summary

Wonderful work on the plentiful plots presented to you in this chapter! This is a great reference piece you can return to whenever you need to consider how to visualise your data in the future and what ways there are available. There are of course a few that we didn’t cover (you may notice the absence of pie charts), which are instead contained in the Reference Materials Chapter 6, as they are considered last resort plot types, not recommended to be used unless absolutely necessary.

Reuse

Open Government Licence 3.0