Chapter 5 Plotting in R

5.1 Base graphics

We won’t go through all base graphics plotting functions one by one here. We will encounter and learn several of these functions throughout the course and, if necessary, discuss them when questions arise.

5.2 Plotting with ggplot2

A useful ggplot2 cheet sheet is available here. More details are available on the documentation page.

Base graphics uses a canvas model a series of instructions that sequentially fill the plotting canvas. While this model is very useful to build plots bits by bits bottom up, which is useful in some cases, it has some clear drawback:

  • Layout choices have to be made without global overview over what may still be coming.
  • Different functions for different plot types with different interfaces.
  • No standard data input.
  • Many routine tasks require a lot of boilerplate code.
  • No concept of facets/lattices/viewports.
  • Poor default colours.

The ggplot2 package implements a grammar of graphics. Users describe what and how to visualise data and the package then generates the figure. The components of ggplot2’s of graphics are

  1. A tidy dataset
  2. A choice of geometric objects that servers as the visual representation of the data - for instance, points, lines, rectangles, contours.
  3. A description of how the variables in the data are mapped to visual properties (aesthetics) or the geometric objects, and an associated scale (e.g. linear, logarithmic, polar)
  4. A statistical summarisation rule
  5. A coordinate system.
  6. A facet specification, i.e. the use of several plots to look at the same data.

Fist of all, we need to load the ggplot2 package and load the iprg data.

library("ggplot2")
iprg <- read.csv("http://bit.ly/VisBiomedDataIprgCsv")

ggplot graphics are built step by step by adding new elements.

To build a ggplot we need to:

  • bind the plot to a specific data frame using the data argument
ggplot(data = iprg)
  • define aesthetics (aes), by selecting the variables to be plotted and the variables to define the presentation such as plotting size, shape color, etc.
ggplot(data = iprg, aes(x = Run, y = Log2Intensity))
  • add geoms – graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use + operator
ggplot(data = iprg, aes(x = Run, y = Log2Intensity)) +
  geom_boxplot()

See the documentation page to explore the many available geoms.

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot “templates” and conveniently explore different types of plots, so the above plot can also be generated with code like this:

## Assign plot to a variable
ints_plot <- ggplot(data = iprg, aes(x = Run, y = Log2Intensity))

## Draw the plot
ints_plot + geom_boxplot()

Notes:

  • Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x and y axis you set up in aes().

  • You can also specify aesthetics for a given geom independently of the aesthetics defined globally in the ggplot() function.

  • The + sign used to add layers must be placed at the end of each line containing a layer. If, instead, the + sign is added in the line before the other layer, ggplot2 will not add the new layer and will return an error message.

Challenge

  • Repeat the plot above but displaying the raw intensities.
  • Log-10 transform the raw intensities on the flight when plotting.

ggplot(data = iprg, aes(x = Run, y = Intensity)) + geom_boxplot()

ggplot(data = iprg, aes(x = Run, y = log10(Intensity))) + geom_boxplot()

5.3 Composing plots

First, let’s colour the boxplot based on the condition:

ggplot(data = iprg,
       aes(x = Run, y = Log2Intensity,
           fill = Condition)) +
  geom_boxplot()

Now let’s rename all axis labels and title, and rotate the x-axis labels 90 degrees. We can add those specifications using the labs and theme functions of the ggplot2 package.

ggplot(aes(x = Run, y = Log2Intensity, fill = Condition),
       data = iprg) +
    geom_boxplot() +
    labs(title = 'Log2 transformed intensity distribution per MS run',
         y = 'Log2(Intensity)',
         x = 'MS run') +
    theme(axis.text.x = element_text(angle = 90))

And easily switch from a boxplot to a violin plot representation by changing the geom type.

ggplot(aes(x = Run, y = Log2Intensity, fill = Condition),
       data = iprg) +
    geom_violin() +
    labs(title = 'Log2 transformed intensity distribution per Subject',
         y = 'Log2(Intensity)',
         x = 'MS run') +
    theme(axis.text.x = element_text(angle = 90))

Finally, we can also overlay multiple geoms by simply adding them one after the other.

p <- ggplot(aes(x = Run, y = Log2Intensity, fill = Condition),
            data = iprg)
p + geom_boxplot()

p + geom_boxplot() + geom_jitter() ## not very usefull

p + geom_jitter() + geom_boxplot()

p + geom_jitter(alpha = 0.1) + geom_boxplot()

Challenge

  • Overlay a boxplot goem on top of a jitter geom for the raw or log-10 transformed intensities.
  • Customise the plot as suggested above.

## Note how the log10 transformation is applied to both geoms
ggplot(data = iprg, aes(x = Run, y = log10(Intensity))) +
    geom_jitter(alpha = 0.1) +
    geom_boxplot()

Finally, a very useful feature of ggplot2 is facetting, that defines how to subset the data into different panels (facets).

names(iprg)
## [1] "Protein"       "Log2Intensity" "Run"           "Condition"    
## [5] "BioReplicate"  "Intensity"     "TechReplicate"
ggplot(data = iprg,
       aes(x = TechReplicate, y = Log2Intensity,
           fill = Condition)) +
    geom_boxplot() +
    facet_grid(~ Condition)

5.4 Colour scales

library("viridis")
library("RColorBrewer")
RColorBrewer::display.brewer.all()

p <- ggplot(data = crcdf,
            aes(x = A1AG2, y = AFM, colour = AHSG)) +
    geom_point()

p + scale_color_viridis()

p + scale_color_viridis(option = "A")

p + scale_color_distiller(palette = "Spectral")

p + scale_color_distiller(palette = "Blues")

p + scale_color_distiller(palette = "Purples")

p + scale_color_distiller(palette = "Set1")

p <- ggplot(data = crcdf,
            aes(x = A1AG2, y = AFM, colour = Sub_group)) +
    geom_point()

p ## uses scale_color_discrete()

p + scale_color_grey()

p + scale_color_viridis(discrete = TRUE)

p + scale_color_viridis(discrete = TRUE, option = "A")

5.5 Customising plots

Using labs:

p <- ggplot(data = iprg, aes(x = Run, y = log10(Intensity))) +
    geom_jitter(alpha = 0.1) +
    geom_boxplot()
p + labs(title = "A title, at the top",
         subtitle = "A subtitle, under the title",
         caption = "Comes at the bottom of the plot",
         x = "x axis label",
         y = "y axis label")
p
p + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Setting themes:

p + theme_bw()
p + theme_gray()
p + theme_dark()
p + theme_minimal()
p + theme_light()
p + theme_void()
## ....

See also the ggthemes package.

5.6 Combine ggplot figures

The goal of patchwork patchwork is to make it ridiculously simple to combine separate ggplots into the same graphic. As such it tries to solve the same problem as gridExtra::grid.arrange() and cowplot::plot_grid but using an API that incites exploration and iteration.

Installation:

## install.packages("devtools")
devtools::install_github("thomasp85/patchwork")
p1 <- ggplot(iprg, aes(x = Condition, y = Log2Intensity)) +
    geom_boxplot(aes(fill = Condition)) +
    theme(legend.position = "none")
p2 <- ggplot(iprg, aes(x = Intensity, y = Log2Intensity)) +
    geom_point()
p3 <- iprg %>%
    group_by(TechReplicate, BioReplicate) %>%
    tally %>%
    ggplot(aes(x = TechReplicate,
               y = n,
               fill = as.factor(BioReplicate))) +
    geom_col() +
    theme(legend.position = "none")
p4 <- ggplot(iprg, aes(x = TechReplicate, y = Log2Intensity)) +
    geom_violin(aes(fill = Condition)) +
    theme(legend.position = "none") +
    coord_flip()
library("patchwork")
p1 + p2
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
p1 + p2 + p3 + plot_layout(ncol = 1)
p1 + p2 - p3 + plot_layout(ncol = 1)
(p1 | p2 | p3) / p4
(p1 + (p2 + p3) + p4 + plot_layout(ncol = 1))

5.7 Saving your figures

You can save plots to a number of different file formats. PDF is by far the most common format because it’s lightweight, cross-platform and scales up well but jpegs, pngs and a number of other file formats are also supported. Let’s redo the last barplot but save it to the file system this time.

Let’s save the boxplot as pdf file.

pdf()
p + geom_jitter(alpha = 0.1) + geom_boxplot()
dev.off()

The default file name is Rplots.pdf. We can customise that file name specifying it by passing the file name, as a character, to the pdf() function.

Challenge

Save a figure of your choice to a pdf file. Read the manual for the png function and save that same image to a png file.

Tip: save your figures in a dedicated directory.

5.8 Exercises

  1. Transform the anscombe data into a long dataset and visualise the relations between the four\((x_i, y_i)\) pairs.

anscombe$point <-  1:11
anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4 point
## 1  10 10 10  8  8.04 9.14  7.46  6.58     1
## 2   8  8  8  8  6.95 8.14  6.77  5.76     2
## 3  13 13 13  8  7.58 8.74 12.74  7.71     3
## 4   9  9  9  8  8.81 8.77  7.11  8.84     4
## 5  11 11 11  8  8.33 9.26  7.81  8.47     5
## 6  14 14 14  8  9.96 8.10  8.84  7.04     6
## 7   6  6  6  8  7.24 6.13  6.08  5.25     7
## 8   4  4  4 19  4.26 3.10  5.39 12.50     8
## 9  12 12 12  8 10.84 9.13  8.15  5.56     9
## 10  7  7  7  8  4.82 7.26  6.42  7.91    10
## 11  5  5  5  8  5.68 4.74  5.73  6.89    11
xs <- gather(anscombe[, 1:4], key = coord, value = x)
ys <- gather(anscombe[, 5:9], key = coord, value = y, -point)

(ans <- cbind(xs[, 2, drop = FALSE], ys) %>% as_tibble)
## # A tibble: 44 x 4
##        x point coord     y
##    <dbl> <int> <chr> <dbl>
##  1    10     1 y1     8.04
##  2     8     2 y1     6.95
##  3    13     3 y1     7.58
##  4     9     4 y1     8.81
##  5    11     5 y1     8.33
##  6    14     6 y1     9.96
##  7     6     7 y1     7.24
##  8     4     8 y1     4.26
##  9    12     9 y1    10.8 
## 10     7    10 y1     4.82
## # … with 34 more rows
ggplot(ans, aes(x = x, y = y, colour = coord)) +
    geom_point() +
    facet_wrap(. ~ coord) +
    geom_smooth(method = "lm")

Or, producing 4 plots and patching them up with patchwork:

p1 <- ggplot(anscombe, aes(x=x1, y=y1)) + 
   geom_point() +
   geom_smooth(method = "lm")

p2 <- ggplot(anscombe, aes(x=x2, y=y2)) + 
   geom_point() +
   geom_smooth(method="lm")

p3 <- ggplot(anscombe, aes(x=x3, y=y3)) + 
   geom_point() +
   geom_smooth(method = "lm")

p4 <- ggplot(anscombe, aes(x=x4, y=y4)) + 
   geom_point() +
   geom_smooth(method = "lm")

library("patchwork")
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

Or, in one single pipe (solution by Aaron Storey at May Institute 2019).

data(anscombe) 
anscombe$point <- c(1:11)

as_tibble(anscombe) %>%
  gather(Variable, Value,-point) %>%
  mutate(
    Series_number = str_extract(Variable, "[[:digit:]]"), 
    Variable = str_extract(Variable, "[[:alpha:]]") 
  ) %>%
  spread(Variable, Value) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~Series_number)

  1. Count the number of quantified proteins for each group, gender and age and visualise the results (suggestion below).

Tips: For age, you can use cut to bin the ages. You’ll probably want to use dplyr::group_by and dplyr::tally to obtain the values after converting the crc data to a long format.

tbl <- gather(crcdf, key = Protein, value = expression,
              -Sample, -Group, -Age, -Gender, -Cancer_stage,
              -Tumour_location, -Sub_group) %>%
    as_tibble

tbl %>% group_by(age = cut(tbl$Age, 5), Group, Gender) %>%
    tally %>%
        ggplot(aes(x = Gender, y = n, colour = age)) +
        geom_point(size = 5) +
        facet_grid( ~ Group)

  1. Compare the range of experessions in a subset of samples (below I used those starting with "P1F") and group plots by categories using facets.
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

ggplot(data = filter(tbl, grepl("P1F", tbl$Sample)),
       aes(x = Sample, y = expression)) +
    geom_boxplot() +
    facet_grid(Gender ~ Group) +
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, hjust = 1))

  1. Plot the expression of all proteins in one sample (for example P1A10) against another one (for example P1A2) and use dataset-wide features such as the log-fold change of all CRC vs. Healthy samples and the grand mean expression intensity.
## Warning: Removed 1 rows containing missing values (geom_point).

x <- MSnbase::ms2df(crc) %>% as_tibble

x$rm <- rowMeans(MSnbase::exprs(crc), na.rm = TRUE)
fc <- rowSums(MSnbase::exprs(crc[, crc$Group == "CRC"]), na.rm = TRUE) /
    rowSums(MSnbase::exprs(crc[, crc$Group != "CRC"]), na.rm = TRUE)
x$lfc <- log2(fc)

ggplot(x, aes(P1A10, P1A2, colour = lfc, size = rm)) +
    geom_point() +
    scale_color_distiller(palette = "Spectral")

5.9 References