Chapter 4 Using dplyr for data manipulation

A useful dplyr cheet sheet is available here.

The following material is based on Data Carpentry’s the Data analisis and visualisation lessons.

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter dplyr. dplyr is a package for making tabular data manipulation easier. It pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

We will need the tidyverse package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as dplyr, ggplot2 (for visualisation), tibble, etc.

library("tidyverse")
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ purrr   0.3.2       ✔ stringr 1.4.0  
## ✔ tibble  2.1.1       ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse()   masks IRanges::collapse()
## ✖ dplyr::combine()    masks MSnbase::combine(), Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::desc()       masks IRanges::desc()
## ✖ purrr::detect()     masks RforProteomics::detect()
## ✖ tidyr::expand()     masks S4Vectors::expand()
## ✖ magrittr::extract() masks tidyr::extract()
## ✖ dplyr::filter()     masks plotly::filter(), stats::filter()
## ✖ dplyr::first()      masks S4Vectors::first()
## ✖ dplyr::ident()      masks msdata::ident()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce()     masks IRanges::reduce(), MSnbase::reduce()
## ✖ dplyr::rename()     masks plotly::rename(), S4Vectors::rename()
## ✖ dplyr::select()     masks plotly::select(), AnnotationDbi::select()
## ✖ purrr::set_names()  masks magrittr::set_names()
## ✖ dplyr::slice()      masks plotly::slice(), IRanges::slice()

The tidyverse package tries to address 3 major problems with some of base R functions:

  1. The results from a base R function sometimes depends on the type of data.
  2. Using R expressions in a non standard way, which can be confusing for new learners.
  3. Hidden arguments, having default operations that new learners are not aware of.

To learn more about dplyr and tidyr, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

Let’s start by reading the data using readr::read_csv that will produce a tibble.

library("readr")
iprg <- read_csv("http://bit.ly/VisBiomedDataIprgCsv")
## Parsed with column specification:
## cols(
##   Protein = col_character(),
##   Log2Intensity = col_double(),
##   Run = col_character(),
##   Condition = col_character(),
##   BioReplicate = col_double(),
##   Intensity = col_double(),
##   TechReplicate = col_character()
## )

Tibbles are data frames, but they tweak some of the old behaviors of data frames. The data structure is very similar to a data frame. For our purposes the only differences are that:

  1. In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
  2. Columns of class character are never converted into factors.

4.1 Selecting columns and filtering rows

We’re going to learn some of the most common dplyr functions: select(), filter(), mutate(), group_by(), and summarize(). To select columns of a data frame, use select(). The first argument to this function is the data frame, and the subsequent arguments are the columns to keep.

select(iprg, Protein, Run, Condition)

To choose rows based on a specific criteria, use filter():

filter(iprg, BioReplicate == 1)
## # A tibble: 9,079 x 7
##    Protein Log2Intensity Run   Condition BioReplicate Intensity
##    <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
##  1 sp|D6V…          26.8 JD_0… Conditio…            1    1.18e8
##  2 sp|D6V…          26.6 JD_0… Conditio…            1    1.02e8
##  3 sp|D6V…          26.6 JD_0… Conditio…            1    1.01e8
##  4 sp|O13…          24.7 JD_0… Conditio…            1    2.76e7
##  5 sp|O13…          24.7 JD_0… Conditio…            1    2.68e7
##  6 sp|O13…          24.7 JD_0… Conditio…            1    2.76e7
##  7 sp|O13…          23.4 JD_0… Conditio…            1    1.09e7
##  8 sp|O13…          24.0 JD_0… Conditio…            1    1.69e7
##  9 sp|O13…          23.5 JD_0… Conditio…            1    1.16e7
## 10 sp|O13…          27.5 JD_0… Conditio…            1    1.92e8
## # … with 9,069 more rows, and 1 more variable: TechReplicate <chr>
filter(iprg, Condition == 'Condition2')
## # A tibble: 9,081 x 7
##    Protein Log2Intensity Run   Condition BioReplicate Intensity
##    <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
##  1 sp|D6V…          26.8 JD_0… Conditio…            2    1.20e8
##  2 sp|D6V…          26.8 JD_0… Conditio…            2    1.16e8
##  3 sp|D6V…          26.6 JD_0… Conditio…            2    1.02e8
##  4 sp|O13…          24.5 JD_0… Conditio…            2    2.41e7
##  5 sp|O13…          24.7 JD_0… Conditio…            2    2.68e7
##  6 sp|O13…          24.6 JD_0… Conditio…            2    2.51e7
##  7 sp|O13…          23.2 JD_0… Conditio…            2    9.45e6
##  8 sp|O13…          23.4 JD_0… Conditio…            2    1.13e7
##  9 sp|O13…          23.8 JD_0… Conditio…            2    1.43e7
## 10 sp|O13…          25.9 JD_0… Conditio…            2    6.18e7
## # … with 9,071 more rows, and 1 more variable: TechReplicate <chr>

4.2 Pipes

But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out.

The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

iprg %>%
  filter(Intensity > 1e8) %>%
  select(Protein, Condition, Intensity)
## # A tibble: 4,729 x 3
##    Protein              Condition   Intensity
##    <chr>                <chr>           <dbl>
##  1 sp|D6VTK4|STE2_YEAST Condition1 117845016.
##  2 sp|D6VTK4|STE2_YEAST Condition1 102273602.
##  3 sp|D6VTK4|STE2_YEAST Condition1 100526837.
##  4 sp|D6VTK4|STE2_YEAST Condition2 119765106.
##  5 sp|D6VTK4|STE2_YEAST Condition2 116382798.
##  6 sp|D6VTK4|STE2_YEAST Condition2 102328260.
##  7 sp|D6VTK4|STE2_YEAST Condition3 103830944.
##  8 sp|D6VTK4|STE2_YEAST Condition4 102150172.
##  9 sp|D6VTK4|STE2_YEAST Condition4 105724288.
## 10 sp|O13539|THP2_YEAST Condition1 192490784.
## # … with 4,719 more rows

In the above, we use the pipe to send the iprg dataset first through filter() to keep rows where Intensity is greater than 1e8, then through select() to keep only the Protein, Condition, and Intensity columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include it as an argument to the filter() and select() functions anymore.

If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name:

iprg_sml <- iprg %>%
    filter(Intensity > 1e8) %>%
    select(Protein, Condition, Intensity)

iprg_sml
## # A tibble: 4,729 x 3
##    Protein              Condition   Intensity
##    <chr>                <chr>           <dbl>
##  1 sp|D6VTK4|STE2_YEAST Condition1 117845016.
##  2 sp|D6VTK4|STE2_YEAST Condition1 102273602.
##  3 sp|D6VTK4|STE2_YEAST Condition1 100526837.
##  4 sp|D6VTK4|STE2_YEAST Condition2 119765106.
##  5 sp|D6VTK4|STE2_YEAST Condition2 116382798.
##  6 sp|D6VTK4|STE2_YEAST Condition2 102328260.
##  7 sp|D6VTK4|STE2_YEAST Condition3 103830944.
##  8 sp|D6VTK4|STE2_YEAST Condition4 102150172.
##  9 sp|D6VTK4|STE2_YEAST Condition4 105724288.
## 10 sp|O13539|THP2_YEAST Condition1 192490784.
## # … with 4,719 more rows

Note that the final data frame is the leftmost part of this expression.

Challenge

Using pipes, subset the iprg data to include Proteins with a log2 intensity greater than 20 and retain only the columns Proteins, and Condition.

## Answer
iprg %>%
    filter(Log2Intensity > 20) %>%
    select(Protein, Condition)

4.3 Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of weight in kg:

iprg %>%
  mutate(Log10Intensity = log10(Intensity))
## # A tibble: 36,321 x 8
##    Protein Log2Intensity Run   Condition BioReplicate Intensity
##    <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
##  1 sp|D6V…          26.8 JD_0… Conditio…            1    1.18e8
##  2 sp|D6V…          26.6 JD_0… Conditio…            1    1.02e8
##  3 sp|D6V…          26.6 JD_0… Conditio…            1    1.01e8
##  4 sp|D6V…          26.8 JD_0… Conditio…            2    1.20e8
##  5 sp|D6V…          26.8 JD_0… Conditio…            2    1.16e8
##  6 sp|D6V…          26.6 JD_0… Conditio…            2    1.02e8
##  7 sp|D6V…          26.6 JD_0… Conditio…            3    1.04e8
##  8 sp|D6V…          26.5 JD_0… Conditio…            3    9.47e7
##  9 sp|D6V…          26.5 JD_0… Conditio…            3    9.69e7
## 10 sp|D6V…          26.6 JD_0… Conditio…            4    1.02e8
## # … with 36,311 more rows, and 2 more variables: TechReplicate <chr>,
## #   Log10Intensity <dbl>

You can also create a second new column based on the first new column within the same call of mutate():

iprg %>%
    mutate(Log10Intensity = log10(Intensity),
           Log10Intensity2 = Log10Intensity * 2)
## # A tibble: 36,321 x 9
##    Protein Log2Intensity Run   Condition BioReplicate Intensity
##    <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
##  1 sp|D6V…          26.8 JD_0… Conditio…            1    1.18e8
##  2 sp|D6V…          26.6 JD_0… Conditio…            1    1.02e8
##  3 sp|D6V…          26.6 JD_0… Conditio…            1    1.01e8
##  4 sp|D6V…          26.8 JD_0… Conditio…            2    1.20e8
##  5 sp|D6V…          26.8 JD_0… Conditio…            2    1.16e8
##  6 sp|D6V…          26.6 JD_0… Conditio…            2    1.02e8
##  7 sp|D6V…          26.6 JD_0… Conditio…            3    1.04e8
##  8 sp|D6V…          26.5 JD_0… Conditio…            3    9.47e7
##  9 sp|D6V…          26.5 JD_0… Conditio…            3    9.69e7
## 10 sp|D6V…          26.6 JD_0… Conditio…            4    1.02e8
## # … with 36,311 more rows, and 3 more variables: TechReplicate <chr>,
## #   Log10Intensity <dbl>, Log10Intensity2 <dbl>

If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data. (Pipes work with non-dplyr functions, too, as long as the dplyr or magrittr package is loaded).

iprg %>%
  mutate(Log10Intensity = log10(Intensity)) %>%
  head
## # A tibble: 6 x 8
##   Protein Log2Intensity Run   Condition BioReplicate Intensity
##   <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
## 1 sp|D6V…          26.8 JD_0… Conditio…            1    1.18e8
## 2 sp|D6V…          26.6 JD_0… Conditio…            1    1.02e8
## 3 sp|D6V…          26.6 JD_0… Conditio…            1    1.01e8
## 4 sp|D6V…          26.8 JD_0… Conditio…            2    1.20e8
## 5 sp|D6V…          26.8 JD_0… Conditio…            2    1.16e8
## 6 sp|D6V…          26.6 JD_0… Conditio…            2    1.02e8
## # … with 2 more variables: TechReplicate <chr>, Log10Intensity <dbl>

Note that we don’t include parentheses at the end of our call to head() above. When piping into a function with no additional arguments, you can call the function with or without parentheses (e.g. head or head()).

If you want to display more data, you can use the print() function at the end of your chain with the argument n specifying the number of rows to display:

iprg %>%
    mutate(Log10Intensity = log10(Intensity),
               Log10Intensity2 = Log10Intensity * 2) %>%
    print(n = 20)
## # A tibble: 36,321 x 9
##    Protein Log2Intensity Run   Condition BioReplicate Intensity
##    <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
##  1 sp|D6V…          26.8 JD_0… Conditio…            1    1.18e8
##  2 sp|D6V…          26.6 JD_0… Conditio…            1    1.02e8
##  3 sp|D6V…          26.6 JD_0… Conditio…            1    1.01e8
##  4 sp|D6V…          26.8 JD_0… Conditio…            2    1.20e8
##  5 sp|D6V…          26.8 JD_0… Conditio…            2    1.16e8
##  6 sp|D6V…          26.6 JD_0… Conditio…            2    1.02e8
##  7 sp|D6V…          26.6 JD_0… Conditio…            3    1.04e8
##  8 sp|D6V…          26.5 JD_0… Conditio…            3    9.47e7
##  9 sp|D6V…          26.5 JD_0… Conditio…            3    9.69e7
## 10 sp|D6V…          26.6 JD_0… Conditio…            4    1.02e8
## 11 sp|D6V…          26.4 JD_0… Conditio…            4    8.77e7
## 12 sp|D6V…          26.7 JD_0… Conditio…            4    1.06e8
## 13 sp|O13…          24.7 JD_0… Conditio…            1    2.76e7
## 14 sp|O13…          24.7 JD_0… Conditio…            1    2.68e7
## 15 sp|O13…          24.7 JD_0… Conditio…            1    2.76e7
## 16 sp|O13…          24.5 JD_0… Conditio…            2    2.41e7
## 17 sp|O13…          24.7 JD_0… Conditio…            2    2.68e7
## 18 sp|O13…          24.6 JD_0… Conditio…            2    2.51e7
## 19 sp|O13…          24.4 JD_0… Conditio…            3    2.20e7
## 20 sp|O13…          24.6 JD_0… Conditio…            3    2.59e7
## # … with 3.63e+04 more rows, and 3 more variables: TechReplicate <chr>,
## #   Log10Intensity <dbl>, Log10Intensity2 <dbl>

Let’s use a modified iprg data that contains missing values for the next example. It can be loaded with

if (!file.exists("./data/iprgna.rda"))
    download.file("http://bit.ly/VisBiomedDataIprgNA",
                  "./data/iprgna.rda")
load("./data/iprgna.rda")

Challenge

Using the iprgna data repeat the creation of a new Log10Intensisty column.

iprgna %>% mutate(Log10Intensity = log10(Intensity))
## # A tibble: 36,321 x 7
##    Protein Run   Condition BioReplicate Intensity TechReplicate
##    <chr>   <chr> <chr>            <int>     <dbl> <chr>        
##  1 sp|D6V… JD_0… Conditio…            1   NA      B            
##  2 sp|D6V… JD_0… Conditio…            1    1.02e8 C            
##  3 sp|D6V… JD_0… Conditio…            1   NA      A            
##  4 sp|D6V… JD_0… Conditio…            2    1.20e8 A            
##  5 sp|D6V… JD_0… Conditio…            2   NA      B            
##  6 sp|D6V… JD_0… Conditio…            2    1.02e8 C            
##  7 sp|D6V… JD_0… Conditio…            3    1.04e8 A            
##  8 sp|D6V… JD_0… Conditio…            3    9.47e7 B            
##  9 sp|D6V… JD_0… Conditio…            3    9.69e7 C            
## 10 sp|D6V… JD_0… Conditio…            4    1.02e8 B            
## # … with 36,311 more rows, and 1 more variable: Log10Intensity <dbl>

The first few rows of the output are full of NAs, so if we wanted to remove those we could insert a filter() in the chain:

iprgna %>%
    filter(!is.na(Intensity)) %>%
    mutate(Log10Intensity = log10(Intensity))
## # A tibble: 35,318 x 7
##    Protein Run   Condition BioReplicate Intensity TechReplicate
##    <chr>   <chr> <chr>            <int>     <dbl> <chr>        
##  1 sp|D6V… JD_0… Conditio…            1    1.02e8 C            
##  2 sp|D6V… JD_0… Conditio…            2    1.20e8 A            
##  3 sp|D6V… JD_0… Conditio…            2    1.02e8 C            
##  4 sp|D6V… JD_0… Conditio…            3    1.04e8 A            
##  5 sp|D6V… JD_0… Conditio…            3    9.47e7 B            
##  6 sp|D6V… JD_0… Conditio…            3    9.69e7 C            
##  7 sp|D6V… JD_0… Conditio…            4    1.02e8 B            
##  8 sp|D6V… JD_0… Conditio…            4    8.77e7 C            
##  9 sp|D6V… JD_0… Conditio…            4    1.06e8 A            
## 10 sp|O13… JD_0… Conditio…            1    2.76e7 B            
## # … with 35,308 more rows, and 1 more variable: Log10Intensity <dbl>

is.na() is a function that determines whether something is an NA. The ! symbol negates the result, so we’re asking for everything that is not an NA.

4.4 Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to view the mean weight by sex:

iprgna %>%
  group_by(Condition) %>%
  summarize(mean_Intensity = mean(Intensity))
## # A tibble: 4 x 2
##   Condition  mean_Intensity
##   <chr>               <dbl>
## 1 Condition1             NA
## 2 Condition2             NA
## 3 Condition3             NA
## 4 Condition4             NA

Unfortunately, the mean of any vector that contains even a single missing value is NA. We need to remove missing values before calculating the mean, which is done easily with the na.rm argument.

iprgna %>%
  group_by(Condition) %>%
  summarize(mean_Intensity = mean(Intensity, na.rm = TRUE))
## # A tibble: 4 x 2
##   Condition  mean_Intensity
##   <chr>               <dbl>
## 1 Condition1      65144912.
## 2 Condition2      64439756.
## 3 Condition3      62475797.
## 4 Condition4      63616488.

You can also group by multiple columns:

iprgna %>%
  group_by(TechReplicate, BioReplicate) %>%
  summarize(mean_Intensity = mean(Intensity, na.rm = TRUE))
## # A tibble: 12 x 3
## # Groups:   TechReplicate [3]
##    TechReplicate BioReplicate mean_Intensity
##    <chr>                <int>          <dbl>
##  1 A                        1      64891444.
##  2 A                        2      63870255.
##  3 A                        3      61648150.
##  4 A                        4      63662564.
##  5 B                        1      65563938.
##  6 B                        2      65164270.
##  7 B                        3      62758494.
##  8 B                        4      64196979.
##  9 C                        1      64978764.
## 10 C                        2      64283727.
## 11 C                        3      63020774.
## 12 C                        4      62984686.

Tallying

When working with data, it is also common to want to know the number of observations found for each factor or combination of factors. For this, dplyr provides tally().

iprgna %>%
  group_by(Condition) %>%
  tally
## # A tibble: 4 x 2
##   Condition      n
##   <chr>      <int>
## 1 Condition1  9079
## 2 Condition2  9081
## 3 Condition3  9081
## 4 Condition4  9080

Here, tally() is the action applied to the groups created by group_by() and counts the total number of records for each category.

Challenge

  1. How many proteins of each technical replicate are there?

  2. Use group_by() and summarize() to find the mean, min, and max intensity for each condition.

  3. What are the proteins with the highest intensity in each condition?

## Answer 1
iprgna %>%
    group_by(TechReplicate) %>%
    tally
## # A tibble: 3 x 2
##   TechReplicate     n
##   <chr>         <int>
## 1 A             12107
## 2 B             12106
## 3 C             12108
## Answer 2
iprgna %>%
    filter(!is.na(Intensity)) %>%
    group_by(Condition) %>%
    summarize(mean_int = mean(Intensity),
                  min_int = min(Intensity),
                  max_int = max(Intensity))
## # A tibble: 4 x 4
##   Condition   mean_int min_int    max_int
##   <chr>          <dbl>   <dbl>      <dbl>
## 1 Condition1 65144912. 254608. 2841953257
## 2 Condition2 64439756. 259513. 2757471311
## 3 Condition3 62475797.  88409. 2659018724
## 4 Condition4 63616488.  84850. 2881057105
## Answer 3
iprgna %>%
    filter(!is.na(Intensity)) %>%
    group_by(Condition) %>%
    filter(Intensity == max(Intensity)) %>%
    arrange(Intensity)
## # A tibble: 4 x 6
## # Groups:   Condition [4]
##   Protein      Run           Condition BioReplicate Intensity TechReplicate
##   <chr>        <chr>         <chr>            <int>     <dbl> <chr>        
## 1 sp|P48589|R… JD_06232014_… Conditio…            3    2.66e9 B            
## 2 sp|P48589|R… JD_06232014_… Conditio…            2    2.76e9 B            
## 3 sp|P48589|R… JD_06232014_… Conditio…            1    2.84e9 A            
## 4 sp|P48589|R… JD_06232014_… Conditio…            4    2.88e9 A