Chapter 4 Using dplyr
for data manipulation
A useful dplyr
cheet sheet is available here.
The following material is based on Data Carpentry’s the Data analisis and visualisation lessons.
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter dplyr
. dplyr
is a package for making tabular data manipulation easier. It pairs nicely with tidyr
which enables you to swiftly convert between different data formats for plotting and analysis.
We will need the tidyverse
package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as dplyr
, ggplot2
(for visualisation), tibble
, etc.
library("tidyverse")
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ purrr 0.3.2 ✔ stringr 1.4.0
## ✔ tibble 2.1.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse() masks IRanges::collapse()
## ✖ dplyr::combine() masks MSnbase::combine(), Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::desc() masks IRanges::desc()
## ✖ purrr::detect() masks RforProteomics::detect()
## ✖ tidyr::expand() masks S4Vectors::expand()
## ✖ magrittr::extract() masks tidyr::extract()
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::first() masks S4Vectors::first()
## ✖ dplyr::ident() masks msdata::ident()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce() masks IRanges::reduce(), MSnbase::reduce()
## ✖ dplyr::rename() masks plotly::rename(), S4Vectors::rename()
## ✖ dplyr::select() masks plotly::select(), AnnotationDbi::select()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ dplyr::slice() masks plotly::slice(), IRanges::slice()
The tidyverse
package tries to address 3 major problems with some of base R functions:
- The results from a base R function sometimes depends on the type of data.
- Using R expressions in a non standard way, which can be confusing for new learners.
- Hidden arguments, having default operations that new learners are not aware of.
To learn more about dplyr
and tidyr
, you may want to check out this handy data transformation with dplyr
cheatsheet and this one about tidyr
.
Let’s start by reading the data using readr::read_csv
that will produce a tibble
.
library("readr")
iprg <- read_csv("http://bit.ly/VisBiomedDataIprgCsv")
## Parsed with column specification:
## cols(
## Protein = col_character(),
## Log2Intensity = col_double(),
## Run = col_character(),
## Condition = col_character(),
## BioReplicate = col_double(),
## Intensity = col_double(),
## TechReplicate = col_character()
## )
Tibbles are data frames, but they tweak some of the old behaviors of data frames. The data structure is very similar to a data frame. For our purposes the only differences are that:
- In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
- Columns of class
character
are never converted into factors.
4.1 Selecting columns and filtering rows
We’re going to learn some of the most common dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarize()
. To select columns of a data frame, use select()
. The first argument to this function is the data frame, and the subsequent arguments are the columns to keep.
select(iprg, Protein, Run, Condition)
To choose rows based on a specific criteria, use filter()
:
filter(iprg, BioReplicate == 1)
## # A tibble: 9,079 x 7
## Protein Log2Intensity Run Condition BioReplicate Intensity
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 sp|D6V… 26.8 JD_0… Conditio… 1 1.18e8
## 2 sp|D6V… 26.6 JD_0… Conditio… 1 1.02e8
## 3 sp|D6V… 26.6 JD_0… Conditio… 1 1.01e8
## 4 sp|O13… 24.7 JD_0… Conditio… 1 2.76e7
## 5 sp|O13… 24.7 JD_0… Conditio… 1 2.68e7
## 6 sp|O13… 24.7 JD_0… Conditio… 1 2.76e7
## 7 sp|O13… 23.4 JD_0… Conditio… 1 1.09e7
## 8 sp|O13… 24.0 JD_0… Conditio… 1 1.69e7
## 9 sp|O13… 23.5 JD_0… Conditio… 1 1.16e7
## 10 sp|O13… 27.5 JD_0… Conditio… 1 1.92e8
## # … with 9,069 more rows, and 1 more variable: TechReplicate <chr>
filter(iprg, Condition == 'Condition2')
## # A tibble: 9,081 x 7
## Protein Log2Intensity Run Condition BioReplicate Intensity
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 sp|D6V… 26.8 JD_0… Conditio… 2 1.20e8
## 2 sp|D6V… 26.8 JD_0… Conditio… 2 1.16e8
## 3 sp|D6V… 26.6 JD_0… Conditio… 2 1.02e8
## 4 sp|O13… 24.5 JD_0… Conditio… 2 2.41e7
## 5 sp|O13… 24.7 JD_0… Conditio… 2 2.68e7
## 6 sp|O13… 24.6 JD_0… Conditio… 2 2.51e7
## 7 sp|O13… 23.2 JD_0… Conditio… 2 9.45e6
## 8 sp|O13… 23.4 JD_0… Conditio… 2 1.13e7
## 9 sp|O13… 23.8 JD_0… Conditio… 2 1.43e7
## 10 sp|O13… 25.9 JD_0… Conditio… 2 6.18e7
## # … with 9,071 more rows, and 1 more variable: TechReplicate <chr>
4.2 Pipes
But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out.
The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>%
and are made available via the magrittr
package, installed automatically with dplyr
. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
iprg %>%
filter(Intensity > 1e8) %>%
select(Protein, Condition, Intensity)
## # A tibble: 4,729 x 3
## Protein Condition Intensity
## <chr> <chr> <dbl>
## 1 sp|D6VTK4|STE2_YEAST Condition1 117845016.
## 2 sp|D6VTK4|STE2_YEAST Condition1 102273602.
## 3 sp|D6VTK4|STE2_YEAST Condition1 100526837.
## 4 sp|D6VTK4|STE2_YEAST Condition2 119765106.
## 5 sp|D6VTK4|STE2_YEAST Condition2 116382798.
## 6 sp|D6VTK4|STE2_YEAST Condition2 102328260.
## 7 sp|D6VTK4|STE2_YEAST Condition3 103830944.
## 8 sp|D6VTK4|STE2_YEAST Condition4 102150172.
## 9 sp|D6VTK4|STE2_YEAST Condition4 105724288.
## 10 sp|O13539|THP2_YEAST Condition1 192490784.
## # … with 4,719 more rows
In the above, we use the pipe to send the iprg
dataset first through filter()
to keep rows where Intensity
is greater than 1e8, then through select()
to keep only the Protein
, Condition
, and Intensity
columns. Since %>%
takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include it as an argument to the filter()
and select()
functions anymore.
If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name:
iprg_sml <- iprg %>%
filter(Intensity > 1e8) %>%
select(Protein, Condition, Intensity)
iprg_sml
## # A tibble: 4,729 x 3
## Protein Condition Intensity
## <chr> <chr> <dbl>
## 1 sp|D6VTK4|STE2_YEAST Condition1 117845016.
## 2 sp|D6VTK4|STE2_YEAST Condition1 102273602.
## 3 sp|D6VTK4|STE2_YEAST Condition1 100526837.
## 4 sp|D6VTK4|STE2_YEAST Condition2 119765106.
## 5 sp|D6VTK4|STE2_YEAST Condition2 116382798.
## 6 sp|D6VTK4|STE2_YEAST Condition2 102328260.
## 7 sp|D6VTK4|STE2_YEAST Condition3 103830944.
## 8 sp|D6VTK4|STE2_YEAST Condition4 102150172.
## 9 sp|D6VTK4|STE2_YEAST Condition4 105724288.
## 10 sp|O13539|THP2_YEAST Condition1 192490784.
## # … with 4,719 more rows
Note that the final data frame is the leftmost part of this expression.
Challenge
Using pipes, subset the
iprg
data to include Proteins with a log2 intensity greater than 20 and retain only the columnsProteins
, andCondition
.
## Answer
iprg %>%
filter(Log2Intensity > 20) %>%
select(Protein, Condition)
4.3 Mutate
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of weight in kg:
iprg %>%
mutate(Log10Intensity = log10(Intensity))
## # A tibble: 36,321 x 8
## Protein Log2Intensity Run Condition BioReplicate Intensity
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 sp|D6V… 26.8 JD_0… Conditio… 1 1.18e8
## 2 sp|D6V… 26.6 JD_0… Conditio… 1 1.02e8
## 3 sp|D6V… 26.6 JD_0… Conditio… 1 1.01e8
## 4 sp|D6V… 26.8 JD_0… Conditio… 2 1.20e8
## 5 sp|D6V… 26.8 JD_0… Conditio… 2 1.16e8
## 6 sp|D6V… 26.6 JD_0… Conditio… 2 1.02e8
## 7 sp|D6V… 26.6 JD_0… Conditio… 3 1.04e8
## 8 sp|D6V… 26.5 JD_0… Conditio… 3 9.47e7
## 9 sp|D6V… 26.5 JD_0… Conditio… 3 9.69e7
## 10 sp|D6V… 26.6 JD_0… Conditio… 4 1.02e8
## # … with 36,311 more rows, and 2 more variables: TechReplicate <chr>,
## # Log10Intensity <dbl>
You can also create a second new column based on the first new column within the same call of mutate()
:
iprg %>%
mutate(Log10Intensity = log10(Intensity),
Log10Intensity2 = Log10Intensity * 2)
## # A tibble: 36,321 x 9
## Protein Log2Intensity Run Condition BioReplicate Intensity
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 sp|D6V… 26.8 JD_0… Conditio… 1 1.18e8
## 2 sp|D6V… 26.6 JD_0… Conditio… 1 1.02e8
## 3 sp|D6V… 26.6 JD_0… Conditio… 1 1.01e8
## 4 sp|D6V… 26.8 JD_0… Conditio… 2 1.20e8
## 5 sp|D6V… 26.8 JD_0… Conditio… 2 1.16e8
## 6 sp|D6V… 26.6 JD_0… Conditio… 2 1.02e8
## 7 sp|D6V… 26.6 JD_0… Conditio… 3 1.04e8
## 8 sp|D6V… 26.5 JD_0… Conditio… 3 9.47e7
## 9 sp|D6V… 26.5 JD_0… Conditio… 3 9.69e7
## 10 sp|D6V… 26.6 JD_0… Conditio… 4 1.02e8
## # … with 36,311 more rows, and 3 more variables: TechReplicate <chr>,
## # Log10Intensity <dbl>, Log10Intensity2 <dbl>
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head()
of the data. (Pipes work with non-dplyr
functions, too, as long as the dplyr
or magrittr
package is loaded).
iprg %>%
mutate(Log10Intensity = log10(Intensity)) %>%
head
## # A tibble: 6 x 8
## Protein Log2Intensity Run Condition BioReplicate Intensity
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 sp|D6V… 26.8 JD_0… Conditio… 1 1.18e8
## 2 sp|D6V… 26.6 JD_0… Conditio… 1 1.02e8
## 3 sp|D6V… 26.6 JD_0… Conditio… 1 1.01e8
## 4 sp|D6V… 26.8 JD_0… Conditio… 2 1.20e8
## 5 sp|D6V… 26.8 JD_0… Conditio… 2 1.16e8
## 6 sp|D6V… 26.6 JD_0… Conditio… 2 1.02e8
## # … with 2 more variables: TechReplicate <chr>, Log10Intensity <dbl>
Note that we don’t include parentheses at the end of our call to head()
above. When piping into a function with no additional arguments, you can call the function with or without parentheses (e.g. head
or head()
).
If you want to display more data, you can use the print()
function at the end of your chain with the argument n
specifying the number of rows to display:
iprg %>%
mutate(Log10Intensity = log10(Intensity),
Log10Intensity2 = Log10Intensity * 2) %>%
print(n = 20)
## # A tibble: 36,321 x 9
## Protein Log2Intensity Run Condition BioReplicate Intensity
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 sp|D6V… 26.8 JD_0… Conditio… 1 1.18e8
## 2 sp|D6V… 26.6 JD_0… Conditio… 1 1.02e8
## 3 sp|D6V… 26.6 JD_0… Conditio… 1 1.01e8
## 4 sp|D6V… 26.8 JD_0… Conditio… 2 1.20e8
## 5 sp|D6V… 26.8 JD_0… Conditio… 2 1.16e8
## 6 sp|D6V… 26.6 JD_0… Conditio… 2 1.02e8
## 7 sp|D6V… 26.6 JD_0… Conditio… 3 1.04e8
## 8 sp|D6V… 26.5 JD_0… Conditio… 3 9.47e7
## 9 sp|D6V… 26.5 JD_0… Conditio… 3 9.69e7
## 10 sp|D6V… 26.6 JD_0… Conditio… 4 1.02e8
## 11 sp|D6V… 26.4 JD_0… Conditio… 4 8.77e7
## 12 sp|D6V… 26.7 JD_0… Conditio… 4 1.06e8
## 13 sp|O13… 24.7 JD_0… Conditio… 1 2.76e7
## 14 sp|O13… 24.7 JD_0… Conditio… 1 2.68e7
## 15 sp|O13… 24.7 JD_0… Conditio… 1 2.76e7
## 16 sp|O13… 24.5 JD_0… Conditio… 2 2.41e7
## 17 sp|O13… 24.7 JD_0… Conditio… 2 2.68e7
## 18 sp|O13… 24.6 JD_0… Conditio… 2 2.51e7
## 19 sp|O13… 24.4 JD_0… Conditio… 3 2.20e7
## 20 sp|O13… 24.6 JD_0… Conditio… 3 2.59e7
## # … with 3.63e+04 more rows, and 3 more variables: TechReplicate <chr>,
## # Log10Intensity <dbl>, Log10Intensity2 <dbl>
Let’s use a modified iprg
data that contains missing values for the next example. It can be loaded with
if (!file.exists("./data/iprgna.rda"))
download.file("http://bit.ly/VisBiomedDataIprgNA",
"./data/iprgna.rda")
load("./data/iprgna.rda")
Challenge
Using the
iprgna
data repeat the creation of a newLog10Intensisty
column.
iprgna %>% mutate(Log10Intensity = log10(Intensity))
## # A tibble: 36,321 x 7
## Protein Run Condition BioReplicate Intensity TechReplicate
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 sp|D6V… JD_0… Conditio… 1 NA B
## 2 sp|D6V… JD_0… Conditio… 1 1.02e8 C
## 3 sp|D6V… JD_0… Conditio… 1 NA A
## 4 sp|D6V… JD_0… Conditio… 2 1.20e8 A
## 5 sp|D6V… JD_0… Conditio… 2 NA B
## 6 sp|D6V… JD_0… Conditio… 2 1.02e8 C
## 7 sp|D6V… JD_0… Conditio… 3 1.04e8 A
## 8 sp|D6V… JD_0… Conditio… 3 9.47e7 B
## 9 sp|D6V… JD_0… Conditio… 3 9.69e7 C
## 10 sp|D6V… JD_0… Conditio… 4 1.02e8 B
## # … with 36,311 more rows, and 1 more variable: Log10Intensity <dbl>
The first few rows of the output are full of NA
s, so if we wanted to remove those we could insert a filter()
in the chain:
iprgna %>%
filter(!is.na(Intensity)) %>%
mutate(Log10Intensity = log10(Intensity))
## # A tibble: 35,318 x 7
## Protein Run Condition BioReplicate Intensity TechReplicate
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 sp|D6V… JD_0… Conditio… 1 1.02e8 C
## 2 sp|D6V… JD_0… Conditio… 2 1.20e8 A
## 3 sp|D6V… JD_0… Conditio… 2 1.02e8 C
## 4 sp|D6V… JD_0… Conditio… 3 1.04e8 A
## 5 sp|D6V… JD_0… Conditio… 3 9.47e7 B
## 6 sp|D6V… JD_0… Conditio… 3 9.69e7 C
## 7 sp|D6V… JD_0… Conditio… 4 1.02e8 B
## 8 sp|D6V… JD_0… Conditio… 4 8.77e7 C
## 9 sp|D6V… JD_0… Conditio… 4 1.06e8 A
## 10 sp|O13… JD_0… Conditio… 1 2.76e7 B
## # … with 35,308 more rows, and 1 more variable: Log10Intensity <dbl>
is.na()
is a function that determines whether something is an NA
. The !
symbol negates the result, so we’re asking for everything that is not an NA
.
4.4 Split-apply-combine data analysis and the summarize() function
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function.
The summarize()
function
group_by()
is often used together with summarize()
, which collapses each group into a single-row summary of that group. group_by()
takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to view the mean weight
by sex:
iprgna %>%
group_by(Condition) %>%
summarize(mean_Intensity = mean(Intensity))
## # A tibble: 4 x 2
## Condition mean_Intensity
## <chr> <dbl>
## 1 Condition1 NA
## 2 Condition2 NA
## 3 Condition3 NA
## 4 Condition4 NA
Unfortunately, the mean
of any vector that contains even a single missing value is NA
. We need to remove missing values before calculating the mean, which is done easily with the na.rm
argument.
iprgna %>%
group_by(Condition) %>%
summarize(mean_Intensity = mean(Intensity, na.rm = TRUE))
## # A tibble: 4 x 2
## Condition mean_Intensity
## <chr> <dbl>
## 1 Condition1 65144912.
## 2 Condition2 64439756.
## 3 Condition3 62475797.
## 4 Condition4 63616488.
You can also group by multiple columns:
iprgna %>%
group_by(TechReplicate, BioReplicate) %>%
summarize(mean_Intensity = mean(Intensity, na.rm = TRUE))
## # A tibble: 12 x 3
## # Groups: TechReplicate [3]
## TechReplicate BioReplicate mean_Intensity
## <chr> <int> <dbl>
## 1 A 1 64891444.
## 2 A 2 63870255.
## 3 A 3 61648150.
## 4 A 4 63662564.
## 5 B 1 65563938.
## 6 B 2 65164270.
## 7 B 3 62758494.
## 8 B 4 64196979.
## 9 C 1 64978764.
## 10 C 2 64283727.
## 11 C 3 63020774.
## 12 C 4 62984686.
Tallying
When working with data, it is also common to want to know the number of observations found for each factor or combination of factors. For this, dplyr
provides tally()
.
iprgna %>%
group_by(Condition) %>%
tally
## # A tibble: 4 x 2
## Condition n
## <chr> <int>
## 1 Condition1 9079
## 2 Condition2 9081
## 3 Condition3 9081
## 4 Condition4 9080
Here, tally()
is the action applied to the groups created by group_by()
and counts the total number of records for each category.
Challenge
How many proteins of each technical replicate are there?
Use
group_by()
andsummarize()
to find the mean, min, and max intensity for each condition.What are the proteins with the highest intensity in each condition?
## Answer 1
iprgna %>%
group_by(TechReplicate) %>%
tally
## # A tibble: 3 x 2
## TechReplicate n
## <chr> <int>
## 1 A 12107
## 2 B 12106
## 3 C 12108
## Answer 2
iprgna %>%
filter(!is.na(Intensity)) %>%
group_by(Condition) %>%
summarize(mean_int = mean(Intensity),
min_int = min(Intensity),
max_int = max(Intensity))
## # A tibble: 4 x 4
## Condition mean_int min_int max_int
## <chr> <dbl> <dbl> <dbl>
## 1 Condition1 65144912. 254608. 2841953257
## 2 Condition2 64439756. 259513. 2757471311
## 3 Condition3 62475797. 88409. 2659018724
## 4 Condition4 63616488. 84850. 2881057105
## Answer 3
iprgna %>%
filter(!is.na(Intensity)) %>%
group_by(Condition) %>%
filter(Intensity == max(Intensity)) %>%
arrange(Intensity)
## # A tibble: 4 x 6
## # Groups: Condition [4]
## Protein Run Condition BioReplicate Intensity TechReplicate
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 sp|P48589|R… JD_06232014_… Conditio… 3 2.66e9 B
## 2 sp|P48589|R… JD_06232014_… Conditio… 2 2.76e9 B
## 3 sp|P48589|R… JD_06232014_… Conditio… 1 2.84e9 A
## 4 sp|P48589|R… JD_06232014_… Conditio… 4 2.88e9 A