Last update: Wed May 11 20:00:24 2016
This vignette available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.
This document provides annotated and reproducible quantitative proteomics data analysis examples for the Quantitative Proteomics And Data Analysis course (intro slides).
To be able to execute the code below, you will need to have a working R installation. I also recommend using the RStudio editor. To install the proteomics add-on packages required for this tutorial, you will need to run the following code:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RforProteomics", dependencies = TRUE)
biocLite("AnnotationHub")
biocLite("genefilter")
biocLite("gplots")
biocLite("qvalue")
For a more thorough introduction to R for proteomics, please read the RforProteomics
vignette (online or off-line with vignette("RforProteomics")
after installing as described above), the visualisation vignette and the corresonding papers [1, 2]
We first need to load the proteomics packages:
library("MSnbase")
library("rpx")
library("mzR")
library("RforProteomics")
library("pRoloc")
library("pRolocdata")
library("msmsTests")
library("AnnotationHub")
library("lattice")
library("gridExtra")
library("gplots")
library("genefilter")
library("qvalue")
AnnotationHub is a cloud resource set up and managed by the Bioconductor project that programmatically disseminates omics data. I am currently working on contributing proteomics data.
Below, we download a raw mass spectrometry dataset with identifier AH49008
and store it in a variable names ms
.
ah <- AnnotationHub()
ms <- ah[["AH49008"]]
ms
## Mass Spectrometry file handle.
## Filename: 55314
## Number of scans: 7534
The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The filename, 55314, is not very descriptive because the data originates from the AnnotationHub
cloud repository. If the data was read from a local file, is would be named as the mzML
(or mzXML
) file.
Later, we will use data that is distributed direclty with package and access them using the data
function. One can also use the rpx package to access and download data from the ProteomeXchange repository.
px1 <- PXDataset("PXD000001")
px1
## Object of class "PXDataset"
## Id: PXD000001 with 10 files
## [1] 'F063721.dat' ... [10] 'erwinia_carotovora.fasta'
## Use 'pxfiles(.)' to see all files.
mzf <- pxget(px1, 6)
mzf
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
Manual download:
f1 <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/Thermo_Hela_PRTC_1.mzML")
f2 <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/Thermo_Hela_PRTC_2.mzML")
f3 <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/Thermo_Hela_PRTC_3.mzML")
f3
## [1] "./Thermo_Hela_PRTC_3.mzML"
chromatogram(ms)
c1 <- chromatogram(f1)
c2 <- chromatogram(f2, plot = FALSE)
lines(c2, col = "steelblue", lty = "dashed")
c3 <- chromatogram(f3, plot = FALSE)
lines(c3, col = "orange", lty = "dotted")
par(mfrow = c(1, 2))
xic(ms, mz = 636.925, width = 0.01)
x <- xic(ms, mz = 636.925, width = 0.01, rtlim = c(2120, 2200))
We first load a test iTRAQ data called itraqdata
and distributed as part of the MSnbase package using the data
function. This is a pre-packaged data that comes as a dedicated data structure called MSnExp
. We then plot
the 10th spectum using specific code that recognizes what to do with an element of an MSnExp
.
data(itraqdata)
itraqdata
## Object of class "MSnExp"
## Object size in memory: 1.88 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of MS1 acquisitions: 1
## Number of MSn scans: 55
## Number of precursor ions: 55
## 55 unique MZs
## Precursor MZ's: 401.74 - 1236.1
## MSn M/Z range: 100 2069.27
## MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## MSnbase version: 1.1.22
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames sampleNumbers
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ProteinDescription
## PeptideSequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
plot(itraqdata[[10]], reporters = iTRAQ4, full=TRUE)
The ms
data is not pre-packaged as an MSnExp
data. It is a more bare-bone mzRramp object, a pointer to a raw data file (here 55314): we need first to extract a spectrum of interest (here the 3071st spectrum, an MS1 spectrum), and use the generic plot
function to visualise the spectrum.
plot(peaks(ms, 3071), type = "h",
xlab = "M/Z", ylab = "Intensity",
sub = formatRt(hd[3071, "retentionTime"]))
Below, we use data downloaded from ProteomeXchange (see above) to generate additional raw data visualisations. These examples are taken from the RforProteomics visualisation vignette. The code, which is not displayed here, can also be seen in the source document.
The importance of flexible access to specialised data becomes visible in the figure below (taken from the RforProteomics visualisation vignette). Not only can we access specific data and understand/visualise them, but we can transverse all the data and extracted/visualise/understand structured slices of data.
The upper panel represents the chomatogram of the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML raw data file. We concentrate at a specific retention time, 30:1 minutes (1800.6836 seconds) corresponding to the 2807th MS1 spectrum, shown on the second row of figures. On the right, we zoom on the isotopic envelope of one peptide in particular. All vertical lines (red and grey) represent all the ions that were selected for a second round of MS analysis; these are represented in the bottom part of the figure.
Below, we illustrate some additional visualisation and animations of raw MS data, also taken from the RforProteomics visualisation vignette. On the left, we have a heatmap like visualisation of a MS map and a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.
## 1
## 1
Below, we have animations build from extracting successive slices as above.
Annotated spectra and comparing spectra.
par(mfrow = c(1, 2))
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE)
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))
The annotation of spectra is obtained by simulating fragmentation of a peptide and matching observed peaks to fragments:
calculateFragments("SIGFEGDSIGR")
## Modifications used: C=57.02146
## mz ion type pos z seq
## 1 88.03931 b1 b 1 1 S
## 2 201.12337 b2 b 2 1 SI
## 3 258.14483 b3 b 3 1 SIG
## 4 405.21324 b4 b 4 1 SIGF
## 5 534.25583 b5 b 5 1 SIGFE
## 6 591.27729 b6 b 6 1 SIGFEG
## 7 706.30423 b7 b 7 1 SIGFEGD
## 8 793.33626 b8 b 8 1 SIGFEGDS
## 9 906.42032 b9 b 9 1 SIGFEGDSI
## 10 963.44178 b10 b 10 1 SIGFEGDSIG
## 11 1119.54289 b11 b 11 1 SIGFEGDSIGR
## 12 175.11895 y1 y 1 1 R
## 13 232.14041 y2 y 2 1 GR
## 14 345.22447 y3 y 3 1 IGR
## 15 432.25650 y4 y 4 1 SIGR
## 16 547.28344 y5 y 5 1 DSIGR
## [ reached getOption("max.print") -- omitted 20 rows ]
Visualising a pair of spectra means that we can access them, and that, in addition to plotting, we can manipluate them and perform computations. The two spectra corresponding to the IMIDLDGTENK
peptide, for example have 22 common peaks, a correlation of 0.198 and a dot product of 0.21 (see ?compareSpectra
for details).
There are 2 Bioconductor packages for peptide-spectrum matching directly in R, namely MSGFplus and rTANDEM, replying on MSGF+
and X!TANDEM
respectively.
See also the MSGFgui package for visualisation of identification data.
What does the quantitative data encode: ratios or intenstities? Do not let the software decide for you!
Here’s where the experimental design becomes essential: what are replicates: technical and biological, what variability (technical vs biological vs different conditions) are we exploring.
The data illustrated in the heatmap below is available as the mulvey2015norm
MSnSet
data. In this experiment, Mulvey and colleagues, performed a time course experiment on mouse extra-embryonic endoderm (XEN) stem cells. Extra-embryonic endoderm differentiation can be modelled in vitro by induced expression of GATA transcription factors in mouse embryonic stem cells. They used this GATA-inducible system to quantitatively monitor the dynamics of global proteomic changes during the early stages of this differentiation event (at 0, 16, 24, 48 and 72 hours) and also investigate the fully differentiated phenotype, as represented by embryo-derived XEN cells.
data(mulvey2015norm)
table(pData(mulvey2015norm))
## , , cond = 1
##
## times
## rep 1 2 3 4 5 6
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 3 1 1 1 1 1 1
table(pData(mulvey2015norm)[, -1])
## cond
## times 1
## 1 3
## 2 3
## 3 3
## 4 3
## 5 3
## 6 3
MSnSet
can be created from various formats such as raw data, mzTab
and spreadsheets, when using third-party software for data quantitation and pro-processing. See ?readMSnSet
for details about the latter.
heatmap(exprs(mulvey2015norm))
Note: Often, heatmap illustrating the results of a statistical analysis show very distinctive coloured patterns. While stinking, these patterns only display the obvious, i.e. a set of features that have been selected for differences between experimental conditions. It would be very worrying in no pattern was to be observed and, in case of such a pattern, the figure itself does not enable to assess the validity of the results.
i0 <- image(mulvey2015norm, plot = FALSE)
i1 <- image(mulvey2015norm, facetBy = "rep", plot = FALSE)
i2 <- image(mulvey2015norm, facetBy = "times", plot = FALSE)
grid.arrange(i0, i1, i2, ncol = 3)
i <- c(grep("0hr", sampleNames(mulvey2015norm)),
grep("XEN", sampleNames(mulvey2015norm)))
## plot all pairs with correlations - see ?pairs for details
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor = 1) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
text(0.5, 0.5, txt, cex = cex.cor)
}
pairs(exprs(mulvey2015norm[, i]), lower.panel = panel.cor)
par(mfrow = c(1, 2))
hc <- hclust(dist(exprs(mulvey2015norm)))
plot(hc)
hc2 <- hclust(dist(t(exprs(mulvey2015norm))))
plot(hc2)
mztab <- pxget(px1, "PXD000001_mztab.txt")
qnt <- readMzTabData(mztab, what = "PEP", version = "0.9")
sampleNames(qnt) <- reporterNames(TMT6)
qnt <- filterNA(qnt)
spikes <- c("P02769", "P00924", "P62894", "P00489")
protclasses <- as.character(fData(qnt)$accession)
protclasses[!protclasses %in% spikes] <- "Background"
MAplot(qnt[, 1:2], col = factor(protclasses), pch = 19)
MAplot(qnt, cex = .8)
The RforProteomics visualisation vignette has a dedicated section about MA plots, describing the different R plotting systems and various customisations that can be applied to improve such figures and scatterplots in general.
The shinyMA
interactive application enables to navigate an MA plot an a linked expression plot. It can be tested online (https://lgatto.shinyapps.io/shinyMA/) or offline (as part of the RforProteomics package), as shown below:
shinyMA()
data(naset)
naplot(naset, col = "black")
## features.na
## 0 1 2 3 4 8 9 10
## 301 247 91 13 2 23 10 2
## samples.na
## 34 39 41 42 43 45 47 49 51 52 53 55 56 57 61
## 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
x <- impute(naset, "zero")
exprs(x)[exprs(x) != 0] <- 1
suppressPackageStartupMessages(library("gplots"))
heatmap.2(exprs(x), col = c("lightgray", "black"),
scale = "none", dendrogram = "none",
trace = "none", keysize = 0.5, key = FALSE,
RowSideColors = ifelse(fData(x)$randna, "orange", "brown"),
ColSideColors = rep(c("steelblue", "darkolivegreen"), each = 8))
Identification tranfer between acquisitions (label-free)
One solution is to remove all or part of the features that have missing values (see ?filterNA
) and/or impute missing values (?impute
). The latter is not a straighforward thing, as is likely to dramatically fail when a high proportion of data is missing (10s of %).
Different imputation methods are more appropriate to different classes of missing values (as documented in this paper).
Normalisation: remove unwanted (technical) variation while retaining biological variability.
sf <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/silac.rda")
load(sf)
ns1 <- s1; ns2 <- s2; ns3 <- s3
exprs(ns1) <- exprs(s1) - median(exprs(s1))
exprs(ns2) <- exprs(s2) - median(exprs(s2))
exprs(ns3) <- exprs(s3) - median(exprs(s3))
par(mfrow = c(1, 2))
plot(density(exprs(s1)), col = "red", main = expression(SILAC~log[2]~ratios))
rug(exprs(s1))
lines(density(exprs(s2)), col = "green")
lines(density(exprs(s3)), col = "blue")
grid()
abline(v = 0)
plot(density(exprs(ns1)), col = "red", main = expression(Normalised~SILAC~log[2]~ratios))
rug(exprs(ns1))
lines(density(exprs(ns2)), col = "green")
lines(density(exprs(ns3)), col = "blue")
grid()
abline(v = 0)
Other approaches for ratio normalisation:
loess
polynomial regression that uses the raw intensities.quantile
(as below) for between acquisition normalisation.See the limma package.
par(mfrow = c(1, 4))
data(dunkley2006)
boxplot(exprs(dunkley2006), main = "original")
boxplot(exprs(normalise(dunkley2006, method = "vsn")),
main = "Variance stabilisation normalisation")
boxplot(exprs(normalise(dunkley2006, method = "center.median")),
main = "Median")
boxplot(exprs(normalise(dunkley2006, method = "quantiles")),
main = "Quantile")
plot2D(t(mulvey2015norm), fcol = "times", fpch = "rep")
addLegend(t(mulvey2015norm), where = "bottomright", fcol = "times")
plot2D(dunkley2006)
addLegend(dunkley2006, where = "topleft")
data(hyperLOPIT2015)
setStockcol(paste0(getStockcol(), 60))
plot2D(hyperLOPIT2015,
fcol = "final.assignment",
cex = exp(fData(hyperLOPIT2015)$svm.score) - 1)
set.seed(1)
x <- matrix(rnorm(3000), ncol = 3)
res <- rowttests(x)
par(mfrow = c(1, 2))
hist(res$p.value, xlab = "p-values",
breaks = 100, main = "Random data")
abline(v = 0.05, col = "red", lwd = 2)
matplot(t(x[which(res$p.value < 0.01), ]), type = "b",
ylab = expression(log[2]~fold-change),
main = "Data with a p-value < 0.01")
sum(res$p.value < 0.05)
## [1] 49
See How to interpret a p-value histogram for more explanations.
summary(qvalue(res$p.value)$qvalue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4686 0.8637 0.8679 0.8865 0.9196 0.9353
Here, we have spectral counting data and use a quasi-likelihood GLM regression to compare count data between two treatments taking two batches into account. This code chunk comes from the msmsTests package.
data(msms.dataset)
e <- pp.msms.data(msms.dataset)
null.f <- "y~batch"
alt.f <- "y~treat+batch"
div <- apply(exprs(e), 2, sum)
res <- msms.glm.qlll(e, alt.f, null.f, div = div)
lst <- test.results(res, e, pData(e)$treat, "U600", "U200", div,
alpha = 0.05, minSpC = 2, minLFC = log2(1.8),
method = "BH")
summary(lst[["tres"]]$p.value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.07823 0.29630 0.36620 0.59780 0.98860
hist(lst[["tres"]]$p.value, main = "Raw p-values")
summary(lst[["tres"]]$adjp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000003 0.3107000 0.5896000 0.5536000 0.7959000 0.9886000
hist(lst[["tres"]]$adjp, main = "Adjusted p-values")
res.volcanoplot(lst$tres, max.pval = 0.05,
min.LFC = 1, maxx = 3, maxy = NULL,
ylbls = 4)
Context is essential in biology: we need to be able to pull annotations from GO, Biomart (biomaRt), UniProt (UniProt.ws), and many specialised databases as well as decorate the data (samples, features) with additional values we calculate (such as p-values, classification results, number of missing values, …).
There are different approaches to enrichment analyses. One, based on the hyper-geometric distribution assumes that the universe (all expected features/proteins) are known. But this is often undefined in proteomics experiments: do we take all the proteins in the database, all identified proteins or protein groups, …? Other approaches are based on bootstrap re-sampling, which relies on observed features.
Check your p-values!
The source used to generate this document is available here.
Software used:
## R Under development (unstable) (2016-03-03 r70270)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] qvalue_2.4.0 genefilter_1.54.0 gplots_3.0.1
## [4] gridExtra_2.2.1 lattice_0.20-33 AnnotationHub_2.4.0
## [7] msmsTests_1.10.0 msmsEDA_1.10.0 pRolocdata_1.10.0
## [10] pRoloc_1.12.0 MLInterfaces_1.52.0 cluster_2.0.4
## [13] annotate_1.50.0 XML_3.98-1.4 AnnotationDbi_1.34.0
## [16] IRanges_2.6.0 S4Vectors_0.10.0 RforProteomics_1.10.0
## [19] rpx_1.8.0 MSnbase_1.21.4 ProtGenerics_1.4.0
## [22] BiocParallel_1.6.0 mzR_2.6.0 Rcpp_0.12.4.5
## [25] Biobase_2.32.0 BiocGenerics_0.18.0 BiocStyle_2.0.0
##
## loaded via a namespace (and not attached):
## [1] plyr_1.8.3 GSEABase_1.34.0
## [3] splines_3.3.0 ggvis_0.4.2
## [5] ggplot2_2.1.0 digest_0.6.9
## [7] foreach_1.4.3 BiocInstaller_1.22.1
## [9] htmltools_0.3.5 gdata_2.17.0
## [11] magrittr_1.5 doParallel_1.0.10
## [13] sfsmisc_1.1-0 limma_3.28.2
## [15] rda_1.0.2-2 R.utils_2.3.0
## [17] lpSolve_5.6.13 colorspace_1.2-6
## [19] dplyr_0.4.3 RCurl_1.95-4.8
## [21] jsonlite_0.9.19 graph_1.50.0
## [23] lme4_1.1-12 impute_1.46.0
## [25] survival_2.39-3 iterators_1.0.8
## [27] gtable_0.2.0 zlibbioc_1.18.0
## [29] MatrixModels_0.4-1 car_2.1-2
## [31] kernlab_0.9-24 prabclus_2.2-6
## [33] DEoptimR_1.0-4 SparseM_1.7
## [35] scales_0.4.0 vsn_3.40.0
## [37] mvtnorm_1.0-5 DBI_0.4-1
## [39] edgeR_3.14.0 xtable_1.8-2
## [41] proxy_0.4-15 mclust_5.2
## [43] preprocessCore_1.34.0 htmlwidgets_0.6
## [45] sampling_2.7 httr_1.1.0
## [47] threejs_0.2.2 FNN_1.1
## [49] RColorBrewer_1.1-2 fpc_2.1-10
## [51] modeltools_0.2-21 R.methodsS3_1.7.1
## [53] flexmix_2.3-13 nnet_7.3-12
## [55] RJSONIO_1.3-0 caret_6.0-68
## [57] labeling_0.3 reshape2_1.4.1
## [59] munsell_0.4.3 mlbench_2.1-1
## [61] biocViews_1.40.0 tools_3.3.0
## [63] RSQLite_1.0.0 pls_2.5-0
## [65] evaluate_0.9 stringr_1.0.0
## [67] mzID_1.10.0 knitr_1.13
## [69] robustbase_0.92-5 rgl_0.95.1441
## [71] caTools_1.17.1 randomForest_4.6-12
## [73] RBGL_1.48.0 nlme_3.1-127
## [75] mime_0.4 quantreg_5.21
## [77] formatR_1.4 R.oo_1.20.0
## [79] biomaRt_2.28.0 pbkrtest_0.4-6
## [81] curl_0.9.7 interactiveDisplayBase_1.10.0
## [83] e1071_1.6-7 affyio_1.42.0
## [85] stringi_1.0-1 trimcluster_0.1-2
## [87] Matrix_1.2-6 nloptr_1.0.4
## [89] gbm_2.1.1 RUnit_0.4.31
## [91] MALDIquant_1.14 bitops_1.0-6
## [93] httpuv_1.3.3 R6_2.1.2
## [95] pcaMethods_1.64.0 affy_1.50.0
## [97] hwriter_1.3.2 KernSmooth_2.23-15
## [99] gridSVG_1.5-0 codetools_0.2-14
## [101] MASS_7.3-45 gtools_3.5.0
## [103] assertthat_0.1 interactiveDisplay_1.10.0
## [105] Category_2.38.0 diptest_0.75-7
## [107] mgcv_1.8-12 grid_3.3.0
## [109] rpart_4.1-10 class_7.3-14
## [111] minqa_1.2.4 shiny_0.13.2
## [113] base64enc_0.1-3