2 Getting example data

AnnotationHub is a cloud resource set up and managed by the Bioconductor project that programmatically disseminates omics data. I am currently working on contributing proteomics data.

Below, we download a raw mass spectrometry dataset with identifier AH49008 and store it in a variable names ms.

ah <- AnnotationHub()
ms <- ah[["AH49008"]]
ms

## Mass Spectrometry file handle.
## Filename:  55314 
## Number of scans:  7534

The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The filename, 55314, is not very descriptive because the data originates from the AnnotationHub cloud repository. If the data was read from a local file, is would be named as the mzML (or mzXML) file.

Later, we will use data that is distributed direclty with package and access them using the data function. One can also use the rpx package to access and download data from the ProteomeXchange repository.

px1 <- PXDataset("PXD000001")
px1

## Object of class "PXDataset"
##  Id: PXD000001 with 10 files
##  [1] 'F063721.dat' ... [10] 'erwinia_carotovora.fasta'
##  Use 'pxfiles(.)' to see all files.

mzf <- pxget(px1, 6)
mzf

## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"

Manual download:

f1 <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/Thermo_Hela_PRTC_1.mzML")
f2 <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/Thermo_Hela_PRTC_2.mzML")
f3 <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/Thermo_Hela_PRTC_3.mzML")
f3

## [1] "./Thermo_Hela_PRTC_3.mzML"

3 Visualising raw data

3.1 A full chromatogam

chromatogram(ms)

plot of chunk chromato

3.2 Multiple chromatograms

c1 <- chromatogram(f1)
c2 <- chromatogram(f2, plot = FALSE)
lines(c2, col = "steelblue", lty = "dashed")
c3 <- chromatogram(f3, plot = FALSE)
lines(c3, col = "orange", lty = "dotted")

plot of chunk chromato3

3.3 An extracted ion chromatogram

par(mfrow = c(1, 2))
xic(ms, mz = 636.925, width = 0.01)
x <- xic(ms, mz = 636.925, width = 0.01, rtlim = c(2120, 2200))

plot of chunk xic

3.4 Spectra

We first load a test iTRAQ data called itraqdata and distributed as part of the MSnbase package using the data function. This is a pre-packaged data that comes as a dedicated data structure called MSnExp. We then plot the 10th spectum using specific code that recognizes what to do with an element of an MSnExp.

data(itraqdata)
itraqdata

## Object of class "MSnExp"
##  Object size in memory: 1.88 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of MS1 acquisitions: 1 
##  Number of MSn scans: 55 
##  Number of precursor ions: 55 
##  55 unique MZs
##  Precursor MZ's: 401.74 - 1236.1 
##  MSn M/Z range: 100 2069.27 
##  MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
##  MSnbase version: 1.1.22 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1
##   varLabels: sampleNames sampleNumbers
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ProteinDescription
##     PeptideSequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

plot(itraqdata[[10]], reporters = iTRAQ4, full=TRUE)

plot of chunk itraqdata

The ms data is not pre-packaged as an MSnExp data. It is a more bare-bone mzRramp object, a pointer to a raw data file (here 55314): we need first to extract a spectrum of interest (here the 3071st spectrum, an MS1 spectrum), and use the generic plot function to visualise the spectrum.

plot(peaks(ms, 3071), type = "h",
     xlab = "M/Z", ylab = "Intensity",     
     sub = formatRt(hd[3071, "retentionTime"]))

plot of chunk ms1

Below, we use data downloaded from ProteomeXchange (see above) to generate additional raw data visualisations. These examples are taken from the RforProteomics visualisation vignette. The code, which is not displayed here, can also be seen in the source document.

The importance of flexible access to specialised data becomes visible in the figure below (taken from the RforProteomics visualisation vignette). Not only can we access specific data and understand/visualise them, but we can transverse all the data and extracted/visualise/understand structured slices of data.

The upper panel represents the chomatogram of the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML raw data file. We concentrate at a specific retention time, 30:1 minutes (1800.6836 seconds) corresponding to the 2807th MS1 spectrum, shown on the second row of figures. On the right, we zoom on the isotopic envelope of one peptide in particular. All vertical lines (red and grey) represent all the ions that were selected for a second round of MS analysis; these are represented in the bottom part of the figure.

plot of chunk mslayout

Below, we illustrate some additional visualisation and animations of raw MS data, also taken from the RforProteomics visualisation vignette. On the left, we have a heatmap like visualisation of a MS map and a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.

## 1

## 1

plot of chunk msmap1

Below, we have animations build from extracting successive slices as above.

4 Identification data

Annotated spectra and comparing spectra.

par(mfrow = c(1, 2))
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE)
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))

plot of chunk id1

The annotation of spectra is obtained by simulating fragmentation of a peptide and matching observed peaks to fragments:

calculateFragments("SIGFEGDSIGR")

## Modifications used: C=57.02146

##            mz  ion type pos z         seq
## 1    88.03931   b1    b   1 1           S
## 2   201.12337   b2    b   2 1          SI
## 3   258.14483   b3    b   3 1         SIG
## 4   405.21324   b4    b   4 1        SIGF
## 5   534.25583   b5    b   5 1       SIGFE
## 6   591.27729   b6    b   6 1      SIGFEG
## 7   706.30423   b7    b   7 1     SIGFEGD
## 8   793.33626   b8    b   8 1    SIGFEGDS
## 9   906.42032   b9    b   9 1   SIGFEGDSI
## 10  963.44178  b10    b  10 1  SIGFEGDSIG
## 11 1119.54289  b11    b  11 1 SIGFEGDSIGR
## 12  175.11895   y1    y   1 1           R
## 13  232.14041   y2    y   2 1          GR
## 14  345.22447   y3    y   3 1         IGR
## 15  432.25650   y4    y   4 1        SIGR
## 16  547.28344   y5    y   5 1       DSIGR
##  [ reached getOption("max.print") -- omitted 20 rows ]

Visualising a pair of spectra means that we can access them, and that, in addition to plotting, we can manipluate them and perform computations. The two spectra corresponding to the IMIDLDGTENK peptide, for example have 22 common peaks, a correlation of 0.198 and a dot product of 0.21 (see ?compareSpectra for details).

There are 2 Bioconductor packages for peptide-spectrum matching directly in R, namely MSGFplus and rTANDEM, replying on MSGF+ and X!TANDEM respectively.

See also the MSGFgui package for visualisation of identification data.

MSGFgui screenshot

5 Quantitation data

What does the quantitative data encode: ratios or intenstities? Do not let the software decide for you!

Here’s where the experimental design becomes essential: what are replicates: technical and biological, what variability (technical vs biological vs different conditions) are we exploring.

The data illustrated in the heatmap below is available as the mulvey2015norm MSnSet data. In this experiment, Mulvey and colleagues, performed a time course experiment on mouse extra-embryonic endoderm (XEN) stem cells. Extra-embryonic endoderm differentiation can be modelled in vitro by induced expression of GATA transcription factors in mouse embryonic stem cells. They used this GATA-inducible system to quantitatively monitor the dynamics of global proteomic changes during the early stages of this differentiation event (at 0, 16, 24, 48 and 72 hours) and also investigate the fully differentiated phenotype, as represented by embryo-derived XEN cells.

data(mulvey2015norm)
table(pData(mulvey2015norm))

## , , cond = 1
## 
##    times
## rep 1 2 3 4 5 6
##   1 1 1 1 1 1 1
##   2 1 1 1 1 1 1
##   3 1 1 1 1 1 1

table(pData(mulvey2015norm)[, -1])

##      cond
## times 1
##     1 3
##     2 3
##     3 3
##     4 3
##     5 3
##     6 3

MSnSet can be created from various formats such as raw data, mzTab and spreadsheets, when using third-party software for data quantitation and pro-processing. See ?readMSnSet for details about the latter.

5.1 Heatmaps

heatmap(exprs(mulvey2015norm))

plot of chunk mulveyhm

Note: Often, heatmap illustrating the results of a statistical analysis show very distinctive coloured patterns. While stinking, these patterns only display the obvious, i.e. a set of features that have been selected for differences between experimental conditions. It would be very worrying in no pattern was to be observed and, in case of such a pattern, the figure itself does not enable to assess the validity of the results.

5.2 Images

i0 <- image(mulvey2015norm, plot = FALSE)
i1 <- image(mulvey2015norm, facetBy = "rep", plot = FALSE)
i2 <- image(mulvey2015norm, facetBy = "times", plot = FALSE)
grid.arrange(i0, i1, i2, ncol = 3)

plot of chunk imacefacet

5.3 Pair plots with all vs all scatterplots

i <- c(grep("0hr", sampleNames(mulvey2015norm)),
       grep("XEN", sampleNames(mulvey2015norm)))

## plot all pairs with correlations - see ?pairs for details
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor = 1) {
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- cor(x, y)
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    text(0.5, 0.5, txt, cex = cex.cor)
}

pairs(exprs(mulvey2015norm[, i]), lower.panel = panel.cor)

plot of chunk pairplot

5.4 Hierarchical clustering only

par(mfrow = c(1, 2))
hc <- hclust(dist(exprs(mulvey2015norm)))
plot(hc)
hc2 <- hclust(dist(t(exprs(mulvey2015norm))))
plot(hc2)

plot of chunk hclust

5.5 MA plots

mztab <- pxget(px1, "PXD000001_mztab.txt")
qnt <- readMzTabData(mztab, what = "PEP", version = "0.9")
sampleNames(qnt) <- reporterNames(TMT6)
qnt <- filterNA(qnt)
spikes <- c("P02769", "P00924", "P62894", "P00489")
protclasses <- as.character(fData(qnt)$accession)
protclasses[!protclasses %in% spikes] <- "Background"

MAplot(qnt[, 1:2], col = factor(protclasses), pch = 19)

plot of chunk maplots

MAplot(qnt, cex = .8)

plot of chunk maplots

The RforProteomics visualisation vignette has a dedicated section about MA plots, describing the different R plotting systems and various customisations that can be applied to improve such figures and scatterplots in general.

The shinyMA interactive application enables to navigate an MA plot an a linked expression plot. It can be tested online (https://lgatto.shinyapps.io/shinyMA/) or offline (as part of the RforProteomics package), as shown below:

shinyMA()

5.6 Missing values

data(naset)
naplot(naset, col = "black")

plot of chunk na

## features.na
##   0   1   2   3   4   8   9  10 
## 301 247  91  13   2  23  10   2 
## samples.na
## 34 39 41 42 43 45 47 49 51 52 53 55 56 57 61 
##  1  1  1  1  1  2  1  1  1  1  1  1  1  1  1

x <- impute(naset, "zero")
exprs(x)[exprs(x) != 0] <- 1
suppressPackageStartupMessages(library("gplots"))
heatmap.2(exprs(x), col = c("lightgray", "black"),
          scale = "none", dendrogram = "none",
          trace = "none", keysize = 0.5, key = FALSE,
          RowSideColors = ifelse(fData(x)$randna, "orange", "brown"),
          ColSideColors = rep(c("steelblue", "darkolivegreen"), each = 8))

plot of chunk naheatmap

Identification tranfer between acquisitions (label-free)
One solution is to remove all or part of the features that have missing values (see ?filterNA) and/or impute missing values (?impute). The latter is not a straighforward thing, as is likely to dramatically fail when a high proportion of data is missing (10s of %).

Different imputation methods are more appropriate to different classes of missing values (as documented in this paper).

5.7 Normalisation strategies

Normalisation: remove unwanted (technical) variation while retaining biological variability.

5.7.1 Ratios (SILAC)

sf <- downloadData("http://proteome.sysbiol.cam.ac.uk/lgatto/files/silac.rda")
load(sf)
ns1 <- s1; ns2 <- s2; ns3 <- s3
exprs(ns1) <- exprs(s1) - median(exprs(s1))
exprs(ns2) <- exprs(s2) - median(exprs(s2))
exprs(ns3) <- exprs(s3) - median(exprs(s3))

par(mfrow = c(1, 2))

plot(density(exprs(s1)), col = "red", main = expression(SILAC~log[2]~ratios))
rug(exprs(s1))
lines(density(exprs(s2)), col = "green")
lines(density(exprs(s3)), col = "blue")
grid()
abline(v = 0)

plot(density(exprs(ns1)), col = "red", main = expression(Normalised~SILAC~log[2]~ratios))
rug(exprs(ns1))
lines(density(exprs(ns2)), col = "green")
lines(density(exprs(ns3)), col = "blue")
grid()
abline(v = 0)

plot of chunk silacnormplot

Other approaches for ratio normalisation:

loess polynomial regression that uses the raw intensities.
quantile (as below) for between acquisition normalisation.

See the limma package.

5.7.2 Intensities (iTRAQ)

par(mfrow = c(1, 4))
data(dunkley2006)
boxplot(exprs(dunkley2006), main = "original")
boxplot(exprs(normalise(dunkley2006, method = "vsn")),
        main = "Variance stabilisation normalisation")
boxplot(exprs(normalise(dunkley2006, method = "center.median")),
        main = "Median")
boxplot(exprs(normalise(dunkley2006, method = "quantiles")),
        main = "Quantile")

plot of chunk normbxplot

5.8 PCA analysis and plots

plot2D(t(mulvey2015norm), fcol = "times", fpch = "rep")
addLegend(t(mulvey2015norm), where = "bottomright", fcol = "times")

plot of chunk pca1

plot2D(dunkley2006)
addLegend(dunkley2006, where = "topleft")

plot of chunk pca

data(hyperLOPIT2015)
setStockcol(paste0(getStockcol(), 60))
plot2D(hyperLOPIT2015,
       fcol = "final.assignment",
       cex = exp(fData(hyperLOPIT2015)$svm.score) - 1)

plot of chunk pcacex

6 Statistical analyses

6.1 Random data

set.seed(1)
x <- matrix(rnorm(3000), ncol = 3)
res <- rowttests(x)
par(mfrow = c(1, 2))
hist(res$p.value, xlab = "p-values",
     breaks = 100, main = "Random data")
abline(v = 0.05, col = "red", lwd = 2)
matplot(t(x[which(res$p.value < 0.01), ]), type = "b",
        ylab = expression(log[2]~fold-change),
        main = "Data with a p-value < 0.01")

plot of chunk random

sum(res$p.value < 0.05)

## [1] 49

p-value histograms

A: well-behaved p-values!
B: all hypotheses are probably null.
C: Don’t blindly apply false discovery rate control, as the peak close to one violates the assumption of uniform distribution towards one. Ask for help!
D: Something went wrong: wrong distribution, p-values were already adjusted, … Ask for help!
E: Bootstrap with too few iterations? Non-parametric test with too few samples? Ask for help!
F: Ask for help!

See How to interpret a p-value histogram for more explanations.

6.2 Adjustment for multiple testing

summary(qvalue(res$p.value)$qvalue)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4686  0.8637  0.8679  0.8865  0.9196  0.9353

6.3 Real data

Here, we have spectral counting data and use a quasi-likelihood GLM regression to compare count data between two treatments taking two batches into account. This code chunk comes from the msmsTests package.

data(msms.dataset)
e <- pp.msms.data(msms.dataset)
null.f <- "y~batch"
alt.f <- "y~treat+batch"
div <- apply(exprs(e), 2, sum)
res <- msms.glm.qlll(e, alt.f, null.f, div = div)
lst <- test.results(res, e, pData(e)$treat, "U600", "U200", div,
                    alpha = 0.05, minSpC = 2, minLFC = log2(1.8),
                    method = "BH")

6.4 p-values

summary(lst[["tres"]]$p.value)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.07823 0.29630 0.36620 0.59780 0.98860

hist(lst[["tres"]]$p.value, main = "Raw p-values")

plot of chunk pvhist

summary(lst[["tres"]]$adjp)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000003 0.3107000 0.5896000 0.5536000 0.7959000 0.9886000

hist(lst[["tres"]]$adjp, main = "Adjusted p-values")

plot of chunk adjphist

6.5 Volcano plots

res.volcanoplot(lst$tres, max.pval = 0.05,
                min.LFC = 1, maxx = 3, maxy = NULL,
                ylbls = 4)

plot of chunk volc

7 Annotation

Context is essential in biology: we need to be able to pull annotations from GO, Biomart (biomaRt), UniProt (UniProt.ws), and many specialised databases as well as decorate the data (samples, features) with additional values we calculate (such as p-values, classification results, number of missing values, …).

8 Gene-set/pathway enrichment analyses

There are different approaches to enrichment analyses. One, based on the hyper-geometric distribution assumes that the universe (all expected features/proteins) are known. But this is often undefined in proteomics experiments: do we take all the proteins in the database, all identified proteins or protein groups, …? Other approaches are based on bootstrap re-sampling, which relies on observed features.
Check your p-values!

pathway enrichment

9 References and resources

Visualisation of proteomics data using R and Bioconductor
Using R and Bioconductor for proteomics data analysis
RforProteomics: http://bioconductor.org/packages/RforProteomics
R/Bioconductor workflow
Teaching material for R and more
Workshops: Software and Data Carpentry, Data Programmers

10 About this document

The source used to generate this document is available here.

Software used:

## R Under development (unstable) (2016-03-03 r70270)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] qvalue_2.4.0          genefilter_1.54.0     gplots_3.0.1         
##  [4] gridExtra_2.2.1       lattice_0.20-33       AnnotationHub_2.4.0  
##  [7] msmsTests_1.10.0      msmsEDA_1.10.0        pRolocdata_1.10.0    
## [10] pRoloc_1.12.0         MLInterfaces_1.52.0   cluster_2.0.4        
## [13] annotate_1.50.0       XML_3.98-1.4          AnnotationDbi_1.34.0 
## [16] IRanges_2.6.0         S4Vectors_0.10.0      RforProteomics_1.10.0
## [19] rpx_1.8.0             MSnbase_1.21.4        ProtGenerics_1.4.0   
## [22] BiocParallel_1.6.0    mzR_2.6.0             Rcpp_0.12.4.5        
## [25] Biobase_2.32.0        BiocGenerics_0.18.0   BiocStyle_2.0.0      
## 
## loaded via a namespace (and not attached):
##   [1] plyr_1.8.3                    GSEABase_1.34.0              
##   [3] splines_3.3.0                 ggvis_0.4.2                  
##   [5] ggplot2_2.1.0                 digest_0.6.9                 
##   [7] foreach_1.4.3                 BiocInstaller_1.22.1         
##   [9] htmltools_0.3.5               gdata_2.17.0                 
##  [11] magrittr_1.5                  doParallel_1.0.10            
##  [13] sfsmisc_1.1-0                 limma_3.28.2                 
##  [15] rda_1.0.2-2                   R.utils_2.3.0                
##  [17] lpSolve_5.6.13                colorspace_1.2-6             
##  [19] dplyr_0.4.3                   RCurl_1.95-4.8               
##  [21] jsonlite_0.9.19               graph_1.50.0                 
##  [23] lme4_1.1-12                   impute_1.46.0                
##  [25] survival_2.39-3               iterators_1.0.8              
##  [27] gtable_0.2.0                  zlibbioc_1.18.0              
##  [29] MatrixModels_0.4-1            car_2.1-2                    
##  [31] kernlab_0.9-24                prabclus_2.2-6               
##  [33] DEoptimR_1.0-4                SparseM_1.7                  
##  [35] scales_0.4.0                  vsn_3.40.0                   
##  [37] mvtnorm_1.0-5                 DBI_0.4-1                    
##  [39] edgeR_3.14.0                  xtable_1.8-2                 
##  [41] proxy_0.4-15                  mclust_5.2                   
##  [43] preprocessCore_1.34.0         htmlwidgets_0.6              
##  [45] sampling_2.7                  httr_1.1.0                   
##  [47] threejs_0.2.2                 FNN_1.1                      
##  [49] RColorBrewer_1.1-2            fpc_2.1-10                   
##  [51] modeltools_0.2-21             R.methodsS3_1.7.1            
##  [53] flexmix_2.3-13                nnet_7.3-12                  
##  [55] RJSONIO_1.3-0                 caret_6.0-68                 
##  [57] labeling_0.3                  reshape2_1.4.1               
##  [59] munsell_0.4.3                 mlbench_2.1-1                
##  [61] biocViews_1.40.0              tools_3.3.0                  
##  [63] RSQLite_1.0.0                 pls_2.5-0                    
##  [65] evaluate_0.9                  stringr_1.0.0                
##  [67] mzID_1.10.0                   knitr_1.13                   
##  [69] robustbase_0.92-5             rgl_0.95.1441                
##  [71] caTools_1.17.1                randomForest_4.6-12          
##  [73] RBGL_1.48.0                   nlme_3.1-127                 
##  [75] mime_0.4                      quantreg_5.21                
##  [77] formatR_1.4                   R.oo_1.20.0                  
##  [79] biomaRt_2.28.0                pbkrtest_0.4-6               
##  [81] curl_0.9.7                    interactiveDisplayBase_1.10.0
##  [83] e1071_1.6-7                   affyio_1.42.0                
##  [85] stringi_1.0-1                 trimcluster_0.1-2            
##  [87] Matrix_1.2-6                  nloptr_1.0.4                 
##  [89] gbm_2.1.1                     RUnit_0.4.31                 
##  [91] MALDIquant_1.14               bitops_1.0-6                 
##  [93] httpuv_1.3.3                  R6_2.1.2                     
##  [95] pcaMethods_1.64.0             affy_1.50.0                  
##  [97] hwriter_1.3.2                 KernSmooth_2.23-15           
##  [99] gridSVG_1.5-0                 codetools_0.2-14             
## [101] MASS_7.3-45                   gtools_3.5.0                 
## [103] assertthat_0.1                interactiveDisplay_1.10.0    
## [105] Category_2.38.0               diptest_0.75-7               
## [107] mgcv_1.8-12                   grid_3.3.0                   
## [109] rpart_4.1-10                  class_7.3-14                 
## [111] minqa_1.2.4                   shiny_0.13.2                 
## [113] base64enc_0.1-3

Inspection, visualisation and analysis of quantitative proteomics data

Laurent Gatto

Contents

1 Introduction