1 Introduction

This tutorial requires MSnbase and msdata package.

library("MSnbase")
library("msdata")

A longer version of the material is available here.

2 Data files in mass spectrometry

Most community-driven formats described in the table are supported in R. We will see how to read and access these data in the following sections.

Type	Format	Package
raw	mzML, mzXML, netCDF, mzData	MSnbase (read and write in version >= 2.3.13) via mzR
identification	mzIdentML	mzID (read) and MSnbase (read, via mzR)
quantitation	mzQuantML
peak lists	mgf	MSnbase (read)
quant and id	mzTab	MSnbase (read)
protein db	fasta	Biostrings

3 Loading raw data

Raw data files (in any of the above formats) is read into R using MSnbase::readMSData function that will return an list-like object of class MSnExp.

basename(fl3 <- msdata::proteomics(full.name = TRUE, pattern = "MS3TMT11"))

## [1] "MS3TMT11.mzML"

(rw3 <- readMSData(fl3, mode = "onDisk"))

## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.53 Mb
## - - - Spectra data - - -
##  MS level(s): 1 2 3 
##  Number of spectra: 994 
##  MSn retention times: 45:27 - 47:6 minutes
## - - - Processing information - - -
## Data loaded [Fri Jul  5 09:25:59 2019] 
##  MSnbase version: 2.10.1 
## - - - Meta data  - - -
## phenoData
##   rowNames: MS3TMT11.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   MS3TMT11.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
##   fvarLabels: fileIdx spIdx ... spectrum (33 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

An MSnExp object contains

the raw data (spectra), accessible with [ and [[
features data, accessible as a dataframe with fData

We can also access specific data using

msLevel(rw3)
rtime(rw3)
precursorMz(rw3)
centroided(rw3)
…

table(msLevel(rw3), centroided(rw3))

##    
##     FALSE TRUE
##   1    30    0
##   2     0  482
##   3     0  482

4 Loading identification data

basename(idf <- msdata::ident(full.name = TRUE))

## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"

iddf <- readMzIdData(idf)

library("ggplot2")
ggplot(iddf, aes(x = MS.GF.RawScore, colour = isDecoy)) +
    geom_density()

5 Combining raw and identification data

basename(quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzXML$"))

## [1] "dummyiTRAQ.mzXML"

basename(identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                          full.name = TRUE, pattern = "dummyiTRAQ.mzid"))

## [1] "dummyiTRAQ.mzid"

msexp <- readMSData(quantFile)
fvarLabels(msexp)

## [1] "spectrum"

msexp <- addIdentificationData(msexp, identFile)
fvarLabels(msexp)

##  [1] "spectrum"                 "acquisition.number"      
##  [3] "sequence"                 "chargeState"             
##  [5] "rank"                     "passThreshold"           
##  [7] "experimentalMassToCharge" "calculatedMassToCharge"  
##  [9] "modNum"                   "isDecoy"                 
## [11] "post"                     "pre"                     
## [13] "start"                    "end"                     
## [15] "DatabaseAccess"           "DBseqLength"             
## [17] "DatabaseSeq"              "DatabaseDescription"     
## [19] "scan.number.s."           "idFile"                  
## [21] "MS.GF.RawScore"           "MS.GF.DeNovoScore"       
## [23] "MS.GF.SpecEValue"         "MS.GF.EValue"            
## [25] "modName"                  "modMass"                 
## [27] "modLocation"              "subOriginalResidue"      
## [29] "subReplacementResidue"    "subLocation"             
## [31] "nprot"                    "npep.prot"               
## [33] "npsm.prot"                "npsm.pep"

6 Quantitative data

Quantiative data is stored a objects of class MSnSet, which can be

Figure 1: The MSnSet structure

Created from an MSnExp object with the quantify function.

data(itraqdata)
itraqdata

## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 55 
##  MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
##  MSnbase version: 1.1.22 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1
##   varLabels: sampleNames sampleNumbers
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ProteinDescription
##     PeptideSequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

msnset <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
msnset

## MSnSet (storageMode: lockedEnvironment)
## assayData: 55 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
##   varLabels: mz reporters
##   varMetadata: labelDescription
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ... collision.energy (15
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: No annotation 
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
## iTRAQ4 quantification by trapezoidation: Fri Jul  5 09:26:02 2019 
##  MSnbase version: 1.1.22

plot(itraqdata[[1]], reporters = iTRAQ4, full = TRUE)

Figure 2: Raw data and quantiative results

exprs(msnset)[1, ]

## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117 
##   1347.616   2247.310   3927.693   7661.146

fData(msnset)[1, ]

##    spectrum ProteinAccession   ProteinDescription PeptideSequence fileIdx
## X1        1              BSA bovine serum albumin          NYQEAK       1
##    retention.time precursor.mz precursor.intensity charge peaks.count
## X1        1149.31     520.7833             3449020      2        1922
##         tic ionCount ms.level acquisition.number collision.energy
## X1 26413754 26413754        2                  2               40

Created from a spreadsheet produced by a third party software using the readMSnSet2 function.

Before reading the spreadsheet, we need to identify which columns contain quantitation data (that will be put into the exprs slot) and the feature data (that will be put into the fData slot).

Download the data here.

f <- "./data/cptac_peptides.txt"
getEcols(f, split = "\t")

##  [1] "Sequence"                 "N-term cleavage window"  
##  [3] "C-term cleavage window"   "Amino acid before"       
##  [5] "First amino acid"         "Second amino acid"       
##  [7] "Second last amino acid"   "Last amino acid"         
##  [9] "Amino acid after"         "A Count"                 
## [11] "R Count"                  "N Count"                 
## [13] "D Count"                  "C Count"                 
## [15] "Q Count"                  "E Count"                 
## [17] "G Count"                  "H Count"                 
## [19] "I Count"                  "L Count"                 
## [21] "K Count"                  "M Count"                 
## [23] "F Count"                  "P Count"                 
## [25] "S Count"                  "T Count"                 
## [27] "W Count"                  "Y Count"                 
## [29] "V Count"                  "U Count"                 
## [31] "Length"                   "Missed cleavages"        
## [33] "Mass"                     "Proteins"                
## [35] "Leading razor protein"    "Start position"          
## [37] "End position"             "Unique (Groups)"         
## [39] "Unique (Proteins)"        "Charges"                 
## [41] "PEP"                      "Score"                   
## [43] "Identification type 6A_7" "Identification type 6A_8"
## [45] "Identification type 6A_9" "Identification type 6B_7"
## [47] "Identification type 6B_8" "Identification type 6B_9"
## [49] "Experiment 6A_7"          "Experiment 6A_8"         
## [51] "Experiment 6A_9"          "Experiment 6B_7"         
## [53] "Experiment 6B_8"          "Experiment 6B_9"         
## [55] "Intensity"                "Intensity 6A_7"          
## [57] "Intensity 6A_8"           "Intensity 6A_9"          
## [59] "Intensity 6B_7"           "Intensity 6B_8"          
## [61] "Intensity 6B_9"           "Reverse"                 
## [63] "Potential contaminant"    "id"                      
## [65] "Protein group IDs"        "Mod. peptide IDs"        
## [67] "Evidence IDs"             "MS/MS IDs"               
## [69] "Best MS/MS"               "Oxidation (M) site IDs"  
## [71] "MS/MS Count"

e <- grepEcols(f, "Intensity ", split = "\t") ## careful at the space!
(cptac <- readMSnSet2(f, ecol = e,
                      fnames = "Sequence",
                      sep = "\t"))

## MSnSet (storageMode: lockedEnvironment)
## assayData: 11466 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
##     (11466 total)
##   fvarLabels: Sequence N.term.cleavage.window ... MS.MS.Count (65
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.10.1

We can access the expression data with exprs and the feature meta-data with fData.

For the sake of simplicity, we can clean up the feature variables and only keep those of interest. It is possible to do this interactively with

cptac <- selectFeatureData(cptac)

or by setting the feature variables of interest.

cptac <- selectFeatureData(cptac,
                           fcol = c("Proteins",
                                    "Potential.contaminant",
                                    "Reverse",
                                    "Sequence"))

Let’s also add sample annotations:

cptac$group <- rep(c("6A", "6B"), each = 3)
cptac$sample <- rep(7:9, 2)
sampleNames(cptac) <- sub("Intensity.", "", sampleNames(cptac))
pData(cptac)

##      group sample
## 6A_7    6A      7
## 6A_8    6A      8
## 6A_9    6A      9
## 6B_7    6B      7
## 6B_8    6B      8
## 6B_9    6B      9

7 Data processing

7.1 Filtering out contaminants

table(sel_conts <- fData(cptac)$Potential.contaminant != "+")

## 
## FALSE  TRUE 
##    81 11385

table(sel_rev <- fData(cptac)$Reverse != "+")

## 
## FALSE  TRUE 
##    30 11436

(cptac <- cptac[sel_conts & sel_rev, ])

## MSnSet (storageMode: lockedEnvironment)
## assayData: 11357 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
##   varLabels: group sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
##     (11357 total)
##   fvarLabels: Proteins Potential.contaminant Reverse Sequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul  5 09:26:04 2019 
##  MSnbase version: 2.10.1

Note how the filtering has been recorded in the object’s processing log.

7.2 Notes on missing values

Unfortunately, some software uses 0 irrespecitve whether the data has intensity zero and when the data haven’t been observer. Below we fix this.

exprs(cptac)[exprs(cptac) == 0] <- NA
table(is.na(exprs(cptac)))

## 
## FALSE  TRUE 
## 37533 30609

napac <- cptac
exprs(napac)[!is.na(exprs(napac))] <- 1
naplot(napac)

Figure 3: Overview of missing data

fData(cptac)$nNA <- apply(exprs(cptac), 1, function(x) sum(is.na(x)))
table(fData(cptac)$nNA)

## 
##    0    1    2    3    4    5    6 
## 4051  989  879  715  912  797 3014

(Note that some peptides aren’t seen at all because these 6 samples are a subset of a larger dataset, and these features are present in the other acquisitions.)

From here on, one could filter data with missing values, which however sacrifices a lot of data.

(cptac <- filterNA(cptac))

## MSnSet (storageMode: lockedEnvironment)
## assayData: 4051 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
##   varLabels: group sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (4051 total)
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul  5 09:26:04 2019 
## Subset [11357,6][4051,6] Fri Jul  5 09:26:05 2019 
## Removed features with more than 0 NAs: Fri Jul  5 09:26:05 2019 
## Dropped featureData's levels Fri Jul  5 09:26:05 2019 
##  MSnbase version: 2.10.1

Perform imputation, considering the underlying nature of missingness, i.e missing not at random (left-censored) or at random.

Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better.

See Lazar et al. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

The best solution is arguably to handle missing values at the statistical test level (see later).

7.3 Log transformation

(cptac <- log(cptac, base = 2))

## MSnSet (storageMode: lockedEnvironment)
## assayData: 4051 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
##   varLabels: group sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (4051 total)
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul  5 09:26:04 2019 
## Subset [11357,6][4051,6] Fri Jul  5 09:26:05 2019 
## Removed features with more than 0 NAs: Fri Jul  5 09:26:05 2019 
## Dropped featureData's levels Fri Jul  5 09:26:05 2019 
## Log transformed (base 2) Fri Jul  5 09:26:05 2019 
##  MSnbase version: 2.10.1

7.4 Normalisation

Normalisation is handled by the normalise (or normalize) function.

(cptac <- normalise(cptac, method = "quantiles"))

## MSnSet (storageMode: lockedEnvironment)
## assayData: 4051 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
##   varLabels: group sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (4051 total)
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul  5 09:26:04 2019 
## Subset [11357,6][4051,6] Fri Jul  5 09:26:05 2019 
## Removed features with more than 0 NAs: Fri Jul  5 09:26:05 2019 
## Dropped featureData's levels Fri Jul  5 09:26:05 2019 
## Log transformed (base 2) Fri Jul  5 09:26:05 2019 
## Normalised (quantiles): Fri Jul  5 09:26:05 2019 
##  MSnbase version: 2.10.1

7.5 Summarisation

So far we have quantitation values at the peptide level, while the level of interest are proteins. We can take all peptides that are associated to a protein (or protein group), as defined by the Proteins feature variable and aggregate them using a preferred operation.

Figure 4: Aggregation between different levels

Figure 5: Examples of aggregations

(cptac <- combineFeatures(cptac, fcol = "Proteins", method = "mean"))

## MSnSet (storageMode: lockedEnvironment)
## assayData: 1125 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
##   varLabels: group sample
##   varMetadata: labelDescription
## featureData
##   featureNames: P00918ups|CAH2_HUMAN_UPS
##     P01008ups|ANT3_HUMAN_UPS;CON__P41361 ... sp|Q99383|HRP1_YEAST
##     (1125 total)
##   fvarLabels: Proteins Potential.contaminant ... CV.6B_9 (11 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul  5 09:26:04 2019 
## Subset [11357,6][4051,6] Fri Jul  5 09:26:05 2019 
## Removed features with more than 0 NAs: Fri Jul  5 09:26:05 2019 
## Dropped featureData's levels Fri Jul  5 09:26:05 2019 
## Log transformed (base 2) Fri Jul  5 09:26:05 2019 
## Normalised (quantiles): Fri Jul  5 09:26:05 2019 
## Combined 4051 features into 1125 using mean: Fri Jul  5 09:26:06 2019 
##  MSnbase version: 2.10.1

As you will see in the next section, there are much better aggregation function than the mean!

Session info

## R version 3.6.0 RC (2019-04-21 r76417)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.2.0       msdata_0.24.0       MSnbase_2.10.1     
##  [4] ProtGenerics_1.16.0 S4Vectors_0.22.0    mzR_2.18.0         
##  [7] Rcpp_1.0.1          Biobase_2.44.0      BiocGenerics_0.30.0
## [10] BiocStyle_2.12.0   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5      xfun_0.8              purrr_0.3.2          
##  [4] lattice_0.20-38       colorspace_1.4-1      htmltools_0.3.6      
##  [7] yaml_2.2.0            vsn_3.52.0            XML_3.98-1.20        
## [10] rlang_0.4.0           pillar_1.4.2          withr_2.1.2          
## [13] glue_1.3.1            BiocParallel_1.18.0   affy_1.62.0          
## [16] affyio_1.54.0         foreach_1.4.4         plyr_1.8.4           
## [19] mzID_1.22.0           stringr_1.4.0         zlibbioc_1.30.0      
## [22] munsell_0.5.0         pcaMethods_1.76.0     gtable_0.3.0         
## [25] codetools_0.2-16      evaluate_0.14         labeling_0.3         
## [28] knitr_1.23            IRanges_2.18.1        doParallel_1.0.14    
## [31] highr_0.8             preprocessCore_1.46.0 scales_1.0.0         
## [34] BiocManager_1.30.4    limma_3.40.2          impute_1.58.0        
## [37] digest_0.6.19         stringi_1.4.3         bookdown_0.11        
## [40] dplyr_0.8.2           ncdf4_1.16.1          grid_3.6.0           
## [43] tools_3.6.0           magrittr_1.5          lazyeval_0.2.2       
## [46] tibble_2.1.3          crayon_1.3.4          pkgconfig_2.0.2      
## [49] MASS_7.3-51.4         iterators_1.0.10      assertthat_0.2.1     
## [52] rmarkdown_1.13        R6_2.4.0              MALDIquant_1.19.3    
## [55] compiler_3.6.0

Mass spectrometry and proteomics

5 juillet 2019

Contents