This tutorial requires the MSnbase (Gatto and Lilley (2011)), MSqRob (Goeminne et al. (2016)), limma (Ritchie et al. (2015)) and msdata packages, as well as some of the tidyverse packages for data tidying and visualisation.
library("MSnbase")
library("MSqRob")
library("msdata")
library("limma")
library("tidyverse")
Sections 2 to 4 deal with raw mass spectrometry and identification data stored in one of the open community-developed data formats (see below). Section 5 describes how to produce or import quantitative data and the sections thereafter focus on the processing and analysis of differentially expressed proteins.
Most community-driven formats described in the table are supported in R
. We will see how to read and access these data in the following sections.
Type | Format | Package |
---|---|---|
raw | mzML, mzXML, netCDF, mzData | MSnbase (read and write in version >= 2.3.13) via mzR |
identification | mzIdentML | mzID (read) and MSnbase (read, via mzR) |
quantitation | mzQuantML | |
peak lists | mgf | MSnbase (read) |
quantitation (with id) | mzTab, spreadsheets | MSnbase (read, read/write) |
protein db | fasta | Biostrings |
In the next sections, we will use raw data (in mzML
and mzXML
format), identification data (in mzIdentML
format) and quantitative data (from a tab-separated spreadsheet).
Raw data files (in any of the above formats) is read into R using readMSData
function from the MSnbase
package, that will return an list-like object of class MSnExp
. Below, we first extract the full path to the MS3TMT11.mzML
file from the msdata
package1 The proteomics
, ident
and quant
msdata
functions return example files for raw, identification and quantitative data respectively. before reading it in.
basename(fl3 <- msdata::proteomics(full.name = TRUE, pattern = "MS3TMT11"))
## [1] "MS3TMT11.mzML"
(rw3 <- readMSData(fl3, mode = "onDisk"))
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.53 Mb
## - - - Spectra data - - -
## MS level(s): 1 2 3
## Number of spectra: 994
## MSn retention times: 45:27 - 47:6 minutes
## - - - Processing information - - -
## Data loaded [Thu Jul 25 08:43:45 2019]
## MSnbase version: 2.10.1
## - - - Meta data - - -
## phenoData
## rowNames: MS3TMT11.mzML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## MS3TMT11.mzML
## protocolData: none
## featureData
## featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
## fvarLabels: fileIdx spIdx ... spectrum (33 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
Note: above, we specify to use the on disk mode (as opposed to in memory) to avoid loading the whole raw data file(s) into memory. With on disk, the raw data will be accessed on demand. Here, we only read a single MS file into R, but with on disk mode, we could load 100s thereof.
An MSnExp
object contains
[
and [[
. The former returns and subset of the data as a new MSnExp
object, while the former extracts a single spectrum.fData
Exercise: Extract the first, second and tenth spectrum. What are their MS levels, precursor m/z and retention times?
rw3[[1]]
## Object of class "Spectrum1"
## Retention time: 45:27
## MSn level: 1
## Total ion count: 10768
## Polarity: 1
We can also access specific data for the whole experiment using accessors:
msLevel(rw3)
rtime(rw3)
precursorMz(rw3)
centroided(rw3)
See ?MSnExp
for details.
Exercise: Using msLevel
, extract the MS level of the 994 spectra of the rw3
file. What levels are available in these data. Using centroided
, check what spectra and centroided of in profile mode (see figure below). Which are the levels that are centroided or in profile mode.
table(msLevel(rw3), centroided(rw3))
##
## FALSE TRUE
## 1 30 0
## 2 0 482
## 3 0 482
Identification data in mzIdentML
is parsed and loaded as a data.frame
using the readMzIdData
function.
basename(idf <- msdata::ident(full.name = TRUE))
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
iddf <- readMzIdData(idf)
Among the identification variables available, you will find MS.GF.RawScore
, the identification score computated by MSGF+ (the search engine used for the search), and isDecoy
.
names(iddf)
Exercise: How many PSMs from the read and decoy database are recoded in the identification data? Calculate summary statistics for MSGF+ raw scores.
table(iddf$isDecoy)
##
## FALSE TRUE
## 2906 2896
summary(iddf$MS.GF.RawScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -141.00 -31.00 -13.00 -14.24 3.00 214.00
Exercise: Reproduce and interpret the plot showing the identification score distribution for the decoy and real peptides in this file.
ggplot(iddf, aes(x = MS.GF.RawScore, colour = isDecoy)) +
geom_density()
When working with raw and identification data, these two can be merged by adding the identification results to the raw feature meta-data slot. Below, we use small data that is shipped with the MSnbase
package. The raw data being very small, we can afford to read it into memory without specifying mode = "onDisk"
.
basename(quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$"))
## [1] "dummyiTRAQ.mzXML"
basename(identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "dummyiTRAQ.mzid"))
## [1] "dummyiTRAQ.mzid"
msexp <- readMSData(quantFile)
fvarLabels(msexp)
## [1] "spectrum"
Add this stage, there’s only one spectrum
feature variable with the spectra indices. Below we add the identification data with addIdentificationData
to gain 33 new feature variables.
msexp <- addIdentificationData(msexp, identFile)
fvarLabels(msexp)
## [1] "spectrum" "acquisition.number"
## [3] "sequence" "chargeState"
## [5] "rank" "passThreshold"
## [7] "experimentalMassToCharge" "calculatedMassToCharge"
## [9] "modNum" "isDecoy"
## [11] "post" "pre"
## [13] "start" "end"
## [15] "DatabaseAccess" "DBseqLength"
## [17] "DatabaseSeq" "DatabaseDescription"
## [19] "scan.number.s." "idFile"
## [21] "MS.GF.RawScore" "MS.GF.DeNovoScore"
## [23] "MS.GF.SpecEValue" "MS.GF.EValue"
## [25] "modName" "modMass"
## [27] "modLocation" "subOriginalResidue"
## [29] "subReplacementResidue" "subLocation"
## [31] "nprot" "npep.prot"
## [33] "npsm.prot" "npsm.pep"
Note that prior to addition, the identification data is filtered as documented in the filterIdentificationDataFrame
function:
Quantitative data is stored a objects of class MSnSet
, that are composed of an expression matrix (accessed with exprs()
), a feature meta-data dataframe (accessed with fData()
) and a meta-data dataframe (accessed with pData()
).
MSnSet
objects can be:
MSnExp
object with the quantify
function.readMSnSet2
function.data(itraqdata)
itraqdata
## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 55
## MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## MSnbase version: 1.1.22
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames sampleNumbers
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ProteinDescription
## PeptideSequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msnset <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
msnset
## MSnSet (storageMode: lockedEnvironment)
## assayData: 55 features, 4 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## varLabels: mz reporters
## varMetadata: labelDescription
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ... collision.energy (15
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: No annotation
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jul 25 08:43:50 2019
## MSnbase version: 1.1.22
Below, we plot the first MS2 spectrum from the itraqdata
test data, highlighting the four iTRAQ reporter ions and extract the quantiation values and feature metadata of that same first record.
plot(itraqdata[[1]], reporters = iTRAQ4, full = TRUE)