This tutorial requires MSnbase and msdata package.
library("MSnbase")
library("msdata")
A longer version of the material is available here.
Most community-driven formats described in the table are supported in R
. We will see how to read and access these data in the following sections.
Type | Format | Package |
---|---|---|
raw | mzML, mzXML, netCDF, mzData | MSnbase (read and write in version >= 2.3.13) via mzR |
identification | mzIdentML | mzID (read) and MSnbase (read, via mzR) |
quantitation | mzQuantML | |
peak lists | mgf | MSnbase (read) |
quant and id | mzTab | MSnbase (read) |
protein db | fasta | Biostrings |
Raw data files (in any of the above formats) is read into R using MSnbase::readMSData
function that will return an list-like object of class MSnExp
.
basename(fl3 <- msdata::proteomics(full.name = TRUE, pattern = "MS3TMT11"))
## [1] "MS3TMT11.mzML"
(rw3 <- readMSData(fl3, mode = "onDisk"))
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.53 Mb
## - - - Spectra data - - -
## MS level(s): 1 2 3
## Number of spectra: 994
## MSn retention times: 45:27 - 47:6 minutes
## - - - Processing information - - -
## Data loaded [Fri Jul 5 09:25:59 2019]
## MSnbase version: 2.10.1
## - - - Meta data - - -
## phenoData
## rowNames: MS3TMT11.mzML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## MS3TMT11.mzML
## protocolData: none
## featureData
## featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
## fvarLabels: fileIdx spIdx ... spectrum (33 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
An MSnExp
object contains
[
and [[
fData
We can also access specific data using
msLevel(rw3)
rtime(rw3)
precursorMz(rw3)
centroided(rw3)
table(msLevel(rw3), centroided(rw3))
##
## FALSE TRUE
## 1 30 0
## 2 0 482
## 3 0 482
basename(idf <- msdata::ident(full.name = TRUE))
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
iddf <- readMzIdData(idf)
library("ggplot2")
ggplot(iddf, aes(x = MS.GF.RawScore, colour = isDecoy)) +
geom_density()
basename(quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$"))
## [1] "dummyiTRAQ.mzXML"
basename(identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "dummyiTRAQ.mzid"))
## [1] "dummyiTRAQ.mzid"
msexp <- readMSData(quantFile)
fvarLabels(msexp)
## [1] "spectrum"
msexp <- addIdentificationData(msexp, identFile)
fvarLabels(msexp)
## [1] "spectrum" "acquisition.number"
## [3] "sequence" "chargeState"
## [5] "rank" "passThreshold"
## [7] "experimentalMassToCharge" "calculatedMassToCharge"
## [9] "modNum" "isDecoy"
## [11] "post" "pre"
## [13] "start" "end"
## [15] "DatabaseAccess" "DBseqLength"
## [17] "DatabaseSeq" "DatabaseDescription"
## [19] "scan.number.s." "idFile"
## [21] "MS.GF.RawScore" "MS.GF.DeNovoScore"
## [23] "MS.GF.SpecEValue" "MS.GF.EValue"
## [25] "modName" "modMass"
## [27] "modLocation" "subOriginalResidue"
## [29] "subReplacementResidue" "subLocation"
## [31] "nprot" "npep.prot"
## [33] "npsm.prot" "npsm.pep"
Quantiative data is stored a objects of class MSnSet
, which can be
MSnExp
object with the quantify
function.data(itraqdata)
itraqdata
## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 55
## MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## MSnbase version: 1.1.22
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames sampleNumbers
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ProteinDescription
## PeptideSequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msnset <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
msnset
## MSnSet (storageMode: lockedEnvironment)
## assayData: 55 features, 4 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## varLabels: mz reporters
## varMetadata: labelDescription
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ... collision.energy (15
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: No annotation
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Fri Jul 5 09:26:02 2019
## MSnbase version: 1.1.22
plot(itraqdata[[1]], reporters = iTRAQ4, full = TRUE)
exprs(msnset)[1, ]
## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## 1347.616 2247.310 3927.693 7661.146
fData(msnset)[1, ]
## spectrum ProteinAccession ProteinDescription PeptideSequence fileIdx
## X1 1 BSA bovine serum albumin NYQEAK 1
## retention.time precursor.mz precursor.intensity charge peaks.count
## X1 1149.31 520.7833 3449020 2 1922
## tic ionCount ms.level acquisition.number collision.energy
## X1 26413754 26413754 2 2 40
readMSnSet2
function.Before reading the spreadsheet, we need to identify which columns contain quantitation data (that will be put into the exprs
slot) and the feature data (that will be put into the fData
slot).
Download the data here.
f <- "./data/cptac_peptides.txt"
getEcols(f, split = "\t")
## [1] "Sequence" "N-term cleavage window"
## [3] "C-term cleavage window" "Amino acid before"
## [5] "First amino acid" "Second amino acid"
## [7] "Second last amino acid" "Last amino acid"
## [9] "Amino acid after" "A Count"
## [11] "R Count" "N Count"
## [13] "D Count" "C Count"
## [15] "Q Count" "E Count"
## [17] "G Count" "H Count"
## [19] "I Count" "L Count"
## [21] "K Count" "M Count"
## [23] "F Count" "P Count"
## [25] "S Count" "T Count"
## [27] "W Count" "Y Count"
## [29] "V Count" "U Count"
## [31] "Length" "Missed cleavages"
## [33] "Mass" "Proteins"
## [35] "Leading razor protein" "Start position"
## [37] "End position" "Unique (Groups)"
## [39] "Unique (Proteins)" "Charges"
## [41] "PEP" "Score"
## [43] "Identification type 6A_7" "Identification type 6A_8"
## [45] "Identification type 6A_9" "Identification type 6B_7"
## [47] "Identification type 6B_8" "Identification type 6B_9"
## [49] "Experiment 6A_7" "Experiment 6A_8"
## [51] "Experiment 6A_9" "Experiment 6B_7"
## [53] "Experiment 6B_8" "Experiment 6B_9"
## [55] "Intensity" "Intensity 6A_7"
## [57] "Intensity 6A_8" "Intensity 6A_9"
## [59] "Intensity 6B_7" "Intensity 6B_8"
## [61] "Intensity 6B_9" "Reverse"
## [63] "Potential contaminant" "id"
## [65] "Protein group IDs" "Mod. peptide IDs"
## [67] "Evidence IDs" "MS/MS IDs"
## [69] "Best MS/MS" "Oxidation (M) site IDs"
## [71] "MS/MS Count"
e <- grepEcols(f, "Intensity ", split = "\t") ## careful at the space!
(cptac <- readMSnSet2(f, ecol = e,
fnames = "Sequence",
sep = "\t"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 11466 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData: none
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
## (11466 total)
## fvarLabels: Sequence N.term.cleavage.window ... MS.MS.Count (65
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## MSnbase version: 2.10.1
We can access the expression data with exprs
and the feature meta-data with fData
.
For the sake of simplicity, we can clean up the feature variables and only keep those of interest. It is possible to do this interactively with
cptac <- selectFeatureData(cptac)
or by setting the feature variables of interest.
cptac <- selectFeatureData(cptac,
fcol = c("Proteins",
"Potential.contaminant",
"Reverse",
"Sequence"))
Let’s also add sample annotations:
cptac$group <- rep(c("6A", "6B"), each = 3)
cptac$sample <- rep(7:9, 2)
sampleNames(cptac) <- sub("Intensity.", "", sampleNames(cptac))
pData(cptac)
## group sample
## 6A_7 6A 7
## 6A_8 6A 8
## 6A_9 6A 9
## 6B_7 6B 7
## 6B_8 6B 8
## 6B_9 6B 9
table(sel_conts <- fData(cptac)$Potential.contaminant != "+")
##
## FALSE TRUE
## 81 11385
table(sel_rev <- fData(cptac)$Reverse != "+")
##
## FALSE TRUE
## 30 11436
(cptac <- cptac[sel_conts & sel_rev, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 11357 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
## varLabels: group sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
## (11357 total)
## fvarLabels: Proteins Potential.contaminant Reverse Sequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul 5 09:26:04 2019
## MSnbase version: 2.10.1
Note how the filtering has been recorded in the object’s processing log.
Unfortunately, some software uses 0 irrespecitve whether the data has intensity zero and when the data haven’t been observer. Below we fix this.
exprs(cptac)[exprs(cptac) == 0] <- NA
table(is.na(exprs(cptac)))
##
## FALSE TRUE
## 37533 30609
napac <- cptac
exprs(napac)[!is.na(exprs(napac))] <- 1
naplot(napac)
fData(cptac)$nNA <- apply(exprs(cptac), 1, function(x) sum(is.na(x)))
table(fData(cptac)$nNA)
##
## 0 1 2 3 4 5 6
## 4051 989 879 715 912 797 3014
(Note that some peptides aren’t seen at all because these 6 samples are a subset of a larger dataset, and these features are present in the other acquisitions.)
From here on, one could filter data with missing values, which however sacrifices a lot of data.
(cptac <- filterNA(cptac))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4051 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
## varLabels: group sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (4051 total)
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul 5 09:26:04 2019
## Subset [11357,6][4051,6] Fri Jul 5 09:26:05 2019
## Removed features with more than 0 NAs: Fri Jul 5 09:26:05 2019
## Dropped featureData's levels Fri Jul 5 09:26:05 2019
## MSnbase version: 2.10.1
Perform imputation, considering the underlying nature of missingness, i.e missing not at random (left-censored) or at random.
See Lazar et al. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.
The best solution is arguably to handle missing values at the statistical test level (see later).
(cptac <- log(cptac, base = 2))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4051 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
## varLabels: group sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (4051 total)
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul 5 09:26:04 2019
## Subset [11357,6][4051,6] Fri Jul 5 09:26:05 2019
## Removed features with more than 0 NAs: Fri Jul 5 09:26:05 2019
## Dropped featureData's levels Fri Jul 5 09:26:05 2019
## Log transformed (base 2) Fri Jul 5 09:26:05 2019
## MSnbase version: 2.10.1
Normalisation is handled by the normalise
(or normalize
) function.
(cptac <- normalise(cptac, method = "quantiles"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4051 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
## varLabels: group sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (4051 total)
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul 5 09:26:04 2019
## Subset [11357,6][4051,6] Fri Jul 5 09:26:05 2019
## Removed features with more than 0 NAs: Fri Jul 5 09:26:05 2019
## Dropped featureData's levels Fri Jul 5 09:26:05 2019
## Log transformed (base 2) Fri Jul 5 09:26:05 2019
## Normalised (quantiles): Fri Jul 5 09:26:05 2019
## MSnbase version: 2.10.1
So far we have quantitation values at the peptide level, while the level of interest are proteins. We can take all peptides that are associated to a protein (or protein group), as defined by the Proteins
feature variable and aggregate them using a preferred operation.
(cptac <- combineFeatures(cptac, fcol = "Proteins", method = "mean"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1125 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: 6A_7 6A_8 ... 6B_9 (6 total)
## varLabels: group sample
## varMetadata: labelDescription
## featureData
## featureNames: P00918ups|CAH2_HUMAN_UPS
## P01008ups|ANT3_HUMAN_UPS;CON__P41361 ... sp|Q99383|HRP1_YEAST
## (1125 total)
## fvarLabels: Proteins Potential.contaminant ... CV.6B_9 (11 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][11357,6] Fri Jul 5 09:26:04 2019
## Subset [11357,6][4051,6] Fri Jul 5 09:26:05 2019
## Removed features with more than 0 NAs: Fri Jul 5 09:26:05 2019
## Dropped featureData's levels Fri Jul 5 09:26:05 2019
## Log transformed (base 2) Fri Jul 5 09:26:05 2019
## Normalised (quantiles): Fri Jul 5 09:26:05 2019
## Combined 4051 features into 1125 using mean: Fri Jul 5 09:26:06 2019
## MSnbase version: 2.10.1
As you will see in the next section, there are much better aggregation function than the mean!
## R version 3.6.0 RC (2019-04-21 r76417)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_3.2.0 msdata_0.24.0 MSnbase_2.10.1
## [4] ProtGenerics_1.16.0 S4Vectors_0.22.0 mzR_2.18.0
## [7] Rcpp_1.0.1 Biobase_2.44.0 BiocGenerics_0.30.0
## [10] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.8 purrr_0.3.2
## [4] lattice_0.20-38 colorspace_1.4-1 htmltools_0.3.6
## [7] yaml_2.2.0 vsn_3.52.0 XML_3.98-1.20
## [10] rlang_0.4.0 pillar_1.4.2 withr_2.1.2
## [13] glue_1.3.1 BiocParallel_1.18.0 affy_1.62.0
## [16] affyio_1.54.0 foreach_1.4.4 plyr_1.8.4
## [19] mzID_1.22.0 stringr_1.4.0 zlibbioc_1.30.0
## [22] munsell_0.5.0 pcaMethods_1.76.0 gtable_0.3.0
## [25] codetools_0.2-16 evaluate_0.14 labeling_0.3
## [28] knitr_1.23 IRanges_2.18.1 doParallel_1.0.14
## [31] highr_0.8 preprocessCore_1.46.0 scales_1.0.0
## [34] BiocManager_1.30.4 limma_3.40.2 impute_1.58.0
## [37] digest_0.6.19 stringi_1.4.3 bookdown_0.11
## [40] dplyr_0.8.2 ncdf4_1.16.1 grid_3.6.0
## [43] tools_3.6.0 magrittr_1.5 lazyeval_0.2.2
## [46] tibble_2.1.3 crayon_1.3.4 pkgconfig_2.0.2
## [49] MASS_7.3-51.4 iterators_1.0.10 assertthat_0.2.1
## [52] rmarkdown_1.13 R6_2.4.0 MALDIquant_1.19.3
## [55] compiler_3.6.0