This tutorial requires the MSnbase (Gatto and Lilley (2011)), MSqRob (Goeminne et al. (2016)), limma (Ritchie et al. (2015)) and msdata packages, as well as some of the tidyverse packages for data tidying and visualisation.
library("MSnbase")
library("MSqRob")
library("msdata")
library("limma")
library("tidyverse")
Sections 2 to 4 deal with raw mass spectrometry and identification data stored in one of the open community-developed data formats (see below). Section 5 describes how to produce or import quantitative data and the sections thereafter focus on the processing and analysis of differentially expressed proteins.
Most community-driven formats described in the table are supported in R
. We will see how to read and access these data in the following sections.
Type | Format | Package |
---|---|---|
raw | mzML, mzXML, netCDF, mzData | MSnbase (read and write in version >= 2.3.13) via mzR |
identification | mzIdentML | mzID (read) and MSnbase (read, via mzR) |
quantitation | mzQuantML | |
peak lists | mgf | MSnbase (read) |
quantitation (with id) | mzTab, spreadsheets | MSnbase (read, read/write) |
protein db | fasta | Biostrings |
In the next sections, we will use raw data (in mzML
and mzXML
format), identification data (in mzIdentML
format) and quantitative data (from a tab-separated spreadsheet).
Raw data files (in any of the above formats) is read into R using readMSData
function from the MSnbase
package, that will return an list-like object of class MSnExp
. Below, we first extract the full path to the MS3TMT11.mzML
file from the msdata
package1 The proteomics
, ident
and quant
msdata
functions return example files for raw, identification and quantitative data respectively. before reading it in.
basename(fl3 <- msdata::proteomics(full.name = TRUE, pattern = "MS3TMT11"))
## [1] "MS3TMT11.mzML"
(rw3 <- readMSData(fl3, mode = "onDisk"))
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.53 Mb
## - - - Spectra data - - -
## MS level(s): 1 2 3
## Number of spectra: 994
## MSn retention times: 45:27 - 47:6 minutes
## - - - Processing information - - -
## Data loaded [Thu Jul 25 08:43:45 2019]
## MSnbase version: 2.10.1
## - - - Meta data - - -
## phenoData
## rowNames: MS3TMT11.mzML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## MS3TMT11.mzML
## protocolData: none
## featureData
## featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
## fvarLabels: fileIdx spIdx ... spectrum (33 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
Note: above, we specify to use the on disk mode (as opposed to in memory) to avoid loading the whole raw data file(s) into memory. With on disk, the raw data will be accessed on demand. Here, we only read a single MS file into R, but with on disk mode, we could load 100s thereof.
An MSnExp
object contains
[
and [[
. The former returns and subset of the data as a new MSnExp
object, while the former extracts a single spectrum.fData
Exercise: Extract the first, second and tenth spectrum. What are their MS levels, precursor m/z and retention times?
rw3[[1]]
## Object of class "Spectrum1"
## Retention time: 45:27
## MSn level: 1
## Total ion count: 10768
## Polarity: 1
We can also access specific data for the whole experiment using accessors:
msLevel(rw3)
rtime(rw3)
precursorMz(rw3)
centroided(rw3)
See ?MSnExp
for details.
Exercise: Using msLevel
, extract the MS level of the 994 spectra of the rw3
file. What levels are available in these data. Using centroided
, check what spectra and centroided of in profile mode (see figure below). Which are the levels that are centroided or in profile mode.
table(msLevel(rw3), centroided(rw3))
##
## FALSE TRUE
## 1 30 0
## 2 0 482
## 3 0 482
Identification data in mzIdentML
is parsed and loaded as a data.frame
using the readMzIdData
function.
basename(idf <- msdata::ident(full.name = TRUE))
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
iddf <- readMzIdData(idf)
Among the identification variables available, you will find MS.GF.RawScore
, the identification score computated by MSGF+ (the search engine used for the search), and isDecoy
.
names(iddf)
Exercise: How many PSMs from the read and decoy database are recoded in the identification data? Calculate summary statistics for MSGF+ raw scores.
table(iddf$isDecoy)
##
## FALSE TRUE
## 2906 2896
summary(iddf$MS.GF.RawScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -141.00 -31.00 -13.00 -14.24 3.00 214.00
Exercise: Reproduce and interpret the plot showing the identification score distribution for the decoy and real peptides in this file.
ggplot(iddf, aes(x = MS.GF.RawScore, colour = isDecoy)) +
geom_density()
When working with raw and identification data, these two can be merged by adding the identification results to the raw feature meta-data slot. Below, we use small data that is shipped with the MSnbase
package. The raw data being very small, we can afford to read it into memory without specifying mode = "onDisk"
.
basename(quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$"))
## [1] "dummyiTRAQ.mzXML"
basename(identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "dummyiTRAQ.mzid"))
## [1] "dummyiTRAQ.mzid"
msexp <- readMSData(quantFile)
fvarLabels(msexp)
## [1] "spectrum"
Add this stage, there’s only one spectrum
feature variable with the spectra indices. Below we add the identification data with addIdentificationData
to gain 33 new feature variables.
msexp <- addIdentificationData(msexp, identFile)
fvarLabels(msexp)
## [1] "spectrum" "acquisition.number"
## [3] "sequence" "chargeState"
## [5] "rank" "passThreshold"
## [7] "experimentalMassToCharge" "calculatedMassToCharge"
## [9] "modNum" "isDecoy"
## [11] "post" "pre"
## [13] "start" "end"
## [15] "DatabaseAccess" "DBseqLength"
## [17] "DatabaseSeq" "DatabaseDescription"
## [19] "scan.number.s." "idFile"
## [21] "MS.GF.RawScore" "MS.GF.DeNovoScore"
## [23] "MS.GF.SpecEValue" "MS.GF.EValue"
## [25] "modName" "modMass"
## [27] "modLocation" "subOriginalResidue"
## [29] "subReplacementResidue" "subLocation"
## [31] "nprot" "npep.prot"
## [33] "npsm.prot" "npsm.pep"
Note that prior to addition, the identification data is filtered as documented in the filterIdentificationDataFrame
function:
Quantitative data is stored a objects of class MSnSet
, that are composed of an expression matrix (accessed with exprs()
), a feature meta-data dataframe (accessed with fData()
) and a meta-data dataframe (accessed with pData()
).
MSnSet
objects can be:
MSnExp
object with the quantify
function.readMSnSet2
function.data(itraqdata)
itraqdata
## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 55
## MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## MSnbase version: 1.1.22
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames sampleNumbers
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ProteinDescription
## PeptideSequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msnset <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
msnset
## MSnSet (storageMode: lockedEnvironment)
## assayData: 55 features, 4 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## varLabels: mz reporters
## varMetadata: labelDescription
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ... collision.energy (15
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: No annotation
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jul 25 08:43:50 2019
## MSnbase version: 1.1.22
Below, we plot the first MS2 spectrum from the itraqdata
test data, highlighting the four iTRAQ reporter ions and extract the quantiation values and feature metadata of that same first record.
plot(itraqdata[[1]], reporters = iTRAQ4, full = TRUE)
exprs(msnset)[1, ]
## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## 1347.616 2247.310 3927.693 7661.146
fData(msnset)[1, ]
## spectrum ProteinAccession ProteinDescription PeptideSequence fileIdx
## X1 1 BSA bovine serum albumin NYQEAK 1
## retention.time precursor.mz precursor.intensity charge peaks.count
## X1 1149.31 520.7833 3449020 2 1922
## tic ionCount ms.level acquisition.number collision.energy
## X1 26413754 26413754 2 2 40
We are going to use the cptac_a_b_peptides.txt
tab-separated file from the msdata
package. These data are the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 ng/μL Saccharomyces cerevisiae strain BY4741 . Two different spike-in concentrations were used: 6A (0.25 fmol UPS1 proteins/μL) and 6B (0.74 fmol UPS1 proteins/μL). We limited ourselves to the data of LTQ-Orbitrap W at site 56. The data were searched with MaxQuant version 1.5.2.8, and detailed search settings were described in Goeminne et al. (2016). Three replicates are available for each concentration. The study is a spike-in study for which we know the ground truth so we have the ability to evaluate the quality of the fold change estimates and the list of DE genes that we return with a method.
basename(f <- msdata::quant(full.names = TRUE))
## [1] "cptac_a_b_peptides.txt"
Before reading the spreadsheet, we need to identify which columns contain quantitation data (that will be used to populate the exprs
slot) and the feature data (that will be put into the fData
slot).
The getEcols
function lists the column names in the expression data spreadsheet. The quantitative values we want to used are those in the columns starting withIntensity 6A_7
, Intensity 6A_8
, … Intensity 6B_9
, that we refer to with Intensity
.
getEcols(f, split = "\t")
## [1] "Sequence" "N-term cleavage window"
## [3] "C-term cleavage window" "Amino acid before"
## [5] "First amino acid" "Second amino acid"
## [7] "Second last amino acid" "Last amino acid"
## [9] "Amino acid after" "A Count"
## [11] "R Count" "N Count"
## [13] "D Count" "C Count"
## [15] "Q Count" "E Count"
## [17] "G Count" "H Count"
## [19] "I Count" "L Count"
## [21] "K Count" "M Count"
## [23] "F Count" "P Count"
## [25] "S Count" "T Count"
## [27] "W Count" "Y Count"
## [29] "V Count" "U Count"
## [31] "Length" "Missed cleavages"
## [33] "Mass" "Proteins"
## [35] "Leading razor protein" "Start position"
## [37] "End position" "Unique (Groups)"
## [39] "Unique (Proteins)" "Charges"
## [41] "PEP" "Score"
## [43] "Identification type 6A_7" "Identification type 6A_8"
## [45] "Identification type 6A_9" "Identification type 6B_7"
## [47] "Identification type 6B_8" "Identification type 6B_9"
## [49] "Experiment 6A_7" "Experiment 6A_8"
## [51] "Experiment 6A_9" "Experiment 6B_7"
## [53] "Experiment 6B_8" "Experiment 6B_9"
## [55] "Intensity" "Intensity 6A_7"
## [57] "Intensity 6A_8" "Intensity 6A_9"
## [59] "Intensity 6B_7" "Intensity 6B_8"
## [61] "Intensity 6B_9" "Reverse"
## [63] "Potential contaminant" "id"
## [65] "Protein group IDs" "Mod. peptide IDs"
## [67] "Evidence IDs" "MS/MS IDs"
## [69] "Best MS/MS" "Oxidation (M) site IDs"
## [71] "MS/MS Count"
Using a pattern, we can set the columns to be used to populate the quantitation slot.
(e <- grepEcols(f, "Intensity ", split = "\t")) ## careful at the space!
## [1] 56 57 58 59 60 61
(cptac <- readMSnSet2(f, ecol = e,
fnames = "Sequence",
sep = "\t"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 11466 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData: none
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
## (11466 total)
## fvarLabels: Sequence N.term.cleavage.window ... MS.MS.Count (65
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## MSnbase version: 2.10.1
We can access the peptide-level expression data with exprs
and the feature meta-data with fData
.
head(exprs(cptac))
## Intensity.6A_7 Intensity.6A_8 Intensity.6A_9
## AAAAGAGGAGDSGDAVTK 0 0 66760
## AAAALAGGK 2441300 1220000 1337600
## AAAALAGGKK 1029200 668040 638990
## AAADALSDLEIK 515460 670780 712140
## AAADALSDLEIKDSK 331130 420900 365560
## AAAEEFQR 0 0 51558
## Intensity.6B_7 Intensity.6B_8 Intensity.6B_9
## AAAAGAGGAGDSGDAVTK 0 37436 0
## AAAALAGGK 2850900 935580 1606200
## AAAALAGGKK 777030 641270 562840
## AAADALSDLEIK 426580 620510 737780
## AAADALSDLEIKDSK 329250 380820 414490
## AAAEEFQR 0 0 76970
tail(exprs(cptac))
## Intensity.6A_7 Intensity.6A_8 Intensity.6A_9
## YYSISSSSLSEK 309270 357640 337930
## YYSIYDLGNNAVGLAK 0 0 0
## YYTFNGPNYNENETIR 88629 0 55584
## YYTITEVATR 298360 0 0
## YYTVFDRDNNR 0 0 0
## YYTVFDRDNNRVGFAEAAR 0 0 0
## Intensity.6B_7 Intensity.6B_8 Intensity.6B_9
## YYSISSSSLSEK 387870 361990 432020
## YYSIYDLGNNAVGLAK 0 0 0
## YYTFNGPNYNENETIR 63071 0 0
## YYTITEVATR 0 366000 0
## YYTVFDRDNNR 0 0 0
## YYTVFDRDNNRVGFAEAAR 0 0 0
fvarLabels(cptac)
## [1] "Sequence" "N.term.cleavage.window"
## [3] "C.term.cleavage.window" "Amino.acid.before"
## [5] "First.amino.acid" "Second.amino.acid"
## [7] "Second.last.amino.acid" "Last.amino.acid"
## [9] "Amino.acid.after" "A.Count"
## [11] "R.Count" "N.Count"
## [13] "D.Count" "C.Count"
## [15] "Q.Count" "E.Count"
## [17] "G.Count" "H.Count"
## [19] "I.Count" "L.Count"
## [21] "K.Count" "M.Count"
## [23] "F.Count" "P.Count"
## [25] "S.Count" "T.Count"
## [27] "W.Count" "Y.Count"
## [29] "V.Count" "U.Count"
## [31] "Length" "Missed.cleavages"
## [33] "Mass" "Proteins"
## [35] "Leading.razor.protein" "Start.position"
## [37] "End.position" "Unique..Groups."
## [39] "Unique..Proteins." "Charges"
## [41] "PEP" "Score"
## [43] "Identification.type.6A_7" "Identification.type.6A_8"
## [45] "Identification.type.6A_9" "Identification.type.6B_7"
## [47] "Identification.type.6B_8" "Identification.type.6B_9"
## [49] "Experiment.6A_7" "Experiment.6A_8"
## [51] "Experiment.6A_9" "Experiment.6B_7"
## [53] "Experiment.6B_8" "Experiment.6B_9"
## [55] "Intensity" "Reverse"
## [57] "Potential.contaminant" "id"
## [59] "Protein.group.IDs" "Mod..peptide.IDs"
## [61] "Evidence.IDs" "MS.MS.IDs"
## [63] "Best.MS.MS" "Oxidation..M..site.IDs"
## [65] "MS.MS.Count"
For the sake of simplicity, we can clean up the feature variables and only keep those of interest. It is possible to do this interactively with
cptac <- selectFeatureData(cptac)
or by setting the feature variables of interest.
cptac <- selectFeatureData(cptac,
fcol = c("Proteins",
"Potential.contaminant",
"Reverse",
"Sequence"))
fvarLabels(cptac)
## [1] "Proteins" "Potential.contaminant" "Reverse"
## [4] "Sequence"
Let’s also add sample annotations:
cptac$condition <- factor(rep(c("A", "B"), each = 3))
cptac$sample <- rep(7:9, 2)
pData(cptac)
## condition sample
## Intensity.6A_7 A 7
## Intensity.6A_8 A 8
## Intensity.6A_9 A 9
## Intensity.6B_7 B 7
## Intensity.6B_8 B 8
## Intensity.6B_9 B 9
This could also be done by reading a spreadsheet into R as a data.frame
, making sure that the rownames match the sample names exactly, and then adding it with pData(cptac) <- myDf
.
The sample names are rather long and contain information on the spike-in concentration and the repeat. We this remove Intensity.6
from the sample names:
sampleNames(cptac) <- sub("Intensity\\.6", "", sampleNames(cptac))
sampleNames(cptac)
## [1] "A_7" "A_8" "A_9" "B_7" "B_8" "B_9"
Note that data in the mzTab
format can be imported using the readMzTabData
function.
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
head(keep <- smallestUniqueGroups(fData(cptac)$Proteins))
## [1] "sp|P38915|SPT8_YEAST" "sp|P09938|RIR2_YEAST" "sp|P53075|SHE10_YEAST"
## [4] "sp|P00360|G3P1_YEAST" "sp|P15180|SYKC_YEAST" "sp|P15705|STI1_YEAST"
length(keep)
## [1] 1546
length(unique(fData(cptac)$Proteins))
## [1] 1722
As shown above, we will keep 1546 protein groups from the 1722 protein groups recorded in the data. This vector of protein names is used to filter the peptide-level data.
(cptac <- cptac[fData(cptac)$Proteins %in% keep, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 10740 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
## (10740 total)
## fvarLabels: Proteins Potential.contaminant Reverse Sequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## MSnbase version: 2.10.1
Exercise: How many peptides have we lost by removing the proteins above?
Below, we create vector of logicals (and count) recording peptides that are assigned to contaminant (such as keratine, trypsine, …) and reverse proteins (from the decoy database). These are annotated with a "+"
in the respective "Potential.contaminant"
and "Reverse"
feature variables.
table(sel_conts <- fData(cptac)$Potential.contaminant != "+")
##
## FALSE TRUE
## 62 10678
table(sel_rev <- fData(cptac)$Reverse != "+")
##
## TRUE
## 10740
(cptac <- cptac[sel_conts & sel_rev, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 10678 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
## (10678 total)
## fvarLabels: Proteins Potential.contaminant Reverse Sequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## MSnbase version: 2.10.1
You can keep track of the data processing steps in the object’s processing log.
Proteins for which all peptides are carrying modifications (PTMs) can be considered as unreliable. We will filter out these proteins. This information is included in the Only.identified.by.site
column of the proteinGroups.txt
MaxQuant file. The code chunk below provides this list of proteins.
ptm_only <- c("REV__CON__Q3T052", "REV__sp|P01120|RAS2_YEAST",
"REV__sp|P32849|RAD5_YEAST",
"REV__sp|Q03723|OST6_YEAST", "sp|P04051|RPC1_YEAST",
"sp|P06367|RS14A_YEAST",
"sp|P0CX73|YP11A_YEAST;sp|P0CX72|YL12A_YEAST;sp|P0CX71|YE11A_YEAST;sp|P0CX70|YD15A_YEAST;sp|Q6Q5H1|YP14A_YEAST;sp|P0C2I8|YL14A_YEAST",
"sp|P19657|PMA2_YEAST", "sp|P32465|HXT1_YEAST",
"sp|P39567|IMDH1_YEAST", "sp|P40527|ATC7_YEAST",
"sp|P40530|PDK1_YEAST", "sp|P40989|FKS2_YEAST",
"sp|P49955|SF3B1_YEAST", "sp|P51401|RL9B_YEAST",
"sp|P53072|TAN1_YEAST", "sp|Q03964|YD17A_YEAST",
"sp|Q04670|YM14B_YEAST;sp|Q12088|YL11B_YEAST;sp|Q03619|YE12B_YEAST",
"sp|Q08649|ESA1_YEAST", "sp|Q12112|YN11B_YEAST",
"sp|Q12479|IRC11_YEAST", "sp|Q3E7B7|YD85C_YEAST")
We now remove the peptides matched to these proteins:
(cptac <- cptac[!fData(cptac)$Proteins %in% ptm_only, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 10659 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
## (10659 total)
## fvarLabels: Proteins Potential.contaminant Reverse Sequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019
## MSnbase version: 2.10.1
Unfortunately, some software use 0 irrespective whether the data has intensity zero and when the data haven’t been observer. Below we fix this by setting all 0 values to NA
.
exprs(cptac)[exprs(cptac) == 0] <- NA
table(is.na(exprs(cptac)))
##
## FALSE TRUE
## 34637 29317
The following figure shows the distribution of missing values for samples (columns) and rows (peptides). The cells have been reporder to emphasis the presence of missing values in the proteins shown at the top and the samples shown towards the right.
napac <- cptac
exprs(napac)[!is.na(exprs(napac))] <- 1
naplot(napac)
The following figure the proportions of features (peptides in this case) with respect to their completeness (blue) and the percentage of missing data in the full dataset (red).
plotNA(cptac)
Below, we count the number of missing values in each
fData(cptac)$nNA <- apply(exprs(cptac), 1, function(x) sum(is.na(x)))
table(fData(cptac)$nNA)
##
## 0 1 2 3 4 5 6
## 3675 939 830 686 876 762 2891
Note that some peptides aren’t seen at all because these 6 samples are a subset of a larger dataset, and these features are present in the other acquisitions only.
From here on one could:
filterNA(cptac)
## MSnSet (storageMode: lockedEnvironment)
## assayData: 3675 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (3675 total)
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019
## Subset [10659,6][3675,6] Thu Jul 25 08:43:54 2019
## Removed features with more than 0 NAs: Thu Jul 25 08:43:54 2019
## Dropped featureData's levels Thu Jul 25 08:43:54 2019
## MSnbase version: 2.10.1
See Lazar et al. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.
We are going to keep peptides that had a least two observations. This can be done with the nNA
variable that we compute above
(cptac <- cptac[fData(cptac)$nNA <= 4, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 7006 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTITEVATR (7006
## total)
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019
## MSnbase version: 2.10.1
(cptac <- log(cptac, base = 2))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 7006 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTITEVATR (7006
## total)
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019
## Log transformed (base 2) Thu Jul 25 08:43:54 2019
## MSnbase version: 2.10.1
Normalisation is handled by the normalise
(or normalize
) function.
plotDensities(exprs(cptac))
(cptac <- normalise(cptac, method = "quantiles"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 7006 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTITEVATR (7006
## total)
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019
## Log transformed (base 2) Thu Jul 25 08:43:54 2019
## Normalised (quantiles): Thu Jul 25 08:43:54 2019
## MSnbase version: 2.10.1
plotDensities(exprs(cptac))
We can visualise our peptide-level data using a Multi Dimensional Scaling (MDS) plot, using the plotMDS
function from the limma
package. We use the condition sample variable to colour-code the samples.
plotMDS(exprs(cptac), col = as.numeric(cptac$condition))
The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences in the peptide data seems to be driven by technical variability. Indeed the samples do not seem to be clearly separated according to the spike in condition.
So far we have used quantitation values at the peptide level, while the data of interest are proteins. We can take all peptides that are associated with a protein group, as defined by the Proteins
feature variable, and aggregate them using an summary function of choice.
Below, we combine the peptides into proteins using the median, passing na.rm = TRUE
to account for the missing values in the data.
(cptac_prot <- combineFeatures(cptac, fcol = "Proteins",
method = "median", na.rm = TRUE))
## Your data contains missing values. Please read the relevant section
## in the combineFeatures manual page for details the effects of
## missing values on data aggregation.
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1384 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: O00762ups|UBE2C_HUMAN_UPS P00167ups|CYB5_HUMAN_UPS
## ... sp|Q99385|VCX1_YEAST (1384 total)
## fvarLabels: Proteins Potential.contaminant ... CV.B_9 (11 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019
## Log transformed (base 2) Thu Jul 25 08:43:54 2019
## Normalised (quantiles): Thu Jul 25 08:43:54 2019
## Combined 7006 features into 1384 using median: Thu Jul 25 08:43:57 2019
## MSnbase version: 2.10.1
We obtain 1384 proteins. Note how the processing steps are recorded. Below, we visualise the protein-level data on an MDS plot.
plotMDS(exprs(cptac_prot), col = as.numeric(cptac_prot$condition))
Exercise: Repeat the summarisation using a the robust normalisation by settingmethod = "robust"
, as described in Sticker et al. (2019) and visualise the data on an MDS plot. Which one do you anticipate to provide better results?
cptac_rob <- combineFeatures(cptac, fcol = "Proteins",
method = "robust", na.rm = TRUE)
## Your data contains missing values. Please read the relevant section
## in the combineFeatures manual page for details the effects of
## missing values on data aggregation.
plotMDS(exprs(cptac_rob), col = as.numeric(cptac_rob$condition))
Exercise: In this section, the median-summarised protein data will be used. As an exercice, you are advised to use the best of the median- or robust-summerised data and compare your results with those shown below.
MSqRob
is currently working with a format where we have one dataframe for each protein. This will be changed in the next release to use the MSnSet
directly. Therefore we first have to reorganise the data.
protMSqRob <- MSnSet2protdata(cptac_prot, "Proteins")
Next the models are fitted. This is done using the fit.model
function. We only have to model the data using the factor condition from the sample metadata (the pData
slot) of the protein level MSnSet
. The name of the factor variable is specified in the fixed argument (if multiple predictors have to be incorporated in the model, a vector of variable names has to be provided in this argument). The argument shrinkage is used to specify if ridge regression has to be adopted. For the sake of speed we do not do this in the tutorial. The shrinkage has to be specified for each variable in the fixed effects. We also have to indicate this for the intercept (which we never shrink). So we specify it at c(0, 0)
to indicate that the intercept (first 0) and the parameters for the factor condition (second 0) are not penalized. We set the robust_var
function equal to FALSE
- this functionality will be removed from the package in the next release.
models <- fit.model(protdata = protMSqRob,
response = "quant_value",
fixed = "condition",
shrinkage.fixed = c(0, 0),
robust_var = FALSE)
Often, biologists have problems with the reference coding. In MSqRob we have opted to formulate contrasts using all levels of a factor. Internally, the contrasts are than recasted according to the factor level that is the reference class.
L <- makeContrast("conditionB - conditionA",
levels = c("conditionA", "conditionB"))
res <- test.contrast_adjust(models, L)
head(res)
## estimate se df Tval
## P99999ups|CYC_HUMAN_UPS 2.152716 0.1663700 7.230582 12.939331
## P63279ups|UBC9_HUMAN_UPS 1.822164 0.1531925 7.230582 11.894609
## P08311ups|CATG_HUMAN_UPS 1.316092 0.1817428 7.230582 7.241510
## P06732ups|KCRM_HUMAN_UPS 1.274376 0.1986088 7.230582 6.416512
## sp|P12688|YPK1_YEAST -1.655782 0.2398832 6.230582 -6.902449
## O00762ups|UBE2C_HUMAN_UPS 1.604233 0.2339066 6.230582 6.858435
## pval qval signif
## P99999ups|CYC_HUMAN_UPS 2.914470e-06 0.003419021 TRUE
## P63279ups|UBC9_HUMAN_UPS 5.227860e-06 0.003419021 TRUE
## P08311ups|CATG_HUMAN_UPS 1.463569e-04 0.063811592 FALSE
## P06732ups|KCRM_HUMAN_UPS 3.158129e-04 0.083831332 FALSE
## sp|P12688|YPK1_YEAST 3.879161e-04 0.083831332 FALSE
## O00762ups|UBE2C_HUMAN_UPS 4.020442e-04 0.083831332 FALSE
Below, we put the results of the statistical analysis back into the MSnSet
feature data, to keep the data and their analysis together. We need however to rearrange the proteins in the result dataframe (ordered by adjusted p-value) to match the order in the MSnSet
.
fData(cptac_prot)$res <- res[featureNames(cptac_prot), ]
There are 2 protein groups identified as differentially expressed at a significant effect at the 5% FDR level.
volc <- ggplot(fData(cptac_prot)$res,
aes(x = estimate,
y = -log10(pval),
color = signif)) +
geom_point() +
scale_color_manual(values = c("black", "red"))
volc
## Warning: Removed 76 rows containing missing values (geom_point).
It is easy to generate an interactive graph to explore the results using the plotly
package and passing the ggplot-object to the ggplotly
function:
plotly::ggplotly(volc)
sign <- which(fData(cptac_prot)$res$sig)
heatmap(exprs(cptac_prot)[sign,])
Below, we extract the peptide data matching the differentially expressed proteins.
sign_prots <- featureNames(cptac_prot)[sign]
(cptac_sign <- cptac[fData(cptac)$Proteins %in% sign_prots, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A_7 A_8 ... B_9 (6 total)
## varLabels: condition sample
## varMetadata: labelDescription
## featureData
## featureNames: ADLIAYLK ADLIAYLKK GTPWEGGLFK TGQAPGYSYTAANK
## fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019
## Log transformed (base 2) Thu Jul 25 08:43:54 2019
## Normalised (quantiles): Thu Jul 25 08:43:54 2019
## Subset [7006,6][4,6] Thu Jul 25 08:44:05 2019
## MSnbase version: 2.10.1
In the following code chunk, we convert the peptide-level MSnSet
containing the significant peptides using the ms2df
helper function, convert that wide format dataframe to a long format and visualise the expression distributions in each group.
ms2df(cptac_sign) %>%
tidyr::gather(key = sample, value = expression, 1:6) %>%
ggplot(aes(x = sample, y = expression)) +
geom_boxplot() +
geom_jitter(aes(colour = Sequence)) +
facet_grid(Proteins ~ .)
Because we know the ground truth for the cptac study, i.e. we know that only the spike-in proteins (UPS) are differentially expressed, we can evalute the fold changes. Yeast proteins should be not differentially expressed and their log fold changes should be centered around 0. These of UPS proteins are spiked at differt concentrations and their log2 fold changes should be centered around \(log2(concB/concA)\), i.e \(log2(0.74/0.25) = 1.56\).
fData(cptac_prot)$res$spike <- grepl("UPS", fData(cptac_prot)$Proteins)
ggplot(fData(cptac_prot)$res,
aes(x = spike, y = estimate)) +
geom_boxplot() +
ylab("log2 FC") +
geom_hline(yintercept = c(0, log(0.74/0.25, base = 2)),
color = "red")
## Warning: Removed 23 rows containing non-finite values (stat_boxplot).
Because we know the ground truth for the cptac study, i.e. we know that only the spike-in proteins (UPS) are differentially expressed, we can calculate
\[TPR=\frac{TP}{\text{#actual positives}},\]
here TP are the true positives in the list. The TPR is thus the fraction of ups proteins that we can recall.
\[FPD=\frac{FP}{FP+TP},\]
with FP the false positives. In our case the yeast proteins that are in our list.
Instead of only calculating that for the protein list that is returned for the chosen FDR level, we can do this for all possible FDR cutoffs so that we get an overview of the quality of the ranking of the proteins in the protein list.
fData(cptac_prot)$res %>%
dplyr::arrange(qval) %>%
dplyr::mutate(FDP = cumsum(!spike)/(1:length(spike)),
TPR = cumsum(spike)/sum(spike)) %>%
ggplot(aes(x = FDP, y = TPR)) +
geom_path() +
geom_point() +
geom_vline(xintercept = 0.05, lty = 2)
For additional material on MSnbase
and MS/proteomics data manipualation, see here and the main MSnbase
vignette.
For additional MSqRob
material, please see here
## R version 3.6.0 RC (2019-04-21 r76417)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
## [4] purrr_0.3.2 readr_1.3.1 tidyr_0.8.3
## [7] tibble_2.1.3 ggplot2_3.2.0 tidyverse_1.2.1
## [10] limma_3.40.2 msdata_0.25.2 MSqRob_0.7.6
## [13] lme4_1.1-21 Matrix_1.2-17 MSnbase_2.10.1
## [16] ProtGenerics_1.16.0 S4Vectors_0.22.0 mzR_2.18.0
## [19] Rcpp_1.0.1 Biobase_2.44.0 BiocGenerics_0.30.0
## [22] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-140 lubridate_1.7.4 doParallel_1.0.14
## [4] httr_1.4.0 tools_3.6.0 backports_1.1.4
## [7] R6_2.4.0 affyio_1.54.0 lazyeval_0.2.2
## [10] colorspace_1.4-1 withr_2.1.2 tidyselect_0.2.5
## [13] compiler_3.6.0 preprocessCore_1.46.0 cli_1.1.0
## [16] rvest_0.3.4 xml2_1.2.0 labeling_0.3
## [19] bookdown_0.12 scales_1.0.0 affy_1.62.0
## [22] digest_0.6.20 minqa_1.2.4 rmarkdown_1.14
## [25] pkgconfig_2.0.2 htmltools_0.3.6 highr_0.8
## [28] rlang_0.4.0 readxl_1.3.1 rstudioapi_0.10
## [31] impute_1.58.0 generics_0.0.2 jsonlite_1.6
## [34] mzID_1.22.0 BiocParallel_1.18.0 magrittr_1.5
## [37] MALDIquant_1.19.3 munsell_0.5.0 vsn_3.52.0
## [40] stringi_1.4.3 yaml_2.2.0 MASS_7.3-51.4
## [43] zlibbioc_1.30.0 plyr_1.8.4 grid_3.6.0
## [46] crayon_1.3.4 lattice_0.20-38 haven_2.1.1
## [49] splines_3.6.0 hms_0.5.0 zeallot_0.1.0
## [52] knitr_1.23 pillar_1.4.2 boot_1.3-23
## [55] reshape2_1.4.3 codetools_0.2-16 XML_3.98-1.20
## [58] glue_1.3.1 evaluate_0.14 pcaMethods_1.76.0
## [61] BiocManager_1.30.4 modelr_0.1.4 vctrs_0.2.0
## [64] nloptr_1.2.1 foreach_1.4.4 cellranger_1.1.0
## [67] gtable_0.3.0 assertthat_0.2.1 xfun_0.8
## [70] broom_0.5.2 ncdf4_1.16.1 iterators_1.0.10
## [73] IRanges_2.18.1