Contents

1 Introduction

This tutorial requires the MSnbase (Gatto and Lilley (2011)), MSqRob (Goeminne et al. (2016)), limma (Ritchie et al. (2015)) and msdata packages, as well as some of the tidyverse packages for data tidying and visualisation.

library("MSnbase")
library("MSqRob")
library("msdata")
library("limma")
library("tidyverse")

Sections 2 to 4 deal with raw mass spectrometry and identification data stored in one of the open community-developed data formats (see below). Section 5 describes how to produce or import quantitative data and the sections thereafter focus on the processing and analysis of differentially expressed proteins.

Data files in mass spectrometry

Most community-driven formats described in the table are supported in R. We will see how to read and access these data in the following sections.

Type Format Package
raw mzML, mzXML, netCDF, mzData MSnbase (read and write in version >= 2.3.13) via mzR
identification mzIdentML mzID (read) and MSnbase (read, via mzR)
quantitation mzQuantML
peak lists mgf MSnbase (read)
quantitation (with id) mzTab, spreadsheets MSnbase (read, read/write)
protein db fasta Biostrings

In the next sections, we will use raw data (in mzML and mzXML format), identification data (in mzIdentML format) and quantitative data (from a tab-separated spreadsheet).

2 Raw MS data

Raw data files (in any of the above formats) is read into R using readMSData function from the MSnbase package, that will return an list-like object of class MSnExp. Below, we first extract the full path to the MS3TMT11.mzML file from the msdata package1 The proteomics, ident and quant msdata functions return example files for raw, identification and quantitative data respectively. before reading it in.

basename(fl3 <- msdata::proteomics(full.name = TRUE, pattern = "MS3TMT11"))
## [1] "MS3TMT11.mzML"
(rw3 <- readMSData(fl3, mode = "onDisk"))
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.53 Mb
## - - - Spectra data - - -
##  MS level(s): 1 2 3 
##  Number of spectra: 994 
##  MSn retention times: 45:27 - 47:6 minutes
## - - - Processing information - - -
## Data loaded [Thu Jul 25 08:43:45 2019] 
##  MSnbase version: 2.10.1 
## - - - Meta data  - - -
## phenoData
##   rowNames: MS3TMT11.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   MS3TMT11.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
##   fvarLabels: fileIdx spIdx ... spectrum (33 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

Note: above, we specify to use the on disk mode (as opposed to in memory) to avoid loading the whole raw data file(s) into memory. With on disk, the raw data will be accessed on demand. Here, we only read a single MS file into R, but with on disk mode, we could load 100s thereof.

An MSnExp object contains

Exercise: Extract the first, second and tenth spectrum. What are their MS levels, precursor m/z and retention times?

rw3[[1]]
## Object of class "Spectrum1"
##  Retention time: 45:27 
##  MSn level: 1 
##  Total ion count: 10768 
##  Polarity: 1

We can also access specific data for the whole experiment using accessors:

See ?MSnExp for details.

Exercise: Using msLevel, extract the MS level of the 994 spectra of the rw3 file. What levels are available in these data. Using centroided, check what spectra and centroided of in profile mode (see figure below). Which are the levels that are centroided or in profile mode.

Peak picking. Raw profile mode data (left) as measured by the mass spectrometre, and processed, centroided data (right). See also this [`MSnbase` vignette](http://lgatto.github.io/MSnbase/articles/v03-MSnbase-centroiding.html).

Figure 1: Peak picking
Raw profile mode data (left) as measured by the mass spectrometre, and processed, centroided data (right). See also this MSnbase vignette.

table(msLevel(rw3), centroided(rw3))
##    
##     FALSE TRUE
##   1    30    0
##   2     0  482
##   3     0  482

3 Identification results

Identification data in mzIdentML is parsed and loaded as a data.frame using the readMzIdData function.

basename(idf <- msdata::ident(full.name = TRUE))
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
iddf <- readMzIdData(idf)

Among the identification variables available, you will find MS.GF.RawScore, the identification score computated by MSGF+ (the search engine used for the search), and isDecoy.

names(iddf)

Exercise: How many PSMs from the read and decoy database are recoded in the identification data? Calculate summary statistics for MSGF+ raw scores.

table(iddf$isDecoy)
## 
## FALSE  TRUE 
##  2906  2896
summary(iddf$MS.GF.RawScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -141.00  -31.00  -13.00  -14.24    3.00  214.00

Exercise: Reproduce and interpret the plot showing the identification score distribution for the decoy and real peptides in this file.

ggplot(iddf, aes(x = MS.GF.RawScore, colour = isDecoy)) +
    geom_density()

4 Combining raw and identification data

When working with raw and identification data, these two can be merged by adding the identification results to the raw feature meta-data slot. Below, we use small data that is shipped with the MSnbase package. The raw data being very small, we can afford to read it into memory without specifying mode = "onDisk".

basename(quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzXML$"))
## [1] "dummyiTRAQ.mzXML"
basename(identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                          full.name = TRUE, pattern = "dummyiTRAQ.mzid"))
## [1] "dummyiTRAQ.mzid"
msexp <- readMSData(quantFile)
fvarLabels(msexp)
## [1] "spectrum"

Add this stage, there’s only one spectrum feature variable with the spectra indices. Below we add the identification data with addIdentificationData to gain 33 new feature variables.

msexp <- addIdentificationData(msexp, identFile)
fvarLabels(msexp)
##  [1] "spectrum"                 "acquisition.number"      
##  [3] "sequence"                 "chargeState"             
##  [5] "rank"                     "passThreshold"           
##  [7] "experimentalMassToCharge" "calculatedMassToCharge"  
##  [9] "modNum"                   "isDecoy"                 
## [11] "post"                     "pre"                     
## [13] "start"                    "end"                     
## [15] "DatabaseAccess"           "DBseqLength"             
## [17] "DatabaseSeq"              "DatabaseDescription"     
## [19] "scan.number.s."           "idFile"                  
## [21] "MS.GF.RawScore"           "MS.GF.DeNovoScore"       
## [23] "MS.GF.SpecEValue"         "MS.GF.EValue"            
## [25] "modName"                  "modMass"                 
## [27] "modLocation"              "subOriginalResidue"      
## [29] "subReplacementResidue"    "subLocation"             
## [31] "nprot"                    "npep.prot"               
## [33] "npsm.prot"                "npsm.pep"

Note that prior to addition, the identification data is filtered as documented in the filterIdentificationDataFrame function:

  1. only PSMs matching the regular (non-decoy) database are retained;
  2. PSMs of rank greater than 1 are discarded;
  3. only proteotypic peptides are kept.

5 Quantitative data

Quantitative data is stored a objects of class MSnSet, that are composed of an expression matrix (accessed with exprs()), a feature meta-data dataframe (accessed with fData()) and a meta-data dataframe (accessed with pData()).

The MSnSet structure

Figure 2: The MSnSet structure

MSnSet objects can be:

  1. Created from an MSnExp object with the quantify function.
  2. Created from a spreadsheet produced by a third party software using the readMSnSet2 function.

5.1 Quantifying raw data

data(itraqdata)
itraqdata
## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 55 
##  MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
##  MSnbase version: 1.1.22 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1
##   varLabels: sampleNames sampleNumbers
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ProteinDescription
##     PeptideSequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msnset <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
msnset
## MSnSet (storageMode: lockedEnvironment)
## assayData: 55 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
##   varLabels: mz reporters
##   varMetadata: labelDescription
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ... collision.energy (15
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: No annotation 
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
## iTRAQ4 quantification by trapezoidation: Thu Jul 25 08:43:50 2019 
##  MSnbase version: 1.1.22

Below, we plot the first MS2 spectrum from the itraqdata test data, highlighting the four iTRAQ reporter ions and extract the quantiation values and feature metadata of that same first record.

plot(itraqdata[[1]], reporters = iTRAQ4, full = TRUE)
An iTRAQ spectrum. The full spectrum is the composite of the peptide from all samples and is used for identification. The reporter ion peaks are specific to each sample and are used for quantitation.

Figure 3: An iTRAQ spectrum
The full spectrum is the composite of the peptide from all samples and is used for identification. The reporter ion peaks are specific to each sample and are used for quantitation.

exprs(msnset)[1, ]
## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117 
##   1347.616   2247.310   3927.693   7661.146
fData(msnset)[1, ]
##    spectrum ProteinAccession   ProteinDescription PeptideSequence fileIdx
## X1        1              BSA bovine serum albumin          NYQEAK       1
##    retention.time precursor.mz precursor.intensity charge peaks.count
## X1        1149.31     520.7833             3449020      2        1922
##         tic ionCount ms.level acquisition.number collision.energy
## X1 26413754 26413754        2                  2               40

5.2 Importing third-party data

We are going to use the cptac_a_b_peptides.txt tab-separated file from the msdata package. These data are the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 ng/μL Saccharomyces cerevisiae strain BY4741 . Two different spike-in concentrations were used: 6A (0.25 fmol UPS1 proteins/μL) and 6B (0.74 fmol UPS1 proteins/μL). We limited ourselves to the data of LTQ-Orbitrap W at site 56. The data were searched with MaxQuant version 1.5.2.8, and detailed search settings were described in Goeminne et al. (2016). Three replicates are available for each concentration. The study is a spike-in study for which we know the ground truth so we have the ability to evaluate the quality of the fold change estimates and the list of DE genes that we return with a method.

basename(f <- msdata::quant(full.names = TRUE))
## [1] "cptac_a_b_peptides.txt"

Before reading the spreadsheet, we need to identify which columns contain quantitation data (that will be used to populate the exprs slot) and the feature data (that will be put into the fData slot).

The getEcols function lists the column names in the expression data spreadsheet. The quantitative values we want to used are those in the columns starting withIntensity 6A_7, Intensity 6A_8, … Intensity 6B_9, that we refer to with Intensity.

getEcols(f, split = "\t")
##  [1] "Sequence"                 "N-term cleavage window"  
##  [3] "C-term cleavage window"   "Amino acid before"       
##  [5] "First amino acid"         "Second amino acid"       
##  [7] "Second last amino acid"   "Last amino acid"         
##  [9] "Amino acid after"         "A Count"                 
## [11] "R Count"                  "N Count"                 
## [13] "D Count"                  "C Count"                 
## [15] "Q Count"                  "E Count"                 
## [17] "G Count"                  "H Count"                 
## [19] "I Count"                  "L Count"                 
## [21] "K Count"                  "M Count"                 
## [23] "F Count"                  "P Count"                 
## [25] "S Count"                  "T Count"                 
## [27] "W Count"                  "Y Count"                 
## [29] "V Count"                  "U Count"                 
## [31] "Length"                   "Missed cleavages"        
## [33] "Mass"                     "Proteins"                
## [35] "Leading razor protein"    "Start position"          
## [37] "End position"             "Unique (Groups)"         
## [39] "Unique (Proteins)"        "Charges"                 
## [41] "PEP"                      "Score"                   
## [43] "Identification type 6A_7" "Identification type 6A_8"
## [45] "Identification type 6A_9" "Identification type 6B_7"
## [47] "Identification type 6B_8" "Identification type 6B_9"
## [49] "Experiment 6A_7"          "Experiment 6A_8"         
## [51] "Experiment 6A_9"          "Experiment 6B_7"         
## [53] "Experiment 6B_8"          "Experiment 6B_9"         
## [55] "Intensity"                "Intensity 6A_7"          
## [57] "Intensity 6A_8"           "Intensity 6A_9"          
## [59] "Intensity 6B_7"           "Intensity 6B_8"          
## [61] "Intensity 6B_9"           "Reverse"                 
## [63] "Potential contaminant"    "id"                      
## [65] "Protein group IDs"        "Mod. peptide IDs"        
## [67] "Evidence IDs"             "MS/MS IDs"               
## [69] "Best MS/MS"               "Oxidation (M) site IDs"  
## [71] "MS/MS Count"

Using a pattern, we can set the columns to be used to populate the quantitation slot.

(e <- grepEcols(f, "Intensity ", split = "\t")) ## careful at the space!
## [1] 56 57 58 59 60 61
(cptac <- readMSnSet2(f, ecol = e,
                      fnames = "Sequence",
                      sep = "\t"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 11466 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
##     (11466 total)
##   fvarLabels: Sequence N.term.cleavage.window ... MS.MS.Count (65
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.10.1

We can access the peptide-level expression data with exprs and the feature meta-data with fData.

head(exprs(cptac))
##                    Intensity.6A_7 Intensity.6A_8 Intensity.6A_9
## AAAAGAGGAGDSGDAVTK              0              0          66760
## AAAALAGGK                 2441300        1220000        1337600
## AAAALAGGKK                1029200         668040         638990
## AAADALSDLEIK               515460         670780         712140
## AAADALSDLEIKDSK            331130         420900         365560
## AAAEEFQR                        0              0          51558
##                    Intensity.6B_7 Intensity.6B_8 Intensity.6B_9
## AAAAGAGGAGDSGDAVTK              0          37436              0
## AAAALAGGK                 2850900         935580        1606200
## AAAALAGGKK                 777030         641270         562840
## AAADALSDLEIK               426580         620510         737780
## AAADALSDLEIKDSK            329250         380820         414490
## AAAEEFQR                        0              0          76970
tail(exprs(cptac))
##                     Intensity.6A_7 Intensity.6A_8 Intensity.6A_9
## YYSISSSSLSEK                309270         357640         337930
## YYSIYDLGNNAVGLAK                 0              0              0
## YYTFNGPNYNENETIR             88629              0          55584
## YYTITEVATR                  298360              0              0
## YYTVFDRDNNR                      0              0              0
## YYTVFDRDNNRVGFAEAAR              0              0              0
##                     Intensity.6B_7 Intensity.6B_8 Intensity.6B_9
## YYSISSSSLSEK                387870         361990         432020
## YYSIYDLGNNAVGLAK                 0              0              0
## YYTFNGPNYNENETIR             63071              0              0
## YYTITEVATR                       0         366000              0
## YYTVFDRDNNR                      0              0              0
## YYTVFDRDNNRVGFAEAAR              0              0              0
fvarLabels(cptac)
##  [1] "Sequence"                 "N.term.cleavage.window"  
##  [3] "C.term.cleavage.window"   "Amino.acid.before"       
##  [5] "First.amino.acid"         "Second.amino.acid"       
##  [7] "Second.last.amino.acid"   "Last.amino.acid"         
##  [9] "Amino.acid.after"         "A.Count"                 
## [11] "R.Count"                  "N.Count"                 
## [13] "D.Count"                  "C.Count"                 
## [15] "Q.Count"                  "E.Count"                 
## [17] "G.Count"                  "H.Count"                 
## [19] "I.Count"                  "L.Count"                 
## [21] "K.Count"                  "M.Count"                 
## [23] "F.Count"                  "P.Count"                 
## [25] "S.Count"                  "T.Count"                 
## [27] "W.Count"                  "Y.Count"                 
## [29] "V.Count"                  "U.Count"                 
## [31] "Length"                   "Missed.cleavages"        
## [33] "Mass"                     "Proteins"                
## [35] "Leading.razor.protein"    "Start.position"          
## [37] "End.position"             "Unique..Groups."         
## [39] "Unique..Proteins."        "Charges"                 
## [41] "PEP"                      "Score"                   
## [43] "Identification.type.6A_7" "Identification.type.6A_8"
## [45] "Identification.type.6A_9" "Identification.type.6B_7"
## [47] "Identification.type.6B_8" "Identification.type.6B_9"
## [49] "Experiment.6A_7"          "Experiment.6A_8"         
## [51] "Experiment.6A_9"          "Experiment.6B_7"         
## [53] "Experiment.6B_8"          "Experiment.6B_9"         
## [55] "Intensity"                "Reverse"                 
## [57] "Potential.contaminant"    "id"                      
## [59] "Protein.group.IDs"        "Mod..peptide.IDs"        
## [61] "Evidence.IDs"             "MS.MS.IDs"               
## [63] "Best.MS.MS"               "Oxidation..M..site.IDs"  
## [65] "MS.MS.Count"

For the sake of simplicity, we can clean up the feature variables and only keep those of interest. It is possible to do this interactively with

cptac <- selectFeatureData(cptac)

or by setting the feature variables of interest.

cptac <- selectFeatureData(cptac,
                           fcol = c("Proteins",
                                    "Potential.contaminant",
                                    "Reverse",
                                    "Sequence"))
fvarLabels(cptac)
## [1] "Proteins"              "Potential.contaminant" "Reverse"              
## [4] "Sequence"

Let’s also add sample annotations:

cptac$condition <- factor(rep(c("A", "B"), each = 3))
cptac$sample <- rep(7:9, 2)
pData(cptac)
##                condition sample
## Intensity.6A_7         A      7
## Intensity.6A_8         A      8
## Intensity.6A_9         A      9
## Intensity.6B_7         B      7
## Intensity.6B_8         B      8
## Intensity.6B_9         B      9

This could also be done by reading a spreadsheet into R as a data.frame, making sure that the rownames match the sample names exactly, and then adding it with pData(cptac) <- myDf.

The sample names are rather long and contain information on the spike-in concentration and the repeat. We this remove Intensity.6 from the sample names:

sampleNames(cptac) <- sub("Intensity\\.6", "", sampleNames(cptac))
sampleNames(cptac) 
## [1] "A_7" "A_8" "A_9" "B_7" "B_8" "B_9"

Note that data in the mzTab format can be imported using the readMzTabData function.

6 Quantitative data processing

6.1 Handling overlapping protein groups

In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.

head(keep <- smallestUniqueGroups(fData(cptac)$Proteins))
## [1] "sp|P38915|SPT8_YEAST"  "sp|P09938|RIR2_YEAST"  "sp|P53075|SHE10_YEAST"
## [4] "sp|P00360|G3P1_YEAST"  "sp|P15180|SYKC_YEAST"  "sp|P15705|STI1_YEAST"
length(keep)
## [1] 1546
length(unique(fData(cptac)$Proteins))
## [1] 1722

As shown above, we will keep 1546 protein groups from the 1722 protein groups recorded in the data. This vector of protein names is used to filter the peptide-level data.

(cptac <- cptac[fData(cptac)$Proteins %in% keep, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 10740 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
##     (10740 total)
##   fvarLabels: Proteins Potential.contaminant Reverse Sequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
##  MSnbase version: 2.10.1

Exercise: How many peptides have we lost by removing the proteins above?

6.2 Filtering out contaminants and reverse hits

Below, we create vector of logicals (and count) recording peptides that are assigned to contaminant (such as keratine, trypsine, …) and reverse proteins (from the decoy database). These are annotated with a "+" in the respective "Potential.contaminant" and "Reverse" feature variables.

table(sel_conts <- fData(cptac)$Potential.contaminant != "+")
## 
## FALSE  TRUE 
##    62 10678
table(sel_rev <- fData(cptac)$Reverse != "+")
## 
##  TRUE 
## 10740
(cptac <- cptac[sel_conts & sel_rev, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 10678 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
##     (10678 total)
##   fvarLabels: Proteins Potential.contaminant Reverse Sequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
##  MSnbase version: 2.10.1

You can keep track of the data processing steps in the object’s processing log.

6.3 Removing PTM-only proteins

Proteins for which all peptides are carrying modifications (PTMs) can be considered as unreliable. We will filter out these proteins. This information is included in the Only.identified.by.site column of the proteinGroups.txt MaxQuant file. The code chunk below provides this list of proteins.

ptm_only <- c("REV__CON__Q3T052", "REV__sp|P01120|RAS2_YEAST",
              "REV__sp|P32849|RAD5_YEAST",
              "REV__sp|Q03723|OST6_YEAST", "sp|P04051|RPC1_YEAST",
              "sp|P06367|RS14A_YEAST",
              "sp|P0CX73|YP11A_YEAST;sp|P0CX72|YL12A_YEAST;sp|P0CX71|YE11A_YEAST;sp|P0CX70|YD15A_YEAST;sp|Q6Q5H1|YP14A_YEAST;sp|P0C2I8|YL14A_YEAST",
              "sp|P19657|PMA2_YEAST", "sp|P32465|HXT1_YEAST",
              "sp|P39567|IMDH1_YEAST", "sp|P40527|ATC7_YEAST",
              "sp|P40530|PDK1_YEAST", "sp|P40989|FKS2_YEAST",
              "sp|P49955|SF3B1_YEAST", "sp|P51401|RL9B_YEAST",
              "sp|P53072|TAN1_YEAST", "sp|Q03964|YD17A_YEAST",
              "sp|Q04670|YM14B_YEAST;sp|Q12088|YL11B_YEAST;sp|Q03619|YE12B_YEAST",
              "sp|Q08649|ESA1_YEAST", "sp|Q12112|YN11B_YEAST",
              "sp|Q12479|IRC11_YEAST", "sp|Q3E7B7|YD85C_YEAST")

We now remove the peptides matched to these proteins:

(cptac <- cptac[!fData(cptac)$Proteins %in% ptm_only, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 10659 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
##     (10659 total)
##   fvarLabels: Proteins Potential.contaminant Reverse Sequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019 
##  MSnbase version: 2.10.1

6.4 Notes on missing values

Unfortunately, some software use 0 irrespective whether the data has intensity zero and when the data haven’t been observer. Below we fix this by setting all 0 values to NA.

exprs(cptac)[exprs(cptac) == 0] <- NA
table(is.na(exprs(cptac)))
## 
## FALSE  TRUE 
## 34637 29317

The following figure shows the distribution of missing values for samples (columns) and rows (peptides). The cells have been reporder to emphasis the presence of missing values in the proteins shown at the top and the samples shown towards the right.

napac <- cptac
exprs(napac)[!is.na(exprs(napac))] <- 1
naplot(napac)
Overview of missing data

Figure 4: Overview of missing data

The following figure the proportions of features (peptides in this case) with respect to their completeness (blue) and the percentage of missing data in the full dataset (red).

plotNA(cptac)
Data completeness.

Figure 5: Data completeness

Below, we count the number of missing values in each

fData(cptac)$nNA <- apply(exprs(cptac), 1, function(x) sum(is.na(x)))
table(fData(cptac)$nNA)
## 
##    0    1    2    3    4    5    6 
## 3675  939  830  686  876  762 2891

Note that some peptides aren’t seen at all because these 6 samples are a subset of a larger dataset, and these features are present in the other acquisitions only.

From here on one could:

  • filter data with missing values, which however sacrifices a lot of data.
filterNA(cptac)
## MSnSet (storageMode: lockedEnvironment)
## assayData: 3675 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAALAGGK AAAALAGGKK ... YYSISSSSLSEK (3675 total)
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019 
## Subset [10659,6][3675,6] Thu Jul 25 08:43:54 2019 
## Removed features with more than 0 NAs: Thu Jul 25 08:43:54 2019 
## Dropped featureData's levels Thu Jul 25 08:43:54 2019 
##  MSnbase version: 2.10.1
  • perform imputation, considering the underlying nature of missingness, i.e missing not at random (left-censored) or at random.
Effect of the nature of missing values on their imputation. Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better.

Figure 6: Effect of the nature of missing values on their imputation
Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better.

See Lazar et al. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

  • The best solution is arguably to handle missing values at the statistical test level, which is the approach we are going to use.

We are going to keep peptides that had a least two observations. This can be done with the nNA variable that we compute above

(cptac <- cptac[fData(cptac)$nNA <= 4, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 7006 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTITEVATR (7006
##     total)
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019 
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019 
##  MSnbase version: 2.10.1

6.5 Log transformation

(cptac <- log(cptac, base = 2))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 7006 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTITEVATR (7006
##     total)
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019 
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019 
## Log transformed (base 2) Thu Jul 25 08:43:54 2019 
##  MSnbase version: 2.10.1

6.6 Normalisation

Normalisation is handled by the normalise (or normalize) function.

plotDensities(exprs(cptac))

(cptac <- normalise(cptac, method = "quantiles"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 7006 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTITEVATR (7006
##     total)
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019 
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019 
## Log transformed (base 2) Thu Jul 25 08:43:54 2019 
## Normalised (quantiles): Thu Jul 25 08:43:54 2019 
##  MSnbase version: 2.10.1
plotDensities(exprs(cptac))

We can visualise our peptide-level data using a Multi Dimensional Scaling (MDS) plot, using the plotMDS function from the limma package. We use the condition sample variable to colour-code the samples.

plotMDS(exprs(cptac), col = as.numeric(cptac$condition))

The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences in the peptide data seems to be driven by technical variability. Indeed the samples do not seem to be clearly separated according to the spike in condition.

6.7 Summarisation

So far we have used quantitation values at the peptide level, while the data of interest are proteins. We can take all peptides that are associated with a protein group, as defined by the Proteins feature variable, and aggregate them using an summary function of choice.

Multiple levels of summarisation. Examples of aggregation from PSMs, peptides to proteins using median aggregation (from the [`Features`](https://rformassspectrometry.github.io/Features/articles/Features.html) package).

Figure 7: Multiple levels of summarisation
Examples of aggregation from PSMs, peptides to proteins using median aggregation (from the Features package).

Below, we combine the peptides into proteins using the median, passing na.rm = TRUE to account for the missing values in the data.

(cptac_prot <- combineFeatures(cptac, fcol = "Proteins",
                               method = "median", na.rm = TRUE))
## Your data contains missing values. Please read the relevant section
## in the combineFeatures manual page for details the effects of
## missing values on data aggregation.
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1384 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: O00762ups|UBE2C_HUMAN_UPS P00167ups|CYB5_HUMAN_UPS
##     ... sp|Q99385|VCX1_YEAST (1384 total)
##   fvarLabels: Proteins Potential.contaminant ... CV.B_9 (11 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019 
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019 
## Log transformed (base 2) Thu Jul 25 08:43:54 2019 
## Normalised (quantiles): Thu Jul 25 08:43:54 2019 
## Combined 7006 features into 1384 using median: Thu Jul 25 08:43:57 2019 
##  MSnbase version: 2.10.1

We obtain 1384 proteins. Note how the processing steps are recorded. Below, we visualise the protein-level data on an MDS plot.

plotMDS(exprs(cptac_prot), col = as.numeric(cptac_prot$condition))

Exercise: Repeat the summarisation using a the robust normalisation by settingmethod = "robust", as described in Sticker et al. (2019) and visualise the data on an MDS plot. Which one do you anticipate to provide better results?

cptac_rob <- combineFeatures(cptac, fcol = "Proteins",
                             method = "robust", na.rm = TRUE)
## Your data contains missing values. Please read the relevant section
## in the combineFeatures manual page for details the effects of
## missing values on data aggregation.
plotMDS(exprs(cptac_rob), col = as.numeric(cptac_rob$condition))

7 Differential expression analysis

Exercise: In this section, the median-summarised protein data will be used. As an exercice, you are advised to use the best of the median- or robust-summerised data and compare your results with those shown below.

7.1 Estimation

MSqRob is currently working with a format where we have one dataframe for each protein. This will be changed in the next release to use the MSnSet directly. Therefore we first have to reorganise the data.

protMSqRob <- MSnSet2protdata(cptac_prot, "Proteins")

Next the models are fitted. This is done using the fit.model function. We only have to model the data using the factor condition from the sample metadata (the pData slot) of the protein level MSnSet. The name of the factor variable is specified in the fixed argument (if multiple predictors have to be incorporated in the model, a vector of variable names has to be provided in this argument). The argument shrinkage is used to specify if ridge regression has to be adopted. For the sake of speed we do not do this in the tutorial. The shrinkage has to be specified for each variable in the fixed effects. We also have to indicate this for the intercept (which we never shrink). So we specify it at c(0, 0) to indicate that the intercept (first 0) and the parameters for the factor condition (second 0) are not penalized. We set the robust_var function equal to FALSE - this functionality will be removed from the package in the next release.

models <- fit.model(protdata = protMSqRob,
                    response = "quant_value",
                    fixed = "condition",
                    shrinkage.fixed = c(0, 0),
                    robust_var = FALSE)

7.2 Inference

Often, biologists have problems with the reference coding. In MSqRob we have opted to formulate contrasts using all levels of a factor. Internally, the contrasts are than recasted according to the factor level that is the reference class.

L <- makeContrast("conditionB - conditionA",
                  levels = c("conditionA", "conditionB"))
res <- test.contrast_adjust(models, L)
head(res)
##                            estimate        se       df      Tval
## P99999ups|CYC_HUMAN_UPS    2.152716 0.1663700 7.230582 12.939331
## P63279ups|UBC9_HUMAN_UPS   1.822164 0.1531925 7.230582 11.894609
## P08311ups|CATG_HUMAN_UPS   1.316092 0.1817428 7.230582  7.241510
## P06732ups|KCRM_HUMAN_UPS   1.274376 0.1986088 7.230582  6.416512
## sp|P12688|YPK1_YEAST      -1.655782 0.2398832 6.230582 -6.902449
## O00762ups|UBE2C_HUMAN_UPS  1.604233 0.2339066 6.230582  6.858435
##                                   pval        qval signif
## P99999ups|CYC_HUMAN_UPS   2.914470e-06 0.003419021   TRUE
## P63279ups|UBC9_HUMAN_UPS  5.227860e-06 0.003419021   TRUE
## P08311ups|CATG_HUMAN_UPS  1.463569e-04 0.063811592  FALSE
## P06732ups|KCRM_HUMAN_UPS  3.158129e-04 0.083831332  FALSE
## sp|P12688|YPK1_YEAST      3.879161e-04 0.083831332  FALSE
## O00762ups|UBE2C_HUMAN_UPS 4.020442e-04 0.083831332  FALSE

Below, we put the results of the statistical analysis back into the MSnSet feature data, to keep the data and their analysis together. We need however to rearrange the proteins in the result dataframe (ordered by adjusted p-value) to match the order in the MSnSet.

fData(cptac_prot)$res <- res[featureNames(cptac_prot), ]

There are 2 protein groups identified as differentially expressed at a significant effect at the 5% FDR level.

8 Visualising DE results

8.1 Volcano plot

volc <- ggplot(fData(cptac_prot)$res,
               aes(x = estimate,
                   y = -log10(pval),
                   color = signif)) +
    geom_point() +
    scale_color_manual(values = c("black", "red"))
volc
## Warning: Removed 76 rows containing missing values (geom_point).

It is easy to generate an interactive graph to explore the results using the plotly package and passing the ggplot-object to the ggplotly function:

plotly::ggplotly(volc)

8.2 Heatmap

sign <- which(fData(cptac_prot)$res$sig)
heatmap(exprs(cptac_prot)[sign,])

8.3 Expression data for proteins of interest

Below, we extract the peptide data matching the differentially expressed proteins.

sign_prots <- featureNames(cptac_prot)[sign]
(cptac_sign <- cptac[fData(cptac)$Proteins %in% sign_prots, ])
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A_7 A_8 ... B_9 (6 total)
##   varLabels: condition sample
##   varMetadata: labelDescription
## featureData
##   featureNames: ADLIAYLK ADLIAYLKK GTPWEGGLFK TGQAPGYSYTAANK
##   fvarLabels: Proteins Potential.contaminant ... nNA (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Subset [11466,6][10740,6] Thu Jul 25 08:43:52 2019 
## Subset [10740,6][10678,6] Thu Jul 25 08:43:52 2019 
## Subset [10678,6][10659,6] Thu Jul 25 08:43:52 2019 
## Subset [10659,6][7006,6] Thu Jul 25 08:43:54 2019 
## Log transformed (base 2) Thu Jul 25 08:43:54 2019 
## Normalised (quantiles): Thu Jul 25 08:43:54 2019 
## Subset [7006,6][4,6] Thu Jul 25 08:44:05 2019 
##  MSnbase version: 2.10.1

In the following code chunk, we convert the peptide-level MSnSet containing the significant peptides using the ms2df helper function, convert that wide format dataframe to a long format and visualise the expression distributions in each group.

ms2df(cptac_sign) %>%
    tidyr::gather(key = sample, value = expression, 1:6) %>%
    ggplot(aes(x = sample, y = expression)) +
    geom_boxplot() +
    geom_jitter(aes(colour = Sequence)) +
    facet_grid(Proteins ~ .)

8.4 Global fold-changes

Because we know the ground truth for the cptac study, i.e. we know that only the spike-in proteins (UPS) are differentially expressed, we can evalute the fold changes. Yeast proteins should be not differentially expressed and their log fold changes should be centered around 0. These of UPS proteins are spiked at differt concentrations and their log2 fold changes should be centered around \(log2(concB/concA)\), i.e \(log2(0.74/0.25) = 1.56\).

fData(cptac_prot)$res$spike <- grepl("UPS", fData(cptac_prot)$Proteins)

ggplot(fData(cptac_prot)$res,
       aes(x = spike, y = estimate)) +
    geom_boxplot() +
    ylab("log2 FC") + 
    geom_hline(yintercept = c(0, log(0.74/0.25, base = 2)),
               color = "red")
## Warning: Removed 23 rows containing non-finite values (stat_boxplot).

8.5 Sensitivity FDP plot

Because we know the ground truth for the cptac study, i.e. we know that only the spike-in proteins (UPS) are differentially expressed, we can calculate

  • the sensitivity or true positive rate (TPR), the proportion of actual positives that are correctly identified, in the protein list that we return

\[TPR=\frac{TP}{\text{#actual positives}},\]

here TP are the true positives in the list. The TPR is thus the fraction of ups proteins that we can recall.

  • false discovery proportion (FPD): fraction of false positives in the protein list that we return:

\[FPD=\frac{FP}{FP+TP},\]

with FP the false positives. In our case the yeast proteins that are in our list.

Instead of only calculating that for the protein list that is returned for the chosen FDR level, we can do this for all possible FDR cutoffs so that we get an overview of the quality of the ranking of the proteins in the protein list.

fData(cptac_prot)$res %>%
                    dplyr::arrange(qval) %>%
                    dplyr::mutate(FDP = cumsum(!spike)/(1:length(spike)),
                                  TPR = cumsum(spike)/sum(spike)) %>% 
                    ggplot(aes(x = FDP, y = TPR)) +
                    geom_path() +
                    geom_point() + 
                    geom_vline(xintercept = 0.05, lty = 2)

9 Further reading

Session information

## R version 3.6.0 RC (2019-04-21 r76417)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] forcats_0.4.0       stringr_1.4.0       dplyr_0.8.3        
##  [4] purrr_0.3.2         readr_1.3.1         tidyr_0.8.3        
##  [7] tibble_2.1.3        ggplot2_3.2.0       tidyverse_1.2.1    
## [10] limma_3.40.2        msdata_0.25.2       MSqRob_0.7.6       
## [13] lme4_1.1-21         Matrix_1.2-17       MSnbase_2.10.1     
## [16] ProtGenerics_1.16.0 S4Vectors_0.22.0    mzR_2.18.0         
## [19] Rcpp_1.0.1          Biobase_2.44.0      BiocGenerics_0.30.0
## [22] BiocStyle_2.12.0   
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-140          lubridate_1.7.4       doParallel_1.0.14    
##  [4] httr_1.4.0            tools_3.6.0           backports_1.1.4      
##  [7] R6_2.4.0              affyio_1.54.0         lazyeval_0.2.2       
## [10] colorspace_1.4-1      withr_2.1.2           tidyselect_0.2.5     
## [13] compiler_3.6.0        preprocessCore_1.46.0 cli_1.1.0            
## [16] rvest_0.3.4           xml2_1.2.0            labeling_0.3         
## [19] bookdown_0.12         scales_1.0.0          affy_1.62.0          
## [22] digest_0.6.20         minqa_1.2.4           rmarkdown_1.14       
## [25] pkgconfig_2.0.2       htmltools_0.3.6       highr_0.8            
## [28] rlang_0.4.0           readxl_1.3.1          rstudioapi_0.10      
## [31] impute_1.58.0         generics_0.0.2        jsonlite_1.6         
## [34] mzID_1.22.0           BiocParallel_1.18.0   magrittr_1.5         
## [37] MALDIquant_1.19.3     munsell_0.5.0         vsn_3.52.0           
## [40] stringi_1.4.3         yaml_2.2.0            MASS_7.3-51.4        
## [43] zlibbioc_1.30.0       plyr_1.8.4            grid_3.6.0           
## [46] crayon_1.3.4          lattice_0.20-38       haven_2.1.1          
## [49] splines_3.6.0         hms_0.5.0             zeallot_0.1.0        
## [52] knitr_1.23            pillar_1.4.2          boot_1.3-23          
## [55] reshape2_1.4.3        codetools_0.2-16      XML_3.98-1.20        
## [58] glue_1.3.1            evaluate_0.14         pcaMethods_1.76.0    
## [61] BiocManager_1.30.4    modelr_0.1.4          vctrs_0.2.0          
## [64] nloptr_1.2.1          foreach_1.4.4         cellranger_1.1.0     
## [67] gtable_0.3.0          assertthat_0.2.1      xfun_0.8             
## [70] broom_0.5.2           ncdf4_1.16.1          iterators_1.0.10     
## [73] IRanges_2.18.1