Abstract In this course, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The workflow is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.

This material available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.

If you (re-)use this material, please cite the following reference

Gatto, Laurent. (2019, January). Bioconductor tools for mass spectrometry and proteomics. Zenodo. http://doi.org/10.5281/zenodo.2547971

1 Introduction

Before we start:

If you identify typos, if there are parts that you would like to see expended or clarified, please let me know by telling me directly (during workshops), opening a github issue or by emailing me. Please do also briefly specify your background/familiarity with mass spectrometry and/or proteomics (beginner, intermediate or expert) so that I can update accordingly.

In recent years, there we have seen an increase in the number of packages to analyse mass spectrometry and proteomics data for R and Bioconductor, as well as an increase in total number of downloads. See vignette Proteomics packages in Bioconductor for more details and code underlying these figures.

Number of packages Number of downloads

It is also good to highlight that several of these package have become a group efforts, supported by several developers in the community. This post illustrates the various contributions to MSnbase. mzR has benefited by a similar wide range of successful contributions. Both packages, and in particular mzR, are used by many others, and will be described in some detail in this workflow.

This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:

Exploring available infrastructure
Mass spectrometry data
Getting data from proteomics repositories
Handling raw MS data
Handling identification data
MS/MS database search
Analysing search results
High-level data interface
Quantitative proteomics
Importing third-party quantitation data
Data processing and analysis
Statistical analysis
Machine learning
Annotation
Other relevant packages/pipelines

Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.

Other material

This workflow provides a general introduction to Bioconductor software for mass spectrometry and proteomics. If you are interested in

The application of machine learning to proteomics data, in particular spatial proteomics (i.e. the sub-cellular localisation), follow the tutorial vignette from pRoloc package, accessible with vignette("pRoloc-tutorial", package = "pRoloc") or online.
The analysis of identification data to retain the most reliable PSMs, see the MSnID vignette11 Section Analysing search results below is a summary of that vignette., accessible with vignette("msnid_vignette", package = "MSnID") or online. In addition, the vignettes of the msmsTest package describe how to analyse spectral counting data using packages dedicated for the analysis of high throughput sequencing data.
The analysis of MS^E data independent acquisition (DIA), see the vignettes in the synapter package.
The processing and analysis of MALDI-MS data, read the MALDIquant introduction accessible with vignette("MALDIquant-intro", package = "MALDIquant") and available online.
The processing and analysis of imaging mass spectrometry (IMS), read the Carinal walkthrough vignette accessible with vignette("Cardinal-walkthrough", package = "Cardinal") and online.
…

Setup

The follow packages will be used throughout this documents. R version 3.5 or higher is required to install all the packages using BiocManager::install.

library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")
library("ensembldb")

The most convenient way to install most of the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.

if (!require("BiocManager"))
    install.package("BiocManager")
BiocManager::install("RforProteomics", dependencies = TRUE)

Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.

Exploring available infrastructure

On-line

In Bioconductor version 3.6, there are respectively 92 proteomics, 62 mass spectrometry software packages and 17 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages(), massSpectrometryPackages() and massSpectrometryDataPackages() and explored interactively, or looked at by exploring the respective biocViews on the Bioconductor web page.

library("RforProteomics")
pp <- proteomicsPackages()
DT::datatable(pp)

Exploring proteomics packages

Exercise Explore available proteomics packages using the proteomicsPackages() function above or the Bioconductor software page. What software could you use to analyse mzML files?

Mass spectrometry data

Most community-driven formats described in the table are supported in R. We will see how to read and access these data in the following sections.

Type	Format	Package
raw	mzML, mzXML, netCDF, mzData	MSnbase (read and write in version >= 2.3.13) via mzR
identification	mzIdentML	mzID (read) and MSnbase (read, via mzR)
quantitation	mzQuantML
peak lists	mgf	MSnbase (read)
quant and id	mzTab	MSnbase (read)

2 How does mass spectrometry work?

Mass spectrometry (MS) is a technology that separates charged molecules (ions) based on their mass to charge ratio (M/Z). It is often coupled to chromatography (liquid LC, but can also be gas-based GC). The time an analytes takes to elute from the chromatography column is the retention time.

A chromatogram, illustrating the total amount of analytes over the retention time.

An mass spectrometer is composed of three components:

The source, that ionises the molecules: examples are Matrix-assisted laser desorption/ionisation (MALDI) or electrospray ionisation. (ESI)
The analyser, that separates the ions: Time of flight (TOF) or Orbitrap.
The detector that quantifies the ions.

When using mass spectrometry for proteomics, the proteins are first digested with a protease such as trypsin. In mass shotgun proteomics, the analytes assayed in the mass spectrometer are peptides.

Often, ions are subjected to more than a single MS round. After a first round of separation, the peaks in the spectra, called MS1 spectra, represent peptides. At this stage, the only information we possess about these peptides are their retention time and their mass-to-charge (we can also infer their charge be inspecting their isotopic envelope, i.e the peaks of the individual isotopes, see below), which is not enough to infer their identify (i.e. their sequence).

In MSMS (or MS2), the settings of the mass spectrometer are set automatically to select a certain number of MS1 peaks (for example 20). Once a narrow M/Z range has been selected (corresponding to one high-intensity peak, a peptide, and some background noise), it is fragmented (using for example collision-induced dissociation (CID), higher energy collisional dissociation (HCD) or electron-transfer dissociation (ETD)). The fragment ions are then themselves separated in the analyser to produce a MS2 spectrum. The unique fragment ion pattern can then be used to infer the peptide sequence using de novo sequencing (when the spectrum is of high enough quality) of using a search engine such as, for example Mascot, MSGF+, …, that will match the observed, experimental spectrum to theoratical spectra (see details below).

Schematics of a mass spectrometer and two rounds of MS.

The animation below show how 25 ions different ions (i.e. having different M/Z values) are separated throughout the MS analysis and are eventually detected (i.e. quantified). The final frame shows the hypothetical spectrum.

Separation and detection of ions in a mass spectrometer.

The figures below illustrate the two rounds of MS. The spectrum on the left is an MS1 spectrum acquired after 21 minutes and 3 seconds of elution. 10 peaks, highlited by dotted vertical lines, were selected for MS2 analysis. The peak at M/Z 460.79 (488.8) is highlighted by a red (orange) vertical line on the MS1 spectrum and the fragment spectra are shown on the MS2 spectrum on the top (bottom) right figure.

Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).

The figures below represent the 3 dimensions of MS data: a set of spectra (M/Z and intensity) of retention time, as well as the interleaved nature of MS1 and MS2 (and there could be more levels) data.

MS1 spectra over retention time.

MS2 spectra interleaved between two MS1 spectra.

3 Accessing data

3.1 From the ProteomeXchange database

On-line

MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRoteomics IDEntifications (PRIDE) database at the EBI for mass spectrometry-based experiments (including quantitative data, as opposed as the name suggests), PASSEL at the ISB for Selected Reaction Monitoring (SRM, i.e. targeted) data and the MassIVE resource. These data can be downloaded within R using the rpx package.

library("rpx")
pxannounced()

## 15 new ProteomeXchange annoucements

##     Data.Set    Publication.Data             Message
## 1  PXD011796 2019-02-18 14:53:08                 New
## 2  PXD010078 2019-02-18 14:43:59                 New
## 3  PXD009108 2019-02-18 14:09:37                 New
## 4  PXD008803 2019-02-18 14:00:41                 New
## 5  PXD010314 2019-02-18 13:36:56                 New
## 6  PXD005824 2019-02-18 13:36:17                 New
## 7  PXD009981 2019-02-18 11:54:58 Updated information
## 8  PXD012206 2019-02-18 11:53:16                 New
## 9  PXD011095 2019-02-18 11:49:53                 New
## 10 PXD011411 2019-02-18 11:40:46                 New
## 11 PXD012545 2019-02-18 10:19:50                 New
## 12 PXD010720 2019-02-18 09:17:50 Updated information
## 13 PXD006999 2019-02-18 09:12:41 Updated information
## 14 PXD009675 2019-02-16 13:33:04                 New
## 15 PXD006818 2019-02-16 12:41:32                 New

Using the unique PXD000001 identifier, we can retrieve the relevant metadata that will be stored in a PXDataset object. The names of the files available in this data can be retrieved with the pxfiles accessor function.

px <- PXDataset("PXD000001")
px

## Object of class "PXDataset"
##  Id: PXD000001 with 12 files
##  [1] 'F063721.dat' ... [12] 'generated'
##  Use 'pxfiles(.)' to see all files.

pxfiles(px)

##  [1] "F063721.dat"                                                         
##  [2] "F063721.dat-mztab.txt"                                               
##  [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"                                  
##  [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"                                    
##  [5] "PXD000001_mztab.txt"                                                 
##  [6] "README.txt"                                                          
##  [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML" 
##  [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
##  [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"         
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"           
## [11] "erwinia_carotovora.fasta"                                            
## [12] "generated"

Other metadata for the px data set:

pxtax(px)

## [1] "Erwinia carotovora"

pxurl(px)

## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"

pxref(px)

## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta. 2014 1844(1 pt a):42-51"

Data files can then be downloaded with the pxget function. Below, we retrieve the raw data file. The file is downloaded22 If the file is already available, it is not downloaded a second time. in the working directory and the name of the file is return by the function and stored in the mzf variable for later use 33 This and other files are also availabel in the msdata package, described below.

fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)

## Downloading 1 file

## /home/lgatto/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML already present.

mzf

## [1] "/home/lgatto/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"

3.2 From AnnotationHub

On-line

AnnotationHub is a cloud resource set up and managed by the Bioconductor project that serves various omics datasets. It is possible to contribute and access (albeit currently only a limited number of) proteomics data.

library("AnnotationHub")
ah <- AnnotationHub()

## updating metadata: snapshotDate(): 2018-10-24

query(ah, "proteomics")

## AnnotationHub with 4 records
## # snapshotDate(): 2018-10-24 
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH49006"]]' 
## 
##             title                                                         
##   AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
##   AH49007 | PXD000001: Peptide-level quantitation data                    
##   AH49008 | PXD000001: raw mass spectrometry data                         
##   AH49009 | PXD000001: MS-GF+ identiciation data

ms <- ah[["AH49008"]]
ms

## Mass Spectrometry file handle.
## Filename:  55314 
## Number of scans:  7534

The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The file name, 55314, is not very descriptive because the data originates from the AnnotationHub cloud repository. If the data was read from a local file, is would be named as the mzML (or mzXML) file (see below).

3.3 Data packages

Some data are also distributed through dedicated packages. The msdata, for example, provides some general raw data files relevant for both proteomics and metabolomics.

library("msdata")
## proteomics raw data
proteomics()

## [1] "MRM-standmix-5.mzML.gz"                                                
## [2] "MS3TMT10_01022016_32917-33481.mzML.gz"                                 
## [3] "MS3TMT11.mzML"                                                         
## [4] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
## [5] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"

## proteomics identification data
ident()

## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"

More often, such experiment packages distribute processed data; an example of such is the pRolocdata package, that offers quantitative proteomics data.

4 Raw MS data

4.1 High-level data structures

The MSnbase package provides high-level data abstractions for raw MS data through the MSnExp class and containers for quantification data via the MSnSet class (see Quantitative proteomics section). Both store

the actual assay data (spectra or quantitation matrix, see below), accessed with spectra (or the [, [[ operators) or exprs;
sample metadata, accessed as a data.frame with pData;
feature metadata, accessed as a data.frame with fData.

Another useful slot is processingData, accessed with processingData(.), that records all the processing that objects have undergone since their creation (see examples below).

4.2 Raw data: the `MSnExp` class

Read raw data

The readMSData will parse the raw data and construct an MS experiment object of class MSnExp. An important argument to readMSData is the mode, which can be "onDisk" or "inMemory". The former doesn’t load the raw data in memory (which is not advised for MS1 data, or when many files are loaded) and is generally the recommended mode. See the benchmarking vignette44 Open it with vignette("benchmarking", package = "MSnbase") or read it online for details).

library("MSnbase")
## get a small test data
rawFile <- dir(system.file(package = "MSnbase",
                           dir = "extdata"),
               full.name = TRUE,
               pattern = "mzXML$")
basename(rawFile)

## [1] "dummyiTRAQ.mzXML"

msexp <- readMSData(rawFile, msLevel = 2L)
msexp

## MSn experiment data ("MSnExp")
## Object size in memory: 0.18 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 5 
##  MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Mon Feb 18 16:01:12 2019 
##  MSnbase version: 2.9.3 
## - - - Meta data  - - -
## phenoData
##   rowNames: dummyiTRAQ.mzXML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: F1.S1 F1.S2 ... F1.S5 (5 total)
##   fvarLabels: spectrum
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

Spectra can be extracted as a list of Spectrum2 objects with the spectra accessor or as a subset of the original MSnExp data with the [ operator. Individual spectra can be accessed with [[.

length(msexp)

## [1] 5

msexp[1:2]

## MSn experiment data ("MSnExp")
## Object size in memory: 0.07 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 2 
##  MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Mon Feb 18 16:01:12 2019 
## Data [numerically] subsetted 2 spectra: Mon Feb 18 16:01:12 2019 
##  MSnbase version: 2.9.3 
## - - - Meta data  - - -
## phenoData
##   rowNames: dummyiTRAQ.mzXML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: F1.S1 F1.S2
##   fvarLabels: spectrum
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

msexp[[2]]

## Object of class "Spectrum2"
##  Precursor: 546.9586 
##  Retention time: 25:2 
##  Charge: 3 
##  MSn level: 2 
##  Peaks count: 1012 
##  Total ion count: 56758067

Chromatograms

We can also extract the chromatogram for the acquistion(s) in the MSnExp object and visualise it. Here, we use a complete acquisition from the msdata package, and read it with on-disk mode and focus on MS1 data, which is used to generate chromatograms.

f <- msdata::proteomics(pattern = "45stepped_60min_01-20141210", full.names = TRUE)
rw <- readMSData(f, mode = "onDisk", msLevel. = 1L)
chr <- chromatogram(rw)
chr

## Chromatograms with 1 row and 1 column
##      TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz
##                                                              <Chromatogram>
## [1,]                                                           length: 1431
## phenoData with 1 variables
## featureData with 1 variables

plot(chr)

plot of chunk chrom1

Note that here, as we only loaded a single raw data file, we obtain a Chromatograms object with a single chromatogram. When reading multiple raw data files at once (for example with readMSData(c("f1.mzML", "f2.mzML"))), we would get and visualise one chromatogram per file.

Annotating spectra with identification results

The identification results stemming from the same raw data file can then be used to add PSM matches. Here, we use the small msexp test data with 5 MS2 spectra that we read in further up.

## initial feature variable
fData(msexp)

##       spectrum
## F1.S1        1
## F1.S2        2
## F1.S3        3
## F1.S4        4
## F1.S5        5

## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "dummyiTRAQ.mzid")
basename(identFile)

## [1] "dummyiTRAQ.mzid"

msexp <- addIdentificationData(msexp, identFile)
## additional feature variables
fvarLabels(msexp)

##  [1] "spectrum"                 "acquisition.number"      
##  [3] "sequence"                 "chargeState"             
##  [5] "rank"                     "passThreshold"           
##  [7] "experimentalMassToCharge" "calculatedMassToCharge"  
##  [9] "modNum"                   "isDecoy"                 
## [11] "post"                     "pre"                     
## [13] "start"                    "end"                     
## [15] "DatabaseAccess"           "DBseqLength"             
## [17] "DatabaseSeq"              "DatabaseDescription"     
## [19] "scan.number.s."           "idFile"                  
## [21] "MS.GF.RawScore"           "MS.GF.DeNovoScore"       
## [23] "MS.GF.SpecEValue"         "MS.GF.EValue"            
## [25] "modName"                  "modMass"                 
## [27] "modLocation"              "subOriginalResidue"      
## [29] "subReplacementResidue"    "subLocation"             
## [31] "nprot"                    "npep.prot"               
## [33] "npsm.prot"                "npsm.pep"

We see that 3 out of 5 MS2 spectra in the msexp data have been identified; those that haven’t have missing values for the new, id-related feature variables.

fData(msexp)$rank

## [1]  1  1 NA NA  1

fData(msexp)$isDecoy

## [1] FALSE FALSE    NA    NA FALSE

Exercise Load all MS level data from file MS3TMT11.mzML available in the msdata package using readMSData, making sure you set mode = "onDisk", and verify which MS levels (accessible with the msLevel function) are centroided (accessible with the centroided() function). See section Raw data processing for data in profile and centroided (processed) modes.

f <- proteomics(full.names = TRUE, pattern = "MS3TMT11.mzML")
ms <- readMSData(f, mode = "onDisk")
table(centroided(ms), msLevel(ms))

##        
##           1   2   3
##   FALSE  30   0   0
##   TRUE    0 482 482

Plotting MS spectra

Spectra and (parts of) experiments can be extracted and plotted.

msexp[[1]]

## Object of class "Spectrum2"
##  Precursor: 645.3741 
##  Retention time: 25:1 
##  Charge: 3 
##  MSn level: 2 
##  Peaks count: 2921 
##  Total ion count: 668170086

plot(msexp[[1]])

Plotting an object of class Spectrum.

As this data was labeled with iTRAQ4 isobaric tags, we can highlight these four peaks of interest on top of the full spectrum with

plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)

Plotting an object of class Spectrum with reporter ions.

msexp[1:3]

## MSn experiment data ("MSnExp")
## Object size in memory: 0.11 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 3 
##  MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Mon Feb 18 16:01:12 2019 
## Data [numerically] subsetted 3 spectra: Mon Feb 18 16:01:13 2019 
##  MSnbase version: 2.9.3 
## - - - Meta data  - - -
## phenoData
##   rowNames: dummyiTRAQ.mzXML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: F1.S1 F1.S2 F1.S3
##   fvarLabels: spectrum acquisition.number ... npsm.pep (34 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

plot(msexp[1:3], full = TRUE)

Plotting an object of class MSnExp

In the examples above, we only used a single file as input to readMSData, but multiple file can be read into a single MSnExp object. The origin of the spectra can be accessed with the fromFile function:

fromFile(msexp)

## F1.S1 F1.S2 F1.S3 F1.S4 F1.S5 
##     1     1     1     1     1

Exercise Repeat the previous combination of raw and identification data with the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz and TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid files from msdata. Retain only MS 2 level data; this can be done either when reading the data in (see the msLevel argument in ?readMSData) or can be done afterwards by filtering the MS levels with filterMsLevel.

## read raw data
rwf <- proteomics(pattern = "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz",
                  full.names = TRUE)
tmterw <- readMSData(rwf, mode = "onDisk")
## or, only read MS2-leve data
## tmterw <- readMSData(rwf, mode = "onDisk", msLevel = 2L)

## add identification data
idf <- ident(full.names = TRUE)
tmterw <- addIdentificationData(tmterw, idf)

tmterw2 <- filterMsLevel(tmterw, 2L)

## It is also possible to chain operations
library("magrittr")
tmterw2 <- rwf %>%
    readMSData(mode = "onDisk") %>%
    addIdentificationData(idf) %>%
    filterMsLevel(2L)

Exercise Still using the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210 data from the previous exercise, identify the index of the MS2 spectrum with the highest precursor intensity (see the precursorIntensity feature variable) and plot it as illustrated above.

i <- which.max(precursorIntensity(tmterw2))
sp <- tmterw2[[i]]
plot(sp, full = TRUE)

4.3 Relations between scans

As seen in the introduction, scans have a hierarchical structure: MS2 spectra stem form a precursor MS1 scan. This also holds for MS3 spectra, that are the result from an additional analysis round of MS2 spectra. When validating quantitative or identification data by referring back to raw data, it is often useful to be able to navigate this structure.

We will use an experiment with 3 MS levels to do this:

ms3f <- proteomics(pattern = "MS3TMT11", full.names = TRUE)
basename(ms3f)

## [1] "MS3TMT11.mzML"

ms3 <- readMSData(ms3f, mode = "onDisk")

Note that it is important to use on-disk mode here, as we want to retain all MS levels, which isn’t possible with in-memory mode.

Exercise Generate a table showing how many MS1, 2, and 3 level scans are available in this data

table(msLevel(ms3))

## 
##   1   2   3 
##  30 482 482

The filterPrecursorScan function takes on raw data object, it’s acquisition number (get them with acquisitionNum), and returns a new raw data object containing the children of that spectrum.

Exercise Find the acquisition of the first MS1 spectrum and extract all spectra that originate, directly and indirectly, from it.

head(msLevel(ms3))

## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 
##       1       2       2       3       2       2

head(acquisitionNum(ms3))

## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 
##   21945   21946   21947   21948   21949   21950

(from1 <- filterPrecursorScan(ms3, 21945))

## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.05 Mb
## - - - Spectra data - - -
##  MS level(s): 1 2 3 
##  Number of spectra: 35 
##  MSn retention times: 45:27 - 45:30 minutes
## - - - Processing information - - -
## Data loaded [Mon Feb 18 16:01:31 2019] 
## Filter: select parent/children scans for 21945 [Mon Feb 18 16:01:31 2019] 
##  MSnbase version: 2.9.3 
## - - - Meta data  - - -
## phenoData
##   rowNames: MS3TMT11.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   MS3TMT11.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S001 F1.S002 ... F1.S035 (35 total)
##   fvarLabels: fileIdx spIdx ... spectrum (30 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

msLevel(from1)

## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 F1.S007 F1.S008 F1.S009 
##       1       2       2       3       2       2       3       2       2 
## F1.S010 F1.S011 F1.S012 F1.S013 F1.S014 F1.S015 F1.S016 F1.S017 F1.S018 
##       3       3       2       3       2       2       2       3       2 
## F1.S019 F1.S020 F1.S021 F1.S022 F1.S023 F1.S024 F1.S025 F1.S026 F1.S027 
##       2       3       2       2       2       3       2       2       3 
## F1.S028 F1.S029 F1.S030 F1.S031 F1.S032 F1.S033 F1.S034 F1.S035 
##       3       3       3       3       3       3       3       3

4.4 Low-level access to raw data

Devel

This section illustrates the underlying infrastructure from the mzR package, that is used by MSnbase under the hood. It is recommended to use the high level interfaces, as it supports multiple files and does data integrity checks throughout data processing.

The mzR package provides an interface to the proteowizard C/C++ code base to access various raw data files, such as mzML, mzXML, netCDF, and mzData. The data is accessed on-disk, i.e it is not loaded entirely in memory, and only when explicitly requested. The three main functions are openMSfile to create a file handle to a raw data file, header to extract metadata about the spectra contained in the file and peaks to extract one or multiple spectra of interest. Other functions such as instrumentInfo, or runInfo can be used to gather general information about a run.

Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.

library("mzR")
basename(mzf)

## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"

ms <- openMSfile(mzf)
ms

## Mass Spectrometry file handle.
## Filename:  TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML 
## Number of scans:  7534

The object loaded from AnnotationHub in the previous section is of the same type, and was also created by the openMSfile function. All operations below can equally be applied to it.

The header function returns the metadata of all available peaks:

hd <- header(ms)
dim(hd)

## [1] 7534   26

names(hd)

##  [1] "seqNum"                   "acquisitionNum"          
##  [3] "msLevel"                  "polarity"                
##  [5] "peaksCount"               "totIonCurrent"           
##  [7] "retentionTime"            "basePeakMZ"              
##  [9] "basePeakIntensity"        "collisionEnergy"         
## [11] "ionisationEnergy"         "lowMZ"                   
## [13] "highMZ"                   "precursorScanNum"        
## [15] "precursorMZ"              "precursorCharge"         
## [17] "precursorIntensity"       "mergedScan"              
## [19] "mergedResultScanNum"      "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum"   "injectionTime"           
## [23] "filterString"             "spectrumId"              
## [25] "centroided"               "ionMobilityDriftTime"

We can extract metadata and scan data for scan 1000 as follows:

hd[1000, ]

##      seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000   1000           1000       2        1        274       1048554
##      retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000      1106.916    136.061            164464              45
##      ionisationEnergy    lowMZ   highMZ precursorScanNum precursorMZ
## 1000                0 104.5467 1370.758              992    683.0817
##      precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000               2           689443.7          0                   0
##      mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000                        0                      0      55.21463
##                                                  filterString
## 1000 FTMS + p NSI d Full ms2 683.08@hcd45.00 [100.00-1380.00]
##                                         spectrumId centroided
## 1000 controllerType=0 controllerNumber=1 scan=1000       TRUE
##      ionMobilityDriftTime
## 1000                   NA

head(peaks(ms, 1000))

##          [,1]     [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975

plot(peaks(ms, 1000), type = "h", xlab = "M/Z", ylab = "Intensity")

Manual extraction and plotting of an MS spectrum

4.5 A bit more raw data visualisation

Below, we illustrate some additional visualisation and animations of raw MS data, taken from the RforProteomics visualisation vignette. On the left, we have a heatmap visualisation of a MS map and, in the centre, a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.

msn <- readMSData(mzf, mode = "onDisk")

## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(msLevel(msn) == 1)
rtsel <- rtime(msn)[ms1] / 60 > 30 &
    rtime(msn)[ms1] / 60 < 35

## the MS map
M <- MSmap(msn, scans = ms1[rtsel],
           lowMz = 521, highMz = 523,
           resMz = .005)
## custom colours
ff <- colorRampPalette(c("yellow", "steelblue"))
lattice::trellis.par.set(regions=list(col=ff(100)))
## heatmap
m1 <- plot(M, aspect = 1, allTicks = FALSE)

## set 0s to NA for better visualisation
M@map[msMap(M) == 0] <- NA
m2 <- plot3D(M, rgl = FALSE)

## MS map with MS1 and MS2 spectra
i <- ms1[which(rtsel)][1] ## 1st MS1
j <- ms1[which(rtsel)][2] ## 2nd MS1
## All MS 1 and 2 spectra between i and j
M2 <- MSmap(msn, i:j, 100, 1000, 1)
m3 <- plot3D(M2)

gridExtra::grid.arrange(m1, m2, m3, ncol = 3)

Plotting MS maps along retention time, MZ range and intensity.

Below, we have animations build from extracting successive slices as above.

MS animation 1 MS animation 2

5 Identification data

5.1 Identification data.frame

Let’s use the identification from from msdata:

idf <- msdata::ident(full.names = TRUE)
basename(idf)

## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"

The easiest way to read identification data in mzIdentML (often abbreviated with mzid) into R is to read it with readMzIdData, that will parse it, process it, and return a data.frame:

iddf <- readMzIdData(idf)
head(iddf)

##                 sequence                                    spectrumID
## 1           RQCRTDFLNYLR controllerType=0 controllerNumber=1 scan=2949
## 2 ESVALADQVTCVDWRNRKATKK controllerType=0 controllerNumber=1 scan=6534
## 3           KELLCLAMQIIR controllerType=0 controllerNumber=1 scan=5674
##   chargeState rank passThreshold experimentalMassToCharge
## 1           3    1          TRUE                 548.2856
## 2           2    1          TRUE                1288.1528
## 3           2    1          TRUE                 744.4109
##   calculatedMassToCharge modNum isDecoy post pre start end DatabaseAccess
## 1               547.9474      1   FALSE    V   R   574 585        ECA2006
## 2              1288.1741      1   FALSE    G   R    69  90        ECA1676
## 3               744.4255      1    TRUE    Q   R   131 142    XXX_ECA2855
##   DBseqLength DatabaseSeq                        DatabaseDescription
## 1        1295                         ECA2006 ATP-dependent helicase
## 2         110             ECA1676 putative growth inhibitory protein
## 3         157                                                       
##   scan.number.s. acquisitionNum
## 1           2949           2949
## 2           6534           6534
## 3           5674           5674
##                                                          spectrumFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
##                                                                idFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
##   MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue
## 1             10               101     4.617121e-08    0.1321981
## 2             12               121     7.255875e-08    0.2087481
## 3              8                74     9.341019e-08    0.2674533
##   MS.GF.QValue MS.GF.PepQValue         modName  modMass modLocation
## 1    0.5254237       0.5490196 Carbamidomethyl 57.02146           3
## 2    0.6103896       0.6231884 Carbamidomethyl 57.02146          11
## 3    0.6250000       0.6363636 Carbamidomethyl 57.02146           5
##   subOriginalResidue subReplacementResidue subLocation
## 1               <NA>                  <NA>          NA
## 2               <NA>                  <NA>          NA
## 3               <NA>                  <NA>          NA
##  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

When adding identification data with the addIdentificationData function as shown above, the data is first read with readMzIdData, and is then cleaned up:

only PSMs matching the regular (non-decoy) database are retained;
PSMs or rank greater than 1 are discarded;
only proteotypic peptides are kept, i.e. those that match to a unique peptide.

## at this stage, we still have all the PSMs
table(iddf$isDecoy)

## 
## FALSE  TRUE 
##  2906  2896

table(iddf$rank)

## 
##    1    2    3    4 
## 5487  302   12    1

Exercise This behaviour can be replicates with the filterIdentificationDataFrame function. Try it out for yourself.

iddf2 <- filterIdentificationDataFrame(iddf)
table(iddf2$isDecoy)

## 
## FALSE 
##  2710

table(iddf2$rank)

## 
##    1 
## 2710

Exercise The standard tidyverse tools are fit for data wrangling with identification data. Using the above identification dataframe, calculate the length of each peptide (you can use nchar with the peptide sequence sequence) and the number of peptides for each protein (defined as DatabaseDescription). Plot the length of the proteins against their respective number of peptides. Optionally, stratify the plot by the peptide e-value score (MS.GF.EValue) using for example cut to define bins.

suppressPackageStartupMessages(library("dplyr"))
iddf2 <- as_tibble(iddf2) %>%
    mutate(peplen = nchar(sequence))
npeps <- iddf2 %>%
    group_by(DatabaseDescription) %>%
    tally
iddf2 <- full_join(iddf2, npeps)

## Joining, by = "DatabaseDescription"

library("ggplot2")
ggplot(iddf2, aes(x = n, y = DBseqLength)) + geom_point()

Identifcation data wrangling 1

iddf2$evalBins <- cut(iddf2$MS.GF.EValue, summary(iddf2$MS.GF.EValue))
ggplot(iddf2, aes(x = n, y = DBseqLength, color = peplen)) +
    geom_point() +
    facet_wrap(~ evalBins)

5.2 Low level access to id data

Devel

Along the lines of what is available for raw data, the parsing of this XML-based format comes from mzR. A file handle to mzIdentML files can be created with the openIDfile function. As for raw data, the underlying C/C++ code comes from the proteowizard.

library("mzR")
id1 <- openIDfile(idf)
id1

## Identification file handle.
## Filename:  TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid 
## Number of psms:  5759

Various data can be extracted from the identification object. The peptide spectrum matches (PSMs) and the identification scores can be accessed as a data.frame with psms and score respectively.

softwareInfo(id1)

## [1] "MS-GF+ Beta (v10072) "                      
## [2] "ProteoWizard MzIdentML 3.0.501 ProteoWizard"

enzymes(id1)

##      name nTermGain cTermGain minDistance missedCleavages
## 1 Trypsin                               0            1000

fid1 <- mzR::psms(id1)
head(fid1)

##                                      spectrumID chargeState rank
## 1 controllerType=0 controllerNumber=1 scan=5782           3    1
## 2 controllerType=0 controllerNumber=1 scan=6037           3    1
## 3 controllerType=0 controllerNumber=1 scan=5235           3    1
## 4 controllerType=0 controllerNumber=1 scan=5397           3    1
## 5 controllerType=0 controllerNumber=1 scan=6075           3    1
##   passThreshold experimentalMassToCharge calculatedMassToCharge
## 1          TRUE                1080.2325              1080.2321
## 2          TRUE                1002.2089              1002.2115
## 3          TRUE                1189.2836              1189.2800
## 4          TRUE                 960.5365               960.5365
## 5          TRUE                1264.3409              1264.3419
##                              sequence modNum isDecoy post pre start end
## 1 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR      0   FALSE    S   R    50  84
## 2        TQVLDGLINANDIEVPVALIDGEIDVLR      0   FALSE    R   K   288 315
## 3   TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR      0   FALSE    A   R   192 224
## 4         SQILQQAGTSVLSQANQVPQTVLSLLR      0   FALSE    -   R   264 290
## 5 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR      0   FALSE    F   R   119 153
##   DatabaseAccess DBseqLength DatabaseSeq
## 1        ECA1932         155            
## 2        ECA1147         434            
## 3        ECA0013         295            
## 4        ECA1731         290            
## 5        ECA1443         298            
##                                    DatabaseDescription scan.number.s.
## 1                   ECA1932 outer membrane lipoprotein           5782
## 2                               ECA1147 trigger factor           6037
## 3           ECA0013 ribose-binding periplasmic protein           5235
## 4                                    ECA1731 flagellin           5397
## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase           6075
##   acquisitionNum
## 1           5782
## 2           6037
## 3           5235
## 4           5397
## 5           6075
##  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

sc1 <- mzR::score(id1)
head(sc1)

##                                      spectrumID MS.GF.RawScore
## 1 controllerType=0 controllerNumber=1 scan=5782            147
## 2 controllerType=0 controllerNumber=1 scan=6037            214
## 3 controllerType=0 controllerNumber=1 scan=5235            211
## 4 controllerType=0 controllerNumber=1 scan=5397            154
## 5 controllerType=0 controllerNumber=1 scan=6075            188
## 6 controllerType=0 controllerNumber=1 scan=5761            123
##   MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue MS.GF.QValue
## 1               174     3.764831e-27 1.086033e-20            0
## 2               245     6.902626e-26 1.988774e-19            0
## 3               264     1.778789e-25 5.129649e-19            0
## 4               178     1.792541e-24 5.163566e-18            0
## 5               252     1.510364e-23 4.356914e-17            0
## 6               138     1.618941e-23 4.658952e-17            0
##   MS.GF.PepQValue
## 1               0
## 2               0
## 3               0
## 4               0
## 5               0
## 6               0

The mzID package, has similar functionality to parse identification files, and was the first one to provide such capabilities in R. The main difference with mzR is that is parses the files using the XMLpackage and reads the whole data into memory rather than relying on proteowizard, and is slower.

5.3 MS/MS database search

While searches are generally performed using third-party software independently of R or can be started from R using a system call, the MSGFplus package enables to perform a search using the MSGF+ engine, as illustrated below.

We search the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML file against the fasta file from PXD000001 using MSGFplus.

We first download the fasta files from ProteomeXchange:

On-line

fas <- pxget(px, "erwinia_carotovora.fasta")

## Downloading 1 file

basename(fas)

## [1] "erwinia_carotovora.fasta"

Below, we setup and run the search55 In the runMSGF call, the memory allocated to the java virtual machine is limited to 1GB. In general, there is no need to specify this argument, unless you experience an error regarding the maximum heap size..

library("MSGFplus")
msgfpar <- msgfPar(database = fas,
                   instrument = 'HighRes',
                   tda = TRUE,
                   enzyme = 'Trypsin',
                   protocol = 'iTRAQ')
idres <- runMSGF(msgfpar, mzf, memory=1000)

## '/usr/bin/java' -Xmx1000M -jar '/home/lg390/R/x86_64-pc-linux-gnu-library/3.4/MSGFplus/MSGFPlus/MSGFPlus.jar' -s '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML' -o '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid' -d '/home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta' -tda 1 -inst 1 -e 1 -protocol 2 
## 
## reading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid... DONE!

idres

## An mzID object
## 
## Software used:   MS-GF+ (version: Beta (v10072))
## 
## Rawfile:         /home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## 
## Database:        /home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta
## 
## Number of scans: 5343
## Number of PSM's: 5656

## identification file (needed below)
basename(mzID::files(idres)$id)

## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"

A graphical interface to perform the search the data and explore the results is also available:

library("MSGFgui")
MSGFgui()

The MSGFgui interface

The rTANDEM package can be used to perform a search with XTandem software.

5.4 Analysing search results

The MSnID package can be used for post-search filtering of MS/MS identifications. One starts with the construction of an MSnID object that is populated with identification results that can be imported from a data.frame or from mzIdenML files. Here, we will use the example identification data provided with the package.

mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
basename(mzids)

## [1] "c_elegans.mzid.gz"

We start by loading the package, initialising the MSnID object, and add the identification result from our mzid file (there could of course be more that one).

library("MSnID")
msnid <- MSnID(".")

## Note, the anticipated/suggested columns in the
## peptide-to-spectrum matching results are:
## -----------------------------------------------
## accession
## calculatedMassToCharge
## chargeState
## experimentalMassToCharge
## isDecoy
## peptide
## spectrumFile
## spectrumID

msnid <- read_mzIDs(msnid, mzids)

## Loaded cached data

show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  1 
## #PSMs: 12263 at 36 % FDR
## #peptides: 9489 at 44 % FDR
## #accessions: 7414 at 76 % FDR

Printing the MSnID object returns some basic information such as

Working directory.
Number of spectrum files used to generate data.
Number of peptide-to-spectrum matches and corresponding FDR.
Number of unique peptide sequences and corresponding FDR.
Number of unique proteins or amino acid sequence accessions and corresponding FDR.

The package then enables to define, optimise and apply filtering based for example on missed cleavages, identification scores, precursor mass errors, etc. and assess PSM, peptide and protein FDR levels. To properly function, it expects to have access to the following data

## [1] "accession"                "calculatedMassToCharge"  
## [3] "chargeState"              "experimentalMassToCharge"
## [5] "isDecoy"                  "peptide"                 
## [7] "spectrumFile"             "spectrumID"

which are indeed present in our data:

names(msnid)

##  [1] "spectrumID"                "scan number(s)"           
##  [3] "acquisitionNum"            "passThreshold"            
##  [5] "rank"                      "calculatedMassToCharge"   
##  [7] "experimentalMassToCharge"  "chargeState"              
##  [9] "MS-GF:DeNovoScore"         "MS-GF:EValue"             
## [11] "MS-GF:PepQValue"           "MS-GF:QValue"             
## [13] "MS-GF:RawScore"            "MS-GF:SpecEValue"         
## [15] "AssumedDissociationMethod" "IsotopeError"             
## [17] "isDecoy"                   "post"                     
## [19] "pre"                       "end"                      
## [21] "start"                     "accession"                
## [23] "length"                    "description"              
## [25] "pepSeq"                    "modified"                 
## [27] "modification"              "idFile"                   
## [29] "spectrumFile"              "databaseFile"             
## [31] "peptide"

Here, we summarise a few steps and redirect the reader to the package’s vignette for more details:

Analysis of peptide sequences

Cleaning irregular cleavages at the termini of the peptides and missing cleavage site within the peptide sequences. The following two function call create the new numMisCleavages and numIrrCleabages columns in the MSnID object

msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")

Trimming the data

Now, we can use the apply_filter function to effectively apply filters. The strings passed to the function represent expressions that will be evaluated, this keeping only PSMs that have 0 irregular cleavages and 2 or less missed cleavages.

msnid <- apply_filter(msnid, "numIrregCleavages == 0")
msnid <- apply_filter(msnid, "numMissCleavages <= 2")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  1 
## #PSMs: 7838 at 17 % FDR
## #peptides: 5598 at 23 % FDR
## #accessions: 3759 at 53 % FDR

Parent ion mass errors

Using "calculatedMassToCharge" and "experimentalMassToCharge", the mass_measurement_error function calculates the parent ion mass measurement error in parts per million.

summary(mass_measurement_error(msnid))

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2184.0640    -0.6992     0.0000    17.6146     0.7512  2012.5178

We then filter any matches that do not fit the +/- 20 ppm tolerance

msnid <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
summary(mass_measurement_error(msnid))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -19.7797  -0.5866   0.0000  -0.2970   0.5713  19.6758

Filtering criteria

Filtering of the identification data will rely on

-log10 transformed MS-GF+ Spectrum E-value, reflecting the goodness of match experimental and theoretical fragmentation patterns

msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)

the absolute mass measurement error (in ppm units) of the parent ion

msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))

MS2 filters are handled by a special MSnIDFilter class objects, where individual filters are set by name (that is present in names(msnid)) and comparison operator (>, <, = , …) defining if we should retain hits with higher or lower given the threshold and finally the threshold value itself.

filtObj <- MSnIDFilter(msnid)
filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
filtObj$msmsScore <- list(comparison=">", threshold=10.0)
show(filtObj)

## MSnIDFilter object
## (absParentMassErrorPPM < 10) & (msmsScore > 10)

We can then evaluate the filter on the identification data object, which return the false discovery rate and number of retained identifications for the filtering criteria at hand.

evaluate_filter(msnid, filtObj)

##           fdr    n
## PSM         0 3807
## peptide     0 2455
## accession   0 1009

Filter optimisation

Rather than setting filtering values by hand, as shown above, these can be set automativally to meet a specific false discovery rate.

filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
                                method="Grid", level="peptide",
                                n.iter=500)
show(filtObj.grid)

## MSnIDFilter object
## (absParentMassErrorPPM < 3) & (msmsScore > 7.4)

evaluate_filter(msnid, filtObj.grid)

##                   fdr    n
## PSM       0.004097561 5146
## peptide   0.006447651 3278
## accession 0.021996616 1208

Filters can eventually be applied (rather than just evaluated) using the apply_filter function.

msnid <- apply_filter(msnid, filtObj.grid)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  1 
## #PSMs: 5146 at 0.41 % FDR
## #peptides: 3278 at 0.64 % FDR
## #accessions: 1208 at 2.2 % FDR

And finally, identifications that matched decoy and contaminant protein sequences are removed

msnid <- apply_filter(msnid, "isDecoy == FALSE")
msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  1 
## #PSMs: 5117 at 0 % FDR
## #peptides: 3251 at 0 % FDR
## #accessions: 1179 at 0 % FDR

The resulting filtered identification data can be exported to a data.frame or to a dedicated MSnSet data structure for quantitative MS data, described below, and further processed and analyses using appropriate statistical tests.

5.5 Visualising identification data

Annotated spectra and comparing spectra.

par(mfrow = c(1, 2))
data(itraqdata)
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE) ## centroiding
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))

Annotating and comparing MS2 spectra.

The annotation of spectra is obtained by simulating fragmentation of a peptide and matching observed peaks to fragments:

calculateFragments("SIGFEGDSIGR")

##           mz ion type pos z        seq
## 1   88.03931  b1    b   1 1          S
## 2  201.12337  b2    b   2 1         SI
## 3  258.14483  b3    b   3 1        SIG
## 4  405.21324  b4    b   4 1       SIGF
## 5  534.25583  b5    b   5 1      SIGFE
## 6  591.27729  b6    b   6 1     SIGFEG
## 7  706.30423  b7    b   7 1    SIGFEGD
## 8  793.33626  b8    b   8 1   SIGFEGDS
## 9  906.42032  b9    b   9 1  SIGFEGDSI
## 10 963.44178 b10    b  10 1 SIGFEGDSIG
## 11 175.11895  y1    y   1 1          R
## 12 232.14041  y2    y   2 1         GR
## 13 345.22447  y3    y   3 1        IGR
## 14 432.25650  y4    y   4 1       SIGR
## 15 547.28344  y5    y   5 1      DSIGR
## 16 604.30490  y6    y   6 1     GDSIGR
##  [ reached 'max' / getOption("max.print") -- omitted 16 rows ]

Visualising a pair of spectra means that we can access them, and that, in addition to plotting, we can manipulate them and perform computations. The two spectra corresponding to the IMIDLDGTENK peptide, for example have 22 common peaks, a correlation of 0.198 and a dot product of 0.21 (see ?compareSpectra for details).

Exercise Use the compareSpectra function to compare spectra 25 and 28 plotted above, calculating the metrics mentioned above. Don’t forget to pick peaks from itraqdata first.

data(itraqdata)
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE)
compareSpectra(itraqdata2[[25]], itraqdata2[[28]], fun = "common")

## [1] 22

compareSpectra(itraqdata2[[25]], itraqdata2[[28]], fun = "cor")

## [1] 0.1983378

compareSpectra(itraqdata2[[25]], itraqdata2[[28]], fun = "dotproduct")

## [1] 0.2101533

6 Quantitative proteomics

There are a wide range of proteomics quantitation techniques that can broadly be classified as labelled vs. label-free, depending whether the features are labelled prior the MS acquisition and the MS level at which quantitation is inferred, namely MS1 or MS2.

	Label-free	Labelled
MS1	XIC	SILAC, 15N
MS2	Counting	iTRAQ, TMT

In terms of raw data quantitation, most efforts have been devoted to MS2-level quantitation. Label-free XIC quantitation has however been addressed in the frame of metabolomics data processing by the xcms infrastructure.

Below is a list of suggested packages for some common proteomics quantitation technologies:

Isobaric tagging (iTRAQ and TMT): MSnbase and isobar.
Label-free: xcms (metabolomics).
Counting: MSnbase and MSnID for peptide-spectrum matching confidence assessment.
N14N15 for heavy Nitrogen-labelled data.

6.1 The `MSnSet` class for quantitative data

Quantitative data is stored in a dedicated data structure called MSnSet. The figure below gives a schematics of an MSnSet instance and the relation between the assay data and the respective feature and sample metadata, accessible respectively with the exprs, fData and pData functions.

The MSnSet structure

Storing quantitative data in an MSnSet quaranties that the feature (peptides or proteins) and sample annotations are correctly aligned with the quantitative data, i.e.

there is a one-to-one match between the expression data rows and feature meta data rows;
there is a one-to-one match between the expression data columns and sample meta data rows.

This correspondance is also guaranteed during all data processing and manipulation.

6.2 Isobaric tagging

An MSnExp is converted to an MSnSet by the quantitation method. Below, we use the iTRAQ 4-plex isobaric tagging strategy (defined by the iTRAQ4 parameter; other tags are available: see ?ReporterIons) and the max method to calculate the use the maximum of the reporter peak for quantitation.

plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)

MS2 spectrum and it’s iTRAQ4 reporter ions.

msset <- quantify(msexp, method = "max", reporters = iTRAQ4)

Below, we access the quantitative and metadata slots of the newly created MSnSet object.

exprs(msset)

##       iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## F1.S1   706555.7   685055.1   929016.1   668245.2
## F1.S2   260663.7   212745.0   163782.8   239142.7
## F1.S3  2213566.0  2069209.6  2204032.2  2331846.8
## F1.S4   616043.4   705976.6   671828.8   666845.6
## F1.S5  1736128.2  1787622.5  1795311.8  1825523.0

head(fData(msset))

##       spectrum acquisition.number          sequence chargeState rank
## F1.S1        1                  1 VESITARHGEVLQLRPK           3    1
## F1.S2        2                  2     IDGQWVTHQWLKK           3    1
##       passThreshold experimentalMassToCharge calculatedMassToCharge modNum
## F1.S1          TRUE                 645.3741               645.0375      0
## F1.S2          TRUE                 546.9586               546.9633      0
##       isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
## F1.S1   FALSE    A   R   170 186        ECA0984         231            
## F1.S2   FALSE    A   K    50  62        ECA1028         275            
##                                                              DatabaseDescription
## F1.S1                                        ECA0984 DNA mismatch repair protein
## F1.S2 ECA1028 2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase
##       scan.number.s.          idFile MS.GF.RawScore MS.GF.DeNovoScore
## F1.S1              1 dummyiTRAQ.mzid            -39                77
## F1.S2              2 dummyiTRAQ.mzid            -30                39
##       MS.GF.SpecEValue MS.GF.EValue modName modMass modLocation
## F1.S1     5.527468e-05     79.36958    <NA>      NA          NA
## F1.S2     9.399048e-06     13.46615    <NA>      NA          NA
##       subOriginalResidue subReplacementResidue subLocation nprot npep.prot
## F1.S1               <NA>                  <NA>          NA     1         1
## F1.S2               <NA>                  <NA>          NA     1         1
##       npsm.prot npsm.pep fileIdx retention.time precursor.mz
## F1.S1         1        1       1        1501.35     645.3741
## F1.S2         1        1       1        1501.59     546.9586
##       precursor.intensity charge peaks.count       tic  ionCount ms.level
## F1.S1            47659400      3        2921 182542000 668170086        2
## F1.S2            26356100      3        1012  16488100  56758067        2
##       collision.energy
## F1.S1               40
## F1.S2               40
##  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

pData(msset)

##                  mz reporters
## iTRAQ4.114 114.1112    iTRAQ4
## iTRAQ4.115 115.1083    iTRAQ4
## iTRAQ4.116 116.1116    iTRAQ4
## iTRAQ4.117 117.1150    iTRAQ4

New columns can be added to the metadata slots.

pData(msset)$groups <- rep(c("Treat", "Cond"), each = 2)
pData(msset)

##                  mz reporters groups
## iTRAQ4.114 114.1112    iTRAQ4  Treat
## iTRAQ4.115 115.1083    iTRAQ4  Treat
## iTRAQ4.116 116.1116    iTRAQ4   Cond
## iTRAQ4.117 117.1150    iTRAQ4   Cond

Another useful slot is processingData, accessed with processingData(.), that records all the processing that objects have undergone since their creation.

processingData(msset)

## - - - Processing information - - -
## Data loaded: Mon Feb 18 16:01:12 2019 
## iTRAQ4 quantification by max: Mon Feb 18 16:01:37 2019 
##  MSnbase version: 2.9.3

6.3 Spectral counting

Other MS2 quantitation methods available in quantify include the (normalised) spectral index SI and (normalised) spectral abundance factor SAF or simply a simple count method66 The code below is for illustration only - it doesn’t make much sense to perform any of these quantitations on such a multiplexed data.

exprs(si <- quantify(msexp, method = "SIn"))

##         dummyiTRAQ.mzXML
## ECA0510     0.0006553518
## ECA0984     0.0035384487
## ECA1028     0.0002684726

exprs(saf <- quantify(msexp, method = "NSAF"))

##         dummyiTRAQ.mzXML
## ECA0510        0.4306167
## ECA0984        0.3094475
## ECA1028        0.2599359

Note that spectra that have not been assigned any peptide (NA) or that match non-unique peptides (npsm > 1) are discarded in the counting process.

As shown above, the MSnID package enables to explore and assess the confidence of identification data using mzid files. A subset of all peptide-spectrum matches, that pass a specific false discovery rate threshold can them be converted to an MSnSet, where the number of peptide occurrences are used to populate the assay data.

6.4 Importing third-party quantitation data

From `MzTab` files

On-line

The Proteomics Standard Initiative (PSI) mzTab file format is aimed at providing a simpler (than XML formats) and more accessible file format to the wider community. It is composed of a key-value metadata section and peptide/protein/small molecule tabular sections. These data can be imported with the readMzTabData function77 We specify version 0.9 (which generates the warning) to fit with the version of that file. For recent files, the version argument should be ignored to use the importer for the current file version 1.0..

mztf <- pxget(px, "F063721.dat-mztab.txt")

## Downloading 1 file

(mzt <- readMzTabData(mztf, what = "PEP", version = "0.9"))

## Warning: Version 0.9 is deprecated. Please see '?readMzTabData' and '?
## MzTab' for details.

## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: sub[1] sub[2] ... sub[6] (6 total)
##   varLabels: abundance
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 1528 (1528 total)
##   fvarLabels: sequence accession ... uri (14 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## mzTab read: Mon Jun 19 23:10:32 2017 
##  MSnbase version: 2.3.6

From spreadsheets

It is also possible to import arbitrary spreadsheets (such as those exported by MaxQuant, ProteomeDiscoverer, …) as MSnSet objects into R with the readMSnSet2 function. The main 2 arguments of the function are (1) a text-based spreadsheet and (2) column names of indices that identify the quantitation data. The latter can be queried with the getEcols function.

csv <- dir(system.file ("extdata" , package = "pRolocdata"),
           full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")

##  [1] "\"Protein ID\""              "\"FBgn\""                   
##  [3] "\"Flybase Symbol\""          "\"No. peptide IDs\""        
##  [5] "\"Mascot score\""            "\"No. peptides quantified\""
##  [7] "\"area 114\""                "\"area 115\""               
##  [9] "\"area 116\""                "\"area 117\""               
## [11] "\"PLS-DA classification\""   "\"Peptide sequence\""       
## [13] "\"Precursor ion mass\""      "\"Precursor ion charge\""   
## [15] "\"pd.2013\""                 "\"pd.markers\""

ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))

##   area.114 area.115 area.116 area.117
## 1 0.379000 0.281000 0.225000 0.114000
## 2 0.420000 0.209667 0.206111 0.163889
## 3 0.187333 0.167333 0.169667 0.476000
## 4 0.247500 0.253000 0.320000 0.179000
## 5 0.216000 0.183000 0.342000 0.259000
## 6 0.072000 0.212333 0.573000 0.142667

head(fData(res))

##   Protein.ID        FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1    CG10060 FBgn0001104    G-ialpha65A               3       179.86
## 2    CG10067 FBgn0000044         Act57B               5       222.40
## 3    CG10077 FBgn0035720        CG10077               5       219.65
## 4    CG10079 FBgn0003731           Egfr               2        86.39
## 5    CG10106 FBgn0029506        Tsp42Ee               1        52.10
## 6    CG10130 FBgn0010638      Sec61beta               2        79.90
##   No..peptides.quantified PLS.DA.classification Peptide.sequence
## 1                       1                    PM                 
## 2                       9                    PM                 
## 3                       3                                       
## 4                       2                    PM                 
## 5                       1                              GGVFDTIQK
## 6                       3              ER/Golgi                 
##   Precursor.ion.mass Precursor.ion.charge     pd.2013 pd.markers
## 1                                                  PM    unknown
## 2                                                  PM    unknown
## 3                                             unknown    unknown
## 4                                                  PM    unknown
## 5            626.887                    2 Phenotype 1    unknown
## 6                                            ER/Golgi         ER

However, as we see below, we do not have any metadata about samples, i.e. about the design of the experiment.

pData(res)

## data frame with 0 columns and 4 rows

This can be done manually, or by importing a csv file containing that design. Below, we define two groups and two operators for the 4 samples of the res object created above:

pData(res)$group <- rep(c("A", "B"), each = 2)
pData(res)$operator <- rep(1:2, 2)
pData(res)

##          group operator
## area.114     A        1
## area.115     A        2
## area.116     B        1
## area.117     B        2

Note that pData(res)$ can be shortened with res$. This is also valid when setting new metadata, as shown above.

pData(res)$group

## [1] "A" "A" "B" "B"

res$group

## [1] "A" "A" "B" "B"

Exercise Using readMSnSet2, load the following file that was part of the supplementary information of a manuscript.

csvfile <- dir(system.file("extdata", package = "pRolocdata"),
               pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv",
               full.names = TRUE)
basename(csvfile)

## [1] "hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz"

You’ll first need to identify which columns to use as expression data. In this case however, two rows are used as header, and you’ll need to set n in getEcols to retrieve the appropriate one. There are 20 expresion columns annotated as TMT 10 plex reporter ion M/Z values (if you don’t know these, you can find them out by looking at the TMT10 reporter ion object). You can now use readMSnSet2, remembering to skip 1 line and, optionally, use the first column as feature names (see the fnames argument). What are the number of features and samples in the data?

getEcols(csvfile, split = ",", n = 2)

##  [1] ""                                 
##  [2] ""                                 
##  [3] ""                                 
##  [4] "Experiment 1"                     
##  [5] "Experiment 2"                     
##  [6] "Experiment 1"                     
##  [7] "Experiment 2"                     
##  [8] "126"                              
##  [9] "127N"                             
## [10] "127C"                             
## [11] "128N"                             
## [12] "128C"                             
## [13] "129N"                             
## [14] "129C"                             
## [15] "130N"                             
## [16] "130C"                             
## [17] "131"                              
## [18] "126"                              
## [19] "127N"                             
## [20] "127C"                             
## [21] "128N"                             
## [22] "128C"                             
## [23] "129N"                             
## [24] "129C"                             
## [25] "130N"                             
## [26] "130C"                             
## [27] "131"                              
## [28] "phenoDisco Input"                 
## [29] "phenoDisco Output"                
## [30] "Curated phenoDisco Output"        
## [31] "SVM marker set"                   
## [32] "SVM classification"               
## [33] "SVM score"                        
## [34] "SVM classification (top quartile)"
## [35] "Final Localization Assignment"    
## [36] "First localization evidence?"     
## [37] "Curated Organelles"               
## [38] "Cytoskeletal Components"          
## [39] "Trafficking Proteins"             
## [40] "Protein Complexes"                
## [41] "Signaling Cascades"               
## [42] "Oct4 Interactome"                 
## [43] "Nanog Interactome"                
## [44] "Sox2 Interactome"                 
## [45] "Cell Surface Proteins"

msn <- readMSnSet2(csvfile, ecol = 8:27, fnames = 1, skip = 1)
dim(msn)

## [1] 5032   20

Exercise Add the following experimental design to the MSnSet created above. The 10 first samples originate from batch A, and the 10 following from batch B. Sameple 1 to 5 and 11 to 15 belong to the control group, and the others to the condition group. Even samples are female and odd samples are male.

msn$batch <- rep(c("A", "B"), each = 10)
msn$group <- rep(rep(c("CTRL", "COND"), each = 5), 2)
msn$gender <- rep(c("M", "F"), 10)
pData(msn)

##         batch group gender
## X126        A  CTRL      M
## X127N       A  CTRL      F
## X127C       A  CTRL      M
## X128N       A  CTRL      F
## X128C       A  CTRL      M
## X129N       A  COND      F
## X129C       A  COND      M
## X130N       A  COND      F
## X130C       A  COND      M
## X131        A  COND      F
## X126.1      B  CTRL      M
## X127N.1     B  CTRL      F
## X127C.1     B  CTRL      M
## X128N.1     B  CTRL      F
## X128C.1     B  CTRL      M
## X129N.1     B  COND      F
## X129C.1     B  COND      M
## X130N.1     B  COND      F
## X130C.1     B  COND      M
## X131.1      B  COND      F

7 Data processing and analysis

7.1 Raw data processing

For raw data processing look at MSnbase’s clean, smooth, pickPeaks, removePeaks and trimMz for MSnExp and spectra processing methods.

As an illustration, we show the pickPeaks function on the itraqdata data. Centoiding transforms the distribution of M/Z values measured for an ion (i.e. a set of M/Z and intensities, first figure below) into a single M/Z and intensity pair of values (second figure below).

library("ggplot2") ## for coord_cartesian
data(itraqdata)
plot(itraqdata[[10]], full = TRUE) +
    coord_cartesian(xlim = c(915, 925))

Peak picking: profile mode.

itraqdata2 <- pickPeaks(itraqdata)
plot(itraqdata2[[10]], full = TRUE) +
    coord_cartesian(xlim = c(915, 925))

Peak picking: centroided.

The MALDIquant and xcms packages also features a wide range of raw data processing methods on their own ad hoc data instance types.

7.2 Quantitative data processing and normalisation

Isobaric purity correction

Each different types of quantitative data will require their own pre-processing and normalisation steps. Both isobar and MSnbase allow to correct for isobaric tag impurities normalise the quantitative data.

data(itraqdata)
qnt <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
impurities <- matrix(c(0.929, 0.059, 0.002, 0.000,
                       0.020, 0.923, 0.056, 0.001,
                       0.000, 0.030, 0.924, 0.045,
                       0.000, 0.001, 0.040, 0.923),
                     nrow = 4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt <- purityCorrect(qnt, impurities)
processingData(qnt)

## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
## iTRAQ4 quantification by trapezoidation: Mon Feb 18 16:01:39 2019 
## Purity corrected: Mon Feb 18 16:01:39 2019 
##  MSnbase version: 1.1.22

Normalisation

Various normalisation methods can be applied the MSnSet instances using the normalise method: variance stabilisation (vsn), quantile (quantiles), median or mean centring (center.median or center.mean), …

qnt <- normalise(qnt, "quantiles")
processingData(qnt)

## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
## iTRAQ4 quantification by trapezoidation: Mon Feb 18 16:01:39 2019 
## Purity corrected: Mon Feb 18 16:01:39 2019 
## Normalised (quantiles): Mon Feb 18 16:01:39 2019 
##  MSnbase version: 1.1.22

Combining features

The combineFeatures method combines spectra/peptides quantitation values into protein data. The grouping is defined by the groupBy parameter, which is generally taken from the feature metadata (protein accessions, for example).

prt <- combineFeatures(qnt, fcol = "ProteinDescription", fun = "median")

## Your data contains missing values. Please read the relevant
## section in the combineFeatures manual page for details the effects
## of missing values on data aggregation.

processingData(prt)

## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
## iTRAQ4 quantification by trapezoidation: Mon Feb 18 16:01:39 2019 
## Purity corrected: Mon Feb 18 16:01:39 2019 
## Normalised (quantiles): Mon Feb 18 16:01:39 2019 
## Combined 55 features into 39 using median: Mon Feb 18 16:01:39 2019 
##  MSnbase version: 2.9.3

Missing values

Finally, proteomics data analysis is generally hampered by missing values. Missing data imputation is a sensitive operation whose success will be guided by many factors, such as degree and (non-)random nature of the missingness.

Below, we load an MSnSet with missing values, count the number missing and non-missing values.

data(naset)
table(is.na(naset))

## 
## FALSE  TRUE 
## 10254   770

The naplot figure will reorder cells within the data matrix so that the experiments and features with many missing values will be grouped towards the top and right of the heatmap, and barplots at the top and right summarise the number of missing values in the respective samples (column) and rows (rows).

naplot(naset)

Overview of missing values.

The importance of missing values in a dataset will depend on the quantitation technology employed. Label-free quantitation in particular can suffer from a very high number of missing values.

Missing value in MSnSet instances can be filtered out with the filterNA functions. By default, it removes features that contain at least NA value.

## remove features with missing values
tmp <- filterNA(naset)
processingData(tmp)

## - - - Processing information - - -
## Subset [689,16][301,16] Mon Feb 18 16:01:39 2019 
## Removed features with more than 0 NAs: Mon Feb 18 16:01:39 2019 
## Dropped featureData's levels Mon Feb 18 16:01:39 2019 
##  MSnbase version: 1.15.6

It is of course possible to impute missing values (?impute). This is however not a straightforward thing, as is likely to dramatically fail when a high proportion of data is missing (10s of %)88 Note that when using limma for instance, downstream analyses can handle missing values. Still, it is recommended to explore missingness as part of the exploratory data analysis.. But also, there are two types of mechanisms resulting in missing values in LC/MSMS experiments.

Missing values resulting from absence of detection of a feature, despite ions being present at detectable concentrations. For example in the case of ion suppression or as a result from the stochastic, data-dependent nature of the MS acquisition method. These missing value are expected to be randomly distributed in the data and are defined as missing at random (MAR) or missing completely at random (MCAR).
Biologically relevant missing values, resulting from the absence or the low abundance of ions (below the limit of detection of the instrument). These missing values are not expected to be randomly distributed in the data and are defined as missing not at random (MNAR).

Random and non-random missing values.

Different imputation methods are more appropriate to different classes of missing values (as documented in this paper). Values missing at random, and those missing not at random should be imputed with different methods.

Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better. (See here for details)

Generally, it is recommended to use hot deck methods (nearest neighbour (left), maximum likelihood, …) when data are missing at random.Conversely, MNAR features should ideally be imputed with a left-censor (minimum value (right), but not zero, …) method.

## impute missing values using knn imputation
tmp <- impute(naset, method = "knn")

## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 12 rows with more than 50 % entries missing;
##  mean imputation used for these rows

processingData(tmp)

## - - - Processing information - - -
## Data imputation using knn Mon Feb 18 16:01:39 2019 
##   Using default parameters 
##  MSnbase version: 1.15.6

There are various methods to perform data imputation, as described in ?impute. The imp4p package contains additional functionality, including some to estimate the randomness of missing data.

Exercise Following the example above, apply a mixed imputation, using knn for data missing at random and the deterministic minumum left-cencored imputation for data missing no at random.

impute(naset, "mixed",
       randna = fData(naset)$randna,
       mar = "knn", mnar = "MinDet")

## MSnSet (storageMode: lockedEnvironment)
## assayData: 689 features, 16 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: M1F1A M1F4A ... M2F11B (16 total)
##   varLabels: nNA
##   varMetadata: labelDescription
## featureData
##   featureNames: AT1G09210 AT1G21750 ... AT4G39080 (689 total)
##   fvarLabels: nNA randna
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Data imputation using mixed Mon Feb 18 16:01:39 2019 
##   Using default parameters 
##  MSnbase version: 1.15.6

Exercise When assessing missing data imputation methods, such as in Lazar et al. (2016), one often replaces values with missing data, imputes these with a method of choice, then quantifies the difference between original (expected) and observed (imputed) values. Here, using the naset data, use this strategy to assess the difference between knn and Bayesian PCA imputation.

imp1 <- impute(naset, method = "knn")

## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 12 rows with more than 50 % entries missing;
##  mean imputation used for these rows

imp2 <- impute(naset, method = "bpca")
summary(abs(exprs(imp1)[is.na(naset)] - exprs(imp2)[is.na(naset)]))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 5.332e-05 6.594e-03 1.535e-02 2.315e-02 2.855e-02 2.579e-01

summary(as.numeric(na.omit(exprs(naset))))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0170  0.1865  0.2440  0.2500  0.3080  0.6587

Exercise When assessing the impact of missing value imputation on real data, one can’t use the strategy above. Another useful approach is to assess the impact of the imputation method on the distribution of the quantitative data. For instance, here is the intensity distribution of the naset data. Verify the effect of applying knn, zero, MinDet and bpca on this distribution.

plot(density(na.omit(exprs(naset))))

Intensity disctribution of the naset data.

cls <- c("black", "red", "blue", "steelblue", "orange")
plot(density(na.omit(exprs(naset))), col = cls[1])
lines(density(exprs(impute(naset, method = "knn"))), col = cls[2])

## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 12 rows with more than 50 % entries missing;
##  mean imputation used for these rows

lines(density(exprs(impute(naset, method = "zero"))), col = cls[3])
lines(density(exprs(impute(naset, method = "MinDet"))), col = cls[4])
lines(density(exprs(impute(naset, method = "bpca"))), col = cls[5])
legend("topright", legend = c("orig", "knn", "zero", "MinDet", "bpca"),
       col = cls, lwd = 2, bty = "n")

7.3 Some notes on data analysis and manipulation

The tidyverse syntax has proved to be versatile and very useful for generic data analysis, i.e. analysis of data stored in dataframes. While it is possible to convert dedicated data containers into dataframes, this leads to loosing data integrity checks and access to dedicated omics data processing functions.

The tidies enables to use typical dplyr functions directly on MSnSet data structures. See the vignette at http://lgatto.github.io/tidies/ for details and examples.

library("tidies")
data(msnset)
## Some test sample groups
msnset$group <- c("A", "A", "B", "B")

msnset %>%
    dplyr::select(starts_with("Protein")) %>%
    fvarLabels

## [1] "ProteinAccession"   "ProteinDescription"

msnset %>%
    filter(ProteinAccession == "ENO") %>%
    exprs

##       iTRAQ4.114  iTRAQ4.115  iTRAQ4.116  iTRAQ4.117
## X27 147093.25030 94770.28613 42616.07457 21259.42497
## X46   5369.73246  1148.32171          NA  1313.44599
## X47   7384.83022  3935.30012  2370.02527  1115.75006
## X55     15.11764    15.68074    14.23333    14.03018

msnset %>% group_by(ProteinAccession) %>%
    summarise(median(exprs, na.rm = TRUE)) %>%
    exprs %>%
    head

##         iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## BSA       1347.616   2247.310   3927.693  7661.1463
## ECA0172  17593.548  18545.620  19361.837 18328.2365
## ECA0435   4923.628   5557.818   5775.203  5079.2952
## ECA0452   1524.148   1399.897   1547.218  1563.2299
## ECA0469   1069.945   1035.689   1029.420   999.6957
## ECA0621   1101.062   1124.167   1140.093  1191.8055

msnset %>% group_by(group) %>%
    summarise(mean(exprs, na.rm = TRUE)) %>%
    exprs %>%
    head

##              A          B
## X1   1797.4628  5794.4197
## X10   769.6681   826.6388
## X11 30516.1917 29366.5078
## X12 32763.7954 37451.2196
## X13 27910.6161 28495.8100
## X14  6341.1393  6670.0603

msnset %>%
    group_by(charge) %>%
    summarise(mean(exprs)) %>%
    group_by(group) %>%
    summarise(max(exprs, na.rm = TRUE)) %>%
    exprs

##          A         B
## 2 13880.38 12660.236
## 3  1477.78  1346.071

msnset %>%
    filterNA() %>%
    combineFeatures(method = "median", fcol = "ProteinAccession") %>%
    group_by(group) %>%
    summarise(mean(exprs)) %>%
    normalise(method = "quantiles") %>%
    filter(ProteinAccession %in% c('ENO', 'BSA')) %>%
    exprs

##             A         B
## BSA  1462.295  4355.662
## ENO 41383.731 11272.133

8 Statistical analysis

R in general and Bioconductor in particular are well suited for the statistical analysis of data of quantitative proteomics data. Several packages provide dedicated resources for proteomics data:

MSstats and MSstatsTMT: A set of tools for statistical relative protein significanceanalysis in Data dependent (DDA), SRM, Data independent acquisition (DIA) and TMT experiments.
msmsTests: Statistical tests for label-free LC-MS/MS data by spectral counts, to discover differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. All can be readily applied on MSnSet instances produced, for example by MSnID.
isobar also provides dedicated infrastructure for the statistical analysis of isobaric data.
DEP provides an integrated analysis workflow for the analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment.

Others, while not specfic to proteomics, are also recommended, such as the limma package. When analysing spectral counting data, methods for high throughput sequencing data are applicable. Below, we illustrate how to apply a typical edgeR test to count data using the msms.edgeR function from the msmsTests package.

The data is illustrated below (we will see later how to generate such plots), showing two experimental conditions (red and blue points) processed as two batches (solid and empty points).

plot of chunk msmspca

nWe first pre-process to remove features containing only 0s and entries from the reverse database (ending with ‘-R’) (see also the pp.msms.data function).

library(msmsTests)
data(msms.dataset)
e <- msms.dataset[rowSums(exprs(msms.dataset)) > 0, ]
e <- e[!grepl("-R$", featureNames(e)), ]
pData(e)

##           treat batch
## U2.2502.1  U200  2502
## U2.2502.2  U200  2502
## U2.2502.3  U200  2502
## U2.2502.4  U200  2502
## U6.2502.1  U600  2502
## U6.2502.2  U600  2502
## U6.2502.3  U600  2502
## U6.2502.4  U600  2502
## U2.0302.1  U200  0302
## U2.0302.2  U200  0302
## U2.0302.3  U200  0302
## U6.0302.1  U600  0302
## U6.0302.2  U600  0302
## U6.0302.3  U600  0302

null.f <- "y~batch"
alt.f <- "y~treat+batch"
div <- apply(exprs(e), 2, sum)
res <- msms.edgeR(e, alt.f, null.f, div = div)
head(res)

##               LogFC         LR     p.value
## YJR104C  0.02689713  0.2691959 0.603871680
## YKL060C -0.12645862  5.5833478 0.018132029
## YDR155C -0.18781161 10.2706877 0.001351604
## YGR192C -0.08495735  2.5941276 0.107260484
## YOL086C -0.11853525  5.7568480 0.016424510
## YLR150W -0.09299164  1.3766329 0.240675519

It is best to store the results directly with the quantitative data. Below, we first check that the results rownames match the feature names and then add it to the feature metadata.

identical(rownames(res), featureNames(e))

## [1] TRUE

fData(e) <- cbind(fData(e), res)

And we conclude with a volcano plot of the results of the test.

plot(fData(e)$LogFC, -log10(fData(e)$p.value))

Volcano plot.

9 Machine learning

There are numerous packages for machine learing in R, many with specific omics applications and use cases in mind. An excellent general package is mlr that provides a unified interface to many methods. For a general hands-on introduction, An introduction to machine learning with R, as well as many other freely available documents are available.

The MLInterfaces package provides a unified interface to a wide range of machine learning algorithms. Initially developed for microarray and ExpressionSet instances, the r Biocpkg("pRoloc") package enables application of these algorithms to MSnSet data. We will also demonstrate some specific functions of the pRoloc package.

9.1 Dimensionality reduction

Dimensionality reduction is very frequently used to summarise high-dimensional data. Below we will use principal component analysis (PCA), but other methods can be applied. Below, we will use the plot2D function from the pRoloc package99 While originally developed for the analysis of spatial/organelle proteomics data in mind, it is applicable many use cases., that will extract the expression values in the assay data, perform dimensionality reduction, an produce the scatter plot.

Let’s first use plot2D to visualise the pattern in 20 protein quantitation values (initial 20 dimensional data). Here, we use an example from spatial proteomics, where the quantitative protein profiles reflect the proteins sub-cellular localisation (from Christoforou et al, 2016, see also Breckels et al, 2016 for more data analysis background). We will use the known localisation of some proteins (marker proteins) to annotate the plot using the fcol argument, that indicates which feature variable (i.e. column un the feature meta-data) to use.

library("pRoloc")
library("pRolocdata")
data(hyperLOPIT2015)
plot2D(hyperLOPIT2015, fcol = "markers")
addLegend(hyperLOPIT2015, fcol = "markers", cex = .7)

PCA plot for protein sub-cellular localisation.

Exercise The results of a clasification analysis (see below) are available in svm.classification feature variable. Repeat the PCA plot above, colouring the proints using this variable.

plot2D(hyperLOPIT2015, fcol = "svm.classification")

Plotting all the classification results.

In other cases, we want to visualise the relation of samples. plot2D uses the rows of the data to perform dimensionality reduction. To use the columns, we just need to transpose the MSnSet. By doing so, the pData becomes the fData and vice versa.

Let’s use a time-course experiment on stem cells (Mulvey et al. 2015). Below, we use the times (time points) variable to set colours.

data(mulvey2015)
head(pData(mulvey2015))

##           rep times cond
## rep1_0hr    1     1    1
## rep1_16hr   1     2    1
## rep1_24hr   1     3    1
## rep1_48hr   1     4    1
## rep1_72hr   1     5    1
## rep1_XEN    1     6    1

plot2D(t(mulvey2015),  fcol = "times", cex = 2)
addLegend(t(mulvey2015),  fcol = "times")

PCA plots for sample in a time-course experiment.

Exercise The plot2D function can use two feature variables to set colours with the fcol argument (as above) and point characters with the fpch argument. Use the latter to also highlight the replicate numbers rep.

plot2D(t(mulvey2015),  fcol = "times", fpch = "rep", cex = 2)

Using colours and point characters to annotated the plot with plot2D.

9.2 Classification

Classification is performed in two steps. First, an adequate model and its parameters are learned from labelled training data, then that model is applied on new, unlabelled data.

We are going to apply this strategy to repeat the protein sub-cellular classification analysis of the hyperLOPIT2015 data above using a k nearest neighbour classifier, which is generally used as a baseline method. In the interest of time, we will only repeat the optimisation step 10 times, even though 100 would be recommended. For details on this procedure, please see the [main pRoloc vignette](https://lgatto.github.io/pRoloc/articles/v01-pRoloc-tutorial.html].

p <- knnOptimisation(hyperLOPIT2015, time = 10, verbose = FALSE)
p

## Object of class "GenRegRes"
## Algorithm: knn 
## Hyper-parameters:
##  k: 3 5 7 9 11 13 15
## Design:
##  Replication: 10 x 5-fold X-validation
##  Partitioning: 0.2/0.8 (test/train)
## Results
##  macro F1:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8761  0.8948  0.9232  0.9223  0.9454  0.9841 
##  best k: 5 3   
## Use getWarnings() to see warnings.

We can now apply out best model (here k = 3) to out dataset, which will add feature variables with the classification results (knn) and scores (knn.scores).

hyperLOPIT2015 <- knnClassification(hyperLOPIT2015, p)
fvarLabels(hyperLOPIT2015)

##  [1] "entry.name"                "protein.description"      
##  [3] "peptides.rep1"             "peptides.rep2"            
##  [5] "psms.rep1"                 "psms.rep2"                
##  [7] "phenodisco.input"          "phenodisco.output"        
##  [9] "curated.phenodisco.output" "markers"                  
## [11] "svm.classification"        "svm.score"                
## [13] "svm.top.quartile"          "final.assignment"         
## [15] "first.evidence"            "curated.organelles"       
## [17] "cytoskeletal.components"   "trafficking.proteins"     
## [19] "protein.complexes"         "signalling.cascades"      
## [21] "oct4.interactome"          "nanog.interactome"        
## [23] "sox2.interactome"          "cell.surface.proteins"    
## [25] "markers2015"               "TAGM"                     
## [27] "knn"                       "knn.scores"

Exercise Once you have performed the kNN classification as illustrated above, visualise your results on a PCA plot.

plot2D(hyperLOPIT2015, fcol = "knn")

Below, we show how to use MLInterfaces to perform a classification analysis using k nearest neighbours.

library("MLInterfaces")
library("pRolocdata")
data(dunkley2006)
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
ans

## MLInterfaces classification output container
## The call was:
## MLearn(formula = markers ~ ., data = t(dunkley2006), .method = knnI(k = 5), 
##     trainInd = traininds)
## Predicted outcome distribution for test set:
## 
##      ER lumen   ER membrane         Golgi Mitochondrion       Plastid 
##             5           138            66            51            29 
##            PM      Ribosome           TGN       vacuole 
##            91            32             6            10 
## Summary of scores on test set (use testScores() method for details):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  1.0000  1.0000  0.9332  1.0000  1.0000

9.3 Clustering

To illustrate how to apply clustering in the frame of what we have seen so far, we are going to use k-means clustering on the spatial proteomics data above, visually comparing the clusters obtains with the results of the classification results (in the svm.classification feature variable).

We are (1) going to perform k-means setting the number of expected clusters equal to the number of annotations, (2) store clustering results as a new feature variable, and then (3) visualise the results next to each other on two PCA plots.

We can use the kmeans function, passing the quantiative proteomics data and the number of anticipated clusters as input:

## number of sub-cellular niches
n <- length(getMarkerClasses(hyperLOPIT2015))
cl <- kmeans(exprs(hyperLOPIT2015), n)
table(cl$cluster)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
## 244 452 286 360 450 366 232 352 295 130 610 351 613 291

We can now add these results to our MSnSet:

fData(hyperLOPIT2015)$cluster <- cl$cluster

And now visualise the classification and clustering results side by side:

setStockcol(paste0(getStockcol(), 40))
par(mfrow = c(1, 2))
plot2D(hyperLOPIT2015, fcol = "svm.classification")
plot2D(hyperLOPIT2015, fcol = "cluster")

PCA plot with the classification and clustering results.

A wide range of clustering algorithms are available in MLInterfaces, as described in the ?MLearn documentation page, used below. Below, we show how to use it to perform a k-means clustering.

kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
kcl

## clusteringOutput: partition table
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 56 98 45 49 28 82 55 28 60 80 85 23 
## The call that created this object was:
## MLearn(formula = ~., data = dunkley2006, .method = kmeansI, centers = 12)

plot(kcl, exprs(dunkley2006))

Kmeans clustering using r Biocpkg('MLInterfaces') with an MSnSet object.

10 Annotation

10.1 The `rols` package

All the Bioconductor annotation infrastructure, such as biomaRt, GO.db, organism specific annotations, … are directly relevant to the analysis of proteomics data. A total of 191 ontologies, including some proteomics-centred annotations such as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI 2.5) or Protein Modifications are available through the rols

library("rols")
res <- OlsSearch(q = "ESI", ontology = "MS", exact = TRUE)
res

## Object of class 'OlsSearch':
##   ontolgy: MS 
##   query: ESI 
##   requested: 20 (out of 1)
##   response(s): 0

There is a single exact match (default is to retrieve 20 results), that can be retrieved and coerced to a Terms or data.frame object with

res <- olsSearch(res)
as(res, "Terms")

## Object of class 'Terms' with 1 entries
##  From the MS ontology
## MS:1000073

as(res, "data.frame")

##                                                   id
## 1 ms:class:http://purl.obolibrary.org/obo/MS_1000073
##                                         iri short_form     obo_id
## 1 http://purl.obolibrary.org/obo/MS_1000073 MS_1000073 MS:1000073
##                     label
## 1 electrospray ionization
##                                                                                                                                                                                                                                                                                                                                                                                                                                                  description
## 1 A process in which ionized species in the gas phase are produced from an analyte-containing solution via highly charged fine droplets, by means of spraying the solution from a narrow-bore needle tip at atmospheric pressure in the presence of a high electric field. When a pressurized gas is used to aid in the formation of a stable spray, the term pneumatically assisted electrospray ionization is used. The term ion spray is not recommended.
##   ontology_name ontology_prefix  type is_defining_ontology
## 1            ms              MS class                 TRUE

10.2 The Human Protein Atlas

Data from the Human Protein Atlas is available via the hpar package. Below, we are going to illustrate how to use it with a usec ase retrieving sub-cellular information for a protein, and contrast it with data from the GO.db package.

More HPA data are available, as documented in the package manual at ?hpar.

Let’s compare the subcellular localisation annotation obtained from the HPA subcellular location data set and the information available in the Bioconductor annotation packages.

library("hpar")
id <- "ENSG00000001460"
getHpa(id, "hpaSubcellularLoc")

##              Gene Gene.name Reliability Enhanced Supported    Approved
## 8 ENSG00000001460     STPG1    Approved                    Nucleoplasm
##   Uncertain Single.cell.variation.intensity Single.cell.variation.spatial
## 8                                                                        
##   Cell.cycle.dependency                    GO.id
## 8                       Nucleoplasm (GO:0005654)

Below, we first extract all cellular component GO terms available for id from the org.Hs.eg.db human annotation and then retrieve their term definitions using the GO.db database.

library("org.Hs.eg.db")
library("GO.db")
ans <- AnnotationDbi::select(org.Hs.eg.db, keys = id,
                             columns = c("ENSEMBL", "GO", "ONTOLOGY"),
                             keytype = "ENSEMBL")
ans <- ans[ans$ONTOLOGY == "CC", ]
ans

##           ENSEMBL         GO EVIDENCE ONTOLOGY
## 2 ENSG00000001460 GO:0005622      IMP       CC
## 3 ENSG00000001460 GO:0005634      IEA       CC
## 4 ENSG00000001460 GO:0005739      IEA       CC

sapply(as.list(GOTERM[ans$GO]), slot, "Term")

##      GO:0005622      GO:0005634      GO:0005739 
## "intracellular"       "nucleus" "mitochondrion"

10.3 The `ensembldb` package

Theensembldb allows to query and filter the data from ENSEMBL and integrates smoothly with general Bioconductor infrastructure. Below are a couple of illustrative examples, and more are available in the package vignettes.

Below, we initialise the human database:

library("ensembldb")
library("EnsDb.Hsapiens.v86")
edb <- EnsDb.Hsapiens.v86
edb

## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.0
## |Creation time: Thu May 18 16:32:27 2017
## |ensembl_version: 86
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |taxonomy_id: 9606
## |genome_build: GRCh38
## |DBSCHEMAVERSION: 2.0
## | No. of genes: 63970.
## | No. of transcripts: 216741.
## |Protein data available.

Here are the tables and columns available for that database:

listTables(edb)

## $gene
## [1] "gene_id"          "gene_name"        "gene_biotype"    
## [4] "gene_seq_start"   "gene_seq_end"     "seq_name"        
## [7] "seq_strand"       "seq_coord_system" "symbol"          
## 
## $tx
## [1] "tx_id"            "tx_biotype"       "tx_seq_start"    
## [4] "tx_seq_end"       "tx_cds_seq_start" "tx_cds_seq_end"  
## [7] "gene_id"          "tx_name"         
## 
## $tx2exon
## [1] "tx_id"    "exon_id"  "exon_idx"
## 
## $exon
## [1] "exon_id"        "exon_seq_start" "exon_seq_end"  
## 
## $chromosome
## [1] "seq_name"    "seq_length"  "is_circular"
## 
## $protein
## [1] "tx_id"            "protein_id"       "protein_sequence"
## 
## $uniprot
## [1] "protein_id"           "uniprot_id"           "uniprot_db"          
## [4] "uniprot_mapping_type"
## 
## $protein_domain
## [1] "protein_id"            "protein_domain_id"     "protein_domain_source"
## [4] "interpro_accession"    "prot_dom_start"        "prot_dom_end"         
## 
## $entrezgene
## [1] "gene_id"  "entrezid"
## 
## $metadata
## [1] "name"  "value"

Fetch protein annotation for genes and transcripts

Protein annotations for (protein coding) transcripts can be retrieved by simply adding the desired annotation columns to the columns parameter.

## Get protein information for ZBTB16 transcripts
txs <- transcripts(edb, filter = GeneNameFilter("ZBTB16"),
                   columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs

## GRanges object with 11 ranges and 5 metadata columns:
##                   seqnames              ranges strand |      protein_id
##                      <Rle>           <IRanges>  <Rle> |     <character>
##   ENST00000335953       11 114059593-114250676      + | ENSP00000338157
##   ENST00000335953       11 114059593-114250676      + | ENSP00000338157
##   ENST00000541602       11 114059725-114189764      + |            <NA>
##   ENST00000544220       11 114059737-114063646      + | ENSP00000437716
##   ENST00000535700       11 114060257-114063744      + | ENSP00000443013
##   ENST00000392996       11 114060507-114250652      + | ENSP00000376721
##   ENST00000392996       11 114060507-114250652      + | ENSP00000376721
##   ENST00000539918       11 114064412-114247344      + | ENSP00000445047
##   ENST00000545851       11 114180766-114247296      + |            <NA>
##   ENST00000535379       11 114237207-114250557      + |            <NA>
##   ENST00000535509       11 114246790-114250476      + |            <NA>
##                    uniprot_id              tx_biotype           tx_id
##                   <character>             <character>     <character>
##   ENST00000335953      Q05516          protein_coding ENST00000335953
##   ENST00000335953  A0A024R3C6          protein_coding ENST00000335953
##   ENST00000541602        <NA>         retained_intron ENST00000541602
##   ENST00000544220      F5H6C3          protein_coding ENST00000544220
##   ENST00000535700      F5H5Y7          protein_coding ENST00000535700
##   ENST00000392996      Q05516          protein_coding ENST00000392996
##   ENST00000392996  A0A024R3C6          protein_coding ENST00000392996
##   ENST00000539918      H0YGW2 nonsense_mediated_decay ENST00000539918
##   ENST00000545851        <NA>    processed_transcript ENST00000545851
##   ENST00000535379        <NA>    processed_transcript ENST00000535379
##   ENST00000535509        <NA>         retained_intron ENST00000535509
##                     gene_name
##                   <character>
##   ENST00000335953      ZBTB16
##   ENST00000335953      ZBTB16
##   ENST00000541602      ZBTB16
##   ENST00000544220      ZBTB16
##   ENST00000535700      ZBTB16
##   ENST00000392996      ZBTB16
##   ENST00000392996      ZBTB16
##   ENST00000539918      ZBTB16
##   ENST00000545851      ZBTB16
##   ENST00000535379      ZBTB16
##   ENST00000535509      ZBTB16
##   -------
##   seqinfo: 1 sequence from GRCh38 genome

Retrieve proteins from the database

Below, we download the protein sequences for the ZBTB16 gene.

prts <- proteins(edb, filter = GeneNameFilter("ZBTB16"),
                 return.type = "AAStringSet")
prts

##   A AAStringSet instance of length 5
##     width seq                                          names               
## [1]   673 MDLTKMGMIQLQNPSHPTGLL...PEEIPPDWRIEKTYLYLCYV ENSP00000338157
## [2]   115 MDLTKMGMIQLQNPSHPTGLL...AEDLDDLLYAAEILEIEYLE ENSP00000437716
## [3]   148 MDLTKMGMIQLQNPSHPTGLL...DDNDTEATMADGGAEEEEDR ENSP00000443013
## [4]   673 MDLTKMGMIQLQNPSHPTGLL...PEEIPPDWRIEKTYLYLCYV ENSP00000376721
## [5]    55 XGGLLPQGFIQRELFSKLGEL...CSVCGVELPDNEAVEQHRVF ENSP00000445047

We can also get the associated meta-data with the mcols function.

mcols(prts)

## DataFrame with 5 rows and 3 columns
##                           tx_id      protein_id   gene_name
##                     <character>     <character> <character>
## ENSP00000338157 ENST00000335953 ENSP00000338157      ZBTB16
## ENSP00000437716 ENST00000544220 ENSP00000437716      ZBTB16
## ENSP00000443013 ENST00000535700 ENSP00000443013      ZBTB16
## ENSP00000376721 ENST00000392996 ENSP00000376721      ZBTB16
## ENSP00000445047 ENST00000539918 ENSP00000445047      ZBTB16

11 More information

References and resources

Other relevant packages/pipelines

Analysis of post translational modification with isobar.
Processing and analysis or isobaric tagging mass spectrometry with isobar and MSnbase.
Analysis of spatial proteomics data and, more generally, supervised, semi-supervised learning, and transfer learning using the pRoloc package.
Analysis of MALDI data with the MALDIquant package.
Access to the Proteomics Standard Initiative Common QUery InterfaCe with the PSICQUIC package.
Cardinal: A mass spectrometry imaging toolbox for statistical analysis.
MSnID: Utilities for Exploration and Assessment of Confidence of LC-MSn Proteomics Identifications.
protViz: Visualising and Analysing Mass Spectrometry Related Data in Proteomics
aLFQ: Estimating Absolute Protein Quantities from Label-Free LC-MS/MS Proteomics Data.
protiq: Protein (identification and) quantification based on peptide evidence.
MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

Data Independent Acquisition

Analysis of label-free data from a Synapt G2 (including ion mobility) with synapter.
SWATH2stats: Transform and Filter SWATH Data for Statistical Packages and
specL: Prepare Peptide Spectrum Matches for Use in Targeted Proteomics
SwathXtend: SWATH extended library generation and statistical data analysis

12 Session Information

sessionInfo()

## R version 3.5.2 Patched (2019-01-24 r76018)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Manjaro Linux
## 
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.8.0
## LAPACK: /usr/lib/liblapack.so.3.8.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] EnsDb.Hsapiens.v86_2.99.0 GO.db_3.7.0              
##  [3] org.Hs.eg.db_3.7.0        magrittr_1.5             
##  [5] knitr_1.21                msmsTests_1.20.1         
##  [7] msmsEDA_1.20.1            tidies_0.0.3             
##  [9] imputeLCMD_2.0            impute_1.56.0            
## [11] pcaMethods_1.74.0         norm_1.0-9.5             
## [13] tmvtnorm_1.4-10           gmm_1.6-2                
## [15] sandwich_2.5-0            Matrix_1.2-15            
## [17] mvtnorm_1.0-8             ggplot2_3.1.0            
## [19] bindrcpp_0.2.2            dplyr_0.7.8              
## [21] msdata_0.22.0             hpar_1.25.1              
## [23] rols_2.11.0               MSGFplus_1.15.1          
## [25] pRolocdata_1.21.1         rpx_1.18.1               
## [27] MSnID_1.16.1              mzID_1.20.1              
## [29] ensembldb_2.6.5           AnnotationFilter_1.6.0   
## [31] GenomicFeatures_1.34.3    GenomicRanges_1.34.0     
## [33] GenomeInfoDb_1.18.1       AnnotationHub_2.14.3     
## [35] pRoloc_1.23.2             MLInterfaces_1.62.0      
## [37] cluster_2.0.7-1           annotate_1.60.0          
## [39] XML_3.98-1.17             AnnotationDbi_1.44.0     
## [41] IRanges_2.16.0            gridExtra_2.3            
## [43] lattice_0.20-38           RforProteomics_1.20.0    
## [45] BiocInstaller_1.32.1      BiocParallel_1.16.5      
## [47] MSnbase_2.9.3             ProtGenerics_1.14.0      
## [49] S4Vectors_0.20.1          mzR_2.16.1               
## [51] Rcpp_1.0.0                Biobase_2.42.0           
## [53] BiocGenerics_0.28.0       BiocStyle_2.10.0         
## 
## loaded via a namespace (and not attached):
##   [1] rtracklayer_1.42.1            prabclus_2.2-7               
##   [3] ModelMetrics_1.2.2            R.methodsS3_1.7.1            
##   [5] coda_0.19-2                   tidyr_0.8.2                  
##   [7] bit64_0.9-7                   DelayedArray_0.8.0           
##   [9] R.utils_2.7.0                 data.table_1.12.0            
##  [11] rpart_4.1-13                  hwriter_1.3.2                
##  [13] RCurl_1.95-4.11               doParallel_1.0.14            
##  [15] generics_0.0.2                preprocessCore_1.44.0        
##  [17] callr_3.1.1                   RSQLite_2.1.1                
##  [19] proxy_0.4-22                  bit_1.1-14                   
##  [21] webshot_0.5.1                 xml2_1.2.0                   
##  [23] lubridate_1.7.4               httpuv_1.4.5.1               
##  [25] SummarizedExperiment_1.12.0   assertthat_0.2.0             
##  [27] viridis_0.5.1                 gower_0.1.2                  
##  [29] xfun_0.4                      hms_0.4.2                    
##  [31] evaluate_0.12                 promises_1.0.1               
##  [33] DEoptimR_1.0-8                progress_1.2.0               
##  [35] caTools_1.17.1.1              dendextend_1.9.0             
##  [37] igraph_1.2.3                  DBI_1.0.0                    
##  [39] htmlwidgets_1.3               purrr_0.3.0                  
##  [41] crosstalk_1.0.0               rda_1.0.2-2.1                
##  [43] trimcluster_0.1-2.1           biomaRt_2.38.0               
##  [45] caret_6.0-81                  withr_2.1.2                  
##  [47] sfsmisc_1.1-3                 robustbase_0.93-3            
##  [49] GenomicAlignments_1.18.1      prettyunits_1.0.2            
##  [51] mclust_5.4.2                  segmented_0.5-3.0            
##  [53] lazyeval_0.2.1                crayon_1.3.4                 
##  [55] genefilter_1.64.0             edgeR_3.24.3                 
##  [57] recipes_0.1.4                 pkgconfig_2.0.2              
##  [59] labeling_0.3                  nlme_3.1-137                 
##  [61] nnet_7.3-12                   bindr_0.1.1                  
##  [63] rlang_0.3.1                   diptest_0.75-7               
##  [65] pls_2.7-0                     LaplacesDemon_16.1.1         
##  [67] affyio_1.52.0                 randomForest_4.6-14          
##  [69] matrixStats_0.54.0            graph_1.60.0                 
##  [71] zoo_1.8-4                     base64enc_0.1-3              
##  [73] whisker_0.3-2                 processx_3.2.1               
##  [75] viridisLite_0.3.0             bitops_1.0-6                 
##  [77] R.oo_1.22.0                   KernSmooth_2.23-15           
##  [79] Biostrings_2.50.2             blob_1.1.1                   
##  [81] qvalue_2.14.1                 stringr_1.4.0                
##  [83] R.cache_0.13.0                ggvis_0.4.4                  
##  [85] scales_1.0.0                  lpSolve_5.6.13               
##  [87] memoise_1.1.0                 plyr_1.8.4                   
##  [89] hexbin_1.27.2                 gplots_3.0.1.1               
##  [91] gdata_2.18.0                  zlibbioc_1.28.0              
##  [93] threejs_0.3.1                 compiler_3.5.2               
##  [95] RColorBrewer_1.1-2            Rsamtools_1.34.1             
##  [97] affy_1.60.0                   XVector_0.22.0               
##  [99] ps_1.3.0                      MASS_7.3-51.1                
## [101] tidyselect_0.2.5              vsn_3.50.0                   
## [103] stringi_1.2.4                 highr_0.7                    
## [105] yaml_2.2.0                    locfit_1.5-9.1               
## [107] MALDIquant_1.18               biocViews_1.50.10            
## [109] grid_3.5.2                    tools_3.5.2                  
## [111] foreach_1.4.4                 prodlim_2018.04.18           
## [113] digest_0.6.18                 BiocManager_1.30.4           
## [115] FNN_1.1.2.2                   shiny_1.2.0                  
## [117] lava_1.6.4                    fpc_2.1-11.1                 
## [119] later_0.7.5                   ncdf4_1.16                   
## [121] httr_1.4.0                    kernlab_0.9-27               
## [123] colorspace_1.4-0              splines_3.5.2                
## [125] RBGL_1.58.1                   flexmix_2.3-14               
## [127] xtable_1.8-3                  jsonlite_1.6                 
## [129] mlbench_2.1-1                 timeDate_3043.102            
## [131] modeltools_0.2-22             ipred_0.9-8                  
## [133] R6_2.3.0                      RUnit_0.4.32                 
## [135] pillar_1.3.1                  htmltools_0.3.6              
## [137] mime_0.6                      glue_1.3.0                   
## [139] DT_0.5                        class_7.3-15                 
## [141] interactiveDisplayBase_1.20.0 codetools_0.2-16             
## [143] tibble_2.0.1                  mixtools_1.1.0               
## [145] gbm_2.1.5                     curl_3.3                     
## [147] gtools_3.8.1                  survival_2.43-3              
## [149] limma_3.38.3                  rmarkdown_1.11               
## [151] sampling_2.8                  munsell_0.5.0                
## [153] e1071_1.7-0.1                 GenomeInfoDbData_1.2.0       
## [155] iterators_1.0.10              reshape2_1.4.3               
## [157] gtable_0.2.0

Bioconductor tools for mass spectrometry and proteomics

18 February 2019

1 Introduction

Other material

Setup

Exploring available infrastructure

Mass spectrometry data

2 How does mass spectrometry work?

3 Accessing data

3.1 From the ProteomeXchange database

3.2 From AnnotationHub

3.3 Data packages

4 Raw MS data

4.1 High-level data structures

4.2 Raw data: the MSnExp class

Read raw data

Chromatograms

Annotating spectra with identification results

Plotting MS spectra

4.3 Relations between scans

4.4 Low-level access to raw data

4.5 A bit more raw data visualisation

5 Identification data

5.1 Identification data.frame

5.2 Low level access to id data

5.3 MS/MS database search

5.4 Analysing search results

Analysis of peptide sequences

Trimming the data

Parent ion mass errors

Filtering criteria

Filter optimisation

5.5 Visualising identification data

6 Quantitative proteomics

6.1 The MSnSet class for quantitative data

6.2 Isobaric tagging

6.3 Spectral counting

6.4 Importing third-party quantitation data

From MzTab files

From spreadsheets

7 Data processing and analysis

7.1 Raw data processing

7.2 Quantitative data processing and normalisation

Isobaric purity correction

Normalisation

Combining features

Missing values

7.3 Some notes on data analysis and manipulation

8 Statistical analysis

9 Machine learning

9.1 Dimensionality reduction

9.2 Classification

9.3 Clustering

10 Annotation

10.1 The rols package

10.2 The Human Protein Atlas

10.3 The ensembldb package

Fetch protein annotation for genes and transcripts

Retrieve proteins from the database

11 More information

References and resources

Other relevant packages/pipelines

Data Independent Acquisition

More questions

12 Session Information

4.2 Raw data: the `MSnExp` class

6.1 The `MSnSet` class for quantitative data

From `MzTab` files

10.1 The `rols` package

10.3 The `ensembldb` package