Abstract In this workshop, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The participants will gain a general overview of Bioconductor packages for mass spectrometry and proteomics, and learn how to navigate raw data and reconstruct quantitative data. The workshop is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.
This short tutorial is part of the ProteomicsBioc2016Workshop
package (version 0.2.2), available at https://github.com/lgatto/ProteomicsBioc2016Workshop.
This vignette available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.
Modified: 2016-06-26 19:13:31
Compiled: Sun Jun 26 19:13:45 2016
To install this package and build the vignette:
If the BiocInstaller is not installted, start with
source("https://bioconductor.org/biocLite.R")
then
library("BiocInstaller")
biocLite("lgatto/ProteomicsBioc2016Workshop",
dependencies = TRUE, build_vignettes=TRUE)
To launch the vignette
library("ProteomicsBioc2016Workshop")
bioc2016()
In Bioconductor version 3.4, there are respectively 83 proteomics and 55 mass spectrometry software packages and 15 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages()
, massSpectrometryPackages()
and massSpectrometryDataPackages()
and explored interactively.
library("RforProteomics")
pp <- proteomicsPackages("3.3")
display(pp)
AnnotationHub is a cloud resource set up and managed by the Bioconductor project that programmatically disseminates omics data. I am currently working on contributing proteomics data.
library("AnnotationHub")
ah <- AnnotationHub()
## snapshotDate(): 2016-06-06
query(ah, "proteomics")
## AnnotationHub with 4 records
## # snapshotDate(): 2016-06-06
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description,
## # preparerclass, tags, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH49006"]]'
##
## title
## AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
## AH49007 | PXD000001: Peptide-level quantitation data
## AH49008 | PXD000001: raw mass spectrometry data
## AH49009 | PXD000001: MS-GF+ identiciation data
Below, we download a raw mass spectrometry dataset with identifier AH49008
and store it in a variable names ms
.
ms <- ah[["AH49008"]]
ms
## Mass Spectrometry file handle.
## Filename: 55314
## Number of scans: 7534
The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The filename, 55314, is not very descriptive because the data originates from the AnnotationHub
cloud repository. If the data was read from a local file, is would be named as the mzML
(or mzXML
) file.
Contemporary MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRIDE data base at the EBI for MS/MS experiments, PASSEL at the ISB for SRM data and the MassIVE resource. The rpx is an interface to ProteomeXchange and provides a basic and unified access to PX data.
library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD000275 2016-06-25 07:30:28 Updated information
## 2 PXD001533 2016-06-24 18:34:35 New
## 3 PXD001580 2016-06-24 18:14:28 New
## 4 PXD001585 2016-06-24 17:41:38 New
## 5 PXD001621 2016-06-24 16:55:41 New
## 6 PXD001624 2016-06-24 16:42:12 New
## 7 PXD001631 2016-06-24 16:24:27 New
## 8 PXD001635 2016-06-24 15:49:53 New
## 9 PXD001654 2016-06-24 15:46:02 New
## 10 PXD001664 2016-06-24 15:31:28 New
## 11 PXD003660 2016-06-24 13:52:05 New
## 12 PXD000275 2016-06-24 09:09:19 New
## 13 PXD003718 2016-06-23 11:59:49 New
## 14 PXD004174 2016-06-23 09:15:03 New
## 15 PXD002554 2016-06-23 09:03:25 New
px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
## Id: PXD000001 with 10 files
## [1] 'F063721.dat' ... [10] 'erwinia_carotovora.fasta'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [10] "erwinia_carotovora.fasta"
Other metadata for the px
dataset:
pxtax(px)
pxurl(px)
pxref(px)
Data files can then be downloaded with the pxget
function as illustrated below.
mzf <- pxget(px, pxfiles(px)[6])
## Downloading 1 file
## TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML already present.
mzf
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
The mzR
package provides an interface to the proteowizard code base, the legacy RAMP is a non-sequential parser and other C/C++ code to access various raw data files, such as mzML
, mzXML
, netCDF
, and mzData
. The data is accessed on-disk, i.e it does not get loaded entirely in memory by default. The three main functions are openMSfile
to create a file handle to a raw data file, header
to extract metadata about the spectra contained in the file and peaks
to extract one or multiple spectra of interest. Other functions such as instrumentInfo
, or runInfo
can be used to gather general information about a run.
library("mzR")
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
hd <- header(ms)
dim(hd)
## [1] 7534 21
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum"
head(peaks(ms, 234))
## [,1] [,2]
## [1,] 100.3852 93.86617
## [2,] 101.6572 83.94572
## [3,] 107.3893 91.63007
## [4,] 112.0505 15697.57129
## [5,] 113.0538 106.47455
## [6,] 113.3050 83.92274
str(peaks(ms, 1:5))
## List of 5
## $ : num [1:25800, 1:2] 400 400 400 400 400 ...
## $ : num [1:25934, 1:2] 400 400 400 400 400 ...
## $ : num [1:26148, 1:2] 400 400 400 400 400 ...
## $ : num [1:26330, 1:2] 400 400 400 400 400 ...
## $ : num [1:26463, 1:2] 400 400 400 400 400 ...
Exercise Extract the index of the MS2 spectrum with the highest base peak intensity and plot its spectrum (we will see more about MS data visualisation in the next section). Is the data centroided or in profile mode?
hd2 <- hd[hd$msLevel == 2, ]
i <- which.max(hd2$basePeakIntensity)
hd2[i, ]
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 5404 5404 5404 2 -1 275 2283283712
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 5404 2751.313 859.5032 354288224 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 5404 0 100.5031 1995.63 5403 859.1722
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 5404 3 627820480 0 0
## mergedResultStartScanNum mergedResultEndScanNum
## 5404 0 0
pi <- peaks(ms, hd2[i, 1])
plot(pi, type = "h")
mz <- hd2[i, "basePeakMZ"]
plot(pi, type = "h", xlim = c(mz-0.5, mz+0.5))
Zooming into spectrum 100.
pj <- peaks(ms, 100)
plot(pj, type = "l")
plot(pj, type = "l", xlim = c(536,540))
The ProteomicsBioc2016Workshop
package distributes a small identification result file (see ?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid
) that we load and parse using infrastructure from the mzID package.
library("mzID")
(f <- dir(system.file("extdata", package = "ProteomicsBioc2016Workshop"),
pattern = "mzid", full.names = TRUE))
## [1] "/home/lg390/R/x86_64-pc-linux-gnu-library/3.3rel/ProteomicsBioc2016Workshop/extdata/TMT_Erwinia.mzid.gz"
id <- mzID(f)
## reading TMT_Erwinia.mzid.gz... DONE!
software(id)
## version name id
## 1 Beta (v10072) MS-GF+ ID_software
id
## An mzID object
##
## Software used: MS-GF+ (version: Beta (v10072))
##
## Rawfile: /home/lgatto/dev/00_github/RforProteomics/sandbox/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
##
## Database: /home/lgatto/dev/00_github/RforProteomics/sandbox/erwinia_carotovora.fasta
##
## Number of scans: 5287
## Number of PSM's: 5563
Various data can be extracted from the mzID
object, using one the accessor functions such as database
, sofware
, scans
, peptides
, … The object can also be converted into a data.frame
using the flatten
function.
head(flatten(id))
## spectrumid scan number(s) acquisitionnum passthreshold rank
## 1 scan=5782 5782 5782 TRUE 1
## 2 scan=6037 6037 6037 TRUE 1
## 3 scan=5235 5235 5235 TRUE 1
## calculatedmasstocharge experimentalmasstocharge chargestate
## 1 1080.2321 1080.2325 3
## 2 1002.2115 1002.2089 3
## 3 1189.2800 1189.2836 3
## ms-gf:denovoscore ms-gf:evalue ms-gf:rawscore ms-gf:specevalue
## 1 174 5.430080e-21 147 3.764831e-27
## 2 245 9.943751e-20 214 6.902626e-26
## 3 264 2.564787e-19 211 1.778789e-25
## assumeddissociationmethod isotopeerror isdecoy post pre end start
## 1 HCD 0 FALSE S R 84 50
## 2 HCD 0 FALSE R K 315 288
## 3 HCD 0 FALSE A R 224 192
## accession length description
## 1 ECA1932 155 outer membrane lipoprotein
## 2 ECA1147 434 trigger factor
## 3 ECA0013 295 ribose-binding periplasmic protein
## pepseq modified modification
## 1 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR FALSE <NA>
## 2 TQVLDGLINANDIEVPVALIDGEIDVLR FALSE <NA>
## 3 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR FALSE <NA>
## idFile
## 1 TMT_Erwinia.mzid.gz
## 2 TMT_Erwinia.mzid.gz
## 3 TMT_Erwinia.mzid.gz
## spectrumFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## databaseFile
## 1 erwinia_carotovora.fasta
## 2 erwinia_carotovora.fasta
## 3 erwinia_carotovora.fasta
## [ reached getOption("max.print") -- omitted 3 rows ]
The mzR
interface provides a similar interface. It is however much faster as it does not read all the data into memory and only extracts relevant data on demand. It has also accessor functions such as softwareInfo
, mzidInfo
, … (use showMethods(classes = "mzRident", where = "package:mzR")
) to see all available methods.
library("mzR")
id2 <- openIDfile(f)
id2
## Identification file handle.
## Filename: TMT_Erwinia.mzid.gz
## Number of psms: 5655
softwareInfo(id2)
## [1] "MS-GF+ Beta (v10072) "
## [2] "ProteoWizard MzIdentML 3.0.6239 ProteoWizard"
The identification data can be accessed as a data.frame
with the psms
accessor.
head(psms(id2))
## spectrumID chargeState rank passThreshold experimentalMassToCharge
## 1 scan=5782 3 1 TRUE 1080.2325
## 2 scan=6037 3 1 TRUE 1002.2089
## 3 scan=5235 3 1 TRUE 1189.2836
## 4 scan=5397 3 1 TRUE 960.5365
## 5 scan=6075 3 1 TRUE 1264.3409
## calculatedMassToCharge sequence modNum
## 1 1080.2321 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR 0
## 2 1002.2115 TQVLDGLINANDIEVPVALIDGEIDVLR 0
## 3 1189.2800 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR 0
## 4 960.5365 SQILQQAGTSVLSQANQVPQTVLSLLR 0
## 5 1264.3419 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR 0
## isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
## 1 FALSE S R 50 84 ECA1932 155
## 2 FALSE R K 288 315 ECA1147 434
## 3 FALSE A R 192 224 ECA0013 295
## 4 FALSE - R 264 290 ECA1731 290
## 5 FALSE F R 119 153 ECA1443 298
## DatabaseDescription acquisitionNum
## 1 ECA1932 outer membrane lipoprotein 5782
## 2 ECA1147 trigger factor 6037
## 3 ECA0013 ribose-binding periplasmic protein 5235
## 4 ECA1731 flagellin 5397
## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase 6075
## [ reached getOption("max.print") -- omitted 1 row ]
Exercise Is there a relation between the length of a protein and the number of identified peptides, conditioned by the (average) e-value of the identifications?
fid <- flatten(id)
x <- by(fid, fid$accession, function(x)
c(unique(x$length),
length(unique(x$pepseq)),
mean(x$'ms-gf:specevalue')))
x <- data.frame(do.call(rbind, x))
colnames(x) <- c("plength", "npep", "eval")
x$bins <- cut(x$eval, summary(x$eval))
library("lattice")
xyplot(plength ~ npep | bins, data = x)
While searches are generally performed using third-party software independently of R or can be started from R using a system
call, the rTANDEM package allows one to execute such searches using the X!Tandem engine.
library("rTANDEM")
?rtandem
The MSGPplus package provides an interface to the very fast MSGF+
engine.
library("MSGFplus")
parameters <- msgfPar(database = 'milk-proteins.fasta',
tolerance = '20 ppm',
instrument = 'TOF')
runMSGF(parameters, 'msraw.mzML')
The above sections introduced low-level interfaces to raw and identification results. The MSnbase package provides abstractions for raw data through the MSnExp
class and containers for quantification data via the MSnSet
class. Both store the actual assay data (spectra or quantitation matrix) and sample and feature metadata, accessed with spectra
(or the [
, [[
operators) or exprs
, pData
and fData
.
The figure below give a schematics of an MSnSet
instance and the relation between the assay data and the respective feature and sample metadata.
Another useful slot is processingData
, accessed with processingData(.)
, that records all the processing that objects have undergone since their creation (see examples below).
The readMSData
will parse the raw data, extract the MS2 spectra and construct an MS experiment file.
library("MSnbase")
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
quantFile
## [1] "/home/lg390/R/x86_64-pc-linux-gnu-library/3.3rel/MSnbase/extdata/dummyiTRAQ.mzXML"
msexp <- readMSData(quantFile, verbose=FALSE)
msexp
## Object of class "MSnExp"
## Object size in memory: 0.2 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of MS1 acquisitions: 1
## Number of MSn scans: 5
## Number of precursor ions: 5
## 4 unique MZs
## Precursor MZ's: 437.8 - 716.34
## MSn M/Z range: 100 2016.66
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Sun Jun 26 19:14:39 2016
## MSnbase version: 1.21.7
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1.1 X2.1 ... X5.1 (5 total)
## fvarLabels: spectrum
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
The identification results stemming from the same raw data file can then be used to add PSM matches.
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzid$")
identFile
## [1] "/home/lg390/R/x86_64-pc-linux-gnu-library/3.3rel/MSnbase/extdata/dummyiTRAQ.mzid"
msexp <- addIdentificationData(msexp, identFile)
## reading dummyiTRAQ.mzid... DONE!
fData(msexp)
## spectrum scan number(s) passthreshold rank calculatedmasstocharge
## X1.1 1 1 TRUE 1 645.0375
## X2.1 2 2 TRUE 1 546.9633
## X3.1 3 NA NA NA NA
## experimentalmasstocharge chargestate ms-gf:denovoscore ms-gf:evalue
## X1.1 645.3741 3 77 79.36958
## X2.1 546.9586 3 39 13.46615
## X3.1 NA NA NA NA
## ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod
## X1.1 -39 5.527468e-05 CID
## X2.1 -30 9.399048e-06 CID
## X3.1 NA NA <NA>
## isotopeerror isdecoy post pre end start accession length
## X1.1 1 FALSE A R 186 170 ECA0984;ECA3829 231
## X2.1 0 FALSE A K 62 50 ECA1028 275
## X3.1 <NA> NA <NA> <NA> NA NA <NA> NA
## description
## X1.1 DNA mismatch repair protein;acetolactate synthase isozyme III large subunit
## X2.1 2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase
## X3.1 <NA>
## pepseq modified modification idFile
## X1.1 VESITARHGEVLQLRPK FALSE NA dummyiTRAQ.mzid
## X2.1 IDGQWVTHQWLKK FALSE NA dummyiTRAQ.mzid
## X3.1 <NA> NA NA <NA>
## databaseFile nprot npep.prot npsm.prot npsm.pep
## X1.1 erwinia_carotovora.fasta 2 1 1 1
## X2.1 erwinia_carotovora.fasta 1 1 1 1
## X3.1 <NA> NA NA NA NA
## [ reached getOption("max.print") -- omitted 2 rows ]
msexp[[1]]
## Object of class "Spectrum2"
## Precursor: 645.3741
## Retention time: 25:1
## Charge: 3
## MSn level: 2
## Peaks count: 2921
## Total ion count: 668170086
as(msexp[[1]], "data.frame")[100:105, ]
## mz i
## 100 141.0990 588594.812
## 101 141.1015 845401.250
## 102 141.1041 791352.125
## 103 141.1066 477623.000
## 104 141.1091 155376.312
## 105 141.1117 4752.541
MSnExp
object load all MS data into memory. This is a viable solution for MS2 data, but does not scale to MS1 data, especially when multiple files are loaded together. With the help of Johannes Rainer, a new MSnExp
class supporting on disk access is being developed.
chromatogram(ms)
It is of course possible to overlay several chromatograms. The code chunks below are not executed dynamically so save time with downloading raw data files.
## manual download
library("RforProteomics")
url <- "http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/"
f1 <- downloadData(file.path(url, "Thermo_Hela_PRTC_1.mzML"))
f2 <- downloadData(file.path(url, "Thermo_Hela_PRTC_2.mzML"))
f3 <- downloadData(file.path(url, "Thermo_Hela_PRTC_3.mzML"))
## plotting
c1 <- chromatogram(f1)
c2 <- chromatogram(f2, plot = FALSE)
lines(c2, col = "steelblue", lty = "dashed", lwd = 2)
c3 <- chromatogram(f3, plot = FALSE)
lines(c2, col = "orange", lty = "dotted")
par(mfrow = c(1, 2))
xic(ms, mz = 636.925, width = 0.01)
x <- xic(ms, mz = 636.925, width = 0.01, rtlim = c(2120, 2200))
We first load a test iTRAQ data called itraqdata
and distributed as part of the MSnbase package using the data
function. This is a pre-packaged data that comes as a dedicated data structure called MSnExp
. We then plot
the 10th spectrum using specific code that recognises what to do with an element of an MSnExp
.
data(itraqdata)
itraqdata
## Object of class "MSnExp"
## Object size in memory: 1.88 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of MS1 acquisitions: 1
## Number of MSn scans: 55
## Number of precursor ions: 55
## 55 unique MZs
## Precursor MZ's: 401.74 - 1236.1
## MSn M/Z range: 100 2069.27
## MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## MSnbase version: 1.1.22
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames sampleNumbers
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ProteinDescription
## PeptideSequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
plot(itraqdata[[10]], reporters = iTRAQ4, full=TRUE)
The ms
data is not pre-packaged as an MSnExp
data. It is a more bare-bone mzRramp object, a pointer to a raw data file (here TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML): we need first to extract a spectrum of interest (here the 3071st spectrum, an MS1 spectrum), and use the generic plot
function to visualise the spectrum.
plot(peaks(ms, 3071), type = "h",
xlab = "M/Z", ylab = "Intensity",
sub = formatRt(hd[3071, "retentionTime"]))
The importance of flexible access to specialised data becomes visible in the figure below (taken from the RforProteomics visualisation vignette). Not only can we access specific data and understand/visualise them, but we can transverse all the data and extracted/visualise/understand structured slices of data.
In this code chunks we start by selecting relevant spectra of interest. We will focus on the first MS1 spectrum acquired after 30 minutes of retention time.
## (1) Open raw data file
ms <- openMSfile(mzf)
## (2) Extract the header information
hd <- header(ms)
## (3) MS1 spectra indices
ms1 <- which(hd$msLevel == 1)
## (4) Select MS1 spectra with retention time between 30 and 35 minutes
rtsel <- hd$retentionTime[ms1] / 60 > 30 & hd$retentionTime[ms1] / 60 < 35
## (5) Indices of the 1st and 2nd MS1 spectra after 30 minutes
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
## (6) Interleaved MS2 spectra
ms2 <- (i+1):(j-1)
Now now extract and plot all relevant information:
chromatogram
.chromatogram(ms)
chromatogram(ms)
abline(v = hd[i, "retentionTime"], col = "red")
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5))
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
length(ms2)
MS2 spectra highlighted above.par(mfrow = c(5, 2), mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
Putting it all together:
layout(lout)
par(mar=c(4,2,1,1))
chromatogram(ms)
abline(v = hd[i, "retentionTime"], col = "red")
par(mar = c(3, 2, 1, 0))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
par(mar = c(3, 0.5, 1, 1))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5),
yaxt = "n")
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
par(mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
Below, we illustrate some additional visualisation and animations of raw MS data, also taken from the RforProteomics visualisation vignette. On the left, we have a heatmap visualisation of a MS map and a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.
## (1) MS space heaptmap
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
## 1
ff <- colorRampPalette(c("yellow", "steelblue"))
trellis.par.set(regions=list(col=ff(100)))
m1 <- plot(M, aspect = 1, allTicks = FALSE)
## (2) Same data as (1), in 3 dimenstion
M@map[msMap(M) == 0] <- NA
m2 <- plot3D(M, rgl = FALSE)
## (3) The 2 MS1 and 10 interleaved MS2 spectra from above
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
M2 <- MSmap(ms, i:j, 100, 1000, 1, hd)
## 1
m3 <- plot3D(M2)
grid.arrange(m1, m2, m3, ncol = 3)
Below, we have animations build from extracting successive slices as above.
Annotated spectra and comparing spectra.
par(mfrow = c(1, 2))
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE)
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))
The annotation of spectra is obtained by simulating fragmentation of a peptide and matching observed peaks to fragments:
calculateFragments("SIGFEGDSIGR")
## Modifications used: C=57.02146
## mz ion type pos z seq
## 1 88.03931 b1 b 1 1 S
## 2 201.12337 b2 b 2 1 SI
## 3 258.14483 b3 b 3 1 SIG
## 4 405.21324 b4 b 4 1 SIGF
## 5 534.25583 b5 b 5 1 SIGFE
## 6 591.27729 b6 b 6 1 SIGFEG
## 7 706.30423 b7 b 7 1 SIGFEGD
## 8 793.33626 b8 b 8 1 SIGFEGDS
## 9 906.42032 b9 b 9 1 SIGFEGDSI
## 10 963.44178 b10 b 10 1 SIGFEGDSIG
## 11 1119.54289 b11 b 11 1 SIGFEGDSIGR
## 12 175.11895 y1 y 1 1 R
## 13 232.14041 y2 y 2 1 GR
## 14 345.22447 y3 y 3 1 IGR
## 15 432.25650 y4 y 4 1 SIGR
## 16 547.28344 y5 y 5 1 DSIGR
## [ reached getOption("max.print") -- omitted 20 rows ]
Visualising a pair of spectra means that we can access them, and that, in addition to plotting, we can manipulate them and perform computations. The two spectra corresponding to the IMIDLDGTENK
peptide, for example have 22 common peaks, a correlation of 0.198 and a dot product of 0.21 (see ?compareSpectra
for details).
There are 2 Bioconductor packages for peptide-spectrum matching directly in R, namely MSGFplus and rTANDEM, replying on MSGF+
and X!TANDEM
respectively. See also the MSGFgui package for visualisation of identification data.
There are a wide range of proteomics quantitation techniques that can broadly be classified as labelled vs. label-free, depending whether the features are labelled prior the MS acquisition and the MS level at which quantitation is inferred, namely MS1 or MS2.
Label-free | Labelled | |
---|---|---|
MS1 | XIC | SILAC, 15N |
MS2 | Counting | iTRAQ, TMT |
In terms of raw data quantitation, most efforts have been devoted to MS2-level quantitation. Label-free XIC quantitation has however been addressed in the frame of metabolomics data processing by the xcms
infrastructure.
An MSnExp
is converted to an MSnSet
by the quantitation
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy (defined by the iTRAQ4
parameter; other isobaric tags are available).
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose = FALSE)
exprs(msset)
## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## X1.1 4483.320 4873.996 6743.441 4601.378
## X2.1 1918.082 1418.040 1117.601 1581.954
## X3.1 15210.979 15296.256 15592.760 16550.502
## X4.1 4133.103 5069.983 4724.845 4694.801
## X5.1 11947.881 13061.875 12809.491 12911.479
processingData(msset)
## - - - Processing information - - -
## Data loaded: Sun Jun 26 19:14:39 2016
## iTRAQ4 quantification by trapezoidation: Sun Jun 26 19:15:07 2016
## MSnbase version: 1.21.7
See also The isobar package supports quantitation from centroided mgf
peak lists or its own tab-separated files that can be generated from Mascot and Phenyx vendor files.
Other MS2 quantitation methods available in quantify
include the (normalised) spectral index SI
and (normalised) spectral abundance factor SAF
or simply a simple count method.
exprs(si <- quantify(msexp, method = "SIn"))
## Combined 2 features into 2 using sum
## dummyiTRAQ.mzXML
## ECA0510 0.003588641
## ECA1028 0.001470129
exprs(saf <- quantify(msexp, method = "NSAF"))
## Combined 2 features into 2 using user-defined function
## dummyiTRAQ.mzXML
## ECA0510 0.6235828
## ECA1028 0.3764172
Note that spectra that have not been assigned any peptide (NA
) or that match non-unique peptides (npsm > 1
) are discarded in the counting process.
The MSnID package provides enables to explore and assess the confidence of identification data using mzid
files. A subset of all peptide-spectrum matches, that pass a specific false discovery rate threshold can them be converted to an MSnSet
, where the number of peptide occurrences are used to populate the assay data.
The PSI mzTab
file format is aimed at providing a simpler (than XML formats) and more accessible file format to the wider community. It is composed of a key-value metadata section and peptide/protein/small molecule tabular sections.
mztf <- pxget(px, pxfiles(px)[2])
## Downloading 1 file
## F063721.dat-mztab.txt already present.
(mzt <- readMzTabData(mztf, what = "PEP", version = "0.9"))
## Warning: Version 0.9 is deprecated. Please see '?readMzTabData' and '?
## MzTab' for details.
## Detected a metadata section
## Detected a peptide section
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## rowNames: sub[1] sub[2] ... sub[6] (6 total)
## varLabels: abundance
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 1528 (1528 total)
## fvarLabels: sequence accession ... uri (14 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## mzTab read: Sun Jun 26 19:15:14 2016
## MSnbase version: 1.21.7
It is also possible to import arbitrary spreadsheets as MSnSet
objects into R with the readMSnSet2
function. The main 2 arguments of the function are (1) a text-based spreadsheet or a data.frame
and (2) column names of indices that identify the quantitation data.
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
## [1] "\"Protein ID\"" "\"FBgn\""
## [3] "\"Flybase Symbol\"" "\"No. peptide IDs\""
## [5] "\"Mascot score\"" "\"No. peptides quantified\""
## [7] "\"area 114\"" "\"area 115\""
## [9] "\"area 116\"" "\"area 117\""
## [11] "\"PLS-DA classification\"" "\"Peptide sequence\""
## [13] "\"Precursor ion mass\"" "\"Precursor ion charge\""
## [15] "\"pd.2013\"" "\"pd.markers\""
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
## area.114 area.115 area.116 area.117
## 1 0.379000 0.281000 0.225000 0.114000
## 2 0.420000 0.209667 0.206111 0.163889
## 3 0.187333 0.167333 0.169667 0.476000
## 4 0.247500 0.253000 0.320000 0.179000
## 5 0.216000 0.183000 0.342000 0.259000
## 6 0.072000 0.212333 0.573000 0.142667
head(fData(res))
## Protein.ID FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1 CG10060 FBgn0001104 G-ialpha65A 3 179.86
## 2 CG10067 FBgn0000044 Act57B 5 222.40
## 3 CG10077 FBgn0035720 CG10077 5 219.65
## 4 CG10079 FBgn0003731 Egfr 2 86.39
## 5 CG10106 FBgn0029506 Tsp42Ee 1 52.10
## 6 CG10130 FBgn0010638 Sec61beta 2 79.90
## No..peptides.quantified PLS.DA.classification Peptide.sequence
## 1 1 PM
## 2 9 PM
## 3 3
## 4 2 PM
## 5 1 GGVFDTIQK
## 6 3 ER/Golgi
## Precursor.ion.mass Precursor.ion.charge pd.2013 pd.markers
## 1 PM unknown
## 2 PM unknown
## 3 unknown unknown
## 4 PM unknown
## 5 626.887 2 Phenotype 1 unknown
## 6 ER/Golgi ER
Processing quantitative proteomics data is very similar to processing transcriptomics data. When working with MSnSet
instances, a set of methods for data processing and visualisation are readily available to streamline and track the process:
purityCorrect
to correct for isobaric tag impuritiesnormalise
(or normalize
) to, well, normalise data using quantile, vsn, mean, … normalisationfilterNA
to remove feature with (a certain proportion) of missing valuesimpute
to impute missing values using a wide range of methods (more on this below)image
, MAplot
, plot2D
(for dimensionality reduction using PCA, t-SNE, MDS, …), … to visualise datalibrary("msmsTests")
data(msms.dataset) ## an MSnSet
e <- pp.msms.data(msms.dataset)
head(exprs(e))
## U2.2502.1 U2.2502.2 U2.2502.3 U2.2502.4 U6.2502.1 U6.2502.2
## YJR104C 156 176 201 203 194 208
## YKL060C 183 205 202 171 171 171
## YDR155C 180 184 198 185 161 160
## YGR192C 176 172 174 170 170 171
## YOL086C 171 182 137 103 159 164
## YLR150W 97 111 104 100 99 103
## U6.2502.3 U6.2502.4 U2.0302.1 U2.0302.2 U2.0302.3 U6.0302.1
## YJR104C 215 217 246 266 261 249
## YKL060C 149 144 247 249 236 278
## YDR155C 166 168 193 175 155 179
## YGR192C 162 167 243 310 270 270
## YOL086C 116 113 409 384 364 395
## YLR150W 98 103 73 90 96 79
## U6.0302.2 U6.0302.3
## YJR104C 240 265
## YKL060C 252 247
## YDR155C 165 154
## YGR192C 275 259
## YOL086C 358 355
## YLR150W 81 88
pData(e)
## treat batch
## U2.2502.1 U200 2502
## U2.2502.2 U200 2502
## U2.2502.3 U200 2502
## U2.2502.4 U200 2502
## U6.2502.1 U600 2502
## U6.2502.2 U600 2502
## U6.2502.3 U600 2502
## U6.2502.4 U600 2502
## U2.0302.1 U200 0302
## U2.0302.2 U200 0302
## U2.0302.3 U200 0302
## U6.0302.1 U600 0302
## U6.0302.2 U600 0302
## U6.0302.3 U600 0302
null.f <- "y~batch"
alt.f <- "y~treat+batch"
div <- apply(exprs(e),2,sum)
res <- msms.edgeR(e,alt.f,null.f,div=div,fnm="treat")
head(res)
## LogFC LR p.value
## YJR104C 0.02689909 0.2691922 0.603874157
## YKL060C -0.12646368 5.5829487 0.018136162
## YDR155C -0.18781161 10.2706901 0.001351602
## YGR192C -0.08495735 2.5941286 0.107260419
## YOL086C -0.11853786 5.7558869 0.016433498
## YLR150W -0.09299164 1.3766331 0.240675481
MSnSet
support MLInterfaces out-of-the-box.MSnSet
instances.data(naset)
naplot(naset, col = "black")
## features.na
## 0 1 2 3 4 8 9 10
## 301 247 91 13 2 23 10 2
## samples.na
## 34 39 41 42 43 45 47 49 51 52 53 55 56 57 61
## 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
One solution is to remove all or part of the features that have missing values (see ?filterNA
).
flt <- filterNA(naset)
processingData(flt)
## - - - Processing information - - -
## Subset [689,16][301,16] Sun Jun 26 19:15:15 2016
## Removed features with more than 0 NAs: Sun Jun 26 19:15:15 2016
## Dropped featureData's levels Sun Jun 26 19:15:15 2016
## MSnbase version: 1.15.6
Identification transfer between acquisitions (label-free): if a feature was not acquired in MS2 in one replicate, it is possible to find the ion in MS space based on the M/Z and retention time coordinates of the same ion in a replicate where it was identified. (An example of this is implemented in the synapter package).
It is of course possible to impute missing values (?impute
). This is however not a straightforward thing, as is likely to dramatically fail when a high proportion of data is missing (10s of %). But also, there are two types of mechanisms resulting in missing values in LC/MSMS experiments.
Missing values resulting from absence of detection of a feature, despite ions being present at detectable concentrations. For example in the case of ion suppression or as a result from the stochastic, data-dependent nature of the MS acquisition method. These missing value are expected to be randomly distributed in the data and are defined as missing at random (MAR) or missing completely at random (MCAR).
Biologically relevant missing values, resulting from the absence or the low abundance of ions (below the limit of detection of the instrument). These missing values are not expected to be randomly distributed in the data and are defined as missing not at random (MNAR).
Different imputation methods are more appropriate to different classes of missing values (as documented in this paper). Values missing at random, and those missing not at random should be imputed with different methods.
It is recommended to use hot deck methods (nearest neighbour (left), maximum likelihood, …) when data are missing at random.Conversely, MNAR features should ideally be imputed with a left-censor (minimum value (right), but not zero, …) method.
%
X
The following values are higher bounds, without peptide filtering for about 80000 gene groups
data(cvg)
hist(cvg$coverage, breaks = 100, xlab = "coverage", main = "")
And
the majority of peptides map to a minority of proteins different
peptides within one protein can be differently detectable in MS acquisitions
From Qeli and Ahrens (2010). See also Nesvizhskii and Aebersold (2005).
Often, in proteomics experiments, the features represent single proteins and groups of indistinguishable or non-differentiable proteins identified by shared (non-unique) peptides.
The latest UniProt
data(upens)
data(upenst)
Where upens
and upenst
where created by querying the Ensembl gene and transcript identifiers for all human UniProtKB accession numbers against the UniProt web services. (See ?upens
for details.)
All the Bioconductor annotation infrastructure, such as biomaRt, GO.db, organism specific annotations, .. are directly relevant to the analysis of proteomics data. Some proteomics-centred annotations such as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI 2.5) or Protein Modifications are available through the rols. Data from the Human Protein Atlas (hpar) and UniProt (UniProt.ws) are also available.
The source used to generate this document is available here.
## R version 3.3.0 Patched (2016-05-11 r70599)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## attached base packages:
## [1] parallel methods stats graphics grDevices utils datasets
## [8] base
##
## other attached packages:
## [1] rpx_1.9.2 gplots_3.0.1
## [3] msmsTests_1.11.0 msmsEDA_1.11.0
## [5] mzID_1.11.2 gridExtra_2.2.1
## [7] lattice_0.20-33 BiocInstaller_1.23.4
## [9] AnnotationHub_2.5.4 RforProteomics_1.11.2
## [11] MSnbase_1.21.7 ProtGenerics_1.5.0
## [13] BiocParallel_1.7.4 mzR_2.7.3
## [15] Rcpp_0.12.5 Biobase_2.33.0
## [17] BiocGenerics_0.19.1 ProteomicsBioc2016Workshop_0.2.2
## [19] BiocStyle_2.1.11
##
## loaded via a namespace (and not attached):
## [1] httr_1.2.0 edgeR_3.15.0
## [3] vsn_3.41.0 interactiveDisplay_1.11.2
## [5] splines_3.3.0 foreach_1.4.3
## [7] R.utils_2.3.0 gtools_3.5.0
## [9] shiny_0.13.2 interactiveDisplayBase_1.11.3
## [11] highr_0.6 affy_1.51.0
## [13] stats4_3.3.0 RBGL_1.49.1
## [15] Category_2.39.0 impute_1.47.0
## [17] RSQLite_1.0.0 limma_3.29.12
## [19] RUnit_0.4.31 digest_0.6.9
## [21] RColorBrewer_1.1-2 qvalue_2.5.2
## [23] colorspace_1.2-6 htmltools_0.3.5
## [25] httpuv_1.3.3 preprocessCore_1.35.0
## [27] Matrix_1.2-6 R.oo_1.20.0
## [29] plyr_1.8.4 GSEABase_1.35.0
## [31] MALDIquant_1.14 XML_3.98-1.4
## [33] genefilter_1.55.2 zlibbioc_1.19.0
## [35] xtable_1.8-2 scales_0.4.0
## [37] gdata_2.17.0 affyio_1.43.0
## [39] annotate_1.51.0 biocViews_1.41.5
## [41] IRanges_2.7.11 ggplot2_2.1.0
## [43] survival_2.39-4 RJSONIO_1.3-0
## [45] magrittr_1.5 mime_0.4
## [47] evaluate_0.9 R.methodsS3_1.7.1
## [49] doParallel_1.0.10 MASS_7.3-45
## [51] graph_1.51.0 tools_3.3.0
## [53] formatR_1.4 stringr_1.0.0
## [55] S4Vectors_0.11.7 munsell_0.4.3
## [57] AnnotationDbi_1.35.3 pcaMethods_1.65.0
## [59] caTools_1.17.1 grid_3.3.0
## [61] RCurl_1.95-4.8 iterators_1.0.8
## [63] labeling_0.3 bitops_1.0-6
## [65] gtable_0.2.0 codetools_0.2-14
## [67] curl_0.9.7 DBI_0.4-1
## [69] reshape2_1.4.1 R6_2.1.2
## [71] knitr_1.13 KernSmooth_2.23-15
## [73] gridSVG_1.5-0 stringi_1.1.1