Chapter 3 Raw MS data

In this section, we will learn how to read raw data from in one of the commonly used open formats (mzML, mzXML and netCDF) into R.

|Data type  |File format   |Data structure               |Package           |
|:----------|:-------------|:----------------------------|:-----------------|
|Raw        |mzXML or mzML |mzRpwiz or mzRramp           |mzR               |
|Raw        |mzXML or mzML |list of MassSpectrum objects |MALDIquantForeign |
|Raw        |mzXML or mzML |MSnExp                       |MSnbase           |
|Peak lists |mgf           |MSnExp                       |MSnbase           |
|Raw        |several       |Spectra                      |Spectra           |

3.1 What is raw data in R

When we manipulate complex data, we need a way to abstract it.

The need for and abstraction saves us from having to know about all the details of that data and its associated metadata. This allows to rely on a few easy-to-remember conventions to make mundane and repetitive tasks trivial and be able to complete more complex things easily. Abstractions provide a smoother approaches to handle complex data using common patterns.

Figure 3.1: Schematic representation of what is referred to by raw data: a collection of mass spectra and a table containing spectrum-level annotations along the lines. Raw data are imported from one of the many community-maintained open standards formats (mzML, mzXML, mzData or ANDI-MS/netCDF) (Figure taken from (Gatto, Gibb, and Rainer 2020)).

Schematic representation of what is referred to by *raw data*: a collection of mass spectra and a table containing spectrum-level annotations along the lines. Raw data are imported from one of the many community-maintained open standards formats (mzML, mzXML, mzData or ANDI-MS/netCDF) (Figure taken from [@Gatto:2020]).

3.1.1 The `Spectra` class

We are going to use the Spectra package as an abstraction to raw mass spectrometry data.

library(Spectra)

Spectra is part of the R for Mass Spectrometry initiative initiative. It defines the Spectra class that is used as a raw data abstration, to maniputate MS data and metadata. The best way to learn about a data structure is to create one by hand.

Let’s create a DataFrame44 As defined in the Bioconductor S4Vectors package. containing MS levels, retention time, m/z and intensities for 2 spectra:

spd <- DataFrame(msLevel = c(1L, 2L), rtime = c(1.1, 1.2))
spd$mz <- list(c(100, 103.2, 104.3, 106.5), c(45.6, 120.4, 190.2))
spd$intensity <- list(c(200, 400, 34.2, 17), c(12.3, 15.2, 6.8))
spd

## DataFrame with 2 rows and 4 columns
##     msLevel     rtime                    mz             intensity
##   <integer> <numeric>                <list>                <list>
## 1         1       1.1 100.0,103.2,104.3,... 200.0,400.0, 34.2,...
## 2         2       1.2      45.6,120.4,190.2        12.3,15.2, 6.8

And now convert this DataFrame into a Spectra object:

sp0 <- Spectra(spd)
sp0

## MSn data (Spectra) with 2 spectra in a MsBackendDataFrame backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1         1       1.1        NA
## 2         2       1.2        NA
##  ... 16 more variables/columns.

Exercise

Explore the newly created object using

spectraVariables to extract all the metadata variables.
spectraData to extract all the metadata.
peaksData to extract a list containing thet raw data.
[ to create subsets.

3.1.2 `Spectra` from mzML files

Let’s now create a new object using the mzML data previously downloaded and available in the mzf file.

mzf

## [1] "~/.cache/rpx/10c1f4e4c6980_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"

sp <- Spectra(mzf)
sp

## MSn data (Spectra) with 7534 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    0.4584         1
## 2            1    0.9725         2
## 3            1    1.8524         3
## 4            1    2.7424         4
## 5            1    3.6124         5
## ...        ...       ...       ...
## 7530         2   3600.47      7530
## 7531         2   3600.83      7531
## 7532         2   3601.18      7532
## 7533         2   3601.57      7533
## 7534         2   3601.98      7534
##  ... 33 more variables/columns.
## 
## file(s):
## 10c1f4e4c6980_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML

Exercise

Repeat the data manipulations above.
Check the number of scans in the object with length().
Note the difference in the first line when showing the object in the console. We will get back to this idea of backend later.

Mass spectrometry data in Spectra objects can be thought of as a list of individual spectra, with each spectrum having a set of variables associated with it. Besides core spectra variables (such as MS level or retention time) an arbitrary number of optional variables can be assigned to a spectrum. The core spectra variables all have their own accessor method and it is guaranteed that a value is returned by it (or NA if the information is not available). The core variables and their data type are (alphabetically ordered):

acquisitionNum integer(1): the index of acquisition of a spectrum during a MS run.
centroided logical(1): whether the spectrum is in profile or centroid mode.
collisionEnergy numeric(1): collision energy used to create an MSn spectrum.
dataOrigin character(1): the origin of the spectrum’s data, e.g. the mzML file from which it was read.
dataStorage character(1): the (current) storage location of the spectrum data. This value depends on the backend used to handle and provide the data. For an in-memory backend like the MsBackendDataFrame this will be "<memory>", for an on-disk backend such as the MsBackendHdf5Peaks it will be the name of the HDF5 file where the spectrum’s peak data is stored.
intensity numeric: intensity values for the spectrum’s peaks.
isolationWindowLowerMz numeric(1): lower m/z for the isolation window in which the (MSn) spectrum was measured.
isolationWindowTargetMz numeric(1): the target m/z for the isolation window in which the (MSn) spectrum was measured.
isolationWindowUpperMz numeric(1): upper m/z for the isolation window in which the (MSn) spectrum was measured.
msLevel integer(1): the MS level of the spectrum.
mz numeric: the m/z values for the spectrum’s peaks.
polarity integer(1): the polarity of the spectrum (0 and 1 representing negative and positive polarity, respectively).
precScanNum integer(1): the scan (acquisition) number of the precursor for an MSn spectrum.
precursorCharge integer(1): the charge of the precursor of an MSn spectrum.
precursorIntensity numeric(1): the intensity of the precursor of an MSn spectrum.
precursorMz numeric(1): the m/z of the precursor of an MSn spectrum.
rtime numeric(1): the retention time of a spectrum.
scanIndex integer(1): the index of a spectrum within a (raw) file.
smoothed logical(1): whether the spectrum was smoothed.

For details on the individual variables and their getter/setter function see the help for Spectra (?Spectra). Also note that these variables are suggested, but not required to characterize a spectrum. Also, some only make sense for MSn, but not for MS1 spectra.

Exercise

Extract a set of spectra variables using the accessor (for example msLevel(.)) or using the $ notation (for example .$msLevel).
How many MS level are there, and how many scans of each level?
Extract the index of the MS2 spectrum with the highest base peak intensity.
Are the data centroided or in profile mode?
Pick a spectrum of each level and visually check whether it is centroided or in profile mode. You can use the plotSpectra() function to visualise peaks and set the m/z range with the xlim arguments.

Exercise

Using the first raw data file starting with MS3TMT10, answer the following questions:

How many spectra are there in that file?
How many MS levels, and how many spectra per MS level?
What is the index of the MS2 spectrum with the highest precursor intensity?
Plot one spectrum of each level. Are they centroided or in profile mode?

These objects and their manipulations are not limited to single files:

(fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE))

## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML"
## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML"

sp_sciex <- Spectra(fls)
table(dataOrigin(sp_sciex))

## 
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML 
##                                                                                          931 
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML 
##                                                                                          931

3.1.3 Backends

Backends allow to use different backends to store mass spectrometry data while providing via the Spectra class a unified interface to use that data. The Spectra package defines a set of example backends but any object extending the base MsBackend class could be used instead. The default backends are:

MsBackendMzR: this backend keeps only general spectra variables in memory and relies on the mzR package to read mass peaks (m/z and intensity values) from the original MS files on-demand.

sp_sciex

## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 33 more variables/columns.
## 
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## 20171016_POOL_POS_3_105-134.mzML

MsBackendDataFrame: the mass spectrometry data is stored (in-memory) in a DataFrame. Keeping the data in memory guarantees high performance but has also, depending on the number of mass peaks in each spectrum, a much higher memory footprint.

setBackend(sp_sciex, MsBackendDataFrame())

## MSn data (Spectra) with 1862 spectra in a MsBackendDataFrame backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 33 more variables/columns.
## Processing:
##  Switch backend from MsBackendMzR to MsBackendDataFrame [Wed Mar 17 22:21:57 2021]

MsBackendHdf5Peaks: similar to MsBackendMzR this backend reads peak data only on-demand from disk while all other spectra variables are kept in memory. The peak data are stored in Hdf5 files which guarantees scalability.

sp_hdf5 <- setBackend(sp_sciex, MsBackendHdf5Peaks(), hdf5path = tempdir())
sp_hdf5

## MSn data (Spectra) with 1862 spectra in a MsBackendHdf5Peaks backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 33 more variables/columns.
## 
## file(s):
##  20171016_POOL_POS_1_105-134.h5
##  20171016_POOL_POS_3_105-134.h5
## Processing:
##  Switch backend from MsBackendMzR to MsBackendHdf5Peaks [Wed Mar 17 22:22:06 2021]

table(sp_hdf5$dataOrigin)

## 
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML 
##                                                                                          931 
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML 
##                                                                                          931

table(sp_hdf5$dataStorage)

## 
## /tmp/RtmpdvVy6F/20171016_POOL_POS_1_105-134.h5 
##                                            931 
## /tmp/RtmpdvVy6F/20171016_POOL_POS_3_105-134.h5 
##                                            931

All of the above mentioned backends support changing all of their their spectra variables, except the MsBackendMzR that does not support changing m/z or intensity values for the mass peaks.

With the example below we load the data from a single mzML file and use a MsBackendHdf5Peaks backend for data storage. The hdf5path parameter allows us to specify the storage location of the HDF5 file.

There is also an (under development) SQLite-based backend called MsBackendSqlDb that will store all data, i.e. raw and metadata, on disk.

3.2 Under the hood: `mzR` (optional)

The mzR package in a direct interface to the proteowizard code base. It includes a substantial proportion of pwiz’s C/C++ code for fast and efficient parsing of these large raw data files.

Let’s start by using some raw data files from the msdata package. After loading it, we use the proteomics() function to return the full file names for two raw data files. We will start by focusing on the second one.

f <- msdata::proteomics(full.names = TRUE)
f

## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/MRM-standmix-5.mzML.gz"                                                
## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/MS3TMT10_01022016_32917-33481.mzML.gz"                                 
## [3] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/MS3TMT11.mzML"                                                         
## [4] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
## [5] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"

(f2 <- grep("20141210", f, value = TRUE))

## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"

The three main functions of mzR are

openMSfile to create a file handle to a raw data file
header to extract metadata about the spectra contained in the file
peaks to extract one or multiple spectra of interest.

Other functions such as instrumentInfo, or runInfo can be used to gather general information about a run.

library("mzR")
ms <- openMSfile(f2)
ms

## Mass Spectrometry file handle.
## Filename:  TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz 
## Number of scans:  7534

hd <- header(ms)
dim(hd)

## [1] 7534   31

names(hd)

##  [1] "seqNum"                     "acquisitionNum"            
##  [3] "msLevel"                    "polarity"                  
##  [5] "peaksCount"                 "totIonCurrent"             
##  [7] "retentionTime"              "basePeakMZ"                
##  [9] "basePeakIntensity"          "collisionEnergy"           
## [11] "ionisationEnergy"           "lowMZ"                     
## [13] "highMZ"                     "precursorScanNum"          
## [15] "precursorMZ"                "precursorCharge"           
## [17] "precursorIntensity"         "mergedScan"                
## [19] "mergedResultScanNum"        "mergedResultStartScanNum"  
## [21] "mergedResultEndScanNum"     "injectionTime"             
## [23] "filterString"               "spectrumId"                
## [25] "centroided"                 "ionMobilityDriftTime"      
## [27] "isolationWindowTargetMZ"    "isolationWindowLowerOffset"
## [29] "isolationWindowUpperOffset" "scanWindowLowerLimit"      
## [31] "scanWindowUpperLimit"

head(peaks(ms, 117))

##          [,1] [,2]
## [1,] 399.9976    0
## [2,] 399.9991    0
## [3,] 400.0006    0
## [4,] 400.0021    0
## [5,] 400.2955    0
## [6,] 400.2970    0

str(peaks(ms, 1:5))

## List of 5
##  $ : num [1:25800, 1:2] 400 400 400 400 400 ...
##  $ : num [1:25934, 1:2] 400 400 400 400 400 ...
##  $ : num [1:26148, 1:2] 400 400 400 400 400 ...
##  $ : num [1:26330, 1:2] 400 400 400 400 400 ...
##  $ : num [1:26463, 1:2] 400 400 400 400 400 ...

Exercise

Let’s extract the index of the MS2 spectrum with the highest base peak intensity and plot its spectrum. Is the data centroided or in profile mode?

hd2 <- hd[hd$msLevel == 2, ]
i <- which.max(hd2$basePeakIntensity)
hd2[i, ]

##      seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 5404   5404           5404       2        1        275    2283283712
##      retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 5404      2751.313   859.5032         354288224              45
##      ionisationEnergy    lowMZ  highMZ precursorScanNum precursorMZ
## 5404                0 100.5031 1995.63             5403    859.1722
##      precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 5404               3          627820480         NA                  NA
##      mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 5404                       NA                     NA    0.03474091
##                                                  filterString
## 5404 FTMS + p NSI d Full ms2 859.50@hcd45.00 [100.00-2000.00]
##                                         spectrumId centroided
## 5404 controllerType=0 controllerNumber=1 scan=5404       TRUE
##      ionMobilityDriftTime isolationWindowTargetMZ isolationWindowLowerOffset
## 5404                   NA                   859.5                          1
##      isolationWindowUpperOffset scanWindowLowerLimit scanWindowUpperLimit
## 5404                          1                  100                 2000

pi <- peaks(ms, hd2[i, 1])
plot(pi, type = "h")

mz <- hd2[i, "basePeakMZ"]
plot(pi, type = "h", xlim = c(mz - 0.5, mz + 0.5))

Exercise

Pick an MS1 spectrum and visually check whether it is centroided or in profile mode.

## Zooming into spectrum 300 (an MS1 spectrum).
j <- 300
pj <- peaks(ms, j)
plot(pj, type = "l")

mz <- hd[j, "basePeakMZ"]
plot(pj, type = "l", xlim = c(mz - 0.5, mz + 0.5))

3.3 Visualisation of raw MS data

The importance of flexible access to specialised data becomes visible in the figure below (taken from the RforProteomics visualisation vignette). Not only can we access specific data and understand/visualise them, but we can transverse all the data and extracted/visualise/understand structured slices of data.

The figure below show is an illustration of how mass spectrometry works:

The chromatogram at the top display to total ion current along the retention time. The vertical line identifies one scan in particular at retention time 1800.68 seconds (the 2807th scan).
The spectra on the second line represent the full MS1 spectrum marked by the red line. The vertical lines identify the 10 precursor ions that where selected for MS2 analysis. The zoomed in on the right shows one specific precursor peak.
The MS2 spectra displayed along the two rows at the bottom are those resulting from the fragmentation of the 10 precursor peaks identified by the vertical bars above.

We are going to reproduce the figure above trought a set of exercices.

Exercise

The chromatogram can be created by extracting the totIonCurrent and rtime variables for all MS1 spectra. Annotate the spectrum of interest.

with(spectraData(filterMsLevel(sp, 1)),
     plot(rtime, totIonCurrent, type = "l"))
abline(v = rtime(sp)[2807], col = "red")

The filterPrecursorScan() function can be used to retains parent (MS1) and children scans (MS2) of a scan, as defined by its acquisition number. Use it to extract the MS1 scan of interest and all its MS2 children.

ms_2 <- filterPrecursorScan(sp, 2807)
ms_2

## MSn data (Spectra) with 11 spectra in a MsBackendMzR backend:
##      msLevel     rtime scanIndex
##    <integer> <numeric> <integer>
## 1          1   1800.68      2807
## 2          2   1801.26      2808
## 3          2   1801.92      2809
## 4          2   1802.20      2810
## 5          2   1802.48      2811
## 6          2   1802.77      2812
## 7          2   1803.05      2813
## 8          2   1803.34      2814
## 9          2   1803.64      2815
## 10         2   1803.93      2816
## 11         2   1804.21      2817
##  ... 33 more variables/columns.
## 
## file(s):
## 10c1f4e4c6980_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Processing:
##  Filter: select parent/children scans for 2807 [Wed Mar 17 22:22:17 2021]

Plot the MS1 spectrum of interest and highlight all the peaks that will be selected for MS2 analysis.

plotSpectra(sp[2807], xlim = c(400, 1000))
abline(v = precursorMz(ms_2)[-1], col = "grey")
abline(v = precursorMz(ms_2)[2], col = "red")

Zoom in mz values 521.1 and 522.5 to reveal the isotopic envelope of that peak.

plotSpectra(sp[2807], xlim = c(521.2, 522.5), type = "l")
abline(v = precursorMz(ms_2)[2], col = "red")

The plotSpectra() function is used to plot all 10 MS2 spectra in one call.

plotSpectra(ms_2[-1])

It is possible to label the peaks with the plotSpectra() function. The labels argument is either a character of appropriate length (i.e. with a label for each peak) or, as illustrated below, a function that computes the labels.

mzLabel <- function(z) {
    z <- peaksData(z)[[1L]]
    lbls <- format(z[, "mz"], digits = 4)
    lbls[z[, "intensity"] < 1e5] <- ""
    lbls
}

plotSpectra(ms_2[7],
            xlim = c(126, 132),
            labels = mzLabel,
            labelSrt = -30, labelPos = 2,
            labelOffset = 0.1)

Spectra can also be compared either by overlay or mirror plotting using the plotSpectraOverlay() and plotSpectraMirror() functions.

Exercice

Filter MS2 level spectra.
Find any 2 MS2 spectra that have matching precursor peaks based on the precursor m/z values.

sp2 <- filterMsLevel(sp, 2L)
anyDuplicated(precursorMz(filterMsLevel(sp, 2)))

## [1] 37

i <- which(precursorMz(sp2) == precursorMz(sp2)[37])
sp2i <- sp2[i]

Visualise the matching pair using the plotSpectraOverlay() and plotSpectraMirror() functions.

plotSpectraOverlay(sp2i, col = c("red", "steelblue"))

plotSpectraMirror(sp2i[1], sp2i[2])

3.4 Raw data processing and manipulation

Apart from classical subsetting operations such as [ and split, a set of filter functions are defined for Spectra objects (for detailed help please see the ?Spectra help):

filterAcquisitionNum: retain spectra with certain acquisition numbers.
filterDataOrigin: subset to spectra from specific origins.
filterDataStorage: subset to spectra from certain data storage files.
filterEmptySpectra: remove spectra without mass peaks.
filterMzRange: subset spectra keeping only peaks with an m/z within the provided m/z range.
filterMzValues: subset spectra keeping only peaks matching provided m/z value(s).
filterIsolationWindow: keep spectra with the provided mz in their isolation window (m/z range).
filterMsLevel: filter by MS level.
filterPolarity: filter by polarity.
filterPrecursorMz: retain (MSn) spectra with a precursor m/z within the provided m/z range.
filterPrecursorScan: retain (parent and children) scans of an acquisition number.
filterRt: filter based on retention time ranges.

Exercise

Using the sp_sciex data, select all spectra measured in the second mzML file and subsequently filter them to retain spectra measured between 175 and 189 seconds in the measurement run.

fls <- unique(dataOrigin(sp_sciex))
fls

## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML"
## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML"

file_2 <- filterDataOrigin(sp_sciex, dataOrigin = fls[2])
length(file_2)

## [1] 931

sps_sub <- filterRt(file_2, rt = c(175, 189))
length(sps_sub)

## [1] 50

library("magrittr")
sp_sciex %>%
    filterDataOrigin(fls[2]) %>%
    filterRt(c(175, 189))

## MSn data (Spectra) with 50 spectra in a MsBackendMzR backend:
##       msLevel     rtime scanIndex
##     <integer> <numeric> <integer>
## 1           1   175.212       628
## 2           1   175.491       629
## 3           1   175.770       630
## 4           1   176.049       631
## 5           1   176.328       632
## ...       ...       ...       ...
## 46          1   187.768       673
## 47          1   188.047       674
## 48          1   188.326       675
## 49          1   188.605       676
## 50          1   188.884       677
##  ... 33 more variables/columns.
## 
## file(s):
## 20171016_POOL_POS_3_105-134.mzML
## Processing:
##  Filter: select data origin(s) /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML [Wed Mar 17 22:22:19 2021]
##  Filter: select retention time [175..189] on MS level(s) 1 [Wed Mar 17 22:22:19 2021]

As an example of data processing, below we use the pickPeaks() function to pick peaks:

plotSpectra(sp[2807], xlim = c(521.2, 522.5))

library("magrittr")
pickPeaks(sp[2807]) %>%
    filterIntensity(1e7) %>%
    plotSpectra(xlim = c(521.2, 522.5))

3.5 A note on efficiency

3.5.1 Backends

Figure 3.2: (a) Reading time (triplicates, in seconds) and (b) data size in memory (in MB) to read/store 1, 5, and 10 files containing 1431 MS1 (on-disk only) and 6103 MS2 (on-disk and in-memory) spectra. (c) Filtering benchmark assessed over 10 interactions on in-memory and on-disk data containing 6103 MS2 spectra. (d) Access time to spectra for the in-memory (left) and on-disk (right) backends for 1, 10, 100 1000, 5000, and all 6103 spectra. Benchmarks were performed on a Dell XPS laptop with an Intel i5-8250U processor 1.60 GHz (4 cores, 8 threads), 7.5 GB RAM running Ubuntu 18.04.4 LTS 64-bit, and an SSD drive. The data used for the benchmarking are a TMT 4-plex experiment acquired on a LTQ Orbitrap Velos (Thermo Fisher Scientific) available in the msdata package . (Figure taken from (Gatto, Gibb, and Rainer 2020).

(a) Reading time (triplicates, in seconds) and (b) data size in memory (in MB) to read/store 1, 5, and 10 files containing 1431 MS1 (on-disk only) and 6103 MS2 (on-disk and in-memory) spectra. (c) Filtering benchmark assessed over 10 interactions on in-memory and on-disk data containing 6103 MS2 spectra. (d) Access time to spectra for the in-memory (left) and on-disk (right) backends for 1, 10, 100 1000, 5000, and all 6103 spectra. Benchmarks were performed on a Dell XPS laptop with an Intel i5-8250U processor 1.60 GHz (4 cores, 8 threads), 7.5 GB RAM running Ubuntu 18.04.4 LTS 64-bit, and an SSD drive. The data used for the benchmarking are a TMT 4-plex experiment acquired on a LTQ Orbitrap Velos (Thermo Fisher Scientific) available in the msdata package . (Figure taken from [@Gatto:2020].

3.5.2 Parallel processing

Most functions on Spectra support (and use) parallel processing out of the box. Peak data access and manipulation methods perform by default parallel processing on a per-file basis (i.e. using the dataStorage variable as splitting factor). Spectra uses BiocParallel for parallel processing and all functions use the default registered parallel processing setup of that package.

3.5.3 Lazy evaluation

Data manipulations on Spectra objects are not immediately applied to the peak data. They are added to a so called processing queue which is applied each time peak data is accessed (with the peaksData, mz or intensity functions). Thanks to this processing queue data manipulation operations are also possible for read-only backends (e.g. mzML-file based backends or database-based backends). The information about the number of such processing steps can be seen below (next to Lazy evaluation queue).

min(intensity(sp_sciex[1]))

## [1] 0

sp_sciex <- filterIntensity(sp_sciex, intensity = c(10, Inf))
sp_sciex ## Note the lazy evaluation queue

## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 33 more variables/columns.
## 
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## 20171016_POOL_POS_3_105-134.mzML
## Lazy evaluation queue: 1 processing step(s)
## Processing:
##  Remove peaks with intensities outside [10, Inf] in spectra of MS level(s) 1. [Wed Mar 17 22:22:19 2021]

min(intensity(sp_sciex[1]))

## [1] 412

sp_sciex@processingQueue

## [[1]]
## Object of class "ProcessingStep"
##  Function: user-provided function
##  Arguments:
##   o intensity = 10Inf
##   o msLevel = 1

sp_sciex <- reset(sp_sciex)
sp_sciex@processingQueue

## list()

min(intensity(sp_sciex[1]))

## [1] 0

References

Gatto, Laurent, Sebastian Gibb, and Johannes Rainer. 2020. “MSnbase, Efficient and Elegant R-Based Processing and Visualisation of Raw Mass Spectrometry Data.” J. Proteome Res., September.

Page built: 2021-03-17

Chapter 3 Raw MS data

3.1 What is raw data in R

3.1.1 The Spectra class

Exercise

3.1.2 Spectra from mzML files

Exercise

Exercise

Exercise

3.1.3 Backends

3.2 Under the hood: mzR (optional)

Exercise

Exercise

3.3 Visualisation of raw MS data

Exercise

Exercice

3.4 Raw data processing and manipulation

Exercise

3.5 A note on efficiency

3.5.1 Backends

3.5.2 Parallel processing

3.5.3 Lazy evaluation

References

3.1.1 The `Spectra` class

3.1.2 `Spectra` from mzML files

3.2 Under the hood: `mzR` (optional)