In this section, we will learn how to read raw data from in one of the
commonly used open formats (mzML
, mzXML
and netCDF
) into R.
|Data type |File format |Data structure |Package |
|:----------|:-------------|:----------------------------|:-----------------|
|Raw |mzXML or mzML |mzRpwiz or mzRramp |mzR |
|Raw |mzXML or mzML |list of MassSpectrum objects |MALDIquantForeign |
|Raw |mzXML or mzML |MSnExp |MSnbase |
|Peak lists |mgf |MSnExp |MSnbase |
|Raw |several |Spectra |Spectra |
When we manipulate complex data, we need a way to abstract it.
The need for and abstraction saves us from having to know about all the details of that data and its associated metadata. This allows to rely on a few easy-to-remember conventions to make mundane and repetitive tasks trivial and be able to complete more complex things easily. Abstractions provide a smoother approaches to handle complex data using common patterns.
Spectra
classWe are going to use the
Spectra
package
as an abstraction to raw mass spectrometry data.
Spectra
is part of the R for Mass Spectrometry
initiative initiative. It
defines the Spectra
class that is used as a raw data abstration, to
maniputate MS data and metadata. The best way to learn about a data
structure is to create one by hand.
Let’s create a DataFrame
4 As defined in the Bioconductor S4Vectors
package. containing MS levels, retention time, m/z and intensities
for 2 spectra:
spd <- DataFrame(msLevel = c(1L, 2L), rtime = c(1.1, 1.2))
spd$mz <- list(c(100, 103.2, 104.3, 106.5), c(45.6, 120.4, 190.2))
spd$intensity <- list(c(200, 400, 34.2, 17), c(12.3, 15.2, 6.8))
spd
## DataFrame with 2 rows and 4 columns
## msLevel rtime mz intensity
## <integer> <numeric> <list> <list>
## 1 1 1.1 100.0,103.2,104.3,... 200.0,400.0, 34.2,...
## 2 2 1.2 45.6,120.4,190.2 12.3,15.2, 6.8
And now convert this DataFrame
into a Spectra
object:
## MSn data (Spectra) with 2 spectra in a MsBackendDataFrame backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 1.1 NA
## 2 2 1.2 NA
## ... 16 more variables/columns.
Explore the newly created object using
spectraVariables
to extract all the metadata variables.spectraData
to extract all the metadata.peaksData
to extract a list containing thet raw data.[
to create subsets.Spectra
from mzML filesLet’s now create a new object using the mzML data previously
downloaded and available in the mzf
file.
## [1] "~/.cache/rpx/10c1f4e4c6980_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## MSn data (Spectra) with 7534 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.4584 1
## 2 1 0.9725 2
## 3 1 1.8524 3
## 4 1 2.7424 4
## 5 1 3.6124 5
## ... ... ... ...
## 7530 2 3600.47 7530
## 7531 2 3600.83 7531
## 7532 2 3601.18 7532
## 7533 2 3601.57 7533
## 7534 2 3601.98 7534
## ... 33 more variables/columns.
##
## file(s):
## 10c1f4e4c6980_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
length()
.Mass spectrometry data in Spectra
objects can be thought of as a
list of individual spectra, with each spectrum having a set of
variables associated with it. Besides core spectra variables (such
as MS level or retention time) an arbitrary number of optional
variables can be assigned to a spectrum. The core spectra variables
all have their own accessor method and it is guaranteed that a value
is returned by it (or NA
if the information is not available). The
core variables and their data type are (alphabetically ordered):
integer(1)
: the index of acquisition of a
spectrum during a MS run.logical(1)
: whether the spectrum is in profile or
centroid mode.numeric(1)
: collision energy used to create an
MSn spectrum.character(1)
: the origin of the spectrum’s data,
e.g. the mzML file from which it was read.character(1)
: the (current) storage location of the
spectrum data. This value depends on the backend used to handle and
provide the data. For an in-memory backend like the
MsBackendDataFrame
this will be "<memory>"
, for an on-disk
backend such as the MsBackendHdf5Peaks
it will be the name of the
HDF5 file where the spectrum’s peak data is stored.numeric
: intensity values for the spectrum’s peaks.numeric(1)
: lower m/z for the isolation
window in which the (MSn) spectrum was measured.numeric(1)
: the target m/z for the
isolation window in which the (MSn) spectrum was measured.numeric(1)
: upper m/z for the isolation
window in which the (MSn) spectrum was measured.integer(1)
: the MS level of the spectrum.numeric
: the m/z values for the spectrum’s peaks.integer(1)
: the polarity of the spectrum (0
and 1
representing negative and positive polarity, respectively).integer(1)
: the scan (acquisition) number of the
precursor for an MSn spectrum.integer(1)
: the charge of the precursor of an
MSn spectrum.numeric(1)
: the intensity of the precursor of
an MSn spectrum.numeric(1)
: the m/z of the precursor of an MSn
spectrum.numeric(1)
: the retention time of a spectrum.integer(1)
: the index of a spectrum within a (raw)
file.logical(1)
: whether the spectrum was smoothed.For details on the individual variables and their getter/setter
function see the help for Spectra
(?Spectra
). Also note that these
variables are suggested, but not required to characterize a
spectrum. Also, some only make sense for MSn, but not for MS1 spectra.
msLevel(.)
) or using the $
notation (for example .$msLevel
).plotSpectra()
function to visualise peaks and set the m/z range with the xlim
arguments.Using the first raw data file starting with MS3TMT10
, answer the
following questions:
These objects and their manipulations are not limited to single files:
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML"
## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML"
##
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML
## 931
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML
## 931
Backends allow to use different backends to store mass spectrometry data while
providing via the Spectra
class a unified interface to use that data. The
Spectra
package defines a set of example backends but any object extending the
base MsBackend
class could be used instead. The default backends are:
MsBackendMzR
: this backend keeps only general spectra variables in memory
and relies on the mzR package to read mass peaks (m/z and
intensity values) from the original MS files on-demand.## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 33 more variables/columns.
##
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## 20171016_POOL_POS_3_105-134.mzML
MsBackendDataFrame
: the mass spectrometry data is stored (in-memory) in a
DataFrame
. Keeping the data in memory guarantees high performance but has
also, depending on the number of mass peaks in each spectrum, a much higher
memory footprint.## MSn data (Spectra) with 1862 spectra in a MsBackendDataFrame backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 33 more variables/columns.
## Processing:
## Switch backend from MsBackendMzR to MsBackendDataFrame [Wed Mar 17 22:21:57 2021]
MsBackendHdf5Peaks
: similar to MsBackendMzR
this backend reads peak data
only on-demand from disk while all other spectra variables are kept in
memory. The peak data are stored in Hdf5 files which guarantees scalability.## MSn data (Spectra) with 1862 spectra in a MsBackendHdf5Peaks backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 33 more variables/columns.
##
## file(s):
## 20171016_POOL_POS_1_105-134.h5
## 20171016_POOL_POS_3_105-134.h5
## Processing:
## Switch backend from MsBackendMzR to MsBackendHdf5Peaks [Wed Mar 17 22:22:06 2021]
##
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML
## 931
## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML
## 931
##
## /tmp/RtmpdvVy6F/20171016_POOL_POS_1_105-134.h5
## 931
## /tmp/RtmpdvVy6F/20171016_POOL_POS_3_105-134.h5
## 931
All of the above mentioned backends support changing all of their their spectra
variables, except the MsBackendMzR
that does not support changing m/z or
intensity values for the mass peaks.
With the example below we load the data from a single mzML file and use a
MsBackendHdf5Peaks
backend for data storage. The hdf5path
parameter allows
us to specify the storage location of the HDF5 file.
There is also an (under development) SQLite-based backend called
MsBackendSqlDb
that will store all data, i.e. raw and metadata, on disk.
mzR
(optional)The mzR
package in a direct interface to the
proteowizard code base. It
includes a substantial proportion of pwiz’s C/C++ code for fast and
efficient parsing of these large raw data files.
Let’s start by using some raw data files from the msdata
package. After loading it, we use the proteomics()
function to
return the full file names for two raw data files. We will start by
focusing on the second one.
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/MRM-standmix-5.mzML.gz"
## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/MS3TMT10_01022016_32917-33481.mzML.gz"
## [3] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/MS3TMT11.mzML"
## [4] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
## [5] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
The three main functions of mzR
are
openMSfile
to create a file handle to a raw data fileheader
to extract metadata about the spectra contained in the filepeaks
to extract one or multiple spectra of interest.Other functions such as instrumentInfo
, or runInfo
can be used to
gather general information about a run.
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz
## Number of scans: 7534
## [1] 7534 31
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum" "injectionTime"
## [23] "filterString" "spectrumId"
## [25] "centroided" "ionMobilityDriftTime"
## [27] "isolationWindowTargetMZ" "isolationWindowLowerOffset"
## [29] "isolationWindowUpperOffset" "scanWindowLowerLimit"
## [31] "scanWindowUpperLimit"
## [,1] [,2]
## [1,] 399.9976 0
## [2,] 399.9991 0
## [3,] 400.0006 0
## [4,] 400.0021 0
## [5,] 400.2955 0
## [6,] 400.2970 0
## List of 5
## $ : num [1:25800, 1:2] 400 400 400 400 400 ...
## $ : num [1:25934, 1:2] 400 400 400 400 400 ...
## $ : num [1:26148, 1:2] 400 400 400 400 400 ...
## $ : num [1:26330, 1:2] 400 400 400 400 400 ...
## $ : num [1:26463, 1:2] 400 400 400 400 400 ...
Let’s extract the index of the MS2 spectrum with the highest base peak intensity and plot its spectrum. Is the data centroided or in profile mode?
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 5404 5404 5404 2 1 275 2283283712
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 5404 2751.313 859.5032 354288224 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 5404 0 100.5031 1995.63 5403 859.1722
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 5404 3 627820480 NA NA
## mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 5404 NA NA 0.03474091
## filterString
## 5404 FTMS + p NSI d Full ms2 859.50@hcd45.00 [100.00-2000.00]
## spectrumId centroided
## 5404 controllerType=0 controllerNumber=1 scan=5404 TRUE
## ionMobilityDriftTime isolationWindowTargetMZ isolationWindowLowerOffset
## 5404 NA 859.5 1
## isolationWindowUpperOffset scanWindowLowerLimit scanWindowUpperLimit
## 5404 1 100 2000
Pick an MS1 spectrum and visually check whether it is centroided or in profile mode.
The importance of flexible access to specialised data becomes visible
in the figure below (taken from the RforProteomics
visualisation
vignette).
Not only can we access specific data and understand/visualise them,
but we can transverse all the data and extracted/visualise/understand
structured slices of data.
The figure below show is an illustration of how mass spectrometry works:
The chromatogram at the top display to total ion current along the retention time. The vertical line identifies one scan in particular at retention time 1800.68 seconds (the 2807th scan).
The spectra on the second line represent the full MS1 spectrum marked by the red line. The vertical lines identify the 10 precursor ions that where selected for MS2 analysis. The zoomed in on the right shows one specific precursor peak.
The MS2 spectra displayed along the two rows at the bottom are those resulting from the fragmentation of the 10 precursor peaks identified by the vertical bars above.
We are going to reproduce the figure above trought a set of exercices.
totIonCurrent
and rtime
variables for all MS1 spectra. Annotate the spectrum of
interest.filterPrecursorScan()
function can be used to retains parent
(MS1) and children scans (MS2) of a scan, as defined by its
acquisition number. Use it to extract the MS1 scan of interest and
all its MS2 children.## MSn data (Spectra) with 11 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 1800.68 2807
## 2 2 1801.26 2808
## 3 2 1801.92 2809
## 4 2 1802.20 2810
## 5 2 1802.48 2811
## 6 2 1802.77 2812
## 7 2 1803.05 2813
## 8 2 1803.34 2814
## 9 2 1803.64 2815
## 10 2 1803.93 2816
## 11 2 1804.21 2817
## ... 33 more variables/columns.
##
## file(s):
## 10c1f4e4c6980_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Processing:
## Filter: select parent/children scans for 2807 [Wed Mar 17 22:22:17 2021]
plotSpectra()
function is used to plot all 10 MS2 spectra in
one call.It is possible to label the peaks with the plotSpectra()
function. The labels
argument is either a character
of appropriate
length (i.e. with a label for each peak) or, as illustrated below, a
function that computes the labels.
mzLabel <- function(z) {
z <- peaksData(z)[[1L]]
lbls <- format(z[, "mz"], digits = 4)
lbls[z[, "intensity"] < 1e5] <- ""
lbls
}
plotSpectra(ms_2[7],
xlim = c(126, 132),
labels = mzLabel,
labelSrt = -30, labelPos = 2,
labelOffset = 0.1)
Spectra can also be compared either by overlay or mirror plotting
using the plotSpectraOverlay()
and plotSpectraMirror()
functions.
## [1] 37
plotSpectraOverlay()
and
plotSpectraMirror()
functions.Apart from classical subsetting operations such as [
and split
,
a set of filter functions are defined for Spectra
objects (for
detailed help please see the ?Spectra
help):
filterAcquisitionNum
: retain spectra with certain acquisition numbers.filterDataOrigin
: subset to spectra from specific origins.filterDataStorage
: subset to spectra from certain data storage files.filterEmptySpectra
: remove spectra without mass peaks.filterMzRange
: subset spectra keeping only peaks with an m/z within the
provided m/z range.filterMzValues
: subset spectra keeping only peaks matching provided m/z
value(s).filterIsolationWindow
: keep spectra with the provided mz
in their
isolation window (m/z range).filterMsLevel
: filter by MS level.filterPolarity
: filter by polarity.filterPrecursorMz
: retain (MSn) spectra with a precursor m/z within the
provided m/z range.filterPrecursorScan
: retain (parent and children) scans of an acquisition
number.filterRt
: filter based on retention time ranges.Using the sp_sciex
data, select all spectra measured in the second
mzML file and subsequently filter them to retain spectra measured
between 175 and 189 seconds in the measurement run.
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_1_105-134.mzML"
## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML"
## [1] 931
## [1] 50
## MSn data (Spectra) with 50 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 175.212 628
## 2 1 175.491 629
## 3 1 175.770 630
## 4 1 176.049 631
## 5 1 176.328 632
## ... ... ... ...
## 46 1 187.768 673
## 47 1 188.047 674
## 48 1 188.326 675
## 49 1 188.605 676
## 50 1 188.884 677
## ... 33 more variables/columns.
##
## file(s):
## 20171016_POOL_POS_3_105-134.mzML
## Processing:
## Filter: select data origin(s) /home/lgatto/R/x86_64-pc-linux-gnu-library/4.0/msdata/sciex/20171016_POOL_POS_3_105-134.mzML [Wed Mar 17 22:22:19 2021]
## Filter: select retention time [175..189] on MS level(s) 1 [Wed Mar 17 22:22:19 2021]
As an example of data processing, below we use the pickPeaks()
function to pick peaks:
Most functions on Spectra
support (and use) parallel processing out
of the box. Peak data access and manipulation methods perform by
default parallel processing on a per-file basis (i.e. using the
dataStorage variable as splitting factor). Spectra uses
BiocParallel
for
parallel processing and all functions use the default registered
parallel processing setup of that package.
Data manipulations on Spectra objects are not immediately applied to
the peak data. They are added to a so called processing queue which is
applied each time peak data is accessed (with the peaksData
, mz
or
intensity
functions). Thanks to this processing queue data
manipulation operations are also possible for read-only backends
(e.g. mzML-file based backends or database-based backends). The
information about the number of such processing steps can be seen
below (next to Lazy evaluation queue).
## [1] 0
sp_sciex <- filterIntensity(sp_sciex, intensity = c(10, Inf))
sp_sciex ## Note the lazy evaluation queue
## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 33 more variables/columns.
##
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## 20171016_POOL_POS_3_105-134.mzML
## Lazy evaluation queue: 1 processing step(s)
## Processing:
## Remove peaks with intensities outside [10, Inf] in spectra of MS level(s) 1. [Wed Mar 17 22:22:19 2021]
## [1] 412
## [[1]]
## Object of class "ProcessingStep"
## Function: user-provided function
## Arguments:
## o intensity = 10Inf
## o msLevel = 1
## list()
## [1] 0
Gatto, Laurent, Sebastian Gibb, and Johannes Rainer. 2020. “MSnbase, Efficient and Elegant R-Based Processing and Visualisation of Raw Mass Spectrometry Data.” J. Proteome Res., September.
Page built: 2021-03-17