AbstractThis vignette describes the use of the MSnbase package for centroiding of profile-mode mass spectrometry data.
Mass spectrometry measures data in so called profile mode, were the signal corresponding to a specific ion is distributed around the ion’s actual m/z value (Smith et al. 2014). The accuracy of that signal depends on the resolution and settings of the instrument. Profile mode data can be processed into centroid data by retaining only a single, representative value, typically the local maximum of the distribution of data points. This centroiding substantially reduces the amount of data without much loss of information. Certain algorithms, such as the centWave method in the xcms package for chromatographic peak detection in LC-MS experiments or proteomics search engines that match MS2 spectra to peptides, require the data to be in centroid mode. In this vignette, we will focus on metabolomics data.
Many manufacturers apply centroiding of the profile data, either directly during the acquisition or immediately thereafter so that the user immediately receives processed data. Alternatively, third party software, such as
msconvert from the
proteowizard suite (Chambers et al. 2012) allow to apply various data centroiding algorithms, including vendor methods. In some cases however, the software provided by some vendors generate centroided data of poor quality.
MSnbase also provides some functionality to perform centroiding of profile MS data. These processed data can then be further quantified or analysed within R or serialised to mzML files, and used as input for other software.
In this vignette we use a subset of a metabolomics profile-mode LC-MS data of pooled human serum samples measured on a AB Sciex TripleTOF 5600+ mass spectrometer (the employed chromatography was a hydrophilic interaction high-performance liquid chromatography (HILIC HPLC)). The mzML file contains profile mode data for an m/z range from 105 to 130 and a retention time from 0 to 240 seconds. For more details on the sample see
?msdata::sciexdata. Below we load the required packages and read the MS data.
##  "20171016_POOL_POS_3_105-134.mzML"
We next extract the profile MS data for the [M+H]+ adduct of serine with the expected m/z of 106.049871. We thus filter the
data_prof object using an m/z range containing the signal for the metabolite and a retention time window from 175 to 187 seconds corresponding to the time when the analyte elutes from the LC.
We can now plot the profile MS data for serine.
The lower panel in the plot above shows all the individual signal intensities measured by the mass spectrometer over the retention time and the m/z ranges of interest. The upper panel displays the base peak chromatogram (BPC), which represents the maximum signal (across the range of m/z values) for each discrete retention time. The rows of points in this lower panel indicate the resolution of the mass spectrometer while the columns of data points (i.e. the data collected for a discrete retention time point) represents the signal for the ion in one spectrum.
Below we plot the signal for one of of the 43 spectra containing signal for serine, the one at retention time 181.07
The MS instrument recorded a signal along the m/z range in discrete intervals (which depend on the resolution of the instrument). The profile-mode signal of the serine ion at the respective retention time (the mass peak) consists therefore of multiple intensities that follow approximately a gaussian distribution.
As described in the introduction, centroiding aims to reduce this signal distribution to a single representative intensity, a single data point, for the ion in a spectrum. The simplest approach selects the largest intensity for each mass peak and report its intensity and m/z value. This can be done using the
pickPeaks method with default parameters as shown below.
After centroiding the data consists of a single intensity for each mass peak. In the example above the centroids from consecutive scans do however not have the same m/z value, but they fluctuate between the discrete m/z values defined by the instruments’s resolution. For lower intensity signals this variation can be substantial.
To further illustrate this, we plot below the centroided signal for the [M+H]+ ion of proline.
For proline, the centroids jump between 3 bins of m/z values in consecutive scans and the chromatographic data does not show a nice, regular peak shape. Additional data processing, such as data smoothing prior to centroiding and/or refining the centroid’s m/z can reduce these effects and improve overall data quality as we will see in the next section.
While the simple centroiding using
pickPeaks as described in the previous section might be sufficient for many experiments and setups, MS data smoothing and refinement of the identified centroids’ m/z values can improve data quality.
Raw mass spectrometry data is usually smoothed in m/z dimension by applying e.g. a Savitzky-Golay filter (Savitzky and Golay 1964) which reduces the noise and hence improves data quality. Below we use the
smooth method to apply a Savitzky-Golay filter with a half-window size of 4 to the data within each spectrum (see
?smooth for more details on the parameters).
We next apply the simple peak picking on the smoothed data, filter the desired retention time and m/z ranges, and subsequently plot the such centroided data for serine.
Smoothing the raw data prior to peak picking improved the quality of the centroided data of serine as well as proline as can be seen below.
The smoothed centroided data for proline still show a systematic deviation of m/z values as well as poor chromatographic data.
In addition to smoothing the signal in m/z dimension, we can also smooth the signal along the retention time dimension using the
combineSpectraMovingWindow function. This function aggregates signal for the same m/z value from neighbouring spectra in a moving window approach, thus smoothing the chromatographic data (by replacing the intensity in the middle spectrum by the average signal of all intensities for the m/z in the spectra within the defined window). To reduce the run-time of the example we apply the smoothing only to the profile-mode data for a retention time window containing proline (in a real data analysis this should be performed on the full data).
As can be seen above, smoothing in retention time dimension improves the chromatographic peak shape of proline.
Note however that, to combine data from multiple spectra, the
combineSpectraMovingWindow function has to first load the full data into memory (i.e. it converts the
OnDiskMSnExp object into a
MSnExp object) and that it returns also a
MSnExp object. In a real use case it is thus advisable to apply
combineSpectraMovingWindow separately on each file of an experiment and to export the results as an mzML file using the
Thus far we applied only a simple peak picking strategy, but the
pickPeaks method allows also to refine the identified centroid’s m/z value by considering also the signal from the full, or parts of the, mass peak. Currently two methods are implemented, descendPeak and kNeighbors that can be selected by passing either
refineMz = "descendPeak" or
refineMz = "kNeighbors" to the
pickPeaks method. The m/z value of the reported centroid is calculated using an intensity-weighted mean of m/z-intensity values from the mass peak. This can improve the accuracy of the reported m/z values. The two methods differ only in the way in which the peak area for the final calculation is defined:
kNeighbors takes the
k m/z-intensity pairs (default
k = 2) left and right of the centroid and
descendPeak walks, on both sides from the centroid, down until the signal is below
signalPercentage% of the centroid’s intensity (by default 33%), or until the signal increases again. All m/z intensity pairs within this range are used for the weighted average calculation of the centroid’s m/z value.
Below we first extract the data for serine and then plot the smoothed and centroided data without and with m/z refinement.
## Extract the data for serine serine_sg_cent <- data_sg_cent %>% filterRt(rtr) %>% filterMz(mzr) serine_sg_cent_mz <- data_sg_cent_mz %>% filterRt(rtr) %>% filterMz(mzr) ## Plot the data layout(matrix(1:4, ncol = 2)) ## No m/z refinement plot(serine_sg_cent, type = "XIC", layout = NULL) abline(h = serine_mz, col = "red", lty = 2) abline(v = rtime(serine_sg_cent), col = "red", lty = 3) ## With m/z refinement plot(serine_sg_cent_mz, type = "XIC", layout = NULL) abline(h = serine_mz, col = "red", lty = 2) abline(v = rtime(serine_sg_cent_mz), col = "red", lty = 3)
As shown above (right), the accuracy of the centroided data with m/z refinement was improved, where the difference between the largest signal centroids’ m/z value and the theoretical m/z value for the [M+H]+ ion of serine is reduced.
For the simple peak picking on raw data the difference is:
##  -0.0003682412
Smoothing already improves the accuracy:
##  -0.0003682412
And refining the m/z value during the centroiding can improve accuracy even more:
##  -0.0001703475
Similarly, the m/z refinement also improved the accuracy for proline.
proline_sg_cent <- data_prof %>% smooth(method = "SavitzkyGolay", halfWindowSize = 4L) %>% pickPeaks() %>% filterRt(prol_rtr) %>% filterMz(prol_mzr) proline_sg_cent_mz <- data_prof %>% smooth(method = "SavitzkyGolay", halfWindowSize = 4L) %>% pickPeaks(refineMz = "descendPeak") %>% filterRt(prol_rtr) %>% filterMz(prol_mzr) layout(matrix(1:4, ncol = 2)) plot(proline_sg_cent, type = "XIC", layout = NULL) abline(h = prol_mz, col = "red", lty = 2) abline(v = rtime(proline_sg_cent_mz), col = "red", lty = 3) plot(proline_sg_cent_mz, type = "XIC", layout = NULL) abline(h = prol_mz, col = "red", lty = 2) abline(v = rtime(proline_sg_cent_mz), col = "red", lty = 3)
The difference between the m/z of the centroid with the largest signal and the theoretical m/z for the [M+H]+ ion of proline is shown below.
##  0.001577028
##  5.75261e-05
##  -3.241556e-05
Summarizing, smoothing raw profile MS data, e.g. by applying a Savitzky-Golay filter, improves quality considerably. Additional smoothing in retention time dimension can be advantageous too, specifically for the chromatographic peak shape. Accuracy can be further improved for smoothed profile MS data by refining the m/z value of the identified centroids.
Chambers, M C, B Maclean, R Burke, D Amodei, D L Ruderman, S Neumann, L Gatto, et al. 2012. “A Cross-Platform Toolkit for Mass Spectrometry and Proteomics.” Nat Biotechnol 30 (10): 918–20. https://doi.org/10.1038/nbt.2377.
Savitzky, Abraham, and M J E Golay. 1964. “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.” Analytical Chemistry 36 (8): 1627–39.
Smith, Rob, Andrew D Mathis, Dan Ventura, and John T Prince. 2014. “Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist’s point of view.” BMC Bioinformatics 15 Suppl 7 (Suppl 7): S9.