Abstract In this course, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The workflow is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.


This material available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.

If you (re-)use this material, please cite the following reference

Gatto, Laurent. (2019, January). Bioconductor tools for mass spectrometry and proteomics. Zenodo. http://doi.org/10.5281/zenodo.2547971

1 Introduction

Before we start:

If you identify typos, if there are parts that you would like to see expended or clarified, please let me know by telling me directly (during workshops), opening a github issue or by emailing me. Please do also briefly specify your background/familiarity with mass spectrometry and/or proteomics (beginner, intermediate or expert) so that I can update accordingly.

In recent years, there we have seen an increase in the number of packages to analyse mass spectrometry and proteomics data for R and Bioconductor, as well as an increase in total number of downloads. See vignette Proteomics packages in Bioconductor for more details and code underlying these figures.

Number of packages Number of downloads

It is also good to highlight that several of these package have become a group efforts, supported by several developers in the community. This post illustrates the various contributions to MSnbase. mzR has benefited by a similar wide range of successful contributions. Both packages, and in particular mzR, are used by many others, and will be described in some detail in this workflow.

This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:

  • Exploring available infrastructure
  • Mass spectrometry data
  • Getting data from proteomics repositories
  • Handling raw MS data
  • Handling identification data
  • MS/MS database search
  • Analysing search results
  • High-level data interface
  • Quantitative proteomics
  • Importing third-party quantitation data
  • Data processing and analysis
  • Statistical analysis
  • Machine learning
  • Annotation
  • Other relevant packages/pipelines

Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.

Other material

This workflow provides a general introduction to Bioconductor software for mass spectrometry and proteomics. If you are interested in

  • The application of machine learning to proteomics data, in particular spatial proteomics (i.e. the sub-cellular localisation), follow the tutorial vignette from pRoloc package, accessible with vignette("pRoloc-tutorial", package = "pRoloc") or online.
  • The analysis of identification data to retain the most reliable PSMs, see the MSnID vignette1 Section Analysing search results below is a summary of that vignette., accessible with vignette("msnid_vignette", package = "MSnID") or online. In addition, the vignettes of the msmsTest package describe how to analyse spectral counting data using packages dedicated for the analysis of high throughput sequencing data.
  • The analysis of MSE data independent acquisition (DIA), see the vignettes in the synapter package.
  • The processing and analysis of MALDI-MS data, read the MALDIquant introduction accessible with vignette("MALDIquant-intro", package = "MALDIquant") and available online.
  • The processing and analysis of imaging mass spectrometry (IMS), read the Carinal walkthrough vignette accessible with vignette("Cardinal-walkthrough", package = "Cardinal") and online.
  • …

Setup

The follow packages will be used throughout this documents. R version 3.5 or higher is required to install all the packages using BiocManager::install.

library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")
library("ensembldb")

The most convenient way to install most of the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.

if (!require("BiocManager"))
    install.package("BiocManager")
BiocManager::install("RforProteomics", dependencies = TRUE)

Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.

Exploring available infrastructure

On-line

In Bioconductor version 3.6, there are respectively 92 proteomics, 62 mass spectrometry software packages and 17 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages(), massSpectrometryPackages() and massSpectrometryDataPackages() and explored interactively, or looked at by exploring the respective biocViews on the Bioconductor web page.

library("RforProteomics")
pp <- proteomicsPackages()
DT::datatable(pp)