Standardised and reproducible analysis of mass spectrometry-based single-cell proteomics data

11 minute read

We are excited to anounce the official release of the scp package. scp has been accepted in Bioconductor and published as its first stable release (version 1.0) on the 28 October. We present here a transcript of a recent talk Christophe Vanderaa and Laurent Gatto gave at the SCP2020 conference.

The images in this post were taken from our slides

Introduction

Mass spectrometry-based single cell proteomics is a young and exciting field that comes with many challenges. Great progress has been made in recent years, and this conference is a hallmark of its success and promises. In this presentation, we, that is Christophe Vanderaa and myself, would like to share our efforts in proposing an infrastructure for standardised and reproducible analysis of such data. We will illustrate this infrastructure by presenting a replication of the recent SCoPE2 data by Harrison Specht et al.

The value of replication

We will focus here on computational replication, starting from the low-level (PSM) quantitation data. We will posit that results that can be reproduced are more trustable.

  • Expectation: the processing and analysis of data leads to results and discoveries
  • Reality: many steps of the analysis pipeline can go wrong leading to inaccurate, misleading, wrong results.

A pipeline that can’t be reproduced is a pipeline that is ill defined, that leads, mostly, to false discoveries. Of course, replication isn’t a guarantee for accuracy, but an analysis that can’t be replicated is one that can’t be trusted.

Replication is a first step towards a trustworthy and shareable process. The development of non-trivial software tools isn’t done out of the blue, in isolation. Ideally, it is a collaboration between the developer, the data producer and the user. In other words Replication- or Reproduction-based development is an efficient approach to tool development.

Replication is of course only the first step to define sound data infrastructure and principled analysis.

Why SCoPE2

The goal is not to focus on one type of single cell data exclusively. Indeed, many of the processing steps that will be mentioned are ubiquitous in quantitative data processing.

SCoPE2 offered an ideal reproduction-based development use case: full protocol, data and analysis scripts are available, plus support from the authors.

SCP2020-presentation

Our objectives

  1. Contribute a standardised and principled data and analysis that is broadly applicable, i.e. to any MS-based single cell proteomics data.
  2. Open, transparent and reproducible computational infrastructure that can be leveraged to further improve data analysis and interpretation.
  3. R and Bioconductor offer an ideal environment for this.

SCP2020-presentation

scp package

Data infrastructure

QFeatures

QFeatures (also recently accepted in Bioconductor) is a general infrastructure (i.e. not specifically for MS-based single cell data) to manipulate quantitative data from MS experiments. Proteomics data is multi-level: data is acquired at the (low) spectrum level and protein-level data is progressively built up through multiple processing steps. The goal of QFeatures is to explicitly record the successive steps so as to allow users to navigate up and down these different levels.

SCP2020-presentation

The following example illustrates what this means using a real data example. At the end of an analysis pipeline, it is easy to check any proteins, here STAT1 and STAT3, and verify the summarised expression profiles at the protein level, the expression profiles and the peptides level or at the spectrum level for labelled MS2 methods. Here, we can immediately visualise that the former is composed of a single PSM/peptide, while the other displays a more coherent profile for 10 PSMs/9 peptides.

SCP2020-presentation

SingleCellExperiment

The scp package also relies on the work of colleagues working on single cell RNA sequencing, in particular on the SingleCellExperiment package. The figure here represents how the SingleCellExperiment data class captures single cell RNASeq data, which of course also fit single cell proteomics data.

SCP2020-presentation

Such an experiment is composed of primary and transformed quantitative data, with genes/peptides/proteins along the rows and cells along the columns.

  • A table that captures metadata along the rows.
  • A table that captures metadata along the columns.

An additional slot to store data from various dimensionality reduction techniques, an important and widely used tool in single cell data analysis.

scp

The scp package relies on these two pieces of software, SingleCellExperiment to individually handle PSM, peptide and protein level data, that are all, collectively linked and managed by QFeatures. In addition, and most importantly, scp offers the function to process and analyse the single cell proteomics data.

SCP2020-presentation

Load data

To start the replication, we have formatted the supplementary data from the SCoPE2 preprint using the scp infrastructure that was just introduced. The dataset is stored in a data package called scpdata. So, we load scpdata and fetch the dataset that we called specht2019v2.

SCP2020-presentation

The show method gives a compact overview of the data object. The header indicates the dataset is a QFeatures object and it contains 179 assays. Remember, each assay within the dataset is stored as a SingleCellExperiment object. Assays 1 to 177 contain the quantitative PSM data and metadata for the different SCoPE2 sets. They contain either 11 or 16 columns, depending on whether the TMT-11 or 16 protocol was used. The before last assay contains the peptide data for all sets. The last assay is the protein data assay.

The tabular data generated by MaxQuant or Proteome Discoverer can easily be converted to a QFeatures object using our readSCP function.

Metadata

The sample metadata that is common to all assays are stored in a single table called the colData where each row represents a sample in one of the assays and the columns are the metadata fields. In this example, the Set field gives the name of the SCoPE2 set, Channel in which channel the sample was acquired, SampleType the type of sample, lcbatch the chromatographic batch, etc.

SCP2020-presentation

So, all useful information is contained in a single object, ready for processing.

Analysis workflow

The SCoPE2 analysis is composed of a series of data processing steps and I just mentioned loading the data. We start the workflow with PSM level data acquired over 177 different sets.

These data will go through feature filtering, division of expression channel by the reference channel, aggregating the data to peptides, combine the different sets in one, single-cell filtering, normalization, removal of highly missing peptides, and log-transformation. After these steps, we get the peptide data that is provided in the preprint.

The peptides are then aggregated to proteins, normalized, imputed and batch corrected to get the processed protein data from the article.

SCP2020-presentation

We will not cover all steps but focus on a few representative ones.

scp showcase

Let’s see how it works

PSM filtering

One common functionality is to remove low-confidence features based on the metadata stored along with the dataset.

For example, we want to remove PSMs that are matched to the reversed database that serves as a decoy database. This information was already generated by MaxQuant and is stored in the Reverse field of the metadata. The filterFeatures function takes our dataset and a filtering condition. In this case, the features for which Reverse is not positive. The function then returns the dataset containing only PSM that passed the condition. This is done automatically for all 177 PSM-level assays.

SCP2020-presentation

Data filtering: compute QC metrics

Of course, some QC metrics might be specific to single-cell proteomics and are not computed by MaxQuant. For instance, SCoPE2 computes a sample to carrier ratio to discard samples with intensities higher than it would be expected. It also computes peptide FDR, the expected proportion of features that are wrongly assigned to a given peptide. Another example is to filter single-cells based on the median coefficient of variation that indicates the reliability of the protein quantification.

SCP2020-presentation

We provide the functionality to compute those three metrics. See here, computeMedianCV takes the dataset, the name of the assay from which the median CVs should be computed, here peptides, and the names of the metadata fields that hold the required information for computing the CV.

Data filtering: plot QC metrics

The computed CVs are stored in a new metadata field called medianCV and allows for easy plotting and subsetting of the data. Let’s plot the CVs. First, we retrieve the assay peptides in which the CVs were stored. Then we get the colData where the medianCV is stored and format it to a data.frame for plotting. The remainder of the code is creating the histogram using ggplot2. We can see here that blank samples exhibit much higher CVs than the single-cells and the black line shows the cutoff used in SCoPE2 for filtering cells.

SCP2020-presentation

Feature aggregation

Next, feature aggregation allows combining features into a higher-level structure. For example, multiple peptides can be aggregated to a single protein. This functionality includes 2 steps. First, we combine the quantitative data from multiple peptides to a single protein. Second, we store the relationship from the aggregated peptides to the protein.

SCP2020-presentation

We use the aggregateFeatures for this. It takes the dataset, the assay to aggregate, here the assay called peptides, the name of the newly aggregated assay, here proteins, the metadata field name that contains the protein to which the peptide belongs to, and a function to combine the data with possible associated arguments.

Managing missingness

Single-cell technologies do contain many zeros, and so does single-cell proteomics. Zeros can either be biological zeros or technical zeros. They are better replaced by the missing value NA to avoid confusion in downstream analyses. This is done by zeroIsNA. Again, we supply the dataset and the name of the assay for which the zeros should be replaced.

SCP2020-presentation

SCoPE2 and other single-cell pipelines remove highly-missing features, for instance removing peptides that have over 99% missingness. This is performed using the filterNA function. The pNA argument will control the tolerated proportion of missingness.

Finally, missing data can be imputed using the impute function. This code chunk imputes the assay called proteins using the KNN method with k equals 3. Other methods are also available what facilitates benchmarking of imputation methods.

Data transformation

Another processing step is data transformation. We provide common data transformation function, such as normalization and log-transformation. For instance, we apply a base-2 log-transformation using the logTransform function. Beside the dataset, we need to supply the name of the assay to transform, the log base, and the name of the new assay that will contain the transformed data.

SCP2020-presentation

Custom function

Finally, our framework can easily adapt to any custom function, for example, batch correction using ComBat. It requires a three-step procedure.

SCP2020-presentation

First, we extract the assay to process, here it is the protein data. Then, we apply the custom function, ComBat in this case. ComBat requires a data matrix that we can access with the assay function. The dots represent the required arguments that we omit here for clarity. Once, the assay data is overwritten, we can add the transformed assay back in the dataset using addAssay. Remember that the added assay should be a SingleCellExperiment.

Replication results

We replicated the SCoPE2 analysis using our standardized framework. The results that are shown in the next slides compare the expected SCoPE2 supplementary data with the output of our scp framework.

Selected Features

We first compared the selected features present in the two datasets. The overlap between the selected features is very high, for both the peptide and the protein data. Around 9000 peptides are common to both workflows, with over 99% agreement. Similarly, for proteins, around 2700 proteins are common to both workflows with over 97% agreement.

SCP2020-presentation

Numerical comparison

Next, we looked at the numerical differences. The graphs show the distribution of the difference between the two datasets. In order to do this, we had to subset the data matrices to common features and samples. The distribution sharply peaks around 0 for both protein and peptide data. The protein data is a bit more spread out because the differences seen for the peptide data accumulate during the processing.

SCP2020-presentation

Replicate weighted PCA

Although small differences are seen, the datasets are highly similar. This can be seen from the weighted PCA plots on the protein data. The left figure can be found in the SCoPE2 preprint and the right figure is our attempt to reproduce it. The two PCA plots exhibit the exact same trend, indicating the underlying main patterns are well replicated.

SCP2020-presentation

Note about replication

Let’s finish the result part with a small note about reproducing data analyses. Identifying key processing steps is essential for reproducible work.

SCP2020-presentation

Good software development includes continuous maintenance and improvement. Changes in the implementation of algorithms can impact the reproducibility of an analysis. For example, we noticed, with the help of Harrison Specht, that the batch correction results using CombBat depended on the software versions. These differences are due to a new invariant feature filter that was added recently to the batch correction algorithm. This filtering step in turn highlighted technical artefacts of the data.

Conclusions

Take home messages

Here are our main take home messages with regard to the scp package. We are all excited by MS-based single cell proteomics and are well aware of the many challenges ahead. With scp, we offer a principled and reproducible way to manage, process and analyse such data. While we have used SCoPE2 data as an illustration, the package isn’t limited to MS2 labelled data. The processing functions in scp are perfectly applicable to label free data which are generated using the nanoPOTS method, which we also will highlight in future work. scp uses the same data structure as scRNAseq data, which facilitates, when relevant, consistent data processing and integration of various single cell assay modalities. Most importantly, perhaps, this is only the beginning, and we are already using scp for the next generation of single cell proteomics data modelling.

SCP2020-presentation

Resources and acknowledgements:

The scp package is publicly available at this URL and has been submitted to Bioconductor. We will also release the pre-formatted SCoPE2 data as well as other single cell proteomics datasets that we use for our method development. The QFeatures and SingleCellExperiment packages have been available for quite some time now. Our slides are also available at http://bit.ly/2020SCP.

We would like to thank Harrison, Ed and Nikolai for their keenness to share their data and experience with their data, the FNRS for funding this work and you for your attention.

SCP2020-presentation

References

SCP2020-presentation

SCP2020-presentation