vignettes/v01-QFeatures.Rmd
v01-QFeatures.Rmd
Last modified: NA
Compiled: Thu Aug 12 15:59:13 2021
QFeatures
is a package for the manipulation of mass spectrometry (MS)-based quantitative proteomics data. The quantitative data is generally obtained after running a preprocessing algorithm on raw MS profiles called spectra. The algorithm attempts to match each spectra to a peptide database to infer the peptide sequence. A spectrum for which a corresponding peptide could be found is called peptide to spectrum match (PSM). The intensities recorded by the MS are then used to quantify the PSMs. The quantified PSMs are used as input by QFeatures
and undergo various data wrangling steps to reconstruct the peptide data and then the protein data, usually of interest for elucidating the biological question at hand.
MS-based quantification data can be represented as a matrix of quantitative values for features (PSMs, peptides, proteins) arranged along the rows, measured for a set of samples, arranged along the columns. The matrix format is a common representation for any quantitative data set. We will be using the SummarizedExperiment
(Morgan et al. 2020) class:
colData()
function.rowData()
column.assay()
.assays()
returns a list of matrix-like assays.Note that other classes that inherits from SummarizedExperiment
can also be used. We will see in the other vignette that for single-cell proteomics data, we will use SingleCellExperiment
objects.
While mass spectrometers acquire data for spectra/peptides, the biological entity of interest are the protein. As part of the data processing, we are thus required to aggregate low-level quantitative features into higher level data.
We are going to start to familiarise ourselves with the QFeatures
class implemented in the QFeatures
package. The class is derived from the Bioconductor MultiAssayExperiment
(Ramos et al. 2017) class. Let’s start by loading the QFeatures
package.
Next, we load the feat1
test data, which is composed of single assay of class SummarizedExperiment
composed of 10 rows and 2 columns.
data(feat1)
feat1
## An instance of class QFeatures containing 1 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
Let’s perform some simple operations to familiarise ourselves with the QFeatures
class:
colData()
accessor (like you have previously done with SummarizedExperiment
objects).
colData(feat1)
## DataFrame with 2 rows and 1 column
## Group
## <integer>
## S1 1
## S2 2
QFeaures
data using the [[
operator (as you have done to extract elements of a list) by using the assay’s index or name.
feat1[[1]]
## class: SummarizedExperiment
## dim: 10 2
## metadata(0):
## assays(1): ''
## rownames(10): PSM1 PSM2 ... PSM9 PSM10
## rowData names(5): Sequence Protein Var location pval
## colnames(2): S1 S2
## colData names(0):
feat1[["psms"]]
## class: SummarizedExperiment
## dim: 10 2
## metadata(0):
## assays(1): ''
## rownames(10): PSM1 PSM2 ... PSM9 PSM10
## rowData names(5): Sequence Protein Var location pval
## colnames(2): S1 S2
## colData names(0):
psms
assay’s row data and quantitative values.
assay(feat1[[1]])
## S1 S2
## PSM1 1 11
## PSM2 2 12
## PSM3 3 13
## PSM4 4 14
## PSM5 5 15
## PSM6 6 16
## PSM7 7 17
## PSM8 8 18
## PSM9 9 19
## PSM10 10 20
rowData(feat1[[1]])
## DataFrame with 10 rows and 5 columns
## Sequence Protein Var location pval
## <character> <character> <integer> <character> <numeric>
## PSM1 SYGFNAAR ProtA 1 Mitochondr... 0.084
## PSM2 SYGFNAAR ProtA 2 Mitochondr... 0.077
## PSM3 SYGFNAAR ProtA 3 Mitochondr... 0.063
## PSM4 ELGNDAYK ProtA 4 Mitochondr... 0.073
## PSM5 ELGNDAYK ProtA 5 Mitochondr... 0.012
## PSM6 ELGNDAYK ProtA 6 Mitochondr... 0.011
## PSM7 IAEESNFPFI... ProtB 7 unknown 0.075
## PSM8 IAEESNFPFI... ProtB 8 unknown 0.038
## PSM9 IAEESNFPFI... ProtB 9 unknown 0.028
## PSM10 IAEESNFPFI... ProtB 10 unknown 0.097
The central functionality of the QFeatures
infrastructure is the aggregation of features into higher-level features while retaining the link between the different levels. This can be done with the aggregateFeatures()
function.
The call below will
psms
assay of the feat1
objects;peptides
row data variables;colMeans()
function;peptides
and add it to the feat1
object.
feat1 <- aggregateFeatures(feat1, i = "psms",
fcol = "Sequence",
name = "peptides",
fun = colMeans)
feat1
## An instance of class QFeatures containing 2 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
Note: we do not claim that colMeans
is the best aggregation method. There are other more robust methods for aggregating features, for instance MsCoreUtils::robustSummary
is a good alternative. See the documentation of aggregateFeatures
for more information.
Exercise at home: check that you understand the effect of feature aggregation and repeat the calculations manually. Compare your manual results agains the content of the new assay’s row data.
## Data before aggregation
data.frame(assay(feat1[["psms"]]),
Sequence = rowData(feat1[["psms"]])$Sequence)
## S1 S2 Sequence
## PSM1 1 11 SYGFNAAR
## PSM2 2 12 SYGFNAAR
## PSM3 3 13 SYGFNAAR
## PSM4 4 14 ELGNDAYK
## PSM5 5 15 ELGNDAYK
## PSM6 6 16 ELGNDAYK
## PSM7 7 17 IAEESNFPFIK
## PSM8 8 18 IAEESNFPFIK
## PSM9 9 19 IAEESNFPFIK
## PSM10 10 20 IAEESNFPFIK
## SYGFNAAR
colMeans(assay(feat1[["psms"]])[1:3, ])
## S1 S2
## 2 12
assay(feat1[[2]])["SYGFNAAR", ]
## S1 S2
## 2 12
## ELGNDAYK
colMeans(assay(feat1[["psms"]])[4:6, ])
## S1 S2
## 5 15
assay(feat1[[2]])["ELGNDAYK", ]
## S1 S2
## 5 15
## IAEESNFPFIK
colMeans(assay(feat1[["psms"]])[7:10, ])
## S1 S2
## 8.5 18.5
assay(feat1[[2]])["IAEESNFPFIK", ]
## S1 S2
## 8.5 18.5
rowData(feat1[[2]])
## DataFrame with 3 rows and 4 columns
## Sequence Protein location .n
## <character> <character> <character> <integer>
## ELGNDAYK ELGNDAYK ProtA Mitochondr... 3
## IAEESNFPFIK IAEESNFPFI... ProtB unknown 4
## SYGFNAAR SYGFNAAR ProtA Mitochondr... 3
We can now aggregate the peptide-level data into a new protein-level assay using the colMedians()
aggregation function.
feat1 <- aggregateFeatures(feat1, i = "peptides",
fcol = "Protein",
name = "proteins",
fun = colMedians)
feat1
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
## [3] proteins: SummarizedExperiment with 2 rows and 2 columns
assay(feat1[["proteins"]])
## S1 S2
## ProtA 3.5 13.5
## ProtB 8.5 18.5
You can also get a graphical overview of the QFeatures object thanks to the plot
function. Arrows represent the links between the assays.
plot(feat1)
The subsetting a QFeatures
object can be performed using the 3-index subset operator, [
. The first index will subset the features of interest. If the feature is linked to features from other assays, those features will also
We provide an example of subsetting features that are associated to features from other assays. When we subset for protein A as shown below or using the subsetByFeature()
function, this creates a new instance of class QFeatures
containing assays with the expression data for protein, its peptides and their PSMs.
feat1["ProtA", , ]
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 6 rows and 2 columns
## [2] peptides: SummarizedExperiment with 2 rows and 2 columns
## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
The filterFeatures()
function can be used to filter rows the assays composing a QFeatures
object using the row data variables. We can for example retain rows that have a pval
< 0.05, which would only keep rows in the psms
assay because the pval
is only relevant for that assay.
filterFeatures(feat1, ~ pval < 0.05)
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 4 rows and 2 columns
## [2] peptides: SummarizedExperiment with 0 rows and 2 columns
## [3] proteins: SummarizedExperiment with 0 rows and 2 columns
On the other hand, if we filter assay rows for those that localise to the mitochondrion, we retain the relevant protein, peptides and PSMs.
filterFeatures(feat1, ~ location == "Mitochondrion")
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 6 rows and 2 columns
## [2] peptides: SummarizedExperiment with 2 rows and 2 columns
## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
As an exercise, let’s filter rows that do not localise to the mitochondrion.
filterFeatures(feat1, ~ location != "Mitochondrion")
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 4 rows and 2 columns
## [2] peptides: SummarizedExperiment with 1 rows and 2 columns
## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
Another useful filtering functionality is filterNA
that removes features based on the amount of missing data it contains. To illustrate this, we will load a data set containing missing data
data("ft_na")
ft_na
## An instance of class QFeatures containing 1 assays:
## [1] na: SummarizedExperiment with 4 rows and 3 columns
assay(ft_na[[1]])
## A B C
## a NA 5 9
## b 2 6 10
## c 3 NA 11
## d NA 8 12
We can remove feature that contain more that a given percentage missing values. For instance, let’s remove features with more than 25 % missing data. We expect to keep only 1 features as all the other features contain 33 % missing values.
filterNA(ft_na, i = "na", pNA = 0.25)
## An instance of class QFeatures containing 1 assays:
## [1] na: SummarizedExperiment with 1 rows and 3 columns
The resulting QFeatures
object contains a single feature as we expected.
The Qfeatures
package provides a few utility functions to process the data:
logTransform
creates a new assay in a QFeatures
object that contains the log-transformed quantification of the target assay.
logTransform(feat1, i = "proteins", base = 2, name = "logproteins")
## An instance of class QFeatures containing 4 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
## [3] proteins: SummarizedExperiment with 2 rows and 2 columns
## [4] logproteins: SummarizedExperiment with 2 rows and 2 columns
normalize
creates a new assay in a QFeatures
object that contains the normalized quantification of the target assay. See the ?QFeatures::normalize
documentation for a list of available normalization methods. If the method you are looking for is not available, you could also consider QFeatures::sweep
.
normalize(feat1, i = "proteins", method = "center.median", name = "normproteins")
## An instance of class QFeatures containing 4 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
## [3] proteins: SummarizedExperiment with 2 rows and 2 columns
## [4] normproteins: SummarizedExperiment with 2 rows and 2 columns
imputes
predicts missing values in a new assay. See the ?MsCoreUtils::impute_matrix()
for a list of available imputation methods. In this example, we impute missing values with zero.
assay(ft_na[[1]])
## A B C
## a NA 5 9
## b 2 6 10
## c 3 NA 11
## d NA 8 12
ft_na <- impute(ft_na, i = 1, method = "zero")
assay(ft_na[[1]])
## A B C
## a 0 5 9
## b 2 6 10
## c 3 0 11
## d 0 8 12
You can refer to the Quantitative features for mass spectrometry data vignette and the QFeature
manual page for more details about the class.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] QFeatures_1.3.7 MultiAssayExperiment_1.19.5
## [3] SummarizedExperiment_1.23.1 Biobase_2.53.0
## [5] GenomicRanges_1.45.0 GenomeInfoDb_1.29.3
## [7] IRanges_2.27.0 S4Vectors_0.31.0
## [9] BiocGenerics_0.39.1 MatrixGenerics_1.5.3
## [11] matrixStats_0.60.0 BiocStyle_2.21.3
##
## loaded via a namespace (and not attached):
## [1] xfun_0.25 bslib_0.2.5.1 lattice_0.20-44
## [4] htmltools_0.5.1.1 yaml_2.2.1 rlang_0.4.11
## [7] pkgdown_1.6.1 jquerylib_0.1.4 GenomeInfoDbData_1.2.6
## [10] stringr_1.4.0 zlibbioc_1.39.0 ProtGenerics_1.25.1
## [13] ragg_1.1.3 memoise_2.0.0 evaluate_0.14
## [16] knitr_1.33 fastmap_1.1.0 highr_0.9
## [19] BiocManager_1.30.16 cachem_1.0.5 DelayedArray_0.19.1
## [22] desc_1.3.0 jsonlite_1.7.2 XVector_0.33.0
## [25] systemfonts_1.0.2 fs_1.5.0 MsCoreUtils_1.5.0
## [28] textshaping_0.3.5 digest_0.6.27 stringi_1.7.3
## [31] rprojroot_2.0.2 grid_4.1.0 clue_0.3-59
## [34] tools_4.1.0 bitops_1.0-7 magrittr_2.0.1
## [37] sass_0.4.0 RCurl_1.98-1.3 lazyeval_0.2.2
## [40] cluster_2.1.2 pkgconfig_2.0.3 crayon_1.4.1
## [43] MASS_7.3-54 Matrix_1.3-4 rmarkdown_2.10
## [46] AnnotationFilter_1.17.1 R6_2.5.0 igraph_1.2.6
## [49] compiler_4.1.0