The aim of the R for mass spectrometry RforMassSpectrometry initiative is to provide efficient, thoroughly documented, tested and flexible R software for the analysis and interpretation of high throughput mass spectrometry assays, including proteomics and metabolomics experiments. The project formalises the longtime collaborative development efforts of its core members under the RforMassSpectrometry organisation to facilitate dissemination and accessibility of their work.
Two packages of interest:
This use case features a proximity tagging protein-protein interaction dataset. In this experiment, a bait protein X
was tagged using BioID2. BioID biotinylates proteins in close proximity of the bait X
, in particular, but not only, proteins interacting with X
. These proteins are then be recovered using streptavidin beads and identified/quantified using mass spectrometry. The expectation is that the proteins contain interaction partners of X
. However, given that a substantial number of false positives will also be biotinylated, it is necessary to complement this data with a negative control, where X
is blocked from interacting with its partners. A differential analysis between the wild type and the negative control will reveal likely interacting partners of X
.
Identify the most likely interactors of protein X
.
bioid.csv
file into R.pos
variable to guide you: TRUE
referring to a set of expected true interactors, but there might be more.library("MSnbase")
e <- readMSnSet2("./data/bioid.csv", ecol = 1:6, fnames = "ids")
e$grp <- sub("[1-3]", "", sampleNames(e))
sel1 <- fData(e)$Reverse == ""
sel2 <- fData(e)$Potential.contaminant == ""
e <- e[sel1 & sel2, ]
fData(e) <- fData(e)[, c("pos", "ids")]
limma::plotDensities(log2(exprs(e) + 1))
exprs(e)[exprs(e) == 0] <- NA
(e$nNA <- colSums(is.na(e)))
## WT1 WT2 WT3 NEG1 NEG2 NEG3
## 306 239 306 249 303 286
fData(e)$nNA <- rowSums(is.na(e))
table(fData(e)$nNA)
##
## 0 1 2 3 4 5
## 180 70 75 80 101 165
table(is.na(e))
##
## FALSE TRUE
## 2337 1689
sum(is.na(e))/prod(dim(e))
## [1] 0.4195231
naplot(e, col = c("black", "white"))
e_norm <- e %>%
log(base = 2) %>%
normalise(method = "center.median")
limma::plotMDS(exprs(e_norm))
limma::plotDensities(exprs(e_norm))
proDA
: probabilistic dropout analysisWe will now try the proDA
package for inference of protein differential abundance by probabilistic dropout analysis (Ahlman-Eltze and Anders (2019)) that explicity models missing values as lowly expressed features.
library("proDA")
Below, we first fit the model using the grp
variable.
proDA_fit <- proDA(e_norm,
design = ~ grp,
data_is_log_transformed = TRUE)
## Warning in proDA(e_norm, design = ~grp, data_is_log_transformed = TRUE): The
## data contains a mix of 3 exact zeros and 1689 NA's. Will treat the zeros as
## valid input and not replace them with NA's.
proDA_fit
## Parameters of the probabilistic dropout model
##
## The dataset contains 6 samples and 671 proteins
## 42% of the values are missing
##
## Experimental design: y~grp
## The model has successfully converged.
##
## The inferred parameters are:
## location_prior_mean: -1.39
## location_prior_scale: 11.2
## location_prior_df: 3
## variance_prior_scale: 0.695
## variance_prior_df: 1.59
## dropout_curve_position: -1.45, -2.21, -1.6, -2.05, ...
## dropout_curve_scale: -1.45, -1.42, -1.8, -1.22, ...
Here we extract the contract of interest, setting the alternative to greater to perform a one-sided test (we are only interested in proteins with greater expression in the wild type),
res <- test_diff(proDA_fit,
contrast = "grpWT",
alternative = "greater",
pval_adjust_method = "BH")
res %>%
arrange(adj_pval) %>%
filter(adj_pval < 0.1)
## name pval adj_pval diff t_statistic se df
## 1 P8706942 8.290037e-05 0.02781307 7.173387 13.671231 0.5247067 4
## 2 P6582309 7.634347e-05 0.02781307 5.927278 13.960778 0.4245665 4
## 3 P2936078 2.067392e-04 0.04624067 4.164363 10.822912 0.3847729 4
## 4 P3615987 5.247346e-04 0.07041938 6.303459 8.502347 0.7413787 4
## 5 P8350629 4.505959e-04 0.07041938 5.799505 8.847187 0.6555196 4
## avg_abundance n_approx n_obs
## 1 1.4191144 5.024654 5
## 2 5.9443354 6.000000 6
## 3 4.8914212 6.000000 6
## 4 0.9919089 4.074235 4
## 5 1.9211433 3.811299 4
Below, we integrate the results in the MSnSet
and visualise them using a volcano plot. We observe an enrichment in true positives.
stopifnot(identical(res$name, featureNames(e_norm)))
fData(e_norm) <- cbind(fData(e_norm), res)
with(fData(e_norm),
plot(diff, -log10(adj_pval),
col = as.numeric(pos) + 1))
abline(h = -log10(0.05))
abline(h = -log10(0.1), lty = "dotted")
ms2df(e_norm) %>%
arrange(adj_pval) %>%
filter(adj_pval < 0.1)
## WT1 WT2 WT3 NEG1 NEG2 NEG3 pos
## P8706942 4.809723 4.843454 5.614666 -2.3246799 -1.978230 NA FALSE
## P6582309 8.886493 8.917121 9.096700 2.7715624 3.530842 2.837897 TRUE
## P2936078 6.811807 7.003624 7.264089 2.7718795 2.920089 2.896690 TRUE
## P3615987 3.934746 4.958431 3.888658 -1.9859032 NA NA TRUE
## P8350629 5.046591 5.039421 4.638563 -0.5905618 NA NA TRUE
## ids nNA name pval adj_pval diff t_statistic
## P8706942 P8706942 1 P8706942 8.290037e-05 0.02781307 7.173387 13.671231
## P6582309 P6582309 0 P6582309 7.634347e-05 0.02781307 5.927278 13.960778
## P2936078 P2936078 0 P2936078 2.067392e-04 0.04624067 4.164363 10.822912
## P3615987 P3615987 2 P3615987 5.247346e-04 0.07041938 6.303459 8.502347
## P8350629 P8350629 2 P8350629 4.505959e-04 0.07041938 5.799505 8.847187
## se df avg_abundance n_approx n_obs
## P8706942 0.5247067 4 1.4191144 5.024654 5
## P6582309 0.4245665 4 5.9443354 6.000000 6
## P2936078 0.3847729 4 4.8914212 6.000000 6
## P3615987 0.7413787 4 0.9919089 4.074235 4
## P8350629 0.6555196 4 1.9211433 3.811299 4
table(fData(e_norm)$pos,
fData(e_norm)$nNA)
##
## 0 1 2 3 4 5
## FALSE 176 70 73 78 99 165
## TRUE 4 0 2 2 2 0
Page built: 2021-01-12