class: center, middle, inverse, title-slide # Mass spectrometry and proteomics ## Using R/Bioconductor ###
Laurent Gatto
###
CSAMA
- Bressanone - 25 July 2019 --- class: middle name: cc-by Slides available at: http://bit.ly/20190725csama These slides are available under a **creative common [CC-BY license](http://creativecommons.org/licenses/by/4.0/)**. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially <img height="20px" alt="CC-BY" src="./img/cc1.jpg" />. --- class: middle ## On the menu Morning lecture: 1. Proteomics in R/Bioconductor 2. How does mass spectrometry-based proteomics work? 3. Quantitative proteomics 4. Quantitative proteomics data processing and analysis Afternoon lab: - Manipulating MS data (raw and identification data) - Manipulating quantitative proteomics data - Data processing and DE --- class: middle center ![](./Figures/overview0.png) ??? From a 2011 study that compared the expression profiles of 3 cell lines using RNA-Seq, MS-based proteomics and immunofluorescence (protein-specific antibodies). They observed an overall Spearman correlation of 0.63. **In what ways to these summaries differ?** Using a common gene-centric identifier, but - What do we have along the rows, what are our features? Transcripts on the left. Protein groups on the right. - How are these intensities produced? ## Take-home message 1 These data tables are less similar than they appear. --- class: middle, center ## 1. [Proteomics](http://bioconductor.org/packages/release/BiocViews.html#___Proteomics) and [mass spectrometry](http://bioconductor.org/packages/release/BiocViews.html#___MassSpectrometry) packages, [questions](https://support.bioconductor.org/local/search/page/?q=proteomics) and [workflow](https://rawgit.com/lgatto/bioc-ms-prot/master/lab.html) in Bioconductor. --- class: middle, center ## 2. How does mass spectrometry work? ### (applies to proteomics and metabolomics) --- class: middle ### Overview ![](./Figures/overview1.jpg) --- ### How does MS work? 1. Digestion of proteins into peptides - as will become clear later, the features we measure in shotgun (or bottom-up) *proteomics* are peptides, **not** proteins. -- 2. On-line liquid chromatography (LC-MS) -- 3. Mass spectrometry (MS) is a technology that **separates** charged molecules (ions, peptides) based on their mass to charge ratio (M/Z). --- ### Chromatography MS is generally coupled to chromatography (liquid LC, but can also be gas-based GC). The time an analytes takes to elute from the chromatography column is the **retention time**. ![A chromatogram, illustrating the total amount of analytes over the retention time.](./Figures/chromatogram.png) ??? - This is an acquisition. There can be one per sample (with muliple fractions), of samples can be combined/multiplexed and acquired together. --- class: middle An mass spectrometer is composed of three components: 1. The *source*, that ionises the molecules: examples are Matrix-assisted laser desorption/ionisation (MALDI) or electrospray ionisation (ESI). 2. The *analyser*, that separates the ions: Time of flight (TOF) or Orbitrap. 3. The *detector* that quantifies the ions. Ions typically go through that cylce at least twice (MS2, tandem MS, or MSMS). Before the second cycle, individual *precursor* ions a selected and broken into *fragment* ions. --- class: middle center ![Schematics of a mass spectrometer and two rounds of MS.](./Figures/SchematicMS2.png) ??? Before MS: - Restriction with enzyme, typically trypsine. - Off-line fractionation. An mass spectrometer is composed of three components: 1. The *source*, that ionises the molecules: examples are Matrix-assisted laser desorption/ionisation (MALDI) or electrospray ionisation (ESI). 2. The *analyser*, that separates the ions: Time of flight (TOF) or Orbitrap. 3. The *detector* that quantifies the ions. Ions typically go through that cylce at least twice (MS2, tandem MS or MSMS). Before the second cycle, individual *precursor* ions a selected, broken into fragment ions. --- class: middle center ![Separation and detection of ions in a mass spectrometer.](./Figures/mstut.gif) --- class: middle center ![Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).](./Figures/MS1-MS2-spectra.png) ??? Highlight - Semi-stochastic nature of MS - Data dependent acquisition (DDA) --- class: middle center .left-col-50[ ![MS1 spectra over retention time.](./Figures/F02-3D-MS1-scans-400-1200-lattice.png) ] .right-col-50[ ![MS2 spectra interleaved between two MS1 spectra.](./Figures/F02-3D-MS1-MS2-scans-100-1200-lattice.png) ] ??? - Please keep this MS space and the semi-stochastic nature of MS acquisition in mind , as I will come back to it later. --- ### Identification: fragment ions ![Peptide fragment ions.](./Figures/frag.png) --- ### Identification: Peptide-spectrum matching (PSM) Matching **expected** and *observed* spectra: ``` > MSnbase::calculateFragments("SIGFEGDSIGR") mz ion type pos z seq 1 88.03931 b1 b 1 1 S 2 201.12337 b2 b 2 1 SI 3 258.14483 b3 b 3 1 SIG 4 405.21324 b4 b 4 1 SIGF 5 534.25583 b5 b 5 1 SIGFE 6 591.27729 b6 b 6 1 SIGFEG 7 706.30423 b7 b 7 1 SIGFEGD 8 793.33626 b8 b 8 1 SIGFEGDS 9 906.42032 b9 b 9 1 SIGFEGDSI 10 963.44178 b10 b 10 1 SIGFEGDSIG 11 175.11895 y1 y 1 1 R 12 232.14041 y2 y 2 1 GR 13 345.22447 y3 y 3 1 IGR 14 432.25650 y4 y 4 1 SIGR 15 547.28344 y5 y 5 1 DSIGR 16 604.30490 y6 y 6 1 GDSIGR [ reached getOption("max.print") -- omitted 16 rows ] ``` --- class: middle ### Identification: Peptide-spectrum matching (PSM) Matching *expected* and **observed** spectra: ![Peptide fragment matching](./Figures/annotated-spectrum.png) ??? Performed by **search engines** such as Mascot, MSGF+, Andromeda, ... --- ### Identification: database ![Uniprot human proteome](./Figures/uniprot1.png) ??? ## [Human proteome](https://www.uniprot.org/help/human_proteome) In 2008, a draft of the complete human proteome was released from UniProtKB/Swiss-Prot: the approximately 20,000 putative human protein-coding genes were represented by one UniProtKB/Swiss-Prot entry, tagged with the keyword 'Complete proteome' and later linked to proteome identifier UP000005640. This UniProtKB/Swiss-Prot H. sapiens proteome (manually reviewed) can be considered as complete in the sense that it contains one representative (**canonical**) sequence for each currently known human gene. Close to 40% of these 20,000 entries contain manually annotated alternative isoforms representing over 22,000 additional sequences ## [What is the canonical sequence?](https://www.uniprot.org/help/canonical_and_isoforms) To reduce redundancy, the UniProtKB/Swiss-Prot policy is to describe all the protein products encoded by one gene in a given species in a single entry. We choose for each entry a canonical sequence based on at least one of the following criteria: 1. It is the most prevalent. 2. It is the most similar to orthologous sequences found in other species. 3. By virtue of its length or amino acid composition, it allows the clearest description of domains, isoforms, polymorphisms, post-translational modifications, etc. 4. In the absence of any information, we choose the longest sequence. ## Are all isoforms described in one UniProtKB/Swiss-Prot entry? Whenever possible, all the protein products encoded by one gene in a given species are described in a single UniProtKB/Swiss-Prot entry, including isoforms generated by alternative splicing, alternative promoter usage, and alternative translation initiation (*). However, some alternative splicing isoforms derived from the same gene share only a few exons, if any at all, the same for some 'trans-splicing' events. In these cases, the divergence is obviously too important to merge all protein sequences into a single entry and the isoforms have to be described in separate 'external' entries. ## UniProt [downloads](https://www.uniprot.org/downloads) --- class: middle ### Identification ![Decoy databases and peptide scoring](./Figures/pr-2007-00739d_0004.gif) From Käll *et al.* [Posterior Error Probabilities and False Discovery Rates: Two Sides of the Same Coin](https://pubs.acs.org/doi/abs/10.1021/pr700739d). ??? - (global) FDR = B/(A+B) - (local) fdr = PEP = b/(a+b) --- ### Identification: Protein inference - Keep only reliable peptides - From these peptides, infer proteins - If proteins can't be resolved due to shared peptides, merge them into **protein groups** of indistinguishable or non-differentiable proteins. --- class: middle ![Peptide evidence classes](./Figures/nbt0710-647-F2.gif) From [Qeli and Ahrens (2010)](http://www.ncbi.nlm.nih.gov/pubmed/20622826). --- class: middle, center ## 3. Quantitative proteomics --- class: middle | |Label-free |Labelled | |:---|:----------|:----------| |MS1 |XIC |SILAC, 15N | |MS2 |Counting |iTRAQ, TMT | .left-col-50[ ![MS1 spectra over retention time.](./Figures/F02-3D-MS1-scans-400-1200-lattice.png) ] .right-col-50[ ![MS2 spectra interleaved between two MS1 spectra.](./Figures/F02-3D-MS1-MS2-scans-100-1200-lattice.png) ] --- class: middle center ### Label-free MS2: Spectral counting ![](./Figures/pbase.png) ??? Comments of **spectral counting**: - easy, but *semi-quantitative* - **Note depth/coverage and compare to RNA-seq.** --- class: middle center ### Labelled MS2: Isobaric tagging .left-col-30[ ![](./Figures/itraqchem.png) ] .right-col-70[ ![](./Figures/itraq.png) ] ??? Comments of **isobaric tagging**: - multiplexing, albeit limited - limited problem with missing values - easy to process (fewer parameters to set) --- class: middle center ### Label-free MS1: extracted ion chromatograms .left-col-50[ ![MS1 spectra over retention time.](./Figures/F02-3D-MS1-scans-400-1200-lattice.png) ] .right-col-50[ ![](./Figures/chrom_peaks.png) ] Credit: [Johannes Rainer](https://github.com/jorainer) ??? Comments on **label-free**: - Data processing - Independent runs, NAs --- class: middle center ### Labelled MS1: SILAC ![](./Figures/Silac.gif) Credit: Wikimedia Commons. ??? Comments of **isotope labelling**: - like red/green microarrays - can be extended to 3, but risk of overlap of light, medium and heavy peaks --- class: middle, center ## 4. Quantitative proteomics data processing and analysis You will be use the*[MSnbase](https://bioconductor.org/packages/3.9/MSnbase)* and *[MSqRob](https://github.com/statOmics/MSqRob)* packages during the lab. <img src="https://raw.githubusercontent.com/Bioconductor/BiocStickers/master/MSnbase/MSnbase.png" alt="MSnbase" style="width: 150px; padding: 10px"/> --- class: middle Quantitative proteomics data processing - data processing/cleaning - **missing values** - log transformation and normalisation - **summarisation** - **differential analysis** --- class: middle <img src="./Figures/msnset.png" width="587" /> The `MSnSet` structure: expression (accessed with `exprs`), feature (`fData`) and sample (`pData`) metadata. --- class: middle ### Missing values Can appear because - the feature is missing (due to biology, i.e not at random) - the feature was missed during the acquisition process (i.e at random) - mixture thereof What can one do? - filter out features (or at least those that have *too many* missing values). - imputation - when possible, use a statistical method that accounts for missing values (for example [proDA](https://www.biorxiv.org/content/10.1101/661496v1), [MSqRob](https://www.biorxiv.org/content/10.1101/668863v1), ...) --- class: middle ### Missing values Can appear because - the feature is missing (due to biology, i.e not at random) - for example `impute(, method = "min")` - the feature was missed during the acquisition process (i.e at random) - for example `impute(, method = "MLE")` - mixture thereof - for example `impute(, method = "mixed")` (used in *[DEP](https://bioconductor.org/packages/3.9/DEP)*) What can one do? - filter out features (or at least those that have *too many* missing values). - imputation (`MSnbase::impute` - see above and next slide) - Feature rescuing (identification transfer, matching between runs) - when possible, use a statistical method that accounts for missing values (for example [proDA](https://www.biorxiv.org/content/10.1101/661496v1), [MSqRob](https://www.biorxiv.org/content/10.1101/668863v1), ...) --- class: middle ### Imputation ![](./Figures/imp-sim.png) Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better. See Lazar *et al.* [Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies](http://dx.doi.org/10.1021/acs.jproteome.5b00981). --- class: middle ### Feature rescuing .left-col-50[ ![Retention time alignment](./Figures/pr-2012-00776t_0004.jpeg) ] .right-col-50[ ![Matching between runs](./Figures/idtrans.png) ] .left-col-100[ From Bond *et al.* [Improving Qualitative and Quantitative Performance for MSE-based Label-free Proteomics](https://pubs.acs.org/doi/10.1021/pr300776t). ] --- class: middle ### Missing values Can appear because - the feature is missing (due to biology, i.e not at random) - for example `impute(, method = "min")` - the feature was missed during the acquisition process (i.e at random) - for example `impute(, method = "MLE")` - mixture thereof - for example `impute(, method = "mixed")` (used in *[DEP](https://bioconductor.org/packages/3.9/DEP)*) What can one do? - filter out features (or at least those that have *too many* missing values). - imputation (`MSnbase::impute` - see above and next slide) - Feature rescuing (identification transfer, matching between runs) - when possible, use a statistical method that accounts for missing values (for example [proDA](https://www.biorxiv.org/content/10.1101/661496v1), [MSqRob](https://www.biorxiv.org/content/10.1101/668863v1), ...) --- class: middle ### Summarisation <img src="./Figures/features.png" width="5045" /> Examples of aggregations (from the [Features](https://rformassspectrometry.github.io/Features/articles/Features.html#subsetting) package). --- class: middle ![](./Figures/summarisation.jpg)<!-- --> Summarisation examples: (1) peptide-level data, (2) mean intensity of the *top three* peptides (Proteus), (3) using pair-wise abundance ratios of shared peptides between samples (MaxQuant) and (4) robust summarisation (MSqRob). From [Sticker et al. (2019)](https://www.biorxiv.org/content/10.1101/668863v1.full). --- class: middle ### Differential analysis 1. Aggregate normalised peptide intensities of a protein using robust regression with M-estimation using Huber weights: `\(y_{sp} = \color{blue}{\beta^{sample}_s} + \color{red}{\beta^{pep}_p + \epsilon_{sp}}\)` 2. Protein-level inference: `\(y_{st} = \beta_0 + \color{blue}{\beta^{treatment}_t} + \epsilon_{ts}\)` Sticker *et al.* 2019, *Robust summarization and inference in proteome-wide label-free quantification*, [https://doi.org/10.1101/668863](https://doi.org/10.1101/668863). ??? 1. We aggregate all normalised peptide intensities of a protein using robust regression with M-estimation using Huber weights. We consider the protein-wise linear model in eq (1) with `\(y_{sp}\)` the normalised log2-transformed intensity of peptide `\(p\)` in sample `\(s\)` and `\(\beta^{pep}_p\)` the effect of peptide `\(p\)`. By encoding the peptide effect as a sum contrast, `\(\beta^{sample}_s\)` be interpreted as the mean intensity in sample `\(s\)` for this protein. The error term `\(\epsilon_{sp}\)` is assumed to be normally distributed with mean 0 and variance `\(\sigma^{2}_{pep}\)`. 2. We perform the inference on protein intensities with the reduced model in eq (2) with `\(y_{ts}\)` the summarised log2-transformed protein intensity in sample `\(s\)` of treatment `\(t\)`, `\(\beta_0\)` the intercept, and `\(\beta^{treatment}_t\)`, the effect condition `\(t\)`. Again, the error term `\(\epsilon_{ts}\)` is assumed to be normally distributed with mean 0 and variance `\(\sigma_2\)`. We correct for multiple testing using the Benjamini-Hochberg FDR procedure. --- class: middle ### And also - Data independent acquisition (DIA) - Targets proteomics (SRM, MRM, PRM) - Post-translational modifications (PTMs) - Protein-protein interactions - Sub-cellular localisation - ... --- class: middle name: laurent-gatto .left-col-50[ <img src="./img/lgatto3b.png" width = "180px"/> ### Laurent Gatto <i class="fas fa-flask"></i> [Computational Biology Group](https://lgatto.github.io/cbio-lab/)<br /> <i class="fas fa-map-marker-alt"></i> de Duve Institute, UCLouvain<br /> <i class="fas fa-envelope"></i> laurent.gatto@uclouvain.be<br /> <i class="fas fa-home"></i> https://lgatto.github.io<br /> <i class="fab fa-twitter"></i> [@lgatto](https://twitter.com/lgatt0/)<br /> <i class="fab fa-github"></i> [lgatto](https://github.com/lgatto/)<br /> <img width="20px" align="top" alt="orcid" src="./img/orcid_64x64.png" /> [0000-0002-1520-2268](https://orcid.org/0000-0002-1520-2268)<br /> <img width="20px" align="top" alt="Impact story" src="./img/keybase.png"/> [lgatto](https://keybase.io/lgatto)<br /> <img width="20px" align="top" alt="Google scholar" src="./img/gscholar.png"/> [Google scholar](https://scholar.google.co.uk/citations?user=k5DrB74AAAAJ&hl=en)<br /> <img width="20px" align="top" alt="Impact story" src="./img/impactstory-logo.png"/> [Impact story](https://profiles.impactstory.org/u/0000-0002-1520-2268)<br /> <i class="fas fa-pencil-alt"></i> [dissem.in](https://dissem.in/r/6231/laurent-gatto)<br /> <!-- <i class="fab fa-linkedin"></i> https://www.linkedin.com/in/lgatto/<br /> --> ] .rigth-col-50[ **Acknowledgements** [Sebastian Gibb](https://github.com/sgibb) and [Johannes Rainer](https://github.com/jorainer) (`MSnbase` and *R for Mass Spectrometry*) Open PhD or post-doc position available at the de Duve Institute, UCLouvain, in Brussels. (For international candidates only). **Slides** available at http://bit.ly/20190725csama ## Thank you for your attention ] ??? Explain lab organisation.