• 1 Introduction
    • 1.1 Installation
    • 1.2 Programme
    • 1.3 License and credits
  • 2 Tidy data
    • 2.1 Introduction
    • 2.2 Tidy data
    • 2.3 Common Spreadsheet Errors
    • 2.4 Key points
  • 3 Before we start
    • 3.1 What is R? What is RStudio?
    • 3.2 Why learn R?
      • 3.2.1 R does not involve lots of pointing and clicking, and that’s a good thing
      • 3.2.2 R code is great for reproducibility
      • 3.2.3 R is interdisciplinary and extensible
      • 3.2.4 R works on data of all shapes and sizes
      • 3.2.5 R produces high-quality graphics
      • 3.2.6 R has a large and welcoming community
      • 3.2.7 Not only is R free, but it is also open-source and cross-platform
    • 3.3 Knowing your way around RStudio
    • 3.4 Getting set up
      • 3.4.1 Organizing your working directory
    • 3.5 Interacting with R
    • 3.6 How to learn more after the workshop?
    • 3.7 Seeking help
      • 3.7.1 Use the built-in RStudio help interface to search for more information on R functions
      • 3.7.2 I know the name of the function I want to use, but I’m not sure how to use it
      • 3.7.3 I want to use a function that does X, there must be a function for it but I don’t know which one…
      • 3.7.4 I am stuck… I get an error message that I don’t understand
      • 3.7.5 Asking for help
      • 3.7.6 Where to ask for help?
      • 3.7.7 More resources
  • 4 Introduction to R
    • 4.1 Creating objects in R
      • 4.1.1 Comments
      • 4.1.2 Functions and their arguments
      • 4.1.3 Objects vs. variables
    • 4.2 Vectors and data types
    • 4.3 Subsetting vectors
      • 4.3.1 Conditional subsetting
    • 4.4 Missing data
  • 5 Starting with data
    • 5.1 Presentation of the Survey Data
    • 5.2 What are data frames?
    • 5.3 Inspecting data.frame Objects
    • 5.4 Indexing and subsetting data frames
    • 5.5 Factors
    • 5.6 Formatting Dates
  • 6 Manipulating and analyzing data
    • 6.1 What are dplyr and tidyr?
    • 6.2 Selecting columns and filtering rows
    • 6.3 Pipes
    • 6.4 Mutate
    • 6.5 Split-apply-combine data analysis and the summarize() function
      • 6.5.1 The summarize() function
      • 6.5.2 Tallying
    • 6.6 Reshaping with gather and spread
    • 6.7 Exporting data
  • 7 Data visualization with ggplot2
    • 7.1 Plotting with ggplot2
    • 7.2 Building your plots iteratively
    • 7.3 Boxplot
    • 7.4 Plotting time series data
    • 7.5 Faceting
    • 7.6 ggplot2 themes
    • 7.7 Customization
    • 7.8 Arranging and exporting plots
  • 8 Reproducible research with Rmarkdown
    • 8.1 Introduction
    • 8.2 Additional features
    • 8.3 Challenge
  • 9 R Package Development
    • 9.1 Introduction
      • 9.1.1 References
      • 9.1.2 Terminology
      • 9.1.3 Requirement
      • 9.1.4 Course content
    • 9.2 Preparing R code
    • 9.3 Package layout
    • 9.4 Package developement cycle
    • 9.5 Package metadata
    • 9.6 NAMESPACE
    • 9.7 R code
    • 9.8 Package sub-directories
    • 9.9 Documentation
      • 9.9.1 Manual pages
      • 9.9.2 Using roxygen
      • 9.9.3 Vignettes
    • 9.10 Additional files
    • 9.11 Distributing packages
  • 10 Visualisation of high-dimensional data in R
    • 10.1 Introduction
    • 10.2 Data
    • 10.3 K-means clustering
      • 10.3.1 How does k-means work
      • 10.3.2 Model selection
      • 10.3.3 How to determine the number of clusters
    • 10.4 Hierarchical clustering
      • 10.4.1 How does hierarchical clustering work
      • 10.4.2 Defining clusters
    • 10.5 Principal component analysis (PCA)
      • 10.5.1 Visualisation
  • 11 Interactive visualisation
    • 11.1 Interactivity graphs
      • 11.1.1 Interactivity
      • 11.1.2 Layers
      • 11.1.3 More components
    • 11.2 Interactive apps
      • 11.2.1 Introduction
      • 11.2.2 Creation of our fist shiny app
      • 11.2.3 The shiny infrastructure
      • 11.2.4 Another shiny app
      • 11.2.5 Single-file app
      • 11.2.6 There’s more to shiny
  • 12 Navigating the Bioconductor project
    • 12.1 biocViews
    • 12.2 Workflows
    • 12.3 Learning about specific packages
      • 12.3.1 Package vignettes
      • 12.3.2 Installation
    • 12.4 Versions
    • 12.5 Getting help
    • 12.6 Data infrastructure
    • 12.7 More resources
    • 12.8 Challenges
  • 13 R programming concepts and tools
    • 13.1 Defensive programming
      • Talking to users
    • 13.2 KISS
    • 13.3 Failing fast and well
    • 13.4 Consistency and predictability
    • 13.5 Comparisons
      • Floating point issues to be aware of
      • Floating point: how to compare
      • Exact identity
    • 13.6 Exercise
    • 13.7 Debugging: techniques and tools
      • Tools
      • Exercise
  • Session information

Data analysis and R programming

Chapter 12 Navigating the Bioconductor project

12.1 biocViews

Bioconductor has become a large project proposing many packages (1473 software packages at the time of writing) across many domains of high throughput biology. It continues to grow, at an increasing rate, and it can be difficult to get started.

One way to find packages of interest is to navigate the biocViews hierarchy. Every package is tagged with a set of biocViews labels. The highest level defines 3 types of packages:

  • Software: packages providing a specific functionality.
  • AnnotationData: packages providing annotations, such as various ontologies, species annotations, microarray annotations, …
  • ExperimentData: packages distributing experiments.

The biocViews page is available here

  • https://bioconductor.org/packages/release/BiocViews.html#___Software

It is most easily accessed by clicking on the software packages link on the homepage, under About Bioconductor.

See also this page for additional information.

12.2 Workflows

On the other hand, people generally don’t approach the Bioconductor project to learn the whole project, but are interested by a specific analysis from a Bioconductor package, that they have read in a paper of interest. In my opinion, it is more effective to restrict ones attention to a problem or analysis of interest to first immerse oneself into Bioconductor, then broaden up ones experience to other topics and packages.

To to that, the project offers workflows that provide a general introduction to topics such as sequence analysis, annotation resources, RNA-Seq data analyis, Mass spectrometry and proteomics, CyTOF analysis, ….

  • https://bioconductor.org/help/workflows/

A similar set of resources are published in F1000Research under the Bioconductor gateway

  • https://f1000research.com/gateways/bioconductor

These peer-reviewed papers describe more complete pipelines involving one or several packages.

12.3 Learning about specific packages

Each Bioconductor package has it’s own landing pages that provides all necessary information about a package, including a short summary, its current version, the authors, how the cite the package, installation instructions, and links to all package vignettes.

Any Bioconductor package page can be contructed by appending the package’s name to https://bioconductor.org/packages/ to produce an URL like

  • https://bioconductor.org/packages/packagename

This works for any type of package (software, annotation or data). For example, the pages for packages DESeq2 or MSnbase would be

  • https://bioconductor.org/packages/DESeq2

and

  • https://bioconductor.org/packages/MSnbase

These short URLs are then resolved to their longer form to redirect to the longer package URL leading the user to the current release version of the packge.

12.3.1 Package vignettes

An important quality of every Bioconductor package is the availability of a dedicated vignette. Vignettes are documentations (generally provided as pdf or html files) that provide a generic overview of the package, without necessarily going in detail for every function of the package.

Vignettes are special in that respect as they are produced as part of the package building process. The code in a vignette is executed and its output, whether in the form of simple text, tables and figures, are inserted in the vignette before the final file (in pdf or html) is produced. Hence, all the code and outputs are guaranteed to be correct and reproduced.

Given a vignette, it is this possible to re-generate all the results. To make reproducing a long vignette as easy as possible without copy and pasting all code chunks one by one, it is possible to extract the code into an R script runnung the Stangle (from the utils package - see here for details) or knitr::purl functions on the vignette source document.

12.3.2 Installation

The installation of a Bioconductor package is done using the biocLite function from the BiocInstaller package. The first time, when the package isn’t available yet, you need to run

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")

The above command will install and load BiocInstaller. If it isn’t loaded yet, first load it with library, then install your package(s) of interest with biocLite:

library("BiocInstaller")
biocLite("Biobase")

This page provides more details on package installation and update.

12.4 Versions

It is also useful to know that at any given time, there are two Bioconductor versions - there is always a release (stable) and a development (devel) versions. For example, in October 2017, the release version is 3.6 and the development is 3.7.

The individual packages have a similar scheme. Every package is available for the release and the development versions of Bioconductor. These two versions of the package have also different version numbers, where the last digit is even for the former and off for the later. Currently, the MSnbase has versions 2.4.0 and 2.5.1, respectively.

Finally, every Bioconductor version is tight to an R version. To access the current Bioconductor release 3.6 version, one needs to use the latest R version, which is 3.4.2. Hence, it is important to have an up-to-date R installation to keep up with the latest developments in Bioconductor. More details here.

12.5 Getting help

The best way to get help with regard the a Bioconductor package is to post the question on the Bioconductor support forum at https://support.bioconductor.org/. Package developers generally follow the support site for questions related to their packages. See this page for some details.

To maximise the chances to obtain a answer promptly, it is important to provide details for other to understand the question and, if relevant, reproduce the observed errors. The Bioconductor project has a dedicated posting guide. Here’s another useful guide on how to write a reproducible question.

Packages come with a lot of documentation build in, that users are advised to read to familiarise themselves with the package and how to use it. In addition to the package vignettes are describe above, every function of class in a package is documented in great detail in their respective man page, that can be accessed with ?function.

There is also a dedicated developer mailing list that is dedicated for questions and discussions related to package development.

12.6 Data infrastructure

An essential aspect that is central to Bioconductor and its success is the availability of core data infrastructure that is used across packages. Package developers are advised to make use of existing infrastructure to provide coherence, interoperability and stability to the project as a whole.

Here are some core classes, taken from the Common Bioconductor Methods and Classes page:

Importing

  • GTF, GFF, BED, BigWig, etc., - rtracklayer::import()
  • VCF – VariantAnnotation::readVcf()
  • SAM / BAM – Rsamtools::scanBam(), GenomicAlignments:readGAlignment*()
  • FASTA – Biostrings::readDNAStringSet()
  • FASTQ – ShortRead::readFastq()
  • Mass spectrometry data (XML-based and mfg formats) – MSnbase::readMSData(), MSnbase::readMgfData()

Common Classes

  • Rectangular feature x sample data – SummarizedExperiment::SummarizedExperiment() (RNAseq count matrix, microarray, …)
  • Genomic coordinates – GenomicRanges::GRanges() (1-based, closed interval)
  • DNA / RNA / AA sequences – Biostrings::*StringSet()
  • Gene sets – GSEABase::GeneSet() GSEABase::GeneSetCollection()
  • Multi-omics data – MultiAssayExperiment::MultiAssayExperiment()
  • Single cell data – SingleCellExperiment::SingleCellExperiment()
  • Mass spectrometry data – MSnbase::MSnExp()

A common design theme that is found throughout many Bioconductor core classes is illustrated below, which is found in microarrays, quantiative proteomics data, RNA-Seq data, … It contains a rectangular feature x sample data matrix and sample and feature-specific annotations.

12.7 More resources

  • Videos: https://www.youtube.com/user/bioconductor
  • Community Contributed Help Resources: https://bioconductor.org/help/community/
  • Tutorials: https://support.bioconductor.org/t/Tutorials/

12.8 Challenges

  1. Install a Bioconductor package of your choice, discover the vignette(s) it offers, open one, and extract the R code of it.

  2. Find a package that allows reading raw mass spectrometry data and identify the specific function. Either use the biocViews tree, look for a possible workflow, or look in the common methods and classes page on the Bioconductor page.