How to succeed in my exam

6 minute read

This short post describes how to succeed and how to fail in answering a typical exam question in one of my bachelor courses. I even believe it is applicable later when tackling more complex data analysis questions in a research project.

The question

The student is provided with an object of class SummarizedExperiment, called se, that contains (part of the) data from The effect of upper-respiratory infection on transcriptomic changes in the CNS by Blackmore et al. (2017):

Purpose: The goal of this study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occurring in the cerebellum and spinal cord post infection.

(The provided data here only contain the cerebellum samples.)

Methods: Gender matched eight week old C57BL/6 mice were inoculated saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.

The student is asked to visualise the gene expression distributions in each sample using violin plots and to colour these based on the infection statues, Influenza A or Non-Infected.

Pre-requisites

There are several ways to produce this visualisation. Here, the students have learnt to use the tidyverse and about SummarizedExperiment objects, and not (yet) with base plotting. The goal of this question is for them to demonstrate they can extract different sub-parts from a SummarizedExperiment, know how to transform and combine them using standard dplyr functions and eventually visualise them using ggplot2.

It’s a trap

The main trap many students fall into is to start writing code before having thought about how to answer the question. They have access to the lecture notes and code as well as past exam questions. They often start by copying code that apparently (at least) partially addresses the question, and try to modify it until “it works”. Student often end up wasting precious time with this random trial and error strategy. This trial-and-error approach equates to randomly taking buses until one brings them to they expected destination by sheer luck… which is extremely unlikely when several buses are needed to reach that destination.

A conceptual solution

My suggestion is to first think about how to answer the question before actually answering. As a way of focusing on the how, let’s first sketch out an plan, starting from the end.

  1. Based on the question, I anticipate to have a figure with samples along the x axis and expression values along the y axis, with violin plots of different colours based on the infection status. I typically recommend students to have a go with pen and paper and draw such a hypothetical figure.

  2. We can already easily write code that produces such a figure, assuming that we have a long data.frame or tibble named x, that contains three columns, namely expression with the expression values, sample containing the sample names and infection, with the ‘Influenza A and Non-Infected sample infection status.

    ggplot(x, aes(x = sample, y = expression, fill = infection)) +
        geom_violin()
    
  3. From the previous step, we know we will need the expression data, the sample names, and their respective infection status. Even without looking at the actual data, we know how to extract these: assay(se) gives us the expression data, with column names referring to the sample, and colData(se)$infection (or simply se$infection) provides the infection status (assuming that it’s encoded by the infection column variable).

  4. We also know that assay(se) returns the data in a wide format, with genes along the rows and samples along the columns. Given that we want to produce the figure sketched out in step 1 using ggplot() in step 2, we need to convert it into a long table, which we can do use with pivot_longer().

    pivot_longer(assay(se),
                 names_to = "sample",
                 values_to = "expression",
                 everything())
    

    The code above will fail because assay(se) returns a matrix, but pivot_longer() expects a data.frame, as mentioned above. We thus need to convert the matrix as shown below:

    pivot_longer(as.data.frame(assay(se)),
                 names_to = "sample",
                 values_to = "expression",
                 everything())
    

    This will generate something quite close to what we need, a table with the expression and sample columns. We are still missing the infection status.

  5. We know from step 3 that the infection status is available from the colData, that we need to merge to our long two-column table from step 4. This can be done with full_join(). Below, I use the pipe operator to continue the chunk written above.

    pivot_longer(as.data.frame(assay(se)),
                 names_to = "sample",
                 values_to = "expression",
                 everything()) %>%
        full_join(colData(se))
    

    There are two errors in the code above. First, we need to convert the DataFrame returned by colData(se) to a data.frame (or tibble). This is easy. The second (likely) error lies in the variable names used to join the tables (full_join()’s by argument). If we have a sample variable in the colData that contains sample names, the join will succeed. Otherwise, we will need to defined the vector of variables to join by using by. We can also create that variable on the fly using tibble::rownames_to_column() knowing that the rownames of colData match the sample names (by definition), and set it to sample.

    pivot_longer(as.data.frame(assay(se)),
                 names_to = "sample",
                 values_to = "expression",
                 everything()) %>%
        full_join(rownames_to_column(as.data.frame(colData(se)),
                                     var = "sample"))
    

Bringing it all together

In the previous section, I already wrote syntactically correct code, ironing out smaller issues such as conversion to data.frame and variable matching. I haven’t run the code, but I’m quite confident it should work with my se object, as long as the infection status is encoded as infection in the colData. A conceptual solution without these details would already earn students a very decent grade. The effort to produce the actual figure and earn full marks is relatively minor, once the path to the answer is mapped out. Referring back to the travelling by bus illustration above, this conceptual solution is similar to having the bus names, their respective stop, and possibly even times mapped out. All that is left is to execute the actual travel plan.

pivot_longer(as.data.frame(assay(se)),
             names_to = "sample",
             values_to = "expression",
             everything()) %>%
    full_join(rownames_to_column(as.data.frame(colData(se)),
                                 var = "sample")) %>%
    ggplot(aes(x = sample, y = expression, fill = infection)) +
    geom_violin()

Conclusions

There is of course a much more direct solution, using boxplot() on the assay matrix and setting the col argument with the infection state vector. This solution is one that student would most likely implement after following the second year of the bioinformatics curriculum. In the first year, we start with general data manipulation and visualisation using the tidyverse. Omics data and SummarizedExperiment only come later, as a preparation for the second and third years. Other examples won’t simplify that nicely though. But the important fact remains that elaborating a solution will always reply on a good understanding of the basic data structure and their manipulation and, using that understanding, conceptually sketch an initial path to the solution.

While I do emphasise the reasoning and concepts behind the code when introducing new material and solving exercises in class, I appreciate that when learning a programming language for the first time, one tends to focus on the syntax (the interpreter isn’t forgiving, after all) and it can be difficult to step back and look at the bigger picture. This is probably something I will need to pay special attention to from now on.

Some context

I came to write this post following several discussions with students during office hours. These discussions led invariably to the same conclusion: those that struggle in class and during the exams struggle to conceptualise the course material. Despite honest efforts, every exercise comes across as a new beast to tame - they fail to visualise the common patterns and start from scratch at every attempt, which is not only very inefficient, but also leads to considerable stress, which itself amplifies perceived difficulty of the task.