How to succeed in my exam
This short post describes how to succeed and how to fail in answering a typical exam question in one of my bachelor courses. I even believe it is applicable later when tackling more complex data analysis questions in a research project.
The question
The student is provided with an object of class
SummarizedExperiment
,
called se
, that contains (part of the) data from The effect of
upper-respiratory infection on transcriptomic changes in the CNS by
Blackmore et al.
(2017):
Purpose: The goal of this study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occurring in the cerebellum and spinal cord post infection.
(The provided data here only contain the cerebellum samples.)
Methods: Gender matched eight week old C57BL/6 mice were inoculated saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.
The student is asked to visualise the gene expression distributions in each sample using violin plots and to colour these based on the infection statues, Influenza A or Non-Infected.
Pre-requisites
There are several ways to produce this visualisation. Here, the
students have learnt to use the
tidyverse and about
SummarizedExperiment
objects, and not (yet) with base plotting. The
goal of this question is for them to demonstrate they can extract
different sub-parts from a SummarizedExperiment
, know how to
transform and combine them using standard dplyr
functions and
eventually visualise them using ggplot2
.
It’s a trap
The main trap many students fall into is to start writing code before having thought about how to answer the question. They have access to the lecture notes and code as well as past exam questions. They often start by copying code that apparently (at least) partially addresses the question, and try to modify it until “it works”. Student often end up wasting precious time with this random trial and error strategy. This trial-and-error approach equates to randomly taking buses until one brings them to they expected destination by sheer luck… which is extremely unlikely when several buses are needed to reach that destination.
A conceptual solution
My suggestion is to first think about how to answer the question before actually answering. As a way of focusing on the how, let’s first sketch out an plan, starting from the end.
-
Based on the question, I anticipate to have a figure with samples along the x axis and expression values along the y axis, with violin plots of different colours based on the infection status. I typically recommend students to have a go with pen and paper and draw such a hypothetical figure.
-
We can already easily write code that produces such a figure, assuming that we have a long
data.frame
ortibble
namedx
, that contains three columns, namelyexpression
with the expression values,sample
containing the sample names andinfection
, with the ‘Influenza A and Non-Infected sample infection status.ggplot(x, aes(x = sample, y = expression, fill = infection)) + geom_violin()
-
From the previous step, we know we will need the expression data, the sample names, and their respective infection status. Even without looking at the actual data, we know how to extract these:
assay(se)
gives us the expression data, with column names referring to the sample, andcolData(se)$infection
(or simplyse$infection
) provides the infection status (assuming that it’s encoded by theinfection
column variable). -
We also know that
assay(se)
returns the data in a wide format, with genes along the rows and samples along the columns. Given that we want to produce the figure sketched out in step 1 usingggplot()
in step 2, we need to convert it into a long table, which we can do use withpivot_longer()
.pivot_longer(assay(se), names_to = "sample", values_to = "expression", everything())
The code above will fail because
assay(se)
returns a matrix, butpivot_longer()
expects adata.frame
, as mentioned above. We thus need to convert the matrix as shown below:pivot_longer(as.data.frame(assay(se)), names_to = "sample", values_to = "expression", everything())
This will generate something quite close to what we need, a table with the
expression
andsample
columns. We are still missing theinfection
status. -
We know from step 3 that the infection status is available from the
colData
, that we need to merge to our long two-column table from step 4. This can be done withfull_join()
. Below, I use the pipe operator to continue the chunk written above.pivot_longer(as.data.frame(assay(se)), names_to = "sample", values_to = "expression", everything()) %>% full_join(colData(se))
There are two errors in the code above. First, we need to convert the
DataFrame
returned bycolData(se)
to adata.frame
(ortibble
). This is easy. The second (likely) error lies in the variable names used to join the tables (full_join()
’sby
argument). If we have asample
variable in thecolData
that contains sample names, the join will succeed. Otherwise, we will need to defined the vector of variables to join by usingby
. We can also create that variable on the fly usingtibble::rownames_to_column()
knowing that the rownames ofcolData
match the sample names (by definition), and set it tosample
.pivot_longer(as.data.frame(assay(se)), names_to = "sample", values_to = "expression", everything()) %>% full_join(rownames_to_column(as.data.frame(colData(se)), var = "sample"))
Bringing it all together
In the previous section, I already wrote syntactically correct code,
ironing out smaller issues such as conversion to data.frame
and
variable matching. I haven’t run the code, but I’m quite confident it
should work with my se
object, as long as the infection status is
encoded as infection
in the colData
. A conceptual solution without
these details would already earn students a very decent grade. The
effort to produce the actual figure and earn full marks is relatively
minor, once the path to the answer is mapped out. Referring back to
the travelling by bus illustration above, this conceptual solution is
similar to having the bus names, their respective stop, and possibly
even times mapped out. All that is left is to execute the actual
travel plan.
pivot_longer(as.data.frame(assay(se)),
names_to = "sample",
values_to = "expression",
everything()) %>%
full_join(rownames_to_column(as.data.frame(colData(se)),
var = "sample")) %>%
ggplot(aes(x = sample, y = expression, fill = infection)) +
geom_violin()
Conclusions
There is of course a much more direct solution, using boxplot()
on
the assay matrix and setting the col
argument with the infection
state vector. This solution is one that student would most likely
implement after following the second year of the bioinformatics
curriculum. In the first year, we start
with general data manipulation and visualisation using the
tidyverse. Omics data and SummarizedExperiment
only come later, as a
preparation for the second and
third years. Other examples won’t simplify
that nicely though. But the important fact remains that elaborating a
solution will always reply on a good understanding of the basic data
structure and their manipulation and, using that understanding,
conceptually sketch an initial path to the solution.
While I do emphasise the reasoning and concepts behind the code when introducing new material and solving exercises in class, I appreciate that when learning a programming language for the first time, one tends to focus on the syntax (the interpreter isn’t forgiving, after all) and it can be difficult to step back and look at the bigger picture. This is probably something I will need to pay special attention to from now on.
Some context
I came to write this post following several discussions with students during office hours. These discussions led invariably to the same conclusion: those that struggle in class and during the exams struggle to conceptualise the course material. Despite honest efforts, every exercise comes across as a new beast to tame - they fail to visualise the common patterns and start from scratch at every attempt, which is not only very inefficient, but also leads to considerable stress, which itself amplifies perceived difficulty of the task.