This short post describes how to succeed and how to fail in answering a typical exam question in one of my bachelor courses. I even believe it is applicable later when tackling more complex data analysis questions in a research project.
The student is provided with an object of class
se, that contains (part of the) data from The effect of
upper-respiratory infection on transcriptomic changes in the CNS by
Blackmore et al.
Purpose: The goal of this study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occurring in the cerebellum and spinal cord post infection.
(The provided data here only contain the cerebellum samples.)
Methods: Gender matched eight week old C57BL/6 mice were inoculated saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.
The student is asked to visualise the gene expression distributions in each sample using violin plots and to colour these based on the infection statues, Influenza A or Non-Infected.
There are several ways to produce this visualisation. Here, the
students have learnt to use the
tidyverse and about
SummarizedExperiment objects, and not (yet) with base plotting. The
goal of this question is for them to demonstrate they can extract
different sub-parts from a
SummarizedExperiment, know how to
transform and combine them using standard
dplyr functions and
eventually visualise them using
It’s a trap
The main trap many students fall into is to start writing code before having thought about how to answer the question. They have access to the lecture notes and code as well as past exam questions. They often start by copying code that apparently (at least) partially addresses the question, and try to modify it until “it works”. Student often end up wasting precious time with this random trial and error strategy. This trial-and-error approach equates to randomly taking buses until one brings them to they expected destination by sheer luck… which is extremely unlikely when several buses are needed to reach that destination.
A conceptual solution
My suggestion is to first think about how to answer the question before actually answering. As a way of focusing on the how, let’s first sketch out an plan, starting from the end.
Based on the question, I anticipate to have a figure with samples along the x axis and expression values along the y axis, with violin plots of different colours based on the infection status. I typically recommend students to have a go with pen and paper and draw such a hypothetical figure.
We can already easily write code that produces such a figure, assuming that we have a long
x, that contains three columns, namely
expressionwith the expression values,
samplecontaining the sample names and
infection, with the ‘Influenza A and Non-Infected sample infection status.
ggplot(x, aes(x = sample, y = expression, fill = infection)) + geom_violin()
From the previous step, we know we will need the expression data, the sample names, and their respective infection status. Even without looking at the actual data, we know how to extract these:
assay(se)gives us the expression data, with column names referring to the sample, and
se$infection) provides the infection status (assuming that it’s encoded by the
We also know that
assay(se)returns the data in a wide format, with genes along the rows and samples along the columns. Given that we want to produce the figure sketched out in step 1 using
ggplot()in step 2, we need to convert it into a long table, which we can do use with
pivot_longer(assay(se), names_to = "sample", values_to = "expression", everything())
The code above will fail because
assay(se)returns a matrix, but
data.frame, as mentioned above. We thus need to convert the matrix as shown below:
pivot_longer(as.data.frame(assay(se)), names_to = "sample", values_to = "expression", everything())
This will generate something quite close to what we need, a table with the
samplecolumns. We are still missing the
We know from step 3 that the infection status is available from the
colData, that we need to merge to our long two-column table from step 4. This can be done with
full_join(). Below, I use the pipe operator to continue the chunk written above.
pivot_longer(as.data.frame(assay(se)), names_to = "sample", values_to = "expression", everything()) %>% full_join(colData(se))
There are two errors in the code above. First, we need to convert the
tibble). This is easy. The second (likely) error lies in the variable names used to join the tables (
byargument). If we have a
samplevariable in the
colDatathat contains sample names, the join will succeed. Otherwise, we will need to defined the vector of variables to join by using
by. We can also create that variable on the fly using
tibble::rownames_to_column()knowing that the rownames of
colDatamatch the sample names (by definition), and set it to
pivot_longer(as.data.frame(assay(se)), names_to = "sample", values_to = "expression", everything()) %>% full_join(rownames_to_column(as.data.frame(colData(se)), var = "sample"))
Bringing it all together
In the previous section, I already wrote syntactically correct code,
ironing out smaller issues such as conversion to
variable matching. I haven’t run the code, but I’m quite confident it
should work with my
se object, as long as the infection status is
infection in the
colData. A conceptual solution without
these details would already earn students a very decent grade. The
effort to produce the actual figure and earn full marks is relatively
minor, once the path to the answer is mapped out. Referring back to
the travelling by bus illustration above, this conceptual solution is
similar to having the bus names, their respective stop, and possibly
even times mapped out. All that is left is to execute the actual
pivot_longer(as.data.frame(assay(se)), names_to = "sample", values_to = "expression", everything()) %>% full_join(rownames_to_column(as.data.frame(colData(se)), var = "sample")) %>% ggplot(aes(x = sample, y = expression, fill = infection)) + geom_violin()
There is of course a much more direct solution, using
the assay matrix and setting the
col argument with the infection
state vector. This solution is one that student would most likely
implement after following the second year of the bioinformatics
curriculum. In the first year, we start
with general data manipulation and visualisation using the
tidyverse. Omics data and
SummarizedExperiment only come later, as a
preparation for the second and
third years. Other examples won’t simplify
that nicely though. But the important fact remains that elaborating a
solution will always reply on a good understanding of the basic data
structure and their manipulation and, using that understanding,
conceptually sketch an initial path to the solution.
While I do emphasise the reasoning and concepts behind the code when introducing new material and solving exercises in class, I appreciate that when learning a programming language for the first time, one tends to focus on the syntax (the interpreter isn’t forgiving, after all) and it can be difficult to step back and look at the bigger picture. This is probably something I will need to pay special attention to from now on.
I came to write this post following several discussions with students during office hours. These discussions led invariably to the same conclusion: those that struggle in class and during the exams struggle to conceptualise the course material. Despite honest efforts, every exercise comes across as a new beast to tame - they fail to visualise the common patterns and start from scratch at every attempt, which is not only very inefficient, but also leads to considerable stress, which itself amplifies perceived difficulty of the task.