Becoming a better scientist with open and reproducible research

17 minute read

This blog post summarises the notes for my talk at the Are you ready for publishing reproducible research? meeting at the TU Delft on the 16 May 2019. The slides are available here.

This material is available under CC-BY, unless otherwise stated.

The original title of the talk was Being a better scientist with open and reproducible research, but I feel that this is some one constantly aims for, hence the change to becoming one.

Disclaimer: I do not speak from authority1. I speak of personal experience. My experience is in computational biology, bioinformatics and high throughput biology data. My experience doesn’t directly translate to other fields or domains (for example when it comes to data privacy) or even to other personalities in the same field.

Note 1: A piece of open research doesn’t automatically make it good, where good is defined as of high academic quality. A piece of closed research doesn’t make it bad, where bad here is defined of low academic quality. So openness doesn’t equate to academic quality. But openness provides some desired quality (i.e. desirable property) independent from academic excellent. Openness leads to trust.

Note 2: A piece of reproducible research doesn’t automatically make it good, where good is defined as of high academic quality. A piece of non reproducible research doesn’t make it bad, where bad here is defined of low academic quality. So reproducible doesn’t equate to academic quality. But reproducibility provides some desired quality (i.e. desirable property) independent from academic excellent. Reproducibility leads (among other things) to trust.

Open and reproducible research

Open != reproducible

Open research and reproducible research aren’t the same thing, and one doesn’t imply the other. They are historically also very different.

From a technical point of view:

  • When individual patronage funded scientists, discoveries were kept private or publicised as codes in anagrams or cyphers (Source Wikipedia:Open Science).
  • The concept of open access to scientific data was institutionally established with the formation of the World Data Center system (now the World Data System) in 1957. MEDLINE, later renamed PubMed, was created in 1966. (Source Wikipedia:Open Science Data).

From a philosophical point of view:

The Mertonian norms (1942)

  • Communism: all scientists should have common ownership of scientific goods (intellectual property), to promote collective collaboration; secrecy is the opposite of this norm.

  • Universalism: scientific validity is independent of the sociopolitical status/personal attributes of its participants.

  • Disinterestedness: scientific institutions act for the benefit of a common scientific enterprise, rather than for the personal gain of individuals within them

  • Organised scepticism: scientific claims should be exposed to critical scrutiny before being accepted: both in methodology and institutional codes of conduct.

There isn’t only one type of open science

Open science has seen a continuous evolution since the 17th century, with the advent of dissemination of research in scientific journals and the societal demand to access scientific knowledge at large. Technology and communication has further accelerated this evolution, and put it in the spot light among researchers and academics (for for examples funder mandates) and more widely in the press with the cost of publications (see for example this Guardian long read article Is the staggeringly profitable business of scientific publishing bad for science? or the Paywall movie).

Open science/research is the process of transparent dissemination and access to knowledge, that can be applied to various scientific practices (image below from Wikipedia):

The six principles of open science

As a result

Open science/research can mean different things to different people, in particular when declined it along its many technical and philosophical attributes.

Take home message:

Open isn’t binary, it’s a gradient, it’s multidisciplinary, it’s multidimensional.

How to be an open scientist:

Let’s be open and understanding of different situations and constraints.

Why becoming an open research practitioners

It’s the right thing to do. See the The Mertonian norms… Or is it?

Benefits for your academic career: some examples from the Open as a career boost paragraph:

  • Open access articles get more citations.
  • Data availability is associated with citation benefit.
  • Openly available software more likely to be used. (I don’t have any reference for this, and there are of course many couter examples).

Networking opportunities (I’m here thanks to my open research activities with my former colleague Marta Teperek at the University of Cambridge, UK).

See also Why Open Research

But are there any risks?

Does it take more time to work openly?

Isn’t it worth investing time is managing data in a way that others (including future self) can find and understand it? That’s, IMHO, particularly important from a group leader’s perspective, where I want to build a corpus of data/software/research that other lab members can find, mine and re-use.

Are senior academics always supportive?

No.

Is there a risk of being scooped?

There certainly is a benefit if releasing one’s research early!

But, importantly, working with open and reproducible research in mind doesn’t mean releasing everything prematurely, it means

  • managing research in a way one can find data and results at every stage

  • one can reproduce/repeat results, re-run/compare them with new data or different methods/parameters, and

  • one can release data (or parts thereof) when/if appropriate.

So, are there any risks?

The Bullied Into Bad Science campaign is an initiative by early career researchers (ECRs) for early career researchers who aim for a fairer, more open and ethical research and publication environment.

Bullied Into Bad Science

Why reproducibility is important

  • For scientific reasons: think reproducibility crisis.

  • For political reasons: public trust in science, in data, in experts; without (public) trust in science and research, there won’t be any funding anymore.

But what do we mean by reproducibility?

From a But what to we mean by reproducibility? blog post.

  • Repeat my experiment, i.e. obtain the same tables/graphs/results using the same setup (data, software, …) in the same lab or on the same computer. That’s basically re-running one of my analysis some time after I original developed it.

  • Reproduce an experiment (not mine), i.e. obtain the same tables/graphs/results in a different lab or on a different computer, using the same setup (the data would be downloaded from a public repository and the same software, but possibly different version, different OS, is used). I suppose, we should differentiate replication using a fresh install and a virtual machine or docker image that replicates the original setup.

  • Replicate an experiment, i.e. obtain the same (similar enough) tables/graphs/results in a different set up. The data could still be downloaded from the public repository, or possibly re-generate/re-simulate it, and the analysis would be re-implemented based on the original description. This requires openness, and one would clearly not be allowed the use a black box approach (VM, docker image) or just re-running a script.

  • Finally, re-use the information/knowledge from one experiment to run a different experiment with the aim to confirm results from scratch.

Another view (from a talk by Kirstie Whitaker):

  Same Data Different Data
Same Code reproduce replicate
Different Code robust generalisable

See also this opinion piece by Jeffrey T. Leek and Roger D. Peng, Reproducible research can still be wrong: Adopting a prevention approach.

From

Gabriel Becker An Imperfect Guide to Imperfect Reproducibility, May Institute for Computational Proteomics, 2019.

(Computational) Reproducibility Is Not The Point

Take home message:

The goal is trust, verification and guarantees:

  • Trust in Reporting - result is accurately reported
  • Trust in Implementation - analysis code successfully implements chosen methods
  • Statistical Trust - data and methods are (still) appropriate
  • Scientific Trust - result convincingly supports claim(s) about underlying systems or truths

Reproducibility As A Trust Scale (copyright Genentech Inc)

Reproducibility As A Trust Scale

Take home message:

Reproducibility isn’t binary, it’s a gradient, it’s multidisciplinary, it’s multidimensional.

Another take home message:

Reproducibility isn’t easy.

Why becoming a reproducible research practitioners

Florian Markowetz, Five selfish reasons to work reproducibly, Genome Biology 2015, 16:274.

And so, my fellow scientists: ask not what you can do for reproducibility; ask what reproducibility can do for you! Here, I present five reasons why working reproducibly pays off in the long run and is in the self-interest of every ambitious, career-oriented scientist.

  • Reason number 1: reproducibility helps to avoid disaster
  • Reason number 2: reproducibility makes it easier to write papers
  • Reason number 3: reproducibility helps reviewers see it your way
  • Reason number 4: reproducibility enables continuity of your work
  • Reason number 5: reproducibility helps to build your reputation

And career perspectives: Faculty promotion must assess reproducibility.

What can you do to improve trust in (your) research?

  1. Be an open research practitioners
  2. Be an reproducible research practitioners

Includes (but not limited to)

Preprints are the best!

Read, post, review and cite preprints (see ASAPbio for lots of resources about preprints).

Promoting open research through peer review

This section is based on my The role of peer-reviewers in checking supporting information promoting open science talk.

As an open researcher, I think it is important to apply and promote the importance of data and good data management on a day-to-day basis (see for example Marta Teperek’s 2017 Data Management: Why would I bother? slides), but also to express this ethic in our academic capacity, such as peer review. My responsibility as a reviewer is to

  • Accept sound/valid research and provide constructive comments

and hence

  • Focus firstly on the validity of the research by inspecting the data, software and method. If the methods and/or data fail, the rest is meaningless.

I don’t see novelty, relevance, news-worthiness as my business as a reviewer. These factors are not the prime qualities of thorough research, but rather characteristics of flashy news.

Here are some aspects that are easy enough to check, and go a long way to verify the availability and validity and of the data

  1. Availability: Are the data/software/methods accessible and understandable in a way that would allow an informed researcher in the same or close field to reproduce and/or verify the results underlying the claims? Note that this doesn’t mean that as a reviewer, I will necessarily try to repeat the whole analysis (that would be too time consuming indeed). But, conversely, a submission without data/software will be reviewed (and rejected, or more appropriately send back for completion) in matters of minutes. Are the data available in a public repository that guarantees that it will remain accessible, such as a subject-specific or, if none is available, a generic repository (such as zenodo or figshare, …), an institutional repository, or, but less desirable, supplementary information or a personal webpage2.

  2. Meta-data: It’s of course not enough to provide a wild dump of the data/software/…, but these need to be appropriately documented. Personally, I recommend an README file in every top project directory to summarise the project, the data, …

  3. Do numbers match?: The first thing when reproducing someone’s analysis is to match the data files to the experimental design. That is one of the first things I check when reviewing a paper. For example if the experimental design says there are 2 groups, each with 3 replicates, I expect to find 6 (or a multiple thereof) data files or data columns in the data matrix. Along these lines, I also look at the file names (of column names in the data matrix) for a consistent naming convention, that allows to match the files (columns) to the experimental groups and replicates.

  4. What data, what format: Is the data readable with standard and open/free software? Are the raw and processed available, and have the authors described how to get from one to the other?

  5. License: Is the data shared in a way that allows users to re-use it. Under what conditions? Is the research output shared under a valid license?

Make sure that the data adhere to the FAIR principles:

Findable and Accessible and Interoperable and Reusable

Note that SI are not FAIR, not discoverable, not structured, voluntary, used to bury stuff. A personal web page is likely to disappear in the near future.

As a quick note, my ideal review system would be one where

  1. Submit your data to a repository, where it gets checked (by specialists, data scientists, data curators) for quality, annotation, meta-data.

  2. Submit your research with a link to the peer reviewed data. First review the intro and methods, then only the results (to avoid positive results bias).

When talking about open research and peer review, one logical extension is open peer review.

While I personally value open peer review and practice it when possible, it can be a difficult issue for ECRs, exposing them unnecessarily when reviewing work from prominent peers. It also can reinforce an already unwelcoming environment for underrepresented minorities. See more about this in the Inclusivity: open science and open science section below.

Registered reports

Define you data collection and analysis protocol in advance. Get it reviewed and, if accepted, get right to publish once data have been collected and analysed, irrespective of the (positive or negative) result.

  • Three challenges: Restrictions on flexibility (no p-hacking of HARKing), The time cost, Incentive structure isn’t in place yet.

  • Three benefits: Greater faith in research (no p-hacking of HARKing), New helpful systems (see technical solutions below), Investment in your future.

See https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000246 (2019).

Make allies

This is very important!

  • Other ECR
  • Librarians
  • Data stewards/champions
  • Research Software engineers
  • On/off-line networking

Open research can lead to collaborative research. The development of MSnbase is an example I am very proud of.

Collaborative work and cooperation is certainly one important concept that gravitates around open science/research (see the Mertonian norm of communism), but that isn’t necessary nor sufficient for open science.

Just do it!

Build openness at the core your research

(according to you possibilities)

Open and reproducible research doesn’t work if it’s an afterthought.

Technical solutions

  • Scripting, scripting, scripting (applies to code, data, analyses, manuscripts, …).

  • Avoid manual steps.

  • Document everything, especially manual steps (which you should avoid anyway).

  • Version control, such as git/github, bitbucket, …

  • Literate analyses: reproducible documents with R markdown, Sweave (R with LaTeX), Juyter notebooks, …

  • Shareable compute environments (docker containers).

  • Document and share all artefacts related to your research (when possible): data, code, protocols, …

See also An Imperfect Guide to Imperfect Reproducibility for further details.

Some examples of my own research

Spatial proteomics software: systematic and high throughput analysis of sub-cellular protein localisation.

Software

  • Software: infrastructure with MSnbase (Gatto and Lilley 2012); dedicated machine learning with pRoloc (Gatto et al. 2014); interactive with pRolocGUI; data with pRolocdata (Gatto et al. 2014).

  • The Bioconductor project (Huber et al. 2015) ecosystem for high throughput biology data analysis and comprehension: open source, and coordinated and collaborative (between and within domains/software) open development, enabling reproducible research, enables understanding of the data (not a black box) and drive scientific innovation.

pRoloc screenshots

QSep: quantify resolution of a spatial proteomics experiment

QSep is a function within the pRoloc software.

qsep screenshots

Evolution of Gatto et al. (2018)

SpatialMap

SpatialMap is a project aiming at producing a visualisation and data sharing platform for spatial proteomics. I decided to promote and drive it as openly as possible in the frame of the Open Research Pilot Project. The ORPP is a joint project by the Office of Scholarly Communication at the University of Cambridge and the Wellcome Trust Open research team. Here are the reasons why the SpatialMap project is an open project:

  1. The SpatialMap project in itself is about opening up spatial proteomics data by facilitating data sharing and providing tools to further the comprehension of the data. One aim is to allow users to use the SpatialMap web portal to upload, share, explore and discuss their data privately with collaborators in a first instance (few researchers share their data before publication), then make the data available to reviewer, and finally, once reviewed, make it public at the push of a button. The incentive for early utilisation of the platform is to provide interactive data visualisation and integration with other tools and sharing of the data with close collaborators.

  2. The project is developed completely openly in a public GitHub repository. Absolutely all code and contributions are publicly available. Anyone can collaborate, or even fork the project and build their own.

  3. I publicly announced the SpatialMap project in a blog post. The blog post was written as a legitimate grant application (albeit a little bit shorter and sticking to the most important parts).

Note that I do not have any dedicated funding for this project. The progress so far was the result of a masters student visiting my group, and is currently not actively developed anymore.

Inclusivity: open research and open research

There is

Open Science as in widely disseminated and openly accessible

and

Open Science as in inclusive and welcoming

On being inclusive - Twitter thread by Cameron Neylon:

As far as I was concerned for a long time (until June 2017 to be accurate - this section is based this Open science and open science post), the former more technical definition was always what I was focusing on, and the second community-level aspect of openness was, somehow, implicit from the former, but that’s clearly not the case.

Even if there are efforts to promote diversity, under-represented minorities (URM) don’t necessarily feel included. When it comes to open science/research URMs can be further discriminated against by greater exposure or, can’t always afford to be vocal.

  • Not everybody has the privilege to be open.
  • There are different levels in how open one wants, or how open one could afford to be.
  • Every voice and support is welcome.

Conclusions

Standing on the shoulders of giants only really makes sense in the context of open and reproducible research.

  • If you are here (or have read this), chances are you are on the path towards open and reproducible research.

  • You are the architect of the kind of research and researcher you want to become. I hope these include openness and reproducibility.

  • It’s a long path, that constantly evolves, depending on constraints, aspirations, environment, …

  • The sky is the limit, be creative: work out the (open and reproducible) research that works for you now …

  • … and that you want to work (for you and others) in the future.

Acknowledgements

One of my advice was to make allies. I have been lucky to meet wonderful allies and inspiring friends along the path towards open and reproducible research that works for me. Among these, I would like to highlight Corina Logan, Stephen Eglen, Marta Teperek, Kirstie Whitaker, Chris Hartgenink, Naomie Penfold, Yvonne Nobis.

  1. I actually think that authority (or seniority) isn’t doing any favours when it comes to open research and reproducibility. The more senior stakeholder all too often aren’t those that drive research toward more openness and reproducibility. There are, fortunately, notable exceptions. 

  2. There is often no perfect solution, and a combination of the above might be desirable.