Exploring and sharing spatial proteomics data

10 minute read

I am very happy to be involved in the Open Research Project project, jointly organised and managed by the Wellcome Trust and the University of Cambridge’s Office of Scholar Communication. Some time ago, I shared my first thoughts about the initiative, where I expressed the need for a more open research landscape, reaching beyond openly releasing research outputs (i.e. papers, data and software). Among these efforts, one that is widely discussed is pre-registration of trials. While this doesn’t really apply to the kind of research I lead, I believe that promoting projects early on has interesting benefits, hence this post.

A couple of years ago, I started with some very preliminary work on an online resource for experimental spatial proteomics data, coined SpatialMap.org, to be populated with interactive apps from the pRolocGUI package. As part of my efforts to promote open research, I decided to write the grant proposal for the SpatialMap.org project, and promote it through this Open Research Project.

It is unlikely I would ever have applied for funding for this project because its main goal is the development of a rather specialised online resource. As far as I know, despite the need for better data sharing and dissemination, it is difficult to obtain funding for such projects, unless they are big and address a very general need. If I believed that this project would have had a high chance to win external funding, I would probably have waited until obtaining an official reply from the funding body before publishing it here. Another view is that this proactive and open project description and it’s initial implementation might lead to new opportunities later on.

The application below follows the standard structure I have used for my BBSRC applications, albeit in a shorter form. Another difference with my grant applications is that there are generally more applicants; in addition to me, the principal investigator (PI), I include one or more co-applicants (co-investigator, co-I) with complementary skills as well as collaborators. Collaborators do not request any funds but express their specific interest by offering access to data or specific support in terms of expertise through a formal letter of support. For this grant, I would have sought involvement of researchers with expertise in spatial proteomics and biological data repositories; Kathryn Lilley and Juan Antonio Vizcaíno, with whom I have published and collaborated in the past, would be likely co-I’s.

`SpatialMap.org`: a modern platform to explore and share mass spectrometry-based spatial proteomics data

Applicants

Laurent Gatto (LG, PI), Computational Proteomics Unit (CPU), Cambridge Centre for Proteomics (CCP), University of Cambridge, author of state-of-the-art computational spatial proteomics software. He’s a fellow of the Software Sustainability Institute (SSI) and a Data Champion and the University of Cambridge.

Suitability of the environment

LG and the Computational Proteomics Unit have a long-standing track record in the development and publication of innovative algorithm and open source software for the analysis and comprehension of mass spectrometry-based spatial proteomics data (Gatto et al., 2014).

Scientific case

Background

Spatial proteomics is the systematic analysis of protein sub-cellular locations. Knowledge of the exact location of a protein is of paramount importance, as its whereabouts will define the biochemical conditions of its environment and the presence of interacting partners and hence, is essential for a protein to execute its intended function.

There exists a wide several experimental procedure to elucidate the sub-cellular location of a protein, that can be classified into single-cell and single-protein imaging methods and high-throughput, proteome-wide mass spectrometry-based methods (Gatto et al., 2010). Here, we will concentrate on the latter. The PI has extensive experience in such methods, in particular the LOPIT (Dunkley et al., 2010) and hyperLOPIT methods (Christoforou et al., 2016, Mulvey et al., 2016). His group has developed numerous innovative analysis methods, ranging from supervised, semi-supervised (Breckels et al., 2013) and transfer learning (Breckels et al., 2016) algorithms, and the current state-of-the-art spatial proteomics computational platform (Gatto et al., 2014a, 2014b, Breckels et al., 2016), including dedicated visualisation tools (Gatto et al., 2015, Breckels et al., 2017) and tens of datasets (Gatto et al., 2014a). Despite the availability of these tools and data, the community would benefit of an online resource to interactively visualise and explore spatial proteomics experiments without the requirement for local software installation and a platform to easily share and disseminate such data.

Some of the groups that published spatial proteomics data recently also provide an online version of their data. Itzhak et al (2016) provide a limited interface to their HeLa spatial proteomics data at www.mapofthecell.org, while Jadot et al. (2017) offer their own data repository, Prolocate. Both single-data resources have simple search interfaces, but none of these enable interactive exploration. We provide an interactive pRolocGUI app to the mouse embryonic stem cell spatial proteome (Mulvey et al., 2016) with similar simple data-specific searches. While useful on their own, these individual efforts hardly meet the need for a consistent, interactive and modern interface across several datasets and don’t support easy dissemination of data.

Objectives

In this project, we propose to develop an online platform that will offer (1) an interface to interactively navigate numerous experimental mass spectrometry-based spatial proteomics data and (2) disseminate (publicly and privately) experiments to facilitate collaboration and wide dissemination of these data.

Preliminary data

We have already collected tens of spatial proteomics experiments collected by multiple laboratories and covering all major experimental designs currently available. These data have all been annotated and prepared as dedicated computational MSnSet object (Gatto et al., 2012). In addition, we have worked on several interactive visualisation applications, currently available as part of the shiny-based software (Chang et al., 2017) pRolocGUI (Breckels et al., 2017). Both available software and data will form the basis for the development the new resource we propose to develop and expand on.

Deliverables

The new resource, named Spatial Map, will be available at www.SpatialMap.org. It will feature an interactive interface allowing to search and navigate multivariate spatial proteomics data, similar to the PCA app from the pRolocGUI software (see figure below), an example of which is available for the recent mouse embryonic stem cell spatial proteomics data (Christoforou et al., 2016).

The PCA application from the `pRolocGUI` package

To assure responsiveness of the application, we will identify the most important features and re-implement them directly in JavaScript using the d3 library for fast and native online visualisation. The interactive data visualisation will features links to external resources such as UniProt and the Human Protein Atlas (Uhlén M et al., 2015).

The data underlying the new resource will be provided by a new backend to support native online visualisation, i.e. not relying on shiny and R. This backend will be based on a modern, web-ready data store, such as Google’s firebase, and will seamlessly integrate with current data available in pRolocdata, enabling submission and retrieval of data from/to R. In particular, we will provide, in the pRolocdata package, an interface to the online data store that will mimic what is currently available for offline data. In addition, the online store will also allow to distribute the annotated spatial proteomics data online directly, in a generic format such as JSON through a RESTful interface, accessible to any programming language (currently, the data are only available to R users and programmers).

As part of the new data backend, we will enable searches across datasets to identify, for example, datasets from a species of interest, datasets that document certain sub-cellular niches, or specific proteins and their location across data and systems. We will then use this new functionality to provide a large scale comparative spatial proteomics study.

Finally, a novel and important feature of our proposed platform will be a modern and unique sharing and dissemination platform. Firstly, the resource will enable anyone to access publicly available data currently in the pRolocdata package, as well as those that will be added in the future. Note that maintenance and synchronisation of the two resource will be supported by a consistent interface between the two resource and automatic build systems assuring identical data availability and data validity across resources. In addition, we will also provide personal logins and support private data upload. These private datasets will be explorable, like their public counterparts, and it will be possible for the users to share these with trusted collaborators. Logins will be based on popular social media accounts such as Google, Facebook, twitter, GitHub, … as well as academic-related accounts such as ORCID, to facilitate to usage of the platform. Upon publication, it will be easy for the data submitters to make their data publicly available and to publicise and share them through SpatialMap.org.

Project management

The project will be managed by LG and the researcher who will implement the proposed resource will be located in the CPU. They will collaborate and interact with computational scientists in the CPU as well as experimentalists within the Cambridge Centre for Proteomics, who generate spatial proteomics data and are the main audience of our resource.

All the work and the discussions around the project advancement will be centralised and recorded on GitHub (see for example the PI’s and CPU accounts for existing and successful utilisation of GitHub) and the PI and researcher will have regular meetings to discuss the project.

We will also include stakeholders to collect feedback and opinions early on, to assure that the resource meets the requirements of its future user base and identify the most important features beyond those that we have already identified.

Data management plan

For a complete data management plan, see Gatto L., 2017.

Justification of resources

In a real application, this section would be much more detailed, specifying exact amounts (the salary would be calculated by the institution’s research office, based on the kind of post that would be needed) and project duration. Here, I just summarise what resources I would need funding for.

We request a salary for the researcher for the duration of the project. While we indent to use freely available infrastructure as much as possible, we also request dedicated funding to support the payment for the hosting of the resource and payment for the domain name for a duration of n years.

Pathways to impact

There are several general areas that this research will have impact on. These beneficiaries include the cell biology and proteomics community, the computational biology community, industry, and education and outreach.

The cell biology and proteomics communities are the first beneficiary of SpatialMap.org, as it will provide instant and user-friendly, interactive access to biologically relevant high-quality spatial proteomics data. These data will also be of benefit to the computational biology community, permitting easier integration of protein localisation data in any computational pipelines. Due to the importance of protein localisation and alteration thereof in drug development, we expect our resource to be relevant for industry. SpatialMap.org will more constitute a useful resource for education and outreach, thanks to the easy and detailed, high-quality protein localisation information.

Finally, we envision that our platform will become a unique and essential resource for reviewers of spatial proteomics publications. Our private access to data will enable to provide restricted access to collaborators as well as reviewers, and we will offer this free service to scientific journals.

Implementation of the project

It turns out that, since the last meeting when I decided to write about this project, Robin Kohze from the Radboud University Nijmegen, is visiting my group in the frame of his master’s internship. He has the right expertise (R and web development) and interests, and has started working on the project. Robin will benefit from student funding from his home institution and the Erasmus programme, but there won’t be any funding for any additional resources.

Final thoughts

By publishing this project, I hope to set a useful example on how to promote better open research. I would also value feedback, comments and suggestion to improve the project, i.e. open and constructive peer review.

I suspect that publicly advertising one’s research, before (or even in absence) of any formal financial support can be seen as risky by many due to potential scooping. I don’t think that this is a genuine risk, at least in my case. And if it where, I doubt that keeping this project a secret would keep me safe from being scooped. In addition, the SpatialMap.org page has been online for a while now, anyway.

The best possible outcome for publishing my research proposal would be that this initiated interest, some sort of support and possibly even new collaboration.

Laurent Gatto

Exploring and sharing spatial proteomics data

Applicants

Suitability of the environment

Scientific case

Background

Objectives

Preliminary data

Deliverables

Project management

Data management plan

Justification of resources

Pathways to impact

Implementation of the project

Final thoughts

You May Also Enjoy

CBIO’s EuroBioc2024 posters and talks

6 Questions We Should Ask Before Adopting a New Technology

On cats, farts and parrots.

Podcasts du LLL: Et si on oubliait les notes?

Laurent Gatto

SpatialMap.org: a modern platform to explore and share mass spectrometry-based spatial proteomics data

Applicants

Suitability of the environment

Scientific case

Background

Objectives

Preliminary data

Deliverables

Project management

Data management plan

Justification of resources

Pathways to impact

Implementation of the project

Final thoughts

You May Also Enjoy

CBIO’s EuroBioc2024 posters and talks

6 Questions We Should Ask Before Adopting a New Technology

On cats, farts and parrots.

Podcasts du LLL: Et si on oubliait les notes?

`SpatialMap.org`: a modern platform to explore and share mass spectrometry-based spatial proteomics data