Journal:Neuroimaging, genetics, and clinical data sharing in Python using the CubicWeb framework

From LIMSWiki
Revision as of 21:45, 19 June 2017 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Neuroimaging, genetics, and clinical data sharing in Python using the CubicWeb framework
Journal Frontiers in Neuroinformatics
Author(s) Grigis, Antoine; Goyard, David; Cherbonnier, Robin; Gareau, Thomas; Papadopoulos Orfanos, Dimitri; Chauvat, Nicolas; Di Mascio, Adrien; Schumann, Gunter; Spooren, Will; Murphy, Declan; Frouin, Vincent
Author affiliation(s) Université Paris-Saclay, Logilab, King’s College London, F. Hoffmann-La Roche Pharmaceuticals
Primary contact Email: antoine dot grigis at cea dot fr
Editors Marcus, Daniel
Year published 2017
Volume and issue 11
Page(s) 18
DOI 10.3389/fninf.2017.00018
ISSN 1662-5196
Distribution license Creative Commons Attribution 4.0 International
Website http://journal.frontiersin.org/article/10.3389/fninf.2017.00018/full
Download http://journal.frontiersin.org/article/10.3389/fninf.2017.00018/pdf (PDF)

Abstract

In neurosciences or psychiatry, the emergence of large multi-center population imaging studies raises numerous technological challenges. From distributed data collection, across different institutions and countries, to final data publication service, one must handle the massive, heterogeneous, and complex data from genetics, imaging, demographics, or clinical scores. These data must be both efficiently obtained and downloadable. We present a Python solution, based on the CubicWeb open-source semantic framework, aimed at building population imaging study repositories. In addition, we focus on the tools developed around this framework to overcome the challenges associated with data sharing and collaborative requirements. We describe a set of three highly adaptive web services that transform the CubicWeb framework into a (1) multi-center upload platform, (2) collaborative quality assessment platform, and (3) publication platform endowed with massive-download capabilities. Two major European projects, IMAGEN and EU-AIMS, are currently supported by the described framework. We also present a Python package that enables end users to remotely query neuroimaging, genetics, and clinical data from scripts.

Keywords: web service, data sharing, database, neuroimaging, genetics, medical informatics, Python

Introduction

Health research strategies using neuroimaging have shifted in recent years: the focus has moved from patient care only, to a combination of patient care and prevention. In the case of neurodegenerative and psychiatric diseases, this drives the creation of increasingly numerous massive imaging studies, also known as population imaging (PI) surveys.[1][2] It should be noticed that PI studies no longer consist of image data only. The recent wide availability of high-throughput genomics has augmented the subject data with genetics, epigenetics, and functional genomics. Likewise, the standardization of personality, demographics, and deficit tests in psychiatry facilitates the acquisition of clinical/behavioral records to enrich the subject data in large population studies. Moreover, PI studies now classically encompass more than one single imaging session per subject and cover multiple-time point heterogeneous experiments. Ultimately, these studies with complex imaging and extended data (PIx) require multi-center acquisitions to build a large target population.

A regular PIx infrastructure has to cover the following three main topics: (1) data collection, (2) quality control (QC) with data processing, and (3) data indexing and publication with controlled data sharing mechanisms. Furthermore, PIx infrastructures must evolve during the life cycle of a population imaging project, and they must also be resilient to extreme evolutions of the data content and management. In the projects we manage, we experience several extreme evolutions. The first kind of evolution may affect the published dataset such as adding a new modality for all subjects, a new time point or a new subcohort. Second, the amount of data requested evolves dramatically as the project consortium gets enlarged.[3] Finally, internal ontologies have to evolve constantly in order to match the ongoing initiatives on interoperability.[4][5]

Several existing open-source frameworks support one or several of the described topics, sometimes only for one specific data type. We propose in the following a brief overview of existing systems. Some of these systems have also been reviewed by Nichols and Pohl.[6] IDA[7] is a neuroimaging data repository and management system that supports data collection (topic one) and data sharing (topic three). With this system, the published datasets can be searched using automatically extracted metadata. The XNAT framework[8] is widely used for neuroimaging data and supports all the PIx infrastructure topics, focusing on tools to pipeline, and to audit the processing of image data (topic two). The LORIS[9] and NiDB[10] frameworks represent a significant effort to account for multimodal data involved in PIx studies. These frameworks, although addressing all the required topics, mainly support neuroimaging data. Openclinica[11] and REDCap[12] facilitate the collection of electronic data such as eCRF or questionnaires and are recognized in projects of various sizes that support data collection (topic one). Likewise, laboratory information management systems were developed for the collection of genomic measurements such as SIMBioMS.[13] Finally, the COINS framework brings essential tools for multimodal data support and, more interestingly, emphasizes the importance of providing sharing tools (topics one and three).[14]

The two European studies we manage require a tailored PIx infrastructure. Existing frameworks neither completely handle the diversity of our PIx requirements and project life cycle nor provide efficient tools to collect, check the quality of, and publish evolving data. Additional developments were required for building such complete infrastructure. We based these developments on a more general framework than the dedicated applications described above. In collaboration with Logilab company (Logilab SA, Paris, France), we developed three highly adaptive web services, based on the CubicWeb (CW) pure-Python framework, aimed at creating a (1) multi-center upload platform, (2) collaborative quality assessment platform, and (3) publication platform with massive-download features.[15] These developments were originally instituted for IMAGEN and EU-AIMS projects in order to host their data about mental health in adolescents[16] and autism[17], respectively. The corresponding studies require key features such as upload/browse published data from the web, dynamic selection and filtering of displayed data, support for flexible download operations, high-level request language, multilevel access rights, remote data access, remote user access rights management, collaborative QC, and interoperability.

Materials and methods

The three services described in the introduction were handled in distinct developments. The next sub-section presents the CW framework capabilities, followed by introductions in the second and fourth sub-sections to the upload and publication web services through which the tailored requirements of PIx studies are satisfied. Furthermore, we describe in the third sub-section a collaborative rating web service that helps users to assess the data quality, and in the final sub-section a Python API that remotely queries these web services.

CubicWeb overview

All the implemented services are based on the CW framework.[15] We choose a high level pure-Python framework that bridges web technologies and database engines. This choice was also based on the expertise and experience of people from our laboratory and a tight collaboration with Logilab.[18][19] CW distribution is organized in a core part and a set of basic Python modules, referred to as cubes, which can be used to efficiently generate web applications. The core of the CW framework, developed under the LGPL license, is constructed from well-established technologies (SQL, Python, web technologies such as HTML5 and Javascript). The main characteristics of the CW framework are given as follows:

1. CW defines its data model with Python classes and automatically generates the underlying database structure.
2. The queries are expressed with the RQL language, which is similar to W3C’s SPARQL.[20] All the persistent data are retrieved and modified using this language.
3. CW implements a mechanism that exposes information in several ways, referred to as views. This mechanism implements the classical model-view-controller software architecture pattern. Defined in Python, the views are applied to query results and can produce HTML pages and/or trigger external processes. The separation of queries and views offers major advantages: first, the same data selection may have several web representations, and second, retrieved data can be exported in several other formats without modifying the underlying data storage.
4. All the views and triggers are recorded in a registry and are automatically selected depending on the current context, which is inferred from the type of data returned by the RQL.
5. Thanks to the semantic nature of CW, all developments inherit the possibility to follow existing or emerging ontologies, thereby facilitating sharing, access, and processing.
6. CW has a security system that grants fine-grained access to the data. This system is similar to the row-level security and policies available in the most recent versions of PostgreSQL, and links access rights to entities/relations in the schema. Each entity type has a set of attributes and relations, and permissions that define who can add, read, update, or delete such an entity and associated relations.
7. CW may run either as a standalone application or behind an Apache front server. We refer to both settings as a data sharing service (DSS) (cf. Figure 1).
8. CW can be configured to run with various database engines. For the best performance, PostgreSQL is recommended.


Fig1 Grigis FInNeuroinformatics2017 11.jpg

Figure 1. Architecture of a CubicWeb data sharing service (DSS) integrated in an Apache platform with LDAP. The business logic cubes provide a schema that can be instantiated in the database management system (DBMS: red puzzle piece). The system cubes ensure low-level system interactions (green puzzle piece), and the application cube proposes a web user interface (blue puzzle piece). End users access the database content through a web browser, a Python API scripting the DSS or an FTP solution, where virtual folders (acting as filters on the central repository) are proposed for download.

References

  1. Hurko, O.; Black, S.E.; Doody, R. et al. (2012). "The ADNI Publication Policy: Commensurate recognition of critical contributors who are not authors". NeuroImage 59 (4): 4196–4200. doi:10.1016/j.neuroimage.2011.10.085. PMC PMC3676932. PMID 22100665. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676932. 
  2. Poldrack, R.A.; Gorgolewski, K.J. (2014). "Making big data open: Data sharing in neuroimaging". Nature Neuroscience 17 (11): 1510–7. doi:10.1038/nn.3818. PMID 25349916. 
  3. Gorgolewski, K.J.; Varoquaux, G.; Rivera, G. et al. (2015). "NeuroVault.org: A web-based repository for collecting and sharing unthresholded statistical maps of the human brain". Frontiers in Neuroinformatics 9: 8. doi:10.3389/fninf.2015.00008. PMC PMC4392315. PMID 25914639. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4392315. 
  4. Scheufele, E.; Aronzon, D.; Coopersmith, R. et al. (2014). "tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform". AMIA Joint Summits on Translational Science 2014: 96–101. PMC PMC4333702. PMID 25717408. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4333702. 
  5. Gorgolewski, K.J.; Auer, T.; Calhoun, V.D. et al. (2016). "The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments". Scientific Data 3: 160044. doi:10.1038/sdata.2016.44. PMC PMC4978148. PMID 27326542. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978148. 
  6. Nichols, B.N.; Pohl, K.M. (2015). "Neuroinformatics Software Applications Supporting Electronic Data Capture, Management, and Sharing for the Neuroimaging Community". Neuropsychology Review 25 (3): 356-68. doi:10.1007/s11065-015-9293-x. PMC PMC5400666. PMID 26267019. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5400666. 
  7. Van Horn, J.D.; Toga, A.W. (2009). "Is it time to re-prioritize neuroimaging databases and digital repositories?". NeuroImage 47 (4): 1720-34. doi:10.1016/j.neuroimage.2009.03.086. PMC PMC2754579. PMID 19371790. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2754579. 
  8. Marcus, D.S.; Harms, M.P.; Snyder, A.Z. et al. (2013). "Human Connectome Project informatics: quality control, database services, and data visualization". NeuroImage 80: 202-19. doi:10.1016/j.neuroimage.2013.05.077. PMC PMC3845379. PMID 23707591. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3845379. 
  9. Das, S.; Zijdenbos, A.P.; Harlap, J. et al. (2012). "LORIS: A web-based data management system for multi-center studies". Frontiers in Neuroinformatics 5: 37. doi:10.3389/fninf.2011.00037. PMC PMC3262165. PMID 22319489. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3262165. 
  10. Book, G.A.; Anderson, B.M.; Stevens, M.C. et al. (2013). "Neuroinformatics Database (NiDB) - A modular, portable database for the storage, analysis, and sharing of neuroimaging data". Neuroinformatics 11 (4): 495-505. doi:10.1007/s12021-013-9194-1. PMC PMC3864015. PMID 23912507. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3864015. 
  11. "OpenClinica User Documentation". OpenClinica, LLC. 18 April 2016. https://docs.openclinica.com/. 
  12. Harris, P.A.; Taylor, R.; Thielke, R. et al. (2009). "Research electronic data capture (REDCap) - A metadata-driven methodology and workflow process for providing translational research informatics support". Journal of Biomedical Informatics 42 (2): 377–81. doi:10.1016/j.jbi.2008.08.010. PMC PMC2700030. PMID 18929686. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2700030. 
  13. Krestyaninova, M.; Zarins, A.; Viksna, J. et al. (2009). "A system for information management in biomedical studies – SIMBioMS". Bioinformatics 25 (20): 2768-2769. doi:10.1093/bioinformatics/btp420. PMC PMC2759553. PMID 19633095. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2759553. 
  14. Scott, A.; Courtney, W.; Wood, D. et al. (2011). "COINS: An Innovative Informatics and Neuroimaging Tool Suite Built for Large Heterogeneous Datasets". Frontiers in Neuroinformatics 5: 33. doi:10.3389/fninf.2011.00033. PMC PMC3250631. PMID 22275896. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3250631. 
  15. 15.0 15.1 "CubicWeb - The Semantic Web is a construction game!". Logilab. 2016. https://www.cubicweb.org/. 
  16. Schumann, G.; Loth, E.; Banaschewski, T. et al. (2010). "The IMAGEN study: Reinforcement-related behaviour in normal brain function and psychopathology". Molecular Psychiatry 15 (12): 1128-39. doi:10.1038/mp.2010.4. PMID 21102431. 
  17. Murphy, D.; Spooren, W. (2012). "EU-AIMS: A boost to autism research". Nature Reviews Drug Discovery 11 (11): 815-6. doi:10.1038/nrd3881. PMID 23123927. 
  18. Michel, V.; Schwartz, Y.; Pinel, P. et al. (2013). "Brainomics: A management system for exploring and merging heterogeneous brain mapping data". Proceedings from the 19th Annual Meeting of the Organization for Human Brain Mapping 2013. https://hal.inria.fr/cea-00904768/en. 
  19. Papadopoulos Orfanos, D.; Michel, V.; Schwartz, Y. et al. (2017). "The Brainomics/Localizer database". NeuroImage 144 (Pt B): 309-314. doi:10.1016/j.neuroimage.2015.09.052. PMID 26455807. 
  20. Prud'hommeaux, E.; Seaborne, A., ed. (15 January 2008). "SPARQL Query Language for RDF". World Wide Web Consortium. https://www.w3.org/TR/rdf-sparql-query/. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. References are in order of appearance rather than alphabetical order (as the original was) due to the way this wiki works.