Journal:Development of an informatics system for accelerating biomedical research
Full article title | Development of an informatics system for accelerating biomedical research (Version 2) |
---|---|
Journal | F1000Research |
Author(s) | Navale, Vivek; Ji, Micehle; Vovk, Olga; Misquitta, Leonie; Gebremichael, Tsega; Garcia, Alison; Fann, Yang; McAuliffe, Matthew |
Author affiliation(s) | National Institutes of Health; General Dynamics Information Technology, Inc.; Sapient Government Services |
Primary contact | Email: Vivek dot Navale at nih dot gov |
Year published | 2020 |
Volume and issue | 8 |
Article # | 1430 |
DOI | 10.12688/f1000research.19161.2 |
ISSN | 2046-1402 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://f1000research.com/articles/8-1430/v2 |
Download | https://f1000research.com/articles/8-1430/v2/pdf (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
The Biomedical Research Informatics Computing System (BRICS) was developed to support multiple disease-focused research programs. Seven service modules are integrated together to provide a collaborative and extensible web-based environment. The modules—Data Dictionary, Account Management, Query Tool, Protocol and Form Research Management System, Meta Study, Data Repository, and Globally Unique Identifier—facilitate the management of research protocols, including the submission, processing, curation, access, and storage of clinical, imaging, and derived genomics data within the associated data repositories. Multiple instances of BRICS are deployed to support various biomedical research communities focused on accelerating discoveries for rare diseases, traumatic brain injuries, Parkinson’s disease, inherited eye diseases, and symptom science research. No personally identifiable information is stored within the data repositories. Digital object identifiers are associated with the research studies. Reusability of biomedical data is enhanced by common data elements (CDEs), which enable systematic collection, analysis, and sharing of data. The use of CDEs with a service-oriented informatics architecture enabled the development of disease-specific repositories that support hypothesis-based biomedical research.
Keywords: informatics system, biomedical repository, translational research, FAIR
Introduction
Biomedical informatics systems can be used for the management of heterogeneous data, testing of data analysis methods, dissemination of translational research, and the generation of high-throughput hypotheses.[1][2] In the past, many disease-focused research programs have collected data in dissimilar ways, which has resulted in difficulties for data aggregation and comparative analyses. For example, non-standard methods of data collection in traumatic brain injury (TBI) research have led to many different types of injuries to be classified within the same class of injury. To overcome this problem, in October 2007, the National Institute of Neurological Disorders and Stroke (NINDS), National Institute on Disability and Rehabilitation Research (NIDRR), the Defense and Veterans Brain Injury Center, and the Brain Injury Association of America sponsored a workshop to examine barriers to TBI clinical trial effectiveness. The workshop recommendation of improving data discoverability and integration in TBI research resulted in the development and implementation of common data elements (CDEs) and the Federal Interagency Traumatic Brain Injury Research (FITBIR) informatics system.[3]
A CDE is defined as a fixed representation of a variable collected within a specified clinical domain, interpretable unambiguously in human and machine-computable terms.[4] It consists of a precisely defined question with a set of permissible values as responses. Typically, CDE development for biomedical disease programs involves multiple steps: identifying a need for a CDE or group of CDEs, bringing together stakeholders and expert groups for selection, implementing various iterations and updates to initial CDE development based on ongoing input from the broader community, and finally endorsing of the CDEs for widespread usage and adoption by the stakeholder community.[5] Use of CDEs enhances data quality and consistency, which facilitates data reuse for clinical and translational research.
CDEs are used in various programs of clinical research, including in neuroscience[6], rare diseases research[7], and management of chronic conditions.[8] For clinical data lifecycle management, the use of CDEs provides a structured data collection process, which enhances the likelihood for data to be pooled and combined for meta-analyses, modelling, and post-hoc construction of synthetic cohorts for exploratory analyses.[9] Investigators working to develop protocols for data collection can also consult the NIH Common Data Element Resource Portal for using established CDEs for disease programs.[10]
In 2010, the Department of Defense and the NINDS initiated the development of FITBIR. The goal was to develop a centralized repository for TBI research, in order to foster collaboration between researchers working in the field. Additionally, the design of FITBIR called for the use of CDEs during TBI data collection.
Prior to the development of FITBIR, the National Database for Autism Research (NDAR) system had demonstrated the use of CDEs for autism research.[11] Certain design features such as the use of a globally unique identifier (GUID) scheme were adopted from NDAR for FITBIR. However, the NDAR model was dedicated for access and submission to federated databases for autism research. FITBIR, on the other hand, required development of a multi-program centralized repository.
The Biomedical Research Informatics Computing System (BRICS) was designed to address the wide-ranging needs of several biomedical research programs. The overall concept was to develop services that could be integrated together and deployed as instances for individual research programs. FITBIR was the first initial BRICS instance and was leveraged to develop other instances (e.g., the Parkinson’s disease program). A BRICS instance supports electronic data capture and use of data dictionaries for processing and storing data within disease-specific digital repositories.
Data dictionaries comprise data elements, form structures, and electronic forms (eForms). A data element has a name, precise definition, and clear permissible values, if applicable. A data element also directly relates to a question on a paper, eForm, and/or field(s) in a database record. Form structures serve as the containers for data elements, and eForms are developed using form structures as their foundation. The data dictionary provides defined CDEs, as well as unique data elements (UDEs), for specific BRICS instance implementation. Reuse of CDEs is significantly encouraged, and in the case of FITBIR’s data dictionary, it incorporates and extends the CDE definitions developed by the National Institute of Neurological Disorders and Stroke (NINDS) CDE project.[6]
This paper discusses the overall system design and an architecture that supports the various BRICS instances. The functionalities developed to use the CDEs for electronic data submission, processing, validation, and storage within designated repositories have been presented. System access is highlighted for searching across research studies within a BRICS instance. An example has been provided for BRICS implementation within an area of disease research (Parkinson’s disease). Also shown is the role of individual system components that enable data to be findable, accessible, interoperable, and reusable (FAIR).
BRICS system design and architecture
The system design was predicated on the adoption of a CDE-based data collection method. To satisfy this requirement, an electronic data collection tool (ProFoRMS) was developed to interface with data dictionaries, which enabled deployment of multiple instances of the system to disease area programs. This method of using CDEs early in the data life cycle facilitated data harmonization and minimized the need for elaborate post processing and curation work. Services were developed to support the various stages in the data life cycle. De-identification of each patient within a research study is supported by the use of a GUID. A de-identification tool was developed for researchers to use prior to submission of data to a specific BRICS instance. No personally identifiable information could be retained in the BRICS repositories.
Since BRICS development started in 2011, the Java Web Start technology has been used for deploying the tools shown in the presentation layer of the architecture (Figure 1, below). Although Java Web Start was deprecated in subsequent editions of Java after Oracle Java SE 8, free public updates and auto updates to Java SE 8 are provided by Oracle Inc., until at least the end of December 2020. GUID and download tools that initially used the Web Start technology have been migrated to the JavaScript client. The Submission tool will also be migrated to JavaScript client by end of 2020. During the transition period, users continue to maintain the Oracle Java SE 8 installed on their local computers.
An open-source database, PostgreSQL, was preferred over Oracle database during BRICS development, primarily to minimize individual licensing costs when deploying instances of the system to various biomedical programs. However, three separate PostgreSQL databases were used, one for data dictionaries and the other two for ProFoRMS' data repository and meta-analysis functionalities, respectively. Separate databases were needed because the data dictionary is shared and ProFoRMS was developed as an application that was integrated with the system.
The Virtuoso database uses the World Wide Web Consortium's Resource Description Framework (RDF) for accessing data that comes from data dictionary, data repository, and meta-analysis modules. Virtuoso contains data that are linked together in RDF, to support the query tool. The repository data is linked to metadata (studies and datasets) and the data dictionary, which is processed and stored in Virtuoso for querying. An advantage of using the RDF triple model is its flexibility to adapt to user-driven data requirement changes that can be made in the study repository or query tool. Once the data is added to the RDF graph as triples, regardless of where the data is stored, it can easily be retrieved and processed by the query tool.
Since the initial release of the BRICS platform, we have initiated a migration to the MongoDB database to take advantage of schema-free development. Currently, the GUID module has been migrated to use MongoDB. Other BRICS functionalities will eventually be migrated to also use MongoDB, thereby eliminating the need for using PostgreSQL in the BRICS architecture.
An overview of the current informatics system architecture is provided in Figure 1. The architecture is defined by its three layers: the (a) Presentation Layer, (b) Application Layer, and (c) Data Layer.
|
The Presentation Layer provides a secure entry point through the BRICS portal. A login page is used to enter valid credentials with a central authentication system (CAS) to support single sign-on for users to access all the BRICS modules. Role-based access has also been implemented by using Spring Security (a Java/Java Enterprise Edition framework that provides authentication and authorization features) throughout the system to provide an additional level of controlled access to each of the modules. The GUID client, validation/upload, download, and image submission tools are accessible via the BRICS portal.
The Image Submission Package Creation Tool, a plugin to the Medical Image Processing Analysis and Visualization (MIPAV) application[12][13], leverages medical image file readers found in the MIPAV software application (version 8.0.2) to support the semi-automated submission of image data into the data repository. The plugin supports more than 35 file formats commonly used in medical imaging, including DICOM, NIfTI, Analyze, AFNI, and more. The Image Submission Tool extracts available image header metadata from the image and attempts to map that metadata onto the CDEs in the selected imaging form structure. The quality and amount of image header metadata that can be extracted out of an image volume will depend on the medical image file format, the scanner on which the images were acquired, and the de-identification process performed.
The Application Layer is responsible for the logic that determines the capabilities of the BRICS modules and tools. Seven service modules within the Application Layer are integrated together to provide a collaborative and extensible web-based environment. This includes the data dictionary, account management, query tool, Protocol and Form Research Management System (ProFoRMS), GUID, data repository, and meta-analysis modules. To communicate and exchange information between the modules, a representational state transfer (RESTful) interface for web services is used.[14] Additional information about the various service modules is available from the BRICS site.
The Data Layer consists of open-source databases including PostgreSQL, Virtuoso, and MongoDB. Since a typical query use case requires data from a repository, data dictionary, and meta-analysis module, it is much more efficient to store and access data in a single Virtuoso database. Instead of using resource-intensive joins in PostgreSQL, data can be accessed in Virtuoso by traversing RDF graph database. Having related data linked together in one place allows the query tool to quickly query repository data, an otherwise slow process. RDF is also used to support searching of studies, form structures, and data elements.
Also utilized are open-source libraries such as Hibernate and Apache Jena for storing and retrieving data from databases. Hibernate is an object-relational mapping framework used to map PostgreSQL data into Java objects. Using Hibernate reduces the amount of software code that would otherwise be required to translate tabular data from SQL into Java objects. Jena is a Java framework that enables interaction with semantic web applications; it is the Hibernate equivalent for semantic web, mapping the Virtuoso data into Java objects. Both of these frameworks support users’ requests for retrieving and storing data. A single library was not available to support data persistence, therefore Hibernate was used with PostgreSQL, and JENA was used to support Virtuoso’s RDF structure.
The Data Layer is supported by the physical infrastructure located within the National Institutes of Health (NIH). It is certified to operate at the Federal Information Security Modernization Act's (FISMA) moderate level.15–17 In accordance with FISMA moderate systems, the BRICS system adheres to the National Institute of Standards and Technology's (NIST) Special Publication 800-53 and its cybersecurity standards and guidelines. The BRICS system is also certified to 21 CFR Part 11, and as part of the its requirements, a stringent audit trail has been implemented within the BRICS system to verify that digital objects have not been altered or corrupted.
References
- ↑ Sarkar, I.N. (2010). "Biomedical informatics and translational medicine". Journal of Translational Medicine 8: 22. doi:10.1186/1479-5876-8-22. PMC PMC2837642. PMID 20187952. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2837642.
- ↑ Payne, P.R.O. (2012). "Chapter 1: Biomedical knowledge integration". PLoS Computational Biology 8 (12): e1002826. doi:10.1371/journal.pcbi.1002826. PMC PMC3531314. PMID 23300416. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531314.
- ↑ Thompson, H.J.; Vavilala, M.S.; Rivara, F.P. (2015). "Chapter 1: Common Data Elements and Federal Interagency Traumatic Brain Injury Research Informatics System for TBI Research". Annual Review of Nursing Research 33 (1): 1–11. doi:10.1891/0739-6686.33.1. PMC PMC4704986. PMID 25946381. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4704986.
- ↑ Silva, J.; Wittes, R. (1999). "Role of clinical trials informatics in the NCI's cancer informatics infrastructure". Proceedings AMIA Symposium: 950–4. PMC PMC2232686. PMID 10566501. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2232686.
- ↑ Zentzis, B. (15 May 2017). "Common Data Element (CDE)". Clinfowiki. https://clinfowiki.org/wiki/index.php/Common_Data_Element_(CDE). Retrieved 03 April 2018.
- ↑ 6.0 6.1 National Institutes of Health. "NINDS Commond Data Elements". National Institutes of Health. https://www.commondataelements.ninds.nih.gov/. Retrieved 03 April 2018.
- ↑ Rubinstein, Y.R.; McInnes, P. (2015). "NIH/NCATS/GRDR Common Data Elements: A leading force for standardized data collection". Contemporary Clinical Trials 42: 78–80. doi:10.1016/j.cct.2015.03.003. PMC PMC4450118. PMID 25797358. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450118.
- ↑ Moore, S.M.; Schiffman, R.; Waldrop-Valverde, D. et al. (2016). "Recommendations of Common Data Elements to Advance the Science of Self-Management of Chronic Conditions". Journal of Nursing Scholarship 48 (5): 437–47. doi:10.1111/jnu.12233. PMC PMC5490657. PMID 27486851. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490657.
- ↑ Sheehan, J.; Hirschfeld, S.; Foster, E. et al. (2016). "Improving the value of clinical research through the use of Common Data Elements". Clinical Trials 13 (6): 671–76. doi:10.1177/1740774516653238. PMC PMC5133155. PMID 27311638. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5133155.
- ↑ "Common Data Element (CDE) Resource Portal". National Library of Medicine. National Institutes of Health. 3 January 2013. https://www.nlm.nih.gov/cde/glossary.html. Retrieved 03 April 2018.
- ↑ Hall, D.; Huerta, M.F.; Mcauliffe, M.J. et al. (2012). "Sharing heterogeneous data: the national database for autism research". Neuroinformatics 10 (4): 331–9. doi:10.1007/s12021-012-9151-4. PMC PMC4219200. PMID 22622767. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4219200.
- ↑ Haak, D.; Page, C.-E.; Meserno, T.M. (2016). "A Survey of DICOM Viewer Software to Integrate Clinical Research and Medical Imaging". Journal of Digital Imaging 29 (2): 206-15. doi:10.1007/s10278-015-9833-1. PMC PMC4788610. PMID 26482912. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4788610.
- ↑ Shah, J.. "MIPAV". NIH Center for Information Technology. https://mipav.cit.nih.gov/. Retrieved 06 November 2017.
- ↑ Fielding, R.T. (2000). "Architectural Styles and the Design of Network-based Software Architectures" (PDF). University of California, Irvine. https://www.ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.