Journal:An extract-transform-load process design for the incremental loading of German real-world data based on FHIR and OMOP CDM: Algorithm development and validation
Full article title | An extract-transform-load process design for the incremental loading of German real-world data based on FHIR and OMOP CDM: Algorithm development and validation |
---|---|
Journal | JMIR Medical Informatics |
Author(s) | Henke, Elisa; Peng, Yuan; Reinecke, Ines; Zoch, Michéle; Sedlmayr, Martin; Bathelt, Franziska |
Author affiliation(s) | Technische Universität Dresden |
Primary contact | Email: elisa dot henke at tu dash dresden dot de |
Editors | Lovis, Christian |
Year published | 2023 |
Volume and issue | 11 |
Article # | e47310 |
DOI | 10.2196/47310 |
ISSN | 2291-9694 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://medinform.jmir.org/2023/1/e47310 |
Download | https://medinform.jmir.org/2023/1/e47310/PDF (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
Background: In the Medical Informatics in Research and Care in University Medicine (MIRACUM) consortium, an IT-based clinical trial recruitment support system was developed based on the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). Currently, OMOP CDM is populated with German Fast Healthcare Interoperability Resources (FHIR) data using an extract-transform-load (ETL) process, which was designed as a bulk load. However, the computational effort that comes with an everyday full load is not efficient for daily recruitment.
Objective: The aim of this study is to extend our existing ETL process with the option of incremental loading to efficiently support daily updated data.
Methods: Based on our existing bulk ETL process, we performed an analysis to determine the requirements of incremental loading. Furthermore, a literature review was conducted to identify adaptable approaches. Based on this, we implemented three methods to integrate incremental loading into our ETL process. Lastly, a test suite was defined to evaluate the incremental loading for data correctness and performance compared to bulk loading.
Results: The resulting ETL process supports bulk and incremental loading. Performance tests show that the incremental load took 87.5% less execution time than the bulk load (2.12 minutes compared to 17.07 minutes) related to changes of one day, while no data differences occurred in OMOP CDM.
Conclusions: Since incremental loading is more efficient than a daily bulk load, and both loading options result in the same amount of data, we recommend using bulk load for an initial load and switching to incremental load for daily updates. The resulting incremental ETL logic can be applied internationally since it is not restricted to German FHIR profiles.
Keywords: extract-transform-load, ETL, incremental loading, OMOP CDM, FHIR, interoperability, Observational Medical Outcomes Partnership Common Data Model; Fast Healthcare Interoperability Resources
Introduction
Background and significance
Randomized controlled clinical trials are the gold standard to “measure the effectiveness of a new intervention or treatment.” [1] However, randomized controlled clinical trials are limited regarding the representative number of persons included and, therefore, are restricted in their external generalizability. To gain more unbiased evidence, observational studies focus on real-world data from large heterogeneous populations.
To support observational research, we at the Institute for Medical Informatics and Biometry at Technische Universität Dresden already provide a transferable extract-transform-load (ETL) process [2] to transform German real-world data to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) [3] provided by Observational Health Data Sciences and Informatics (OHDSI). [4] This transformation effort supports the possibilities for multicentric and even international studies. Due to the heterogeneity of the structure and content of the data from the data integration centers within the Medical Informatics Initiative Germany (MI-I) [5], the Health Level 7 (HL7) [6] Fast Healthcare Interoperability Resources (FHIR) communication standard was specified among all German university hospitals. Consequently, we used FHIR as the source for our ETL process. The FHIR specification is given by the core data set of the MI-I. [7] FHIR resources can be read from an FHIR Gateway [8] (PostgreSQL database) or FHIR Server (e.g., HAPI [9] or Blaze [10]). As the target of our ETL process, we used OMOP CDM v5.3.1. [11] The implementation of the ETL process was done using the open-source framework Java SpringBatch. [12] Our ETL process has been implemented in accordance with the default assumption as described in The Book of OHDSI [13], where the OHDSI community defines the ETL process as a full load to transfer data from source to target systems.
This approach is efficient for a dedicated study where data gets loaded once without any update afterward; however, it is inefficient when it comes to the need for updated data on a daily basis. The latter is the case for the developments around the improvement and support of the recruitment process for clinical trials, which the Medical Informatics in Research and Care in University Medicine (MIRACUM) [14] consortium, as part of the MI-I funded by the German Federal Ministry of Education and Research, is working on. In this context, an IT-based clinical trial recruitment support system (CTRSS) based on OMOP CDM was implemented. [15] The CTRSS consists of a screening list for recruitment teams that provides potential candidates for clinical trials updated on a daily base. To enable the CTRSS to provide recruitment proposals, it is necessary to transform the data in FHIR format at each site from the 10 MIRACUM data integration centers into the standardized format of OMOP CDM. The procession of FHIR resources to OMOP CDM through our ETL process has already been successfully tested and integrated at all 10 German university hospitals of the MIRACUM consortium.
So far, our ETL process is restricted to a bulk load of FHIR resources to OMOP CDM. This implied that all FHIR resources are read from the source. To enable the CTRSS to provide daily recruitment proposals, our ETL process has to be executed every day as a full load. However, an everyday full load is not efficient because often only a small amount of source data has changed during loading periods, which results in unnecessary long execution times considering a full load for daily executions. Consequently, the computational effort that comes with the daily execution of the bulk load is not efficient in the context of the CTRSS.
Thus, a new approach is needed to only process FHIR resources that were created, updated, or deleted (CUD) since the last execution of the ETL process once an initial load has been executed. This loading option is known as "incremental loading."
Objective
To keep the bulk load option for dedicated studies and still be performant toward daily changes in the source data, a combination of bulk load and incremental load is needed. To reduce the additional effort in implementing a second independent ETL process for incremental loading, it is our aim to extend our existing ETL process with the option of incremental loading. During our research, we focused on the following four research questions:
- What requirements need to be considered when integrating incremental loading into our existing ETL process design?
- What approaches already exist for incremental ETL processes?
- How can the identified requirements from research question one be implemented in our existing ETL process design?
- Does incremental loading provide an advantage over daily bulk loading?
Methods
Analysis of the Existing ETL FHIR-to-OMOP process
References
Notes
This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.