Difference between revisions of "Journal:Improving data quality in clinical research informatics tools"
Shawndouglas (talk | contribs) |
Shawndouglas (talk | contribs) (Saving and adding more.) |
||
Line 345: | Line 345: | ||
''V''<sub>1</sub> = i2b2 counts and ''V''<sub>2</sub> = SlicerDicer counts; these counts are plugged into the below formula: | ''V''<sub>1</sub> = i2b2 counts and ''V''<sub>2</sub> = SlicerDicer counts; these counts are plugged into the below formula: | ||
:<math>Percentage{~~}difference =| V1 - V2 |/[{{({V1\ + \ V2})}/2}] \times \ 100</math> | |||
A paired ''t''-test is used to investigate the difference between two counts from i2b2 and SlicerDicer for the same query. | |||
===Findings and hypotheses=== | |||
All the results obtained from comparing the counts between i2b2 and SlicerDicer are listed in Tables 2 and 3. | |||
However, when diagnoses were explored, larger discrepancies were noted. There are two diagnosis fields in i2b2: one for diagnosis and one for billing diagnosis. Using <tt>J45*</tt> as the ICD-10 code for asthma resulted in 22,265 patients when using the billing diagnosis code in SlicerDicer but only 20,429 in i2b2. The discrepancy using diagnosis was even larger. Patient count results for type 1 diabetes diagnosis code <tt>E10*</tt> using both diagnosis and billing diagnosis are also shown in Table 3. | |||
The best approach to understanding the reasons for this discrepancy was by looking at the diagnosis options in SlicerDicer to build a hypothesis on where this discrepancy might come from. Next, the SQL code for the Caboodle-to-i2b2 ETL process was examined. From these examinations, the following hypotheses were considered: | |||
H0: There is no discrepancy in the data elements used to pull the data. | |||
H1: There is a discrepancy in the data elements used to pull the data. | |||
A paired sample ''t''-test was implemented on the counts obtained from ib2b and SlicerDicer using different data points. The ''p''-value was equal to 0, [''P''(x ≤ –Infinity) = 0]; in all cases that means that the chance of type I error (rejecting a correct H0) is small: 0 (0%). The smaller the ''p''-value the more it supports H1. For example, results of the paired ''t''-test indicated that there is a significant medium difference between i2b2 (''M'' = 14,500, ''SD'' = 0) and SlicerDicer (''M'' = 23,958, ''SD'' = 0), t(0) = Infinity, ''p'' < 0.001; results of the paired ''t''-test indicated that there is a significant medium difference between i2b2 (''M'' = 155,434, ''SD'' = 0) and Slicerdicer (''M'' = 1,579, ''SD'' = 0), t(0) = Infinity, ''p'' < 0.001. | |||
Since the ''p''-value < α, H0 is rejected and the i2b2 population's average is considered to be not equal to the SlicerDicer population's average. In other words, the difference between the averages of i2b2 and SlicerDicer is big enough to be statistically significant. | |||
The paired ''t''-test results supported the alternative hypothesis and revealed that there is a discrepancy in the data elements used to pull the data. | |||
Also, the percentage difference calculator results, which were used to estimate the quality of the counts coming from the two tools, showed that a majority exceeded the threshold for accepted quality in this study (below 2%), as shown in Tables 2 and 3. The percentage difference results showed and provided strong evidence for a crucial quality issue in the counts obtained. | |||
When examining the SQL code for the Caboodle-to-i2b2 ETL process, the SQL code results showed the code only looked at billing and encounter diagnosis, and everything that was not a billing diagnosis was simply labeled "diagnosis." SlicerDicer, and even Caboodle, include other diagnosis sources such as "medical history," "hospital problem," and "problem list." This was included in the data dictionary so that researchers would understand what sources i2b2 was using and that if they wanted data beyond that, they would have to request data from Caboodle. | |||
==Discussion== | |||
==References== | ==References== |
Revision as of 16:02, 26 July 2022
Full article title | Improving data quality in clinical research informatics tools |
---|---|
Journal | Frontiers in Big Data |
Author(s) | AbuHalimeh, Ahmed |
Author affiliation(s) | University of Arkansas at Little Rock |
Primary contact | Email: aaabuhalime at ualr dot edu |
Editors | Ehrlinger, Lisa |
Year published | 2022 |
Volume and issue | 5 |
Article # | 871897 |
DOI | 10.3389/fdata.2022.871897 |
ISSN | 2624-909X |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://www.frontiersin.org/articles/10.3389/fdata.2022.871897/full |
Download | https://www.frontiersin.org/articles/10.3389/fdata.2022.871897/pdf (PDF) |
This article contains rendered mathematical formulae. You may require the TeX All the Things plugin for Chrome or the Native MathML add-on and fonts for Firefox if they don't render properly for you. |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
Maintaining data quality is a fundamental requirement for any successful and long-term data management project. Providing high-quality, reliable, and statistically sound data is a primary goal for clinical research informatics. In addition, effective data governance and management are essential to ensuring accurate data counts, reports, and validation. As a crucial step of the clinical research process, it is important to establish and maintain organization-wide standards for data quality management to ensure consistency across all systems designed primarily for cohort identification, allowing users to perform an enterprise-wide search on a clinical research data repository to determine the existence of a set of patients meeting certain inclusion or exclusion criteria. Some of the clinical research tools are referred to as de-identified data tools.
Assessing and improving the quality of data used by clinical research informatics tools are both important and difficult tasks. For an increasing number of users who rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment to preserve the value of the data. In clinical research informatics, better data quality translates into better research results and better patient care. However, achieving high-quality data standards is a major task because of the variety of ways that errors might be introduced in a system and the difficulty of correcting them systematically. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among data resources such as format, syntax, and semantic inconsistencies. The second category is related to poor extract, transform, and load (ETL) and data mapping processes.
In this paper, we describe a real-life case study on assessing and improving the data quality within a healthcare organization. This paper compares between the results obtained from two de-identified data systems—TranSMART Foundation's i2b2 and Epic's SlicerDicer—and discuss the data quality dimensions specific to the clinical research informatics context, and the possible data quality issues between the de-identified systems. This work closes by proposing steps or rules for maintaining data quality among different systems to help data managers, information systems teams, and informaticists at any healthcare organization to monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes.
Keywords: clinical research data, data quality, research informatics, informatics, management of clinical data
Introduction
Data is the building block in all research, as research results are only as good as the data upon which the conclusions were formed. However, researchers may receive minimal training on how to use the de-identified data systems and methods common to clinical research today for achieving, assessing, or controlling the quality of research data. (Nahm, 2012; Zozus et al., 2019) De-identified data systems are defined as systems or tools that allow users to drag and drop search terms from a hierarchical ontology into a Venn diagram-like interface. Investigators can then perform an initial analysis on the de-identified cohort. However, de-identified data systems have no features to indicate or assist in identifying the quality of data in the system; these systems only provide counts.
Another issue involves the level of knowledge a clinician has about informatics in general and clinical informatics in particular. Without knowledge of informatics concepts, clinicians may not be able to identify quality issues in informatics systems. This requires some background in informatics, the science of how to use data, information, and knowledge to improve human health and the delivery of healthcare services (American Medical Informatics Association, 2022), as well as clinical informatics, the the application of informatics and information technology to deliver healthcare services. For example, clinicians increasingly need to turn to patient portals, electronic medical records (EMRs), telehealth tools, healthcare apps, and a variety of data reporting tools (American Medical Informatics Association, 2022) as part of achieving higher-quality health outcomes.
The case presented in this paper focuses on the quality of data obtained from two de-identified systems: TranSMART Foundation's i2b2 and Epic's SlicerDicer. The purpose of this paper is to discuss the quality of the data (counts) generated from the two systems, understand the potential causes of data quality issues, and propose steps to improve the quality and increase the trust of the generated counts by comparing the accuracy, consistency, validity, and understandability of the outcomes from the two systems. The proposed quality improvement steps are broadly applicable and contribute towards adding generic and essential steps to automate data curation and data governance to tackle various data quality problem. These steps should help data managers, information systems teams, and informaticists at a healthcare organization monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes.
The remainder of this paper is organized as follows. In the following section, we introduce the importance of data quality to clinical research informatics, followed by details of the case study, study method, and materials used. Afterwards, findings are presented and the proposed steps to ensure data quality are discussed. At the end, conclusions are drawn and future work discussed.
Importance of data quality to clinical research informatics
Data quality refers to the degree data meets the expectations of data consumers and their intended use of the data. (Pipino et al., 2002; Halimeh, 2011; AbuHalimeh and Tudoreanu, 2014). In clinical research informatics, this depends on the parameters of the study conducted. (Nahm, 2012; Zozus et al., 2019)
The significance of data quality lies in how the data is perceived and used by its consumer. Identifying data quality involves two stages: first, highlighting which characteristics (i.e., dimensions) are important (Figure 1) and second, determining how these dimensions affect the population in question. (Halimeh, 2011; AbuHalimeh and Tudoreanu, 2014)
|
This paper focuses on a subset of data quality dimensions, which we term "de-identified data quality dimensions" (DDQD). We think these dimensions, described in Table 1, are critical to maintaining data quality in de-identified systems.
|
The impact of quality data and management is in performance and efficiency gains and the ability to extract new understanding. On the other hand, poor clinical research informatics data quality can cause inefficiencies and other problems throughout an organization. This impact includes the quality of research outcomes, healthcare services, and decision-making.
Quality is not a simple scalar measure but can be defined on multiple dimensions, with each dimension yielding different meanings to different information consumers and processes. (Halimeh, 2011; AbuHalimeh and Tudoreanu, 2014) Each dimension can be measured and assessed differently. Data quality assessment implies providing a value for each dimension that clearly says something about how much of the dimension or quality feature is achieved to enable adequate understanding and management. Data quality and the discipline of informatics are undoubtedly interconnected. Data quality depends on how data are collected, processed, and presented; this is what makes data quality important and sometimes complicated because data collection and processing varies from one study to another. Clinical informatics data can include different data formats and types and can come from different resources.
Case study goals
The primary goal is to compare, identify, and understand discrepancies in a patient count in TranSMART Foundation's i2b2 compared to Epic's SlicerDicer. (Galaxy, 2021) The secondary goal is to create a data dictionary that clinical researchers can easily understand. For example, if they wanted a count of patients with asthma, they would know what diagnoses were used to identify patients, where these diagnoses were captured, and that this count matches existing clinical knowledge.
The case described below is from a healthcare organization that wanted to have the ability to ingest other sources of research-specific data, such as genomic information, and their existing products did not have a way to do that. After deliberation, i2b2 (The i2b2 tranSMART Foundation, 2021) was chosen as the data model for their clinical data warehouse. Prior to going live with users, however, it was essential to validate that the data in their clinical data management system (CDMS) was accurate.
Methodology
Participants
The clinical validation process involved a clinical informatician, data analyst, and extract, transform, and load (ETL) developer.
Data
Many healthcare organizations use at least one of the three major Epic databases: Chronicles, Clarity, and Caboodle. The data source used to feed the i2b2 and SlicerDicer tools was the Caboodle database.
Tools
The tools used to perform the study were i2b2 and SlicerDicer:
- i2b2: Informatics for Integrating Biology and the Bedside (i2b2) is an open-source clinical data warehousing and analytics research platform managed by TranSMART Foundation. i2b2 enables sharing, integration, standardization, and analysis of heterogeneous data from healthcare and clinical research sources. (The i2b2 tranSMART Foundation, 2021)
- SlicerDicer: SlicerDicer is a self-service reporting tool with Epic systems that allows physicians ready access to clinical data that is customizable by patient populations for data exploration. SlicerDicer allows the user to choose and search a specific patient population to answer questions about diagnoses, demographics, and procedures performed. (Galaxy, 2021)
Method description
The study was designed in such a way as to better compare, identify, and understand discrepancies in a patient count between i2b2 and SlicerDicer. We achieved this goal by choosing a task based on the nature of the tools. The first step was to run the same query in order to look at patient demographics (e.g., race, ethnicity, gender) and identify different aggregations with race and ethnicity in i2b2 compared with SlicerDicer, which was more granular (as shown in Table 2). For example, Cuban and Puerto Rican values in SlicerDicer were included in the "Other Hispanic" or "Latino" category in i2b2. The discrepancies are shown in Table 2.
|
The second step was to run the same query to explore diagnoses using J45*, the ICD-10 code for asthma, and E10*, the ICD-10 code for type 1 diabetes. The query results are shown in Table 3.
|
A percentage difference calculator was implemented to find the percent difference between i2b2 counts and SlicerDicer counts >0. The percentage difference, as described in the formula below, is usually calculated when you want to know the difference in percentage between two numbers. It is useful for estimating the quality of the counts coming from the two tools. The threshold for accepted quality in this study was below two percent difference.
V1 = i2b2 counts and V2 = SlicerDicer counts; these counts are plugged into the below formula:
A paired t-test is used to investigate the difference between two counts from i2b2 and SlicerDicer for the same query.
Findings and hypotheses
All the results obtained from comparing the counts between i2b2 and SlicerDicer are listed in Tables 2 and 3.
However, when diagnoses were explored, larger discrepancies were noted. There are two diagnosis fields in i2b2: one for diagnosis and one for billing diagnosis. Using J45* as the ICD-10 code for asthma resulted in 22,265 patients when using the billing diagnosis code in SlicerDicer but only 20,429 in i2b2. The discrepancy using diagnosis was even larger. Patient count results for type 1 diabetes diagnosis code E10* using both diagnosis and billing diagnosis are also shown in Table 3.
The best approach to understanding the reasons for this discrepancy was by looking at the diagnosis options in SlicerDicer to build a hypothesis on where this discrepancy might come from. Next, the SQL code for the Caboodle-to-i2b2 ETL process was examined. From these examinations, the following hypotheses were considered:
H0: There is no discrepancy in the data elements used to pull the data.
H1: There is a discrepancy in the data elements used to pull the data.
A paired sample t-test was implemented on the counts obtained from ib2b and SlicerDicer using different data points. The p-value was equal to 0, [P(x ≤ –Infinity) = 0]; in all cases that means that the chance of type I error (rejecting a correct H0) is small: 0 (0%). The smaller the p-value the more it supports H1. For example, results of the paired t-test indicated that there is a significant medium difference between i2b2 (M = 14,500, SD = 0) and SlicerDicer (M = 23,958, SD = 0), t(0) = Infinity, p < 0.001; results of the paired t-test indicated that there is a significant medium difference between i2b2 (M = 155,434, SD = 0) and Slicerdicer (M = 1,579, SD = 0), t(0) = Infinity, p < 0.001.
Since the p-value < α, H0 is rejected and the i2b2 population's average is considered to be not equal to the SlicerDicer population's average. In other words, the difference between the averages of i2b2 and SlicerDicer is big enough to be statistically significant.
The paired t-test results supported the alternative hypothesis and revealed that there is a discrepancy in the data elements used to pull the data.
Also, the percentage difference calculator results, which were used to estimate the quality of the counts coming from the two tools, showed that a majority exceeded the threshold for accepted quality in this study (below 2%), as shown in Tables 2 and 3. The percentage difference results showed and provided strong evidence for a crucial quality issue in the counts obtained.
When examining the SQL code for the Caboodle-to-i2b2 ETL process, the SQL code results showed the code only looked at billing and encounter diagnosis, and everything that was not a billing diagnosis was simply labeled "diagnosis." SlicerDicer, and even Caboodle, include other diagnosis sources such as "medical history," "hospital problem," and "problem list." This was included in the data dictionary so that researchers would understand what sources i2b2 was using and that if they wanted data beyond that, they would have to request data from Caboodle.
Discussion
References
Notes
This presentation is faithful to the original, with only a few minor changes to presentation, grammar, and punctuation. In some cases important information was missing from the references, and that information was added. Numerous links that were originally posted inline in the text were turned into full citations for this version, adding significantly to the total citation count.