Difference between revisions of "Journal:Design and refinement of a data quality assessment workflow for a large pediatric research network"

Full article title	Design and refinement of a data quality assessment workflow for a large pediatric research network
Journal	eGEMs
Author(s)	Khare, Ritu; Utidjian, Levon H.; Razzaghi, Hanieh; Soucek, Victoria; Burrows, Evanette; Eckrich, Daniel;; Hoyt, Richard; Weistein, Harris; Miller, Matthew W.; Soler, David; Tucker, Joshua; Bailey, L. Charles
Author affiliation(s)	The Children's Hospital of Philadelphia, Seattle Children’s Hospital, Nemours Children’s Health System,; Nationwide Children’s Hospital
Primary contact	Email: kharer at email dot chop dot edu
Year published	2019
Volume and issue	7(1)
Page(s)	36
DOI	10.5334/egems.294
ISSN	2327-9214
Distribution license	Creative Commons Attribution 4.0 International
Website	https://egems.academyhealth.org/articles/10.5334/egems.294/
Download	https://egems.academyhealth.org/articles/10.5334/egems.294/galley/397/download/ (PDF)

Revision as of 19:52, 18 October 2019

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Background: Clinical data research networks (CDRNs) aggregate electronic health record (EHR) data from multiple hospitals to enable large-scale research. A critical operation toward building a CDRN is conducting continual evaluations to optimize data quality. The key challenges include determining the assessment coverage on big datasets, handling data variability over time, and facilitating communication with data teams. This study presents the evolution of a systematic workflow for data quality assessment in CDRNs.

Implementation: Using a specific CDRN as a use case, a workflow was iteratively developed and packaged into a toolkit. The resultant toolkit comprises 685 data quality checks to identify any data quality issues, procedures to reconciliate with a history of known issues, and a contemporary GitHub-based reporting mechanism for organized tracking.

Results: During the first two years of network development, the toolkit assisted in discovering over 800 data characteristics and resolving over 1400 programming errors. Longitudinal analysis indicated that the variability in time to resolution (15day mean, 24day IQR) is due to the underlying cause of the issue, perceived importance of the domain, and the complexity of assessment.

Conclusions: In the absence of a formalized data quality framework, CDRNs continue to face challenges in data management and query fulfillment. The proposed data quality toolkit was empirically validated on a particular network and is publicly available for other networks. While the toolkit is user-friendly and effective, the usage statistics indicated that the data quality process is very time-intensive, and sufficient resources should be dedicated for investigating problems and optimizing data for research.

Keywords: CDRN, checks, data quality, electronic health records, GitHub, issues

Background

Collaborations across multiple institutions are essential to achieve sufficient cohort sizes in clinical research and strengthen findings in a wide range of scientific studies.^[1]^[2] Clinical data research networks (CDRNs) combine electronic health record (EHR) data from multiple hospital systems to provide integrated access for conducting large-scale research studies. The results of CDRN-based studies, however, come with the caveat that the EHR data are directed towards clinical operations rather than clinical research. Suboptimal quality of EHR data and incorrect interpretation of EHR-derived data not only lead to inaccurate study results but also increase the cost of conducting science.^[3] Hence, one of the most critical aspects in building a CDRN is conducting continual quality evaluation to ensure that the patient-level clinical datasets are “fit for research use.”^[3]^[4]^[5] A well-designed data quality (DQ) assessment program helps data developers in identifying programming and logic errors when deriving secondary datasets from EHRs (e.g., an incorrect mapping of patient’s race information into controlled vocabularies). Also, it assists data consumers and scientists in learning the peculiar characteristics of network data (e.g., “acute respiratory tract infections” and “attention deficit hyperactive disorder” are likely to be among the most frequent diagnoses in a pediatric data resource) as well as helps assess the readiness of network data for specific research studies.^[6]

This study focuses on the pediatric learning health system (PEDSnet) CDRN, which provides access to pediatric observational data drawn from eight of the nation’s largest children’s hospitals, through an underlying common data model (CDM).^[7] PEDSnet is one of the 13 CDRNs supported by the Patient-Centered Outcomes Research Institute (PCORI). PEDSnet has a centralized architecture, where most patient-level datasets from various hospitals are concatenated together; two hospitals that participate in a distributed manner are the exception. The ultimate goal of PEDSnet is to provide high-quality data for conducting a variety of pediatric research studies. Therefore, an essential task during building PEDSnet was to build a system to conduct DQ assessments on the network data.

We experienced three critical demands in conducting such assessments in PEDSnet.

1. Assessment coverage: It is important to ensure that the key variables in the CDM are being assessed, and that the assessments encompass the important aspects of DQ^[5]^[8] and meet the demands of internal and external data consumers. The challenge lies in the selection of variables to be assessed and the types of assessment to be used for the select variables, given a plethora of possibilities of DQ checks for a single variable or a combination of variables. In a CDRN like PEDSnet that contains hundreds of variables in the CDM and a diverse set of users, identification of appropriate assessment coverage is a vital but resource-intensive task. For example, the field condition_concept_id that captures the standardized SNOMED diagnosis in PEDSnet could be assessed in a number of ways, including as a missing or unmapped diagnosis, incorrectly mapped diagnosis, variability in frequency distribution of a certain diagnosis across sites, inconsistency between diagnosis and medication data, etc.

2. Evolution-friendly: PEDSnet is a continually growing network; the size of the centralized dataset increases as data for new patients and new observations get added into individual EHRs. In addition, the underlying model also evolves based on the changing needs of the data consumers. For instance, the PEDSnet patient population increased by over 90 percent since the first data submission, and at least six versions of the underlying model have been adopted in the last two years. It is a challenge to conduct DQ assessments while accounting for temporal variability on such an evolving dataset.

3. Communication: The PEDSnet data committee is responsible for developing network data and represents a collaboration of more than 50 programmers, analysts, and scientists spread across the eight participating institutions. It is important to enable effective communication among them and track all DQ related interactions.

While much information on DQ assessments is made available by existing data sharing networks^[9], there is limited discussion on the coverage and driving factors of the assessments to be encapsulated in a toolkit. There have been recent advances in conducing DQ assessments on an evolving dataset to review temporal variations.^[10]^[11]^[12]^[13] However, the communication challenge is largely underexplored in the context of CDRNs. In this study, we describe the design and evolution of a software toolkit that addresses the key challenges outlined above and serves as a systematic DQ assessment workflow for the PEDSnet CDRN.

References

↑ Bailey, L.C.; Milov, D.E.; Kelleher, K. et al. (2013). "Multi-Institutional Sharing of Electronic Health Record Data to Assess Childhood Obesity". PLoS One 8 (6): e66192. doi:10.1371/journal.pone.0066192. PMC PMC3688837. PMID 23823186. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3688837.
↑ Brown, J.S.; Kahn, M.; Toh, S. (2013). "Data quality assessment for comparative effectiveness research in distributed data networks". Medical Care 51 (8 Suppl. 3): S22–9. doi:10.1097/MLR.0b013e31829b1e2c. PMC PMC4306391. PMID 23793049. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306391.
↑ ^3.0 ^3.1 Kahn, M.G.; Raebel, M.A.; Glanz, J.M. et al. (2012). "A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research". Medical Care 50 (Suppl.): S21–9. doi:10.1097/MLR.0b013e318257dd67. PMC PMC3833692. PMID 22692254. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3833692.
↑ Arts, D.G.; De Keizer, N.F.; Scheffer, G.J. (2002). "Defining and improving data quality in medical registries: A literature review, case study, and generic framework". JAMIA 9 (6): 600–11. doi:10.1197/jamia.m1087. PMC PMC349377. PMID 12386111. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC349377.
↑ ^5.0 ^5.1 Weiskopf, N.G.; Weng, C. (2013). "Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research". JAMIA 20 (1): 144–51. doi:10.1136/amiajnl-2011-000681. PMC PMC3555312. PMID 22733976. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3555312.
↑ Holve, E.; Kahn, M.; Nahm, M. et al. (2013). "A comprehensive framework for data quality assessment in CER". AMIA Joint Summits on Translational Science Procedings 2013: 86–8. PMC PMC3845781. PMID 24303241. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3845781.
↑ Forrest, C.B.; Margolis, P.A.; Bailey, L.C. et al. (2014). "PEDSnet: A National Pediatric Learning Health System". JAMIA 21 (4): 602–6. doi:10.1136/amiajnl-2014-002743. PMC PMC4078288. PMID 24821737. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4078288.
↑ Kahn, M.G.; Callahan, T.J.; Barnard, J. et al. (2016). "A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data". EGEMS 4 (1): 1244. doi:10.13063/2327-9214.1244. PMC PMC5051581. PMID 27713905. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051581.
↑ Callahan, T.J.; Bauck, A.E.; Bertoch, D. et al. (2017). "A Comparison of Data Quality Assessment Checks in Six Data Sharing Networks". EGEMS 5 (1): 8. doi:10.5334/egems.223. PMC PMC5982846. PMID 29881733. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5982846.
↑ Qualls, L.G.; Phillips, T.A.; Hammill, B.G. et al. (2018). "Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet)". EGEMS 6 (1): 3. doi:10.5334/egems.199. PMC PMC5983028. PMID 29881761. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5983028.
↑ Curtis, L.H.; Weiner, M.G.; Boudreau, D.M. et al. (2012). "Design considerations, architecture, and use of the Mini-Sentinel distributed data system". Pharmacoepidemiology and Drug Safety 21 (Suppl. 1): 23–31. doi:10.1002/pds.2336. PMID 22262590.
↑ Raebel, M.A.; Haynes, K.; Woodworth, T.S. et al. (2014). "Electronic clinical laboratory test results data tables: lessons from Mini-Sentinel". Pharmacoepidemiology and Drug Safety 23 (6): 609–18. doi:10.1002/pds.3580. PMID 24677577.
↑ Network HCSR. "Data Resources". "Available from: 24677577"

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Grammar was cleaned up for smoother reading. In some cases important information was missing from the references, and that information was added.

[BaileyMulti13-1] Bailey, L.C.; Milov, D.E.; Kelleher, K. et al. (2013). "Multi-Institutional Sharing of Electronic Health Record Data to Assess Childhood Obesity". PLoS One 8 (6): e66192. doi:10.1371/journal.pone.0066192. PMC PMC3688837. PMID 23823186. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3688837.

[BrownData13-2] Brown, J.S.; Kahn, M.; Toh, S. (2013). "Data quality assessment for comparative effectiveness research in distributed data networks". Medical Care 51 (8 Suppl. 3): S22–9. doi:10.1097/MLR.0b013e31829b1e2c. PMC PMC4306391. PMID 23793049. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306391.

[KahnAPrag12-3] 3.0 ^3.1 Kahn, M.G.; Raebel, M.A.; Glanz, J.M. et al. (2012). "A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research". Medical Care 50 (Suppl.): S21–9. doi:10.1097/MLR.0b013e318257dd67. PMC PMC3833692. PMID 22692254. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3833692.

[ArtsDefining02-4] Arts, D.G.; De Keizer, N.F.; Scheffer, G.J. (2002). "Defining and improving data quality in medical registries: A literature review, case study, and generic framework". JAMIA 9 (6): 600–11. doi:10.1197/jamia.m1087. PMC PMC349377. PMID 12386111. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC349377.

[WeiskopfMethods13-5] 5.0 ^5.1 Weiskopf, N.G.; Weng, C. (2013). "Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research". JAMIA 20 (1): 144–51. doi:10.1136/amiajnl-2011-000681. PMC PMC3555312. PMID 22733976. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3555312.

[HolveAComp13-6] Holve, E.; Kahn, M.; Nahm, M. et al. (2013). "A comprehensive framework for data quality assessment in CER". AMIA Joint Summits on Translational Science Procedings 2013: 86–8. PMC PMC3845781. PMID 24303241. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3845781.

[ForrestPEDS14-7] Forrest, C.B.; Margolis, P.A.; Bailey, L.C. et al. (2014). "PEDSnet: A National Pediatric Learning Health System". JAMIA 21 (4): 602–6. doi:10.1136/amiajnl-2014-002743. PMC PMC4078288. PMID 24821737. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4078288.

[KahnAHarm16-8] Kahn, M.G.; Callahan, T.J.; Barnard, J. et al. (2016). "A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data". EGEMS 4 (1): 1244. doi:10.13063/2327-9214.1244. PMC PMC5051581. PMID 27713905. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051581.

[CallahanAComp17-9] Callahan, T.J.; Bauck, A.E.; Bertoch, D. et al. (2017). "A Comparison of Data Quality Assessment Checks in Six Data Sharing Networks". EGEMS 5 (1): 8. doi:10.5334/egems.223. PMC PMC5982846. PMID 29881733. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5982846.

[QuallsEval18-10] Qualls, L.G.; Phillips, T.A.; Hammill, B.G. et al. (2018). "Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet)". EGEMS 6 (1): 3. doi:10.5334/egems.199. PMC PMC5983028. PMID 29881761. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5983028.

[CurtisDesign12-11] Curtis, L.H.; Weiner, M.G.; Boudreau, D.M. et al. (2012). "Design considerations, architecture, and use of the Mini-Sentinel distributed data system". Pharmacoepidemiology and Drug Safety 21 (Suppl. 1): 23–31. doi:10.1002/pds.2336. PMID 22262590.

[RaebelElectro14-12] Raebel, M.A.; Haynes, K.; Woodworth, T.S. et al. (2014). "Electronic clinical laboratory test results data tables: lessons from Mini-Sentinel". Pharmacoepidemiology and Drug Safety 23 (6): 609–18. doi:10.1002/pds.3580. PMID 24677577.

[NHCSR-13] Network HCSR. "Data Resources". "Available from: 24677577"

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

@@ Line 38: / Line 38: @@
 ==Background==
 Collaborations across multiple institutions are essential to achieve sufficient cohort sizes in clinical research and strengthen findings in a wide range of scientific studies.<ref name="BaileyMulti13">{{cite journal |title=Multi-Institutional Sharing of Electronic Health Record Data to Assess Childhood Obesity |journal=PLoS One |author=Bailey, L.C.; Milov, D.E.; Kelleher, K. et al. |volume=8 |issue=6 |page=e66192 |year=2013 |doi=10.1371/journal.pone.0066192 |pmid=23823186 |pmc=PMC3688837}}</ref><ref name="BrownData13">{{cite journal |title=Data quality assessment for comparative effectiveness research in distributed data networks |journal=Medical Care |author=Brown, J.S.; Kahn, M.; Toh, S. |volume=51 |issue=8 Suppl. 3 |page=S22–9 |year=2013 |doi=10.1097/MLR.0b013e31829b1e2c |pmid=23793049 |pmc=PMC4306391}}</ref> Clinical data research networks (CDRNs) combine [[electronic health record]] (EHR) data from multiple [[hospital]] systems to provide integrated access for conducting large-scale research studies. The results of CDRN-based studies, however, come with the caveat that the EHR data are directed towards clinical operations rather than clinical research. Suboptimal quality of EHR data and incorrect interpretation of EHR-derived data not only lead to inaccurate study results but also increase the cost of conducting science.<ref name="KahnAPrag12">{{cite journal |title=A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research |journal=Medical Care |author=Kahn, M.G.; Raebel, M.A.; Glanz, J.M. et al. |volume=50 |issue=Suppl. |page=S21–9 |year=2012 |doi=10.1097/MLR.0b013e318257dd67 |pmid=22692254 |pmc=PMC3833692}}</ref> Hence, one of the most critical aspects in building a CDRN is conducting continual quality evaluation to ensure that the patient-level clinical datasets are “fit for research use.”<ref name="KahnAPrag12" /><ref name="ArtsDefining02">{{cite journal |title=Defining and improving data quality in medical registries: A literature review, case study, and generic framework |journal=JAMIA |author=Arts, D.G.; De Keizer, N.F.; Scheffer, G.J. |volume=9 |issue=6 |page=600–11 |year=2002 |doi=10.1197/jamia.m1087 |pmid=12386111 |pmc=PMC349377}}</ref><ref name="WeiskopfMethods13">{{cite journal |title=Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research |journal=JAMIA |author=Weiskopf, N.G.; Weng, C. |volume=20 |issue=1 |page=144–51 |year=2013 |doi=10.1136/amiajnl-2011-000681 |pmid=22733976 |pmc=PMC3555312}}</ref> A well-designed data quality (DQ) assessment program helps data developers in identifying programming and logic errors when deriving secondary datasets from EHRs (e.g., an incorrect mapping of patient’s race information into controlled vocabularies). Also, it assists data consumers and scientists in learning the peculiar characteristics of network data (e.g., “acute respiratory tract infections” and “attention deficit hyperactive disorder” are likely to be among the most frequent diagnoses in a pediatric data resource) as well as helps assess the readiness of network data for specific research studies.<ref name="HolveAComp13">{{cite journal |title=A comprehensive framework for data quality assessment in CER |journal= AMIA Joint Summits on Translational Science Procedings |author=Holve, E.; Kahn, M.; Nahm, M. et al. |volume=2013 |page=86–8 |year=2013 |pmid=24303241 |pmc=PMC3845781}}</ref>
+This study focuses on the pediatric learning health system (PEDSnet) CDRN, which provides access to pediatric observational data drawn from eight of the nation’s largest children’s hospitals, through an underlying common data model (CDM).<ref name="ForrestPEDS14">{{cite journal |title=PEDSnet: A National Pediatric Learning Health System |journal=JAMIA |author=Forrest, C.B.; Margolis, P.A.; Bailey, L.C. et al. |volume=21 |issue=4 |page=602–6 |year=2014 |doi=10.1136/amiajnl-2014-002743 |pmid=24821737 |pmc=PMC4078288}}</ref> PEDSnet is one of the 13 CDRNs supported by the Patient-Centered Outcomes Research Institute (PCORI). PEDSnet has a centralized architecture, where most patient-level datasets from various hospitals are concatenated together; two hospitals that participate in a distributed manner are the exception. The ultimate goal of PEDSnet is to provide high-quality data for conducting a variety of pediatric research studies. Therefore, an essential task during building PEDSnet was to build a system to conduct DQ assessments on the network data.
+We experienced three critical demands in conducting such assessments in PEDSnet.
+. '''Assessment coverage''': It is important to ensure that the key variables in the CDM are being assessed, and that the assessments encompass the important aspects of DQ<ref name="WeiskopfMethods13" /><ref name="KahnAHarm16">{{cite journal |title=A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data |journal=EGEMS |author=Kahn, M.G.; Callahan, T.J.; Barnard, J. et al. |volume=4 |issue=1 |page=1244 |year=2016 |doi=10.13063/2327-9214.1244 |pmid=27713905 |pmc=PMC5051581}}</ref> and meet the demands of internal and external data consumers. The challenge lies in the selection of variables to be assessed and the types of assessment to be used for the select variables, given a plethora of possibilities of DQ checks for a single variable or a combination of variables. In a CDRN like PEDSnet that contains hundreds of variables in the CDM and a diverse set of users, identification of appropriate assessment coverage is a vital but resource-intensive task. For example, the field <tt>condition_concept_id</tt> that captures the standardized SNOMED diagnosis in PEDSnet could be assessed in a number of ways, including as a missing or unmapped diagnosis, incorrectly mapped diagnosis, variability in frequency distribution of a certain diagnosis across sites, inconsistency between diagnosis and medication data, etc.
+. '''Evolution-friendly''': PEDSnet is a continually growing network; the size of the centralized dataset increases as data for new patients and new observations get added into individual EHRs. In addition, the underlying model also evolves based on the changing needs of the data consumers. For instance, the PEDSnet patient population increased by over 90 percent since the first data submission, and at least six versions of the underlying model have been adopted in the last two years. It is a challenge to conduct DQ assessments while accounting for temporal variability on such an evolving dataset.
+. '''Communication''': The PEDSnet data committee is responsible for developing network data and represents a collaboration of more than 50 programmers, analysts, and scientists spread across the eight participating institutions. It is important to enable effective communication among them and track all DQ related interactions.
+While much information on DQ assessments is made available by existing data sharing networks<ref name="CallahanAComp17">{{cite journal |title=A Comparison of Data Quality Assessment Checks in Six Data Sharing Networks |journal=EGEMS |author=Callahan, T.J.; Bauck, A.E.; Bertoch, D. et al. |volume=5 |issue=1 |page=8 |year=2017 |doi=10.5334/egems.223 |pmid=29881733 |pmc=PMC5982846}}</ref>, there is limited discussion on the coverage and driving factors of the assessments to be encapsulated in a toolkit. There have been recent advances in conducing DQ assessments on an evolving dataset to review temporal variations.<ref name="QuallsEval18">{{cite journal |title=Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet) |journal=EGEMS |author=Qualls, L.G.; Phillips, T.A.; Hammill, B.G. et al. |volume=6 |issue=1 |page=3 |year=2018 |doi=10.5334/egems.199 |pmid=29881761 |pmc=PMC5983028}}</ref><ref name="CurtisDesign12">{{cite journal |title=Design considerations, architecture, and use of the Mini-Sentinel distributed data system |journal=Pharmacoepidemiology and Drug Safety |author=Curtis, L.H.; Weiner, M.G.; Boudreau, D.M. et al. |volume=21 |issue=Suppl. 1 |page=23–31 |year=2012 |doi=10.1002/pds.2336 |pmid=22262590}}</ref><ref name="RaebelElectro14">{{cite journal |title=Electronic clinical laboratory test results data tables: lessons from Mini-Sentinel |journal=Pharmacoepidemiology and Drug Safety |author=Raebel, M.A.; Haynes, K.; Woodworth, T.S. et al. |volume=23 |issue=6 |page=609–18 |year=2014 |doi=10.1002/pds.3580 |pmid=24677577}}</ref><ref name="NHCSR">{{cite web |author=Network HCSR |title=Data Resources |quote=Available from: 24677577}}</ref> However, the communication challenge is largely underexplored in the context of CDRNs. In this study, we describe the design and evolution of a software toolkit that addresses the key challenges outlined above and serves as a systematic DQ assessment workflow for the PEDSnet CDRN.

Difference between revisions of "Journal:Design and refinement of a data quality assessment workflow for a large pediatric research network"

Revision as of 19:52, 18 October 2019

Contents

Abstract

Background

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export