User:Shawndouglas/sandbox/sublevel2
This is sublevel2 of my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see my discussion page instead. |
Sandbox begins below
Full article title | Big data in the era of health information exchanges: Challenges and opportunities for public health |
---|---|
Journal | Informatics |
Author(s) | Baseman, Janet G.; Revere, Debra; Painter, Ian |
Author affiliation(s) | University of Washington |
Primary contact | Email: jbaseman at uw dot edu |
Editors | Ge, Mouzhi; Dohnal, Vlastislav |
Year published | 2017 |
Volume and issue | 4(4) |
Page(s) | 39 |
DOI | 10.3390/informatics4040039 |
ISSN | 2227-9709 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | http://www.mdpi.com/2227-9709/4/4/39/htm |
Download | http://www.mdpi.com/2227-9709/4/4/39/pdf (PDF) |
This article should not be considered complete until this message box has been removed. This is a work in progress. |
Abstract
Public health surveillance of communicable diseases depends on timely, complete, accurate, and useful data that are collected across a number of health care and public health systems. Health information exchanges (HIEs) which support electronic sharing of data and information between health care organizations are recognized as a source of "big data" in health care and have the potential to provide public health with a single stream of data collated across disparate systems and sources. However, given these data are not collected specifically to meet public health objectives, it is unknown whether a public health agency’s (PHA’s) secondary use of the data is supportive of or presents additional barriers to meeting disease reporting and surveillance needs. To explore this issue, we conducted an assessment of big data that is available to a PHA—laboratory test results and clinician-generated notifiable condition report data—through its participation in an HIE.
Keywords: big data, communicable diseases, data mining, data quality, epidemiology, health information exchange, infectious diseases, population surveillance, public health
Introduction
We evaluated two datasets—for sexually-transmitted infections (STIs) and non-STIs—for the time period of January 1, 2012 to September 15, 2013 used by a PHA that is part of one of the largest and oldest HIE infrastructures in the U.S. The two datasets were independently analyzed for their data quality, utility, and appropriateness for meeting public health surveillance objectives: (1) timeliness, defined as the difference between earliest date of a disease report and date the report is received at the PHA; (2) volume, defined as the number of disease report cases received by the PHA; and (3) completion, defined as the number of days to close a disease case report.
Our assessment uncovered the following challenges for effective utilization of big data by public health:
- While PHAs almost exclusively rely on secondary use data for surveillance, big data that has been collected for clinical purposes omits data fields of high value for public health.
- Big data is not always smart data, especially when the context within which the data is collected is absent.
- Data collected by disparate, varying systems and sources can introduce uncertainties and limit trustworthiness in the data, which may diminish its value for public health purposes.
- The process by which data is obtained needs to be evident in order for big data to be useful to public health.
- Big data for public health purposes needs to answer both "what" and "why" questions.
Despite these and other issues—such as measurement error and confounding, well-known challenges to both big and small data—strategies traditionally employed by public health epidemiologists and other public health professionals can uncover limitations and contribute to the design of solutions in collection, integration, warehousing, and analysis of big data so its value and utility to public health can be optimized.
In recognition of the 10 year anniversary of the incorporation of the internet search firm Google, the journal Nature issued a special supplement on big data and what the availability of large datasets meant and will mean for scientists and researchers.[1] In particular, the supplement focused on the opportunities that will be possible when issues such as interoperable data infrastructures, security, data standardization, storage and transfer requirements, and data governance are resolved. Now, nearly 10 years later, users of big data—characterized by the 5 Vs (huge volume, high velocity, high variety, low veracity, and high value)—still encounter the issues presented in the Nature special supplement.[2] In particular, the primary challenges to utilizing big data center around the diversity of data types (variety), the resources required to handle data collection, storage and processing (velocity), and uncertainties inherent in mixing and cleaning data from varied data streams that generates unpredictability in the data (veracity).[3]
Nevertheless, within the health care sector, despite these challenges, big data also promises great opportunities to improve quality of health care delivery, population management, early detection of disease, decision-making, and cost reduction.[4] Major contributors to the explosion of big data are investments in information technology (IT), such as increased adoption of electronic medical record systems[5], and the creation of health information exchanges (HIEs)[6] which facilitate sharing of electronic data and information between health care organizations.[7] While the focus of HIEs has been on sharing patient information between clinics, hospitals, pharmacies, laboratories, and payers, public health agencies (PHAs) are increasingly included in HIEs.[8] PHA participation in a HIE provides a single stream of data collated across disparate systems and sources for public health.
Public health is a data-intensive and -driven field. Data is a highly valued currency for assessing the health of the community; providing guidance to stakeholders for handling a foodborne illness outbreak; forecasting the burden of seasonal influenza to enable sufficient timing to vaccinate vulnerable populations; and innumerable other efforts that aim to prevent disease, prolong life, promote human health, and mitigate unnecessary suffering.[9] Within the context of big data, public health efforts include linking information technology systems to conduct population-based cancer research and surveillance[10], more effectively identify behaviors that can build healthier communities[11], and improve targeted and timely epidemiologic surveillance of communicable and infectious disease.[12]
Specific to public health surveillance of communicable diseases, effective surveillance relies on time-sensitive, complete, accurate, and useful data that are collected across a number of healthcare and public health systems. It could be assumed that PHA participation in a HIE would support and potentially improve surveillance efforts, as data collected within the clinical encounter could be shared with public health more rapidly and be integrated into PHA decision support systems to meet public health practice needs. However, given that these data are not collected specifically to meet public health objectives, it is unknown whether a PHA’s secondary use of the data is supportive of or presents additional barriers to meeting disease reporting and surveillance needs. To explore this issue, we conducted an assessment of big data that is available to a PHA—laboratory test results and clinician-generated notifiable condition report data—through its participation in a HIE and discuss the extent to which its value impacts the rationale for investing in the infrastructure, including workforce training, that is required to collect and interpret this data and ultimately inform measurable improvements in the health of public health community stakeholders.
Objective
To explore challenges and opportunities for utilizing a public health big data available through PHA participation in a HIE.
Methods
Ethics
This study was approved by the Indiana University Institutional Review Board with cross-institutional and concurrent IRB deferral from the University of Washington.
Data source
Datasets for the time period of January 1, 2012 through September 15, 2013 were pulled from two public health surveillance systems: (1) the Statewide Information Management Surveillance System (SWIMSS), which collects electronic lab reports (ELRs) and communicable disease reports (CDRs) for STIs; and (2) InSight, the county’s core population health data system, which collects ELRs and CDRs of non-STI data for public health surveillance activities. The SWIMSS data pull was limited to the most prevalent and highly-reported conditions: chlamydia, gonorrhea, and syphilis. The InSight data pull was limited to acute hepatitis B, chronic hepatitis C, and salmonella.
Analysis
The two datasets were independently analyzed for their data quality, utility, and appropriateness for meeting public health surveillance objectives, including: (1) timeliness, defined as the difference between earliest date of a disease report and date the report is received at the PHA; (2) volume, defined as the number of disease report cases received by the PHA; and (3) completion, defined as the number of days until a case report is marked as closed by the investigator.
Each dataset was separately reviewed for data quality issues. Duplicate records were removed and missing data rates tabulated. Patterns of missing data over time were visualized over time and change point analysis[13] used to estimate time points at which underlying process changes may have occurred. Processing times (time to receipt of test results and PHA time to process results) were calculated in calendar days. Metadata was not available on which days the PHA conducted work, and this was estimated from the data based on days on which any cases were closed, and this estimated metadata was used to calculate number of work days required to close each case. Analyses of factors associated with time to receive and time to process cases were conducted after removal of atypical times. We aggregated case counts by disease and month to examine seasonal patterns of disease counts, and we aggregated case counts by disease and week to examine possible outbreaks and associations between outbreaks of different disease types. Occurrences of possible outbreaks were examined using a thresholds of three standard deviations above a 31-day moving average.
Results
The final SWIMSS dataset included chlamydia (n = 28018); gonorrhea (n = 7791); syphilis (n = 810); and syphilis, reactor (n = 3118). The final InSight dataset included acute hepatitis B (n = 563); chronic hepatitis C (n = 2160); histoplasmosis (n = 73); and salmonella (n = 210). Table 1 summarizes data exclusions resulting from the data quality analysis.
|
References
- ↑ Miller, E. (2008). "Community cleverness required". Nature 455 (1). doi:10.1038/455001a.
- ↑ Kruse, C.S.; Goswamy, R.; Raval, Y.; Marawi, S. (2016). "Challenges and Opportunities of Big Data in Health Care: A Systematic Review". JMIR Medical Informatics 4 (4): e38. doi:10.2196/medinform.5359. PMC PMC5138448. PMID 27872036. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5138448.
- ↑ Jin, X.; Wah, B.W.; Cheng, X.; Wang, Y. (2015). "Significance and Challenges of Big Data Research". Big Data Research 2 (2). doi:10.1016/j.bdr.2015.01.006.
- ↑ Nambiar, R.; Bhardwaj, R.; Sethi, A.; Vargheese, R. (2013). "A look at challenges and opportunities of big data analytics in healthcare". Proceedings from the 2013 IEEE International Conference on Big Data: 17–22. doi:10.1109/BigData.2013.6691753.
- ↑ Joseph, S.; Sow, M.; Furukawa, M.F. et al. (2014). "HITECH spurs EHR vendor competition and innovation, resulting in increased adoption". American Journal of Managed Care 20 (9): 734-40. PMID 25365748.
- ↑ Roski, J.; Bo-Linn, G.W.; Andrews, T.A. (2014). "Creating value in health care through big data: Opportunities and policy implications". Health Affairs 33 (7): 1115-22. doi:10.1377/hlthaff.2014.0147. PMID 25006136.
- ↑ Groves, P.; Kayyali, B.; Knott, D.; Van Kuiken, S. (January 2013). "The 'big data' revolution in healthcare: Accelerating value and innovation". McKinsey & Company. https://www.mckinsey.com/~/media/mckinsey/industries/healthcare%20systems%20and%20services/our%20insights/the%20big%20data%20revolution%20in%20us%20health%20care/the_big_data_revolution_in_healthcare.ashx.
- ↑ Shah, G.H.; Leider, J.P.; Luo, H.; Kaur, R. (2016). "Interoperability of Information Systems Managed and Used by the Local Health Departments". Journal of Public Health Management and Practice 22 (Suppl. 6): S34-S43. doi:10.1097/PHH.0000000000000436. PMC PMC5049946. PMID 27684616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5049946.
- ↑ "What Is Public Health?". CDC Foundation. https://www.cdcfoundation.org/what-public-health. Retrieved 12 October 2017.
- ↑ Barrett, M.A.; Humblet, O.; Hiatt, R.A.; Adler, N.E. (2013). "Big Data and Disease Prevention: From Quantified Self to Quantified Communities". Big Data 1 (3): 168-75. doi:10.1089/big.2013.0027. PMID 27442198.
- ↑ Meyer, A.M.; Olshan, A.F.; Green, L. et al. (2014). "Big data for population-based cancer research: the integrated cancer information and surveillance system". North Carolina Medical Journal 75 (4): 265–9. doi:10.1089/big.2013.0027. PMC PMC4766858. PMID 25046092. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4766858.
- ↑ Salathé, M.; Bengtsson, L.; Bodnar, T.J. et al. (2012). "Digital epidemiology". PLoS Computational Biology 8 (7): e1002616. doi:10.1371/journal.pcbi.1002616. PMC PMC3406005. PMID 22844241. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3406005.
- ↑ Painter, I.; Eaton. J.; Lober, B. (2013). "Using Change Point Detection for Monitoring the Quality of Aggregate Data". Online Journal of Public Health Informatics 5 (1): e186. doi:10.5210/ojphi.v5i1.4597.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Several URL from the original were dead, and more current URLs were substituted.