Journal:The GAAIN Entity Mapper: An active-learning system for medical data mapping

Full article title	The GAAIN Entity Mapper: An active-learning system for medical data mapping
Journal	Frontiers in Neuroinformatics
Author(s)	Ashish, N.; Dewan, P.; Toga, A.W.
Author affiliation(s)	University of Southern California at Los Angeles
Primary contact	Email: nashish@loni.usc.edu
Editors	Van Ooyen, A.
Year published	2016
Volume and issue	9
Page(s)	30
DOI	10.3389/fninf.2015.00030
ISSN	1662-5196
Distribution license	Creative Commons Attribution 4.0 International
Website	http://journal.frontiersin.org/article/10.3389/fninf.2015.00030/full
Download	http://journal.frontiersin.org/article/10.3389/fninf.2015.00030/pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90 percent accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal.

Keywords: data mapping, machine learning, active learning, data harmonization, common data model

Background and significance

This paper describes a software solution for biomedical data harmonization. Our work is in the context of the “GAAIN” project in the domain of Alzheimer's disease data. However, this solution is applicable to any biomedical or clinical data harmonization in general. GAAIN — the Global Alzheimer's Association Interactive Network — is a data sharing federated network of Alzheimer's disease datasets from around the globe. The aim of GAAIN is to create a network of Alzheimer's disease data, researchers, analytical tools and computational resources to better our understanding of this disease. A key capability of this network is also to provide investigators with access to harmonized data across multiple, independently created Alzheimer's datasets.

Our primary interest is in biomedical data sharing and specifically harmonized data sharing. Harmonized data from multiple data providers has been curated to a unified representation after reconciling the different formats, representation, and terminology from which it was derived.^[1]^[2] The process of data harmonization can be resource-intensive and time-consuming; the present work describes a software solution to significantly automate that process. Data harmonization is fundamentally about data alignment, the establishment of correspondence between related or identical data elements across different datasets. Consider the very simple example of a data element capturing the gender of a subject that is defined as “SEX” in one dataset, “GENDER” in another and “M/F” in yet another. When harmonizing data, a unified element is needed to capture this gender concept and to link (align) the individual elements in different datasets with this unified element. This unified element is the “G.GENDER” element as illustrated in Figure 1.

References

↑ Doan, A.; Halevy, A.; Ives, Z. (2012). Principles of Data Integration (1st ed.). Elsevier. pp. 520. ISBN 9780123914798.
↑ Ohmann, C.; Kuchinke, W. (2009). "Future developments of medical informatics from the viewpoint of networked clinical research: Interoperability and integration". Methods of Information in Medicine 48 (1): 45–54. doi:10.3414/ME9137. PMID 19151883.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. References are in order of appearance rather than alphabetical order (as the original was). Some grammar, punctuation, and minor wording issues have been corrected.