Difference between revisions of "Journal:Effective information extraction framework for heterogeneous clinical reports using online machine learning and controlled vocabularies"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Finished adding rest of content.)
 
(One intermediate revision by the same user not shown)
Line 18: Line 18:
|website      = [http://medinform.jmir.org/2017/2/e12/ http://medinform.jmir.org/2017/2/e12/]
|website      = [http://medinform.jmir.org/2017/2/e12/ http://medinform.jmir.org/2017/2/e12/]
|download    = [http://medinform.jmir.org/2017/2/e12/pdf http://medinform.jmir.org/2017/2/e12/pdf] (PDF)
|download    = [http://medinform.jmir.org/2017/2/e12/pdf http://medinform.jmir.org/2017/2/e12/pdf] (PDF)
}}
{{ombox
| type      = content
| style    = width: 500px;
| text      = This article should not be considered complete until this message box has been removed. This is a work in progress.
}}
}}


Line 80: Line 75:
{| border="0" cellpadding="5" cellspacing="0" width="800px"
{| border="0" cellpadding="5" cellspacing="0" width="800px"
  |-
  |-
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Fig. 2''' Example snippets of different report forms. (a) Semistructured report; (b) Template based narration; and (c) Complex narration</blockquote>
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Fig. 2''' Example snippets of different report forms. (a) Semi-structured report; (b) Template based narration; and (c) Complex narration</blockquote>
  |-  
  |-  
|}
|}
Line 93: Line 88:
The system can operate in two modes: (1) interactive: through online learning, the system predicts values to be extracted for each report, and the user verifies or corrects the predicted values; and (2) batch: batch predicting for all unprocessed documents once the accrued accuracy is sufficient for users. Whereas interactive mode uses online machine learning to build the learning model incrementally, batch mode runs the same as the prediction phase of batch machine learning.
The system can operate in two modes: (1) interactive: through online learning, the system predicts values to be extracted for each report, and the user verifies or corrects the predicted values; and (2) batch: batch predicting for all unprocessed documents once the accrued accuracy is sufficient for users. Whereas interactive mode uses online machine learning to build the learning model incrementally, batch mode runs the same as the prediction phase of batch machine learning.


===System interface and user operations===
====System interface and user operations====
====System interface====
'''System interface'''
 
IDEAL-X provides a GUI with two main panels, a menu, and navigation buttons (Figure 3). The left panel is for browsing an input report, and the right panel is the output table with predicted values of each data element in the report. The menu provides options for defining the data elements to be extracted, specifying input reports, among others.
IDEAL-X provides a GUI with two main panels, a menu, and navigation buttons (Figure 3). The left panel is for browsing an input report, and the right panel is the output table with predicted values of each data element in the report. The menu provides options for defining the data elements to be extracted, specifying input reports, among others.


Line 109: Line 105:
|}
|}


====Definition of data elements for extraction====
'''Definition of data elements for extraction'''
 
The system provides a wizard for constructing the metadata of the output form. The user builds the form by specifying a list of data elements and their constraints. An example is the data element "Heart Rate," which is constrained to be a numerical value between 0 and 200. Other constraints include sections of the report that may contain the values. However, except for the names of the data elements, specifying constraints are optional, as these can be learned by the system.
The system provides a wizard for constructing the metadata of the output form. The user builds the form by specifying a list of data elements and their constraints. An example is the data element "Heart Rate," which is constrained to be a numerical value between 0 and 200. Other constraints include sections of the report that may contain the values. However, except for the names of the data elements, specifying constraints are optional, as these can be learned by the system.
'''
Data extraction workflow'''


====Data extraction workflow====
The user will first select a collection of input reports to be extracted from a local folder. By default, the system runs in an interactive mode, and one report will be loaded at a time on the left display panel. The user can make manual annotations by highlighting the correct value in the report text. Clicking the corresponding data field in the table assigns the value to the data element. If the system has prefilled the field of a data element with a predicted value, the user can provide feedback by fixing incorrect values. As the user navigates to the next document, the system compares the prefilled and the final values for the most recently processed document. Values that are unchanged or filled in by users are taken as positive instances, and values that have been revised are taken as negative instances. Both instances are incorporated into the online learning algorithm to be used by the data extraction for subsequent documents. By iterating through this process, the amount of information that the system is able to correctly prefill grows over time. Note that manual revision in this context is different from traditional human labeling. It is only necessary if there is a wrong prediction, thus user effort can be significantly saved. Once the decision model reaches an acceptable level of accuracy, the user has the option to switch to batch mode to complete extraction for the remaining documents. If a patient has multiple reports, the text input panel displays each report with a separate tab. Data extracted from all the reports are aggregated in the output.
The user will first select a collection of input reports to be extracted from a local folder. By default, the system runs in an interactive mode, and one report will be loaded at a time on the left display panel. The user can make manual annotations by highlighting the correct value in the report text. Clicking the corresponding data field in the table assigns the value to the data element. If the system has prefilled the field of a data element with a predicted value, the user can provide feedback by fixing incorrect values. As the user navigates to the next document, the system compares the prefilled and the final values for the most recently processed document. Values that are unchanged or filled in by users are taken as positive instances, and values that have been revised are taken as negative instances. Both instances are incorporated into the online learning algorithm to be used by the data extraction for subsequent documents. By iterating through this process, the amount of information that the system is able to correctly prefill grows over time. Note that manual revision in this context is different from traditional human labeling. It is only necessary if there is a wrong prediction, thus user effort can be significantly saved. Once the decision model reaches an acceptable level of accuracy, the user has the option to switch to batch mode to complete extraction for the remaining documents. If a patient has multiple reports, the text input panel displays each report with a separate tab. Data extracted from all the reports are aggregated in the output.
'''
Customization of controlled vocabularies'''


====Customization of controlled vocabularies====
IDEAL-X also provides an interface for the user to customize a controlled vocabulary that can be used by the system for data extraction. The controlled vocabulary contains both terminology and structural properties. The terminology includes lists of values and their normalization mappings. For example, disease terminology includes "Diabetes Mellitus" with variations "DM" and "Diabetes." It also defines inductions. For example, taking "Insulin" or "Metformin" indicates having diabetes mellitus. Structural properties provide positive and negative contextual information for giving terms. For example, to extract medications taken by patients, the "Allergies" section is a negative context and medicine names in the section will be skipped. Structural properties may also contain disambiguation terms that may further improve the precision of extraction. A simple example is that "intolerant" is a negative indicator for identifying "statin" as "statin intolerant" refers to a different concept. Controlled vocabularies can be a powerful tool to support data extraction: it can be used to locate sentences and chunks of possible values, and to perform normalization for extracted values, discussed in the next section.
IDEAL-X also provides an interface for the user to customize a controlled vocabulary that can be used by the system for data extraction. The controlled vocabulary contains both terminology and structural properties. The terminology includes lists of values and their normalization mappings. For example, disease terminology includes "Diabetes Mellitus" with variations "DM" and "Diabetes." It also defines inductions. For example, taking "Insulin" or "Metformin" indicates having diabetes mellitus. Structural properties provide positive and negative contextual information for giving terms. For example, to extract medications taken by patients, the "Allergies" section is a negative context and medicine names in the section will be skipped. Structural properties may also contain disambiguation terms that may further improve the precision of extraction. A simple example is that "intolerant" is a negative indicator for identifying "statin" as "statin intolerant" refers to a different concept. Controlled vocabularies can be a powerful tool to support data extraction: it can be used to locate sentences and chunks of possible values, and to perform normalization for extracted values, discussed in the next section.


====The data extraction engine====
'''The data extraction engine'''
 
While the user interacts with IDEAL-X interface, the data extraction engine works transparently in the background. The engine has three major components: answer prediction, learning, and the learning model that the online learning process continuously updates (Figure 4). The system combines statistical and machine learning–based approaches with controlled vocabularies for effective data extraction.
While the user interacts with IDEAL-X interface, the data extraction engine works transparently in the background. The engine has three major components: answer prediction, learning, and the learning model that the online learning process continuously updates (Figure 4). The system combines statistical and machine learning–based approaches with controlled vocabularies for effective data extraction.


===Document preprocessing===
====Document preprocessing====
When a report is loaded, the text is first parsed into an in-memory hierarchical tree consisting of four layers: section, paragraph, sentence, and token. Apache OpenNLP<ref name="OpenNLP">{{cite web |url=http://opennlp.apache.org/ |title=OpenNLP |publisher=Apache Software Foundation |accessdate=20 April 2017}}</ref> is used to support the parsing with its Sentence Detector, Tokenizer, and Part-of-Speech Tagger. A reverse index of tokens is created to support an efficient keyword-based search. The index is used to find locations (e.g., sections, paragraphs, sentences, and phrases) of a token, as well as its properties such as part of speech and data type. For example, given the token "DM," the system can quickly identify the section (e.g., "History") and the containing sentences. Such token search is frequently performed in answer prediction, and the in-memory index structures enable high efficiency for such operations.
When a report is loaded, the text is first parsed into an in-memory hierarchical tree consisting of four layers: section, paragraph, sentence, and token. Apache OpenNLP<ref name="OpenNLP">{{cite web |url=http://opennlp.apache.org/ |title=OpenNLP |publisher=Apache Software Foundation |accessdate=20 April 2017}}</ref> is used to support the parsing with its Sentence Detector, Tokenizer, and Part-of-Speech Tagger. A reverse index of tokens is created to support an efficient keyword-based search. The index is used to find locations (e.g., sections, paragraphs, sentences, and phrases) of a token, as well as its properties such as part of speech and data type. For example, given the token "DM," the system can quickly identify the section (e.g., "History") and the containing sentences. Such token search is frequently performed in answer prediction, and the in-memory index structures enable high efficiency for such operations.


===Answer prediction===
====Answer prediction====
Predicting the value of each data element involves the following steps: (1) Identifying target sentences that are likely to contain the answer; (2) Identifying candidate chunks in the sentences; (3) Filtering the chunks to generate candidate values; (4) Ranking candidate values to generate (raw) values; (5) Normalizing values; and (6) Aggregating values from multiple reports. The workflow is shown in Figure 4.
Predicting the value of each data element involves the following steps: (1) Identifying target sentences that are likely to contain the answer; (2) Identifying candidate chunks in the sentences; (3) Filtering the chunks to generate candidate values; (4) Ranking candidate values to generate (raw) values; (5) Normalizing values; and (6) Aggregating values from multiple reports. The workflow is shown in Figure 4.


Line 139: Line 139:
|}
|}


====Identifying target sentences====
'''Identifying target sentences'''
 
Through online learning, the system accrues keywords from past answers (answer keywords) along with co-occurring words in the corresponding sentences (contextual words). For example, given the answer keywords "diabetes" and "hypertension" in the sentence "The patient reports history of diabetes and hypertension," contextual words are "patient," "report," and "history." Such answer keywords and contextual words combined with customized vocabularies can be utilized to identify sentences that are likely to contain answers with the following methods:
Through online learning, the system accrues keywords from past answers (answer keywords) along with co-occurring words in the corresponding sentences (contextual words). For example, given the answer keywords "diabetes" and "hypertension" in the sentence "The patient reports history of diabetes and hypertension," contextual words are "patient," "report," and "history." Such answer keywords and contextual words combined with customized vocabularies can be utilized to identify sentences that are likely to contain answers with the following methods:
First, there's similarity-based search using the vector space model.<ref name="ManningIntro08">{{cite book |title=Introduction to Information Retrieval |author=Manning, C.D.; Raghavan, P.; Schütze, H. |publisher=Cambridge University Press |pages=506 |year=2008 |isbn=9780521865715}}</ref> Given a collection of contextual words and their frequencies, the system computes the similarity against sentences in the document.<ref name="ManningIntro08" /> Sentences with high similarities are selected. For example, most sentences about "disease" contain "diagnosis" and "history." The past contextual keywords and their frequency weights are represented and maintained through a learning model discussed later in section on Learning.
Second, we have the answer keyword matching search. The answer keywords, combined with relevant user customized vocabularies, are also used to identify target sentences with keyword matching. For example, to extract diseases, if a sentence contains the disease term "myocardial infarction" defined in the vocabulary, the sentence is selected as a target. In both approaches, sections to be searched or skipped are also considered to narrow the scope of searching.
'''Identifying candidate chunks'''
After target sentences are selected, the system identifies potential phrases in the sentences using two methods: Hidden Markov model (HMM)<ref name="ElliottHidden08">{{cite book |title=Hidden Markov Models: Estimation and Control |author=Elliott, R.J.; Aggoun, L.; Moore, J.B. |publisher=Springer |pages=382 |year=2008 |isbn=9780387943640}}</ref> and keyword-based search. The HMM represents target words and contextual words in a sentence with different states, and it marks values to be extracted based on probability distributions learned from previously collected values and their sentences. The keyword-based search finds candidate chunks using keywords collected from past answers and the controlled vocabulary.
'''
Filtering chunks'''
To filter candidate chunks, the system uses rule induction<ref name="CiravegnaAdaptive01" /><ref name="FürnkranzSeparate99">{{cite journal |title=Separate-and-Conquer Rule Learning |journal=Artificial Intelligence Review |author=Fürnkranz, J. |volume=13 |issue=1 |pages=3–54 |year=1999 |doi=10.1023/A:1006524209794}}</ref> to generate "If-Then" rules based on historical statistics. The following filtering criteria are used: (1) Part of speech (POS): This filters a phrase by its POS tag in the sentence. Simple example phrases are noun and verb phrases. (2) String pattern: This looks for chunks that match special string patterns. For example, the first characters of all tokens are capitalized. (3) Value domain: This eliminates numerical or enumerated values that fall outside a specified range of values. (4) Negation: Based on predefined built-in rules, this removes phrases governed by words that reverse the meaning of the answer.<ref name="HuangANovel07">{{cite journal |title=A novel hybrid approach to automated negation detection in clinical radiology reports |journal=JAMIA |author=Huang, Y.; Lowe, H.J. |volume=14 |issue=3 |pages=304–11 |year=2007 |doi=10.1197/jamia.M2284 |pmid=17329723 |pmc=PMC2244882}}</ref> For example, if a candidate chunk "cancer" is extracted from a sentence "the patient has no history of cancer," "cancer" would not be included. (5) Certainty: Similar to negation filter, this detects and filters uncertain events or situations such as future plans, based on predefined rules. For example, a candidate chunk "radiation therapy" for treatment from a sentence "the patient is planned to take radiation therapy" should not be included. Whereas negation and certainty filtering is based on predefined rules, other filtering relies on real-time data statistics for filtering criteria.
'''
Ranking candidate values'''
The system combines the scores of the selected sentences and chunks for ranking of candidate values. For a single-valued data element (e.g., heart beat), the candidate value with the highest confidence score is selected. For a multi-valued data element (e.g., medication), values with confidence scores above a threshold are selected. Based on this, each candidate value is either accepted or rejected.
'''Normalizing values'''
This step normalizes extracted values through transformation, generalization, and induction rules given by the controlled vocabulary (Figure 4). For example, "DM" is transformed into "Diabetes Mellitus." "Pindolol" is generalized to its hypernym "beta blocker." The appearance of the medication term "Metformin" (a drug for treating type 2 diabetes) in the text can infer the disease "Diabetes Mellitus."
'''Aggregating results'''
Data extracted from multiple reports of a patient will be aggregated into a single table. The aggregation process may normalize values and remove duplicates. For example, "lisinopril" and "captopril" are extracted from discharge summary and inpatient report, respectively, and they can be normalized as "ACE inhibitor." If the same data element is extracted from multiple reports, deduplication is performed. The final output is in simple structural table form that can be exported conveniently to other applications such as Excel (Microsoft) or a database.
Note that controlled vocabularies can play important roles in the answer prediction process. They are used for identifying target sentences through keyword searching, identifying candidate chunks through keyword matching, and supporting normalization for extracted values.
====Learning====
DEAL-X takes an online learning–based approach to incrementally build statistical models and make predictions (Figure 5). The three models used in IDEAL-X are all statistical based and can be continuously updated after each iteration.


[[File:Fig5 Zheng JMIRMedInfo2017 5-2.png|800px]]
[[File:Fig5 Zheng JMIRMedInfo2017 5-2.png|800px]]
Line 153: Line 184:
|}
|}


System-predicted values automatically populate the output table, and the user advances to the next report with or without revision to these values. In both cases, the internal learning and prediction models of IDEAL-X are updated. For each instance, IDEAL-X collects and analyzes the following features: (1) Position: location of the answer in the text hierarchy; (2) Landmark: co-occurring contextual keywords in a sentence; (3) POS: parts of speech tag; (4) Value: the tokens of the answer; and (5) String patterns: literal features such as capitalization and initial and special punctuation. These features are then used to update the three models.
In IDEAL-X, each data element such as attribute "disease" or "medicine" has its own statistical model, and each new instance of a data element will update the corresponding model. There are three models to be updated: (1) Updating Space Vector Model: This model uses "Landmark" features of positive instances. The system updates frequencies of co-occurring contextual words, used as weights of the space vector.<ref name="ManningIntro08" /> (2) Updating HMM: HMM lists all words in a sentence as a sequence, in which an extracted value is marked as target value state and other words are recognized as irrelevant contextual states. Based on this sequence, the state transition probabilities and emission probabilities are recalculated.<ref name="ElliottHidden08" /> (3) Updating rule induction model: Filtering rules are induced based on the coverage percentage.<ref name="FürnkranzSeparate99" /> Features such as POS, value domain, and string patterns of both positive and negative instances are analyzed and their respective coverage percentages are modified. Once the coverage of a rule reaches a predefined threshold, the rule is triggered for filtering.
In an interactive mode, the above four steps repeat for each report, where the learning models are continuously updated and improved.
==Results==
===Experimental setup===
====Datasets====
We used three datasets from 100 patients that were randomly sampled from a collection of about 5000 patients in the Emory Biobank database. Dataset 1 is a set of semi-structured reports and contains 100 cardiac catheterization procedure reports. Dataset 2 is a set of template-based narration and contains 100 coronary angiographic reports. Dataset 3 is a set of complex narration and contains 315 reports, including history and physical report, discharge summary, outpatient clinic notes, outpatient clinic letter, and inpatient discharge medication report.
====Ground truth====
The test datasets are independently hand-annotated by domain expert annotators, including physicians, physician trainees, and students trained by the Emory Clinical Cardiovascular Research Institute for Biobank data reporting. Each record is annotated by two different annotators. The interrater agreement scores (kappa) of these three datasets are .991, .986, and .835, respectively. An arbitrator — an independent cardiovascular disease researcher — reconciles incompatible outputs of the system and the manual annotations to produce the final ground truth.
====Evaluation metrics====
For validation, precision, recall, and F1, scores are used to estimate the effectiveness of extraction by comparing the system predicted results (before human revision) and the ground truth.
====Experiment settings====
We aimed to evaluate the effectiveness of the system with respect to using online learning and controlled vocabularies and to understand their applicability to different report forms. By analyzing the report styles and vocabularies, we discovered that online learning will be more suitable for semi-structured or template-based narration reports, and controlled vocabulary-guided data extraction would be more effective on complex narration with a finite vocabulary. Thus, we designed three experiments: (1) Online learning–based data extraction, where controlled vocabularies are not provided, based on Dataset 1 (semi-structured) and Dataset 2 (template-based narration); (2) Controlled vocabularies-based data extraction, where online learning is not used, based on Dataset 3 (complex narration); and (3) Controlled vocabularies-guided data extraction combined with online learning, based on Dataset 3.
'''Performance evaluation'''
''Experiment 1: Online machine learning–based data extraction'': This experiment was based on Datasets 1 and 2. The system starts in an interactive mode with an empty decision model without prior training. The defined data elements are summarized in Multimedia Appendix 1. The user processes one report at a time, and each system-predicted value (including empty values for the first few reports) before user revision was recorded for calculating precision and recall.
Results are summarized in Table 1 for the two datasets, respectively. Both test cases achieved high precision as semi-structured and template-based text is most easy to handle. To study the learning rate of online learning, we divided records into 10 groups and plotted precision and recall of every 10 percent of the records in Datasets 1 and 2. We observed that in both tests, the system maintained high precision during the learning process. Although some variability exists due to new data pattern, the recall of both cases also improved steadily. Not surprisingly, the rate of learning for Dataset 1 is much faster given its semi-structure.
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|'''Table 1.''' Results of data extraction from semistructured reports (Dataset 1) and template-based narration (Dataset 2)
|-
  ! style="padding-left:10px; padding-right:10px;"|Dataset
  ! style="padding-left:10px; padding-right:10px;"|Number of data elements
  ! style="padding-left:10px; padding-right:10px;"|Number of values
  ! style="padding-left:10px; padding-right:10px;"|Precision (%)
  ! style="padding-left:10px; padding-right:10px;"|Recall (%)
  ! style="padding-left:10px; padding-right:10px;"|F1 (%)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|1
  | style="background-color:white; padding-left:10px; padding-right:10px;"|19
  | style="background-color:white; padding-left:10px; padding-right:10px;"|1272
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.8
  | style="background-color:white; padding-left:10px; padding-right:10px;"|96.5
  | style="background-color:white; padding-left:10px; padding-right:10px;"|98.1
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|16
  | style="background-color:white; padding-left:10px; padding-right:10px;"|728
  | style="background-color:white; padding-left:10px; padding-right:10px;"|97.2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|93.2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|95.2
|-
|}
|}
''Experiment 2: Controlled vocabularies-guided data extraction'': In this experiment, online learning was disabled and data extraction was performed in batches using controlled vocabulary. Diseases and medications were extracted from Dataset 3 (values to be extracted are shown in Multimedia Appendix 1). Customized controlled vocabularies, including terminology and structural properties, had been created independently by physicians via references to domain knowledge resources and analyzing another development report dataset of 100 patients, disjointed from Dataset 3. Note that comparisons in this and the following experiments were at a clinical finding level between system-integrated results and manual-annotation-integrated results.
The results in Table 2 show that controlled vocabularies are highly effective for data extraction over complex narratives. Domain-specific data such as cardiology-related diseases and medications have limited numbers of possible values (or domain values), and a carefully customized controlled vocabulary can achieve high extraction accuracy.
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|'''Table 2.''' Results of controlled vocabularies-guided data extraction from complex narration (Dataset 3)
|-
  ! style="padding-left:10px; padding-right:10px;"|Type of data elements
  ! style="padding-left:10px; padding-right:10px;"|Number of data elements
  ! style="padding-left:10px; padding-right:10px;"|Number of round truth values
  ! style="padding-left:10px; padding-right:10px;"|Precision (%)
  ! style="padding-left:10px; padding-right:10px;"|Recall (%)
  ! style="padding-left:10px; padding-right:10px;"|F1 (%)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diseases
  | style="background-color:white; padding-left:10px; padding-right:10px;"|15
  | style="background-color:white; padding-left:10px; padding-right:10px;"|418
  | style="background-color:white; padding-left:10px; padding-right:10px;"|94.5
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.0
  | style="background-color:white; padding-left:10px; padding-right:10px;"|96.7
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Medications
  | style="background-color:white; padding-left:10px; padding-right:10px;"|10
  | style="background-color:white; padding-left:10px; padding-right:10px;"|437
  | style="background-color:white; padding-left:10px; padding-right:10px;"|98.6
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.7
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.2
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|All
  | style="background-color:white; padding-left:10px; padding-right:10px;"|25
  | style="background-color:white; padding-left:10px; padding-right:10px;"|855
  | style="background-color:white; padding-left:10px; padding-right:10px;"|96.5
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.4
  | style="background-color:white; padding-left:10px; padding-right:10px;"|97.9
|-
|}
|}
''Experiment 3: Controlled vocabularies-guided data extraction combined with online machine learning'': In this experiment, we performed two tests to examine how efficient and effective the system learns when only terminology is available and structural properties need to be obtained from online learning. Test 1 was to generate the baseline for comparison, and Test 2 was to demonstrate the effectiveness of combining online machine learning and controlled vocabularies. Dataset 3 was used to extract all diseases and medications.
For Test 1, terminology was used and online machine learning was disabled, so the test was guided by controlled vocabulary without any structural properties. We note that comprehensive terminology contributes directly to high recall rate, which means that the system seldom misses values to be extracted. However, if structural properties are not included, compared with the result in Experiment 2, the precision is much lower. This highlights the value of positive and negative contexts in an extraction task.
For Test 2, both terminology and online machine learning were used. Online machine learning supports learning structural properties. To show how quickly the system learns, only the 38 reports associated with the first 10 patients were processed with interactive online learning. All remaining reports were processed in batch. Results in Table 3 show an overall precision of 94.9%, which demonstrates that online learning could quickly learn structural properties.
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|'''Table 3.''' Results of controlled vocabularies-guided data extraction combined with online learning
|-
  ! style="padding-left:10px; padding-right:10px;"|Test
  ! style="padding-left:10px; padding-right:10px;"|Controlled vocabulary
  ! style="padding-left:10px; padding-right:10px;"|Online learning
  ! style="padding-left:10px; padding-right:10px;"|Precision (%)
  ! style="padding-left:10px; padding-right:10px;"|Recall (%)
  ! style="padding-left:10px; padding-right:10px;"|F1 (%)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|1
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Terminology only
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N/A
  | style="background-color:white; padding-left:10px; padding-right:10px;"|80.9
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.4
  | style="background-color:white; padding-left:10px; padding-right:10px;"|89.2
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Terminology only
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Applied to first 10 patients
  | style="background-color:white; padding-left:10px; padding-right:10px;"|94.9
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.4
  | style="background-color:white; padding-left:10px; padding-right:10px;"|97.1
|-
|}
|}
Typical errors in these two tests were associated with terminology and contextual information used in complex narrative scenarios. On one hand, the completeness of the terminology list, including terms and their synonyms, influences the recall rate directly. On the other hand, although coverage of terminologies could be maximized by a carefully engineered vocabulary, unwanted extractions arising from searches in the wrong section, undetected negations, and ambiguous use of terms can still lower the overall precision.
==Discussion==
===Principal findings===
IDEAL-X provides a generic data extraction framework that takes advantage of both online learning and controlled vocabularies. The two approaches complement each other and can also be combined. An online learning–based approach is highly effective for reports with underlying structural patterns such as semi-structured or template-based narration style-based reports. Experiments with complex narrative reports indicate that the use of controlled vocabularies is highly effective for supporting extraction constrained by a finite data domain. In addition, structural properties such as section-data associations can play an important role in improving the accuracy of extraction. However, in cases where controlled vocabularies are unavailable — extracting generic named entities, for example — maintaining high accuracy is a challenge. This is an ongoing area of exploration that we will report in the future.
Machine learning is among major techniques for identifying candidate chunks. Besides HMM, we have also explored other classifiers such as the Naive Bayes classifier and neural networks-based classifier. An ongoing project includes a systematic study of different classifiers and their combinations (including Conditional Random Field and Support Vector Machine<ref name="JiangAStudy11">{{cite journal |title=A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries |journal=JAMIA |author=Jiang, M.; Chen, Y.; Liu, M. et al. |volume=18 |issue=5 |pages=601-6 |year=2011 |doi=10.1136/amiajnl-2011-000163 |pmid=21508414 |pmc=PMC3168315}}</ref>) for online machine learning–based data extraction.
To make it more flexible when using standard medical terminologies for customizing controlled vocabularies, a project is ongoing towards developing a tool that can easily search and import concepts from standard vocabularies such as ICDE-9, [[ICD-10]], and SNOMED, from a local file or through NCBO BioPortal.
===Conclusions===
Although there are natural language processing tools available for extracting information from clinical reports, the majority lack the capability to support interactive feedback from human users. An interactive, online approach allows the user to coach the system using knowledge specific to the given set of reports, which may include local reporting conventions and structures. Moreover, no advanced linguistics knowledge or programming skills are required of the users; the system maintains the ordinary workflow of manual annotation systems. We perform a systematic study on the effectiveness of the online learning–based method combined with controlled vocabularies for data extraction from reports with various structural patterns, and we conclude that our method is highly effective. The framework is generic and the applicability is demonstrated with diverse report types. The software will be made freely available online.<ref name="IDEAL-X">{{cite web |url=http://www.idealx.net/home.html |title=IDEAL-X |publisher=Emory University |accessdate=04 May 2017}}</ref>
==Acknowledgments==
The study was funded by Center for Disease Control and Prevention, contract #200-2014-M-59415.
==Conflicts of interest==
None declared.
==Multimedia Appendix 1==
[http://medinform.jmir.org/article/downloadSuppFile/7235/51185 Test cases] (PDF file; 32KB)


==References==
==References==

Latest revision as of 17:01, 23 May 2017

Full article title Effective information extraction framework for heterogeneous clinical reports using online machine learning and controlled vocabularies
Journal JMIR Medical Informatics
Author(s) Zheng, Shuai; Lu, James. J.; Ghasemzadeh, Nima; Hayek, Salim S.; Quyyumi, Arshed, A.; Wang, Fusheng
Author affiliation(s) Emory University, Stony Brook University
Primary contact Email: fusheng dot wang at stonybrook dot edu; Phone: 1 6316327528
Editors Eysenbach, G.
Year published 2017
Volume and issue 5 (2)
Page(s) e12
DOI 10.2196/medinform.7235
ISSN 2291-9694
Distribution license Creative Commons Attribution 2.0
Website http://medinform.jmir.org/2017/2/e12/
Download http://medinform.jmir.org/2017/2/e12/pdf (PDF)

Abstract

Background: Extracting structured data from narrated medical reports is challenged by the complexity of heterogeneous structures and vocabularies and often requires significant manual effort. Traditional machine-based approaches lack the capability to take user feedback for improving the extraction algorithm in real time.

Objective: Our goal was to provide a generic information extraction framework that can support diverse clinical reports and enables a dynamic interaction between a human and a machine that produces highly accurate results.

Methods: A clinical information extraction system IDEAL-X has been built on top of online machine learning. It processes one document at a time, and user interactions are recorded as feedback to update the learning model in real time. The updated model is used to predict values for extraction in subsequent documents. Once prediction accuracy reaches a user-acceptable threshold, the remaining documents may be batch processed. A customizable controlled vocabulary may be used to support extraction.

Results: Three datasets were used for experiments based on report styles: 100 cardiac catheterization procedure reports, 100 coronary angiographic reports, and 100 integrated reports — each combines history and physical report, discharge summary, outpatient clinic notes, outpatient clinic letter, and inpatient discharge medication report. Data extraction was performed by three methods: online machine learning, controlled vocabularies, and a combination of these. The system delivers results with F1 scores greater than 95%.

Conclusions: IDEAL-X adopts a unique online machine learning–based approach combined with controlled vocabularies to support data extraction for clinical reports. The system can quickly learn and improve, thus it is highly adaptable.

Keywords: information extraction, natural language processing, controlled vocabulary, electronic medical records

Introduction

While immense efforts have been made to enable a structured data model for electronic medical records (EMRs), a large amount of medical data remain in free-form narrative text, and useful data from individual patients are usually distributed across multiple reports of heterogeneous structures and vocabularies. This poses major challenges to traditional information extraction systems, as either costly training datasets or manually crafted rules have to be prepared. These approaches also lack the capability of taking user feedback to adapt and improve the extraction algorithm in real time.

Our goal is to provide a generic information extraction framework that adapts to diverse clinical reports, enables a dynamic interaction between a human and a machine, and produces highly accurate results with minimal human effort. We have developed a system, Information and Data Extraction using Adaptive Online Learning (IDEAL-X), to support adaptive information extraction from diverse clinical reports with heterogeneous structures and vocabularies. The system is built on top of online machine learning and customizable controlled vocabularies. A demo video can be found on YouTube.[1]

IDEAL-X uses an online machine learning–based approach[2][3][4] for information extraction. Traditional machine learning algorithms take a two-stage approach: batch training based on an annotated training dataset, and batch prediction for future datasets based on the model generated from stage one (Figure 1). In contrast, online machine learning algorithms[2][3] take an iterative approach (Figure 1). IDEAL-X learns one document at a time and predicts values to be extracted for the next one. Learning occurs from revisions made by the user, and the updated model is applied to prediction for subsequent documents. Once the model achieves a satisfactory accuracy, the remaining documents may be processed in batches. Online machine learning not only significantly reduces user effort for annotation but also provides the mechanism for collecting feedback from human-machine interaction to improve the system's model continuously.

Besides online machine learning, IDEAL-X allows for customizable controlled vocabularies to support data extraction from clinical reports, where a vocabulary enumerates the possible values that can be extracted for a given attribute. (The X in IDEAL-X represents the controlled vocabulary plug-in.) The use of online machine learning and controlled vocabularies is not mutually exclusive; they are complementary, which provides the user with a variety of modes for working with IDEAL-X.


Fig1 Zheng JMIRMedInfo2017 5-2.png

Fig. 1 Online machine learning versus batch learning. (a) Batch machine learning workflow; (b) Online machine learning workflow

Background

Related work

A number of research efforts have been made in different fields of medical information extraction. Successful systems include caTIES[5], MedEx [6], MedLEE[7], cTAKES[8], MetaMap[9], HITEx[10], and so on. These methods either take a rule-based approach, a traditional machine learning–based approach, or a combination of both.

Different online learning algorithms have been studied and developed for classification tasks[11], but their direct application to information extraction has not been studied. Especially in the clinical environment, the effectiveness of these algorithms is yet to be examined. Several pioneering projects have used learning processes that involve user interaction and certain elements of IDEAL-X. I2E2 is an early rule-based interactive information extraction system.[12] It is limited by its restriction to a predefined feature set. Amilcare[13][14] is adaptable to different domains. Each domain requires an initial training that can be retrained on the basis of the user’s revision. Its algorithm (LP)2 is able to generalize and induce symbolic rules. RapTAT[15] is most similar to IDEAL-X in its goals. It preannotates text interactively to accelerate the annotation process. It uses a multinominal naïve Baysian algorithm for classification but does not appear to use contextual information beyond previously found values in its search process. This may limit its ability to extract certain value types.

Different from online machine learning but related is active learning[16][17]; it assumes the ability to retrieve labels for the most informative data points while involving the users in the annotation process. DUALIST[18] allows users to select system-populated rules for feature annotation to support information extraction. Other example applications in health care informatics include word sense disambiguation[19] and phenotyping.[20] Active learning usually requires comprehending the entire corpus in order to pick the most useful data point. However, in a clinical environment, data arrive in a streaming fashion over time, which limits our ability to choose data points. Hence, an online learning approach is more suitable.

IDEAL-X adopts the Hidden Markov Model for its compatibility with online learning, and for its efficiency and scalability. We will also describe a broader set of contextual information used by the learning algorithm to facilitate extraction of values of all types.

Heterogeneous clinical reports

A patient’s electronic medical record could come with a variety of medical reports. Data in these reports provide critical information that can be used to improve clinical diagnosis and support biomedical research. For example, the Emory University Cardiovascular Biobank[21] collects records of patients with potential or confirmed coronary artery diseases undergoing cardiac catheterization, and it aims to combine extracted data elements from multiple reports to identity patients for research. Report types include history and physical report, discharge summary, outpatient clinic note, outpatient clinic letter, coronary angiogram report, cardiac catheterization procedure report, echocardiogram report, inpatient report, and discharge medication lists.


Fig2 Zheng JMIRMedInfo2017 5-2.png

Fig. 2 Example snippets of different report forms. (a) Semi-structured report; (b) Template based narration; and (c) Complex narration

We classify clinical reports into three forms: semi-structured data, template-based narration, and complex narration. Semi-structured data represent data elements in the form of attribute and value pairs (Figure 2). Reports in this form have simple structures, making data extraction relatively straightforward. Template-based narration is a very common report form. The narrative style, including sentence patterns and vocabularies, follows consistent templates and expressions (Figure 2). Extracting information from this type of text (e.g., "right posterior descending artery") require major linguistics expertise, to either formulate extraction rules or to annotate training data. Complex narration is essentially free-form text. It can be irregular, personal, and idiomatic (Figure 2). Most medical reporting systems still allow for (and thus encourage) such a style. It is the most difficult form to interpret and process by natural language processing (NLP) algorithms. Nevertheless, certain types of information such as diseases and medications have finite vocabulary that could be used to support data extraction.

Methods

Overview

The interface and workflow conform to traditional annotation systems: a user browses an input document from the input document collection and fills out an output form. On loading each document, the system attempts to fill the output form automatically with its data extraction engine. Then, a user can review and revise incorrect answers. The system then updates its data extraction model automatically based on the user’s feedback. Optionally, the user may provide a customized controlled vocabulary to further support data extraction and answer normalization. Pretraining with manually annotated data is not required, as the prediction model behind the data extraction engine can be established incrementally through online learning, customizing controlled vocabularies, or a combination of the two.

The system can operate in two modes: (1) interactive: through online learning, the system predicts values to be extracted for each report, and the user verifies or corrects the predicted values; and (2) batch: batch predicting for all unprocessed documents once the accrued accuracy is sufficient for users. Whereas interactive mode uses online machine learning to build the learning model incrementally, batch mode runs the same as the prediction phase of batch machine learning.

System interface and user operations

System interface

IDEAL-X provides a GUI with two main panels, a menu, and navigation buttons (Figure 3). The left panel is for browsing an input report, and the right panel is the output table with predicted values of each data element in the report. The menu provides options for defining the data elements to be extracted, specifying input reports, among others.


Fig3 Zheng JMIRMedInfo2017 5-2.png

Fig. 3 An example screenshot of IDEAL-X’s interface

Definition of data elements for extraction

The system provides a wizard for constructing the metadata of the output form. The user builds the form by specifying a list of data elements and their constraints. An example is the data element "Heart Rate," which is constrained to be a numerical value between 0 and 200. Other constraints include sections of the report that may contain the values. However, except for the names of the data elements, specifying constraints are optional, as these can be learned by the system. Data extraction workflow

The user will first select a collection of input reports to be extracted from a local folder. By default, the system runs in an interactive mode, and one report will be loaded at a time on the left display panel. The user can make manual annotations by highlighting the correct value in the report text. Clicking the corresponding data field in the table assigns the value to the data element. If the system has prefilled the field of a data element with a predicted value, the user can provide feedback by fixing incorrect values. As the user navigates to the next document, the system compares the prefilled and the final values for the most recently processed document. Values that are unchanged or filled in by users are taken as positive instances, and values that have been revised are taken as negative instances. Both instances are incorporated into the online learning algorithm to be used by the data extraction for subsequent documents. By iterating through this process, the amount of information that the system is able to correctly prefill grows over time. Note that manual revision in this context is different from traditional human labeling. It is only necessary if there is a wrong prediction, thus user effort can be significantly saved. Once the decision model reaches an acceptable level of accuracy, the user has the option to switch to batch mode to complete extraction for the remaining documents. If a patient has multiple reports, the text input panel displays each report with a separate tab. Data extracted from all the reports are aggregated in the output. Customization of controlled vocabularies

IDEAL-X also provides an interface for the user to customize a controlled vocabulary that can be used by the system for data extraction. The controlled vocabulary contains both terminology and structural properties. The terminology includes lists of values and their normalization mappings. For example, disease terminology includes "Diabetes Mellitus" with variations "DM" and "Diabetes." It also defines inductions. For example, taking "Insulin" or "Metformin" indicates having diabetes mellitus. Structural properties provide positive and negative contextual information for giving terms. For example, to extract medications taken by patients, the "Allergies" section is a negative context and medicine names in the section will be skipped. Structural properties may also contain disambiguation terms that may further improve the precision of extraction. A simple example is that "intolerant" is a negative indicator for identifying "statin" as "statin intolerant" refers to a different concept. Controlled vocabularies can be a powerful tool to support data extraction: it can be used to locate sentences and chunks of possible values, and to perform normalization for extracted values, discussed in the next section.

The data extraction engine

While the user interacts with IDEAL-X interface, the data extraction engine works transparently in the background. The engine has three major components: answer prediction, learning, and the learning model that the online learning process continuously updates (Figure 4). The system combines statistical and machine learning–based approaches with controlled vocabularies for effective data extraction.

Document preprocessing

When a report is loaded, the text is first parsed into an in-memory hierarchical tree consisting of four layers: section, paragraph, sentence, and token. Apache OpenNLP[22] is used to support the parsing with its Sentence Detector, Tokenizer, and Part-of-Speech Tagger. A reverse index of tokens is created to support an efficient keyword-based search. The index is used to find locations (e.g., sections, paragraphs, sentences, and phrases) of a token, as well as its properties such as part of speech and data type. For example, given the token "DM," the system can quickly identify the section (e.g., "History") and the containing sentences. Such token search is frequently performed in answer prediction, and the in-memory index structures enable high efficiency for such operations.

Answer prediction

Predicting the value of each data element involves the following steps: (1) Identifying target sentences that are likely to contain the answer; (2) Identifying candidate chunks in the sentences; (3) Filtering the chunks to generate candidate values; (4) Ranking candidate values to generate (raw) values; (5) Normalizing values; and (6) Aggregating values from multiple reports. The workflow is shown in Figure 4.


Fig4 Zheng JMIRMedInfo2017 5-2.png

Fig. 4 Overview of System Workflow

Identifying target sentences

Through online learning, the system accrues keywords from past answers (answer keywords) along with co-occurring words in the corresponding sentences (contextual words). For example, given the answer keywords "diabetes" and "hypertension" in the sentence "The patient reports history of diabetes and hypertension," contextual words are "patient," "report," and "history." Such answer keywords and contextual words combined with customized vocabularies can be utilized to identify sentences that are likely to contain answers with the following methods:

First, there's similarity-based search using the vector space model.[23] Given a collection of contextual words and their frequencies, the system computes the similarity against sentences in the document.[23] Sentences with high similarities are selected. For example, most sentences about "disease" contain "diagnosis" and "history." The past contextual keywords and their frequency weights are represented and maintained through a learning model discussed later in section on Learning.

Second, we have the answer keyword matching search. The answer keywords, combined with relevant user customized vocabularies, are also used to identify target sentences with keyword matching. For example, to extract diseases, if a sentence contains the disease term "myocardial infarction" defined in the vocabulary, the sentence is selected as a target. In both approaches, sections to be searched or skipped are also considered to narrow the scope of searching.

Identifying candidate chunks

After target sentences are selected, the system identifies potential phrases in the sentences using two methods: Hidden Markov model (HMM)[24] and keyword-based search. The HMM represents target words and contextual words in a sentence with different states, and it marks values to be extracted based on probability distributions learned from previously collected values and their sentences. The keyword-based search finds candidate chunks using keywords collected from past answers and the controlled vocabulary. Filtering chunks

To filter candidate chunks, the system uses rule induction[14][25] to generate "If-Then" rules based on historical statistics. The following filtering criteria are used: (1) Part of speech (POS): This filters a phrase by its POS tag in the sentence. Simple example phrases are noun and verb phrases. (2) String pattern: This looks for chunks that match special string patterns. For example, the first characters of all tokens are capitalized. (3) Value domain: This eliminates numerical or enumerated values that fall outside a specified range of values. (4) Negation: Based on predefined built-in rules, this removes phrases governed by words that reverse the meaning of the answer.[26] For example, if a candidate chunk "cancer" is extracted from a sentence "the patient has no history of cancer," "cancer" would not be included. (5) Certainty: Similar to negation filter, this detects and filters uncertain events or situations such as future plans, based on predefined rules. For example, a candidate chunk "radiation therapy" for treatment from a sentence "the patient is planned to take radiation therapy" should not be included. Whereas negation and certainty filtering is based on predefined rules, other filtering relies on real-time data statistics for filtering criteria. Ranking candidate values

The system combines the scores of the selected sentences and chunks for ranking of candidate values. For a single-valued data element (e.g., heart beat), the candidate value with the highest confidence score is selected. For a multi-valued data element (e.g., medication), values with confidence scores above a threshold are selected. Based on this, each candidate value is either accepted or rejected.

Normalizing values

This step normalizes extracted values through transformation, generalization, and induction rules given by the controlled vocabulary (Figure 4). For example, "DM" is transformed into "Diabetes Mellitus." "Pindolol" is generalized to its hypernym "beta blocker." The appearance of the medication term "Metformin" (a drug for treating type 2 diabetes) in the text can infer the disease "Diabetes Mellitus."

Aggregating results

Data extracted from multiple reports of a patient will be aggregated into a single table. The aggregation process may normalize values and remove duplicates. For example, "lisinopril" and "captopril" are extracted from discharge summary and inpatient report, respectively, and they can be normalized as "ACE inhibitor." If the same data element is extracted from multiple reports, deduplication is performed. The final output is in simple structural table form that can be exported conveniently to other applications such as Excel (Microsoft) or a database.

Note that controlled vocabularies can play important roles in the answer prediction process. They are used for identifying target sentences through keyword searching, identifying candidate chunks through keyword matching, and supporting normalization for extracted values.

Learning

DEAL-X takes an online learning–based approach to incrementally build statistical models and make predictions (Figure 5). The three models used in IDEAL-X are all statistical based and can be continuously updated after each iteration.


Fig5 Zheng JMIRMedInfo2017 5-2.png

Fig. 5 Precision and recall changes over processed records

System-predicted values automatically populate the output table, and the user advances to the next report with or without revision to these values. In both cases, the internal learning and prediction models of IDEAL-X are updated. For each instance, IDEAL-X collects and analyzes the following features: (1) Position: location of the answer in the text hierarchy; (2) Landmark: co-occurring contextual keywords in a sentence; (3) POS: parts of speech tag; (4) Value: the tokens of the answer; and (5) String patterns: literal features such as capitalization and initial and special punctuation. These features are then used to update the three models.

In IDEAL-X, each data element such as attribute "disease" or "medicine" has its own statistical model, and each new instance of a data element will update the corresponding model. There are three models to be updated: (1) Updating Space Vector Model: This model uses "Landmark" features of positive instances. The system updates frequencies of co-occurring contextual words, used as weights of the space vector.[23] (2) Updating HMM: HMM lists all words in a sentence as a sequence, in which an extracted value is marked as target value state and other words are recognized as irrelevant contextual states. Based on this sequence, the state transition probabilities and emission probabilities are recalculated.[24] (3) Updating rule induction model: Filtering rules are induced based on the coverage percentage.[25] Features such as POS, value domain, and string patterns of both positive and negative instances are analyzed and their respective coverage percentages are modified. Once the coverage of a rule reaches a predefined threshold, the rule is triggered for filtering.

In an interactive mode, the above four steps repeat for each report, where the learning models are continuously updated and improved.

Results

Experimental setup

Datasets

We used three datasets from 100 patients that were randomly sampled from a collection of about 5000 patients in the Emory Biobank database. Dataset 1 is a set of semi-structured reports and contains 100 cardiac catheterization procedure reports. Dataset 2 is a set of template-based narration and contains 100 coronary angiographic reports. Dataset 3 is a set of complex narration and contains 315 reports, including history and physical report, discharge summary, outpatient clinic notes, outpatient clinic letter, and inpatient discharge medication report.

Ground truth

The test datasets are independently hand-annotated by domain expert annotators, including physicians, physician trainees, and students trained by the Emory Clinical Cardiovascular Research Institute for Biobank data reporting. Each record is annotated by two different annotators. The interrater agreement scores (kappa) of these three datasets are .991, .986, and .835, respectively. An arbitrator — an independent cardiovascular disease researcher — reconciles incompatible outputs of the system and the manual annotations to produce the final ground truth.

Evaluation metrics

For validation, precision, recall, and F1, scores are used to estimate the effectiveness of extraction by comparing the system predicted results (before human revision) and the ground truth.

Experiment settings

We aimed to evaluate the effectiveness of the system with respect to using online learning and controlled vocabularies and to understand their applicability to different report forms. By analyzing the report styles and vocabularies, we discovered that online learning will be more suitable for semi-structured or template-based narration reports, and controlled vocabulary-guided data extraction would be more effective on complex narration with a finite vocabulary. Thus, we designed three experiments: (1) Online learning–based data extraction, where controlled vocabularies are not provided, based on Dataset 1 (semi-structured) and Dataset 2 (template-based narration); (2) Controlled vocabularies-based data extraction, where online learning is not used, based on Dataset 3 (complex narration); and (3) Controlled vocabularies-guided data extraction combined with online learning, based on Dataset 3.

Performance evaluation

Experiment 1: Online machine learning–based data extraction: This experiment was based on Datasets 1 and 2. The system starts in an interactive mode with an empty decision model without prior training. The defined data elements are summarized in Multimedia Appendix 1. The user processes one report at a time, and each system-predicted value (including empty values for the first few reports) before user revision was recorded for calculating precision and recall.

Results are summarized in Table 1 for the two datasets, respectively. Both test cases achieved high precision as semi-structured and template-based text is most easy to handle. To study the learning rate of online learning, we divided records into 10 groups and plotted precision and recall of every 10 percent of the records in Datasets 1 and 2. We observed that in both tests, the system maintained high precision during the learning process. Although some variability exists due to new data pattern, the recall of both cases also improved steadily. Not surprisingly, the rate of learning for Dataset 1 is much faster given its semi-structure.

Table 1. Results of data extraction from semistructured reports (Dataset 1) and template-based narration (Dataset 2)
Dataset Number of data elements Number of values Precision (%) Recall (%) F1 (%)
1 19 1272 99.8 96.5 98.1
2 16 728 97.2 93.2 95.2

Experiment 2: Controlled vocabularies-guided data extraction: In this experiment, online learning was disabled and data extraction was performed in batches using controlled vocabulary. Diseases and medications were extracted from Dataset 3 (values to be extracted are shown in Multimedia Appendix 1). Customized controlled vocabularies, including terminology and structural properties, had been created independently by physicians via references to domain knowledge resources and analyzing another development report dataset of 100 patients, disjointed from Dataset 3. Note that comparisons in this and the following experiments were at a clinical finding level between system-integrated results and manual-annotation-integrated results.

The results in Table 2 show that controlled vocabularies are highly effective for data extraction over complex narratives. Domain-specific data such as cardiology-related diseases and medications have limited numbers of possible values (or domain values), and a carefully customized controlled vocabulary can achieve high extraction accuracy.

Table 2. Results of controlled vocabularies-guided data extraction from complex narration (Dataset 3)
Type of data elements Number of data elements Number of round truth values Precision (%) Recall (%) F1 (%)
Diseases 15 418 94.5 99.0 96.7
Medications 10 437 98.6 99.7 99.2
All 25 855 96.5 99.4 97.9

Experiment 3: Controlled vocabularies-guided data extraction combined with online machine learning: In this experiment, we performed two tests to examine how efficient and effective the system learns when only terminology is available and structural properties need to be obtained from online learning. Test 1 was to generate the baseline for comparison, and Test 2 was to demonstrate the effectiveness of combining online machine learning and controlled vocabularies. Dataset 3 was used to extract all diseases and medications.

For Test 1, terminology was used and online machine learning was disabled, so the test was guided by controlled vocabulary without any structural properties. We note that comprehensive terminology contributes directly to high recall rate, which means that the system seldom misses values to be extracted. However, if structural properties are not included, compared with the result in Experiment 2, the precision is much lower. This highlights the value of positive and negative contexts in an extraction task.

For Test 2, both terminology and online machine learning were used. Online machine learning supports learning structural properties. To show how quickly the system learns, only the 38 reports associated with the first 10 patients were processed with interactive online learning. All remaining reports were processed in batch. Results in Table 3 show an overall precision of 94.9%, which demonstrates that online learning could quickly learn structural properties.

Table 3. Results of controlled vocabularies-guided data extraction combined with online learning
Test Controlled vocabulary Online learning Precision (%) Recall (%) F1 (%)
1 Terminology only N/A 80.9 99.4 89.2
2 Terminology only Applied to first 10 patients 94.9 99.4 97.1

Typical errors in these two tests were associated with terminology and contextual information used in complex narrative scenarios. On one hand, the completeness of the terminology list, including terms and their synonyms, influences the recall rate directly. On the other hand, although coverage of terminologies could be maximized by a carefully engineered vocabulary, unwanted extractions arising from searches in the wrong section, undetected negations, and ambiguous use of terms can still lower the overall precision.

Discussion

Principal findings

IDEAL-X provides a generic data extraction framework that takes advantage of both online learning and controlled vocabularies. The two approaches complement each other and can also be combined. An online learning–based approach is highly effective for reports with underlying structural patterns such as semi-structured or template-based narration style-based reports. Experiments with complex narrative reports indicate that the use of controlled vocabularies is highly effective for supporting extraction constrained by a finite data domain. In addition, structural properties such as section-data associations can play an important role in improving the accuracy of extraction. However, in cases where controlled vocabularies are unavailable — extracting generic named entities, for example — maintaining high accuracy is a challenge. This is an ongoing area of exploration that we will report in the future.

Machine learning is among major techniques for identifying candidate chunks. Besides HMM, we have also explored other classifiers such as the Naive Bayes classifier and neural networks-based classifier. An ongoing project includes a systematic study of different classifiers and their combinations (including Conditional Random Field and Support Vector Machine[27]) for online machine learning–based data extraction.

To make it more flexible when using standard medical terminologies for customizing controlled vocabularies, a project is ongoing towards developing a tool that can easily search and import concepts from standard vocabularies such as ICDE-9, ICD-10, and SNOMED, from a local file or through NCBO BioPortal.

Conclusions

Although there are natural language processing tools available for extracting information from clinical reports, the majority lack the capability to support interactive feedback from human users. An interactive, online approach allows the user to coach the system using knowledge specific to the given set of reports, which may include local reporting conventions and structures. Moreover, no advanced linguistics knowledge or programming skills are required of the users; the system maintains the ordinary workflow of manual annotation systems. We perform a systematic study on the effectiveness of the online learning–based method combined with controlled vocabularies for data extraction from reports with various structural patterns, and we conclude that our method is highly effective. The framework is generic and the applicability is demonstrated with diverse report types. The software will be made freely available online.[28]

Acknowledgments

The study was funded by Center for Disease Control and Prevention, contract #200-2014-M-59415.

Conflicts of interest

None declared.

Multimedia Appendix 1

Test cases (PDF file; 32KB)

References

  1. Zheng, S. (7 September 2014). "IDEAL-X: Information and Data Extraction using Adaptive Learning". YouTube. Google, Inc. https://www.youtube.com/watch?v=Q-DrWi31nv0. Retrieved 20 April 2017. 
  2. 2.0 2.1 Shalev-Shwartz, S. (July 2007). "Online Learning: Theory, Algorithms, and Applications" (PDF). University of Chicago. http://ttic.uchicago.edu/~shai/papers/ShalevThesis07.pdf. Retrieved 03 May 2017. 
  3. 3.0 3.1 Shalev-Shwartz, S. (2012). "Online Learning and Online Convex Optimization". Foundations and Trends in Machine Learning 4 (2): 107–194. doi:10.1561/2200000018. 
  4. Smale, S.; Yao, Y. (2006). "Online Learning Algorithms". Foundations of Computational Mathematics 6 (2): 145–170. doi:10.1007/s10208-004-0160-z. 
  5. Crowley, R.S.; Castine, M.; Mitchell, K. et al. (2010). "caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research". JAMIA 17 (3): 253-64. doi:10.1136/jamia.2009.002295. PMC PMC2995710. PMID 20442142. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995710. 
  6. Xu, H.; Stenner, S.P.; Doan, S. et al. (2010). "MedEx: A medication information extraction system for clinical narratives". JAMIA 17 (1): 19–24. doi:10.1197/jamia.M3378. PMC PMC2995636. PMID 20064797. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995636. 
  7. Friedman, C.; Shagina, L.; Lussier, Y.; Hripcsak, G. (2004). "Automated encoding of clinical documents based on natural language processing". JAMIA 11 (5): 392–402. doi:10.1197/jamia.M1552. PMC PMC516246. PMID 15187068. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC516246. 
  8. Savova, G.K.; Masanz, J.J.; Ogren, P.V. et al. (2010). "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, component evaluation and applications". JAMIA 17 (5): 507–13. doi:10.1136/jamia.2009.001560. PMC PMC2995668. PMID 20819853. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995668. 
  9. Aronson, A.R. (2001). "Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program". Proceedings AMIA Symposium 2001: 17–21. PMC PMC2243666. PMID 11825149. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2243666. 
  10. Zeng, Q.T.; Goryachev, S.; Weiss, S. et al. (2006). "Extracting principal diagnosis, co-morbidity and smoking status for asthma research: Evaluation of a natural language processing system". BMC Medical Informatics and Decision Making 6: 30. doi:10.1186/1472-6947-6-30. PMC PMC1553439. PMID 16872495. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1553439. 
  11. Hoi, S.C.H.; Wang, J.; Zhao, P. (2014). "LIBOL: A Library for Online Learning Algorithms". Journal of Machine Learning Research 15 (Feb): 495–499. http://www.jmlr.org/papers/v15/hoi14a.html. 
  12. Cardie, C.; Pierce, D. (6 July 1998). "Proposal for an Interactive Environment for Information Extraction". Cornell University Computer Science Technical Report TR98-1702. Cornell University. 
  13. Ciravegna, F.; Wilks, Y. (2003). "Designing Adaptive Information Extraction for the Semantic Web in Amilcare". In Handschuh, S.; Staab, S.. Annotation for the Semantic Web. 96. IOS Press. pp. 112–127. ISBN 9781586033453. 
  14. 14.0 14.1 Ciravegna, F. (2001). "Adaptive information extraction from text by rule induction and generalisation". Proceedings of the 17th International Joint Conference on Artificial Intelligence 2: 1251–1256. ISBN 9781558608122. 
  15. Gobbel, G.T.; Garvin, J.; Reeves, R. et al. (2014). "Assisted annotation of medical free text using RapTAT". JAMIA 21 (5): 833-41. doi:10.1136/amiajnl-2013-002255. PMC PMC4147611. PMID 24431336. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4147611. 
  16. Settles, B. (January 2009). "Active Learning Literature Survey" (PDF). University of Wisconsin. https://research.cs.wisc.edu/techreports/2009/TR1648.pdf. Retrieved 03 May 2017. 
  17. Olsson, F. (17 April 2009). "A literature survey of active machine learning in the context of natural language processing" (PDF). Swedish Institute of Computer Science. http://soda.swedish-ict.se/3600/1/SICS-T--2009-06--SE.pdf. Retrieved 03 May 2017. 
  18. Settles, B. (2011). "Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances". Proceedings of the Conference on Empirical Methods in Natural Language Processing 2011: 1467–1478. ISBN 9781937284114. 
  19. Chen, Y.; Cao, H.; Mei, Q. et al. (2013). "Applying active learning to supervised word sense disambiguation in MEDLINE". JAMIA 2013: 1001–1006. doi:10.1136/amiajnl-2012-001244. PMC PMC3756255. PMID 23364851. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756255. 
  20. Chen, Y.; Carroll, R.J.; Hinz, E.R. et al. (2013). "Applying active learning to high-throughput phenotyping algorithms for electronic health records data". JAMIA 20 (e2): e253-9. doi:10.1136/amiajnl-2013-001945. PMC PMC3861916. PMID 23851443. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3861916. 
  21. Eapen, D.J.; Manocha, P.; Patel, R.S. et al. (2013). "Aggregate risk score based on markers of inflammation, cell stress, and coagulation is an independent predictor of adverse cardiovascular outcomes". Journal of the American College of Cardiology 62 (4): 329-37. doi:10.1016/j.jacc.2013.03.072. PMC PMC4066955. PMID 23665099. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066955. 
  22. "OpenNLP". Apache Software Foundation. http://opennlp.apache.org/. Retrieved 20 April 2017. 
  23. 23.0 23.1 23.2 Manning, C.D.; Raghavan, P.; Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. pp. 506. ISBN 9780521865715. 
  24. 24.0 24.1 Elliott, R.J.; Aggoun, L.; Moore, J.B. (2008). Hidden Markov Models: Estimation and Control. Springer. pp. 382. ISBN 9780387943640. 
  25. 25.0 25.1 Fürnkranz, J. (1999). "Separate-and-Conquer Rule Learning". Artificial Intelligence Review 13 (1): 3–54. doi:10.1023/A:1006524209794. 
  26. Huang, Y.; Lowe, H.J. (2007). "A novel hybrid approach to automated negation detection in clinical radiology reports". JAMIA 14 (3): 304–11. doi:10.1197/jamia.M2284. PMC PMC2244882. PMID 17329723. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2244882. 
  27. Jiang, M.; Chen, Y.; Liu, M. et al. (2011). "A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries". JAMIA 18 (5): 601-6. doi:10.1136/amiajnl-2011-000163. PMC PMC3168315. PMID 21508414. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168315. 
  28. "IDEAL-X". Emory University. http://www.idealx.net/home.html. Retrieved 04 May 2017. 

Abbreviations

EMR: electronic medical record

HMM: Hidden Markov Model

NLP: natural language processing

POS: part of speech

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In several cases the PubMed ID was missing and was added to make the reference more useful. Grammar and vocabulary were cleaned up to make the article easier to read. In the "Answer prediction" subsection, the original referred to Figure 5, when it most likely should have referred to Figure 4; this was changed for this version.

Per the distribution agreement, the following copyright information is also being added:

©Shuai Zheng, James J Lu, Nima Ghasemzadeh, Salim S Hayek, Arshed A Quyyumi, Fusheng Wang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 09.05.2017.