Difference between revisions of "Journal:Terminology spectrum analysis of natural-language chemical documents: Term-like phrases retrieval routine"

From LIMSWiki
Jump to navigationJump to search
(Added content. Saving and adding more.)
(Added content. Saving and adding more.)
Line 18: Line 18:
|website      = [http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0136-4 http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0136-4]
|website      = [http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0136-4 http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0136-4]
|download    = [http://jcheminf.springeropen.com/track/pdf/10.1186/s13321-016-0136-4?site=jcheminf.springeropen.com http://jcheminf.springeropen.com/track/pdf/10.1186/s13321-016-0136-4] (PDF)
|download    = [http://jcheminf.springeropen.com/track/pdf/10.1186/s13321-016-0136-4?site=jcheminf.springeropen.com http://jcheminf.springeropen.com/track/pdf/10.1186/s13321-016-0136-4] (PDF)
}}
{{ombox
| type      = content
| style    =
| text      = This article contains rendered mathematical formulae. You ''may'' require the [https://chrome.google.com/webstore/detail/math-anywhere/gebhifiddmaaeecbaiemfpejghjdjmhc Math Anywhere] plugin for Chrome or the [https://addons.mozilla.org/en-US/firefox/addon/native-mathml/ Native MathML] add-on and [https://developer.mozilla.org/en-US/docs/Mozilla/MathML_Project/Fonts fonts] for Firefox if they don't render properly for you.
}}
}}
{{ombox
{{ombox
Line 467: Line 472:
|}
|}


===N-grams spectrum retrieval procedure===
As it is defined earlier within our study, the term "n-gram at length ''n''" connotes a sequence or string of ''n'' consecutive tokens situated within the same sentence with omission of useless tokens (at the moment only definite/indefinite articles). N-gram set is obtained by moving a window of ''n'' tokens length through an entire sentence. This moving is performed token by token. This process is to be repeated for all sentences for a set of all texts: <math id="M1">T = \left\{ {T_{1},T_{2},\ldots,T_{m}} \right\}</math>
For a set of texts, each n-gram may be characterized by textual frequency of n-gram occurrence <math id="M2">f_{T}\left( T_{i} \right)</math>—total number of n-gram occurrences within a text <math id="M3">T_{i}</math> and by absolute frequency of occurrence <math id="M4">f_{A} = \sum\limits_{i}f_{A}\left( T_{i} \right)</math>—total number of n-gram occurrences. As a result each n-gram may be described by a vector <math id="M5">\mathbf{F}\left( T \right) = \left\{ {f_{T}\left( T_{1} \right),f_{T}\left( T_{2} \right),\ldots,f_{T}\left( T_{m} \right)} \right\}</math> within a set of texts enabling us to develop the additional procedures for n-gram filtering and text information analysis.
The full n-gram data set is redundant and it creates difficulties for analysis. For specific purposes different filtration procedures are to be applied. For instance, threshold filtering based on the values of <math id="M6">\text{max}f_{A} = \text{max}\sum_{i}f_{T}\left( T_{i} \right)</math> and <math id="M7">\text{max}f_{T}\left( T_{i} \right)</math> may be used.
===Module of terminology spectrum building===
The final stage of the analysis is to distinguish among the scores of n-grams such as the term-like phrases, general chemistry scientific terms, names of chemical entities and useless n-grams. The calculation of textual and absolute frequencies of term occurrence finishes the terminology spectrum building.
To select term-like n-grams the sets of accept and reject rules are applied. They are all based on token tags assigned at previous steps and developed dictionaries (Table 1). The intention of each set of rules is to determine whether an n-gram of defined length is a term-like phrase or not by analyzing its structure. All rules are applied in a consecutive manner. If an n-gram conforms to an accept or reject rule in the rule sequence, the procedure will be stopped with declaring the n-gram as either a non-term-like or a term-like phrase, probably having a special meaning (e.g. general chemistry scientific term or chemical entity). If no rule is applicable, the n-gram will be considered a term-like phrase too. There are a few general rules that can be used for analysis of n-grams of any length. There are also tailored sets of rules for 1-grams (Table 5), 2-grams (Table 6) and for long (n > 2)-grams (Table 7).
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="60%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''Table 5.''' Accept and reject rules succession for unigrams (1-grams)
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Description
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Examples
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''GeneralChemTermRule (accept rule)'''''<br />&nbsp;<br />True if a 1-gram is a general chemistry scientific term
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''StrictFilteringTagRule (reject rule)'''''<br />&nbsp;<br />True if a 1-gram consists of a token with the strict filtering tag <code>rubbish:true</code>
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''ShortTokensRule (reject rule)'''''<br />&nbsp;<br />True if a 1-gram consists of a short token of length less than three characters; this rule is to exclude noise existing in documents such as axes labels and so on.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''UnitsRule (reject rule)'''''<br />&nbsp;<br />True if a 1-gram contains a string being a measurement unit from the dictionary (Table 1)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''''ChemUnigramRule (accept rule)'''''<br />&nbsp;<br />True if a 1-gram is tagged by any OSCAR tag and by one of the following POS tags: <code>FW</code>, <code>NNP</code>, or tagged by tag <code>COMP</code>; selected unigrams are assumed and marked to have a chemical sense
  | style="background-color:white; padding-left:10px; padding-right:10px;"|''Term-like'': barium, phenanthrene, pentanol, xanes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''''GeneralEnglishDictRule (reject rule)'''''<br />&nbsp;<br />True if a 1-gram is in the General English Dictionary (Table 1)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|''Filtered'': topography, paint, plateau, pool, searching, file, addenda, improvement, theme …<br />&nbsp;<br />''Term-like'': hydrocalcite, acetylacetone, cracking, ageing
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''''UnigramPOSRule (reject rule)'''''<br />&nbsp;<br />True if a 1-gram is not a noun or a gerund; term-like 1-gram must be tagged with the following POS tags: <code>VBG</code>, <code>NN</code>, <code>NNPS</code>, <code>NNS</code>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|''Filtered'': schematized, suddenly, skeletal, behind<br />&nbsp;<br />''Term-like'': ethylene, hydrocalcite, leaching, 12n-decylhexadecanamide, sulfamethoxazole, anchoring
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''''UnigramAddRules (reject rules)'''''<br />&nbsp;<br />Set of regular expressions to filter unigrams denoting various ions, signs, captions and etc.
  | style="background-color:white; padding-left:10px; padding-right:10px;"|''Filtered'': M(O2), GA15.6, PW91, V2.1, G(D), TI(V), PD(I), PT0, P(X), BA2+, CE(3+), cm3, CH3, AA, Cu2+, Mo6+, Et-CP, GC–MS, Zn-Al
|-
|}
|}
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="60%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''Table 6.''' Reject and accept rules consecution for bigrams (2-grams)
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Description
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Examples
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''GeneralChemTermRule (accept rule)'''''<br />&nbsp;<br />Same rule as for 1-grams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''StrictFilteringTagRule (reject rule)'''''<br />&nbsp;<br />Same rule as for 1-grams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''ShortTokensRule (reject rule)'''''<br />&nbsp;<br />True if a 2-gram consists of only short tokens greater than three characters
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''IdenticalTokensRule (reject rule)'''''<br />&nbsp;<br />True if a 2-gram contains at least two identical tokens
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''''UnitsRule (reject rule)'''''<br />&nbsp;<br />True if any token in a 2-gram ends with measurement unit string from the dictionary (Table 1); it should be noted that measurement unit may consist of several tokens, for example, the "g/h" consists of three tokens ["g", "/", "h"]
  | style="background-color:white; padding-left:10px; padding-right:10px;"|PPM C7H14, 70ML MIN-1, CM3MIN-1 H2, MIN-1 FLOW, H-1 GAS, PPM N2O/AR, ML G-1MIN-1, MOL-1 HYDROLYSIS, PPM NOX/5%O2/N2
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''''BiGramPOSRule (accept rule with exception)'''''<br />&nbsp;<br />True, if the fist token is tagged with one of the following POS tags: <code>JJ</code>, <code>JJR</code>, <code>FW</code>, <code>VBG</code>, <code>VBD</code>, <code>VBN</code>, <code>NN</code>, <code>NNP</code>, <code>NNPS</code>, <code>NNS</code>; and the second token is tagged with one of: <code>FW</code>, <code>VBG</code>, <code>NN</code>, <code>NNP</code>, <code>NNPS</code>, <code>NNS</code><br />&nbsp;<br />Exception — the following combinations are not allowed: <code>VBG, VBG</code>, <code>VBG, FW</code>, and <code>NNP, FW</code>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|''Term-like'': Andronov bifurcation, Na2CO3 impregnation, nickel catalyst; supported MgO, anchored lysine, stirred glass; carbonaceous particle, temperature-programmed adsorption, Fischer–Tropsch catalyst; in situ EXAF, UV–VIS spectroscopy, Raman spectroscopy<br />&nbsp;<br />''Filtered due to exception'': involving reforming, reforming minimizing, using in, Shimada etc.
|-
|}
|}
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="60%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''Table 7.''' Reject and accept rules consecution for n-grams (n ≥ 3)
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Description
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Examples
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''GeneralChemTermRule (accept rule)'''''<br />&nbsp;<br />Same rule as for 2-grams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''StrictFilteringTagRule (reject rule)'''''<br />&nbsp;<br />Same rule as for 2-grams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''ShortTokensRule (reject rule)'''''<br />&nbsp;<br />Same rule as for 2-grams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''IdenticalTokensRule (reject rule)'''''<br />&nbsp;<br />Same rule as for 2-grams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''''UnitsRule (reject rule)'''''<br />&nbsp;<br />Same rule as for 2-grams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''''ManyGramPOSRule (accept rule with exception)'''''<br />&nbsp;<br />True, if the '''fist''' token must be tagged with one of the following POS tags (noun, gerund, adjective, adverb or participle): <code>NN</code>, <code>NNP</code>, <code>VBG</code>, <code>VBD</code>, <code>VBN</code>, <code>JJ</code>, <code>JJR</code>, <code>RB</code>, <code>RBS</code>, <code>FW</code>; and the '''middle''' in any position token (+ preposition or determiner) is: <code>NN</code>, <code>NNP</code>, <code>VBG</code>, <code>VBD</code>, <code>VBN</code>, <code>JJ</code>, <code>JJR</code>, <code>RB</code>, <code>RBS</code>, <code>FW</code> + '''<code>IN</code>''', '''<code>DT</code>'''; and the '''last''' token is: <code>VBG</code>, <code>NN</code>, <code>NNP</code>, <code>NNPS</code>, <code>NNS</code> (gerund or noun)<br />&nbsp;<br />Exception — the following combinations are not allowed (describing phrases which looks like to be torn from their context): <code>VGB, NN</code>, <code>VGB, IN</code>, <code>VBN, NN</code>, <code>VBN, JJ</code>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|''Term-like'': X-ray fluorescence spectrometer; Brønsted basic site; Pd(110) surface oscillation; doping CsPW with platinum; catalyzed N2O decomposition; crystalline phase transition; catalyzed oxidation of NO; complete photoreduction of Pd(II); propagating thermosynthesis; reforming of the biomass; drying inside the microscope column<br />&nbsp;<br />''Filtered due to exception'':used during steam reforming; catalyzed by metalloporphyrin; investigated by XRD; using atomic absorption
|-
|}
|}


==References==
==References==

Revision as of 16:35, 7 June 2016

Full article title Terminology spectrum analysis of natural-language chemical documents: Term-like phrases retrieval routine
Journal Journal of Cheminformatics
Author(s) Alperin, Boris L.; Kuzmin, Andrey O.; Ilina, Ludmila Y.; Gusev, Vladimir D.; Salomatina, Natalia V.; Parmon, Valentin, N.
Author affiliation(s) Boreskov Institute of Catalysis, Sobolev Institute of Mathematics, Novosibirsk State University
Primary contact Email: kuzmin [at] catalysis.ru
Year published 2016
Volume and issue 8
Page(s) 22
DOI 10.1186/s13321-016-0136-4
ISSN 1758-2946
Distribution license Creative Commons Attribution 4.0 International
Website http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0136-4
Download http://jcheminf.springeropen.com/track/pdf/10.1186/s13321-016-0136-4 (PDF)

Abstract

Background: This study seeks to develop, test and assess a methodology for automatic extraction of a complete set of ‘term-like phrases’ and to create a terminology spectrum from a collection of natural language PDF documents in the field of chemistry. The definition of ‘term-like phrases’ is one or more consecutive words and/or alphanumeric string combinations with unchanged spelling which convey specific scientific meanings. A terminology spectrum for a natural language document is an indexed list of tagged entities including: recognized general scientific concepts, terms linked to existing thesauri, names of chemical substances/reactions and term-like phrases. The retrieval routine is based on n-gram textual analysis with a sequential execution of various ‘accept and reject’ rules with taking into account the morphological and structural information.

Results: The assessment of the retrieval process, expressed quantitatively with a precision (P), recall (R) and F1-measure, which are calculated manually from a limited set of documents (the full set of text abstracts belonging to five EuropaCat events were processed) by professional chemical scientists, has proved the effectiveness of the developed approach. The term-like phrase parsing efficiency is quantified with precision (P = 0.53), recall (R = 0.71) and F1-measure (F1 = 0.61) values.

Conclusion: The paper suggests using such terminology spectra to perform various types of textual analysis across document collections. This sort of terminology spectrum may be successfully employed for text information retrieval, for reference database development, to analyze research trends in subject fields of research and to look for the similarity between documents.

Fig0.5 Alperin JofCheminformatics2016 8.gif

Keywords: Terminology spectrum, natural language text analysis, n-Gram analysis, term-like phrases retrieval, text information retrieval

Background

The current situation in chemistry, as in any other field of natural science, can be characterized by a substantial growth of texts in natural languages (research papers, conference proceedings, patents, etc.), still being the most important sources of scientific knowledge and experimental data, information about modern research trends and terminology used in the subject areas of science. It greatly increases the value of such powerful information systems as Scopus®, SciFinder®, and Reaxys® which are capable of handling large text document databases and especially those fitted with advanced text information retrieval capabilities. In fact, both efficiency and productivity of modern scientific research in chemistry depend rigorously on quality and completeness of its information support, which is oriented firstly on advanced and flexible reference search, discovering and analysing of text information to afford the most relevant answers to user questions (substances, reactions, relevant patents or journal articles). The main ideas and developments in the information retrieval methods coupled with techniques of full text analysis are now well described and examined.[1]

In conventional information systems, the majority of text information retrieval and discovery methods are based on using specific sets of pre-defined document metadata, e.g. keywords or indexes of terms characterizing the texts content. User queries are converted using an index into information requests expressed by a combination of Boolean terms while bringing into play the vector space and terms weight. Probabilistic approaches may also be employed to take into account such features as terms distribution, co-occurrence information and their relationships derived from information retrieval thesauri (IRT) to include them into analytic process. Any kind of such indexes have to primarily be produced and updated manually by trained experts, but now the possibilities of automated index development attracts closer attention.

It is assumed that the structural foundation of any scientific text is its terminology, which may be represented, in principle, by advanced IRT. However, it leads to difficulties in applying conventional IRTs in practical information text analysis procedures because of limitations inherent in them. Typically, such thesauri are made manually in a very labor-intensive process and often are constructed to reflect the general terminology only. Terms from thesauri originally represent a formally written description of scientific conceptions and definitions which may not exactly match the real usage and spelling used in scientific texts. Moreover, a thesaurus developed for one type of text may be less efficient or not applicable when used with another. A good example is the IUPAC Gold Book compendium of chemical nomenclature, terminology, units and definition recommendations.[2] Terminology drafted by experts of IUPAC spans a wide range of chemistry but does not describe any field in detail and represents only a well-established upper level of scientific terminology. Summarizing, IRT based text analysis alone is unable to solve the problem of the variability of scientific texts written in natural languages because the accuracy of matching thesaurus terms with real text phrases leaves much to be desired.

It should also be noted that the language of science is evolving faster than that of natural language, especially in chemistry and molecular biology. Thus, the analysis of terminology of subject text collection should be done automatically using both primitive extraction and sophisticated knowledge-based parsing. Only automated data analysis can process and reveal the variety of term-like word combinations in the constantly changing world of scientific publications. Automated parsing and analysis of document collections or isolated documents for term-like phrases can also help to discover various contexts in which the same scientific terminology is used in different publications or even parts of the same publication.

There is nothing new in the idea of automated term retrieval. Typically, the terminology analysis of text content is focused on recognition of chemical entities and automatic keyphrase extraction aimed to provide a limited set of keywords which might characterize and classify the document as a whole. Two main strategies are usually applied: machine self-learning and usage of various dictionaries with automated selection rules (heuristics) coupled with calculated features[3], such as TF-IDF.[4][5] Therefore, keyphrase retrieval procedures typically involve the following stages: initial text pre-processing; selecting a candidate to a keyphrase; applying rules to each candidate; and compiling a list of keyphrases.[6] A few existing systems had been analyzed in terms of precision (P), recall (R) and F1-score attainable for existing keyphrase extraction datasets. For such well-known systems as Wingnus, Sztergak, and KP-Mminer, these values are reported as P = 0.34÷0.40, R = 0.11÷0.14, and F1 = 0.17÷0.20.[6] Open-Source Chemistry Analysis Routines (OSCAR4)[7] and ChemicalTagger[8] NLP may also be mentioned as tools for the recognition of named chemical entities and for parsing and tagging the language of text publications in chemistry.

However, there are some inherent shortcomings in the above mentioned keyphrase extraction approaches due to the presence of a significant amount of cases where a limited set of automatically selected top ranked keyphrases does not properly describe the document in details (e.g., a paper may contain the description of a specific procedure of catalyst preparation while not being the main subject of the paper). It may also be seen from the aforementioned values of P, R and F that in many cases the extracted keyphrases do not match the keyphrases selected by experts to an adequate degree. Exact matching of keyphrases is a rather rare event, partially due to the difficulties of taking into account nearly similar phrases, for instance, semantically similar phrases. On the other hand, even though the widely used n-gram analysis can build a full spectrum of token sequences present in the text, it may also produce a great level of noise, making it difficult to use them. Some attempts have been made to take into account the semantic similarity of n-grams and to differentiate between rubbish and candidates to plausible keyphrases.[9][10]

The problem of automatic recognition of scientific terms in natural language texts has been explored in recent decades.[11] That research has shown that taking into account the linguistic information may improve the terms extraction efficiency. The information about grammatical structure of multi-word scientific terms, their text variants, and the context of their usage may be represented as a set of lexico-syntactic patterns. For instance, values of P, R and F-measure equal to 73.1, 53.6 and 61.8 percent respectively for term extraction from scientific texts (only in Russian) on computer science and physics were obtained.[12]

A "terminology spectrum" of a natural language publication may be defined as an indexed list of tagged token sequences with calculated weights, such as recognized general scientific notions, terms linked to existing thesauri, names of chemical entities and "term-like phrases." The term-like phrases are not exactly the keyphrases or terms in the usual sense (like published in thesauri). Such term-like phrases are defined here as one or more consecutive tokens (represented by words and/or alphanumeric strings combinations), which convey specific scientific meaning with unchanged spelling and context as in a real text document. For instance, a term-like phrase may look similar to a specific generally used term but with different spelling or word order reflecting the usage of the term in a different context in natural language environment. Consequently, they may describe real text content and the essence of real processes that the scientific research handles, which makes the analysis of such phrases extremely useful. That sort of terminology spectrum of a natural language publication may be considered as some kind of knowledge representation of a text and may be successfully employed in various information retrieval strategies, text analysis and reference systems.[13]

The present work is aimed to develop and test the methodology of automated retrieval of full terminology spectrum from any natural language chemical text collections in PDF format, with term-like phrases selection being the central part of the procedure. The retrieval routine is based on n-gram text analysis with sequential execution of a complex grouping of "accept" and ‘"eject" rules while taking into account the morphological and structural information. The term "n-gram" denotes here a text string or a sequence of n consecutive words or tokens presented in a text. Numerical assessment of automated term-like phrases retrieval process efficiency done in the paper is calculated by comparing automatically extracted term-like phrases and those manually selected by experts.

Methods

Text collection used for experiments

Chemical catalysis is a foundation of chemical industry and represents a very complex field of scientific and technological research. It includes chemistry, various subject fields of physics, chemical engineering, material science and a lot more. One of the most representative research conferences in catalysis is the European Congress on Catalysis or EuropaCat, which has been chosen as a source of scientific texts covering the wide range of themes of research. A set of abstracts of EuropaCat conferences of 2013, 2011, 2009, 2007, and 2005 (about 6000 documents from all five Congress events) has been used for textual analysis in the present study. All abstracts are in PDF format.

General description of terminology spectrum retrieval process

The developed system of terminology spectrum analysis consists of the following sequentially running procedures or steps, as depicted in Fig. 1.

Fig1 Alperin JofCheminformatics2016 8.gif

Fig. 1 General scheme of the terminology spectrum building process with term-like phrases retrieval

The server side of the terminology spectrum analysis system runs on Java SE 6 platform and the client is a PHP web application to view texts and the results of terminology analysis. To store all data collected in the terminology retrieval process, the cross-platform document-oriented database MongoDB is used.[14] The choice in favor of MongoDB was conditioned by the need to process nested n-gram structures up to level seven.

The main stages and analytic methods involved in the process are discussed in the following sections.

Text materials conversion with PdfTextStream library

The scientific texts are mainly published in PDF format which does not typically contain any information about document structure and therefore is not suitable for immediate text analysis. Thus, at first, a document has to be preprocessed by converting a PDF file into text format and analyzing its structure (highlighting titles, authors, headings, references, etc.) with the aim to make the text suitable for further content information retrieval (see Fig. 2). The following steps are used with PdfTextStream library[15] (stages 1–2 on Fig. 1) to make such a PDF transformation (for a detailed example see Additional File 1):

Fig2 Alperin JofCheminformatics2016 8.gif

Fig. 2 An example of PDF-to-text transformation

1. Isolate text blocks which have the same formatting (e.g. bold, underline and etc.).

2. Remove empty blocks and merge blocks located on the same text row.

3. Analyze the document structure by classifying each block as containing information about the publication title, the headings, authors, organizations, e-mails, references and content. To perform such analysis a set of special taggers has been developed which are executed sequentially to analyze and tag each text block. Taggers utilize such features as the position of the first and last rows of text block, its text formatting, the position of a block of text on a page, etc. All developed taggers have been adjusted to handle each conference event individually.

4. Filter text blocks to remove unclassified text blocks, for instance, situated before the publication title, because such blocks typically contain useless and already known information about a conference or journal.

5. Unify special symbols (such as variants of the dash, hyphen, and quote characters), removal of space characters placed before brackets in writings of crystal indexes, etc. Regular expressions are used.

Text pre-processing

The text pre-processing stage (step three in Fig. 1) is used to transform a text document obtained from stages one and two into a unified structured format with markup. During this stage the text is split into individual words and sentences (tokenization) followed by a morphological analysis that includes: highlighting objects such as formulas and chemical entities, removing unnecessary words and meaningless combinations of symbols, and recognizing general English words and tokens with special meaning (units, stable isotopes, acronyms, etc.). The result of this stage is a fully marked structured text to be stored in the database. The following steps are involved in the text pre-processing stage.

Tokenization

A tokenizer from the OSCAR4 library is used for splitting a text into words, phrases and other meaningful elements. The tokenizer has been adapted for better handling of chemical texts.

The present study established that the original OSCAR4 tokenizer, in view of our needs, had some shortcomings. The first issue was a separation of tokens with a hyphen "-", which often led to mistakes in recognizing compound terms. To overcome this issue, the parts of the source code which are responsible for splitting tokens with hyphens were commented out (see Additional File 2). Next was a problem where some complex tokens, representing various chemical compositions, were considered by the tokenizer as a sequence of tokens (see Fig. 3). In such cases it was necessary to combine those isolated tokens into an integrated one. The modified tokenizing procedure now makes merging of tandem tokens separated with either the "/" or ":" characters, provided that they are marked by OSCAR4 tag CM or incorporate a chemical element symbol sign. Additionally, tokens that look as "number %" and are situated at the beginning of such a phrase describing chemical compositions are merged into the integral token too (see Fig. 3). https://static-content.springer.com/image/art%3A10.1186%2Fs13321-016-0136-4/MediaObjects/13321_2016_136_Fig3_HTML.gif

Fig3 Alperin JofCheminformatics2016 8.gif

Fig. 3 An example of the tokenization process. Frames outline the results of modified OSCAR4 tokenizer, additional outer frames isolate tokens describing a chemical composition (possessing the tag "COMP").

An example of the work of the modified tokenizer is shown on Fig. 3. Blue frames hold the tokens identified by modified OSCAR4 tokenizer. Additional red frames outline tokens which are combined into integral ones. Such tokens are marked with the isolated tag COMP. This tag is used by accept rule ChemUnigramRule to identify one-word n-grams describing chemical compositions.

Then the position of a token in the text is determined. Splitting the series of tokens into sentences finalizes the tokenization process, which is realized with the help of the WordToSentenceAnnotator routine of Stanford CoreNLP library.[16][17]

Morphological analysis and labeling tokens with their POS tags

Morphological analysis (Stanford CoreNLP library[18] is used) maps each word with a set of part-of-speech tags (Penn Treebank Tag Set[19] by Stanford CoreNLP is used). Typical tags used in the research are: NN (plural NNS) — noun; VB — verb; JJ — adjective; CD — ordinal numeral, etc. For the full information about the POS tags used by terminology spectrum building procedure, see Table 4 (later in the paper).

Lemmatization

Lemmatization is the process of grouping together different inflected word forms so they can be treated as a single item. But, in the present work, lemmatization is only used to replace nouns in the plural form with their lemmas. Preliminary experiments demonstrate that additional lemmatization is not helpful and leads to a significant loss of meaningful information (for example, reforming process leads to reform and process lemmas with the loss of the name of a very important modern industrial chemical process in refining).

Recognition of names of chemical entities

Meta-information about names of chemical entities is very important in various term-like phrases retrieval strategies. The open source OSCAR4 (Open Source Chemistry Analysis Routines)[7][20] software package is applied for selection and semantic annotation of chemical entities across a text. Among a variety of tags and attributes utilized by OSCAR4 routine only the following ones are used in the present study:

1. CM — chemical term (chemical name, formula or acronym);

2. RN — reaction (for example, epoxidation, dehydrogenation, hydrolysis, etc.);

3. ONT — ontology term (for example, glass, adsorption, cation, etc.).

When a token is a part of some recognized chemical entity the token gets the same OSCAR4 tag as a whole entity.

Recognition of tokens with special meaning

The significant part of text pre-processing stage is selection of individual tokens being the words of general English and recognition of various meaningful text strings which are: the general scientific terms (actually performed at the final terminology spectrum building stage but described here for convenience); tokens denoting chemical elements, stable isotopes and measurement units; tokens which cannot be a part of any terms in any way. This part of work is performed using specially developed dictionaries described in details in Table 1.

Table 1. Developed/modified dictionaries used for recognition of general English words, general chemical science terms and tokens with special meaning
Dictionary/Usage for Description Reference Examples
General chemical science terms
 
Selection of general terms (chemical and from related fields of physics, mathematics …)
~7500 General scientific terms in chemistry, physics and mathematics
 
IUPAC Compendium is used
http://goldbook.iupac.org/
 
IUPAC Compendium of Chemical Terminology (Gold Book)
Naphthenes, solvation energy, osmotic pressure, reaction dynamics …
General English words dictionary
 
Selection of general English wordsGeneral chemical science terms
~58,000 general English words. It is based on Corncob Lowercase Dictionary modified by us for stated goals. 566 words were excluded, which are often used in scientific terminology Modified Corncob Lowercase list of more than 57,000 English words http://ru.scribd.com/doc/147594864/
 
Corncob Lowercase (see Additional file 3 for excluded words)
Abbreviate, academic, accelerate …
 
Excluded: Abrasion, absorption, aerosol …
Stop list
 
Filtering tokens which are not part of terms in any way
~2060 tokens. List contains the words, abbreviations and so on, which cannot be incorporated into any term-like phrases Proprietary design (see Additional file 4) e.g., de, ca., fig., al., co-exist, et, etc., i.e., ltd …
Stable isotopes
 
Filtering n-grams containing digits
~250 isotopes. It is based on The Berkeley Laboratory Isotopes Project’s isotopes database Proprietary design, based on The Berkeley Laboratory Isotopes Project’s DB: http://ie.lbl.gov/education/isotopes.htm (see Additional file 5) 1H, 2H, 3He, 4He, 6Li, 7Li …
Chemical elements signs
 
Filtering n-grams containing digits
~126 chemical elements. It is based on periodic table Proprietary design, based on periodic table (see Additional file 6) H, He, Li, Be, B, C, N, O, F …
Measurement units
 
Filtering n-grams containing units of measure
~100 records now, partially based on IUPAC Gold Book Proprietary design, partially based on http://goldbook.iupac.org/ (see Additional file 7) (a.u.), (ev), a.u, °C, ppm, kV, mol, g−1, ml−1, gcat, gcat h …

Some extra explanation needs to be given on the general English dictionary, the stop list dictionary and the procedure of recognition of general scientific terms.

More than 560 words either found in scientific terminology (for instance: "acid", "alcohol", "aldehyde", "alloy", "aniline", etc.) or occurring in composite terms (for example, "abundant" may be part of the term "most abundant reactive intermediates") were excluded from the original version of the Corncob Lowercase Dictionary.

The IUPAC Compendium of Chemical Terminology (the only well-known and time-proven dictionary) is used as a source of general chemistry terms. To find the best way to match an n-gram to a scientific term from the compendium, a number of experiments have been performed which resulted in the following criteria:

1. N-gram is considered a general scientific term if all n-gram tokens are the words of a certain IUPAC Gold Book term, regardless of their order; and

2. If (n − 1) of n-gram tokens coincide with the (n − 1) words of an IUPAC Gold Book term, and the remaining word is among other terms in the dictionary, then the n-gram is considered a general scientific term too.

Some examples may be given. The n-gram "RADIAL CONCENTRATION GRADIENT" is a general scientific term because the phrase "concentration gradient" is in the compendium and the word "radial" is part of the term "radial development." The n-gram "CONTENT CATALYTIC ACTIVITY" is a general term because the term "catalytic activity content" is present in the compendium and differs from the n-gram only by word order. The n-gram "TOLUENE ADSORPTION CAPACITY" is not considered a general term, despite the fact that two words coincide with the term "absorption capacity," because the remaining word "TOLUENE" is special and is not found in the compendium. The n-gram "COBALT ACETATE DECOMPOSITION" is not considered a general term either as only the term "decomposition" may be found.

The final comment is about the stop list dictionary that, at first glance, may look like a set of arbitrary words. But, actually, it is based on a series of observations performed with the set of wrongly identified term-like phrases by the earlier version of the terminology analysis system.

Strict filtering

The last step in the text pre-processing stage is strict filtering developed to remove unnecessary words and meaningless combinations of symbols. If at least one n-gram token is labeled by the strict filtering tag ("rubbish" : "true") then such an n-gram is not considered a term-like phrase. At this stage, certain character sequences — as described by the filtering rules (Table 2) and not exempt by the list of exceptions (Table 3) — are looked for. They are successive digits, special symbols, measurement units, symbols of chemical elements, brackets and so on. Custom regular expressions and standard dictionaries described in Table 1 are used for this procedure. A general scheme of strict filtering parsing is illustrated in Fig. 4.

Table 2. Rules for strict filtering procedure
No. Rule Examples
1 SpecialSymbolsRule
 
True if a token contains at least one of the special symbols different from: . -,/: () [] + = @ ®
SIZE(**), SELECTIVITY%, NIMG_650, H2S↔35SCAT, 1AUDAE_AM, ΔGADS, H0 ≦−8.2
2 StopListRule
 
True if a token is in the stop list (Table 1)
LITERATURE, VIEWPOINT, PERCENT, PRESENT, IMPORTANCE, FUNDAMENTAL, CONCLUSION, TYPICALLY, EXAMPLE, INTRODUCTION
Rules of regular expressions:
 
True, if a token satisfies at least one of the regular expressions from the following list...
3 4DigitRule
 
True if a token contains four or more digits in succession
FQM-3994, RYC-2008-03387, 20000H-1, MAT2010-21147, CO(0001)-CARBIDE, CO(111)/CO(0001), RU(0001) ELECTRODE
4 3DigitRule
 
True if a token contains three digits in succession
215KMTA, 220ML, 148H-1, CU2O(111), AU{111}-CEO2{100}, MGO/AG(100)
2DigitRule
 
True if a token begins with one or two digits
12C16O-13C16O, 31P{1H}, 2-PROPANOL, 2-METHYL-1-BUTENE, 3-METHYL-1,3-BUTADIENE, 15 %H3PW12O40/TIO2
5 UnitsRule
 
True if a token ends with a string from the dictionary of measurement units (Table 1)
KJMOL-1, MMOL.MIN-1, KJ.MOL-1, G.GZEOLITE-1.H-1, CM3.MIN-1.G-1
Table 3. Exceptions for strict filtering procedure
No. Exception Examples
1 Facet_Index_4digits
 
Token denotes the substance containing a four-digit facet index. The list of chemical element signs is used (Table 1).
terms: RU(0001); CO(0001)-CARBIDE; α-FE2O3(0001)
 
rubbish: HPG1800B; RYC-2008-03387; 20000H-1
2 Miller_Index_3digits
 
Token denotes the substance containing a three-digit crystallographic Miller index. The list of chemical element signs is used.
terms: CEO2(111); PT(111); AU{111}-CEO2{100}; (NI,AL)(111); AL2O3/NIAL(110)
 
rubbish: R873; 50WX8-100; 270-470OC
3 Substances_3digits
 
Token denotes chemical containing three digits in succession. Chemical elements signs list and regular expressions as EL/\{\d{3}\} are used.
terms: 15N218O; H235S; H218O-SSITKA; H216O/H218O
 
rubbish: FA100; TSVET-500; CE-440
4 Isotopes
 
Token denotes an isotope. Stable isotopes and chemical elements signs lists are used (Table 1).
terms: 13C CP-MAS NMR; 12C16O-13C16O MIXTURE; 31P MAS NMR SPECTROSCOPY
 
rubbish: 04,21H; 11H; 11HV; 1 %18O2; -1H-1; 57CO
5 Substances_2digits
 
Token denotes substance, which begins with one or two digits.
terms: 5-PENTANEDIOL; 2-AMINOBENZENE-1,4-DICARBOXYLATE; 5-BROMO-3-(N,N-DIETHYLAMINO-ETHOXY)-2-METHYLINDOLE
 
rubbish: 2R,3S; 2LFH; 5NICZPOL; 1KPM; 4-CP
6 Catalysts
 
Token denotes a catalytic system which is a chemical composition with the "." character.
terms: 1.5AU/C; 1.0CUCOK/ZRO2; CE0.9PR0.1O2; CU0.2CO0.8FE2O4; MG3ZN3.-XFE0.5AL0.5; LAFE0.7NI0.3O3-Δ; CE0.8GD0.2O2-Δ; MN0.8ZR0.2
 
rubbish: VOL. %; (B)2.5 %; DISP.[%]
7 Comp
 
Token denotes the chemical or catalyst composition. Tag COMP is used.
terms: 20 %CU/ZNAL; 0.4 %PD/AL2O3; 4 %PT-4 %RE/TIO2; (5 %)PB(10 %)-SBA15
 
rubbish: 50 %AIR; 1.5 %WT; 0-2.5MOL %; CA.23 %
8 Cryst_hydrates
 
Tokens denote crystalline hydrates. Regular expressions as *[A-Za-z].*H2O$ are used.
terms: AL(NO3)3*6H2O; FE2(SO4)3.9H2O; AUCL4(NH4)7[TI2(O2)2(CIT)(HCIT)]2.12H2O;
 
rubbish: 0.6 %H2O; 0.03 %C3H6; 0.06286*T;
9 SpatialDimension
 
Token denotes the 1-, 2- or 3-dimensional method or pattern.
terms: 2D-SAXS; 2D-GC; 1D-3D COPPER – OXIDE; 1D-STRUCTURE; 1D COPPER – OXIDE
 
rubbish: 12-MR; 1LATTICE; 16ACR; 60HPW
10 Names
 
Token denotes a proper name. A set of regular expressions is used for recognition.
terms: BRØNSTED ACID; BRӦNSTED BASIC SITE; MӦSSBAUER SPECTROSCOPY;
 
rubbish: L’ARGENTIЀRE; PROCESS’S
11 OscarTags
 
True if a token has any Oscar tag and matches the following regular expressions: \-[A-Za-z]{2}, \{, \[*[A-Za-z] and etc.
terms: STEM-HAADF; L-CYSTINE; DI-TERT-BUTYLPEROXIDE;[AU(EN)2]2[CU(OX)2]3
 
rubbish: 128°- Y-ROTATED; π- BACKDONATION; CONVERSION(%);CU(1)MN; M1(2); ACTIVITY [2]

EL designation of any chemical element, IS designation of any stable isotope

Fig4 Alperin JofCheminformatics2016 8.gif

Fig. 4 General scheme of strict filtering tagging

The following examples may be given to illustrate the decision-making process of defining a token as "valid" or "rubbish" (Fig. 5).


Fig5 Alperin JofCheminformatics2016 8.gif

Fig. 5 Examples of strict filtering tagging

Summary of pre-processing stage

The final result of the text pre-processing stage is the marked and structured text with tagged tokens. These tags are used then by various rules for term-like phrase selection. As there is no need for all the tags from OSCAR4 and Penn Treebank Tag Set, only a few of them are used in the term-like phrases retrieval procedure. The consolidated list of all tags is used, which may be assigned to tokens at different steps of the text pre-processing stage, as specified in the Table 4.

Table 4. The consolidated list of all tags assigned to tokens at different steps of the text pre-processing stage; it is also indicated whether a tag is used in strict filtering or in term-like phrases retrieval procedure with help of POS-based rules.
Group of tags Tag Explanation Strict filtering Morphological pattern
POS JJ Adjective Yes (n-grams n > 1)
JJR Adjective, comparative Yes (n-grams n > 1)
VBG Verb, gerund or present participle Yes (n-grams n ≥ 1)
VBD Verb, past tense includes the conditional form of the verb to be Yes (n-grams n > 1)
VBN Verb, past participle Yes (n-grams n > 1)
NNP Proper Noun, singular Yes (n-grams n > 1)
NN Noun, singular or mass Yes (n-grams n ≥ 1)
NNPS Proper Noun, plural Yes (n-grams n ≥ 1)
NNS Noun, plural Yes (n-grams n ≥ 1)
IN Preposition or subordinating conjunction Yes (n-grams n > 1)
DT Determiner Yes (n-grams n > 1)
RB Adverb Yes (n-grams n > 2)
RBS Adverb, superlative Yes (n-grams n > 2)
FW Foreign word Yes (n-grams n > 1)
OSCAR CM Chemical matter Yes Yes (all n-grams)
ONT Ontological term Yes Yes (all n-grams)
Own tags COMP Chemical composition Yes (all n-grams)
rubbish Token for which strict filtering to be applied Yes Yes (all n-grams)
GCST General Chemistry Scientific Term Yes (all n-grams)

As an illustration of tag assignment the following example may be given. Figure 6 shows an example sentence where a few tokens have been tagged. For instance, there are the following different tags used in the example for token 2.7 %CO/10.0 %H2O/He – (pos = "CD"; lemma = "2.7 %CO/10.0 %H2O/He"; oscar = "CM"; rubbish = "false"; exception = "comp"). Every token has at least two tags — pos (it holds the part-of-speech information) and lemma (it corresponds to the lemma of a token). In addition some tokens related to chemistry (indicating chemical substances, formulas, reactions and etc.) have a tag oscar taking the values of CM or ONT. Last but not least is the tag rubbish ("true" or "false") marking tokens for which strict filtering is to be applied.

Fig6 Alperin JofCheminformatics2016 8.gif

Fig. 6 An illustration of tags assignment to different tokens

N-grams spectrum retrieval procedure

As it is defined earlier within our study, the term "n-gram at length n" connotes a sequence or string of n consecutive tokens situated within the same sentence with omission of useless tokens (at the moment only definite/indefinite articles). N-gram set is obtained by moving a window of n tokens length through an entire sentence. This moving is performed token by token. This process is to be repeated for all sentences for a set of all texts:

For a set of texts, each n-gram may be characterized by textual frequency of n-gram occurrence —total number of n-gram occurrences within a text and by absolute frequency of occurrence —total number of n-gram occurrences. As a result each n-gram may be described by a vector within a set of texts enabling us to develop the additional procedures for n-gram filtering and text information analysis.

The full n-gram data set is redundant and it creates difficulties for analysis. For specific purposes different filtration procedures are to be applied. For instance, threshold filtering based on the values of and may be used.

Module of terminology spectrum building

The final stage of the analysis is to distinguish among the scores of n-grams such as the term-like phrases, general chemistry scientific terms, names of chemical entities and useless n-grams. The calculation of textual and absolute frequencies of term occurrence finishes the terminology spectrum building.

To select term-like n-grams the sets of accept and reject rules are applied. They are all based on token tags assigned at previous steps and developed dictionaries (Table 1). The intention of each set of rules is to determine whether an n-gram of defined length is a term-like phrase or not by analyzing its structure. All rules are applied in a consecutive manner. If an n-gram conforms to an accept or reject rule in the rule sequence, the procedure will be stopped with declaring the n-gram as either a non-term-like or a term-like phrase, probably having a special meaning (e.g. general chemistry scientific term or chemical entity). If no rule is applicable, the n-gram will be considered a term-like phrase too. There are a few general rules that can be used for analysis of n-grams of any length. There are also tailored sets of rules for 1-grams (Table 5), 2-grams (Table 6) and for long (n > 2)-grams (Table 7).

Table 5. Accept and reject rules succession for unigrams (1-grams)
Description Examples
GeneralChemTermRule (accept rule)
 
True if a 1-gram is a general chemistry scientific term
StrictFilteringTagRule (reject rule)
 
True if a 1-gram consists of a token with the strict filtering tag rubbish:true
ShortTokensRule (reject rule)
 
True if a 1-gram consists of a short token of length less than three characters; this rule is to exclude noise existing in documents such as axes labels and so on.
UnitsRule (reject rule)
 
True if a 1-gram contains a string being a measurement unit from the dictionary (Table 1)
ChemUnigramRule (accept rule)
 
True if a 1-gram is tagged by any OSCAR tag and by one of the following POS tags: FW, NNP, or tagged by tag COMP; selected unigrams are assumed and marked to have a chemical sense
Term-like: barium, phenanthrene, pentanol, xanes
GeneralEnglishDictRule (reject rule)
 
True if a 1-gram is in the General English Dictionary (Table 1)
Filtered: topography, paint, plateau, pool, searching, file, addenda, improvement, theme …
 
Term-like: hydrocalcite, acetylacetone, cracking, ageing
UnigramPOSRule (reject rule)
 
True if a 1-gram is not a noun or a gerund; term-like 1-gram must be tagged with the following POS tags: VBG, NN, NNPS, NNS
Filtered: schematized, suddenly, skeletal, behind
 
Term-like: ethylene, hydrocalcite, leaching, 12n-decylhexadecanamide, sulfamethoxazole, anchoring
UnigramAddRules (reject rules)
 
Set of regular expressions to filter unigrams denoting various ions, signs, captions and etc.
Filtered: M(O2), GA15.6, PW91, V2.1, G(D), TI(V), PD(I), PT0, P(X), BA2+, CE(3+), cm3, CH3, AA, Cu2+, Mo6+, Et-CP, GC–MS, Zn-Al
Table 6. Reject and accept rules consecution for bigrams (2-grams)
Description Examples
GeneralChemTermRule (accept rule)
 
Same rule as for 1-grams
StrictFilteringTagRule (reject rule)
 
Same rule as for 1-grams
ShortTokensRule (reject rule)
 
True if a 2-gram consists of only short tokens greater than three characters
IdenticalTokensRule (reject rule)
 
True if a 2-gram contains at least two identical tokens
UnitsRule (reject rule)
 
True if any token in a 2-gram ends with measurement unit string from the dictionary (Table 1); it should be noted that measurement unit may consist of several tokens, for example, the "g/h" consists of three tokens ["g", "/", "h"]
PPM C7H14, 70ML MIN-1, CM3MIN-1 H2, MIN-1 FLOW, H-1 GAS, PPM N2O/AR, ML G-1MIN-1, MOL-1 HYDROLYSIS, PPM NOX/5%O2/N2
BiGramPOSRule (accept rule with exception)
 
True, if the fist token is tagged with one of the following POS tags: JJ, JJR, FW, VBG, VBD, VBN, NN, NNP, NNPS, NNS; and the second token is tagged with one of: FW, VBG, NN, NNP, NNPS, NNS
 
Exception — the following combinations are not allowed: VBG, VBG, VBG, FW, and NNP, FW
Term-like: Andronov bifurcation, Na2CO3 impregnation, nickel catalyst; supported MgO, anchored lysine, stirred glass; carbonaceous particle, temperature-programmed adsorption, Fischer–Tropsch catalyst; in situ EXAF, UV–VIS spectroscopy, Raman spectroscopy
 
Filtered due to exception: involving reforming, reforming minimizing, using in, Shimada etc.
Table 7. Reject and accept rules consecution for n-grams (n ≥ 3)
Description Examples
GeneralChemTermRule (accept rule)
 
Same rule as for 2-grams
StrictFilteringTagRule (reject rule)
 
Same rule as for 2-grams
ShortTokensRule (reject rule)
 
Same rule as for 2-grams
IdenticalTokensRule (reject rule)
 
Same rule as for 2-grams
UnitsRule (reject rule)
 
Same rule as for 2-grams
ManyGramPOSRule (accept rule with exception)
 
True, if the fist token must be tagged with one of the following POS tags (noun, gerund, adjective, adverb or participle): NN, NNP, VBG, VBD, VBN, JJ, JJR, RB, RBS, FW; and the middle in any position token (+ preposition or determiner) is: NN, NNP, VBG, VBD, VBN, JJ, JJR, RB, RBS, FW + IN, DT; and the last token is: VBG, NN, NNP, NNPS, NNS (gerund or noun)
 
Exception — the following combinations are not allowed (describing phrases which looks like to be torn from their context): VGB, NN, VGB, IN, VBN, NN, VBN, JJ
Term-like: X-ray fluorescence spectrometer; Brønsted basic site; Pd(110) surface oscillation; doping CsPW with platinum; catalyzed N2O decomposition; crystalline phase transition; catalyzed oxidation of NO; complete photoreduction of Pd(II); propagating thermosynthesis; reforming of the biomass; drying inside the microscope column
 
Filtered due to exception:used during steam reforming; catalyzed by metalloporphyrin; investigated by XRD; using atomic absorption

References

  1. Salton, G. (1991). "Developments in Automatic Text Retrieval". pp. 974–980. doi:10.1126/science.253.5023.974. PMID 17775340. 
  2. "IUPAC Gold Book". International Union of Pure and Applied Chemistry. 2014. http://goldbook.iupac.org/. 
  3. Hussey, R.; Williams, S.; Mitchell, R. (2012). "Automatic keyphrase extraction: A comparison of methods". eKNOW, Proceedings of The Fourth International Conference on Information Process, and Knowledge Management: 18–23. ISBN 9781612081816. 
  4. Eltyeb, S.; Salim, N. (2014). "Chemical named entities recognition: a review on approaches and applications". Journal of Cheminformatics 6: 17. doi:10.1186/1758-2946-6-17. PMC PMC4022577. PMID 24834132. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4022577. 
  5. Gurulingappa, H.; Mudi, A.; Toldo, L.; Hofmann-Apitus, M.; Bhate, J. (2013). "Challenges in mining the literature for chemical information". RSC Advances 2013 (3): 16194-16211. doi:10.1039/C3RA40787J. 
  6. 6.0 6.1 Kim, S.N.; Madelyan, O.; Kan, M.-Y.; Baldwin, T. (2013). "Automatic keyphrase extraction from scientific articles". Language Resources and Evaluation 47 (3): 723–742. doi:10.1007/s10579-012-9210-3. 
  7. 7.0 7.1 Jessop, D.M.; Adams, S.E.; Willighagen, E.L.; Hawizy, L.; Murray-Rust, P. (2011). "OSCAR4: A flexible architecture for chemical text-mining". Journal of Cheminformatics 3: 41. doi:10.1186/1758-2946-3-41. PMC PMC3205045. PMID 21999457. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3205045. 
  8. Hawizy, L.; Jessop, D.M.; Adams, N.; Murray-Rust, P. (2011). "ChemicalTagger: A tool for semantic text-mining in chemistry". Journal of Cheminformatics 3: 17. doi:10.1186/1758-2946-3-17. PMC PMC3117806. PMID 21575201. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117806. 
  9. "Re-examining automatic keyphrase extraction approaches in scientific articles". MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications: 9–16. 2009. ISBN 9781932432602. 
  10. "Approximate matching for evaluating keyphrase extraction". RANLP '09: International Conference on Recent Advances in Natural Language Processing: 484–489. 2009. 
  11. Castellvi, M.T.C.; Bagot, R.E.; Palatresi, J.V. (2001). "Automatic term detection: A review of current systems". In Bourigault, D.; Jacquemin, C.; L'Homme, M.-C.. Recent Advances in Computational Terminology. John Benjamins Publishing Company. pp. 53–87. doi:10.1075/nlp.2.04cab. ISBN 9789027298164. 
  12. Bolshakova, E.I.; Efremova, N.E. (2015). "A Heuristic Strategy for Extracting Terms from Scientific Texts". In Khachay, M.Y.; Konstantinova, N.; Panchenko, A.; Ignatov, D.I.; Labunets, V.G.. Analysis of Images, Social Networks and Texts. Springer International Publishing. pp. 297-307. doi:10.1007/978-3-319-26123-2_29. ISBN 9783319261232. 
  13. Salton, G.; Buckley, C. (1991). "Global Text Matching for Information Retrieval". pp. 1012–1015. doi:10.1126/science.253.5023.1012. PMID 17775345. 
  14. Chodorow, K.; Dirolf, M. (2010). MongoDB: The Definitive Guide. O'Reilly Media. ISBN 9781449381561. 
  15. "PDFxStream". Snowtide Informatics Systems, Inc. 2016. https://www.snowtide.com/. 
  16. "Stanford CoreNLP – A suite of core NLP tools". Github. 2016. http://stanfordnlp.github.io/CoreNLP/. 
  17. "The Stanford CoreNLP Natural Language Processing Toolkit". Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations: 55–60. 2014. doi:10.3115/v1/P14-5010. 
  18. Toutanova, K.; Klein, D.; Manning, C.D.; Singer, Y. (2003). "Feature-rich part-of-speech tagging with a cyclic dependency network". NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology 1: 173–180. doi:10.3115/1073445.1073478. 
  19. Taylor, A.; Marcus, M.; Santorini, B. (2003). "The Penn Treebank: An Overview". In Abeillé, A.. Text, Speech and Language Technology. 20. Springer Netherlands. pp. 5–22. doi:10.1007/978-94-010-0201-1_1. ISBN 978-94-010-0201-1. 
  20. "Semantic enrichment of journal articles using chemical named entity recognition". ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions: 45–48. 2007. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Numerous grammar errors were also corrected throughout the entire text.