Journal:A robust, format-agnostic scientific data transfer framework
Full article title | A robust, format-agnostic scientific data transfer framework |
---|---|
Journal | Data Science Journal |
Author(s) | Hester, James |
Author affiliation(s) | Australian Nuclear Science and Technology Organisation |
Primary contact | Email: jxh at ansto dot gov dot au |
Year published | 2016 |
Volume and issue | 15 |
Page(s) | 12 |
DOI | 10.5334/dsj-2016-012 |
ISSN | 1683-1470 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | http://datascience.codata.org/articles/10.5334/dsj-2016-012/ |
Download | http://datascience.codata.org/articles/10.5334/dsj-2016-012/galley/605/download/ (PDF) |
This article should not be considered complete until this message box has been removed. This is a work in progress. |
Abstract
The olog approach of Spivak and Kent[1] is applied to the practical development of data transfer frameworks, yielding simple rules for construction and assessment of data transfer standards. The simplicity, extensibility and modularity of such descriptions allows discipline experts unfamiliar with complex ontological constructs or toolsets to synthesize multiple pre-existing standards, potentially including a variety of file formats, into a single overarching ontology. These ontologies nevertheless capture all scientifically-relevant prior knowledge, and when expressed in machine-readable form are sufficiently expressive to mediate translation between legacy and modern data formats. A format-independent programming interface informed by this ontology consists of six functions, of which only two handle data. Demonstration software implementing this interface is used to translate between two common diffraction image formats using such an ontology in place of an intermediate format.
Keywords: metadata, ontology, knowledge representation, data formats
Introduction
For most of scientific history, results and data were communicated using words and numbers on paper, with correct interpretation of this information reliant on the informal standards created by scholarly reference works, linguistic background, and educational traditions. Modern scientists increasingly rely on computers to perform such data transfer, and in this context the sender and receiver agree on the meaning of the data via a specification as interpreted by authors of the sending and receiving software. Recent calls to preserve raw data[2][3] and a growing awareness of a need to manage the explosion in the variety and quantity of data produced by modern large-scale experimental facilities (big data) have led to an increase in the number and coverage of these data transfer standards. Overlap in the areas of knowledge covered by each standard is increasingly common, either because the newer standards aim to replace older ad hoc or de facto standards, or because of natural expansion into the territory of ontologically “neighboring” standards. One example of such overlap is found in single-crystal diffraction: the newer NeXus standard for raw data[4] partly covers the same ontological space as the older imgCIF standard[5], and both aim to replace the multiplicity of ad hoc standards for diffraction images.
Authors of scientific software faced with multiple standards generally write custom input or output modules for each standard. For example, the HKL Research, Inc. suite of diffraction image processing programs accepts over 300 different formats.[6] In such software, broadly useful information on equivalences and transformations is crystallized in code that is specific to a programming language and software environment and is therefore difficult for other authors faced with the same problems to reuse, even if code is freely available. Such uniform processing and merging of disparate standards has been extensively studied by the knowledge representation community: it is one outcome of "ontological alignment" or "ontological mapping," which has been the subject of hundreds of publications over the last decade.[7] Despite the availability of ontological mapping tools, Otero-Cerdeira, Rodríguez-Martínez, & Gómez-Rodríguez note that relatively few ontology matching systems are put to practical use (see their section 4.5). One barrier to adoption is likely to be the need for the discipline experts driving standards development to learn ontological concepts and terminology in order to evaluate and use ontological tools: the effort required to master these tools may not be judged to yield commensurate benefits in situations where communities have historically been able to transfer data reliably without such formal approaches. Introduction of ontological ideas into data transfer would therefore stand more chance of success if those ideas are simple to understand and implement, as well as offering tangible benefits over the status quo. Indeed one of the challenges noted by Otero-Cerdeira et al. is to "define good tools that are easy to use for non-experts."
Much of the research listed by Otero-Cerdeira et al. has understandably been predicated on reducing human involvement in the mapping process, although expert human intervention is still currently required. In contrast to the thousands of terms found in ontologies tackled by ontological mapping projects, data files in the experimental sciences usually contain information relating to a few dozen well-defined scientific concepts, and so manual handling of ontologies is feasible. The present paper therefore adopts the practical position that, if involvement of discipline experts is unavoidable, then the method of representing the ontology should be as accessible as possible to those experts. An easily-applied framework for scientist-driven formalization, development and assessment of data transfer standards is presented, aimed at minimizing the complexity of the task, while promoting interoperability and minimizing duplication of programmer and domain expert effort.
After describing the framework in the next section, we demonstrate the utility of these concepts by discussing schemes for standards development and later semiautomatic data file translation.
A conceptual framework for data file standards
The framework described here covers systems for automated transfer and manipulation of scientific data. In other words, following creation of the reading and writing software in consultation with the data standard, no further human intervention is necessary in order to automatically create, ingest, and perform calculations on data from standards-conformant data files. Note that simple transfer of information found in the data file to a human reader — for example, presentation of text or graphics — is of minor significance in this context, as such operations, while useful, do not require any interpretation of the data by the computer and are in essence identical to traditional paper-based transfer of information from writer to reader.
Terminology used in this paper is defined in Table 1. The process of scientific data transfer is described using these terms as follows: in consultation with the ontology, authors of file output software determine the required or possible list of datanames for their particular application, then correlate concepts handled by their code to these datanames, arranging for the appropriate values to be linked to the datanames within the output data format according to the specifications within the format adapter. A file in this format is then transferred or archived. At some point, software written in consultation with the same format adapter and ontology extracts datavalues from the file and processes them correctly.
|
Following Shvaiko & Euzenat[8], the word "ontology" as used in this paper refers to a system of interrelated terms and their meanings, regardless of the way in which those meanings are represented or described. Under this definition, Table 1 is itself an ontology for use solely by the human reader in understanding the present paper. An ontology may be encoded using a language such as OWL[9] to produce a human-and machine-readable document allowing some level of machine verification, deduction and manipulation.
This paper makes frequent reference to two established data transfer standards in the area of experimental science: the Crystallographic Information Framework (CIF)[10] and the NeXus standard.[11]
Constructing the ontology
In general, a complete data transfer ontology for some field would include all of the distinct concepts and relationships used by scientific software authors in the process of constructing software, including scientific, programming, and format-specific terminology. A clear dividing line may be drawn between the scientific components of the ontology and the remainder, by relying on the assertion that scientific concepts and their relationships are dictated by the real world, not by the particular arrangement in which the data appear; that is, a scientific ontology may be completely specified independent of a particular format.
Furthermore, the scheme presented below assumes that the scientific knowledge informing the ontology is already shared by the software authors implementing the standard. These software authors are one of the main consumers of the ontology, so we do not require the level of machine-readability offered by ontology description languages such as OWL; rather we seek the minimum level of sophistication necessary to describe to a human the correct interpretation of the data, while at the same time including properties that allow coherent expansion and curation of the ontology. Such ontologies should be maximally accessible to experts in the scientific field who are not necessarily programmers or familiar with ontological constructs, in order to allow broad-based contribution and review.
A suitably simple but powerful system for expressing ontologies has been presented by Spivak & Kent[1], who propose using category-theoretic box and arrow diagrams which they call ologs (from "ontology logs"). A concept in an ontology is drawn as an arrow ("aspect") between boxes ("types"): the arrow denotes a mapping between elements in the sets represented by the boxes. A simple ontology written using this approach is shown in Figure 1, which might be used to describe a data file containing the values of neutron cross-section. This olog shows that the concept "measured neutron scattering cross-section" maps every atomic element to a value in barns. We can therefore specify "measured neutron scattering cross-section" as a (domain, function, codomain)
[a] triple of ({element names},‘cross-section measurement’,{(r,“barns”) : r ∈ ℝ})
. Each of the datanames in our ontology is associated with such a triple, so that the values that the dataname takes are the results of applying the associated function to each of the elements of the domain. Given this formulation of an ontology, it follows that the scientifically useful content of a datafile consists solely of the values taken by the datanames in their codomains, and the matching domain values. In other words, a datafile documents an instance of the olog.
Notes
- ↑ Note that "domain" and "codomain" are used throughout in the mathematical sense, as the set on which a function operates, and the set of resulting values, respectively.
References
- ↑ 1.0 1.1 Spivak, D.I.; Kent, R.E. (2012). "Ologs: A categorical framework for knowledge representation". PLoS One 7 (1): e24274. doi:10.1371/journal.pone.0024274. PMC PMC3269434. PMID 22303434. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269434.
- ↑ Boulton, G. (2012). "Open your minds and share your results". Nature 486 (7404): 441. doi:10.1038/486441a. PMID 22739274.
- ↑ Kroon-Batenburg, L.M.; Helliwell, J.R. (2014). "Experiences with making diffraction image data available: What metadata do we need to archive?". Acta Crystallographica Section D Biological Crystallography 70 (Pt. 10): 2502-9. doi:10.1107/S1399004713029817. PMC PMC4187998. PMID 25286836. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4187998.
- ↑ "NXmx – Nexus: Manual 3.1 documentation". NeXusformat.org. NIAC. 2015. http://download.nexusformat.org/doc/html/classes/applications/NXmx.html.
- ↑ Bernstein, H.J. (2006). "Classification and use of image data". International Tables for Crystallography G (3.7): 199–205. doi:10.1107/97809553602060000739.
- ↑ "Detectors & Formats recognized by the HKL/HKL-2000/HKL-3000 Software". HKL Research, Inc. 2016. http://www.hkl-xray.com/detectors-formats-recognized-hklhkl-2000hkl-3000-software.
- ↑ Otero-Cerdeira, L.; Rodríguez-Martínez, F.J.; Gómez-Rodríguez, A. (2015). "Ontology matching: A literature review". Expert Systems with Applications 42 (2): 949–971. doi:10.1016/j.eswa.2014.08.032.
- ↑ Shvaiko, P.; Euzenat, J. (2013). "Ontology matching: State of the art and future challenges". IEEE Transactions on Knowledge and Data Engineering 25 (1): 158–176. doi:10.1109/TKDE.2011.253.
- ↑ Hitzler, P.; Krötzsch, M.; Parsia, B. et al. (11 December 2012). "OWL 2 Web Ontology Language Primer (Second Edition)". W3C Recommendations. W3C. https://www.w3.org/TR/2012/REC-owl2-primer-20121211/.
- ↑ Hall, S.R.; McMahon, B., ed. (2005). International Tables for Crystallography Volume G: Definition and exchange of crystallographic data. Springer Netherlands. doi:10.1107/97809553602060000107. ISBN 9781402042904.
- ↑ Könnecke, M.; Akeroyd, F.A.; Bernstein, H.J. et al. (2015). "The NeXus data format". Journal of Applied Crystallography 48 (Pt. 1): 301–305. doi:10.1107/S1600576714027575. PMC PMC4453170. PMID 26089752. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4453170.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version — by design — lists them in order of appearance. Footnotes have been changed from numbers to letters as citations are currently using numbers.