Difference between revisions of "Journal:Mapping hierarchical file structures to semantic data models for efficient data integration into research data management systems"

From LIMSWiki
Jump to navigationJump to search
(Created stub. Saving and adding more.)
 
(Saving and adding more.)
Line 28: Line 28:
Although other methods exist to store and [[Information management|manage data]] using modern information technology (IT), the standard solution is file systems. Therefore, maintaining well-organized file structures and file system layouts can be key to a sustainable [[research]] data management infrastructure. However, file structures alone lack several important capabilities for [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR]] (findable, accessible, interoperable, and reusable) data management, the two most significant being insufficient [[Data visualization|visualization of data]] and inadequate possibilities for searching and obtaining an overview. Research data management systems (RDMSs) can fill this gap, but many do not support the simultaneous use of the file system and RDMS. This simultaneous use can have many benefits, but keeping data in an RDMS in synchrony with the file structure is challenging.  
Although other methods exist to store and [[Information management|manage data]] using modern information technology (IT), the standard solution is file systems. Therefore, maintaining well-organized file structures and file system layouts can be key to a sustainable [[research]] data management infrastructure. However, file structures alone lack several important capabilities for [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR]] (findable, accessible, interoperable, and reusable) data management, the two most significant being insufficient [[Data visualization|visualization of data]] and inadequate possibilities for searching and obtaining an overview. Research data management systems (RDMSs) can fill this gap, but many do not support the simultaneous use of the file system and RDMS. This simultaneous use can have many benefits, but keeping data in an RDMS in synchrony with the file structure is challenging.  


Here, we present concepts that allow for keeping file structures and [[Semantics|semantic]] data models (found in RDMSs) synchronous. Furthermore, we propose a specification in [[YAML]] format that allows for a structured and extensible declaration and implementation of a mapping between the file system and data models used in semantic research data management. Implementing these concepts will facilitate the re-use of specifications for multiple use cases. Furthermore, the specification can serve as a machine-readable and, at the same time, human-readable documentation of specific file system structures. We demonstrate our work using the open-source RDMS LinkAhead (previously named “CaosDB”).
Here, we present concepts that allow for keeping file structures and [[semantic data model]]s (found in RDMSs) synchronous. Furthermore, we propose a specification in [[YAML]] format that allows for a structured and extensible declaration and implementation of a [[Data mapping|mapping]] between the file system and data models used in semantic research data management. Implementing these concepts will facilitate the re-use of specifications for multiple use cases. Furthermore, the specification can serve as a machine-readable and, at the same time, human-readable documentation of specific file system structures. We demonstrate our work using the open-source RDMS LinkAhead (previously named “CaosDB”).


'''Keywords''': research data management, FAIR, file structure, file crawler, semantic data model
'''Keywords''': research data management, FAIR, file structure, file crawler, metadata, semantic data model


==Introduction==
==Introduction==
[[Information management|Data management]] for [[research]] is part of an active transformation in science, with effective management required in order to meet the needs of increasing amounts of complex data. Furthermore, the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR guiding principles]] [1] for scientific data—which are an elementary part of numerous data management plans, funding guidelines, and data management strategies of research organizations [2,3], requiring that research objects be more findable, accessible, interoperable, and reusable—require scientists to review and enhance their established data management [[workflow]]s.


One particular focus of this endeavor is the introduction and expansion of research data management systems (RDMSs). These systems help researchers organize their data during the whole data management life cycle, especially by increasing findability and accessibility. [4] Furthermore, [[Semantics|semantic]] data management approaches [5] can increase the reuse and reproducibility of data that are typically organized in file structures. As has been pointed out by Gray ''et al.'' [4], one major shortcoming of file systems is the lack of rich [[metadata]] features, which additionally limits search options. Typically, RDMSs employ [[database]] management systems (DBMSs) to store data and metadata, but the degree to which data is migrated, linked, or synchronized into these systems can vary substantially.


The import of data into an RDMS typically requires the development of [[data integration]] procedures that are tied to the specific workflows at hand. While very few standard products exist [6], in practice, mostly custom software written in various programming languages and making use of a high variety of different software packages are used for data integration in scientific environments. There are two main workflows for integrating data into RDMSs: manually inputting data (e.g., using forms [7]) or facilitating the batch import of data sets. The automatic methods often include data import routines for predefined formats, like tables in Excel or CSV format. [8,9] Some systems include plugin systems to allow for a configuration of the data integration process. [10] Sometimes, data files have to be uploaded using a web front-end [11] and are afterwards attached to objects in the RDMSs. In general, developing this kind of software can be considered very costly [6], as it is highly dependent on the specific environment. Data import can still be considered one of the major bottlenecks for the adaption of an RDMS.
There are several advantages to using an RDMS over organization of data in classical file hierarchies. There is a higher flexibility in adding metadata to data sets, while these capabilities are limited for classical file systems. The standardized representation in an RDMS improves the comparability of data sets that possibly originate from different file formats and data representations. Furthermore, semantic information can be seamlessly integrated, possibly using standards like RDF [12] and OWL. [13] The semantic information allows for advanced querying and searching, e.g., using SPARQL. [14] Concepts like linked data [15,16] and FAIR digital objects (FDO [17]) provide overarching concepts for achieving more standardized representations within RDMSs and for publication on the web. Specifically, the FDO concept aims at bundling data sets with a persistent digital identifier (PID) and its metadata to self-contained units. These units are designed to be machine-actionable and interoperable, so that they have the potential to build complex and distributed data processing infrastructures. [17]
===Using file systems and RDMSs simultaneously===
Despite the advantages mentioned above, RDMSs have still failed to gain a widespread adoption. One of the key problems in the employment of an RDMS in an active research environment is that a full transition to such a system is very difficult, as most digital scientific workflows are in one or multiple ways dependent on classical hierarchical file systems. [4] Examples include data acquisition and measurement devices, data processing and analysis software, and digitized [[laboratory]] notes and material for publications. The complete transition to an RDMS would require developing data integration procedures (e.g., extract, transform, load [ETL] [6,18] processes) for every digital workflow in the lab and to provide interfaces for input and output to any other software involved in these workflows.
As files on classical file systems play a crucial role in these workflows, our aim is to develop a robust strategy to use file systems and an RDMS simultaneously. Rather than requiring a full transition to an RDMS, we want to make use of the file system as an interoperability layer between the RDMS and any other file-based workflow in the research environment.
There are two important tasks that need to be solved and that are the main focus of this article:
#There must be a method to keep data and metadata in the RDMS synchronized with data files on the file system. Using that method, the file system can be used as an interoperability layer between the RDMS and other software and workflows. Our approach to solving this issue is discussed in detail in the results section. One key component of the synchronization method is defining the concept of "identity" for data in the RDMS, also discussed in the results section about identifiables.
#The high variety of different data structures found on the file system needs an adaptive and flexible approach for data integration and synchronization into the RDMS. We discuss our solution for this task in the results section concerning [[YAML]], where we present a standardized but highly configurable format for mapping information from files to a [[semantic data model]].
Apart from the main motivation, described above, we have identified several additional advantages of using a conventional folder structure simultaneous to an RDMS: standard tools for managing the files can be used for [[backup]] (e.g., rsync), [[Version control|versioning]] (e.g., git), archiving, and file access (e.g., SSH). Functionality of these tools does not need to be re-implemented in the RDMS. Furthermore, the file system can act as a fallback in cases where the RDMS might become unavailable. This methodology, therefore, increases robustness. As a third advantage, existing workflows relying on storing files in a file system do not need to be changed, while the simultaneous findability within an RDMS is available to users.
The concepts described in this article can be used independent of a specific RDMS software. However, as a proof-of-concept, we implemented the approach as part of the file crawler framework that belongs to the open-source RDMS LinkAhead (recently renamed from CaosDB). [19,20] The crawler framework is released as [[open-source software]] under the AGPLv3 license (see Appendix A).
===Example data set===
We will illustrate the problem of integrating research data using a simplified example that is based on the work of Spreckelsen ''et al.'' [21] This example will be used in the results section to demonstrate our data integration concepts. Examples for more complex data integration, e.g., for data sets found in the neurosciences (BIDS [22] and DICOM [23]) and in the geosciences, can be found online (see Appendix B). Although the concept is not restricted to data stored on file systems, in this example we will assume for simplicity that the research data are stored on a standard file system with a well-defined file structure layout:
<tt>ExperimentalData/<br />
:2020_SpeedOfLight/<br />
::2020-01-01_TimeOfFlight<br />
:::README.md<br />
:::...<br />
::2020-01-02_Cavity<br />
:::README.md<br />
:::...<br />
::2020-01-03<br />
:::README.md<br />
:::...</tt>
The above listing replicates an example with experimental data from Spreckelsen ''et al.'' [21] using a three-level folder structure:
* Level 1 (<tt>ExperimentalData</tt>) stores rough categories for data, in this data acquired from experimental measurements.
* Level 2 (<tt>2020_SpeedOfLight</tt>) is the level of project names, grouping data into independent projects.
* Level 3 stores the actual measurement folders, which can also be referred to as “scientific activity” folders in the general case. Each of these folders could have an arbitrary substructure and store the actual experimental data along with a README.md file, containing meta data.
The generic use case of integrating data from file systems involves the following sub tasks:
#Identify the required data for integration into the RDMS. This can possibly involve information contained in the file structure (e.g., file names, path names, or file extensions) or data contained in the contents of the files themselves.
#Define an appropriate (semantic) data model for the desired data.
#Specify the data integration procedure that [[Data mapping|maps data]] found on the file system (including data within the files) to the (semantic) data in the RDMS.
A concrete example for this procedure, including a semantic data model, is provided in the results section. As previously state, there are already many use cases that can benefit from the simultaneous use of the file system and RDMS. Therefore, it is important to implement reliable means for identifying and transferring the data not only once, as a single “data import,” but also while allowing for frequent updates of existing or changed data. Such an update might be needed if an error in the raw data has been detected. It can then be corrected on the file system, with the changes needing to be propagated to the RDMS. Another possibility is that data files that are actively worked on have been inserted into the RDMS. A third-party software is used to process these files and, consequently, the information taken from the files has to be frequently updated in the RDMS.
We use the term “synchronization” here to refer to the possible insertion of new data sets and to update existing data sets in the same procedure. To avoid confusion, we want to explicitly note here that we are not referring to bi-directional synchronization. Bi-directional synchronization means that information from RDMS that is not present in the file system can be propagated back to the file system, which is not possible in our current implementation. Although ideas exist to implement bi-directional synchronization in the future, in the current work (and also the current software implementation), we focus on the uni-directional synchronization from the file system to the RDMS. The outlook for adding extensions to bi-directional synchronization will be discussed in the discussion section.
===About LinkAhead===
LinkAhead was designed as an RDMS mainly targeted at active data analysis. So, in contrast to [[electronic laboratory notebook]]s (ELNs), which have a stronger focus on data acquisition and data repositories, which are used to publish data, data in LinkAhead are assumed to be actively worked on by scientists on a regular basis. Its scope for single instances (which are usually operated on-premises) ranges from small work groups to whole research institutes. Independent of any RDMS, data acquisition typically leads to files stored on a file system. The LinkAhead crawler synchronizes data from the file system into the RDMS. LinkAhead provides multiple interfaces for interacting with the data, such as a web-based graphical user interface (GUI) and an [[application programming interface]] (API) that can be used for interfacing the RDMS from multiple programming languages. LinkAhead itself is typically not used as a data repository, but the structured and enriched data in LinkAhead serves as a preparation for data publication, and data can be exported from the system and published in data repositories. The semantic data model used by LinkAhead is described in more detail in the next subsection. LinkAhead is open-source software, released under the AGPLv3 license (see Appendix A).
====Data models in LinkAhead====
The LinkAhead data model is basically an object-oriented representation of data which makes use of four different types of entities: <tt>RecordType</tt>, <tt>Property</tt>, <tt>Record</tt> and <tt>File RecordTypes</tt>, and <tt>Properties</tt>. These four entities define the data model, which is later used to store concrete data objects, which are represented by <tt>Records</tt>. In that respect, <tt>RecordTypes</tt> and <tt>Properties</tt> share a lot of similarities with [[Ontology (information science)|ontologies]], but have a restricted set of relations, as described in more detail by Fitschen ''et al.'' [19] <tt>Files</tt> have a special role within LinkAhead as they represent references to actual files on a file system, but allow for linking them to other LinkAhead entities and, e.g., adding custom properties.
<tt>Properties</tt> are individual pieces of information that have a name, description, optionally a physical unit, and can store a value of a well-defined data type. <tt>Properties</tt> are attached to <tt>RecordTypes</tt> and can be marked as “obligatory,” “recommended,” or “suggested.” In case of obligatory <tt>Properties</tt>, each <tt>Record</tt> of the respective <tt>RecordType</tt> is enforced to set the respective <tt>Properties</tt>. Each Record must have at least one <tt>RecordType</tt> and <tt>RecordTypes</tt> can have other <tt>RecordTypes</tt> as parents. This is known as (multiple) inheritance in object-oriented programming languages.
==Appendices==
===Appendix A. Supporting software===
The following software projects can be used to implement the workflows described in the article:
* Repository of the LinkAhead open-source project: https://gitlab.com/linkahead
* Repository of the LinkAhead crawler: https://gitlab.com/linkahead/linkahead-crawler
The installation procedures for LinkAhead and the crawler framework are provided in their respective repositories. Also a Docker container is available for the instant deployment of LinkAhead.
===Appendix B. Example crawlers===
There is a documented example available online ( at https://gitlab.com/linkahead/crawler-extensions/documented-crawler-example) that demonstrates the application of the crawler to example data. This can also be used as a template for the development of custom crawlers. We are currently aware of one public instance of LinkAhead, which makes use of a complex crawler based on the crawler framework described in this article. It is provided by ZMT-Leibniz Centre for Tropical Marine Research and can be accessed online at https://dataportal.leibniz-zmt.de/.
===Appendix C. Community repository for crawler extensions===
In order to foster the re-usability of crawler definitions, we are building a community repository for crawler extensions, which can be found at https://gitlab.com/linkahead/crawler-extensions.
===Appendix D. Software documentation===
The official documentation for LinkAhead, including an installation guide, can be found at https://docs.indiscale.com. The documentation for the crawler framework that is presented in this article can be found at https://docs.indiscale.com/caosdb-crawler/.


==References==
==References==

Revision as of 23:35, 10 June 2024

Full article title Mapping hierarchical file structures to semantic data models for efficient data integration into research data management systems
Journal Data
Author(s) tom Wörden, Henrik; Spreckelsen, Florian; Luther, Stefan; Parlitz, Ulrich; Schlemmer, Alexander
Author affiliation(s) Indiscale GmbH, Max Planck Institute for Dynamics and Self-Organization, Georg-August-Universität, German Center for Cardiovascular Research (DZHK) Göttingen, University Medical Center Göttingen
Primary contact Email: alexander dot schlemmer at ds dot mpg dot de
Editors Sedig, Kamran
Year published 2024
Volume and issue 9(2)
Page(s) 24
DOI 10.3390/data9020024
ISSN 2306-5729
Distribution license Creative Commons Attribution 4.0 International
Website https://www.mdpi.com/2306-5729/9/2/24
Download https://www.mdpi.com/2306-5729/9/2/24/pdf (PDF)

Abstract

Although other methods exist to store and manage data using modern information technology (IT), the standard solution is file systems. Therefore, maintaining well-organized file structures and file system layouts can be key to a sustainable research data management infrastructure. However, file structures alone lack several important capabilities for FAIR (findable, accessible, interoperable, and reusable) data management, the two most significant being insufficient visualization of data and inadequate possibilities for searching and obtaining an overview. Research data management systems (RDMSs) can fill this gap, but many do not support the simultaneous use of the file system and RDMS. This simultaneous use can have many benefits, but keeping data in an RDMS in synchrony with the file structure is challenging.

Here, we present concepts that allow for keeping file structures and semantic data models (found in RDMSs) synchronous. Furthermore, we propose a specification in YAML format that allows for a structured and extensible declaration and implementation of a mapping between the file system and data models used in semantic research data management. Implementing these concepts will facilitate the re-use of specifications for multiple use cases. Furthermore, the specification can serve as a machine-readable and, at the same time, human-readable documentation of specific file system structures. We demonstrate our work using the open-source RDMS LinkAhead (previously named “CaosDB”).

Keywords: research data management, FAIR, file structure, file crawler, metadata, semantic data model

Introduction

Data management for research is part of an active transformation in science, with effective management required in order to meet the needs of increasing amounts of complex data. Furthermore, the FAIR guiding principles [1] for scientific data—which are an elementary part of numerous data management plans, funding guidelines, and data management strategies of research organizations [2,3], requiring that research objects be more findable, accessible, interoperable, and reusable—require scientists to review and enhance their established data management workflows.

One particular focus of this endeavor is the introduction and expansion of research data management systems (RDMSs). These systems help researchers organize their data during the whole data management life cycle, especially by increasing findability and accessibility. [4] Furthermore, semantic data management approaches [5] can increase the reuse and reproducibility of data that are typically organized in file structures. As has been pointed out by Gray et al. [4], one major shortcoming of file systems is the lack of rich metadata features, which additionally limits search options. Typically, RDMSs employ database management systems (DBMSs) to store data and metadata, but the degree to which data is migrated, linked, or synchronized into these systems can vary substantially.

The import of data into an RDMS typically requires the development of data integration procedures that are tied to the specific workflows at hand. While very few standard products exist [6], in practice, mostly custom software written in various programming languages and making use of a high variety of different software packages are used for data integration in scientific environments. There are two main workflows for integrating data into RDMSs: manually inputting data (e.g., using forms [7]) or facilitating the batch import of data sets. The automatic methods often include data import routines for predefined formats, like tables in Excel or CSV format. [8,9] Some systems include plugin systems to allow for a configuration of the data integration process. [10] Sometimes, data files have to be uploaded using a web front-end [11] and are afterwards attached to objects in the RDMSs. In general, developing this kind of software can be considered very costly [6], as it is highly dependent on the specific environment. Data import can still be considered one of the major bottlenecks for the adaption of an RDMS.

There are several advantages to using an RDMS over organization of data in classical file hierarchies. There is a higher flexibility in adding metadata to data sets, while these capabilities are limited for classical file systems. The standardized representation in an RDMS improves the comparability of data sets that possibly originate from different file formats and data representations. Furthermore, semantic information can be seamlessly integrated, possibly using standards like RDF [12] and OWL. [13] The semantic information allows for advanced querying and searching, e.g., using SPARQL. [14] Concepts like linked data [15,16] and FAIR digital objects (FDO [17]) provide overarching concepts for achieving more standardized representations within RDMSs and for publication on the web. Specifically, the FDO concept aims at bundling data sets with a persistent digital identifier (PID) and its metadata to self-contained units. These units are designed to be machine-actionable and interoperable, so that they have the potential to build complex and distributed data processing infrastructures. [17]

Using file systems and RDMSs simultaneously

Despite the advantages mentioned above, RDMSs have still failed to gain a widespread adoption. One of the key problems in the employment of an RDMS in an active research environment is that a full transition to such a system is very difficult, as most digital scientific workflows are in one or multiple ways dependent on classical hierarchical file systems. [4] Examples include data acquisition and measurement devices, data processing and analysis software, and digitized laboratory notes and material for publications. The complete transition to an RDMS would require developing data integration procedures (e.g., extract, transform, load [ETL] [6,18] processes) for every digital workflow in the lab and to provide interfaces for input and output to any other software involved in these workflows.

As files on classical file systems play a crucial role in these workflows, our aim is to develop a robust strategy to use file systems and an RDMS simultaneously. Rather than requiring a full transition to an RDMS, we want to make use of the file system as an interoperability layer between the RDMS and any other file-based workflow in the research environment.

There are two important tasks that need to be solved and that are the main focus of this article:

  1. There must be a method to keep data and metadata in the RDMS synchronized with data files on the file system. Using that method, the file system can be used as an interoperability layer between the RDMS and other software and workflows. Our approach to solving this issue is discussed in detail in the results section. One key component of the synchronization method is defining the concept of "identity" for data in the RDMS, also discussed in the results section about identifiables.
  2. The high variety of different data structures found on the file system needs an adaptive and flexible approach for data integration and synchronization into the RDMS. We discuss our solution for this task in the results section concerning YAML, where we present a standardized but highly configurable format for mapping information from files to a semantic data model.

Apart from the main motivation, described above, we have identified several additional advantages of using a conventional folder structure simultaneous to an RDMS: standard tools for managing the files can be used for backup (e.g., rsync), versioning (e.g., git), archiving, and file access (e.g., SSH). Functionality of these tools does not need to be re-implemented in the RDMS. Furthermore, the file system can act as a fallback in cases where the RDMS might become unavailable. This methodology, therefore, increases robustness. As a third advantage, existing workflows relying on storing files in a file system do not need to be changed, while the simultaneous findability within an RDMS is available to users.

The concepts described in this article can be used independent of a specific RDMS software. However, as a proof-of-concept, we implemented the approach as part of the file crawler framework that belongs to the open-source RDMS LinkAhead (recently renamed from CaosDB). [19,20] The crawler framework is released as open-source software under the AGPLv3 license (see Appendix A).

Example data set

We will illustrate the problem of integrating research data using a simplified example that is based on the work of Spreckelsen et al. [21] This example will be used in the results section to demonstrate our data integration concepts. Examples for more complex data integration, e.g., for data sets found in the neurosciences (BIDS [22] and DICOM [23]) and in the geosciences, can be found online (see Appendix B). Although the concept is not restricted to data stored on file systems, in this example we will assume for simplicity that the research data are stored on a standard file system with a well-defined file structure layout:

ExperimentalData/

2020_SpeedOfLight/
2020-01-01_TimeOfFlight
README.md
...
2020-01-02_Cavity
README.md
...
2020-01-03
README.md
...

The above listing replicates an example with experimental data from Spreckelsen et al. [21] using a three-level folder structure:

  • Level 1 (ExperimentalData) stores rough categories for data, in this data acquired from experimental measurements.
  • Level 2 (2020_SpeedOfLight) is the level of project names, grouping data into independent projects.
  • Level 3 stores the actual measurement folders, which can also be referred to as “scientific activity” folders in the general case. Each of these folders could have an arbitrary substructure and store the actual experimental data along with a README.md file, containing meta data.

The generic use case of integrating data from file systems involves the following sub tasks:

  1. Identify the required data for integration into the RDMS. This can possibly involve information contained in the file structure (e.g., file names, path names, or file extensions) or data contained in the contents of the files themselves.
  2. Define an appropriate (semantic) data model for the desired data.
  3. Specify the data integration procedure that maps data found on the file system (including data within the files) to the (semantic) data in the RDMS.

A concrete example for this procedure, including a semantic data model, is provided in the results section. As previously state, there are already many use cases that can benefit from the simultaneous use of the file system and RDMS. Therefore, it is important to implement reliable means for identifying and transferring the data not only once, as a single “data import,” but also while allowing for frequent updates of existing or changed data. Such an update might be needed if an error in the raw data has been detected. It can then be corrected on the file system, with the changes needing to be propagated to the RDMS. Another possibility is that data files that are actively worked on have been inserted into the RDMS. A third-party software is used to process these files and, consequently, the information taken from the files has to be frequently updated in the RDMS.

We use the term “synchronization” here to refer to the possible insertion of new data sets and to update existing data sets in the same procedure. To avoid confusion, we want to explicitly note here that we are not referring to bi-directional synchronization. Bi-directional synchronization means that information from RDMS that is not present in the file system can be propagated back to the file system, which is not possible in our current implementation. Although ideas exist to implement bi-directional synchronization in the future, in the current work (and also the current software implementation), we focus on the uni-directional synchronization from the file system to the RDMS. The outlook for adding extensions to bi-directional synchronization will be discussed in the discussion section.

About LinkAhead

LinkAhead was designed as an RDMS mainly targeted at active data analysis. So, in contrast to electronic laboratory notebooks (ELNs), which have a stronger focus on data acquisition and data repositories, which are used to publish data, data in LinkAhead are assumed to be actively worked on by scientists on a regular basis. Its scope for single instances (which are usually operated on-premises) ranges from small work groups to whole research institutes. Independent of any RDMS, data acquisition typically leads to files stored on a file system. The LinkAhead crawler synchronizes data from the file system into the RDMS. LinkAhead provides multiple interfaces for interacting with the data, such as a web-based graphical user interface (GUI) and an application programming interface (API) that can be used for interfacing the RDMS from multiple programming languages. LinkAhead itself is typically not used as a data repository, but the structured and enriched data in LinkAhead serves as a preparation for data publication, and data can be exported from the system and published in data repositories. The semantic data model used by LinkAhead is described in more detail in the next subsection. LinkAhead is open-source software, released under the AGPLv3 license (see Appendix A).

Data models in LinkAhead

The LinkAhead data model is basically an object-oriented representation of data which makes use of four different types of entities: RecordType, Property, Record and File RecordTypes, and Properties. These four entities define the data model, which is later used to store concrete data objects, which are represented by Records. In that respect, RecordTypes and Properties share a lot of similarities with ontologies, but have a restricted set of relations, as described in more detail by Fitschen et al. [19] Files have a special role within LinkAhead as they represent references to actual files on a file system, but allow for linking them to other LinkAhead entities and, e.g., adding custom properties.

Properties are individual pieces of information that have a name, description, optionally a physical unit, and can store a value of a well-defined data type. Properties are attached to RecordTypes and can be marked as “obligatory,” “recommended,” or “suggested.” In case of obligatory Properties, each Record of the respective RecordType is enforced to set the respective Properties. Each Record must have at least one RecordType and RecordTypes can have other RecordTypes as parents. This is known as (multiple) inheritance in object-oriented programming languages.

Appendices

Appendix A. Supporting software

The following software projects can be used to implement the workflows described in the article:

The installation procedures for LinkAhead and the crawler framework are provided in their respective repositories. Also a Docker container is available for the instant deployment of LinkAhead.

Appendix B. Example crawlers

There is a documented example available online ( at https://gitlab.com/linkahead/crawler-extensions/documented-crawler-example) that demonstrates the application of the crawler to example data. This can also be used as a template for the development of custom crawlers. We are currently aware of one public instance of LinkAhead, which makes use of a complex crawler based on the crawler framework described in this article. It is provided by ZMT-Leibniz Centre for Tropical Marine Research and can be accessed online at https://dataportal.leibniz-zmt.de/.

Appendix C. Community repository for crawler extensions

In order to foster the re-usability of crawler definitions, we are building a community repository for crawler extensions, which can be found at https://gitlab.com/linkahead/crawler-extensions.

Appendix D. Software documentation

The official documentation for LinkAhead, including an installation guide, can be found at https://docs.indiscale.com. The documentation for the crawler framework that is presented in this article can be found at https://docs.indiscale.com/caosdb-crawler/.

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.