Journal:Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation

From LIMSWiki
Revision as of 02:00, 31 March 2022 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation
Journal Journal of Biomedical Semantics
Author(s) Schröder, Max; Staehlke, Susanne; Groth, Paul; Nebe, J. Barbara; Spors, Sascha; Krüger, Frank
Author affiliation(s) University of Rostock, University Medical Center Rostock, University of Amsterdam
Primary contact Email: max dot schroeder at uni-rostock dot de
Year published 2022
Volume and issue 13
Article # 4 (2022)
DOI 10.1186/s13326-021-00257-x
ISSN 2041-1480
Distribution license Creative Commons Attribution 4.0 International
Website https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-021-00257-x
Download https://jbiomedsem.biomedcentral.com/track/pdf/10.1186/s13326-021-00257-x.pdf (PDF)

Abstract

Background: Electronic laboratory notebooks (ELNs) are used to document experiments and investigations in the wet lab. Protocols in ELNs contain a detailed description of the conducted steps, including the necessary information to understand the procedure and the raised research data, as well as to reproduce the research investigation. The purpose of this study is to investigate whether such ELN protocols can be used to create semantic documentation of the provenance of research data by the use of ontologies and linked data methodologies.

Methods: Based on an ELN protocol of a biomedical wet lab experiment, a retrospective provenance model of the raised research data describing the details of the experiment in a machine-interpretable way is manually engineered. Furthermore, an automated approach for knowledge acquisition from ELN protocols is derived from these results. This structure-based approach exploits the structure in the experiment’s description—such as headings, tables, and links—to translate the ELN protocol into a semantic knowledge representation. To satisfy the FAIR guiding principles (making data findable, accessible, interoperable, and reuseable), a ready-to-publish bundle is created that contains the research data together with their semantic documentation.

Results: While the manual modelling efforts serve as proof of concept by employing one protocol, the automated structure-based approach demonstrates the potential generalization with seven ELN protocols. For each of those protocols, a ready-to-publish bundle is created and, by employing the SPARQL query language, it is illustrated such that questions about the processes and the obtained research data can be answered.

Conclusions: The semantic documentation of research data obtained from the ELN protocols allows for the representation of the retrospective provenance of research data in a machine-interpretable way. Research Object Crate (RO-Crate) bundles including these models enable researchers to easily share the research data, including the corresponding documentation, as well as to search and relate the experiment to each other.

Keywords: research data, provenance, knowledge acquisition, electronic laboratory notebooks, semantic documentation, RO-Crate, FAIR

Background

Effective reuse of research data requires comprehensive documentation of their provenance. Beside metadata, knowledge about the generating process helps others to understand research data and allows for the reproduction of research investigations. This includes not only sources of input data, such as parameters and assumptions, but also information about instrumentation, devices, and materials. For wet lab experiments, such knowledge is increasingly documented in electronic laboratory notebooks (ELNs). The focus of these tools is on the documentation of laboratory activities that produce research data in so-called "ELN protocols." In addition to this textual description, the FAIR guiding principles [1] provide general guidance on research data documentation in terms of metadata. However, they do not prescribe technical details about the implementation of such documentation. [2]

To foster the realization of the FAIR principles for research data produced in wet lab experiments, we aim for machine-interpretable representations of experimental documentation of the process that is the origin of the data. In other words, the provenance information about the research data—including the activities and involved researchers, resources, and equipment—should be semantically represented. For this purpose, we employ the frequently used [3] PROV W3C recommendation [4], which ontologically, in PROV Ontology (PROV-O), defines entities, activities, and agents, as well as their relations. In particular, according to Belhajjame et al., an entity is defined as a “physical, digital, conceptual, or other kind of thing with some fixed aspects,” [5] an activity as “something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities,” [5] and an agent as “something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity.” [5] With respect to wet lab experiments, all biological and chemical resources—as well as not only the devices and software but also the research data itself—can be seen as entities; researchers conducting the experiment are the agents, and the process of research data creation consists of activities. The semantic representation of this information as a knowledge graph (KG) [6] can be achieved by the use of modern web technologies where the terms and their relations are defined in ontologies such as PROV-O (TBox modelling), the instances are built up in the KG (ABox modelling), and other KGs can be linked in order to create an interconnected graph of semantic knowledge.

In this paper, we aim for an automatic extraction of information from ELN protocols in order to transfer them into a semantic representation that documents the produced research data. For this purpose, we employ the documentation of Calcium imaging (Ca-imaging) experiments, originally proposed by Staehlke et al. [7], as a running example. In particular, we use ELN protocols that document the conduction of Ca-imaging experiments in order to: (i) demonstrate the feasibility of manually transferring an ELN protocol into a semantic representation encoding the provenance of research data, (ii) automate the information extraction and modelling by exploiting the structure of an ELN protocol by means of a structure-based approach, and (iii) evaluate the proposed method by answering provenance questions from the resulting bundle of research data and the corresponding semantic model.

Here, the term "ELN protocol" refers to the actual documentation of the wet lab experiment within an ELN and is different from the term "protocol templates," which are used to encode instructions to be performed in order to conduct particular procedures or be published at protocols.io. While those protocol templates do encode a list of abstract instructions, they do not necessarily reflect particular research data, nor instrumentation, parameters, or other aspects to the execution-specific information. ELN protocols, in contrast, represent the documentation of the actual experiment, and the contained information is thus necessary to understand how the resulting research data was generated. This includes manufacturer-specific information about resources used in the experiment such as lot numbers.[a] Furthermore, passage numbers of the resources, the times when an activity was conducted, and the parameters used in a device, as well as the research data and the researchers conducting the experiment, are information specific to a particular experiment. Figure 1 illustrates the differences by providing an example for an ELN protocol and a protocol template.


Fig1 Schröder JofBioSem22 13.png

Figure 1. Excerpts of an ELN protocol that represents a particular experiment including all details such as timestamps, lot numbers as well as the research data (left) and a protocol template containing general instructions of experiments without these details (right, sourced from here.)

The work presented here is based on a preliminary investigation regarding the effectiveness of manually modeling ELN protocols by use of ontologies. [8] Here, we extend this preliminary work by discussing the potential of automatic information extraction from ELN protocols by employing structural information and discussing the differences and implications of both approaches. Moreover, while the previous work only sketched the semantic representation of the wet lab experiments, here, we focus on the generation of ready-to-publish research data bundles, including the semantic description of the origin of the research data.

Use case

To demonstrate the feasibility of the proposed approach, a typical wet lab investigation was chosen as a use case. In the following, we introduce the use case and derive questions regarding the provenance of the corresponding research data.

Biomedical wet lab experiments

The objective of the biomedical study was to investigate the intracellular calcium ions (Ca2+) dynamics by Calcium-imaging (Ca-imaging) under different settings. [7] In particular, two different wet lab experiments were considered: (i) an investigation of the influence of different material surface conditions on Ca2+ mobilization, and (ii) an investigation regarding the Ca2+ dynamics under the influence of electrical stimulation. Both types of experiments involve similar activities of the researchers. In particular, each experiment employs the Ca-imaging method previously established by Staehlke et al. [7] in different settings. The particular conditions, e.g., surface conditions or parameters of the electrical stimulation, are investigated within each experiment, while the order of the different variations was permuted across the experiments. That is, after a preparation phase, where all materials and devices are prepared, the same procedure, i.e., Ca-imaging, was executed for the different conditions. During the experiment, several materials and devices are employed, such as cell line passages, buffer, and microscopes.

For the purpose of this study, we asked the researchers to use an ELN for the documentation of their wet lab activities, resulting in eight ELN protocols: one for the first experiment and seven for the latter, representing different permutations of the sequential execution of Ca-imaging for different electrical stimulation parameters. In particular, eLabFTW (Deltablot, https://www.elabftw.net/, v3.6.7) [10], a domain-independent ELN, was used. Figure 2 shows an excerpt of a protocol from the use case.


Fig2 Schröder JofBioSem22 13.png

Figure 2. ELN protocol about a Ca-imaging experiment in the eLabFTW software. It contains general information (top), the list of activities with their starting time (middle), used inventory items, and uploaded research data (bottom).

ELNs often provide an inventory database that allows the maintenance of materials and other research resources used during the experiments. Typically, each resource belongs to a configurable set of categories, e.g., cell lines, buffer, software, or devices. These entries in the inventory database can be linked from within the protocol when used within the corresponding experiment. Figure 3 illustrates the entry to the inventory database for the MG-63 cell line that is used in the experiments of the use case. Note that this entry is already augmented by information about ontology classes that were added during the manual model engineering process. Here, we use such ontology references but also other resource identifiers, such as Research Resource Identifiers (RRID)[1], could be used for resource reference. However, these RRIDs do not reflect different versions of the resources, e.g., when describing a software. Thus, they can be used to annotate the inventory database of the ELN similar to the ontology classes, but cannot be used on their own. Research data is attached to the ELN protocol by uploading and linking from within the textual description of the step that describes the generating activity.


Fig3 Schröder JofBioSem22 13.png

Figure 3. Shortened documentation of a Ca-imaging experiment in the eLabFTW ELN software. The upper part contains general information about the investigation, followed by the list of activities with their starting time. Below, used inventory items and uploaded research data are listed.

In summary, the execution of an individual experiment took about 4.5 hours, resulting from the preparation and the sequential executions of the Ca-imaging procedure under five different stimulation settings consisting of 15 steps for each. Each protocol referred to 22 inventory items in the database and between 85 and 110 data files of different types were generated. The different file types include (i) CZI files (developed by ZEISS) containing the microscope settings, recorded images, and raw measurement data; (ii) image files in JPEG format to illustrate particular excerpts from the video recordings; and (iii) raw measurements of the luminescence over time, in the form of XML encoded tabular data files. The latter two formats are exports from the CZI files. The provenance of all attached files needs to be documented.

Research data provenance

When considering this use case, several questions regarding the provenance of the research data can be raised. To this end, we consider questions based on the W7 provenance model [11], that describes provenance as combinations of What, When, Where, How, Who, Which, and Why. We consider each question individually, encoding the view of a researcher that aims at re-using the research data from our use case. The questions were developed together with the domain experts and resemble actual questions that arise when considering the replication of the documented experiments.

W1 Who participated in the study?
With respect to the provenance of research data, all researchers contributing to the creation are of interest, i.e., we expect to get a list of all researchers and their affiliations involved in an experiment.
W2 Which biological and chemical resources and which equipment was used in the study?
In particular, we are interested in the resources and the equipment used in an experiment, including all details such as the lot number and the passage information.
W3 How was a particular file created?
"What was the sequence of activities that led to the creation of a particular file" is a question that might help other researchers in comprehending the data.
W4 When was an activity conducted?
The date and the time point of a particular activity but also its duration are of interest. This information is useful for the planning of similar experiments, but also with respect to the comprehensibility of the results as the date and time point might influence them, e.g., due to weather or other environmental phenomena.
W5 Why was the experiment done?
Understanding why the research data was created is crucial for their comprehensibility. We take the objective of the experiment as the reason for the creation.
W6 Where was the experiment conducted?
The location—respectively. the institution where the experiment was conducted—is of interest as regional characteristics might influence the data.
W7 What was the order of the stimulation parameters in a particular experiment?
The order of the particular approaches influences the results as there might be effects from the timing of the experiments or the duration since their preparation. That means, with respect to the evaluation of the results, we are interested in this order.

Related work

The provenance of research data, including their research investigations, combines several research fields, ranging from general-purpose methods and standards for the documentation of provenance to specifically tailored methods and platforms for the tracking of research and other activities. In the following, we will discuss recent work within those fields and relate it to our method.

Many methods aiming at documenting the provenance of activities have already been proposed. Here, we consider the classification of provenance information following the definition of Herschel et al. [12] and Lim et al. [13]:

  1. prospective provenance describes “an abstract workflow specification as a recipe for future data derivation” [13];
  2. retrospective provenance documents a “past workflow execution and data derivation information, i.e., which tasks were performed and how data artifacts were derived” [13]; and
  3. evolution provenance illustrates “the changes made between two versions of the input” [12], or, in other words, versions of the procedure, the data, or the parameters are reflected by evolution provenance similar to version control such as that implemented by Git for source code.

Applying those definitions to the use case at hand, prospective provenance allows the keeping track of changes of laboratory-specific operating procedures in general, while retrospective provenance allows the documenting of the actually executed sequence of activities that resulted in a particular set of research data. At last, evolution provenance allows the tracking of changes made to the actual ELN protocol or the inventory database items.

With respect to the research workflows to be represented by provenance modeling, two different types can be distinguished:

  1. In-silico studies employ computational methods for the analysis of the data. Workflow systems like Taverna [14], Kepler [15], or Galaxy [16], and programming environments like Jupyter Notebook [17] have been successfully augmented to record retrospective provenance.
  2. Wet lab experiments are courses of activities in a laboratory. While several approaches exist that describe prospective provenance [18, 19] by analyzing published protocols, only limited work is done on documenting retrospective provenance for these workflows.

More detailed information about provenance modelling and the employed methods are provided in the literature. [3, 12] Here, we are interested in providing detailed information about the origin of research data. Thus, we aim at providing retrospective provenance documentation of research data from ELN protocols documenting wet lab experiments.

The Smart Tea project [20] similarly aims at the semantic metadata recording for research data from within a customized ELN. The developed ELN provides a structured graphical user interface (GUI) requiring the user to provide information for predefined variables. All information is directly transferred into a linked data representation and persistently archived with a linked data server. While this approach perfectly guides users through the sequence of activities and tracks retrospective provenance at the same time, it fails to keep track of deviations from the predefined plan. Furthermore, as the documentation is directly translated into a semantic representation, additional information that was not considered before can hardly be attached to such protocols, which restricts both the expressivity of the semantic model and the user to previously known information.

Similar to the Smart Tea project, the PROV templating approach [21] suggests the recording of provenance information given a pre-defined provenance model. In other words, the main idea is that applications only store values for placeholders in a particular provenance model, which was shown to be more efficient than the storage of the original provenance models. [21] This solution is very efficient if a very large number of identical provenance structures with some variable information are to be stored. If, however, the application requires more flexibility in terms of the provenance structure, the template approach does not utilize this efficiency advantage. Note that provenance templates encode a semantic representation with variables, whereas protocol templates provide guidelines for experiments.

Curcin et al. [22] use a very similar approach for the provenance modelling in diagnostic decision support systems. A more flexible approach is the use of knowledge graph cells (KGCs), proposed by Vogt et al. [23] They provide a concept for the definition of knowledge structures. In particular, rules including ABox and TBox expressions might be defined that allow the dynamic modification of the KG. Thus, KGCs might be used to specify potential semantic structures of ELN protocols without particular information inside. The application of KGCs would require a complete definition over all possible semantic representations of ELN protocols, which is infeasible.

With respect to the vocabulary used to semantically describe the laboratory-specific information, the EXperimental ACTions (EXACT2) ontology, together with the Natural Language Processing (NLP) framework [18], aims at the automatic extraction of knowledge from biomedical protocols for prospective provenance. Similarly, the SeMAntic RepresenTation for Experimental Protocols (SMART Protocols) ontology reuses EXACT2 to represent prospective provenance from published protocols. [19] In contrast to both approaches that represent a plan, we aim at retrospective provenance, i.e., a particular course of activities. Both approaches, however, could be used to describe prospective provenance of the underlying plan of an ELN protocol, to allow the documentation of potential deviations from the original plan. The Reproduce Microscopy Experiments (REPRODUCE-ME) ontology [24] introduces a specific vocabulary to describe retrospective provenance for microscopy experiments. Besides, the domain-independent ontologies, PROV-O and its predecessor Open Provenance Model (OPM) [25], are frequently employed as upper-level ontology for provenance documentation. [3] Furthermore, many extensions for specific applications have been proposed. The Provenance, Authoring, and Versioning (PAV) ontology, for example, proposes a mechanism for the versioning and authoring of web resources [26], and CollabPG encodes collaborations within processes. [3] With respect to the application domain of the use case, the Open Biological and Biomedical Ontology (OBO) Foundry is a community initiative aiming at the development and maintenance of ontologies in the biomedical domain. [27] The Basic Formal Ontology (BFO) [28] is the upper-level ontology that is used for each of the OBO ontologies.

For the retrospective provenance documentation of research data from computational workflows, several specifically tailored tools and approaches have been proposed in the literature. ProvBook [17], for instance, tracks provenance in Jupyter notebooks that are used for literate programming. There's also Dataprov [29], a wrapper tool producing provenance information from the execution of analysis tools, and noWorkflow [30], which captures provenance information from analysis scripts such as for the programming language Python. Aside from these methods, other provenance tracking approaches known as lineage retrieval [31] or lineage tracking and workflow systems exist. [32] In general, in-silico workflow systems not only record provenance information, but at the same time they specify the involved processing steps and enable their execution possibly on a distributed system. [33] However, as these systems are limited to tackling computational analyses, their usage for the provenance of research data from wet lab experiments is difficult.

Regarding the completeness of the documentation with respect to reproducibility, plenty of standards exist that aim at the definition of the minimum set of information required to comprehend and reproduce the research investigation for different applications. With respect to the use case at hand, the minimum information for electrical cell stimulation [34] and the Minimum Information About a Cellular Assay (MIACA)[2] provide such references for the documentation. Similarly, standard operating procedures (SOPs) or published instructions for experiments encode standards for the documentation of a particular experiment.

When considering the publication or archiving of research data, metadata is important to provide additional context, enabling others (including the future self) to understand the research process and the resulting data. In particular, the FAIR guiding principles provide abstract recommendations for handling research data to enable its re-usability. [1] Together with the implementation suggestions of these guidelines [2], they provide a framework which is also applicable for research data from wet lab experiments. While both guidelines provide generic recommendations regarding research data documentation, different standards exist that provide vocabulary for their support. Several initiatives foster the development of documentation standards for research data, including the Data Documentation Initiative (DDI) that focuses on standardizing metadata for social science datasets. [35] The Dublin Core, instead, is a more general definition of 15 metadata elements for electronic resources. [36, 37] Similarly, Data Catalog Vocabulary (DCAT) provides a common vocabulary for the interoperability of data catalogs [38] and, thus, also defines required metadata for research data. Additionally, domain-specific metadata standards have been developed. With respect to the use case, this includes metadata for microscopy images, such as that proposed by the RDM4mic Initiative.[3] In addition to these metadata, the information inside the data file might also be described. For this purpose, codebooks and data dictionaries are employed. [39, 40] Considering a CSV file as an example, this includes information about each column such as the domain of the values and the unit of the measurements. This information is defined in a separate file that helps comprehend the raw data.

For the publication and archiving of this data, including the semantic documentation, several approaches have been proposed. These include bundling formats such as BagIt [41], Oxford Common File Layout (OCFL) [42], and RO-Crate [43], as well as literate programming methods such as using Jupyter Notebook to combine (parts of) research data, their analysis source code, and results, as well as their documentation. RO-Crate [43] is a mechanism that allows the bundling of resources together with their associated metadata, supporting the FAIR publication and archiving of the research data. By re-using existing vocabulary such as schema.org or PROV-O, it implements a linked data approach to enable researchers to provide all information necessary to (re-)use the described research data. This includes basic properties such as author and title of the resource, a license for publication, a description of the files, and a description of the workflow used to create those files in terms of retrospective provenance, including employed software and other equipment. In brief, a RO-Crate bundle consists of the research data file and a metadata file called ro-crate-metadata.json, which contains structured metadata about the files and the entire bundle in a JSON-LD format. While the ro-crate-metadata.json contains all information in machine interpretable way, it is accompanied by a human readable HTML representation. RO-Crate has successfully been used for the documentation of retrospective provenance of in-silico studies [44], but can, due to the flexibility of the vocabulary, also be used for retrospective provenance of wet lab experiments.

Methods

The objective of the study was to investigate whether it is possible to create semantic documentation of the research process and the resulting research data in terms of provenance. To this end, semantic documentation was manually created by analyzing the ELN protocol. To support potential automation of the semantic model creation, based on the results of this analysis, a protocol template was designed that (i) guides researchers through the process while (ii) requiring them to provide all information necessary to comprehend the origin of the research data. The resulting protocol template was split up into a set of templates that encode steps of an experiment such as the staining or the imaging with a particular set of stimulation parameters. These sub-templates ease the re-use for new experiments, e.g., by combining them in other permutations. Based on this, researchers documented their wet lab experiments, resulting in a set of ELN protocols, each of which contains variations, such as differences in parameters, execution time, or execution order. The different protocols were then automatically analyzed, translated into a semantic model, and finally bundled into self-contained archives. The following provides a detailed description of each step.

Manual model engineering

The manual engineering process for the semantic model of the ELN protocol was comprised of iterative modelling and reviewing. Domain experts were consulted during this process in order to validate the model. The main objective of this process was to check if all information for the semantic provenance modelling are available in ELN protocols and whether they can be transferred into a semantic representation by employing existing ontologies. The aim of the resulting model was to document the provenance of the research data.

Protegé [45] was used for model engineering. In particular, the modelling was conducted as follows:

  1. BioPortal[4] and Ontobee[5] are used to identify relevant ontologies for terms from the ELN protocol and the inventory database items.
  2. A set of ontologies is selected from these search results so that the coverage of terms from the ELN in a single ontology is maximized. Ontologies from the OBO Foundry [27], compatible with the BFO [28], were preferred.
  3. Ontology classes representing inventory database items in the ELN (see Fig. 3) are added into the ELN description of the corresponding inventory database item as a reference for the semantic modelling.
  4. The semantic model itself is constructed by ABox statements, i.e., the creation of instances of these classes that represent the particular entities and activities of the protocol and the inventory database. Each instance gets a unique identifier in the local namespace, reflecting the individual entity; for example, MG-63_(P25,_LOT_57840088) is used to encode passage 25 of the MG-63 cells that were delivered with the lot number 57840088 (see also Fig. 5). The specific input and output relations of the activity classes are used in order to connect the particular entities correspondingly.
  5. References to the same entities in other KGs such as Wikidata [46] are included by employing the owl:sameAs relation. This is essential for linked open data according to the five-star deployment scheme proposed by Berners-Lee.[6]

The following three rules were considered during iterative modelling in order to prevent the introduction of a bias from modeller and domain experts:

  • Use ontological classes of the same granularity as the terms in the experiment documentation, e.g., “washing” instead of “material processing.”
  • Avoid the introduction of new classes and attributes whenever possible (e.g., avoid TBox statements) and re-use existing ontologies. [47]
  • Use only information from the ELN protocol, and do not introduce further knowledge despite the references to other KGs.

Thus, the semantic model serves as demonstrator for the inherent potential of ELN protocols.

Structure-based modelling approach

Manual model engineering reveals the potential of ELN protocols for the semantic documentation of research data. However, in order to use this at large scale, a more automated approach is needed. To approach this target, the structure-based method presented here employs the textual structure in the ELN protocols, as well as basic text analysis, which is introduced in the following sections.

Considering the ELN protocol from the manual model, we observed that the main content is structured by:

  • headings and paragraphs,
  • tables (table headings and body),
  • enumerations and lists, and
  • links to inventory items and research data.

Headings are used to structure the documentation, e.g., the general section about the experimental details, or a particular set of activities are preceded from a heading (upper and lower part in Fig. 2, respectively). In the latter case, different sets of activities in a protocol correspond to the templates we extracted, i.e., at each headline a new template was included.

Tables are used here for two different purposes. First, key-value mappings represent tables that encode general information about an experiment or inventory item, e.g., the objective of the investigation or the manufacturer of a resource. The description of inventory items mainly consists of a table of this kind (see Fig. 3). Second, lists of activities represent tables with two columns: “Step” and “Starting time.” Each row encodes an atomic activity of the experiment (see Fig. 2). Especially for the activity tables, cells include also enumerations, lists, and paragraphs which further describe the atomic activities and parameters, as well as the linking inventory items and the research data. As an example, see the last row in the activity table in Fig. 2. Note, that we assume each row defining an atomic activity that we do not split up at this stage.

Considering our ultimate goal of retrospective research data provenance documentation, we exploited the structure of the ELN protocol as follows:

  1. General information such as the researcher conducting the experiment and the objective of the investigation are parsed from the key-value table at the beginning of the protocol. This information is added to the protocol activity using the relation qualifiedAssociation (prov:qualifiedAssociation).
  2. Activities described within the ELN protocol are hierarchically structured to represent different levels of granularity. The top-level activity resembles the entire experiment, while the different main sections are represented by second-level activities. Note that each main section contains an activity table. Finally, the third level represents activities from table rows of those tables.
  3. All activities are augmented by inventory items mentioned in the respective description by the used (prov:used) relation.
  4. For each research data file created during the investigation, a corresponding entity is created. Assuming that the mention of a file inside an activity marks the creation of this file, the activity is linked to the file using the relation wasGeneratedBy (prov:wasGeneratedBy).

As previously described, we do not further split up the third-level activities, i.e., complex structures such as enumerations and lists, including their order inside a step description, are taken as atomic.

Aside from the use of structural elements in the ELN, which was the base for the manual model, we identified different repeating patterns that can be exploited. For example, from the textual description of activities such as “incubate 5 min in [Device] SANYO CO2 Incubator at 37C” or “wash cells with [Washing solution] PBS without Ca/Mg [..],” we observed the use of verb phrases indicating the activity of the step: “incubate” and “wash,” respectively. Here, we use the head verb of those phrases to assign the corresponding ontological class from a prior mapping. Similarly, information about researchers and institutions, manufacturers, file mime-types, and experiment type are included. For large scale usage, these information might also be retrieved from an organizational or research information system.

Parameters that are used in the textual description are identified by their unit, e.g., “1.5 ml,” “5 min,” and “37C” by employing regular expressions. They are then represented as blank nodes connected to the step using the relation has value specification (OBI_0001938) with the value as the numerical value of the parameter and the unit connected by has measurement unit label (IAO_0000039). We observed that most of the units mentioned in the protocols at hand are defined in the units ontology (UO). [48]

Another frequently used pattern observed in the textual description is the mixture of biological and chemical resources, e.g., “89% [Culture Medium] DMEM + 10% [Serum] FCS + 1% [Antibiotic] Gentamicin”. By employing the following regular expression, the contained information is extracted and transferred into a representation of activity of type creating a mixture of molecules in solution (OBI_0000685):

[\. \d] + \s*% <item> ('+' [\. \d] + \s*% <item>) +

Depending on the appearance of attribution notes in the corresponding contexts (e.g., “(Attributed to Susanne Staehlke)”), we create separate activities following the same specification. Figure 7 contains an example activity encoding the creation of the above mixture.

Preparing ELN protocol template

ELN protocols encode instructions (i.e., lists of activities) to (re-)produce the particular research findings. This does not restrict researchers but rather provides a guideline based on earlier experiments. Specifically, they include parameters, timestamps, and the research data. Taking the first experiment of our study, which was documented as an ELN protocol, we derived a protocol template by marking all variable information as placeholders. Together with the domain experts, this generalization has been validated to allow the usage as a basis for new experiments. The main advantage for the researchers conducting experiments in the wet lab is that all parameters that need to be documented during the experiment are highlighted while the overall description of the process is already done. Thus, errors introduced from missing parameters or instructions are reduced. If, however, the documentation needs to be modified during the experimental execution, researchers can adjust the activities and description.

This protocol template might already be used for the documentation of identical experiments (including identical ordering of parameter variations). However, as the researchers in our use case permute the different parts of the experiment (i.e., the stimulation parameters in each experiment), the templates were further split up in individual steps. For the use case at hand, we identified the following four parts: (i) Preparation, (ii) Fluo-3 Staining, (iii) Ca-imaging with Stimulation, and (iv) Ca-imaging without Stimulation. Figure 4 illustrates the template for the approach using electrical stimulation. Placeholders that will be replaced with specific parameter values during an experiment are marked with orange background color. These templates can be re-combined and used to encode new experiments. A protocol template, therefore, can be interpreted as a combination of templates which themselves are combinations of activities in a textually structured description. In consequence, an ELN protocol represents a completed protocol template with actual parameters.


Fig4 Schröder JofBioSem22 13.png

Figure 4. Template transferred from an ELN protocol section by highlighting parameters (marked with orange background color). The template contains the preparation and microscoping of a sample with stimulation. Note that this template aims at supporting researchers during their documentation, but the semantic translation approach is more general.

Bundling research data and re-use

The structure-based approach automatically translates the ELN protocol into a semantic representation of the activities and resources involved in the production of the research data. In order to combine this semantic representation (i.e., the documentation) with the research data, we employ the RO-Crate format. The RO-Crate bundle consists of the semantic model in a JSON-LD file ro-crate-metadata.json, the research data files, and a human-readable copy of the original ELN protocol and the inventory item description as HTML files.

By using the resulting RO-Crates for our use case, we answer the raised provenance questions. Therefore, we load all semantic representations from the RO-Crates into a linked data server with a SPARQL endpoint. In this study, we use Apache Jena Fuseki (v4.1.0)[7] for this purpose.

An advantage of the semantic representation of the research data documentation is its machine interpretability. This enables the comparison of the experimental processes with respect to similarities and potential differences that may have influenced the final result. This includes the particular execution times, but also omitted or additional steps as well as different parameter combinations. Furthermore, influences of the order of the different parts can easily be investigated (W7).

Results

First, we present the details of the manually engineered semantic representation of the Ca-imaging procedure which served as (i) a proof of concept for the effectiveness of retrospective provenance documentation from ELN protocols, (ii) a basis for analysis of the ELN protocol structure, and (iii) the development of the protocol template for research guidance. Second, details of the structure-based semantic translation for the seven Ca-imaging protocols with stimulation are given. Finally, we present the results of the evaluation of the RO-Crate bundles.

Manually engineered model

The semantic representation of the Ca-imaging procedure is based on the upper-level ontology BFO. In addition, PROV-O [25] is used for retrospective provenance documentation of the experimental results. Table 1 lists the most important ontologies used in the model. For the representation, an artefact-based modelling approach was selected, where artefacts are central to the model and are used to connect activities via their corresponding input and output relations. In total, the protocol as well as the inventory items are represented in about 80 resources of 46 types connected by almost 20 distinct predicates from 13 vocabularies.

Table 1. Ontologies selected for the manually engineered model. Upper rows list general ontologies, while the lower rows list domain-specific ontologies for resources and activities.
Name Source Details
BFO Smith et al.[28] Basic Formal Ontology
PROV-O Moreau et al. [25] PROV Ontology
BTO Gremse et al. [49] BRENDA Tissue Ontology
CHEBI Degtyarenko et al. [50] Chemical Entities of Biological Interest Ontology
CLO Sarntivijai et al. [51] Cell Line Ontology
OBI Bandrowski et al. [52] Ontology for Biomedical Investigations
FOAF Brickley and Miller[8] People and their web information

All inventory items that were mentioned as resources in the protocol were represented by instances of the corresponding ontology classes (ABox statements), which is exemplified in the following by use of the MG-63 cell line. The manually engineered representation, as well as the corresponding inventory database description, are illustrated in Figs. 5 and 3, respectively.


Fig5 Schröder JofBioSem22 13.png

Figure 5. Graphical representation of the manually engineered semantic model of the MG-63 cell line used in the protocol. (See Schröder et al. [8])

In the ELN protocol, a passage with number 25 of the originally supplied MG-63 cells with lot number 57840088 was used, i.e., “[Cell line] MG-63 P25 LOT 57840088”.[b] This is modelled by using multiple instances of the corresponding class MG-63 cell (CLO_0007699), which are connected with the relation is_passage_of. The passage information are annotated using the attribute passage situation (CLO_0051628). Lot numbers are represented as an instance of lot number (IAO_0000132) and connected to the cell instances using the newly defined relation has_lot_number. The creation of a cell passage is attributed to a researcher using the relation wasAttributedTo (prov:wasAttributedTo). Finally, the supplier is an instance of class Organization (prov:Organization) and related to the cells using has_supplier (OBI_0000647).

The modelling of the ELN protocol can be summarized as the creation of instances of activity classes that require their individual input entities and often produce an output entity which serves as an input for the subsequent activity (artefact based modelling). Examples of atomic activities and their corresponding activity classes include washing (OBI_0302888), creating a mixture of molecules in solution (OBI_0000685), or cell line cell culturing (CLO_0000000 . The relations that are used to connect the entities to the activities are modelled in the corresponding ontology and depend on the actual activity class. Additionally, these processes are also of type Activity (prov:Activity) in order to encode general provenance information.

This modelling approach was employed for the entire ELN protocol. However, the most interesting part when it comes to the provenance documentation of research data is the activity, which produces or uses the research data. The upper part in Fig. 6 illustrates the documentation from the ELN protocol relevant for the research data generation: the first two steps describe the creation of the data while the last step contains the details about the actual analysis.


Fig6 Schröder JofBioSem22 13.png

Figure 6. Graphical representation of the semantic model describing the data recording (see also Fig. 5). (See Schröder et al. [8])

Structure-based model

For the structure-based model, an activity-based modelling approach was used to resemble the textual structure of the ELN protocol. For this purpose, the model was build upon the general purpose ontologies RO-Crate, PROV-O, and BFO. In total, for the representation of the seven protocols and their corresponding inventory items, 1935 resources of 18 types connected by 36 distinct predicates from seven vocabularies were used.

The structural hierarchy of the activities was represented by bfo:hasPart, while the sequential order was represented by wasInformedBy (prov:wasInformedBy). Figure 7 illustrates this structure. For each activity, the general types Action, prov:Activity, and bfo:process were used. Further links to external ontologies were added by owl:sameAs, for instance “wash” was augmented by washing (OBI_0302888).


Fig7 Schröder JofBioSem22 13.png

Figure 7. Graphical representation of an excerpt of the semantic model that was created semi-automatically.

The RO-Crate’s root data entity that describes the research data is required to be an entity of type Dataset (schema:Dataset). Thus, research data files are added to this dataset by hasPart (schema:hasPart). The connection of these file entities and the hierarchical structure of the activities is represented by wasGeneratedBy (prov:wasGeneratedBy) (see the right part of Fig. 7), when mentioned in the activities’ textual description. This means that all files are included in this root data entity (via hasPart), but are not necessarily associated to the activities, if they are not mentioned.

Following the RO-Crate specification, ELN inventory database items are encoded as the domain-independent type IndividualProduct as they provide contextual information. However, the ontological knowledge about the type of the biological and chemical resource was added using the relation owl:sameAs by the external references from the description in the ELN. The resulting entity is connected to the activities using used (prov:used). Resources with a specific passage or lot number are added as individual entities connected to a general entity encoding the inventory database item using the relation is_instance_of. Furthermore, attributes has_passage_number and has_lot_number are added with their corresponding information.

Several mixtures are used in the ELN protocols. This information is modelled around the activity creating a mixture of molecules in solution (OBI_0000685). All resources that are used in this activity are linked by has_specified_input (OBI_0000293) and the resulting mixture entity by has_specified_output (OBI_0000299). To specify the recipe of this mixture, a material combination objective (OBI_0000686) is created and linked to the activity using achieves_planned_objective (OBI_0000417). If an attribution of this mixture is annotated in the ELN protocol, the corresponding agent is associated with the resulting mixture entity via wasAttributedTo. Note that recipes of a mixture are independent of the actual creation activity, i.e., if multiple researchers create a mixture using the same recipe, the same recipe entity is referenced, but individual activities and mixture entities are created.

With respect to parameters, we extracted values and units for the following types: (i) time and duration (min and ms), (ii) temperature (Celsius), (iii) frequency (Hz), and (iv) voltage (V) and represented by their corresponding classes. Specifically, the frequency and the voltage are of interest as they provide the parameters for the stimulation of the cells during the Ca-imaging approach.

ELN protocols and protocol template

By providing templates for the individual parts of the experiment (preparation, Fluo-3 staining, Ca-imaging with and without stimulation), the researchers were able to compile seven ELN protocols with different permutations of the experiment parameters. In comparison to the predefined protocol template, we observed that the researchers further modified the ELN protocol description to reflect the particular course of activities and observations conducted in the wet lab, e.g., the repetition of an experimental setting due to issues in the previous experiment or the documentation of issues during the experiment. That means the model represents such deviations from the original plan (prospective provenance) and allows for tracking the actually documented activity sequence by means of retrospective provenance.

Research data bundles

In summary, seven RO-Crates have been created, one for each ELN protocol of the Ca-imaging experiments with stimulation. The corresponding semantic representation was automatically created using the structure-based approach. All research data that was produced in a particular experiment, together with this semantic representation, was bundled in the RO-Crate. In order to foster readability, a copy of the ELN protocol and the inventory items' description was included in the form of HTML files. Thus, the RO-Crates contain between 110 and 135 files and are between 107 and 185 MB large. The particular ELN protocols are encoded in models of 2,174 to 2,553 triples with 15,823 triples in total. As some triples—such as researchers, institutions, and resources—are identical across all RO-Crates, the number of unique triples is only 13,490. The number of triples per protocol differ due to deviations in the documentation from the original plan and the number of research files.

The structure-based approach employs RO-Crate, PROV-O, and BFO as upper level ontologies. Especially RO-Crate and PROV-O are designed to encode provenance information about resources. Provenance information about experimenter, manufacturer, biological and chemical resources, activities, and research data are transferred by this approach into a semantic representation. To illustrate the capabilities of the resulting RO-Crate bundles, we evaluated SPARQL queries for the W7 questions in our use case. Considering the question “How was a particular file created?” (W3), Fig. 8 presents the corresponding SPARQL query for a Ca-imaging approach in a particular experiment. Table 2 illustrates an excerpt of the result of this query, i.e., the sequence of activities from one experiment, providing the result to the question W3. That is, for every atomic activity within the Ca-imaging approach, the description as well as the created research data are listed in the order of the execution. Moreover, all resources and equipment (W2), as well as the parameters, are depicted as a result of the query.


Fig8 Schröder JofBioSem22 13.png

Figure 7. This SPARQL query selects (1) the ontological activity classes, (2) the research data produced, (3) the resources and equipment that is used, and (4) the parameters for each atomic activity order by their execution in a Ca-imaging approach, with stimulation from one of the use case ELN protocols that have been translated using the structure-based modelling approach.

Table 2. An excerpt of the resulting output for the SPARQL query in Fig. 8.
Activity Text Act.-Class Resources Files Par.-Units Par.-Values
[...]
ap_1_with_stimulation/14 place [Device] IonOptix 12 well plate chamber electrodes on plate obo:NCIT_C52253 IonOptix 12 well plate chamber
ap_1_with_stimulation/15 incubate for 10min with stimulation in LSM hood: [...] obo:OMIT_0005807, obo:OBI_0001007, obo:OBI_0302893 LSM780, ZEN 2011 (black edition) Data/02_Zeitserie-Stimulation_5V_7.9Hz.czi obo:UO_0000031, obo:UO_0000028, obo:UO_0000218, obo:UO_0000106 5, 10, 7.9
[...]

Beside queries for individual experiments, the semantic models enable the comparison of the documentation of multiple experiments. As an example, we consider the question “What was the order of the stimulation parameters in a particular experiment?” (W7) that should be answered for seven experiments. Figure 9 illustrates the query for the comparison of multiple experiments based on the order of their stimulation parameters. The corresponding results are shown in Table 3.



Footnotes

  1. A lot number is an identifier for a particular set of materials produced by one manufacturer. Thus, lot numbers enable to track information about the provenance of these material productions.
  2. Note that this is not part of the inventory item description, as this aims at the general cell specification. However, the particular information for a specific experiment are part of the ELN protocol.

References

  1. "RRID Portal". SciCrunch. 2021. https://scicrunch.org/resources. 
  2. MIACA Standards Initiative (2006). "MIACA - Minimum Information About a Cellular Assay". SourceForge. http://miaca.sourceforge.net/. 
  3. Kunis, S. (22 October 2021). "Workgroup RDM4mic - Research data management for microscopy". Zenodo. doi:10.5281/zenodo.5591958. https://zenodo.org/record/5591958. 
  4. National Center for Biomedical Ontology (2021). "BioPortal". Board of Trustees of Leland Stanford Junior University. https://bioportal.bioontology.org/. 
  5. Ong, Edison; Xiang, Zuoshuang; Zhao, Bin; Liu, Yue; Lin, Yu; Zheng, Jie; Mungall, Chris; Courtot, Mélanie et al. (4 January 2017). "Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration". Nucleic Acids Research 45 (D1): D347–D352. doi:10.1093/nar/gkw918. ISSN 1362-4962. PMC 5210626. PMID 27733503. https://pubmed.ncbi.nlm.nih.gov/27733503. 
  6. Hyland, B.; Atemezing, G.; Pendleton, M. et al., ed. (27 June 2013). "Linked Data Glossary". W3C. https://dvcs.w3.org/hg/gld/raw-file/default/glossary/index.html. 
  7. The Apache Software Foundation (2021). "Apache Jena Fuseki". https://jena.apache.org/documentation/fuseki2/index.html. 
  8. Brickley, D.; Miller, L. (14 January 2014). "FOAF Vocabulary Specification 0.99". xmlns.com. http://xmlns.com/foaf/spec/. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. To more easily differentiate footnotes from references, the original footnotes (which were numbered) were updated to use lowercase letters. Most footnotes referencing web pages were turned into proper citations.