Difference between revisions of "Journal:Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(→‎Notes: Cats)
 
(5 intermediate revisions by the same user not shown)
Line 18: Line 18:
|website      = [https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-021-00257-x https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-021-00257-x]
|website      = [https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-021-00257-x https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-021-00257-x]
|download    = [https://jbiomedsem.biomedcentral.com/track/pdf/10.1186/s13326-021-00257-x.pdf https://jbiomedsem.biomedcentral.com/track/pdf/10.1186/s13326-021-00257-x.pdf] (PDF)
|download    = [https://jbiomedsem.biomedcentral.com/track/pdf/10.1186/s13326-021-00257-x.pdf https://jbiomedsem.biomedcentral.com/track/pdf/10.1186/s13326-021-00257-x.pdf] (PDF)
}}
{{ombox
| type      = notice
| image    = [[Image:Emblem-important-yellow.svg|40px]]
| style    = width: 500px;
| text      = This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.
}}
}}
==Abstract==
==Abstract==
Line 37: Line 31:


==Background==
==Background==
Effective reuse of research data requires comprehensive documentation of their [[wikipedia:Provenance#Data provenance|provenance]]. Beside [[metadata]], knowledge about the generating process helps others to understand research data and allows for the reproduction of research investigations. This includes not only sources of input data, such as parameters and assumptions, but also [[information]] about instrumentation, devices, and materials. For wet [[Laboratory|lab]] experiments, such knowledge is increasingly documented in [[electronic laboratory notebook]]s (ELNs). The focus of these tools is on the documentation of laboratory activities that produce research data in so-called "ELN protocols." In addition to this textual description, the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR guiding principles]] [1] provide general guidance on research data documentation in terms of metadata. However, they do not prescribe technical details about the implementation of such documentation. [2]
Effective reuse of research data requires comprehensive documentation of their [[wikipedia:Provenance#Data provenance|provenance]]. Beside [[metadata]], knowledge about the generating process helps others to understand research data and allows for the reproduction of research investigations. This includes not only sources of input data, such as parameters and assumptions, but also [[information]] about instrumentation, devices, and materials. For wet [[Laboratory|lab]] experiments, such knowledge is increasingly documented in [[electronic laboratory notebook]]s (ELNs). The focus of these tools is on the documentation of laboratory activities that produce research data in so-called "ELN protocols." In addition to this textual description, the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR guiding principles]]<ref name=":0">{{Cite journal |last=Wilkinson |first=Mark D. |last2=Dumontier |first2=Michel |last3=Aalbersberg |first3=IJsbrand Jan |last4=Appleton |first4=Gabrielle |last5=Axton |first5=Myles |last6=Baak |first6=Arie |last7=Blomberg |first7=Niklas |last8=Boiten |first8=Jan-Willem |last9=da Silva Santos |first9=Luiz Bonino |last10=Bourne |first10=Philip E. |last11=Bouwman |first11=Jildau |date=2016-12 |title=The FAIR Guiding Principles for scientific data management and stewardship |url=http://www.nature.com/articles/sdata201618 |journal=Scientific Data |language=en |volume=3 |issue=1 |pages=160018 |doi=10.1038/sdata.2016.18 |issn=2052-4463 |pmc=PMC4792175 |pmid=26978244}}</ref> provide general guidance on research data documentation in terms of metadata. However, they do not prescribe technical details about the implementation of such documentation.<ref name=":1">{{Cite journal |last=Jacobsen |first=Annika |last2=de Miranda Azevedo |first2=Ricardo |last3=Juty |first3=Nick |last4=Batista |first4=Dominique |last5=Coles |first5=Simon |last6=Cornet |first6=Ronald |last7=Courtot |first7=Mélanie |last8=Crosas |first8=Mercè |last9=Dumontier |first9=Michel |last10=Evelo |first10=Chris T. |last11=Goble |first11=Carole |date=2020-01 |title=FAIR Principles: Interpretations and Implementation Considerations |url=https://direct.mit.edu/dint/article/2/1-2/10-29/10017 |journal=Data Intelligence |language=en |volume=2 |issue=1-2 |pages=10–29 |doi=10.1162/dint_r_00024 |issn=2641-435X}}</ref>


To foster the realization of the FAIR principles for research data produced in wet lab experiments, we aim for machine-interpretable representations of experimental documentation of the process that is the origin of the data. In other words, the provenance information about the research data—including the activities and involved researchers, resources, and equipment—should be [[wikipedia:Semantics|semantically]] represented. For this purpose, we employ the frequently used [3] PROV W3C recommendation [4], which [[Ontology (information science)|ontologically]], in PROV Ontology (PROV-O), defines entities, activities, and agents, as well as their relations. In particular, according to Belhajjame ''et al.'', an entity is defined as a “physical, digital, conceptual, or other kind of thing with some fixed aspects,” [5] an activity as “something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities,” [5] and an agent as “something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity.” [5] With respect to wet lab experiments, all biological and chemical resources—as well as not only the devices and software but also the research data itself—can be seen as entities; researchers conducting the experiment are the agents, and the process of research data creation consists of activities. The semantic representation of this information as a knowledge graph (KG) [6] can be achieved by the use of modern web technologies where the terms and their relations are defined in ontologies such as PROV-O (TBox modelling), the instances are built up in the KG (ABox modelling), and other KGs can be linked in order to create an interconnected graph of semantic knowledge.
To foster the realization of the FAIR principles for research data produced in wet lab experiments, we aim for machine-interpretable representations of experimental documentation of the process that is the origin of the data. In other words, the provenance information about the research data—including the activities and involved researchers, resources, and equipment—should be [[wikipedia:Semantics|semantically]] represented. For this purpose, we employ the frequently used<ref name=":2">{{Citation |last=Yu |first=Fangyu |last2=Zhou |first2=Beisi |last3=Lu |first3=Tun |last4=Gu |first4=Ning |date=2019 |editor-last=Sun |editor-first=Yuqing |editor2-last=Lu |editor2-first=Tun |editor3-last=Xie |editor3-first=Xiaolan |editor4-last=Gao |editor4-first=Liping |editor5-last=Fan |editor5-first=Hongfei |title=Research on Data Provenance Model for Multidisciplinary Collaboration |url=http://link.springer.com/10.1007/978-981-13-3044-5_3 |work=Computer Supported Cooperative Work and Social Computing |publisher=Springer Singapore |place=Singapore |volume=917 |pages=32–49 |doi=10.1007/978-981-13-3044-5_3 |isbn=978-981-13-3043-8 |accessdate=2022-04-01}}</ref> PROV W3C recommendation<ref>{{Cite journal |last=Moreau |first=Luc |last2=Groth |first2=Paul |date=2013-09-15 |title=Provenance: An Introduction to PROV |url=http://www.morganclaypool.com/doi/abs/10.2200/S00528ED1V01Y201308WBE007 |journal=Synthesis Lectures on the Semantic Web: Theory and Technology |language=en |volume=3 |issue=4 |pages=1–129 |doi=10.2200/S00528ED1V01Y201308WBE007 |issn=2160-4711}}</ref>, which [[Ontology (information science)|ontologically]], in PROV Ontology (PROV-O), defines entities, activities, and agents, as well as their relations. In particular, according to Belhajjame ''et al.'', an entity is defined as a “physical, digital, conceptual, or other kind of thing with some fixed aspects,"<ref name=":3">{{Cite web |last=Belhajjame, K.; B'Far, R.; Cheney, J. et al. |date=30 April 2013 |title=PROV-DM: The PROV Data Model |url=https://www.w3.org/TR/2013/REC-prov-dm-20130430/ |publisher=W3C}}</ref> an activity as “something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities,”<ref name=":3" /> and an agent as “something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity.”<ref name=":3" /> With respect to wet lab experiments, all biological and chemical resources—as well as not only the devices and software but also the research data itself—can be seen as entities; researchers conducting the experiment are the agents, and the process of research data creation consists of activities. The semantic representation of this information as a knowledge graph (KG)<ref>{{Cite journal |last=Hogan |first=Aidan |last2=Blomqvist |first2=Eva |last3=Cochez |first3=Michael |last4=D’amato |first4=Claudia |last5=Melo |first5=Gerard De |last6=Gutierrez |first6=Claudio |last7=Kirrane |first7=Sabrina |last8=Gayo |first8=José Emilio Labra |last9=Navigli |first9=Roberto |last10=Neumaier |first10=Sebastian |last11=Ngomo |first11=Axel-Cyrille Ngonga |date=2022-05-31 |title=Knowledge Graphs |url=https://dl.acm.org/doi/10.1145/3447772 |journal=ACM Computing Surveys |language=en |volume=54 |issue=4 |pages=1–37 |doi=10.1145/3447772 |issn=0360-0300}}</ref> can be achieved by the use of modern web technologies where the terms and their relations are defined in ontologies such as PROV-O (TBox modelling), the instances are built up in the KG (ABox modelling), and other KGs can be linked in order to create an interconnected graph of semantic knowledge.


In this paper, we aim for an automatic extraction of information from ELN protocols in order to transfer them into a semantic representation that documents the produced research data. For this purpose, we employ the documentation of Calcium imaging (Ca-imaging) experiments, originally proposed by Staehlke ''et al.'' [7], as a running example. In particular, we use ELN protocols that document the conduction of Ca-imaging experiments in order to: (i) demonstrate the feasibility of manually transferring an ELN protocol into a semantic representation encoding the provenance of research data, (ii) automate the information extraction and modelling by exploiting the structure of an ELN protocol by means of a structure-based approach, and (iii) evaluate the proposed method by answering provenance questions from the resulting bundle of research data and the corresponding semantic model.
In this paper, we aim for an automatic extraction of information from ELN protocols in order to transfer them into a semantic representation that documents the produced research data. For this purpose, we employ the documentation of Calcium imaging (Ca-imaging) experiments, originally proposed by Staehlke ''et al.''<ref name=":4">{{Cite journal |last=Staehlke |first=Susanne |last2=Koertge |first2=Andreas |last3=Nebe |first3=Barbara |date=2015-04 |title=Intracellular calcium dynamics dependent on defined microtopographical features of titanium |url=https://linkinghub.elsevier.com/retrieve/pii/S0142961214012666 |journal=Biomaterials |language=en |volume=46 |pages=48–57 |doi=10.1016/j.biomaterials.2014.12.016}}</ref>, as a running example. In particular, we use ELN protocols that document the conduction of Ca-imaging experiments in order to: (i) demonstrate the feasibility of manually transferring an ELN protocol into a semantic representation encoding the provenance of research data, (ii) automate the information extraction and modelling by exploiting the structure of an ELN protocol by means of a structure-based approach, and (iii) evaluate the proposed method by answering provenance questions from the resulting bundle of research data and the corresponding semantic model.


Here, the term "ELN protocol" refers to the actual documentation of the wet lab experiment within an ELN and is different from the term "protocol templates," which are used to encode instructions to be performed in order to conduct particular procedures or be published at [https://www.protocols.io/ protocols.io]. While those protocol templates do encode a list of abstract instructions, they do not necessarily reflect particular research data, nor instrumentation, parameters, or other aspects to the execution-specific information. ELN protocols, in contrast, represent the documentation of the actual experiment, and the contained information is thus necessary to understand how the resulting research data was generated. This includes manufacturer-specific information about resources used in the experiment such as lot numbers.{{Efn|A lot number is an identifier for a particular set of materials produced by one manufacturer. Thus, lot numbers enable to track information about the provenance of these material productions.}} Furthermore, passage numbers of the resources, the times when an activity was conducted, and the parameters used in a device, as well as the research data and the researchers conducting the experiment, are information specific to a particular experiment. Figure 1 illustrates the differences by providing an example for an ELN protocol and a protocol template.
Here, the term "ELN protocol" refers to the actual documentation of the wet lab experiment within an ELN and is different from the term "protocol templates," which are used to encode instructions to be performed in order to conduct particular procedures or be published at [https://www.protocols.io/ protocols.io]. While those protocol templates do encode a list of abstract instructions, they do not necessarily reflect particular research data, nor instrumentation, parameters, or other aspects to the execution-specific information. ELN protocols, in contrast, represent the documentation of the actual experiment, and the contained information is thus necessary to understand how the resulting research data was generated. This includes manufacturer-specific information about resources used in the experiment such as lot numbers.{{Efn|A lot number is an identifier for a particular set of materials produced by one manufacturer. Thus, lot numbers enable to track information about the provenance of these material productions.}} Furthermore, passage numbers of the resources, the times when an activity was conducted, and the parameters used in a device, as well as the research data and the researchers conducting the experiment, are information specific to a particular experiment. Figure 1 illustrates the differences by providing an example for an ELN protocol and a protocol template.
Line 57: Line 51:
|}
|}


The work presented here is based on a preliminary investigation regarding the effectiveness of manually modeling ELN protocols by use of ontologies. [8] Here, we extend this preliminary work by discussing the potential of automatic information extraction from ELN protocols by employing structural information and discussing the differences and implications of both approaches. Moreover, while the previous work only sketched the semantic representation of the wet lab experiments, here, we focus on the generation of ready-to-publish research data bundles, including the semantic description of the origin of the research data.
The work presented here is based on a preliminary investigation regarding the effectiveness of manually modeling ELN protocols by use of ontologies.<ref name=":5">{{Cite journal |last=Schröder |first=Max |last2=Stählke |first2=Susanne |last3=Nebe |first3=Barbara |last4=Krüger |first4=Frank |date=2020 |title=Towards in-situ knowledge acquisition for research data provenance from electronic lab notebooks |url=https://repository.publisso.de/resource/frl:6423288 |journal=Proceedings of the 1st Workshop on Research Data Management for Linked Open Science (DaMaLOS) Co-located with 19th International Semantic Web Conference |language=en |doi=10.4126/FRL01-006423288}}</ref> Here, we extend this preliminary work by discussing the potential of automatic information extraction from ELN protocols by employing structural information and discussing the differences and implications of both approaches. Moreover, while the previous work only sketched the semantic representation of the wet lab experiments, here, we focus on the generation of ready-to-publish research data bundles, including the semantic description of the origin of the research data.


==Use case==
==Use case==
Line 63: Line 57:


===Biomedical wet lab experiments===
===Biomedical wet lab experiments===
The objective of the biomedical study was to investigate the intracellular calcium ions (Ca<sup>2+</sup>) dynamics by Calcium-imaging (Ca-imaging) under different settings. [7] In particular, two different wet lab experiments were considered: (i) an investigation of the influence of different material surface conditions on Ca<sup>2+</sup> mobilization, and (ii) an investigation regarding the Ca<sup>2+</sup> dynamics under the influence of electrical stimulation. Both types of experiments involve similar activities of the researchers. In particular, each experiment employs the Ca-imaging method previously established by Staehlke ''et al.'' [7] in different settings. The particular conditions, e.g., surface conditions or parameters of the electrical stimulation, are investigated within each experiment, while the order of the different variations was permuted across the experiments. That is, after a preparation phase, where all materials and devices are prepared, the same procedure, i.e., Ca-imaging, was executed for the different conditions. During the experiment, several materials and devices are employed, such as cell line passages, buffer, and microscopes.
The objective of the biomedical study was to investigate the intracellular calcium ions (Ca<sup>2+</sup>) dynamics by Calcium-imaging (Ca-imaging) under different settings.<ref name=":4" /> In particular, two different wet lab experiments were considered: (i) an investigation of the influence of different material surface conditions on Ca<sup>2+</sup> mobilization, and (ii) an investigation regarding the Ca<sup>2+</sup> dynamics under the influence of electrical stimulation. Both types of experiments involve similar activities of the researchers. In particular, each experiment employs the Ca-imaging method previously established by Staehlke ''et al.''<ref name=":4" /> in different settings. The particular conditions, e.g., surface conditions or parameters of the electrical stimulation, are investigated within each experiment, while the order of the different variations was permuted across the experiments. That is, after a preparation phase, where all materials and devices are prepared, the same procedure, i.e., Ca-imaging, was executed for the different conditions. During the experiment, several materials and devices are employed, such as cell line passages, buffer, and microscopes.


For the purpose of this study, we asked the researchers to use an ELN for the documentation of their wet lab activities, resulting in eight ELN protocols: one for the first experiment and seven for the latter, representing different permutations of the sequential execution of Ca-imaging for different electrical stimulation parameters. In particular, [[eLabFTW]] (Deltablot, https://www.elabftw.net/, v3.6.7) [10], a domain-independent ELN, was used. Figure 2 shows an excerpt of a protocol from the use case.
For the purpose of this study, we asked the researchers to use an ELN for the documentation of their wet lab activities, resulting in eight ELN protocols: one for the first experiment and seven for the latter, representing different permutations of the sequential execution of Ca-imaging for different electrical stimulation parameters. In particular, [[eLabFTW]] (Deltablot, https://www.elabftw.net/, v3.6.7)<ref>{{Cite journal |last=CARPi |first=Nicolas |last2=Minges |first2=Alexander |last3=Piel |first3=Matthieu |date=2017-04-14 |title=eLabFTW: An open source laboratory notebook for research labs |url=http://joss.theoj.org/papers/10.21105/joss.00146 |journal=The Journal of Open Source Software |volume=2 |issue=12 |pages=146 |doi=10.21105/joss.00146 |issn=2475-9066}}</ref>, a domain-independent ELN, was used. Figure 2 shows an excerpt of a protocol from the use case.




Line 96: Line 90:


===Research data provenance===
===Research data provenance===
When considering this use case, several questions regarding the provenance of the research data can be raised. To this end, we consider questions based on the W7 provenance model [11], that describes provenance as combinations of What, When, Where, How, Who, Which, and Why. We consider each question individually, encoding the view of a researcher that aims at re-using the research data from our use case. The questions were developed together with the domain experts and resemble actual questions that arise when considering the replication of the documented experiments.
When considering this use case, several questions regarding the provenance of the research data can be raised. To this end, we consider questions based on the W7 provenance model<ref>{{Citation |last=Ram |first=Sudha |last2=Liu |first2=Jun |date=2007 |editor-last=Chen |editor-first=Peter P. |editor2-last=Wong |editor2-first=Leah Y. |title=Understanding the Semantics of Data Provenance to Support Active Conceptual Modeling |url=http://link.springer.com/10.1007/978-3-540-77503-4_3 |work=Active Conceptual Modeling of Learning |publisher=Springer Berlin Heidelberg |place=Berlin, Heidelberg |volume=4512 |pages=17–29 |doi=10.1007/978-3-540-77503-4_3 |isbn=978-3-540-77502-7 |accessdate=2022-04-01}}</ref>, that describes provenance as combinations of What, When, Where, How, Who, Which, and Why. We consider each question individually, encoding the view of a researcher that aims at re-using the research data from our use case. The questions were developed together with the domain experts and resemble actual questions that arise when considering the replication of the documented experiments.


:'''W1''' ''Who'' participated in the study?
:'''W1''' ''Who'' participated in the study?
Line 129: Line 123:
The provenance of research data, including their research investigations, combines several research fields, ranging from general-purpose methods and standards for the documentation of provenance to specifically tailored methods and platforms for the tracking of research and other activities. In the following, we will discuss recent work within those fields and relate it to our method.
The provenance of research data, including their research investigations, combines several research fields, ranging from general-purpose methods and standards for the documentation of provenance to specifically tailored methods and platforms for the tracking of research and other activities. In the following, we will discuss recent work within those fields and relate it to our method.


Many methods aiming at documenting the provenance of activities have already been proposed. Here, we consider the classification of provenance information following the definition of Herschel ''et al.'' [12] and Lim ''et al.'' [13]:
Many methods aiming at documenting the provenance of activities have already been proposed. Here, we consider the classification of provenance information following the definition of Herschel ''et al.''<ref name=":6">{{Cite journal |last=Herschel |first=Melanie |last2=Diestelkämper |first2=Ralf |last3=Ben Lahmar |first3=Houssem |date=2017-12 |title=A survey on provenance: What for? What form? What from? |url=http://link.springer.com/10.1007/s00778-017-0486-1 |journal=The VLDB Journal |language=en |volume=26 |issue=6 |pages=881–906 |doi=10.1007/s00778-017-0486-1 |issn=1066-8888}}</ref> and Lim ''et al.''<ref name=":7">{{Cite journal |last=Lim |first=Chunhyeok |last2=Lu |first2=Shiyong |last3=Chebotko |first3=Artem |last4=Fotouhi |first4=Farshad |date=2010-07 |title=Prospective and Retrospective Provenance Collection in Scientific Workflow Environments |url=http://ieeexplore.ieee.org/document/5557202/ |journal=2010 IEEE International Conference on Services Computing |publisher=IEEE |place=Miami, FL, USA |pages=449–456 |doi=10.1109/SCC.2010.18 |isbn=978-1-4244-8147-7}}</ref>:


#prospective provenance describes “an abstract workflow specification as a recipe for future data derivation” [13];
#prospective provenance describes “an abstract workflow specification as a recipe for future data derivation”<ref name=":7" />;
#retrospective provenance documents a “past workflow execution and data derivation information, i.e., which tasks were performed and how data artifacts were derived” [13]; and
#retrospective provenance documents a “past workflow execution and data derivation information, i.e., which tasks were performed and how data artifacts were derived”<ref name=":7" />; and
#evolution provenance illustrates “the changes made between two versions of the input” [12], or, in other words, versions of the procedure, the data, or the parameters are reflected by evolution provenance similar to version control such as that implemented by Git for source code.
#evolution provenance illustrates “the changes made between two versions of the input”<ref name=":6" />, or, in other words, versions of the procedure, the data, or the parameters are reflected by evolution provenance similar to version control such as that implemented by Git for source code.


Applying those definitions to the use case at hand, prospective provenance allows the keeping track of changes of laboratory-specific operating procedures in general, while retrospective provenance allows the documenting of the actually executed sequence of activities that resulted in a particular set of research data. At last, evolution provenance allows the tracking of changes made to the actual ELN protocol or the inventory database items.
Applying those definitions to the use case at hand, prospective provenance allows the keeping track of changes of laboratory-specific operating procedures in general, while retrospective provenance allows the documenting of the actually executed sequence of activities that resulted in a particular set of research data. At last, evolution provenance allows the tracking of changes made to the actual ELN protocol or the inventory database items.
Line 139: Line 133:
With respect to the research workflows to be represented by provenance modeling, two different types can be distinguished:
With respect to the research workflows to be represented by provenance modeling, two different types can be distinguished:


#''In-silico'' studies employ computational methods for the analysis of the data. [[Workflow]] systems like Taverna [14], Kepler [15], or [[Galaxy (biomedical software)|Galaxy]] [16], and programming environments like [[Jupyter Notebook]] [17] have been successfully augmented to record retrospective provenance.
#''In-silico'' studies employ computational methods for the analysis of the data. [[Workflow]] systems like Taverna<ref>{{Cite journal |last=Belhajjame |first=Khalid |last2=Wolstencroft |first2=Katy |last3=Corcho |first3=Oscar |last4=Oinn |first4=Tom |last5=Tanoh |first5=Franck |last6=William |first6=Alan |last7=Goble |first7=Carole |date=2008-05 |title=Metadata Management in the Taverna Workflow System |url=http://ieeexplore.ieee.org/document/4534278/ |journal=2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) |publisher=IEEE |place=Lyon, France |pages=651–656 |doi=10.1109/CCGRID.2008.17}}</ref>, Kepler<ref>{{Citation |last=Altintas |first=Ilkay |last2=Barney |first2=Oscar |last3=Jaeger-Frank |first3=Efrat |date=2006 |editor-last=Moreau |editor-first=Luc |editor2-last=Foster |editor2-first=Ian |title=Provenance Collection Support in the Kepler Scientific Workflow System |url=http://link.springer.com/10.1007/11890850_14 |work=Provenance and Annotation of Data |publisher=Springer Berlin Heidelberg |place=Berlin, Heidelberg |volume=4145 |pages=118–132 |doi=10.1007/11890850_14 |isbn=978-3-540-46302-3 |accessdate=2022-04-01}}</ref>, or [[Galaxy (biomedical software)|Galaxy]]<ref>{{Cite journal |last=Goecks |first=Jeremy |last2=Nekrutenko |first2=Anton |last3=Taylor |first3=James |last4=Galaxy Team |first4=The |date=2010 |title=Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences |url=http://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-8-r86 |journal=Genome Biology |language=en |volume=11 |issue=8 |pages=R86 |doi=10.1186/gb-2010-11-8-r86 |issn=1465-6906 |pmc=PMC2945788 |pmid=20738864}}</ref>, and programming environments like [[Jupyter Notebook]]<ref name=":8">{{Cite journal |last=Samuel, S.; König-Ries, B. |year=2018 |title=ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility |url=http://ceur-ws.org/Vol-2180/paper-57.pdf |format=Pdf |journal=Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference |volume=2180 |URN=urn:nbn:de:0074-2180-3}}</ref> have been successfully augmented to record retrospective provenance.
#Wet lab experiments are courses of activities in a laboratory. While several approaches exist that describe prospective provenance [18, 19] by analyzing published protocols, only limited work is done on documenting retrospective provenance for these workflows.
#Wet lab experiments are courses of activities in a laboratory. While several approaches exist that describe prospective provenance<ref name=":9">{{Cite journal |last=Soldatova |first=Larisa N |last2=Nadis |first2=Daniel |last3=King |first3=Ross D |last4=Basu |first4=Piyali S |last5=Haddi |first5=Emma |last6=Baumlé |first6=Véronique |last7=Saunders |first7=Nigel J |last8=Marwan |first8=Wolfgang |last9=Rudkin |first9=Brian B |date=2014-12 |title=EXACT2: the semantics of biomedical protocols |url=https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S14-S5 |journal=BMC Bioinformatics |language=en |volume=15 |issue=S14 |pages=S5 |doi=10.1186/1471-2105-15-S14-S5 |issn=1471-2105 |pmc=PMC4255744 |pmid=25472549}}</ref><ref name=":10">{{Cite book |last=Giraldo Pasmín, O.X., Corcho, O.; Castro, A.G., |year=2014 |title=Linked Science 2014: Proceedings of the 4th Workshop on Linked Science 2014 - Making Sense Out of Data |url=https://oa.upm.es/36778/ |chapter=SMART Protocols: seMAntic represenTation for experimental protocols |publisher= |volume=1282 |pages=36–47 |isbn=1613-0073}}</ref> by analyzing published protocols, only limited work is done on documenting retrospective provenance for these workflows.


More detailed information about provenance modelling and the employed methods are provided in the literature. [3, 12] Here, we are interested in providing detailed information about the origin of research data. Thus, we aim at providing retrospective provenance documentation of research data from ELN protocols documenting wet lab experiments.
More detailed information about provenance modelling and the employed methods are provided in the literature.<ref name=":2" /><ref name=":6" /> Here, we are interested in providing detailed information about the origin of research data. Thus, we aim at providing retrospective provenance documentation of research data from ELN protocols documenting wet lab experiments.


The Smart Tea project [20] similarly aims at the semantic metadata recording for research data from within a customized ELN. The developed ELN provides a structured graphical user interface (GUI) requiring the user to provide information for predefined variables. All information is directly transferred into a linked data representation and persistently archived with a linked data server. While this approach perfectly guides users through the sequence of activities and tracks retrospective provenance at the same time, it fails to keep track of deviations from the predefined plan. Furthermore, as the documentation is directly translated into a semantic representation, additional information that was not considered before can hardly be attached to such protocols, which restricts both the expressivity of the semantic model and the user to previously known information.
The Smart Tea project<ref>{{Cite journal |last=Hughes |first=Gareth |last2=Mills |first2=Hugo |last3=De Roure |first3=David |last4=Frey |first4=Jeremy G. |last5=Moreau |first5=Luc |last6=schraefel |first6=m. c. |last7=Smith |first7=Graham |last8=Zaluska |first8=Ed |date=2004 |title=The semantic smart laboratory: a system for supporting the chemical eScientist |url=http://xlink.rsc.org/?DOI=b410075a |journal=Organic &  Biomolecular Chemistry |language=en |volume=2 |issue=22 |pages=3284 |doi=10.1039/b410075a |issn=1477-0520}}</ref> similarly aims at the semantic metadata recording for research data from within a customized ELN. The developed ELN provides a structured graphical user interface (GUI) requiring the user to provide information for predefined variables. All information is directly transferred into a linked data representation and persistently archived with a linked data server. While this approach perfectly guides users through the sequence of activities and tracks retrospective provenance at the same time, it fails to keep track of deviations from the predefined plan. Furthermore, as the documentation is directly translated into a semantic representation, additional information that was not considered before can hardly be attached to such protocols, which restricts both the expressivity of the semantic model and the user to previously known information.


Similar to the Smart Tea project, the PROV templating approach [21] suggests the recording of provenance information given a pre-defined provenance model. In other words, the main idea is that applications only store values for placeholders in a particular provenance model, which was shown to be more efficient than the storage of the original provenance models. [21] This solution is very efficient if a very large number of identical provenance structures with some variable information are to be stored. If, however, the application requires more flexibility in terms of the provenance structure, the template approach does not utilize this efficiency advantage. Note that provenance templates encode a semantic representation with variables, whereas protocol templates provide guidelines for experiments.
Similar to the Smart Tea project, the PROV templating approach<ref name=":11">{{Cite journal |last=Moreau |first=Luc |last2=Batlajery |first2=Belfrit Victor |last3=Huynh |first3=Trung Dong |last4=Michaelides |first4=Danius |last5=Packer |first5=Heather |date=2018-02-01 |title=A Templating System to Generate Provenance |url=https://ieeexplore.ieee.org/document/7909036/ |journal=IEEE Transactions on Software Engineering |volume=44 |issue=2 |pages=103–121 |doi=10.1109/TSE.2017.2659745 |issn=0098-5589}}</ref> suggests the recording of provenance information given a pre-defined provenance model. In other words, the main idea is that applications only store values for placeholders in a particular provenance model, which was shown to be more efficient than the storage of the original provenance models.<ref name=":11" /> This solution is very efficient if a very large number of identical provenance structures with some variable information are to be stored. If, however, the application requires more flexibility in terms of the provenance structure, the template approach does not utilize this efficiency advantage. Note that provenance templates encode a semantic representation with variables, whereas protocol templates provide guidelines for experiments.


Curcin ''et al.'' [22] use a very similar approach for the provenance modelling in [[Clinical decision support system|diagnostic decision support systems]]. A more flexible approach is the use of knowledge graph cells (KGCs), proposed by Vogt ''et al.'' [23] They provide a concept for the definition of knowledge structures. In particular, rules including ABox and TBox expressions might be defined that allow the dynamic modification of the KG. Thus, KGCs might be used to specify potential semantic structures of ELN protocols without particular information inside. The application of KGCs would require a complete definition over all possible semantic representations of ELN protocols, which is infeasible.
Curcin ''et al.''<ref>{{Cite journal |last=Curcin |first=Vasa |last2=Fairweather |first2=Elliot |last3=Danger |first3=Roxana |last4=Corrigan |first4=Derek |date=2017-01 |title=Templates as a method for implementing data provenance in decision support systems |url=https://linkinghub.elsevier.com/retrieve/pii/S1532046416301599 |journal=Journal of Biomedical Informatics |language=en |volume=65 |pages=1–21 |doi=10.1016/j.jbi.2016.10.022}}</ref> use a very similar approach for the provenance modelling in [[Clinical decision support system|diagnostic decision support systems]]. A more flexible approach is the use of knowledge graph cells (KGCs), proposed by Vogt ''et al.''<ref>{{Cite journal |last=Vogt |first=Lars |last2=D'Souza |first2=Jennifer |last3=Stocker |first3=Markus |last4=Auer |first4=Sören |date=2020-08 |title=Toward Representing Research Contributions in Scholarly Knowledge Graphs Using Knowledge Graph Cells |url=https://dl.acm.org/doi/10.1145/3383583.3398530 |journal=Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 |language=en |publisher=ACM |place=Virtual Event China |pages=107–116 |doi=10.1145/3383583.3398530 |isbn=978-1-4503-7585-6}}</ref> They provide a concept for the definition of knowledge structures. In particular, rules including ABox and TBox expressions might be defined that allow the dynamic modification of the KG. Thus, KGCs might be used to specify potential semantic structures of ELN protocols without particular information inside. The application of KGCs would require a complete definition over all possible semantic representations of ELN protocols, which is infeasible.


With respect to the vocabulary used to semantically describe the laboratory-specific information, the EXperimental ACTions (EXACT2) ontology, together with the Natural Language Processing (NLP) framework [18], aims at the automatic extraction of knowledge from biomedical protocols for prospective provenance. Similarly, the SeMAntic RepresenTation for Experimental Protocols (SMART Protocols) ontology reuses EXACT2 to represent prospective provenance from published protocols. [19] In contrast to both approaches that represent a plan, we aim at retrospective provenance, i.e., a particular course of activities. Both approaches, however, could be used to describe prospective provenance of the underlying plan of an ELN protocol, to allow the documentation of potential deviations from the original plan. The Reproduce Microscopy Experiments (REPRODUCE-ME) ontology [24] introduces a specific vocabulary to describe retrospective provenance for microscopy experiments. Besides, the domain-independent ontologies, PROV-O and its predecessor Open Provenance Model (OPM) [25], are frequently employed as upper-level ontology for provenance documentation. [3] Furthermore, many extensions for specific applications have been proposed. The Provenance, Authoring, and Versioning (PAV) ontology, for example, proposes a mechanism for the versioning and authoring of web resources [26], and CollabPG encodes collaborations within processes. [3] With respect to the application domain of the use case, the Open Biological and Biomedical Ontology (OBO) Foundry is a community initiative aiming at the development and maintenance of ontologies in the biomedical domain. [27] The Basic Formal Ontology (BFO) [28] is the upper-level ontology that is used for each of the OBO ontologies.
With respect to the vocabulary used to semantically describe the laboratory-specific information, the EXperimental ACTions (EXACT2) ontology, together with the Natural Language Processing (NLP) framework<ref name=":9" />, aims at the automatic extraction of knowledge from biomedical protocols for prospective provenance. Similarly, the SeMAntic RepresenTation for Experimental Protocols (SMART Protocols) ontology reuses EXACT2 to represent prospective provenance from published protocols.<ref name=":10" /> In contrast to both approaches that represent a plan, we aim at retrospective provenance, i.e., a particular course of activities. Both approaches, however, could be used to describe prospective provenance of the underlying plan of an ELN protocol, to allow the documentation of potential deviations from the original plan. The Reproduce Microscopy Experiments (REPRODUCE-ME) ontology<ref>{{Citation |last=Samuel |first=Sheeba |last2=König-Ries |first2=Birgitta |date=2017 |editor-last=Blomqvist |editor-first=Eva |editor2-last=Hose |editor2-first=Katja |editor3-last=Paulheim |editor3-first=Heiko |editor4-last=Ławrynowicz |editor4-first=Agnieszka |editor5-last=Ciravegna |editor5-first=Fabio |title=REPRODUCE-ME: Ontology-Based Data Access for Reproducibility of Microscopy Experiments |url=http://link.springer.com/10.1007/978-3-319-70407-4_4 |work=The Semantic Web: ESWC 2017 Satellite Events |publisher=Springer International Publishing |place=Cham |volume=10577 |pages=17–20 |doi=10.1007/978-3-319-70407-4_4 |isbn=978-3-319-70406-7 |accessdate=2022-04-01}}</ref> introduces a specific vocabulary to describe retrospective provenance for microscopy experiments. Besides, the domain-independent ontologies, PROV-O and its predecessor Open Provenance Model (OPM)<ref name=":12">{{Cite journal |last=Moreau |first=Luc |last2=Groth |first2=Paul |last3=Cheney |first3=James |last4=Lebo |first4=Timothy |last5=Miles |first5=Simon |date=2015-12 |title=The rationale of PROV |url=https://linkinghub.elsevier.com/retrieve/pii/S1570826815000177 |journal=Journal of Web Semantics |language=en |volume=35 |pages=235–257 |doi=10.1016/j.websem.2015.04.001}}</ref>, are frequently employed as upper-level ontology for provenance documentation.<ref name=":2" /> Furthermore, many extensions for specific applications have been proposed. The Provenance, Authoring, and Versioning (PAV) ontology, for example, proposes a mechanism for the versioning and authoring of web resources<ref>{{Cite journal |last=Ciccarese |first=Paolo |last2=Soiland-Reyes |first2=Stian |last3=Belhajjame |first3=Khalid |last4=Gray |first4=Alasdair JG |last5=Goble |first5=Carole |last6=Clark |first6=Tim |date=2013 |title=PAV ontology: provenance, authoring and versioning |url=http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-37 |journal=Journal of Biomedical Semantics |language=en |volume=4 |issue=1 |pages=37 |doi=10.1186/2041-1480-4-37 |issn=2041-1480 |pmc=PMC4177195 |pmid=24267948}}</ref>, and CollabPG encodes collaborations within processes.<ref name=":2" /> With respect to the application domain of the use case, the Open Biological and Biomedical Ontology (OBO) Foundry is a community initiative aiming at the development and maintenance of ontologies in the biomedical domain.<ref name=":13">{{Cite journal |last=The OBI Consortium |last2=Smith |first2=Barry |last3=Ashburner |first3=Michael |last4=Rosse |first4=Cornelius |last5=Bard |first5=Jonathan |last6=Bug |first6=William |last7=Ceusters |first7=Werner |last8=Goldberg |first8=Louis J |last9=Eilbeck |first9=Karen |last10=Ireland |first10=Amelia |last11=Mungall |first11=Christopher J |date=2007-11 |title=The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration |url=http://www.nature.com/articles/nbt1346 |journal=Nature Biotechnology |language=en |volume=25 |issue=11 |pages=1251–1255 |doi=10.1038/nbt1346 |issn=1087-0156 |pmc=PMC2814061 |pmid=17989687}}</ref> The Basic Formal Ontology (BFO)<ref name=":14">{{Cite web |last=Smith, B.; Kumar, A.; Bittner, T. |date=2005 |title=Basic Formal Ontology for Bioinformatics |work=IFOMIS Reports |url=http://ontology.buffalo.edu/smith/articles/BFO_for_bioinformatics.pdf |format=PDF}}</ref> is the upper-level ontology that is used for each of the OBO ontologies.


For the retrospective provenance documentation of research data from computational workflows, several specifically tailored tools and approaches have been proposed in the literature. ProvBook [17], for instance, tracks provenance in Jupyter notebooks that are used for literate programming. There's also Dataprov [29], a wrapper tool producing provenance information from the execution of analysis tools, and noWorkflow [30], which captures provenance information from analysis scripts such as for the programming language Python. Aside from these methods, other provenance tracking approaches known as lineage retrieval [31] or lineage tracking and workflow systems exist. [32] In general, ''in-silico'' workflow systems not only record provenance information, but at the same time they specify the involved processing steps and enable their execution possibly on a distributed system. [33] However, as these systems are limited to tackling computational analyses, their usage for the provenance of research data from wet lab experiments is difficult.
For the retrospective provenance documentation of research data from computational workflows, several specifically tailored tools and approaches have been proposed in the literature. ProvBook<ref name=":8" />, for instance, tracks provenance in Jupyter Notebooks that are used for literate programming. There's also Dataprov<ref>{{Cite journal |last=Bartusch, F.; Hanussek, M.; Krüger, J. |year=2018 |editor-last=Atkinson, M.; Gesing, S. |title=Automatic generation of provenance metadata during execution of scientific workflows |url=http://ceur-ws.org/Vol-2357/paper8.pdf |format=PDF |journal=Proceedings of the 10th International Workshop on Science Gateways |volume=2357 |pages=1–6 |URN=nbn:de:0074-2357-5}}</ref>, a wrapper tool producing provenance information from the execution of analysis tools, and noWorkflow<ref>{{Citation |last=Murta |first=Leonardo |last2=Braganholo |first2=Vanessa |last3=Chirigati |first3=Fernando |last4=Koop |first4=David |last5=Freire |first5=Juliana |date=2015 |editor-last=Ludäscher |editor-first=Bertram |editor2-last=Plale |editor2-first=Beth |title=noWorkflow: Capturing and Analyzing Provenance of Scripts |url=http://link.springer.com/10.1007/978-3-319-16462-5_6 |work=Provenance and Annotation of Data and Processes |language=en |publisher=Springer International Publishing |place=Cham |volume=8628 |pages=71–83 |doi=10.1007/978-3-319-16462-5_6 |isbn=978-3-319-16461-8 |accessdate=2022-04-01}}</ref>, which captures provenance information from analysis scripts such as for the programming language Python. Aside from these methods, other provenance tracking approaches known as lineage retrieval<ref>{{Cite journal |last=Bose |first=Rajendra |last2=Frew |first2=James |date=2005-03 |title=Lineage retrieval for scientific data processing: a survey |url=https://dl.acm.org/doi/10.1145/1057977.1057978 |journal=ACM Computing Surveys |language=en |volume=37 |issue=1 |pages=1–28 |doi=10.1145/1057977.1057978 |issn=0360-0300}}</ref> or lineage tracking and workflow systems exist.<ref>{{Cite journal |last=Davidson |first=Susan B. |last2=Freire |first2=Juliana |date=2008 |title=Provenance and scientific workflows: challenges and opportunities |url=http://portal.acm.org/citation.cfm?doid=1376616.1376772 |journal=Proceedings of the 2008 ACM SIGMOD international conference on Management of data  - SIGMOD '08 |language=en |publisher=ACM Press |place=Vancouver, Canada |pages=1345 |doi=10.1145/1376616.1376772 |isbn=978-1-60558-102-6}}</ref> In general, ''in-silico'' workflow systems not only record provenance information, but at the same time they specify the involved processing steps and enable their execution possibly on a distributed system.<ref>{{Cite journal |last=Deelman |first=Ewa |last2=Gannon |first2=Dennis |last3=Shields |first3=Matthew |last4=Taylor |first4=Ian |date=2009-05 |title=Workflows and e-Science: An overview of workflow system features and capabilities |url=https://linkinghub.elsevier.com/retrieve/pii/S0167739X08000861 |journal=Future Generation Computer Systems |language=en |volume=25 |issue=5 |pages=528–540 |doi=10.1016/j.future.2008.06.012}}</ref> However, as these systems are limited to tackling computational analyses, their usage for the provenance of research data from wet lab experiments is difficult.


Regarding the completeness of the documentation with respect to reproducibility, plenty of standards exist that aim at the definition of the minimum set of information required to comprehend and reproduce the research investigation for different applications. With respect to the use case at hand, the minimum information for electrical cell stimulation [34] and the Minimum Information About a Cellular Assay (MIACA)<ref name="MIACA">{{cite web |url=http://miaca.sourceforge.net/ |title=MIACA - Minimum Information About a Cellular Assay |author=MIACA Standards Initiative |work=SourceForge |date=2006}}</ref> provide such references for the documentation. Similarly, standard operating procedures (SOPs) or published instructions for experiments encode standards for the documentation of a particular experiment.
Regarding the completeness of the documentation with respect to reproducibility, plenty of standards exist that aim at the definition of the minimum set of information required to comprehend and reproduce the research investigation for different applications. With respect to the use case at hand, the minimum information for electrical cell stimulation<ref>{{Cite journal |last=Budde |first=Kai |last2=Zimmermann |first2=Julius |last3=Neuhaus |first3=Elisa |last4=Schroder |first4=Max |last5=Uhrmacher |first5=Adelinde M. |last6=van Rienen |first6=Ursula |date=2019-07 |title=Requirements for Documenting Electrical Cell Stimulation Experiments for Replicability and Numerical Modeling ∗ |url=https://ieeexplore.ieee.org/document/8856863/ |journal=2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) |publisher=IEEE |place=Berlin, Germany |pages=1082–1088 |doi=10.1109/EMBC.2019.8856863 |isbn=978-1-5386-1311-5}}</ref> and the Minimum Information About a Cellular Assay (MIACA)<ref name="MIACA">{{cite web |url=http://miaca.sourceforge.net/ |title=MIACA - Minimum Information About a Cellular Assay |author=MIACA Standards Initiative |work=SourceForge |date=2006}}</ref> provide such references for the documentation. Similarly, standard operating procedures (SOPs) or published instructions for experiments encode standards for the documentation of a particular experiment.


When considering the publication or archiving of research data, metadata is important to provide additional context, enabling others (including the future self) to understand the research process and the resulting data. In particular, the FAIR guiding principles provide abstract recommendations for handling research data to enable its re-usability. [1] Together with the implementation suggestions of these guidelines [2], they provide a framework which is also applicable for research data from wet lab experiments. While both guidelines provide generic recommendations regarding research data documentation, different standards exist that provide vocabulary for their support. Several initiatives foster the development of documentation standards for research data, including the Data Documentation Initiative (DDI) that focuses on standardizing metadata for social science datasets. [35] The Dublin Core, instead, is a more general definition of 15 metadata elements for electronic resources. [36, 37] Similarly, Data Catalog Vocabulary (DCAT) provides a common vocabulary for the interoperability of data catalogs [38] and, thus, also defines required metadata for research data. Additionally, domain-specific metadata standards have been developed. With respect to the use case, this includes metadata for microscopy images, such as that proposed by the RDM4mic Initiative.<ref>{{Cite journal |last=Kunis, S. |date=22 October 2021 |title=Workgroup RDM4mic - Research data management for microscopy |url=https://zenodo.org/record/5591958 |journal=Zenodo |doi=10.5281/zenodo.5591958}}</ref> In addition to these metadata, the information inside the data file might also be described. For this purpose, codebooks and data dictionaries are employed. [39, 40] Considering a CSV file as an example, this includes information about each column such as the domain of the values and the unit of the measurements. This information is defined in a separate file that helps comprehend the raw data.
When considering the publication or archiving of research data, metadata is important to provide additional context, enabling others (including the future self) to understand the research process and the resulting data. In particular, the FAIR guiding principles provide abstract recommendations for handling research data to enable its re-usability.<ref name=":0" /> Together with the implementation suggestions of these guidelines<ref name=":1" />, they provide a framework which is also applicable for research data from wet lab experiments. While both guidelines provide generic recommendations regarding research data documentation, different standards exist that provide vocabulary for their support. Several initiatives foster the development of documentation standards for research data, including the Data Documentation Initiative (DDI) that focuses on standardizing metadata for social science datasets.<ref>{{Cite journal |last=Rasmussen |first=Karsten Boye |last2=Blank |first2=Grant |date=2007-03 |title=The data documentation initiative: a preservation standard for research |url=http://link.springer.com/10.1007/s10502-006-9036-0 |journal=Archival Science |language=en |volume=7 |issue=1 |pages=55–71 |doi=10.1007/s10502-006-9036-0 |issn=1389-0166}}</ref> The Dublin Core, instead, is a more general definition of 15 metadata elements for electronic resources.<ref>{{Cite journal |last=Weibel |first=S. |last2=Kunze |first2=J. |last3=Lagoze |first3=C. |last4=Wolf |first4=M. |date=1998-09 |title=Dublin Core Metadata for Resource Discovery |url=https://www.rfc-editor.org/info/rfc2413 |language=en |pages=RFC2413 |doi=10.17487/rfc2413}}</ref><ref>{{Cite journal |last=Kunze |first=J. |last2=Baker |first2=T. |date=2007-08 |title=The Dublin Core Metadata Element Set |url=https://www.rfc-editor.org/info/rfc5013 |language=en |pages=RFC5013 |doi=10.17487/rfc5013}}</ref> Similarly, Data Catalog Vocabulary (DCAT) provides a common vocabulary for the interoperability of data catalogs<ref>{{Cite web |last=Albertoni, R.; Browning, D.; Cox, S. et al. |date=04 February 2020 |title=Data Catalog Vocabulary (DCAT) - Version 2 |url=https://www.w3.org/TR/2020/REC-vocab-dcat-2-20200204/ |publisher=W3C}}</ref> and, thus, also defines required metadata for research data. Additionally, domain-specific metadata standards have been developed. With respect to the use case, this includes metadata for microscopy images, such as that proposed by the RDM4mic Initiative.<ref>{{Cite journal |last=Kunis, S. |date=22 October 2021 |title=Workgroup RDM4mic - Research data management for microscopy |url=https://zenodo.org/record/5591958 |journal=Zenodo |doi=10.5281/zenodo.5591958}}</ref> In addition to these metadata, the information inside the data file might also be described. For this purpose, codebooks and data dictionaries are employed.<ref>{{Cite journal |last=Buchanan |first=Erin M. |last2=Crain |first2=Sarah E. |last3=Cunningham |first3=Ari L. |last4=Johnson |first4=Hannah R. |last5=Stash |first5=Hannah |last6=Papadatou-Pastou |first6=Marietta |last7=Isager |first7=Peder M. |last8=Carlsson |first8=Rickard |last9=Aczel |first9=Balazs |date=2021-01 |title=Getting Started Creating Data Dictionaries: How to Create a Shareable Data Set |url=http://journals.sagepub.com/doi/10.1177/2515245920928007 |journal=Advances in Methods and Practices in Psychological Science |language=en |volume=4 |issue=1 |pages=251524592092800 |doi=10.1177/2515245920928007 |issn=2515-2459}}</ref><ref>{{Cite journal |last=Rashid |first=Sabbir M. |last2=McCusker |first2=James P. |last3=Pinheiro |first3=Paulo |last4=Bax |first4=Marcello P. |last5=Santos |first5=Henrique O. |last6=Stingone |first6=Jeanette A. |last7=Das |first7=Amar K. |last8=McGuinness |first8=Deborah L. |date=2020-10 |title=The Semantic Data Dictionary – An Approach for Describing and Annotating Data |url=https://direct.mit.edu/dint/article/2/4/443-486/94892 |journal=Data Intelligence |language=en |volume=2 |issue=4 |pages=443–486 |doi=10.1162/dint_a_00058 |issn=2641-435X |pmc=PMC7583433 |pmid=33103120}}</ref> Considering a CSV file as an example, this includes information about each column such as the domain of the values and the unit of the measurements. This information is defined in a separate file that helps comprehend the raw data.


For the publication and archiving of this data, including the semantic documentation, several approaches have been proposed. These include bundling formats such as BagIt [41], Oxford Common File Layout (OCFL) [42], and RO-Crate [43], as well as literate programming methods such as using Jupyter Notebook to combine (parts of) research data, their analysis source code, and results, as well as their documentation. RO-Crate [43] is a mechanism that allows the bundling of resources together with their associated metadata, supporting the FAIR publication and archiving of the research data. By re-using existing vocabulary such as schema.org or PROV-O, it implements a linked data approach to enable researchers to provide all information necessary to (re-)use the described research data. This includes basic properties such as author and title of the resource, a license for publication, a description of the files, and a description of the workflow used to create those files in terms of retrospective provenance, including employed software and other equipment. In brief, a RO-Crate bundle consists of the research data file and a metadata file called <tt>ro-crate-metadata.json</tt>, which contains structured metadata about the files and the entire bundle in a JSON-LD format. While the <tt>ro-crate-metadata.json</tt> contains all information in machine interpretable way, it is accompanied by a human readable HTML representation. RO-Crate has successfully been used for the documentation of retrospective provenance of ''in-silico'' studies [44], but can, due to the flexibility of the vocabulary, also be used for retrospective provenance of wet lab experiments.
For the publication and archiving of this data, including the semantic documentation, several approaches have been proposed. These include bundling formats such as BagIt<ref>{{Cite journal |last=Kunze |first=J. |last2=Littman |first2=J. |last3=Madden |first3=E. |last4=Scancella |first4=J. |last5=Adams |first5=C. |date=2018-10 |title=The BagIt File Packaging Format (V1.0) |url=https://www.rfc-editor.org/info/rfc8493 |language=en |pages=RFC8493 |doi=10.17487/rfc8493}}</ref>, Oxford Common File Layout (OCFL)<ref>{{Cite journal |last=Hankinson |first=Andrew |last2=Brower |first2=Donald |last3=Jefferies |first3=Neil |last4=Metz |first4=Rosalyn |last5=Morley |first5=Julian |last6=Warner |first6=Simeon |last7=Woods |first7=Andrew |date=2019-06-04 |title=The Oxford Common File Layout: A Common Approach to Digital Preservation |url=https://www.mdpi.com/2304-6775/7/2/39 |journal=Publications |language=en |volume=7 |issue=2 |pages=39 |doi=10.3390/publications7020039 |issn=2304-6775}}</ref>, and RO-Crate<ref name=":15">{{Cite journal |last=Carragáin |first=Eoghan Ó |last2=Goble |first2=Carole |last3=Sefton |first3=Peter |last4=Soiland-Reyes |first4=Stian |date=2019-06-20 |title=A lightweight approach to research object data packaging |url=https://zenodo.org/record/3250687 |doi=10.5281/ZENODO.3250687}}</ref>, as well as literate programming methods such as using Jupyter Notebook to combine (parts of) research data, their analysis source code, and results, as well as their documentation. RO-Crate<ref name=":15" /> is a mechanism that allows the bundling of resources together with their associated metadata, supporting the FAIR publication and archiving of the research data. By re-using existing vocabulary such as schema.org or PROV-O, it implements a linked data approach to enable researchers to provide all information necessary to (re-)use the described research data. This includes basic properties such as author and title of the resource, a license for publication, a description of the files, and a description of the workflow used to create those files in terms of retrospective provenance, including employed software and other equipment. In brief, a RO-Crate bundle consists of the research data file and a metadata file called <tt>ro-crate-metadata.json</tt>, which contains structured metadata about the files and the entire bundle in a JSON-LD format. While the <tt>ro-crate-metadata.json</tt> contains all information in machine interpretable way, it is accompanied by a human readable HTML representation. RO-Crate has successfully been used for the documentation of retrospective provenance of ''in-silico'' studies<ref>{{Cite journal |last=Chard |first=Kyle |last2=Gaffney |first2=Niall |last3=Jones |first3=Matthew B. |last4=Kowalik |first4=Kacper |last5=Ludascher |first5=Bertram |last6=McPhillips |first6=Timothy |last7=Nabrzyski |first7=Jarek |last8=Stodden |first8=Victoria |last9=Taylor |first9=Ian |last10=Thelen |first10=Thomas |last11=Turk |first11=Matthew J. |date=2019-09 |title=Application of BagIt-Serialized Research Object Bundles for Packaging and Re-Execution of Computational Analyses |url=https://ieeexplore.ieee.org/document/9041738/ |journal=2019 15th International Conference on eScience (eScience) |publisher=IEEE |place=San Diego, CA, USA |pages=514–521 |doi=10.1109/eScience.2019.00068 |isbn=978-1-7281-2451-3}}</ref>, but can, due to the flexibility of the vocabulary, also be used for retrospective provenance of wet lab experiments.


==Methods==
==Methods==
Line 166: Line 160:
The manual engineering process for the semantic model of the ELN protocol was comprised of iterative modelling and reviewing. Domain experts were consulted during this process in order to validate the model. The main objective of this process was to check if all information for the semantic provenance modelling are available in ELN protocols and whether they can be transferred into a semantic representation by employing existing ontologies. The aim of the resulting model was to document the provenance of the research data.
The manual engineering process for the semantic model of the ELN protocol was comprised of iterative modelling and reviewing. Domain experts were consulted during this process in order to validate the model. The main objective of this process was to check if all information for the semantic provenance modelling are available in ELN protocols and whether they can be transferred into a semantic representation by employing existing ontologies. The aim of the resulting model was to document the provenance of the research data.


Protegé [45] was used for model engineering. In particular, the modelling was conducted as follows:
Protegé<ref>{{Cite journal |last=Musen |first=Mark A. |date=2015-06-16 |title=The protégé project: a look back and a look forward |url=https://dl.acm.org/doi/10.1145/2757001.2757003 |journal=AI Matters |language=en |volume=1 |issue=4 |pages=4–12 |doi=10.1145/2757001.2757003 |issn=2372-3483 |pmc=PMC4883684 |pmid=27239556}}</ref> was used for model engineering. In particular, the modelling was conducted as follows:


#BioPortal<ref name="BioPortal">{{cite web |url=https://bioportal.bioontology.org/ |title=BioPortal |author=National Center for Biomedical Ontology |publisher=Board of Trustees of Leland Stanford Junior University |date=2021}}</ref> and Ontobee<ref>{{Cite journal |last=Ong |first=Edison |last2=Xiang |first2=Zuoshuang |last3=Zhao |first3=Bin |last4=Liu |first4=Yue |last5=Lin |first5=Yu |last6=Zheng |first6=Jie |last7=Mungall |first7=Chris |last8=Courtot |first8=Mélanie |last9=Ruttenberg |first9=Alan |last10=He |first10=Yongqun |date=2017-01-04 |title=Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration |url=https://pubmed.ncbi.nlm.nih.gov/27733503 |journal=Nucleic Acids Research |volume=45 |issue=D1 |pages=D347–D352 |doi=10.1093/nar/gkw918 |issn=1362-4962 |pmc=5210626 |pmid=27733503}}</ref> are used to identify relevant ontologies for terms from the ELN protocol and the inventory database items.
#BioPortal<ref name="BioPortal">{{cite web |url=https://bioportal.bioontology.org/ |title=BioPortal |author=National Center for Biomedical Ontology |publisher=Board of Trustees of Leland Stanford Junior University |date=2021}}</ref> and Ontobee<ref>{{Cite journal |last=Ong |first=Edison |last2=Xiang |first2=Zuoshuang |last3=Zhao |first3=Bin |last4=Liu |first4=Yue |last5=Lin |first5=Yu |last6=Zheng |first6=Jie |last7=Mungall |first7=Chris |last8=Courtot |first8=Mélanie |last9=Ruttenberg |first9=Alan |last10=He |first10=Yongqun |date=2017-01-04 |title=Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration |url=https://pubmed.ncbi.nlm.nih.gov/27733503 |journal=Nucleic Acids Research |volume=45 |issue=D1 |pages=D347–D352 |doi=10.1093/nar/gkw918 |issn=1362-4962 |pmc=5210626 |pmid=27733503}}</ref> are used to identify relevant ontologies for terms from the ELN protocol and the inventory database items.
#A set of ontologies is selected from these search results so that the coverage of terms from the ELN in a single ontology is maximized. Ontologies from the OBO Foundry [27], compatible with the BFO [28], were preferred.
#A set of ontologies is selected from these search results so that the coverage of terms from the ELN in a single ontology is maximized. Ontologies from the OBO Foundry<ref name=":13" />, compatible with the BFO<ref name=":14" />, were preferred.
#Ontology classes representing inventory database items in the ELN (see Fig. 3) are added into the ELN description of the corresponding inventory database item as a reference for the semantic modelling.
#Ontology classes representing inventory database items in the ELN (see Fig. 3) are added into the ELN description of the corresponding inventory database item as a reference for the semantic modelling.
#The semantic model itself is constructed by ABox statements, i.e., the creation of instances of these classes that represent the particular entities and activities of the protocol and the inventory database. Each instance gets a unique identifier in the local namespace, reflecting the individual entity; for example, <tt>MG-63_(P25,_LOT_57840088)</tt> is used to encode passage 25 of the MG-63 cells that were delivered with the lot number 57840088 (see also Fig. 5). The specific input and output relations of the activity classes are used in order to connect the particular entities correspondingly.
#The semantic model itself is constructed by ABox statements, i.e., the creation of instances of these classes that represent the particular entities and activities of the protocol and the inventory database. Each instance gets a unique identifier in the local namespace, reflecting the individual entity; for example, <tt>MG-63_(P25,_LOT_57840088)</tt> is used to encode passage 25 of the MG-63 cells that were delivered with the lot number 57840088 (see also Fig. 5). The specific input and output relations of the activity classes are used in order to connect the particular entities correspondingly.
# References to the same entities in other KGs such as Wikidata [46] are included by employing the <tt>owl:sameAs</tt> relation. This is essential for linked open data according to the five-star deployment scheme proposed by Berners-Lee.<ref name="HylandLinked13">{{cite web |url=https://dvcs.w3.org/hg/gld/raw-file/default/glossary/index.html |title=Linked Data Glossary |editor=Hyland, B.; Atemezing, G.; Pendleton, M. et al. |publisher=W3C |date=27 June 2013}}</ref>
#References to the same entities in other KGs such as Wikidata<ref>{{Cite journal |last=Vrandečić |first=Denny |last2=Krötzsch |first2=Markus |date=2014-09-23 |title=Wikidata: a free collaborative knowledgebase |url=https://dl.acm.org/doi/10.1145/2629489 |journal=Communications of the ACM |language=en |volume=57 |issue=10 |pages=78–85 |doi=10.1145/2629489 |issn=0001-0782}}</ref> are included by employing the <tt>owl:sameAs</tt> relation. This is essential for linked open data according to the five-star deployment scheme proposed by Berners-Lee.<ref name="HylandLinked13">{{cite web |url=https://dvcs.w3.org/hg/gld/raw-file/default/glossary/index.html |title=Linked Data Glossary |editor=Hyland, B.; Atemezing, G.; Pendleton, M. et al. |publisher=W3C |date=27 June 2013}}</ref>


The following three rules were considered during iterative modelling in order to prevent the introduction of a bias from modeller and domain experts:  
The following three rules were considered during iterative modelling in order to prevent the introduction of a bias from modeller and domain experts:  


* Use ontological classes of the same granularity as the terms in the experiment documentation, e.g., “washing” instead of “material processing.”
*Use ontological classes of the same granularity as the terms in the experiment documentation, e.g., “washing” instead of “material processing.”
* Avoid the introduction of new classes and attributes whenever possible (e.g., avoid TBox statements) and re-use existing ontologies. [47]
*Avoid the introduction of new classes and attributes whenever possible (e.g., avoid TBox statements) and re-use existing ontologies.<ref>{{Cite journal |last=Heath |first=Tom |last2=Bizer |first2=Christian |date=2011-02-09 |title=Linked Data: Evolving the Web into a Global Data Space |url=http://www.morganclaypool.com/doi/abs/10.2200/S00334ED1V01Y201102WBE001 |journal=Synthesis Lectures on the Semantic Web: Theory and Technology |language=en |volume=1 |issue=1 |pages=1–136 |doi=10.2200/S00334ED1V01Y201102WBE001 |issn=2160-4711}}</ref>
* Use only information from the ELN protocol, and do not introduce further knowledge despite the references to other KGs.  
*Use only information from the ELN protocol, and do not introduce further knowledge despite the references to other KGs.


Thus, the semantic model serves as demonstrator for the inherent potential of ELN protocols.
Thus, the semantic model serves as demonstrator for the inherent potential of ELN protocols.
Line 187: Line 181:
Considering the ELN protocol from the manual model, we observed that the main content is structured by:
Considering the ELN protocol from the manual model, we observed that the main content is structured by:


* headings and paragraphs,
*headings and paragraphs,
* tables (table headings and body),
*tables (table headings and body),
* enumerations and lists, and
*enumerations and lists, and
* links to inventory items and research data.
*links to inventory items and research data.


Headings are used to structure the documentation, e.g., the general section about the experimental details, or a particular set of activities are preceded from a heading (upper and lower part in Fig. 2, respectively). In the latter case, different sets of activities in a protocol correspond to the templates we extracted, i.e., at each headline a new template was included.
Headings are used to structure the documentation, e.g., the general section about the experimental details, or a particular set of activities are preceded from a heading (upper and lower part in Fig. 2, respectively). In the latter case, different sets of activities in a protocol correspond to the templates we extracted, i.e., at each headline a new template was included.
Line 198: Line 192:
Considering our ultimate goal of retrospective research data provenance documentation, we exploited the structure of the ELN protocol as follows:
Considering our ultimate goal of retrospective research data provenance documentation, we exploited the structure of the ELN protocol as follows:


# General information such as the researcher conducting the experiment and the objective of the investigation are parsed from the key-value table at the beginning of the protocol. This information is added to the protocol activity using the relation <tt>qualifiedAssociation</tt> (prov:qualifiedAssociation).
#General information such as the researcher conducting the experiment and the objective of the investigation are parsed from the key-value table at the beginning of the protocol. This information is added to the protocol activity using the relation <tt>qualifiedAssociation</tt> ([https://www.w3.org/ns/prov#qualifiedAssociation prov:qualifiedAssociation]).
# Activities described within the ELN protocol are hierarchically structured to represent different levels of granularity. The top-level activity resembles the entire experiment, while the different main sections are represented by second-level activities. Note that each main section contains an activity table. Finally, the third level represents activities from table rows of those tables.
#Activities described within the ELN protocol are hierarchically structured to represent different levels of granularity. The top-level activity resembles the entire experiment, while the different main sections are represented by second-level activities. Note that each main section contains an activity table. Finally, the third level represents activities from table rows of those tables.
# All activities are augmented by inventory items mentioned in the respective description by the used (prov:used) relation.
#All activities are augmented by inventory items mentioned in the respective description by the used ([https://www.w3.org/ns/prov#used prov:used]) relation.
# For each research data file created during the investigation, a corresponding entity is created. Assuming that the mention of a file inside an activity marks the creation of this file, the activity is linked to the file using the relation <tt>wasGeneratedBy</tt> (prov:wasGeneratedBy).
#For each research data file created during the investigation, a corresponding entity is created. Assuming that the mention of a file inside an activity marks the creation of this file, the activity is linked to the file using the relation <tt>wasGeneratedBy</tt> ([https://www.w3.org/ns/prov#wasGeneratedBy prov:wasGeneratedBy]).


As previously described, we do not further split up the third-level activities, i.e., complex structures such as enumerations and lists, including their order inside a step description, are taken as atomic.
As previously described, we do not further split up the third-level activities, i.e., complex structures such as enumerations and lists, including their order inside a step description, are taken as atomic.
Line 207: Line 201:
Aside from the use of structural elements in the ELN, which was the base for the manual model, we identified different repeating patterns that can be exploited. For example, from the textual description of activities such as “incubate 5 min in [Device] SANYO CO2 Incubator at 37<sup>∘</sup>C” or “wash cells with [Washing solution] PBS without Ca/Mg [..],” we observed the use of verb phrases indicating the activity of the step: “incubate” and “wash,” respectively. Here, we use the head verb of those phrases to assign the corresponding ontological class from a prior mapping. Similarly, information about researchers and institutions, manufacturers, file mime-types, and experiment type are included. For large scale usage, these information might also be retrieved from an organizational or research information system.
Aside from the use of structural elements in the ELN, which was the base for the manual model, we identified different repeating patterns that can be exploited. For example, from the textual description of activities such as “incubate 5 min in [Device] SANYO CO2 Incubator at 37<sup>∘</sup>C” or “wash cells with [Washing solution] PBS without Ca/Mg [..],” we observed the use of verb phrases indicating the activity of the step: “incubate” and “wash,” respectively. Here, we use the head verb of those phrases to assign the corresponding ontological class from a prior mapping. Similarly, information about researchers and institutions, manufacturers, file mime-types, and experiment type are included. For large scale usage, these information might also be retrieved from an organizational or research information system.


Parameters that are used in the textual description are identified by their unit, e.g., “1.5 ml,” “5 min,” and “37<sup>∘</sup>C” by employing regular expressions. They are then represented as blank nodes connected to the step using the relation <tt>has value specification</tt> (OBI_0001938) with the <tt>value</tt> as the numerical value of the parameter and the unit connected by <tt>has measurement unit label</tt> (IAO_0000039). We observed that most of the units mentioned in the protocols at hand are defined in the units ontology (UO). [48]
Parameters that are used in the textual description are identified by their unit, e.g., “1.5 ml,” “5 min,” and “37<sup>∘</sup>C” by employing regular expressions. They are then represented as blank nodes connected to the step using the relation <tt>has value specification</tt> (OBI_0001938) with the <tt>value</tt> as the numerical value of the parameter and the unit connected by <tt>has measurement unit label</tt> (IAO_0000039). We observed that most of the units mentioned in the protocols at hand are defined in the units ontology (UO).<ref>{{Cite journal |last=Gkoutos |first=G. V. |last2=Schofield |first2=P. N. |last3=Hoehndorf |first3=R. |date=2012-10-10 |title=The Units Ontology: a tool for integrating units of measurement in science |url=https://academic.oup.com/database/article-lookup/doi/10.1093/database/bas033 |journal=Database |language=en |volume=2012 |issue=0 |pages=bas033–bas033 |doi=10.1093/database/bas033 |issn=1758-0463 |pmc=PMC3468815 |pmid=23060432}}</ref>


Another frequently used pattern observed in the textual description is the mixture of biological and chemical resources, e.g., “89% [Culture Medium] DMEM + 10% [Serum] FCS + 1% [Antibiotic] Gentamicin”. By employing the following regular expression, the contained information is extracted and transferred into a representation of activity of type <tt>creating a mixture of molecules in solution</tt> (OBI_0000685):  
Another frequently used pattern observed in the textual description is the mixture of biological and chemical resources, e.g., “89% [Culture Medium] DMEM + 10% [Serum] FCS + 1% [Antibiotic] Gentamicin”. By employing the following regular expression, the contained information is extracted and transferred into a representation of activity of type <tt>creating a mixture of molecules in solution</tt> (OBI_0000685):  
Line 240: Line 234:


==Results==
==Results==
First, we present the details of the manually engineered semantic representation of the Ca-imaging procedure which served as (i) a proof of concept for the effectiveness of retrospective provenance documentation from ELN protocols, (ii) a basis for analysis of the ELN protocol structure, and (iii) the development of the protocol template for research guidance. Second, details of the structure-based semantic translation for the seven Ca-imaging protocols with stimulation are given. Finally, we present the results of the evaluation of the RO-Crate bundles.


===Manually engineered model===
The semantic representation of the Ca-imaging procedure is based on the upper-level ontology BFO. In addition, PROV-O<ref name=":12" /> is used for retrospective provenance documentation of the experimental results. Table 1 lists the most important ontologies used in the model. For the representation, an artifact-based modelling approach was selected, where artifacts are central to the model and are used to connect activities via their corresponding input and output relations. In total, the protocol as well as the inventory items are represented in about 80 resources of 46 types connected by almost 20 distinct predicates from 13 vocabularies.


{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | colspan="3" style="background-color:white; padding-left:10px; padding-right:10px;" |'''Table 1.''' Ontologies selected for the manually engineered model. Upper rows list general ontologies, while the lower rows list domain-specific ontologies for resources and activities.
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Name
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Source
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Details
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |BFO
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Smith ''et al.''<ref name=":14" />
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Basic Formal Ontology
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |PROV-O
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Moreau ''et al.''<ref name=":12" />
  | style="background-color:white; padding-left:10px; padding-right:10px;" |PROV Ontology
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |BTO
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Gremse ''et al.''<ref>{{Cite journal |last=Gremse |first=M. |last2=Chang |first2=A. |last3=Schomburg |first3=I. |last4=Grote |first4=A. |last5=Scheer |first5=M. |last6=Ebeling |first6=C. |last7=Schomburg |first7=D. |date=2011-01-01 |title=The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources |url=https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkq968 |journal=Nucleic Acids Research |language=en |volume=39 |issue=Database |pages=D507–D513 |doi=10.1093/nar/gkq968 |issn=0305-1048 |pmc=PMC3013802 |pmid=21030441}}</ref>
  | style="background-color:white; padding-left:10px; padding-right:10px;" |BRENDA Tissue Ontology
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |CHEBI
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Degtyarenko ''et al.''<ref>{{Cite journal |last=Degtyarenko |first=K. |last2=de Matos |first2=P. |last3=Ennis |first3=M. |last4=Hastings |first4=J. |last5=Zbinden |first5=M. |last6=McNaught |first6=A. |last7=Alcantara |first7=R. |last8=Darsow |first8=M. |last9=Guedj |first9=M. |last10=Ashburner |first10=M. |date=2007-12-23 |title=ChEBI: a database and ontology for chemical entities of biological interest |url=https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkm791 |journal=Nucleic Acids Research |language=en |volume=36 |issue=Database |pages=D344–D350 |doi=10.1093/nar/gkm791 |issn=0305-1048 |pmc=PMC2238832 |pmid=17932057}}</ref>
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Chemical Entities of Biological Interest Ontology
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |CLO
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Sarntivijai ''et al.''<ref>{{Cite journal |last=Sarntivijai |first=Sirarat |last2=Lin |first2=Yu |last3=Xiang |first3=Zuoshuang |last4=Meehan |first4=Terrence F |last5=Diehl |first5=Alexander D |last6=Vempati |first6=Uma D |last7=Schürer |first7=Stephan C |last8=Pang |first8=Chao |last9=Malone |first9=James |last10=Parkinson |first10=Helen |last11=Liu |first11=Yue |date=2014 |title=CLO: The cell line ontology |url=http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-5-37 |journal=Journal of Biomedical Semantics |language=en |volume=5 |issue=1 |pages=37 |doi=10.1186/2041-1480-5-37 |issn=2041-1480 |pmc=PMC4387853 |pmid=25852852}}</ref>
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Cell Line Ontology
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |OBI
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Bandrowski ''et al.''<ref>{{Cite journal |last=Bandrowski |first=Anita |last2=Brinkman |first2=Ryan |last3=Brochhausen |first3=Mathias |last4=Brush |first4=Matthew H. |last5=Bug |first5=Bill |last6=Chibucos |first6=Marcus C. |last7=Clancy |first7=Kevin |last8=Courtot |first8=Mélanie |last9=Derom |first9=Dirk |last10=Dumontier |first10=Michel |last11=Fan |first11=Liju |date=2016-04-29 |editor-last=Xue |editor-first=Yu |title=The Ontology for Biomedical Investigations |url=https://dx.plos.org/10.1371/journal.pone.0154556 |journal=PLOS ONE |language=en |volume=11 |issue=4 |pages=e0154556 |doi=10.1371/journal.pone.0154556 |issn=1932-6203 |pmc=PMC4851331 |pmid=27128319}}</ref>
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ontology for Biomedical Investigations
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |FOAF
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Brickley and Miller<ref name="BrickleyFOAF14">{{cite web |url=http://xmlns.com/foaf/spec/ |title=FOAF Vocabulary Specification 0.99 |author=Brickley, D.; Miller, L. |work=xmlns.com |date=14 January 2014}}</ref>
  | style="background-color:white; padding-left:10px; padding-right:10px;" |People and their web information
|-
|}
|}
All inventory items that were mentioned as resources in the protocol were represented by instances of the corresponding ontology classes (ABox statements), which is exemplified in the following by use of the MG-63 cell line. The manually engineered representation, as well as the corresponding inventory database description, are illustrated in Figs. 5 and 3, respectively.
[[File:Fig5 Schröder JofBioSem22 13.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Graphical representation of the manually engineered semantic model of the MG-63 cell line used in the protocol. (See Schröder ''et al.''<ref name=":5" />)</blockquote>
|-
|}
|}
In the ELN protocol, a passage with number <tt>25</tt> of the originally supplied <tt>MG-63</tt> cells with lot number <tt>57840088</tt> was used, i.e., “[Cell line] MG-63 P25 LOT 57840088”.{{Efn|Note that this is not part of the inventory item description, as this aims at the general cell specification. However, the particular information for a specific experiment are part of the ELN protocol.}} This is modelled by using multiple instances of the corresponding class ''MG-63 cell'' (CLO_0007699), which are connected with the relation <tt>is_passage_of</tt>. The passage information are annotated using the attribute <tt>passage situation</tt> (CLO_0051628). Lot numbers are represented as an instance of <tt>lot number</tt> (IAO_0000132) and connected to the cell instances using the newly defined relation <tt>has_lot_number</tt>. The creation of a cell passage is attributed to a researcher using the relation <tt>wasAttributedTo</tt> ([https://www.w3.org/ns/prov#wasAttributedTo prov:wasAttributedTo]). Finally, the supplier is an instance of class <tt>Organization</tt> ([https://www.w3.org/ns/prov#Organization prov:Organization]) and related to the cells using <tt>has_supplier</tt> (OBI_0000647).
The modelling of the ELN protocol can be summarized as the creation of instances of activity classes that require their individual input entities and often produce an output entity which serves as an input for the subsequent activity (artifact-based modelling). Examples of atomic activities and their corresponding activity classes include <tt>washing</tt> (OBI_0302888), <tt>creating a mixture of molecules in solution</tt> (OBI_0000685), or <tt>cell line cell culturing</tt> (CLO_0000000 . The relations that are used to connect the entities to the activities are modelled in the corresponding ontology and depend on the actual activity class. Additionally, these processes are also of type <tt>Activity</tt> ([https://www.w3.org/ns/prov#Activity prov:Activity]) in order to encode general provenance information.
This modelling approach was employed for the entire ELN protocol. However, the most interesting part when it comes to the provenance documentation of research data is the activity, which produces or uses the research data. The upper part in Fig. 6 illustrates the documentation from the ELN protocol relevant for the research data generation: the first two steps describe the creation of the data while the last step contains the details about the actual analysis.
[[File:Fig6 Schröder JofBioSem22 13.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 6.''' Graphical representation of the semantic model describing the data recording (see also Fig. 5).  (See Schröder ''et al.''<ref name=":5" />)</blockquote>
|-
|}
|}
===Structure-based model===
For the structure-based model, an activity-based modelling approach was used to resemble the textual structure of the ELN protocol. For this purpose, the model was build upon the general purpose ontologies RO-Crate, PROV-O, and BFO. In total, for the representation of the seven protocols and their corresponding inventory items, 1935 resources of 18 types connected by 36 distinct predicates from seven vocabularies were used.
The structural hierarchy of the activities was represented by <tt>bfo:hasPart</tt>, while the sequential order was represented by <tt>wasInformedBy</tt> (prov:wasInformedBy). Figure 7 illustrates this structure. For each activity, the general types <tt>Action</tt>, <tt>prov:Activity</tt>, and <tt>bfo:process</tt> were used. Further links to external ontologies were added by <tt>owl:sameAs</tt>, for instance “wash” was augmented by <tt>washing</tt> (OBI_0302888).
[[File:Fig7 Schröder JofBioSem22 13.png|1200px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1200px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 7.''' Graphical representation of an excerpt of the semantic model that was created semi-automatically.</blockquote>
|-
|}
|}
The RO-Crate’s root data entity that describes the research data is required to be an entity of type <tt>Dataset</tt> ([https://schema.org/Dataset schema:Dataset]). Thus, research data files are added to this dataset by <tt>hasPart</tt> ([https://schema.org/hasPart schema:hasPart]). The connection of these file entities and the hierarchical structure of the activities is represented by <tt>wasGeneratedBy</tt> ([https://www.w3.org/ns/prov#wasGeneratedBy prov:wasGeneratedBy]) (see the right part of Fig. 7), when mentioned in the activities’ textual description. This means that all files are included in this root data entity (via <tt>hasPart</tt>), but are not necessarily associated to the activities, if they are not mentioned.
Following the RO-Crate specification, ELN inventory database items are encoded as the domain-independent type <tt>IndividualProduct</tt> as they provide contextual information. However, the ontological knowledge about the type of the biological and chemical resource was added using the relation <tt>owl:sameAs</tt> by the external references from the description in the ELN. The resulting entity is connected to the activities using <tt>used</tt> ([https://www.w3.org/ns/prov#used prov:used]). Resources with a specific passage or lot number are added as individual entities connected to a general entity encoding the inventory database item using the relation <tt>is_instance_of</tt>. Furthermore, attributes <tt>has_passage_number</tt> and <tt>has_lot_number</tt> are added with their corresponding information.
Several mixtures are used in the ELN protocols. This information is modelled around the activity <tt>creating a mixture of molecules in solution</tt> (OBI_0000685). All resources that are used in this activity are linked by <tt>has_specified_input</tt> (OBI_0000293) and the resulting mixture entity by <tt>has_specified_output</tt> (OBI_0000299). To specify the recipe of this mixture, a <tt>material combination objective</tt> (OBI_0000686) is created and linked to the activity using <tt>achieves_planned_objective</tt> (OBI_0000417). If an attribution of this mixture is annotated in the ELN protocol, the corresponding agent is associated with the resulting mixture entity via <tt>wasAttributedTo</tt>. Note that recipes of a mixture are independent of the actual creation activity, i.e., if multiple researchers create a mixture using the same recipe, the same recipe entity is referenced, but individual activities and mixture entities are created.
With respect to parameters, we extracted values and units for the following types: (i) time and duration (min and ms), (ii) temperature (<sup>∘</sup>Celsius), (iii) frequency (Hz), and (iv) voltage (V) and represented by their corresponding classes. Specifically, the frequency and the voltage are of interest as they provide the parameters for the stimulation of the cells during the Ca-imaging approach.
===ELN protocols and protocol template===
By providing templates for the individual parts of the experiment (preparation, Fluo-3 staining, Ca-imaging with and without stimulation), the researchers were able to compile seven ELN protocols with different permutations of the experiment parameters. In comparison to the predefined protocol template, we observed that the researchers further modified the ELN protocol description to reflect the particular course of activities and observations conducted in the wet lab, e.g., the repetition of an experimental setting due to issues in the previous experiment or the documentation of issues during the experiment. That means the model represents such deviations from the original plan (prospective provenance) and allows for tracking the actually documented activity sequence by means of retrospective provenance.
===Research data bundles===
In summary, seven RO-Crates have been created, one for each ELN protocol of the Ca-imaging experiments with stimulation. The corresponding semantic representation was automatically created using the structure-based approach. All research data that was produced in a particular experiment, together with this semantic representation, was bundled in the RO-Crate. In order to foster readability, a copy of the ELN protocol and the inventory items' description was included in the form of HTML files. Thus, the RO-Crates contain between 110 and 135 files and are between 107 and 185 MB large. The particular ELN protocols are encoded in models of 2,174 to 2,553 triples with 15,823 triples in total. As some triples—such as researchers, institutions, and resources—are identical across all RO-Crates, the number of unique triples is only 13,490. The number of triples per protocol differ due to deviations in the documentation from the original plan and the number of research files.
The structure-based approach employs RO-Crate, PROV-O, and BFO as upper level ontologies. Especially RO-Crate and PROV-O are designed to encode provenance information about resources. Provenance information about experimenter, manufacturer, biological and chemical resources, activities, and research data are transferred by this approach into a semantic representation. To illustrate the capabilities of the resulting RO-Crate bundles, we evaluated SPARQL queries for the W7 questions in our use case. Considering the question “How was a particular file created?” (W3), Fig. 8 presents the corresponding SPARQL query for a Ca-imaging approach in a particular experiment. Table 2 illustrates an excerpt of the result of this query, i.e., the sequence of activities from one experiment, providing the result to the question W3. That is, for every atomic activity within the Ca-imaging approach, the description as well as the created research data are listed in the order of the execution. Moreover, all resources and equipment (W2), as well as the parameters, are depicted as a result of the query.
[[File:Fig8 Schröder JofBioSem22 13.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 8.''' This SPARQL query selects (1) the ontological activity classes, (2) the research data produced, (3) the resources and equipment that is used, and (4) the parameters for each atomic activity order by their execution in a Ca-imaging approach, with stimulation from one of the use case ELN protocols that have been translated using the structure-based modelling approach.</blockquote>
|-
|}
|}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | colspan="7" style="background-color:white; padding-left:10px; padding-right:10px;" |'''Table 2.''' An excerpt of the resulting output for the SPARQL query in Fig. 8.
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Activity
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Text
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Act.-Class
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Resources
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Files
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Par.-Units
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Par.-Values
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |[...]
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/14
  | style="background-color:white; padding-left:10px; padding-right:10px;" |place [Device] IonOptix 12 well plate chamber electrodes on plate
  | style="background-color:white; padding-left:10px; padding-right:10px;" |obo:NCIT_C52253
  | style="background-color:white; padding-left:10px; padding-right:10px;" |IonOptix 12 well plate chamber
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/15
  | style="background-color:white; padding-left:10px; padding-right:10px;" |incubate for 10min with stimulation in LSM hood: [...]
  | style="background-color:white; padding-left:10px; padding-right:10px;" |obo:OMIT_0005807, obo:OBI_0001007, obo:OBI_0302893
  | style="background-color:white; padding-left:10px; padding-right:10px;" |LSM780, ZEN 2011 (black edition)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Data/02_Zeitserie-Stimulation_5V_7.9Hz.czi
  | style="background-color:white; padding-left:10px; padding-right:10px;" |obo:UO_0000031, obo:UO_0000028, obo:UO_0000218, obo:UO_0000106
  | style="background-color:white; padding-left:10px; padding-right:10px;" |5, 10, 7.9
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |[...]
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
|-
|}
|}
Beside queries for individual experiments, the semantic models enable the comparison of the documentation of multiple experiments. As an example, we consider the question “What was the order of the stimulation parameters in a particular experiment?” (W7) that should be answered for seven experiments. Figure 9 illustrates the query for the comparison of multiple experiments based on the order of their stimulation parameters. The corresponding results are shown in Table 3.
[[File:Fig9 Schröder JofBioSem22 13.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 9.''' This SPARQL query selects all experiments following the Ca-imaging procedure and collects their stimulation parameters in the order that they have been investigated.</blockquote>
|-
|}
|}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | colspan="3" style="background-color:white; padding-left:10px; padding-right:10px;" |'''Table 3.''' The result for the SPARQL query in Fig. 9 illustrating a comparison of multiple experiments based on the order of their stimulation parameters.
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Protocol
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Title
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Stimulation parameters
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1124/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging (with stimulation) 29.01.2021
  | style="background-color:white; padding-left:10px; padding-right:10px;" |7.9Hz, 1V | 7.9Hz, 5V | 20Hz, 5V | 20Hz, 1V
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1042/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging (with stimulation)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |20Hz, 1V | 7.9Hz, 1V | 7.9Hz, 5V | 20Hz, 5V
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1021/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging (with stimulation)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |20Hz, 1V | 20Hz, 5V | 7.9Hz, 5V | 7.9Hz, 1V
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1022/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging (with stimulation)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |7.9Hz, 5V | 7.9Hz, 1V | 20Hz, 1V | 20Hz, 5V | 7.9Hz, 5V
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1023/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging (with stimulation) Failed (durch ATP Zugabe hat sich der Bildausschnitt verändert)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |7.9Hz, 1V | 7.9Hz, 5V | 20Hz, 5V | 20Hz, 1V
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln942/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging (with stimulation)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |7.9Hz, 5V | 7.9Hz, 1V | 20Hz, 1V | 20Hz, 5V
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1071/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging (with stimulation) 22.01.2021
  | style="background-color:white; padding-left:10px; padding-right:10px;" |7.9Hz, 5V | 20Hz, 5V | 20Hz, 1V | 7.9Hz, 1V
|-
|}
|}
The remaining W7 questions could also be validated based on similar queries, as shown in the appendix of this paper. Thus, the proposed approach demonstrates the feasibility of research data documentation using ELN protocols.
==Discussion==
The results of the manual modelling show that it is feasible to translate the information of an ELN protocol into a semantic representation for documentation of retrospective provenance of research data. Moreover, it has been shown that the creation of ready-to-publish bundles containing the research data, the associated metadata and the retrospective provenance documentation by using of RO-Crate enables to answer questions about the experimental procedure raising the research data. The manually engineered model implements an artifact-based modelling approach that uses ontological terms in full extent. Thus, the resulting representation mainly consists of a sequential list of activities and entities connected via their specific input and output relations. The level of granularity of the model corresponds in most cases to the terms used in the documentation, although existing ontologies not always provide the same level of detail for all terms. As an example, the terms “relocate,” “transfer,” and “take out” can be subsumed under moving of materials, but still have some distinct differences. Furthermore, when “take out” is used in the context of a fridge or freezer, an ontological modelling additionally requires encoding of the warming up of the material. Thus, providing ontological definitions for these different situations requires much work in future ontological engineering.
In contrast to the manual model, the structure-based approach implements an activity-based modelling mechanism and does not use the specific input and output relations of an activity, but the same activities. As a result, the structure-based approach does not specify the particular role of the used entities. Furthermore, in the manually engineered model, the semantic representation of an entity that results from the sequential execution of activities is difficult without introducing TBox statements. The reason is that this entity needs to reflect the result of the particular activity sequence. In the structure-based approach, these entities need not be defined as the main part of the model consists of a hierarchy of activities, including the used resources. This allows the model to represent only the information that is actually contained in the textual description of the ELN protocol, without artificially introducing entities with properties that are the direct result of the activities.
Beside the process documentation, the structure-based approach adds metadata about the mime type, the file size, and the checksum allowing to validate the integrity of the research data. This representation of the research data might be extended by additional metadata, which, however, would require the application of file-type-specific extraction methods (e.g., CZI files) or the researchers themselves to provide the information (e.g., in the form of data dictionaries for tabular data). Moreover, representing the research data itself in the same representation format as the metadata and the retrospective provenance documentation would enable further data integration and thus allow for automatic data analysis approaches.
Employing the structure-based approach at a large scale requires knowledge about the relation of terms from the textual description in the ELN to classes and attributes from ontologies. Here, we implemented this relation by a hard-coded mapping, for instance from verb phrases to ontology classes in the case of activities. This can also be achieved by use of a suggestion system for the researchers that proposes ontological classes selected from automated queries of ontological databases. Similarly, the external identifier might be augmented. The structure-based approach currently integrates the Open Researcher and Contributor ID (ORCID) and Research Organization Registry (ROR) for persons and organizations, respectively, and it also uses references to Wikidata entities. Several initiatives proposed the use of persistent identifiers for other aspects of wet lab experiments, e.g., RRIDs can be used to reference scientific resources similar to the inventory database of the ELN.<ref name="RRIDPortal" /> While using persistent identifiers, we observed two aspects that are crucial:
#The granularity of the entity referenced by the identifier needs to be on the same level as that needed for the application. As an example, the organization referenced by the Research Organization Registry (ROR)<ref name="RORHome">{{cite web |url=https://ror.org/ |title=Research Organization Registry |author=Conlon, M. |date=2021}}</ref> not reflect the particular department that the researchers are affiliated.
#The entity referenced by the identifier needs to reflect evolution, too. Although the identifier should reference a particular version of an entity, the entity behind might change and, thus, the registry needs to encode these versions and provide corresponding identifier for each version. To the best of our knowledge, this is currently not supported by, e.g., RRIDs.
A fine-grained solution for referencing researchers, organizations, and research projects on an institutional level might be implemented by organizational information systems.
Another important aspect is related to privacy protection, for instance, the names of all involved persons in an experiment. While for archival purposes the identity of all involved persons are of interest, it might not be wanted to publish all personal details with respect to privacy protection. The structured representation of the RO-Crate allows all involved persons (W1) to perform queries and thus would directly allow for easy implementation of pseudonymization via graph update operations.
With respect to the recent advances in information extraction, we employed basic methods. While this does not extract all information of interest, it sketches the potential benefits of automatic text analysis. By employing more sophisticated information extraction methods, for instance, training on labelled published protocols<ref>{{Cite journal |last=Kulkarni |first=Chaitanya |last2=Xu |first2=Wei |last3=Ritter |first3=Alan |last4=Machiraju |first4=Raghu |date=2018 |title=An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols |url=http://aclweb.org/anthology/N18-2016 |journal=Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) |language=en |publisher=Association for Computational Linguistics |place=New Orleans, Louisiana |pages=97–106 |doi=10.18653/v1/N18-2016}}</ref>, this could further be improved. This is also true for the extraction of parameters and their assignment to activities, as can be seen by recently established NLP challenges such as MeasEval.<ref name="HarperWelcome21">{{cite web |url=https://competitions.codalab.org/competitions/25770 |title=Welcome to MeasEval: Counts and Measurements! |author=Harper, C.; Cox, J.; Daniel, R. et al. |work=CodaLab |date=01 February 2021}}</ref> Moreover, disambiguating detected terms with respect to their context and linking them to the corresponding ontology classes is one of the core challenges in modern NLP.
With respect to the completeness of the documentation of wet lab experiments, minimal information guidelines provide a reference that can potentially be exploited to create protocol templates. In combination with the proposed structure-based approach, this would allow deployment of minimum information checklists using the Minim model<ref name="Soiland-ReyesMinim17">{{cite web |url=https://github.com/wf4ever/ro-manager/blob/master/Minim/Minim-description.md |title=Minim model for defining checklists |author=Soiland-Reyes, S.; Klyne, G. |work=RO-manager on GitHub |date=17 October 2017}}</ref> to enable the validation of the generated documentation.
During the use of ELNs for a longer period of time, inventory database items are regularly updated, because e.g., the supplier changes and the software or the firmware of a device is updated. Evolution provenance methods can be employed to represent such changes. In order to reflect these versions also for the research data in the RO-Crate, a data storage solution with versioning is needed. Intra-consortia sharing platforms<ref>{{Cite journal |last=Schröder |first=Max |last2=LeBlanc |first2=Hayley |last3=Spors |first3=Sascha |last4=Krüger |first4=Frank |date=2020-02-25 |title=Intra-consortia data sharing platforms for interdisciplinary collaborative research projects |url=https://www.degruyter.com/document/doi/10.1515/itit-2019-0039/html |journal=it - Information Technology |language=en |volume=62 |issue=1 |pages=19–28 |doi=10.1515/itit-2019-0039 |issn=2196-7032}}</ref> can be employed for this purpose.
Overall, we have shown that our approach is able to help generate increasingly FAIR data. The ELN Protocols captured, together with the data entries in the RO-Crate format, increase the findability of data produced in wet lab experiments, creating a binding between experiment steps and data. Likewise, the approach increases accessibility by allowing rich SPARQL queries to be formulated that combine the experiment metadata with the data itself. In terms of interoperability and reusability, the use of common ontologies allows for different experiment runs to be easily compared and documentation to be more easily generated. However, as noted by Mons ''et al.''<ref>{{Cite journal |last=Mons |first=Barend |last2=Neylon |first2=Cameron |last3=Velterop |first3=Jan |last4=Dumontier |first4=Michel |last5=da Silva Santos |first5=Luiz Olavo Bonino |last6=Wilkinson |first6=Mark D. |date=2017-03-07 |title=Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/ISU-170824 |journal=Information Services & Use |volume=37 |issue=1 |pages=49–56 |doi=10.3233/ISU-170824}}</ref>, making FAIR data is not an absolute but a spectrum where there are trade-offs in terms of ability to find and reuse and the effort in documentation. Our approach illustrates this by highlighting the differences between automated capture and manual capture. In particular, while automated capture reduces the burden in capturing FAIR data, it also means, for the time being, the decrease in the richness of the associated metadata needed for reusability. Having a target in terms of manual capture provides a valuable target for automated capture of metadata for the data produced in the wet lab.
==Conclusion==
The presented study investigated the feasibility of creating semantic provenance documentation for research data using ELN protocols from wet-lab experiments. ELN protocols contain specific information about an experiment such as the produced research data but also timestamps, lot and passage number as well as parameters. This is in contrast to templates that serve as general guidelines without such information.
The manually engineered model was used as a proof of concept for the translation of ELN protocols using a Ca-imaging experiment. In order to support researchers in the wet lab, we derived four templates encoding parts of this initial protocol that can be used to create new experiment documentation. Based on these results, a structure-based approach was implemented to translate these protocols into a semantic representation. This approach uses the structure in the description, including headings, tables, and links, as well as some basic text analysis. Furthermore, the resulting semantic model is bundled together with the research data. Potential provenance questions from the viewpoint of other researchers using these bundles have been implemented as SPARQL queries in order to evaluate the proposed methodology. We have shown that the structure-based approach, in combination with RO-Crate bundling, can be used to successfully document research data based on the description in the form of ELN protocols. Thus, these RO-Crates enable the sharing, publication, and archiving of the research data in terms of the FAIR principles.<ref name=":0" /><ref name=":1" /> Furthermore, in order to guide researchers during the conduction of Ca-imaging experiments, the four derived sub-templates can be combined to provide a documentation basis for new experiments.
Integrating the proposed approach, as well as the sketched extensions, into a comprehensive virtual research environment (VRE) would enable the tracking of the entire research process and the research data from the creation of a hypothesis to the publication of the data. In particular, the ELN can be used for the documentation of the wet lab investigation of a research project. The funding information of the research projects, including involved researchers and the consortia, can be stored in a research information system. Furthermore, the semantic representation of the protocol can be automatically synced with a linked data server, and the research data be stored in an institutional repository. The particular platforms can be connected with a semantic search interface for researchers that enables searching for similar experiments and data, as well as creating reports about experimental activities.
==Appendix==
===Queries and answers for the W7 questions===
Note that for better readability, we shortened URIs in some of the following results, e.g., <nowiki>https://eln-provenance.elaine.uni-rostock.de/942/approach_1_with_stimulation/1</nowiki> has been shortened to <tt>ap_1_with_stimulation/1</tt>, and <nowiki>http://localhost:3030/Data/02_Zeitserie-Stimulation_5V_7.9Hz.czi</nowiki> has been shortened to <tt>02_Zeitserie-Stimulation_5V_7.9Hz.czi</tt>.
'''W1: Who participated in the study?'''
[[File:FigA1 Schröder JofBioSem22 13.png|400px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Experimentalist
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Involved persons
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Susanne Staehlke
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Person1 Anonymous, Person2 Anonymous
|-
|}
|}
Note that before publication, we pseudonymized some researchers with respect to privacy protection. Refer to the Discussion section for more details.
'''W2: Which biological and chemical resources and which equipment was used in the study?'''
[[File:FigA2 Schröder JofBioSem22 13.png|400px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Activity
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Used resources
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |# of previous steps
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/1
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Tube: 10ml
  | style="background-color:white; padding-left:10px; padding-right:10px;" |0
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/2
  | style="background-color:white; padding-left:10px; padding-right:10px;" |PBS without Ca/Mg
  | style="background-color:white; padding-left:10px; padding-right:10px;" |1
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/3
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Eppendorf Centrifuge
  | style="background-color:white; padding-left:10px; padding-right:10px;" |2
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/4
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |3
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/5
  | style="background-color:white; padding-left:10px; padding-right:10px;" |50% HEPES I (isotonic) + 50% HEPES II (hypotonic)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |4
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/6
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Fluo-3/AM
  | style="background-color:white; padding-left:10px; padding-right:10px;" |5
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/7
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Eppendorf Thermomixer C (incubation shaker)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |6
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/8
  | style="background-color:white; padding-left:10px; padding-right:10px;" |LSM780, IonOptix 12 well plate chamber, IonOptix C-Pace EM
  | style="background-color:white; padding-left:10px; padding-right:10px;" |7
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/9
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Eppendorf Centrifuge
  | style="background-color:white; padding-left:10px; padding-right:10px;" |8
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/10
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |9
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/11
  | style="background-color:white; padding-left:10px; padding-right:10px;" |HEPES I (isotonic)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |10
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/12
  | style="background-color:white; padding-left:10px; padding-right:10px;" |12 well plate, PBS without Ca/Mg
  | style="background-color:white; padding-left:10px; padding-right:10px;" |11
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/13
  | style="background-color:white; padding-left:10px; padding-right:10px;" |HEPES I (isotonic)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |12
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/14
  | style="background-color:white; padding-left:10px; padding-right:10px;" |IonOptix 12 well plate chamber
  | style="background-color:white; padding-left:10px; padding-right:10px;" |13
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/15
  | style="background-color:white; padding-left:10px; padding-right:10px;" |LSM780, ZEN 2011 (black edition)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |14
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/16
  | style="background-color:white; padding-left:10px; padding-right:10px;" |IonOptix 12 well plate chamber
  | style="background-color:white; padding-left:10px; padding-right:10px;" |15
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/17
  | style="background-color:white; padding-left:10px; padding-right:10px;" |LSM780,ATP, ZEN 2011 (black edition)
  | style="background-color:white; padding-left:10px; padding-right:10px;" |16
|-
|}
|}
'''W3: How was a particular file created?'''
[[File:FigA3 Schröder JofBioSem22 13.png|401px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |File
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Activity
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Protocol
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Data/02_Zeitserie-Stimulation_5V_7.9Hz.czi
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln942:ap_1_with_stimulation/15
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln942:protocol
|-
|}
|}
'''W4: When was an activity conducted?'''
[[File:FigA4 Schröder JofBioSem22 13.png|399px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Activity
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Starting time
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |# of previous steps
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/1
  | style="background-color:white; padding-left:10px; padding-right:10px;" |09:00:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |0
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/2
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |1
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/3
  | style="background-color:white; padding-left:10px; padding-right:10px;" |09:01:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |2
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/4
  | style="background-color:white; padding-left:10px; padding-right:10px;" |09:06:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |3
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/5
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |4
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/6
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |5
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/7
  | style="background-color:white; padding-left:10px; padding-right:10px;" |09:10:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |6
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/8
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |7
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/9
  | style="background-color:white; padding-left:10px; padding-right:10px;" |09:40:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |8
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/10
  | style="background-color:white; padding-left:10px; padding-right:10px;" |09:45:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |9
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/11
  | style="background-color:white; padding-left:10px; padding-right:10px;" |
  | style="background-color:white; padding-left:10px; padding-right:10px;" |10
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/12
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |11
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/13
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |12
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/14
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |13
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/15
  | style="background-color:white; padding-left:10px; padding-right:10px;" |09:50:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |14
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/16
  | style="background-color:white; padding-left:10px; padding-right:10px;" |10:00:00
  | style="background-color:white; padding-left:10px; padding-right:10px;" |15
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |ap_1_with_stimulation/17
  | style="background-color:white; padding-left:10px; padding-right:10px;" |immediately afterwards
  | style="background-color:white; padding-left:10px; padding-right:10px;" |16
|-
|}
|}
'''W5: When was the experiment done?'''
[[File:FigA5 Schröder JofBioSem22 13.png|399px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Template
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Protocol
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Objective
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1023/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Intracellular calcium dynamic caused by electric fields
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1021/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Intracellular calcium dynamic caused by electric fields
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln942/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Intracellular calcium dynamic caused by electric fields
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1071/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Intracellular calcium dynamic caused by electric fields
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1042/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Intracellular calcium dynamic caused by electric fields
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1124/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Intracellular calcium dynamic caused by electric fields
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Ca-imaging
  | style="background-color:white; padding-left:10px; padding-right:10px;" |eln1022/protocol
  | style="background-color:white; padding-left:10px; padding-right:10px;" |Intracellular calcium dynamic caused by electric fields
|-
|}
|}
'''W6: Where was the experiment conducted?'''
[[File:FigA6 Schröder JofBioSem22 13.png|401px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Organization
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |University Medical Center Rostock
|-
|}
|}
Note that the granularity of external identifiers is not always sufficient, e.g., the researchers and the equipment are affiliated in the “Department of Cell Biology,” which is part of the “University Medical Center Rostock.” See the discussion on this issue in the Discussion section.
==Abbreviations==
'''BFO''': Basic Formal Ontology
'''DCAT''': Data Catalog Vocabulary
'''DDI''': Data Documentation Initiative
'''ELN''': Electronic Laboratory Notebook
'''EXACT2''': EXperimental ACTions
'''FAIR''': Findable, Accessible, Interoperable, and Reuseable
'''KG''': Knowledge Graph
'''KGC''': Knowledge Graph Cell
'''MIACA''': Minimum Information About a Cellular Assay
'''NLP''': Natural Language Processing
'''OBO''': Open Biological and Biomedical Ontology
'''OCFL''': Oxford Common File Layout
'''OPM''': Open Provenance Model
'''ORCID''': Open Researcher and Contributor ID
'''PAV''': Provenance, Authoring and Versioning
'''PROV-O''': PROV Ontology
'''REPRODUCE-ME''': Reproduce Microscopy Experiments
'''RO-Crate''': Research Object Crate
'''ROR''': Research Organization Registry
'''RRID''': Research Resource Identifiers
'''SMART Protocols''': SeMAntic RepresenTation for Experimental Protocols
'''SOP''': Standard Operating Procedure
'''UO''': Units Ontology
'''VRE''': Virtual Research Environment


==Footnotes==
==Footnotes==
{{reflist|group=lower-alpha}}
{{reflist|group=lower-alpha}}
==Acknowledgements==
We thank Tazin Hossain for her help with the prototypical implementation of parts of the structure-based approach.
===Author contributions===
Author contributions according to CRediT: MS: Conceptualization, Data curation, Methodology, Investigation, Software, Writing. SSt: Resources, Data curation. PG: Methodology, Writing. JBN: Resources, Funding acquisition. SSp: Funding acquisition, Supervision. FK: Conceptualization, Methodology, Writing, Funding acquisition, Supervision. All authors read and approved the final manuscript.
===Funding===
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – SFB 1270/1 - 299150580. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; and in the decision to publish the results. Open Access funding enabled and organized by Projekt DEAL.
===Availability of data and materials===
The manually engineered model is published at [https://github.com/SFB-ELAINE/Semantic-Modelling-CA-Imaging GitHub]. The results of the structure-based modelling approach and the research data are published as [https://github.com/SFB-ELAINE/Ca-imaging-RO-Crate RO-Crates bundles]. The source that builds the model based on the ELN API is available [https://github.com/m6121/Structure-based-ELN2LOD here]. This study is based on the previously published data by Staehlke and Nebe.<ref>{{Citation |last=Staehlke, Susanne |last2=Nebe, J. Barbara |date=2021-06-10 |title=Research data of Calcium Imaging after electrical stimulation |url=https://zenodo.org/record/4923173 |work=Zenodo |language=en |publisher=Zenodo |doi=10.5281/zenodo.4923173 |accessdate=2022-04-01}}</ref>
===Competing interests===
The authors declare that they have no competing interests.


==References==
==References==
Line 255: Line 850:
[[Category:LIMSwiki journal articles (added in 2022)]]
[[Category:LIMSwiki journal articles (added in 2022)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles on FAIR data principles]]
[[Category:LIMSwiki journal articles on laboratory informatics]]
[[Category:LIMSwiki journal articles on laboratory informatics]]
[[Category:LIMSwiki journal articles on research]]
[[Category:LIMSwiki journal articles on research]]
[[Category:LIMSwiki journal articles on software]]
[[Category:LIMSwiki journal articles on software]]

Latest revision as of 16:29, 29 April 2024

Full article title Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation
Journal Journal of Biomedical Semantics
Author(s) Schröder, Max; Staehlke, Susanne; Groth, Paul; Nebe, J. Barbara; Spors, Sascha; Krüger, Frank
Author affiliation(s) University of Rostock, University Medical Center Rostock, University of Amsterdam
Primary contact Email: max dot schroeder at uni-rostock dot de
Year published 2022
Volume and issue 13
Article # 4 (2022)
DOI 10.1186/s13326-021-00257-x
ISSN 2041-1480
Distribution license Creative Commons Attribution 4.0 International
Website https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-021-00257-x
Download https://jbiomedsem.biomedcentral.com/track/pdf/10.1186/s13326-021-00257-x.pdf (PDF)

Abstract

Background: Electronic laboratory notebooks (ELNs) are used to document experiments and investigations in the wet lab. Protocols in ELNs contain a detailed description of the conducted steps, including the necessary information to understand the procedure and the raised research data, as well as to reproduce the research investigation. The purpose of this study is to investigate whether such ELN protocols can be used to create semantic documentation of the provenance of research data by the use of ontologies and linked data methodologies.

Methods: Based on an ELN protocol of a biomedical wet lab experiment, a retrospective provenance model of the raised research data describing the details of the experiment in a machine-interpretable way is manually engineered. Furthermore, an automated approach for knowledge acquisition from ELN protocols is derived from these results. This structure-based approach exploits the structure in the experiment’s description—such as headings, tables, and links—to translate the ELN protocol into a semantic knowledge representation. To satisfy the FAIR guiding principles (making data findable, accessible, interoperable, and reuseable), a ready-to-publish bundle is created that contains the research data together with their semantic documentation.

Results: While the manual modelling efforts serve as proof of concept by employing one protocol, the automated structure-based approach demonstrates the potential generalization with seven ELN protocols. For each of those protocols, a ready-to-publish bundle is created and, by employing the SPARQL query language, it is illustrated such that questions about the processes and the obtained research data can be answered.

Conclusions: The semantic documentation of research data obtained from the ELN protocols allows for the representation of the retrospective provenance of research data in a machine-interpretable way. Research Object Crate (RO-Crate) bundles including these models enable researchers to easily share the research data, including the corresponding documentation, as well as to search and relate the experiment to each other.

Keywords: research data, provenance, knowledge acquisition, electronic laboratory notebooks, semantic documentation, RO-Crate, FAIR

Background

Effective reuse of research data requires comprehensive documentation of their provenance. Beside metadata, knowledge about the generating process helps others to understand research data and allows for the reproduction of research investigations. This includes not only sources of input data, such as parameters and assumptions, but also information about instrumentation, devices, and materials. For wet lab experiments, such knowledge is increasingly documented in electronic laboratory notebooks (ELNs). The focus of these tools is on the documentation of laboratory activities that produce research data in so-called "ELN protocols." In addition to this textual description, the FAIR guiding principles[1] provide general guidance on research data documentation in terms of metadata. However, they do not prescribe technical details about the implementation of such documentation.[2]

To foster the realization of the FAIR principles for research data produced in wet lab experiments, we aim for machine-interpretable representations of experimental documentation of the process that is the origin of the data. In other words, the provenance information about the research data—including the activities and involved researchers, resources, and equipment—should be semantically represented. For this purpose, we employ the frequently used[3] PROV W3C recommendation[4], which ontologically, in PROV Ontology (PROV-O), defines entities, activities, and agents, as well as their relations. In particular, according to Belhajjame et al., an entity is defined as a “physical, digital, conceptual, or other kind of thing with some fixed aspects,"[5] an activity as “something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities,”[5] and an agent as “something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity.”[5] With respect to wet lab experiments, all biological and chemical resources—as well as not only the devices and software but also the research data itself—can be seen as entities; researchers conducting the experiment are the agents, and the process of research data creation consists of activities. The semantic representation of this information as a knowledge graph (KG)[6] can be achieved by the use of modern web technologies where the terms and their relations are defined in ontologies such as PROV-O (TBox modelling), the instances are built up in the KG (ABox modelling), and other KGs can be linked in order to create an interconnected graph of semantic knowledge.

In this paper, we aim for an automatic extraction of information from ELN protocols in order to transfer them into a semantic representation that documents the produced research data. For this purpose, we employ the documentation of Calcium imaging (Ca-imaging) experiments, originally proposed by Staehlke et al.[7], as a running example. In particular, we use ELN protocols that document the conduction of Ca-imaging experiments in order to: (i) demonstrate the feasibility of manually transferring an ELN protocol into a semantic representation encoding the provenance of research data, (ii) automate the information extraction and modelling by exploiting the structure of an ELN protocol by means of a structure-based approach, and (iii) evaluate the proposed method by answering provenance questions from the resulting bundle of research data and the corresponding semantic model.

Here, the term "ELN protocol" refers to the actual documentation of the wet lab experiment within an ELN and is different from the term "protocol templates," which are used to encode instructions to be performed in order to conduct particular procedures or be published at protocols.io. While those protocol templates do encode a list of abstract instructions, they do not necessarily reflect particular research data, nor instrumentation, parameters, or other aspects to the execution-specific information. ELN protocols, in contrast, represent the documentation of the actual experiment, and the contained information is thus necessary to understand how the resulting research data was generated. This includes manufacturer-specific information about resources used in the experiment such as lot numbers.[a] Furthermore, passage numbers of the resources, the times when an activity was conducted, and the parameters used in a device, as well as the research data and the researchers conducting the experiment, are information specific to a particular experiment. Figure 1 illustrates the differences by providing an example for an ELN protocol and a protocol template.


Fig1 Schröder JofBioSem22 13.png

Figure 1. Excerpts of an ELN protocol that represents a particular experiment including all details such as timestamps, lot numbers as well as the research data (left) and a protocol template containing general instructions of experiments without these details (right, sourced from here.)

The work presented here is based on a preliminary investigation regarding the effectiveness of manually modeling ELN protocols by use of ontologies.[8] Here, we extend this preliminary work by discussing the potential of automatic information extraction from ELN protocols by employing structural information and discussing the differences and implications of both approaches. Moreover, while the previous work only sketched the semantic representation of the wet lab experiments, here, we focus on the generation of ready-to-publish research data bundles, including the semantic description of the origin of the research data.

Use case

To demonstrate the feasibility of the proposed approach, a typical wet lab investigation was chosen as a use case. In the following, we introduce the use case and derive questions regarding the provenance of the corresponding research data.

Biomedical wet lab experiments

The objective of the biomedical study was to investigate the intracellular calcium ions (Ca2+) dynamics by Calcium-imaging (Ca-imaging) under different settings.[7] In particular, two different wet lab experiments were considered: (i) an investigation of the influence of different material surface conditions on Ca2+ mobilization, and (ii) an investigation regarding the Ca2+ dynamics under the influence of electrical stimulation. Both types of experiments involve similar activities of the researchers. In particular, each experiment employs the Ca-imaging method previously established by Staehlke et al.[7] in different settings. The particular conditions, e.g., surface conditions or parameters of the electrical stimulation, are investigated within each experiment, while the order of the different variations was permuted across the experiments. That is, after a preparation phase, where all materials and devices are prepared, the same procedure, i.e., Ca-imaging, was executed for the different conditions. During the experiment, several materials and devices are employed, such as cell line passages, buffer, and microscopes.

For the purpose of this study, we asked the researchers to use an ELN for the documentation of their wet lab activities, resulting in eight ELN protocols: one for the first experiment and seven for the latter, representing different permutations of the sequential execution of Ca-imaging for different electrical stimulation parameters. In particular, eLabFTW (Deltablot, https://www.elabftw.net/, v3.6.7)[9], a domain-independent ELN, was used. Figure 2 shows an excerpt of a protocol from the use case.


Fig2 Schröder JofBioSem22 13.png

Figure 2. ELN protocol about a Ca-imaging experiment in the eLabFTW software. It contains general information (top), the list of activities with their starting time (middle), used inventory items, and uploaded research data (bottom).

ELNs often provide an inventory database that allows the maintenance of materials and other research resources used during the experiments. Typically, each resource belongs to a configurable set of categories, e.g., cell lines, buffer, software, or devices. These entries in the inventory database can be linked from within the protocol when used within the corresponding experiment. Figure 3 illustrates the entry to the inventory database for the MG-63 cell line that is used in the experiments of the use case. Note that this entry is already augmented by information about ontology classes that were added during the manual model engineering process. Here, we use such ontology references but also other resource identifiers, such as Research Resource Identifiers (RRID)[10], could be used for resource reference. However, these RRIDs do not reflect different versions of the resources, e.g., when describing a software. Thus, they can be used to annotate the inventory database of the ELN similar to the ontology classes, but cannot be used on their own. Research data is attached to the ELN protocol by uploading and linking from within the textual description of the step that describes the generating activity.


Fig3 Schröder JofBioSem22 13.png

Figure 3. Shortened documentation of a Ca-imaging experiment in the eLabFTW ELN software. The upper part contains general information about the investigation, followed by the list of activities with their starting time. Below, used inventory items and uploaded research data are listed.

In summary, the execution of an individual experiment took about 4.5 hours, resulting from the preparation and the sequential executions of the Ca-imaging procedure under five different stimulation settings consisting of 15 steps for each. Each protocol referred to 22 inventory items in the database and between 85 and 110 data files of different types were generated. The different file types include (i) CZI files (developed by ZEISS) containing the microscope settings, recorded images, and raw measurement data; (ii) image files in JPEG format to illustrate particular excerpts from the video recordings; and (iii) raw measurements of the luminescence over time, in the form of XML encoded tabular data files. The latter two formats are exports from the CZI files. The provenance of all attached files needs to be documented.

Research data provenance

When considering this use case, several questions regarding the provenance of the research data can be raised. To this end, we consider questions based on the W7 provenance model[11], that describes provenance as combinations of What, When, Where, How, Who, Which, and Why. We consider each question individually, encoding the view of a researcher that aims at re-using the research data from our use case. The questions were developed together with the domain experts and resemble actual questions that arise when considering the replication of the documented experiments.

W1 Who participated in the study?
With respect to the provenance of research data, all researchers contributing to the creation are of interest, i.e., we expect to get a list of all researchers and their affiliations involved in an experiment.
W2 Which biological and chemical resources and which equipment was used in the study?
In particular, we are interested in the resources and the equipment used in an experiment, including all details such as the lot number and the passage information.
W3 How was a particular file created?
"What was the sequence of activities that led to the creation of a particular file" is a question that might help other researchers in comprehending the data.
W4 When was an activity conducted?
The date and the time point of a particular activity but also its duration are of interest. This information is useful for the planning of similar experiments, but also with respect to the comprehensibility of the results as the date and time point might influence them, e.g., due to weather or other environmental phenomena.
W5 Why was the experiment done?
Understanding why the research data was created is crucial for their comprehensibility. We take the objective of the experiment as the reason for the creation.
W6 Where was the experiment conducted?
The location—respectively. the institution where the experiment was conducted—is of interest as regional characteristics might influence the data.
W7 What was the order of the stimulation parameters in a particular experiment?
The order of the particular approaches influences the results as there might be effects from the timing of the experiments or the duration since their preparation. That means, with respect to the evaluation of the results, we are interested in this order.

Related work

The provenance of research data, including their research investigations, combines several research fields, ranging from general-purpose methods and standards for the documentation of provenance to specifically tailored methods and platforms for the tracking of research and other activities. In the following, we will discuss recent work within those fields and relate it to our method.

Many methods aiming at documenting the provenance of activities have already been proposed. Here, we consider the classification of provenance information following the definition of Herschel et al.[12] and Lim et al.[13]:

  1. prospective provenance describes “an abstract workflow specification as a recipe for future data derivation”[13];
  2. retrospective provenance documents a “past workflow execution and data derivation information, i.e., which tasks were performed and how data artifacts were derived”[13]; and
  3. evolution provenance illustrates “the changes made between two versions of the input”[12], or, in other words, versions of the procedure, the data, or the parameters are reflected by evolution provenance similar to version control such as that implemented by Git for source code.

Applying those definitions to the use case at hand, prospective provenance allows the keeping track of changes of laboratory-specific operating procedures in general, while retrospective provenance allows the documenting of the actually executed sequence of activities that resulted in a particular set of research data. At last, evolution provenance allows the tracking of changes made to the actual ELN protocol or the inventory database items.

With respect to the research workflows to be represented by provenance modeling, two different types can be distinguished:

  1. In-silico studies employ computational methods for the analysis of the data. Workflow systems like Taverna[14], Kepler[15], or Galaxy[16], and programming environments like Jupyter Notebook[17] have been successfully augmented to record retrospective provenance.
  2. Wet lab experiments are courses of activities in a laboratory. While several approaches exist that describe prospective provenance[18][19] by analyzing published protocols, only limited work is done on documenting retrospective provenance for these workflows.

More detailed information about provenance modelling and the employed methods are provided in the literature.[3][12] Here, we are interested in providing detailed information about the origin of research data. Thus, we aim at providing retrospective provenance documentation of research data from ELN protocols documenting wet lab experiments.

The Smart Tea project[20] similarly aims at the semantic metadata recording for research data from within a customized ELN. The developed ELN provides a structured graphical user interface (GUI) requiring the user to provide information for predefined variables. All information is directly transferred into a linked data representation and persistently archived with a linked data server. While this approach perfectly guides users through the sequence of activities and tracks retrospective provenance at the same time, it fails to keep track of deviations from the predefined plan. Furthermore, as the documentation is directly translated into a semantic representation, additional information that was not considered before can hardly be attached to such protocols, which restricts both the expressivity of the semantic model and the user to previously known information.

Similar to the Smart Tea project, the PROV templating approach[21] suggests the recording of provenance information given a pre-defined provenance model. In other words, the main idea is that applications only store values for placeholders in a particular provenance model, which was shown to be more efficient than the storage of the original provenance models.[21] This solution is very efficient if a very large number of identical provenance structures with some variable information are to be stored. If, however, the application requires more flexibility in terms of the provenance structure, the template approach does not utilize this efficiency advantage. Note that provenance templates encode a semantic representation with variables, whereas protocol templates provide guidelines for experiments.

Curcin et al.[22] use a very similar approach for the provenance modelling in diagnostic decision support systems. A more flexible approach is the use of knowledge graph cells (KGCs), proposed by Vogt et al.[23] They provide a concept for the definition of knowledge structures. In particular, rules including ABox and TBox expressions might be defined that allow the dynamic modification of the KG. Thus, KGCs might be used to specify potential semantic structures of ELN protocols without particular information inside. The application of KGCs would require a complete definition over all possible semantic representations of ELN protocols, which is infeasible.

With respect to the vocabulary used to semantically describe the laboratory-specific information, the EXperimental ACTions (EXACT2) ontology, together with the Natural Language Processing (NLP) framework[18], aims at the automatic extraction of knowledge from biomedical protocols for prospective provenance. Similarly, the SeMAntic RepresenTation for Experimental Protocols (SMART Protocols) ontology reuses EXACT2 to represent prospective provenance from published protocols.[19] In contrast to both approaches that represent a plan, we aim at retrospective provenance, i.e., a particular course of activities. Both approaches, however, could be used to describe prospective provenance of the underlying plan of an ELN protocol, to allow the documentation of potential deviations from the original plan. The Reproduce Microscopy Experiments (REPRODUCE-ME) ontology[24] introduces a specific vocabulary to describe retrospective provenance for microscopy experiments. Besides, the domain-independent ontologies, PROV-O and its predecessor Open Provenance Model (OPM)[25], are frequently employed as upper-level ontology for provenance documentation.[3] Furthermore, many extensions for specific applications have been proposed. The Provenance, Authoring, and Versioning (PAV) ontology, for example, proposes a mechanism for the versioning and authoring of web resources[26], and CollabPG encodes collaborations within processes.[3] With respect to the application domain of the use case, the Open Biological and Biomedical Ontology (OBO) Foundry is a community initiative aiming at the development and maintenance of ontologies in the biomedical domain.[27] The Basic Formal Ontology (BFO)[28] is the upper-level ontology that is used for each of the OBO ontologies.

For the retrospective provenance documentation of research data from computational workflows, several specifically tailored tools and approaches have been proposed in the literature. ProvBook[17], for instance, tracks provenance in Jupyter Notebooks that are used for literate programming. There's also Dataprov[29], a wrapper tool producing provenance information from the execution of analysis tools, and noWorkflow[30], which captures provenance information from analysis scripts such as for the programming language Python. Aside from these methods, other provenance tracking approaches known as lineage retrieval[31] or lineage tracking and workflow systems exist.[32] In general, in-silico workflow systems not only record provenance information, but at the same time they specify the involved processing steps and enable their execution possibly on a distributed system.[33] However, as these systems are limited to tackling computational analyses, their usage for the provenance of research data from wet lab experiments is difficult.

Regarding the completeness of the documentation with respect to reproducibility, plenty of standards exist that aim at the definition of the minimum set of information required to comprehend and reproduce the research investigation for different applications. With respect to the use case at hand, the minimum information for electrical cell stimulation[34] and the Minimum Information About a Cellular Assay (MIACA)[35] provide such references for the documentation. Similarly, standard operating procedures (SOPs) or published instructions for experiments encode standards for the documentation of a particular experiment.

When considering the publication or archiving of research data, metadata is important to provide additional context, enabling others (including the future self) to understand the research process and the resulting data. In particular, the FAIR guiding principles provide abstract recommendations for handling research data to enable its re-usability.[1] Together with the implementation suggestions of these guidelines[2], they provide a framework which is also applicable for research data from wet lab experiments. While both guidelines provide generic recommendations regarding research data documentation, different standards exist that provide vocabulary for their support. Several initiatives foster the development of documentation standards for research data, including the Data Documentation Initiative (DDI) that focuses on standardizing metadata for social science datasets.[36] The Dublin Core, instead, is a more general definition of 15 metadata elements for electronic resources.[37][38] Similarly, Data Catalog Vocabulary (DCAT) provides a common vocabulary for the interoperability of data catalogs[39] and, thus, also defines required metadata for research data. Additionally, domain-specific metadata standards have been developed. With respect to the use case, this includes metadata for microscopy images, such as that proposed by the RDM4mic Initiative.[40] In addition to these metadata, the information inside the data file might also be described. For this purpose, codebooks and data dictionaries are employed.[41][42] Considering a CSV file as an example, this includes information about each column such as the domain of the values and the unit of the measurements. This information is defined in a separate file that helps comprehend the raw data.

For the publication and archiving of this data, including the semantic documentation, several approaches have been proposed. These include bundling formats such as BagIt[43], Oxford Common File Layout (OCFL)[44], and RO-Crate[45], as well as literate programming methods such as using Jupyter Notebook to combine (parts of) research data, their analysis source code, and results, as well as their documentation. RO-Crate[45] is a mechanism that allows the bundling of resources together with their associated metadata, supporting the FAIR publication and archiving of the research data. By re-using existing vocabulary such as schema.org or PROV-O, it implements a linked data approach to enable researchers to provide all information necessary to (re-)use the described research data. This includes basic properties such as author and title of the resource, a license for publication, a description of the files, and a description of the workflow used to create those files in terms of retrospective provenance, including employed software and other equipment. In brief, a RO-Crate bundle consists of the research data file and a metadata file called ro-crate-metadata.json, which contains structured metadata about the files and the entire bundle in a JSON-LD format. While the ro-crate-metadata.json contains all information in machine interpretable way, it is accompanied by a human readable HTML representation. RO-Crate has successfully been used for the documentation of retrospective provenance of in-silico studies[46], but can, due to the flexibility of the vocabulary, also be used for retrospective provenance of wet lab experiments.

Methods

The objective of the study was to investigate whether it is possible to create semantic documentation of the research process and the resulting research data in terms of provenance. To this end, semantic documentation was manually created by analyzing the ELN protocol. To support potential automation of the semantic model creation, based on the results of this analysis, a protocol template was designed that (i) guides researchers through the process while (ii) requiring them to provide all information necessary to comprehend the origin of the research data. The resulting protocol template was split up into a set of templates that encode steps of an experiment such as the staining or the imaging with a particular set of stimulation parameters. These sub-templates ease the re-use for new experiments, e.g., by combining them in other permutations. Based on this, researchers documented their wet lab experiments, resulting in a set of ELN protocols, each of which contains variations, such as differences in parameters, execution time, or execution order. The different protocols were then automatically analyzed, translated into a semantic model, and finally bundled into self-contained archives. The following provides a detailed description of each step.

Manual model engineering

The manual engineering process for the semantic model of the ELN protocol was comprised of iterative modelling and reviewing. Domain experts were consulted during this process in order to validate the model. The main objective of this process was to check if all information for the semantic provenance modelling are available in ELN protocols and whether they can be transferred into a semantic representation by employing existing ontologies. The aim of the resulting model was to document the provenance of the research data.

Protegé[47] was used for model engineering. In particular, the modelling was conducted as follows:

  1. BioPortal[48] and Ontobee[49] are used to identify relevant ontologies for terms from the ELN protocol and the inventory database items.
  2. A set of ontologies is selected from these search results so that the coverage of terms from the ELN in a single ontology is maximized. Ontologies from the OBO Foundry[27], compatible with the BFO[28], were preferred.
  3. Ontology classes representing inventory database items in the ELN (see Fig. 3) are added into the ELN description of the corresponding inventory database item as a reference for the semantic modelling.
  4. The semantic model itself is constructed by ABox statements, i.e., the creation of instances of these classes that represent the particular entities and activities of the protocol and the inventory database. Each instance gets a unique identifier in the local namespace, reflecting the individual entity; for example, MG-63_(P25,_LOT_57840088) is used to encode passage 25 of the MG-63 cells that were delivered with the lot number 57840088 (see also Fig. 5). The specific input and output relations of the activity classes are used in order to connect the particular entities correspondingly.
  5. References to the same entities in other KGs such as Wikidata[50] are included by employing the owl:sameAs relation. This is essential for linked open data according to the five-star deployment scheme proposed by Berners-Lee.[51]

The following three rules were considered during iterative modelling in order to prevent the introduction of a bias from modeller and domain experts:

  • Use ontological classes of the same granularity as the terms in the experiment documentation, e.g., “washing” instead of “material processing.”
  • Avoid the introduction of new classes and attributes whenever possible (e.g., avoid TBox statements) and re-use existing ontologies.[52]
  • Use only information from the ELN protocol, and do not introduce further knowledge despite the references to other KGs.

Thus, the semantic model serves as demonstrator for the inherent potential of ELN protocols.

Structure-based modelling approach

Manual model engineering reveals the potential of ELN protocols for the semantic documentation of research data. However, in order to use this at large scale, a more automated approach is needed. To approach this target, the structure-based method presented here employs the textual structure in the ELN protocols, as well as basic text analysis, which is introduced in the following sections.

Considering the ELN protocol from the manual model, we observed that the main content is structured by:

  • headings and paragraphs,
  • tables (table headings and body),
  • enumerations and lists, and
  • links to inventory items and research data.

Headings are used to structure the documentation, e.g., the general section about the experimental details, or a particular set of activities are preceded from a heading (upper and lower part in Fig. 2, respectively). In the latter case, different sets of activities in a protocol correspond to the templates we extracted, i.e., at each headline a new template was included.

Tables are used here for two different purposes. First, key-value mappings represent tables that encode general information about an experiment or inventory item, e.g., the objective of the investigation or the manufacturer of a resource. The description of inventory items mainly consists of a table of this kind (see Fig. 3). Second, lists of activities represent tables with two columns: “Step” and “Starting time.” Each row encodes an atomic activity of the experiment (see Fig. 2). Especially for the activity tables, cells include also enumerations, lists, and paragraphs which further describe the atomic activities and parameters, as well as the linking inventory items and the research data. As an example, see the last row in the activity table in Fig. 2. Note, that we assume each row defining an atomic activity that we do not split up at this stage.

Considering our ultimate goal of retrospective research data provenance documentation, we exploited the structure of the ELN protocol as follows:

  1. General information such as the researcher conducting the experiment and the objective of the investigation are parsed from the key-value table at the beginning of the protocol. This information is added to the protocol activity using the relation qualifiedAssociation (prov:qualifiedAssociation).
  2. Activities described within the ELN protocol are hierarchically structured to represent different levels of granularity. The top-level activity resembles the entire experiment, while the different main sections are represented by second-level activities. Note that each main section contains an activity table. Finally, the third level represents activities from table rows of those tables.
  3. All activities are augmented by inventory items mentioned in the respective description by the used (prov:used) relation.
  4. For each research data file created during the investigation, a corresponding entity is created. Assuming that the mention of a file inside an activity marks the creation of this file, the activity is linked to the file using the relation wasGeneratedBy (prov:wasGeneratedBy).

As previously described, we do not further split up the third-level activities, i.e., complex structures such as enumerations and lists, including their order inside a step description, are taken as atomic.

Aside from the use of structural elements in the ELN, which was the base for the manual model, we identified different repeating patterns that can be exploited. For example, from the textual description of activities such as “incubate 5 min in [Device] SANYO CO2 Incubator at 37C” or “wash cells with [Washing solution] PBS without Ca/Mg [..],” we observed the use of verb phrases indicating the activity of the step: “incubate” and “wash,” respectively. Here, we use the head verb of those phrases to assign the corresponding ontological class from a prior mapping. Similarly, information about researchers and institutions, manufacturers, file mime-types, and experiment type are included. For large scale usage, these information might also be retrieved from an organizational or research information system.

Parameters that are used in the textual description are identified by their unit, e.g., “1.5 ml,” “5 min,” and “37C” by employing regular expressions. They are then represented as blank nodes connected to the step using the relation has value specification (OBI_0001938) with the value as the numerical value of the parameter and the unit connected by has measurement unit label (IAO_0000039). We observed that most of the units mentioned in the protocols at hand are defined in the units ontology (UO).[53]

Another frequently used pattern observed in the textual description is the mixture of biological and chemical resources, e.g., “89% [Culture Medium] DMEM + 10% [Serum] FCS + 1% [Antibiotic] Gentamicin”. By employing the following regular expression, the contained information is extracted and transferred into a representation of activity of type creating a mixture of molecules in solution (OBI_0000685):

[\. \d] + \s*% <item> ('+' [\. \d] + \s*% <item>) +

Depending on the appearance of attribution notes in the corresponding contexts (e.g., “(Attributed to Susanne Staehlke)”), we create separate activities following the same specification. Figure 7 contains an example activity encoding the creation of the above mixture.

Preparing ELN protocol template

ELN protocols encode instructions (i.e., lists of activities) to (re-)produce the particular research findings. This does not restrict researchers but rather provides a guideline based on earlier experiments. Specifically, they include parameters, timestamps, and the research data. Taking the first experiment of our study, which was documented as an ELN protocol, we derived a protocol template by marking all variable information as placeholders. Together with the domain experts, this generalization has been validated to allow the usage as a basis for new experiments. The main advantage for the researchers conducting experiments in the wet lab is that all parameters that need to be documented during the experiment are highlighted while the overall description of the process is already done. Thus, errors introduced from missing parameters or instructions are reduced. If, however, the documentation needs to be modified during the experimental execution, researchers can adjust the activities and description.

This protocol template might already be used for the documentation of identical experiments (including identical ordering of parameter variations). However, as the researchers in our use case permute the different parts of the experiment (i.e., the stimulation parameters in each experiment), the templates were further split up in individual steps. For the use case at hand, we identified the following four parts: (i) Preparation, (ii) Fluo-3 Staining, (iii) Ca-imaging with Stimulation, and (iv) Ca-imaging without Stimulation. Figure 4 illustrates the template for the approach using electrical stimulation. Placeholders that will be replaced with specific parameter values during an experiment are marked with orange background color. These templates can be re-combined and used to encode new experiments. A protocol template, therefore, can be interpreted as a combination of templates which themselves are combinations of activities in a textually structured description. In consequence, an ELN protocol represents a completed protocol template with actual parameters.


Fig4 Schröder JofBioSem22 13.png

Figure 4. Template transferred from an ELN protocol section by highlighting parameters (marked with orange background color). The template contains the preparation and microscoping of a sample with stimulation. Note that this template aims at supporting researchers during their documentation, but the semantic translation approach is more general.

Bundling research data and re-use

The structure-based approach automatically translates the ELN protocol into a semantic representation of the activities and resources involved in the production of the research data. In order to combine this semantic representation (i.e., the documentation) with the research data, we employ the RO-Crate format. The RO-Crate bundle consists of the semantic model in a JSON-LD file ro-crate-metadata.json, the research data files, and a human-readable copy of the original ELN protocol and the inventory item description as HTML files.

By using the resulting RO-Crates for our use case, we answer the raised provenance questions. Therefore, we load all semantic representations from the RO-Crates into a linked data server with a SPARQL endpoint. In this study, we use Apache Jena Fuseki (v4.1.0)[54] for this purpose.

An advantage of the semantic representation of the research data documentation is its machine interpretability. This enables the comparison of the experimental processes with respect to similarities and potential differences that may have influenced the final result. This includes the particular execution times, but also omitted or additional steps as well as different parameter combinations. Furthermore, influences of the order of the different parts can easily be investigated (W7).

Results

First, we present the details of the manually engineered semantic representation of the Ca-imaging procedure which served as (i) a proof of concept for the effectiveness of retrospective provenance documentation from ELN protocols, (ii) a basis for analysis of the ELN protocol structure, and (iii) the development of the protocol template for research guidance. Second, details of the structure-based semantic translation for the seven Ca-imaging protocols with stimulation are given. Finally, we present the results of the evaluation of the RO-Crate bundles.

Manually engineered model

The semantic representation of the Ca-imaging procedure is based on the upper-level ontology BFO. In addition, PROV-O[25] is used for retrospective provenance documentation of the experimental results. Table 1 lists the most important ontologies used in the model. For the representation, an artifact-based modelling approach was selected, where artifacts are central to the model and are used to connect activities via their corresponding input and output relations. In total, the protocol as well as the inventory items are represented in about 80 resources of 46 types connected by almost 20 distinct predicates from 13 vocabularies.

Table 1. Ontologies selected for the manually engineered model. Upper rows list general ontologies, while the lower rows list domain-specific ontologies for resources and activities.
Name Source Details
BFO Smith et al.[28] Basic Formal Ontology
PROV-O Moreau et al.[25] PROV Ontology
BTO Gremse et al.[55] BRENDA Tissue Ontology
CHEBI Degtyarenko et al.[56] Chemical Entities of Biological Interest Ontology
CLO Sarntivijai et al.[57] Cell Line Ontology
OBI Bandrowski et al.[58] Ontology for Biomedical Investigations
FOAF Brickley and Miller[59] People and their web information

All inventory items that were mentioned as resources in the protocol were represented by instances of the corresponding ontology classes (ABox statements), which is exemplified in the following by use of the MG-63 cell line. The manually engineered representation, as well as the corresponding inventory database description, are illustrated in Figs. 5 and 3, respectively.


Fig5 Schröder JofBioSem22 13.png

Figure 5. Graphical representation of the manually engineered semantic model of the MG-63 cell line used in the protocol. (See Schröder et al.[8])

In the ELN protocol, a passage with number 25 of the originally supplied MG-63 cells with lot number 57840088 was used, i.e., “[Cell line] MG-63 P25 LOT 57840088”.[b] This is modelled by using multiple instances of the corresponding class MG-63 cell (CLO_0007699), which are connected with the relation is_passage_of. The passage information are annotated using the attribute passage situation (CLO_0051628). Lot numbers are represented as an instance of lot number (IAO_0000132) and connected to the cell instances using the newly defined relation has_lot_number. The creation of a cell passage is attributed to a researcher using the relation wasAttributedTo (prov:wasAttributedTo). Finally, the supplier is an instance of class Organization (prov:Organization) and related to the cells using has_supplier (OBI_0000647).

The modelling of the ELN protocol can be summarized as the creation of instances of activity classes that require their individual input entities and often produce an output entity which serves as an input for the subsequent activity (artifact-based modelling). Examples of atomic activities and their corresponding activity classes include washing (OBI_0302888), creating a mixture of molecules in solution (OBI_0000685), or cell line cell culturing (CLO_0000000 . The relations that are used to connect the entities to the activities are modelled in the corresponding ontology and depend on the actual activity class. Additionally, these processes are also of type Activity (prov:Activity) in order to encode general provenance information.

This modelling approach was employed for the entire ELN protocol. However, the most interesting part when it comes to the provenance documentation of research data is the activity, which produces or uses the research data. The upper part in Fig. 6 illustrates the documentation from the ELN protocol relevant for the research data generation: the first two steps describe the creation of the data while the last step contains the details about the actual analysis.


Fig6 Schröder JofBioSem22 13.png

Figure 6. Graphical representation of the semantic model describing the data recording (see also Fig. 5). (See Schröder et al.[8])

Structure-based model

For the structure-based model, an activity-based modelling approach was used to resemble the textual structure of the ELN protocol. For this purpose, the model was build upon the general purpose ontologies RO-Crate, PROV-O, and BFO. In total, for the representation of the seven protocols and their corresponding inventory items, 1935 resources of 18 types connected by 36 distinct predicates from seven vocabularies were used.

The structural hierarchy of the activities was represented by bfo:hasPart, while the sequential order was represented by wasInformedBy (prov:wasInformedBy). Figure 7 illustrates this structure. For each activity, the general types Action, prov:Activity, and bfo:process were used. Further links to external ontologies were added by owl:sameAs, for instance “wash” was augmented by washing (OBI_0302888).


Fig7 Schröder JofBioSem22 13.png

Figure 7. Graphical representation of an excerpt of the semantic model that was created semi-automatically.

The RO-Crate’s root data entity that describes the research data is required to be an entity of type Dataset (schema:Dataset). Thus, research data files are added to this dataset by hasPart (schema:hasPart). The connection of these file entities and the hierarchical structure of the activities is represented by wasGeneratedBy (prov:wasGeneratedBy) (see the right part of Fig. 7), when mentioned in the activities’ textual description. This means that all files are included in this root data entity (via hasPart), but are not necessarily associated to the activities, if they are not mentioned.

Following the RO-Crate specification, ELN inventory database items are encoded as the domain-independent type IndividualProduct as they provide contextual information. However, the ontological knowledge about the type of the biological and chemical resource was added using the relation owl:sameAs by the external references from the description in the ELN. The resulting entity is connected to the activities using used (prov:used). Resources with a specific passage or lot number are added as individual entities connected to a general entity encoding the inventory database item using the relation is_instance_of. Furthermore, attributes has_passage_number and has_lot_number are added with their corresponding information.

Several mixtures are used in the ELN protocols. This information is modelled around the activity creating a mixture of molecules in solution (OBI_0000685). All resources that are used in this activity are linked by has_specified_input (OBI_0000293) and the resulting mixture entity by has_specified_output (OBI_0000299). To specify the recipe of this mixture, a material combination objective (OBI_0000686) is created and linked to the activity using achieves_planned_objective (OBI_0000417). If an attribution of this mixture is annotated in the ELN protocol, the corresponding agent is associated with the resulting mixture entity via wasAttributedTo. Note that recipes of a mixture are independent of the actual creation activity, i.e., if multiple researchers create a mixture using the same recipe, the same recipe entity is referenced, but individual activities and mixture entities are created.

With respect to parameters, we extracted values and units for the following types: (i) time and duration (min and ms), (ii) temperature (Celsius), (iii) frequency (Hz), and (iv) voltage (V) and represented by their corresponding classes. Specifically, the frequency and the voltage are of interest as they provide the parameters for the stimulation of the cells during the Ca-imaging approach.

ELN protocols and protocol template

By providing templates for the individual parts of the experiment (preparation, Fluo-3 staining, Ca-imaging with and without stimulation), the researchers were able to compile seven ELN protocols with different permutations of the experiment parameters. In comparison to the predefined protocol template, we observed that the researchers further modified the ELN protocol description to reflect the particular course of activities and observations conducted in the wet lab, e.g., the repetition of an experimental setting due to issues in the previous experiment or the documentation of issues during the experiment. That means the model represents such deviations from the original plan (prospective provenance) and allows for tracking the actually documented activity sequence by means of retrospective provenance.

Research data bundles

In summary, seven RO-Crates have been created, one for each ELN protocol of the Ca-imaging experiments with stimulation. The corresponding semantic representation was automatically created using the structure-based approach. All research data that was produced in a particular experiment, together with this semantic representation, was bundled in the RO-Crate. In order to foster readability, a copy of the ELN protocol and the inventory items' description was included in the form of HTML files. Thus, the RO-Crates contain between 110 and 135 files and are between 107 and 185 MB large. The particular ELN protocols are encoded in models of 2,174 to 2,553 triples with 15,823 triples in total. As some triples—such as researchers, institutions, and resources—are identical across all RO-Crates, the number of unique triples is only 13,490. The number of triples per protocol differ due to deviations in the documentation from the original plan and the number of research files.

The structure-based approach employs RO-Crate, PROV-O, and BFO as upper level ontologies. Especially RO-Crate and PROV-O are designed to encode provenance information about resources. Provenance information about experimenter, manufacturer, biological and chemical resources, activities, and research data are transferred by this approach into a semantic representation. To illustrate the capabilities of the resulting RO-Crate bundles, we evaluated SPARQL queries for the W7 questions in our use case. Considering the question “How was a particular file created?” (W3), Fig. 8 presents the corresponding SPARQL query for a Ca-imaging approach in a particular experiment. Table 2 illustrates an excerpt of the result of this query, i.e., the sequence of activities from one experiment, providing the result to the question W3. That is, for every atomic activity within the Ca-imaging approach, the description as well as the created research data are listed in the order of the execution. Moreover, all resources and equipment (W2), as well as the parameters, are depicted as a result of the query.


Fig8 Schröder JofBioSem22 13.png

Figure 8. This SPARQL query selects (1) the ontological activity classes, (2) the research data produced, (3) the resources and equipment that is used, and (4) the parameters for each atomic activity order by their execution in a Ca-imaging approach, with stimulation from one of the use case ELN protocols that have been translated using the structure-based modelling approach.

Table 2. An excerpt of the resulting output for the SPARQL query in Fig. 8.
Activity Text Act.-Class Resources Files Par.-Units Par.-Values
[...]
ap_1_with_stimulation/14 place [Device] IonOptix 12 well plate chamber electrodes on plate obo:NCIT_C52253 IonOptix 12 well plate chamber
ap_1_with_stimulation/15 incubate for 10min with stimulation in LSM hood: [...] obo:OMIT_0005807, obo:OBI_0001007, obo:OBI_0302893 LSM780, ZEN 2011 (black edition) Data/02_Zeitserie-Stimulation_5V_7.9Hz.czi obo:UO_0000031, obo:UO_0000028, obo:UO_0000218, obo:UO_0000106 5, 10, 7.9
[...]

Beside queries for individual experiments, the semantic models enable the comparison of the documentation of multiple experiments. As an example, we consider the question “What was the order of the stimulation parameters in a particular experiment?” (W7) that should be answered for seven experiments. Figure 9 illustrates the query for the comparison of multiple experiments based on the order of their stimulation parameters. The corresponding results are shown in Table 3.


Fig9 Schröder JofBioSem22 13.png

Figure 9. This SPARQL query selects all experiments following the Ca-imaging procedure and collects their stimulation parameters in the order that they have been investigated.

Table 3. The result for the SPARQL query in Fig. 9 illustrating a comparison of multiple experiments based on the order of their stimulation parameters.
Protocol Title Stimulation parameters
eln1124/protocol Ca-imaging (with stimulation) 29.01.2021 7.9Hz, 1V | 7.9Hz, 5V | 20Hz, 5V | 20Hz, 1V
eln1042/protocol Ca-imaging (with stimulation) 20Hz, 1V | 7.9Hz, 1V | 7.9Hz, 5V | 20Hz, 5V
eln1021/protocol Ca-imaging (with stimulation) 20Hz, 1V | 20Hz, 5V | 7.9Hz, 5V | 7.9Hz, 1V
eln1022/protocol Ca-imaging (with stimulation) 7.9Hz, 5V | 7.9Hz, 1V | 20Hz, 1V | 20Hz, 5V | 7.9Hz, 5V
eln1023/protocol Ca-imaging (with stimulation) Failed (durch ATP Zugabe hat sich der Bildausschnitt verändert) 7.9Hz, 1V | 7.9Hz, 5V | 20Hz, 5V | 20Hz, 1V
eln942/protocol Ca-imaging (with stimulation) 7.9Hz, 5V | 7.9Hz, 1V | 20Hz, 1V | 20Hz, 5V
eln1071/protocol Ca-imaging (with stimulation) 22.01.2021 7.9Hz, 5V | 20Hz, 5V | 20Hz, 1V | 7.9Hz, 1V

The remaining W7 questions could also be validated based on similar queries, as shown in the appendix of this paper. Thus, the proposed approach demonstrates the feasibility of research data documentation using ELN protocols.

Discussion

The results of the manual modelling show that it is feasible to translate the information of an ELN protocol into a semantic representation for documentation of retrospective provenance of research data. Moreover, it has been shown that the creation of ready-to-publish bundles containing the research data, the associated metadata and the retrospective provenance documentation by using of RO-Crate enables to answer questions about the experimental procedure raising the research data. The manually engineered model implements an artifact-based modelling approach that uses ontological terms in full extent. Thus, the resulting representation mainly consists of a sequential list of activities and entities connected via their specific input and output relations. The level of granularity of the model corresponds in most cases to the terms used in the documentation, although existing ontologies not always provide the same level of detail for all terms. As an example, the terms “relocate,” “transfer,” and “take out” can be subsumed under moving of materials, but still have some distinct differences. Furthermore, when “take out” is used in the context of a fridge or freezer, an ontological modelling additionally requires encoding of the warming up of the material. Thus, providing ontological definitions for these different situations requires much work in future ontological engineering.

In contrast to the manual model, the structure-based approach implements an activity-based modelling mechanism and does not use the specific input and output relations of an activity, but the same activities. As a result, the structure-based approach does not specify the particular role of the used entities. Furthermore, in the manually engineered model, the semantic representation of an entity that results from the sequential execution of activities is difficult without introducing TBox statements. The reason is that this entity needs to reflect the result of the particular activity sequence. In the structure-based approach, these entities need not be defined as the main part of the model consists of a hierarchy of activities, including the used resources. This allows the model to represent only the information that is actually contained in the textual description of the ELN protocol, without artificially introducing entities with properties that are the direct result of the activities.

Beside the process documentation, the structure-based approach adds metadata about the mime type, the file size, and the checksum allowing to validate the integrity of the research data. This representation of the research data might be extended by additional metadata, which, however, would require the application of file-type-specific extraction methods (e.g., CZI files) or the researchers themselves to provide the information (e.g., in the form of data dictionaries for tabular data). Moreover, representing the research data itself in the same representation format as the metadata and the retrospective provenance documentation would enable further data integration and thus allow for automatic data analysis approaches.

Employing the structure-based approach at a large scale requires knowledge about the relation of terms from the textual description in the ELN to classes and attributes from ontologies. Here, we implemented this relation by a hard-coded mapping, for instance from verb phrases to ontology classes in the case of activities. This can also be achieved by use of a suggestion system for the researchers that proposes ontological classes selected from automated queries of ontological databases. Similarly, the external identifier might be augmented. The structure-based approach currently integrates the Open Researcher and Contributor ID (ORCID) and Research Organization Registry (ROR) for persons and organizations, respectively, and it also uses references to Wikidata entities. Several initiatives proposed the use of persistent identifiers for other aspects of wet lab experiments, e.g., RRIDs can be used to reference scientific resources similar to the inventory database of the ELN.[10] While using persistent identifiers, we observed two aspects that are crucial:

  1. The granularity of the entity referenced by the identifier needs to be on the same level as that needed for the application. As an example, the organization referenced by the Research Organization Registry (ROR)[60] not reflect the particular department that the researchers are affiliated.
  2. The entity referenced by the identifier needs to reflect evolution, too. Although the identifier should reference a particular version of an entity, the entity behind might change and, thus, the registry needs to encode these versions and provide corresponding identifier for each version. To the best of our knowledge, this is currently not supported by, e.g., RRIDs.

A fine-grained solution for referencing researchers, organizations, and research projects on an institutional level might be implemented by organizational information systems.

Another important aspect is related to privacy protection, for instance, the names of all involved persons in an experiment. While for archival purposes the identity of all involved persons are of interest, it might not be wanted to publish all personal details with respect to privacy protection. The structured representation of the RO-Crate allows all involved persons (W1) to perform queries and thus would directly allow for easy implementation of pseudonymization via graph update operations.

With respect to the recent advances in information extraction, we employed basic methods. While this does not extract all information of interest, it sketches the potential benefits of automatic text analysis. By employing more sophisticated information extraction methods, for instance, training on labelled published protocols[61], this could further be improved. This is also true for the extraction of parameters and their assignment to activities, as can be seen by recently established NLP challenges such as MeasEval.[62] Moreover, disambiguating detected terms with respect to their context and linking them to the corresponding ontology classes is one of the core challenges in modern NLP.

With respect to the completeness of the documentation of wet lab experiments, minimal information guidelines provide a reference that can potentially be exploited to create protocol templates. In combination with the proposed structure-based approach, this would allow deployment of minimum information checklists using the Minim model[63] to enable the validation of the generated documentation.

During the use of ELNs for a longer period of time, inventory database items are regularly updated, because e.g., the supplier changes and the software or the firmware of a device is updated. Evolution provenance methods can be employed to represent such changes. In order to reflect these versions also for the research data in the RO-Crate, a data storage solution with versioning is needed. Intra-consortia sharing platforms[64] can be employed for this purpose.

Overall, we have shown that our approach is able to help generate increasingly FAIR data. The ELN Protocols captured, together with the data entries in the RO-Crate format, increase the findability of data produced in wet lab experiments, creating a binding between experiment steps and data. Likewise, the approach increases accessibility by allowing rich SPARQL queries to be formulated that combine the experiment metadata with the data itself. In terms of interoperability and reusability, the use of common ontologies allows for different experiment runs to be easily compared and documentation to be more easily generated. However, as noted by Mons et al.[65], making FAIR data is not an absolute but a spectrum where there are trade-offs in terms of ability to find and reuse and the effort in documentation. Our approach illustrates this by highlighting the differences between automated capture and manual capture. In particular, while automated capture reduces the burden in capturing FAIR data, it also means, for the time being, the decrease in the richness of the associated metadata needed for reusability. Having a target in terms of manual capture provides a valuable target for automated capture of metadata for the data produced in the wet lab.

Conclusion

The presented study investigated the feasibility of creating semantic provenance documentation for research data using ELN protocols from wet-lab experiments. ELN protocols contain specific information about an experiment such as the produced research data but also timestamps, lot and passage number as well as parameters. This is in contrast to templates that serve as general guidelines without such information.

The manually engineered model was used as a proof of concept for the translation of ELN protocols using a Ca-imaging experiment. In order to support researchers in the wet lab, we derived four templates encoding parts of this initial protocol that can be used to create new experiment documentation. Based on these results, a structure-based approach was implemented to translate these protocols into a semantic representation. This approach uses the structure in the description, including headings, tables, and links, as well as some basic text analysis. Furthermore, the resulting semantic model is bundled together with the research data. Potential provenance questions from the viewpoint of other researchers using these bundles have been implemented as SPARQL queries in order to evaluate the proposed methodology. We have shown that the structure-based approach, in combination with RO-Crate bundling, can be used to successfully document research data based on the description in the form of ELN protocols. Thus, these RO-Crates enable the sharing, publication, and archiving of the research data in terms of the FAIR principles.[1][2] Furthermore, in order to guide researchers during the conduction of Ca-imaging experiments, the four derived sub-templates can be combined to provide a documentation basis for new experiments.

Integrating the proposed approach, as well as the sketched extensions, into a comprehensive virtual research environment (VRE) would enable the tracking of the entire research process and the research data from the creation of a hypothesis to the publication of the data. In particular, the ELN can be used for the documentation of the wet lab investigation of a research project. The funding information of the research projects, including involved researchers and the consortia, can be stored in a research information system. Furthermore, the semantic representation of the protocol can be automatically synced with a linked data server, and the research data be stored in an institutional repository. The particular platforms can be connected with a semantic search interface for researchers that enables searching for similar experiments and data, as well as creating reports about experimental activities.

Appendix

Queries and answers for the W7 questions

Note that for better readability, we shortened URIs in some of the following results, e.g., https://eln-provenance.elaine.uni-rostock.de/942/approach_1_with_stimulation/1 has been shortened to ap_1_with_stimulation/1, and http://localhost:3030/Data/02_Zeitserie-Stimulation_5V_7.9Hz.czi has been shortened to 02_Zeitserie-Stimulation_5V_7.9Hz.czi.


W1: Who participated in the study?

FigA1 Schröder JofBioSem22 13.png

Experimentalist Involved persons
Susanne Staehlke Person1 Anonymous, Person2 Anonymous

Note that before publication, we pseudonymized some researchers with respect to privacy protection. Refer to the Discussion section for more details.


W2: Which biological and chemical resources and which equipment was used in the study?

FigA2 Schröder JofBioSem22 13.png

Activity Used resources # of previous steps
ap_1_with_stimulation/1 Tube: 10ml 0
ap_1_with_stimulation/2 PBS without Ca/Mg 1
ap_1_with_stimulation/3 Eppendorf Centrifuge 2
ap_1_with_stimulation/4 3
ap_1_with_stimulation/5 50% HEPES I (isotonic) + 50% HEPES II (hypotonic) 4
ap_1_with_stimulation/6 Fluo-3/AM 5
ap_1_with_stimulation/7 Eppendorf Thermomixer C (incubation shaker) 6
ap_1_with_stimulation/8 LSM780, IonOptix 12 well plate chamber, IonOptix C-Pace EM 7
ap_1_with_stimulation/9 Eppendorf Centrifuge 8
ap_1_with_stimulation/10 9
ap_1_with_stimulation/11 HEPES I (isotonic) 10
ap_1_with_stimulation/12 12 well plate, PBS without Ca/Mg 11
ap_1_with_stimulation/13 HEPES I (isotonic) 12
ap_1_with_stimulation/14 IonOptix 12 well plate chamber 13
ap_1_with_stimulation/15 LSM780, ZEN 2011 (black edition) 14
ap_1_with_stimulation/16 IonOptix 12 well plate chamber 15
ap_1_with_stimulation/17 LSM780,ATP, ZEN 2011 (black edition) 16


W3: How was a particular file created?

FigA3 Schröder JofBioSem22 13.png

File Activity Protocol
Data/02_Zeitserie-Stimulation_5V_7.9Hz.czi eln942:ap_1_with_stimulation/15 eln942:protocol


W4: When was an activity conducted?

FigA4 Schröder JofBioSem22 13.png

Activity Starting time # of previous steps
ap_1_with_stimulation/1 09:00:00 0
ap_1_with_stimulation/2 immediately afterwards 1
ap_1_with_stimulation/3 09:01:00 2
ap_1_with_stimulation/4 09:06:00 3
ap_1_with_stimulation/5 immediately afterwards 4
ap_1_with_stimulation/6 immediately afterwards 5
ap_1_with_stimulation/7 09:10:00 6
ap_1_with_stimulation/8 immediately afterwards 7
ap_1_with_stimulation/9 09:40:00 8
ap_1_with_stimulation/10 09:45:00 9
ap_1_with_stimulation/11 10
ap_1_with_stimulation/12 immediately afterwards 11
ap_1_with_stimulation/13 immediately afterwards 12
ap_1_with_stimulation/14 immediately afterwards 13
ap_1_with_stimulation/15 09:50:00 14
ap_1_with_stimulation/16 10:00:00 15
ap_1_with_stimulation/17 immediately afterwards 16


W5: When was the experiment done?

FigA5 Schröder JofBioSem22 13.png

Template Protocol Objective
Ca-imaging eln1023/protocol Intracellular calcium dynamic caused by electric fields
Ca-imaging eln1021/protocol Intracellular calcium dynamic caused by electric fields
Ca-imaging eln942/protocol Intracellular calcium dynamic caused by electric fields
Ca-imaging eln1071/protocol Intracellular calcium dynamic caused by electric fields
Ca-imaging eln1042/protocol Intracellular calcium dynamic caused by electric fields
Ca-imaging eln1124/protocol Intracellular calcium dynamic caused by electric fields
Ca-imaging eln1022/protocol Intracellular calcium dynamic caused by electric fields


W6: Where was the experiment conducted?

FigA6 Schröder JofBioSem22 13.png

Organization
University Medical Center Rostock

Note that the granularity of external identifiers is not always sufficient, e.g., the researchers and the equipment are affiliated in the “Department of Cell Biology,” which is part of the “University Medical Center Rostock.” See the discussion on this issue in the Discussion section.

Abbreviations

BFO: Basic Formal Ontology

DCAT: Data Catalog Vocabulary

DDI: Data Documentation Initiative

ELN: Electronic Laboratory Notebook

EXACT2: EXperimental ACTions

FAIR: Findable, Accessible, Interoperable, and Reuseable

KG: Knowledge Graph

KGC: Knowledge Graph Cell

MIACA: Minimum Information About a Cellular Assay

NLP: Natural Language Processing

OBO: Open Biological and Biomedical Ontology

OCFL: Oxford Common File Layout

OPM: Open Provenance Model

ORCID: Open Researcher and Contributor ID

PAV: Provenance, Authoring and Versioning

PROV-O: PROV Ontology

REPRODUCE-ME: Reproduce Microscopy Experiments

RO-Crate: Research Object Crate

ROR: Research Organization Registry

RRID: Research Resource Identifiers

SMART Protocols: SeMAntic RepresenTation for Experimental Protocols

SOP: Standard Operating Procedure

UO: Units Ontology

VRE: Virtual Research Environment

Footnotes

  1. A lot number is an identifier for a particular set of materials produced by one manufacturer. Thus, lot numbers enable to track information about the provenance of these material productions.
  2. Note that this is not part of the inventory item description, as this aims at the general cell specification. However, the particular information for a specific experiment are part of the ELN protocol.

Acknowledgements

We thank Tazin Hossain for her help with the prototypical implementation of parts of the structure-based approach.

Author contributions

Author contributions according to CRediT: MS: Conceptualization, Data curation, Methodology, Investigation, Software, Writing. SSt: Resources, Data curation. PG: Methodology, Writing. JBN: Resources, Funding acquisition. SSp: Funding acquisition, Supervision. FK: Conceptualization, Methodology, Writing, Funding acquisition, Supervision. All authors read and approved the final manuscript.

Funding

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – SFB 1270/1 - 299150580. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; and in the decision to publish the results. Open Access funding enabled and organized by Projekt DEAL.

Availability of data and materials

The manually engineered model is published at GitHub. The results of the structure-based modelling approach and the research data are published as RO-Crates bundles. The source that builds the model based on the ELN API is available here. This study is based on the previously published data by Staehlke and Nebe.[66]

Competing interests

The authors declare that they have no competing interests.

References

  1. 1.0 1.1 1.2 Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem et al. (1 December 2016). "The FAIR Guiding Principles for scientific data management and stewardship" (in en). Scientific Data 3 (1): 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC PMC4792175. PMID 26978244. http://www.nature.com/articles/sdata201618. 
  2. 2.0 2.1 2.2 Jacobsen, Annika; de Miranda Azevedo, Ricardo; Juty, Nick; Batista, Dominique; Coles, Simon; Cornet, Ronald; Courtot, Mélanie; Crosas, Mercè et al. (1 January 2020). "FAIR Principles: Interpretations and Implementation Considerations" (in en). Data Intelligence 2 (1-2): 10–29. doi:10.1162/dint_r_00024. ISSN 2641-435X. https://direct.mit.edu/dint/article/2/1-2/10-29/10017. 
  3. 3.0 3.1 3.2 3.3 Yu, Fangyu; Zhou, Beisi; Lu, Tun; Gu, Ning (2019), Sun, Yuqing; Lu, Tun; Xie, Xiaolan et al.., eds., "Research on Data Provenance Model for Multidisciplinary Collaboration", Computer Supported Cooperative Work and Social Computing (Singapore: Springer Singapore) 917: 32–49, doi:10.1007/978-981-13-3044-5_3, ISBN 978-981-13-3043-8, http://link.springer.com/10.1007/978-981-13-3044-5_3. Retrieved 2022-04-01 
  4. Moreau, Luc; Groth, Paul (15 September 2013). "Provenance: An Introduction to PROV" (in en). Synthesis Lectures on the Semantic Web: Theory and Technology 3 (4): 1–129. doi:10.2200/S00528ED1V01Y201308WBE007. ISSN 2160-4711. http://www.morganclaypool.com/doi/abs/10.2200/S00528ED1V01Y201308WBE007. 
  5. 5.0 5.1 5.2 Belhajjame, K.; B'Far, R.; Cheney, J. et al. (30 April 2013). "PROV-DM: The PROV Data Model". W3C. https://www.w3.org/TR/2013/REC-prov-dm-20130430/. 
  6. Hogan, Aidan; Blomqvist, Eva; Cochez, Michael; D’amato, Claudia; Melo, Gerard De; Gutierrez, Claudio; Kirrane, Sabrina; Gayo, José Emilio Labra et al. (31 May 2022). "Knowledge Graphs" (in en). ACM Computing Surveys 54 (4): 1–37. doi:10.1145/3447772. ISSN 0360-0300. https://dl.acm.org/doi/10.1145/3447772. 
  7. 7.0 7.1 7.2 Staehlke, Susanne; Koertge, Andreas; Nebe, Barbara (1 April 2015). "Intracellular calcium dynamics dependent on defined microtopographical features of titanium" (in en). Biomaterials 46: 48–57. doi:10.1016/j.biomaterials.2014.12.016. https://linkinghub.elsevier.com/retrieve/pii/S0142961214012666. 
  8. 8.0 8.1 8.2 Schröder, Max; Stählke, Susanne; Nebe, Barbara; Krüger, Frank (2020). "Towards in-situ knowledge acquisition for research data provenance from electronic lab notebooks" (in en). Proceedings of the 1st Workshop on Research Data Management for Linked Open Science (DaMaLOS) Co-located with 19th International Semantic Web Conference. doi:10.4126/FRL01-006423288. https://repository.publisso.de/resource/frl:6423288. 
  9. CARPi, Nicolas; Minges, Alexander; Piel, Matthieu (14 April 2017). "eLabFTW: An open source laboratory notebook for research labs". The Journal of Open Source Software 2 (12): 146. doi:10.21105/joss.00146. ISSN 2475-9066. http://joss.theoj.org/papers/10.21105/joss.00146. 
  10. 10.0 10.1 "RRID Portal". SciCrunch. 2021. https://scicrunch.org/resources. 
  11. Ram, Sudha; Liu, Jun (2007), Chen, Peter P.; Wong, Leah Y., eds., "Understanding the Semantics of Data Provenance to Support Active Conceptual Modeling", Active Conceptual Modeling of Learning (Berlin, Heidelberg: Springer Berlin Heidelberg) 4512: 17–29, doi:10.1007/978-3-540-77503-4_3, ISBN 978-3-540-77502-7, http://link.springer.com/10.1007/978-3-540-77503-4_3. Retrieved 2022-04-01 
  12. 12.0 12.1 12.2 Herschel, Melanie; Diestelkämper, Ralf; Ben Lahmar, Houssem (1 December 2017). "A survey on provenance: What for? What form? What from?" (in en). The VLDB Journal 26 (6): 881–906. doi:10.1007/s00778-017-0486-1. ISSN 1066-8888. http://link.springer.com/10.1007/s00778-017-0486-1. 
  13. 13.0 13.1 13.2 Lim, Chunhyeok; Lu, Shiyong; Chebotko, Artem; Fotouhi, Farshad (1 July 2010). "Prospective and Retrospective Provenance Collection in Scientific Workflow Environments". 2010 IEEE International Conference on Services Computing (Miami, FL, USA: IEEE): 449–456. doi:10.1109/SCC.2010.18. ISBN 978-1-4244-8147-7. http://ieeexplore.ieee.org/document/5557202/. 
  14. Belhajjame, Khalid; Wolstencroft, Katy; Corcho, Oscar; Oinn, Tom; Tanoh, Franck; William, Alan; Goble, Carole (1 May 2008). "Metadata Management in the Taverna Workflow System". 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) (Lyon, France: IEEE): 651–656. doi:10.1109/CCGRID.2008.17. http://ieeexplore.ieee.org/document/4534278/. 
  15. Altintas, Ilkay; Barney, Oscar; Jaeger-Frank, Efrat (2006), Moreau, Luc; Foster, Ian, eds., "Provenance Collection Support in the Kepler Scientific Workflow System", Provenance and Annotation of Data (Berlin, Heidelberg: Springer Berlin Heidelberg) 4145: 118–132, doi:10.1007/11890850_14, ISBN 978-3-540-46302-3, http://link.springer.com/10.1007/11890850_14. Retrieved 2022-04-01 
  16. Goecks, Jeremy; Nekrutenko, Anton; Taylor, James; Galaxy Team, The (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences" (in en). Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. ISSN 1465-6906. PMC PMC2945788. PMID 20738864. http://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-8-r86. 
  17. 17.0 17.1 Samuel, S.; König-Ries, B. (2018). "ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility" (Pdf). Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference 2180. http://ceur-ws.org/Vol-2180/paper-57.pdf. 
  18. 18.0 18.1 Soldatova, Larisa N; Nadis, Daniel; King, Ross D; Basu, Piyali S; Haddi, Emma; Baumlé, Véronique; Saunders, Nigel J; Marwan, Wolfgang et al. (1 December 2014). "EXACT2: the semantics of biomedical protocols" (in en). BMC Bioinformatics 15 (S14): S5. doi:10.1186/1471-2105-15-S14-S5. ISSN 1471-2105. PMC PMC4255744. PMID 25472549. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S14-S5. 
  19. 19.0 19.1 Giraldo Pasmín, O.X., Corcho, O.; Castro, A.G., (2014). "SMART Protocols: seMAntic represenTation for experimental protocols". Linked Science 2014: Proceedings of the 4th Workshop on Linked Science 2014 - Making Sense Out of Data. 1282. pp. 36–47. ISBN 1613-0073. https://oa.upm.es/36778/. 
  20. Hughes, Gareth; Mills, Hugo; De Roure, David; Frey, Jeremy G.; Moreau, Luc; schraefel, m. c.; Smith, Graham; Zaluska, Ed (2004). "The semantic smart laboratory: a system for supporting the chemical eScientist" (in en). Organic & Biomolecular Chemistry 2 (22): 3284. doi:10.1039/b410075a. ISSN 1477-0520. http://xlink.rsc.org/?DOI=b410075a. 
  21. 21.0 21.1 Moreau, Luc; Batlajery, Belfrit Victor; Huynh, Trung Dong; Michaelides, Danius; Packer, Heather (1 February 2018). "A Templating System to Generate Provenance". IEEE Transactions on Software Engineering 44 (2): 103–121. doi:10.1109/TSE.2017.2659745. ISSN 0098-5589. https://ieeexplore.ieee.org/document/7909036/. 
  22. Curcin, Vasa; Fairweather, Elliot; Danger, Roxana; Corrigan, Derek (1 January 2017). "Templates as a method for implementing data provenance in decision support systems" (in en). Journal of Biomedical Informatics 65: 1–21. doi:10.1016/j.jbi.2016.10.022. https://linkinghub.elsevier.com/retrieve/pii/S1532046416301599. 
  23. Vogt, Lars; D'Souza, Jennifer; Stocker, Markus; Auer, Sören (1 August 2020). "Toward Representing Research Contributions in Scholarly Knowledge Graphs Using Knowledge Graph Cells" (in en). Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (Virtual Event China: ACM): 107–116. doi:10.1145/3383583.3398530. ISBN 978-1-4503-7585-6. https://dl.acm.org/doi/10.1145/3383583.3398530. 
  24. Samuel, Sheeba; König-Ries, Birgitta (2017), Blomqvist, Eva; Hose, Katja; Paulheim, Heiko et al.., eds., "REPRODUCE-ME: Ontology-Based Data Access for Reproducibility of Microscopy Experiments", The Semantic Web: ESWC 2017 Satellite Events (Cham: Springer International Publishing) 10577: 17–20, doi:10.1007/978-3-319-70407-4_4, ISBN 978-3-319-70406-7, http://link.springer.com/10.1007/978-3-319-70407-4_4. Retrieved 2022-04-01 
  25. 25.0 25.1 25.2 Moreau, Luc; Groth, Paul; Cheney, James; Lebo, Timothy; Miles, Simon (1 December 2015). "The rationale of PROV" (in en). Journal of Web Semantics 35: 235–257. doi:10.1016/j.websem.2015.04.001. https://linkinghub.elsevier.com/retrieve/pii/S1570826815000177. 
  26. Ciccarese, Paolo; Soiland-Reyes, Stian; Belhajjame, Khalid; Gray, Alasdair JG; Goble, Carole; Clark, Tim (2013). "PAV ontology: provenance, authoring and versioning" (in en). Journal of Biomedical Semantics 4 (1): 37. doi:10.1186/2041-1480-4-37. ISSN 2041-1480. PMC PMC4177195. PMID 24267948. http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-37. 
  27. 27.0 27.1 The OBI Consortium; Smith, Barry; Ashburner, Michael; Rosse, Cornelius; Bard, Jonathan; Bug, William; Ceusters, Werner; Goldberg, Louis J et al. (1 November 2007). "The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration" (in en). Nature Biotechnology 25 (11): 1251–1255. doi:10.1038/nbt1346. ISSN 1087-0156. PMC PMC2814061. PMID 17989687. http://www.nature.com/articles/nbt1346. 
  28. 28.0 28.1 28.2 Smith, B.; Kumar, A.; Bittner, T. (2005). "Basic Formal Ontology for Bioinformatics" (PDF). IFOMIS Reports. http://ontology.buffalo.edu/smith/articles/BFO_for_bioinformatics.pdf. 
  29. Bartusch, F.; Hanussek, M.; Krüger, J. (2018). Atkinson, M.; Gesing, S.. ed. "Automatic generation of provenance metadata during execution of scientific workflows" (PDF). Proceedings of the 10th International Workshop on Science Gateways 2357: 1–6. http://ceur-ws.org/Vol-2357/paper8.pdf. 
  30. Murta, Leonardo; Braganholo, Vanessa; Chirigati, Fernando; Koop, David; Freire, Juliana (2015), Ludäscher, Bertram; Plale, Beth, eds., "noWorkflow: Capturing and Analyzing Provenance of Scripts" (in en), Provenance and Annotation of Data and Processes (Cham: Springer International Publishing) 8628: 71–83, doi:10.1007/978-3-319-16462-5_6, ISBN 978-3-319-16461-8, http://link.springer.com/10.1007/978-3-319-16462-5_6. Retrieved 2022-04-01 
  31. Bose, Rajendra; Frew, James (1 March 2005). "Lineage retrieval for scientific data processing: a survey" (in en). ACM Computing Surveys 37 (1): 1–28. doi:10.1145/1057977.1057978. ISSN 0360-0300. https://dl.acm.org/doi/10.1145/1057977.1057978. 
  32. Davidson, Susan B.; Freire, Juliana (2008). "Provenance and scientific workflows: challenges and opportunities" (in en). Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08 (Vancouver, Canada: ACM Press): 1345. doi:10.1145/1376616.1376772. ISBN 978-1-60558-102-6. http://portal.acm.org/citation.cfm?doid=1376616.1376772. 
  33. Deelman, Ewa; Gannon, Dennis; Shields, Matthew; Taylor, Ian (1 May 2009). "Workflows and e-Science: An overview of workflow system features and capabilities" (in en). Future Generation Computer Systems 25 (5): 528–540. doi:10.1016/j.future.2008.06.012. https://linkinghub.elsevier.com/retrieve/pii/S0167739X08000861. 
  34. Budde, Kai; Zimmermann, Julius; Neuhaus, Elisa; Schroder, Max; Uhrmacher, Adelinde M.; van Rienen, Ursula (1 July 2019). "Requirements for Documenting Electrical Cell Stimulation Experiments for Replicability and Numerical Modeling ∗". 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (Berlin, Germany: IEEE): 1082–1088. doi:10.1109/EMBC.2019.8856863. ISBN 978-1-5386-1311-5. https://ieeexplore.ieee.org/document/8856863/. 
  35. MIACA Standards Initiative (2006). "MIACA - Minimum Information About a Cellular Assay". SourceForge. http://miaca.sourceforge.net/. 
  36. Rasmussen, Karsten Boye; Blank, Grant (1 March 2007). "The data documentation initiative: a preservation standard for research" (in en). Archival Science 7 (1): 55–71. doi:10.1007/s10502-006-9036-0. ISSN 1389-0166. http://link.springer.com/10.1007/s10502-006-9036-0. 
  37. Weibel, S.; Kunze, J.; Lagoze, C.; Wolf, M. (1 September 1998) (in en). Dublin Core Metadata for Resource Discovery. pp. RFC2413. doi:10.17487/rfc2413. https://www.rfc-editor.org/info/rfc2413. 
  38. Kunze, J.; Baker, T. (1 August 2007) (in en). The Dublin Core Metadata Element Set. pp. RFC5013. doi:10.17487/rfc5013. https://www.rfc-editor.org/info/rfc5013. 
  39. Albertoni, R.; Browning, D.; Cox, S. et al. (4 February 2020). "Data Catalog Vocabulary (DCAT) - Version 2". W3C. https://www.w3.org/TR/2020/REC-vocab-dcat-2-20200204/. 
  40. Kunis, S. (22 October 2021). "Workgroup RDM4mic - Research data management for microscopy". Zenodo. doi:10.5281/zenodo.5591958. https://zenodo.org/record/5591958. 
  41. Buchanan, Erin M.; Crain, Sarah E.; Cunningham, Ari L.; Johnson, Hannah R.; Stash, Hannah; Papadatou-Pastou, Marietta; Isager, Peder M.; Carlsson, Rickard et al. (1 January 2021). "Getting Started Creating Data Dictionaries: How to Create a Shareable Data Set" (in en). Advances in Methods and Practices in Psychological Science 4 (1): 251524592092800. doi:10.1177/2515245920928007. ISSN 2515-2459. http://journals.sagepub.com/doi/10.1177/2515245920928007. 
  42. Rashid, Sabbir M.; McCusker, James P.; Pinheiro, Paulo; Bax, Marcello P.; Santos, Henrique O.; Stingone, Jeanette A.; Das, Amar K.; McGuinness, Deborah L. (1 October 2020). "The Semantic Data Dictionary – An Approach for Describing and Annotating Data" (in en). Data Intelligence 2 (4): 443–486. doi:10.1162/dint_a_00058. ISSN 2641-435X. PMC PMC7583433. PMID 33103120. https://direct.mit.edu/dint/article/2/4/443-486/94892. 
  43. Kunze, J.; Littman, J.; Madden, E.; Scancella, J.; Adams, C. (1 October 2018) (in en). The BagIt File Packaging Format (V1.0). pp. RFC8493. doi:10.17487/rfc8493. https://www.rfc-editor.org/info/rfc8493. 
  44. Hankinson, Andrew; Brower, Donald; Jefferies, Neil; Metz, Rosalyn; Morley, Julian; Warner, Simeon; Woods, Andrew (4 June 2019). "The Oxford Common File Layout: A Common Approach to Digital Preservation" (in en). Publications 7 (2): 39. doi:10.3390/publications7020039. ISSN 2304-6775. https://www.mdpi.com/2304-6775/7/2/39. 
  45. 45.0 45.1 Carragáin, Eoghan Ó; Goble, Carole; Sefton, Peter; Soiland-Reyes, Stian (20 June 2019). A lightweight approach to research object data packaging. doi:10.5281/ZENODO.3250687. https://zenodo.org/record/3250687. 
  46. Chard, Kyle; Gaffney, Niall; Jones, Matthew B.; Kowalik, Kacper; Ludascher, Bertram; McPhillips, Timothy; Nabrzyski, Jarek; Stodden, Victoria et al. (1 September 2019). "Application of BagIt-Serialized Research Object Bundles for Packaging and Re-Execution of Computational Analyses". 2019 15th International Conference on eScience (eScience) (San Diego, CA, USA: IEEE): 514–521. doi:10.1109/eScience.2019.00068. ISBN 978-1-7281-2451-3. https://ieeexplore.ieee.org/document/9041738/. 
  47. Musen, Mark A. (16 June 2015). "The protégé project: a look back and a look forward" (in en). AI Matters 1 (4): 4–12. doi:10.1145/2757001.2757003. ISSN 2372-3483. PMC PMC4883684. PMID 27239556. https://dl.acm.org/doi/10.1145/2757001.2757003. 
  48. National Center for Biomedical Ontology (2021). "BioPortal". Board of Trustees of Leland Stanford Junior University. https://bioportal.bioontology.org/. 
  49. Ong, Edison; Xiang, Zuoshuang; Zhao, Bin; Liu, Yue; Lin, Yu; Zheng, Jie; Mungall, Chris; Courtot, Mélanie et al. (4 January 2017). "Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration". Nucleic Acids Research 45 (D1): D347–D352. doi:10.1093/nar/gkw918. ISSN 1362-4962. PMC 5210626. PMID 27733503. https://pubmed.ncbi.nlm.nih.gov/27733503. 
  50. Vrandečić, Denny; Krötzsch, Markus (23 September 2014). "Wikidata: a free collaborative knowledgebase" (in en). Communications of the ACM 57 (10): 78–85. doi:10.1145/2629489. ISSN 0001-0782. https://dl.acm.org/doi/10.1145/2629489. 
  51. Hyland, B.; Atemezing, G.; Pendleton, M. et al., ed. (27 June 2013). "Linked Data Glossary". W3C. https://dvcs.w3.org/hg/gld/raw-file/default/glossary/index.html. 
  52. Heath, Tom; Bizer, Christian (9 February 2011). "Linked Data: Evolving the Web into a Global Data Space" (in en). Synthesis Lectures on the Semantic Web: Theory and Technology 1 (1): 1–136. doi:10.2200/S00334ED1V01Y201102WBE001. ISSN 2160-4711. http://www.morganclaypool.com/doi/abs/10.2200/S00334ED1V01Y201102WBE001. 
  53. Gkoutos, G. V.; Schofield, P. N.; Hoehndorf, R. (10 October 2012). "The Units Ontology: a tool for integrating units of measurement in science" (in en). Database 2012 (0): bas033–bas033. doi:10.1093/database/bas033. ISSN 1758-0463. PMC PMC3468815. PMID 23060432. https://academic.oup.com/database/article-lookup/doi/10.1093/database/bas033. 
  54. The Apache Software Foundation (2021). "Apache Jena Fuseki". https://jena.apache.org/documentation/fuseki2/index.html. 
  55. Gremse, M.; Chang, A.; Schomburg, I.; Grote, A.; Scheer, M.; Ebeling, C.; Schomburg, D. (1 January 2011). "The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources" (in en). Nucleic Acids Research 39 (Database): D507–D513. doi:10.1093/nar/gkq968. ISSN 0305-1048. PMC PMC3013802. PMID 21030441. https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkq968. 
  56. Degtyarenko, K.; de Matos, P.; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcantara, R.; Darsow, M. et al. (23 December 2007). "ChEBI: a database and ontology for chemical entities of biological interest" (in en). Nucleic Acids Research 36 (Database): D344–D350. doi:10.1093/nar/gkm791. ISSN 0305-1048. PMC PMC2238832. PMID 17932057. https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkm791. 
  57. Sarntivijai, Sirarat; Lin, Yu; Xiang, Zuoshuang; Meehan, Terrence F; Diehl, Alexander D; Vempati, Uma D; Schürer, Stephan C; Pang, Chao et al. (2014). "CLO: The cell line ontology" (in en). Journal of Biomedical Semantics 5 (1): 37. doi:10.1186/2041-1480-5-37. ISSN 2041-1480. PMC PMC4387853. PMID 25852852. http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-5-37. 
  58. Bandrowski, Anita; Brinkman, Ryan; Brochhausen, Mathias; Brush, Matthew H.; Bug, Bill; Chibucos, Marcus C.; Clancy, Kevin; Courtot, Mélanie et al. (29 April 2016). Xue, Yu. ed. "The Ontology for Biomedical Investigations" (in en). PLOS ONE 11 (4): e0154556. doi:10.1371/journal.pone.0154556. ISSN 1932-6203. PMC PMC4851331. PMID 27128319. https://dx.plos.org/10.1371/journal.pone.0154556. 
  59. Brickley, D.; Miller, L. (14 January 2014). "FOAF Vocabulary Specification 0.99". xmlns.com. http://xmlns.com/foaf/spec/. 
  60. Conlon, M. (2021). "Research Organization Registry". https://ror.org/. 
  61. Kulkarni, Chaitanya; Xu, Wei; Ritter, Alan; Machiraju, Raghu (2018). "An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols" (in en). Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (New Orleans, Louisiana: Association for Computational Linguistics): 97–106. doi:10.18653/v1/N18-2016. http://aclweb.org/anthology/N18-2016. 
  62. Harper, C.; Cox, J.; Daniel, R. et al. (1 February 2021). "Welcome to MeasEval: Counts and Measurements!". CodaLab. https://competitions.codalab.org/competitions/25770. 
  63. Soiland-Reyes, S.; Klyne, G. (17 October 2017). "Minim model for defining checklists". RO-manager on GitHub. https://github.com/wf4ever/ro-manager/blob/master/Minim/Minim-description.md. 
  64. Schröder, Max; LeBlanc, Hayley; Spors, Sascha; Krüger, Frank (25 February 2020). "Intra-consortia data sharing platforms for interdisciplinary collaborative research projects" (in en). it - Information Technology 62 (1): 19–28. doi:10.1515/itit-2019-0039. ISSN 2196-7032. https://www.degruyter.com/document/doi/10.1515/itit-2019-0039/html. 
  65. Mons, Barend; Neylon, Cameron; Velterop, Jan; Dumontier, Michel; da Silva Santos, Luiz Olavo Bonino; Wilkinson, Mark D. (7 March 2017). "Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud". Information Services & Use 37 (1): 49–56. doi:10.3233/ISU-170824. https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/ISU-170824. 
  66. Staehlke, Susanne; Nebe, J. Barbara (10 June 2021), "Research data of Calcium Imaging after electrical stimulation" (in en), Zenodo (Zenodo), doi:10.5281/zenodo.4923173, https://zenodo.org/record/4923173. Retrieved 2022-04-01 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. To more easily differentiate footnotes from references, the original footnotes (which were numbered) were updated to use lowercase letters. Most footnotes referencing web pages were turned into proper citations.