Difference between revisions of "Journal:Making data and workflows findable for machines"
Shawndouglas (talk | contribs) (Saving and adding more.) |
Shawndouglas (talk | contribs) (Saving and adding more.) |
||
Line 31: | Line 31: | ||
==Introduction== | ==Introduction== | ||
In several scientific disciplines, the number, size, and variety of data objects to be managed are growing. Examples of particular interest to the challenges discussed in this article include climate modeling<ref name="BalajiRequire18">{{cite journal |title=Requirements for a global data infrastructure in support of CMIP6 |journal=Geoscientific Model Development |author=Balaji, V.; Taylor, K.E.; Juckes, M. et al. |volume=11 |issue=9 |pages=3659–3680 |year=2018 |doi=10.5194/gmd-11-3659-2018}}</ref>, geophysics<ref name="SquireScient18">{{cite journal |title=IN43C-0903: Scientific Software Solution Centre for Discovering, Sharing and Reusing Research Software |journal=Proceedings from the 2018 AGU Fall Meeting |author=Squire, G.; Wu, M.; Friedrich, C. et al. |year=2018 |url=https://agu.confex.com/agu/fm18/meetingapp.cgi/Paper/459873}}</ref>, and “omics”-based scientific approaches.<ref name="GobleFAIR20">{{cite journal |title=FAIR Computational Workflows |journal=Data Intelligence |author=Goble, C.; Cohen=Boulakia, S.; Soiland-Reyes, S. et al. |volume=2 |issue=1–2 |pages=108–21 |year=2020 |doi=10.1162/dint_a_00033}}</ref> The supporting [[Information management|data infrastructures and services]] are challenged to offer adequate solutions, and researchers are looking toward increased automation in their processes to cope with the needs. Aspects of automation are intrinsic to making data and | In several scientific disciplines, the number, size, and variety of data objects to be managed are growing. Examples of particular interest to the challenges discussed in this article include climate modeling<ref name="BalajiRequire18">{{cite journal |title=Requirements for a global data infrastructure in support of CMIP6 |journal=Geoscientific Model Development |author=Balaji, V.; Taylor, K.E.; Juckes, M. et al. |volume=11 |issue=9 |pages=3659–3680 |year=2018 |doi=10.5194/gmd-11-3659-2018}}</ref>, geophysics<ref name="SquireScient18">{{cite journal |title=IN43C-0903: Scientific Software Solution Centre for Discovering, Sharing and Reusing Research Software |journal=Proceedings from the 2018 AGU Fall Meeting |author=Squire, G.; Wu, M.; Friedrich, C. et al. |year=2018 |url=https://agu.confex.com/agu/fm18/meetingapp.cgi/Paper/459873}}</ref>, and “omics”-based scientific approaches.<ref name="GobleFAIR20">{{cite journal |title=FAIR Computational Workflows |journal=Data Intelligence |author=Goble, C.; Cohen=Boulakia, S.; Soiland-Reyes, S. et al. |volume=2 |issue=1–2 |pages=108–21 |year=2020 |doi=10.1162/dint_a_00033}}</ref> The supporting [[Information management|data infrastructures and services]] are challenged to offer adequate solutions, and researchers are looking toward increased automation in their processes to cope with the needs. Aspects of automation are intrinsic to making data and [[workflow]]s findable, accessible, interoperable, and reusable according to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR guiding principles]].<ref name="MonsCloudy17">{{cite journal |title=Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud |journal=Information Services & Use |author=Mons, B.; Neylon, C.; Velterop, J. et al. |volume=37 |issue=1 |pages=49–56 |year=2017 |doi=10.3233/ISU-170824}}</ref> This article highlights the automation steps that are required to automatically identify data objects, associate them with [[metadata]], and make both that data and the processes that generated them more findable. Persistent identifiers, machine processes with autonomous decision-making capability, and machine-actionable metadata are critical elements for practical solutions. | ||
The motivation is given through the increased interest by researchers and funders in making not only those data available that underpin [[Data analysis|analysis]] in scientific publications, but also give insight into the generative history of these data while they were generated, processed, analyzed, and eventually published. Readers wish to investigate the provenance of data underlying publications, gaining access to contextual [[information]] on data in the provenance graph and on workflows or individual data processing steps. In this article, we investigate how such information can be aggregated and leveraged to improve the general findability of data and the workflows that produce them, improving the quality of information that search catalogs such as B2FIND or the CSIRO Data Access Portal can depend upon. The potential next step—to enable machines to find resources automatically as part of orchestration—will only be touched upon marginally. Concerning aggregation for findability, the article highlights key requirements and elements of possible solutions that can inform future work. | |||
Researchers who work with data are also interested in making their workflows more efficient, shortening the time from data production to analysis, but also short-cutting workflows, for example, when using ''in-situ'' visualization in a high-performance computing (HPC) workflow to detect errors already made during a computing run and restarting the process quickly with modified parameters. Another important usage trend is the motivation of users to work with data at higher levels of abstraction. Researchers are increasingly relying on tools such as [[Jupyter Notebook]] and standard software libraries to deal with issues of data access and management, giving rise to the wider adoption of virtual research environments (VREs).<ref name="WybornBuild17">{{cite journal |title=ED32B-03: Building a Generic Virtual Research Environment Framework for Multiple Earth and Space Science Domains and a Diversity of Users |journal=Proceedings from the 2017 AGU Fall Meeting |author=Wyborn, L.A.; Fraser, R.; Evans, B.J.K. et al. |year=2017 |url=https://agu.confex.com/agu/fm17/meetingapp.cgi/Paper/293857}}</ref><ref name="BarkerTheGlob19">{{cite journal |title=The global impact of science gateways, virtual research environments and virtual laboratories |journal=Future Generation Computer Systems |author=Barker, M.; Olabarriaga, S.D.; Wilkins-Diehr, N. et al. |volume=95 |pages=240–48 |year=2019 |doi=10.1016/j.future.2018.12.026}}</ref> It is much more efficient to let them focus on the scientific questions surrounding data analysis, and reduce the amount of resources they spend on data management and access. This is part of a larger cultural change—which has wide impact on the evolution of data services—and improving findability is a key concern. | |||
A key capability necessary to support future scenarios is, therefore, support at the data infrastructure level for better automation of the processes dealing with data and workflows. Out of the many possible facets related to this challenge that could be derived from the FAIR principles, we focus on the automation of findability (principles F1-F3), emphasizing that identifiers are a foundational element from which the other principles must follow.<ref name="JutyUnique20">{{cite journal |title=Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR |journal=Data Intelligence |author=Juty, N.; Wimalaratne, S.M.; Soiland-Reyes, S. et al. |volume=2 |issue=1–2 |pages=30–39 |year=2020 |doi=10.1162/dint_a_00025}}</ref> A key question is: How can automated processes help to make more data and workflows findable, particularly from early research workflow stages? In this article, we understand an automated process as one that is capable of limited, autonomous decision-making. This is driven by rule systems specified by humans, but could also, in a later evolution, be replaced by means of [[machine learning]]. | |||
==Essential requirements for automating data and workflow findability for machines== | |||
Revision as of 19:23, 31 January 2021
Full article title | Making data and workflows findable for machines |
---|---|
Journal | Data Intelligence |
Author(s) | Weigel, Tobias; Schwardmann, Ulrich; Klump, Jens; Bendoukha, Sofiane; Quick, Robert |
Author affiliation(s) |
Deutsches Klimarechenzentrum, Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, CSIRO, Indiana University Bloomington |
Primary contact | Email: weigel at dkrz dot de |
Year published | 2020 |
Volume and issue | 2(1–2) |
Page(s) | 40-46 |
DOI | 10.1162/dint_a_00026 |
ISSN | 2641-435X |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://www.mitpressjournals.org/doi/full/10.1162/dint_a_00026 |
Download | https://www.mitpressjournals.org/doi/pdf/10.1162/dint_a_00026 (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
Research data currently face a huge increase of data objects, with an increasing variety of types (data types, formats) and variety of workflows by which objects need to be managed across their lifecycle by data infrastructures. Researchers desire to shorten the workflows from data generation to analysis and publication, and the full workflow needs to become transparent to multiple stakeholders, including research administrators and funders. This poses challenges for research infrastructures and user-oriented data services in terms of not only making data and workflows findable, accessible, interoperable, and reusable (FAIR), but also doing so in a way that leverages machine support for better efficiency. One primary need yet to be addressed is that of findability, and achieving better findability has benefits for other aspects of data and workflow management. In this article, we describe how machine capabilities can be extended to make workflows more findable, in particular by leveraging the Digital Object Architecture, common object operations, and machine learning techniques.
Keywords: findability, workflows, automation, FAIR data, data infrastructures, data services
Introduction
In several scientific disciplines, the number, size, and variety of data objects to be managed are growing. Examples of particular interest to the challenges discussed in this article include climate modeling[1], geophysics[2], and “omics”-based scientific approaches.[3] The supporting data infrastructures and services are challenged to offer adequate solutions, and researchers are looking toward increased automation in their processes to cope with the needs. Aspects of automation are intrinsic to making data and workflows findable, accessible, interoperable, and reusable according to the FAIR guiding principles.[4] This article highlights the automation steps that are required to automatically identify data objects, associate them with metadata, and make both that data and the processes that generated them more findable. Persistent identifiers, machine processes with autonomous decision-making capability, and machine-actionable metadata are critical elements for practical solutions.
The motivation is given through the increased interest by researchers and funders in making not only those data available that underpin analysis in scientific publications, but also give insight into the generative history of these data while they were generated, processed, analyzed, and eventually published. Readers wish to investigate the provenance of data underlying publications, gaining access to contextual information on data in the provenance graph and on workflows or individual data processing steps. In this article, we investigate how such information can be aggregated and leveraged to improve the general findability of data and the workflows that produce them, improving the quality of information that search catalogs such as B2FIND or the CSIRO Data Access Portal can depend upon. The potential next step—to enable machines to find resources automatically as part of orchestration—will only be touched upon marginally. Concerning aggregation for findability, the article highlights key requirements and elements of possible solutions that can inform future work.
Researchers who work with data are also interested in making their workflows more efficient, shortening the time from data production to analysis, but also short-cutting workflows, for example, when using in-situ visualization in a high-performance computing (HPC) workflow to detect errors already made during a computing run and restarting the process quickly with modified parameters. Another important usage trend is the motivation of users to work with data at higher levels of abstraction. Researchers are increasingly relying on tools such as Jupyter Notebook and standard software libraries to deal with issues of data access and management, giving rise to the wider adoption of virtual research environments (VREs).[5][6] It is much more efficient to let them focus on the scientific questions surrounding data analysis, and reduce the amount of resources they spend on data management and access. This is part of a larger cultural change—which has wide impact on the evolution of data services—and improving findability is a key concern.
A key capability necessary to support future scenarios is, therefore, support at the data infrastructure level for better automation of the processes dealing with data and workflows. Out of the many possible facets related to this challenge that could be derived from the FAIR principles, we focus on the automation of findability (principles F1-F3), emphasizing that identifiers are a foundational element from which the other principles must follow.[7] A key question is: How can automated processes help to make more data and workflows findable, particularly from early research workflow stages? In this article, we understand an automated process as one that is capable of limited, autonomous decision-making. This is driven by rule systems specified by humans, but could also, in a later evolution, be replaced by means of machine learning.
Essential requirements for automating data and workflow findability for machines
References
- ↑ Balaji, V.; Taylor, K.E.; Juckes, M. et al. (2018). "Requirements for a global data infrastructure in support of CMIP6". Geoscientific Model Development 11 (9): 3659–3680. doi:10.5194/gmd-11-3659-2018.
- ↑ Squire, G.; Wu, M.; Friedrich, C. et al. (2018). "IN43C-0903: Scientific Software Solution Centre for Discovering, Sharing and Reusing Research Software". Proceedings from the 2018 AGU Fall Meeting. https://agu.confex.com/agu/fm18/meetingapp.cgi/Paper/459873.
- ↑ Goble, C.; Cohen=Boulakia, S.; Soiland-Reyes, S. et al. (2020). "FAIR Computational Workflows". Data Intelligence 2 (1–2): 108–21. doi:10.1162/dint_a_00033.
- ↑ Mons, B.; Neylon, C.; Velterop, J. et al. (2017). "Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud". Information Services & Use 37 (1): 49–56. doi:10.3233/ISU-170824.
- ↑ Wyborn, L.A.; Fraser, R.; Evans, B.J.K. et al. (2017). "ED32B-03: Building a Generic Virtual Research Environment Framework for Multiple Earth and Space Science Domains and a Diversity of Users". Proceedings from the 2017 AGU Fall Meeting. https://agu.confex.com/agu/fm17/meetingapp.cgi/Paper/293857.
- ↑ Barker, M.; Olabarriaga, S.D.; Wilkins-Diehr, N. et al. (2019). "The global impact of science gateways, virtual research environments and virtual laboratories". Future Generation Computer Systems 95: 240–48. doi:10.1016/j.future.2018.12.026.
- ↑ Juty, N.; Wimalaratne, S.M.; Soiland-Reyes, S. et al. (2020). "Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR". Data Intelligence 2 (1–2): 30–39. doi:10.1162/dint_a_00025.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.