Journal:The FAIR Guiding Principles for scientific data management and stewardship

Full article title	The FAIR Guiding Principles for scientific data management and stewardship
Journal	Scientific Data
Author(s)	List of authors Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; Bouwman, J.; Brookes, A.J.; Clark, T.; Crosas, M.; Dillo, I.; Dumon, O.; Edmunds, S.; Evelo, C.T.; Finkers, R.; Gonzalez-Beltran, A.; Gray, A.J.; Groth, P.; Goble, C.; Grethe, J.S.; Heringa, J.; 't Hoen, P.A.; Hooft, R.; Kuhn, T.; Kok, R.; Kok, J.; Lusher, S.J.; Martone, M.E.; Mons, A.; Packer, A.L.; Persson, B.; Rocca-Serra, P.; Roos, M.; van Schaik, R.; Sansone, S.A.; Schultes, E.; Sengstag, T.; Slater, T.; Strawn, G.; Swertz, M.A.; Thompson, M.; van der Lei, J.; van Mulligen, E.; Velterop, J.; Waagmeester, A.; Wittenburg, P.; Wolstencroft, K.; Zhao, J.; Mons, B.;
Author affiliation(s)	List of author affiliations Universidad Politécnica de Madrid, Stanford University, Nature Genetics, Euretos and Phortos Consultants, Wellcome Genome Campus, Lygature, Vrije Universiteit Amsterdam, National Institutes of Health, TNO, University of Leicester, Harvard Medical School, Harvard University, Data Archiving and Networked Services at The Hague, Beijing Genomics Institute, Maastricht University, Wageningen UR Plant Breeding, University of Oxford, Heriot-Watt University, University of Manchester, University of California San Diego, Dutch Techcenter for the Life Sciences, VU University Amsterdam, Leiden University, Netherlands eScience Center, National Center for Microscopy and Imaging Research, Phortos Consultants, UNIFESP Foundation, Uppsala University, Leiden University Medical Center, Bayer CropScience, University of Basel, Cray, University of Groningen, Erasmus MC - Rotterdam, Independent Open Access and Open Science Advocate, Micelio, Max Planck Compute and Data Facility, Dutch TechCenter for Life Sciences;
Primary contact	E-mail: Barend Mons (must log in)
Year published	2016
Volume and issue	3
Page(s)	160018
DOI	10.1038/sdata.2016.18
ISSN	2052-4463
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.nature.com/articles/sdata201618
Download	https://www.nature.com/articles/sdata201618.pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders — representing academia, industry, funding agencies, and scholarly publishers — have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This comment article represents the first formal publication of the FAIR Principles, and it includes the rationale behind them as well as some exemplar implementations in the community.

Comment

Supporting discovery through good data management

Good data management is not a goal in itself but rather the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process. Unfortunately, the existing digital ecosystem surrounding scholarly data publication prevents us from extracting maximum benefit from our research investments (e.g., Roche et al.^[1]). Partially in response to this, science funders, publishers and governmental agencies are beginning to require data management and stewardship plans for data generated in publicly funded experiments. Beyond proper collection, annotation, and archival purposes, data stewardship includes the notion of "long-term care" of valuable digital assets, with the goal that they should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. The outcomes from good data management and stewardship, therefore, are high-quality digital publications that facilitate and simplify this ongoing process of discovery, evaluation, and reuse in downstream studies. What constitutes "good data management" is, however, largely undefined, and is generally left as a decision for the data or repository owner. Therefore, bringing some clarity around the goals and desiderata of good data management and stewardship, and defining simple guideposts to inform those who publish and/or preserve scholarly data, would be of great utility.

This article describes four foundational principles — findability, accessibility, interoperability, and reusability — that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added value gained by contemporary, formal scholarly digital publishing. Importantly, it is our intent that the principles apply not only to "data" in the conventional sense, but also to the algorithms, tools, and workflows that led to that data. All scholarly digital research objects^[2] — from data to analytical pipelines — benefit from application of these principles, since all components of the research process must be available to ensure transparency, reproducibility, and reusability.

There are numerous and diverse stakeholders who stand to benefit from overcoming these obstacles: researchers wanting to share, get credit, and reuse each other’s data and interpretations; professional data publishers offering their services; software and tool-builders providing data analysis and processing services such as reusable workflows; funding agencies (private and public) increasingly concerned with long-term data stewardship; and a data science community mining, integrating, and analyzing new and existing data to advance discovery. To facilitate the reading of this manuscript by these diverse stakeholders, we provide definitions for common abbreviations in Box 1. Humans, however, are not the only critical stakeholders in the milieu of scientific data. Similar problems are encountered by the applications and computational agents that we task to undertake data retrieval and analysis on our behalf. These "computational stakeholders" are increasingly relevant, and the demand as much, or more, attention as their importance grows. One of the grand challenges of data-intensive science, therefore, is to improve knowledge discovery through assisting both humans and their computational agents in the discovery of, access to, and integration and analysis of task-appropriate scientific data and other scholarly digital objects.

Box 1: Terms and abbreviations
BD2K — Big Data 2 Knowledge, a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximise community engagement DOI — Digital Object Identifier, a code used to permanently and stably identify (usually digital) objects; DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself. FAIR — Findable, Accessible, Interoperable, Reusable FORCE11 — The Future of Research Communications and e-Scholarship, a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing; initiated in 2011 Interoperability — The ability of data or tools from non-cooperating resources to integrate or work together with minimal effort JDDCP — Joint Declaration of Data Citation Principles, acknowledging data as a first-class research output and supporting good research practices around data re-use; JDDCP proposes a set of guiding principles for citation of data within scholarly literature, another dataset, or any other research object. RDF — Resource Description Framework, a globally-accepted framework for data and knowledge representation that is intended to be read and interpreted by machines

For certain types of important digital objects, there are well-curated, deeply integrated, special-purpose repositories such as GenBank^[3], worldwide Protein Data Bank (wwPDB)^[4], and UniProt^[5] in the life sciences; Space Physics Data Facility (SPDF; http://spdf.gsfc.nasa.gov/) and Set of Identifications, Measurements and Bibliography for Astronomical Data (SIMBAD)^[6] in the space sciences.

These foundational and critical core resources are continuously curating and capturing high-value reference datasets and fine-tuning them to enhance scholarly output, provide support for both human and mechanical users, and provide extensive tooling to access their content in rich, dynamic ways. However, not all datasets or even data types can be captured by, or submitted to, these repositories. Many important datasets emerging from traditional, low-throughput bench science don’t fit in the data models of these special-purpose repositories, yet these datasets are no less important with respect to integrative research, reproducibility, and reuse in general. Apparently in response to this, we see the emergence of numerous general-purpose data repositories, at scales ranging from institutional (for example, a single university), to open globally-scoped repositories such as Dataverse^[7], FigShare (http://figshare.com), Dryad^[8], Mendeley Data (https://data.mendeley.com/), Zenodo (http://zenodo.org/), DataHub (http://datahub.io), DANS (http://www.dans.knaw.nl/), and EUDAT.^[9] Such repositories accept a wide range of data types in a wide variety of formats, generally do not attempt to integrate or harmonize the deposited data, and place few restrictions (or requirements) on the descriptors of the data deposition. The resulting data ecosystem, therefore, appears to be moving away from centralization and is becoming more diverse and less integrated, thereby exacerbating the discovery and re-usability problem for both human and computational stakeholders.

A specific example of these obstacles could be imagined in the domain of gene regulation and expression analysis. Suppose a researcher has generated a dataset of differentially selected polyadenylation sites in a non-model pathogenic organism grown under a variety of environmental conditions that stimulate its pathogenic state. The researcher is interested in comparing the alternatively polyadenylated genes in this local dataset to other examples of alternative polyadenylation as well as the expression levels of these genes — both in this organism and related model organisms — during the infection process. Given that there is no special-purpose archive for differential polyadenylation data and no model organism database for this pathogen, where does the researcher begin?

We will consider the current approach to this problem from a variety of data discovery and integration perspectives. If the desired datasets existed, where might they have been published, and how would one begin to search for them, using what search tools? The desired search would need to filter based on specific species, tissues, types of data (Poly-A, microarray, NGS), conditions (infection), and genes; is that information ("metadata") captured by the repositories, and if so, what format is it in, is it searchable, and how? Once the data is discovered, can it be downloaded? In what format(s)? Can that format be easily integrated with private in-house data (the local dataset of alternative polyadenylation sites) as well as other data publications from third parties and with the community’s core gene/protein data repositories? Can this integration be done automatically to save time and avoid copy/paste errors? Does the researcher have permission to use the data from these third-party researchers, under what license conditions, and who should be cited if a data-point is reused?

Questions such as these highlight some of the barriers to data discovery and reuse, not only for humans, but even more so for machines; yet it is precisely these kinds of deeply and broadly integrative analyses that constitute the bulk of contemporary e-Science. The reason that we often need several weeks (or months) of specialist technical effort to gather the data necessary to answer such research questions is not the lack of appropriate technology; the reason is, that we do not pay our valuable digital objects the careful attention they deserve when we create and preserve them. Overcoming these barriers, therefore, necessitates that all stakeholders — including researchers, special-purpose, and general-purpose repositories — evolve to meet the emergent challenges described above. The goal is for scholarly digital objects of all kinds to become "first class citizens" in the scientific publication ecosystem, where the quality of the publication — and more importantly, the impact of the publication — is a function of its ability to be accurately and appropriately found, re-used, and cited over time, by all stakeholders, both human and mechanical.

With this goal in-mind, a workshop was held in Leiden, Netherlands, in 2014, named Jointly Designing a Data Fairport. This workshop brought together a wide group of academic and private stakeholders all of whom had an interest in overcoming data discovery and reuse obstacles. From the deliberations at the workshop the notion emerged that through the definition of, and widespread support for, a minimal set of community-agreed guiding principles and practices, all stakeholders could more easily discover, access, appropriately integrate and re-use, and adequately cite, the vast quantities of information being generated by contemporary data-intensive science. The meeting concluded with a draft formulation of a set of foundational principles that were subsequently elaborated in greater detail: namely, that all research objects should be findable, accessible, interoperable and reusable (FAIR) both for machines and for people. These are now referred to as the FAIR Guiding Principles. Subsequently, a dedicated FAIR working group, established by several members of the FORCE11 community^[10] fine-tuned and improved the Principles. The results of these efforts are reported here.

The significance of machines in data-rich research environments

The emphasis placed on FAIRness being applied to both human-driven and machine-driven activities, is a specific focus of the FAIR Guiding Principles that distinguishes them from many peer initiatives (discussed in the subsequent section). Humans and machines often face distinct barriers when attempting to find and process data on the web. Humans have an intuitive sense of "semantics" (the meaning or intent of a digital object) because we are capable of identifying and interpreting a wide variety of contextual cues, whether those take the form of structural/visual/iconic cues in the layout of a web page, or the content of narrative notes. As such, we are less likely to make errors in the selection of appropriate data or other digital objects, although humans will face similar difficulties if sufficient contextual metadata is lacking. The primary limitation of humans, however, is that we are unable to operate at the scope, scale, and speed necessitated by the scale of contemporary scientific data and complexity of e-Science. It is for this reason that humans increasingly rely on computational agents to undertake discovery and integration tasks on their behalf. This necessitates machines to be capable of autonomously and appropriately acting when faced with the wide range of types, formats, and access-mechanisms/protocols that will be encountered during their self-guided exploration of the global data ecosystem. It also necessitates that the machines keep an exquisite record of provenance such that the data they are collecting can be accurately and adequately cited. Assisting these agents, therefore, is a critical consideration for all participants in the data management and stewardship process — from researchers and data producers to data repository hosts.

Throughout this paper, we use the phrase "machine actionable" to indicate a continuum of possible states wherein a digital object provides increasingly more detailed information to an autonomously acting, computational data explorer. This information enables the agent — to a degree dependent on the amount of detail provided — to have the capacity, when faced with a digital object never encountered before, to: a) identify the type of object (with respect to both structure and intent); b) determine if it is useful within the context of the agent’s current task by interrogating metadata and/or data elements; c) determine if it is usable, with respect to license, consent, or other accessibility or use constraints; and d) take appropriate action, in much the same manner that a human would.

For example, a machine may be capable of determining the data-type of a discovered digital object, but not capable of parsing it due to it being in an unknown format; or it may be capable of processing the contained data, but not capable of determining the licensing requirements related to the retrieval and/or use of that data. The optimal state — where machines fully "understand" and can autonomously and correctly operate, on a digital object — may rarely be achieved. Nevertheless, the FAIR principles provide "steps along a path" toward machine-actionability; adopting, in whole or in part, the FAIR principles, leads the resource along the continuum towards this optimal state. In addition, the idea of being machine-actionable applies in two contexts, first, when referring to the contextual metadata surrounding a digital object ("what is it?"), and second, when referring to the content of the digital object itself ("how do I process it/integrate it?"). Either or both of these may be machine-actionable, and each forms its own continuum of actionability.

Finally, we wish to draw a distinction between data that is machine-actionable as a result of specific investment in software supporting that data-type, for example, bespoke parsers that understand life science wwPDB files or space science Space Physics Archive Search and Extract (SPASE) files, and data that is machine-actionable exclusively through the utilization of general-purpose, open technologies. To reiterate the earlier point—ultimate machine-actionability occurs when a machine can make a useful decision regarding data that it has not encountered before. This distinction is important when considering both (a) the rapidly growing and evolving data environment, with new technologies and new, more complex data-types continuously being developed, and (b) the growth of general-purpose repositories, where the data-types likely to be encountered by an agent are unpredictable. Creating bespoke parsers, in all computer languages for all data-types and all analytical tools that require those data-types, is not a sustainable activity. As such, the focus on assisting machines in their discovery and exploration of data through application of more generalized interoperability technologies and standards at the data/repository level becomes a top-level priority for good data stewardship.

References

↑ Roche, D.G.; Kruuk, L.E.; Lanfear, R.; Binning, S.A. (2015). "Public data archiving in ecology and evolution: How well are we doing?". PLOS Biology 13: e1002295. doi:10.1371/journal.pbio.1002295. PMC PMC4640582. PMID 26556502. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640582.
↑ Bechhofer, S.; De Roure, D.; Gamble, M. et al. (2010). "Research objects: Towards exchange and reuse of digital knowledge". Nature Precedings. doi:10.1038/npre.2010.4626.1.
↑ Benson, D.A.; Cavanaugh, M.; Clark, K. et al. (2013). "GenBank". Nucleic Acids Research 41 (D1): D36-42. doi:10.1093/nar/gks1195. PMC PMC4640582. PMID PMC3531190. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640582.
↑ Berman, H.; Henrick, K.; Nakamura, H. (2003). "Announcing the worldwide Protein Data Bank". Nature Structural Biology 10 (12): 980. doi:10.1038/nsb1203-980. PMID 14634627.
↑ UniProt Consortium (2015). "UniProt: A hub for protein information". Nucleic Acids Research 43 (D1): D204-12. doi:10.1093/nar/gku989. PMC PMC4384041. PMID 25348405. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4384041.
↑ Wenger, M.; Ochsenbein, F.; Egret, D. et al. (2000). "The SIMBAD astronomical database: The CDS reference database for astronomical objects". Astronomy and Astrophysics Supplement Series 143 (1): 9–22. doi:10.1051/aas:2000332.
↑ Crosas, M. (2011). "The Dataverse Network: An open-source application for sharing, discovering and preserving data". D-Lib Magazine 17 (1/2): 2. doi:10.1045/january2011-crosas.
↑ White, H.C.; Carrier, S.; Thompson, A. et al. (2008). "The Dryad Data Repository: A Singapore Framework metadata architecture in a DSpace environment". DC-2008--Berlin Proceedings 2008: 157–162. http://dcpapers.dublincore.org/pubs/article/view/928.
↑ Lecarpentier, D.; Wittenburg, P.; Elbers, W. (2013). "EUDAT: A new cross-disciplinary data infrastructure for science". International Journal of Digital Curation 8 (1): 279–287. doi:10.2218/ijdc.v8i1.260.
↑ Martone, M.E. (2015). "FORCE11: Building the Future for Research Communications and e-Scholarship". BioScience 65 (7): 635. doi:10.1093/biosci/biv095.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.