Journal:The FAIR Guiding Principles for scientific data management and stewardship
Full article title | The FAIR Guiding Principles for scientific data management and stewardship |
---|---|
Journal | Scientific Data |
Author(s) |
List of authors
|
Author affiliation(s) |
List of author affiliations
|
Primary contact | E-mail: Barend Mons (must log in) |
Year published | 2016 |
Volume and issue | 3 |
Page(s) | 160018 |
DOI | 10.1038/sdata.2016.18 |
ISSN | 2052-4463 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://www.nature.com/articles/sdata201618 |
Download | https://www.nature.com/articles/sdata201618.pdf (PDF) |
This article should not be considered complete until this message box has been removed. This is a work in progress. |
Abstract
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders — representing academia, industry, funding agencies, and scholarly publishers — have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This comment article represents the first formal publication of the FAIR Principles, and it includes the rationale behind them as well as some exemplar implementations in the community.
Comment
Supporting discovery through good data management
Good data management is not a goal in itself but rather the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process. Unfortunately, the existing digital ecosystem surrounding scholarly data publication prevents us from extracting maximum benefit from our research investments (e.g., Roche et al.[1]). Partially in response to this, science funders, publishers and governmental agencies are beginning to require data management and stewardship plans for data generated in publicly funded experiments. Beyond proper collection, annotation, and archival purposes, data stewardship includes the notion of "long-term care" of valuable digital assets, with the goal that they should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. The outcomes from good data management and stewardship, therefore, are high-quality digital publications that facilitate and simplify this ongoing process of discovery, evaluation, and reuse in downstream studies. What constitutes "good data management" is, however, largely undefined, and is generally left as a decision for the data or repository owner. Therefore, bringing some clarity around the goals and desiderata of good data management and stewardship, and defining simple guideposts to inform those who publish and/or preserve scholarly data, would be of great utility.
This article describes four foundational principles — findability, accessibility, interoperability, and reusability — that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added value gained by contemporary, formal scholarly digital publishing. Importantly, it is our intent that the principles apply not only to "data" in the conventional sense, but also to the algorithms, tools, and workflows that led to that data. All scholarly digital research objects[2] — from data to analytical pipelines — benefit from application of these principles, since all components of the research process must be available to ensure transparency, reproducibility, and reusability.
There are numerous and diverse stakeholders who stand to benefit from overcoming these obstacles: researchers wanting to share, get credit, and reuse each other’s data and interpretations; professional data publishers offering their services; software and tool-builders providing data analysis and processing services such as reusable workflows; funding agencies (private and public) increasingly concerned with long-term data stewardship; and a data science community mining, integrating, and analyzing new and existing data to advance discovery. To facilitate the reading of this manuscript by these diverse stakeholders, we provide definitions for common abbreviations in Box 1. Humans, however, are not the only critical stakeholders in the milieu of scientific data. Similar problems are encountered by the applications and computational agents that we task to undertake data retrieval and analysis on our behalf. These "computational stakeholders" are increasingly relevant, and the demand as much, or more, attention as their importance grows. One of the grand challenges of data-intensive science, therefore, is to improve knowledge discovery through assisting both humans and their computational agents in the discovery of, access to, and integration and analysis of task-appropriate scientific data and other scholarly digital objects.
|
For certain types of important digital objects, there are well-curated, deeply integrated, special-purpose repositories such as GenBank[3], worldwide Protein Data Bank (wwPDB)[4], and UniProt[5] in the life sciences; Space Physics Data Facility (SPDF; http://spdf.gsfc.nasa.gov/) and Set of Identifications, Measurements and Bibliography for Astronomical Data (SIMBAD)[6] in the space sciences.
These foundational and critical core resources are continuously curating and capturing high-value reference datasets and fine-tuning them to enhance scholarly output, provide support for both human and mechanical users, and provide extensive tooling to access their content in rich, dynamic ways. However, not all datasets or even data types can be captured by, or submitted to, these repositories. Many important datasets emerging from traditional, low-throughput bench science don’t fit in the data models of these special-purpose repositories, yet these datasets are no less important with respect to integrative research, reproducibility, and reuse in general. Apparently in response to this, we see the emergence of numerous general-purpose data repositories, at scales ranging from institutional (for example, a single university), to open globally-scoped repositories such as Dataverse[7], FigShare (http://figshare.com), Dryad[8], Mendeley Data (https://data.mendeley.com/), Zenodo (http://zenodo.org/), DataHub (http://datahub.io), DANS (http://www.dans.knaw.nl/), and EUDAT.[9] Such repositories accept a wide range of data types in a wide variety of formats, generally do not attempt to integrate or harmonize the deposited data, and place few restrictions (or requirements) on the descriptors of the data deposition. The resulting data ecosystem, therefore, appears to be moving away from centralization and is becoming more diverse and less integrated, thereby exacerbating the discovery and re-usability problem for both human and computational stakeholders.
A specific example of these obstacles could be imagined in the domain of gene regulation and expression analysis. Suppose a researcher has generated a dataset of differentially selected polyadenylation sites in a non-model pathogenic organism grown under a variety of environmental conditions that stimulate its pathogenic state. The researcher is interested in comparing the alternatively polyadenylated genes in this local dataset to other examples of alternative polyadenylation as well as the expression levels of these genes — both in this organism and related model organisms — during the infection process. Given that there is no special-purpose archive for differential polyadenylation data and no model organism database for this pathogen, where does the researcher begin?
References
- ↑ Roche, D.G.; Kruuk, L.E.; Lanfear, R.; Binning, S.A. (2015). "Public data archiving in ecology and evolution: How well are we doing?". PLOS Biology 13: e1002295. doi:10.1371/journal.pbio.1002295. PMC PMC4640582. PMID 26556502. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640582.
- ↑ Bechhofer, S.; De Roure, D.; Gamble, M. et al. (2010). "Research objects: Towards exchange and reuse of digital knowledge". Nature Precedings. doi:10.1038/npre.2010.4626.1.
- ↑ Benson, D.A.; Cavanaugh, M.; Clark, K. et al. (2013). "GenBank". Nucleic Acids Research 41 (D1): D36-42. doi:10.1093/nar/gks1195. PMC PMC4640582. PMID PMC3531190. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640582.
- ↑ Berman, H.; Henrick, K.; Nakamura, H. (2003). "Announcing the worldwide Protein Data Bank". Nature Structural Biology 10 (12): 980. doi:10.1038/nsb1203-980. PMID 14634627.
- ↑ UniProt Consortium (2015). "UniProt: A hub for protein information". Nucleic Acids Research 43 (D1): D204-12. doi:10.1093/nar/gku989. PMC PMC4384041. PMID 25348405. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4384041.
- ↑ Wenger, M.; Ochsenbein, F.; Egret, D. et al. (2000). "The SIMBAD astronomical database: The CDS reference database for astronomical objects". Astronomy and Astrophysics Supplement Series 143 (1): 9–22. doi:10.1051/aas:2000332.
- ↑ Crosas, M. (2011). "The Dataverse Network: An open-source application for sharing, discovering and preserving data". D-Lib Magazine 17 (1/2): 2. doi:10.1045/january2011-crosas.
- ↑ White, H.C.; Carrier, S.; Thompson, A. et al. (2008). "The Dryad Data Repository: A Singapore Framework metadata architecture in a DSpace environment". DC-2008--Berlin Proceedings 2008: 157–162. http://dcpapers.dublincore.org/pubs/article/view/928.
- ↑ Lecarpentier, D.; Wittenburg, P.; Elbers, W. (2013). "EUDAT: A new cross-disciplinary data infrastructure for science". International Journal of Digital Curation 8 (1): 279–287. doi:10.2218/ijdc.v8i1.260.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.