Difference between revisions of "Journal:Shared metadata for data-centric materials science"
Shawndouglas (talk | contribs) (Saving and adding more.) |
Shawndouglas (talk | contribs) (Saving and adding more.) |
||
Line 33: | Line 33: | ||
==Introduction: Metadata and FAIR data principles== | ==Introduction: Metadata and FAIR data principles== | ||
The amount of data that has been produced in [[materials science]] up to today, and its day-by-day increase, are massive. | The amount of data that has been produced in [[materials science]] up to today, and its day-by-day increase, are massive.<ref>{{Cite journal |last=Rickman |first=J.M. |last2=Lookman |first2=T. |last3=Kalinin |first3=S.V. |date=2019-04 |title=Materials informatics: From the atomic-level to the continuum |url=https://linkinghub.elsevier.com/retrieve/pii/S1359645419300667 |journal=Acta Materialia |language=en |volume=168 |pages=473–510 |doi=10.1016/j.actamat.2019.01.051}}</ref> The dawn of the data-centric era<ref>{{Cite book |date=2009 |editor-last=Hey |editor-first=Anthony J. G. |title=The fourth paradigm: data-intensive scientific discovery |publisher=Microsoft Research |place=Redmond, Washington |isbn=978-0-9825442-0-4}}</ref> requires that such data are not just stored, but also carefully annotated in order to find, access, and possibly reuse them. Terms of good practice to be adopted by the scientific community for the [[Information management|management]] and stewardship of its data, the so-called [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR data principles]], have been compiled by the FORCE11 group.<ref name=":0">{{Cite journal |last=Wilkinson |first=Mark D. |last2=Dumontier |first2=Michel |last3=Aalbersberg |first3=IJsbrand Jan |last4=Appleton |first4=Gabrielle |last5=Axton |first5=Myles |last6=Baak |first6=Arie |last7=Blomberg |first7=Niklas |last8=Boiten |first8=Jan-Willem |last9=da Silva Santos |first9=Luiz Bonino |last10=Bourne |first10=Philip E. |last11=Bouwman |first11=Jildau |date=2016-03-15 |title=The FAIR Guiding Principles for scientific data management and stewardship |url=https://www.nature.com/articles/sdata201618 |journal=Scientific Data |language=en |volume=3 |issue=1 |pages=160018 |doi=10.1038/sdata.2016.18 |issn=2052-4463 |pmc=PMC4792175 |pmid=26978244}}</ref> Here, the acronym "FAIR" stands for "findable, accessible, interoperable, and reusable," which applies not only to data but also to [[metadata]]. Other terms for the “R” in FAIR are “repurposable” and “recyclable.” The former term indicates that data may be used for a different purpose than the original one for which they were created. The latter term hints at the fact that data in materials science are often exploited only once for supporting the thesis of a single publication, and then they are stored and forgotten. In this sense, they would constitute a “waste” that can be recycled, provided that they can be found and they are properly annotated. | ||
Before examining the meaning and importance of the four terms of the FAIR acronym, it is worth defining what metadata are with respect to data. To that purpose, we start by introducing the concept of a data object, which represents the collective storage of [[information]] related to an elementary entry in a [[database]]. One can consider it as a row in a table, where the columns can be occupied by simple scalars, higher-order mathematical objects, strings of characters, or even full documents (or other media objects). In the materials science context, a data object is the collection of attributes (the columns in the above-mentioned table) that represent a material or, even more fundamentally, a snapshot of the material captured by a single configuration of atoms, or it may be a set of measurements from well-defined equivalent [[Sample (material)|samples]] (see below for a discussion on this concept). For instance, in computational materials science, the attributes of a data object could be both the inputs (e.g., the coordinates and chemical species of the atoms constituting the material, the description of the physical model used for calculating its properties), and the outputs (e.g., total energy, forces, electronic density of states, etc.) of a calculation. Logically and physically, inputs and outputs are at different levels, in the sense that the former determine the latter. Hence, one can consider the inputs as metadata describing the data, i.e., the outputs. In turn, the set of coordinates A that are metadata to some observed quantities, may be considered as data that depend on another set of coordinates B, and the forces acting on the atoms in that set A. So, the set of coordinates B and the acting forces are metadata to the set A, now regarded as data. Metadata can always be considered to be data as they could be objects of different, independent analyses than those performed on the calculated properties. In this respect, whether an attribute of a data object is data or metadata depends on the context. This simple example also depicts a provenance relationship between the data and their metadata. | Before examining the meaning and importance of the four terms of the FAIR acronym, it is worth defining what metadata are with respect to data. To that purpose, we start by introducing the concept of a data object, which represents the collective storage of [[information]] related to an elementary entry in a [[database]]. One can consider it as a row in a table, where the columns can be occupied by simple scalars, higher-order mathematical objects, strings of characters, or even full documents (or other media objects). In the materials science context, a data object is the collection of attributes (the columns in the above-mentioned table) that represent a material or, even more fundamentally, a snapshot of the material captured by a single configuration of atoms, or it may be a set of measurements from well-defined equivalent [[Sample (material)|samples]] (see below for a discussion on this concept). For instance, in computational materials science, the attributes of a data object could be both the inputs (e.g., the coordinates and chemical species of the atoms constituting the material, the description of the physical model used for calculating its properties), and the outputs (e.g., total energy, forces, electronic density of states, etc.) of a calculation. Logically and physically, inputs and outputs are at different levels, in the sense that the former determine the latter. Hence, one can consider the inputs as metadata describing the data, i.e., the outputs. In turn, the set of coordinates A that are metadata to some observed quantities, may be considered as data that depend on another set of coordinates B, and the forces acting on the atoms in that set A. So, the set of coordinates B and the acting forces are metadata to the set A, now regarded as data. Metadata can always be considered to be data as they could be objects of different, independent analyses than those performed on the calculated properties. In this respect, whether an attribute of a data object is data or metadata depends on the context. This simple example also depicts a provenance relationship between the data and their metadata. | ||
Line 41: | Line 41: | ||
<blockquote>Metadata are attributes that are necessary to locate, fully characterize, and ultimately reproduce other attributes that are identified as data.</blockquote> | <blockquote>Metadata are attributes that are necessary to locate, fully characterize, and ultimately reproduce other attributes that are identified as data.</blockquote> | ||
The metadata include a clear and unambiguous description of the data as well as their full provenance. This definition is reminiscent of the definition given by the National Institute of Standards and Technology (NIST) | The metadata include a clear and unambiguous description of the data as well as their full provenance. This definition is reminiscent of the definition given by the National Institute of Standards and Technology (NIST)<ref>{{Cite journal |last=Grassi |first=Paul A |last2=Lefkovitz |first2=Naomi B |last3=Nadeau |first3=Ellen M |last4=Galluzzo |first4=Ryan J |last5=Dinh |first5=Abhiraj T |date=2018-01-11 |title=Attribute metadata: a proposed schema for evaluating federated attributes |url=https://nvlpubs.nist.gov/nistpubs/ir/2018/NIST.IR.8112.pdf |place=Gaithersburg, MD |pages=NIST IR 8112 |doi=10.6028/nist.ir.8112}}</ref>: “Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about information or information about information.” With our definition, we highlight the role of data “reproducibility,” which is crucial in science. | ||
Within the “full characterization” requirement, we highlight interpretation of the data as a crucial aspect. In other words, the metadata must provide enough information on a stored value (therein including, e.g., adimensional constants) to make it unambiguous whether two data objects may be compared with respect to the value of a given attribute or not. | Within the “full characterization” requirement, we highlight interpretation of the data as a crucial aspect. In other words, the metadata must provide enough information on a stored value (therein including, e.g., adimensional constants) to make it unambiguous whether two data objects may be compared with respect to the value of a given attribute or not. | ||
Line 51: | Line 51: | ||
From a practical point of view, the metadata are organized in a schema. We summarize what the FAIR principles imply in terms of a metadata schema as follows: | From a practical point of view, the metadata are organized in a schema. We summarize what the FAIR principles imply in terms of a metadata schema as follows: | ||
* '''Findability''' is achieved by assigning unique and persistent identifiers (PIDs) to data and metadata, describing data with rich metadata, and registering (see below) the (meta)data in searchable resources. Widely known examples of PIDs are digital object identifiers (DOIs) and (permanent) Uniform Resource Identifiers (URIs). According to ISO/IEC 11179, a metadata registry (MDR) is a database of metadata that supports the functionality of registration. Registration accomplishes three main goals: identification, provenance, and monitoring [[Quality (business)|quality]]. Furthermore, an MDR manages the semantics of the metadata, i.e., the relationships (connections) among them. | *'''Findability''' is achieved by assigning unique and persistent identifiers (PIDs) to data and metadata, describing data with rich metadata, and registering (see below) the (meta)data in searchable resources. Widely known examples of PIDs are digital object identifiers (DOIs) and (permanent) Uniform Resource Identifiers (URIs). According to ISO/IEC 11179, a metadata registry (MDR) is a database of metadata that supports the functionality of registration. Registration accomplishes three main goals: identification, provenance, and monitoring [[Quality (business)|quality]]. Furthermore, an MDR manages the semantics of the metadata, i.e., the relationships (connections) among them. | ||
* '''Accessibility''' is enabled by [[application programming interface]]s (APIs), which allow one to query and retrieve single entries as well as entire archives. | *'''Accessibility''' is enabled by [[application programming interface]]s (APIs), which allow one to query and retrieve single entries as well as entire archives. | ||
* '''Interoperability''' implies the use of formal, accessible, shared, and broadly applicable languages for knowledge representation (these are known as formal [[Ontology (information science)|ontologies]] and will be discussed in the later section “Outlook on ontologies in materials science”), use of vocabularies to annotate data and metadata, and inclusion of references. | *'''Interoperability''' implies the use of formal, accessible, shared, and broadly applicable languages for knowledge representation (these are known as formal [[Ontology (information science)|ontologies]] and will be discussed in the later section “Outlook on ontologies in materials science”), use of vocabularies to annotate data and metadata, and inclusion of references. | ||
* '''Reusability''' hints at the fact that data in materials science are often exploited only once for a focus-oriented research project, and many data are not even properly stored as they turned out to be irrelevant for the focus. In this sense, many data constitute a “waste” that can be recycled, provided that the data can be found and they are properly annotated. | *'''Reusability''' hints at the fact that data in materials science are often exploited only once for a focus-oriented research project, and many data are not even properly stored as they turned out to be irrelevant for the focus. In this sense, many data constitute a “waste” that can be recycled, provided that the data can be found and they are properly annotated. | ||
Establishing one or more metadata schemas that are FAIR-compliant, and that therefore enable the materials science community to efficiently share the heterogeneously and decentrally produced data, needs to be a community effort. The workshop “Shared Metadata and Data Formats for Big-Data Driven Materials Science: A NOMAD–FAIR-DI Workshop” was organized and held in Berlin in July 2019 to ignite this effort. In the following sections, we describe the identified challenges and first-stage plans, divided into different aspects that are crucial to be addressed in computational materials science. | Establishing one or more metadata schemas that are FAIR-compliant, and that therefore enable the materials science community to efficiently share the heterogeneously and decentrally produced data, needs to be a community effort. The workshop “Shared Metadata and Data Formats for Big-Data Driven Materials Science: A NOMAD–FAIR-DI Workshop” was organized and held in Berlin in July 2019 to ignite this effort. In the following sections, we describe the identified challenges and first-stage plans, divided into different aspects that are crucial to be addressed in computational materials science. | ||
Line 63: | Line 63: | ||
The materials science community has realized long ago that it is necessary to structure data by means of metadata schemas. In this section, we describe the pioneering and recent examples of such schemas, and how a metadata schema becomes FAIR-compliant. | The materials science community has realized long ago that it is necessary to structure data by means of metadata schemas. In this section, we describe the pioneering and recent examples of such schemas, and how a metadata schema becomes FAIR-compliant. | ||
To our knowledge, the first systematic effort to build a metadata schema for exchanging data in chemistry and materials science is CIF, an acronym that originally stood for "Crystallographic Information File," the data exchange standard file format introduced in 1991 by Hall, Allen and Brown. | To our knowledge, the first systematic effort to build a metadata schema for exchanging data in chemistry and materials science is CIF, an acronym that originally stood for "Crystallographic Information File," the data exchange standard file format introduced in 1991 by Hall, Allen and Brown.<ref>{{Cite journal |last=Hall |first=S. R. |last2=Allen |first2=F. H. |last3=Brown |first3=I. D. |date=1991-11-01 |title=The crystallographic information file (CIF): a new standard archive file for crystallography |url=https://scripts.iucr.org/cgi-bin/paper?S010876739101067X |journal=Acta Crystallographica Section A Foundations of Crystallography |volume=47 |issue=6 |pages=655–685 |doi=10.1107/S010876739101067X}}</ref><ref>{{Cite journal |last=Bernstein |first=Herbert J. |last2=Bollinger |first2=John C. |last3=Brown |first3=I. David |last4=Gražulis |first4=Saulius |last5=Hester |first5=James R. |last6=McMahon |first6=Brian |last7=Spadaccini |first7=Nick |last8=Westbrook |first8=John D. |last9=Westrip |first9=Simon P. |date=2016-02-01 |title=Specification of the Crystallographic Information File format, version 2.0 |url=https://scripts.iucr.org/cgi-bin/paper?S1600576715021871 |journal=Journal of Applied Crystallography |volume=49 |issue=1 |pages=277–284 |doi=10.1107/S1600576715021871 |issn=1600-5767}}</ref> Later, the CIF acronym was extended to also mean "Crystallographic Information Framework"<ref>{{Cite book |last=Hall, S.R.; Spadaccini, N.; Brown, I.D. et al. |date= |year=2006 |editor-last=Hall |editor-first=S. R. |editor2-last=McMahon |editor2-first=B. |title=International Tables for Crystallography: Definition and exchange of crystallographic data |url=https://it.iucr.org/Ga/ |chapter=Formal specification of the crystallographic information file. Version 1.1 specification |series=International Tables for Crystallography |edition=1 |publisher=International Union of Crystallography |place=Chester, England |volume=G |pages=25–36 |doi=10.1107/97809553602060000107 |isbn=978-1-4020-5411-2}}</ref>, a broader system of exchange protocols based on data dictionaries and relational rules expressible in different machine-readable manifestations. These include the Crystallographic Information File itself, but also, for instance, XML ([[Extensible Markup Language]]), a general framework for encoding text documents in a format that is meant to be at the same time human and machine readable. CIF was developed by the International Union of Crystallography (IUCr) working party on Crystallographic Information and was adopted in 1990 as a standard file structure for the archiving and distribution of crystallographic information. It is now well established and is in regular use for reporting [[crystal structure]] determinations to ''Acta Crystallographica'' and other journals. More recently, CIF has been adapted to different areas of science such as structural biology (mmCIF, the macromolecular CIF<ref>{{Citation |last=Westbrook |first=J. D. |last2=Yang |first2=H. |last3=Feng |first3=Z. |last4=Berman |first4=H. M. |date=2006-10-01 |editor-last=Hall |editor-first=S. R. |editor2-last=McMahon |editor2-first=B. |title=The use of mmCIF architecture for PDB data management |url=https://xrpp.iucr.org/cgi-bin/itr?url_ver=Z39.88-2003&rft_dat=what%3Dchapter%26volid%3DGa%26chnumo%3D5o5%26chvers%3Dv0001 |work=International Tables for Crystallography |edition=1 |publisher=International Union of Crystallography |place=Chester, England |volume=G |pages=539–543 |doi=10.1107/97809553602060000755 |isbn=978-1-4020-5411-2 |accessdate=2023-11-07}}</ref>) and [[spectroscopy]].<ref>{{Cite journal |last=El Mendili |first=Yassine |last2=Vaitkus |first2=Antanas |last3=Merkys |first3=Andrius |last4=Gražulis |first4=Saulius |last5=Chateigner |first5=Daniel |last6=Mathevet |first6=Fabrice |last7=Gascoin |first7=Stéphanie |last8=Petit |first8=Sebastien |last9=Bardeau |first9=Jean-François |last10=Zanatta |first10=Marco |last11=Secchi |first11=Maria |date=2019-06-01 |title=Raman Open Database: first interconnected Raman–X-ray diffraction open-access resource for material identification |url=http://scripts.iucr.org/cgi-bin/paper?S1600576719004229 |journal=Journal of Applied Crystallography |volume=52 |issue=3 |pages=618–625 |doi=10.1107/S1600576719004229 |issn=1600-5767 |pmc=PMC6557180 |pmid=31236093}}</ref> The CIF framework includes strict syntax definition in a machine-readable form and dictionary defining (meta)data items. It has been noted that the adoption of the CIF framework in IUCr publications has allowed for a significant reduction of the amount of errors in published crystal structures.<ref>{{Cite journal |last=McMahon |first=B. |date=1996-05 |title=The role of journals in maintaining data integrity: Checking of crystal structure data in Acta Crystallographica |url=https://nvlpubs.nist.gov/nistpubs/jres/101/3/j3mcma.pdf |journal=Journal of Research of the National Institute of Standards and Technology |volume=101 |issue=3 |pages=347 |doi=10.6028/jres.101.036 |pmc=PMC4894614 |pmid=27805171}}</ref><ref>{{Cite journal |last=Brown |first=I. David |last2=McMahon |first2=Brian |date=2002-06-01 |title=CIF: the computer language of crystallography |url=https://scripts.iucr.org/cgi-bin/paper?S0108768102003464 |journal=Acta Crystallographica Section B Structural Science |volume=58 |issue=3 |pages=317–324 |doi=10.1107/S0108768102003464 |issn=0108-7681}}</ref> | ||
An early example of an exhaustive metadata schema for chemistry and materials science is the Chemical Markup Language (CML) | An early example of an exhaustive metadata schema for chemistry and materials science is the Chemical Markup Language (CML)<ref>{{Cite web |date=2012 |title=Chemical Markup Language |url=https://www.xml-cml.org/ |publisher=CMLC |accessdate=04 July 2023}}</ref><ref>{{Cite journal |last=Murray-Rust |first=Peter |last2=Townsend |first2=Joe A |last3=Adams |first3=Sam E |last4=Phadungsukanan |first4=Weerapong |last5=Thomas |first5=Jens |date=2011-12 |title=The semantics of Chemical Markup Language (CML): dictionaries and conventions |url=https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-3-43 |journal=Journal of Cheminformatics |language=en |volume=3 |issue=1 |pages=43 |doi=10.1186/1758-2946-3-43 |issn=1758-2946 |pmc=PMC3206453 |pmid=21999509}}</ref><ref name=":1">{{Cite journal |last=Murray-Rust |first=Peter |last2=Rzepa |first2=Henry S |date=2011-12 |title=CML: Evolution and design |url=https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-3-44 |journal=Journal of Cheminformatics |language=en |volume=3 |issue=1 |pages=44 |doi=10.1186/1758-2946-3-44 |issn=1758-2946 |pmc=PMC3205047 |pmid=21999549}}</ref>, whose first public version was released in 1995. CML is a dictionary, encoded in XML for chemical metadata. CML is accessible (for reading, writing, and validation) via the Java library JUMBO (Java Universal Molecular/Markup Browser for Objects).<ref name=":1" /> The general idea of CML is to represent with a common language all kinds of documents that contain chemical data, even though currently the language—as of the latest update in 2012<ref>{{Cite web |date=2012 |title=Schema 3 |work=Chemical Markup Language |url=https://www.xml-cml.org/schema/schema3/ |publisher=CMLC |accessdate=04 July 2023}}</ref>—covers mainly the description of molecules (e.g., IUPAC name, atomic coordinates, bond distances) and of inputs/outputs of computational chemistry codes such as Gaussian03<ref>{{Cite web |date=2023 |title=Gaussian - Expanding the limits of computational chemistry |url=https://gaussian.com/ |publisher=Gaussian, Inc. |accessdate=04 July 2023}}</ref> and NWChem.<ref>{{Cite journal |last=Valiev |first=M. |last2=Bylaska |first2=E.J. |last3=Govind |first3=N. |last4=Kowalski |first4=K. |last5=Straatsma |first5=T.P. |last6=Van Dam |first6=H.J.J. |last7=Wang |first7=D. |last8=Nieplocha |first8=J. |last9=Apra |first9=E. |last10=Windus |first10=T.L. |last11=de Jong |first11=W.A. |date=2010-09 |title=NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations |url=https://linkinghub.elsevier.com/retrieve/pii/S0010465510001438 |journal=Computer Physics Communications |language=en |volume=181 |issue=9 |pages=1477–1489 |doi=10.1016/j.cpc.2010.04.018}}</ref> Specifically, in the CML representation of computational chemistry calculations<ref>{{Cite web |date=2012 |title=Examples for Schema 3 CompChem |work=Chemical Markup Language |url=https://www.xml-cml.org/examples/schema3/compchem/ |publisher=CMLC |accessdate=04 July 2023}}</ref>, (ideally) all the information on a simulation that is contained in the input and output files is mapped onto a format that is in principle independent of the code itself. Such information is: | ||
* Administrative data like the code version, libraries for the compilation, hardware, user submitting the job; | *Administrative data like the code version, libraries for the compilation, hardware, user submitting the job; | ||
* Materials-specific (or materials-snapshot-specific) data like computed structure (e.g., atomic species, coordinates), the physical method (e.g., electronic exchange-correlation treatment, relativistic treatment), numerical settings (basis set, integration grids, etc.); | *Materials-specific (or materials-snapshot-specific) data like computed structure (e.g., atomic species, coordinates), the physical method (e.g., electronic exchange-correlation treatment, relativistic treatment), numerical settings (basis set, integration grids, etc.); | ||
* Computed quantities (energies, forces, sequence of atomic positions in case a structure relaxation or some dynamical propagation of the system is performed, etc.). | *Computed quantities (energies, forces, sequence of atomic positions in case a structure relaxation or some dynamical propagation of the system is performed, etc.). | ||
The different types of information are hierarchically organized in modules, e.g., environment (for the code version, hardware, run date, etc.), initialization (for the exchange correlation treatment, spin, charge), molgeom (for the atomic coordinates and the localized basis set specification), and finalization (for the energies, forces, etc.). The most recent release of the CML schema contains more than 500 metadata-schema items, i.e., unique entries in the metadata schema. It is worth noticing that CIF is the dictionary of choice for the crystallography domain within CML. | The different types of information are hierarchically organized in modules, e.g., environment (for the code version, hardware, run date, etc.), initialization (for the exchange correlation treatment, spin, charge), molgeom (for the atomic coordinates and the localized basis set specification), and finalization (for the energies, forces, etc.). The most recent release of the CML schema contains more than 500 metadata-schema items, i.e., unique entries in the metadata schema. It is worth noticing that CIF is the dictionary of choice for the crystallography domain within CML. | ||
Another long-standing activity is JCAMP-DX (Joint Committee on Atomic and Molecular Physical Data - Data Exchange) | Another long-standing activity is JCAMP-DX (Joint Committee on Atomic and Molecular Physical Data - Data Exchange)<ref>{{Cite journal |last=McDonald |first=Robert S. |last2=Wilks |first2=Paul A. |date=1988-01 |title=JCAMP-DX: A Standard Form for Exchange of Infrared Spectra in Computer Readable Form |url=http://journals.sagepub.com/doi/10.1366/0003702884428734 |journal=Applied Spectroscopy |language=en |volume=42 |issue=1 |pages=151–162 |doi=10.1366/0003702884428734 |issn=0003-7028}}</ref>, a standard file format for exchange of infrared spectra and related chemical and physical information that was established in 1988 and then updated with IUPAC recommendations until 2004. It contains standard dictionaries for infrared spectroscopy, chemical structure, nuclear magnetic resonance (NMR) spectroscopy<ref>{{Cite journal |last=Davies |first=Antony N. |last2=Lampen |first2=Peter |date=1993-08 |title=JCAMP-DX for NMR |url=http://journals.sagepub.com/doi/10.1366/0003702934067874 |journal=Applied Spectroscopy |language=en |volume=47 |issue=8 |pages=1093–1099 |doi=10.1366/0003702934067874 |issn=0003-7028}}</ref>, and [[mass spectrometry]]<ref>{{Cite journal |last=Lampen |first=Peter |last2=Hillig |first2=Heinrich |last3=Davies |first3=Antony N. |last4=Linscheid |first4=Michael |date=1994-12 |title=JCAMP-DX for Mass Spectrometry |url=http://journals.sagepub.com/doi/10.1366/0003702944027840 |journal=Applied Spectroscopy |language=en |volume=48 |issue=12 |pages=1545–1552 |doi=10.1366/0003702944027840 |issn=0003-7028}}</ref>, and ion-mobility spectrometry.<ref>{{Cite journal |last=Baumbach |first=Jörg Ingo |last2=Davies |first2=Antony N. |last3=Lampen |first3=Peter |last4=Schmidt |first4=Hartwig |date=2001-01-01 |title=JCAMP-DX. A standard format for the exchange of ion mobility spectrometry data (IUPAC Recommendations 2001) |url=https://www.degruyter.com/document/doi/10.1351/pac200173111765/html |journal=Pure and Applied Chemistry |language=en |volume=73 |issue=11 |pages=1765–1782 |doi=10.1351/pac200173111765 |issn=1365-3075}}</ref> The European Theoretical Spectroscopy Facility (ETSF) File Format Specifications were proposed in 2007<ref>{{Cite journal |last=Gonze |first=X. |last2=Almbladh |first2=C.-O. |last3=Cucca |first3=A. |last4=Caliste |first4=D. |last5=Freysoldt |first5=C. |last6=Marques |first6=M.A.L. |last7=Olevano |first7=V. |last8=Pouillon |first8=Y. |last9=Verstraete |first9=M.J. |date=2008-10 |title=Specification of an extensible and portable file format for electronic structure and crystallographic data |url=https://linkinghub.elsevier.com/retrieve/pii/S0927025608001377 |journal=Computational Materials Science |language=en |volume=43 |issue=4 |pages=1056–1065 |doi=10.1016/j.commatsci.2008.02.023}}</ref><ref>{{Cite journal |last=Gonze |first=X. |last2=Almbladh |first2=C.-O. |last3=Cucca |first3=A. |last4=Caliste |first4=D. |last5=Freysoldt |first5=C. |last6=Marques |first6=M.A.L. |last7=Olevano |first7=V. |last8=Pouillon |first8=Y. |last9=Verstraete |first9=M.J. |date=2008-10 |title=Specification of an extensible and portable file format for electronic structure and crystallographic data |url=https://linkinghub.elsevier.com/retrieve/pii/S0927025608001377 |journal=Computational Materials Science |language=en |volume=43 |issue=4 |pages=1056–1065 |doi=10.1016/j.commatsci.2008.02.023}}</ref><ref>{{Cite journal |last=Caliste |first=D. |last2=Pouillon |first2=Y. |last3=Verstraete |first3=M.J. |last4=Olevano |first4=V. |last5=Gonze |first5=X. |date=2008-11 |title=Sharing electronic structure and crystallographic data with ETSF_IO |url=https://linkinghub.elsevier.com/retrieve/pii/S0010465508001963 |journal=Computer Physics Communications |language=en |volume=179 |issue=10 |pages=748–758 |doi=10.1016/j.cpc.2008.05.007}}</ref>, in the context of the European Network of Excellence NANOQUANTA, in order to overcome widely known portability issues of input/output file formats across platforms. The Electronic Structure Common Data Format (ESCDF) Specifications<ref name=":2">{{Cite journal |last=Ghiringhelli |first=Luca M. |last2=Carbogno |first2=Christian |last3=Levchenko |first3=Sergey |last4=Mohamed |first4=Fawzi |last5=Huhs |first5=Georg |last6=Lüders |first6=Martin |last7=Oliveira |first7=Micael |last8=Scheffler |first8=Matthias |date=2017-11-06 |title=Towards efficient data exchange and sharing for big-data driven materials science: metadata and data formats |url=https://www.nature.com/articles/s41524-017-0048-5 |journal=npj Computational Materials |language=en |volume=3 |issue=1 |pages=46 |doi=10.1038/s41524-017-0048-5 |issn=2057-3960}}</ref> is the ongoing continuation of the ETSF project and is part of the CECAM Electronic Structure Library, a community-maintained collection of software libraries and data standards for electronic-structure calculations.<ref>{{Cite journal |last=Oliveira |first=Micael J. T. |last2=Papior |first2=Nick |last3=Pouillon |first3=Yann |last4=Blum |first4=Volker |last5=Artacho |first5=Emilio |last6=Caliste |first6=Damien |last7=Corsetti |first7=Fabiano |last8=de Gironcoli |first8=Stefano |last9=Elena |first9=Alin M. |last10=García |first10=Alberto |last11=García-Suárez |first11=Víctor M. |date=2020-07-14 |title=The CECAM electronic structure library and the modular software development paradigm |url=https://pubs.aip.org/jcp/article/153/2/024117/1061500/The-CECAM-electronic-structure-library-and-the |journal=The Journal of Chemical Physics |language=en |volume=153 |issue=2 |pages=024117 |doi=10.1063/5.0012901 |issn=0021-9606}}</ref> | ||
The largest databases of computational materials science data, AFLOW | The largest databases of computational materials science data, AFLOW<ref>{{Cite journal |last=Curtarolo |first=Stefano |last2=Setyawan |first2=Wahyu |last3=Wang |first3=Shidong |last4=Xue |first4=Junkai |last5=Yang |first5=Kesong |last6=Taylor |first6=Richard H. |last7=Nelson |first7=Lance J. |last8=Hart |first8=Gus L.W. |last9=Sanvito |first9=Stefano |last10=Buongiorno-Nardelli |first10=Marco |last11=Mingo |first11=Natalio |date=2012-06 |title=AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations |url=https://linkinghub.elsevier.com/retrieve/pii/S0927025612000687 |journal=Computational Materials Science |language=en |volume=58 |pages=227–235 |doi=10.1016/j.commatsci.2012.02.002}}</ref>, Materials Cloud<ref>{{Cite journal |last=Talirz |first=Leopold |last2=Kumbhar |first2=Snehal |last3=Passaro |first3=Elsa |last4=Yakutovich |first4=Aliaksandr V. |last5=Granata |first5=Valeria |last6=Gargiulo |first6=Fernando |last7=Borelli |first7=Marco |last8=Uhrin |first8=Martin |last9=Huber |first9=Sebastiaan P. |last10=Zoupanos |first10=Spyros |last11=Adorf |first11=Carl S. |date=2020-09-08 |title=Materials Cloud, a platform for open computational science |url=https://www.nature.com/articles/s41597-020-00637-5 |journal=Scientific Data |language=en |volume=7 |issue=1 |pages=299 |doi=10.1038/s41597-020-00637-5 |issn=2052-4463 |pmc=PMC7479138 |pmid=32901046}}</ref>, Materials Project<ref>{{Cite journal |last=Jain |first=Anubhav |last2=Ong |first2=Shyue Ping |last3=Hautier |first3=Geoffroy |last4=Chen |first4=Wei |last5=Richards |first5=William Davidson |last6=Dacek |first6=Stephen |last7=Cholia |first7=Shreyas |last8=Gunter |first8=Dan |last9=Skinner |first9=David |last10=Ceder |first10=Gerbrand |last11=Persson |first11=Kristin A. |date=2013-07-01 |title=Commentary: The Materials Project: A materials genome approach to accelerating materials innovation |url=https://pubs.aip.org/apm/article/1/1/011002/119685/Commentary-The-Materials-Project-A-materials |journal=APL Materials |language=en |volume=1 |issue=1 |pages=011002 |doi=10.1063/1.4812323 |issn=2166-532X}}</ref>, the NOMAD Repository and Archive<ref>{{Cite journal |last=Draxl |first=Claudia |last2=Scheffler |first2=Matthias |date=2018 |title=NOMAD: The FAIR Concept for Big-Data-Driven Materials Science |url=https://arxiv.org/abs/1805.05039 |journal=arXiv |doi=10.48550/ARXIV.1805.05039}}</ref><ref>{{Cite journal |last=Draxl |first=Claudia |last2=Scheffler |first2=Matthias |date=2019-07-01 |title=The NOMAD laboratory: from data sharing to artificial intelligence |url=https://iopscience.iop.org/article/10.1088/2515-7639/ab13bb |journal=Journal of Physics: Materials |volume=2 |issue=3 |pages=036001 |doi=10.1088/2515-7639/ab13bb |issn=2515-7639}}</ref><ref>{{Citation |last=Draxl |first=Claudia |last2=Scheffler |first2=Matthias |date=2020 |editor-last=Andreoni |editor-first=Wanda |editor2-last=Yip |editor2-first=Sidney |title=Big Data-Driven Materials Science and Its FAIR Data Infrastructure |url=http://link.springer.com/10.1007/978-3-319-44677-6_104 |work=Handbook of Materials Modeling |language=en |publisher=Springer International Publishing |place=Cham |pages=49–73 |doi=10.1007/978-3-319-44677-6_104 |isbn=978-3-319-44676-9 |accessdate=2023-11-07}}</ref>, OQMD<ref>{{Cite journal |last=Kirklin |first=Scott |last2=Saal |first2=James E |last3=Meredig |first3=Bryce |last4=Thompson |first4=Alex |last5=Doak |first5=Jeff W |last6=Aykol |first6=Muratahan |last7=Rühl |first7=Stephan |last8=Wolverton |first8=Chris |date=2015-12-11 |title=The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies |url=https://www.nature.com/articles/npjcompumats201510 |journal=npj Computational Materials |language=en |volume=1 |issue=1 |pages=15010 |doi=10.1038/npjcompumats.2015.10 |issn=2057-3960}}</ref>, and TCOD<ref>{{Cite journal |last=Merkys |first=Andrius |last2=Mounet |first2=Nicolas |last3=Cepellotti |first3=Andrea |last4=Marzari |first4=Nicola |last5=Gražulis |first5=Saulius |last6=Pizzi |first6=Giovanni |date=2017-12 |title=A posteriori metadata from automated provenance tracking: integration of AiiDA and TCOD |url=https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0242-y |journal=Journal of Cheminformatics |language=en |volume=9 |issue=1 |pages=56 |doi=10.1186/s13321-017-0242-y |issn=1758-2946 |pmc=PMC5686034 |pmid=29138947}}</ref> offer APIs that rely on dedicated metadata schemas. Similarly, AiiDA<ref>{{Cite journal |last=Pizzi |first=Giovanni |last2=Cepellotti |first2=Andrea |last3=Sabatini |first3=Riccardo |last4=Marzari |first4=Nicola |last5=Kozinsky |first5=Boris |date=2016-01 |title=AiiDA: automated interactive infrastructure and database for computational science |url=https://linkinghub.elsevier.com/retrieve/pii/S0927025615005820 |journal=Computational Materials Science |language=en |volume=111 |pages=218–230 |doi=10.1016/j.commatsci.2015.09.013}}</ref><ref>{{Cite journal |last=Huber |first=Sebastiaan P. |last2=Zoupanos |first2=Spyros |last3=Uhrin |first3=Martin |last4=Talirz |first4=Leopold |last5=Kahle |first5=Leonid |last6=Häuselmann |first6=Rico |last7=Gresch |first7=Dominik |last8=Müller |first8=Tiziano |last9=Yakutovich |first9=Aliaksandr V. |last10=Andersen |first10=Casper W. |last11=Ramirez |first11=Francisco F. |date=2020-09-08 |title=AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance |url=https://www.nature.com/articles/s41597-020-00638-4 |journal=Scientific Data |language=en |volume=7 |issue=1 |pages=300 |doi=10.1038/s41597-020-00638-4 |issn=2052-4463 |pmc=PMC7479590 |pmid=32901044}}</ref><ref>{{Cite journal |last=Uhrin |first=Martin |last2=Huber |first2=Sebastiaan P. |last3=Yu |first3=Jusong |last4=Marzari |first4=Nicola |last5=Pizzi |first5=Giovanni |date=2021-02 |title=Workflows in AiiDA: Engineering a high-throughput, event-based engine for robust and modular computational workflows |url=https://linkinghub.elsevier.com/retrieve/pii/S0927025620305772 |journal=Computational Materials Science |language=en |volume=187 |pages=110086 |doi=10.1016/j.commatsci.2020.110086}}</ref> and ASE<ref>{{Cite journal |last=Hjorth Larsen |first=Ask |last2=Jørgen Mortensen |first2=Jens |last3=Blomqvist |first3=Jakob |last4=Castelli |first4=Ivano E |last5=Christensen |first5=Rune |last6=Dułak |first6=Marcin |last7=Friis |first7=Jesper |last8=Groves |first8=Michael N |last9=Hammer |first9=Bjørk |last10=Hargus |first10=Cory |last11=Hermes |first11=Eric D |date=2017-07-12 |title=The atomic simulation environment—a Python library for working with atoms |url=https://iopscience.iop.org/article/10.1088/1361-648X/aa680e |journal=Journal of Physics: Condensed Matter |volume=29 |issue=27 |pages=273002 |doi=10.1088/1361-648X/aa680e |issn=0953-8984}}</ref>, which are schedulers and workflow managers for computational materials science calculations, adopt their own metadata schema. OpenKIM<ref>{{Cite journal |last=Tadmor |first=E. B. |last2=Elliott |first2=R. S. |last3=Sethna |first3=J. P. |last4=Miller |first4=R. E. |last5=Becker |first5=C. A. |date=2011-07 |title=The potential of atomistic simulations and the knowledgebase of interatomic models |url=http://link.springer.com/10.1007/s11837-011-0102-6 |journal=JOM |language=en |volume=63 |issue=7 |pages=17–17 |doi=10.1007/s11837-011-0102-6 |issn=1047-4838}}</ref> is a library of interatomic models (force fields) and simulation codes that test the predictions of these models, complemented with the necessary first-principles and experimental reference data. Within OpenKIM, a metadata schema is defined for the annotation of the models and reference data. Some of the metadata in all these schemas are straightforward to map onto each other (e.g., those related to the structure of the studied system, i.e., atomic coordinates and species, and simulation-cell specification), others can be mapped with some care. The OPTIMADE (Open Databases Integration for Materials Design<ref name=":3">{{Cite journal |last=Andersen |first=Casper W. |last2=Armiento |first2=Rickard |last3=Blokhin |first3=Evgeny |last4=Conduit |first4=Gareth J. |last5=Dwaraknath |first5=Shyam |last6=Evans |first6=Matthew L. |last7=Fekete |first7=Ádám |last8=Gopakumar |first8=Abhijith |last9=Gražulis |first9=Saulius |last10=Merkys |first10=Andrius |last11=Mohamed |first11=Fawzi |date=2021-08-12 |title=OPTIMADE, an API for exchanging materials data |url=https://www.nature.com/articles/s41597-021-00974-z |journal=Scientific Data |language=en |volume=8 |issue=1 |pages=217 |doi=10.1038/s41597-021-00974-z |issn=2052-4463 |pmc=PMC8361091 |pmid=34385453}}</ref>) consortium has recognized this potential and has recently released the first version of an API that allows users to access a common subset of metadata-schema items, independent of the schema adopted for any specific database/repository that is part of the consortium. | ||
In order to clarify how a metadata schema can explicitly be FAIR-compliant, we describe as an example the main features of the NOMAD Metainfo, onto which the information contained in the input and output files of atomistic codes, both ''ab initio'' and force-field based, is mapped. The first released version of the NOMAD Metainfo is described by Ghiringhelli ''et al.'' | In order to clarify how a metadata schema can explicitly be FAIR-compliant, we describe as an example the main features of the NOMAD Metainfo, onto which the information contained in the input and output files of atomistic codes, both ''ab initio'' and force-field based, is mapped. The first released version of the NOMAD Metainfo is described by Ghiringhelli ''et al.''<ref name=":2" /> and it has powered the NOMAD Archive since the latter went online in 2014, thus predating the formal introduction of the FAIR data principles.<ref name=":0" /> | ||
Here, we give a simplified description, graphically aided by Fig. 1, which highlights the hierarchical/modular architecture of the metadata schema. The elementary mode in which an atomistic materials science code is run (encompassed by the black rectangle) yields the computation of some observables (Output) for a given System, specified in terms of atomic species arranged by their coordinates in a box, and for a given physical model (Method), including specification of its numerical implementation. Sequences or collections of such runs are often defined via a Workflow. Examples of workflows are: | Here, we give a simplified description, graphically aided by Fig. 1, which highlights the hierarchical/modular architecture of the metadata schema. The elementary mode in which an atomistic materials science code is run (encompassed by the black rectangle) yields the computation of some observables (Output) for a given System, specified in terms of atomic species arranged by their coordinates in a box, and for a given physical model (Method), including specification of its numerical implementation. Sequences or collections of such runs are often defined via a Workflow. Examples of workflows are: | ||
* Perturbative physical models (e.g., second-order Møller–Plesset, MP2, Green’s function based methods such as G0W0, random-phase approximation, RPA) evaluated using self-consistent solutions provided by other models (e.g., density-functional theory, DFT, Hartree-Fock method, HF) applied on the same System; | *Perturbative physical models (e.g., second-order Møller–Plesset, MP2, Green’s function based methods such as G0W0, random-phase approximation, RPA) evaluated using self-consistent solutions provided by other models (e.g., density-functional theory, DFT, Hartree-Fock method, HF) applied on the same System; | ||
* Sampling of some desired thermodynamic ensemble by means of, e.g., molecular dynamics; | *Sampling of some desired thermodynamic ensemble by means of, e.g., molecular dynamics; | ||
* Global- and local-minima structure searches; | *Global- and local-minima structure searches; | ||
* Numerical evaluations of equations of state, phonons, or elastic constants by evaluating energies, forces, and possibly other observables; and | *Numerical evaluations of equations of state, phonons, or elastic constants by evaluating energies, forces, and possibly other observables; and | ||
* Scans over the compositional space for a given class of materials (high-throughput screening). | *Scans over the compositional space for a given class of materials (high-throughput screening). | ||
Line 105: | Line 105: | ||
All definitions in the NOMAD Metainfo have the following attributes: | All definitions in the NOMAD Metainfo have the following attributes: | ||
* A globally unique qualified name; | *A globally unique qualified name; | ||
* Human-readable/interpretable description and expected format (e.g., scalar, string of a given length, array of given size); | *Human-readable/interpretable description and expected format (e.g., scalar, string of a given length, array of given size); | ||
* Allowed values; | *Allowed values; | ||
* Provenance, which is realized in terms of a hierarchical and modular schema, where each data object is linked to all the metadata that concur to its definition. Related to provenance, an important aspect of NOMAD Metainfo is its extensibility. It stems from the recognition that reproducibility is an empirical concept, thus at any time, new, previously unknown or disregarded metadata may be recognized as necessary. The metadata schema must be ready to accommodate such extensions seamlessly. | *Provenance, which is realized in terms of a hierarchical and modular schema, where each data object is linked to all the metadata that concur to its definition. Related to provenance, an important aspect of NOMAD Metainfo is its extensibility. It stems from the recognition that reproducibility is an empirical concept, thus at any time, new, previously unknown or disregarded metadata may be recognized as necessary. The metadata schema must be ready to accommodate such extensions seamlessly. | ||
The representation in Fig. 1 is very simplified for tutorial purposes. For instance, a workflow can be arbitrarily complex. In particular, it may contain a hierarchy of sub-workflows. In the currently released version of the NOMAD Metainfo, the elementary-code-run modality is fully supported, i.e., ideally all the information contained in a code run is mapped onto the metadata schema. However, the workflow modality is still under development. An important implication of the hierarchical schema is the mapping of any (complex) workflow onto the schema. That way, all the information obtained by its steps is stored. This is achieved by parsers, which have been written by the NOMAD team for each supported simulation code. One of the outcomes of the parsing is the assignment of a PID to each parsed data object, thus allowing for its localization, e.g., via a URI. | The representation in Fig. 1 is very simplified for tutorial purposes. For instance, a workflow can be arbitrarily complex. In particular, it may contain a hierarchy of sub-workflows. In the currently released version of the NOMAD Metainfo, the elementary-code-run modality is fully supported, i.e., ideally all the information contained in a code run is mapped onto the metadata schema. However, the workflow modality is still under development. An important implication of the hierarchical schema is the mapping of any (complex) workflow onto the schema. That way, all the information obtained by its steps is stored. This is achieved by parsers, which have been written by the NOMAD team for each supported simulation code. One of the outcomes of the parsing is the assignment of a PID to each parsed data object, thus allowing for its localization, e.g., via a URI. | ||
Line 129: | Line 129: | ||
In Fig. 1, the solid arrows stand for the relationship is contained in between section-type metadata. A few examples of quantity-type metadata are listed in each box/section. Such metadata are also in an <tt>is-contained-in</tt> relationship with the section they are listed in. The dashed arrows symbolize the relationship <tt>has reference in</tt>. In practice, in the example of an Output section, the quantity-type metadata contained in such a section are evaluated for a given system described in a System section and for a given physical model described in a Method section. So, the section Output contains a reference to the specific System and Method sections holding the necessary input information. At the same time, the Output section is contained in a given Atomistic-code run section. These relationships among metadata already build a basic ontology, induced by the way computational data are produced in practice, by means of workflows and code runs. This aspect will be reexamined in the later Section “Outlook on ontologies in materials science.” | In Fig. 1, the solid arrows stand for the relationship is contained in between section-type metadata. A few examples of quantity-type metadata are listed in each box/section. Such metadata are also in an <tt>is-contained-in</tt> relationship with the section they are listed in. The dashed arrows symbolize the relationship <tt>has reference in</tt>. In practice, in the example of an Output section, the quantity-type metadata contained in such a section are evaluated for a given system described in a System section and for a given physical model described in a Method section. So, the section Output contains a reference to the specific System and Method sections holding the necessary input information. At the same time, the Output section is contained in a given Atomistic-code run section. These relationships among metadata already build a basic ontology, induced by the way computational data are produced in practice, by means of workflows and code runs. This aspect will be reexamined in the later Section “Outlook on ontologies in materials science.” | ||
We now come to the <tt>category-type</tt> metadata that allow for complementary, arbitrarily complex ontologies to be built by starting from the same metadata. They define a concept, such as “energy” or “energy component,” in order to specify that a given quantity-type metadata has a certain meaning, be it physical (such as “energy”) or computer-hardware related, or administrative. To the purpose, each section and quantity-type metadata is related to a category-type metadata, by means of an <tt>is-a</tt> kind of relationship. Each category-type metadata can be related to another category-type metadata by means of the same <tt>is-a</tt> relationship, thus building another ontology on the metadata, which can be connected with top-down ontologies such as EMMO | We now come to the <tt>category-type</tt> metadata that allow for complementary, arbitrarily complex ontologies to be built by starting from the same metadata. They define a concept, such as “energy” or “energy component,” in order to specify that a given quantity-type metadata has a certain meaning, be it physical (such as “energy”) or computer-hardware related, or administrative. To the purpose, each section and quantity-type metadata is related to a category-type metadata, by means of an <tt>is-a</tt> kind of relationship. Each category-type metadata can be related to another category-type metadata by means of the same <tt>is-a</tt> relationship, thus building another ontology on the metadata, which can be connected with top-down ontologies such as EMMO<ref>{{Cite web |date=2021 |title=EMMO: an Ontology for Applied Sciences |url=https://emmc.info/emmo-info |publisher=EMMC |archiveurl=https://web.archive.org/web/20220526170653/https://emmc.info/emmo-info/ |archivedate=26 May 2022 |accessdate=04 July 2023}}</ref> (see Section “Outlook on ontologies in materials science” for a short description of EMMO). | ||
The current version of NOMAD Metainfo includes more than 400 metadata-schema items. More specifically, these are the common metadata, i.e., those that are code-independent. Hundreds more metadata are code-specific, i.e., mapping pieces of information in the codes’ input/output that are specific to a given code and not transferable to other codes. The NOMAD Metainfo can be browsed at https://nomad-lab.eu/prod/v1/gui/analyze/metainfo. | The current version of NOMAD Metainfo includes more than 400 metadata-schema items. More specifically, these are the common metadata, i.e., those that are code-independent. Hundreds more metadata are code-specific, i.e., mapping pieces of information in the codes’ input/output that are specific to a given code and not transferable to other codes. The NOMAD Metainfo can be browsed at https://nomad-lab.eu/prod/v1/gui/analyze/metainfo. | ||
Line 135: | Line 135: | ||
To summarize, the NOMAD Metainfo addresses the FAIR data principles in the following sense: | To summarize, the NOMAD Metainfo addresses the FAIR data principles in the following sense: | ||
* '''Findability''' is enabled by unique names and a human-understandable description; | *'''Findability''' is enabled by unique names and a human-understandable description; | ||
* '''Accessibility''' is enabled by the PID assigned to each metadata-schema item, which can be accessed via a RESTful | *'''Accessibility''' is enabled by the PID assigned to each metadata-schema item, which can be accessed via a RESTful<ref>{{Cite web |last=Fielding, R.T. |date=2000 |title=Architectural Styles and the Design of Network-based Software Architectures |url=https://ics.uci.edu/~fielding/pubs/dissertation/top.htm |publisher=University of California, Irvine}}</ref> API (i.e., an API supporting the access via web services, through common protocols, such as HTTP), specifically developed for the NOMAD Metainfo. Essentially all NOMAD data are open-access, and users who wish to search and download data do not need to identify themselves. They only need to accept the CC BY license. Uploaders can decide for an embargo. These data are then shared with a selected group of colleagues. | ||
* '''Interoperability''' is enabled by the extensibility of the schema and the <tt>category-type</tt> metadata, which can be linked to existing and future ontologies (see Section “Outlook on ontologies in materials science”). | *'''Interoperability''' is enabled by the extensibility of the schema and the <tt>category-type</tt> metadata, which can be linked to existing and future ontologies (see Section “Outlook on ontologies in materials science”). | ||
* '''Reusability/Repurposability/Recyclability''' is enabled by the modular/hierarchical structure that allows for accessing calculations at different abstraction scales, from the single observables in a code run to a whole complex workflow (see Section “Metadata for Computational Workflows”). | *'''Reusability/Repurposability/Recyclability''' is enabled by the modular/hierarchical structure that allows for accessing calculations at different abstraction scales, from the single observables in a code run to a whole complex workflow (see Section “Metadata for Computational Workflows”). | ||
The usefulness and versatility of a metadata schema are demonstrated by the multiple access modalities it allows for. The NOMAD Metainfo schema is the basis of the whole NOMAD Laboratory infrastructure, which supports access to all the data in the NOMAD Archive, via the NOMAD API (also an implementation of the OPTIMADE API | The usefulness and versatility of a metadata schema are demonstrated by the multiple access modalities it allows for. The NOMAD Metainfo schema is the basis of the whole NOMAD Laboratory infrastructure, which supports access to all the data in the NOMAD Archive, via the NOMAD API (also an implementation of the OPTIMADE API<ref name=":3" /> is supported). This API powers three different access modes of the Archive: the Browser<ref>{{Cite web |date=2023 |title=Entries |work=NOMAD |url=https://nomad-lab.eu/prod/v1/gui/search/entries |publisher=MPCDF and FHI on behalf of Max-Planck-Society |accessdate=04 July 2023}}</ref>, which allows searches for single or groups of calculations, the Encyclopedia<ref>{{Cite web |date=2023 |title=Search |work=NOMAD Encyclopedia |url=https://nomad-lab.eu/prod/rae/encyclopedia/#/search |publisher=The NOMAD Laboratory |accessdate=04 July 2023}}</ref>, which display the content of the Archive organized by materials, and the [[Artificial intelligence|Artificial-Intelligence]] (AI) Toolkit<ref>{{Cite journal |last=Ghiringhelli |first=Luca M. |date=2021-09-09 |title=An AI-toolkit to develop and share research into new materials |url=https://www.nature.com/articles/s42254-021-00373-8 |journal=Nature Reviews Physics |language=en |volume=3 |issue=11 |pages=724–724 |doi=10.1038/s42254-021-00373-8 |issn=2522-5820}}</ref><ref>{{Cite journal |last=Sbailò |first=Luigi |last2=Fekete |first2=Ádám |last3=Ghiringhelli |first3=Luca M. |last4=Scheffler |first4=Matthias |date=2022-12-05 |title=The NOMAD Artificial-Intelligence Toolkit: turning materials-science data into knowledge and understanding |url=https://www.nature.com/articles/s41524-022-00935-z |journal=npj Computational Materials |language=en |volume=8 |issue=1 |pages=250 |doi=10.1038/s41524-022-00935-z |issn=2057-3960}}</ref><ref>{{Cite web |date=2023 |title=NOMAD Artificial Intelligence Toolkit |url=https://nomad-lab.eu/aitoolkit |publisher=NOMAD Laboratory |accessdate=04 July 2023}}</ref>, which connects in [[Jupyter Notebook]]'s script-based queries and AI ([[machine learning]] [ML], [[data mining]]) analyses of the filtered data. All the three services are accessible via a web browser running the dedicated GUI offered by NOMAD. | ||
==Metadata for ground-state electronic-structure calculations== | ==Metadata for ground-state electronic-structure calculations== | ||
Line 146: | Line 146: | ||
===Approximations to the DFT exchange-correlation functional=== | ===Approximations to the DFT exchange-correlation functional=== | ||
Approximations to the DFT exchange-correlation (xc) functionals are identified by a name or acronym (e.g., “PBE”), although sometimes this identification is not unique or complete. As metadata, we suggest to use the identifiers of the Libxc library | Approximations to the DFT exchange-correlation (xc) functionals are identified by a name or acronym (e.g., “PBE”), although sometimes this identification is not unique or complete. As metadata, we suggest to use the identifiers of the Libxc library<ref>{{Cite journal |last=Marques |first=Miguel A.L. |last2=Oliveira |first2=Micael J.T. |last3=Burnus |first3=Tobias |date=2012-10 |title=Libxc: A library of exchange and correlation functionals for density functional theory |url=https://linkinghub.elsevier.com/retrieve/pii/S0010465512001750 |journal=Computer Physics Communications |language=en |volume=183 |issue=10 |pages=2272–2281 |doi=10.1016/j.cpc.2012.05.007}}</ref><ref>{{Cite journal |last=Lehtola |first=Susi |last2=Steigemann |first2=Conrad |last3=Oliveira |first3=Micael J.T. |last4=Marques |first4=Miguel A.L. |date=2018-01 |title=Recent developments in libxc — A comprehensive library of functionals for density functional theory |url=https://linkinghub.elsevier.com/retrieve/pii/S2352711017300602 |journal=SoftwareX |language=en |volume=7 |pages=1–5 |doi=10.1016/j.softx.2017.11.002}}</ref>, which is the largest bibliography of xc functionals. In order to be both human and computer friendly, the Libxc identifiers consist of a human-readable string that has a unique integer associated with it. Often, the above-noted identification needs some refinement, because xc functionals typically depend on a set of parameters and these may be modified for a given calculation. Obviously, there is a need to standardize the way in which such parameters are referenced. Just like it is possible to use the Libxc identifiers for the functionals themselves, one may also use the Libxc naming scheme for their internal parameters. Obviously, code developers have to ensure that this information is contained in the respective input and/or output files. As Libxc provides version numbers of the xc functionals, it is important that this information is also available. | ||
===Basic sets=== | ===Basic sets=== | ||
Complete and unambiguous specification of the basis set is crucial for judging the precision of a calculation. Ground-state calculations should include the full information about the basis sets used, including a DOI that a basis may be referred to. The use of repositories of basis sets, like the Basis Set Exchange repository | Complete and unambiguous specification of the basis set is crucial for judging the precision of a calculation. Ground-state calculations should include the full information about the basis sets used, including a DOI that a basis may be referred to. The use of repositories of basis sets, like the Basis Set Exchange repository<ref>{{Cite journal |last=Pritchard |first=Benjamin P. |last2=Altarawy |first2=Doaa |last3=Didier |first3=Brett |last4=Gibson |first4=Tara D. |last5=Windus |first5=Theresa L. |date=2019-11-25 |title=New Basis Set Exchange: An Open, Up-to-Date Resource for the Molecular Sciences Community |url=https://pubs.acs.org/doi/10.1021/acs.jcim.9b00725 |journal=Journal of Chemical Information and Modeling |language=en |volume=59 |issue=11 |pages=4814–4820 |doi=10.1021/acs.jcim.9b00725 |issn=1549-9596}}</ref>, is therefore strongly recommended. | ||
Basis sets can be coarsely divided into two classes, i.e., atom-position-dependent (atom-centered, bond-centered) and cell-dependent (such as plane waves) ones. Also, a combination of both is possible, as, e.g., realized in augmented plane-wave or projector-augmented-wave methods. For the atom-centered basis, the list of centers needs to be provided, and these may even contain positions where no actual atomic nucleus is located. The NOMAD Metainfo contains a rather complete set of metadata to describe atom-centered basis sets. A more complete description of cell-dependent basis sets can be found in the ESCDF, which is planned to be merged with the NOMAD Metainfo. | Basis sets can be coarsely divided into two classes, i.e., atom-position-dependent (atom-centered, bond-centered) and cell-dependent (such as plane waves) ones. Also, a combination of both is possible, as, e.g., realized in augmented plane-wave or projector-augmented-wave methods. For the atom-centered basis, the list of centers needs to be provided, and these may even contain positions where no actual atomic nucleus is located. The NOMAD Metainfo contains a rather complete set of metadata to describe atom-centered basis sets. A more complete description of cell-dependent basis sets can be found in the ESCDF, which is planned to be merged with the NOMAD Metainfo. | ||
===Energy reference=== | ===Energy reference=== | ||
In order to enable interoperability and reusability of energies computed with different electronic-structure methods, it is necessary to define a “general energy zero.” An analysis of this problem and some clues on how to tackle it were already discussed by some of us in a previous work. | In order to enable interoperability and reusability of energies computed with different electronic-structure methods, it is necessary to define a “general energy zero.” An analysis of this problem and some clues on how to tackle it were already discussed by some of us in a previous work.<ref name=":2" /> The following is a further attempt to advance and systematize ideas and solutions. | ||
The problem of comparing energies is not restricted to computational materials science and chemistry. In fact, it also arises in experimental chemistry, as for instance, only enthalpy or entropy differences can be measured, but not absolute values. To solve this, chemists have defined a reference state for each element, called the "standard state," which is defined as the element in its natural form at standard conditions, while the heat of formation is used to measure the change from the elements to the compound. In computational materials science and chemistry, we can adopt a similar approach. For each element we need to define a reference system as the zero of the energy scale. To do so, we introduce some definitions: | The problem of comparing energies is not restricted to computational materials science and chemistry. In fact, it also arises in experimental chemistry, as for instance, only enthalpy or entropy differences can be measured, but not absolute values. To solve this, chemists have defined a reference state for each element, called the "standard state," which is defined as the element in its natural form at standard conditions, while the heat of formation is used to measure the change from the elements to the compound. In computational materials science and chemistry, we can adopt a similar approach. For each element we need to define a reference system as the zero of the energy scale. To do so, we introduce some definitions: | ||
* A system is a defined set of one or more atoms, with a given geometry and, if periodic, a given unit cell. It can be an atom, a molecule, a periodic crystal, etc. If relevant, the charge, the spin-state or magnetic ordering needs to be specified. | *A system is a defined set of one or more atoms, with a given geometry and, if periodic, a given unit cell. It can be an atom, a molecule, a periodic crystal, etc. If relevant, the charge, the spin-state or magnetic ordering needs to be specified. | ||
* A reference system is a well-defined system to which other systems are compared to. | *A reference system is a well-defined system to which other systems are compared to. | ||
* A calculated energy is the energy obtained by a numerical simulation of a system with given input data and parameters, defining the Hamiltonian (i.e., DFT xc-functional approximation) or the many-electron model (e.g., Hartree-Fock, MP2, “coupled-cluster singles, doubles, and perturbative triples”, CCSD[T]), the basis set, and the numerical parameters. | *A calculated energy is the energy obtained by a numerical simulation of a system with given input data and parameters, defining the Hamiltonian (i.e., DFT xc-functional approximation) or the many-electron model (e.g., Hartree-Fock, MP2, “coupled-cluster singles, doubles, and perturbative triples”, CCSD[T]), the basis set, and the numerical parameters. | ||
Whether the reference system is an atom, an element in its natural form, some molecule or other system, does not matter, as long as it is well-defined. Defining the system by atoms requires specifying how the orbitals are occupied, whether the atom is spherical, spin-polarized, etc. For each computational method and numerical settings, the energy per atom of the reference system must be calculated. The standard energy is then obtained by subtracting these values (multiplied by the number of constituents) from the calculated total energy. For example, to determine the energy of formation of a molecule like H<sub>2</sub>O or a crystal like SiC, we calculate the difference in total energies as <math>E\left( H_{2}O \right) - E\left( H_{2} \right) - \frac{1}{2}E\left( O_{2} \right)</math> or <math>E\left( {SiC} \right) - E\left( {Si} \right) - E(C)</math>, respectively. Here, H<sub>2</sub> and O<sub>2</sub> are isolated, neutral molecules while Si and C are free, neutral atoms. However, using the energy per atom of Si and C in their crystalline ground-state structure would be an option as well. We propose to tabulate the reference energies for the most common computational methods, so that they can be applied without further computations and preferably automatically by the codes themselves. | Whether the reference system is an atom, an element in its natural form, some molecule or other system, does not matter, as long as it is well-defined. Defining the system by atoms requires specifying how the orbitals are occupied, whether the atom is spherical, spin-polarized, etc. For each computational method and numerical settings, the energy per atom of the reference system must be calculated. The standard energy is then obtained by subtracting these values (multiplied by the number of constituents) from the calculated total energy. For example, to determine the energy of formation of a molecule like H<sub>2</sub>O or a crystal like SiC, we calculate the difference in total energies as <math>E\left( H_{2}O \right) - E\left( H_{2} \right) - \frac{1}{2}E\left( O_{2} \right)</math> or <math>E\left( {SiC} \right) - E\left( {Si} \right) - E(C)</math>, respectively. Here, H<sub>2</sub> and O<sub>2</sub> are isolated, neutral molecules while Si and C are free, neutral atoms. However, using the energy per atom of Si and C in their crystalline ground-state structure would be an option as well. We propose to tabulate the reference energies for the most common computational methods, so that they can be applied without further computations and preferably automatically by the codes themselves. | ||
Line 168: | Line 168: | ||
One factor here is the choice of the PP. Irrespective of the used method, the computational settings determine the quality of a calculation. Most decisive here is the basis-set cut-off. For the plane-wave basis, convergence with respect to this parameter is straightforward. In any case, depending on the code, the method and details of the calculation, care needs to be taken to define all the adjustable parameters that significantly affect the energy when defining computational methods. | One factor here is the choice of the PP. Irrespective of the used method, the computational settings determine the quality of a calculation. Most decisive here is the basis-set cut-off. For the plane-wave basis, convergence with respect to this parameter is straightforward. In any case, depending on the code, the method and details of the calculation, care needs to be taken to define all the adjustable parameters that significantly affect the energy when defining computational methods. | ||
To tabulate standard energies, as suggested above, every computational method needs to be applied to all reference systems. This requires care in choosing the reference systems to ensure that an as-wide-as-possible range of codes and methods are actually suited for these calculations. It may be that some codes cannot constrain the occupancies of atoms, or keep them spherical, which would be a problem if spherical atoms were chosen as the reference. Clearly, periodic crystals such as silicon are not suitable for molecular codes. It is possible, however, that some other codes could help with bridging this gap. For example, FHI-aims | To tabulate standard energies, as suggested above, every computational method needs to be applied to all reference systems. This requires care in choosing the reference systems to ensure that an as-wide-as-possible range of codes and methods are actually suited for these calculations. It may be that some codes cannot constrain the occupancies of atoms, or keep them spherical, which would be a problem if spherical atoms were chosen as the reference. Clearly, periodic crystals such as silicon are not suitable for molecular codes. It is possible, however, that some other codes could help with bridging this gap. For example, FHI-aims<ref>{{Cite journal |last=Blum |first=Volker |last2=Gehrke |first2=Ralf |last3=Hanke |first3=Felix |last4=Havu |first4=Paula |last5=Havu |first5=Ville |last6=Ren |first6=Xinguo |last7=Reuter |first7=Karsten |last8=Scheffler |first8=Matthias |date=2009-11 |title=Ab initio molecular simulations with numeric atom-centered orbitals |url=https://linkinghub.elsevier.com/retrieve/pii/S0010465509002033 |journal=Computer Physics Communications |language=en |volume=180 |issue=11 |pages=2175–2196 |doi=10.1016/j.cpc.2009.06.022}}</ref> is not only capable of simulating crystalline system, but can also handle atoms and molecules and it can employ Gaussian-type orbitals (GTO) basis sets. Thus, FHI-aims is able to reproduce energy differences between atoms/molecules and crystals. In this way, it can support codes such as Gaussian16 or GAMESS.<ref>{{Cite journal |last=Barca |first=Giuseppe M. J. |last2=Bertoni |first2=Colleen |last3=Carrington |first3=Laura |last4=Datta |first4=Dipayan |last5=De Silva |first5=Nuwan |last6=Deustua |first6=J. Emiliano |last7=Fedorov |first7=Dmitri G. |last8=Gour |first8=Jeffrey R. |last9=Gunina |first9=Anastasia O. |last10=Guidez |first10=Emilie |last11=Harville |first11=Taylor |date=2020-04-21 |title=Recent developments in the general atomic and molecular electronic structure system |url=https://pubs.aip.org/jcp/article/152/15/154102/1058751/Recent-developments-in-the-general-atomic-and |journal=The Journal of Chemical Physics |language=en |volume=152 |issue=15 |pages=154102 |doi=10.1063/5.0005188 |issn=0021-9606}}</ref> | ||
==Metadata for external-perturbation and excited-state electronic-structure calculations== | ==Metadata for external-perturbation and excited-state electronic-structure calculations== | ||
Line 176: | Line 176: | ||
===Diagrammatic techniques and TDDFT=== | ===Diagrammatic techniques and TDDFT=== | ||
The most common application of the ''GW'' approximation (the one-body Green's function ''G'' and the dynamically screened Coulomb interaction ''W'' | The most common application of the ''GW'' approximation (the one-body Green's function ''G'' and the dynamically screened Coulomb interaction ''W''<ref>{{Cite journal |last=Reining |first=Lucia |date=2018-05 |title=The GW approximation: content, successes and limitations |url=https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.1344 |journal=WIREs Computational Molecular Science |language=en |volume=8 |issue=3 |pages=e1344 |doi=10.1002/wcms.1344 |issn=1759-0876}}</ref>) is to compute quasi-particle energies, i.e., energies that describe the removal or addition of a single electron. For this, the many-body electron-electron interaction is described by a two-particle operator, called the electronic self-energy. To compute this object, on the technical side we may need an additional (auxiliary) basis set, not the same as the one used in the ground-state calculation, coming with additional parameters. Likewise, there are various ways for doing the analytical continuation of the Green’s function, as there are various ways for carrying out the required frequency integration, possibly employing a plasmon-pole model as an approximation. And there are also different ways of how to evaluate the screened Coulomb potential ''W''. Most important is the flavor of ''GW'', i.e., whether it is done in a single-shot manner, called G0W0, or in a self-consistent way. If the latter, what kind of self-consistency (scf) is used; any type of partial scf, quasi-particle scf, or any other type which would remedy any starting-point dependence, i.e., the dependence of the results on the xc functional of the initial DFT (or Hartree-Fock or alike) used in the GS. | ||
While ''GW'' approximation is the method of choice for quasi-particle energies (and potentially also life times) within the realm of MBPT, we need to solve the Bethe-Salpeter equation (BSE) to tackle electron-hole interactions. This approach should typically be applied on top of a ''GW'' calculation, but often the quasi-particle states are approximated by DFT results adjusted by a scissors operator to widen the band gap in a similar way to the latter. In all cases, BSE carries along all subtleties from the underlying steps. In addition, it comes with its own issues, like the way of screening the Coulomb interaction (electron-hole this time), the representation of non-local operators, and alike. | While ''GW'' approximation is the method of choice for quasi-particle energies (and potentially also life times) within the realm of MBPT, we need to solve the Bethe-Salpeter equation (BSE) to tackle electron-hole interactions. This approach should typically be applied on top of a ''GW'' calculation, but often the quasi-particle states are approximated by DFT results adjusted by a scissors operator to widen the band gap in a similar way to the latter. In all cases, BSE carries along all subtleties from the underlying steps. In addition, it comes with its own issues, like the way of screening the Coulomb interaction (electron-hole this time), the representation of non-local operators, and alike. | ||
Line 187: | Line 187: | ||
===Density-functional perturbation theory=== | ===Density-functional perturbation theory=== | ||
Density-functional perturbation theory is used to obtain physical properties that are related to the (density-)response of the system to external perturbations, like the displacement potential according to lattice vibrations. Also in this case, the calculation relies on a preliminary GS run, inheriting all issues therefrom. After having chosen the type of perturbation, which requires method-dependent definitions and inputs, one needs to choose the order of perturbation: The linear response approach, that is implemented in many codes (e.g., VASP | Density-functional perturbation theory is used to obtain physical properties that are related to the (density-)response of the system to external perturbations, like the displacement potential according to lattice vibrations. Also in this case, the calculation relies on a preliminary GS run, inheriting all issues therefrom. After having chosen the type of perturbation, which requires method-dependent definitions and inputs, one needs to choose the order of perturbation: The linear response approach, that is implemented in many codes (e.g., VASP<ref>{{Cite journal |last=Kresse |first=G. |last2=Furthmüller |first2=J. |date=1996-10-15 |title=Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set |url=https://link.aps.org/doi/10.1103/PhysRevB.54.11169 |journal=Physical Review B |language=en |volume=54 |issue=16 |pages=11169–11186 |doi=10.1103/PhysRevB.54.11169 |issn=0163-1829}}</ref>, octopus<ref>{{Cite journal |last=Marques |first=M |date=2003-03-01 |title=octopus: a first-principles tool for excited electron–ion dynamics |url=https://linkinghub.elsevier.com/retrieve/pii/S0010465502006860 |journal=Computer Physics Communications |language=en |volume=151 |issue=1 |pages=60–78 |doi=10.1016/S0010-4655(02)00686-0}}</ref>, CASTEP<ref>{{Cite journal |last=Segall |first=M D |last2=Lindan |first2=Philip J D |last3=Probert |first3=M J |last4=Pickard |first4=C J |last5=Hasnip |first5=P J |last6=Clark |first6=S J |last7=Payne |first7=M C |date=2002-03-25 |title=First-principles simulation: ideas, illustrations and the CASTEP code |url=https://iopscience.iop.org/article/10.1088/0953-8984/14/11/301 |journal=Journal of Physics: Condensed Matter |volume=14 |issue=11 |pages=2717–2744 |doi=10.1088/0953-8984/14/11/301 |issn=0953-8984}}</ref>, FHI-aims<ref>{{Cite journal |last=Shang |first=Honghui |last2=Carbogno |first2=Christian |last3=Rinke |first3=Patrick |last4=Scheffler |first4=Matthias |date=2017-06 |title=Lattice dynamics calculations based on density-functional perturbation theory in real space |url=https://linkinghub.elsevier.com/retrieve/pii/S0010465517300437 |journal=Computer Physics Communications |language=en |volume=215 |pages=26–46 |doi=10.1016/j.cpc.2017.02.001}}</ref>, Quantum Espresso<ref>{{Cite journal |last=Giannozzi |first=Paolo |last2=Baroni |first2=Stefano |last3=Bonini |first3=Nicola |last4=Calandra |first4=Matteo |last5=Car |first5=Roberto |last6=Cavazzoni |first6=Carlo |last7=Ceresoli |first7=Davide |last8=Chiarotti |first8=Guido L. |last9=Cococcioni |first9=Matteo |last10=Dabo |first10=Ismaila |last11=Dal Corso |first11=Andrea |date=2009-09-30 |title=QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials |url=https://pubmed.ncbi.nlm.nih.gov/21832390 |journal=Journal of Physics. Condensed Matter: An Institute of Physics Journal |volume=21 |issue=39 |pages=395502 |doi=10.1088/0953-8984/21/39/395502 |issn=1361-648X |pmid=21832390}}</ref>, ABINIT<ref>{{Cite journal |last=Gonze |first=X. |last2=Amadon |first2=B. |last3=Anglade |first3=P.-M. |last4=Beuken |first4=J.-M. |last5=Bottin |first5=F. |last6=Boulanger |first6=P. |last7=Bruneval |first7=F. |last8=Caliste |first8=D. |last9=Caracas |first9=R. |last10=Côté |first10=M. |last11=Deutsch |first11=T. |date=2009-12 |title=ABINIT: First-principles approach to material and nanosystem properties |url=https://linkinghub.elsevier.com/retrieve/pii/S0010465509002276 |journal=Computer Physics Communications |language=en |volume=180 |issue=12 |pages=2582–2615 |doi=10.1016/j.cpc.2009.07.007}}</ref>), allows for the determination of second-order derivatives of the total energy. Among these codes, some of them also allow for the calculation of third-order derivatives, like anharmonic vibrational effects. The variation of the Kohn-Sham orbitals can be obtained from the Sternheimer equation, where different methods are used for deriving its solution (iterative methods, direct linearization, integral formulation). | ||
===Quantum-chemistry methods=== | ===Quantum-chemistry methods=== | ||
Line 202: | Line 202: | ||
To summarize, quantum-chemical methods offer an excellent toolbox for accurate ''ab initio'' calculations for molecules (especially so for small and medium sized ones). However, severe issues concerning reproducibility and replicability remain, in particular for extended and/or open-shell systems. This calls for a more detailed specification of the implemented techniques by the developers, for example, a better design of the outputs, and a thorough analysis and documentation of the employed methods and parameters by the users. A possible strategy addressing these issues would be two-fold: | To summarize, quantum-chemical methods offer an excellent toolbox for accurate ''ab initio'' calculations for molecules (especially so for small and medium sized ones). However, severe issues concerning reproducibility and replicability remain, in particular for extended and/or open-shell systems. This calls for a more detailed specification of the implemented techniques by the developers, for example, a better design of the outputs, and a thorough analysis and documentation of the employed methods and parameters by the users. A possible strategy addressing these issues would be two-fold: | ||
#Promoting the compliance of the developed software with the FAIR principles for software | #Promoting the compliance of the developed software with the FAIR principles for software<ref>{{Cite journal |last=Lamprecht |first=Anna-Lena |last2=Garcia |first2=Leyla |last3=Kuzak |first3=Mateusz |last4=Martinez |first4=Carlos |last5=Arcila |first5=Ricardo |last6=Martin Del Pico |first6=Eva |last7=Dominguez Del Angel |first7=Victoria |last8=van de Sandt |first8=Stephanie |last9=Ison |first9=Jon |last10=Martinez |first10=Paula Andrea |last11=McQuilton |first11=Peter |date=2020-06-12 |editor-last=Groth |editor-first=Paul |editor2-last=Groth |editor2-first=Paul |editor3-last=Dumontier |editor3-first=Michel |title=Towards FAIR principles for research software |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/DS-190026 |journal=Data Science |volume=3 |issue=1 |pages=37–59 |doi=10.3233/DS-190026}}</ref><ref>{{Cite journal |last=Barker |first=Michelle |last2=Chue Hong |first2=Neil P. |last3=Katz |first3=Daniel S. |last4=Lamprecht |first4=Anna-Lena |last5=Martinez-Ortiz |first5=Carlos |last6=Psomopoulos |first6=Fotis |last7=Harrow |first7=Jennifer |last8=Castro |first8=Leyla Jael |last9=Gruenpeter |first9=Morane |last10=Martinez |first10=Paula Andrea |last11=Honeyman |first11=Tom |date=2022-10-14 |title=Introducing the FAIR Principles for research software |url=https://www.nature.com/articles/s41597-022-01710-x |journal=Scientific Data |language=en |volume=9 |issue=1 |pages=622 |doi=10.1038/s41597-022-01710-x |issn=2052-4463 |pmc=PMC9562067 |pmid=36241754}}</ref>, which comprise the recommendation to publish the software in a repository with [[version control]], have a well-defined license, register the code in a community registry, assign to each version a PID, and enable its proper citation.<ref>{{Cite journal |last=Katz |first=Daniel S. |last2=Chue Hong |first2=Neil P. |last3=Clark |first3=Tim |last4=Muench |first4=August |last5=Stall |first5=Shelley |last6=Bouquin |first6=Daina |last7=Cannon |first7=Matthew |last8=Edmunds |first8=Scott |last9=Faez |first9=Telli |last10=Feeney |first10=Patricia |last11=Fenner |first11=Martin |date=2021-01-12 |title=Recognizing the value of software: a software citation guide |url=https://f1000research.com/articles/9-1257/v2 |journal=F1000Research |language=en |volume=9 |pages=1257 |doi=10.12688/f1000research.26932.2 |issn=2046-1402 |pmc=PMC7805487 |pmid=33500780}}</ref><ref>{{Cite journal |last=Smith |first=Arfon M. |last2=Katz |first2=Daniel S. |last3=Niemeyer |first3=Kyle E. |last4=FORCE11 Software Citation Working Group |date=2016-09-19 |title=Software citation principles |url=https://peerj.com/articles/cs-86 |journal=PeerJ Computer Science |language=en |volume=2 |pages=e86 |doi=10.7717/peerj-cs.86 |issn=2376-5992}}</ref> Reproducibility can be enhanced by publishing software code under the Free Libre Open-Source Software (FLOSS)<ref>{{Cite web |last=Hertz, J.C.; Lucas, M.; Scott, J. |date=April 2006 |title=DoD Open Technology Development (OTD) Roadmap |work=Terry Bollinger Online Resources |url=https://www.terrybollinger.com/index.html#open_source_reports |publisher=Terry Bollinger |accessdate=04 July 2023}}</ref><ref>{{Cite web |last=Stallman, R. |date=11 September 2021 |title=FLOSS and FOSS |work=GNU Operating System |url=https://www.gnu.org/philosophy/floss-and-foss.html |publisher=Richard Stallman |accessdate=04 July 2023}}</ref> license and by documenting the computation environment (hardware, operating system version, computational framework and libraries that were used, if any); and | ||
#Creation of well-defined benchmark datasets. | |||
Interoperability among different implementations of (in the intention) the same theoretical model can be assessed by the quantitative comparison over different codes (including different versions thereof) of a set of properties on an agreed-upon set of materials. Such datasets would obviously need to be stored in a FAIR-compliant fashion. A large community-based effort in this direction is being carried on in the DFT community | Interoperability among different implementations of (in the intention) the same theoretical model can be assessed by the quantitative comparison over different codes (including different versions thereof) of a set of properties on an agreed-upon set of materials. Such datasets would obviously need to be stored in a FAIR-compliant fashion. A large community-based effort in this direction is being carried on in the DFT community, while in the many-body-theory community, implementation of this idea is just at its beginning. | ||
==Metadata for potential-energy sampling== | ==Metadata for potential-energy sampling== | ||
Line 259: | Line 260: | ||
In data science, an ontology is a formal representation of the knowledge of a community about a domain of interest, for a purpose. As ontologies are currently less common in basic materials science than in other fields of science, let us explain these terms: | In data science, an ontology is a formal representation of the knowledge of a community about a domain of interest, for a purpose. As ontologies are currently less common in basic materials science than in other fields of science, let us explain these terms: | ||
* '''Formal representation''' means that: (1) the ontology is a representation, hence it is a simplification, or a model, of the target domain, and (2) the attribute formal communicates that the ontological terms and relationships between them must have a deterministic and unambiguous meaning. Furthermore, formal representation implies that the mechanism to specify the ontology must have a degree of logical processing capability, e.g., inference and reasoning should be possible. Crucially, the attribute formal refers to the fact that an ontology should be machine-readable. | *'''Formal representation''' means that: (1) the ontology is a representation, hence it is a simplification, or a model, of the target domain, and (2) the attribute formal communicates that the ontological terms and relationships between them must have a deterministic and unambiguous meaning. Furthermore, formal representation implies that the mechanism to specify the ontology must have a degree of logical processing capability, e.g., inference and reasoning should be possible. Crucially, the attribute formal refers to the fact that an ontology should be machine-readable. | ||
* '''Knowledge''' is the accumulated set of facts, pieces of information, and skills of the experts of the domain of interest that are represented in the ontology. | *'''Knowledge''' is the accumulated set of facts, pieces of information, and skills of the experts of the domain of interest that are represented in the ontology. | ||
* The '''community''' influences the ontology in two aspects; (1) it implies an overall agreement between a group of experts/users of the knowledge as represented in the ontology and (2) it indicates that the ontology is not meant to convince a whole population nor wants to be universal. However, if it fulfills the requirements of bigger communities, the ontology will be adopted by broader audiences and will find its way towards standardization. | *The '''community''' influences the ontology in two aspects; (1) it implies an overall agreement between a group of experts/users of the knowledge as represented in the ontology and (2) it indicates that the ontology is not meant to convince a whole population nor wants to be universal. However, if it fulfills the requirements of bigger communities, the ontology will be adopted by broader audiences and will find its way towards standardization. | ||
* The '''domain of interest''' is the common ground for the community, e.g., a scientific discipline, a subordinate of discipline, or a market section. It is often used as a boundary to limit the scope of the ontology. It is a proper tool to detect overlapping concepts, modularizing ontologies, and identifying extension and integration points. | *The '''domain of interest''' is the common ground for the community, e.g., a scientific discipline, a subordinate of discipline, or a market section. It is often used as a boundary to limit the scope of the ontology. It is a proper tool to detect overlapping concepts, modularizing ontologies, and identifying extension and integration points. | ||
* The '''purpose''' conveys the goals of the ontology designers so that the ontology is applicable to a set of situations. In many ontology design efforts, the purpose is formulated by a collection of so-called competency questions. These questions and the answers provided to them identify the intent and viewpoint of the designers and set the potential applications of the ontology. | *The '''purpose''' conveys the goals of the ontology designers so that the ontology is applicable to a set of situations. In many ontology design efforts, the purpose is formulated by a collection of so-called competency questions. These questions and the answers provided to them identify the intent and viewpoint of the designers and set the potential applications of the ontology. | ||
In practice, ontologies are often mapped onto, and visualized by means of, directed acyclic graphs, where an edge is one of a well-defined set of relationships (e.g., <tt>is a</tt>, <tt>has property</tt>) and each node is a class, i.e., a concept which is specific to the domain of interest. Each node-edge-node triple is interpreted as a subject-predicate-object expression. For instance, in an ontology for catalysis, one could find the triples: “catalytic material–has property–selectivity”, and “selectivity–refers to–reaction product.” Ontologies address the interoperability requirement of FAIR data. By means of a machine-readable formal structure, which can be connected to an existing or ''ex novo'' derived metadata schema of a database, ontologies allow queries over various databases, even from different fields. | In practice, ontologies are often mapped onto, and visualized by means of, directed acyclic graphs, where an edge is one of a well-defined set of relationships (e.g., <tt>is a</tt>, <tt>has property</tt>) and each node is a class, i.e., a concept which is specific to the domain of interest. Each node-edge-node triple is interpreted as a subject-predicate-object expression. For instance, in an ontology for catalysis, one could find the triples: “catalytic material–has property–selectivity”, and “selectivity–refers to–reaction product.” Ontologies address the interoperability requirement of FAIR data. By means of a machine-readable formal structure, which can be connected to an existing or ''ex novo'' derived metadata schema of a database, ontologies allow queries over various databases, even from different fields. | ||
Line 287: | Line 288: | ||
==Abbreviations, acronyms, and initialisms== | ==Abbreviations, acronyms, and initialisms== | ||
* '''AI''': artificial intelligence | |||
* '''aiMD''': ''ab initio'' calculated forces and energies | *'''AI''': artificial intelligence | ||
* '''API''': application programming interface | *'''aiMD''': ''ab initio'' calculated forces and energies | ||
* '''BSE''': Bethe-Salpeter equation | *'''API''': application programming interface | ||
* '''CASPT2''': complete active-space second-order perturbation theory | *'''BSE''': Bethe-Salpeter equation | ||
* '''CASSCF''': complete active-space self-consistent Field | *'''CASPT2''': complete active-space second-order perturbation theory | ||
* '''CC''': coupled-cluster | *'''CASSCF''': complete active-space self-consistent Field | ||
* '''CML''': Chemical Markup Language | *'''CC''': coupled-cluster | ||
* '''CIF''': Crystallographic Information File; Crystallographic Information Framework | *'''CML''': Chemical Markup Language | ||
* '''DFT''': density-functional theory | *'''CIF''': Crystallographic Information File; Crystallographic Information Framework | ||
* '''DFTP''': density-functional perturbation theory | *'''DFT''': density-functional theory | ||
* '''DMFT''': dynamical mean-field theory | *'''DFTP''': density-functional perturbation theory | ||
* '''DOI''': digital object identifier | *'''DMFT''': dynamical mean-field theory | ||
* '''ELN''': electronic laboratory notebook | *'''DOI''': digital object identifier | ||
* '''EMMC''': European Materials Modelling Council | *'''ELN''': electronic laboratory notebook | ||
* '''EMMO''': Elemental Multiperspective Material Ontology | *'''EMMC''': European Materials Modelling Council | ||
* '''ESCDF''': Electronic Structure Common Data Format | *'''EMMO''': Elemental Multiperspective Material Ontology | ||
* '''ETSF''': European Theoretical Spectroscopy Facility | *'''ESCDF''': Electronic Structure Common Data Format | ||
* '''FAIR''': findable, accessible, interoperable, reusable | *'''ETSF''': European Theoretical Spectroscopy Facility | ||
* '''FLOSS''': Free Libre Open-Source Software | *'''FAIR''': findable, accessible, interoperable, reusable | ||
* '''GS''': general state | *'''FLOSS''': Free Libre Open-Source Software | ||
* '''''GW''''': Green's function ''G'' and dynamically screened Coulomb interaction ''W'' | *'''GS''': general state | ||
* '''IUCr''': International Union of Crystallography | *'''''GW''''': Green's function ''G'' and dynamically screened Coulomb interaction ''W'' | ||
* '''JCAMP-DX''': Joint Committee on Atomic and Molecular Physical Data - Data Exchange | *'''IUCr''': International Union of Crystallography | ||
* '''JUMBO''': Java Universal Molecular/Markup Browser for Objects | *'''JCAMP-DX''': Joint Committee on Atomic and Molecular Physical Data - Data Exchange | ||
* '''LIMS''': laboratory information management system | *'''JUMBO''': Java Universal Molecular/Markup Browser for Objects | ||
* '''MBPT''': many-body perturbation theory | *'''LIMS''': laboratory information management system | ||
* '''MD''': molecular dynamics | *'''MBPT''': many-body perturbation theory | ||
* '''MDR''' metadata registry | *'''MD''': molecular dynamics | ||
* '''ML''': machine learning | *'''MDR''' metadata registry | ||
* '''MM''': molecular mechanics | *'''ML''': machine learning | ||
* '''MRCI''': multireference configuration interaction | *'''MM''': molecular mechanics | ||
* '''NOMAD''': Novel-Materials Discovery Laboratory | *'''MRCI''': multireference configuration interaction | ||
* '''OPTIMADE''': Open Databases Integration for Materials Design | *'''NOMAD''': Novel-Materials Discovery Laboratory | ||
* '''PID''': persistent identifier | *'''OPTIMADE''': Open Databases Integration for Materials Design | ||
* '''TDDFT''': time-dependent DFT | *'''PID''': persistent identifier | ||
* '''XML''': Extensible Markup Language | *'''TDDFT''': time-dependent DFT | ||
*'''XML''': Extensible Markup Language | |||
==Acknowledgements== | ==Acknowledgements== | ||
Line 341: | Line 343: | ||
==Notes== | ==Notes== | ||
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Several inline URLs from the original were turned into full citations for this version. The original didn't state what ''GW'' was; for this version, an explanation and citation was given for clarity. | This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Several inline URLs from the original were turned into full citations for this version. The original didn't state what ''GW'' was; for this version, an explanation and citation was given for clarity. The URL to the EMMC and EMMO website was broken when adding this to LIMSwiki; an archived URL was used in its place. | ||
<!--Place all category tags here--> | <!--Place all category tags here--> |
Revision as of 21:31, 7 November 2023
Full article title | Shared metadata for data-centric materials science |
---|---|
Journal | Scientific Data |
Author(s) | Ghiringhelli, Luca M.; Baldauf, Carsten; Bereau, Tristan; Brockhauser, Sandor; Carbogno, Christian; Chamanara, Javad; Cozzini, Stefano; Curtarolo, Stefano; Draxl, Claudia; Dwaraknath, Shyam; Fekete, Ádám; Kermode, James; Koch, Christoph T.; Kühbach, Markus; Ladines, Alvin Noe; Lambrix, Patrick; Himmer, Maja-Olivia; Levchenko, Sergey V.; Oliveira, Micael; Michalchuk, Adam; Miller, Ronald E.; Onat, Berk; Pavone, Pasquale; Pizzi, Giovanni; Regler, Benjamin; Rignanese, Gian-Marco; Schaarschmidt, Jörg; Scheidgen, Markus; Schneidewind, Astrid; Sheveleva, Tatyana; Su, Chuanxun; Usvyat, Denis; Valsson, Omar; Wöll, Christof; Scheffler, Matthias |
Author affiliation(s) | Friedrich-Alexander Universität, Humboldt-Universität zu Berlin, Fritz-Haber-Institut of the Max-Planck-Gesellschaft, University of Amsterdam, TIB – Leibniz Information Centre for Science and Technology and University Library, AREA Science Park, Duke University, Lawrence Berkeley National Laboratory, University of Warwick, Linköping University, Skolkovo Institute of Science and Technology, Max Planck Institute for the Structure and Dynamics of Matter, Federal Institute for Materials Research and Testing, University of Birmingham, Carleton University, École Polytechnique Fédérale de Lausanne, Paul Scherrer Institut, Chemin des Étoiles, Karlsruhe Institute of Technology, Forschungszentrum Jülich GmbH, University of Science and Technology of China, University of North Texas |
Primary contact | Email: luca dot ghiringhelli at physik dot hu dash berlin dot de |
Year published | 2023 |
Volume and issue | 10 |
Article # | 626 |
DOI | 10.1038/s41597-023-02501-8 |
ISSN | 2052-4463 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://www.nature.com/articles/s41597-023-02501-8 |
Download | https://www.nature.com/articles/s41597-023-02501-8.pdf (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
This article contains rendered mathematical formulae. You may require the TeX All the Things plugin for Chrome or the Native MathML add-on and fonts for Firefox if they don't render properly for you. |
Abstract
The expansive production of data in materials science, as well as their widespread sharing and repurposing, requires educated support and stewardship. In order to ensure that this need helps rather than hinders scientific work, the implementation of the FAIR data principles (that ask for data and information to be findable, accessible, interoperable, and reusable) must not be too narrow. At the same time, the wider materials science community ought to agree on the strategies to tackle the challenges that are specific to its data, both from computations and experiments. In this paper, we present the result of the discussions held at the workshop on “Shared Metadata and Data Formats for Big-Data Driven Materials Science.” We start from an operative definition of metadata and the features that a FAIR-compliant metadata schema should have. We will mainly focus on computational materials science data and propose a constructive approach for the "FAIR-ification" of the (meta)data related to ground-state and excited-states calculations, potential energy sampling, and generalized workflows. Finally, challenges with the FAIR-ification of experimental (meta)data and materials science ontologies are presented, together with an outlook of how to meet them.
Keywords: materials science, data sharing, FAIR data principles, file formats, metadata, ontologies, workflows
Introduction: Metadata and FAIR data principles
The amount of data that has been produced in materials science up to today, and its day-by-day increase, are massive.[1] The dawn of the data-centric era[2] requires that such data are not just stored, but also carefully annotated in order to find, access, and possibly reuse them. Terms of good practice to be adopted by the scientific community for the management and stewardship of its data, the so-called FAIR data principles, have been compiled by the FORCE11 group.[3] Here, the acronym "FAIR" stands for "findable, accessible, interoperable, and reusable," which applies not only to data but also to metadata. Other terms for the “R” in FAIR are “repurposable” and “recyclable.” The former term indicates that data may be used for a different purpose than the original one for which they were created. The latter term hints at the fact that data in materials science are often exploited only once for supporting the thesis of a single publication, and then they are stored and forgotten. In this sense, they would constitute a “waste” that can be recycled, provided that they can be found and they are properly annotated.
Before examining the meaning and importance of the four terms of the FAIR acronym, it is worth defining what metadata are with respect to data. To that purpose, we start by introducing the concept of a data object, which represents the collective storage of information related to an elementary entry in a database. One can consider it as a row in a table, where the columns can be occupied by simple scalars, higher-order mathematical objects, strings of characters, or even full documents (or other media objects). In the materials science context, a data object is the collection of attributes (the columns in the above-mentioned table) that represent a material or, even more fundamentally, a snapshot of the material captured by a single configuration of atoms, or it may be a set of measurements from well-defined equivalent samples (see below for a discussion on this concept). For instance, in computational materials science, the attributes of a data object could be both the inputs (e.g., the coordinates and chemical species of the atoms constituting the material, the description of the physical model used for calculating its properties), and the outputs (e.g., total energy, forces, electronic density of states, etc.) of a calculation. Logically and physically, inputs and outputs are at different levels, in the sense that the former determine the latter. Hence, one can consider the inputs as metadata describing the data, i.e., the outputs. In turn, the set of coordinates A that are metadata to some observed quantities, may be considered as data that depend on another set of coordinates B, and the forces acting on the atoms in that set A. So, the set of coordinates B and the acting forces are metadata to the set A, now regarded as data. Metadata can always be considered to be data as they could be objects of different, independent analyses than those performed on the calculated properties. In this respect, whether an attribute of a data object is data or metadata depends on the context. This simple example also depicts a provenance relationship between the data and their metadata.
The above discussion can be summarized in a more general definition of the term metadata:
Metadata are attributes that are necessary to locate, fully characterize, and ultimately reproduce other attributes that are identified as data.
The metadata include a clear and unambiguous description of the data as well as their full provenance. This definition is reminiscent of the definition given by the National Institute of Standards and Technology (NIST)[4]: “Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about information or information about information.” With our definition, we highlight the role of data “reproducibility,” which is crucial in science.
Within the “full characterization” requirement, we highlight interpretation of the data as a crucial aspect. In other words, the metadata must provide enough information on a stored value (therein including, e.g., adimensional constants) to make it unambiguous whether two data objects may be compared with respect to the value of a given attribute or not.
Next, we should notice that, whereas in computational materials science the concept of data object identified by a single atomic configuration is well defined, in experimental materials science the concept of a class of equivalent samples is very hard to implement operationally. For instance, a single specimen can be altered by a measurement operation and thus cannot, strictly speaking, be measured twice. At the same time, two specimens prepared with the same synthesis recipe, may differ in substantial aspects due to the presence of different impurities or even crystal phases, thus yielding different values of a measured quantity. In this respect, here we use the term "equivalent sample" in its abstract, ideal meaning, but we also mention that one of the main purposes of introducing well-defined metadata in materials science is to provide enough characterization of experimental samples to put into practice the concept of equivalent samples.
The need for storing and characterizing data by means of metadata is determined by two main aspects, related to data usage. The first aspect is as old as science: reproducibility. In an experiment or computation, all the necessary information needed to reproduce the measured/calculated data (i.e., the metadata) should be recorded, stored, and retrievable. The second aspect becomes prominent with the demand for reusability. Data can and should be also usable for purposes that were not anticipated at the time they were recorded. A useful way of looking at metadata is that they are attributes of data objects answering the questions who, what, when, where, why, and how. For example, “Who has produced the data?”, “What are the data expected to represent (in physical terms)?”, “When were they produced?”, “Where are they stored?”, “For what purpose were they produced?”, and “By means of which methods were the data obtained?”. The latter two questions also refer to the concept of provenance, i.e., the logical sequence of operations that determine, ideally univocally, the data. Keeping track of the provenance requires the possibility to record the whole workflow that has lead to some calculated or measured properties (for more details, see the later section “Metadata for computational workflows”).
From a practical point of view, the metadata are organized in a schema. We summarize what the FAIR principles imply in terms of a metadata schema as follows:
- Findability is achieved by assigning unique and persistent identifiers (PIDs) to data and metadata, describing data with rich metadata, and registering (see below) the (meta)data in searchable resources. Widely known examples of PIDs are digital object identifiers (DOIs) and (permanent) Uniform Resource Identifiers (URIs). According to ISO/IEC 11179, a metadata registry (MDR) is a database of metadata that supports the functionality of registration. Registration accomplishes three main goals: identification, provenance, and monitoring quality. Furthermore, an MDR manages the semantics of the metadata, i.e., the relationships (connections) among them.
- Accessibility is enabled by application programming interfaces (APIs), which allow one to query and retrieve single entries as well as entire archives.
- Interoperability implies the use of formal, accessible, shared, and broadly applicable languages for knowledge representation (these are known as formal ontologies and will be discussed in the later section “Outlook on ontologies in materials science”), use of vocabularies to annotate data and metadata, and inclusion of references.
- Reusability hints at the fact that data in materials science are often exploited only once for a focus-oriented research project, and many data are not even properly stored as they turned out to be irrelevant for the focus. In this sense, many data constitute a “waste” that can be recycled, provided that the data can be found and they are properly annotated.
Establishing one or more metadata schemas that are FAIR-compliant, and that therefore enable the materials science community to efficiently share the heterogeneously and decentrally produced data, needs to be a community effort. The workshop “Shared Metadata and Data Formats for Big-Data Driven Materials Science: A NOMAD–FAIR-DI Workshop” was organized and held in Berlin in July 2019 to ignite this effort. In the following sections, we describe the identified challenges and first-stage plans, divided into different aspects that are crucial to be addressed in computational materials science.
In the next section, we describe the identified challenges and first plans for FAIR metadata schemas for computational materials science, where we also summarize as an example the main ideas behind the metadata schema implemented in the Novel-Materials Discovery (NOMAD) Laboratory for storing and managing millions of data objects produced by means of atomistic calculations (both ab initio and molecular mechanics), employing tens of different codes, which cover the overwhelming majority of what is actually used in terms of volume-of-data production in the community. We then follow with more detailed sections discussing the specific challenges related to interoperability and reusability for ground-state calculations (Section “Metadata for ground-state electronic-structure calculations”), perturbative and excited-state calculations (Section “Metadata for external-perturbation and excited-state electronic-structure calculations”), potential-energy sampling (molecular-dynamics and more, Section “Metadata for potential-energy sampling”), and generalized workflows (Section “Metadata for computational workflows”) are addressed in detail in the following sections. Challenges related to the choice of file formats are discussed in Section “File Formats.” An outlook on metadata schema(s) for experimental materials science and on the introduction of formal ontologies for materials science databases constitute Sections “Metadata schemas for experimental materials science” and “Outlook on ontologies in materials science,” respectively.
Towards FAIR metadata schemas for computational materials science
The materials science community has realized long ago that it is necessary to structure data by means of metadata schemas. In this section, we describe the pioneering and recent examples of such schemas, and how a metadata schema becomes FAIR-compliant.
To our knowledge, the first systematic effort to build a metadata schema for exchanging data in chemistry and materials science is CIF, an acronym that originally stood for "Crystallographic Information File," the data exchange standard file format introduced in 1991 by Hall, Allen and Brown.[5][6] Later, the CIF acronym was extended to also mean "Crystallographic Information Framework"[7], a broader system of exchange protocols based on data dictionaries and relational rules expressible in different machine-readable manifestations. These include the Crystallographic Information File itself, but also, for instance, XML (Extensible Markup Language), a general framework for encoding text documents in a format that is meant to be at the same time human and machine readable. CIF was developed by the International Union of Crystallography (IUCr) working party on Crystallographic Information and was adopted in 1990 as a standard file structure for the archiving and distribution of crystallographic information. It is now well established and is in regular use for reporting crystal structure determinations to Acta Crystallographica and other journals. More recently, CIF has been adapted to different areas of science such as structural biology (mmCIF, the macromolecular CIF[8]) and spectroscopy.[9] The CIF framework includes strict syntax definition in a machine-readable form and dictionary defining (meta)data items. It has been noted that the adoption of the CIF framework in IUCr publications has allowed for a significant reduction of the amount of errors in published crystal structures.[10][11]
An early example of an exhaustive metadata schema for chemistry and materials science is the Chemical Markup Language (CML)[12][13][14], whose first public version was released in 1995. CML is a dictionary, encoded in XML for chemical metadata. CML is accessible (for reading, writing, and validation) via the Java library JUMBO (Java Universal Molecular/Markup Browser for Objects).[14] The general idea of CML is to represent with a common language all kinds of documents that contain chemical data, even though currently the language—as of the latest update in 2012[15]—covers mainly the description of molecules (e.g., IUPAC name, atomic coordinates, bond distances) and of inputs/outputs of computational chemistry codes such as Gaussian03[16] and NWChem.[17] Specifically, in the CML representation of computational chemistry calculations[18], (ideally) all the information on a simulation that is contained in the input and output files is mapped onto a format that is in principle independent of the code itself. Such information is:
- Administrative data like the code version, libraries for the compilation, hardware, user submitting the job;
- Materials-specific (or materials-snapshot-specific) data like computed structure (e.g., atomic species, coordinates), the physical method (e.g., electronic exchange-correlation treatment, relativistic treatment), numerical settings (basis set, integration grids, etc.);
- Computed quantities (energies, forces, sequence of atomic positions in case a structure relaxation or some dynamical propagation of the system is performed, etc.).
The different types of information are hierarchically organized in modules, e.g., environment (for the code version, hardware, run date, etc.), initialization (for the exchange correlation treatment, spin, charge), molgeom (for the atomic coordinates and the localized basis set specification), and finalization (for the energies, forces, etc.). The most recent release of the CML schema contains more than 500 metadata-schema items, i.e., unique entries in the metadata schema. It is worth noticing that CIF is the dictionary of choice for the crystallography domain within CML.
Another long-standing activity is JCAMP-DX (Joint Committee on Atomic and Molecular Physical Data - Data Exchange)[19], a standard file format for exchange of infrared spectra and related chemical and physical information that was established in 1988 and then updated with IUPAC recommendations until 2004. It contains standard dictionaries for infrared spectroscopy, chemical structure, nuclear magnetic resonance (NMR) spectroscopy[20], and mass spectrometry[21], and ion-mobility spectrometry.[22] The European Theoretical Spectroscopy Facility (ETSF) File Format Specifications were proposed in 2007[23][24][25], in the context of the European Network of Excellence NANOQUANTA, in order to overcome widely known portability issues of input/output file formats across platforms. The Electronic Structure Common Data Format (ESCDF) Specifications[26] is the ongoing continuation of the ETSF project and is part of the CECAM Electronic Structure Library, a community-maintained collection of software libraries and data standards for electronic-structure calculations.[27]
The largest databases of computational materials science data, AFLOW[28], Materials Cloud[29], Materials Project[30], the NOMAD Repository and Archive[31][32][33], OQMD[34], and TCOD[35] offer APIs that rely on dedicated metadata schemas. Similarly, AiiDA[36][37][38] and ASE[39], which are schedulers and workflow managers for computational materials science calculations, adopt their own metadata schema. OpenKIM[40] is a library of interatomic models (force fields) and simulation codes that test the predictions of these models, complemented with the necessary first-principles and experimental reference data. Within OpenKIM, a metadata schema is defined for the annotation of the models and reference data. Some of the metadata in all these schemas are straightforward to map onto each other (e.g., those related to the structure of the studied system, i.e., atomic coordinates and species, and simulation-cell specification), others can be mapped with some care. The OPTIMADE (Open Databases Integration for Materials Design[41]) consortium has recognized this potential and has recently released the first version of an API that allows users to access a common subset of metadata-schema items, independent of the schema adopted for any specific database/repository that is part of the consortium.
In order to clarify how a metadata schema can explicitly be FAIR-compliant, we describe as an example the main features of the NOMAD Metainfo, onto which the information contained in the input and output files of atomistic codes, both ab initio and force-field based, is mapped. The first released version of the NOMAD Metainfo is described by Ghiringhelli et al.[26] and it has powered the NOMAD Archive since the latter went online in 2014, thus predating the formal introduction of the FAIR data principles.[3]
Here, we give a simplified description, graphically aided by Fig. 1, which highlights the hierarchical/modular architecture of the metadata schema. The elementary mode in which an atomistic materials science code is run (encompassed by the black rectangle) yields the computation of some observables (Output) for a given System, specified in terms of atomic species arranged by their coordinates in a box, and for a given physical model (Method), including specification of its numerical implementation. Sequences or collections of such runs are often defined via a Workflow. Examples of workflows are:
- Perturbative physical models (e.g., second-order Møller–Plesset, MP2, Green’s function based methods such as G0W0, random-phase approximation, RPA) evaluated using self-consistent solutions provided by other models (e.g., density-functional theory, DFT, Hartree-Fock method, HF) applied on the same System;
- Sampling of some desired thermodynamic ensemble by means of, e.g., molecular dynamics;
- Global- and local-minima structure searches;
- Numerical evaluations of equations of state, phonons, or elastic constants by evaluating energies, forces, and possibly other observables; and
- Scans over the compositional space for a given class of materials (high-throughput screening).
|
The workflows can also be nested, e.g., a scan over materials (different compositions and/or crystal structures) contains a local optimization for each material and extra calculations based on each local optimum structure such as evaluation of phonons, bulk modulus, or elastic constants, etc.
The NOMAD Metainfo organizes metadata into sections, which are represented in Fig. 1 by the labeled boxes. The sections are a type of metadata, which group other metadata, e.g., other sections or quantity-type metadata. The latter are metadata related to scalars, tensors, and strings, which represent the physical quantities resulting from calculations or measurements. In a relational database model, the sections would correspond to tables, where the data objects would be the rows, and the quantity-type metadata the columns. In its most simple realization, a metadata schema is a key-value dictionary, where the key is a name identifying a given metadata. In NOMAD Metainfo, similarly to CML, the key is a complex entity grouping the several attributes. Each item in NOMAD Metainfo has attributes, starting with its name, a string that must be globally unique, well defined, intuitive, and as short as possible. Other attributes are the human-understandable description, which clarifies the meaning of the metadata, the parent section, i.e., the section the metadata belongs to, and the type, whether the metadata is, e.g., a section or a quantity. Another possible type, the category type, will be discussed below. For the quantity-type metadata, other important attributes are physical units and shape, i.e., the dimensions (scalar, vector of a certain length, a matrix with a certain number of rows and columns, etc.), and allowed values, for metadata that admit only a discrete and finite set of values.
All definitions in the NOMAD Metainfo have the following attributes:
- A globally unique qualified name;
- Human-readable/interpretable description and expected format (e.g., scalar, string of a given length, array of given size);
- Allowed values;
- Provenance, which is realized in terms of a hierarchical and modular schema, where each data object is linked to all the metadata that concur to its definition. Related to provenance, an important aspect of NOMAD Metainfo is its extensibility. It stems from the recognition that reproducibility is an empirical concept, thus at any time, new, previously unknown or disregarded metadata may be recognized as necessary. The metadata schema must be ready to accommodate such extensions seamlessly.
The representation in Fig. 1 is very simplified for tutorial purposes. For instance, a workflow can be arbitrarily complex. In particular, it may contain a hierarchy of sub-workflows. In the currently released version of the NOMAD Metainfo, the elementary-code-run modality is fully supported, i.e., ideally all the information contained in a code run is mapped onto the metadata schema. However, the workflow modality is still under development. An important implication of the hierarchical schema is the mapping of any (complex) workflow onto the schema. That way, all the information obtained by its steps is stored. This is achieved by parsers, which have been written by the NOMAD team for each supported simulation code. One of the outcomes of the parsing is the assignment of a PID to each parsed data object, thus allowing for its localization, e.g., via a URI.
The NOMAD Metainfo is inspired by the CML, in particular in being hierarchical/modular. Each instance of a metadata-schema is uniquely identified, so that it can be associated with a URI for its convenient accessibility. An instance of a metadata schema can be generated by using a dedicated parser by pairing each parsed value with its corresponding metadata label. As an example, in Listing 1, we show a portion of the YAML file (see section “File Formats”) instantiating Metainfo for a specific entry of the NOMAD Archive. This entry can be searched by typing “entry_id = zvUhEDeW43JQjEHOdvmy8pRu-GEq” in the search bar at https://nomad-lab.eu/prod/v1/gui/search/entries. In Listing 1, key-value pairs are visible, as well as the nested-section structuring.
|
The modularity and uniqueness together allow for a straightforward extensibility, including customization, i.e., introduction of metadata-schema items that do not need to be shared among all users, but may be used by a smaller subset of users, without conflicts.
In Fig. 1, the solid arrows stand for the relationship is contained in between section-type metadata. A few examples of quantity-type metadata are listed in each box/section. Such metadata are also in an is-contained-in relationship with the section they are listed in. The dashed arrows symbolize the relationship has reference in. In practice, in the example of an Output section, the quantity-type metadata contained in such a section are evaluated for a given system described in a System section and for a given physical model described in a Method section. So, the section Output contains a reference to the specific System and Method sections holding the necessary input information. At the same time, the Output section is contained in a given Atomistic-code run section. These relationships among metadata already build a basic ontology, induced by the way computational data are produced in practice, by means of workflows and code runs. This aspect will be reexamined in the later Section “Outlook on ontologies in materials science.”
We now come to the category-type metadata that allow for complementary, arbitrarily complex ontologies to be built by starting from the same metadata. They define a concept, such as “energy” or “energy component,” in order to specify that a given quantity-type metadata has a certain meaning, be it physical (such as “energy”) or computer-hardware related, or administrative. To the purpose, each section and quantity-type metadata is related to a category-type metadata, by means of an is-a kind of relationship. Each category-type metadata can be related to another category-type metadata by means of the same is-a relationship, thus building another ontology on the metadata, which can be connected with top-down ontologies such as EMMO[42] (see Section “Outlook on ontologies in materials science” for a short description of EMMO).
The current version of NOMAD Metainfo includes more than 400 metadata-schema items. More specifically, these are the common metadata, i.e., those that are code-independent. Hundreds more metadata are code-specific, i.e., mapping pieces of information in the codes’ input/output that are specific to a given code and not transferable to other codes. The NOMAD Metainfo can be browsed at https://nomad-lab.eu/prod/v1/gui/analyze/metainfo.
To summarize, the NOMAD Metainfo addresses the FAIR data principles in the following sense:
- Findability is enabled by unique names and a human-understandable description;
- Accessibility is enabled by the PID assigned to each metadata-schema item, which can be accessed via a RESTful[43] API (i.e., an API supporting the access via web services, through common protocols, such as HTTP), specifically developed for the NOMAD Metainfo. Essentially all NOMAD data are open-access, and users who wish to search and download data do not need to identify themselves. They only need to accept the CC BY license. Uploaders can decide for an embargo. These data are then shared with a selected group of colleagues.
- Interoperability is enabled by the extensibility of the schema and the category-type metadata, which can be linked to existing and future ontologies (see Section “Outlook on ontologies in materials science”).
- Reusability/Repurposability/Recyclability is enabled by the modular/hierarchical structure that allows for accessing calculations at different abstraction scales, from the single observables in a code run to a whole complex workflow (see Section “Metadata for Computational Workflows”).
The usefulness and versatility of a metadata schema are demonstrated by the multiple access modalities it allows for. The NOMAD Metainfo schema is the basis of the whole NOMAD Laboratory infrastructure, which supports access to all the data in the NOMAD Archive, via the NOMAD API (also an implementation of the OPTIMADE API[41] is supported). This API powers three different access modes of the Archive: the Browser[44], which allows searches for single or groups of calculations, the Encyclopedia[45], which display the content of the Archive organized by materials, and the Artificial-Intelligence (AI) Toolkit[46][47][48], which connects in Jupyter Notebook's script-based queries and AI (machine learning [ML], data mining) analyses of the filtered data. All the three services are accessible via a web browser running the dedicated GUI offered by NOMAD.
Metadata for ground-state electronic-structure calculations
By ground-state calculations, we mean calculations of the electronic structure—e.g., eigenvalues and eigenfunctions of the single-particle Kohn-Sham equations, the electron density, the total energy and possibly its derivatives (forces, force constants)—for a fixed configuration of nuclei. This refers to a point located on the Born-Oppenheimer potential-energy surface, and is a necessary step in geometry optimization, molecular dynamics, the computation of vibrational (phonon) spectra or elastic constants, and more. Thus, ground-state calculations represent the most common task in computational materials science, and the involved approximations are relatively well established. For this reason, they are already extensively covered by the NOMAD Metainfo. Nevertheless, some challenges in defining metadata for such calculations still remain, as discussed below. In particular, density-functional theory (DFT) is the workhorse approach for the great majority of ground-state calculations in materials science. Highly accurate quantum-chemistry models are more computationally expensive than DFT and their use in applications is less widespread. However, they can provide accurate benchmark references for DFT, making high-quality quantum-chemical data essential also for DFT-based studies. Below we analyze the ground-state electronic structure calculations mainly in reference to DFT, but most of the stated principles are also valid for quantum-chemical calculations. A detailed discussion of the latter is deferred to Section “Quantum-chemistry methods.”
Approximations to the DFT exchange-correlation functional
Approximations to the DFT exchange-correlation (xc) functionals are identified by a name or acronym (e.g., “PBE”), although sometimes this identification is not unique or complete. As metadata, we suggest to use the identifiers of the Libxc library[49][50], which is the largest bibliography of xc functionals. In order to be both human and computer friendly, the Libxc identifiers consist of a human-readable string that has a unique integer associated with it. Often, the above-noted identification needs some refinement, because xc functionals typically depend on a set of parameters and these may be modified for a given calculation. Obviously, there is a need to standardize the way in which such parameters are referenced. Just like it is possible to use the Libxc identifiers for the functionals themselves, one may also use the Libxc naming scheme for their internal parameters. Obviously, code developers have to ensure that this information is contained in the respective input and/or output files. As Libxc provides version numbers of the xc functionals, it is important that this information is also available.
Basic sets
Complete and unambiguous specification of the basis set is crucial for judging the precision of a calculation. Ground-state calculations should include the full information about the basis sets used, including a DOI that a basis may be referred to. The use of repositories of basis sets, like the Basis Set Exchange repository[51], is therefore strongly recommended.
Basis sets can be coarsely divided into two classes, i.e., atom-position-dependent (atom-centered, bond-centered) and cell-dependent (such as plane waves) ones. Also, a combination of both is possible, as, e.g., realized in augmented plane-wave or projector-augmented-wave methods. For the atom-centered basis, the list of centers needs to be provided, and these may even contain positions where no actual atomic nucleus is located. The NOMAD Metainfo contains a rather complete set of metadata to describe atom-centered basis sets. A more complete description of cell-dependent basis sets can be found in the ESCDF, which is planned to be merged with the NOMAD Metainfo.
Energy reference
In order to enable interoperability and reusability of energies computed with different electronic-structure methods, it is necessary to define a “general energy zero.” An analysis of this problem and some clues on how to tackle it were already discussed by some of us in a previous work.[26] The following is a further attempt to advance and systematize ideas and solutions.
The problem of comparing energies is not restricted to computational materials science and chemistry. In fact, it also arises in experimental chemistry, as for instance, only enthalpy or entropy differences can be measured, but not absolute values. To solve this, chemists have defined a reference state for each element, called the "standard state," which is defined as the element in its natural form at standard conditions, while the heat of formation is used to measure the change from the elements to the compound. In computational materials science and chemistry, we can adopt a similar approach. For each element we need to define a reference system as the zero of the energy scale. To do so, we introduce some definitions:
- A system is a defined set of one or more atoms, with a given geometry and, if periodic, a given unit cell. It can be an atom, a molecule, a periodic crystal, etc. If relevant, the charge, the spin-state or magnetic ordering needs to be specified.
- A reference system is a well-defined system to which other systems are compared to.
- A calculated energy is the energy obtained by a numerical simulation of a system with given input data and parameters, defining the Hamiltonian (i.e., DFT xc-functional approximation) or the many-electron model (e.g., Hartree-Fock, MP2, “coupled-cluster singles, doubles, and perturbative triples”, CCSD[T]), the basis set, and the numerical parameters.
Whether the reference system is an atom, an element in its natural form, some molecule or other system, does not matter, as long as it is well-defined. Defining the system by atoms requires specifying how the orbitals are occupied, whether the atom is spherical, spin-polarized, etc. For each computational method and numerical settings, the energy per atom of the reference system must be calculated. The standard energy is then obtained by subtracting these values (multiplied by the number of constituents) from the calculated total energy. For example, to determine the energy of formation of a molecule like H2O or a crystal like SiC, we calculate the difference in total energies as or , respectively. Here, H2 and O2 are isolated, neutral molecules while Si and C are free, neutral atoms. However, using the energy per atom of Si and C in their crystalline ground-state structure would be an option as well. We propose to tabulate the reference energies for the most common computational methods, so that they can be applied without further computations and preferably automatically by the codes themselves.
Finally, we need to define what is meant by a computational method. The Hamiltonian and DFT functional are clearly part of the definition, as is the basis set and the potential shape (including pseudopotentials (PP) and effective core potentials). The specific implementation may also be relevant. Gaussian-based molecular-orbital codes may give the same energy for an identical setup (see Section “Quantum-chemistry methods”), while plane-wave DFT codes may not.
One factor here is the choice of the PP. Irrespective of the used method, the computational settings determine the quality of a calculation. Most decisive here is the basis-set cut-off. For the plane-wave basis, convergence with respect to this parameter is straightforward. In any case, depending on the code, the method and details of the calculation, care needs to be taken to define all the adjustable parameters that significantly affect the energy when defining computational methods.
To tabulate standard energies, as suggested above, every computational method needs to be applied to all reference systems. This requires care in choosing the reference systems to ensure that an as-wide-as-possible range of codes and methods are actually suited for these calculations. It may be that some codes cannot constrain the occupancies of atoms, or keep them spherical, which would be a problem if spherical atoms were chosen as the reference. Clearly, periodic crystals such as silicon are not suitable for molecular codes. It is possible, however, that some other codes could help with bridging this gap. For example, FHI-aims[52] is not only capable of simulating crystalline system, but can also handle atoms and molecules and it can employ Gaussian-type orbitals (GTO) basis sets. Thus, FHI-aims is able to reproduce energy differences between atoms/molecules and crystals. In this way, it can support codes such as Gaussian16 or GAMESS.[53]
Metadata for external-perturbation and excited-state electronic-structure calculations
A direct link from the DFT ground state (GS) to excitations is provided by time-dependent DFT (TDDFT). Alternatively, charged and neutral electronic excitations are described by means of Green-function approaches from many-body perturbation theory (MBPT). This route is predominantly (but not exclusively) used for the solid state, while TDDFT and quantum-chemistry approaches are typically preferred for finite systems. For strongly correlated materials, in turn, dynamical mean-field theory (DMFT) is often the methodology of choice, potentially combined with DFT and Green-function methods. Lattice excitations, if not directly treated by DFT molecular dynamics, are often handled by density-functional perturbation theory (DFTP); for their interaction with light, also Green-function techniques are used. DFPT not only allows for the description of vibrational properties, but also for treating macroscopic electric fields, applied macroscopic strains, or combinations of these. The type of perturbation is intimately related to the physical properties of interest, e.g., harmonic and anharmonic phonons, effective charges, Raman tensors, dielectric constants, hyper-polarizabilities, and many others.
Characterizing the corresponding research data is a very complex and complicated task, for various reasons. First, such calculations rely on an underlying ground-state calculation, and thus carry along all uncertainties from it. Second, the methodology for excited states is scientifically and technically more involved by including many-body effects that govern diverse interactions. The methods thus rely on various, often not fully characterized approximations.
Diagrammatic techniques and TDDFT
The most common application of the GW approximation (the one-body Green's function G and the dynamically screened Coulomb interaction W[54]) is to compute quasi-particle energies, i.e., energies that describe the removal or addition of a single electron. For this, the many-body electron-electron interaction is described by a two-particle operator, called the electronic self-energy. To compute this object, on the technical side we may need an additional (auxiliary) basis set, not the same as the one used in the ground-state calculation, coming with additional parameters. Likewise, there are various ways for doing the analytical continuation of the Green’s function, as there are various ways for carrying out the required frequency integration, possibly employing a plasmon-pole model as an approximation. And there are also different ways of how to evaluate the screened Coulomb potential W. Most important is the flavor of GW, i.e., whether it is done in a single-shot manner, called G0W0, or in a self-consistent way. If the latter, what kind of self-consistency (scf) is used; any type of partial scf, quasi-particle scf, or any other type which would remedy any starting-point dependence, i.e., the dependence of the results on the xc functional of the initial DFT (or Hartree-Fock or alike) used in the GS.
While GW approximation is the method of choice for quasi-particle energies (and potentially also life times) within the realm of MBPT, we need to solve the Bethe-Salpeter equation (BSE) to tackle electron-hole interactions. This approach should typically be applied on top of a GW calculation, but often the quasi-particle states are approximated by DFT results adjusted by a scissors operator to widen the band gap in a similar way to the latter. In all cases, BSE carries along all subtleties from the underlying steps. In addition, it comes with its own issues, like the way of screening the Coulomb interaction (electron-hole this time), the representation of non-local operators, and alike.
DMFT, as a rather young and quickly developing field, naturally experiences a plenitude of “experimental” implementations, differing in many aspects, with one of the major obstacles being the quite vast amount of combinations of software. Some of the approaches are computationally light, allowing for the construction of model Hamiltonians based on DFT calculations; others are computationally too demanding and can be applied only to simple systems with a few orbitals; most of the methods rely on Green’s functions and self-energies. Diagrammatic extensions beyond standard DMFT methods employ various kinds of vertex functions. Other issues concern the definition of how to handle the Coulomb interactions, where the parameters can either be chosen empirically or can be calculated by first principles.
Specific issues of TDDFT concern, in a first place, the distinction between the linear-response regime and the time-propagation of the electronic states in presence of a time-dependent potential. For the former, the xc kernel plays the same role as the xc functional of the GS, raising (besides numerical precision) questions related to accuracy. For the latter, there are various ways and flavors for how to implement the time-evolution operator. Moreover, one can write this operator as a simple exponential or use more elaborate expressions, like the Magnus expansion or the enforced time reversal symmetry. Regarding the exponential, one can employ a Crank-Nicolson expansion, expand in a Taylor series or employ Houston states. Obviously, each of them comes with approximations and additionally, numerical issues.
In summary, all the variety captured by the different methods together with the related multitude of computational parameters, needs to be carefully reflected by the metadata schema. This is not only imperative for ensuring reproducible results but also for evaluating the accuracy of methods and commonly used approximations. Besides, further subtleties related to algorithms in the actual implementations in different codes requires the code developers to embark on this challenge.
Density-functional perturbation theory
Density-functional perturbation theory is used to obtain physical properties that are related to the (density-)response of the system to external perturbations, like the displacement potential according to lattice vibrations. Also in this case, the calculation relies on a preliminary GS run, inheriting all issues therefrom. After having chosen the type of perturbation, which requires method-dependent definitions and inputs, one needs to choose the order of perturbation: The linear response approach, that is implemented in many codes (e.g., VASP[55], octopus[56], CASTEP[57], FHI-aims[58], Quantum Espresso[59], ABINIT[60]), allows for the determination of second-order derivatives of the total energy. Among these codes, some of them also allow for the calculation of third-order derivatives, like anharmonic vibrational effects. The variation of the Kohn-Sham orbitals can be obtained from the Sternheimer equation, where different methods are used for deriving its solution (iterative methods, direct linearization, integral formulation).
Quantum-chemistry methods
Quantum chemistry offers several methodological hierarchies for calculating quantities related to excited states, such as excitation energies, transition moments, ionization potentials, etc. As high-quality methods are computationally intensive, without additional approximations such methods can be applied to relatively small molecular systems only.
Among the standard quantum chemical approaches that can be routinely applied to study excited states of small to medium-sized molecules one can distinguish two large groups, i.e., single-reference and multi-reference methods. The single-reference coupled-cluster (CC) hierarchy for excited states can be formulated in terms of the so-called equation-of-motion approach or time-dependent linear response.
Generally, for well-behaving closed-shell molecules, the single-reference quantum-chemical methods can be used as a black box. The formalisms of the MP n and CC models are uniquely defined and well documented. The GTO basis sets from the standard basis set families (Pople, Dunning, etc.) are also uniquely defined by the acronym. In practical implementations of these methods, of course various thresholds are usually introduced for prescreening, convergence, etc., but the default values for these thresholds are routinely set very conservatively to guarantee a sub-microhartree precision of the final total energies. Problems might, however, arise due to the iterative character of most of the mentioned techniques, as convergence to a certain state (both in the ground-state and/or excited-state parts of the calculation) depends on starting guess, preconditioner, possible level shifts, type of convergence accelerator, etc. Unfortunately, the parameters that control the convergence are often not sufficiently well-documented and might not be found in the output. Such problems mainly occur in open-shell cases (note that in the Delta methods at least one of the calculations has to involve an open-shell system). Sometimes a cross-check between several codes becomes essential to detect convergence faults.
When it comes to larger systems and approximate CC models are utilized, the importance of the involved tolerances and underlying protocols substantially increases. The approximations can include, for example, the density-fitting technique, local approximation, Laplace transform, and others. Important parameters here are the auxiliary basis set, the fitting metric, the type of fitting (local or non-local), and if local, how the fit domains are determined, etc. The result of the calculations that use local correlation techniques are influenced by the choice of the virtual space and the corresponding truncation protocols and tolerances, the pair hierarchies and the corresponding approximations for the CC terms, etc. For Laplace-transform-based methods, the details of the numerical quadrature matter. Unfortunately, these subtleties are very specific and technical and even if given in the output, can hardly be properly understood and analyzed by non-specialists who are not involved in the development of the related methods. Therefore, the protocols behind the approximations are usually appropriately automatized, and the defaults are chosen such that for certain (benchmarking) sets of systems the deviations in the energy are substantially smaller than the expected error of the method itself (e.g., 0.01 eV for the excitation energy). However, for these methods, additional benchmarks and cross-checks between different programs and approaches would be very important.
Multi-reference methods come with quite a number of different flavors, where the most widely used ones are complete active-space self-consistent Field (CASSCF), complete active-space second-order perturbation theory (CASPT2), and multi-reference configuration interaction (MRCI). For difficult cases (e.g., strongly correlated systems), these methods might remain the only option to obtain qualitatively and quantitatively correct result. Unfortunately, compared to the single reference methods, they are computationally expensive and much less of a black box. First of all, for each calculation one has to specify the active space or active spaces. The results may depend dramatically on this choice. Furthermore, the underlying theory is not always uniquely defined by the used acronym. For example, different formulations of CASPT2, MRCI, or other theories are not mutually equivalent depending on whether and how much internal contraction is used and additional approximations that neglect certain terms (e.g., many-electron density matrices) can be implicitly invoked. Besides, certain deficiencies of these methods, such as for example lack of size consistency in MRCI or intruder states in CASPT2, are often corrected by additional (sometimes empirical) schemes, which again are not always fully specified. All this makes the interpretation of deviations in results and cross-checks of these methods less conclusive.
To summarize, quantum-chemical methods offer an excellent toolbox for accurate ab initio calculations for molecules (especially so for small and medium sized ones). However, severe issues concerning reproducibility and replicability remain, in particular for extended and/or open-shell systems. This calls for a more detailed specification of the implemented techniques by the developers, for example, a better design of the outputs, and a thorough analysis and documentation of the employed methods and parameters by the users. A possible strategy addressing these issues would be two-fold:
- Promoting the compliance of the developed software with the FAIR principles for software[61][62], which comprise the recommendation to publish the software in a repository with version control, have a well-defined license, register the code in a community registry, assign to each version a PID, and enable its proper citation.[63][64] Reproducibility can be enhanced by publishing software code under the Free Libre Open-Source Software (FLOSS)[65][66] license and by documenting the computation environment (hardware, operating system version, computational framework and libraries that were used, if any); and
- Creation of well-defined benchmark datasets.
Interoperability among different implementations of (in the intention) the same theoretical model can be assessed by the quantitative comparison over different codes (including different versions thereof) of a set of properties on an agreed-upon set of materials. Such datasets would obviously need to be stored in a FAIR-compliant fashion. A large community-based effort in this direction is being carried on in the DFT community, while in the many-body-theory community, implementation of this idea is just at its beginning.
Metadata for potential-energy sampling
Molecular dynamics (MD) simulations model the time evolution of a system. They employ either ab initio calculated forces and energies (aiMD) or molecular mechanics (MM) i.e., forces and energies are defined through empirical atomistic and coarse-grained potentials. The FAIR storing and sharing of their inputs and outputs comes with a number of specific challenges in comparison to single-point electronic-structure calculations.
Conceptually, aiMD and MM are similar, as a sequence of system configurations is evolved at discrete time steps. Positions, velocities, and forces at a given time step are used to evaluate positions and velocities, and hence forces in the new configuration, and so on. In practice, MM simulations are orders of magnitude faster than aiMD, enabling much longer time scales and/or much larger system sizes. Even though the trend towards massive parallelization will enable aiMD in the near future system to handle sizes comparable to today’s standards for MM simulations, the latter will probably always enable larger systems. However, with machine-learned potentials and active learning techniques for their training, aiMD and MM may grow together in the future.
In this Section, we focus on challenges more specific to MM simulations, having in mind large length scales, long time scales, and complex phase-space-exploration algorithms and workflows. They can be summarized as follows:
- (i). In many cases, the investigated systems feature thousands of atoms with complex short- and long-range order and disorder, e.g., describing microstructural evolution such as crack propagation. This requires large, complex simulation cells with a range of chemical species to be correctly described and categorized.
- (ii). Force-fields exist in a wide variety of flavors that require proper classification. On top of that, they allow for granular fine-tuning of the interactions, even for individual atoms. Faithfully representing complex force fields thus requires to also capture the chemical-bonding topology that is often needed to define the actual interactions.
- (iii). The large length and long time scales presently come together with a multitude of simulation protocols, which use specific boundary conditions, thermostats, constraints, integrators, etc. The various approaches enable the computations of additional observables to be computed as statistical averages or correlations. Representing these properties implies the need to efficiently store and access large volumes of data, e.g., trajectories, including positions, and possibly also velocities and forces, for each atom at each time step.
For the purpose of illustration, we start by identifying some typical use cases, then describe what is currently implemented in the NOMAD infrastructure and what is missing. The examples we adopt fall into two classes: (i) high-throughput systems that are individually simple (1,000–10,000 particles) where the value of sharing comes from the ability to run analysis across many variants of, e.g., chemical composition or force field; (ii) sporadic simulations of very large systems or very long time scales which cannot readily be repeated by other researchers and thus are individually valuable to share. Examples of the first class, could be MD simulations in the NVT ensemble for liquid butane or bulk silicon, using well-defined standard force fields (e.g., CHARMM or Stillinger-Weber). Quantities of interest are typically computed during MD simulations (e.g., liquid densities). For flexibility, full trajectory files should also be stored but some important observables might be worth precomputing (e.g., radial distribution functions). The second class could include multi-billion atom MD simulations of dislocation formation [68] or solidification [69,70] or very long time-scale simulations of protein folding. [71] For more complex use cases, the current infrastructure as discussed in Section “Towards FAIR metadata schemas for computational materials science” is not yet sufficient. The challenges to be addressed are the need for support for (i) complex, heterogeneous, possibly multi-resolution systems; (ii) custom force fields; (iii) advanced sampling; (iv) classes of sampling besides MD (e.g., Monte Carlo, global structure prediction/search); and (iv) larger simulations (i.e., need for sparsification of the stored data with minimal loss of information).
Complex systems include heterogeneous systems, e.g., adsorbate and surfaces, interfaces, solute (macro)molecules in solvent fluids, and multi-resolution systems, i.e., systems that are described at different granularity. The representation of complex systems requires a hierarchy of structural components, from atoms, through moieties, molecules, and larger (super)structures. Annotating such complexity will require human intervention as well as algorithms for automatically recognizing the structural elements (see, e.g., Leitherer et al. [72]).
Annotation of force fields into publicly accessible databases has been pioneered by OpenKIM [40] in materials science and MoSDeF [73] for soft matter. However, many simulations are performed with customized force fields. The field is already being augmented and will likely be further supported by ML force fields. So far, the great majority of ML force fields are used only in the publication where they are defined. The reusability-oriented annotation of force fields, including ML ones, require also establishing a criterion for comparing them. Comparisons can be carried out by means of standardized benchmark datasets, with a well-defined set of properties. Differences among predicted properties can establish a metric for the similarity of the force fields.
Advanced sampling techniques (e.g., metadynamics [74], umbrella sampling [75], replica exchange [76], transition-path sampling [77], and forward-flux [78] sampling) are typically supported by libraries such as PLUMED [79] and OpenPathSampling. [80] These libraries are used as plugins to codes where classical-force-field-based (e.g. GROMACS [81], DL_POLY [82], LAMMPS [83]) or ab initio (e.g., CP2K [84] and Quantum Espresso [58]) MD, or both (e.g., i-Pi [85]), are performed. The input and output of these plugins will serve as the basis for the metadata related to these sampling techniques. In this regard, it would also be interesting to connect materials science databases, such as the NOMAD Repository and Archive [31] or Materials Cloud Archive [29] to the PLUMED-NEST [86], the public repository of the PLUMED consortium [87], for example by allowing for automatic uploading of PLUMED input files to the PLUMED-NEST when uploading to the data repositories.
For long time- and large length-scale simulations, several questions arise: How should we deal with these simulations, where the extensive amount of data produced by MD simulations becomes overwhelmingly large to systematically store and share? Can we afford to store and share all of it? If the storage is limited or data retrieval is unpractically slow, how can we identify the significant and crucial part of the simulation to store it in a reduced form? Keeping the whole data locally and sharing the metadata with only the important parts of the simulations would be a viable alternative, assuming the different servers have enough redundancy. Standard analysis techniques such as similarity analysis and monitoring dynamics can also be used to identify the changes in structure and dynamics to store only the significant frames or specific regions in MD simulations (e.g., some QM/MM models uses large MM buffer-atom regions that may not be stored entirely). Furthermore, on the one hand the cost/benefit of storing versus running a new simulation must be weighed. On the other hand, researchers may soon face increased requirements from funding agencies to store their data for a number of years, in which case the present endeavor offers a convenient implementation. We note ongoing algorithmic developments on compression algorithms for trajectories; see, for example, the work of Brehm and Thomas. [88]
Metadata for computational workflows
A computational workflow represents the coordinated execution of repeatable (computational) steps while accounting for dependencies and concurrency of tasks. In other words, a workflow can be thought as a script, a wrapper code that manages the scheduling of other codes, by controlling what should run in parallel, what sequentially and/or iteratively. This definition can be extended to workflows in experimental materials science or hybrid computational-experimental investigations, but, consistently with the previous sections, we limit the discussion to computational aspects only.
Once shared, workflows become useful building blocks that can be combined or modified for developing new ones. Furthermore, FAIR data can be reused as part of workflows completely unrelated to the workflows with which they were generated. An obvious example is AI-based data analytics, which can entail complex workflows involving data originally created for different purposes. During the last decade, the interest in workflow development has grown considerably in the scientific community [89], and various multi-purpose engines for managing calculation workflows have been developed, including AFLOW [28,90,91], AiiDA [36,92], ASE [39], and Fireworks. [93] Using these infrastructures, a number of workflows have been used for scientific purposes, like convergence studies [94], equations of state (e.g., AFLOW Automatic Gibbs Library [95] and the AiiDA common workflows ACWF [96]), phonons [97,98,99,100,101], elastic properties (e.g., the elastic-properties library for Inorganic Crystalline Compounds of the Materials Project [102], AFLOW Automatic Elasticity Library, AEL [103], ElaStic [104]), anharmonic properties (e.g., the Anharmonic Phonon Library, APL [105], AFLOW Automatic Anharmonic Phonon Library, AAPL [106]), high-throughput in the compositional space (e.g., AFLOW Partial Occupation, POCC [107]), charge transport (e.g., organic semiconductors [108,109]), of covalent organic frameworks (COFs) for gas storage applications [110], of spin-dynamics simulations [111], high-throughput automated extraction of tight-binding Hamiltionians via Wannier functions [112], and high-throughput on-surface chemistry. [113]
There are two types of metadata associated to workflows. Thinking of a workflow as a code to be run, the first type of metadata characterizes the code itself. The second type is the annotation of a run of a workflow, i.e., its inputs and outputs. This type of metadata has been already described in the Section “Towards FAIR metadata schemas for computational materials science,” together with a schematic list of possible workflow classes. It is important to realize that the inputs and outputs of the elementary-mode runs of the atomistic codes that are invoked in a workflow run are complemented by the inputs and outputs of the overarching workflows. A simple example: In an equation-of-state type of workflow, the energy and volume per unit cell of each single configuration that is part of the workflow is the output of the elementary run of the code, while the energy-vs-volume equation of state, e.g., fit to the Birch-Murnagham model, is an output of the workflow.
File formats
On an abstract level, a metadata schema is independent from its representation in computer memory, on a hard drive, or on just a piece of paper. But on a practical level, all data and metadata need to be managed, i.e., stored, indexed, accessed, shared, deleted, archived, etc. File formats used in the community address different requirement and intended use cases. Some file formats privilege human readability (e.g., XML, JSON, YAML) but are not very storage-efficient, while others are binary and overall optimized for efficient searches, but require interpreters to be understood by a person (e.g., HDF5 [114]). There are a few use-cases and data properties in the domain of computational materials science that are worth mentioning. First, such data are very heterogeneous and contain many simple properties (e.g., the name of a used code, or a list of considered atoms) that are mixed with properties in the form of large vectors, matrices, or tensors (e.g., the density of states or wave functions). The number of different properties requires hierarchical organization (e.g., with XML, JSON, YAML, or HDF5). It is desirable that many properties are easily human readable (e.g., to quickly verify the sanity of a piece of data), on the other hand large matrices should be stored as efficiently as possible for archiving, retrieving, and searching purposes. Second, there are use cases where random (non-sequential) access of individual properties is desirable (e.g., return all band structures from a set of DFT calculations). Third, computational material science (meta)data need to be archived (efficient storage, prevention of corruption, backups, etc.) on one side, but they also need to be shared via APIs, e.g., for search queries. This requires to transform (meta)data from one representation in one file format (e.g., BagIt and HDF5) to another representation in a different format (e.g., JSON or XML).
These use cases and data properties lead to several conclusions. Even on a technical level, (meta)data need to be handled independently of the file format. Pieces of information have to be managed in different formats, and we need to be able to transform from one representation into another. If many different resources (files, databases, etc.) are used to store (meta)data from a logically conjoined dataset, references to these resources qualify to become an important piece of metadata itself. We propose to use an abstract interface (e.g., implemented as a Python library) based on an abstract schema. This interface allows to manage (meta)data independent of the actual representation used underneath. Various implementations of such an abstract interface can then realize storage in various file formats and access to databases.
Metadata schemas for experimental materials science
In contrast to computational materials science, in experimental materials science the atomic structure and composition is only approximately known. Several techniques are used to collect data that may be more or less directly interpreted in terms of the atomic and/or electronic structure of the material. In cases where the structure of the material is already known, careful characterization of properties helps to establish valuable relationships between structure and properties which, in turn, may help to refine theoretical models of these structure-properties links. The inherent uncertainty in every measurement process causes the precision with which data can be reproduced to be lower, in most cases, than in theoretical/computational materials science. These uncertainties are present even in a well-characterized experimental setup, i.e., when a comprehensive set of metadata is used. In many cases it is not even the focus of an experiment to produce the most perfectly characterized data, but to invest just enough effort to address the specific question that drives the experiment.
The information available about the material whose properties are to be measured is also much less complete than in the computational world, where often the position of every atom is known. However, while physical measurements may be limited in their precision, the accuracy with which a physically observable quantity is obtained is by definition of being physically observable much higher than in computational materials science, where the accuracy of the obtained physical quantity may depend strongly on the validity of approximations being applied.
The uncertainty in retrieving structure-property relationships in computational materials science, which depends on the suitability of the applied theoretical model and its computational implementation, translates in the realm of experiments to an uncertainty in the atomic structure of the object that is being characterized and generally also some uncertainty in the measurement process itself. The metadata necessary to reproduce a given experimental data set must thus include detailed information about the material and its history together with all the parameters which are required to describe the state of the instrument used for the characterization. In most cases, both classes of metadata, i.e., those describing the material and those describing the instrument are going to be incomplete. While, for example, the full history of temperature, air pressure, humidity, and other relevant environmental parameters are not commonly tracked for the complete lifetime of a material (counter-examples exist, e.g., in pharmaceutical research), also information about the state of the instrument is not generally as comprehensive as it should ideally be (e.g., parameters are not recorded, or are not properly controlled, such as hysteresis effects in devices involving magnetic fields, or many mechanical setups).
To overcome part of the uncertainty in the data, one needs to collect as many metadata about the material and its history, as possible, including those that one has no immediate use for at the moment, but might potentially need in the future. Since most of the research equipment being used for characterization tasks is commercial instrumentation, collecting this metadata in an (ideally) fully automated fashion requires the manufacturer’s support. In many cases the formats in which scientific data are provided by these instruments is proprietary. Even if all the data to describe the instrument’s condition of operation are stored, large parts of them may get lost when using the vendor’s software to export the data to other formats; mostly because the “standard format” does not foresee storing vendor- and instrument-specific metadata. It is however worth mentioning here that the CIF dictionaries (see the Section “Towards FAIR metadata schemas for computational materials science”) already contain (meta)data names to describe instrumentation, sample history, and standard uncertainties in both measured and computed values. As a useful addition, the CIF framework provides tools for implementing quality criteria, which can be used for evaluating the trustworthiness of data objects. In this respect, the community has been developing with CIF a powerful tool onto which a FAIR representation of at least structural data can be built.
At large research infrastructures like synchrotrons and neutron-scattering facilities, where a significant fraction of instruments is custom built, and data are often shared with external partners, standards for file formats and metadata structures are being agreed upon, a prominent one being the NeXus standard. NeXus [115] defines hierarchies and rules on how metadata should be described and allows compliant storage using HDF5. Experimental research communities can profit from these activities and provide NeXus-format application definitions which describe necessary metadata that should be stored in a dataset, along with definitions for some optional metadata. This common file format for scientific data is slowly beginning to spread to other communities. Having a standard file format for different types of scientific data seems to be an important step forward towards FAIR data management, since it severely reduces the threshold to share data across communities. Note that NeXus provides a glossary and connected ontology which helps in machine interpretability, and so in reusability.
While standard file formats are of very high value in making data findable and accessible, due to common use of keywords to describe a given parameter, they also make them more interoperable, since the barrier for reading the data is lowered. However, making experimental data truly reproducible requires in many cases more metadata to be collected. Only if the uncertainty with which data can be reproduced is well understood, they may also be fully reusable. As discussed in the previous paragraph, part of these metadata must be provided by manufacturers of commercially available components of the experimental setup. Often this just requires more exhaustive data export functions and/or proper, i.e. versioned descriptions, for all of the instrument-state-describing metadata which are being collected during the experiment. Additionally, it may be necessary to equip home-built laboratory equipment with additional sensors and functionalities for logging their signals.
Even with added sensors and automated logging of all accessible metadata, in many cases, it is also necessary to compile and complete the record of metadata describing the current and past states of the sample that is being characterized by manually adding information and/or combining data from different sources. Tools for doing this in a machine-readable fashion are electronic laboratory notebooks (ELNs) and/or laboratory information management systems (LIMS). Many such systems are already available [116,117,118,119,120,121,122], including open-source solutions that combine features of both ELN and LIMS into one software. Server-client solutions that do not require a specific client, but may be accessed through any web browser, have the advantage that information may be accessed and edited from any electronic device capable of interacting with the server. Such ease of access, combined with the establishment of rules and practices of holistic metadata recording about sample conditions and experimental workflows will also help to increase the reproducibility, and thus with that the reusability of experimental data. The easier the use of such a system is, and the more apparent it makes the benefits of the availability of FAIR experimental data, the faster it will be adopted by the scientific community.
Outlook on ontologies in materials science
In data science, an ontology is a formal representation of the knowledge of a community about a domain of interest, for a purpose. As ontologies are currently less common in basic materials science than in other fields of science, let us explain these terms:
- Formal representation means that: (1) the ontology is a representation, hence it is a simplification, or a model, of the target domain, and (2) the attribute formal communicates that the ontological terms and relationships between them must have a deterministic and unambiguous meaning. Furthermore, formal representation implies that the mechanism to specify the ontology must have a degree of logical processing capability, e.g., inference and reasoning should be possible. Crucially, the attribute formal refers to the fact that an ontology should be machine-readable.
- Knowledge is the accumulated set of facts, pieces of information, and skills of the experts of the domain of interest that are represented in the ontology.
- The community influences the ontology in two aspects; (1) it implies an overall agreement between a group of experts/users of the knowledge as represented in the ontology and (2) it indicates that the ontology is not meant to convince a whole population nor wants to be universal. However, if it fulfills the requirements of bigger communities, the ontology will be adopted by broader audiences and will find its way towards standardization.
- The domain of interest is the common ground for the community, e.g., a scientific discipline, a subordinate of discipline, or a market section. It is often used as a boundary to limit the scope of the ontology. It is a proper tool to detect overlapping concepts, modularizing ontologies, and identifying extension and integration points.
- The purpose conveys the goals of the ontology designers so that the ontology is applicable to a set of situations. In many ontology design efforts, the purpose is formulated by a collection of so-called competency questions. These questions and the answers provided to them identify the intent and viewpoint of the designers and set the potential applications of the ontology.
In practice, ontologies are often mapped onto, and visualized by means of, directed acyclic graphs, where an edge is one of a well-defined set of relationships (e.g., is a, has property) and each node is a class, i.e., a concept which is specific to the domain of interest. Each node-edge-node triple is interpreted as a subject-predicate-object expression. For instance, in an ontology for catalysis, one could find the triples: “catalytic material–has property–selectivity”, and “selectivity–refers to–reaction product.” Ontologies address the interoperability requirement of FAIR data. By means of a machine-readable formal structure, which can be connected to an existing or ex novo derived metadata schema of a database, ontologies allow queries over various databases, even from different fields.
The literature already contains several ontologies created for representing (aspects of) materials science. The most ambitious project is probably EMMO [42], which stands for both the European Materials Modelling Ontology, developed within the European Materials Modelling Council (EMMC), and Elemental Multiperspective Material Ontology. EMMO is designed to provide a formal way to describe the fundamental concepts of physics, chemistry, and materials science, to provide an all-purposes common ground for describing materials, models, and data that can be adapted by all sub-domains of condensed-matter physics and chemistry. The development of EMMO includes also a handful of domain ontologies that assume EMMO as top-level ontology. [123] These domain ontologies span subjects such as “atomistic and electronic modeling,” “crystallography,” “mechanical testing,” and more. So far, however, EMMO and its domain ontologies have not been connected to existing databases.
Other domain-specific ontologies, not related to EMMO, have been developed. For instance, the Materials Ontology [124] was developed for the exchange of data among databases for thermal properties, the MatOnto ontology [125] addresses oxygen ion conducting materials in the fuel cell domain, the NanoParticle Ontology [126] maps properties of nanoparticles with the purpose of designing new nanoparticles with given properties, while the eNanoMapper ontology [127] focuses on assessing risks related to the use of nanomaterials from the engineering point of view.
An application-oriented ontology is Materials Design Ontology (MDO) [128], developed under the guidance of the schemas from OPTIMADE [41], and therefore aimed at dealing with data from the various materials-data repositories (e.g., AFLOW, Materials Project, etc.) on a common ground. In practice, MDO connects calculated structures with the calculated properties and the physical model adopted to calculate structures and properties. Furthermore, the provenance for each calculation, is also represented in MDO. It has recently been extended using text mining on thousands of journal articles. [129]
The hierarchical structure of NOMAD Metainfo already includes ontological aspects. More specifically, it represents atomistic calculations, as performed by all the parsed simulation codes. NOMAD Metainfo contains already five types of relations between the metadata: (a) is subclass of, (b) is part of, (c) has reference, (d) has dimension, and (e) has category. The latter relation, has category, is introduced to describe conceptually physical quantities (e.g., “energy,” “velocity,” etc.). Recently [130], this basic NOMAD Metainfo ontology has been expanded to include a representation of operations among arrays (in an ontology, any mathematical concept needs to be represented in order to properly operate with the physical quantities in complex queries). This extension allowed for the introduction of the notion of “similarity” relationship that has been applied as a proof of concept to the calculated electronic density of states, as stored in the NOMAD Archive, in order to identify materials with similar electronic structures. [131,132]
Achievements and challenges of ontologies for materials science were discussed at the first "Workshop on Ontologies for Materials-Databases Interoperability" (OMDI2021), held in Linköping and virtually on October 2021. The workshop was organized by the OPTIMADE consortium [41] and funded by Psi-k. [133] The main outcomes of the workshop were: a) the strengthening of the idea that the development of useful ontologies need a community effort; b) they need to build from the data, i.e., their development needs to be driven by existing data and the aim of connecting data from different sources; and c) tools for text mining need to be developed [129,134], in order to map into ontologies the enormous wealth buried in decades of scientific literature. Another important outcome of the workshop was the utterance of an insightful warning: "is the field proposing solutions (i.e, the existing ontologies) still in search of a problem?" In other words, the community realizes that it needs specific questions to be addressed (the competence questions) in order to shape the ontologies and then propose demonstrative applications of such ontologies to answer the agreed upon questions.
Discussion and outlook
Defining—as completely as possible—a pool of metadata for all the methods and computed quantities described above, is crucial for processing, storing, and providing FAIR materials science data. A key challenge is the mapping into a metadata schema of the full set of input parameters, including those hidden into the specific codes, and all the available output. This practice will facilitate reproducibility, benchmarking, and peer-review processes.
In particular, we emphasize the importance of developing a hierarchical and modular metadata schema in order to represent the complexity of materials science data and allow for access, reproduction, and repurposing of data, from single-structure calculations to complex workflows. Furthermore, the modularity of the schema enables its extensibility, which is vital for the long-term maintenance of the metadata infrastructure.
As an example, we presented the current status of the NOMAD metadata schema, which was designed to comply with the FAIR principles. By means of existing parsers that map a growing set of atomistic-simulation code packages into the hierarchical, modular NOMAD metadata schema, the NOMAD infrastructure already provides the community with a FAIR storage of materials science data. The challenges of fully covering the ground-state electronic calculations, and extending the schema to excited states, dynamical simulations, and complex workflows were examined in detail. By means of a community effort, all aspects of the different subfields, and all the practical details of each specific implementation can be mapped on the NOMAD metadata schema. Finally, we discussed the challenges of the "FAIR-ification" of experimental materials science metadata and the creation of ontologies for materials science. Ontologies will unlock the interoperability of the FAIR data by enabling the access and reuse of data across materials science areas, but also outside materials science.
As a perspective, probably the biggest benefit of meeting the interoperability challenge will be to allow for routine comparisons between computational evaluations and experimental observations. In fact, it is not trivial to associate a given computed quantity, derived through a given theoretical modelling, to an experimentally measured quantity. This association requires the judgment of a domain expert and a full characterization of both compared quantities. This is where a formalized ontology, applied to FAIR data in materials science, could automatize the process.
Abbreviations, acronyms, and initialisms
- AI: artificial intelligence
- aiMD: ab initio calculated forces and energies
- API: application programming interface
- BSE: Bethe-Salpeter equation
- CASPT2: complete active-space second-order perturbation theory
- CASSCF: complete active-space self-consistent Field
- CC: coupled-cluster
- CML: Chemical Markup Language
- CIF: Crystallographic Information File; Crystallographic Information Framework
- DFT: density-functional theory
- DFTP: density-functional perturbation theory
- DMFT: dynamical mean-field theory
- DOI: digital object identifier
- ELN: electronic laboratory notebook
- EMMC: European Materials Modelling Council
- EMMO: Elemental Multiperspective Material Ontology
- ESCDF: Electronic Structure Common Data Format
- ETSF: European Theoretical Spectroscopy Facility
- FAIR: findable, accessible, interoperable, reusable
- FLOSS: Free Libre Open-Source Software
- GS: general state
- GW: Green's function G and dynamically screened Coulomb interaction W
- IUCr: International Union of Crystallography
- JCAMP-DX: Joint Committee on Atomic and Molecular Physical Data - Data Exchange
- JUMBO: Java Universal Molecular/Markup Browser for Objects
- LIMS: laboratory information management system
- MBPT: many-body perturbation theory
- MD: molecular dynamics
- MDR metadata registry
- ML: machine learning
- MM: molecular mechanics
- MRCI: multireference configuration interaction
- NOMAD: Novel-Materials Discovery Laboratory
- OPTIMADE: Open Databases Integration for Materials Design
- PID: persistent identifier
- TDDFT: time-dependent DFT
- XML: Extensible Markup Language
Acknowledgements
We would like to thank all the participants to the workshop “Shared Metadata and Data Formats for Big-Data Driven Materials Science: A NOMAD–FAIR-DI Workshop,” as listed at META2019, who have contributed with questions and comments to ideas discussed in this paper. The organizers of and participants to the OMDI2021 workshop (see https://liu.se/en/research/omdi2021 for the full list of names) are acknowledged for insightful discussions that inspired some of the concepts discussed in the Section “Outlook on ontologies in materials science.” This work received funding by the European Union’s Horizon 2020 research and innovation program under the grant agreement N° 951786 (NOMAD CoE) and by the German Research Foundation (DFG) through the NFDI consortium FAIRmat, project 460197019. We acknowledge support by the Open Access Publication Fund of Humboldt-Universität zu Berlin. SVL’s contribution was supported by RSCF grant 21-13-00419.
Author contributions
The present paper is inspired by and based on the minutes of the workgroups discussions at the workshop “Shared Metadata and Data Formats for Big-Data Driven Materials Science: A NOMAD–FAIR-DI Workshop.” Here, we report the composition of the original work groups, which reflect into the main contributions to the paper’s sections. Metadata, metadata schemas and ontologies (Introduction, Section “Towards FAIR metadata schemas for computational materials science” and section “Outlook on ontologies in materials science”): Patrick Lambrix, Javad Chamanara, Carsten Baldauf, Tatyana Sheveleva, Benjamin Regler, Alvin Noe Ladines, Christoph T. Koch, Christof Wöll, Stefano Cozzini, Astrid Schneidewind, Maja-Olivia Himmer; Ground-state calculations (Section “Metadata for ground-state electronic-structure calculations”): Micael Oliveira, Sergey Levchenko; Perturbative and excited-states calculations (Section “Metadata for external-perturbation and excited-state electronic-structure calculations”): Claudia Draxl, Pasquale Pavone, Denis Usvyat; Potential-energy sampling (Section “Metadata for potential-energy sampling”): James Kermode, Tristan Bereau, Christian Carbogno, Omar Valsson, Markus Kühbach, Chuanxun Su, Ron Miller, Berk Onat; Workflows (Section “Metadata for Computational Workflows”): Stefano Curtarolo, Shyam Dwaraknath, Adam Michalchuk, Giovanni Pizzi, Gian-Marco Rignanese, Jörg Schaarschmidt; Data formats (Section “File Formats”): Ádám Fekete, Markus Scheidgen; Metadata for experiments (Section “Metadata schemas for experimental materials science”): Christoph T. Koch, Sandor Brockhauser, Astrid Schneidewind. Luca M. Ghiringhelli and Matthias Scheffler coordinated the formation of the work groups, participated to the discussions in several work groups, and prepared the first draft of the paper. All authors contributed to the final version of the paper.
Funding
Open access funding enabled and organized by Projekt DEAL.
Competing interests
The authors declare no competing interests.
References
- ↑ Rickman, J.M.; Lookman, T.; Kalinin, S.V. (1 April 2019). "Materials informatics: From the atomic-level to the continuum" (in en). Acta Materialia 168: 473–510. doi:10.1016/j.actamat.2019.01.051. https://linkinghub.elsevier.com/retrieve/pii/S1359645419300667.
- ↑ Hey, Anthony J. G., ed. (2009). The fourth paradigm: data-intensive scientific discovery. Redmond, Washington: Microsoft Research. ISBN 978-0-9825442-0-4.
- ↑ 3.0 3.1 Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem et al. (15 March 2016). "The FAIR Guiding Principles for scientific data management and stewardship" (in en). Scientific Data 3 (1): 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC PMC4792175. PMID 26978244. https://www.nature.com/articles/sdata201618.
- ↑ Grassi, Paul A; Lefkovitz, Naomi B; Nadeau, Ellen M; Galluzzo, Ryan J; Dinh, Abhiraj T (11 January 2018). Attribute metadata: a proposed schema for evaluating federated attributes. Gaithersburg, MD. pp. NIST IR 8112. doi:10.6028/nist.ir.8112. https://nvlpubs.nist.gov/nistpubs/ir/2018/NIST.IR.8112.pdf.
- ↑ Hall, S. R.; Allen, F. H.; Brown, I. D. (1 November 1991). "The crystallographic information file (CIF): a new standard archive file for crystallography". Acta Crystallographica Section A Foundations of Crystallography 47 (6): 655–685. doi:10.1107/S010876739101067X. https://scripts.iucr.org/cgi-bin/paper?S010876739101067X.
- ↑ Bernstein, Herbert J.; Bollinger, John C.; Brown, I. David; Gražulis, Saulius; Hester, James R.; McMahon, Brian; Spadaccini, Nick; Westbrook, John D. et al. (1 February 2016). "Specification of the Crystallographic Information File format, version 2.0". Journal of Applied Crystallography 49 (1): 277–284. doi:10.1107/S1600576715021871. ISSN 1600-5767. https://scripts.iucr.org/cgi-bin/paper?S1600576715021871.
- ↑ Hall, S.R.; Spadaccini, N.; Brown, I.D. et al. (2006). "Formal specification of the crystallographic information file. Version 1.1 specification". In Hall, S. R.; McMahon, B.. International Tables for Crystallography: Definition and exchange of crystallographic data. International Tables for Crystallography. G (1 ed.). Chester, England: International Union of Crystallography. pp. 25–36. doi:10.1107/97809553602060000107. ISBN 978-1-4020-5411-2. https://it.iucr.org/Ga/.
- ↑ Westbrook, J. D.; Yang, H.; Feng, Z.; Berman, H. M. (1 October 2006), Hall, S. R.; McMahon, B., eds., "The use of mmCIF architecture for PDB data management", International Tables for Crystallography (Chester, England: International Union of Crystallography) G: 539–543, doi:10.1107/97809553602060000755, ISBN 978-1-4020-5411-2, https://xrpp.iucr.org/cgi-bin/itr?url_ver=Z39.88-2003&rft_dat=what%3Dchapter%26volid%3DGa%26chnumo%3D5o5%26chvers%3Dv0001. Retrieved 2023-11-07
- ↑ El Mendili, Yassine; Vaitkus, Antanas; Merkys, Andrius; Gražulis, Saulius; Chateigner, Daniel; Mathevet, Fabrice; Gascoin, Stéphanie; Petit, Sebastien et al. (1 June 2019). "Raman Open Database: first interconnected Raman–X-ray diffraction open-access resource for material identification". Journal of Applied Crystallography 52 (3): 618–625. doi:10.1107/S1600576719004229. ISSN 1600-5767. PMC PMC6557180. PMID 31236093. http://scripts.iucr.org/cgi-bin/paper?S1600576719004229.
- ↑ McMahon, B. (1 May 1996). "The role of journals in maintaining data integrity: Checking of crystal structure data in Acta Crystallographica". Journal of Research of the National Institute of Standards and Technology 101 (3): 347. doi:10.6028/jres.101.036. PMC PMC4894614. PMID 27805171. https://nvlpubs.nist.gov/nistpubs/jres/101/3/j3mcma.pdf.
- ↑ Brown, I. David; McMahon, Brian (1 June 2002). "CIF: the computer language of crystallography". Acta Crystallographica Section B Structural Science 58 (3): 317–324. doi:10.1107/S0108768102003464. ISSN 0108-7681. https://scripts.iucr.org/cgi-bin/paper?S0108768102003464.
- ↑ "Chemical Markup Language". CMLC. 2012. https://www.xml-cml.org/. Retrieved 04 July 2023.
- ↑ Murray-Rust, Peter; Townsend, Joe A; Adams, Sam E; Phadungsukanan, Weerapong; Thomas, Jens (1 December 2011). "The semantics of Chemical Markup Language (CML): dictionaries and conventions" (in en). Journal of Cheminformatics 3 (1): 43. doi:10.1186/1758-2946-3-43. ISSN 1758-2946. PMC PMC3206453. PMID 21999509. https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-3-43.
- ↑ 14.0 14.1 Murray-Rust, Peter; Rzepa, Henry S (1 December 2011). "CML: Evolution and design" (in en). Journal of Cheminformatics 3 (1): 44. doi:10.1186/1758-2946-3-44. ISSN 1758-2946. PMC PMC3205047. PMID 21999549. https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-3-44.
- ↑ "Schema 3". Chemical Markup Language. CMLC. 2012. https://www.xml-cml.org/schema/schema3/. Retrieved 04 July 2023.
- ↑ "Gaussian - Expanding the limits of computational chemistry". Gaussian, Inc.. 2023. https://gaussian.com/. Retrieved 04 July 2023.
- ↑ Valiev, M.; Bylaska, E.J.; Govind, N.; Kowalski, K.; Straatsma, T.P.; Van Dam, H.J.J.; Wang, D.; Nieplocha, J. et al. (1 September 2010). "NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations" (in en). Computer Physics Communications 181 (9): 1477–1489. doi:10.1016/j.cpc.2010.04.018. https://linkinghub.elsevier.com/retrieve/pii/S0010465510001438.
- ↑ "Examples for Schema 3 CompChem". Chemical Markup Language. CMLC. 2012. https://www.xml-cml.org/examples/schema3/compchem/. Retrieved 04 July 2023.
- ↑ McDonald, Robert S.; Wilks, Paul A. (1 January 1988). "JCAMP-DX: A Standard Form for Exchange of Infrared Spectra in Computer Readable Form" (in en). Applied Spectroscopy 42 (1): 151–162. doi:10.1366/0003702884428734. ISSN 0003-7028. http://journals.sagepub.com/doi/10.1366/0003702884428734.
- ↑ Davies, Antony N.; Lampen, Peter (1 August 1993). "JCAMP-DX for NMR" (in en). Applied Spectroscopy 47 (8): 1093–1099. doi:10.1366/0003702934067874. ISSN 0003-7028. http://journals.sagepub.com/doi/10.1366/0003702934067874.
- ↑ Lampen, Peter; Hillig, Heinrich; Davies, Antony N.; Linscheid, Michael (1 December 1994). "JCAMP-DX for Mass Spectrometry" (in en). Applied Spectroscopy 48 (12): 1545–1552. doi:10.1366/0003702944027840. ISSN 0003-7028. http://journals.sagepub.com/doi/10.1366/0003702944027840.
- ↑ Baumbach, Jörg Ingo; Davies, Antony N.; Lampen, Peter; Schmidt, Hartwig (1 January 2001). "JCAMP-DX. A standard format for the exchange of ion mobility spectrometry data (IUPAC Recommendations 2001)" (in en). Pure and Applied Chemistry 73 (11): 1765–1782. doi:10.1351/pac200173111765. ISSN 1365-3075. https://www.degruyter.com/document/doi/10.1351/pac200173111765/html.
- ↑ Gonze, X.; Almbladh, C.-O.; Cucca, A.; Caliste, D.; Freysoldt, C.; Marques, M.A.L.; Olevano, V.; Pouillon, Y. et al. (1 October 2008). "Specification of an extensible and portable file format for electronic structure and crystallographic data" (in en). Computational Materials Science 43 (4): 1056–1065. doi:10.1016/j.commatsci.2008.02.023. https://linkinghub.elsevier.com/retrieve/pii/S0927025608001377.
- ↑ Gonze, X.; Almbladh, C.-O.; Cucca, A.; Caliste, D.; Freysoldt, C.; Marques, M.A.L.; Olevano, V.; Pouillon, Y. et al. (1 October 2008). "Specification of an extensible and portable file format for electronic structure and crystallographic data" (in en). Computational Materials Science 43 (4): 1056–1065. doi:10.1016/j.commatsci.2008.02.023. https://linkinghub.elsevier.com/retrieve/pii/S0927025608001377.
- ↑ Caliste, D.; Pouillon, Y.; Verstraete, M.J.; Olevano, V.; Gonze, X. (1 November 2008). "Sharing electronic structure and crystallographic data with ETSF_IO" (in en). Computer Physics Communications 179 (10): 748–758. doi:10.1016/j.cpc.2008.05.007. https://linkinghub.elsevier.com/retrieve/pii/S0010465508001963.
- ↑ 26.0 26.1 26.2 Ghiringhelli, Luca M.; Carbogno, Christian; Levchenko, Sergey; Mohamed, Fawzi; Huhs, Georg; Lüders, Martin; Oliveira, Micael; Scheffler, Matthias (6 November 2017). "Towards efficient data exchange and sharing for big-data driven materials science: metadata and data formats" (in en). npj Computational Materials 3 (1): 46. doi:10.1038/s41524-017-0048-5. ISSN 2057-3960. https://www.nature.com/articles/s41524-017-0048-5.
- ↑ Oliveira, Micael J. T.; Papior, Nick; Pouillon, Yann; Blum, Volker; Artacho, Emilio; Caliste, Damien; Corsetti, Fabiano; de Gironcoli, Stefano et al. (14 July 2020). "The CECAM electronic structure library and the modular software development paradigm" (in en). The Journal of Chemical Physics 153 (2): 024117. doi:10.1063/5.0012901. ISSN 0021-9606. https://pubs.aip.org/jcp/article/153/2/024117/1061500/The-CECAM-electronic-structure-library-and-the.
- ↑ Curtarolo, Stefano; Setyawan, Wahyu; Wang, Shidong; Xue, Junkai; Yang, Kesong; Taylor, Richard H.; Nelson, Lance J.; Hart, Gus L.W. et al. (1 June 2012). "AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations" (in en). Computational Materials Science 58: 227–235. doi:10.1016/j.commatsci.2012.02.002. https://linkinghub.elsevier.com/retrieve/pii/S0927025612000687.
- ↑ Talirz, Leopold; Kumbhar, Snehal; Passaro, Elsa; Yakutovich, Aliaksandr V.; Granata, Valeria; Gargiulo, Fernando; Borelli, Marco; Uhrin, Martin et al. (8 September 2020). "Materials Cloud, a platform for open computational science" (in en). Scientific Data 7 (1): 299. doi:10.1038/s41597-020-00637-5. ISSN 2052-4463. PMC PMC7479138. PMID 32901046. https://www.nature.com/articles/s41597-020-00637-5.
- ↑ Jain, Anubhav; Ong, Shyue Ping; Hautier, Geoffroy; Chen, Wei; Richards, William Davidson; Dacek, Stephen; Cholia, Shreyas; Gunter, Dan et al. (1 July 2013). "Commentary: The Materials Project: A materials genome approach to accelerating materials innovation" (in en). APL Materials 1 (1): 011002. doi:10.1063/1.4812323. ISSN 2166-532X. https://pubs.aip.org/apm/article/1/1/011002/119685/Commentary-The-Materials-Project-A-materials.
- ↑ Draxl, Claudia; Scheffler, Matthias (2018). "NOMAD: The FAIR Concept for Big-Data-Driven Materials Science". arXiv. doi:10.48550/ARXIV.1805.05039. https://arxiv.org/abs/1805.05039.
- ↑ Draxl, Claudia; Scheffler, Matthias (1 July 2019). "The NOMAD laboratory: from data sharing to artificial intelligence". Journal of Physics: Materials 2 (3): 036001. doi:10.1088/2515-7639/ab13bb. ISSN 2515-7639. https://iopscience.iop.org/article/10.1088/2515-7639/ab13bb.
- ↑ Draxl, Claudia; Scheffler, Matthias (2020), Andreoni, Wanda; Yip, Sidney, eds., "Big Data-Driven Materials Science and Its FAIR Data Infrastructure" (in en), Handbook of Materials Modeling (Cham: Springer International Publishing): 49–73, doi:10.1007/978-3-319-44677-6_104, ISBN 978-3-319-44676-9, http://link.springer.com/10.1007/978-3-319-44677-6_104. Retrieved 2023-11-07
- ↑ Kirklin, Scott; Saal, James E; Meredig, Bryce; Thompson, Alex; Doak, Jeff W; Aykol, Muratahan; Rühl, Stephan; Wolverton, Chris (11 December 2015). "The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies" (in en). npj Computational Materials 1 (1): 15010. doi:10.1038/npjcompumats.2015.10. ISSN 2057-3960. https://www.nature.com/articles/npjcompumats201510.
- ↑ Merkys, Andrius; Mounet, Nicolas; Cepellotti, Andrea; Marzari, Nicola; Gražulis, Saulius; Pizzi, Giovanni (1 December 2017). "A posteriori metadata from automated provenance tracking: integration of AiiDA and TCOD" (in en). Journal of Cheminformatics 9 (1): 56. doi:10.1186/s13321-017-0242-y. ISSN 1758-2946. PMC PMC5686034. PMID 29138947. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0242-y.
- ↑ Pizzi, Giovanni; Cepellotti, Andrea; Sabatini, Riccardo; Marzari, Nicola; Kozinsky, Boris (1 January 2016). "AiiDA: automated interactive infrastructure and database for computational science" (in en). Computational Materials Science 111: 218–230. doi:10.1016/j.commatsci.2015.09.013. https://linkinghub.elsevier.com/retrieve/pii/S0927025615005820.
- ↑ Huber, Sebastiaan P.; Zoupanos, Spyros; Uhrin, Martin; Talirz, Leopold; Kahle, Leonid; Häuselmann, Rico; Gresch, Dominik; Müller, Tiziano et al. (8 September 2020). "AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance" (in en). Scientific Data 7 (1): 300. doi:10.1038/s41597-020-00638-4. ISSN 2052-4463. PMC PMC7479590. PMID 32901044. https://www.nature.com/articles/s41597-020-00638-4.
- ↑ Uhrin, Martin; Huber, Sebastiaan P.; Yu, Jusong; Marzari, Nicola; Pizzi, Giovanni (1 February 2021). "Workflows in AiiDA: Engineering a high-throughput, event-based engine for robust and modular computational workflows" (in en). Computational Materials Science 187: 110086. doi:10.1016/j.commatsci.2020.110086. https://linkinghub.elsevier.com/retrieve/pii/S0927025620305772.
- ↑ Hjorth Larsen, Ask; Jørgen Mortensen, Jens; Blomqvist, Jakob; Castelli, Ivano E; Christensen, Rune; Dułak, Marcin; Friis, Jesper; Groves, Michael N et al. (12 July 2017). "The atomic simulation environment—a Python library for working with atoms". Journal of Physics: Condensed Matter 29 (27): 273002. doi:10.1088/1361-648X/aa680e. ISSN 0953-8984. https://iopscience.iop.org/article/10.1088/1361-648X/aa680e.
- ↑ Tadmor, E. B.; Elliott, R. S.; Sethna, J. P.; Miller, R. E.; Becker, C. A. (1 July 2011). "The potential of atomistic simulations and the knowledgebase of interatomic models" (in en). JOM 63 (7): 17–17. doi:10.1007/s11837-011-0102-6. ISSN 1047-4838. http://link.springer.com/10.1007/s11837-011-0102-6.
- ↑ 41.0 41.1 Andersen, Casper W.; Armiento, Rickard; Blokhin, Evgeny; Conduit, Gareth J.; Dwaraknath, Shyam; Evans, Matthew L.; Fekete, Ádám; Gopakumar, Abhijith et al. (12 August 2021). "OPTIMADE, an API for exchanging materials data" (in en). Scientific Data 8 (1): 217. doi:10.1038/s41597-021-00974-z. ISSN 2052-4463. PMC PMC8361091. PMID 34385453. https://www.nature.com/articles/s41597-021-00974-z.
- ↑ "EMMO: an Ontology for Applied Sciences". EMMC. 2021. Archived from the original on 26 May 2022. https://web.archive.org/web/20220526170653/https://emmc.info/emmo-info/. Retrieved 04 July 2023.
- ↑ Fielding, R.T. (2000). "Architectural Styles and the Design of Network-based Software Architectures". University of California, Irvine. https://ics.uci.edu/~fielding/pubs/dissertation/top.htm.
- ↑ "Entries". NOMAD. MPCDF and FHI on behalf of Max-Planck-Society. 2023. https://nomad-lab.eu/prod/v1/gui/search/entries. Retrieved 04 July 2023.
- ↑ "Search". NOMAD Encyclopedia. The NOMAD Laboratory. 2023. https://nomad-lab.eu/prod/rae/encyclopedia/#/search. Retrieved 04 July 2023.
- ↑ Ghiringhelli, Luca M. (9 September 2021). "An AI-toolkit to develop and share research into new materials" (in en). Nature Reviews Physics 3 (11): 724–724. doi:10.1038/s42254-021-00373-8. ISSN 2522-5820. https://www.nature.com/articles/s42254-021-00373-8.
- ↑ Sbailò, Luigi; Fekete, Ádám; Ghiringhelli, Luca M.; Scheffler, Matthias (5 December 2022). "The NOMAD Artificial-Intelligence Toolkit: turning materials-science data into knowledge and understanding" (in en). npj Computational Materials 8 (1): 250. doi:10.1038/s41524-022-00935-z. ISSN 2057-3960. https://www.nature.com/articles/s41524-022-00935-z.
- ↑ "NOMAD Artificial Intelligence Toolkit". NOMAD Laboratory. 2023. https://nomad-lab.eu/aitoolkit. Retrieved 04 July 2023.
- ↑ Marques, Miguel A.L.; Oliveira, Micael J.T.; Burnus, Tobias (1 October 2012). "Libxc: A library of exchange and correlation functionals for density functional theory" (in en). Computer Physics Communications 183 (10): 2272–2281. doi:10.1016/j.cpc.2012.05.007. https://linkinghub.elsevier.com/retrieve/pii/S0010465512001750.
- ↑ Lehtola, Susi; Steigemann, Conrad; Oliveira, Micael J.T.; Marques, Miguel A.L. (1 January 2018). "Recent developments in libxc — A comprehensive library of functionals for density functional theory" (in en). SoftwareX 7: 1–5. doi:10.1016/j.softx.2017.11.002. https://linkinghub.elsevier.com/retrieve/pii/S2352711017300602.
- ↑ Pritchard, Benjamin P.; Altarawy, Doaa; Didier, Brett; Gibson, Tara D.; Windus, Theresa L. (25 November 2019). "New Basis Set Exchange: An Open, Up-to-Date Resource for the Molecular Sciences Community" (in en). Journal of Chemical Information and Modeling 59 (11): 4814–4820. doi:10.1021/acs.jcim.9b00725. ISSN 1549-9596. https://pubs.acs.org/doi/10.1021/acs.jcim.9b00725.
- ↑ Blum, Volker; Gehrke, Ralf; Hanke, Felix; Havu, Paula; Havu, Ville; Ren, Xinguo; Reuter, Karsten; Scheffler, Matthias (1 November 2009). "Ab initio molecular simulations with numeric atom-centered orbitals" (in en). Computer Physics Communications 180 (11): 2175–2196. doi:10.1016/j.cpc.2009.06.022. https://linkinghub.elsevier.com/retrieve/pii/S0010465509002033.
- ↑ Barca, Giuseppe M. J.; Bertoni, Colleen; Carrington, Laura; Datta, Dipayan; De Silva, Nuwan; Deustua, J. Emiliano; Fedorov, Dmitri G.; Gour, Jeffrey R. et al. (21 April 2020). "Recent developments in the general atomic and molecular electronic structure system" (in en). The Journal of Chemical Physics 152 (15): 154102. doi:10.1063/5.0005188. ISSN 0021-9606. https://pubs.aip.org/jcp/article/152/15/154102/1058751/Recent-developments-in-the-general-atomic-and.
- ↑ Reining, Lucia (1 May 2018). "The GW approximation: content, successes and limitations" (in en). WIREs Computational Molecular Science 8 (3): e1344. doi:10.1002/wcms.1344. ISSN 1759-0876. https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.1344.
- ↑ Kresse, G.; Furthmüller, J. (15 October 1996). "Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set" (in en). Physical Review B 54 (16): 11169–11186. doi:10.1103/PhysRevB.54.11169. ISSN 0163-1829. https://link.aps.org/doi/10.1103/PhysRevB.54.11169.
- ↑ Marques, M (1 March 2003). "octopus: a first-principles tool for excited electron–ion dynamics" (in en). Computer Physics Communications 151 (1): 60–78. doi:10.1016/S0010-4655(02)00686-0. https://linkinghub.elsevier.com/retrieve/pii/S0010465502006860.
- ↑ Segall, M D; Lindan, Philip J D; Probert, M J; Pickard, C J; Hasnip, P J; Clark, S J; Payne, M C (25 March 2002). "First-principles simulation: ideas, illustrations and the CASTEP code". Journal of Physics: Condensed Matter 14 (11): 2717–2744. doi:10.1088/0953-8984/14/11/301. ISSN 0953-8984. https://iopscience.iop.org/article/10.1088/0953-8984/14/11/301.
- ↑ Shang, Honghui; Carbogno, Christian; Rinke, Patrick; Scheffler, Matthias (1 June 2017). "Lattice dynamics calculations based on density-functional perturbation theory in real space" (in en). Computer Physics Communications 215: 26–46. doi:10.1016/j.cpc.2017.02.001. https://linkinghub.elsevier.com/retrieve/pii/S0010465517300437.
- ↑ Giannozzi, Paolo; Baroni, Stefano; Bonini, Nicola; Calandra, Matteo; Car, Roberto; Cavazzoni, Carlo; Ceresoli, Davide; Chiarotti, Guido L. et al. (30 September 2009). "QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials". Journal of Physics. Condensed Matter: An Institute of Physics Journal 21 (39): 395502. doi:10.1088/0953-8984/21/39/395502. ISSN 1361-648X. PMID 21832390. https://pubmed.ncbi.nlm.nih.gov/21832390.
- ↑ Gonze, X.; Amadon, B.; Anglade, P.-M.; Beuken, J.-M.; Bottin, F.; Boulanger, P.; Bruneval, F.; Caliste, D. et al. (1 December 2009). "ABINIT: First-principles approach to material and nanosystem properties" (in en). Computer Physics Communications 180 (12): 2582–2615. doi:10.1016/j.cpc.2009.07.007. https://linkinghub.elsevier.com/retrieve/pii/S0010465509002276.
- ↑ Lamprecht, Anna-Lena; Garcia, Leyla; Kuzak, Mateusz; Martinez, Carlos; Arcila, Ricardo; Martin Del Pico, Eva; Dominguez Del Angel, Victoria; van de Sandt, Stephanie et al. (12 June 2020). Groth, Paul; Groth, Paul; Dumontier, Michel. eds. "Towards FAIR principles for research software". Data Science 3 (1): 37–59. doi:10.3233/DS-190026. https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/DS-190026.
- ↑ Barker, Michelle; Chue Hong, Neil P.; Katz, Daniel S.; Lamprecht, Anna-Lena; Martinez-Ortiz, Carlos; Psomopoulos, Fotis; Harrow, Jennifer; Castro, Leyla Jael et al. (14 October 2022). "Introducing the FAIR Principles for research software" (in en). Scientific Data 9 (1): 622. doi:10.1038/s41597-022-01710-x. ISSN 2052-4463. PMC PMC9562067. PMID 36241754. https://www.nature.com/articles/s41597-022-01710-x.
- ↑ Katz, Daniel S.; Chue Hong, Neil P.; Clark, Tim; Muench, August; Stall, Shelley; Bouquin, Daina; Cannon, Matthew; Edmunds, Scott et al. (12 January 2021). "Recognizing the value of software: a software citation guide" (in en). F1000Research 9: 1257. doi:10.12688/f1000research.26932.2. ISSN 2046-1402. PMC PMC7805487. PMID 33500780. https://f1000research.com/articles/9-1257/v2.
- ↑ Smith, Arfon M.; Katz, Daniel S.; Niemeyer, Kyle E.; FORCE11 Software Citation Working Group (19 September 2016). "Software citation principles" (in en). PeerJ Computer Science 2: e86. doi:10.7717/peerj-cs.86. ISSN 2376-5992. https://peerj.com/articles/cs-86.
- ↑ Hertz, J.C.; Lucas, M.; Scott, J. (April 2006). "DoD Open Technology Development (OTD) Roadmap". Terry Bollinger Online Resources. Terry Bollinger. https://www.terrybollinger.com/index.html#open_source_reports. Retrieved 04 July 2023.
- ↑ Stallman, R. (11 September 2021). "FLOSS and FOSS". GNU Operating System. Richard Stallman. https://www.gnu.org/philosophy/floss-and-foss.html. Retrieved 04 July 2023.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Several inline URLs from the original were turned into full citations for this version. The original didn't state what GW was; for this version, an explanation and citation was given for clarity. The URL to the EMMC and EMMO website was broken when adding this to LIMSwiki; an archived URL was used in its place.