Journal:The NOMAD Artificial Intelligence Toolkit: Turning materials science data into knowledge and understanding
Full article title | The NOMAD Artificial Intelligence Toolkit: Turning materials science data into knowledge and understanding |
---|---|
Journal | npj Computational Materials |
Author(s) | Sbailò, Luigi; Fekete, Ádám; Ghiringhelli, Luca M.; Scheffler, Matthias |
Author affiliation(s) | Humboldt-Universität zu Berlin, Max-Planck-Gesellschaft |
Primary contact | Email: ghiringhelli at fhi dash berlin dot mpg dot de |
Year published | 2022 |
Volume and issue | 8 |
Article # | 250 |
DOI | 10.1038/s41524-022-00935-z |
ISSN | 2057-3960 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://www.nature.com/articles/s41524-022-00935-z |
Download | https://www.nature.com/articles/s41524-022-00935-z.pdf (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
We present the Novel Materials Discovery (NOMAD) Artificial Intelligence (AI) Toolkit, a web-browser-based infrastructure for the interactive AI-based analysis of materials science data under FAIR (findable, accessible, interoperable, and reusable) data principles. The AI Toolkit readily operates on FAIR data stored in the central server of the NOMAD Archive, the largest database of materials science data worldwide, as well as locally stored, user-owned data. The NOMAD Oasis, a local, stand-alone server can also be used to run the AI Toolkit. By using Jupyter Notebooks that run in a web-browser, the NOMAD data can be queried and accessed; data mining, machine learning (ML), and other AI techniques can then be applied to analyze them. This infrastructure brings the concept of reproducibility in materials science to the next level by allowing researchers to share not only the data contributing to their scientific publications, but also all the developed methods and analytics tools. Besides reproducing published results, users of the NOMAD AI Toolkit can modify Jupyter Notebooks toward their own research work.
Keywords: computational methods, theory and computation, artificial intelligence, materials science, FAIR principles
Introduction
Data-centric science has been identified as the fourth paradigm of scientific research. We observe that the novelty introduced by this paradigm is twofold. First, we have seen the creation of large, interconnected databases of scientific data, which are increasingly expected to comply with the so-called FAIR principles [1] of scientific data management and stewardship, meaning that the data and related metadata need to be findable, accessible, interoperable, and reusable (or repurposable, or recyclable). Second, we have seen growing use of artificial intelligence (AI) algorithms, applied to scientific data, in order to find patterns and trends that would be difficult, if not impossible, to identify by unassisted human observation and intuition.
In the last few years, materials science has experienced both of these novelties. Databases, in particular from computational materials science, have been created via high-throughput screening initiatives, mainly boosted by the U.S.-based Materials Genome Initiative (MGI), starting in the early 2010s, e.g., AFLOW [2,] the Materials Project [3], and OQMD. [4] At the end of 2014, the NOMAD (Novel Materials Discovery) Laboratory launched the NOMAD Repository & Archive [5,6,7], the first FAIR storage infrastructure for computational materials science data. NOMAD’s servers and storage are hosted by the Max Planck Computing and Data Facility (MPCDF) in Garching (Germany). The NOMAD Repository stores, as of today, input and output files from more than 50 different atomistic (ab initio and molecular mechanics) codes. It total, more than 100 million total-energy calculations have been uploaded by various materials scientists from their local storage, or from other public databases. The NOMAD Archive stores the same information, but it is converted, normalized, and characterized by means of a metadata schema, the NOMAD Metainfo [8], which allows for the labeling of most of the data in a code-independent representation. The translation from the content of raw input and output files into the code-independent NOMAD Metainfo format makes the data ready for AI analysis.
Besides the above-mentioned databases, other platforms for the open-access storage and access of materials science data appeared in recent years, such as the Materials Data Facility [9,10] and Materials Cloud. [11] Furthermore, many groups have been storing their materials science data on Zenodo [12] and have provided the digital object identifier (DOI) to openly access them in publications. The peculiarity of the NOMAD Repository & Archive is in the fact that users upload the full input and output files from their calculations into the Repository, and then such information is mapped onto the Archive, which (other) users can access via a unified application programming interface (API).
Materials science has embraced also the second aspect of the fourth paradigm, i.e., AI-driven analysis. The applications of AI to materials science span two main classes of methods. One is the modeling of potential-energy surfaces by means of statistical models that promise to yield ab initio accuracy at a fraction of the evaluation time [13,14,15,16,17,18] (if the CPU time necessary to produce the training data set is not considered). The other class is the advent of so-called materials informatics, i.e., the statistical modeling of materials aimed at predicting their physical, often technologically relevant properties [19,20,21,22,23,24], by knowing limited input information about them, often just their stoichiometry. The latter aims at identifying the minimal set of descriptors (the materials’ genes) that correlate with properties of interest. This aspect, together with the observation that only a very small amount of the almost infinite number of possible materials is known today, may lead to the identification of undiscovered materials that have properties (e.g., conductivity, plasticity, elasticity, etc.) superior to the known ones.
The NOMAD CoE has recognized the importance of enabling the AI analysis of the stored FAIR data and has launched the NOMAD AI Toolkit. This web-based infrastructure allows users to run in web-browser computational notebooks (i.e., interactive documents that freely mix code, results, graphics, and text, supported by a suitable virtual environment) for performing complex queries and AI-based exploratory analysis and predictive modeling on the data contained in the NOMAD Archive. In this respect, the AI Toolkit pushes to the next, necessary step the concept of FAIR data, by recognizing that the most promising purpose of the FAIR principles is enabling AI analysis of the stored data. As a mnemonic, the next step in FAIR data starts by upgrading its meaning to "findable and AI-ready data." [25]
The mission of the NOMAD AI Toolkit is threefold, as reflected in the access points shown in its home page (Fig. 1):
- Providing an API and libraries for accessing and analyzing the NOMAD Archive data via state-of-the-art (and beyond) AI tools;
- Providing a set of tutorials with a shallow learning curve, from the hands-on introduction to the mastering of AI techniques; and
- Maintaining a community-driven, growing collection of computational notebooks, each dedicated to an AI-based materials science publication. (By providing both the annotated data and the scripts for their analysis, students and scholars worldwide are enabled to retrace all the steps that the original researchers followed to reach publication-level results. Furthermore, the users can modify the existing notebooks and quickly check alternative ideas.)
|
The data science community has introduced several platforms for performing AI-based analysis of scientific data, typically by providing rich libraries for machine learning (ML) and AI, and often offering users online resources for running electronic notebooks. General purpose frameworks such as Binder [26] and Google Colab [27]—as well as dedicated materials science frameworks such as nanoHUB [28], pyIron [29], AiidaLab [30], and MatBench [31]—are the most used by the community. In all these cases, great effort is devoted to education via online and in-person tutorials. The main specificity of the NOMAD AI Toolkit is in connecting within the same infrastructure the data, as stored in the NOMAD Archive, to their AI analysis. Moreover, as detailed below, users have in the same environment all available AI tools, as well as access to the NOMAD data, without the need to install anything.
The rest of this paper is structured as follows. In the “Results” section, we describe the technology of the AI Toolkit. In the “Discussion” and “Data Availability” sections, we describe two exemplary notebooks: one notebook is a tutorial introduction to the interactive querying and exploratory analysis of the NOMAD Archive data, and the other notebook demonstrates the possibility to report publication-level materials science results [32], while enabling the users to put their hands on the workflow, by modifying the input parameters and observing the impact of their interventions.
Results
Technology
References
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.