Journal:The NOMAD Artificial Intelligence Toolkit: Turning materials science data into knowledge and understanding

Full article title	The NOMAD Artificial Intelligence Toolkit: Turning materials science data into knowledge and understanding
Journal	npj Computational Materials
Author(s)	Sbailò, Luigi; Fekete, Ádám; Ghiringhelli, Luca M.; Scheffler, Matthias
Author affiliation(s)	Humboldt-Universität zu Berlin, Max-Planck-Gesellschaft
Primary contact	Email: ghiringhelli at fhi dash berlin dot mpg dot de
Year published	2022
Volume and issue	8
Article #	250
DOI	10.1038/s41524-022-00935-z
ISSN	2057-3960
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.nature.com/articles/s41524-022-00935-z
Download	https://www.nature.com/articles/s41524-022-00935-z.pdf (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

We present the Novel Materials Discovery (NOMAD) Artificial Intelligence (AI) Toolkit, a web-browser-based infrastructure for the interactive AI-based analysis of materials science data under FAIR (findable, accessible, interoperable, and reusable) data principles. The AI Toolkit readily operates on FAIR data stored in the central server of the NOMAD Archive, the largest database of materials science data worldwide, as well as locally stored, user-owned data. The NOMAD Oasis, a local, stand-alone server can also be used to run the AI Toolkit. By using Jupyter Notebooks that run in a web-browser, the NOMAD data can be queried and accessed; data mining, machine learning (ML), and other AI techniques can then be applied to analyze them. This infrastructure brings the concept of reproducibility in materials science to the next level by allowing researchers to share not only the data contributing to their scientific publications, but also all the developed methods and analytics tools. Besides reproducing published results, users of the NOMAD AI Toolkit can modify Jupyter Notebooks toward their own research work.

Keywords: computational methods, theory and computation, artificial intelligence, materials science, FAIR principles

Introduction

Data-centric science has been identified as the fourth paradigm of scientific research. We observe that the novelty introduced by this paradigm is twofold. First, we have seen the creation of large, interconnected databases of scientific data, which are increasingly expected to comply with the so-called FAIR principles [1] of scientific data management and stewardship, meaning that the data and related metadata need to be findable, accessible, interoperable, and reusable (or repurposable, or recyclable). Second, we have seen growing use of artificial intelligence (AI) algorithms, applied to scientific data, in order to find patterns and trends that would be difficult, if not impossible, to identify by unassisted human observation and intuition.

In the last few years, materials science has experienced both of these novelties. Databases, in particular from computational materials science, have been created via high-throughput screening initiatives, mainly boosted by the U.S.-based Materials Genome Initiative (MGI), starting in the early 2010s, e.g., AFLOW [2,] the Materials Project [3], and OQMD. [4] At the end of 2014, the NOMAD (Novel Materials Discovery) Laboratory launched the NOMAD Repository & Archive [5,6,7], the first FAIR storage infrastructure for computational materials science data. NOMAD’s servers and storage are hosted by the Max Planck Computing and Data Facility (MPCDF) in Garching (Germany). The NOMAD Repository stores, as of today, input and output files from more than 50 different atomistic (ab initio and molecular mechanics) codes. It total, more than 100 million total-energy calculations have been uploaded by various materials scientists from their local storage, or from other public databases. The NOMAD Archive stores the same information, but it is converted, normalized, and characterized by means of a metadata schema, the NOMAD Metainfo [8], which allows for the labeling of most of the data in a code-independent representation. The translation from the content of raw input and output files into the code-independent NOMAD Metainfo format makes the data ready for AI analysis.

Besides the above-mentioned databases, other platforms for the open-access storage and access of materials science data appeared in recent years, such as the Materials Data Facility [9,10] and Materials Cloud. [11] Furthermore, many groups have been storing their materials science data on Zenodo [12] and have provided the digital object identifier (DOI) to openly access them in publications. The peculiarity of the NOMAD Repository & Archive is in the fact that users upload the full input and output files from their calculations into the Repository, and then such information is mapped onto the Archive, which (other) users can access via a unified application programming interface (API).

Materials science has embraced also the second aspect of the fourth paradigm, i.e., AI-driven analysis. The applications of AI to materials science span two main classes of methods. One is the modeling of potential-energy surfaces by means of statistical models that promise to yield ab initio accuracy at a fraction of the evaluation time [13,14,15,16,17,18] (if the CPU time necessary to produce the training data set is not considered). The other class is the advent of so-called materials informatics, i.e., the statistical modeling of materials aimed at predicting their physical, often technologically relevant properties [19,20,21,22,23,24], by knowing limited input information about them, often just their stoichiometry. The latter aims at identifying the minimal set of descriptors (the materials’ genes) that correlate with properties of interest. This aspect, together with the observation that only a very small amount of the almost infinite number of possible materials is known today, may lead to the identification of undiscovered materials that have properties (e.g., conductivity, plasticity, elasticity, etc.) superior to the known ones.

The NOMAD CoE has recognized the importance of enabling the AI analysis of the stored FAIR data and has launched the NOMAD AI Toolkit. This web-based infrastructure allows users to run in web-browser computational notebooks (i.e., interactive documents that freely mix code, results, graphics, and text, supported by a suitable virtual environment) for performing complex queries and AI-based exploratory analysis and predictive modeling on the data contained in the NOMAD Archive. In this respect, the AI Toolkit pushes to the next, necessary step the concept of FAIR data, by recognizing that the most promising purpose of the FAIR principles is enabling AI analysis of the stored data. As a mnemonic, the next step in FAIR data starts by upgrading its meaning to "findable and AI-ready data." [25]

The mission of the NOMAD AI Toolkit is threefold, as reflected in the access points shown in its home page (Fig. 1):

Providing an API and libraries for accessing and analyzing the NOMAD Archive data via state-of-the-art (and beyond) AI tools;
Providing a set of tutorials with a shallow learning curve, from the hands-on introduction to the mastering of AI techniques; and
Maintaining a community-driven, growing collection of computational notebooks, each dedicated to an AI-based materials science publication. (By providing both the annotated data and the scripts for their analysis, students and scholars worldwide are enabled to retrace all the steps that the original researchers followed to reach publication-level results. Furthermore, the users can modify the existing notebooks and quickly check alternative ideas.)

Figure 1. The NOMAD AI Toolkit homepage showcases the three purposes of the NOMAD AI toolkit: querying (and analyzing) the content of the NOMAD Archive, providing tutorials for AI tools, and accessing the AI workflow of published work. The fourth access point, "Get to work," is for experienced users, who can create and manage their own workspace.

The data science community has introduced several platforms for performing AI-based analysis of scientific data, typically by providing rich libraries for machine learning (ML) and AI, and often offering users online resources for running electronic notebooks. General purpose frameworks such as Binder [26] and Google Colab [27]—as well as dedicated materials science frameworks such as nanoHUB [28], pyIron [29], AiidaLab [30], and MatBench [31]—are the most used by the community. In all these cases, great effort is devoted to education via online and in-person tutorials. The main specificity of the NOMAD AI Toolkit is in connecting within the same infrastructure the data, as stored in the NOMAD Archive, to their AI analysis. Moreover, as detailed below, users have in the same environment all available AI tools, as well as access to the NOMAD data, without the need to install anything.

The rest of this paper is structured as follows. In the “Results” section, we describe the technology of the AI Toolkit. In the “Discussion” and “Data Availability” sections, we describe two exemplary notebooks: one notebook is a tutorial introduction to the interactive querying and exploratory analysis of the NOMAD Archive data, and the other notebook demonstrates the possibility to report publication-level materials science results [32], while enabling the users to put their hands on the workflow, by modifying the input parameters and observing the impact of their interventions.

Results

Technology

We provide a user-friendly infrastructure to apply the latest AI developments and the most popular ML methods to materials science data. The NOMAD AI Toolkit aims at facilitating the deployment of sophisticated AI algorithms by means of an intuitive interface that is accessible from a webpage. In this way, AI-powered methodologies are transferred to materials science. In fact, the most recent advances in AI are usually available as software stored on web repositories. However, these need to be installed in a local environment, which requires specific bindings and environment variables. Such an installation can be a tedious process, which limits the diffusion of these computational methods, and also brings in the problem of reproducibility of published results. The NOMAD AI Toolkit offers a solution to this, by providing the software, that we install and maintain, in an environment that is accessible directly from the web.

Docker [33] allows the installation of software in a container that is isolated from the host machine where it is running. In the NOMAD AI Toolkit, we maintain such a container, installing therein software that has been used to produce recently published results and taking care of the versioning of all required packages. Jupyter Notebooks are then used inside the container to interact with the underlying computational engine. Interactions include the execution of code, displaying the results of computations, and writing comments or explanations by using markup language. We opted for Jupyter Notebooks because such interactivity is ideal for combining computation and analysis of the results in a single framework. The kernel of the notebooks, i.e., the computational engine that runs the code, is set to read Python. Python has built-in support for scientific computing as the SciPy ecosystem, and it is highly extensible because it allows the wrapping of code written in compiled languages such as C or C++. This technological infrastructure is built using JupyterHub [34] and deploys servers that are orchestrated by Kubernetes on computing facilities offered by the MPCDF in Garching, Germany. Users of the AI Toolkit can currently run their analyses on up to eight CPU cores, with up to 10 GB RAM.

A key feature of the NOMAD AI Toolkit is that we allow users to create, modify, and store computational notebooks where original AI workflows are developed. From the “Get to work” button accessible at https://nomad-lab.eu/aitoolkit, registered users are redirected to a personal space, where we provide 10 GB of cloud storage and where work can also be saved. Jupyter Notebooks, which are created inside the “work” directory in the user's personal space, are stored on our servers and can be accessed and edited over time. These notebooks are placed in the NOMAD AI Toolkit environment, which means that all software and methods demonstrated in other tutorials can be deployed therein. The versatility of Jupyter Notebooks in fact facilitates an interactive and instantaneous combination of different methods. This is useful if one aims at, e.g., combining different methods available in the NOMAD AI Toolkit in an original manner, or to deploy a specific algorithm to a dataset that is retrieved from the NOMAD Archive. The original notebook, which is developed in the "work" directory, might then lead to a publication, and the notebook can then be added to the “Published results” section of the AI Toolkit.

Contributing

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.