Journal:Integration of X-ray absorption fine structure databases for data-driven materials science

Full article title	Integration of X-ray absorption fine structure databases for data-driven materials science
Journal	Science and Technology of Advanced Materials: Methods
Author(s)	Ishii, Masashi; Tanabe, Kosuke, Tanabe; Matsuda, Asahiko; Ofuchi, Hironori; Matsumoto, Takahiro; Yaji, Toyonari; Inada, Yasuhiro; Nitani, Hiroaki; Kimura, Masao; Asakura, Kiyotaka
Author affiliation(s)	National Institute for Materials Science, Japan Synchrotron Radiation Research Institute, Ritsumeikan University, Hokkaido University
Primary contact	Email: ISHII dot Masashi at nims dot go dot jp
Year published	2023
Volume and issue	3(1)
Article #	2197518
DOI	10.1080/27660400.2023.2197518
ISSN	2766-0400
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.tandfonline.com/doi/full/10.1080/27660400.2023.2197518
Download	https://www.tandfonline.com/doi/epdf/10.1080/27660400.2023.2197518 (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

With the aim of introducing data-driven science and establishing an infrastructure for making X-ray absorption fine structure (XAFS) spectra findable and reusable, we have integrated XAFS databases in Japan. This integrated database (MDR XAFS DB) enables cross searching of spectra from more than 2,000 samples and more than 700 unique materials with machine-readable metadata. The introduction of a materials dictionary with approximately 6,000 synonyms has improved the search performance, and links with large external databases have been established. In order to compare spectra in the database, the energy calibration policies of each institution were compiled, and the energy calibration methods across institutions were shown. This clarified how to utilize the MDR XAFS DB as a knowledge base. The database created through this cross-institution initiative is a model case for the further development of databases for other methods and materials informatics processes using them.

Keywords: X-ray absorption fine structure, data integration, metadata, materials data repository, DOI, RDF

Graphic abstract:

Introduction

While new data-driven scientific discoveries are progressing in various fields [1], ensuring sources of data has become a serious challenge. In particular, data collection in experimental science requires innovations due to the time-consuming tasks involved in data acquisition. There have been trials in many studies, for example, in the development of high-throughput experiments using robotics and combinatorial techniques. [2–4] However, measurements that require a variety of experimental environments, such as operando [5] and low-temperature measurements, are not always suitable for such high-throughput experiments. For the accumulation of data from experiments that require diverse environments, one possible solution is the integration of data through the cooperation of related researchers. [6] Given the diverse range of users involved, the requirements for this data integration are as follows:

The benefits of data integration should be not only in data-driven science but also in everyday research.
The data and metadata should be in as few formats as possible (ideally one format).
The publication infrastructure should be prepared as a repository with policies for data utilization, such as the FAIR Principles. (FAIR is an acronym for "findable, accessible, interoperable, reusable" and is a basic guideline for the utilization of data.) [7]
The database infrastructure should have search functionality and not just storage online.

The X-ray absorption fine structure (XAFS) [8,9] discussed in this paper is a typical synchrotron radiation experimental technique that provides the atomic-level local structure (e.g., bond length, coordination number, etc.) and electronic states of a specific element by exciting its inner-shell electrons. Atomic-scale observation areas have a high commonality even if the samples are intended for various applications or are processed in multiple ways. In other words, many researchers across different fields—including materials science—can discuss a single spectrum and feedback the knowledge they obtained from their samples. The establishment of a basis, by which various XAFS spectra can be superimposed and compared, activates research. We have established an infrastructure for sharing XAFS spectra by integrating XAFS databases in Japan. In this paper, we clarify the problems with integrating data and discuss the solutions attempted in this initiative.

Activities of XAFS database

In order to understand international trends in XAFS databases, we have summarized well-known data provision services outside of Japan:

1. Farrel Lytle Database: The Farrel Lytle Database is a collection of data measured by F.W. Lytle and is probably the world’s oldest and largest XAFS database operated by the International X-ray Absorption Society (IXAS). There are over 7,000 RAW data items, and PROCESSED data compressed into a standard format are also available.

2. IXAS X-ray Absorption Data Library: The IXAS X-ray Absorption Data Library is operated by IXAS and publishes 20 absorption edges, with a total of 276 spectra, measured primarily at the Advanced Photon Source (APS) and the Stanford Synchrotron Radiation Lightsource (SSRL). The unique sample type is 105. Data is stored in the XAFS Data Interchange (XDI) Format [10], with metadata beginning with # + Key + Value in the header. It provides superior reuse of data.

3. ID21 Sulfur XANES Spectra Database: The ID21 Sulfur XANES Spectra Database represents a collection of data provided by the ID21 beamline users at the European Synchrotron Radiation Facility (ESRF). The database is particularly rich in chemical information on samples, which makes it easy to reuse data. Graphical and text data are provided. The database contains 43 inorganic and 29 organic material spectra.

In response to such XAFS database activity outside Japan, the database constructed in this initiative has successfully integrated the major XAFS databases currently available in Japan. The features of these databases are summarized below:

4. BL14B2 XAFS Standard Sample Database: The BL14B2 XAFS Standard Sample Database] is the largest XAFS database in Japan, owned by SPring-8 and operated by Japan Synchrotron Radiation Research Institute (JASRI). The database contains spectral data on 1,913 chemical substances. All of the measured samples are defined as "Standard." For example, for commercial products, information such as the supplier and model number are included in the metadata, making them traceable. The data can also be obtained in bulk by installing the downloader software provided.

5. Hokkaido University XAFS DB: The Hokkaido University XAFS DB is the oldest XAFS database in Japan. It was developed in collaboration with the Japan XAFS Society (JXS) and is operated by the Institute for Catalysis (ICAT). Its history and operational policy are described by Asakura et al. [6], who point out the necessity of data integration for the XAFS community, one of the triggers for this project. Currently, approximately 300 spectral data are included in the database.

6. Ritsumeikan University Soft X-ray XAFS Database: The Ritsumeikan University Soft X-ray XAFS Database] is an open-access database from Ritsumeikan University, which has a soft X-ray synchrotron radiation facility. The database is operated by the Ritsumeikan SR Center. While most of the data are hard X-ray XAFS spectra, this database is a valuable data source that complements the spectra of light elements. Currently, 194 spectra from 98 samples are available using the following detection techniques: Total Electron Yield (TEY), Partial Electron Yield (PEY), Partial Fluorescence Yield (PFY), Inverse Partial Fluorescence Yield (IPFY), and Total Fluorescence Yield (TFY).

7. Photon Factory XAFS Database: The Photon Factory XAFS Database] is published by the Institute of Materials Structure Science (IMSS), which operates the Photon Factory (PF). Data are registered by facility personnel and PF users, and currently 148 spectral data are publicly available. The metadata must be parsed from the header of the data file.

Integration of XAFS databases: Issues and trials

We have integrated the four above-mentioned Japanese databases in this initiative and created a new public infrastructure, the MDR XAFS DB. [11] The most important function of an integrated database is cross searching, and there are two main issues in realizing this: designing and collecting metadata describing spectra and sample details, and unifying the vocabulary used in the metadata, including not only metadata items (keys) but also descriptions (values).

Since XAFS experiments are usually performed at large synchrotron radiation facilities, the conditions of the storage ring for X-ray generation and the optical system for extraction of monochromatic X-rays can almost all be automatically obtained as metadata. The problem is how to collect user-dependent metadata, such as experimental conditions, in a defined format, that is, keys and values expressing sample composition, shape, customized measurement parameters, etc., since these can be written in a variety of ways. Therefore, the format of user-dependent metadata needs to be defined and structured. Another problem is that each synchrotron radiation facility has its own metadata descriptions. In the following, such individual metadata is referred to as "local metadata." Local metadata must eventually be integrated with data that is shared with other facilities. Even if the above issue is resolved, if the vocabulary used for keys and values is not unified, the search performance of the integrated database will deteriorate. In this study, we focused on the project goals of integrating XAFS spectral data and cross searches, and we found the following practical solutions to the above issues.

Design and collection of metadata

Although the data format of XAFS spectra is based on simple columns of incidence and absorption X-ray intensities in a certain photon energy range, various formats are available. In Japan, there are 9809 (PF and SPring-8 Standard), REX [12], and Athena [13] formats, etc., that are compatible with post-experimental data analysis software. Metadata is placed in the header, providing the metadata necessary for analysis and some additional information. However, considering data reuse, these few pieces of metadata are not sufficient, and a wide variety of metadata needs to be organized, as described below. In such cases, it is not desirable to include a few lines of metadata as a header, and it is necessary to prepare a structured metadata file separate from the data file. In other words, it is necessary to maintain the existing data file, add a structured metadata file, and consider how to use it as a new information source to achieve the desired functionality.

Here we describe the general concept of metadata and the methods we adopted to achieve this goal. Figure 1(a) conceptually shows a general metadata hierarchy (stacked metadata model). Figure 1(b) shows schematically the scale of the users of each hierarchy level. The first (top) level is metadata that is always present in any study, such as names, institutions, etc. Its users are broad, and its content is shallow and requires no specialized knowledge. The second level is large category metadata, such as specific measurements (e.g., synchrotron radiation experiments) and samples, which require a certain level of specialized knowledge and have fewer users. The third (bottom) level is metadata specific to XAFS that is highly specialized and has in-depth content with little commonality. Its users are limited to a small number of researchers in the materials field. In general, as shown in Figure 1(a), the number of metadata keys increases as the hierarchy becomes deeper, and it is necessary to handle a variety of contents. The relationship between (a) and (b) is that of a pyramid and an inverted pyramid. We believe that there is more than one way to use metadata, but the appropriate key should be used according to the purpose. It is desirable that all the keys are used for wide and shallow and narrow and deep use, as shown in Figure 1. Since the purpose of the MDR XAFS DB is a cross search, we extracted the keys in the first and second levels with a careful review, according to the purpose of the search.

Figure 1. (a) Stacked metadata model with a hierarchy of keys that increase in number as they become more specialized, and (b) the scale of users at each level of the hierarchy.

We organized local metadata as shown in Table 1. The keys are classified according to the following purposes:

Keys for general information
Keys related to the reproducibility and reliability of XAFS experiments
Keys necessary for the integration of XAFS spectrum data

Purpose	Typical keys	Use case
Table 1. Categorization of keys contained in local metadata.
General information	Date, Experimenter, Facility, Beamline, Method, Sample	Comparison with other experimental data, Discovering relevant data
Reproducibility and reliability of XAFS experiments	Monochromator, Mirror, Slit, Energy calibration, Number of measurement points, Step width, Ion chamber gas, Amplifier gain	Accuracy evaluation, Detection limits, Reproduction of experiments, Precise analyses
Integration of XAFS spectrum data	Column name, Unit, Data format	Big data creation, Statistical analysis, Machine learning

In the case of purpose two, it is highly specialized and not necessary for all researchers of materials, but it is essential for XAFS researchers. Therefore, purpose two corresponds to the third level in Figure 1(a). Additionally, purpose three is information necessary for recent data-driven research. That is, in order to perform big data creation, statistical analysis, and machine learning, information about the definition of the content in each column and its data format is necessary at the data merging stage. Further, since multiple data formats are mixed in the MDR XAFS DB, as mentioned above, this information is necessary for XAFS spectrum analysts.

Consequently, most of the metadata in purpose two and three are necessary for data use but not for cross searches. It is clear that general information in purpose one, e.g. beamline name, measurement technique, and sample name, is suitable for cross searches. And the number of metadata commonly handled here is likely to be less than 10. We will discuss in the next section on the construction of the MDR XAFS DB what keys to assign and uses for these general metadata, including the constraints of the actual data infrastructure.

Unification of vocabulary

Examples of successful lexicon creation can be seen in Wikidata projects. There, each vocabulary is uniquely managed by assigning IDs to each vocabulary in turn, and synonyms are registered to prevent vocabulary fluctuations. The National Institute for Materials Science (NIMS) has adopted a similar system to manage research vocabulary and has established the materials vocabulary platform (MatVoc)^[1], which manages material names and other information using IDs called "QIDs." This platform is already in use in the search system and was released to the public in January 2023. We have used this dictionary to streamline the process of checking whether the material is the same as previously registered data. Currently, this work is performed manually by the database editor, but in the future, it may be used by users to identify names when registering data, and furthermore, it may be automated by machines. Lexicographic control is extremely important for material names, which are extremely diverse in the way they are described. However, as the registration of spectra by individuals begins in the future, it is quite possible that common names and abbreviations will be included in the metadata for beamlines and facilities as well, and the importance of vocabulary management is expected to increase. In fact, as discussed later, facility and beamline policies are incorporated into the energy calibration and metadata contents, thus they can be parameters for data screening.

Furthermore, these IDs are also used as Uniform Resource Identifiers (URIs), which form a space of material-related lexicons, a namespace, and is publicly available.^[2] In this space, one can find the standardized name of materials and their QIDs and chemical formulas (if present). For example, the QID for tin(II) chloride dihydrate is Q2307. (This URI has content for Q2307 in machine-readable format.)

There are currently 713 entities registered as XAFS-related material names, and the number of synonyms is about 6,000. Within MatVoc, many materials are assigned Chemical Abstracts Service (CAS) registry numbers to manage the vocabulary in a favor of linkage with large external databases. (The mapping to external URIs and the resulting validation of data linkage are discussed further two sections from now on the contents of the MDR XAFS DB.) The details of the concept of data and vocabulary management in the project are not limited to the MDR XAFS DB but are general in nature and will be presented at another time.

Construction of the MDR XAFS DB

Database policy

As described prior, earlier efforts to build XAFS databases were done individually. Taking a broad view, it can be concluded that we are in a transitional period from the past, where spectral data only need to be understood by the person who measured them, and the recent policy that aims for a cyber society where understandable metadata are added to the data and shared with many people. In fact, some databases still follow the tradition of leaving information in the file name or sample name, which should be recorded separately as metadata, to serve as a reminder to the person who recorded it. On the other hand, databases that seek to collect data systematically have machine-readable metadata, even though they cannot follow pioneering standard data formats such as NeXus. [14] Therefore, deep data linkage is possible through an interface that allows correspondence to be established. Although these differences in policies among the participating institutions were a challenge in integrating the databases, a construction policy was formulated and the integrated database MDR XAFS DB was constructed based on this policy. Here, Material Data Repository (MDR) [15], as the database infrastructure, is operated as part of a data platform project that has been underway at NIMS since 2017.

MDR has functions and operational policies suitable for open data in accordance with the FAIR Principles [7], which is becoming a fundamental concept for data utilization. Notably, data registered in the MDR is assigned a Digital Object Identifier (DOI) to enhance the visibility of the data. It also has an application programming interface (API) function, which enables not only a graphical user interface (GUI) but also large data unit operations that are suitable for data-driven science. The repository in this project is divided into three main areas: publications, datasets, and collections that systematically archive data. At the time of writing this paper, approximately 1,272 publications and 2,370 datasets have been registered. Each data set in the XAFS DB is stored in the datasets area, and all data are also registered in a collection for systematic browsing. Currently, there are 15 similar systematically organized datasets, that is, collections. The MDR is an open data repository and can be used according to the license granted to each piece of data.

Considering the background so far, i.e., the requirements from the XAFS community, including the cross searches described in the prior section, and MDR’s engineering abilities, we decided on the following construction policy for the MDR XAFS DB:

Each spectral data provided by each institution must be accompanied by a structured local metadata file in Yet Another Markup Language (YAML) format.
Keys in the local metadata should be standardized so that the data can be searched seamlessly without being aware of the differences between data-providing institutions.
The keys to be standardized are the names of materials, chemical formulas, absorption edges, beamline names, and monochromator crystals.
The set of metadata and the spectral data of the sample and reference sample should be defined as "1 Work," and each Work should be assigned a DOI.
Each data providing institution is responsible for the quality and rights of the data, and data that have already been published should be used.
The data to be released in the MDR XAFS DB should be open-access spectra and their supplementary data only, and the license should be Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). [16]

Metadata implementation for cross searches

The policy in the prior subsection had to be consistent with the cross-search requirements discussed in the section prior to that. That is, the names of materials, chemical formulas, absorption edges, and spectrometer crystals had to be extracted from the local metadata provided in YAML format by each participating institution and then embedded in the MDR metadata. Since MDR is not a specialized repository for a specific area of materials science, it is not suitable for creating an advanced database customized for a single purpose, i.e., XAFS. On the other hand, it is advantageous for linking with other data in MDR because it integrates data from a wide range of areas that are not limited to XAFS. In any case, based on this data provision concept, the MDR has its own data structure and rules for input (schema) [17], so it was not possible to fit all the key values for these cross searches into the MDR metadata. For example, with beamline names there is no commonality except for synchrotron radiation experiments, and there are no applicable keys in the MDR metadata schema. Therefore, the following keys for cross searches were extracted from the local metadata of each organization and implemented as values for "Keyword," which is one of the keys in the MDR metadata schema. The following is an example of keywords extracted in YAML format:

subjects:

– subject: Nickel # Material name

– subject: Ni # Chemical formula

– subject: Ni K-edge # Absorption edge

– subject: Pure metal # Material superordinate

– subject: Si(111) # Monochromator crystals

– subject: BL-12C # Beamline name

– subject: Photon Factory # Data provider

– subject: XAFS # Measurement method (fixed)

– subject: collection – MDR XAFS DB # Identification of collection (fixed)

The comment text after the "#" is for ease of understanding for the reader and the definition of the value. Although metadata keys should be precisely defined, the polymorphic key "Subject" is utilized here. This is because it follows DataCite’s schema for obtaining DOIs^[3], but it should be noted that this key is used only for the index for cross-search in MDR. As described below, we have demonstrated that these simplified keys are sufficient for screening data. When cross-searching many fields, the use of a univocal key may inadvertently limit the search target. The advantage of the MDR keyword function is that users can filter the data by sequentially selecting these keys. For example, selecting "Absorption edge" filters out relevant excitation elements, followed by "Material superordinate" to obtain to the desired material system. Here, the vocabulary used in the keywords should be the nomenclature as described in the prior section so that users can search the data seamlessly regardless of the institutions registered. Furthermore, it is also possible to select an institution by choosing "Data provider" in the keywords.

Database management

These cross-institutional initiatives require systematic database management. This section describes how data are registered, assigned DOIs, and maintained. As shown in Figure 2, data registration begins with the submission of spectral data and local metadata including necessary information, such as data provider information and rights statements. Registration is completed when it is confirmed the registration data are displayed correctly on the test server. Within MDR, after the DOI is issued via electronic submission, the data is added to the MDR XAFS DB in the MDR Collection and eventually released to the public. The cross-search keywords described in the prior subsection are also used to obtain DOIs and are the target of searches by DataCite, an organization that grants DOIs for research data. Automating and simplifying the registration procedure make it easier for users to register data directly in the future. Data registration is a joint initiative of materials scientists, engineers in charge of MDR, and service team members to handle data from the data-providing institutions that have contracts with NIMS. The contract procedure guarantees the legality of data use, and the names of these responsible institutions also appear in the keywords mentioned above. The granting of a DOI makes spectral data not just stored data but also carries with it the responsibility of publication. For example, due to the persistence of DOIs, if a serious error is found, a tombstone page is created indicating the reason for the error. Indeed, tombstone pages have been created for seven spectral data so far. This situation is undesirable, and further consideration should be given to how much effort needs to be devoted to the peer review of registration data.

Figure 2. Spectra registration flow for publication in MDR XAFS.

Contents of the MDR XAFS DB

Statistics

As of September 2022, the statistical information of the MDR XAFS DB, which was created by integrating the databases of the four institutions described above, is as follows:

Total number of data: 2,174 (contains seven invalidated data with DOIs)
Total number of absorption edges: K-edge 1,310 and L-edge 864
Unique absorption edges: K-edge 47 and L-edge 23
Unique materials: 713

Figure 3 summarizes the number of K-edge (a) and L-edge data (b), respectively, in histograms. As shown in these figures, the number of absorption edges is more than 100 spectra at the NiK-edge and W L-edge to the unregistered edge. In these figures, the number of highly monochromatic incident X-ray measurements using Si(311) as the monochromator crystal are also shown in the line graph. Approximately 45 percent of the K-edge and 30 percent of the L-edge are high-resolution spectral measurements, and the MDR XAFSDB can easily filter these high-resolution spectra using the keyword "Si(311)."

Figure 3. Number of data for (a) K-absorption edge and (b) L-absorption edge shown in histograms.

Figure 4 shows the number of registered absorption edges sorted in descending order of number. The inset shows the top 10 absorption edges marked in yellow in the figure and their spectral numbers for both K-edge and L-edge. (More detailed registration numbers are listed on the MDR XAFS DB ReadMe page.) The accumulation is also shown. The results show that 90 percent of spectra are covered by 24 elements in K-edge and 13 elements in L-edge, which roughly correspond to 50 percent of the major absorption edges, indicating that there are many absorption edges with low registration numbers. Ideally, these curves should increase linearly or follow a curve according to a strategic spectrum collection plan. We are considering extending the K-edge spectrum to the Zn-Zr region, where a gap is seen in Figure 3(a), and the L-edge spectrum to lighter elements. Establishing a cooperative system in the community, such as by supplying samples to participating institutions, is also desirable.

Figure 4. Number of absorption edge spectra registered in the MDR XAFS DB sorted in descending order and their accumulation.

Metadata analysis

In this project, we have conducted a sample nomenclature with an emphasis on linking with other material data. However, in practical terms it is not sufficient to only use nomenclatures. Instead, it is necessary to map with more general external information, for example, linking with the ID of a well-known large external database or providing detailed product information. Therefore, we investigated the keys related to samples in the local metadata of each data-providing institution. The metadata keys related to the samples and their numbers for the four institutions are summarized in Table 2. Since the names of the keys in the local metadata of each institution are not unified at this time, keys with the same meaning are placed on the same line.

JASRI		Ritsumeikan University		Hokkaido University		KEK
Table 2. Metadata keys related to samples and the number of keys.
Key	Number of value	Key	Number of value	Key	Number of value	Key	Number of value
name	1757	name	75	name	206	name	136
chemical_formula	1684	chemical_formula	75	chemical_formula	206	chemical_formula	121
		CAS_number	68	CAS_number	169
supplier	1753	manufacturer	31			manufacturer	2
model_number	1737	Product_number	24			product_number	1
lot_number	1715	sample_lot_number	16
		additional_data	75	additional_metadata	121	additional_data	62
Total	8646	Total	354	Total	702	Total	322
Average	4.920/work	Average	4.853/work	Average	3.408/work	Average	2.368/work

As summarized in Table 2 and the following paragraphs and beyond, it is clear that each facility has its own characteristics. Local metadata about the sample is entered using a user interface provided by the facility and merged with facility-specific metadata (e.g., storage ring current) and beamline metadata (e.g., optical element settings). In other words, metadata is not designed by individual users. Considering that once metadata is established, it will be used by many users, it is important to recognize that the characteristics will have a significant impact on the MDR XAFS DB.

References

↑ "NIMS XAFS DB Project Materials Dictionary". MatVoc Explorer. National Institute for Materials Science. 2023. https://matvoc.nims.go.jp/explore/en/dictionary/Q713.
↑ Ishii, M. (2023). "MatVoc vocabulary". MDR XAFS Ontology. https://dice.nims.go.jp/ontology/mdr-xafs-ont/Item.
↑ "Create DOIs". DataCite. 2023. https://datacite.org/create-dois/.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Several inline URLs from the original were turned into full citations for this version.