Journal:Developing a framework for open and FAIR data management practices for next generation risk- and benefit assessment of fish and seafood
Full article title | Developing a framework for open and FAIR data management practices for next generation risk- and benefit assessment of fish and seafood |
---|---|
Journal | EFSA Journal |
Author(s) | Pineda-Pampliega, Javier; Bernhard, Annette; Hannisdal, Rita; Ørnsrud, Robin; Mathisen, Gro H.; Solstad, Gisle; Rasinger, Josef D. |
Author affiliation(s) | Norwegian Scientific Committee for Food and Environment |
Primary contact | eu dash fora at efsa dot europa dot eu |
Year published | 2022 |
Volume and issue | 20(S2) |
Article # | e200917 |
DOI | 10.2903/j.efsa.2022.e200917 |
ISSN | 1831-4732 |
Distribution license | Creative Commons Attribution-NoDerivs 4.0 International |
Website | https://efsa.onlinelibrary.wiley.com/doi/full/10.2903/j.efsa.2022.e200917 |
Download | https://efsa.onlinelibrary.wiley.com/doi/pdfdirect/10.2903/j.efsa.2022.e200917 (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
Risk and risk–benefit assessments of food are complex exercises, in which access to and use of several disconnected individual stand-alone databases is required to obtain hazard and exposure information. Data obtained from such databases ideally should be in line with the FAIR principles, i.e. the data must be findable, accessible, interoperable, and reusable. However, often cases are encountered when one or more of these principles are not followed. In this project, we set out to assess if existing commonly used databases in risk assessment are in line with the FAIR principles. We also investigated how access, interoperability, and reusability of data could be improved. We used the OpenFoodTox and the Seafood database as examples and showed how commonly used freely available open-source tools and repositories can be implemented in the data extraction process of risk assessments to increase data reusability and crosstalk across different databases.
Keywords: FAIR, food safety, risk assessment, OpenFoodTox, Seafood database, R, Shiny, Zenodo
Introduction
Description of work program
Aims
This project assessed how to apply FAIR data principles [Wilkinson et al., 2016] in risk and risk–benefit assessments of food. Focusing on key databases recently used in a risk–benefit assessment of fish and seafood in the Norwegian diet [VKM et al., 2022a], the OpenFoodTox and the Seafood database, we aimed to demonstrate how open-source software tools can be used to make data stored in publicly available repositories more findable, accessible, interoperable, and reusable.
Methods
Using the [[R (programming language)|programming language R] [R Core Team, 2021] and data obtained from the European Food Safety Authority (EFSA) OpenFoodTox Tool [Dorne et al., 2020; Kovarich et al., 2022] and from the Institute of Marine Research (IMR) Seafood database [Institute of Marine Research, 2022], we assessed if programmatic optimization of access and the creation of a web-tool for selection and merging of subsets of the stored data improved findability, accessibility, interoperability, and reusability of the data. In this section, a brief description of both data and tools used is provided.
R
The programming language and environment R has been designed for the statistical analysis of data and the creation of graphics. [R Core Team, 2021] Over the past years, R has increasingly gained interest in the scientific research community [Hackenberger, 2020] as it is effective for data handling and includes many tools for basic and advanced data analysis. [R Core Team, 2021] R is a well-developed non-static language, which means that its base features can easily be extended via packages that can provide new functions and functionalities for different data science challenges, including bioinformatics and data mining. [Giorgi et al., 2022] In addition to this, R is supported by a big open-source community actively using this language and continuously adding new functionalities. R is licensed under the terms of the Free Software Foundation's GNU General Public License in source code form. [R Core Team, 2021] To facilitate programming with R, we used RStudio, an integrated development environment for R. [RStudio Team, 2022]
Shiny
As commented above, R can be expanded through packages, including one commonly used one called Shiny. [Chang et al., 2021] This package was designed with the idea of creating interactive web applications which use R in the backend. While the creator of a web-based Shiny-tool does need to know R, the end user of the web application created with Shiny does not need to have any knowledge of R. In addition to local installations of R and Shiny, Shiny web app also can be stored on a server, which users can access through their web browser. In both cases, the appearance and functionalities of the applications are the same, and the underlying R code can be shared freely.
Git and GitHub
Git is a version control system designed to allow different users to work on the same programming project, ensuring the traceability of progress and changes in the project. One of the most widely used providers of internet hosting for software development and version control using Git is GitHub. [Microsoft, 2022] GitHub implements Git and offers a free version, in which users can host different smaller projects and scripts, providing an easy way to share code created in R and other programming languages on the web. The scripts generated during this project will be hosted and accessible on GitHub in this repository. [Pineda-Pampliega, 2022]
Zenodo
Under the European OpenAIRE program, and with the idea of championing the sharing of scientific data, the Zenodo [European Organization for Nuclear Research, and OpenAIRE, 2013] open repository was developed and operated by CERN. [European Council for Nuclear Research] This open-source repository was developed for scientific data in a broad way, allowing to deposit not only research papers, but also data sets, software, reports, supplementary data and any other research-related digital artifacts. Submissions to Zenodo obtain a persistent digital object identifier (DOI), which facilitates the citation of the stored items and allows the sharing of data prior to their publication in peer-reviewed journals.
For a speedy exchange of evidence and supporting materials which could be used in food and feed safety risk assessments, EFSA has created a curated open repository called the Knowledge Junction within Zenodo. In addition to EFSA, several other institutions use Knowledge Junctions to share different data related to food security. For example, The Norwegian Scientific Committee for Food and Environment (VKM), which is part of this project, uses this Zenodo repository to upload finished reports (i.e., risk assessment and risk–benefit assessment) and supplementary materials of interest (i.e., literature searches, datasets, codes, etc.). To date, for VKM, the most recent example of the use of Zenodo is the opinion on the "Risk-Benefit Assessment of Sunscreen" [VKM et al., 2022b] For this opinion, the fellow Javier Pineda-Pampliega contributed to the preparation of the public sharing of the report's supplementary material, including datasets and R codes currently hosted on the VKM Knowledge Junction. [Norwegian Scientific Committee for Food and Environment (VKM), 2022]
Zenodo recently implemented the possibility to import GitHub workspaces; it now is possible to host completed GitHub projects also on Zenodo. This offers the advantage of obtaining a DOI for one's code, which simplifies the traceability and proper citation of code used to create the results.
OpenFoodTox
The EFSA's Chemical Hazards Database, OpenFoodTox [Dorne et al., 2020; Kovarich et al., 2022], is a structured database summarizing the outcomes of hazard identification and characterization for human and animal health and for the environment. It includes all regulated products and contaminants and provides open-source data for the (1) substance characterization, (2) links to EFSA outputs, and the values of (3) reference points, (4) reference values, and (5) genotoxicity. This database has become an essential tool for risk assessors and has provided the basis for the development and implementation of new approach methodologies (NAMs) in food and feed safety research. OpenFoodTox is hosted both on the EFSA webpage (as an interactive web tool) and on Zenodo in the EFSA Knowledge Junction.
Seafood database
The Institute of Marine Research in Norway routinely collects samples of key marine species for national and international monitoring programs. Their ISO/IEC 17025 accredited laboratories perform analyses of contaminants and nutrients using state-of-the-art methods. All the data generated, comprising multiple data points for over 25,000 individuals collected over a period of up to 15 years, are aggregated in a large in-house database. This database can be accessed freely through the online Seafood database portal [Institute of Marine Research, 2022], where the user can select between fish, shellfish, and seaweed divided by wild or farmed, and even prepared products, which can be found in Norwegian supermarkets. The database holds data of both Nutrients (separated into five categories: Amino acids, Fatty acids, Macro nutrients, Minerals, and Trace elements and Vitamins) and Contaminants (separated into four categories: Drug residues, Heavy metals, Organic pollutants, and Other undesirable substances).
Activities
With the aim to investigate the application of FAIR data principles in risk–benefit assessment of seafood, it was essential to evaluate opportunities and limitations in the OpenFoodTox and the Seafood database. Once evaluated, we developed publicly available R and Shiny code, which attempts to address potential limitations found and to add new functionalities for sub-setting and improved crosstalk between hazard and occurrence data repositories.
Evaluation and actions on the OpenFoodTox database
The OpenFoodTox database can be used in two different ways. The first (1) option is through the EFSA-hosted web application. The EFSA-hosted web application of the OpenFoodTox tool presents a classical interface, where different compounds can be searched by name. When searching, selected substances appear in five different categories of results: Substance characterization, EFSA outputs, Reference points, Reference values, and Genotoxicity. The resulting output represents the main limitation, as each category only can be downloaded individually (either in pdf, csv or xlsx format). In other words, after a search, the users need to download five different files and manually merge the data.
The second (2) option to access data is to download the entire OpenFoodTox database in xlsx format (Microsoft Excel Open XML Spreadsheet) from Zenodo. The data comprises five individual spreadsheets providing data on (1) substance characterization, (2) EFSA outputs, (3) reference points, (4) reference values, and (5) genotoxicity results. There is another “complete” spreadsheet, which is a combination of the five spreadsheets commented above (each one in a different tab) in addition to a dictionary spreadsheet. [Dorne et al., 2020] This makes data interoperable. However, as was described in the example above, to work with subsets of data spreading across the different spreadsheets, data aggregation and merging again must be performed manually using additional software for tabular data files. The most common among these tools is Excel, which is part of the commercial Microsoft Office Suite, but other free alternatives such as OpenOffice, LibreOffice, or online tools such as Google Drive Sheets also can be used. In any case, for merging the large individual datasets, the user needs to be proficient in the terminology of terms and use of spreadsheet tools for efficient filtering, merging, and sub-setting of the data in the desired format.
To evaluate potential complementary solutions to access, subset, and merge data stored in the EFSA OpenFoodTox database on Zenodo, in the present project using R (vers. 4.1.2) running in RStudio (vers. 2022.2.3.492), functions (i.e., pieces of code which work together for a common purpose) were written using R markdown, being characterized by the following features:
- Data can be downloaded directly from the OpenFoodTox URL to eliminate the need for the user to search for and/or download the data in Excel.
- The database offers the possibility to search for up to 15 elements at the same time, with an implemented control of any repeated entry values. In the case of repetition, the repeated value is indicated, but not considered in the search.
- If a search is entered for a general term and several compounds appear in the database, an indication for the number of the different compounds is provided. For example, the search “lead” returns four results, because the components identified in the database are: “Lead,” “Lead (II),” “Lead sulphate,” and “Tetraethyl lead.”
- To increase the (computational) reusability of the data in automated analysis pipelines, the information is downloaded in a plain text file (txt). This is a standard format of plain text that can be open in many different software tools. However, also the possibility to download data in csv (comma-separated values) is provided.
- To increase traceability information on the OpenFoodTox database version and the date and time when the file was created are automatically appended to the name of the downloaded file.
After the creation of the R script, to increase the number of potential users of this tool, we assessed if an additional approach that does not require knowledge and use of R could be developed. For this, the creation of a web-based application using Shiny was attempted. The use of Shiny opens the possibility to access and subset OpenFoodTox data using an internet browser only and also allows for the implementation of additional functions into our R code. That is, in addition to the characteristics of the function described above, the Shiny application developed in this project (Figure 1) has the following extra functions:
- Increased traceability: an indication of which version of the OpenFoodTox database used has been included. At the time of writing this report, the fifth iteration of the OpenFoodTox was released (and published on Zenodo on 16 June 2022).
- Implementation of interactive tables, allowing to filter results in real-time.
- Initially, tables will show all columns in the dataset, but tools for sub-setting and selection of individual columns to be retained are provided. This functionality makes it easier to take snapshots only of the columns of interest for further uses.
- With one of the objectives of this project being to facilitate the interaction and crosstalk between databases of interest to risk assessors, the option to add links to PubChem for each selected compound was implemented. PubChem is a database of chemical molecules and their activities, maintained by the National Centre for Biotechnology Information (NCBI) of the United States.
|
Evaluation and actions on the Seafood database
The Seafood database contains information collected over a period of up to 15 years, with different data points for over 25,000 individual samples. This represents a comprehensive data repository of nutrients and contaminants in fish and seafood comprising more than 700,000 records. Due to the experience gained in the previous work with the OpenFoodTox tool, we directly designed a web application using Shiny to work with the Seafood database. As with OpenFoodTox, the first step was to evaluate the potential limitations and challenges of the existing system to access the database, which for the general public currently occurs via a web interface. Having gotten access to the data underlying the web-based tool hosted at the IMR, in the present project, we assessed alternative solutions by addressing issues of the current web application using R and Shiny (Figure 2). We also set out to include additional functions potentially of use to risk assessors. The Seafood database Shiny web application is characterized by the following features:
- The publicly available web interface of the Seafood database is not version controlled. Furthermore, it is not updated with a defined periodicity, as it depends on data from different projects which are made available at different times throughout the year. This could be a challenge for the traceability of results and repeatability of analysis. As an attempted solution, we suggested for the database to be version controlled and to be updated at defined intervals only, e.g., annually. In addition, we implemented code to show a message highlighting the date when the database was last updated (Figure 3A). In a new version of our code, we also will include a button in the Shiny app to select which version of data the user wants to retrieve (i.e., to select the data regarding the day of the update).
- One common situation users of the Seafood database often encounter is the interest in the comparison of the presence of different compounds in different species or products. In the current web interface of the Seafood database, to check all the substances evaluated, it is only possible to select species or products one by one. In addition, to compare the concentration of different substances between species or products, the maximum number of substances is 10 by search. This makes it difficult to prepare a subset of desired data for further comparisons downstream. As a solution, in the prepared Shiny-based application, the user can select up to 15 species or products simultaneously, with information on all nutrients or contaminants. In addition, if the user is interested in only a particular set of compounds, up to 15 nutrients and another 15 contaminants can be selected.
- The R of FAIR means "reusability" of the data. This implies that for performing additional data analyses not yet envisaged by the data providers, users of a database should be able to access data presented in a non-aggregated way. Currently, the Seafood database does not provide this option; the results of searches are presented as numerical summaries (with sample size, mean, minimum and maximum values for each parameter). This makes it difficult to reuse this data in new evaluations. In the present project, at the IMR, access to all data contained in the Seafood database was provided and two tables are presented in the Shiny application developed: one with a summary of the data (as in the IMR web interface), and another table with the non-aggregated data (Figure 3B).
- Continuing with reusability, in addition to access to non-aggregated data, the format in which data can be downloaded by the user is also important to consider. The Seafood database allows downloading in Portable Document Format (pdf) format only. This format is widely used to present documents which include text and images and has the advantage of being immutable, i.e., independent of application software, hardware, and operating systems, and documents are always displayed in the same way. However, this characteristic is a weakness for sharing data intended to be used in downstream analyses. For this, the data needs to be reusable and interoperable. The newly developed Shiny application allows for the download of selected data in txt or csv formats, being the most typical format to share data which could be used for further analysis. Both data from the summary table and the non-aggregated data can be downloaded in the desired formats. In addition, to ensure traceability when files are downloaded, the name consists of the date and the time of the creation and also incorporates the version of the database (the date of the latest update of the data; Figure 3C).
|
|
In addition to addressing specific limitations of the Seafood database listed above, we added extra functionalities in the Shiny code that we considered could be useful for the users:
- An increase in the number of inputs could entail an increase in the number of mistakes due to a repetition of terms in a search. To avoid this, our application indicates if a value is repeated but, even more important, the repeated value is not considered for the search, showing the same results as if the value were introduced only once (Figure 3A).
- Despite quality control measures in place, databases may contain erroneous entries (e.g., the inclusion of text in numeric rows and vice versa, or empty values). The Shiny application developed here includes a filter to flag and eliminate any rows which potentially contain mistakes. In addition, an indication of the number of eliminated entries is provided (Figure 3A).
- During the quantification of substances, it is possible that values are below the limit of quantification (LOQ) of a specific analysis. The LOQ is the lowest concentration of an analyte that can be quantified with a given certainty. In the Seafood database, values for contaminants below the LOQ are routinely reported using "Upper bound" summation where the LOQ is used as if it were the actual concentration measured. This may result in many data points of the same value, and such data sets are referred to as "Left censored." The web interface of the Seafood database indicates which values are below the LOQ, and also lists the numerical value of the respective LOQ for each method and compound in question. Additionally, the Shiny application developed here allows for further modification of the data and the possibility to calculate "Lower bound" (substitute values < LOQ by 0) or "Upper bound" (substitute values < LOQ with the actual LOQ) (Figure 3D).
- To increase data interoperability and crosstalk between different databases, common unique identifiers must be found. In our opinion, codes of the chemical substances in question provide a good option. Different unique identifiers do exist including InChI (International Chemical Identifier) or SMILES (Simplified Molecular Input Line Entry System), which both are included in the Seafood database. In addition to these, the paramCode was added in the Shiny App, which is suggested by EFSA to be used when reporting on different substances in food and feed. [EFSA, 2019]
One general challenge we found when working with the Seafood database is that its web interface is designed to share aggregated occurrence data with the public; access to non-aggregated data is limited to in-house use and can be made available on request to risk assessors. Hazard data from OpenFoodTox on the other hand can be accessed both via a web interface for quick screening of information and through a dedicated Zenodo repository for bulk download and direct reuse (e.g., in exposure calculations for risk assessments). In addition, data is version controlled and linked to persistent citable DOIs. This, in our view, strongly facilitates the timely dissemination of information and the reproducibility of the data analysis performed. The benefits of publishing data on an open repository such as Zenodo sparked a discussion at the IMR on how seafood data could be made available to a wider audience in the future, which is an important first step towards further implementation of the FAIR data principles.
Spin off activities in implementing FAIR data management practices
In addition to the work described above, during the project period supporting activities were carried out to improve communication in project work relying on coding and data sharing across different work groups and institutes. Within the Marine Toxicology group at the IMR, several software tools are used to advance work on several cross-disciplinary projects. Microsoft Teams is used to allow communication between members of the group through video calls or chat. Microsoft SharePoint is used as a document repository and for interactive document creation and editing. Linking SharePoint to OneDrive, within the group R code could be developed locally using RStudio. This allowed for efficient local collaboration between members of the team. To share different elements for a project externally, in addition to Teams, GitHub accounts were set up, and using RStudio scripts created earlier, were directly uploaded. This workflow was shown to VKM, which implemented this workflow for their research in 2022 [VKM et al., 2022b], allowing them for the first time to share supplementary codes and datasheets interactively on Zenodo. [Norwegian Scientific Committee for Food and Environment (VKM), 2022] Lastly, the fellow also engaged in discussions with IMR IT staff about modern software development and recent developments in micro-services architectures with standardized and structured data representation formats for sharing information between systems and services, such as JSON and XML.
Other spin off activities during the project
The performed work is not only represented in this report but was also presented on a poster at the ONE Health, Environment, Society conference in Brussels in June 2022. This conference also invited attendees to participate in a video contest, where a short summary of the project was also presented. In addition to the project work related to FAIR data management, the EU-FORA program also has offered further opportunities. Being integrated into the working group, the fellow had the opportunity to familiarize himself with a new field, participating in a paper regarding proteomics. Finally, to continue training in food security, the fellow also carried out the training "Risk assessment in biotechnology," offered by the European Commission as part of the training initiative "Better Training for Safer Food" (BTSF).
Conclusions
Large amounts of data that could be used in food safety risk assessments are available in different database. However, this steady increase of data has not been always followed by an improvement in the ways to easily access these data, provide data traceability, or offer easy data reuse. To tackle these challenges, it is recommended that data for risk assessments must follow the FAIR principles, i.e., data must be findable, accessible, interoperable, and reusable. Based on publicly available databases and open-source software tools, this project has been attempting to provide a proof of concept to show how using custom code and alternative approaches could improve some characteristics of well-known databases, including OpenFoodTox and the Seafood database. The use of platforms such as GitHub or Zenodo could make the data more findable and interoperable. The creation of web applications with Shiny could increase the accessibility to the data and make easy interaction between databases. The reusability was obtained through the selection of the appropriate formats for the data downloaded and the application of adequate systems to ensure traceability. Following these FAIR principles in the different databases is an essential step to ensuring the success of the future risk–benefit assessment, by offering more timely results with adequate spending of human and economic resources.
Acknowledgements
The authors would like to thank Mauricio Munera, Ole Jakob Nøstbakken, Arne Duinker, Livar Frøyland and Gro-Ingunn Hemre.
Funding
This report is funded by EFSA as part of the EU-FORA program.
Declaration of interest
If you wish to access the declaration of interests of any expert contributing to an EFSA scientific assessment, please contact interestmanagement at efsa dot europa dot eu.
References
Notes
This presentation is faithful to the original, with only a few minor changes to presentation and updates to spelling and grammar. In some cases important information was missing from the references, and that information was added. No other changes have been made, in accord with the "NoDerivs" portion of the license.