Journal:Defending our public biological databases as a global critical infrastructure

Full article title	Defending out public biological databases as a global critical infrastructure
Journal	Frontiers in Bioengineering and Biotechnology
Author(s)	Caswell, Jacob; Gans, Jason D.; Generous, Nicolas; Hudson, Corey M.; Merkley, Eric; Johnson, Curtis; Oehmen, Christopher;; Omberg, Kristin; Purvine, Emilie; Taylor, Karen; Ting, Christina L.; Wolinsky, Murray; Xie, Gary
Author affiliation(s)	Sandia National Laboratories, Los Alamos National Laboratory, Pacific Northwest National Laboratory
Primary contact	Email: karen at pnnl dot gov
Editors	Murch, Randall S.
Year published	2019
Volume and issue	7
Page(s)	58
DOI	10.3389/fbioe.2019.00058
ISSN	2296-4185
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.frontiersin.org/articles/10.3389/fbioe.2019.00058/full
Download	https://www.frontiersin.org/articles/10.3389/fbioe.2019.00058/pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Progress in modern biology is being driven, in part, by the large amounts of freely available data in public resources such as the International Nucleotide Sequence Database Collaboration (INSDC), the world's primary database of biological sequence (and related) information. INSDC and similar databases have dramatically increased the pace of fundamental biological discovery and enabled a host of innovative therapeutic, diagnostic, and forensic applications. However, as high-value, openly shared resources with a high degree of assumed trust, these repositories share compelling similarities to the early days of the internet. Consequently, as public biological databases continue to increase in size and importance, we expect that they will face the same threats as undefended cyberspace. There is a unique opportunity, before a significant breach and loss of trust occurs, to ensure they evolve with quality and security as a design philosophy rather than costly “retrofitted” mitigations. This perspective article surveys some potential quality assurance and security weaknesses in existing open genomic and proteomic repositories, describes methods to mitigate the likelihood of both intentional and unintentional errors, and offers recommendations for risk mitigation based on lessons learned from cybersecurity.

Keywords: cyberbiosecurity, biosecurity, cybersecurity, biological databases, machine learning, bioeconomy

Introduction

Although an openly shared interaction platform confers great value to the biological research community, it may also introduce quality and security risks. Without a system for trusted correction and revision, these shared resources may facilitate widespread dissemination and use of low-quality content, for instance, taxonomically misclassified or erroneous sequences. Furthermore, as these public databases increase in size and importance, they may fall victim to the same security issues and abuses that plague cyberspace to this day. If we act now by developing the databases with quality and security as a design philosophy, we can protect these databases at a much lower cost and with fewer challenges than we currently face with the internet.

In this perspective article, the authors aim to outline some potential quality assurance and security weaknesses in existing public biological repositories. In the background section we provide a discussion of errors present in public biological databases and discuss possible security vulnerabilities inherent in their access, publication, and distribution models and systems. Both unintentional and intentional errors are discussed, the latter of which has not been given significant consideration in literature.^[1] Afterwards, we attempt to introduce greater trust in the data and analyses by providing recommendations to mitigate or account for these errors and vulnerabilities and point to approaches used by other internet databases. Finally, we summarize our recommendations in the conclusions section.

This article focuses on databases which contain public and freely available data. We recognize that other biological databases exist which contain private, sensitive, or otherwise valuable data (e.g., human genomes). While unauthorized disclosure is not a formal concern in public, non-human databases, safeguarding against intentional or unintentional erroneous content is. Some approaches have been proposed to protect unauthorized disclosure^[2]^[3]^[4] and, while we don't survey these approaches in this perspective, we note that the public database community may benefit from these ideas as well.

Background: Problems with public biological databases

Data integrity

An important goal for bioinformatics is the continuous improvement of biological databases. Given the rapid nature of this improvement and the rate of data production though, the content of these repositories is not without error. For example, the problem of contaminated sequences has been recognized for nearly two decades, with evidence stating that bacteria and human error are the two most common sources of contamination.^[5]^[6] Ancient DNA is also particularly affected by human contamination.^[7] These contaminants are frequently introduced during experiments^[5]^[8] from natural associations and insufficient purification.^[9] In the past few years, additional reports have highlighted cases of DNA contamination in published genome data^[10]^[11], suggesting that DNA contamination may be more widespread than previously thought. We recognize that errors and omissions can occur in open databases both at the sequence and at the metadata levels, but for this article we mainly focus on sequence and taxonomic data concerns for the purposes of illustrating some of the many data integrity challenges possible.

References

↑ Moussouni, F.; Berti‐Équille, L. (2014). "Cleaning, Integrating, and Warehousing Genomic Data From Biomedical Resources". In Elloumi, M.; Zomaya, A.Y.. Biological Knowledge Discovery Handbook. John Wiley & Sons. pp. 35–58. doi:10.1002/9781118617151.ch02. ISBN 9781118617151.
↑ Kim, M.; Lauter, K. (2015). "Private genome analysis through homomorphic encryption". BMC Medical Informatics and Decision Making 15 (Suppl 5): S3. doi:10.1186/1472-6947-15-S5-S3. PMC PMC4699052. PMID 26733152. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4699052.
↑ Mandal, A.; Mitchell, J.C.; Montgomery, H.; Roy, A. (2018). "Data oblivious genome variants search on Intel SGX". In Garcia-Alfaro, J.; Herrera-Joancomartí, J.; Livraga, G.; Rios, R.. Data Privacy Management, Cryptocurrencies and Blockchain Technology. Lecture Notes in Computer Science. Springer International Publishing. pp. 21. doi:10.1007/978-3-030-00305-0_21. ISBN 9783030003050.
↑ Ozercan, H.I.; Ileri, A.M.; Ayday, E.; Alkan, C. (2018). "Realizing the potential of blockchain technologies in genomics". Gemone Research 28 (9): 1255–63. doi:10.1101/gr.207464.116. PMC PMC6120626. PMID 30076130. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6120626.
↑ ^5.0 ^5.1 Merchant, S.; Wood, D.E.; Salzberg, S.L. (2014). "Unexpected cross-species contamination in genome sequencing projects". PeerJ 2: e675. doi:10.7717/peerj.675. PMC PMC4243333. PMID 25426337. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243333.
↑ Strong, M.J.; Xu, G.; Morici, L. et al. (2014). "Microbial contamination in next generation sequencing: implications for sequence-based analysis of clinical samples". PLoS Pathogens 10 (11): e1004437. doi:10.1371/journal.ppat.1004437. PMC PMC4239086. PMID 25412476. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4239086.
↑ Pilli, E.; Modi, A.; Serpico, C. et al. (2013). "Monitoring DNA contamination in handled vs. directly excavated ancient human skeletal remains". PLoS One 8 (1): e52524. doi:10.1371/journal.pone.0052524. PMC PMC3556025. PMID 23372650. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3556025.
↑ Ballenghien, M.; Faivre, N;. Galtier, N. (2017). "Patterns of cross-contamination in a multispecies population genomic project: Detection, quantification, impact, and solutions". BMC Biology 15 (1): 25. doi:10.1186/s12915-017-0366-6. PMC PMC5370491. PMID 28356154. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5370491.
↑ Simion, P.; Philippe, H.; Baurain, D. et al. (2017). "A Large and Consistent Phylogenomic Dataset Supports Sponges as the Sister Group to All Other Animals". Current Biology 27 (7): 958-967. doi:10.1016/j.cub.2017.02.031. PMID 28318975.
↑ Witt, N.; Rodger, G.; Vandesompele, J. et al. (2009). "An assessment of air as a source of DNA contamination encountered when performing PCR". Journal of Biomolecular Techniques 20 (5): 236–40. PMC PMC2777341. PMID 19949694. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777341.
↑ Longo, M.S.; O'Neill, M.J.; O'Neill, R.J. (2011). "Abundant human DNA contamination identified in non-primate genome databases". PLoS One 6 (2): e16410. doi:10.1371/journal.pone.0016410. PMC PMC3040168. PMID 21358816. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3040168.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, grammar, and punctuation. In some cases important information was missing from the references, and that information was added. The footnote in the original material were turned into an inline references for this version.