Journal:Defending our public biological databases as a global critical infrastructure

Full article title	Defending out public biological databases as a global critical infrastructure
Journal	Frontiers in Bioengineering and Biotechnology
Author(s)	Caswell, Jacob; Gans, Jason D.; Generous, Nicolas; Hudson, Corey M.; Merkley, Eric; Johnson, Curtis; Oehmen, Christopher;; Omberg, Kristin; Purvine, Emilie; Taylor, Karen; Ting, Christina L.; Wolinsky, Murray; Xie, Gary
Author affiliation(s)	Sandia National Laboratories, Los Alamos National Laboratory, Pacific Northwest National Laboratory
Primary contact	Email: karen at pnnl dot gov
Editors	Murch, Randall S.
Year published	2019
Volume and issue	7
Page(s)	58
DOI	10.3389/fbioe.2019.00058
ISSN	2296-4185
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.frontiersin.org/articles/10.3389/fbioe.2019.00058/full
Download	https://www.frontiersin.org/articles/10.3389/fbioe.2019.00058/pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Progress in modern biology is being driven, in part, by the large amounts of freely available data in public resources such as the International Nucleotide Sequence Database Collaboration (INSDC), the world's primary database of biological sequence (and related) information. INSDC and similar databases have dramatically increased the pace of fundamental biological discovery and enabled a host of innovative therapeutic, diagnostic, and forensic applications. However, as high-value, openly shared resources with a high degree of assumed trust, these repositories share compelling similarities to the early days of the internet. Consequently, as public biological databases continue to increase in size and importance, we expect that they will face the same threats as undefended cyberspace. There is a unique opportunity, before a significant breach and loss of trust occurs, to ensure they evolve with quality and security as a design philosophy rather than costly “retrofitted” mitigations. This perspective article surveys some potential quality assurance and security weaknesses in existing open genomic and proteomic repositories, describes methods to mitigate the likelihood of both intentional and unintentional errors, and offers recommendations for risk mitigation based on lessons learned from cybersecurity.

Keywords: cyberbiosecurity, biosecurity, cybersecurity, biological databases, machine learning, bioeconomy

Introduction

Although an openly shared interaction platform confers great value to the biological research community, it may also introduce quality and security risks. Without a system for trusted correction and revision, these shared resources may facilitate widespread dissemination and use of low-quality content, for instance, taxonomically misclassified or erroneous sequences. Furthermore, as these public databases increase in size and importance, they may fall victim to the same security issues and abuses that plague cyberspace to this day. If we act now by developing the databases with quality and security as a design philosophy, we can protect these databases at a much lower cost and with fewer challenges than we currently face with the internet.

In this perspective article, the authors aim to outline some potential quality assurance and security weaknesses in existing public biological repositories. In the background section we provide a discussion of errors present in public biological databases and discuss possible security vulnerabilities inherent in their access, publication, and distribution models and systems. Both unintentional and intentional errors are discussed, the latter of which has not been given significant consideration in literature.^[1] Afterwards, we attempt to introduce greater trust in the data and analyses by providing recommendations to mitigate or account for these errors and vulnerabilities and point to approaches used by other internet databases. Finally, we summarize our recommendations in the conclusions section.

This article focuses on databases which contain public and freely available data. We recognize that other biological databases exist which contain private, sensitive, or otherwise valuable data (e.g., human genomes). While unauthorized disclosure is not a formal concern in public, non-human databases, safeguarding against intentional or unintentional erroneous content is. Some approaches have been proposed to protect unauthorized disclosure^[2]^[3]^[4] and, while we don't survey these approaches in this perspective, we note that the public database community may benefit from these ideas as well.

Background: Problems with public biological databases

Data integrity

An important goal for bioinformatics is the continuous improvement of biological databases. Given the rapid nature of this improvement and the rate of data production though, the content of these repositories is not without error. For example, the problem of contaminated sequences has been recognized for nearly two decades, with evidence stating that bacteria and human error are the two most common sources of contamination.^[5]^[6] Ancient DNA is also particularly affected by human contamination.^[7] These contaminants are frequently introduced during experiments^[5]^[8] from natural associations and insufficient purification.^[9] In the past few years, additional reports have highlighted cases of DNA contamination in published genome data^[10]^[11], suggesting that DNA contamination may be more widespread than previously thought. We recognize that errors and omissions can occur in open databases both at the sequence and at the metadata levels, but for this article we mainly focus on sequence and taxonomic data concerns for the purposes of illustrating some of the many data integrity challenges possible.

In addition to contamination, two high-profile examples of sequence errors include the reassembly of a misassembled Francisella tularensis genome^[12] and the identification of single nucleotide errors in a reference Tobacco mosaic virus (TMV) genome.^[13] Without a way to flag or remove the erroneous entries, future researchers are left to continually rediscover them. The errors in the reference TMV sequence are particularly disturbing. The taxonomic assignment corresponds to a pathogenic strain, but due to two erroneous single nucleotide polymorphisms (SNPs), virions synthesized from the published reference sequence are atypically not infectious. Overlooked contamination in reference genomes can thereby lead to wrong or confusing results and may have major detrimental effects on biological conclusions.^[14]^[15] While resequencing could be used to identify and correct sequence errors, it is only possible when the original source material is available. For the given example of single nucleotide errors in the TMV genome, the biological sample (sequenced in 1982) no longer exists. In addition to missing samples, samples of high consequence human and agricultural pathogens may not be available for resequencing.

Database integrity considerations for proteomics are generally similar to those for genomics because databases of protein sequences are derived from genome sequencing, via genome annotation and in silico translation. A sequence database error is unlikely to result in spurious detection of a protein that is present in the sample (false positive), but it could easily lead to a failure to detect a protein that is present (false negative). This is particularly concerning for discovery of accurate peptide signatures for use in targeted assays, a rapidly growing area of research.

In this section we discussed the issue of errors in genomic and proteomic databases and their impacts for research and application. Sources of these errors may include, among others, entry errors derived from data transfer, original errors derived from source data, and metadata errors (typically provenance-related) derived from the analysis pipeline. Original errors can arise from sequencing and sample preparation instrumentation chemistry, hardware, and software. Metadata errors can arise from bioinformatics software and faulty human interpretation. Each of these errors may be considered noise or the result of some other unintentional cause, but the key problem to note is that each element of the analytical process introduces some level of artifact when creating the analytical product, i.e., what is defined as a peak or a spot, what is the gene scaffold, what is the closed genome, etc. Any difference in process would therefore by its nature have some impact on the final genome. Our goal here is to start drawing connections between these process elements and genome anomalies.

Vulnerabilities and intentional tampering

In contrast to the data integrity issues discussed in the prior section, errors may also be intentionally introduced into a biological database. For example, consider the hypothetical scenario discussed by Peccoud et al.^[16] whereby a graduate student reads an article and subsequently requests the plasmids described, but receives a faulty sample. It may be that the published sequences were fabricated, or that the source laboratory unwittingly sent faulty plasmids. One could also imagine a scenario where an intentionally mislabeled or harmful sequence is submitted to an open database that could later be unknowingly synthesized in a research setting or, more seriously, in a production capacity. Furthermore, depending on how sequences could be submitted to the database, the adversary may be able to keep the pathogenic sequence from being detected by certain anomaly detection heuristics.

Individuals may also exploit the vulnerabilities inherent in the database as a cybersystem, leading to errors introduced after publication of data despite manipulation and deletion controls. As with any database, biological databases can be compromised, enabling data integrity issues related to insertion, manipulation, exfiltration, and deletion of data, as well as providing a platform for privilege escalation, unauthorized surveillance, or distribution of malware. Ultimately, the effects of the operating environment and the tools used to deliver databases will inform the most appropriate threat model.

Approaches for improving biological databases

In 2000, a workshop titled Bioinformatics: Converting Data to Knowledge^[17] tackled the question of biological database integrity as one of its focus areas. At that time, suggested solutions included building organism-type (e.g., eukaryote) specific grammar-based tools, enabling database self-validation through specialized ontologies, advocating for quality control in laboratories to minimize likelihood of errors, and authorizing only trained curators and annotators to enter data. They also recommended that data provenance be maintained so that the data history and evolution can be understood over time. These approaches broadly fall into two categories: ensuring integrity before or during data entry and analyzing data already in a database. Nearly 20 years later, we still emphasize the importance of quality control in laboratories and standardized data entry procedures, but it is clear that errors continue to make their way into databases for a variety of reasons. In this section, we highlight several categories of existing methods to detect data integrity issues in biological databases and outline the strengths and weaknesses of each. We also provide recommendations for improving biological database security.

Automated approaches for detecting anomalies

Some biological databases take the manual curation approach, such as the SwissProt subset of the UniProt (Universal Protein Resource Database). This effort requires significant resources to maintain, consisting of three principal investigators, a large staff, and external advisory board.^[18] Given the complexity and exponential growth of biological data, automatic methods are needed.

Some tools have been developed to assess the technical quality of genome assemblies (e.g., QUAST^[19]), their completeness in terms of gene content (e.g., BUSCO^[20], ProDeGe^[21]) and even their contamination level (e.g., acdc^[22], CheckM^[23]). Currently there are several analysis pipelines and search methodologies to detect potentially contaminated sequences in the published and assembled genome, such as Taxoblast^[24], GenomePeek^[25], homology search^[26], and a multi-step cleaning process followed by a consensus of rankings.^[27]^[28] All these tools and methods require human review or use of additional tools to distinguish true positive from true negative and are therefore not feasible at scale.

Another database quality issue is the automated identification of taxonomically anomalous, questionable, or erroneous GenBank taxonomic assignments. Automated error identification of taxonomic assignments now draws on methods such as anomaly detection, classification, and prediction techniques. These methods have proved impactful in areas like computer vision^[29] and natural language processing.^[30] They have also been adopted by bioinformatics and computational biology.^[31] Much of the work in applying machine learning to biological data is for classification and prediction of metadata, e.g., gene or taxonomy prediction in genomics, and structure and function prediction in proteomics. Verification of sequence metadata contained in a database is then performed by comparing with the predicted metadata from the sequence.

Sequence-based methods to detect taxonomically misclassified bacterial genome sequences tend to be based either on distance measures between pairs of sequences or on consistency with a reference 16S rRNA phylogeny. Common distance metrics include the average nucleotide identity (ANI), digital DNA-DNA hybridization (dDDH), multi-locus sequence analysis (MLSA), k-mer overlap (summarized by Federhen et al.^[32]), and information theoretic distances.^[33] Given a genome distance, taxonomic misclassifications have been discovered by identifying outlier genomes that exceed a manually determined distance threshold to trusted reference genomes.^[34]^[35]^[36]^[37]^[38]^[32]^[39] The need for reference genomes is problematic, since approximately 20 percent of the bacterial genome sequences in GenBank currently (as of August, 2017) do not have a reference (or “type”) genome available (NCBI).^[40] The lack of bacterial genomes with a “type” designation is not due to the cost of sequencing but rather the need to satisfy a specific set of formal requirements^[41], which includes submitting culturable isolates to more than one culture collection. This poses a significant challenge for unculturable bacteria.

Distinct from these pairwise distance-based methods, a recent method for identifying taxonomically mislabeled sequences^[42] uses consistency between a given set of taxonomic labels and a phylogenetic tree computed from a multiple sequence alignment of 16S rRNA sequences. This approach uses a single model of evolution to identify sequences whose taxonomic placement is most likely incorrect. However, there are multiple, competing methods for assigning bacterial taxonomy and, in particular, multiple sequence alignment of 16S rRNA can fail to resolve closely related species.^[43]^[44]^[45]

References

↑ Moussouni, F.; Berti‐Équille, L. (2014). "Cleaning, Integrating, and Warehousing Genomic Data From Biomedical Resources". In Elloumi, M.; Zomaya, A.Y.. Biological Knowledge Discovery Handbook. John Wiley & Sons. pp. 35–58. doi:10.1002/9781118617151.ch02. ISBN 9781118617151.
↑ Kim, M.; Lauter, K. (2015). "Private genome analysis through homomorphic encryption". BMC Medical Informatics and Decision Making 15 (Suppl 5): S3. doi:10.1186/1472-6947-15-S5-S3. PMC PMC4699052. PMID 26733152. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4699052.
↑ Mandal, A.; Mitchell, J.C.; Montgomery, H.; Roy, A. (2018). "Data oblivious genome variants search on Intel SGX". In Garcia-Alfaro, J.; Herrera-Joancomartí, J.; Livraga, G.; Rios, R.. Data Privacy Management, Cryptocurrencies and Blockchain Technology. Lecture Notes in Computer Science. Springer International Publishing. pp. 21. doi:10.1007/978-3-030-00305-0_21. ISBN 9783030003050.
↑ Ozercan, H.I.; Ileri, A.M.; Ayday, E.; Alkan, C. (2018). "Realizing the potential of blockchain technologies in genomics". Gemone Research 28 (9): 1255–63. doi:10.1101/gr.207464.116. PMC PMC6120626. PMID 30076130. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6120626.
↑ ^5.0 ^5.1 Merchant, S.; Wood, D.E.; Salzberg, S.L. (2014). "Unexpected cross-species contamination in genome sequencing projects". PeerJ 2: e675. doi:10.7717/peerj.675. PMC PMC4243333. PMID 25426337. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243333.
↑ Strong, M.J.; Xu, G.; Morici, L. et al. (2014). "Microbial contamination in next generation sequencing: implications for sequence-based analysis of clinical samples". PLoS Pathogens 10 (11): e1004437. doi:10.1371/journal.ppat.1004437. PMC PMC4239086. PMID 25412476. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4239086.
↑ Pilli, E.; Modi, A.; Serpico, C. et al. (2013). "Monitoring DNA contamination in handled vs. directly excavated ancient human skeletal remains". PLoS One 8 (1): e52524. doi:10.1371/journal.pone.0052524. PMC PMC3556025. PMID 23372650. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3556025.
↑ Ballenghien, M.; Faivre, N;. Galtier, N. (2017). "Patterns of cross-contamination in a multispecies population genomic project: Detection, quantification, impact, and solutions". BMC Biology 15 (1): 25. doi:10.1186/s12915-017-0366-6. PMC PMC5370491. PMID 28356154. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5370491.
↑ Simion, P.; Philippe, H.; Baurain, D. et al. (2017). "A Large and Consistent Phylogenomic Dataset Supports Sponges as the Sister Group to All Other Animals". Current Biology 27 (7): 958-967. doi:10.1016/j.cub.2017.02.031. PMID 28318975.
↑ Witt, N.; Rodger, G.; Vandesompele, J. et al. (2009). "An assessment of air as a source of DNA contamination encountered when performing PCR". Journal of Biomolecular Techniques 20 (5): 236–40. PMC PMC2777341. PMID 19949694. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777341.
↑ Longo, M.S.; O'Neill, M.J.; O'Neill, R.J. (2011). "Abundant human DNA contamination identified in non-primate genome databases". PLoS One 6 (2): e16410. doi:10.1371/journal.pone.0016410. PMC PMC3040168. PMID 21358816. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3040168.
↑ Puiu, D.; Salzberg, S.L. (2008). "Re-assembly of the genome of Francisella tularensis subsp. holarctica OSU18". PLoS One 3 (10): e3427. doi:10.1371/journal.pone.0003427. PMC PMC2561293. PMID 18927608. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2561293.
↑ Cooper, P. (2014). "Proof by synthesis of Tobacco mosaic virus". Genome Biology 15 (5): R67. doi:10.1186/gb-2014-15-5-r67. PMC PMC4072989. PMID 24887356. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072989.
↑ Philippe, H;. Brinkmann, H.; Lavrov, D.V. et al. (2011). "Resolving difficult phylogenetic questions: Why more sequences are not enough". PLoS Biology 9 (3): e1000602. doi:10.1371/journal.pbio.1000602. PMC PMC3057953. PMID 21423652. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057953.
↑ Laurin-Lemay, S.; Brinkmann, H.; Philippe, H. (2012). "Resolving difficult phylogenetic questions: Why more sequences are not enough". PLoS Biology 22 (15): R593–4. doi:10.1016/j.cub.2012.06.013. PMID 22877776.
↑ Peccoud, J.; Gallegos, J.E.; Murch, R. et al. (2018). "Cyberbiosecurity: From Naive Trust to Risk Awareness". Trends in Biotechnology 36 (1): 4–7. doi:10.1016/j.tibtech.2017.10.012. PMID 29224719.
↑ National Research Council (2000). "Bioinformatics: Converting Data to Knowledge". National Academies Press. doi:10.17226/9990. https://www.nap.edu/catalog/9990/bioinformatics-converting-data-to-knowledge-workshop-summary.
↑ Pundir, S;. Martin, M.J.; O'Donovan, C. (2017). "UniProt Protein Knowledgebase". Methods in Molecular Biology 1558: 41–55. doi:10.1007/978-1-4939-6783-4_2. PMC PMC5565770. PMID 28150232. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5565770.
↑ Gurevich, A.; Saveliev, V.; Vyahhi, N.; Tesler, G. (2013). "QUAST: Quality assessment tool for genome assemblies". Bioinformatics 29 (8): 1072–5. doi:10.1093/bioinformatics/btt086. PMC PMC3624806. PMID 23422339. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806.
↑ Simão, F.A.; Waterhouse, R.M.; Ioannidis, P. et al. (2015). "BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs". Bioinformatics 31 (19): 3210–2. doi:10.1093/bioinformatics/btv351. PMID 26059717.
↑ Tennessen, K.; Andersen, E.; Clingenpeel, S. et al. (2016). "ProDeGe: A computational protocol for fully automated decontamination of genomes". The ISME Journal 10 (1): 269–72. doi:10.1038/ismej.2015.100. PMC PMC4681846. PMID 26057843. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681846.
↑ Lux, M.; Krüger, J.; Rinke, C. et al. (2016). "acdc - Automated Contamination Detection and Confidence estimation for single-cell genome data". BMC Bioinformatics 17 (1): 543. doi:10.1186/s12859-016-1397-7. PMC PMC5168860. PMID 27998267. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5168860.
↑ Parks, D.H.; Imelfort, M.; Skennerton, C.T. et al. (2015). "CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes". Genome Research 25 (7): 1043–55. doi:10.1101/gr.186072.114. PMC PMC4484387. PMID 25977477. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484387.
↑ Dittami, S.M.; Corre, E. (2017). "Detection of bacterial contaminants and hybrid sequences in the genome of the kelp Saccharina japonica using Taxoblast". PeerJ 5: e4073. doi:10.7717/peerj.4073. PMC PMC5695246. PMID 29158994. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5695246.
↑ McNair, K.; Edwards, R.A. (2015). "GenomePeek: An online tool for prokaryotic genome and metagenome analysis". PeerJ 3: e1025. doi:10.7717/peerj.1025. PMC PMC4476108. PMID 26157610. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4476108.
↑ Kryukov, K.; Imanishi, T. (2016). "Human Contamination in Public Genome Assemblies". PLoS One 11 (9): e0162424. doi:10.1371/journal.pone.0162424. PMC PMC5017631. PMID 27611326. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017631.
↑ Cornet, L.; Meunier, L.; Van Vlierberghe, M. et al. (2018). "Consensus assessment of the contamination level of publicly available cyanobacterial genomes". PLoS One 13 (7): e0200323. doi:10.1371/journal.pone.0200323. PMC PMC6059444. PMID 30044797. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6059444.
↑ Lu, J.; Salzberg, S.L. (2018). "Removing contaminants from databases of draft genomes". PLoS Computational Biology 14 (6): e1006277. doi:10.1371/journal.pcbi.1006277. PMC PMC6034898. PMID 29939994. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6034898.
↑ Krizhevsky, A.; Sutskever, U.; Hinton, G.E. (2017). "ImageNet classification with deep convolutional neural networks". Communications of the ACM 60 (6): 84–90. doi:10.1145/3065386.
↑ Sutskever, U.; Vinyals, O.; Le, Q.V. (2014). "Sequence to Sequence Learning with Neural Networks". Proceedings from Advances in Neural Information Processing Systems 2014. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural.
↑ Larrañaga, P.; Calvo, B.; Santana, R. et al. (2006). "Machine learning in bioinformatics". Briefings in Bioinformatics 7 (1): 86–112. doi:10.1093/bib/bbk007. PMID 16761367.
↑ ^32.0 ^32.1 Federhen, S.; Rossello-Mora, R.; Klenk, H.-P. et al. (2016). "Meeting report: GenBank microbial genomic taxonomy workshop (12–13 May, 2015)". Standards in Genomic Sciences 11: 15. doi:10.1186/s40793-016-0134-1. PMC PMC4748488. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4748488.
↑ Li, M.; Chen, X.; Li, X. et al. (2004). "The Similarity Metric". IEEE Transactions on Information Theory 50 (12): 3250–64. doi:10.1109/TIT.2004.838101.
↑ Goris, J.; Konstantinudis, K.T.; Klappenbach, J.A. et al. (2007). "DNA-DNA hybridization values and their relationship to whole-genome sequence similarities". International Journal of Systematic and Evolutionary Microbiology 57 (Pt 1): 81–91. doi:10.1099/ijs.0.64483-0. PMID 17220447.
↑ Colston, S.M.; Fullmer, M.S.; Beka, L. et al. (2014). "Bioinformatic genome comparisons for taxonomic and phylogenetic assignments using Aeromonas as a test case". mBio 5 (6): e02136. doi:10.1128/mBio.02136-14. PMC PMC4251997. PMID 25406383. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251997.
↑ Figueras, M.J.; Beaz-Hidalgo, R.; Hossain, M.J.; Liles, M.R. (2014). "Taxonomic affiliation of new genomes should be verified using average nucleotide identity and multilocus phylogenetic analysis". Genome Announcements 2 (6): e00927-14. doi:10.1128/genomeA.00927-14. PMC PMC4256179. PMID 25477398. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256179.
↑ Kim, M.; Oh, H.S.; Park, S.C.; Chun, J. (2014). "Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes". International Journal of Systematic and Evolutionary Microbiology 64 (Pt 2): 346-51. doi:10.1099/ijs.0.059774-0. PMID 24505072.
↑ Beaz-Hidalgo, R;. Hossain, M.J.; Liles, M.R.; Figueras, M.J. (2015). "Strategies to avoid wrongly labelled genomes using as example the detected wrong taxonomic affiliation for aeromonas genomes in the GenBank database". PLoS One 10 (1): e0115813. doi:10.1371/journal.pone.0115813. PMC PMC4301921. PMID 25607802. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4301921.
↑ Tanizawa, Y.; Fujisawa, T.; Kaminuma, E. et al. (2016). "DFAST and DAGA: web-based integrated genome annotation tools and resources". Bioscience of Microbiota, Food and Health 35 (4): 173-184. doi:10.12938/bmfh.16-003. PMC PMC5107635. PMID 27867804. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5107635.
↑ "Bacterial ANI Report". NCBI. ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/ANI_report_bacteria.txt.
↑ Federhen, S. (2015). "Type material in the NCBI Taxonomy Database". Nucleic Acids Research 43 (DB1): D1086–98. doi:10.1093/nar/gku1127. PMC PMC4383940. PMID 25398905. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383940.
↑ Kozlov, A.M.; Zhang, J.; Yilmaz, P. et al. (2016). "Phylogeny-aware identification and correction of taxonomically mislabeled sequences". Nucleic Acids Research 44 (11): 5022-33. doi:10.1093/nar/gkw396. PMC PMC4914121. PMID 27166378. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914121.
↑ Richter, M.; Rosselló-Móra, R. (2009). "Shifting the genomic gold standard for the prokaryotic species definition". Proceedings of the National Academy of Sciences of the United States of America 106 (45): 19126-31. doi:10.1073/pnas.0906412106. PMC PMC2776425. PMID 19855009. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2776425.
↑ Kämpfer, P.; Glaeser, S.P. (2012). "Prokaryotic taxonomy in the sequencing era--the polyphasic approach revisited". Environmental Microbiology 14 (2): 291-317. doi:10.1111/j.1462-2920.2011.02615.x. PMID 22040009.
↑ Larsen, M.V.; Cosentino, S.; Lukjancenko, O. et al. (2014). "Benchmarking of methods for genomic taxonomy". Journal of Clinical Microbiology 52 (5): 1529-39. doi:10.1128/JCM.02981-13. PMC PMC3993634. PMID 24574292. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3993634.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, grammar, and punctuation. In some cases important information was missing from the references, and that information was added. The footnote in the original material were turned into an inline references for this version.

[MoussouniClean13-1] Moussouni, F.; Berti‐Équille, L. (2014). "Cleaning, Integrating, and Warehousing Genomic Data From Biomedical Resources". In Elloumi, M.; Zomaya, A.Y.. Biological Knowledge Discovery Handbook. John Wiley & Sons. pp. 35–58. doi:10.1002/9781118617151.ch02. ISBN 9781118617151.

[KimPrivate15-2] Kim, M.; Lauter, K. (2015). "Private genome analysis through homomorphic encryption". BMC Medical Informatics and Decision Making 15 (Suppl 5): S3. doi:10.1186/1472-6947-15-S5-S3. PMC PMC4699052. PMID 26733152. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4699052.

[MandalData18-3] Mandal, A.; Mitchell, J.C.; Montgomery, H.; Roy, A. (2018). "Data oblivious genome variants search on Intel SGX". In Garcia-Alfaro, J.; Herrera-Joancomartí, J.; Livraga, G.; Rios, R.. Data Privacy Management, Cryptocurrencies and Blockchain Technology. Lecture Notes in Computer Science. Springer International Publishing. pp. 21. doi:10.1007/978-3-030-00305-0_21. ISBN 9783030003050.

[OzercanRealiz18-4] Ozercan, H.I.; Ileri, A.M.; Ayday, E.; Alkan, C. (2018). "Realizing the potential of blockchain technologies in genomics". Gemone Research 28 (9): 1255–63. doi:10.1101/gr.207464.116. PMC PMC6120626. PMID 30076130. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6120626.

[MerchantUnexp14-5] 5.0 ^5.1 Merchant, S.; Wood, D.E.; Salzberg, S.L. (2014). "Unexpected cross-species contamination in genome sequencing projects". PeerJ 2: e675. doi:10.7717/peerj.675. PMC PMC4243333. PMID 25426337. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243333.

[StrongMicro14-6] Strong, M.J.; Xu, G.; Morici, L. et al. (2014). "Microbial contamination in next generation sequencing: implications for sequence-based analysis of clinical samples". PLoS Pathogens 10 (11): e1004437. doi:10.1371/journal.ppat.1004437. PMC PMC4239086. PMID 25412476. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4239086.

[PilliMonit13-7] Pilli, E.; Modi, A.; Serpico, C. et al. (2013). "Monitoring DNA contamination in handled vs. directly excavated ancient human skeletal remains". PLoS One 8 (1): e52524. doi:10.1371/journal.pone.0052524. PMC PMC3556025. PMID 23372650. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3556025.

[BallenghienPatterns17-8] Ballenghien, M.; Faivre, N;. Galtier, N. (2017). "Patterns of cross-contamination in a multispecies population genomic project: Detection, quantification, impact, and solutions". BMC Biology 15 (1): 25. doi:10.1186/s12915-017-0366-6. PMC PMC5370491. PMID 28356154. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5370491.

[SimionALarge17-9] Simion, P.; Philippe, H.; Baurain, D. et al. (2017). "A Large and Consistent Phylogenomic Dataset Supports Sponges as the Sister Group to All Other Animals". Current Biology 27 (7): 958-967. doi:10.1016/j.cub.2017.02.031. PMID 28318975.

[WittAnAss09-10] Witt, N.; Rodger, G.; Vandesompele, J. et al. (2009). "An assessment of air as a source of DNA contamination encountered when performing PCR". Journal of Biomolecular Techniques 20 (5): 236–40. PMC PMC2777341. PMID 19949694. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777341.

[LongoAbund11-11] Longo, M.S.; O'Neill, M.J.; O'Neill, R.J. (2011). "Abundant human DNA contamination identified in non-primate genome databases". PLoS One 6 (2): e16410. doi:10.1371/journal.pone.0016410. PMC PMC3040168. PMID 21358816. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3040168.

[PuiuReass08-12] Puiu, D.; Salzberg, S.L. (2008). "Re-assembly of the genome of Francisella tularensis subsp. holarctica OSU18". PLoS One 3 (10): e3427. doi:10.1371/journal.pone.0003427. PMC PMC2561293. PMID 18927608. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2561293.

[CooperProof14-13] Cooper, P. (2014). "Proof by synthesis of Tobacco mosaic virus". Genome Biology 15 (5): R67. doi:10.1186/gb-2014-15-5-r67. PMC PMC4072989. PMID 24887356. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072989.

[PhilippeResolv11-14] Philippe, H;. Brinkmann, H.; Lavrov, D.V. et al. (2011). "Resolving difficult phylogenetic questions: Why more sequences are not enough". PLoS Biology 9 (3): e1000602. doi:10.1371/journal.pbio.1000602. PMC PMC3057953. PMID 21423652. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057953.

[Laurin-LemayOrigin12-15] Laurin-Lemay, S.; Brinkmann, H.; Philippe, H. (2012). "Resolving difficult phylogenetic questions: Why more sequences are not enough". PLoS Biology 22 (15): R593–4. doi:10.1016/j.cub.2012.06.013. PMID 22877776.

[PeccoudCyber18-16] Peccoud, J.; Gallegos, J.E.; Murch, R. et al. (2018). "Cyberbiosecurity: From Naive Trust to Risk Awareness". Trends in Biotechnology 36 (1): 4–7. doi:10.1016/j.tibtech.2017.10.012. PMID 29224719.

[NRCBio00-17] National Research Council (2000). "Bioinformatics: Converting Data to Knowledge". National Academies Press. doi:10.17226/9990. https://www.nap.edu/catalog/9990/bioinformatics-converting-data-to-knowledge-workshop-summary.

[PundirUniProt17-18] Pundir, S;. Martin, M.J.; O'Donovan, C. (2017). "UniProt Protein Knowledgebase". Methods in Molecular Biology 1558: 41–55. doi:10.1007/978-1-4939-6783-4_2. PMC PMC5565770. PMID 28150232. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5565770.

[GurevichQUAST13-19] Gurevich, A.; Saveliev, V.; Vyahhi, N.; Tesler, G. (2013). "QUAST: Quality assessment tool for genome assemblies". Bioinformatics 29 (8): 1072–5. doi:10.1093/bioinformatics/btt086. PMC PMC3624806. PMID 23422339. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806.

[Sim.C3.A3oBUSCO15-20] Simão, F.A.; Waterhouse, R.M.; Ioannidis, P. et al. (2015). "BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs". Bioinformatics 31 (19): 3210–2. doi:10.1093/bioinformatics/btv351. PMID 26059717.

[TennessenProDeGe16-21] Tennessen, K.; Andersen, E.; Clingenpeel, S. et al. (2016). "ProDeGe: A computational protocol for fully automated decontamination of genomes". The ISME Journal 10 (1): 269–72. doi:10.1038/ismej.2015.100. PMC PMC4681846. PMID 26057843. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681846.

[LuxACDC16-22] Lux, M.; Krüger, J.; Rinke, C. et al. (2016). "acdc - Automated Contamination Detection and Confidence estimation for single-cell genome data". BMC Bioinformatics 17 (1): 543. doi:10.1186/s12859-016-1397-7. PMC PMC5168860. PMID 27998267. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5168860.

[ParksCheckM15-23] Parks, D.H.; Imelfort, M.; Skennerton, C.T. et al. (2015). "CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes". Genome Research 25 (7): 1043–55. doi:10.1101/gr.186072.114. PMC PMC4484387. PMID 25977477. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484387.

[DittamiDetect17-24] Dittami, S.M.; Corre, E. (2017). "Detection of bacterial contaminants and hybrid sequences in the genome of the kelp Saccharina japonica using Taxoblast". PeerJ 5: e4073. doi:10.7717/peerj.4073. PMC PMC5695246. PMID 29158994. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5695246.

[McNairGenome15-25] McNair, K.; Edwards, R.A. (2015). "GenomePeek: An online tool for prokaryotic genome and metagenome analysis". PeerJ 3: e1025. doi:10.7717/peerj.1025. PMC PMC4476108. PMID 26157610. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4476108.

[RyukovHuman16-26] Kryukov, K.; Imanishi, T. (2016). "Human Contamination in Public Genome Assemblies". PLoS One 11 (9): e0162424. doi:10.1371/journal.pone.0162424. PMC PMC5017631. PMID 27611326. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017631.

[CornetConcensus18-27] Cornet, L.; Meunier, L.; Van Vlierberghe, M. et al. (2018). "Consensus assessment of the contamination level of publicly available cyanobacterial genomes". PLoS One 13 (7): e0200323. doi:10.1371/journal.pone.0200323. PMC PMC6059444. PMID 30044797. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6059444.

[LuRemoving18-28] Lu, J.; Salzberg, S.L. (2018). "Removing contaminants from databases of draft genomes". PLoS Computational Biology 14 (6): e1006277. doi:10.1371/journal.pcbi.1006277. PMC PMC6034898. PMID 29939994. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6034898.

[KrizhevskyImageNet17-29] Krizhevsky, A.; Sutskever, U.; Hinton, G.E. (2017). "ImageNet classification with deep convolutional neural networks". Communications of the ACM 60 (6): 84–90. doi:10.1145/3065386.

[SutskeverSeq14-30] Sutskever, U.; Vinyals, O.; Le, Q.V. (2014). "Sequence to Sequence Learning with Neural Networks". Proceedings from Advances in Neural Information Processing Systems 2014. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural.

[Larra.C3.B1agaMach06-31] Larrañaga, P.; Calvo, B.; Santana, R. et al. (2006). "Machine learning in bioinformatics". Briefings in Bioinformatics 7 (1): 86–112. doi:10.1093/bib/bbk007. PMID 16761367.

[FederhenMeeting16-32] 32.0 ^32.1 Federhen, S.; Rossello-Mora, R.; Klenk, H.-P. et al. (2016). "Meeting report: GenBank microbial genomic taxonomy workshop (12–13 May, 2015)". Standards in Genomic Sciences 11: 15. doi:10.1186/s40793-016-0134-1. PMC PMC4748488. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4748488.

[LiTheSim04-33] Li, M.; Chen, X.; Li, X. et al. (2004). "The Similarity Metric". IEEE Transactions on Information Theory 50 (12): 3250–64. doi:10.1109/TIT.2004.838101.

[GorisDNA07-34] Goris, J.; Konstantinudis, K.T.; Klappenbach, J.A. et al. (2007). "DNA-DNA hybridization values and their relationship to whole-genome sequence similarities". International Journal of Systematic and Evolutionary Microbiology 57 (Pt 1): 81–91. doi:10.1099/ijs.0.64483-0. PMID 17220447.

[ColstonBio14-35] Colston, S.M.; Fullmer, M.S.; Beka, L. et al. (2014). "Bioinformatic genome comparisons for taxonomic and phylogenetic assignments using Aeromonas as a test case". mBio 5 (6): e02136. doi:10.1128/mBio.02136-14. PMC PMC4251997. PMID 25406383. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251997.

[FiguerasTax14-36] Figueras, M.J.; Beaz-Hidalgo, R.; Hossain, M.J.; Liles, M.R. (2014). "Taxonomic affiliation of new genomes should be verified using average nucleotide identity and multilocus phylogenetic analysis". Genome Announcements 2 (6): e00927-14. doi:10.1128/genomeA.00927-14. PMC PMC4256179. PMID 25477398. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256179.

[KimTowards14-37] Kim, M.; Oh, H.S.; Park, S.C.; Chun, J. (2014). "Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes". International Journal of Systematic and Evolutionary Microbiology 64 (Pt 2): 346-51. doi:10.1099/ijs.0.059774-0. PMID 24505072.

[Beaz-HidalgoStrat15-38] Beaz-Hidalgo, R;. Hossain, M.J.; Liles, M.R.; Figueras, M.J. (2015). "Strategies to avoid wrongly labelled genomes using as example the detected wrong taxonomic affiliation for aeromonas genomes in the GenBank database". PLoS One 10 (1): e0115813. doi:10.1371/journal.pone.0115813. PMC PMC4301921. PMID 25607802. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4301921.

[TanizawaDFAST16-39] Tanizawa, Y.; Fujisawa, T.; Kaminuma, E. et al. (2016). "DFAST and DAGA: web-based integrated genome annotation tools and resources". Bioscience of Microbiota, Food and Health 35 (4): 173-184. doi:10.12938/bmfh.16-003. PMC PMC5107635. PMID 27867804. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5107635.

[NCBIBacterial-40] "Bacterial ANI Report". NCBI. ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/ANI_report_bacteria.txt.

[FederhenType15-41] Federhen, S. (2015). "Type material in the NCBI Taxonomy Database". Nucleic Acids Research 43 (DB1): D1086–98. doi:10.1093/nar/gku1127. PMC PMC4383940. PMID 25398905. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383940.

[KozlovPhylo16-42] Kozlov, A.M.; Zhang, J.; Yilmaz, P. et al. (2016). "Phylogeny-aware identification and correction of taxonomically mislabeled sequences". Nucleic Acids Research 44 (11): 5022-33. doi:10.1093/nar/gkw396. PMC PMC4914121. PMID 27166378. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914121.

[RichterShift09-43] Richter, M.; Rosselló-Móra, R. (2009). "Shifting the genomic gold standard for the prokaryotic species definition". Proceedings of the National Academy of Sciences of the United States of America 106 (45): 19126-31. doi:10.1073/pnas.0906412106. PMC PMC2776425. PMID 19855009. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2776425.

[K.C3.A4mpferProkary12-44] Kämpfer, P.; Glaeser, S.P. (2012). "Prokaryotic taxonomy in the sequencing era--the polyphasic approach revisited". Environmental Microbiology 14 (2): 291-317. doi:10.1111/j.1462-2920.2011.02615.x. PMID 22040009.

[LarsenBench14-45] Larsen, M.V.; Cosentino, S.; Lukjancenko, O. et al. (2014). "Benchmarking of methods for genomic taxonomy". Journal of Clinical Microbiology 52 (5): 1529-39. doi:10.1128/JCM.02981-13. PMC PMC3993634. PMID 24574292. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3993634.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

Journal:Defending our public biological databases as a global critical infrastructure

Contents

Abstract

Introduction

Background: Problems with public biological databases

Data integrity

Vulnerabilities and intentional tampering

Approaches for improving biological databases

Automated approaches for detecting anomalies

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export