Journal:Named data networking for genomics data management and integrated workflows

From LIMSWiki
Revision as of 23:47, 18 March 2021 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Named data networking for genomics data management and integrated workflows
Journal Frontiers in Big Data
Author(s) Ogle, Cameron; Reddick, David; McKnight, Coleman; Biggs, Tyler; Pauly, Rini; Ficklin, Stephen P.; Feltus, F. Alex; Shannigrahi, Susmit
Author affiliation(s) Clemson University, Tennessee Tech University, Washington State University
Primary contact Email: sshannigrahi at tntech dot edu
Year published 2021
Volume and issue 4
Article # 582468
DOI 10.3389/fdata.2021.582468
ISSN 2624-909X
Distribution license Creative Commons Attribution 4.0 International
Website https://www.frontiersin.org/articles/10.3389/fdata.2021.582468/full
Download https://www.frontiersin.org/articles/10.3389/fdata.2021.582468/pdf (PDF)

Abstract

Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high-resolution biological data. The community is rapidly heading toward the petascale in single-investigator laboratory settings. As evidence, the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) central DNA sequence repository alone contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous, as they are not only large in size but also stored in various geographically distributed repositories such as those hosted by the NCBI, as well as in the DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab.

In this work, we first systematically point out the data management challenges of the genomics community. We then introduce named data networking (NDN), a novel but well-researched internet architecture capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths), all while eliminating the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Using name-based operations also streamlines deployment and integration of workflows with various cloud platforms.

We make four signigicant contributions with this wor. First, we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate. Second, we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. Third, as a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in the "Method" section) to publish data from broadly used data repositories, including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP).

The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN; we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.

Keywords: genomics data, genomics workflows, large science data, cloud computing, named data networking

Introduction

Scientific communities are entering a new era of exploration and discovery in many fields, driven by high-density data accumulation. A few examples are climate science[1], high-energy particle physics (HEP)[2], astrophysics[3][4], genomics[5], seismology[6], and biomedical research[7], just to name a few. Often referred to as “data-intensive” science, these communities utilize and generate extremely large volumes of data, often reaching into the petabytes[8] and soon projected to reach into the exabytes.

Data-intensive science has created radically new opportunities. Take for example high-throughput DNA sequencing (HTDS). Until recently, HTDS was slow and expensive, and only a few institutes were capable of performing it at scale.[9] With the advances in supercomputers, specialized DNA sequencers, and better bioinformatics algorithms, the effectiveness and cost of sequencing has dropped considerably and continues to drop. For example, sequencing the first reference human genome cost around $2.7 billion over 15 years, while currently it costs under $1,000 to resequence a human genome.[10] With commercial incentives, several companies are offering fragmented genome re-sequencing under $100, performed in only a few days. This massive drop in cost and improvement in speed supports more advanced scientific discovery. For example, earlier scientists could only test their hypothesis on a small number of genomes or gene expression conditions within or between species. With more publicly available datasets[5], scientists can test their hypotheses against a larger number of genomes, potentially enabling them to identify rare mutations, precisely classify diseases based on a specific patient, and, thusly, more accurately treat the disease.[11]

While the growth of DNA sequencing is encouraging, it has also created difficulty in genomics data management. For example, the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA) database hosts 42 petabytes of publicly accessible DNA sequence data.[12] Scientists desiring to use public data must discover (or locate) the data and move it from globally distributed sites to on-premize clusters and distributed computing platforms, including public and commercial clouds. Public repositories such as the NCBI SRA contain a subset of all available genomics data.[13] Similar repositories are hosted by NASA, the National Institutes of Health (NIH), and other organizations. Even though these datasets are highly curated, each public repository uses their own standards for data naming, retrieval, and discovery that makes locating and utilizing these datasets difficult.

Moreover, data management problems require the community to build and the scientists to spend time learning complex infrastructures (e.g., cloud platforms, grids) and creating tools, scripts, and workflows that can (semi-) automate their research. The current trend of moving from localized institutional storage and computing to an on-demand cloud computing model adds another layer of complexity to the workflows. The next generation of scientific breakthroughs may require massive data. Our ability to manage, distribute, and utilize these types of extreme-scale datasets and securely integrate them with computational platforms may dictate our success (or failure) in future scientific research.



References

  1. Cinquini, L.; Chrichton, D.; Mattmann, C. et al. (2014). "The Earth System Grid Federation: An open infrastructure for access to distributed geospatial data". Future Generation Computer Systems 36: 400–17. doi:10.1016/j.future.2013.07.002. 
  2. ATLAS Collaboration; Aad, G.; Abat, E. et al. (2008). "The ATLAS Experiment at the CERN Large Hadron Collider". Journal of Instrumentation 3: S08003. doi:10.1088/1748-0221/3/08/S08003. 
  3. Dewdney, P.E.; Hall, P.J.; Schillizzi, R.T. et al. (2009). "The Square Kilometre Array". Proceedings of the IEEE 97 (8): 1482-1496. doi:10.1109/JPROC.2009.2021005. 
  4. LSST Dark Energy Science Collaboration; Abate, A.; Aldering, G. et al. (2012). "Large Synoptic Survey Telescope Dark Energy Science Collaboration". arXiv: 1–133. https://arxiv.org/abs/1211.0310v1. 
  5. 5.0 5.1 Sayers, E.W.; Beck, J.; Brister, J.R. et al. (2020). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research (D1): D9–D16. doi:10.1093/nar/gkz899. PMC PMC6943063. PMID 31602479. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6943063. 
  6. Tsuchiya, S.; Sakamoto, Y.; Tsuchimoto, Y. et al. (2012). "Big Data Processing in Cloud Environments" (PDF). Fujitsu Scientific & Technical Journal 48 (2): 159–68. https://www.fujitsu.com/downloads/MAG/vol48-2/paper09.pdf. 
  7. Luo, J.; Wu, M.; Gopukumar, D. et al. (2016). "Big Data Application in Biomedical Research and Health Care: A Literature Review". Biomedical Informatics Insights 8: 1–10. doi:10.4137/BII.S31559. PMC PMC4720168. PMID 26843812. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720168. 
  8. Shannigrahi, S.; Fan, C.; Papadopoulos, C. et al. (2018). "NDN-SCI for managing large scale genomics data". Proceedings of the 5th ACM Conference on Information-Centric Networking: 204–05. doi:10.1145/3267955.3269022. 
  9. McCombie, W.R.; McPherson, J.D.; Mardis, E.R. (2016). "Next-Generation Sequencing Technologies". Cold Spring Harbor Perspectives in Medicine 9 (11): 1–10. doi:10.4137/BII.S31559. PMC PMC4720168. PMID 26843812. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720168. 
  10. National Human Genome Research Institute (2020). "The Cost of Sequencing a Human Genome". National Institutes of Health. https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost. Retrieved 16 March 2020. 
  11. Lowy-Gallego, E.; Fairley, S.; Zheng-Bradley, X. et al. (2019). "Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project". Wellcome Open Research 4: 50. doi:10.12688/wellcomeopenres.15126.2. PMC PMC7059836. PMID 32175479. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7059836. 
  12. NCBI (2020). "Sequence Read Archive". NCBI. https://trace.ncbi.nlm.nih.gov/Traces/sra/. Retrieved 04 February 2020. 
  13. Stephens, Z.D.; Lee, S.Y.; Faghri, F. et al. (2015). "Big Data: Astronomical or Genomical?". PLoS Biology 13 (7): e1002195. doi:10.1371/journal.pbio.1002195. PMC PMC4494865. PMID 26151137. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494865. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original paper listed references alphabetically; this wiki lists them by order of appearance, by design. The two footnotes were turned into inline references for convenience.