Journal:AI meets exascale computing: Advancing cancer research with large-scale high-performance computing
Full article title | AI meets exascale computing: Advancing cancer research with large-scale high-performance computing |
---|---|
Journal | Frontiers in Oncology |
Author(s) |
Bhattacharya, Tanmoy; Brettin, Thomas; Doroshow, James H.; Evrard, Yvonne A.; Greenspan, Emily J.; Gryshuk, Amy L.; Hoang, Thuc T.; Vea Lauzon, Carolyn, B.; Nissley, Dwight; Penberthy, Lynne; Stahlberg, Eric; Stevens, Rick; Streitz, Fred; Tourassi, Georgia; Xia, Fangfang; Zaki, George |
Author affiliation(s) |
Los Alamos National Laboratory, Argonne National Laboratory, National Cancer Institute, Frederick National Laboratory for Cancer Research, Lawrence Livermore National Laboratory, National Nuclear Security Administration, U.S. Department of Energy Office of Science, University of Chicago, Oak Ridge National Laboratory |
Primary contact | Email: george dot zaki at nih dot gov |
Editors | Meerzaman, Daoud |
Year published | 2019 |
Volume and issue | 9 |
Page(s) | 984 |
DOI | 10.3389/fonc.2019.00984 |
ISSN | 2234-943X |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://www.frontiersin.org/articles/10.3389/fonc.2019.00984/full |
Download | https://www.frontiersin.org/articles/10.3389/fonc.2019.00984/pdf (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
The application of data science in cancer research has been boosted by major advances in three primary areas: (1) data: diversity, amount, and availability of biomedical data; (2) advances in artificial intelligence (AI) and machine learning (ML) algorithms that enable learning from complex, large-scale data; and (3) advances in computer architectures allowing unprecedented acceleration of simulation and machine learning algorithms. These advances help build in silico ML models that can provide transformative insights from data, including molecular dynamics simulations, next-generation sequencing, omics, imaging, and unstructured clinical text documents. Unique challenges persist, however, in building ML models related to cancer, including: (1) access, sharing, labeling, and integration of multimodal and multi-institutional data across different cancer types; (2) developing AI models for cancer research capable of scaling on next-generation high-performance computers; and (3) assessing robustness and reliability in the AI models. In this paper, we review the National Cancer Institute (NCI) -Department of Energy (DOE) collaboration, the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C), a multi-institution collaborative effort focused on advancing computing and data technologies to accelerate cancer research on the molecular, cellular, and population levels. This collaboration integrates various types of generated data, pre-exascale compute resources, and advances in ML models to increase understanding of basic cancer biology, identify promising new treatment options, predict outcomes, and, eventually, prescribe specialized treatments for patients with cancer.
Keywords: cancer research, high-performance computing, artificial intelligence, deep learning, natural language processing, multi-scale modeling, precision medicine, uncertainty quantification
Introduction
Predictive computational models for patients with cancer can in the future support prevention and treatment decisions by informing choices to achieve the best possible clinical outcome. Toward this vision, in 2015, the national Precision Medicine Initiative (PMI)[1] was announced, motivating efforts to target and advance precision oncology, including looking ahead to the scientific, data, and computational capabilities needed to advance this vision. At the same time, the horizon of computing was changing in the life sciences, as the capabilities and transformations enabled by exascale computing were coming into focus, driven by the accelerated growth in data volumes and anticipated new sources of information catalyzed by new technologies and initiatives such as PMI.
The National Strategic Computing Initiative (NSCI) in 2015 named the Department of Energy (DOE) as a lead agency for “advanced simulation through a capable exascale computing program” and the National Institutes of Health (NIH) as one of the deployment agencies to participate “in the co-design process to integrate the special requirements of their respective missions.” This interagency coordination structure opened the avenue for a tight collaboration between the NCI and the DOE. With shared aims to advance cancer research while shaping the future for exascale computing, the NCI and DOE established the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) in June of 2016 through a five-year memorandum of understanding with three co-designed pilot efforts to address both national priorities. The high-level goals of these three pilots were to push the frontiers of computing technologies in specific areas of cancer research:
- at the cellular level: advance the capabilities of patient-derived pre-clinical models to identify new treatments;
- at the molecular level: further understand the basic biology of undruggable targets; and
- at the population level: gain critical insights on the drivers of population cancer outcomes.
The pilots would also develop new uncertainty quantification (UQ) methods to evaluate confidence in the AI model predictions.
Using co-design principles, each of the pilots in the JDACS4C collaboration is based on—and driven by—team science, which is the hallmark of the collaboration's success. Enabled by deep learning, Pilot One (cellular-level) combines data in innovative ways to develop computationally predictive models for tumor response to novel therapeutic agents. Pilot Two (molecular-level) combines experimental data, simulation, and AI to provide new windows to understand and explore the biology of cancers related to the Ras superfamily of proteins. Pilot Three (population-level) uses AI and clinical information at unprecedented scales to enable precision cancer surveillance to transform cancer care.
AI and large-scale computing to predict tumor treatment response
After years of efforts within the research and pharmaceutical sectors, many patients with cancer still do not respond to standard-of-care treatments, and emergence of therapy resistance is common. Efforts in precision medicine may someday change this by using a targeted therapeutics approach, individually tailored to each patient based on predictive models that use molecular and drug signatures. The Predictive Modeling for Pre-Clinical Screening Pilot (Pilot One) aims to develop predictive capabilities of drug response in pre-clinical models of cancer to improve and expedite the selection and development of new targeted therapies for patients with cancer. Highlights of the work done in Pilot One are shown in Figure 1.
|
As omics data continues to accumulate, computational models integrating multimodal data sources become possible. Multimodal deep learning[2] aims to enhance learned features for one task by learning features over multiple modalities. Early Pilot One work[3] measured performance of multi-modal deep neural network drug pair response models with five-fold cross validation. Using the NCI-ALMANAC[4] data, best model performance was demonstrated when gene expression, microRNA, proteome, and Dragon7 drug descriptors[5] were combined obtaining an R-squared value of 0.944, which indicates that over 94% of the variation in tumor response is explained by the variation among the contributing gene expression, micro RNA expression, proteomics, and drug property data.
Mechanistically informed feature selection is an alternative approach that has the potential to increase predictive model performance. The LINCS landmark genes[6] for example have been used to train deep learning models to predict gene expression of non-landmark genes[7] and to classify drug-target interactions.[8] Ongoing work in Pilot One is exploring the impact on prediction using gene sets like that of the LINCS landmark genes and other mechanistically defined gene sets. The potential of employing mechanistically informed feature selection extends beyond improving prediction accuracy, to the realm of building models on the basis of existing biological knowledge.
Transfer learning is another area of important research activity. The goal of transfer learning is to improve learning in the target learning task by leveraging knowledge from an existing source task.[9] Given challenges in obtaining sufficient data for target Patient Derived Xenografts (PDXs), where tumors are grown in mouse host animals, ongoing transfer learning work holds promise for learning on cell lines as a source for the target PDX model predictions. Pilot One is first working on generating models that generalize across cell line studies, a precursor to transfer learning from cell lines to PDXs.
Using data from the NCI-ALMANAC[4], NCI-60[10], GDSC[11], CTRP[12], gCSI[13], and CCLE[14], models can be constructed that generalize across cell-line studies. For example, using multi-task networks which combine additional learning of three different classification tasks—tumor/normal, cancer type, and cancer site—with learning of the drug response task, it could be possible to capture more of the total variance and improve precision and recall when training on CTRP and predicting on CCLE. Demonstrating cross-study model capability will provide additional confidence that general models can be developed for prediction tasks on cell lines, PDXs, and organoids.
Answering questions of how much data and what methods are suitable is a critical part of Pilot One. Although it is generally held that deep learning methods outperform traditional machine learning methods when large data sets are used, this has not yet been explored in the context of drug response prediction problem. Early efforts underway in Pilot One are exploring the relationship among sample size, deep learning methods, and traditional machine learning methods to better characterize the dependencies on predictive performance. This information of sample size, together with model accuracy metrics, will be of critical importance to future experimental designs for those who wish to pursue deep learning approaches to the drug response prediction problem.
Such extensive deep learning and machine learning investigations require significant computational resources, such as those available at DOE Leadership Computing Facilities (LCF) employed by Pilot One. A recent experiment searched 23,200 deep neural network models using COXEN[15] selected features and Bayesian optimization ideas[16] to find the best model hyperparameters (hyperparameters generally define the choice of functions and relationship among functions in a given deep learning model). This produced the best cross-study validation results to-date, underscoring the critical need for feature selection and hyperparameter optimization when building predictive models. Further, uncertainty quantification (explained in more depth later) adds a new level of computing demand. Uncertainty quantification experiments involving over 30 billion predictions from 450 of the best models generated on the DOE Summit LCF system are ongoing to understand the relationship to between best model uncertainty and the model that performs best in cross-study validation experiments.
Reflecting on insights from Pilot One activities and current gaps in available literature, future work will focus on exploring new predictive models to better utilize, ground, and enrich biological knowledge. Efforts to improve drug representations for response prediction are expected to benefit from research involving training semi-supervised networks on millions of compounds. In efforts to improve understanding of trained models, mechanistic information is being incorporated into more interpretable deep learning models. Active learning in response prediction—which balances uncertainty, accuracy, and lead discovery—will be used to guide the acquisition of experimental data for animal models in a cost-effective and timely manner. And finally, a necessary step toward precision models is gaining a fine-grained understanding of prediction error, an insight enabled by the demonstrated capability in large-scale model sweeps.
Oncogenic mutations in Ras genes are associated with more than 30% of cancers and are particularly prevalent in those of the lung, colon, and pancreas. Though Ras mutations have been studied for decades, there are currently no Ras inhibitors, and a detailed molecular mechanism for how Ras engages and activates proximal signaling proteins (RAF) remains elusive.[17] Ras signaling takes place at and is dependent on cellular membranes, a complex cellular environment that is difficult to recapitulate using current experimental technologies.
Pilot Two, Improving Outcomes for RAS-related Cancer, is focused on delivering a validated multiscale model of Ras biology on a cell membrane by combining the experimental capabilities at the Frederick National Laboratory for Cancer Research with the computational resources of the National Nuclear Security Administration (NNSA), a semi-autonomous agency of the DOE. The principal challenge in modeling this system is the diverse length and time scales involved. Lipid membranes evolve over a macroscopic scale (micrometers and milliseconds). Capturing this evolution is critical, as changes in lipid concentration define the local environment in which Ras operates. The Ras protein itself, however, binds over time and length scales which are microscopic (nanometers and microseconds). In order to elucidate the behavior of Ras proteins in the context of a realistic membrane, our modeling effort must span the multiple orders of magnitude between microscopic and macroscopic behavior. The Pilot Two team has built such a framework, developing a macroscopic model that captures the evolution of the lipid environment and which is consistent with an optimized microscopic model that captures protein-protein and protein-lipid interactions at the molecular scale. Macroscopic model components (lipid environment, lipid-lipid interactions, protein behavior, and protein-lipid interactions) were characterized through close collaboration between the experimenters at Frederick National Laboratory and the computational scientists from the DOE/NNSA. The microscopic model is based on standard Martini force fields for coarse-grained molecular dynamics (CGMD), modified to correctly capture certain details of lipid phase behavior.[18][19][20][21] A snapshot from a typical micro-scale simulation run, showing two Ras proteins on a 30 nm × 30 nm patch of lipid membrane (containing ~150,000 particles) is shown in Figure 2.
|
References
- ↑ "What is the Precision Medicine Iniative?". Genetics Home Reference. National Institutes of Health. 2019. https://ghr.nlm.nih.gov/primer/precisionmedicine/initiative. Retrieved 20 September 2019.
- ↑ Sun, D.; Wang, M.; Li, A. (2019). "A Multimodal Deep Neural Network for Human Breast Cancer Prognosis Prediction by Integrating Multi-Dimensional Data". IEEE/ACM Transactions on Computational Biology and Bioinformatics 16 (3): 841–50. doi:10.1109/TCBB.2018.2806438.
- ↑ Xia, F.; Shukla, M.; Brettin, T. et al. (2018). "Predicting tumor cell line response to drug pairs with deep learning". BMC Bioinformatics 19 (Suppl. 18): 486. doi:10.1186/s12859-018-2509-3. PMC PMC6302446. PMID 30577754. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302446.
- ↑ 4.0 4.1 Holbeck, S.L.; Camalier, R.; Crowell, J.A. et al. (2017). "The National Cancer Institute ALMANAC: A Comprehensive Screening Resource for the Detection of Anticancer Drug Pairs with Enhanced Therapeutic Activity". Cancer Research 77 (13): 3564-3576. doi:10.1158/0008-5472.CAN-17-0489. PMC PMC5499996. PMID 28446463. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5499996.
- ↑ "Dragon". Kode Chemoinformatics srl. 2019. https://chm.kode-solutions.net/products_dragon.php. Retrieved 30 April 2019.
- ↑ Subramanian, A.; Narayan, R.; Corsello, S.M. et al. (2017). "A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles". Cell 171 (6): P1437-1452.E17. doi:10.1016/j.cell.2017.10.049. PMC PMC5990023. PMID 29195078. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5990023.
- ↑ Chen, Y.; Li, Y.; Narayan, R. et al. (2016). "Gene expression inference with deep learning". Bioinformatics 32 (12): 1832-9. doi:10.1093/bioinformatics/btw074. PMC PMC4908320. PMID 26873929. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908320.
- ↑ Xie, L.; He, S.; Song, X. et al. (2018). "Deep learning-based transcriptome data classification for drug-target interaction prediction". BMC Genomics 19 (Suppl. 7): 667. doi:10.1186/s12864-018-5031-0. PMC PMC6156897. PMID 30255785. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6156897.
- ↑ Torrey, L.; Shavlik, J. (2010). "Chapter 11: Transfer Learning". In Olivas, E.S.; Guerrero, J.D.M.; Martinez-Sober, M. et al.. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques. IGI Global. pp. 242–64. doi:10.4018/978-1-60566-766-9. ISBN 9781605667669.
- ↑ "NCI-60 Human Tumor Cell Lines Screen". Developmental Therapeutics Program. National Institutes of Health. 26 August 2015. https://dtp.cancer.gov/discovery_development/nci-60/default.htm.
- ↑ Yang, W.; Soares, J.; Greninger, P. et al. (2013). "Genomics of Drug Sensitivity in Cancer (GDSC): A resource for therapeutic biomarker discovery in cancer cells". Nucleic Acids Research 41 (DB1): D955-61. doi:10.1093/nar/gks1111. PMC PMC3531057. PMID 23180760. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531057.
- ↑ Basu, A.; Bodycombe, N.E.; Cheah, J.H. et al. (2013). "An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules". Cell 154 (5): 1151–61. doi:10.1016/j.cell.2013.08.003. PMC PMC3954635. PMID 23993102. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3954635.
- ↑ Klijn, C.; Durinck, S.; Stawiski, E.W. et al. (2015). "A comprehensive transcriptional portrait of human cancer cell lines". Nature Biotechnology 33 (3): 306–12. doi:10.1038/nbt.3080. PMID 25485619.
- ↑ Barretina, J.; Caponigro, G.; Stransky, N. et al. (2012). "The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity". Nature 483 (7391): 603-7. doi:10.1038/nature11003. PMC PMC3320027. PMID 22460905. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027.
- ↑ Smith, S.C.; Baras, A.S.; Lee, J.K. et al. (2010). "The COXEN principle: translating signatures of in vitro chemosensitivity into tools for clinical outcome prediction and drug discovery in cancer". Cancer Research 70 (5): 1753-8. doi:10.1158/0008-5472.CAN-09-3562. PMC PMC2831138. PMID 20160033. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2831138.
- ↑ Wozniak, J.M.; Jain, R.; Balaprakash, P. et al. (2018). "CANDLE/Supervisor: A workflow framework for machine learning applied to cancer research". BMC Bioinformatics 19 (Suppl. 18): 491. doi:10.1186/s12859-018-2508-4. PMC PMC6302440. PMID 30577736. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302440.
- ↑ Simanshu, D.K.; Nissley, D.V.; McCormick, F. (2017). "RAS Proteins and Their Regulators in Human Disease". Cell 170 (1): 17–33. doi:10.1016/j.cell.2017.06.009. PMC PMC5555610. PMID 28666118. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5555610.
- ↑ Carpenter, T.S.; López, C.A.; Neale, C. et al. (2018). "Capturing Phase Behavior of Ternary Lipid Mixtures with a Refined Martini Coarse-Grained Force Field". Journal of Chemical Theory and Computation 14 (11): 6050-6062. doi:10.1021/acs.jctc.8b00496. PMID 30253091.
- ↑ Neale, C.; García, A.E. (2018). "Methionine 170 is an Environmentally Sensitive Membrane Anchor in the Disordered HVR of K-Ras4B". Journal of Physical Chemistry B 122 (44): 10086-10096. doi:10.1021/acs.jpcb.8b07919. PMID 30351122.
- ↑ Ingólfsson, H.I.; Carpenter, T.S.; Bhatia, H. et al. (2017). "Computational Lipidomics of the Neuronal Plasma Membrane". Biophysical Journal 113 (10): 2271-2280. doi:10.1016/j.bpj.2017.10.017. PMC PMC5700369. PMID 29113676. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5700369.
- ↑ Travers, T;. López, C.A.; Van, Q.N. et al. (2018). "Molecular recognition of RAS/RAF complex at the membrane: Role of RAF cysteine-rich domain". Scientific Reports 8 (1): 8461. doi:10.1038/s41598-018-26832-4. PMC PMC5981303. PMID 29855542. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5981303.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.