Journal:Data management and modeling in plant biology

Full article title	Data management and modeling in plant biology
Journal	Frontiers in Plant Science
Author(s)	Krantz, Maria; Zimmer, David; Adler, Stephan O.; Kitashova, Anastasia; Klipp, Edda; Mühlhaus, Timo; Nägele, Thomas
Author affiliation(s)	Humboldt-Universität zu Berlin, Technische Universität Kaiserslautern, Ludwig-Maximilians-Universität München
Primary contact	Email: thomas dot naegele at lmu dot de
Editors	Fukushima, Atsushi
Year published	2021
Volume and issue	12
Article #	717958
DOI	10.3389/fpls.2021.717958
ISSN	1664-462X
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.frontiersin.org/articles/10.3389/fpls.2021.717958/full
Download	https://www.frontiersin.org/articles/10.3389/fpls.2021.717958/pdf (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

The study of plant-environment interactions is a multidisciplinary research field. With the emergence of quantitative large-scale and high-throughput techniques, the amount and dimensionality of experimental data have strongly increased. Appropriate strategies for data storage, management, and evaluation are needed to make efficient use of experimental findings. Computational approaches to data mining are essential for deriving statistical trends and signatures contained in data matrices. Although, current biology is challenged by high data dimensionality in general, this is particularly true for plant biology. As sessile organisms, plants have to cope with environmental fluctuations. This typically results in strong dynamics of metabolite and protein concentrations, which are often challenging to quantify. Summarizing experimental output results in complex data arrays, which need computational statistics and numerical methods for building quantitative models. Experimental findings need to be combined with computational models to gain a mechanistic understanding of plant metabolism. For this, bioinformatics and mathematics need to be combined with experimental setups in physiology, biochemistry, and molecular biology. This review presents and discusses concepts at the interface of experiment and computation, which are likely to shape current and future plant biology. Finally, this interface is discussed with regard to its capabilities and limitations to develop a quantitative model of plant-environment interactions.

Keywords: genome-scale networks, omics analysis, metabolic regulation, plant-environment interactions, machine learning, mathematical modeling, differential equations

Introduction

Experimental high-throughput analysis of genomes, transcriptomes, proteomes, and metabolomes results in a vast number of simultaneously quantified molecular entities. Current biological research frequently applies a combination of experimental high-throughput techniques to address a wide spectrum of complex research questions. On the genome level, high-throughput sequencing (HTS) technologies have revolutionized genetics and genomics, and sequencing projects have provided comprehensive information about many species’ genomes.^[1]^[2]^[3]^[4]^[5] To date, thousands of genomes have been sequenced and pan-genomics approaches have been initiated, which assemble diverse sets of individual genomes to a collection of all DNA sequences occurring in a species.^[6] In plant sciences, the concept of pan-genomics is already discussed to support breeding strategies or evolutionary studies and may significantly contribute to the explanation of gene presence and absence variation.^[7]

Based on such comprehensive genome information, genome-scale models of plant metabolism have been developed and applied to predict plant metabolism in a diverse context. Validation and biotechnological application of such large-scale models need appropriate experimental techniques and platforms, unifying sample analysis in multi-omics approaches.^[8] Although, omics techniques have become a generic element of numerous research projects to quantify transcripts, proteins, and metabolites, the actual handling, normalization, and integration of multidimensional experimental data output is still a central challenge in biology.^[9] The need for integrative analysis of experimental high-throughput data has already been suggested and discussed earlier. For example, almost a decade ago, integrative approaches were suggested for transcriptomics, proteomics, and metabolomics data to promote a systems-level understanding of the genus Arabidopsis.^[10] Since then, machine learning, computational statistics, and mathematical modeling have significantly advanced data integration strategies. Due to their capability to improve the understanding of the genotype-phenotype relation on a molecular level, systems biology, and multi-omics integration have become central topics in the discussion about future perspectives of biology and medicine. Yet, in order to make experiments comparable and to increase consistency and reproducibility across different experimental platforms, laboratories, or research communities, quantitative omics data are needed.^[11] Furthermore, quantitative experimental data necessitates appropriate processing strategies to make it comparable to other independent studies and statistics. Making data and data processing publicly available via databases and repositories may represent one of the most important steps to establish and expand a cross-disciplinary scientific platform for omics data integration. Together with the need for traceable long-term data storage and versioning, these topics are becoming increasingly important in quantitative biology.

Searching for database entries from the last two decades on omics and integrative omics approaches reveals a rapidly increasing research and publication activity in the integrative multi-omics research field (Figure 1). Genomics-related yearly published articles linearly increased to a very high level during the last 20 years, while particularly transcriptomics and metabolomics articles have been published with an increasing rate during the last decade (Figure 1A). Between 2000 and 2015, more proteomics-related articles have been published than transcriptomics and metabolomics articles, but since 2017 their number lies between both omics disciplines. Interestingly, since 2017, articles searchable by the queries “multi-omics” or “multiomics” are exponentially increasing in their number (Figure 1B). A similar, yet weaker trend is also observable for “omics data integration” articles (Figure 1B). Of course, these numbers are only crude estimates based on our chosen specific vocabulary and searched within one specific database (for example, we have not checked the combination of different omics disciplines, i.e., “genomics” and “transcriptomics” instead of “multi-omics”). Yet, these results still indicate that an increasing number of studies focuses on a multi-omics design and that omics data integration gains more and more attention.

Figure 1. Number of articles found by article search in the PubMed library covering two decades, i.e., 2000–2020. (A) Timeline of number of articles on different omics disciplines (blue: genomics; orange: transcriptomics; gray: proteomics; and yellow: metabolomics). Articles were searched by single key word search. (B) Timeline of number of articles found by search on omics data integration (green line; single words were connected by AND-expression) and multi-omics (or multiomics, blue line).

This article aims to summarize and discuss current advances and limitations of integrative molecular analysis, computational modeling, and data science. It focuses on both experimental and theoretical methodology to support design and analysis of interdisciplinary research in plant biology. A particular focus is laid on methodologies for capturing system dynamics of plant metabolism induced by a changing environment.

References

↑ International Human Genome Sequencing Consortium; Whitehead Institute for Biomedical Research, Center for Genome Research:; Lander, Eric S.; Linton, Lauren M.; Birren, Bruce; Nusbaum, Chad; Zody, Michael C.; Baldwin, Jennifer et al. (15 February 2001). "Initial sequencing and analysis of the human genome" (in en). Nature 409 (6822): 860–921. doi:10.1038/35057062. ISSN 0028-0836. http://www.nature.com/articles/35057062.
↑ The 1000 Genomes Project Consortium (1 November 2012). "An integrated map of genetic variation from 1,092 human genomes" (in en). Nature 491 (7422): 56–65. doi:10.1038/nature11632. ISSN 0028-0836. PMC PMC3498066. PMID 23128226. http://www.nature.com/articles/nature11632.
↑ Alonso-Blanco, Carlos; Andrade, Jorge; Becker, Claude; Bemm, Felix; Bergelson, Joy; Borgwardt, Karsten M.; Cao, Jun; Chae, Eunyoung et al. (1 July 2016). "1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana" (in en). Cell 166 (2): 481–491. doi:10.1016/j.cell.2016.05.063. PMC PMC4949382. PMID 27293186. https://linkinghub.elsevier.com/retrieve/pii/S0092867416306675.
↑ Stein, Joshua C.; Yu, Yeisoo; Copetti, Dario; Zwickl, Derrick J.; Zhang, Li; Zhang, Chengjun; Chougule, Kapeel; Gao, Dongying et al. (1 February 2018). "Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza" (in en). Nature Genetics 50 (2): 285–296. doi:10.1038/s41588-018-0040-0. ISSN 1061-4036. http://www.nature.com/articles/s41588-018-0040-0.
↑ Sun, Hequan; Rowan, Beth A.; Flood, Pádraic J.; Brandt, Ronny; Fuss, Janina; Hancock, Angela M.; Michelmore, Richard W.; Huettel, Bruno et al. (1 December 2019). "Linked-read sequencing of gametes allows efficient genome-wide analysis of meiotic recombination" (in en). Nature Communications 10 (1): 4310. doi:10.1038/s41467-019-12209-2. ISSN 2041-1723. PMC PMC6754367. PMID 31541084. http://www.nature.com/articles/s41467-019-12209-2.
↑ Sherman, Rachel M.; Salzberg, Steven L. (1 April 2020). "Pan-genomics in the human genome era" (in en). Nature Reviews Genetics 21 (4): 243–254. doi:10.1038/s41576-020-0210-7. ISSN 1471-0056. PMC PMC7752153. PMID 32034321. http://www.nature.com/articles/s41576-020-0210-7.
↑ Bayer, Philipp E.; Golicz, Agnieszka A.; Scheben, Armin; Batley, Jacqueline; Edwards, David (1 August 2020). "Plant pan-genomes are the new reference" (in en). Nature Plants 6 (8): 914–920. doi:10.1038/s41477-020-0733-0. ISSN 2055-0278. http://www.nature.com/articles/s41477-020-0733-0.
↑ Weckwerth, Wolfram; Ghatak, Arindam; Bellaire, Anke; Chaturvedi, Palak; Varshney, Rajeev K. (1 July 2020). "PANOMICS meets germplasm" (in en). Plant Biotechnology Journal 18 (7): 1507–1525. doi:10.1111/pbi.13372. ISSN 1467-7644. PMC PMC7292548. PMID 32163658. https://onlinelibrary.wiley.com/doi/10.1111/pbi.13372.
↑ Scossa, Federico; Alseekh, Saleh; Fernie, Alisdair R. (1 February 2021). "Integrating multi-omics data for crop improvement" (in en). Journal of Plant Physiology 257: 153352. doi:10.1016/j.jplph.2020.153352. https://linkinghub.elsevier.com/retrieve/pii/S017616172030242X.
↑ Liberman, Louisa M; Sozzani, Rosangela; Benfey, Philip N (1 April 2012). "Integrative systems biology: an attempt to describe a simple weed" (in en). Current Opinion in Plant Biology 15 (2): 162–167. doi:10.1016/j.pbi.2012.01.004. PMC PMC3435099. PMID 22277598. https://linkinghub.elsevier.com/retrieve/pii/S1369526612000052.
↑ Pinu, Farhana R.; Beale, David J.; Paten, Amy M.; Kouremenos, Konstantinos; Swarup, Sanjay; Schirra, Horst J.; Wishart, David (18 April 2019). "Systems Biology and Multi-Omics Integration: Viewpoints from the Metabolomics Research Community" (in en). Metabolites 9 (4): 76. doi:10.3390/metabo9040076. ISSN 2218-1989. PMC PMC6523452. PMID 31003499. https://www.mdpi.com/2218-1989/9/4/76.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, spelling, and grammar. In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; however, this version lists them in order of appearance, by design.