Difference between revisions of "Journal:Data management and modeling in plant biology"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 33: Line 33:
Experimental high-throughput analysis of [[Genomics|genomes]], transcriptomes, [[Proteomics|proteomes]], and metabolomes results in a vast number of simultaneously quantified molecular entities. Current biological research frequently applies a combination of experimental high-throughput techniques to address a wide spectrum of complex research questions. On the genome level, high-throughput sequencing (HTS) technologies have revolutionized genetics and genomics, and [[sequencing]] projects have provided comprehensive [[information]] about many species’ genomes.<ref>{{Cite journal |last=International Human Genome Sequencing Consortium |last2=Whitehead Institute for Biomedical Research, Center for Genome Research: |last3=Lander |first3=Eric S. |last4=Linton |first4=Lauren M. |last5=Birren |first5=Bruce |last6=Nusbaum |first6=Chad |last7=Zody |first7=Michael C. |last8=Baldwin |first8=Jennifer |last9=Devon |first9=Keri |last10=Dewar |first10=Ken |last11=Doyle |first11=Michael |date=2001-02-15 |title=Initial sequencing and analysis of the human genome |url=http://www.nature.com/articles/35057062 |journal=Nature |language=en |volume=409 |issue=6822 |pages=860–921 |doi=10.1038/35057062 |issn=0028-0836}}</ref><ref>{{Cite journal |last=The 1000 Genomes Project Consortium |date=2012-11 |title=An integrated map of genetic variation from 1,092 human genomes |url=http://www.nature.com/articles/nature11632 |journal=Nature |language=en |volume=491 |issue=7422 |pages=56–65 |doi=10.1038/nature11632 |issn=0028-0836 |pmc=PMC3498066 |pmid=23128226}}</ref><ref>{{Cite journal |last=Alonso-Blanco |first=Carlos |last2=Andrade |first2=Jorge |last3=Becker |first3=Claude |last4=Bemm |first4=Felix |last5=Bergelson |first5=Joy |last6=Borgwardt |first6=Karsten M. |last7=Cao |first7=Jun |last8=Chae |first8=Eunyoung |last9=Dezwaan |first9=Todd M. |last10=Ding |first10=Wei |last11=Ecker |first11=Joseph R. |date=2016-07 |title=1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana |url=https://linkinghub.elsevier.com/retrieve/pii/S0092867416306675 |journal=Cell |language=en |volume=166 |issue=2 |pages=481–491 |doi=10.1016/j.cell.2016.05.063 |pmc=PMC4949382 |pmid=27293186}}</ref><ref>{{Cite journal |last=Stein |first=Joshua C. |last2=Yu |first2=Yeisoo |last3=Copetti |first3=Dario |last4=Zwickl |first4=Derrick J. |last5=Zhang |first5=Li |last6=Zhang |first6=Chengjun |last7=Chougule |first7=Kapeel |last8=Gao |first8=Dongying |last9=Iwata |first9=Aiko |last10=Goicoechea |first10=Jose Luis |last11=Wei |first11=Sharon |date=2018-02 |title=Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza |url=http://www.nature.com/articles/s41588-018-0040-0 |journal=Nature Genetics |language=en |volume=50 |issue=2 |pages=285–296 |doi=10.1038/s41588-018-0040-0 |issn=1061-4036}}</ref><ref>{{Cite journal |last=Sun |first=Hequan |last2=Rowan |first2=Beth A. |last3=Flood |first3=Pádraic J. |last4=Brandt |first4=Ronny |last5=Fuss |first5=Janina |last6=Hancock |first6=Angela M. |last7=Michelmore |first7=Richard W. |last8=Huettel |first8=Bruno |last9=Schneeberger |first9=Korbinian |date=2019-12 |title=Linked-read sequencing of gametes allows efficient genome-wide analysis of meiotic recombination |url=http://www.nature.com/articles/s41467-019-12209-2 |journal=Nature Communications |language=en |volume=10 |issue=1 |pages=4310 |doi=10.1038/s41467-019-12209-2 |issn=2041-1723 |pmc=PMC6754367 |pmid=31541084}}</ref> To date, thousands of genomes have been sequenced and pan-genomics approaches have been initiated, which assemble diverse sets of individual genomes to a collection of all [[DNA sequencing|DNA sequences]] occurring in a species.<ref>{{Cite journal |last=Sherman |first=Rachel M. |last2=Salzberg |first2=Steven L. |date=2020-04 |title=Pan-genomics in the human genome era |url=http://www.nature.com/articles/s41576-020-0210-7 |journal=Nature Reviews Genetics |language=en |volume=21 |issue=4 |pages=243–254 |doi=10.1038/s41576-020-0210-7 |issn=1471-0056 |pmc=PMC7752153 |pmid=32034321}}</ref> In plant sciences, the concept of pan-genomics is already discussed to support breeding strategies or evolutionary studies and may significantly contribute to the explanation of gene presence and absence variation.<ref>{{Cite journal |last=Bayer |first=Philipp E. |last2=Golicz |first2=Agnieszka A. |last3=Scheben |first3=Armin |last4=Batley |first4=Jacqueline |last5=Edwards |first5=David |date=2020-08 |title=Plant pan-genomes are the new reference |url=http://www.nature.com/articles/s41477-020-0733-0 |journal=Nature Plants |language=en |volume=6 |issue=8 |pages=914–920 |doi=10.1038/s41477-020-0733-0 |issn=2055-0278}}</ref>
Experimental high-throughput analysis of [[Genomics|genomes]], transcriptomes, [[Proteomics|proteomes]], and metabolomes results in a vast number of simultaneously quantified molecular entities. Current biological research frequently applies a combination of experimental high-throughput techniques to address a wide spectrum of complex research questions. On the genome level, high-throughput sequencing (HTS) technologies have revolutionized genetics and genomics, and [[sequencing]] projects have provided comprehensive [[information]] about many species’ genomes.<ref>{{Cite journal |last=International Human Genome Sequencing Consortium |last2=Whitehead Institute for Biomedical Research, Center for Genome Research: |last3=Lander |first3=Eric S. |last4=Linton |first4=Lauren M. |last5=Birren |first5=Bruce |last6=Nusbaum |first6=Chad |last7=Zody |first7=Michael C. |last8=Baldwin |first8=Jennifer |last9=Devon |first9=Keri |last10=Dewar |first10=Ken |last11=Doyle |first11=Michael |date=2001-02-15 |title=Initial sequencing and analysis of the human genome |url=http://www.nature.com/articles/35057062 |journal=Nature |language=en |volume=409 |issue=6822 |pages=860–921 |doi=10.1038/35057062 |issn=0028-0836}}</ref><ref>{{Cite journal |last=The 1000 Genomes Project Consortium |date=2012-11 |title=An integrated map of genetic variation from 1,092 human genomes |url=http://www.nature.com/articles/nature11632 |journal=Nature |language=en |volume=491 |issue=7422 |pages=56–65 |doi=10.1038/nature11632 |issn=0028-0836 |pmc=PMC3498066 |pmid=23128226}}</ref><ref>{{Cite journal |last=Alonso-Blanco |first=Carlos |last2=Andrade |first2=Jorge |last3=Becker |first3=Claude |last4=Bemm |first4=Felix |last5=Bergelson |first5=Joy |last6=Borgwardt |first6=Karsten M. |last7=Cao |first7=Jun |last8=Chae |first8=Eunyoung |last9=Dezwaan |first9=Todd M. |last10=Ding |first10=Wei |last11=Ecker |first11=Joseph R. |date=2016-07 |title=1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana |url=https://linkinghub.elsevier.com/retrieve/pii/S0092867416306675 |journal=Cell |language=en |volume=166 |issue=2 |pages=481–491 |doi=10.1016/j.cell.2016.05.063 |pmc=PMC4949382 |pmid=27293186}}</ref><ref>{{Cite journal |last=Stein |first=Joshua C. |last2=Yu |first2=Yeisoo |last3=Copetti |first3=Dario |last4=Zwickl |first4=Derrick J. |last5=Zhang |first5=Li |last6=Zhang |first6=Chengjun |last7=Chougule |first7=Kapeel |last8=Gao |first8=Dongying |last9=Iwata |first9=Aiko |last10=Goicoechea |first10=Jose Luis |last11=Wei |first11=Sharon |date=2018-02 |title=Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza |url=http://www.nature.com/articles/s41588-018-0040-0 |journal=Nature Genetics |language=en |volume=50 |issue=2 |pages=285–296 |doi=10.1038/s41588-018-0040-0 |issn=1061-4036}}</ref><ref>{{Cite journal |last=Sun |first=Hequan |last2=Rowan |first2=Beth A. |last3=Flood |first3=Pádraic J. |last4=Brandt |first4=Ronny |last5=Fuss |first5=Janina |last6=Hancock |first6=Angela M. |last7=Michelmore |first7=Richard W. |last8=Huettel |first8=Bruno |last9=Schneeberger |first9=Korbinian |date=2019-12 |title=Linked-read sequencing of gametes allows efficient genome-wide analysis of meiotic recombination |url=http://www.nature.com/articles/s41467-019-12209-2 |journal=Nature Communications |language=en |volume=10 |issue=1 |pages=4310 |doi=10.1038/s41467-019-12209-2 |issn=2041-1723 |pmc=PMC6754367 |pmid=31541084}}</ref> To date, thousands of genomes have been sequenced and pan-genomics approaches have been initiated, which assemble diverse sets of individual genomes to a collection of all [[DNA sequencing|DNA sequences]] occurring in a species.<ref>{{Cite journal |last=Sherman |first=Rachel M. |last2=Salzberg |first2=Steven L. |date=2020-04 |title=Pan-genomics in the human genome era |url=http://www.nature.com/articles/s41576-020-0210-7 |journal=Nature Reviews Genetics |language=en |volume=21 |issue=4 |pages=243–254 |doi=10.1038/s41576-020-0210-7 |issn=1471-0056 |pmc=PMC7752153 |pmid=32034321}}</ref> In plant sciences, the concept of pan-genomics is already discussed to support breeding strategies or evolutionary studies and may significantly contribute to the explanation of gene presence and absence variation.<ref>{{Cite journal |last=Bayer |first=Philipp E. |last2=Golicz |first2=Agnieszka A. |last3=Scheben |first3=Armin |last4=Batley |first4=Jacqueline |last5=Edwards |first5=David |date=2020-08 |title=Plant pan-genomes are the new reference |url=http://www.nature.com/articles/s41477-020-0733-0 |journal=Nature Plants |language=en |volume=6 |issue=8 |pages=914–920 |doi=10.1038/s41477-020-0733-0 |issn=2055-0278}}</ref>


Based on such comprehensive genome information, genome-scale models of plant metabolism have been developed and applied to predict plant metabolism in a diverse context. Validation and biotechnological application of such large-scale models need appropriate experimental techniques and platforms, unifying [[Sample (material)|sample]] analysis in multi-omics approaches.<ref>{{Cite journal |last=Weckwerth |first=Wolfram |last2=Ghatak |first2=Arindam |last3=Bellaire |first3=Anke |last4=Chaturvedi |first4=Palak |last5=Varshney |first5=Rajeev K. |date=2020-07 |title=PANOMICS meets germplasm |url=https://onlinelibrary.wiley.com/doi/10.1111/pbi.13372 |journal=Plant Biotechnology Journal |language=en |volume=18 |issue=7 |pages=1507–1525 |doi=10.1111/pbi.13372 |issn=1467-7644 |pmc=PMC7292548 |pmid=32163658}}</ref> Although, [[omics]] techniques have become a generic element of numerous research projects to quantify transcripts, proteins, and metabolites, the actual [[Information management|handling]], normalization, and integration of multidimensional experimental data output is still a central challenge in biology.<ref>{{Cite journal |last=Scossa |first=Federico |last2=Alseekh |first2=Saleh |last3=Fernie |first3=Alisdair R. |date=2021-02 |title=Integrating multi-omics data for crop improvement |url=https://linkinghub.elsevier.com/retrieve/pii/S017616172030242X |journal=Journal of Plant Physiology |language=en |volume=257 |pages=153352 |doi=10.1016/j.jplph.2020.153352}}</ref> The need for integrative analysis of experimental high-throughput data has already been suggested and discussed earlier. For example, almost a decade ago, integrative approaches were suggested for transcriptomics, proteomics, and metabolomics data to promote a systems-level understanding of the genus ''Arabidopsis''.<ref>{{Cite journal |last=Liberman |first=Louisa M |last2=Sozzani |first2=Rosangela |last3=Benfey |first3=Philip N |date=2012-04 |title=Integrative systems biology: an attempt to describe a simple weed |url=https://linkinghub.elsevier.com/retrieve/pii/S1369526612000052 |journal=Current Opinion in Plant Biology |language=en |volume=15 |issue=2 |pages=162–167 |doi=10.1016/j.pbi.2012.01.004 |pmc=PMC3435099 |pmid=22277598}}</ref> Since then, [[machine learning]], computational statistics, and mathematical modeling have significantly advanced data integration strategies. Due to their capability to improve the understanding of the genotype-phenotype relation on a molecular level, systems biology, and multi-omics integration have become central topics in the discussion about future perspectives of biology and medicine. Yet, in order to make experiments comparable and to increase consistency and reproducibility across different experimental platforms, [[Laboratory|laboratories]], or research communities, quantitative omics data are needed.<ref name=":1">{{Cite journal |last=Pinu |first=Farhana R. |last2=Beale |first2=David J. |last3=Paten |first3=Amy M. |last4=Kouremenos |first4=Konstantinos |last5=Swarup |first5=Sanjay |last6=Schirra |first6=Horst J. |last7=Wishart |first7=David |date=2019-04-18 |title=Systems Biology and Multi-Omics Integration: Viewpoints from the Metabolomics Research Community |url=https://www.mdpi.com/2218-1989/9/4/76 |journal=Metabolites |language=en |volume=9 |issue=4 |pages=76 |doi=10.3390/metabo9040076 |issn=2218-1989 |pmc=PMC6523452 |pmid=31003499}}</ref> Furthermore, quantitative experimental data necessitates appropriate processing strategies to make it comparable to other independent studies and statistics. Making data and data processing publicly available via databases and repositories may represent one of the most important steps to establish and expand a cross-disciplinary scientific platform for omics data integration. Together with the need for traceable long-term data storage and versioning, these topics are becoming increasingly important in quantitative biology.
Based on such comprehensive genome information, genome-scale models of plant metabolism have been developed and applied to predict plant metabolism in a diverse context. Validation and biotechnological application of such large-scale models need appropriate experimental techniques and platforms, unifying [[Sample (material)|sample]] analysis in multi-omics approaches.<ref>{{Cite journal |last=Weckwerth |first=Wolfram |last2=Ghatak |first2=Arindam |last3=Bellaire |first3=Anke |last4=Chaturvedi |first4=Palak |last5=Varshney |first5=Rajeev K. |date=2020-07 |title=PANOMICS meets germplasm |url=https://onlinelibrary.wiley.com/doi/10.1111/pbi.13372 |journal=Plant Biotechnology Journal |language=en |volume=18 |issue=7 |pages=1507–1525 |doi=10.1111/pbi.13372 |issn=1467-7644 |pmc=PMC7292548 |pmid=32163658}}</ref> Although, [[omics]] techniques have become a generic element of numerous research projects to quantify transcripts, proteins, and metabolites, the actual [[Information management|handling]], normalization, and integration of multidimensional experimental data output is still a central challenge in biology.<ref>{{Cite journal |last=Scossa |first=Federico |last2=Alseekh |first2=Saleh |last3=Fernie |first3=Alisdair R. |date=2021-02 |title=Integrating multi-omics data for crop improvement |url=https://linkinghub.elsevier.com/retrieve/pii/S017616172030242X |journal=Journal of Plant Physiology |language=en |volume=257 |pages=153352 |doi=10.1016/j.jplph.2020.153352}}</ref> The need for integrative analysis of experimental high-throughput data has already been suggested and discussed earlier. For example, almost a decade ago, integrative approaches were suggested for transcriptomics, proteomics, and metabolomics data to promote a systems-level understanding of the genus ''Arabidopsis''.<ref>{{Cite journal |last=Liberman |first=Louisa M |last2=Sozzani |first2=Rosangela |last3=Benfey |first3=Philip N |date=2012-04 |title=Integrative systems biology: an attempt to describe a simple weed |url=https://linkinghub.elsevier.com/retrieve/pii/S1369526612000052 |journal=Current Opinion in Plant Biology |language=en |volume=15 |issue=2 |pages=162–167 |doi=10.1016/j.pbi.2012.01.004 |pmc=PMC3435099 |pmid=22277598}}</ref> Since then, [[machine learning]], computational statistics, and mathematical modeling have significantly advanced [[data integration]] strategies. Due to their capability to improve the understanding of the genotype-phenotype relation on a molecular level, systems biology, and multi-omics integration have become central topics in the discussion about future perspectives of biology and medicine. Yet, in order to make experiments comparable and to increase consistency and reproducibility across different experimental platforms, [[Laboratory|laboratories]], or research communities, quantitative omics data are needed.<ref name=":1">{{Cite journal |last=Pinu |first=Farhana R. |last2=Beale |first2=David J. |last3=Paten |first3=Amy M. |last4=Kouremenos |first4=Konstantinos |last5=Swarup |first5=Sanjay |last6=Schirra |first6=Horst J. |last7=Wishart |first7=David |date=2019-04-18 |title=Systems Biology and Multi-Omics Integration: Viewpoints from the Metabolomics Research Community |url=https://www.mdpi.com/2218-1989/9/4/76 |journal=Metabolites |language=en |volume=9 |issue=4 |pages=76 |doi=10.3390/metabo9040076 |issn=2218-1989 |pmc=PMC6523452 |pmid=31003499}}</ref> Furthermore, quantitative experimental data necessitates appropriate processing strategies to make it comparable to other independent studies and statistics. Making data and data processing publicly available via databases and repositories may represent one of the most important steps to establish and expand a cross-disciplinary scientific platform for omics data integration. Together with the need for traceable long-term data storage and versioning, these topics are becoming increasingly important in quantitative biology.


Searching for database entries from the last two decades on omics and integrative omics approaches reveals a rapidly increasing research and publication activity in the integrative multi-omics research field (Figure 1). Genomics-related yearly published articles linearly increased to a very high level during the last 20 years, while particularly transcriptomics and metabolomics articles have been published with an increasing rate during the last decade (Figure 1A). Between 2000 and 2015, more proteomics-related articles have been published than transcriptomics and metabolomics articles, but since 2017 their number lies between both omics disciplines. Interestingly, since 2017, articles searchable by the queries “multi-omics” or “multiomics” are exponentially increasing in their number (Figure 1B). A similar, yet weaker trend is also observable for “omics data integration” articles (Figure 1B). Of course, these numbers are only crude estimates based on our chosen specific vocabulary and searched within one specific database (for example, we have not checked the combination of different omics disciplines, i.e., “genomics” and “transcriptomics” instead of “multi-omics”). Yet, these results still indicate that an increasing number of studies focuses on a multi-omics design and that omics data integration gains more and more attention.  
Searching for database entries from the last two decades on omics and integrative omics approaches reveals a rapidly increasing research and publication activity in the integrative multi-omics research field (Figure 1). Genomics-related yearly published articles linearly increased to a very high level during the last 20 years, while particularly transcriptomics and metabolomics articles have been published with an increasing rate during the last decade (Figure 1A). Between 2000 and 2015, more proteomics-related articles have been published than transcriptomics and metabolomics articles, but since 2017 their number lies between both omics disciplines. Interestingly, since 2017, articles searchable by the queries “multi-omics” or “multiomics” are exponentially increasing in their number (Figure 1B). A similar, yet weaker trend is also observable for “omics data integration” articles (Figure 1B). Of course, these numbers are only crude estimates based on our chosen specific vocabulary and searched within one specific database (for example, we have not checked the combination of different omics disciplines, i.e., “genomics” and “transcriptomics” instead of “multi-omics”). Yet, these results still indicate that an increasing number of studies focuses on a multi-omics design and that omics data integration gains more and more attention.  
Line 68: Line 68:


==Research data management provides the groundwork for successful data integration==
==Research data management provides the groundwork for successful data integration==
Data integration methods, especially machine learning approaches, profit heavily from the increasing availability of data. Aside from high-dimensionality and sparsity of biological data, a fundamental challenge in data integration lies in accessibility and quality of information and knowledge. Modern approaches require not only massive, but particularly well-annotated data sets.<ref>{{Cite journal |last=Webb |first=Sarah |date=2018-02 |title=Deep learning for biology |url=http://www.nature.com/articles/d41586-018-02174-z |journal=Nature |language=en |volume=554 |issue=7693 |pages=555–557 |doi=10.1038/d41586-018-02174-z |issn=0028-0836}}</ref>
Currently, the default medium of scientific communication in the domain of biology is the publication of research in peer-reviewed scientific journals centered around free text-based communication. While this format has many benefits, such as quality control by curators who are experts on the respective field, it also has the drawback of being gated by pay walls. This issue is already being addressed with the increased founding of open-access journals, but the approach suffers from more intrinsic problems. The format itself was designed as a human-readable medium and is thus prone to design flaws that can be implicitly solved by a human reader but imposes problems to the application of machine learning techniques. Examples being the heterogeneity of supplementals, the embedding of data as schematic descriptions, and most severely, the communication of findings as free text. While these challenges are already identified and currently tackled by manual curation and the application of natural language processing (NLP) and pattern recognition, its frequent occurrence still hinders the direct computational usage of the published knowledge for data integration.<ref>{{Cite journal |last=Karp |first=Peter D. |date=2016 |title=Can we replace curation with information extraction software? |url=https://academic.oup.com/database/article-lookup/doi/10.1093/database/baw150 |journal=Database |language=en |volume=2016 |pages=baw150 |doi=10.1093/database/baw150 |issn=1758-0463 |pmc=PMC5199131 |pmid=28025341}}</ref>
An alternative approach of scientific communication is realized by the creation of knowledge databases. In plant research, there are various information resources and data portals of extremely high quality. UniProt<ref>{{Cite journal |last=The UniProt Consortium |date=2019-01-08 |title=UniProt: a worldwide hub of protein knowledge |url=https://academic.oup.com/nar/article/47/D1/D506/5160987 |journal=Nucleic Acids Research |language=en |volume=47 |issue=D1 |pages=D506–D515 |doi=10.1093/nar/gky1049 |issn=0305-1048 |pmc=PMC6323992 |pmid=30395287}}</ref> and Ensembl plants<ref>{{Citation |last=Bolser |first=Dan |last2=Staines |first2=Daniel M. |last3=Pritchard |first3=Emily |last4=Kersey |first4=Paul |date=2016 |editor-last=Edwards |editor-first=David |title=Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data |url=http://link.springer.com/10.1007/978-1-4939-3167-5_6 |work=Plant Bioinformatics |publisher=Springer New York |place=New York, NY |volume=1374 |pages=115–140 |doi=10.1007/978-1-4939-3167-5_6 |isbn=978-1-4939-3166-8 |accessdate=2021-12-17}}</ref> are integrative resources presenting genome-scale information for a growing number of sequenced plant species. Additionally, PLAZA<ref>{{Cite journal |last=Van Bel |first=Michiel |last2=Diels |first2=Tim |last3=Vancaester |first3=Emmelien |last4=Kreft |first4=Lukasz |last5=Botzki |first5=Alexander |last6=Van de Peer |first6=Yves |last7=Coppens |first7=Frederik |last8=Vandepoele |first8=Klaas |date=2018-01-04 |title=PLAZA 4.0: an integrative resource for functional, evolutionary and comparative plant genomics |url=http://academic.oup.com/nar/article/46/D1/D1190/4561641 |journal=Nucleic Acids Research |language=en |volume=46 |issue=D1 |pages=D1190–D1196 |doi=10.1093/nar/gkx1002 |issn=0305-1048 |pmc=PMC5753339 |pmid=29069403}}</ref> provides an integrative resource for functional, evolutionary, and comparative plant genomics. Data portals and specific databases like The Arabidopsis Information Resource (TAIR)<ref>{{Cite journal |last=Berardini |first=Tanya Z. |last2=Reiser |first2=Leonore |last3=Li |first3=Donghui |last4=Mezheritsky |first4=Yarik |last5=Muller |first5=Robert |last6=Strait |first6=Emily |last7=Huala |first7=Eva |date=2015-08 |title=The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome: Tair: Making and Mining the “Gold Standard” Plant Genome |url=https://onlinelibrary.wiley.com/doi/10.1002/dvg.22877 |journal=genesis |language=en |volume=53 |issue=8 |pages=474–485 |doi=10.1002/dvg.22877 |pmc=PMC4545719 |pmid=26201819}}</ref>, Araport<ref>{{Cite journal |last=Krishnakumar |first=Vivek |last2=Hanlon |first2=Matthew R. |last3=Contrino |first3=Sergio |last4=Ferlanti |first4=Erik S. |last5=Karamycheva |first5=Svetlana |last6=Kim |first6=Maria |last7=Rosen |first7=Benjamin D. |last8=Cheng |first8=Chia-Yi |last9=Moreira |first9=Walter |last10=Mock |first10=Stephen A. |last11=Stubbs |first11=Joseph |date=2015-01-28 |title=Araport: the Arabidopsis Information Portal |url=http://academic.oup.com/nar/article/43/D1/D1003/2439074/Araport-the-Arabidopsis-Information-Portal |journal=Nucleic Acids Research |language=en |volume=43 |issue=D1 |pages=D1003–D1009 |doi=10.1093/nar/gku1200 |issn=1362-4962 |pmc=PMC4383980 |pmid=25414324}}</ref>, Aramemnon<ref>{{Cite journal |last=Schwacke |first=Rainer |last2=Schneider |first2=Anja |last3=van der Graaff |first3=Eric |last4=Fischer |first4=Karsten |last5=Catoni |first5=Elisabetta |last6=Desimone |first6=Marcelo |last7=Frommer |first7=Wolf B. |last8=Flügge |first8=Ulf-Ingo |last9=Kunze |first9=Reinhard |date=2003-01-01 |title=ARAMEMNON, a Novel Database for Arabidopsis Integral Membrane Proteins |url=https://academic.oup.com/plphys/article/131/1/16/6114365 |journal=Plant Physiology |language=en |volume=131 |issue=1 |pages=16–26 |doi=10.1104/pp.011577 |issn=1532-2548 |pmc=PMC166783 |pmid=12529511}}</ref>, or Phytozome<ref>{{Cite journal |last=Goodstein |first=David M. |last2=Shu |first2=Shengqiang |last3=Howson |first3=Russell |last4=Neupane |first4=Rochak |last5=Hayes |first5=Richard D. |last6=Fazo |first6=Joni |last7=Mitros |first7=Therese |last8=Dirks |first8=William |last9=Hellsten |first9=Uffe |last10=Putnam |first10=Nicholas |last11=Rokhsar |first11=Daniel S. |date=2012-01 |title=Phytozome: a comparative platform for green plant genomics |url=https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkr944 |journal=Nucleic Acids Research |language=en |volume=40 |issue=D1 |pages=D1178–D1186 |doi=10.1093/nar/gkr944 |issn=1362-4962 |pmc=PMC3245001 |pmid=22110026}}</ref> provide fine-grained species-specific reference knowledge. Generally, these resources offer a more condensed compilation of knowledge and often preserve the virtue of being manually curated. However, each iteration of a knowledge database only represents a snapshot of the knowledge at the time of creation, which imposes the initiator with the additional burden of maintenance and the user with uncertainty with regards to the currentness of the data source. In comparison to free text, knowledge databases are often easier to access by computational means and provide better interoperability when it comes to the application of machine learning methods; nevertheless, they were and still are designed with a human operator in mind and often lack important metadata information. This does not only affect processes like data retrieval but also the documentation of how data was obtained and integrated when assembling the database.
The communication of findings in scientific publications or their integration in knowledge databases is of course limited by the questions asked at the time of creation. Therefore, best practice suggests publishing raw measurements data in a technology-specific data repository. ProteomeXchange<ref>{{Cite journal |last=Vizcaíno |first=Juan A |last2=Deutsch |first2=Eric W |last3=Wang |first3=Rui |last4=Csordas |first4=Attila |last5=Reisinger |first5=Florian |last6=Ríos |first6=Daniel |last7=Dianes |first7=José A |last8=Sun |first8=Zhi |last9=Farrah |first9=Terry |last10=Bandeira |first10=Nuno |last11=Binz |first11=Pierre-Alain |date=2014-03 |title=ProteomeXchange provides globally coordinated proteomics data submission and dissemination |url=http://www.nature.com/articles/nbt.2839 |journal=Nature Biotechnology |language=en |volume=32 |issue=3 |pages=223–226 |doi=10.1038/nbt.2839 |issn=1087-0156 |pmc=PMC3986813 |pmid=24727771}}</ref>, Gene Expression Omnibus (GEO)<ref>{{Citation |last=Clough |first=Emily |last2=Barrett |first2=Tanya |date=2016 |editor-last=Mathé |editor-first=Ewy |editor2-last=Davis |editor2-first=Sean |title=The Gene Expression Omnibus Database |url=http://link.springer.com/10.1007/978-1-4939-3578-9_5 |work=Statistical Genomics |publisher=Springer New York |place=New York, NY |volume=1418 |pages=93–110 |doi=10.1007/978-1-4939-3578-9_5 |isbn=978-1-4939-3576-5 |pmc=PMC4944384 |pmid=27008011 |accessdate=2021-12-17}}</ref>, SRA/ENA<ref>{{Cite journal |last=Leinonen |first=R. |last2=Sugawara |first2=H. |last3=Shumway |first3=M. |last4=on behalf of the International Nucleotide Sequence Database Collaboration |date=2011-01-01 |title=The Sequence Read Archive |url=https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkq1019 |journal=Nucleic Acids Research |language=en |volume=39 |issue=Database |pages=D19–D21 |doi=10.1093/nar/gkq1019 |issn=0305-1048 |pmc=PMC3013647 |pmid=21062823}}</ref>, and Metabolights<ref>{{Cite journal |last=Haug |first=Kenneth |last2=Salek |first2=Reza M. |last3=Conesa |first3=Pablo |last4=Hastings |first4=Janna |last5=de Matos |first5=Paula |last6=Rijnbeek |first6=Mark |last7=Mahendraker |first7=Tejasvi |last8=Williams |first8=Mark |last9=Neumann |first9=Steffen |last10=Rocca-Serra |first10=Philippe |last11=Maguire |first11=Eamonn |date=2013-01-01 |title=MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data |url=http://academic.oup.com/nar/article/41/D1/D781/1050654/MetaboLightsan-openaccess-generalpurpose |journal=Nucleic Acids Research |language=en |volume=41 |issue=D1 |pages=D781–D786 |doi=10.1093/nar/gks1004 |issn=0305-1048 |pmc=PMC3531110 |pmid=23109552}}</ref> are well established data exchange platforms that enforce certain metadata annotation tailored to the individual technology. Generic data repositories like Figshare and Dataverse do not require a technology-specific and laborious annotation process, but in turn do not ensure the necessary metadata annotation. Repositories can improve the process of peer review since the evaluation of data itself can be analyzed with respect to their reproducibility and also make the raw data accessible to the community for reevaluation. This allows the testing of new hypotheses using existing data sets. Nonetheless, the reuse of published data sets is limited by the level of detail in which their creation is described. Therefore, consortia and initiatives coordinate standardization efforts in plant research and developed standards and checklists to formally enable researchers to communicate their findings with required metadata. In the plant field, excellent standardizations for experimental data collections are the Minimal Information on Biological and Biomedical Investigations (MIBBI)<ref>{{Cite journal |last=Taylor |first=Chris F |last2=Field |first2=Dawn |last3=Sansone |first3=Susanna-Assunta |last4=Aerts |first4=Jan |last5=Apweiler |first5=Rolf |last6=Ashburner |first6=Michael |last7=Ball |first7=Catherine A |last8=Binz |first8=Pierre-Alain |last9=Bogue |first9=Molly |last10=Booth |first10=Tim |last11=Brazma |first11=Alvis |date=2008-08 |title=Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project |url=http://www.nature.com/articles/nbt.1411 |journal=Nature Biotechnology |language=en |volume=26 |issue=8 |pages=889–896 |doi=10.1038/nbt.1411 |issn=1087-0156 |pmc=PMC2771753 |pmid=18688244}}</ref>, Minimal Information About a Microarray Experiment for Plants (MIAME/Plant)<ref>{{Cite journal |last=Zimmermann |first=Philip |last2=Schildknecht |first2=Beatrice |last3=Craigon |first3=David |last4=Garcia-Hernandez |first4=Margarita |last5=Gruissem |first5=Wilhelm |last6=May |first6=Sean |last7=Mukherjee |first7=Gaurab |last8=Parkinson |first8=Helen |last9=Rhee |first9=Seung |last10=Wagner |first10=Ulrich |last11=Hennig |first11=Lars |date=2006-12 |title=MIAME/Plant – adding value to plant microarrray experiments |url=https://plantmethods.biomedcentral.com/articles/10.1186/1746-4811-2-1 |journal=Plant Methods |language=en |volume=2 |issue=1 |pages=1 |doi=10.1186/1746-4811-2-1 |issn=1746-4811 |pmc=PMC1334190 |pmid=16401339}}</ref>, and Minimal Information About Plant Phenotyping Experiment (MIAPPE).<ref>{{Cite journal |last=Krajewski |first=Paweł |last2=Chen |first2=Dijun |last3=Ćwiek |first3=Hanna |last4=van Dijk |first4=Aalt D.J. |last5=Fiorani |first5=Fabio |last6=Kersey |first6=Paul |last7=Klukas |first7=Christian |last8=Lange |first8=Matthias |last9=Markiewicz |first9=Augustyn |last10=Nap |first10=Jan Peter |last11=van Oeveren |first11=Jan |date=2015-09 |title=Towards recommendations for metadata and data handling in plant phenotyping |url=https://academic.oup.com/jxb/article-lookup/doi/10.1093/jxb/erv271 |journal=Journal of Experimental Botany |language=en |volume=66 |issue=18 |pages=5417–5427 |doi=10.1093/jxb/erv271 |issn=0022-0957}}</ref> However, it is exceedingly difficult for researchers to judge the necessity of certain meta information beforehand. Additionally, considerable effort and skills are required to provide adequate metadata annotation to the research data. Researchers also need to allocate the resources and capacity to actually do so in daily research practice. In addition, many researchers view data as sensitive research output that could easily be misused or misinterpreted when taken out of context. Thus, many scientists do not trust global repositories unless they have direct and personal connections to these researchers’ own work or find it too time consuming to validate their trustworthiness.
Nevertheless, it is evident that all ways of research communication (e.g., scientific journals, knowledge databases, and data repositories) heavily benefit from improved metadata description, not only in terms of reproducibility, but also accessibility and thus reusability.<ref>{{Cite journal |last=Leonelli |first=Sabina |date=2019-04-05 |title=The challenges of big data biology |url=https://elifesciences.org/articles/47381 |journal=eLife |language=en |volume=8 |pages=e47381 |doi=10.7554/eLife.47381 |issn=2050-084X |pmc=PMC6450665 |pmid=30950793}}</ref> It is apparent that research data management requires a constant endeavor of researchers and well accepted standards need to be developed. Here, the FAIR Guiding Principles form a conceptual roof and formulate the necessary goals to achieve. The FAIR Guiding Principles<ref>{{Cite journal |last=Wilkinson |first=Mark D. |last2=Dumontier |first2=Michel |last3=Aalbersberg |first3=IJsbrand Jan |last4=Appleton |first4=Gabrielle |last5=Axton |first5=Myles |last6=Baak |first6=Arie |last7=Blomberg |first7=Niklas |last8=Boiten |first8=Jan-Willem |last9=da Silva Santos |first9=Luiz Bonino |last10=Bourne |first10=Philip E. |last11=Bouwman |first11=Jildau |date=2016-12 |title=The FAIR Guiding Principles for scientific data management and stewardship |url=http://www.nature.com/articles/sdata201618 |journal=Scientific Data |language=en |volume=3 |issue=1 |pages=160018 |doi=10.1038/sdata.2016.18 |issn=2052-4463 |pmc=PMC4792175 |pmid=26978244}}</ref> are founded on four core elements: (i) findability, (ii) accessibility, (iii) interoperability, and (iv) re-usability. Findable data is described/annotated with rich metadata and consists of a globally unique identifier, which is indexed in a searchable source, e.g., a database. The metadata must specify what kind of identifier is used. According to accessibility, metadata and data must be retrievable based on their identifier by using a standardized protocol, which is open and universally implementable. Interoperable data use a standard vocabulary based on the FAIR principles and include qualified references to other (meta)data and most importantly are represented using a formal, accessible, shared, and broadly applicable language for knowledge representation. Consequently, re-usable (meta)data have a plurality of accurate and relevant attributes. In addition, they need to be associated with their provenance and meet domain-specific community standards.
Generic implementations to assist researchers abide by the FAIR principles have already been implemented. The usage of Research Object (RO)<ref>{{Cite journal |last=Hettne |first=Kristina M |last2=Dharuri |first2=Harish |last3=Zhao |first3=Jun |last4=Wolstencroft |first4=Katherine |last5=Belhajjame |first5=Khalid |last6=Soiland-Reyes |first6=Stian |last7=Mina |first7=Eleni |last8=Thompson |first8=Mark |last9=Cruickshank |first9=Don |last10=Verdes-Montenegro |first10=Lourdes |last11=Garrido |first11=Julian |date=2014-12 |title=Structuring research methods and data with the research object model: genomics workflows as a case study |url=https://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-5-41 |journal=Journal of Biomedical Semantics |language=en |volume=5 |issue=1 |pages=41 |doi=10.1186/2041-1480-5-41 |issn=2041-1480 |pmc=PMC4177597 |pmid=25276335}}</ref>, Research Object Crate (RO-Crate)<ref>{{Cite journal |last=Carragáin |first=Eoghan Ó |last2=Goble |first2=Carole |last3=Sefton |first3=Peter |last4=Soiland-Reyes |first4=Stian |date=2019-06-20 |title=A lightweight approach to research object data packaging |url=https://zenodo.org/record/3250687 |doi=10.5281/ZENODO.3250687}}</ref>, or ISA data model<ref>{{Cite journal |last=González-Beltrán |first=Alejandra |last2=Maguire |first2=Eamonn |last3=Sansone |first3=Susanna-Assunta |last4=Rocca-Serra |first4=Philippe |date=2014-12 |title=linkedISA: semantic representation of ISA-Tab experimental metadata |url=https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S14-S4 |journal=BMC Bioinformatics |language=en |volume=15 |issue=S14 |pages=S4 |doi=10.1186/1471-2105-15-S14-S4 |issn=1471-2105 |pmc=PMC4255742 |pmid=25472428}}</ref> can lead to a rich description of the experimental metadata (i.e., sample characteristics, technology and measurement types, sample-to-data relationships) that make the resulting data and discoveries reproducible and reusable. Scientific findings accompanied with rich metadata descriptions are representable as knowledge graphs. Such graphs greatly improve their value to the scientific community, since embedding into traversable tree-like structures results in a cross linking of available scientific data, which makes knowledge searchable. In practice, this is achieved using domain specific [[Ontology (information science)|ontologies]], which constrain the used vocabulary and conserve the relationship of single terms.
Reproducibility and provenance play an important role, especially in the computational analysis itself. Recent efforts to make analytic pipelines independent of their runtime environment strongly improved reusability and reproducibility of workflows. Containerization of processing tools and analytic pipelines facilitate the sharing and collaborative development of workflows on specialized platforms like WorkflowHUB. Analogously, computation requires metadata and specifications. In this regard, the BioCompute Object Project<ref name=":2">{{Cite journal |last=Simonyan |first=Vahan |last2=Goecks |first2=Jeremy |last3=Mazumder |first3=Raja |date=2017 |title=Biocompute Objects—A Step towards Evaluation and Validation of Biomedical Scientific Computations |url=http://journal.pda.org/lookup/doi/10.5731/pdajpst.2016.006734 |journal=PDA Journal of Pharmaceutical Science and Technology |language=en |volume=71 |issue=2 |pages=136–146 |doi=10.5731/pdajpst.2016.006734 |issn=1079-7440 |pmc=PMC5510742 |pmid=27974626}}</ref> aims to ease the exchange of HTS workflows between various organizations by providing a JSON format that, at a minimum, contains all the software versions and parameters necessary to evaluate or verify a computational pipeline.
It becomes evident that a combination of computation, data, and their metadata is essential to achieve the common goal of a well-annotated research object living up to the FAIR principles.<ref name=":2" /><ref>{{Cite journal |last=Vicente-Saez |first=Ruben |last2=Martinez-Fuentes |first2=Clara |date=2018-07 |title=Open Science now: A systematic literature review for an integrated definition |url=https://linkinghub.elsevier.com/retrieve/pii/S0148296317305441 |journal=Journal of Business Research |language=en |volume=88 |pages=428–436 |doi=10.1016/j.jbusres.2017.12.043}}</ref> Therefore, community-driven initiatives like DataPLANT support plant scientists in every research data management concern and provide a tailor-made service environment to contextualize research data according to the FAIR principles with minimal additional effort in modern plant biology.





Revision as of 22:26, 17 December 2021

Full article title Data management and modeling in plant biology
Journal Frontiers in Plant Science
Author(s) Krantz, Maria; Zimmer, David; Adler, Stephan O.; Kitashova, Anastasia; Klipp, Edda; Mühlhaus, Timo; Nägele, Thomas
Author affiliation(s) Humboldt-Universität zu Berlin, Technische Universität Kaiserslautern, Ludwig-Maximilians-Universität München
Primary contact Email: thomas dot naegele at lmu dot de
Editors Fukushima, Atsushi
Year published 2021
Volume and issue 12
Article # 717958
DOI 10.3389/fpls.2021.717958
ISSN 1664-462X
Distribution license Creative Commons Attribution 4.0 International
Website https://www.frontiersin.org/articles/10.3389/fpls.2021.717958/full
Download https://www.frontiersin.org/articles/10.3389/fpls.2021.717958/pdf (PDF)

Abstract

The study of plant-environment interactions is a multidisciplinary research field. With the emergence of quantitative large-scale and high-throughput techniques, the amount and dimensionality of experimental data have strongly increased. Appropriate strategies for data storage, management, and evaluation are needed to make efficient use of experimental findings. Computational approaches to data mining are essential for deriving statistical trends and signatures contained in data matrices. Although, current biology is challenged by high data dimensionality in general, this is particularly true for plant biology. As sessile organisms, plants have to cope with environmental fluctuations. This typically results in strong dynamics of metabolite and protein concentrations, which are often challenging to quantify. Summarizing experimental output results in complex data arrays, which need computational statistics and numerical methods for building quantitative models. Experimental findings need to be combined with computational models to gain a mechanistic understanding of plant metabolism. For this, bioinformatics and mathematics need to be combined with experimental setups in physiology, biochemistry, and molecular biology. This review presents and discusses concepts at the interface of experiment and computation, which are likely to shape current and future plant biology. Finally, this interface is discussed with regard to its capabilities and limitations to develop a quantitative model of plant-environment interactions.

Keywords: genome-scale networks, omics analysis, metabolic regulation, plant-environment interactions, machine learning, mathematical modeling, differential equations

Introduction

Experimental high-throughput analysis of genomes, transcriptomes, proteomes, and metabolomes results in a vast number of simultaneously quantified molecular entities. Current biological research frequently applies a combination of experimental high-throughput techniques to address a wide spectrum of complex research questions. On the genome level, high-throughput sequencing (HTS) technologies have revolutionized genetics and genomics, and sequencing projects have provided comprehensive information about many species’ genomes.[1][2][3][4][5] To date, thousands of genomes have been sequenced and pan-genomics approaches have been initiated, which assemble diverse sets of individual genomes to a collection of all DNA sequences occurring in a species.[6] In plant sciences, the concept of pan-genomics is already discussed to support breeding strategies or evolutionary studies and may significantly contribute to the explanation of gene presence and absence variation.[7]

Based on such comprehensive genome information, genome-scale models of plant metabolism have been developed and applied to predict plant metabolism in a diverse context. Validation and biotechnological application of such large-scale models need appropriate experimental techniques and platforms, unifying sample analysis in multi-omics approaches.[8] Although, omics techniques have become a generic element of numerous research projects to quantify transcripts, proteins, and metabolites, the actual handling, normalization, and integration of multidimensional experimental data output is still a central challenge in biology.[9] The need for integrative analysis of experimental high-throughput data has already been suggested and discussed earlier. For example, almost a decade ago, integrative approaches were suggested for transcriptomics, proteomics, and metabolomics data to promote a systems-level understanding of the genus Arabidopsis.[10] Since then, machine learning, computational statistics, and mathematical modeling have significantly advanced data integration strategies. Due to their capability to improve the understanding of the genotype-phenotype relation on a molecular level, systems biology, and multi-omics integration have become central topics in the discussion about future perspectives of biology and medicine. Yet, in order to make experiments comparable and to increase consistency and reproducibility across different experimental platforms, laboratories, or research communities, quantitative omics data are needed.[11] Furthermore, quantitative experimental data necessitates appropriate processing strategies to make it comparable to other independent studies and statistics. Making data and data processing publicly available via databases and repositories may represent one of the most important steps to establish and expand a cross-disciplinary scientific platform for omics data integration. Together with the need for traceable long-term data storage and versioning, these topics are becoming increasingly important in quantitative biology.

Searching for database entries from the last two decades on omics and integrative omics approaches reveals a rapidly increasing research and publication activity in the integrative multi-omics research field (Figure 1). Genomics-related yearly published articles linearly increased to a very high level during the last 20 years, while particularly transcriptomics and metabolomics articles have been published with an increasing rate during the last decade (Figure 1A). Between 2000 and 2015, more proteomics-related articles have been published than transcriptomics and metabolomics articles, but since 2017 their number lies between both omics disciplines. Interestingly, since 2017, articles searchable by the queries “multi-omics” or “multiomics” are exponentially increasing in their number (Figure 1B). A similar, yet weaker trend is also observable for “omics data integration” articles (Figure 1B). Of course, these numbers are only crude estimates based on our chosen specific vocabulary and searched within one specific database (for example, we have not checked the combination of different omics disciplines, i.e., “genomics” and “transcriptomics” instead of “multi-omics”). Yet, these results still indicate that an increasing number of studies focuses on a multi-omics design and that omics data integration gains more and more attention.


Fig1 Krantz FrontPlantSci2021 12.jpg

Figure 1. Number of articles found by article search in the PubMed library covering two decades, i.e., 2000–2020. (A) Timeline of number of articles on different omics disciplines (blue: genomics; orange: transcriptomics; gray: proteomics; and yellow: metabolomics). Articles were searched by single key word search. (B) Timeline of number of articles found by search on omics data integration (green line; single words were connected by AND-expression) and multi-omics (or multiomics, blue line).

This article aims to summarize and discuss current advances and limitations of integrative molecular analysis, computational modeling, and data science. It focuses on both experimental and theoretical methodology to support design and analysis of interdisciplinary research in plant biology. A particular focus is laid on methodologies for capturing system dynamics of plant metabolism induced by a changing environment.

On a large scale: How does genome-scale metabolic network reconstruction support data integration in plant biology?

The availability of comprehensive genome information has enabled the reconstruction of genome-scale metabolic networks, which predict, based on gene annotation, a functional cellular network structure. This crucially supports the interpretation of gene functions and makes pathways accessible to computational biology and mathematics.[12] Further, reconstructed networks significantly facilitate a mechanistic description of genotype-phenotype relationships and enable the application of constraint-based analysis methods.[13][14] Major constraints are thermodynamics, mass and charge conservation, and the substrate/enzyme availability. Constraints dramatically reduce the parameter space, which explains a genotype-phenotype relationship, and, hence, strongly increases the probability to find physiologically relevant solutions for underlying equation systems. Thus, it is not surprising that, in current plant biology, genome-scale reconstruction has become an integral part from single-cell to multi-tissue modeling.[15] For example, model reconstructions have been applied to analyze metabolic regulation in autotrophic and heterotrophic tissues, to study C4 plant metabolism, to evaluate diurnal metabolic interactions in plant leaf tissue and to analyze photorespiration.[16][17][18][19]

The experimental basis for constraining, validating, and optimizing large-scale models are high-throughput experiments, i.e., omics analyses. For example, to investigate effects of nitrogen assimilation on metabolism in maize (Zea mays), a genome-scale metabolic model for maize leaf was created comprising more than 5,800 genes, 8,500 reactions, and 9,000 metabolites.[20] Using a combination of transcriptomic and proteomic data to constrain metabolic flux predictions, the authors were able to reproduce experimentally determined metabolomic data to significantly higher accuracy than without these constraints. Applying a combination of publicly available data on maize metabolism, reaction networks, and results from omics experiments, information about reaction stoichiometry, directionality, and compartmentalization was derived. Algorithmic model curation was combined with manual modification to, for example, resolve gaps in the network model with reactions from similar organisms. Information about transcripts and proteins, which were experimentally observed to significantly differ in mutants and under variable nitrogen supply, were then incorporated into the model by switching on/off corresponding reactions. Flux predictions through the metabolic network were compared to metabolomics measurements. With this integrated setup, model application unraveled genes coding for enzymes, which are involved in regulation of biomass formation under variable nitrogen supply.[20] In another study, publicly available transcriptomics and metabolomics data were used within a constraint-based modeling approach to investigate network structure and flux distribution in root cell types and tissue layers of Arabidopsis thaliana. Based on transcriptomics and metabolomics data, it was possible to extract tissue and cell type specific models from a general genome-scale model of root metabolism. By this, the authors were able to simulate and analyze cell types as autonomous subsystems, which communicate with each other via metabolites or proteins. But it was also shown and discussed that further experimental evidence and constraints are essential to support hypotheses derived from their simulations.[21] This example nicely illustrates how large-scale data integration can (i) unravel novel and detailed mechanistic insights into plant metabolism, and also (ii) indicate design and research focus of follow-up studies to prove model predictions. By placing metabolites, proteins, or transcripts into a pathway and network context, genome-scale models significantly support the biochemical and physiological interpretation of molecular data.

Also, in a biotechnological context, such data integration strategies have become an important and promising tool to advance and improve bioengineering strategies. As an example, a genome-scale metabolic network reconstruction for green microalgal model species Chlamydomonas reinhardtii has been developed which reliably and quantitatively predicts growth depending on the light source.[22] This metabolic network comprises 10 compartments, accounting for more than 1,000 genes associated with more than 2,000 reactions and over 1,000 metabolites. Regulatory effects arising from different light conditions are covered by the model, which enables estimation of growth under different laboratory conditions. The model has been refined using metabolite profiling to include further branches of metabolism, e.g., amino acids and peptides as nitrogen sources.[23] Although, it developed a decade ago, the original model (named iRC1080) still represents a valid and supportive platform for data interpretation, and it still fruitfully initiates further model development and validation.[24] These examples, together with many other studies that have been summarized recently[25], provide strong evidence for the capability of genome-scale metabolic models to couple statistics with metabolic models.

Large-scale models need quantitative large-scale experiments on integrative platforms for validation and iterative parameter optimization

Reconstruction of a genome-scale metabolic network from genome sequence information is an iterative process, which needs several rounds of automatized and manual model adjustment, reconfiguration, and fine-tuning.[26] It strictly depends on genome annotation, and due to the strong increase of genome sequence information, high-throughput annotation algorithms are necessary to cope with this vast amount of data. Particularly in eukaryotic genomes, annotation errors due to assembly errors are still a challenge in the field, and direct RNA sequencing is discussed to improve gene annotation in the future.[27][28] However, as soon as a model has been curated and applied to predict metabolic flux or growth, quantitative experiments are needed to validate the model output, and to iteratively adjust model parameters. In addition to validation variables like growth rates, lipid content, ATP concentration, or total protein amount, experimental omics analyses potentially provide detailed information about pathway regulation, gene regulatory networks, and signaling cascades. Here, mass spectrometry-based proteomics and metabolomics analyses play a crucial role, which are not only able to analyze post-translational modifications or protein localization, but also can quantify turnover rates and metabolic fluxes down to subcellular scale.[29][30]

Quality of experimental data limits optimization of in silico models. If absolute quantitative model predictions about metabolite or protein dynamics cannot be experimentally validated due to missing absolute quantitative experiments, accuracy and reliability of the model frequently remain ambiguous or elusive. Several complex and non-intuitive questions about stability or regulatory patterns might still be addressed with such a model. Yet, the physiological constitution of a plant, or organism in general, which results from a certain growth setup, can hardly be modeled and simulated without quantitative information. For example, plant growth strictly depends on various growth parameters, e.g., light intensity and quality, soil composition, water availability, and humidity. It is well known that a slight modification of only one of those growth parameters might strongly affect the (molecular) phenotype, which makes comparative studies difficult. For example, different light sources might be applied (LEDs, fluorescent tubes, etc.) in different laboratories, which immediately results in different growth behavior and physiological properties.[31] While global harmonization of growth cabinets, greenhouses, or climate chambers remains impractical, augmentation of quantitative omics analysis seems realistic. Recommendations and potential pitfalls of experimental designs are already discussed on a research community level.[11] The authors recommend quality control samples (QCs) and universal standardized operating protocols (SOPs) for quantitative and reproducible experiments. Further, collecting and publishing comprehensive metadata is recommended to guide through and inform about experiments.[32][33][34]

In plant biology, absolute quantification of primary and secondary metabolites might represent a suitable approach to make studies comparable across platforms and growth regimes. Plant metabolism shows a high plasticity across different diurnal light periods. For example, under short day growth conditions with eight hours of light and 16 hours of darkness, dynamics of sugar and amino acid concentrations are significantly stronger than under long day growth conditions, i.e., under 16 hours of light and 8eight hours of darkness.[35] Additionally, the ratio of monosaccharides and disaccharides may vary significantly between growth setups, which is not detectable within a qualitative omics study because it does not allow the absolute comparison of two or more different substances. In mass spectrometry, one reason for this is that different molecules, e.g., sucrose and glucose, produce different ions with different masses, which are detected with different intensity. Hence, to make resulting mass spectra and chromatographic peaks comparable across different substances, they need to be individually scaled by a dilution of standard substances, i.e., within a calibration curve, yielding absolute amount of substance within a sample, which can then be normalized to sample protein amount or sample weight. Depending on the applied growth conditions and treatment, normalization might either be favorable to fresh or dry weight. For example, exposing plants to heat and/or drought stress directly affects leaf water content and, thus, under such conditions normalization to dry weight should be favored if metabolite concentrations are quantified.

While such an approach is appropriate for absolute quantification of central primary metabolites—i.e., sugars, amino acids, or organic acids[36]—it is hardly feasible for each individual substance within a metabolite profile. For many substances, appropriate standard substances are lacking, and even if they are available, they might be expensive due to costly purification and/or synthesis procedures. Further problems might occur when purified substances, like polar and apolar amino acids, need to be diluted and mixed within calibration samples due to their different solubilities in water. The vast number of metabolites, which are estimated to comprise between 200,000 and 1,000,000 across the plant kingdom and up to 5,000 within a single species[37][38], makes quantitative metabolomics challenging. Based on these numbers, it seems unfeasible to resolve the quantity of hundreds or thousands of compounds within a Gas chromatography–mass spectrometry (GC-MS) or Liquid chromatography–mass spectrometry (LC-MS) run. While a combination of different analytical platforms promises to cover a large panel of compounds[39][40], semi-quantitative analysis might represent a suitable approach to increase reproducibility and comparability of high-throughput analysis among quantification platforms. Here, structural elucidation of metabolic compounds based on mass spectrometry data might indicate a compound’s class.[41] This information, together with chromatographic information about retention time or index, might allow classification of an unknown substance by database search and comparison to known substances with similar mass spectra and physical properties like polarity. This would enable the comparison of chromatographic peak areas of an unknown substance to a known and most similar standard substance. For example, an unknown substance which, based on its mass spectrum information, is predicted to be a disaccharide might be semi-quantified applying the calibration of a known disaccharide with similar retention time or index. In this way, semi-quantitative information of an unknown substance might be derived from GC-MS (primary metabolites) or LC-MS (secondary metabolites) runs, which would facilitate comparison and data exchange of independent studies and on different experimental platforms.

Research data management provides the groundwork for successful data integration

Data integration methods, especially machine learning approaches, profit heavily from the increasing availability of data. Aside from high-dimensionality and sparsity of biological data, a fundamental challenge in data integration lies in accessibility and quality of information and knowledge. Modern approaches require not only massive, but particularly well-annotated data sets.[42]

Currently, the default medium of scientific communication in the domain of biology is the publication of research in peer-reviewed scientific journals centered around free text-based communication. While this format has many benefits, such as quality control by curators who are experts on the respective field, it also has the drawback of being gated by pay walls. This issue is already being addressed with the increased founding of open-access journals, but the approach suffers from more intrinsic problems. The format itself was designed as a human-readable medium and is thus prone to design flaws that can be implicitly solved by a human reader but imposes problems to the application of machine learning techniques. Examples being the heterogeneity of supplementals, the embedding of data as schematic descriptions, and most severely, the communication of findings as free text. While these challenges are already identified and currently tackled by manual curation and the application of natural language processing (NLP) and pattern recognition, its frequent occurrence still hinders the direct computational usage of the published knowledge for data integration.[43]

An alternative approach of scientific communication is realized by the creation of knowledge databases. In plant research, there are various information resources and data portals of extremely high quality. UniProt[44] and Ensembl plants[45] are integrative resources presenting genome-scale information for a growing number of sequenced plant species. Additionally, PLAZA[46] provides an integrative resource for functional, evolutionary, and comparative plant genomics. Data portals and specific databases like The Arabidopsis Information Resource (TAIR)[47], Araport[48], Aramemnon[49], or Phytozome[50] provide fine-grained species-specific reference knowledge. Generally, these resources offer a more condensed compilation of knowledge and often preserve the virtue of being manually curated. However, each iteration of a knowledge database only represents a snapshot of the knowledge at the time of creation, which imposes the initiator with the additional burden of maintenance and the user with uncertainty with regards to the currentness of the data source. In comparison to free text, knowledge databases are often easier to access by computational means and provide better interoperability when it comes to the application of machine learning methods; nevertheless, they were and still are designed with a human operator in mind and often lack important metadata information. This does not only affect processes like data retrieval but also the documentation of how data was obtained and integrated when assembling the database.

The communication of findings in scientific publications or their integration in knowledge databases is of course limited by the questions asked at the time of creation. Therefore, best practice suggests publishing raw measurements data in a technology-specific data repository. ProteomeXchange[51], Gene Expression Omnibus (GEO)[52], SRA/ENA[53], and Metabolights[54] are well established data exchange platforms that enforce certain metadata annotation tailored to the individual technology. Generic data repositories like Figshare and Dataverse do not require a technology-specific and laborious annotation process, but in turn do not ensure the necessary metadata annotation. Repositories can improve the process of peer review since the evaluation of data itself can be analyzed with respect to their reproducibility and also make the raw data accessible to the community for reevaluation. This allows the testing of new hypotheses using existing data sets. Nonetheless, the reuse of published data sets is limited by the level of detail in which their creation is described. Therefore, consortia and initiatives coordinate standardization efforts in plant research and developed standards and checklists to formally enable researchers to communicate their findings with required metadata. In the plant field, excellent standardizations for experimental data collections are the Minimal Information on Biological and Biomedical Investigations (MIBBI)[55], Minimal Information About a Microarray Experiment for Plants (MIAME/Plant)[56], and Minimal Information About Plant Phenotyping Experiment (MIAPPE).[57] However, it is exceedingly difficult for researchers to judge the necessity of certain meta information beforehand. Additionally, considerable effort and skills are required to provide adequate metadata annotation to the research data. Researchers also need to allocate the resources and capacity to actually do so in daily research practice. In addition, many researchers view data as sensitive research output that could easily be misused or misinterpreted when taken out of context. Thus, many scientists do not trust global repositories unless they have direct and personal connections to these researchers’ own work or find it too time consuming to validate their trustworthiness.

Nevertheless, it is evident that all ways of research communication (e.g., scientific journals, knowledge databases, and data repositories) heavily benefit from improved metadata description, not only in terms of reproducibility, but also accessibility and thus reusability.[58] It is apparent that research data management requires a constant endeavor of researchers and well accepted standards need to be developed. Here, the FAIR Guiding Principles form a conceptual roof and formulate the necessary goals to achieve. The FAIR Guiding Principles[59] are founded on four core elements: (i) findability, (ii) accessibility, (iii) interoperability, and (iv) re-usability. Findable data is described/annotated with rich metadata and consists of a globally unique identifier, which is indexed in a searchable source, e.g., a database. The metadata must specify what kind of identifier is used. According to accessibility, metadata and data must be retrievable based on their identifier by using a standardized protocol, which is open and universally implementable. Interoperable data use a standard vocabulary based on the FAIR principles and include qualified references to other (meta)data and most importantly are represented using a formal, accessible, shared, and broadly applicable language for knowledge representation. Consequently, re-usable (meta)data have a plurality of accurate and relevant attributes. In addition, they need to be associated with their provenance and meet domain-specific community standards.

Generic implementations to assist researchers abide by the FAIR principles have already been implemented. The usage of Research Object (RO)[60], Research Object Crate (RO-Crate)[61], or ISA data model[62] can lead to a rich description of the experimental metadata (i.e., sample characteristics, technology and measurement types, sample-to-data relationships) that make the resulting data and discoveries reproducible and reusable. Scientific findings accompanied with rich metadata descriptions are representable as knowledge graphs. Such graphs greatly improve their value to the scientific community, since embedding into traversable tree-like structures results in a cross linking of available scientific data, which makes knowledge searchable. In practice, this is achieved using domain specific ontologies, which constrain the used vocabulary and conserve the relationship of single terms.

Reproducibility and provenance play an important role, especially in the computational analysis itself. Recent efforts to make analytic pipelines independent of their runtime environment strongly improved reusability and reproducibility of workflows. Containerization of processing tools and analytic pipelines facilitate the sharing and collaborative development of workflows on specialized platforms like WorkflowHUB. Analogously, computation requires metadata and specifications. In this regard, the BioCompute Object Project[63] aims to ease the exchange of HTS workflows between various organizations by providing a JSON format that, at a minimum, contains all the software versions and parameters necessary to evaluate or verify a computational pipeline.

It becomes evident that a combination of computation, data, and their metadata is essential to achieve the common goal of a well-annotated research object living up to the FAIR principles.[63][64] Therefore, community-driven initiatives like DataPLANT support plant scientists in every research data management concern and provide a tailor-made service environment to contextualize research data according to the FAIR principles with minimal additional effort in modern plant biology.


References

  1. International Human Genome Sequencing Consortium; Whitehead Institute for Biomedical Research, Center for Genome Research:; Lander, Eric S.; Linton, Lauren M.; Birren, Bruce; Nusbaum, Chad; Zody, Michael C.; Baldwin, Jennifer et al. (15 February 2001). "Initial sequencing and analysis of the human genome" (in en). Nature 409 (6822): 860–921. doi:10.1038/35057062. ISSN 0028-0836. http://www.nature.com/articles/35057062. 
  2. The 1000 Genomes Project Consortium (1 November 2012). "An integrated map of genetic variation from 1,092 human genomes" (in en). Nature 491 (7422): 56–65. doi:10.1038/nature11632. ISSN 0028-0836. PMC PMC3498066. PMID 23128226. http://www.nature.com/articles/nature11632. 
  3. Alonso-Blanco, Carlos; Andrade, Jorge; Becker, Claude; Bemm, Felix; Bergelson, Joy; Borgwardt, Karsten M.; Cao, Jun; Chae, Eunyoung et al. (1 July 2016). "1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana" (in en). Cell 166 (2): 481–491. doi:10.1016/j.cell.2016.05.063. PMC PMC4949382. PMID 27293186. https://linkinghub.elsevier.com/retrieve/pii/S0092867416306675. 
  4. Stein, Joshua C.; Yu, Yeisoo; Copetti, Dario; Zwickl, Derrick J.; Zhang, Li; Zhang, Chengjun; Chougule, Kapeel; Gao, Dongying et al. (1 February 2018). "Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza" (in en). Nature Genetics 50 (2): 285–296. doi:10.1038/s41588-018-0040-0. ISSN 1061-4036. http://www.nature.com/articles/s41588-018-0040-0. 
  5. Sun, Hequan; Rowan, Beth A.; Flood, Pádraic J.; Brandt, Ronny; Fuss, Janina; Hancock, Angela M.; Michelmore, Richard W.; Huettel, Bruno et al. (1 December 2019). "Linked-read sequencing of gametes allows efficient genome-wide analysis of meiotic recombination" (in en). Nature Communications 10 (1): 4310. doi:10.1038/s41467-019-12209-2. ISSN 2041-1723. PMC PMC6754367. PMID 31541084. http://www.nature.com/articles/s41467-019-12209-2. 
  6. Sherman, Rachel M.; Salzberg, Steven L. (1 April 2020). "Pan-genomics in the human genome era" (in en). Nature Reviews Genetics 21 (4): 243–254. doi:10.1038/s41576-020-0210-7. ISSN 1471-0056. PMC PMC7752153. PMID 32034321. http://www.nature.com/articles/s41576-020-0210-7. 
  7. Bayer, Philipp E.; Golicz, Agnieszka A.; Scheben, Armin; Batley, Jacqueline; Edwards, David (1 August 2020). "Plant pan-genomes are the new reference" (in en). Nature Plants 6 (8): 914–920. doi:10.1038/s41477-020-0733-0. ISSN 2055-0278. http://www.nature.com/articles/s41477-020-0733-0. 
  8. Weckwerth, Wolfram; Ghatak, Arindam; Bellaire, Anke; Chaturvedi, Palak; Varshney, Rajeev K. (1 July 2020). "PANOMICS meets germplasm" (in en). Plant Biotechnology Journal 18 (7): 1507–1525. doi:10.1111/pbi.13372. ISSN 1467-7644. PMC PMC7292548. PMID 32163658. https://onlinelibrary.wiley.com/doi/10.1111/pbi.13372. 
  9. Scossa, Federico; Alseekh, Saleh; Fernie, Alisdair R. (1 February 2021). "Integrating multi-omics data for crop improvement" (in en). Journal of Plant Physiology 257: 153352. doi:10.1016/j.jplph.2020.153352. https://linkinghub.elsevier.com/retrieve/pii/S017616172030242X. 
  10. Liberman, Louisa M; Sozzani, Rosangela; Benfey, Philip N (1 April 2012). "Integrative systems biology: an attempt to describe a simple weed" (in en). Current Opinion in Plant Biology 15 (2): 162–167. doi:10.1016/j.pbi.2012.01.004. PMC PMC3435099. PMID 22277598. https://linkinghub.elsevier.com/retrieve/pii/S1369526612000052. 
  11. 11.0 11.1 Pinu, Farhana R.; Beale, David J.; Paten, Amy M.; Kouremenos, Konstantinos; Swarup, Sanjay; Schirra, Horst J.; Wishart, David (18 April 2019). "Systems Biology and Multi-Omics Integration: Viewpoints from the Metabolomics Research Community" (in en). Metabolites 9 (4): 76. doi:10.3390/metabo9040076. ISSN 2218-1989. PMC PMC6523452. PMID 31003499. https://www.mdpi.com/2218-1989/9/4/76. 
  12. Oberhardt, Matthew A; Palsson, Bernhard Ø; Papin, Jason A (1 January 2009). "Applications of genome‐scale metabolic reconstructions" (in en). Molecular Systems Biology 5 (1): 320. doi:10.1038/msb.2009.77. ISSN 1744-4292. PMC PMC2795471. PMID 19888215. https://onlinelibrary.wiley.com/doi/10.1038/msb.2009.77. 
  13. Lewis, Nathan E.; Nagarajan, Harish; Palsson, Bernhard O. (1 April 2012). "Constraining the metabolic genotype–phenotype relationship using a phylogeny of in silico methods" (in en). Nature Reviews Microbiology 10 (4): 291–305. doi:10.1038/nrmicro2737. ISSN 1740-1526. PMC PMC3536058. PMID 22367118. http://www.nature.com/articles/nrmicro2737. 
  14. Ramon, Charlotte; Gollub, Mattia G.; Stelling, Jörg (26 October 2018). "Integrating –omics data into genome-scale metabolic network models: principles and challenges" (in en). Essays in Biochemistry 62 (4): 563–574. doi:10.1042/EBC20180011. ISSN 0071-1365. https://portlandpress.com/essaysbiochem/article/62/4/563/78519/Integrating-omics-data-into-genome-scale-metabolic. 
  15. Gomes de Oliveira Dal’Molin, Cristiana; Nielsen, Lars Keld (1 February 2018). "Plant genome-scale reconstruction: from single cell to multi-tissue modelling and omics analyses" (in en). Current Opinion in Biotechnology 49: 42–48. doi:10.1016/j.copbio.2017.07.009. https://linkinghub.elsevier.com/retrieve/pii/S0958166917301052. 
  16. de Oliveira Dal’Molin, Cristiana Gomes; Quek, Lake-Ee; Palfreyman, Robin William; Brumbley, Stevens Michael; Nielsen, Lars Keld (1 December 2010). "C4GEM, a Genome-Scale Metabolic Model to Study C4 Plant Metabolism" (in en). Plant Physiology 154 (4): 1871–1885. doi:10.1104/pp.110.166488. ISSN 1532-2548. PMC PMC2996019. PMID 20974891. https://academic.oup.com/plphys/article/154/4/1871/6108787. 
  17. de Oliveira Dal'Molin, Cristiana Gomes; Quek, Lake-Ee; Palfreyman, Robin William; Brumbley, Stevens Michael; Nielsen, Lars Keld (3 February 2010). "AraGEM, a Genome-Scale Reconstruction of the Primary Metabolic Network in Arabidopsis" (in en). Plant Physiology 152 (2): 579–589. doi:10.1104/pp.109.148817. ISSN 1532-2548. PMC PMC2815881. PMID 20044452. https://academic.oup.com/plphys/article/152/2/579/6108441. 
  18. Cheung, C.Y. Maurice; Poolman, Mark G.; Fell, David. A.; Ratcliffe, R. George; Sweetlove, Lee J. (2 June 2014). "A Diel Flux Balance Model Captures Interactions between Light and Dark Metabolism during Day-Night Cycles in C3 and Crassulacean Acid Metabolism Leaves" (in en). Plant Physiology 165 (2): 917–929. doi:10.1104/pp.113.234468. ISSN 1532-2548. PMC PMC4044858. PMID 24596328. https://academic.oup.com/plphys/article/165/2/917/6113238. 
  19. Yuan, Huili; Cheung, C.Y. Maurice; Poolman, Mark G.; Hilbers, Peter A. J.; Riel, Natal A. W. (1 January 2016). "A genome‐scale metabolic network reconstruction of tomato ( Solanum lycopersicum L.) and its application to photorespiratory metabolism" (in en). The Plant Journal 85 (2): 289–304. doi:10.1111/tpj.13075. ISSN 0960-7412. https://onlinelibrary.wiley.com/doi/10.1111/tpj.13075. 
  20. 20.0 20.1 Simons, Margaret; Saha, Rajib; Amiour, Nardjis; Kumar, Akhil; Guillard, Lenaïg; Clément, Gilles; Miquel, Martine; Li, Zhenni et al. (5 November 2014). "Assessing the Metabolic Impact of Nitrogen Availability Using a Compartmentalized Maize Leaf Genome-Scale Model" (in en). Plant Physiology 166 (3): 1659–1674. doi:10.1104/pp.114.245787. ISSN 1532-2548. PMC PMC4226342. PMID 25248718. https://academic.oup.com/plphys/article/166/3/1659/6111218. 
  21. Scheunemann, Michael; Brady, Siobhan M.; Nikoloski, Zoran (1 December 2018). "Integration of large-scale data for extraction of integrated Arabidopsis root cell-type specific models" (in en). Scientific Reports 8 (1): 7919. doi:10.1038/s41598-018-26232-8. ISSN 2045-2322. PMC PMC5962614. PMID 29784955. http://www.nature.com/articles/s41598-018-26232-8. 
  22. Chang, Roger L; Ghamsari, Lila; Manichaikul, Ani; Hom, Erik F Y; Balaji, Santhanam; Fu, Weiqi; Shen, Yun; Hao, Tong et al. (1 January 2011). "Metabolic network reconstruction of Chlamydomonas offers insight into light‐driven algal metabolism" (in en). Molecular Systems Biology 7 (1): 518. doi:10.1038/msb.2011.52. ISSN 1744-4292. PMC PMC3202792. PMID 21811229. https://onlinelibrary.wiley.com/doi/10.1038/msb.2011.52. 
  23. Chaiboonchoe, Amphun; Dohai, Bushra Saeed; Cai, Hong; Nelson, David R.; Jijakli, Kenan; Salehi-Ashtiani, Kourosh (10 December 2014). "Microalgal Metabolic Network Model Refinement through High-Throughput Functional Metabolic Profiling". Frontiers in Bioengineering and Biotechnology 2. doi:10.3389/fbioe.2014.00068. ISSN 2296-4185. PMC PMC4261833. PMID 25540776. http://journal.frontiersin.org/article/10.3389/fbioe.2014.00068/abstract. 
  24. Shene, Carolina; Asenjo, Juan A.; Chisti, Yusuf (1 December 2018). "Metabolic modelling and simulation of the light and dark metabolism of Chlamydomonas reinhardtii" (in en). The Plant Journal 96 (5): 1076–1088. doi:10.1111/tpj.14078. https://onlinelibrary.wiley.com/doi/10.1111/tpj.14078. 
  25. Tong, Hao; Nikoloski, Zoran (1 February 2021). "Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data" (in en). Journal of Plant Physiology 257: 153354. doi:10.1016/j.jplph.2020.153354. https://linkinghub.elsevier.com/retrieve/pii/S0176161720302443. 
  26. Thiele, Ines; Palsson, Bernhard Ø (1 January 2010). "A protocol for generating a high-quality genome-scale metabolic reconstruction" (in en). Nature Protocols 5 (1): 93–121. doi:10.1038/nprot.2009.203. ISSN 1754-2189. PMC PMC3125167. PMID 20057383. http://www.nature.com/articles/nprot.2009.203. 
  27. Salzberg, Steven L. (1 December 2019). "Next-generation genome annotation: we still struggle to get it right" (in en). Genome Biology 20 (1): 92, s13059–019–1715-2. doi:10.1186/s13059-019-1715-2. ISSN 1474-760X. PMC PMC6521345. PMID 31097009. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1715-2. 
  28. Workman, Rachael E.; Tang, Alison D.; Tang, Paul S.; Jain, Miten; Tyson, John R.; Razaghi, Roham; Zuzarte, Philip C.; Gilpatrick, Timothy et al. (1 December 2019). "Nanopore native RNA sequencing of a human poly(A) transcriptome" (in en). Nature Methods 16 (12): 1297–1305. doi:10.1038/s41592-019-0617-2. ISSN 1548-7091. PMC PMC7768885. PMID 31740818. http://www.nature.com/articles/s41592-019-0617-2. 
  29. Szecowka, Marek; Heise, Robert; Tohge, Takayuki; Nunes-Nesi, Adriano; Vosloh, Daniel; Huege, Jan; Feil, Regina; Lunn, John et al. (26 March 2013). "Metabolic Fluxes in an Illuminated Arabidopsis Rosette" (in en). The Plant Cell 25 (2): 694–714. doi:10.1105/tpc.112.106989. ISSN 1532-298X. PMC PMC3608787. PMID 23444331. https://academic.oup.com/plcell/article/25/2/694/6096534. 
  30. Chen, Yanmei; Wang, Yi; Yang, Jun; Zhou, Wenbin; Dai, Shaojun (1 July 2021). "Exploring the diversity of plant proteome" (in en). Journal of Integrative Plant Biology 63 (7): 1197–1210. doi:10.1111/jipb.13087. ISSN 1672-9072. https://onlinelibrary.wiley.com/doi/10.1111/jipb.13087. 
  31. Seiler, Franka; Soll, Jürgen; Bölter, Bettina (13 June 2017). "Comparative Phenotypical and Molecular Analyses of Arabidopsis Grown under Fluorescent and LED Light" (in en). Plants 6 (2): 24. doi:10.3390/plants6020024. ISSN 2223-7747. PMC PMC5489796. PMID 28608805. https://www.mdpi.com/2223-7747/6/2/24. 
  32. Ara, Takeshi; Enomoto, Mitsuo; Arita, Masanori; Ikeda, Chiaki; Kera, Kota; Yamada, Manabu; Nishioka, Takaaki; Ikeda, Tasuku et al. (7 April 2015). "Metabolonote: A Wiki-Based Database for Managing Hierarchical Metadata of Metabolome Analyses". Frontiers in Bioengineering and Biotechnology 3. doi:10.3389/fbioe.2015.00038. ISSN 2296-4185. PMC PMC4388006. PMID 25905099. http://journal.frontiersin.org/article/10.3389/fbioe.2015.00038/abstract. 
  33. Meyer, Rachel S. (1 July 2015). "Encouraging metadata curation in the Diversity Seek initiative" (in en). Nature Plants 1 (7): 15099. doi:10.1038/nplants.2015.99. ISSN 2055-0278. http://www.nature.com/articles/nplants201599. 
  34. Kale, Namrata S.; Haug, Kenneth; Conesa, Pablo; Jayseelan, Kalaivani; Moreno, Pablo; Rocca‐Serra, Philippe; Nainala, Venkata Chandrasekhar; Spicer, Rachel A. et al. (1 March 2016). "MetaboLights: An Open‐Access Database Repository for Metabolomics Data" (in en). Current Protocols in Bioinformatics 53 (1). doi:10.1002/0471250953.bi1413s53. ISSN 1934-3396. https://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1413s53. 
  35. Sulpice, Ronan; Flis, Anna; Ivakov, Alexander A.; Apelt, Federico; Krohn, Nicole; Encke, Beatrice; Abel, Christin; Feil, Regina et al. (1 January 2014). "Arabidopsis Coordinates the Diurnal Regulation of Carbon Allocation and Growth across a Wide Range of Photoperiods" (in en). Molecular Plant 7 (1): 137–155. doi:10.1093/mp/sst127. https://linkinghub.elsevier.com/retrieve/pii/S1674205214608601. 
  36. Weiszmann, Jakob; Fürtauer, Lisa; Weckwerth, Wolfram; Nägele, Thomas (1 November 2018). "Vacuolar sucrose cleavage prevents limitation of cytosolic carbohydrate metabolism and stabilizes photosynthesis under abiotic stress" (in en). The FEBS Journal 285 (21): 4082–4098. doi:10.1111/febs.14656. ISSN 1742-464X. https://onlinelibrary.wiley.com/doi/10.1111/febs.14656. 
  37. Fernie, Alisdair R.; Trethewey, Richard N.; Krotzky, Arno J.; Willmitzer, Lothar (1 September 2004). "Metabolite profiling: from diagnostics to systems biology" (in en). Nature Reviews Molecular Cell Biology 5 (9): 763–769. doi:10.1038/nrm1451. ISSN 1471-0072. http://www.nature.com/articles/nrm1451. 
  38. Fang, Chuanying; Fernie, Alisdair R.; Luo, Jie (1 January 2019). "Exploring the Diversity of Plant Metabolism" (in en). Trends in Plant Science 24 (1): 83–98. doi:10.1016/j.tplants.2018.09.006. https://linkinghub.elsevier.com/retrieve/pii/S1360138518302115. 
  39. Pazhamala, Lekha T.; Kudapa, Himabindu; Weckwerth, Wolfram; Millar, A. Harvey; Varshney, Rajeev K. (1 July 2021). "Systems biology for crop improvement" (in en). The Plant Genome 14 (2). doi:10.1002/tpg2.20098. ISSN 1940-3372. https://onlinelibrary.wiley.com/doi/10.1002/tpg2.20098. 
  40. Zancarini, Anouk; Westerhuis, Johan A; Smilde, Age K; Bouwmeester, Harro J (1 August 2021). "Integration of omics data to unravel root microbiome recruitment" (in en). Current Opinion in Biotechnology 70: 255–261. doi:10.1016/j.copbio.2021.06.016. https://linkinghub.elsevier.com/retrieve/pii/S0958166921001002. 
  41. De Vijlder, Thomas; Valkenborg, Dirk; Lemière, Filip; Romijn, Edwin P.; Laukens, Kris; Cuyckens, Filip (1 September 2018). "A tutorial in small molecule identification via electrospray ionization-mass spectrometry: The practical art of structural elucidation" (in en). Mass Spectrometry Reviews 37 (5): 607–629. doi:10.1002/mas.21551. PMC PMC6099382. PMID 29120505. https://onlinelibrary.wiley.com/doi/10.1002/mas.21551. 
  42. Webb, Sarah (1 February 2018). "Deep learning for biology" (in en). Nature 554 (7693): 555–557. doi:10.1038/d41586-018-02174-z. ISSN 0028-0836. http://www.nature.com/articles/d41586-018-02174-z. 
  43. Karp, Peter D. (2016). "Can we replace curation with information extraction software?" (in en). Database 2016: baw150. doi:10.1093/database/baw150. ISSN 1758-0463. PMC PMC5199131. PMID 28025341. https://academic.oup.com/database/article-lookup/doi/10.1093/database/baw150. 
  44. The UniProt Consortium (8 January 2019). "UniProt: a worldwide hub of protein knowledge" (in en). Nucleic Acids Research 47 (D1): D506–D515. doi:10.1093/nar/gky1049. ISSN 0305-1048. PMC PMC6323992. PMID 30395287. https://academic.oup.com/nar/article/47/D1/D506/5160987. 
  45. Bolser, Dan; Staines, Daniel M.; Pritchard, Emily; Kersey, Paul (2016), Edwards, David, ed., "Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data", Plant Bioinformatics (New York, NY: Springer New York) 1374: 115–140, doi:10.1007/978-1-4939-3167-5_6, ISBN 978-1-4939-3166-8, http://link.springer.com/10.1007/978-1-4939-3167-5_6. Retrieved 2021-12-17 
  46. Van Bel, Michiel; Diels, Tim; Vancaester, Emmelien; Kreft, Lukasz; Botzki, Alexander; Van de Peer, Yves; Coppens, Frederik; Vandepoele, Klaas (4 January 2018). "PLAZA 4.0: an integrative resource for functional, evolutionary and comparative plant genomics" (in en). Nucleic Acids Research 46 (D1): D1190–D1196. doi:10.1093/nar/gkx1002. ISSN 0305-1048. PMC PMC5753339. PMID 29069403. http://academic.oup.com/nar/article/46/D1/D1190/4561641. 
  47. Berardini, Tanya Z.; Reiser, Leonore; Li, Donghui; Mezheritsky, Yarik; Muller, Robert; Strait, Emily; Huala, Eva (1 August 2015). "The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome: Tair: Making and Mining the “Gold Standard” Plant Genome" (in en). genesis 53 (8): 474–485. doi:10.1002/dvg.22877. PMC PMC4545719. PMID 26201819. https://onlinelibrary.wiley.com/doi/10.1002/dvg.22877. 
  48. Krishnakumar, Vivek; Hanlon, Matthew R.; Contrino, Sergio; Ferlanti, Erik S.; Karamycheva, Svetlana; Kim, Maria; Rosen, Benjamin D.; Cheng, Chia-Yi et al. (28 January 2015). "Araport: the Arabidopsis Information Portal" (in en). Nucleic Acids Research 43 (D1): D1003–D1009. doi:10.1093/nar/gku1200. ISSN 1362-4962. PMC PMC4383980. PMID 25414324. http://academic.oup.com/nar/article/43/D1/D1003/2439074/Araport-the-Arabidopsis-Information-Portal. 
  49. Schwacke, Rainer; Schneider, Anja; van der Graaff, Eric; Fischer, Karsten; Catoni, Elisabetta; Desimone, Marcelo; Frommer, Wolf B.; Flügge, Ulf-Ingo et al. (1 January 2003). "ARAMEMNON, a Novel Database for Arabidopsis Integral Membrane Proteins" (in en). Plant Physiology 131 (1): 16–26. doi:10.1104/pp.011577. ISSN 1532-2548. PMC PMC166783. PMID 12529511. https://academic.oup.com/plphys/article/131/1/16/6114365. 
  50. Goodstein, David M.; Shu, Shengqiang; Howson, Russell; Neupane, Rochak; Hayes, Richard D.; Fazo, Joni; Mitros, Therese; Dirks, William et al. (1 January 2012). "Phytozome: a comparative platform for green plant genomics" (in en). Nucleic Acids Research 40 (D1): D1178–D1186. doi:10.1093/nar/gkr944. ISSN 1362-4962. PMC PMC3245001. PMID 22110026. https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkr944. 
  51. Vizcaíno, Juan A; Deutsch, Eric W; Wang, Rui; Csordas, Attila; Reisinger, Florian; Ríos, Daniel; Dianes, José A; Sun, Zhi et al. (1 March 2014). "ProteomeXchange provides globally coordinated proteomics data submission and dissemination" (in en). Nature Biotechnology 32 (3): 223–226. doi:10.1038/nbt.2839. ISSN 1087-0156. PMC PMC3986813. PMID 24727771. http://www.nature.com/articles/nbt.2839. 
  52. Clough, Emily; Barrett, Tanya (2016), Mathé, Ewy; Davis, Sean, eds., "The Gene Expression Omnibus Database", Statistical Genomics (New York, NY: Springer New York) 1418: 93–110, doi:10.1007/978-1-4939-3578-9_5, ISBN 978-1-4939-3576-5, PMC PMC4944384, PMID 27008011, http://link.springer.com/10.1007/978-1-4939-3578-9_5. Retrieved 2021-12-17 
  53. Leinonen, R.; Sugawara, H.; Shumway, M.; on behalf of the International Nucleotide Sequence Database Collaboration (1 January 2011). "The Sequence Read Archive" (in en). Nucleic Acids Research 39 (Database): D19–D21. doi:10.1093/nar/gkq1019. ISSN 0305-1048. PMC PMC3013647. PMID 21062823. https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkq1019. 
  54. Haug, Kenneth; Salek, Reza M.; Conesa, Pablo; Hastings, Janna; de Matos, Paula; Rijnbeek, Mark; Mahendraker, Tejasvi; Williams, Mark et al. (1 January 2013). "MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data" (in en). Nucleic Acids Research 41 (D1): D781–D786. doi:10.1093/nar/gks1004. ISSN 0305-1048. PMC PMC3531110. PMID 23109552. http://academic.oup.com/nar/article/41/D1/D781/1050654/MetaboLightsan-openaccess-generalpurpose. 
  55. Taylor, Chris F; Field, Dawn; Sansone, Susanna-Assunta; Aerts, Jan; Apweiler, Rolf; Ashburner, Michael; Ball, Catherine A; Binz, Pierre-Alain et al. (1 August 2008). "Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project" (in en). Nature Biotechnology 26 (8): 889–896. doi:10.1038/nbt.1411. ISSN 1087-0156. PMC PMC2771753. PMID 18688244. http://www.nature.com/articles/nbt.1411. 
  56. Zimmermann, Philip; Schildknecht, Beatrice; Craigon, David; Garcia-Hernandez, Margarita; Gruissem, Wilhelm; May, Sean; Mukherjee, Gaurab; Parkinson, Helen et al. (1 December 2006). "MIAME/Plant – adding value to plant microarrray experiments" (in en). Plant Methods 2 (1): 1. doi:10.1186/1746-4811-2-1. ISSN 1746-4811. PMC PMC1334190. PMID 16401339. https://plantmethods.biomedcentral.com/articles/10.1186/1746-4811-2-1. 
  57. Krajewski, Paweł; Chen, Dijun; Ćwiek, Hanna; van Dijk, Aalt D.J.; Fiorani, Fabio; Kersey, Paul; Klukas, Christian; Lange, Matthias et al. (1 September 2015). "Towards recommendations for metadata and data handling in plant phenotyping" (in en). Journal of Experimental Botany 66 (18): 5417–5427. doi:10.1093/jxb/erv271. ISSN 0022-0957. https://academic.oup.com/jxb/article-lookup/doi/10.1093/jxb/erv271. 
  58. Leonelli, Sabina (5 April 2019). "The challenges of big data biology" (in en). eLife 8: e47381. doi:10.7554/eLife.47381. ISSN 2050-084X. PMC PMC6450665. PMID 30950793. https://elifesciences.org/articles/47381. 
  59. Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem et al. (1 December 2016). "The FAIR Guiding Principles for scientific data management and stewardship" (in en). Scientific Data 3 (1): 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC PMC4792175. PMID 26978244. http://www.nature.com/articles/sdata201618. 
  60. Hettne, Kristina M; Dharuri, Harish; Zhao, Jun; Wolstencroft, Katherine; Belhajjame, Khalid; Soiland-Reyes, Stian; Mina, Eleni; Thompson, Mark et al. (1 December 2014). "Structuring research methods and data with the research object model: genomics workflows as a case study" (in en). Journal of Biomedical Semantics 5 (1): 41. doi:10.1186/2041-1480-5-41. ISSN 2041-1480. PMC PMC4177597. PMID 25276335. https://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-5-41. 
  61. Carragáin, Eoghan Ó; Goble, Carole; Sefton, Peter; Soiland-Reyes, Stian (20 June 2019). A lightweight approach to research object data packaging. doi:10.5281/ZENODO.3250687. https://zenodo.org/record/3250687. 
  62. González-Beltrán, Alejandra; Maguire, Eamonn; Sansone, Susanna-Assunta; Rocca-Serra, Philippe (1 December 2014). "linkedISA: semantic representation of ISA-Tab experimental metadata" (in en). BMC Bioinformatics 15 (S14): S4. doi:10.1186/1471-2105-15-S14-S4. ISSN 1471-2105. PMC PMC4255742. PMID 25472428. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S14-S4. 
  63. 63.0 63.1 Simonyan, Vahan; Goecks, Jeremy; Mazumder, Raja (2017). "Biocompute Objects—A Step towards Evaluation and Validation of Biomedical Scientific Computations" (in en). PDA Journal of Pharmaceutical Science and Technology 71 (2): 136–146. doi:10.5731/pdajpst.2016.006734. ISSN 1079-7440. PMC PMC5510742. PMID 27974626. http://journal.pda.org/lookup/doi/10.5731/pdajpst.2016.006734. 
  64. Vicente-Saez, Ruben; Martinez-Fuentes, Clara (1 July 2018). "Open Science now: A systematic literature review for an integrated definition" (in en). Journal of Business Research 88: 428–436. doi:10.1016/j.jbusres.2017.12.043. https://linkinghub.elsevier.com/retrieve/pii/S0148296317305441. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, spelling, and grammar. In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; however, this version lists them in order of appearance, by design.