Difference between revisions of "Journal:Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces"
Shawndouglas (talk | contribs) (Saving and adding more.) |
Shawndouglas (talk | contribs) |
||
(5 intermediate revisions by the same user not shown) | |||
Line 18: | Line 18: | ||
|website = [https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0130-z https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0130-z] | |website = [https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0130-z https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0130-z] | ||
|download = [https://biodatamining.biomedcentral.com/track/pdf/10.1186/s13040-017-0130-z?site=biodatamining.biomedcentral.com https://biodatamining.biomedcentral.com/track/pdf/10.1186/s13040-017-0130-z] (PDF) | |download = [https://biodatamining.biomedcentral.com/track/pdf/10.1186/s13040-017-0130-z?site=biodatamining.biomedcentral.com https://biodatamining.biomedcentral.com/track/pdf/10.1186/s13040-017-0130-z] (PDF) | ||
}} | }} | ||
Line 41: | Line 36: | ||
==Implementation== | ==Implementation== | ||
===Framework=== | |||
Here, a simple framework for rapid development of specialized databases is proposed. This framework integrates several database technologies into a unified platform which allows the user to search and retrieve data from existing sources, incorporate established data with newly discovered information in a single data model, define object-based data architecture and persist it to memory for future inquiries. The framework consists of three main data streams: existing data sources to the user (search and retrieve data), new information sources such as experiments and data analytics to the user (generate data), as well as the individual user to a self-curated specialized data-base (model, store and retrieve data) (Fig. 1). The platform enables the user to search and retrieve data from local databases using structured information interfaces, and from online databases using structured URL interfaces. The user interface consists of a database repository which defines the differently structured accesses to existing resources, and a query generation engine which provides advanced searching mechanisms such as field-oriented search and the use of logical relations among search terms. The user designs their own data architecture using object-oriented programming (OOP), within which new data and the database-derived information are encapsulated to class-defined entities, an action which forms the database schema. Entities are persisted to memory with a persistent agent (PA). The PA uses an object relational mapper to map the object-based schema to a set of interconnected tables which resemble a typical relational database. This will allow the user to query the database with SQL-like queries. The relational model is transferred to a database manager that stores the data in memory. | |||
[[File:Fig1 Tsur BioDataMining2017 10.gif|567px]] | |||
{{clear}} | |||
{| | |||
| STYLE="vertical-align:top;"| | |||
{| border="0" cellpadding="5" cellspacing="0" width="567px" | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Framework schematics. Our framework is composed of structured information interfaces, used to search and fetch information from local and online repositories, incorporation of user generated new data, query generation interface and data source selection, a persistent agent which persists user-defined object schema to memory and finally, a database manager. This framework supports generation, search and retrieval of data, as well as modeling, storing and searching persisted schema.</blockquote> | |||
|- | |||
|} | |||
|} | |||
This framework can be implemented in various ways using different programming languages and libraries providers. For example, Southern and colleagues developed a Java API, which maps entities to NCBI's PubChem schema and provides wrapper functions for calling NCBI eUtilities and PubChem web services.<ref name="SouthernAJava11">{{cite journal |title=A Java API for working with PubChem datasets |journal=Bioinformatics |author=Southern, M.R.; Griffin, P.R. |volume=27 |issue=5 |pages=741—2 |year=2011 |doi=10.1093/bioinformatics/btq715 |pmid=21216779 |pmc=PMC3105478}}</ref> Another implementation is the BioPython project, which provides modules to access NCBI's databases from within Python.<ref name="CockBiopython09">{{cite journal |title=Biopython: Freely available Python tools for computational molecular biology and bioinformatics |journal=Bioinformatics |author=Cock, P.J.; Antao, T.; Chang, J.T. et al. |volume=25 |issue=11 |pages=1422-3 |year=2009 |doi=10.1093/bioinformatics/btp163 |pmid=19304878 |pmc=PMC2682512}}</ref> Object persistence can be achieved both in Python using its standard library, which support a family of hash-based file formats and objects serialization, and in JAVA using the Java Persistence API (JPA). Most JPA persistence providers offer the option to automatically create the database schema based on metadata. Popular implementations of JPA are Hibernate, EclipseLink and Apache OpenJPA. Common database managers include Apache Derby, Firebird, SQLite, and HSQL. | |||
===Software description=== | |||
The proposed framework for the development of specialized databases can be implemented using different resources. Here, various open-source and free resources were utilized for implementation. Java was chosen as the development environment, EclipseLink as the JPA provider and Apache Derby as the database manager. Java was used to create interfaces to online databases such as MalaCards, Biomodels and NCBI's databases. The structured information interface was used to derive data from local repositories such as Aneurisk, which was downloaded from the Aneurisk web dataset. By integrating these tools with a series of data parsers, a powerful framework for the curation of specialized databases is provided, which would incorporate new data and database-derived information into a user-defined database architecture. A schematic of the implementation is presented in Fig. 2. | |||
[[File:Fig2 Tsur BioDataMining2017 10.gif|584px]] | |||
{{clear}} | |||
{| | |||
| STYLE="vertical-align:top;"| | |||
{| border="0" cellpadding="5" cellspacing="0" width="584px" | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' Implemented framework. Our implementation of the proposed framework utilizes Java as the development environment, EclipseLink as the JPA provider and Apache Derby as the database manager. Parsing layers were based on J3D, jsoup, Apache Commons and Org.w3c libraries. We implemented structured interfaces to various databases including MalaCards, Biomodels and several NCBI’s data sets.</blockquote> | |||
|- | |||
|} | |||
|} | |||
The main data stream from established online databases was implemented using a structured URL interface. Databases often use a fixed URL syntax which translates a standard set of input parameters into the values necessary to search and retrieve requested data. For example, Entrez Programming Utilities provide a structured URL interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and associated biomedical literature.<ref name="NCBIEntrez">{{cite web |url=https://www.ncbi.nlm.nih.gov/books/NBK25501/ |title=Entrez Programming Utilities Help |author=National Center for Biotechnology Information; U.S. National Library of Medicine |date=2017}}</ref> A series of data processing tools were utilized to implement parsers for syntactic analysis of the retrieved data. The w3c.dom package provides the Document Object Model (DOM) interfaces, which were used as the API for XML processing. The Apache Commons libraries, the jsoup library and the org.j3d library of the Java 3D Community were utilized for CSV, HTML and STL parsing, respectively. An API with which the user can specify the required database from which she wants to search and retrieve data was developed. This API contains an expandable database repository and a simple query generation engine which can be used to search a specified database. The user uses Java OOP to encapsulate the retrieved data and integrate it with her own data. The mapping of Java objects and database tables is defined via persistence metadata, which is used by the JPA provider — EclipseLink — to execute database SQL-like operations for static and dynamic queries. EclipseLink defines the metadata via annotations in the Java class and with XML. The open-source relational database Apache Derby — part of the Apache DB Project — was used as the database manager. Derby is written in Java and is suitable for embedding due to its limited footprint and ease-of-use. Derby supports SQL data storing and querying in a client/server operation mode, commonly used by specialized databases. | |||
The framework consists of five packages, each encapsulating a family of associated functionalities. The UML package diagram is shown in Additional file 1: Figure S1. The database package contains enumerated types which represent specific search fields for advanced search, types of retrieved information and available databases. The URL Interfaces package consists of a series of classes providing structured URL access to the different databases. The Parsers package consists of classes for HTML, XML and CSV parsing. Each of those classes is generalized by model specific parsers as will be discussed below. The Persistency package consists of the persistent Agent class that can persist the model’s objects to memory. Finally, the Model package consists of the model’s specific classes as will also be discussed below. Simplified UML views of the classes’ diagram of each of the packages are shown in Additional file 1: Figures S2–5. | |||
The use of the system is quite simple. For example, a query for retrieving two articles according to their PMIDs from NCBI's PubMed database can be generated using: | |||
<code>Query query = new Query(); | |||
query.setDatabase(DBType.PUBMED); | |||
query.addId(“23371018”); | |||
query.addId(“10227670”); | |||
query.setSearchType(SearchType.FETCH);</code> | |||
<br /> | |||
The following structured URL is automatically generated by the framework as a result: | |||
<nowiki>http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=23371018,10227670&retmode=xml</nowiki> | |||
<br /> | |||
More complicated queries can also be produced. For example, to retrieve publication IDs for all articles published in the journal ''Science'' in 2009, with the terms “breast” and “cancer,” the following query can be generated: | |||
<code>Query query = new Query(); | |||
query.setDatabase(DBType.PUBMED); | |||
query.addTerm("breast"); | |||
query.addTerm("cancer"); | |||
query.addField(SearchFields.JOURNAL, "science"); | |||
query.addField(SearchFields.PUBLICATION_DATA, "2009"); | |||
query.setSearchType(SearchType.SEARCH); | |||
List < String > results = Entrez.searchEntrez(query);</code> | |||
<br /> | |||
As a result, the following structured URL is automatically generated by the framework: | |||
<nowiki>http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+2009[pdat]+breast+AND+cancer&retmode=xml&rettype=uilist</nowiki> | |||
<br /> | |||
The created query can be sent to the Entrez utilities system for the retrieval of the corresponding articles using: | |||
<code>Document xmlDocs = Entrez.callEntrez(query);</code> | |||
<br /> | |||
The retrieved XML documents can then be sent for parsing for the generation of a linked list of "Article" objects: | |||
<code>PubmedParser parser = new PubmedParser(xmlDocs); | |||
parser.parse(); | |||
List<Article> articles = parser.getArticles();</code> | |||
<br /> | |||
Each of the articles can now be connected to other objects. They can then be persisted to memory using: | |||
<code>for (Article article: articles){ | |||
persistAgent.PersistArticle(article); | |||
}</code> | |||
<br /> | |||
Now, the articles can be retrieved from the persistence agent by simply calling on its showArticles function. This function makes use of a simple SQL-like query: | |||
<code>Query q = entityManager.createQuery (“SELECT a FROM Article a”); | |||
List<Article> articleList = q.getResultList();</code> | |||
<br /> | |||
The retrieved articles can then be published on the command line to produce: | |||
<blockquote> | |||
<nowiki>---------------------------------------</nowiki> | |||
ID: 23371018 | |||
TITLE: Non-dimensional analysis of retinal microaneurysms: critical threshold for treatment. | |||
AUTHOR: Ezra Elishai | |||
JOURNAL: Integrative biology : quantitative biosciences from nano to macro 5(3), 2013, DOI: 10.1039/c3ib20259c | |||
<nowiki>---------------------------------------</nowiki> | |||
ID: 10227670 | |||
TITLE: Three dimensional analysis of microaneurysms in the human diabetic retina. | |||
AUTHOR: Moore J | |||
JOURNAL: Journal of anatomy 194 (Pt 1)(?), 1999, DOI: ? | |||
<nowiki>---------------------------------------</nowiki> | |||
</blockquote> | |||
==Results== | |||
The proposed framework can be easily adopted for the curation of specialized databases. Here, the curation of a specialized database aimed to store and retrieve aneurysm-associated data is demonstrated. Aneurysms characterize important vascular pathologies, which depending on the aneurysm's location and geometry could potentially cause blindness, stroke and death.<ref name="ChalouhiReview13">{{cite journal |title=Review of cerebral aneurysm formation, growth, and rupture |journal=Stroke |author=Chalouhi, N.; Hoh, B.L.; Hasan, D. |volume=44 |issue=12 |pages=3613-22 |year=2013 |doi=10.1161/STROKEAHA.113.002390 |pmid=24130141}}</ref> The query generating engine was used to retrieve tens of thousands of aneurysm related research articles from NCBI's PubMed and PMC databases. XML parser was utilized to analyze and store the retrieved data in a linked list of "Article" objects; this list encapsulates each article's information. Abstracts are automatically downloaded and stored in the database for easy retrieval. The structured URL interface to the MalaCards database was used to retrieve hundreds of data records of associated human diseases such as aortic aneurysm and cerebritis. The HTML parser was used to parse the information regarding the key characteristics of each disease, which were stored in a linked-list of "Disease" objects. | |||
For example, the following simple query for “aneurysm”: | |||
<code>Query query = new Query(); | |||
query.setDatabase(DBType.MALA_CARDS); | |||
query.addTerm(“aneurysm”); | |||
Document results = MalaCards.callMalaCards(query);</code> | |||
<br /> | |||
can be sent to the MalaCard parser to retrieve related diseases: | |||
<code>MalaCardsParser parser = new MalaCardsParser(results, query); | |||
parser.parse(); | |||
List<Disease> diseases = parser.getDiseases();</code> | |||
<br /> | |||
After persistence is performed (as was earlier described), the retrieved list of diseases can be published to produce (only the first seven retrieved diseases are shown here): | |||
<blockquote> | |||
<nowiki>---------------------------------------</nowiki> | |||
Name: Familial Thoracic Aortic Aneurysm and Dissection | |||
Link at MalaCards:/card/familial_thoracic_aortic_aneurysm_and_dissection?search=aneurysm | |||
<nowiki>---------------------------------------</nowiki> | |||
Name: Coronary Aneurysm | |||
Link at MalaCards:/card/coronary_aneurysm?search=aneurysm | |||
<nowiki>---------------------------------------</nowiki> | |||
Name: Angiopathy, Hereditary, with Nephropathy, Aneurysms, and Muscle Cramps | |||
Link at MalaCards:/card/angiopathy_hereditary_with_nephropathy_aneurysms_and_muscle_cramps?search=aneurysm | |||
<nowiki>---------------------------------------</nowiki> | |||
Name: Aneurysmal Bone Cysts | |||
Link at MalaCards:/card/aneurysmal_bone_cysts?search=aneurysm | |||
<nowiki>---------------------------------------</nowiki> | |||
Name: Intracranial Berry Aneurysm | |||
Link at MalaCards:/card/intracranial_berry_aneurysm?search=aneurysm | |||
<nowiki>---------------------------------------</nowiki> | |||
Name: Cerebral Aneurysms | |||
Link at MalaCards:/card/cerebral_aneurysms?search=aneurysm | |||
<nowiki>---------------------------------------</nowiki> | |||
Name: Loeys-Dietz Syndrome | |||
Link at MalaCards:/card/loeys_dietz_syndrome?search=aneurysm | |||
<nowiki>---------------------------------------</nowiki> | |||
</blockquote> | |||
<br /> | |||
The platform was similarly used to download tens of related biological models such as the differentiation of endothelial cells, downloading the related XML files, parsing them, and encapsulating the data in a linked-list of "Model" objects. Two of them are: | |||
<blockquote> | |||
<nowiki>---------------------------------------</nowiki> | |||
Id: BIOMD0000000058 | |||
Description: The model reproduces the same amplitude antiphase calcium oscillations of coupled… | |||
<nowiki>---------------------------------------</nowiki> | |||
Id: BIOMD0000000291 | |||
Description: adsorption of albumin-bilirubin complex to the surface of carbon pyropolymer… | |||
<nowiki>---------------------------------------</nowiki> | |||
</blockquote> | |||
<br /> | |||
The structured information interface, the CSV parser and the STL loader were utilized to parse data from the Aneurisk repository, which contains clinical data and three-dimensional models of hundreds of aneurysms. Examples of retrieved patients' information are: | |||
<blockquote> | |||
<nowiki>---------------------------------------</nowiki> | |||
Patient ID: C0004 | |||
SEX: F, AGE: 60 | |||
Aneurysm type: TER, location:ICA, status: U | |||
<nowiki>---------------------------------------</nowiki> | |||
Patient ID: C0005 | |||
SEX: F, AGE: 26 | |||
Aneurysm type: LAT, location:ICA, status: R | |||
<nowiki>---------------------------------------</nowiki> | |||
Patient ID: C0006 | |||
SEX: F, AGE: 45 | |||
Aneurysm type: LAT, location:ICA, status: U | |||
<nowiki>---------------------------------------</nowiki> | |||
Patient ID: C0007 | |||
SEX: F, AGE: 44 | |||
Aneurysm type: LAT, location:ICA, status: U | |||
<nowiki>---------------------------------------</nowiki> | |||
Patient ID: C0008 | |||
SEX: M, AGE: 68 | |||
Aneurysm type: TER, location:ACA, status: R | |||
<nowiki>---------------------------------------</nowiki> | |||
</blockquote> | |||
<br /> | |||
Aneurysm data was integrated with our previously published criteria of aneurysm risk of rapture.<ref name="EzraNondimens13" /> A schematic of the database with its different associations is shown in Fig. 3. Simplified UML views of the models’ related classes are presented in Additional file 1: Figures S6–7. | |||
[[File:Fig3 Tsur BioDataMining2017 10.gif|576px]] | |||
{{clear}} | |||
{| | |||
| STYLE="vertical-align:top;"| | |||
{| border="0" cellpadding="5" cellspacing="0" width="576px" | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3.''' Construction of a specialized database. We have demonstrated generation of a specialized database for aneurysm-related vascular pathologies. This database contains 3 dimensional geometries of aneurysms, patients’ clinical information, articles, mathematical biological models, related diseases and our model of aneurysms’ risk of rapture.</blockquote> | |||
|- | |||
|} | |||
|} | |||
The persisted database can now be inquired with SQL-like commands. For example, all relevant information regarding a patient — including articles, models and diseases — can be easily retrieved according to its identification number (in this case: CD0985674) using: | |||
<code>Query q = entityManager.createQuery | |||
("SELECT p FROM Aneurysm p WHERE p.patientID = :patientID"); | |||
q. setParameter("patientID", "CD0985674");</code> | |||
==Discussion== | |||
In the last two decades a tremendous interest has developed in computational biology and bioinformatics, disciplines which have emerged from the intersection of biology and computer science. Practically, bioinformatics became a fertile new ground for programmers, who have gained access to an entirely new class of questions and challenges.<ref name="KanehisaBioinfo03">{{cite journal |title=Bioinformatics in the post-sequence era |journal=Nature Genetics |author=Kanehisa, M.; Bork, P. |volume=33 |issue=Supp 1 |pages=305–10 |year=2003 |doi=10.1038/ng1109 |pmid=12610540}}</ref> Commonly used software packages include the Bio*.org projects, such as BioRuby<ref name="GotoBioRuby10">{{cite journal |title=BioRuby: Bioinformatics software for the Ruby programming language |journal=Bioinformatics |author=Goto, N.; Prins, P.; Nakao, M. et al. |volume=26 |issue=20 |pages=2617–19 |year=2010 |doi=10.1093/bioinformatics/btq475 |pmid=20739307 |pmc=PMC2951089}}</ref>, BioPerl<ref name="StajichAnIntro07">{{cite book |chapter=An Introduction to BioPerl |title=Plant Bioinformatics |author=Stajich, J.E. |editor=Edwards, D. |series=Methods in Molecular Biology |volume=406 |pages=538–548 |year=2007 |doi=10.1007/978-1-59745-535-0_26 |isbn=9781597455350}}</ref>, BioJava<ref name="HollandBioJava08">{{cite journal |title=BioJava: An open-source framework for bioinformatics |journal=Bioinformatics |author=Holland, R.C.; Down, T.A.; Pocock, M. et al. |volume=24 |issue=18 |pages=2096-7 |year=2008 |doi=10.1093/bioinformatics/btn397 |pmid=18689808 |pmc=PMC2530884}}</ref> and BioPython<ref name="CockBiopython09" />, which have recently been assembled under the Open Bioinformatics Foundation. Each of these projects represents an international association of developers of open-source code libraries for bioinformatics, [[genomics]] and life science research. However, these platforms are not oriented for the curation of databases. Sequence and non-bibliographic databases constitute the most important corner stone for research in computational biology and bioinformatics. While primary databases such as NCBI's Nucleotide and Protein databases are of great importance to biological research, specialized databases which serve specific research communities are rapidly developing. During the last decade, hundreds of specialized databases have been developed, each making use of different frameworks and libraries. Although many database development environments exist, they often rely on a tabular structure, where the designer creates objects such as tables, columns, keys, indexes, relationships and constraints. While those basic entities are prevalent for simple data organization, they can rarely answer the needs of researchers who make use of a wide spectrum of data types, from sequencing data and microarray experiments to statistical models and simulations. | |||
Here, a simple and unified open-source framework for the rapid development of specialized databases, based on user-defined objects and relations is proposed. These objects can be designed with the full arsenal of tools in OOP, giving the user maximum flexibility. It is important to note that the proposed framework aims to assist developers, which are capable of building object-oriented data models. After defining the data model (as it was exemplified in Additional file 1: Figure S7), developers can use the framework to load it with data from the supported web/local-based repositories, and persist/retrieve it from memory. Moreover, this framework can be rapidly extended or modified to support additional parsers and databases. The framework allows the user to concentrate on the biological models, the new data and the database architecture, rather than on concerns regarding data management and access to the different online and local datasets. This implementation is provided with a set of free, open-source tools, to increase availability and to enable ease-of-use. The framework can be easily utilized to work with the variety of bioinformatics tools available via the open-source BioJava project. | |||
The most important aspect in the proposed work is the integration of the most relevant technologies to OO-based database design in a single framework. This is in clear contrast to BioPython/BioJava, which emphasize utilization of algorithms for bioinformatics. While the proposed framework is focused on database design, BioPython/BioJava are focused on fundamental bioinformatics tasks ranging from sequence alignment to molecular structure prediction. Obviously, some aspects of the proposed framework such as interfacing web-based databases are congruent with BioPython/BioJava. However, the context of use is largely different. Moreover, by utilizing flexible structuring of URLs, the proposed framework also supports interfaces to databases such as MalaCards and BioModels. We note that another project — BioServices<ref name="CokelaerBioServices13">{{cite journal |title=BioServices: A common Python package to access biological Web Services programmatically |journal=Bioinformatics |author=Cokelaer, T.; Pultz, D.; Harder, L.M. et al. |volume=29 |issue=24 |pages=3241-2 |year=2013 |doi=10.1093/bioinformatics/btt547 |pmid=24064416 |pmc=PMC3842755}}</ref> — does indeed provide an interface to BioModels. Moreover, our framework provides an interface to Apache Derby, a strong ORM-based database manager. This is currently, to our knowledge, not embedded within BioPython/BioJava. We note that another Python package named BioSQL<ref name="IvesTheORCHESTRA08">{{cite journal |title=The ORCHESTRA Collaborative Data Sharing System |journal=ACM SIGMOD Redord |author=Ives, Z.G.; Green, T.J.; Karvounarakis, G. et al. |volume=37 |issue=3 |pages=26-32 |year=2008 |doi=10.1145/1462571.1462577}}</ref> offers interfacing for SQL-based relational data bases, in contrast to our framework, which is based on ORM. | |||
The creation of a specialized database for aneurysm-associated data and possible queries was demonstrated. A more substantial "data to knowledge" utility of this database can be developed. For example, our predictive model of aneurysms’ risk of rapture uses a non-dimensional analysis of fluid dynamics to set a critical geometrical threshold of treatment.<ref name="EzraNondimens13" /> Piccinelli and colleagues could derive geometrical measures of patients’ aneurysms’ geometries, which can be used with the prediction model to determine risk of rapture.<ref name="PiccinelliAuto12">{{cite journal |title=Automatic neck plane detection and 3D geometric characterization of aneurysmal sacs |journal=Annals of Biomedical Engineering |author=Piccinelli, M.; Steinman, D.A.; Hoi, Y. et al. |volume=40 |issue=10 |pages=2188-211 |year=2012 |doi=10.1007/s10439-012-0577-5 |pmid=22532324}}</ref> These patients’ geometries are incorporated as STL files in this database. A specialized database of aneurysm-related data can be used to further investigate the links between the different aneurysm-associated diseases and the underlying biological models to advance knowledge in this field. | |||
==Declarations== | |||
===Acknowledgements=== | |||
The author wish to thank Aviad Ezra and Tamara Pearlman for their insightful comments. | |||
====Funding==== | |||
This work was supported by a JCT research grant. | |||
====Availability of data and materials==== | |||
A simplified version of the framework, with additional code examples are provided at the NBEL-lab.com website. To ensure public access to the files, the source code was also uploaded to GitHub at https://github.com/NBEL-lab/BioDataMining/. As was described above, the framework uses a series of dependable modules, which are freely accessible: commons-csv V1.2 (https://commons.apache.org/proper/commons-csv/), Apache derby (https://db.apache.org/derby/), eclipselink (https://eclipse.org/eclipselink/downloads/), javax persistence V2.1 (http://mvnrepository.com/artifact/org.eclipse.persistence/javax.persistence/2.1.0), and jsoup V1.8.3 (http://jsoup.org/download). All necessary libraries are available in the projects’ lib directory. A tutorial and installation instructions are located in the framework’s client/Client.java. | |||
Project name: development of entity-based data models for bioinformatics | |||
Operating system(s): cross platform | |||
Programming language: Java | |||
Licence: GNU General Public License V3 | |||
====Authors' contributions==== | |||
EET designed the framework, developed the code and wrote the manuscript. | |||
====Competing interests==== | |||
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. | |||
====Consent for publication==== | |||
Not applicable. | |||
====Ethics approval and consent to participate==== | |||
Not applicable. | |||
==Additional files== | |||
[https://static-content.springer.com/esm/art%3A10.1186%2Fs13040-017-0130-z/MediaObjects/13040_2017_130_MOESM1_ESM.docx Additional file 1]: Supplementary information: Framework's design and architecture (.docx 500kb) | |||
==References== | ==References== |
Latest revision as of 17:08, 16 August 2017
Full article title | Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces |
---|---|
Journal | BioData Mining |
Author(s) | Tsur, Elishai Ezra |
Author affiliation(s) | Jerusalem College of Technology |
Primary contact | Email: elishai85 at gmail dot com |
Year published | 2017 |
Volume and issue | 10 |
Page(s) | 11 |
DOI | 10.1186/s13040-017-0130-z |
ISSN | 1756-0381 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0130-z |
Download | https://biodatamining.biomedcentral.com/track/pdf/10.1186/s13040-017-0130-z (PDF) |
Databases are imperative for research in bioinformatics and computational biology. Current challenges in database design include data heterogeneity and context-dependent interconnections between data entities. These challenges drove the development of unified data interfaces and specialized databases. The curation of specialized databases is an ever-growing challenge due to the introduction of new data sources and the emergence of new relational connections between established datasets. Here, an open-source framework for the curation of specialized databases is proposed. The framework supports user-designed models of data encapsulation, object persistence and structured interfaces to local and external data sources such as MalaCards, Biomodels and the National Center for Biotechnology Information (NCBI) databases. The proposed framework was implemented using Java as the development environment, EclipseLink as the data persistence agent and Apache Derby as the database manager. Syntactic analysis was based on J3D, jsoup, Apache Commons and w3c.dom open libraries. Finally, a construction of a specialized database for aneurysm-associated vascular diseases is demonstrated. This database contains three-dimensional geometries of aneurysms, patients' clinical information, articles, biological models, related diseases and our recently published model of aneurysms’ risk of rapture. The framework is available at: http://nbel-lab.com.
Keywords: specialized databases, object-relational databases, EclipseLink, Apache Derby, object-oriented programming
Background
In the last few decades the intersection of computer science and biology has evolved to the point at which answers to fundamental biological questions have emerged.[1] Some of the most important cross-talks between biology and computer science lie within the data-intensive nature of modern biology.[2] It is currently evident that fields such as computational biology and bioinformatics are practically fueled by the increasing computational resources available and the development of software encapsulation and abstraction layers.[3] An important corner stone of the computer-science/biology interface is object-centered reductionism where relations between discrete biological entities such as DNA, protein and RNA are investigated.[1] Data regarding biological entities is stored in databases, which have become the most important corner stone for research in computational biology and bioinformatics.
Biological database designers currently face two main challenges: data heterogeneity and the emergence of new relational connections between data entities. Today, biological data is not limited to sequential information, which is typically stored in primary databases such as NCBI's Nucleotide and Protein data sets. Biological data also encompass graphs[4], statistical models[5], geometric information[6], vector fields[7], patterns[8], images[9], computational models[10] and others. A recent important advance regarding data heterogeneity was developed by Allan and colleagues, who have developed OMERO, an open-source software platform which uses a server-based middleware application to provide a unified interface for images, matrices and tables.[9] However, while OMERO provides a unified interface for file types, it is currently limited to microscopy images. Another important effort is the development of Semantic Web languages (SWLs), which promote web-based standardization of data formats by utilizing Extensible Markup Language (XML) and Resource Description Framework (RDF). SWLs have been implemented by many biological portals such as MGED Ontology, which provides terms for annotating microarray experiments; BioPAX, which provides an exchange format for biological pathway data; and Gene Ontology (GO), which describes biological processes, molecular functions and cellular components of gene products.[11]
The management of relational connections between biological data entities is a great challenge due to the variety of contexts in which data can be related. The vast spectrum of possible relations between biological entities drove the momentum for the curation of specialized databases. Specialized databases include organism-centered datasets such as Flybase (Drosophila)[12], WormBase (Nematode)[13], AceDB (C. elegans)[14], and TAIR (Arabidopsis)[15]; biological pathways databases such as MetaCyc and Biocyc[16]; and disease databases such as NCBI's OMIM database. Today, specialized databases are often curated to serve consortiums and single laboratories that define their own data relations architecture with their own data format. Specialized database curation is an ever-growing need since new data sources are constantly evolving due to rapidly advancing biological research: new experimental techniques produce types of data greater in both variety and number, requiring database structures to change accordingly. Moreover, most specialized databases contain both new data and data that were derived from established datasets. This hybrid approach of the new and the old presents a major challenge to specialized database designers, which should query, acquire and parse data from existing databases, as well as integrate it into their own database architecture.
The relational model, in which sets of tables are used to organize data, was first introduced by Edgar Codd in 1970 and is currently the most widely used model for data representation.[17] Although relational database management systems (DBMSs) have often been used for biological data management, they are in many ways inadequate. One of the main reasons for this inadequacy is that a relational data model cannot accurately encapsulate important biological data structures such as pedigrees, taxonomies, maps, networks, cascade processes, etc. Moreover, while application development techniques and programming languages have evolved significantly over the past decades, the relational database technology has remained relatively unchanged, frequently causing discrepancies between the object-oriented model used by many modern applications and the relational model.[18] However, while relational databases may be inconvenient to consume by modern programming languages, they are still the main choice for many applications due to their maturity and reliability. One of the main alternatives to the relational data model is an object-based representation of information, in which entities are defined with a set of properties and connected as attributes. Object-oriented database management systems (OODBMSs), popularized as NoSQL, allows the encapsulation of internal details of the data associated with the heterogeneity of the underlying data sources, extending object-oriented programming with data persistence. Data persistence allow objects to be created, stored and used directly with no need to explicitly serialize objects to or from a database. Examples of OODBMSs include hbase and DocDb. Importantly, a technique called object-relational mapping (ORM) allows the user to use SQL-like queries while dealing directly with objects, creating a hybrid of both object and relational approaches.
Here, an ORM-based open-source platform is proposed. The platform provides an efficient way for researchers to curate their own database using their own model of encapsulation, while providing direct access to external biological databases such as the National Center for Biotechnology Information (NCBI) databases, MalaCards, Biomodels and others using structured interfaces.
Implementation
Framework
Here, a simple framework for rapid development of specialized databases is proposed. This framework integrates several database technologies into a unified platform which allows the user to search and retrieve data from existing sources, incorporate established data with newly discovered information in a single data model, define object-based data architecture and persist it to memory for future inquiries. The framework consists of three main data streams: existing data sources to the user (search and retrieve data), new information sources such as experiments and data analytics to the user (generate data), as well as the individual user to a self-curated specialized data-base (model, store and retrieve data) (Fig. 1). The platform enables the user to search and retrieve data from local databases using structured information interfaces, and from online databases using structured URL interfaces. The user interface consists of a database repository which defines the differently structured accesses to existing resources, and a query generation engine which provides advanced searching mechanisms such as field-oriented search and the use of logical relations among search terms. The user designs their own data architecture using object-oriented programming (OOP), within which new data and the database-derived information are encapsulated to class-defined entities, an action which forms the database schema. Entities are persisted to memory with a persistent agent (PA). The PA uses an object relational mapper to map the object-based schema to a set of interconnected tables which resemble a typical relational database. This will allow the user to query the database with SQL-like queries. The relational model is transferred to a database manager that stores the data in memory.
|
This framework can be implemented in various ways using different programming languages and libraries providers. For example, Southern and colleagues developed a Java API, which maps entities to NCBI's PubChem schema and provides wrapper functions for calling NCBI eUtilities and PubChem web services.[19] Another implementation is the BioPython project, which provides modules to access NCBI's databases from within Python.[20] Object persistence can be achieved both in Python using its standard library, which support a family of hash-based file formats and objects serialization, and in JAVA using the Java Persistence API (JPA). Most JPA persistence providers offer the option to automatically create the database schema based on metadata. Popular implementations of JPA are Hibernate, EclipseLink and Apache OpenJPA. Common database managers include Apache Derby, Firebird, SQLite, and HSQL.
Software description
The proposed framework for the development of specialized databases can be implemented using different resources. Here, various open-source and free resources were utilized for implementation. Java was chosen as the development environment, EclipseLink as the JPA provider and Apache Derby as the database manager. Java was used to create interfaces to online databases such as MalaCards, Biomodels and NCBI's databases. The structured information interface was used to derive data from local repositories such as Aneurisk, which was downloaded from the Aneurisk web dataset. By integrating these tools with a series of data parsers, a powerful framework for the curation of specialized databases is provided, which would incorporate new data and database-derived information into a user-defined database architecture. A schematic of the implementation is presented in Fig. 2.
|
The main data stream from established online databases was implemented using a structured URL interface. Databases often use a fixed URL syntax which translates a standard set of input parameters into the values necessary to search and retrieve requested data. For example, Entrez Programming Utilities provide a structured URL interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and associated biomedical literature.[21] A series of data processing tools were utilized to implement parsers for syntactic analysis of the retrieved data. The w3c.dom package provides the Document Object Model (DOM) interfaces, which were used as the API for XML processing. The Apache Commons libraries, the jsoup library and the org.j3d library of the Java 3D Community were utilized for CSV, HTML and STL parsing, respectively. An API with which the user can specify the required database from which she wants to search and retrieve data was developed. This API contains an expandable database repository and a simple query generation engine which can be used to search a specified database. The user uses Java OOP to encapsulate the retrieved data and integrate it with her own data. The mapping of Java objects and database tables is defined via persistence metadata, which is used by the JPA provider — EclipseLink — to execute database SQL-like operations for static and dynamic queries. EclipseLink defines the metadata via annotations in the Java class and with XML. The open-source relational database Apache Derby — part of the Apache DB Project — was used as the database manager. Derby is written in Java and is suitable for embedding due to its limited footprint and ease-of-use. Derby supports SQL data storing and querying in a client/server operation mode, commonly used by specialized databases.
The framework consists of five packages, each encapsulating a family of associated functionalities. The UML package diagram is shown in Additional file 1: Figure S1. The database package contains enumerated types which represent specific search fields for advanced search, types of retrieved information and available databases. The URL Interfaces package consists of a series of classes providing structured URL access to the different databases. The Parsers package consists of classes for HTML, XML and CSV parsing. Each of those classes is generalized by model specific parsers as will be discussed below. The Persistency package consists of the persistent Agent class that can persist the model’s objects to memory. Finally, the Model package consists of the model’s specific classes as will also be discussed below. Simplified UML views of the classes’ diagram of each of the packages are shown in Additional file 1: Figures S2–5.
The use of the system is quite simple. For example, a query for retrieving two articles according to their PMIDs from NCBI's PubMed database can be generated using:
Query query = new Query();
query.setDatabase(DBType.PUBMED);
query.addId(“23371018”);
query.addId(“10227670”);
query.setSearchType(SearchType.FETCH);
The following structured URL is automatically generated by the framework as a result:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=23371018,10227670&retmode=xml
More complicated queries can also be produced. For example, to retrieve publication IDs for all articles published in the journal Science in 2009, with the terms “breast” and “cancer,” the following query can be generated:
Query query = new Query();
query.setDatabase(DBType.PUBMED);
query.addTerm("breast");
query.addTerm("cancer");
query.addField(SearchFields.JOURNAL, "science");
query.addField(SearchFields.PUBLICATION_DATA, "2009");
query.setSearchType(SearchType.SEARCH);
List < String > results = Entrez.searchEntrez(query);
As a result, the following structured URL is automatically generated by the framework:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+2009[pdat]+breast+AND+cancer&retmode=xml&rettype=uilist
The created query can be sent to the Entrez utilities system for the retrieval of the corresponding articles using:
Document xmlDocs = Entrez.callEntrez(query);
The retrieved XML documents can then be sent for parsing for the generation of a linked list of "Article" objects:
PubmedParser parser = new PubmedParser(xmlDocs);
parser.parse();
List<Article> articles = parser.getArticles();
Each of the articles can now be connected to other objects. They can then be persisted to memory using:
for (Article article: articles){
persistAgent.PersistArticle(article);
}
Now, the articles can be retrieved from the persistence agent by simply calling on its showArticles function. This function makes use of a simple SQL-like query:
Query q = entityManager.createQuery (“SELECT a FROM Article a”);
List<Article> articleList = q.getResultList();
The retrieved articles can then be published on the command line to produce:
---------------------------------------
ID: 23371018
TITLE: Non-dimensional analysis of retinal microaneurysms: critical threshold for treatment.
AUTHOR: Ezra Elishai
JOURNAL: Integrative biology : quantitative biosciences from nano to macro 5(3), 2013, DOI: 10.1039/c3ib20259c
---------------------------------------
ID: 10227670
TITLE: Three dimensional analysis of microaneurysms in the human diabetic retina.
AUTHOR: Moore J
JOURNAL: Journal of anatomy 194 (Pt 1)(?), 1999, DOI: ?
---------------------------------------
Results
The proposed framework can be easily adopted for the curation of specialized databases. Here, the curation of a specialized database aimed to store and retrieve aneurysm-associated data is demonstrated. Aneurysms characterize important vascular pathologies, which depending on the aneurysm's location and geometry could potentially cause blindness, stroke and death.[22] The query generating engine was used to retrieve tens of thousands of aneurysm related research articles from NCBI's PubMed and PMC databases. XML parser was utilized to analyze and store the retrieved data in a linked list of "Article" objects; this list encapsulates each article's information. Abstracts are automatically downloaded and stored in the database for easy retrieval. The structured URL interface to the MalaCards database was used to retrieve hundreds of data records of associated human diseases such as aortic aneurysm and cerebritis. The HTML parser was used to parse the information regarding the key characteristics of each disease, which were stored in a linked-list of "Disease" objects.
For example, the following simple query for “aneurysm”:
Query query = new Query();
query.setDatabase(DBType.MALA_CARDS);
query.addTerm(“aneurysm”);
Document results = MalaCards.callMalaCards(query);
can be sent to the MalaCard parser to retrieve related diseases:
MalaCardsParser parser = new MalaCardsParser(results, query);
parser.parse();
List<Disease> diseases = parser.getDiseases();
After persistence is performed (as was earlier described), the retrieved list of diseases can be published to produce (only the first seven retrieved diseases are shown here):
---------------------------------------
Name: Familial Thoracic Aortic Aneurysm and Dissection
Link at MalaCards:/card/familial_thoracic_aortic_aneurysm_and_dissection?search=aneurysm
---------------------------------------
Name: Coronary Aneurysm
Link at MalaCards:/card/coronary_aneurysm?search=aneurysm
---------------------------------------
Name: Angiopathy, Hereditary, with Nephropathy, Aneurysms, and Muscle Cramps
Link at MalaCards:/card/angiopathy_hereditary_with_nephropathy_aneurysms_and_muscle_cramps?search=aneurysm
---------------------------------------
Name: Aneurysmal Bone Cysts
Link at MalaCards:/card/aneurysmal_bone_cysts?search=aneurysm
---------------------------------------
Name: Intracranial Berry Aneurysm
Link at MalaCards:/card/intracranial_berry_aneurysm?search=aneurysm
---------------------------------------
Name: Cerebral Aneurysms
Link at MalaCards:/card/cerebral_aneurysms?search=aneurysm
---------------------------------------
Name: Loeys-Dietz Syndrome
Link at MalaCards:/card/loeys_dietz_syndrome?search=aneurysm
---------------------------------------
The platform was similarly used to download tens of related biological models such as the differentiation of endothelial cells, downloading the related XML files, parsing them, and encapsulating the data in a linked-list of "Model" objects. Two of them are:
---------------------------------------
Id: BIOMD0000000058
Description: The model reproduces the same amplitude antiphase calcium oscillations of coupled…
---------------------------------------
Id: BIOMD0000000291
Description: adsorption of albumin-bilirubin complex to the surface of carbon pyropolymer…
---------------------------------------
The structured information interface, the CSV parser and the STL loader were utilized to parse data from the Aneurisk repository, which contains clinical data and three-dimensional models of hundreds of aneurysms. Examples of retrieved patients' information are:
---------------------------------------
Patient ID: C0004
SEX: F, AGE: 60
Aneurysm type: TER, location:ICA, status: U
---------------------------------------
Patient ID: C0005
SEX: F, AGE: 26
Aneurysm type: LAT, location:ICA, status: R
---------------------------------------
Patient ID: C0006
SEX: F, AGE: 45
Aneurysm type: LAT, location:ICA, status: U
---------------------------------------
Patient ID: C0007
SEX: F, AGE: 44
Aneurysm type: LAT, location:ICA, status: U
---------------------------------------
Patient ID: C0008
SEX: M, AGE: 68
Aneurysm type: TER, location:ACA, status: R
---------------------------------------
Aneurysm data was integrated with our previously published criteria of aneurysm risk of rapture.[7] A schematic of the database with its different associations is shown in Fig. 3. Simplified UML views of the models’ related classes are presented in Additional file 1: Figures S6–7.
|
The persisted database can now be inquired with SQL-like commands. For example, all relevant information regarding a patient — including articles, models and diseases — can be easily retrieved according to its identification number (in this case: CD0985674) using:
Query q = entityManager.createQuery
("SELECT p FROM Aneurysm p WHERE p.patientID = :patientID");
q. setParameter("patientID", "CD0985674");
Discussion
In the last two decades a tremendous interest has developed in computational biology and bioinformatics, disciplines which have emerged from the intersection of biology and computer science. Practically, bioinformatics became a fertile new ground for programmers, who have gained access to an entirely new class of questions and challenges.[23] Commonly used software packages include the Bio*.org projects, such as BioRuby[24], BioPerl[25], BioJava[26] and BioPython[20], which have recently been assembled under the Open Bioinformatics Foundation. Each of these projects represents an international association of developers of open-source code libraries for bioinformatics, genomics and life science research. However, these platforms are not oriented for the curation of databases. Sequence and non-bibliographic databases constitute the most important corner stone for research in computational biology and bioinformatics. While primary databases such as NCBI's Nucleotide and Protein databases are of great importance to biological research, specialized databases which serve specific research communities are rapidly developing. During the last decade, hundreds of specialized databases have been developed, each making use of different frameworks and libraries. Although many database development environments exist, they often rely on a tabular structure, where the designer creates objects such as tables, columns, keys, indexes, relationships and constraints. While those basic entities are prevalent for simple data organization, they can rarely answer the needs of researchers who make use of a wide spectrum of data types, from sequencing data and microarray experiments to statistical models and simulations.
Here, a simple and unified open-source framework for the rapid development of specialized databases, based on user-defined objects and relations is proposed. These objects can be designed with the full arsenal of tools in OOP, giving the user maximum flexibility. It is important to note that the proposed framework aims to assist developers, which are capable of building object-oriented data models. After defining the data model (as it was exemplified in Additional file 1: Figure S7), developers can use the framework to load it with data from the supported web/local-based repositories, and persist/retrieve it from memory. Moreover, this framework can be rapidly extended or modified to support additional parsers and databases. The framework allows the user to concentrate on the biological models, the new data and the database architecture, rather than on concerns regarding data management and access to the different online and local datasets. This implementation is provided with a set of free, open-source tools, to increase availability and to enable ease-of-use. The framework can be easily utilized to work with the variety of bioinformatics tools available via the open-source BioJava project.
The most important aspect in the proposed work is the integration of the most relevant technologies to OO-based database design in a single framework. This is in clear contrast to BioPython/BioJava, which emphasize utilization of algorithms for bioinformatics. While the proposed framework is focused on database design, BioPython/BioJava are focused on fundamental bioinformatics tasks ranging from sequence alignment to molecular structure prediction. Obviously, some aspects of the proposed framework such as interfacing web-based databases are congruent with BioPython/BioJava. However, the context of use is largely different. Moreover, by utilizing flexible structuring of URLs, the proposed framework also supports interfaces to databases such as MalaCards and BioModels. We note that another project — BioServices[27] — does indeed provide an interface to BioModels. Moreover, our framework provides an interface to Apache Derby, a strong ORM-based database manager. This is currently, to our knowledge, not embedded within BioPython/BioJava. We note that another Python package named BioSQL[28] offers interfacing for SQL-based relational data bases, in contrast to our framework, which is based on ORM.
The creation of a specialized database for aneurysm-associated data and possible queries was demonstrated. A more substantial "data to knowledge" utility of this database can be developed. For example, our predictive model of aneurysms’ risk of rapture uses a non-dimensional analysis of fluid dynamics to set a critical geometrical threshold of treatment.[7] Piccinelli and colleagues could derive geometrical measures of patients’ aneurysms’ geometries, which can be used with the prediction model to determine risk of rapture.[29] These patients’ geometries are incorporated as STL files in this database. A specialized database of aneurysm-related data can be used to further investigate the links between the different aneurysm-associated diseases and the underlying biological models to advance knowledge in this field.
Declarations
Acknowledgements
The author wish to thank Aviad Ezra and Tamara Pearlman for their insightful comments.
Funding
This work was supported by a JCT research grant.
Availability of data and materials
A simplified version of the framework, with additional code examples are provided at the NBEL-lab.com website. To ensure public access to the files, the source code was also uploaded to GitHub at https://github.com/NBEL-lab/BioDataMining/. As was described above, the framework uses a series of dependable modules, which are freely accessible: commons-csv V1.2 (https://commons.apache.org/proper/commons-csv/), Apache derby (https://db.apache.org/derby/), eclipselink (https://eclipse.org/eclipselink/downloads/), javax persistence V2.1 (http://mvnrepository.com/artifact/org.eclipse.persistence/javax.persistence/2.1.0), and jsoup V1.8.3 (http://jsoup.org/download). All necessary libraries are available in the projects’ lib directory. A tutorial and installation instructions are located in the framework’s client/Client.java.
Project name: development of entity-based data models for bioinformatics
Operating system(s): cross platform
Programming language: Java
Licence: GNU General Public License V3
Authors' contributions
EET designed the framework, developed the code and wrote the manuscript.
Competing interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Additional files
Additional file 1: Supplementary information: Framework's design and architecture (.docx 500kb)
References
- ↑ 1.0 1.1 Kitano, H. (2002). "Computational systems biology". Nature 420 (6912): 206–10. doi:10.1038/nature01254. PMID 12432404.
- ↑ Stein, L.D. (2003). "Integrating biological databases". Nature Reviews Genetics 4 (5): 337–45. doi:10.1038/nrg1065. PMID 12728276.
- ↑ Cannata, N.; Merelli, E.; Altman, R.B. (2005). "Time to organize the bioinformatics resourceome". PLOS Computational Biology 1 (7): e76. doi:10.1371/journal.pcbi.0010076. PMC PMC1323464. PMID 16738704. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1323464.
- ↑ Sharan, R.; Ideker, T. (2006). "Modeling cellular machinery through biological network comparison". Nature Biotechnology 24 (4): 427–33. doi:10.1038/nbt1196. PMID 16601728.
- ↑ Wilkinson, D.J. (2009). "Stochastic modelling for quantitative description of heterogeneous biological systems". Nature Reviews Genetics 10 (2): 122–33. doi:10.1038/nrg2509. PMID 19139763.
- ↑ Delp, S.L.; Ku, J.P.; Pande, V.S. et al. (2012). "Simbios: An NIH national center for physics-based simulation of biological structures". JAMIA 19 (2): 186–89. doi:10.1136/amiajnl-2011-000488. PMC PMC3277621. PMID 22081222. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3277621.
- ↑ 7.0 7.1 7.2 Ezra, E.; Keinan, E.; Mandel, Y. et al. (2013). "Non-dimensional analysis of retinal microaneurysms: Critical threshold for treatment". Integrative Biology 5 (3): 474-80. doi:10.1039/c3ib20259c. PMC PMC3781337. PMID 23371018. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3781337.
- ↑ Naumova, O.Y.; Lee, M.; Koposov, R. et al. (2012). "Differential patterns of whole-genome DNA methylation in institutionalized children and children raised by their biological parents". Development and Psychopathology 24 (1): 143–55. doi:10.1017/S0954579411000605. PMC PMC3470853. PMID 22123582. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470853.
- ↑ 9.0 9.1 Allan, C.; Burel, J.M.; Moore, J. et al. (2012). "OMERO: Flexible, model-driven data management for experimental biology". Nature Methods 9 (3): 245–53. doi:10.1038/nmeth.1896. PMC PMC3437820. PMID 22373911. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3437820.
- ↑ Chelliah, V.; Laibe, C.; Le Novère, N. et al. (2013). "BioModels Database: A repository of mathematical models of biological processes". Methods in Molecular Biology 1021: 189–99. doi:10.1007/978-1-62703-450-0_10. PMID 23715986.
- ↑ Pasquier, C. (2008). "Biological data integration using Semantic Web technologies". Biochimie 90 (4): 584–94. doi:10.1016/j.biochi.2008.02.007. PMID 18294970.
- ↑ FlyBase Consortium (2003). "The FlyBase database of the Drosophila genome projects and community literature". Nucleic Acids Research 31 (1): 172–5. PMC PMC165541. PMID 12519974. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC165541.
- ↑ Stein, L.; Sternberg, P.; Durbin, R. et al. (2001). "WormBase: Network access to the genome and biology of Caenorhabditis elegans". Nucleic Acids Research 29 (1): 82–6. PMC PMC29781. PMID 11125056. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29781.
- ↑ Martinelli, S.D.; Brown, C.G.; Durbin, R. (1997). "Gene expression and development databases for C. elegans". Seminars in Cell & Developmental Biology 8 (5): 459–67. doi:10.1006/scdb.1997.0171.
- ↑ Poole, R.L. (2007). "The TAIR database". Methods in Molecular Biology 406: 179–212. PMID 18287693.
- ↑ Caspi, R.; Foerster, H.; Fulcher, C.A. et al. (2008). "The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases". Nucleic Acids Research 36 (DB1): D623–31. doi:10.1093/nar/gkm900. PMC PMC2238876. PMID 17965431. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238876.
- ↑ Codd, E.F. (1970). "A relational model of data for large shared data banks". Communications of the ACM 13 (6): 377–87. doi:10.1145/362384.362685.
- ↑ Elmasri, R.; Navathe, S.B. (2008). Fundamentals of Database Systems (5th ed.). Pearson. ISBN 9780321369574.
- ↑ Southern, M.R.; Griffin, P.R. (2011). "A Java API for working with PubChem datasets". Bioinformatics 27 (5): 741—2. doi:10.1093/bioinformatics/btq715. PMC PMC3105478. PMID 21216779. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3105478.
- ↑ 20.0 20.1 Cock, P.J.; Antao, T.; Chang, J.T. et al. (2009). "Biopython: Freely available Python tools for computational molecular biology and bioinformatics". Bioinformatics 25 (11): 1422-3. doi:10.1093/bioinformatics/btp163. PMC PMC2682512. PMID 19304878. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512.
- ↑ National Center for Biotechnology Information; U.S. National Library of Medicine (2017). "Entrez Programming Utilities Help". https://www.ncbi.nlm.nih.gov/books/NBK25501/.
- ↑ Chalouhi, N.; Hoh, B.L.; Hasan, D. (2013). "Review of cerebral aneurysm formation, growth, and rupture". Stroke 44 (12): 3613-22. doi:10.1161/STROKEAHA.113.002390. PMID 24130141.
- ↑ Kanehisa, M.; Bork, P. (2003). "Bioinformatics in the post-sequence era". Nature Genetics 33 (Supp 1): 305–10. doi:10.1038/ng1109. PMID 12610540.
- ↑ Goto, N.; Prins, P.; Nakao, M. et al. (2010). "BioRuby: Bioinformatics software for the Ruby programming language". Bioinformatics 26 (20): 2617–19. doi:10.1093/bioinformatics/btq475. PMC PMC2951089. PMID 20739307. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951089.
- ↑ Stajich, J.E. (2007). "An Introduction to BioPerl". In Edwards, D.. Plant Bioinformatics. Methods in Molecular Biology. 406. pp. 538–548. doi:10.1007/978-1-59745-535-0_26. ISBN 9781597455350.
- ↑ Holland, R.C.; Down, T.A.; Pocock, M. et al. (2008). "BioJava: An open-source framework for bioinformatics". Bioinformatics 24 (18): 2096-7. doi:10.1093/bioinformatics/btn397. PMC PMC2530884. PMID 18689808. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2530884.
- ↑ Cokelaer, T.; Pultz, D.; Harder, L.M. et al. (2013). "BioServices: A common Python package to access biological Web Services programmatically". Bioinformatics 29 (24): 3241-2. doi:10.1093/bioinformatics/btt547. PMC PMC3842755. PMID 24064416. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3842755.
- ↑ Ives, Z.G.; Green, T.J.; Karvounarakis, G. et al. (2008). "The ORCHESTRA Collaborative Data Sharing System". ACM SIGMOD Redord 37 (3): 26-32. doi:10.1145/1462571.1462577.
- ↑ Piccinelli, M.; Steinman, D.A.; Hoi, Y. et al. (2012). "Automatic neck plane detection and 3D geometric characterization of aneurysmal sacs". Annals of Biomedical Engineering 40 (10): 2188-211. doi:10.1007/s10439-012-0577-5. PMID 22532324.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Grammar and word use were updated to make the text easier to read.