Journal:Practical approaches for mining frequent patterns in molecular datasets
Full article title | Practical approaches for mining frequent patterns in molecular datasets |
---|---|
Journal | Bioinformatics and Biology Insights |
Author(s) | Naulaerts, S.; Moens, S.; Engelen, K.; Vanden Berghe, W.; Goethals, B.; Laukens, K.; Meysman, P. |
Author affiliation(s) | University of Antwerp, Antwerp University Hospital, Fondazione Edmund Mach |
Primary contact | Email: pieter dot meysman at uantwerpen dot be |
Editors | Dandekar, T. |
Year published | 2016 |
Volume and issue | 10 |
Page(s) | 37–47 |
DOI | 10.4137/BBI.S38419 |
ISSN | 1177-9322 |
Distribution license | Creative Commons Attribution 3.0 Unported |
Website | http://www.la-press.com/ (HTML) |
Download | http://www.la-press.com/ (PDF) |
This article should not be considered complete until this message box has been removed. This is a work in progress. |
Abstract
Pattern detection is an inherent task in the analysis and interpretation of complex and continuously accumulating biological data. Numerous itemset mining algorithms have been developed in the last decade to efficiently detect specific pattern classes in data. Although many of these have proven their value for addressing bioinformatics problems, several factors still slow down promising algorithms from gaining popularity in the life science community. Many of these issues stem from the low user-friendliness of these tools and the complexity of their output, which is often large, static, and consequently hard to interpret. Here, we apply three software implementations on common bioinformatics problems and illustrate some of the advantages and disadvantages of each, as well as inherent pitfalls of biological data mining. Frequent itemset mining exists in many different flavors, and users should decide their software choice based on their research question, programming proficiency, and added value of extra features.
Keywords: frequent itemset mining, protein domain structure, protein–protein interaction, gene expression, Mycobacterium tuberculosis
Introduction
In the last decade, various information-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.[1][2][3][4] A popular group of pattern mining techniques are itemset mining and their derivative, association rule mining. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.[5] In this context, the shopping cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (i.e. itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coexpressed in tissue samples or which mutations often occur together in cancer tumors of a given type.
References
- ↑ Meysman, P.; Sonego, P.; Bianco, L. et al. (2014). "COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia". Nucleic Acids Research 42 (D1): D649-D653. doi:10.1093/nar/gkt1086. PMC PMC3965013. PMID 24214998. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965013.
- ↑ UniProt Consortium (2013). "Update on activities at the Universal Protein Resource (UniProt) in 2013". Nucleic Acids Research 41 (D1): D43-7. doi:10.1093/nar/gks1068. PMC PMC3531094. PMID 23161681. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531094.
- ↑ Maietta, P.; Lopez, G.; Carro, A. et al. (2014). "FireDB: A compendium of biological and pharmacologically relevant ligands". Nucleic Acids Research 42 (D1): D267-72. doi:10.1093/nar/gkt1127. PMC PMC3965074. PMID 24243844. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965074.
- ↑ Orchard, S.; Ammari, M.; Aranda, B. et al. (2014). "The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases". Nucleic Acids Research 42 (D1): D358-63. doi:10.1093/nar/gkt1115. PMC PMC3965093. PMID 24234451. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965093.
- ↑ Agrawal, R.; Imielinksi, T.; Swami, A. (1993). "Mining association rules between sets of items in large database". Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22 (2): 207–16.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.