Journal:PCM-SABRE: A platform for benchmarking and comparing outcome prediction methods in precision cancer medicine

Full article title	PCM-SABRE: A platform for benchmarking and comparing outcome; prediction methods in precision cancer medicine
Journal	BMC Bioinformatics
Author(s)	Eyal-Altman, Noah; Last, Mark; Rubin, Eitan
Author affiliation(s)	Ben-Gurion University of the Negev
Primary contact	Email: eyalnoa at post dot bgu dor ac dot il
Year published	2017
Volume and issue	18
Page(s)	40
DOI	10.1186/s12859-016-1435-5
ISSN	1471-2105
Distribution license	Creative Commons Attribution 4.0 International
Website	http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1435-5
Download	http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1435-5 (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Numerous publications attempt to predict cancer survival outcome from gene expression data using machine-learning methods. A direct comparison of these works is challenging for the following reasons: (1) inconsistent measures used to evaluate the performance of different models, and (2) incomplete specification of critical stages in the process of knowledge discovery. There is a need for a platform that would allow researchers to replicate previous works and to test the impact of changes in the knowledge discovery process on the accuracy of the induced models.

Results: We developed the PCM-SABRE platform, which supports the entire knowledge discovery process for cancer outcome analysis. PCM-SABRE was developed using KNIME. By using PCM-SABRE to reproduce the results of previously published works on breast cancer survival, we define a baseline for evaluating future attempts to predict cancer outcome with machine learning. We used PCM-SABRE to replicate previous work that describes predictive models of breast cancer recurrence, and tested the performance of all possible combinations of feature selection methods and data mining algorithms that was used in either of the works. We reconstructed the work of Chou et al. observing similar trends – superior performance of Probabilistic Neural Network (PNN) and logistic regression (LR) algorithms and inconclusive impact of feature pre-selection with the decision tree algorithm on subsequent analysis.

Conclusions: PCM-SABRE is a software tool that provides an intuitive environment for rapid development of predictive models in cancer precision medicine.

Keywords: Breast cancer, data mining, reproducible research

Background

Predicting the outcome of cancer from gene expression data is a clinically important, computationally challenging task. For example, early-stage, estrogen-receptor-positive, HER2-negative breast cancer patients that are considered to be at low risk for recurrence can avoid chemotherapy, while patients at high or intermediate risk are treated with aggressive (and harmful) chemotherapy.^[1]

Efforts to stratify patients by risk of recurrence in other tumor types, and the ability to stratify patients by overall chances of survival are not as advanced. Moreover, the relative success in risk stratification for breast cancer patients has been challenged^[2], proposing that it in fact stratifies patients into tumor subtypes, which can be achieved with much simpler tests.

As a result, a large number of papers have been published and are still being published where gene expression data is analyzed in order to construct models that predict cancer survival or cancer recurrence. Much of these efforts are concentrated on breast cancer, the second most commonly diagnosed cancer among American women (besides skin cancer).^[3] About 1 in 8 U.S. women (about 12 percent) will develop invasive breast cancer over the course of her lifetime, and similar rates are reported worldwide.^[4] Breast cancer is an attractive domain for risk stratification as it is estimated that resection is a sufficient treatment for 70 to 80 percent of the patients, while the remaining patients will develop advanced metastatic lesions, which are largely impossible to cure.^[5] Aggressive chemotherapy will reduce the chance of advanced metastasis for those patients in that situation, though it would be harmful and unnecessary therapy for those who aren't. Thus, great efforts have been invested in stratifying patients’ risk of recurrence.

Due to the importance of risk stratification in breast cancer, combined with its relatively high abundance, breast cancer is the type of tumor for which expression profiles of newly diagnosed patients are most abundant. Several works have been published that apply machine-learning techniques to this data for predicting cancer survivability.^[6]^[7] Unfortunately, we found it quite challenging to directly compare these works for the following reasons:

Incomplete specification of critical stages in the process of knowledge discovery, such as feature selection.
Differences in the measures used to evaluate models performance. Some only provide the overall accuracy of the proposed classifier, some offer only the area under a curve (AUC), while others provide no statistical measures and only present the Kaplan-Meier charts that visualize the survival curves based on predicted classes.
Different studies apply different inclusion/exclusion criteria with little or no overlaps between the patients considered.

Incomplete documentation of the analytic process is a common cause for irreproducibility of published results. We conclude that there is a need for a platform that would allow researchers to describe their analytic work in the field of risk stratification for cancer patients in a reproducible way that can be used for further investigation. Such a platform should allow the replication of previous works and methodologically evaluate the impact of alterations in one or more stages of the knowledge discovery process on its performance in the task of cancer survival prediction. Such a tool can help to understand and compare the current state of predictions for breast cancer, and if applied to new cancer types, to prevent the "Tower of Babel" situation that has emerged for breast cancer.

Implementation

References

↑ Sparano, J.A.; Gray, R.J.; Makower, D.F. et al. (2015). "Prospective Validation of a 21-Gene Expression Assay in Breast Cancer". New England Journal of Medicine 373 (21): 2005–14. doi:10.1056/NEJMoa1510764. PMID 26412349.
↑ Senkus, E.; Kyriakides, S.; Ohno, S. et al. (2015). "Primary breast cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up". Annals of Oncology 26 (Suppl 5): v8-30. doi:10.1093/annonc/mdv298. PMID 26314782.
↑ "U.S. Breast Cancer Statistics". Breastcancer.org. 2016. http://www.breastcancer.org/symptoms/understand_bc/statistics. Retrieved 20 December 2016.
↑ "Breast cancer statistics". World Cancer Research Fund International. 2016. http://www.wcrf.org/int/cancer-facts-figures/data-specific-cancers/breast-cancer-statistics. Retrieved 20 December 2016.
↑ "Statistics for Metastatic Breast Cancer". Metastatic Breast Cancer Network. 2016. http://www.mbcn.org/statistics-for-metastatic-breast-cancer/. Retrieved 20 December 2016.
↑ Györffy, B.; Lanczky, A.; Eklund, A.C. et al. (2010). "An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients". Breast Cancer Research and Treatment 123 (3): 725-31. doi:10.1007/s10549-009-0674-9. PMID 20020197.
↑ Naoi, Y.; Kishi, K.; Tanei, T. et al. (2011). "Development of 95-gene classifier as a powerful predictor of recurrences in node-negative and ER-positive breast cancer patients". Breast Cancer Research and Treatment 128 (3): 633-41. doi:10.1007/s10549-010-1145-z. PMID 20803240.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Some grammar were corrected when necessary.