Research Project · Machine Learning in Healthcare

Evaluating ML Algorithms for Prediction of
Cancer Recurrence

A comparative analysis of SVM, Random Forest, AdaBoost, and XGBoost on breast cancer recurrence data from the University of Wisconsin Hospitals.

VINAY GANDHI

198 Observations
35 Variables
4 ML Algorithms
99.37% Best Accuracy
Abstract

The Problem We Set Out to Solve

Breast cancer affects roughly 10% of women worldwide at some stage of their lives, with a 5-year survival rate of 88% post-diagnosis. However, recurrence after treatment remains a serious and unpredictable threat — one that existing manual diagnostic methods fail to address reliably. This project applies machine learning to bridge that gap.

⚕️

Clinical Challenge

Cancer may return locally, regionally, or distantly after treatment. Even oncologists cannot reliably predict recurrence from scans and records alone.

🧮

ML Approach

Four algorithms — SVM, Random Forest, AdaBoost, and XGBoost — were trained and evaluated against a real-world clinical dataset with 35 tumour features.

🎯

Key Outcome

Extreme Gradient Boosting achieved 99.37% accuracy, significantly outperforming the other techniques across all three evaluation metrics.

Introduction

Breast Cancer — By the Numbers

1 in 8

U.S. women face a lifetime risk of developing breast cancer

1 in 3

Cancer diagnoses globally are breast cancer cases

88%

5-year survival rate after diagnosis, dropping to 80% at 10 years

Incidence rates continue to rise globally with insufficient prediction tools

Types of Cancer Recurrence

🏠
Local Recurrence
Cancer returns to the same original site where it first developed.
🔗
Regional Recurrence
Cancer returns to nearby lymph nodes close to the original site.
🌐
Distant Recurrence
Cancer spreads and returns in a completely different part of the body.

Data mining can significantly reduce both false positives and false negatives in clinical decision-making, enabling earlier and more reliable recurrence prediction.

Dataset

Data Source & Structure

Dataset sourced from the University of Wisconsin Hospitals, Madison, collected by Dr. William H. Wolberg. Available via UCI Machine Learning Repository.

Column(s) Description Type Notes
Col 1 Patient ID Number Identifier Removed before modelling
Col 2 Cancer Outcome — Recurrence / Non-Recurrence Target Binary class label
Col 3 Time elapsed since first cancer occurrence Temporal Continuous numeric
Col 4–33 Tumour measurements: radius, texture, perimeter, smoothness, compactness, concavity, concave points, symmetry, fractal dimension (mean, SE, max) Feature 30 numeric features from imaging
Col 34 Tumour Size (diameter in cm) Feature Continuous numeric
Col 35 Lymph Node Status (axillary nodes at surgery) Feature ⚠ 4 missing values
151

Non-Recurrence

76.3% of the dataset — represents treated patients where cancer did not return.

47

Recurrence

23.7% of the dataset — class imbalance noted; handled during preprocessing and boosting.

198

Total Observations

Real-world clinical data — raw, uncleaned, and representative of actual hospital records.

Data Preprocessing

Cleaning the Raw Signal

Real-world medical data is noisy by nature — missing values, outliers, and correlated features all degrade model accuracy. A 5-step preprocessing pipeline was applied before any ML training.

STEP 01
Missing Values
K-Means imputation on 4 missing Lymph Node values using centroidal nearest-neighbour averaging
STEP 02
Outlier Removal
Histograms and box plots visualised per variable; outliers far from mean with <2 occurrences removed
STEP 03
Constant Features
Zero-variance and near-zero-variance predictors eliminated using prevalence ratio thresholds
STEP 04
Correlated Variables
Linearly dependent and highly correlated features removed for SVM/Naive Bayes; retained for RF and XGBoost
STEP 05
Normalisation
Data scaled and normalised for distance-based algorithms; feature selection via Wrapper method applied

Feature Selection Approach

Wrapper method used — iteratively adds/removes predictors and evaluates model performance to find the maximally reproducible feature subset. Feature importance visualised prior to selection.

Wrapper Method Importance Visualisation Cross-Validation

Why Outlier Analysis Was Done Twice

The study ran models both with and without outlier removal. Accuracy was significantly lower when outliers were retained, confirming their detrimental effect. However, not all outliers were deleted — those that helped models learn edge behaviours were preserved.

Algorithms

Four Techniques Evaluated

ALGO 01

Support Vector Machine (SVM)

y = w · X' + b   |   Kernels: Linear · Gaussian (RBF)

SVM finds an optimal separating hyperplane between recurrence and non-recurrence classes by maximising the margin between support vectors — the nearest data points to the boundary. Two kernels were implemented: a linear kernel for linearly separable data and a Gaussian (RBF) kernel for non-linear boundaries. Correlated features were removed prior to SVM training.

Binary Classifier Two Kernels Margin Maximisation
ALGO 02

Random Forest

Ensemble: N decision trees → majority vote → final prediction

An ensemble method that builds multiple decision trees on bootstrapped samples and aggregates their votes. The "divide and conquer" approach converts many weak learners into a single strong learner, reducing variance and overfitting. Correlated variables were retained for Random Forest training as it naturally handles multicollinearity.

Ensemble Method Bootstrap Sampling Variance Reduction
ALGO 03

AdaBoost

F(x) = Σ αₜhₜ(x)  |  Loss: exp(-yᵢF(xᵢ))

Adaptive Boosting iteratively trains weak classifiers, weighting misclassified samples more heavily in each round. The final prediction is a weighted combination of all weak hypotheses. It minimises exponential loss to produce a robust, high-accuracy ensemble.

ALGO 04

Extreme Gradient Boosting (XGBoost)

ŷᵢ = Σⱼ θⱼxᵢⱼ  |  Regularised tree ensemble

XGBoost builds trees sequentially, each correcting the errors of the previous one, with L1/L2 regularisation preventing overfitting. It is computationally efficient and handles class imbalance well — a critical property for this dataset's 76/24 class split.

Experimental Results

Performance Comparison

Models evaluated by Accuracy, Specificity, and Sensitivity derived from confusion matrices on train/test splits. Cross-validation applied throughout.

ACCURACY SPECIFICITY SENSITIVITY
SVM
Accuracy
82.05%
Specificity
22.22%
Sensitivity
100.00%
AdaBoost
Accuracy
74.36%
Specificity
22.22%
Sensitivity
90.00%
Random Forest
Accuracy
76.92%
Specificity
22.22%
Sensitivity
93.33%
XGBoost ⭐
Accuracy
99.37%
Specificity
100.00%
Sensitivity
97.37%
🏆

Extreme Gradient Boosting Wins

XGBoost outperformed all other techniques across every evaluation metric. Critically, it achieved perfect specificity (100%) — meaning it correctly identified every true non-recurrence case — while maintaining near-perfect sensitivity and accuracy. Its computational efficiency also exceeded that of competing algorithms.

99.37% Accuracy
100% Specificity
97.37% Sensitivity
Analysis

Reading the Results

📊

Specificity Gap

SVM, AdaBoost, and Random Forest all share an identical specificity of 22.22% — suggesting they struggle to correctly classify the minority "recurrence" class. This is a direct consequence of class imbalance (151 vs 47). XGBoost's regularisation and sequential correction mechanism resolves this completely.

🎯

Why SVM Scores 100% Sensitivity

SVM achieves perfect sensitivity — it never misses a recurrence case — but at the cost of specificity. It over-classifies patients as recurrence-positive (high false positive rate), making it overly conservative. For clinical use, this trades unnecessary treatment against missed recurrences.

⚖️

The Accuracy–Balance Trade-off

Overall accuracy alone is misleading when class imbalance is present. XGBoost wins on all three metrics simultaneously — accuracy, specificity, and sensitivity — demonstrating it has truly learned the underlying pattern rather than exploiting class frequency shortcuts.

Computational Efficiency

XGBoost also demonstrated superior computational speed compared to the other algorithms. This is significant for real-world clinical deployment where timely predictions from large patient databases are required at scale.

Literature Comparison

Study Method Best Accuracy
Delen et al. (SEER database, 202k records) ANN, Decision Trees, Logistic Regression ~93%
Lundin et al. (951 patients) ANN + Logistic Regression 5/10/15-yr survival
Ebrahimi & Razavi AR SVM (declared best predictor) SVM best in literature
This Study (Gandhi) XGBoost 99.37%
Conclusion & Future Work

What We Proved

This study demonstrates that ML — and XGBoost specifically — can achieve near-perfect accuracy in predicting breast cancer recurrence, outperforming all existing benchmarks in the literature.

XGBoost Validated for Medical Prediction

With 99.37% accuracy and perfect specificity, XGBoost is demonstrated as the superior algorithm for this recurrence prediction task — both in accuracy and computational speed.

🔬

Real-World Data Successfully Processed

The 5-step preprocessing pipeline effectively cleaned noisy clinical data, handling missing values, outliers, and correlated features to maximise model performance.

🏥

Clinical Deployment Potential

An automatic, reliable recurrence prediction system built on this framework could directly assist oncologists in follow-up planning and early intervention, reducing late-stage recurrence deaths.

🔭

Future Work Directions

Expand dataset size beyond 198 observations; explore deep learning approaches (CNN, LSTM on imaging data); integrate genomic/proteomic features; and validate on diverse patient populations.

Machine learning — specifically Extreme Gradient Boosting — can serve as the automatic, reliable recurrence prediction system that clinical practice urgently needs, achieving 99.37% accuracy on real-world tumour data.

References

Bibliography

[1] Ahmad LG et al. "Using Three Machine Learning Techniques for Predicting Breast Cancer." Health and Medical Informatics.
[2] Ikoro GO. "Using SAS Enterprise Miner to predict breast cancer at early stage." Queen Mary University of London; University of East Anglia.
[3] Mandeep Rana et al. "Breast Cancer Diagnosis and Recurrence Prediction Using Machine Learning Techniques." IJRET: International Journal of Research in Engineering and Technology.
[4] American Cancer Society. http://www.cancer.org/
[5] Dumitru D. "Prediction of recurrent events in breast cancer using the Naive Bayesian classification." 2000.
[6] Mangasarian OL & Wolberg WH. "Cancer diagnosis via linear programming." SIAM News, pp. 1–18, 1990.