Research Project · Machine Learning in Healthcare

Evaluating ML Algorithms for Prediction of
Cancer Recurrence

A comparative analysis of SVM, Random Forest, AdaBoost, and XGBoost on breast cancer recurrence data from the University of Wisconsin Hospitals.

VINAY GANDHI

198 Observations

35 Variables

4 ML Algorithms

99.37% Best Accuracy

Abstract

The Problem We Set Out to Solve

Breast cancer affects roughly 10% of women worldwide at some stage of their lives, with a 5-year survival rate of 88% post-diagnosis. However, recurrence after treatment remains a serious and unpredictable threat — one that existing manual diagnostic methods fail to address reliably. This project applies machine learning to bridge that gap.

⚕️

Clinical Challenge

Cancer may return locally, regionally, or distantly after treatment. Even oncologists cannot reliably predict recurrence from scans and records alone.

🧮

ML Approach

Four algorithms — SVM, Random Forest, AdaBoost, and XGBoost — were trained and evaluated against a real-world clinical dataset with 35 tumour features.

🎯

Key Outcome

Extreme Gradient Boosting achieved 99.37% accuracy, significantly outperforming the other techniques across all three evaluation metrics.

Introduction

Breast Cancer — By the Numbers

1 in 8

U.S. women face a lifetime risk of developing breast cancer

1 in 3

Cancer diagnoses globally are breast cancer cases

88%

5-year survival rate after diagnosis, dropping to 80% at 10 years

↑

Incidence rates continue to rise globally with insufficient prediction tools

Types of Cancer Recurrence

🏠

Local Recurrence
Cancer returns to the same original site where it first developed.

🔗

Regional Recurrence
Cancer returns to nearby lymph nodes close to the original site.

🌐

Distant Recurrence
Cancer spreads and returns in a completely different part of the body.

Data mining can significantly reduce both false positives and false negatives in clinical decision-making, enabling earlier and more reliable recurrence prediction.

Dataset

Data Source & Structure

Dataset sourced from the University of Wisconsin Hospitals, Madison, collected by Dr. William H. Wolberg. Available via UCI Machine Learning Repository.

Column(s)	Description	Type	Notes
Col 1	Patient ID Number	Identifier	Removed before modelling
Col 2	Cancer Outcome — Recurrence / Non-Recurrence	Target	Binary class label
Col 3	Time elapsed since first cancer occurrence	Temporal	Continuous numeric
Col 4–33	Tumour measurements: radius, texture, perimeter, smoothness, compactness, concavity, concave points, symmetry, fractal dimension (mean, SE, max)	Feature	30 numeric features from imaging
Col 34	Tumour Size (diameter in cm)	Feature	Continuous numeric
Col 35	Lymph Node Status (axillary nodes at surgery)	Feature	⚠ 4 missing values

151

Non-Recurrence

76.3% of the dataset — represents treated patients where cancer did not return.

47

Recurrence

23.7% of the dataset — class imbalance noted; handled during preprocessing and boosting.

198

Total Observations

Real-world clinical data — raw, uncleaned, and representative of actual hospital records.

Data Preprocessing

Cleaning the Raw Signal

Real-world medical data is noisy by nature — missing values, outliers, and correlated features all degrade model accuracy. A 5-step preprocessing pipeline was applied before any ML training.

STEP 01

Missing Values

K-Means imputation on 4 missing Lymph Node values using centroidal nearest-neighbour averaging

STEP 02

Outlier Removal

Histograms and box plots visualised per variable; outliers far from mean with <2 occurrences removed

STEP 03

Constant Features

Zero-variance and near-zero-variance predictors eliminated using prevalence ratio thresholds

STEP 04

Correlated Variables

Linearly dependent and highly correlated features removed for SVM/Naive Bayes; retained for RF and XGBoost

STEP 05

Normalisation

Data scaled and normalised for distance-based algorithms; feature selection via Wrapper method applied

Feature Selection Approach

Wrapper method used — iteratively adds/removes predictors and evaluates model performance to find the maximally reproducible feature subset. Feature importance visualised prior to selection.

Wrapper Method Importance Visualisation Cross-Validation

Why Outlier Analysis Was Done Twice

The study ran models both with and without outlier removal. Accuracy was significantly lower when outliers were retained, confirming their detrimental effect. However, not all outliers were deleted — those that helped models learn edge behaviours were preserved.

Algorithms

Four Techniques Evaluated

ALGO 01

Support Vector Machine (SVM)

y = w · X' + b | Kernels: Linear · Gaussian (RBF)

SVM finds an optimal separating hyperplane between recurrence and non-recurrence classes by maximising the margin between support vectors — the nearest data points to the boundary. Two kernels were implemented: a linear kernel for linearly separable data and a Gaussian (RBF) kernel for non-linear boundaries. Correlated features were removed prior to SVM training.

Binary Classifier Two Kernels Margin Maximisation

ALGO 02

Random Forest

Ensemble: N decision trees → majority vote → final prediction

An ensemble method that builds multiple decision trees on bootstrapped samples and aggregates their votes. The "divide and conquer" approach converts many weak learners into a single strong learner, reducing variance and overfitting. Correlated variables were retained for Random Forest training as it naturally handles multicollinearity.

Ensemble Method Bootstrap Sampling Variance Reduction

ALGO 03

AdaBoost

F(x) = Σ αₜhₜ(x) | Loss: exp(-yᵢF(xᵢ))

Adaptive Boosting iteratively trains weak classifiers, weighting misclassified samples more heavily in each round. The final prediction is a weighted combination of all weak hypotheses. It minimises exponential loss to produce a robust, high-accuracy ensemble.

ALGO 04

Extreme Gradient Boosting (XGBoost)

ŷᵢ = Σⱼ θⱼxᵢⱼ | Regularised tree ensemble

XGBoost builds trees sequentially, each correcting the errors of the previous one, with L1/L2 regularisation preventing overfitting. It is computationally efficient and handles class imbalance well — a critical property for this dataset's 76/24 class split.

Experimental Results

Performance Comparison

Models evaluated by Accuracy, Specificity, and Sensitivity derived from confusion matrices on train/test splits. Cross-validation applied throughout.

ACCURACY SPECIFICITY SENSITIVITY

SVM

Accuracy

82.05%

Specificity

22.22%

Sensitivity

100.00%

AdaBoost

Accuracy

74.36%

Specificity

22.22%

Sensitivity

90.00%

Random Forest

Accuracy

76.92%

Specificity

22.22%

Sensitivity

93.33%

XGBoost ⭐

Accuracy

99.37%

Specificity

100.00%

Sensitivity

97.37%

🏆

Extreme Gradient Boosting Wins

XGBoost outperformed all other techniques across every evaluation metric. Critically, it achieved perfect specificity (100%) — meaning it correctly identified every true non-recurrence case — while maintaining near-perfect sensitivity and accuracy. Its computational efficiency also exceeded that of competing algorithms.

99.37% Accuracy

100% Specificity

97.37% Sensitivity

Analysis

Reading the Results

📊

Specificity Gap

SVM, AdaBoost, and Random Forest all share an identical specificity of 22.22% — suggesting they struggle to correctly classify the minority "recurrence" class. This is a direct consequence of class imbalance (151 vs 47). XGBoost's regularisation and sequential correction mechanism resolves this completely.

🎯

Why SVM Scores 100% Sensitivity

SVM achieves perfect sensitivity — it never misses a recurrence case — but at the cost of specificity. It over-classifies patients as recurrence-positive (high false positive rate), making it overly conservative. For clinical use, this trades unnecessary treatment against missed recurrences.

⚖️

The Accuracy–Balance Trade-off

Overall accuracy alone is misleading when class imbalance is present. XGBoost wins on all three metrics simultaneously — accuracy, specificity, and sensitivity — demonstrating it has truly learned the underlying pattern rather than exploiting class frequency shortcuts.

⚡

Computational Efficiency

XGBoost also demonstrated superior computational speed compared to the other algorithms. This is significant for real-world clinical deployment where timely predictions from large patient databases are required at scale.

Literature Comparison

Study	Method	Best Accuracy
Delen et al. (SEER database, 202k records)	ANN, Decision Trees, Logistic Regression	~93%
Lundin et al. (951 patients)	ANN + Logistic Regression	5/10/15-yr survival
Ebrahimi & Razavi AR	SVM (declared best predictor)	SVM best in literature
This Study (Gandhi)	XGBoost	99.37%

Conclusion & Future Work

What We Proved

This study demonstrates that ML — and XGBoost specifically — can achieve near-perfect accuracy in predicting breast cancer recurrence, outperforming all existing benchmarks in the literature.

✅

XGBoost Validated for Medical Prediction

With 99.37% accuracy and perfect specificity, XGBoost is demonstrated as the superior algorithm for this recurrence prediction task — both in accuracy and computational speed.

🔬

Real-World Data Successfully Processed

The 5-step preprocessing pipeline effectively cleaned noisy clinical data, handling missing values, outliers, and correlated features to maximise model performance.

🏥

Clinical Deployment Potential

An automatic, reliable recurrence prediction system built on this framework could directly assist oncologists in follow-up planning and early intervention, reducing late-stage recurrence deaths.

🔭

Future Work Directions

Expand dataset size beyond 198 observations; explore deep learning approaches (CNN, LSTM on imaging data); integrate genomic/proteomic features; and validate on diverse patient populations.

Key Takeaway

Machine learning — specifically Extreme Gradient Boosting — can serve as the automatic, reliable recurrence prediction system that clinical practice urgently needs, achieving 99.37% accuracy on real-world tumour data.

References

Bibliography

[1] Ahmad LG et al. "Using Three Machine Learning Techniques for Predicting Breast Cancer." Health and Medical Informatics.

[2] Ikoro GO. "Using SAS Enterprise Miner to predict breast cancer at early stage." Queen Mary University of London; University of East Anglia.

[3] Mandeep Rana et al. "Breast Cancer Diagnosis and Recurrence Prediction Using Machine Learning Techniques." IJRET: International Journal of Research in Engineering and Technology.

[4] American Cancer Society. http://www.cancer.org/

[5] Dumitru D. "Prediction of recurrent events in breast cancer using the Naive Bayesian classification." 2000.

[6] Mangasarian OL & Wolberg WH. "Cancer diagnosis via linear programming." SIAM News, pp. 1–18, 1990.

Evaluating ML Algorithms for Prediction of Cancer Recurrence

The Problem We Set Out to Solve

Clinical Challenge

ML Approach

Key Outcome

Breast Cancer — By the Numbers

Types of Cancer Recurrence

Data Source & Structure

Non-Recurrence

Recurrence

Total Observations

Cleaning the Raw Signal

Feature Selection Approach

Why Outlier Analysis Was Done Twice

Four Techniques Evaluated

Support Vector Machine (SVM)

Random Forest

AdaBoost

Extreme Gradient Boosting (XGBoost)

Performance Comparison

Extreme Gradient Boosting Wins

Reading the Results

Specificity Gap

Why SVM Scores 100% Sensitivity

The Accuracy–Balance Trade-off

Computational Efficiency

Literature Comparison

What We Proved

XGBoost Validated for Medical Prediction

Real-World Data Successfully Processed

Clinical Deployment Potential

Future Work Directions

Bibliography

Evaluating ML Algorithms for Prediction of
Cancer Recurrence