A comparative analysis of SVM, Random Forest, AdaBoost, and XGBoost on breast cancer recurrence data from the University of Wisconsin Hospitals.
VINAY GANDHI
Breast cancer affects roughly 10% of women worldwide at some stage of their lives, with a 5-year survival rate of 88% post-diagnosis. However, recurrence after treatment remains a serious and unpredictable threat — one that existing manual diagnostic methods fail to address reliably. This project applies machine learning to bridge that gap.
Cancer may return locally, regionally, or distantly after treatment. Even oncologists cannot reliably predict recurrence from scans and records alone.
Four algorithms — SVM, Random Forest, AdaBoost, and XGBoost — were trained and evaluated against a real-world clinical dataset with 35 tumour features.
Extreme Gradient Boosting achieved 99.37% accuracy, significantly outperforming the other techniques across all three evaluation metrics.
U.S. women face a lifetime risk of developing breast cancer
Cancer diagnoses globally are breast cancer cases
5-year survival rate after diagnosis, dropping to 80% at 10 years
Incidence rates continue to rise globally with insufficient prediction tools
Data mining can significantly reduce both false positives and false negatives in clinical decision-making, enabling earlier and more reliable recurrence prediction.
Dataset sourced from the University of Wisconsin Hospitals, Madison, collected by Dr. William H. Wolberg. Available via UCI Machine Learning Repository.
| Column(s) | Description | Type | Notes |
|---|---|---|---|
| Col 1 | Patient ID Number | Identifier | Removed before modelling |
| Col 2 | Cancer Outcome — Recurrence / Non-Recurrence | Target | Binary class label |
| Col 3 | Time elapsed since first cancer occurrence | Temporal | Continuous numeric |
| Col 4–33 | Tumour measurements: radius, texture, perimeter, smoothness, compactness, concavity, concave points, symmetry, fractal dimension (mean, SE, max) | Feature | 30 numeric features from imaging |
| Col 34 | Tumour Size (diameter in cm) | Feature | Continuous numeric |
| Col 35 | Lymph Node Status (axillary nodes at surgery) | Feature | ⚠ 4 missing values |
76.3% of the dataset — represents treated patients where cancer did not return.
23.7% of the dataset — class imbalance noted; handled during preprocessing and boosting.
Real-world clinical data — raw, uncleaned, and representative of actual hospital records.
Real-world medical data is noisy by nature — missing values, outliers, and correlated features all degrade model accuracy. A 5-step preprocessing pipeline was applied before any ML training.
Wrapper method used — iteratively adds/removes predictors and evaluates model performance to find the maximally reproducible feature subset. Feature importance visualised prior to selection.
The study ran models both with and without outlier removal. Accuracy was significantly lower when outliers were retained, confirming their detrimental effect. However, not all outliers were deleted — those that helped models learn edge behaviours were preserved.
SVM finds an optimal separating hyperplane between recurrence and non-recurrence classes by maximising the margin between support vectors — the nearest data points to the boundary. Two kernels were implemented: a linear kernel for linearly separable data and a Gaussian (RBF) kernel for non-linear boundaries. Correlated features were removed prior to SVM training.
An ensemble method that builds multiple decision trees on bootstrapped samples and aggregates their votes. The "divide and conquer" approach converts many weak learners into a single strong learner, reducing variance and overfitting. Correlated variables were retained for Random Forest training as it naturally handles multicollinearity.
Adaptive Boosting iteratively trains weak classifiers, weighting misclassified samples more heavily in each round. The final prediction is a weighted combination of all weak hypotheses. It minimises exponential loss to produce a robust, high-accuracy ensemble.
XGBoost builds trees sequentially, each correcting the errors of the previous one, with L1/L2 regularisation preventing overfitting. It is computationally efficient and handles class imbalance well — a critical property for this dataset's 76/24 class split.
Models evaluated by Accuracy, Specificity, and Sensitivity derived from confusion matrices on train/test splits. Cross-validation applied throughout.
XGBoost outperformed all other techniques across every evaluation metric. Critically, it achieved perfect specificity (100%) — meaning it correctly identified every true non-recurrence case — while maintaining near-perfect sensitivity and accuracy. Its computational efficiency also exceeded that of competing algorithms.
SVM, AdaBoost, and Random Forest all share an identical specificity of 22.22% — suggesting they struggle to correctly classify the minority "recurrence" class. This is a direct consequence of class imbalance (151 vs 47). XGBoost's regularisation and sequential correction mechanism resolves this completely.
SVM achieves perfect sensitivity — it never misses a recurrence case — but at the cost of specificity. It over-classifies patients as recurrence-positive (high false positive rate), making it overly conservative. For clinical use, this trades unnecessary treatment against missed recurrences.
Overall accuracy alone is misleading when class imbalance is present. XGBoost wins on all three metrics simultaneously — accuracy, specificity, and sensitivity — demonstrating it has truly learned the underlying pattern rather than exploiting class frequency shortcuts.
XGBoost also demonstrated superior computational speed compared to the other algorithms. This is significant for real-world clinical deployment where timely predictions from large patient databases are required at scale.
| Study | Method | Best Accuracy |
|---|---|---|
| Delen et al. (SEER database, 202k records) | ANN, Decision Trees, Logistic Regression | ~93% |
| Lundin et al. (951 patients) | ANN + Logistic Regression | 5/10/15-yr survival |
| Ebrahimi & Razavi AR | SVM (declared best predictor) | SVM best in literature |
| This Study (Gandhi) | XGBoost | 99.37% |
This study demonstrates that ML — and XGBoost specifically — can achieve near-perfect accuracy in predicting breast cancer recurrence, outperforming all existing benchmarks in the literature.
With 99.37% accuracy and perfect specificity, XGBoost is demonstrated as the superior algorithm for this recurrence prediction task — both in accuracy and computational speed.
The 5-step preprocessing pipeline effectively cleaned noisy clinical data, handling missing values, outliers, and correlated features to maximise model performance.
An automatic, reliable recurrence prediction system built on this framework could directly assist oncologists in follow-up planning and early intervention, reducing late-stage recurrence deaths.
Expand dataset size beyond 198 observations; explore deep learning approaches (CNN, LSTM on imaging data); integrate genomic/proteomic features; and validate on diverse patient populations.
Machine learning — specifically Extreme Gradient Boosting — can serve as the automatic, reliable recurrence prediction system that clinical practice urgently needs, achieving 99.37% accuracy on real-world tumour data.