01 Bias–variance tradeoff
High bias (underfitting): the model class is too simple to capture the signal. High variance (overfitting): the model fits noise in the training set. Total error decomposes (conceptually) into bias² + variance + irreducible noise—tuning model complexity balances the first two.
Training error keeps falling as complexity grows; test error often has a U-shape—early stopping, regularization, and cross-validation hunt the minimum test error region.
02 Train / validation / test
Split data so that hyperparameters are chosen on validation data, and final reporting uses a held-out test set touched only once. K-fold CV averages performance across folds when data is scarce—reducing variance in the score estimate.
For time-series, validate on future chunks—random splits leak future information into the past.
03 Metrics (what “good” means)
| Setting | Common metrics | Notes |
|---|---|---|
| Binary classification | ROC-AUC, PR-AUC, F1 | Use PR when classes are imbalanced; ROC can look optimistic. |
| Multi-class | Macro/micro F1, log loss | Macro treats classes equally; micro follows global counts. |
| Regression | RMSE, MAE, MAPE | MAPE breaks near zero targets; robust losses for outliers. |
AUC summarizes ranking quality across thresholds; it does not replace calibration when you need accurate probabilities.
04 Ensembles
Bagging (random forests) reduces variance by averaging many high-variance learners. Boosting (gradient boosting) reduces bias sequentially by fitting errors—often state-of-the-art on tabular data. Stacking learns a meta-model on base predictions—powerful but easier to overfit without careful CV.