Random Forest + Cross-Validation — Interactive Lab

A pedagogical companion to Introduction to Machine Learning: Random Forest Regression ↗ Back to the post

Five rotating exams instead of one lucky split

The post predicts Bolivia's Municipal Sustainable Development Index (IMDS) from 64-dimensional satellite-image embeddings, across 339 municipalities. Instead of a single train/test split, it evaluates the Random Forest with 5-fold cross-validation: the data are cut into five folds, and each fold takes a turn as the test set while the model trains on the other four. Every municipality is therefore predicted exactly once — by a forest that never saw it. The headline result is a pooled out-of-fold R² of 0.22, with a per-fold spread of 0.224 ± 0.173 — real but limited signal, and genuinely unstable across slices of the country.

Pick a round — watch the test fold rotate

In each round, one fold (orange) is held out for testing and the other four (steel) train the model. Step through all five and notice that every fold is the test set exactly once.

Tab 2

Per-fold metrics

The five fold scores, and why the standard deviation matters as much as the mean. One fold even scores a negative R².

Tab 3

Out-of-fold predictions

All 339 predictions, colored by fold. See regression to the mean, and compare the predicted vs actual distribution.

Tab 4

Feature importance

Which embedding dimensions the forest leans on — A30 dominates — and how little hyperparameter tuning buys.

Glossary (open a card if a term is unfamiliar)

Cross-validation (k-fold)
Cut the data into k folds; each fold is the test set once while the model trains on the rest. Averages out the luck of a single split.
Out-of-fold (OOF) prediction
Each observation's prediction from the round in which it was held out. Every town predicted once, by a model that never saw it.
Random Forest
An ensemble of decorrelated decision trees: bagging + random feature subsets. Prediction = the average of all B trees.
Bagging
Bootstrap aggregating: train B trees on B resamples and average them. Reduces variance without inflating bias.
Pooled vs averaged R²
Pooled = one R² over all OOF predictions; averaged = mean of the per-fold R². Report the per-fold mean ± SD for honesty.
Train/test split
A single held-out test set. On small data the score is a lottery — see Appendix A of the post; we use CV instead.
R², RMSE, MAE
R² is fraction-of-variance-explained (negative = worse than the mean); RMSE penalises large errors; MAE is average absolute error. Pooled OOF: 0.225, 5.95, 4.42.
Permutation importance
Shuffle one feature, measure the drop in R². Less biased than impurity-based MDI. A30 dominates here.

Performance swings from fold to fold — so report the spread

The five rounds produce five of each metric. Reporting only the mean would hide the most important part: the model's quality is genuinely unstable across different slices of Bolivia. Toggle a metric and watch the bars against the mean (dashed) and the ±1 standard-deviation band.

Metric

mean across folds
standard deviation
the spread you must report
worst fold
best fold

What to look for

  • R² ranges from negative to ~0.45. On one fold the forest does worse than predicting the national average — a single train/test split could have shown you only the good fold or only the bad one.
  • The standard deviation (≈ 0.17) is almost as large as the mean (≈ 0.22). "R² = 0.22" is true and misleading; "0.224 ± 0.173" is honest.
  • The three metrics partly disagree about which fold is hardest. R² is measured relative to each fold's own variance, so a fold of unusually similar towns inflates its R² even when absolute errors are small.

All 339 predictions — colored by the fold that produced them

Because every municipality has an out-of-fold prediction, we can plot all 339 at once — not just a small test slice. Switch between the actual-vs-predicted scatter and the distribution overlay, and filter folds to see how the rounds interleave.

View

Folds

pooled OOF R²
over all 339 points
pooled RMSE / MAE
IMDS points
predicted vs actual SD
variance compression
KS test
distributions differ

What to look for

  • The cloud is flatter than the 45° line. Low-IMDS towns are predicted too high and high-IMDS towns too low — regression to the mean, the fingerprint of a low-signal model.
  • The fold colors are thoroughly mixed. No region of the plot belongs to one fold — confirmation that shuffling spread the municipalities evenly, so the per-fold metrics are trustworthy.
  • In the distribution view, the means match but the widths do not. The predicted spread is about half the actual (a ~48% reduction), and a KS test rejects equal distributions — the expected behavior of a model that explains only ~22% of the variance.

Which satellite features matter — and how little tuning helps

Both importance measures, computed on a baseline forest fit on all 339 municipalities, single out embedding dimension A30, with A59 a distant second. Toggle the method to compare the impurity-based (MDI) and permutation rankings.

Importance method

What to look for

  • A30 dominates. Shuffling it alone costs about 0.25 in R² — larger than the model's entire out-of-fold R², because removing the best feature drags the forest below the mean-prediction baseline.
  • Two different methods agree. MDI and permutation both crown A30, which is reassuring — it is genuine signal, not a counting artifact of one method.
  • Importance has a long tail. After A30 and A59 the bars decline gently: the forest reads many faint patterns anchored by one strong one.

Does tuning rescue the model? (Appendix B)

Grid search, random search, and Optuna all optimize the same 5-fold cross-validated R². They rank as theory predicts — Optuna ≥ random ≥ grid — but every gain over the untuned baseline is smaller than the 0.17 fold-to-fold standard deviation. The tuning gain is smaller than the noise.