Five rotating exams instead of one lucky split
The post predicts Bolivia's Municipal Sustainable Development Index (IMDS) from 64-dimensional satellite-image embeddings, across 339 municipalities. Instead of a single train/test split, it evaluates the Random Forest with 5-fold cross-validation: the data are cut into five folds, and each fold takes a turn as the test set while the model trains on the other four. Every municipality is therefore predicted exactly once — by a forest that never saw it. The headline result is a pooled out-of-fold R² of 0.22, with a per-fold spread of 0.224 ± 0.173 — real but limited signal, and genuinely unstable across slices of the country.
Pick a round — watch the test fold rotate
In each round, one fold (orange) is held out for testing and the other four (steel) train the model. Step through all five and notice that every fold is the test set exactly once.
Per-fold metrics
The five fold scores, and why the standard deviation matters as much as the mean. One fold even scores a negative R².
Out-of-fold predictions
All 339 predictions, colored by fold. See regression to the mean, and compare the predicted vs actual distribution.
Feature importance
Which embedding dimensions the forest leans on — A30 dominates — and how little hyperparameter tuning buys.
Glossary (open a card if a term is unfamiliar)
Cross-validation (k-fold)
Out-of-fold (OOF) prediction
Random Forest
Bagging
Pooled vs averaged R²
Train/test split
R², RMSE, MAE
Permutation importance
Performance swings from fold to fold — so report the spread
The five rounds produce five of each metric. Reporting only the mean would hide the most important part: the model's quality is genuinely unstable across different slices of Bolivia. Toggle a metric and watch the bars against the mean (dashed) and the ±1 standard-deviation band.
Metric
What to look for
- R² ranges from negative to ~0.45. On one fold the forest does worse than predicting the national average — a single train/test split could have shown you only the good fold or only the bad one.
- The standard deviation (≈ 0.17) is almost as large as the mean (≈ 0.22). "R² = 0.22" is true and misleading; "0.224 ± 0.173" is honest.
- The three metrics partly disagree about which fold is hardest. R² is measured relative to each fold's own variance, so a fold of unusually similar towns inflates its R² even when absolute errors are small.
All 339 predictions — colored by the fold that produced them
Because every municipality has an out-of-fold prediction, we can plot all 339 at once — not just a small test slice. Switch between the actual-vs-predicted scatter and the distribution overlay, and filter folds to see how the rounds interleave.
View
Folds
What to look for
- The cloud is flatter than the 45° line. Low-IMDS towns are predicted too high and high-IMDS towns too low — regression to the mean, the fingerprint of a low-signal model.
- The fold colors are thoroughly mixed. No region of the plot belongs to one fold — confirmation that shuffling spread the municipalities evenly, so the per-fold metrics are trustworthy.
- In the distribution view, the means match but the widths do not. The predicted spread is about half the actual (a ~48% reduction), and a KS test rejects equal distributions — the expected behavior of a model that explains only ~22% of the variance.
Which satellite features matter — and how little tuning helps
Both importance measures, computed on a baseline forest fit on all 339 municipalities, single out embedding dimension A30, with A59 a distant second. Toggle the method to compare the impurity-based (MDI) and permutation rankings.
Importance method
What to look for
- A30 dominates. Shuffling it alone costs about 0.25 in R² — larger than the model's entire out-of-fold R², because removing the best feature drags the forest below the mean-prediction baseline.
- Two different methods agree. MDI and permutation both crown A30, which is reassuring — it is genuine signal, not a counting artifact of one method.
- Importance has a long tail. After A30 and A59 the bars decline gently: the forest reads many faint patterns anchored by one strong one.
Does tuning rescue the model? (Appendix B)
Grid search, random search, and Optuna all optimize the same 5-fold cross-validated R². They rank as theory predicts — Optuna ≥ random ≥ grid — but every gain over the untuned baseline is smaller than the 0.17 fold-to-fold standard deviation. The tuning gain is smaller than the noise.