Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
synthetic-co2-cross-section | country (cross-section) | 120 × 14 | synthetic-co2-cross-section.dta | synthetic-co2-cross-section.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
use "${BASE}synthetic-co2-cross-section.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
df = pd.read_stata(BASE + "synthetic-co2-cross-section.dta")
# load every dataset at once
files = ["synthetic-co2-cross-section"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "synthetic-co2-cross-section.dta", "synthetic-co2-cross-section.dta")
df, meta = pyreadstat.read_dta("synthetic-co2-cross-section.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
df <- read_dta(paste0(BASE, "synthetic-co2-cross-section.dta"))Overview & sources
Companion data for a hands-on R tutorial comparing three principled responses to the variable selection problem — Bayesian Model Averaging (BMA), the LASSO, and Weighted Average Least Squares (WALS) — on a fully synthetic cross-section of 120 fictional countries. Twelve candidate regressors compete to explain log CO2 emissions: 7 have true nonzero effects and 5 are pure noise deliberately correlated with GDP and the true predictors, creating realistic multicollinearity. Because the data-generating process is known, the data carries its own “answer key” against which each method is graded. The convergence of mechanically distinct methods on the same variables — four are flagged by all three (triple-robust) — illustrates methodological triangulation. The entire DGP is open and reproducible (set.seed(2017)).
synthetic-co2-cross-section.csv has one row per fictional country (no time dimension): a string identifier, the dependent variable log_co2, and the 12 candidate regressors. There are no real countries behind the rows.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — simulated from a calibrated cross-sectional DGP with set.seed(2017); 7 true predictors, 5 noise variables (open & reproducible) | Mendez, C. (2026). See the post's R script script.R for the full data-generating process. |
| Bayesian Model Averaging (BMA) | Method: posterior inclusion probabilities over the 2^12 = 4,096 model space | Hoeting, J.A., Madigan, D., Raftery, A.E., & Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417. (PIP threshold conventions: Raftery 1995.) |
| LASSO | Method: L1-penalized regression for automatic variable selection (Post-LASSO refit per Belloni & Chernozhukov 2013) | Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288. |
| WALS (Weighted Average Least Squares) | Method: fast frequentist model averaging via a semi-orthogonal transform and a Laplace prior, yielding t-statistics | Magnus, J.R., Powell, O., & Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153. |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Three Methods for Robust Variable Selection: BMA, LASSO, and WALS [Data set]. https://carlos-mendez.org/post/r_bma_lasso_wals/
Hoeting, J.A., Madigan, D., Raftery, A.E., & Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288. Magnus, J.R., Powell, O., & Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.BibTeX
@misc{mendez2026rbmalassowals,
author = {Mendez, Carlos},
title = {Three Methods for Robust Variable Selection: BMA, LASSO, and WALS},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/r_bma_lasso_wals/}},
note = {Data set}
}
@article{hoeting1999bma,
author = {Hoeting, Jennifer A. and Madigan, David and Raftery, Adrian E. and Volinsky, Chris T.},
title = {Bayesian Model Averaging: A Tutorial},
journal = {Statistical Science},
volume = {14}, number = {4}, pages = {382--417}, year = {1999}
}
@article{tibshirani1996lasso,
author = {Tibshirani, Robert},
title = {Regression Shrinkage and Selection via the Lasso},
journal = {Journal of the Royal Statistical Society, Series B},
volume = {58}, number = {1}, pages = {267--288}, year = {1996}
}
@article{magnus2010wals,
author = {Magnus, Jan R. and Powell, Owen and Prufer, Patricia},
title = {A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics},
journal = {Journal of Econometrics},
volume = {154}, number = {2}, pages = {139--153}, year = {2010}
}Variable explorer search & filter all 14 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
agriculture# | continuous | Agricultural activity | Agricultural share; true predictor with the weakest effect (true β = 0.005; missed by all three methods). | % / index | synthetic-co2-cross-section | Synthetic (this study) | |
corruption# | continuous | Corruption index (noise) | Noise regressor — weakly (negatively) correlated with GDP but with zero true effect. | index | synthetic-co2-cross-section | Synthetic (this study) | |
country# | identifier | – | Country identifier | Synthetic country label (Country_001 … Country_120); the cross-section key. | string | synthetic-co2-cross-section | Synthetic (this study) |
democracy# | continuous | Democracy index | Democratic-governance score; true predictor with a small effect (true β = 0.004; borderline for BMA). | index | synthetic-co2-cross-section | Synthetic (this study) | |
fdi# | continuous | Foreign direct investment (noise) | Noise regressor — zero true effect on log_co2 (the one noise variable not built from GDP). | % / index | synthetic-co2-cross-section | Synthetic (this study) | |
fossil_fuel# | continuous | Fossil fuel dependence (%) | Fossil-fuel share of energy; true predictor (true β = 0.012, semi-elasticity). | % of energy | synthetic-co2-cross-section | Synthetic (this study) | |
industry# | continuous | Industry share | Industrial output share; true predictor (true β = 0.008, composition effect). | % / index | synthetic-co2-cross-section | Synthetic (this study) | |
log_co2# | continuous | Log CO2 emissions (dependent variable) | Natural-log CO2 emissions; the outcome all three methods explain. | log units | synthetic-co2-cross-section | Synthetic (this study) | |
log_credit# | continuous | Log domestic credit (noise) | Noise regressor — correlated with GDP but with zero true effect on log_co2. | log units | synthetic-co2-cross-section | Synthetic (this study) | |
log_gdp# | continuous | Log GDP per capita | Natural-log GDP per capita; the dominant true predictor (true β = 1.200, an elasticity). | log units | synthetic-co2-cross-section | Synthetic (this study) | |
log_tourism# | continuous | Log tourism (noise) | Noise regressor — correlated with GDP (~0.3) but with zero true effect on log_co2. | log units | synthetic-co2-cross-section | Synthetic (this study) | |
log_trade# | continuous | Log trade openness (noise) | Noise regressor — correlated with GDP but with zero true effect on log_co2. | log units | synthetic-co2-cross-section | Synthetic (this study) | |
trade_network# | continuous | Trade network centrality | Trade-centrality measure; true predictor with a moderate effect (true β = 0.500). | index (0-1 scale) | synthetic-co2-cross-section | Synthetic (this study) | |
urban_pop# | continuous | Urban population (%) | Urbanization rate; true predictor with a moderate effect (true β = 0.010; borderline for BMA). | % of population | synthetic-co2-cross-section | Synthetic (this study) |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | synthetic-co2-cross-section |
|---|---|
agriculture | ● |
corruption | ● |
country | ● |
democracy | ● |
fdi | ● |
fossil_fuel | ● |
industry | ● |
log_co2 | ● |
log_credit | ● |
log_gdp | ● |
log_tourism | ● |
log_trade | ● |
trade_network | ● |
urban_pop | ● |
Construction & formulas
The target model regresses log CO2 on the 12 candidate regressors:
- General model:
log_co2 = β₀ + Σⱼ βⱼ xⱼ + ε, j = 1…12. - BMA — posterior inclusion probability (
PIP):PIPⱼ = Σ_{k: j ∈ Mₖ} P(Mₖ | y)— total posterior mass on models containing variable j; PIP ≥ 0.80 is the robustness threshold (Raftery 1995). Posterior meanE[βⱼ | y] = Σₖ β̂ⱼ,ₖ · P(Mₖ | y). - LASSO — L1 penalty:
β̂ = argmin (1/2n)·‖y − Xβ‖² + λ‖β‖₁; λ chosen by 10-fold CV (lambda.min / lambda.1se). Post-LASSO refits OLS on the selected variables for unbiased magnitudes. - WALS — Laplace prior: split
y = X₁β₁ + X₂β₂ + εinto focus regressors X₁ (here the intercept) and auxiliary regressors X₂ (the 12 candidates); orthogonalize X₂ and average each coefficient independently underp(γⱼ) ∝ exp(−|γⱼ|/τ), the same prior that underlies the LASSO penalty. Robustness flagged at|t| ≥ 2.
Synthetic data-generating process (set.seed(2017), n = 120). GDP drives
the system: log_gdp ~ N(8.5, 1.5²). True predictors and noise are built as linear functions
of log_gdp plus Gaussian noise, so the noise variables are correlated with GDP yet have zero
true effect. The outcome is
log_co2 = 2.0 + 1.200·log_gdp + 0.008·industry + 0.012·fossil_fuel + 0.010·urban_pop
+ 0.004·democracy + 0.500·trade_network + 0.005·agriculture + N(0, 0.3²); the five noise variables
(log_trade, fdi, corruption, log_tourism, log_credit) enter the outcome with coefficient
exactly 0.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
country identifier | Country identifier | Synthetic country label (Country_001 … Country_120); the cross-section key. | Sequential identifier assigned to each of the 120 simulated observations. | string | Synthetic (this study) | 120 countries |
log_co2 continuous | Log CO2 emissions (dependent variable) | Natural-log CO2 emissions; the outcome all three methods explain. | log_co2 = 2.0 + 1.200·log_gdp + 0.008·industry + 0.012·fossil_fuel + 0.010·urban_pop + 0.004·democracy + 0.500·trade_network + 0.005·agriculture + N(0, 0.3²). Noise variables enter with coefficient 0. | log units | Synthetic (this study) | 120 countries |
log_gdp continuous | Log GDP per capita | Natural-log GDP per capita; the dominant true predictor (true β = 1.200, an elasticity). | log_gdp ~ N(8.5, 1.5²); drives all other regressors. | log units | Synthetic (this study) | 120 countries |
industry continuous | Industry share | Industrial output share; true predictor (true β = 0.008, composition effect). | industry = 15 + 1.5·log_gdp + N(0, 6²). | % / index | Synthetic (this study) | 120 countries |
fossil_fuel continuous | Fossil fuel dependence (%) | Fossil-fuel share of energy; true predictor (true β = 0.012, semi-elasticity). | fossil_fuel = 30 + 3·log_gdp + N(0, 10²). | % of energy | Synthetic (this study) | 120 countries |
urban_pop continuous | Urban population (%) | Urbanization rate; true predictor with a moderate effect (true β = 0.010; borderline for BMA). | urban_pop = 20 + 5·log_gdp + N(0, 12²). | % of population | Synthetic (this study) | 120 countries |
democracy continuous | Democracy index | Democratic-governance score; true predictor with a small effect (true β = 0.004; borderline for BMA). | democracy = 5 + 2·log_gdp + N(0, 8²). | index | Synthetic (this study) | 120 countries |
trade_network continuous | Trade network centrality | Trade-centrality measure; true predictor with a moderate effect (true β = 0.500). | trade_network = 0.2 + 0.05·log_gdp + N(0, 0.15²). | index (0-1 scale) | Synthetic (this study) | 120 countries |
agriculture continuous | Agricultural activity | Agricultural share; true predictor with the weakest effect (true β = 0.005; missed by all three methods). | agriculture = 40 − 3·log_gdp + N(0, 8²) (negatively correlated with GDP). | % / index | Synthetic (this study) | 120 countries |
log_trade continuous | Log trade openness (noise) | Noise regressor — correlated with GDP but with zero true effect on log_co2. | log_trade = 3.5 + 0.1·log_gdp + N(0, 0.5²). True β = 0. | log units | Synthetic (this study) | 120 countries |
fdi continuous | Foreign direct investment (noise) | Noise regressor — zero true effect on log_co2 (the one noise variable not built from GDP). | fdi = 2 + N(0, 4²). True β = 0. | % / index | Synthetic (this study) | 120 countries |
corruption continuous | Corruption index (noise) | Noise regressor — weakly (negatively) correlated with GDP but with zero true effect. | corruption = 0.8 − 0.05·log_gdp + N(0, 0.15²). True β = 0. | index | Synthetic (this study) | 120 countries |
log_tourism continuous | Log tourism (noise) | Noise regressor — correlated with GDP (~0.3) but with zero true effect on log_co2. | log_tourism = 12 + 0.3·log_gdp + N(0, 1.2²). True β = 0. | log units | Synthetic (this study) | 120 countries |
log_credit continuous | Log domestic credit (noise) | Noise regressor — correlated with GDP but with zero true effect on log_co2. | log_credit = 2.5 + 0.15·log_gdp + N(0, 0.6²). True β = 0. | log units | Synthetic (this study) | 120 countries |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
country | – | 100% | 120 | 120 | — | — | — | — | — |
log_co2 | 100% | 120 | 120 | 8.76 | 14.22 | 14.16 | 20.36 | 2.11 | |
log_gdp | 100% | 120 | 120 | 4.61 | 8.53 | 8.53 | 13.21 | 1.57 | |
industry | 100% | 120 | 120 | 8.32 | 27.87 | 28.31 | 44.98 | 6.21 | |
fossil_fuel | 100% | 120 | 120 | 24.72 | 55.49 | 55.26 | 81.22 | 9.62 | |
urban_pop | 100% | 120 | 120 | 29.81 | 62.52 | 63.23 | 97.62 | 13.25 | |
democracy | 100% | 120 | 120 | 3.10 | 22.94 | 23.21 | 45.00 | 8.32 | |
trade_network | 100% | 120 | 120 | 0.182 | 0.643 | 0.651 | 1.04 | 0.171 | |
agriculture | 100% | 120 | 110 | 1.00 | 13.87 | 14.30 | 37.11 | 8.11 | |
log_trade | 100% | 120 | 120 | 3.45 | 4.43 | 4.42 | 5.84 | 0.458 | |
fdi | 100% | 120 | 116 | -5.00 | 2.23 | 1.50 | 13.62 | 4.19 | |
corruption | 100% | 120 | 120 | 0.050 | 0.367 | 0.374 | 0.715 | 0.164 | |
log_tourism | 100% | 120 | 120 | 11.54 | 14.61 | 14.57 | 19.63 | 1.32 | |
log_credit | 100% | 120 | 120 | 2.30 | 3.83 | 3.89 | 5.50 | 0.652 |
Known limitations & caveats
- Synthetic data. There is no real data behind this tutorial. The 120 “countries” and all 13 numeric variables are simulated; results are internally consistent with the calibrated DGP but are not empirical evidence about real CO2 emissions or their determinants.
- Built-in answer key. The true coefficients are known by design (7 nonzero, 5 exactly zero). The point of the dataset is to grade each method against this truth; that luxury does not exist with real data, which is exactly why the post recommends using all three methods.
- Deliberate multicollinearity. The noise variables are constructed to be correlated with GDP and the true predictors, so a naive OLS can show spurious significance. This is intentional, to make variable selection genuinely hard.
- Power, not bias, limits detection. All methods reach perfect specificity, but small true effects (agriculture, β = 0.005; and the borderline urban_pop / democracy for BMA) are hard to detect at n = 120 — a sample-size limitation, not a flaw in any method.
- Cross-section only. No time dimension and no panel structure; the methods here address model uncertainty, not endogeneity, nonlinearity, or heteroskedasticity, which real applications must handle separately.