Data dictionary · Robust Variable Selection: BMA, LASSO, and WALS

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`synthetic-co2-cross-section`	country (cross-section)	120 × 14	synthetic-co2-cross-section.dta	synthetic-co2-cross-section.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
use "${BASE}synthetic-co2-cross-section.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
df = pd.read_stata(BASE + "synthetic-co2-cross-section.dta")

# load every dataset at once
files = ["synthetic-co2-cross-section"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "synthetic-co2-cross-section.dta", "synthetic-co2-cross-section.dta")
df, meta = pyreadstat.read_dta("synthetic-co2-cross-section.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
df <- read_dta(paste0(BASE, "synthetic-co2-cross-section.dta"))

Overview & sources

Companion data for a hands-on R tutorial comparing three principled responses to the variable selection problem — Bayesian Model Averaging (BMA), the LASSO, and Weighted Average Least Squares (WALS) — on a fully synthetic cross-section of 120 fictional countries. Twelve candidate regressors compete to explain log CO₂ emissions: 7 have true nonzero effects and 5 are pure noise deliberately correlated with GDP and the true predictors, creating realistic multicollinearity. Because the data-generating process is known, the data carries its own “answer key” against which each method is graded. The convergence of mechanically distinct methods on the same variables — four are flagged by all three (triple-robust) — illustrates methodological triangulation. The entire DGP is open and reproducible (set.seed(2017)).

One file, a pure cross-section. synthetic-co2-cross-section.csv has one row per fictional country (no time dimension): a string identifier, the dependent variable log_co2, and the 12 candidate regressors. There are no real countries behind the rows.

Data sources

Source	Provides	Reference / URL
Synthetic (this study)	All values — simulated from a calibrated cross-sectional DGP with set.seed(2017); 7 true predictors, 5 noise variables (open & reproducible)	Mendez, C. (2026). See the post's R script script.R for the full data-generating process.
Bayesian Model Averaging (BMA)	Method: posterior inclusion probabilities over the 2^12 = 4,096 model space	Hoeting, J.A., Madigan, D., Raftery, A.E., & Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417. (PIP threshold conventions: Raftery 1995.)
LASSO	Method: L1-penalized regression for automatic variable selection (Post-LASSO refit per Belloni & Chernozhukov 2013)	Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
WALS (Weighted Average Least Squares)	Method: fast frequentist model averaging via a semi-orthogonal transform and a Laplace prior, yielding t-statistics	Magnus, J.R., Powell, O., & Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Three Methods for Robust Variable Selection: BMA, LASSO, and WALS [Data set]. https://carlos-mendez.org/post/r_bma_lasso_wals/

Hoeting, J.A., Madigan, D., Raftery, A.E., & Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288. Magnus, J.R., Powell, O., & Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.

BibTeX

@misc{mendez2026rbmalassowals,
  author       = {Mendez, Carlos},
  title        = {Three Methods for Robust Variable Selection: BMA, LASSO, and WALS},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/r_bma_lasso_wals/}},
  note         = {Data set}
}

@article{hoeting1999bma,
  author  = {Hoeting, Jennifer A. and Madigan, David and Raftery, Adrian E. and Volinsky, Chris T.},
  title   = {Bayesian Model Averaging: A Tutorial},
  journal = {Statistical Science},
  volume  = {14}, number = {4}, pages = {382--417}, year = {1999}
}
@article{tibshirani1996lasso,
  author  = {Tibshirani, Robert},
  title   = {Regression Shrinkage and Selection via the Lasso},
  journal = {Journal of the Royal Statistical Society, Series B},
  volume  = {58}, number = {1}, pages = {267--288}, year = {1996}
}
@article{magnus2010wals,
  author  = {Magnus, Jan R. and Powell, Owen and Prufer, Patricia},
  title   = {A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics},
  journal = {Journal of Econometrics},
  volume  = {154}, number = {2}, pages = {139--153}, year = {2010}
}

Variable explorer search & filter all 14 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`agriculture`#	continuous		Agricultural activity	Agricultural share; true predictor with the weakest effect (true β = 0.005; missed by all three methods).	% / index	synthetic-co2-cross-section	Synthetic (this study)
`corruption`#	continuous		Corruption index (noise)	Noise regressor — weakly (negatively) correlated with GDP but with zero true effect.	index	synthetic-co2-cross-section	Synthetic (this study)
`country`#	identifier	–	Country identifier	Synthetic country label (Country_001 … Country_120); the cross-section key.	string	synthetic-co2-cross-section	Synthetic (this study)
`democracy`#	continuous		Democracy index	Democratic-governance score; true predictor with a small effect (true β = 0.004; borderline for BMA).	index	synthetic-co2-cross-section	Synthetic (this study)
`fdi`#	continuous		Foreign direct investment (noise)	Noise regressor — zero true effect on log_co2 (the one noise variable not built from GDP).	% / index	synthetic-co2-cross-section	Synthetic (this study)
`fossil_fuel`#	continuous		Fossil fuel dependence (%)	Fossil-fuel share of energy; true predictor (true β = 0.012, semi-elasticity).	% of energy	synthetic-co2-cross-section	Synthetic (this study)
`industry`#	continuous		Industry share	Industrial output share; true predictor (true β = 0.008, composition effect).	% / index	synthetic-co2-cross-section	Synthetic (this study)
`log_co2`#	continuous		Log CO2 emissions (dependent variable)	Natural-log CO2 emissions; the outcome all three methods explain.	log units	synthetic-co2-cross-section	Synthetic (this study)
`log_credit`#	continuous		Log domestic credit (noise)	Noise regressor — correlated with GDP but with zero true effect on log_co2.	log units	synthetic-co2-cross-section	Synthetic (this study)
`log_gdp`#	continuous		Log GDP per capita	Natural-log GDP per capita; the dominant true predictor (true β = 1.200, an elasticity).	log units	synthetic-co2-cross-section	Synthetic (this study)
`log_tourism`#	continuous		Log tourism (noise)	Noise regressor — correlated with GDP (~0.3) but with zero true effect on log_co2.	log units	synthetic-co2-cross-section	Synthetic (this study)
`log_trade`#	continuous		Log trade openness (noise)	Noise regressor — correlated with GDP but with zero true effect on log_co2.	log units	synthetic-co2-cross-section	Synthetic (this study)
`trade_network`#	continuous		Trade network centrality	Trade-centrality measure; true predictor with a moderate effect (true β = 0.500).	index (0-1 scale)	synthetic-co2-cross-section	Synthetic (this study)
`urban_pop`#	continuous		Urban population (%)	Urbanization rate; true predictor with a moderate effect (true β = 0.010; borderline for BMA).	% of population	synthetic-co2-cross-section	Synthetic (this study)

Cross-file variable index

Which file each variable appears in (● = present).

Variable	synthetic-co2-cross-section
`agriculture`	●
`corruption`	●
`country`	●
`democracy`	●
`fdi`	●
`fossil_fuel`	●
`industry`	●
`log_co2`	●
`log_credit`	●
`log_gdp`	●
`log_tourism`	●
`log_trade`	●
`trade_network`	●
`urban_pop`	●

Construction & formulas

The target model regresses log CO₂ on the 12 candidate regressors:

General model: log_co2 = β₀ + Σⱼ βⱼ xⱼ + ε, j = 1…12.
BMA — posterior inclusion probability (PIP): PIPⱼ = Σ_{k: j ∈ Mₖ} P(Mₖ | y) — total posterior mass on models containing variable j; PIP ≥ 0.80 is the robustness threshold (Raftery 1995). Posterior mean E[βⱼ | y] = Σₖ β̂ⱼ,ₖ · P(Mₖ | y).
LASSO — L1 penalty: β̂ = argmin (1/2n)·‖y − Xβ‖² + λ‖β‖₁; λ chosen by 10-fold CV (lambda.min / lambda.1se). Post-LASSO refits OLS on the selected variables for unbiased magnitudes.
WALS — Laplace prior: split y = X₁β₁ + X₂β₂ + ε into focus regressors X₁ (here the intercept) and auxiliary regressors X₂ (the 12 candidates); orthogonalize X₂ and average each coefficient independently under p(γⱼ) ∝ exp(−|γⱼ|/τ), the same prior that underlies the LASSO penalty. Robustness flagged at |t| ≥ 2.

Synthetic data-generating process (set.seed(2017), n = 120). GDP drives the system: log_gdp ~ N(8.5, 1.5²). True predictors and noise are built as linear functions of log_gdp plus Gaussian noise, so the noise variables are correlated with GDP yet have zero true effect. The outcome is log_co2 = 2.0 + 1.200·log_gdp + 0.008·industry + 0.012·fossil_fuel + 0.010·urban_pop + 0.004·democracy + 0.500·trade_network + 0.005·agriculture + N(0, 0.3²); the five noise variables (log_trade, fdi, corruption, log_tourism, log_credit) enter the outcome with coefficient exactly 0.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country (cross-section) 120 × 14 · none (no time dimension) · 120 fictional countries

Panel key: country · Grade BMA, LASSO, and WALS variable selection against a known answer key (7 true predictors, 5 noise).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`country` identifier	Country identifier	Synthetic country label (Country_001 … Country_120); the cross-section key.	Sequential identifier assigned to each of the 120 simulated observations.	string	Synthetic (this study)	120 countries
`log_co2` continuous	Log CO2 emissions (dependent variable)	Natural-log CO2 emissions; the outcome all three methods explain.	log_co2 = 2.0 + 1.200·log_gdp + 0.008·industry + 0.012·fossil_fuel + 0.010·urban_pop + 0.004·democracy + 0.500·trade_network + 0.005·agriculture + N(0, 0.3²). Noise variables enter with coefficient 0.	log units	Synthetic (this study)	120 countries
`log_gdp` continuous	Log GDP per capita	Natural-log GDP per capita; the dominant true predictor (true β = 1.200, an elasticity).	log_gdp ~ N(8.5, 1.5²); drives all other regressors.	log units	Synthetic (this study)	120 countries
`industry` continuous	Industry share	Industrial output share; true predictor (true β = 0.008, composition effect).	industry = 15 + 1.5·log_gdp + N(0, 6²).	% / index	Synthetic (this study)	120 countries
`fossil_fuel` continuous	Fossil fuel dependence (%)	Fossil-fuel share of energy; true predictor (true β = 0.012, semi-elasticity).	fossil_fuel = 30 + 3·log_gdp + N(0, 10²).	% of energy	Synthetic (this study)	120 countries
`urban_pop` continuous	Urban population (%)	Urbanization rate; true predictor with a moderate effect (true β = 0.010; borderline for BMA).	urban_pop = 20 + 5·log_gdp + N(0, 12²).	% of population	Synthetic (this study)	120 countries
`democracy` continuous	Democracy index	Democratic-governance score; true predictor with a small effect (true β = 0.004; borderline for BMA).	democracy = 5 + 2·log_gdp + N(0, 8²).	index	Synthetic (this study)	120 countries
`trade_network` continuous	Trade network centrality	Trade-centrality measure; true predictor with a moderate effect (true β = 0.500).	trade_network = 0.2 + 0.05·log_gdp + N(0, 0.15²).	index (0-1 scale)	Synthetic (this study)	120 countries
`agriculture` continuous	Agricultural activity	Agricultural share; true predictor with the weakest effect (true β = 0.005; missed by all three methods).	agriculture = 40 − 3·log_gdp + N(0, 8²) (negatively correlated with GDP).	% / index	Synthetic (this study)	120 countries
`log_trade` continuous	Log trade openness (noise)	Noise regressor — correlated with GDP but with zero true effect on log_co2.	log_trade = 3.5 + 0.1·log_gdp + N(0, 0.5²). True β = 0.	log units	Synthetic (this study)	120 countries
`fdi` continuous	Foreign direct investment (noise)	Noise regressor — zero true effect on log_co2 (the one noise variable not built from GDP).	fdi = 2 + N(0, 4²). True β = 0.	% / index	Synthetic (this study)	120 countries
`corruption` continuous	Corruption index (noise)	Noise regressor — weakly (negatively) correlated with GDP but with zero true effect.	corruption = 0.8 − 0.05·log_gdp + N(0, 0.15²). True β = 0.	index	Synthetic (this study)	120 countries
`log_tourism` continuous	Log tourism (noise)	Noise regressor — correlated with GDP (~0.3) but with zero true effect on log_co2.	log_tourism = 12 + 0.3·log_gdp + N(0, 1.2²). True β = 0.	log units	Synthetic (this study)	120 countries
`log_credit` continuous	Log domestic credit (noise)	Noise regressor — correlated with GDP but with zero true effect on log_co2.	log_credit = 2.5 + 0.15·log_gdp + N(0, 0.6²). True β = 0.	log units	Synthetic (this study)	120 countries

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`country`	–	100%	120	120	—	—	—	—	—
`log_co2`		100%	120	120	8.76	14.22	14.16	20.36	2.11
`log_gdp`		100%	120	120	4.61	8.53	8.53	13.21	1.57
`industry`		100%	120	120	8.32	27.87	28.31	44.98	6.21
`fossil_fuel`		100%	120	120	24.72	55.49	55.26	81.22	9.62
`urban_pop`		100%	120	120	29.81	62.52	63.23	97.62	13.25
`democracy`		100%	120	120	3.10	22.94	23.21	45.00	8.32
`trade_network`		100%	120	120	0.182	0.643	0.651	1.04	0.171
`agriculture`		100%	120	110	1.00	13.87	14.30	37.11	8.11
`log_trade`		100%	120	120	3.45	4.43	4.42	5.84	0.458
`fdi`		100%	120	116	-5.00	2.23	1.50	13.62	4.19
`corruption`		100%	120	120	0.050	0.367	0.374	0.715	0.164
`log_tourism`		100%	120	120	11.54	14.61	14.57	19.63	1.32
`log_credit`		100%	120	120	2.30	3.83	3.89	5.50	0.652

Known limitations & caveats

Synthetic data. There is no real data behind this tutorial. The 120 “countries” and all 13 numeric variables are simulated; results are internally consistent with the calibrated DGP but are not empirical evidence about real CO₂ emissions or their determinants.
Built-in answer key. The true coefficients are known by design (7 nonzero, 5 exactly zero). The point of the dataset is to grade each method against this truth; that luxury does not exist with real data, which is exactly why the post recommends using all three methods.
Deliberate multicollinearity. The noise variables are constructed to be correlated with GDP and the true predictors, so a naive OLS can show spurious significance. This is intentional, to make variable selection genuinely hard.
Power, not bias, limits detection. All methods reach perfect specificity, but small true effects (agriculture, β = 0.005; and the borderline urban_pop / democracy for BMA) are hard to detect at n = 120 — a sample-size limitation, not a flaw in any method.
Cross-section only. No time dimension and no panel structure; the methods here address model uncertainty, not endogeneity, nonlinearity, or heteroskedasticity, which real applications must handle separately.