← Back to the post
Interactive data dictionary

Robust Variable Selection: BMA, LASSO, and WALS

A synthetic CO2 cross-section with a built-in answer key, for grading three variable-selection methods in R.

1
dataset
14
variables
120
countries
12 + 1
regressors + outcome

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
synthetic-co2-cross-sectioncountry (cross-section)120 × 14synthetic-co2-cross-section.dtasynthetic-co2-cross-section.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
use "${BASE}synthetic-co2-cross-section.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
df = pd.read_stata(BASE + "synthetic-co2-cross-section.dta")

# load every dataset at once
files = ["synthetic-co2-cross-section"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "synthetic-co2-cross-section.dta", "synthetic-co2-cross-section.dta")
df, meta = pyreadstat.read_dta("synthetic-co2-cross-section.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/data/"
df <- read_dta(paste0(BASE, "synthetic-co2-cross-section.dta"))

Overview & sources

Companion data for a hands-on R tutorial comparing three principled responses to the variable selection problem — Bayesian Model Averaging (BMA), the LASSO, and Weighted Average Least Squares (WALS) — on a fully synthetic cross-section of 120 fictional countries. Twelve candidate regressors compete to explain log CO2 emissions: 7 have true nonzero effects and 5 are pure noise deliberately correlated with GDP and the true predictors, creating realistic multicollinearity. Because the data-generating process is known, the data carries its own “answer key” against which each method is graded. The convergence of mechanically distinct methods on the same variables — four are flagged by all three (triple-robust) — illustrates methodological triangulation. The entire DGP is open and reproducible (set.seed(2017)).

One file, a pure cross-section. synthetic-co2-cross-section.csv has one row per fictional country (no time dimension): a string identifier, the dependent variable log_co2, and the 12 candidate regressors. There are no real countries behind the rows.

Data sources

SourceProvidesReference / URL
Synthetic (this study)All values — simulated from a calibrated cross-sectional DGP with set.seed(2017); 7 true predictors, 5 noise variables (open &amp; reproducible)Mendez, C. (2026). See the post's R script script.R for the full data-generating process.
Bayesian Model Averaging (BMA)Method: posterior inclusion probabilities over the 2^12 = 4,096 model spaceHoeting, J.A., Madigan, D., Raftery, A.E., & Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417. (PIP threshold conventions: Raftery 1995.)
LASSOMethod: L1-penalized regression for automatic variable selection (Post-LASSO refit per Belloni &amp; Chernozhukov 2013)Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
WALS (Weighted Average Least Squares)Method: fast frequentist model averaging via a semi-orthogonal transform and a Laplace prior, yielding t-statisticsMagnus, J.R., Powell, O., & Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Three Methods for Robust Variable Selection: BMA, LASSO, and WALS [Data set]. https://carlos-mendez.org/post/r_bma_lasso_wals/

Hoeting, J.A., Madigan, D., Raftery, A.E., & Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288. Magnus, J.R., Powell, O., & Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.

BibTeX

@misc{mendez2026rbmalassowals,
  author       = {Mendez, Carlos},
  title        = {Three Methods for Robust Variable Selection: BMA, LASSO, and WALS},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/r_bma_lasso_wals/}},
  note         = {Data set}
}

@article{hoeting1999bma,
  author  = {Hoeting, Jennifer A. and Madigan, David and Raftery, Adrian E. and Volinsky, Chris T.},
  title   = {Bayesian Model Averaging: A Tutorial},
  journal = {Statistical Science},
  volume  = {14}, number = {4}, pages = {382--417}, year = {1999}
}
@article{tibshirani1996lasso,
  author  = {Tibshirani, Robert},
  title   = {Regression Shrinkage and Selection via the Lasso},
  journal = {Journal of the Royal Statistical Society, Series B},
  volume  = {58}, number = {1}, pages = {267--288}, year = {1996}
}
@article{magnus2010wals,
  author  = {Magnus, Jan R. and Powell, Owen and Prufer, Patricia},
  title   = {A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics},
  journal = {Journal of Econometrics},
  volume  = {154}, number = {2}, pages = {139--153}, year = {2010}
}

Variable explorer search & filter all 14 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
agriculture#continuousmin 1 | median 14.3 | max 37.1Agricultural activityAgricultural share; true predictor with the weakest effect (true β = 0.005; missed by all three methods).% / indexsynthetic-co2-cross-sectionSynthetic (this study)
corruption#continuousmin 0.05 | median 0.374 | max 0.715Corruption index (noise)Noise regressor — weakly (negatively) correlated with GDP but with zero true effect.indexsynthetic-co2-cross-sectionSynthetic (this study)
country#identifierCountry identifierSynthetic country label (Country_001 … Country_120); the cross-section key.stringsynthetic-co2-cross-sectionSynthetic (this study)
democracy#continuousmin 3.1 | median 23.2 | max 45Democracy indexDemocratic-governance score; true predictor with a small effect (true β = 0.004; borderline for BMA).indexsynthetic-co2-cross-sectionSynthetic (this study)
fdi#continuousmin -5 | median 1.5 | max 13.6Foreign direct investment (noise)Noise regressor — zero true effect on log_co2 (the one noise variable not built from GDP).% / indexsynthetic-co2-cross-sectionSynthetic (this study)
fossil_fuel#continuousmin 24.7 | median 55.3 | max 81.2Fossil fuel dependence (%)Fossil-fuel share of energy; true predictor (true β = 0.012, semi-elasticity).% of energysynthetic-co2-cross-sectionSynthetic (this study)
industry#continuousmin 8.32 | median 28.3 | max 45Industry shareIndustrial output share; true predictor (true β = 0.008, composition effect).% / indexsynthetic-co2-cross-sectionSynthetic (this study)
log_co2#continuousmin 8.76 | median 14.2 | max 20.4Log CO2 emissions (dependent variable)Natural-log CO2 emissions; the outcome all three methods explain.log unitssynthetic-co2-cross-sectionSynthetic (this study)
log_credit#continuousmin 2.3 | median 3.89 | max 5.5Log domestic credit (noise)Noise regressor — correlated with GDP but with zero true effect on log_co2.log unitssynthetic-co2-cross-sectionSynthetic (this study)
log_gdp#continuousmin 4.61 | median 8.53 | max 13.2Log GDP per capitaNatural-log GDP per capita; the dominant true predictor (true β = 1.200, an elasticity).log unitssynthetic-co2-cross-sectionSynthetic (this study)
log_tourism#continuousmin 11.5 | median 14.6 | max 19.6Log tourism (noise)Noise regressor — correlated with GDP (~0.3) but with zero true effect on log_co2.log unitssynthetic-co2-cross-sectionSynthetic (this study)
log_trade#continuousmin 3.45 | median 4.42 | max 5.84Log trade openness (noise)Noise regressor — correlated with GDP but with zero true effect on log_co2.log unitssynthetic-co2-cross-sectionSynthetic (this study)
trade_network#continuousmin 0.182 | median 0.651 | max 1.04Trade network centralityTrade-centrality measure; true predictor with a moderate effect (true β = 0.500).index (0-1 scale)synthetic-co2-cross-sectionSynthetic (this study)
urban_pop#continuousmin 29.8 | median 63.2 | max 97.6Urban population (%)Urbanization rate; true predictor with a moderate effect (true β = 0.010; borderline for BMA).% of populationsynthetic-co2-cross-sectionSynthetic (this study)

Cross-file variable index

Which file each variable appears in (● = present).

Construction & formulas

The target model regresses log CO2 on the 12 candidate regressors:

Synthetic data-generating process (set.seed(2017), n = 120). GDP drives the system: log_gdp ~ N(8.5, 1.5²). True predictors and noise are built as linear functions of log_gdp plus Gaussian noise, so the noise variables are correlated with GDP yet have zero true effect. The outcome is log_co2 = 2.0 + 1.200·log_gdp + 0.008·industry + 0.012·fossil_fuel + 0.010·urban_pop + 0.004·democracy + 0.500·trade_network + 0.005·agriculture + N(0, 0.3²); the five noise variables (log_trade, fdi, corruption, log_tourism, log_credit) enter the outcome with coefficient exactly 0.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country (cross-section)  120 × 14 · none (no time dimension) · 120 fictional countries

Panel key: country · Grade BMA, LASSO, and WALS variable selection against a known answer key (7 true predictors, 5 noise).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
country identifierCountry identifierSynthetic country label (Country_001 … Country_120); the cross-section key.Sequential identifier assigned to each of the 120 simulated observations.stringSynthetic (this study)120 countries
log_co2 continuousLog CO2 emissions (dependent variable)Natural-log CO2 emissions; the outcome all three methods explain.log_co2 = 2.0 + 1.200·log_gdp + 0.008·industry + 0.012·fossil_fuel + 0.010·urban_pop + 0.004·democracy + 0.500·trade_network + 0.005·agriculture + N(0, 0.3²). Noise variables enter with coefficient 0.log unitsSynthetic (this study)120 countries
log_gdp continuousLog GDP per capitaNatural-log GDP per capita; the dominant true predictor (true β = 1.200, an elasticity).log_gdp ~ N(8.5, 1.5²); drives all other regressors.log unitsSynthetic (this study)120 countries
industry continuousIndustry shareIndustrial output share; true predictor (true β = 0.008, composition effect).industry = 15 + 1.5·log_gdp + N(0, 6²).% / indexSynthetic (this study)120 countries
fossil_fuel continuousFossil fuel dependence (%)Fossil-fuel share of energy; true predictor (true β = 0.012, semi-elasticity).fossil_fuel = 30 + 3·log_gdp + N(0, 10²).% of energySynthetic (this study)120 countries
urban_pop continuousUrban population (%)Urbanization rate; true predictor with a moderate effect (true β = 0.010; borderline for BMA).urban_pop = 20 + 5·log_gdp + N(0, 12²).% of populationSynthetic (this study)120 countries
democracy continuousDemocracy indexDemocratic-governance score; true predictor with a small effect (true β = 0.004; borderline for BMA).democracy = 5 + 2·log_gdp + N(0, 8²).indexSynthetic (this study)120 countries
trade_network continuousTrade network centralityTrade-centrality measure; true predictor with a moderate effect (true β = 0.500).trade_network = 0.2 + 0.05·log_gdp + N(0, 0.15²).index (0-1 scale)Synthetic (this study)120 countries
agriculture continuousAgricultural activityAgricultural share; true predictor with the weakest effect (true β = 0.005; missed by all three methods).agriculture = 40 − 3·log_gdp + N(0, 8²) (negatively correlated with GDP).% / indexSynthetic (this study)120 countries
log_trade continuousLog trade openness (noise)Noise regressor — correlated with GDP but with zero true effect on log_co2.log_trade = 3.5 + 0.1·log_gdp + N(0, 0.5²). True β = 0.log unitsSynthetic (this study)120 countries
fdi continuousForeign direct investment (noise)Noise regressor — zero true effect on log_co2 (the one noise variable not built from GDP).fdi = 2 + N(0, 4²). True β = 0.% / indexSynthetic (this study)120 countries
corruption continuousCorruption index (noise)Noise regressor — weakly (negatively) correlated with GDP but with zero true effect.corruption = 0.8 − 0.05·log_gdp + N(0, 0.15²). True β = 0.indexSynthetic (this study)120 countries
log_tourism continuousLog tourism (noise)Noise regressor — correlated with GDP (~0.3) but with zero true effect on log_co2.log_tourism = 12 + 0.3·log_gdp + N(0, 1.2²). True β = 0.log unitsSynthetic (this study)120 countries
log_credit continuousLog domestic credit (noise)Noise regressor — correlated with GDP but with zero true effect on log_co2.log_credit = 2.5 + 0.15·log_gdp + N(0, 0.6²). True β = 0.log unitsSynthetic (this study)120 countries

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
country100%120120
log_co2min 8.76 | median 14.2 | max 20.4100%1201208.7614.2214.1620.362.11
log_gdpmin 4.61 | median 8.53 | max 13.2100%1201204.618.538.5313.211.57
industrymin 8.32 | median 28.3 | max 45100%1201208.3227.8728.3144.986.21
fossil_fuelmin 24.7 | median 55.3 | max 81.2100%12012024.7255.4955.2681.229.62
urban_popmin 29.8 | median 63.2 | max 97.6100%12012029.8162.5263.2397.6213.25
democracymin 3.1 | median 23.2 | max 45100%1201203.1022.9423.2145.008.32
trade_networkmin 0.182 | median 0.651 | max 1.04100%1201200.1820.6430.6511.040.171
agriculturemin 1 | median 14.3 | max 37.1100%1201101.0013.8714.3037.118.11
log_trademin 3.45 | median 4.42 | max 5.84100%1201203.454.434.425.840.458
fdimin -5 | median 1.5 | max 13.6100%120116-5.002.231.5013.624.19
corruptionmin 0.05 | median 0.374 | max 0.715100%1201200.0500.3670.3740.7150.164
log_tourismmin 11.5 | median 14.6 | max 19.6100%12012011.5414.6114.5719.631.32
log_creditmin 2.3 | median 3.89 | max 5.5100%1201202.303.833.895.500.652

Known limitations & caveats