Data dictionary · Causal Machine Learning for Policy Evaluation

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`cml_data`	jobseeker (cross-section)	5,000 × 8	cml_data.dta	cml_data.csv
`cml_truth`	jobseeker (cross-section)	5,000 × 5	cml_truth.dta	cml_truth.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
use "${BASE}cml_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
df = pd.read_stata(BASE + "cml_data.dta")

# load every dataset at once
files = ["cml_data", "cml_truth"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "cml_data.dta", "cml_data.dta")
df, meta = pyreadstat.read_dta("cml_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
df <- read_dta(paste0(BASE, "cml_data.dta"))

Overview & sources

Companion data for a hands-on Python tutorial on Causal Machine Learning (CML) for active-labour-market-programme (ALMP) evaluation. The cohort is a fully synthetic Flemish-ALMP-style sample of 5,000 jobseekers with six pre-treatment covariates, a binary training indicator, and months employed over a 30-month follow-up. The post walks the full CML roadmap — the average treatment effect (ATE) via DoubleML's cross-fitted, doubly-robust Interactive Regression Model; group effects (GATE) by Dutch proficiency via doubly-robust pseudo-outcome averaging; individual effects (IATE) via EconML's CausalForestDML; and a welfare-maximising training-assignment rule. Because the data are simulated, the true individual treatment effect of every jobseeker is known, so every estimator is benchmarked against ground truth. The data-generating process is modelled on Cockx, Lechner & Bollens (2023) and the methodological roadmap in Lechner (2023).

Two files, one row per jobseeker, joined by row order. cml_data is the observed cross-section a real analyst would see: six covariates (X), the treatment D, and the outcome Y. cml_truth is the hidden ground truth available only because the data are simulated: both potential outcomes (Y0, Y1), the individual effect tau, and the true propensity pi_true, plus dutch_prof carried over for self-contained group-bys. Row i of cml_truth corresponds to row i of cml_data (no shared key column — the join is positional). The truth columns are not observable predictors and must never enter an estimator; they exist only to score the estimates.

Data sources

Source	Provides	Reference / URL
Synthetic (this study)	All values — simulated via a calibrated ALMP data-generating process (open & reproducible)	Mendez, C. (2026). See the post's Python script script.py (simulate_almp) for the full DGP.
Cockx, Lechner & Bollens (2023)	Empirical case study the synthetic DGP is calibrated to (Flanders ALMP; Dutch-proficiency heterogeneity)	Cockx, B., Lechner, M., & Bollens, J. (2023). Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. Labour Economics, 80, 102306. https://doi.org/10.1016/j.labeco.2023.102306
Method references	Estimators and concepts	Lechner (2023); Chernozhukov et al. (2018, DoubleML / IRM); Athey, Tibshirani & Wager (2019, generalized random forests / causal forests).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Causal Machine Learning for Policy Evaluation: From ATE to IATE to a Better Assignment Rule [Data set]. https://carlos-mendez.org/post/python_cml/

Lechner, M. (2023). Causal Machine Learning and its use for public policy. Swiss Journal of Economics and Statistics, 159(8). https://doi.org/10.1186/s41937-023-00113-y Cockx, B., Lechner, M., & Bollens, J. (2023). Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. Labour Economics, 80, 102306. https://doi.org/10.1016/j.labeco.2023.102306 Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097 Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals of Statistics, 47(2), 1148–1178. https://doi.org/10.1214/18-AOS1709

BibTeX

@misc{mendez2026pythoncml,
  author       = {Mendez, Carlos},
  title        = {Causal Machine Learning for Policy Evaluation: From ATE to IATE to a Better Assignment Rule},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_cml/}},
  note         = {Data set}
}

@article{lechner2023cml,
  author  = {Lechner, Michael},
  title   = {Causal Machine Learning and its use for public policy},
  journal = {Swiss Journal of Economics and Statistics},
  volume  = {159}, number = {8}, year = {2023}
}
@article{cockx2023priority,
  author  = {Cockx, Bart and Lechner, Michael and Bollens, Joost},
  title   = {Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium},
  journal = {Labour Economics},
  volume  = {80}, pages = {102306}, year = {2023}
}
@article{chernozhukov2018double,
  author  = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James},
  title   = {Double/debiased machine learning for treatment and structural parameters},
  journal = {The Econometrics Journal},
  volume  = {21}, number = {1}, pages = {C1--C68}, year = {2018}
}
@article{athey2019grf,
  author  = {Athey, Susan and Tibshirani, Julie and Wager, Stefan},
  title   = {Generalized random forests},
  journal = {Annals of Statistics},
  volume  = {47}, number = {2}, pages = {1148--1178}, year = {2019}
}

Variable explorer search & filter all 12 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`D`#	dummy		Treatment: received training (1=yes)	Binary training indicator — 1 if the jobseeker received the ALMP training, else 0.	0/1	cml_data	Simulation
`Y`#	continuous		Observed outcome: months employed (0–30)	Observed months employed over the 30-month follow-up — the realised potential outcome under the assigned treatment.	months (0-30)	cml_data	Simulation
`Y0`#	continuous		Untreated potential outcome (months, truth)	Months employed the jobseeker WOULD have over 30 months WITHOUT training — counterfactual; observable only in the simulation.	months (0-30)	cml_truth	Simulation (ground truth)
`Y1`#	continuous		Treated potential outcome (months, truth)	Months employed the jobseeker WOULD have over 30 months WITH training — counterfactual; observable only in the simulation.	months (0-30)	cml_truth	Simulation (ground truth)
`age`#	continuous		Age (years)	Jobseeker age in years at programme entry (pre-treatment covariate).	years	cml_data	Simulation
`dutch_prof`#	identifier	–	Dutch language proficiency (0–3)	Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.	0-3 (ordinal)	cml_data, cml_truth	Simulation
`edu_years`#	continuous		Years of education	Completed years of formal education (pre-treatment covariate).	years	cml_data	Simulation
`female`#	dummy		Female (1=yes)	Sex indicator, 1 if female else 0 (pre-treatment covariate).	0/1	cml_data	Simulation
`migrant`#	dummy		Migrant (1=yes)	Migrant-background indicator, 1 if migrant else 0 (pre-treatment covariate; a moderator of the effect).	0/1	cml_data	Simulation
`pi_true`#	continuous		True propensity P(D=1 \| X)	The true probability of receiving training given covariates, used to generate D. The estimand a propensity model tries to recover.	0-1 (probability)	cml_truth	Simulation (ground truth)
`prior_emp_months`#	continuous		Prior employment in look-back window (months)	Months employed during the pre-programme look-back window (pre-treatment covariate).	months	cml_data	Simulation
`tau`#	continuous		True individual treatment effect (months)	The true per-jobseeker effect of training, τ = Y(1) − Y(0) (pre-clipping). Benchmark for the IATE and input to the oracle/welfare rule.	months	cml_truth	Simulation (ground truth)

Cross-file variable index

Which file each variable appears in (● = present).

Variable	cml_data	cml_truth
`D`	●
`Y`	●
`Y0`		●
`Y1`		●
`age`	●
`dutch_prof`	●	●
`edu_years`	●
`female`	●
`migrant`	●
`pi_true`		●
`prior_emp_months`	●
`tau`		●

Construction & formulas

The post targets three causal estimands of increasing granularity, all under unconfoundedness (selection-on-observables) over the six covariates X:

ATE = E[Y(1) − Y(0)] — the population-average effect, estimated by DoubleMLIRM (cross-fitted, doubly-robust, random-forest nuisances, 5-fold cross-fitting).
GATE(z) = E[Y(1) − Y(0) | Z = z] with Z = dutch_prof — the subgroup average, estimated as the within-stratum mean of the doubly-robust pseudo-outcome ψ.
IATE(x) = E[Y(1) − Y(0) | X = x] — the per-individual effect, estimated by EconML's CausalForestDML.

The doubly-robust score at observation i (whose group-mean gives the GATE) is ψ_i = g₁(Xᵢ) − g₀(Xᵢ) + Dᵢ·(Yᵢ − g₁(Xᵢ))/m(Xᵢ) − (1−Dᵢ)·(Yᵢ − g₀(Xᵢ))/(1 − m(Xᵢ)), where g_d(X) = E[Y | D = d, X] is the outcome regression and m(X) = P(D = 1 | X) is the propensity. The welfare of an assignment rule under a fixed cost c = 4 months is W = E[ rule(X)·(τ(X) − c) ].

Synthetic data-generating process (simulate_almp). Covariates are drawn independently: age ~ U(20,60), edu_years ~ N(12,3) clipped to [6,20], prior_emp_months ~ 60·Beta(2,5), dutch_prof ∈ {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15), female ~ Bernoulli(0.48), migrant ~ Bernoulli(0.30). The true propensity is logistic in the covariates, logit π = −0.6 + 0.020(40−age) + 0.05(12−edu) + 0.015(30−prior_emp) + 0.30(3−dutch) + 0.20·migrant − 0.10·female, clipped to [0.05, 0.95], and D ~ Bernoulli(π). The true individual effect is τ = 3.0 + 1.5(3 − dutch_prof) + 0.4·migrant − 0.02(age − 40) + N(0, 0.3) — so the effect is largest for low-Dutch, migrant, younger jobseekers (the policy punchline). The untreated potential outcome is Y0 = 12 + 0.20·prior_emp + 0.30·edu + 1.0·dutch_prof − 0.005(age−40)² + N(0, 2.5) clipped to [0,30], the treated outcome is Y1 = clip(Y0 + τ, 0, 30), and the observed outcome is Y = D·Y1 + (1−D)·Y0.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

jobseeker (cross-section) 5,000 × 8 · 30-month follow-up (single cohort, no calendar year) · 5,000 jobseekers

Panel key: row order (positional; no explicit id column) · The data a real analyst sees — used to estimate the ATE (DoubleML), GATE (by Dutch proficiency), and IATE (causal forest).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`age` continuous	Age (years)	Jobseeker age in years at programme entry (pre-treatment covariate).	Drawn age ~ Uniform(20, 60).	years	Simulation	cml_data.csv
`edu_years` continuous	Years of education	Completed years of formal education (pre-treatment covariate).	Drawn N(12, 3), clipped to [6, 20].	years	Simulation	cml_data.csv
`prior_emp_months` continuous	Prior employment in look-back window (months)	Months employed during the pre-programme look-back window (pre-treatment covariate).	Drawn 60 · Beta(2, 5).	months	Simulation	cml_data.csv
`dutch_prof` identifier	Dutch language proficiency (0–3)	Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.	Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15).	0-3 (ordinal)	Simulation	cml_data.csv & cml_truth.csv
`female` dummy	Female (1=yes)	Sex indicator, 1 if female else 0 (pre-treatment covariate).	Drawn Bernoulli(0.48).	0/1	Simulation	cml_data.csv
`migrant` dummy	Migrant (1=yes)	Migrant-background indicator, 1 if migrant else 0 (pre-treatment covariate; a moderator of the effect).	Drawn Bernoulli(0.30).	0/1	Simulation	cml_data.csv
`D` dummy	Treatment: received training (1=yes)	Binary training indicator — 1 if the jobseeker received the ALMP training, else 0.	D ~ Bernoulli(pi_true), with pi_true the true logistic propensity in the covariates.	0/1	Simulation	cml_data.csv
`Y` continuous	Observed outcome: months employed (0–30)	Observed months employed over the 30-month follow-up — the realised potential outcome under the assigned treatment.	Y = D·Y1 + (1−D)·Y0, where Y0/Y1 are the clipped potential outcomes.	months (0-30)	Simulation	cml_data.csv

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`age`		100%	5,000	5,000	20.02	39.82	39.68	59.99	11.54
`edu_years`		100%	5,000	4,870	6.00	12.02	11.94	20.00	2.95
`prior_emp_months`		100%	5,000	5,000	0.369	16.99	15.80	54.75	9.59
`dutch_prof`	–	100%	5,000	4	—	—	—	—	—
`female`		100%	5,000	2	0	0.492	0	1.00	0.500
`migrant`		100%	5,000	2	0	0.305	0	1.00	0.460
`D`		100%	5,000	2	0	0.528	1.00	1.00	0.499
`Y`		100%	5,000	4,778	9.81	22.68	22.81	30.00	4.18

jobseeker (cross-section) 5,000 × 5 · matches cml_data.csv (positional) · 5,000 jobseekers

Panel key: row order (positional; aligns 1:1 with cml_data.csv) · Benchmark every estimator against the truth: tau scores the IATE, and tau feeds the welfare/oracle comparison.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`Y0` continuous	Untreated potential outcome (months, truth)	Months employed the jobseeker WOULD have over 30 months WITHOUT training — counterfactual; observable only in the simulation.	Y0 = clip(12 + 0.20·prior_emp + 0.30·edu + 1.0·dutch_prof − 0.005(age−40)² + N(0,2.5), 0, 30).	months (0-30)	Simulation (ground truth)	cml_truth.csv
`Y1` continuous	Treated potential outcome (months, truth)	Months employed the jobseeker WOULD have over 30 months WITH training — counterfactual; observable only in the simulation.	Y1 = clip(Y0 + tau, 0, 30).	months (0-30)	Simulation (ground truth)	cml_truth.csv
`tau` continuous	True individual treatment effect (months)	The true per-jobseeker effect of training, τ = Y(1) − Y(0) (pre-clipping). Benchmark for the IATE and input to the oracle/welfare rule.	tau = 3.0 + 1.5(3 − dutch_prof) + 0.4·migrant − 0.02(age − 40) + N(0, 0.3).	months	Simulation (ground truth)	cml_truth.csv
`pi_true` continuous	True propensity P(D=1 \| X)	The true probability of receiving training given covariates, used to generate D. The estimand a propensity model tries to recover.	logistic(−0.6 + 0.020(40−age) + 0.05(12−edu) + 0.015(30−prior_emp) + 0.30(3−dutch) + 0.20·migrant − 0.10·female), clipped to [0.05, 0.95].	0-1 (probability)	Simulation (ground truth)	cml_truth.csv
`dutch_prof` identifier	Dutch language proficiency (0–3)	Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.	Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15).	0-3 (ordinal)	Simulation	cml_data.csv & cml_truth.csv

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`Y0`		100%	5,000	4,982	8.83	19.66	19.61	30.00	3.47
`Y1`		100%	5,000	4,580	13.62	25.15	25.24	30.00	3.14
`tau`		100%	5,000	5,000	2.02	5.63	5.74	8.95	1.58
`pi_true`		100%	5,000	5,000	0.209	0.526	0.528	0.811	0.107
`dutch_prof`	–	100%	5,000	4	—	—	—	—	—

Known limitations & caveats

Synthetic data. There is no real cohort behind this tutorial; results are internally consistent with the calibration but are not empirical evidence about any real ALMP.
Truth columns are not observable predictors. Y0, Y1, tau, and pi_true in cml_truth.csv are knowable only because the data are simulated. They are for benchmarking and welfare scoring only — feeding any of them into an estimator is circular and must not be done.
Positional join. cml_data.csv and cml_truth.csv share no key column; row i of one matches row i of the other. Preserve row order when merging (e.g. pd.concat([df, truth], axis=1)), never sort one file independently.
Easy overlap by construction. True propensities are clipped to [0.05, 0.95] (and the logistic-regression estimate lands inside [0.21, 0.81]), so trimming choices and doubly-robust denominators are not stressed here. Real ALMP cohorts have tighter overlap and trimming matters more.
Treatment share is high (≈53%). Calibrated to keep overlap comfortable in every Dutch-proficiency stratum; do not over-interpret the magnitude of effects as representative of a real programme.
Unconfoundedness holds by construction. The DGP satisfies selection-on-observables over the six covariates; in a real application this is the strong identifying assumption that justifies DoubleML and CausalForestDML over a naive comparison.