← Back to the post
Interactive data dictionary

Causal Machine Learning for Policy Evaluation

From ATE to GATE to IATE to a welfare-maximising assignment rule, on a synthetic Flanders-ALMP cohort with known true effects.

2
datasets
12
variables
5,000
jobseekers
30-month
follow-up

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
cml_datajobseeker (cross-section)5,000 × 8cml_data.dtacml_data.csv
cml_truthjobseeker (cross-section)5,000 × 5cml_truth.dtacml_truth.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
use "${BASE}cml_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
df = pd.read_stata(BASE + "cml_data.dta")

# load every dataset at once
files = ["cml_data", "cml_truth"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "cml_data.dta", "cml_data.dta")
df, meta = pyreadstat.read_dta("cml_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
df <- read_dta(paste0(BASE, "cml_data.dta"))

Overview & sources

Companion data for a hands-on Python tutorial on Causal Machine Learning (CML) for active-labour-market-programme (ALMP) evaluation. The cohort is a fully synthetic Flemish-ALMP-style sample of 5,000 jobseekers with six pre-treatment covariates, a binary training indicator, and months employed over a 30-month follow-up. The post walks the full CML roadmap — the average treatment effect (ATE) via DoubleML's cross-fitted, doubly-robust Interactive Regression Model; group effects (GATE) by Dutch proficiency via doubly-robust pseudo-outcome averaging; individual effects (IATE) via EconML's CausalForestDML; and a welfare-maximising training-assignment rule. Because the data are simulated, the true individual treatment effect of every jobseeker is known, so every estimator is benchmarked against ground truth. The data-generating process is modelled on Cockx, Lechner & Bollens (2023) and the methodological roadmap in Lechner (2023).

Two files, one row per jobseeker, joined by row order. cml_data is the observed cross-section a real analyst would see: six covariates (X), the treatment D, and the outcome Y. cml_truth is the hidden ground truth available only because the data are simulated: both potential outcomes (Y0, Y1), the individual effect tau, and the true propensity pi_true, plus dutch_prof carried over for self-contained group-bys. Row i of cml_truth corresponds to row i of cml_data (no shared key column — the join is positional). The truth columns are not observable predictors and must never enter an estimator; they exist only to score the estimates.

Data sources

SourceProvidesReference / URL
Synthetic (this study)All values — simulated via a calibrated ALMP data-generating process (open &amp; reproducible)Mendez, C. (2026). See the post's Python script script.py (simulate_almp) for the full DGP.
Cockx, Lechner &amp; Bollens (2023)Empirical case study the synthetic DGP is calibrated to (Flanders ALMP; Dutch-proficiency heterogeneity)Cockx, B., Lechner, M., & Bollens, J. (2023). Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. Labour Economics, 80, 102306. https://doi.org/10.1016/j.labeco.2023.102306
Method referencesEstimators and conceptsLechner (2023); Chernozhukov et al. (2018, DoubleML / IRM); Athey, Tibshirani & Wager (2019, generalized random forests / causal forests).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Causal Machine Learning for Policy Evaluation: From ATE to IATE to a Better Assignment Rule [Data set]. https://carlos-mendez.org/post/python_cml/

Lechner, M. (2023). Causal Machine Learning and its use for public policy. Swiss Journal of Economics and Statistics, 159(8). https://doi.org/10.1186/s41937-023-00113-y Cockx, B., Lechner, M., & Bollens, J. (2023). Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. Labour Economics, 80, 102306. https://doi.org/10.1016/j.labeco.2023.102306 Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097 Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals of Statistics, 47(2), 1148–1178. https://doi.org/10.1214/18-AOS1709

BibTeX

@misc{mendez2026pythoncml,
  author       = {Mendez, Carlos},
  title        = {Causal Machine Learning for Policy Evaluation: From ATE to IATE to a Better Assignment Rule},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_cml/}},
  note         = {Data set}
}

@article{lechner2023cml,
  author  = {Lechner, Michael},
  title   = {Causal Machine Learning and its use for public policy},
  journal = {Swiss Journal of Economics and Statistics},
  volume  = {159}, number = {8}, year = {2023}
}
@article{cockx2023priority,
  author  = {Cockx, Bart and Lechner, Michael and Bollens, Joost},
  title   = {Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium},
  journal = {Labour Economics},
  volume  = {80}, pages = {102306}, year = {2023}
}
@article{chernozhukov2018double,
  author  = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James},
  title   = {Double/debiased machine learning for treatment and structural parameters},
  journal = {The Econometrics Journal},
  volume  = {21}, number = {1}, pages = {C1--C68}, year = {2018}
}
@article{athey2019grf,
  author  = {Athey, Susan and Tibshirani, Julie and Wager, Stefan},
  title   = {Generalized random forests},
  journal = {Annals of Statistics},
  volume  = {47}, number = {2}, pages = {1148--1178}, year = {2019}
}

Variable explorer search & filter all 12 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
D#dummyshare coded 1 = 0.528Treatment: received training (1=yes)Binary training indicator — 1 if the jobseeker received the ALMP training, else 0.0/1cml_dataSimulation
Y#continuousmin 9.81 | median 22.8 | max 30Observed outcome: months employed (0–30)Observed months employed over the 30-month follow-up — the realised potential outcome under the assigned treatment.months (0-30)cml_dataSimulation
Y0#continuousmin 8.83 | median 19.6 | max 30Untreated potential outcome (months, truth)Months employed the jobseeker WOULD have over 30 months WITHOUT training — counterfactual; observable only in the simulation.months (0-30)cml_truthSimulation (ground truth)
Y1#continuousmin 13.6 | median 25.2 | max 30Treated potential outcome (months, truth)Months employed the jobseeker WOULD have over 30 months WITH training — counterfactual; observable only in the simulation.months (0-30)cml_truthSimulation (ground truth)
age#continuousmin 20 | median 39.7 | max 60Age (years)Jobseeker age in years at programme entry (pre-treatment covariate).yearscml_dataSimulation
dutch_prof#identifierDutch language proficiency (0–3)Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.0-3 (ordinal)cml_data, cml_truthSimulation
edu_years#continuousmin 6 | median 11.9 | max 20Years of educationCompleted years of formal education (pre-treatment covariate).yearscml_dataSimulation
female#dummyshare coded 1 = 0.492Female (1=yes)Sex indicator, 1 if female else 0 (pre-treatment covariate).0/1cml_dataSimulation
migrant#dummyshare coded 1 = 0.305Migrant (1=yes)Migrant-background indicator, 1 if migrant else 0 (pre-treatment covariate; a moderator of the effect).0/1cml_dataSimulation
pi_true#continuousmin 0.209 | median 0.528 | max 0.811True propensity P(D=1 | X)The true probability of receiving training given covariates, used to generate D. The estimand a propensity model tries to recover.0-1 (probability)cml_truthSimulation (ground truth)
prior_emp_months#continuousmin 0.369 | median 15.8 | max 54.7Prior employment in look-back window (months)Months employed during the pre-programme look-back window (pre-treatment covariate).monthscml_dataSimulation
tau#continuousmin 2.02 | median 5.74 | max 8.95True individual treatment effect (months)The true per-jobseeker effect of training, τ = Y(1) − Y(0) (pre-clipping). Benchmark for the IATE and input to the oracle/welfare rule.monthscml_truthSimulation (ground truth)

Cross-file variable index

Which file each variable appears in (● = present).

Variablecml_datacml_truth
D
Y
Y0
Y1
age
dutch_prof
edu_years
female
migrant
pi_true
prior_emp_months
tau

Construction & formulas

The post targets three causal estimands of increasing granularity, all under unconfoundedness (selection-on-observables) over the six covariates X:

The doubly-robust score at observation i (whose group-mean gives the GATE) is ψ_i = g₁(Xᵢ) − g₀(Xᵢ) + Dᵢ·(Yᵢ − g₁(Xᵢ))/m(Xᵢ) − (1−Dᵢ)·(Yᵢ − g₀(Xᵢ))/(1 − m(Xᵢ)), where g_d(X) = E[Y | D = d, X] is the outcome regression and m(X) = P(D = 1 | X) is the propensity. The welfare of an assignment rule under a fixed cost c = 4 months is W = E[ rule(X)·(τ(X) − c) ].

Synthetic data-generating process (simulate_almp). Covariates are drawn independently: age ~ U(20,60), edu_years ~ N(12,3) clipped to [6,20], prior_emp_months ~ 60·Beta(2,5), dutch_prof ∈ {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15), female ~ Bernoulli(0.48), migrant ~ Bernoulli(0.30). The true propensity is logistic in the covariates, logit π = −0.6 + 0.020(40−age) + 0.05(12−edu) + 0.015(30−prior_emp) + 0.30(3−dutch) + 0.20·migrant − 0.10·female, clipped to [0.05, 0.95], and D ~ Bernoulli(π). The true individual effect is τ = 3.0 + 1.5(3 − dutch_prof) + 0.4·migrant − 0.02(age − 40) + N(0, 0.3) — so the effect is largest for low-Dutch, migrant, younger jobseekers (the policy punchline). The untreated potential outcome is Y0 = 12 + 0.20·prior_emp + 0.30·edu + 1.0·dutch_prof − 0.005(age−40)² + N(0, 2.5) clipped to [0,30], the treated outcome is Y1 = clip(Y0 + τ, 0, 30), and the observed outcome is Y = D·Y1 + (1−D)·Y0.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

jobseeker (cross-section)  5,000 × 8 · 30-month follow-up (single cohort, no calendar year) · 5,000 jobseekers

Panel key: row order (positional; no explicit id column) · The data a real analyst sees — used to estimate the ATE (DoubleML), GATE (by Dutch proficiency), and IATE (causal forest).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
age continuousAge (years)Jobseeker age in years at programme entry (pre-treatment covariate).Drawn age ~ Uniform(20, 60).yearsSimulationcml_data.csv
edu_years continuousYears of educationCompleted years of formal education (pre-treatment covariate).Drawn N(12, 3), clipped to [6, 20].yearsSimulationcml_data.csv
prior_emp_months continuousPrior employment in look-back window (months)Months employed during the pre-programme look-back window (pre-treatment covariate).Drawn 60 · Beta(2, 5).monthsSimulationcml_data.csv
dutch_prof identifierDutch language proficiency (0–3)Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15).0-3 (ordinal)Simulationcml_data.csv & cml_truth.csv
female dummyFemale (1=yes)Sex indicator, 1 if female else 0 (pre-treatment covariate).Drawn Bernoulli(0.48).0/1Simulationcml_data.csv
migrant dummyMigrant (1=yes)Migrant-background indicator, 1 if migrant else 0 (pre-treatment covariate; a moderator of the effect).Drawn Bernoulli(0.30).0/1Simulationcml_data.csv
D dummyTreatment: received training (1=yes)Binary training indicator — 1 if the jobseeker received the ALMP training, else 0.D ~ Bernoulli(pi_true), with pi_true the true logistic propensity in the covariates.0/1Simulationcml_data.csv
Y continuousObserved outcome: months employed (0–30)Observed months employed over the 30-month follow-up — the realised potential outcome under the assigned treatment.Y = D·Y1 + (1−D)·Y0, where Y0/Y1 are the clipped potential outcomes.months (0-30)Simulationcml_data.csv

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
agemin 20 | median 39.7 | max 60100%5,0005,00020.0239.8239.6859.9911.54
edu_yearsmin 6 | median 11.9 | max 20100%5,0004,8706.0012.0211.9420.002.95
prior_emp_monthsmin 0.369 | median 15.8 | max 54.7100%5,0005,0000.36916.9915.8054.759.59
dutch_prof100%5,0004
femaleshare coded 1 = 0.492100%5,000200.49201.000.500
migrantshare coded 1 = 0.305100%5,000200.30501.000.460
Dshare coded 1 = 0.528100%5,000200.5281.001.000.499
Ymin 9.81 | median 22.8 | max 30100%5,0004,7789.8122.6822.8130.004.18

jobseeker (cross-section)  5,000 × 5 · matches cml_data.csv (positional) · 5,000 jobseekers

Panel key: row order (positional; aligns 1:1 with cml_data.csv) · Benchmark every estimator against the truth: tau scores the IATE, and tau feeds the welfare/oracle comparison.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
Y0 continuousUntreated potential outcome (months, truth)Months employed the jobseeker WOULD have over 30 months WITHOUT training — counterfactual; observable only in the simulation.Y0 = clip(12 + 0.20·prior_emp + 0.30·edu + 1.0·dutch_prof − 0.005(age−40)² + N(0,2.5), 0, 30).months (0-30)Simulation (ground truth)cml_truth.csv
Y1 continuousTreated potential outcome (months, truth)Months employed the jobseeker WOULD have over 30 months WITH training — counterfactual; observable only in the simulation.Y1 = clip(Y0 + tau, 0, 30).months (0-30)Simulation (ground truth)cml_truth.csv
tau continuousTrue individual treatment effect (months)The true per-jobseeker effect of training, τ = Y(1) − Y(0) (pre-clipping). Benchmark for the IATE and input to the oracle/welfare rule.tau = 3.0 + 1.5(3 − dutch_prof) + 0.4·migrant − 0.02(age − 40) + N(0, 0.3).monthsSimulation (ground truth)cml_truth.csv
pi_true continuousTrue propensity P(D=1 | X)The true probability of receiving training given covariates, used to generate D. The estimand a propensity model tries to recover.logistic(−0.6 + 0.020(40−age) + 0.05(12−edu) + 0.015(30−prior_emp) + 0.30(3−dutch) + 0.20·migrant − 0.10·female), clipped to [0.05, 0.95].0-1 (probability)Simulation (ground truth)cml_truth.csv
dutch_prof identifierDutch language proficiency (0–3)Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15).0-3 (ordinal)Simulationcml_data.csv & cml_truth.csv

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
Y0min 8.83 | median 19.6 | max 30100%5,0004,9828.8319.6619.6130.003.47
Y1min 13.6 | median 25.2 | max 30100%5,0004,58013.6225.1525.2430.003.14
taumin 2.02 | median 5.74 | max 8.95100%5,0005,0002.025.635.748.951.58
pi_truemin 0.209 | median 0.528 | max 0.811100%5,0005,0000.2090.5260.5280.8110.107
dutch_prof100%5,0004

Known limitations & caveats