Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
cml_data | jobseeker (cross-section) | 5,000 × 8 | cml_data.dta | cml_data.csv |
cml_truth | jobseeker (cross-section) | 5,000 × 5 | cml_truth.dta | cml_truth.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
use "${BASE}cml_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
df = pd.read_stata(BASE + "cml_data.dta")
# load every dataset at once
files = ["cml_data", "cml_truth"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "cml_data.dta", "cml_data.dta")
df, meta = pyreadstat.read_dta("cml_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_cml/data/"
df <- read_dta(paste0(BASE, "cml_data.dta"))Overview & sources
Companion data for a hands-on Python tutorial on Causal Machine Learning (CML) for active-labour-market-programme (ALMP) evaluation. The cohort is a fully synthetic Flemish-ALMP-style sample of 5,000 jobseekers with six pre-treatment covariates, a binary training indicator, and months employed over a 30-month follow-up. The post walks the full CML roadmap — the average treatment effect (ATE) via DoubleML's cross-fitted, doubly-robust Interactive Regression Model; group effects (GATE) by Dutch proficiency via doubly-robust pseudo-outcome averaging; individual effects (IATE) via EconML's CausalForestDML; and a welfare-maximising training-assignment rule. Because the data are simulated, the true individual treatment effect of every jobseeker is known, so every estimator is benchmarked against ground truth. The data-generating process is modelled on Cockx, Lechner & Bollens (2023) and the methodological roadmap in Lechner (2023).
cml_data is the observed cross-section a real analyst would see: six covariates (X), the treatment D, and the outcome Y. cml_truth is the hidden ground truth available only because the data are simulated: both potential outcomes (Y0, Y1), the individual effect tau, and the true propensity pi_true, plus dutch_prof carried over for self-contained group-bys. Row i of cml_truth corresponds to row i of cml_data (no shared key column — the join is positional). The truth columns are not observable predictors and must never enter an estimator; they exist only to score the estimates.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — simulated via a calibrated ALMP data-generating process (open & reproducible) | Mendez, C. (2026). See the post's Python script script.py (simulate_almp) for the full DGP. |
| Cockx, Lechner & Bollens (2023) | Empirical case study the synthetic DGP is calibrated to (Flanders ALMP; Dutch-proficiency heterogeneity) | Cockx, B., Lechner, M., & Bollens, J. (2023). Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. Labour Economics, 80, 102306. https://doi.org/10.1016/j.labeco.2023.102306 |
| Method references | Estimators and concepts | Lechner (2023); Chernozhukov et al. (2018, DoubleML / IRM); Athey, Tibshirani & Wager (2019, generalized random forests / causal forests). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Causal Machine Learning for Policy Evaluation: From ATE to IATE to a Better Assignment Rule [Data set]. https://carlos-mendez.org/post/python_cml/
Lechner, M. (2023). Causal Machine Learning and its use for public policy. Swiss Journal of Economics and Statistics, 159(8). https://doi.org/10.1186/s41937-023-00113-y Cockx, B., Lechner, M., & Bollens, J. (2023). Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. Labour Economics, 80, 102306. https://doi.org/10.1016/j.labeco.2023.102306 Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097 Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals of Statistics, 47(2), 1148–1178. https://doi.org/10.1214/18-AOS1709BibTeX
@misc{mendez2026pythoncml,
author = {Mendez, Carlos},
title = {Causal Machine Learning for Policy Evaluation: From ATE to IATE to a Better Assignment Rule},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_cml/}},
note = {Data set}
}
@article{lechner2023cml,
author = {Lechner, Michael},
title = {Causal Machine Learning and its use for public policy},
journal = {Swiss Journal of Economics and Statistics},
volume = {159}, number = {8}, year = {2023}
}
@article{cockx2023priority,
author = {Cockx, Bart and Lechner, Michael and Bollens, Joost},
title = {Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium},
journal = {Labour Economics},
volume = {80}, pages = {102306}, year = {2023}
}
@article{chernozhukov2018double,
author = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James},
title = {Double/debiased machine learning for treatment and structural parameters},
journal = {The Econometrics Journal},
volume = {21}, number = {1}, pages = {C1--C68}, year = {2018}
}
@article{athey2019grf,
author = {Athey, Susan and Tibshirani, Julie and Wager, Stefan},
title = {Generalized random forests},
journal = {Annals of Statistics},
volume = {47}, number = {2}, pages = {1148--1178}, year = {2019}
}Variable explorer search & filter all 12 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
D# | dummy | Treatment: received training (1=yes) | Binary training indicator — 1 if the jobseeker received the ALMP training, else 0. | 0/1 | cml_data | Simulation | |
Y# | continuous | Observed outcome: months employed (0–30) | Observed months employed over the 30-month follow-up — the realised potential outcome under the assigned treatment. | months (0-30) | cml_data | Simulation | |
Y0# | continuous | Untreated potential outcome (months, truth) | Months employed the jobseeker WOULD have over 30 months WITHOUT training — counterfactual; observable only in the simulation. | months (0-30) | cml_truth | Simulation (ground truth) | |
Y1# | continuous | Treated potential outcome (months, truth) | Months employed the jobseeker WOULD have over 30 months WITH training — counterfactual; observable only in the simulation. | months (0-30) | cml_truth | Simulation (ground truth) | |
age# | continuous | Age (years) | Jobseeker age in years at programme entry (pre-treatment covariate). | years | cml_data | Simulation | |
dutch_prof# | identifier | – | Dutch language proficiency (0–3) | Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier. | 0-3 (ordinal) | cml_data, cml_truth | Simulation |
edu_years# | continuous | Years of education | Completed years of formal education (pre-treatment covariate). | years | cml_data | Simulation | |
female# | dummy | Female (1=yes) | Sex indicator, 1 if female else 0 (pre-treatment covariate). | 0/1 | cml_data | Simulation | |
migrant# | dummy | Migrant (1=yes) | Migrant-background indicator, 1 if migrant else 0 (pre-treatment covariate; a moderator of the effect). | 0/1 | cml_data | Simulation | |
pi_true# | continuous | True propensity P(D=1 | X) | The true probability of receiving training given covariates, used to generate D. The estimand a propensity model tries to recover. | 0-1 (probability) | cml_truth | Simulation (ground truth) | |
prior_emp_months# | continuous | Prior employment in look-back window (months) | Months employed during the pre-programme look-back window (pre-treatment covariate). | months | cml_data | Simulation | |
tau# | continuous | True individual treatment effect (months) | The true per-jobseeker effect of training, τ = Y(1) − Y(0) (pre-clipping). Benchmark for the IATE and input to the oracle/welfare rule. | months | cml_truth | Simulation (ground truth) |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The post targets three causal estimands of increasing granularity, all under
unconfoundedness (selection-on-observables) over the six covariates
X:
- ATE =
E[Y(1) − Y(0)]— the population-average effect, estimated byDoubleMLIRM(cross-fitted, doubly-robust, random-forest nuisances, 5-fold cross-fitting). - GATE(z) =
E[Y(1) − Y(0) | Z = z]withZ = dutch_prof— the subgroup average, estimated as the within-stratum mean of the doubly-robust pseudo-outcomeψ. - IATE(x) =
E[Y(1) − Y(0) | X = x]— the per-individual effect, estimated by EconML'sCausalForestDML.
The doubly-robust score at observation i (whose group-mean gives
the GATE) is
ψ_i = g₁(Xᵢ) − g₀(Xᵢ) + Dᵢ·(Yᵢ − g₁(Xᵢ))/m(Xᵢ) − (1−Dᵢ)·(Yᵢ − g₀(Xᵢ))/(1 − m(Xᵢ)),
where g_d(X) = E[Y | D = d, X] is the outcome regression and
m(X) = P(D = 1 | X) is the propensity. The welfare of an assignment
rule under a fixed cost c = 4 months is W = E[ rule(X)·(τ(X) − c) ].
Synthetic data-generating process (simulate_almp). Covariates are
drawn independently: age ~ U(20,60), edu_years ~ N(12,3) clipped to
[6,20], prior_emp_months ~ 60·Beta(2,5), dutch_prof ∈ {0,1,2,3} with
probabilities (0.25, 0.30, 0.30, 0.15), female ~ Bernoulli(0.48),
migrant ~ Bernoulli(0.30). The true propensity is logistic in the
covariates,
logit π = −0.6 + 0.020(40−age) + 0.05(12−edu) + 0.015(30−prior_emp) + 0.30(3−dutch) + 0.20·migrant − 0.10·female,
clipped to [0.05, 0.95], and D ~ Bernoulli(π). The true individual
effect is
τ = 3.0 + 1.5(3 − dutch_prof) + 0.4·migrant − 0.02(age − 40) + N(0, 0.3) — so the
effect is largest for low-Dutch, migrant, younger jobseekers (the policy punchline). The
untreated potential outcome is
Y0 = 12 + 0.20·prior_emp + 0.30·edu + 1.0·dutch_prof − 0.005(age−40)² + N(0, 2.5)
clipped to [0,30], the treated outcome is Y1 = clip(Y0 + τ, 0, 30),
and the observed outcome is Y = D·Y1 + (1−D)·Y0.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
age continuous | Age (years) | Jobseeker age in years at programme entry (pre-treatment covariate). | Drawn age ~ Uniform(20, 60). | years | Simulation | cml_data.csv |
edu_years continuous | Years of education | Completed years of formal education (pre-treatment covariate). | Drawn N(12, 3), clipped to [6, 20]. | years | Simulation | cml_data.csv |
prior_emp_months continuous | Prior employment in look-back window (months) | Months employed during the pre-programme look-back window (pre-treatment covariate). | Drawn 60 · Beta(2, 5). | months | Simulation | cml_data.csv |
dutch_prof identifier | Dutch language proficiency (0–3) | Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier. | Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15). | 0-3 (ordinal) | Simulation | cml_data.csv & cml_truth.csv |
female dummy | Female (1=yes) | Sex indicator, 1 if female else 0 (pre-treatment covariate). | Drawn Bernoulli(0.48). | 0/1 | Simulation | cml_data.csv |
migrant dummy | Migrant (1=yes) | Migrant-background indicator, 1 if migrant else 0 (pre-treatment covariate; a moderator of the effect). | Drawn Bernoulli(0.30). | 0/1 | Simulation | cml_data.csv |
D dummy | Treatment: received training (1=yes) | Binary training indicator — 1 if the jobseeker received the ALMP training, else 0. | D ~ Bernoulli(pi_true), with pi_true the true logistic propensity in the covariates. | 0/1 | Simulation | cml_data.csv |
Y continuous | Observed outcome: months employed (0–30) | Observed months employed over the 30-month follow-up — the realised potential outcome under the assigned treatment. | Y = D·Y1 + (1−D)·Y0, where Y0/Y1 are the clipped potential outcomes. | months (0-30) | Simulation | cml_data.csv |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
age | 100% | 5,000 | 5,000 | 20.02 | 39.82 | 39.68 | 59.99 | 11.54 | |
edu_years | 100% | 5,000 | 4,870 | 6.00 | 12.02 | 11.94 | 20.00 | 2.95 | |
prior_emp_months | 100% | 5,000 | 5,000 | 0.369 | 16.99 | 15.80 | 54.75 | 9.59 | |
dutch_prof | – | 100% | 5,000 | 4 | — | — | — | — | — |
female | 100% | 5,000 | 2 | 0 | 0.492 | 0 | 1.00 | 0.500 | |
migrant | 100% | 5,000 | 2 | 0 | 0.305 | 0 | 1.00 | 0.460 | |
D | 100% | 5,000 | 2 | 0 | 0.528 | 1.00 | 1.00 | 0.499 | |
Y | 100% | 5,000 | 4,778 | 9.81 | 22.68 | 22.81 | 30.00 | 4.18 |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
Y0 continuous | Untreated potential outcome (months, truth) | Months employed the jobseeker WOULD have over 30 months WITHOUT training — counterfactual; observable only in the simulation. | Y0 = clip(12 + 0.20·prior_emp + 0.30·edu + 1.0·dutch_prof − 0.005(age−40)² + N(0,2.5), 0, 30). | months (0-30) | Simulation (ground truth) | cml_truth.csv |
Y1 continuous | Treated potential outcome (months, truth) | Months employed the jobseeker WOULD have over 30 months WITH training — counterfactual; observable only in the simulation. | Y1 = clip(Y0 + tau, 0, 30). | months (0-30) | Simulation (ground truth) | cml_truth.csv |
tau continuous | True individual treatment effect (months) | The true per-jobseeker effect of training, τ = Y(1) − Y(0) (pre-clipping). Benchmark for the IATE and input to the oracle/welfare rule. | tau = 3.0 + 1.5(3 − dutch_prof) + 0.4·migrant − 0.02(age − 40) + N(0, 0.3). | months | Simulation (ground truth) | cml_truth.csv |
pi_true continuous | True propensity P(D=1 | X) | The true probability of receiving training given covariates, used to generate D. The estimand a propensity model tries to recover. | logistic(−0.6 + 0.020(40−age) + 0.05(12−edu) + 0.015(30−prior_emp) + 0.30(3−dutch) + 0.20·migrant − 0.10·female), clipped to [0.05, 0.95]. | 0-1 (probability) | Simulation (ground truth) | cml_truth.csv |
dutch_prof identifier | Dutch language proficiency (0–3) | Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier. | Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15). | 0-3 (ordinal) | Simulation | cml_data.csv & cml_truth.csv |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
Y0 | 100% | 5,000 | 4,982 | 8.83 | 19.66 | 19.61 | 30.00 | 3.47 | |
Y1 | 100% | 5,000 | 4,580 | 13.62 | 25.15 | 25.24 | 30.00 | 3.14 | |
tau | 100% | 5,000 | 5,000 | 2.02 | 5.63 | 5.74 | 8.95 | 1.58 | |
pi_true | 100% | 5,000 | 5,000 | 0.209 | 0.526 | 0.528 | 0.811 | 0.107 | |
dutch_prof | – | 100% | 5,000 | 4 | — | — | — | — | — |
Known limitations & caveats
- Synthetic data. There is no real cohort behind this tutorial; results are internally consistent with the calibration but are not empirical evidence about any real ALMP.
- Truth columns are not observable predictors.
Y0,Y1,tau, andpi_trueincml_truth.csvare knowable only because the data are simulated. They are for benchmarking and welfare scoring only — feeding any of them into an estimator is circular and must not be done. - Positional join.
cml_data.csvandcml_truth.csvshare no key column; row i of one matches row i of the other. Preserve row order when merging (e.g.pd.concat([df, truth], axis=1)), never sort one file independently. - Easy overlap by construction. True propensities are clipped to [0.05, 0.95] (and the logistic-regression estimate lands inside [0.21, 0.81]), so trimming choices and doubly-robust denominators are not stressed here. Real ALMP cohorts have tighter overlap and trimming matters more.
- Treatment share is high (≈53%). Calibrated to keep overlap comfortable in every Dutch-proficiency stratum; do not over-interpret the magnitude of effects as representative of a real programme.
- Unconfoundedness holds by construction. The DGP satisfies selection-on-observables over the six covariates; in a real application this is the strong identifying assumption that justifies DoubleML and CausalForestDML over a naive comparison.