Data dictionary · Causal Inference with DoWhy: The Lalonde Job-Training Data

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`lalonde_dowhy`	individual (cross-section)	445 × 10	lalonde_dowhy.dta	lalonde_dowhy.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
use "${BASE}lalonde_dowhy.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
df = pd.read_stata(BASE + "lalonde_dowhy.dta")

# load every dataset at once
files = ["lalonde_dowhy"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "lalonde_dowhy.dta", "lalonde_dowhy.dta")
df, meta = pyreadstat.read_dta("lalonde_dowhy.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
df <- read_dta(paste0(BASE, "lalonde_dowhy.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that estimates the causal effect of a job-training program on earnings using DoWhy's four-step framework (Model → Identify → Estimate → Refute). The file is the Lalonde sample from the U.S. National Supported Work (NSW) Demonstration, a 1970s randomized employment program for disadvantaged workers (LaLonde 1986; the experimental subsample popularized by Dehejia & Wahba 1999). It records 445 participants — 185 randomly assigned to job training (treatment) and 260 to control — with eight pre-treatment covariates and real earnings in 1978 (re78) as the outcome. The post encodes the eight covariates as common causes in a causal graph, identifies the backdoor estimand, and estimates the average treatment effect with five methods (regression adjustment, IPW, doubly robust AIPW, propensity-score stratification, and propensity-score matching), all clustering near a roughly 34–38% earnings gain over the control mean. These are real data from a field experiment, not a simulation.

One file, one cross-section. lalonde_dowhy is a single cross-sectional table — one row per study participant (445 rows), with no panel or time dimension. treat is the randomized assignment, re78 the post-program outcome, and the remaining eight columns are pre-treatment covariates (including prior earnings in 1974 and 1975). It is the verbatim DoWhy lalonde_dataset() export used by the post.

Data sources

Source	Provides	Reference / URL
LaLonde (1986)	Original NSW Demonstration evaluation data and the benchmark research design	LaLonde, R. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review, 76(4), 604-620. https://www.jstor.org/stable/1806062
Dehejia & Wahba (1999, 2002)	The experimental NSW subsample (445 obs) popularized for causal-inference benchmarking	Dehejia, R. & Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. JASA, 94(448), 1053-1062. https://doi.org/10.1080/01621459.1999.10473858
DoWhy (Sharma & Kiciman 2020)	Library shipping the dataset (lalonde_dataset()) and the four-step estimation/refutation methods	Sharma, A. & Kiciman, E. (2020). DoWhy: An End-to-End Library for Causal Inference. arXiv:2011.04216. https://arxiv.org/abs/2011.04216
Method references	Estimators and concepts	Horvitz & Thompson (1952); Rosenbaum & Rubin (1983); Robins, Rotnitzky & Zhao (1994); Cochran (1968).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset [Data set]. https://carlos-mendez.org/post/python_dowhy/

LaLonde, R. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4), 604-620. Dehejia, R., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448), 1053-1062. Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216.

BibTeX

@misc{mendez2026pythondowhy,
  author       = {Mendez, Carlos},
  title        = {Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_dowhy/}},
  note         = {Data set}
}

@article{lalonde1986evaluating,
  author  = {LaLonde, Robert J.},
  title   = {Evaluating the Econometric Evaluations of Training Programs with Experimental Data},
  journal = {American Economic Review},
  volume  = {76}, number = {4}, pages = {604--620}, year = {1986}
}
@article{dehejia1999causal,
  author  = {Dehejia, Rajeev H. and Wahba, Sadek},
  title   = {Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs},
  journal = {Journal of the American Statistical Association},
  volume  = {94}, number = {448}, pages = {1053--1062}, year = {1999}
}
@article{sharma2020dowhy,
  author  = {Sharma, Amit and Kiciman, Emre},
  title   = {{DoWhy}: An End-to-End Library for Causal Inference},
  journal = {arXiv preprint arXiv:2011.04216}, year = {2020}
}

Variable explorer search & filter all 10 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Label	Definition	Units	In files	Source
`age`#	continuous	Age (years)	Participant age at baseline, in years.	years	lalonde_dowhy	NSW Demonstration baseline
`black`#	dummy	Black (1 = yes)	Race indicator: 1 if the participant is Black, else 0.	0/1	lalonde_dowhy	NSW Demonstration baseline
`educ`#	continuous	Years of education	Completed years of schooling at baseline.	years	lalonde_dowhy	NSW Demonstration baseline
`hisp`#	dummy	Hispanic (1 = yes)	Ethnicity indicator: 1 if the participant is Hispanic, else 0.	0/1	lalonde_dowhy	NSW Demonstration baseline
`married`#	dummy	Married (1 = yes)	Marital-status indicator: 1 if married at baseline, else 0.	0/1	lalonde_dowhy	NSW Demonstration baseline
`nodegr`#	dummy	No high-school degree (1 = yes)	1 if the participant lacks a high-school diploma, else 0.	0/1	lalonde_dowhy	NSW Demonstration baseline
`re74`#	continuous	Real earnings in 1974 (US$)	Pre-program real annual earnings in 1974 (prior earnings, a key confounder).	US$ (1974)	lalonde_dowhy	NSW Demonstration baseline
`re75`#	continuous	Real earnings in 1975 (US$)	Pre-program real annual earnings in 1975 (prior earnings, a key confounder).	US$ (1975)	lalonde_dowhy	NSW Demonstration baseline
`re78`#	continuous	Real earnings in 1978 (US$)	Post-program real annual earnings in 1978 — the outcome variable.	US$ (1978)	lalonde_dowhy	NSW Demonstration follow-up
`treat`#	dummy	Job-training assignment (1 = trained)	Randomized treatment indicator: 1 if assigned to NSW job training, 0 if control.	0/1	lalonde_dowhy	NSW Demonstration (LaLonde 1986)

Cross-file variable index

Which file each variable appears in (● = present).

Variable	lalonde_dowhy
`age`	●
`black`	●
`educ`	●
`hisp`	●
`married`	●
`nodegr`	●
`re74`	●
`re75`	●
`re78`	●
`treat`	●

Construction & formulas

The post applies DoWhy's four-step framework to this data, targeting the average treatment effect (ATE), ATE = E[Y(1) − Y(0)], of treat on re78:

1. Model — encode a DAG in which the eight covariates (age, educ, black, hisp, married, nodegr, re74, re75) are common causes of both treat and re78; treat → re78 is the effect of interest.
2. Identify — the backdoor criterion returns the adjustment estimand d/d[treat] E[re78 | age, educ, black, hisp, married, nodegr, re74, re75] under the unconfoundedness assumption.
3. Estimate — five estimators of the same ATE: regression adjustment (models E[Y|X,T]); IPW τ = (1/n) Σ [ T·Y / e(X) − (1−T)·Y / (1−e(X)) ] with propensity score e(X) = P(T=1|X); doubly robust AIPW = regression estimate plus an IPW-weighted residual correction (consistent if either the outcome or the propensity model is correct); PS stratification (5 strata); and PS matching (nearest-neighbour, shifting the estimand toward the ATT).
4. Refute — placebo-treatment, random-common-cause, and data-subset tests stress the estimate (the placebo collapses the effect from ~$1,676 to ~$62).

The dataset itself contains no constructed variables: every column is an observed field from the NSW evaluation (assignment, demographics, prior/post earnings). The quantities above are computed by the analysis code, not stored in the file.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

individual (cross-section) 445 × 10 · covariates pre-1976; re74/re75 = 1974/1975 earnings; re78 = 1978 earnings · 445 participants (185 treated, 260 control)

Panel key: row = one participant (no explicit id column) · Estimate the ATE of NSW job training on 1978 earnings via DoWhy (Model/Identify/Estimate/Refute).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`treat` dummy	Job-training assignment (1 = trained)	Randomized treatment indicator: 1 if assigned to NSW job training, 0 if control.	From NSW random assignment; cast from boolean to integer in the post.	0/1	NSW Demonstration (LaLonde 1986)	185 treated / 260 control
`re78` continuous	Real earnings in 1978 (US$)	Post-program real annual earnings in 1978 — the outcome variable.	Observed earnings; real (inflation-adjusted) US dollars.	US$ (1978)	NSW Demonstration follow-up	all 445
`age` continuous	Age (years)	Participant age at baseline, in years.	Observed at enrollment.	years	NSW Demonstration baseline	all 445
`educ` continuous	Years of education	Completed years of schooling at baseline.	Observed at enrollment.	years	NSW Demonstration baseline	all 445
`black` dummy	Black (1 = yes)	Race indicator: 1 if the participant is Black, else 0.	Observed demographic at baseline.	0/1	NSW Demonstration baseline	all 445
`hisp` dummy	Hispanic (1 = yes)	Ethnicity indicator: 1 if the participant is Hispanic, else 0.	Observed demographic at baseline.	0/1	NSW Demonstration baseline	all 445
`married` dummy	Married (1 = yes)	Marital-status indicator: 1 if married at baseline, else 0.	Observed demographic at baseline.	0/1	NSW Demonstration baseline	all 445
`nodegr` dummy	No high-school degree (1 = yes)	1 if the participant lacks a high-school diploma, else 0.	Observed at baseline (no high-school degree).	0/1	NSW Demonstration baseline	all 445
`re74` continuous	Real earnings in 1974 (US$)	Pre-program real annual earnings in 1974 (prior earnings, a key confounder).	Observed earnings; real US dollars.	US$ (1974)	NSW Demonstration baseline	all 445
`re75` continuous	Real earnings in 1975 (US$)	Pre-program real annual earnings in 1975 (prior earnings, a key confounder).	Observed earnings; real US dollars.	US$ (1975)	NSW Demonstration baseline	all 445

Distribution & statistics (click a header to sort)

Variable	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`treat`	100%	445	2	0	0.416	0	1.00	0.493
`re78`	100%	445	308	0	5,300.8	3,701.8	60,308	6,631.5
`age`	100%	445	34	17.00	25.37	24.00	55.00	7.10
`educ`	100%	445	14	3.00	10.20	10.00	16.00	1.79
`black`	100%	445	2	0	0.834	1.00	1.00	0.373
`hisp`	100%	445	2	0	0.088	0	1.00	0.283
`married`	100%	445	2	0	0.169	0	1.00	0.375
`nodegr`	100%	445	2	0	0.782	1.00	1.00	0.413
`re74`	100%	445	115	0	2,102.3	0	39,571	5,363.6
`re75`	100%	445	155	0	1,377.1	0	25,142	3,151.0

Known limitations & caveats

Small sample. Only 445 observations (185 treated / 260 control); propensity-score methods can suffer from poor overlap and estimates carry high variance.
Earnings are right-skewed with mass at zero. re74, re75, and re78 have a large spike at $0 (no earnings) and a long right tail, so means exceed medians substantially.
Unconfoundedness. The backdoor estimand assumes the eight covariates capture all confounding; this is credible here because assignment was randomized, but the same data is often used to study what happens when it is not.
ATE vs ATT. Four estimators target the ATE; propensity-score matching discards unmatched controls and shifts the estimand toward the ATT, so its estimate answers a slightly different question.
Experimental subsample. This is the randomized NSW experimental sample (Dehejia-Wahba), not the larger observational PSID/CPS comparison versions; results are not directly comparable to those non-experimental variants.