Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
lalonde_dowhy | individual (cross-section) | 445 × 10 | lalonde_dowhy.dta | lalonde_dowhy.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
use "${BASE}lalonde_dowhy.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
df = pd.read_stata(BASE + "lalonde_dowhy.dta")
# load every dataset at once
files = ["lalonde_dowhy"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "lalonde_dowhy.dta", "lalonde_dowhy.dta")
df, meta = pyreadstat.read_dta("lalonde_dowhy.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
df <- read_dta(paste0(BASE, "lalonde_dowhy.dta"))Overview & sources
Companion data for a hands-on Python tutorial that estimates the causal effect of a job-training program on earnings using DoWhy's four-step framework (Model → Identify → Estimate → Refute). The file is the Lalonde sample from the U.S. National Supported Work (NSW) Demonstration, a 1970s randomized employment program for disadvantaged workers (LaLonde 1986; the experimental subsample popularized by Dehejia & Wahba 1999). It records 445 participants — 185 randomly assigned to job training (treatment) and 260 to control — with eight pre-treatment covariates and real earnings in 1978 (re78) as the outcome. The post encodes the eight covariates as common causes in a causal graph, identifies the backdoor estimand, and estimates the average treatment effect with five methods (regression adjustment, IPW, doubly robust AIPW, propensity-score stratification, and propensity-score matching), all clustering near a roughly 34–38% earnings gain over the control mean. These are real data from a field experiment, not a simulation.
lalonde_dowhy is a single cross-sectional table — one row per study participant (445 rows), with no panel or time dimension. treat is the randomized assignment, re78 the post-program outcome, and the remaining eight columns are pre-treatment covariates (including prior earnings in 1974 and 1975). It is the verbatim DoWhy lalonde_dataset() export used by the post.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| LaLonde (1986) | Original NSW Demonstration evaluation data and the benchmark research design | LaLonde, R. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review, 76(4), 604-620. https://www.jstor.org/stable/1806062 |
| Dehejia & Wahba (1999, 2002) | The experimental NSW subsample (445 obs) popularized for causal-inference benchmarking | Dehejia, R. & Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. JASA, 94(448), 1053-1062. https://doi.org/10.1080/01621459.1999.10473858 |
| DoWhy (Sharma & Kiciman 2020) | Library shipping the dataset (lalonde_dataset()) and the four-step estimation/refutation methods | Sharma, A. & Kiciman, E. (2020). DoWhy: An End-to-End Library for Causal Inference. arXiv:2011.04216. https://arxiv.org/abs/2011.04216 |
| Method references | Estimators and concepts | Horvitz & Thompson (1952); Rosenbaum & Rubin (1983); Robins, Rotnitzky & Zhao (1994); Cochran (1968). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset [Data set]. https://carlos-mendez.org/post/python_dowhy/
LaLonde, R. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4), 604-620. Dehejia, R., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448), 1053-1062. Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216.BibTeX
@misc{mendez2026pythondowhy,
author = {Mendez, Carlos},
title = {Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_dowhy/}},
note = {Data set}
}
@article{lalonde1986evaluating,
author = {LaLonde, Robert J.},
title = {Evaluating the Econometric Evaluations of Training Programs with Experimental Data},
journal = {American Economic Review},
volume = {76}, number = {4}, pages = {604--620}, year = {1986}
}
@article{dehejia1999causal,
author = {Dehejia, Rajeev H. and Wahba, Sadek},
title = {Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs},
journal = {Journal of the American Statistical Association},
volume = {94}, number = {448}, pages = {1053--1062}, year = {1999}
}
@article{sharma2020dowhy,
author = {Sharma, Amit and Kiciman, Emre},
title = {{DoWhy}: An End-to-End Library for Causal Inference},
journal = {arXiv preprint arXiv:2011.04216}, year = {2020}
}Variable explorer search & filter all 10 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
age# | continuous | Age (years) | Participant age at baseline, in years. | years | lalonde_dowhy | NSW Demonstration baseline | |
black# | dummy | Black (1 = yes) | Race indicator: 1 if the participant is Black, else 0. | 0/1 | lalonde_dowhy | NSW Demonstration baseline | |
educ# | continuous | Years of education | Completed years of schooling at baseline. | years | lalonde_dowhy | NSW Demonstration baseline | |
hisp# | dummy | Hispanic (1 = yes) | Ethnicity indicator: 1 if the participant is Hispanic, else 0. | 0/1 | lalonde_dowhy | NSW Demonstration baseline | |
married# | dummy | Married (1 = yes) | Marital-status indicator: 1 if married at baseline, else 0. | 0/1 | lalonde_dowhy | NSW Demonstration baseline | |
nodegr# | dummy | No high-school degree (1 = yes) | 1 if the participant lacks a high-school diploma, else 0. | 0/1 | lalonde_dowhy | NSW Demonstration baseline | |
re74# | continuous | Real earnings in 1974 (US$) | Pre-program real annual earnings in 1974 (prior earnings, a key confounder). | US$ (1974) | lalonde_dowhy | NSW Demonstration baseline | |
re75# | continuous | Real earnings in 1975 (US$) | Pre-program real annual earnings in 1975 (prior earnings, a key confounder). | US$ (1975) | lalonde_dowhy | NSW Demonstration baseline | |
re78# | continuous | Real earnings in 1978 (US$) | Post-program real annual earnings in 1978 — the outcome variable. | US$ (1978) | lalonde_dowhy | NSW Demonstration follow-up | |
treat# | dummy | Job-training assignment (1 = trained) | Randomized treatment indicator: 1 if assigned to NSW job training, 0 if control. | 0/1 | lalonde_dowhy | NSW Demonstration (LaLonde 1986) |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The post applies DoWhy's four-step framework to this data, targeting the
average treatment effect (ATE), ATE = E[Y(1) − Y(0)], of
treat on re78:
- 1. Model — encode a DAG in which the eight covariates
(
age, educ, black, hisp, married, nodegr, re74, re75) are common causes of bothtreatandre78;treat → re78is the effect of interest. - 2. Identify — the backdoor criterion returns the
adjustment estimand
d/d[treat] E[re78 | age, educ, black, hisp, married, nodegr, re74, re75]under the unconfoundedness assumption. - 3. Estimate — five estimators of the same ATE:
regression adjustment (models
E[Y|X,T]); IPWτ = (1/n) Σ [ T·Y / e(X) − (1−T)·Y / (1−e(X)) ]with propensity scoree(X) = P(T=1|X); doubly robust AIPW = regression estimate plus an IPW-weighted residual correction (consistent if either the outcome or the propensity model is correct); PS stratification (5 strata); and PS matching (nearest-neighbour, shifting the estimand toward the ATT). - 4. Refute — placebo-treatment, random-common-cause, and data-subset tests stress the estimate (the placebo collapses the effect from ~$1,676 to ~$62).
The dataset itself contains no constructed variables: every column is an observed field from the NSW evaluation (assignment, demographics, prior/post earnings). The quantities above are computed by the analysis code, not stored in the file.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
treat dummy | Job-training assignment (1 = trained) | Randomized treatment indicator: 1 if assigned to NSW job training, 0 if control. | From NSW random assignment; cast from boolean to integer in the post. | 0/1 | NSW Demonstration (LaLonde 1986) | 185 treated / 260 control |
re78 continuous | Real earnings in 1978 (US$) | Post-program real annual earnings in 1978 — the outcome variable. | Observed earnings; real (inflation-adjusted) US dollars. | US$ (1978) | NSW Demonstration follow-up | all 445 |
age continuous | Age (years) | Participant age at baseline, in years. | Observed at enrollment. | years | NSW Demonstration baseline | all 445 |
educ continuous | Years of education | Completed years of schooling at baseline. | Observed at enrollment. | years | NSW Demonstration baseline | all 445 |
black dummy | Black (1 = yes) | Race indicator: 1 if the participant is Black, else 0. | Observed demographic at baseline. | 0/1 | NSW Demonstration baseline | all 445 |
hisp dummy | Hispanic (1 = yes) | Ethnicity indicator: 1 if the participant is Hispanic, else 0. | Observed demographic at baseline. | 0/1 | NSW Demonstration baseline | all 445 |
married dummy | Married (1 = yes) | Marital-status indicator: 1 if married at baseline, else 0. | Observed demographic at baseline. | 0/1 | NSW Demonstration baseline | all 445 |
nodegr dummy | No high-school degree (1 = yes) | 1 if the participant lacks a high-school diploma, else 0. | Observed at baseline (no high-school degree). | 0/1 | NSW Demonstration baseline | all 445 |
re74 continuous | Real earnings in 1974 (US$) | Pre-program real annual earnings in 1974 (prior earnings, a key confounder). | Observed earnings; real US dollars. | US$ (1974) | NSW Demonstration baseline | all 445 |
re75 continuous | Real earnings in 1975 (US$) | Pre-program real annual earnings in 1975 (prior earnings, a key confounder). | Observed earnings; real US dollars. | US$ (1975) | NSW Demonstration baseline | all 445 |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
treat | 100% | 445 | 2 | 0 | 0.416 | 0 | 1.00 | 0.493 | |
re78 | 100% | 445 | 308 | 0 | 5,300.8 | 3,701.8 | 60,308 | 6,631.5 | |
age | 100% | 445 | 34 | 17.00 | 25.37 | 24.00 | 55.00 | 7.10 | |
educ | 100% | 445 | 14 | 3.00 | 10.20 | 10.00 | 16.00 | 1.79 | |
black | 100% | 445 | 2 | 0 | 0.834 | 1.00 | 1.00 | 0.373 | |
hisp | 100% | 445 | 2 | 0 | 0.088 | 0 | 1.00 | 0.283 | |
married | 100% | 445 | 2 | 0 | 0.169 | 0 | 1.00 | 0.375 | |
nodegr | 100% | 445 | 2 | 0 | 0.782 | 1.00 | 1.00 | 0.413 | |
re74 | 100% | 445 | 115 | 0 | 2,102.3 | 0 | 39,571 | 5,363.6 | |
re75 | 100% | 445 | 155 | 0 | 1,377.1 | 0 | 25,142 | 3,151.0 |
Known limitations & caveats
- Small sample. Only 445 observations (185 treated / 260 control); propensity-score methods can suffer from poor overlap and estimates carry high variance.
- Earnings are right-skewed with mass at zero.
re74,re75, andre78have a large spike at $0 (no earnings) and a long right tail, so means exceed medians substantially. - Unconfoundedness. The backdoor estimand assumes the eight covariates capture all confounding; this is credible here because assignment was randomized, but the same data is often used to study what happens when it is not.
- ATE vs ATT. Four estimators target the ATE; propensity-score matching discards unmatched controls and shifts the estimand toward the ATT, so its estimate answers a slightly different question.
- Experimental subsample. This is the randomized NSW experimental sample (Dehejia-Wahba), not the larger observational PSID/CPS comparison versions; results are not directly comparable to those non-experimental variants.