← Back to the post
Interactive data dictionary

Causal Inference with DoWhy: The Lalonde Job-Training Data

The NSW Demonstration evaluation sample used to estimate the ATE of job training on earnings.

1
dataset
10
variables
445
participants
treat = 185 / control = 260
assignment

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
lalonde_dowhyindividual (cross-section)445 × 10lalonde_dowhy.dtalalonde_dowhy.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
use "${BASE}lalonde_dowhy.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
df = pd.read_stata(BASE + "lalonde_dowhy.dta")

# load every dataset at once
files = ["lalonde_dowhy"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "lalonde_dowhy.dta", "lalonde_dowhy.dta")
df, meta = pyreadstat.read_dta("lalonde_dowhy.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy/data/"
df <- read_dta(paste0(BASE, "lalonde_dowhy.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that estimates the causal effect of a job-training program on earnings using DoWhy's four-step framework (Model → Identify → Estimate → Refute). The file is the Lalonde sample from the U.S. National Supported Work (NSW) Demonstration, a 1970s randomized employment program for disadvantaged workers (LaLonde 1986; the experimental subsample popularized by Dehejia & Wahba 1999). It records 445 participants — 185 randomly assigned to job training (treatment) and 260 to control — with eight pre-treatment covariates and real earnings in 1978 (re78) as the outcome. The post encodes the eight covariates as common causes in a causal graph, identifies the backdoor estimand, and estimates the average treatment effect with five methods (regression adjustment, IPW, doubly robust AIPW, propensity-score stratification, and propensity-score matching), all clustering near a roughly 34–38% earnings gain over the control mean. These are real data from a field experiment, not a simulation.

One file, one cross-section. lalonde_dowhy is a single cross-sectional table — one row per study participant (445 rows), with no panel or time dimension. treat is the randomized assignment, re78 the post-program outcome, and the remaining eight columns are pre-treatment covariates (including prior earnings in 1974 and 1975). It is the verbatim DoWhy lalonde_dataset() export used by the post.

Data sources

SourceProvidesReference / URL
LaLonde (1986)Original NSW Demonstration evaluation data and the benchmark research designLaLonde, R. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review, 76(4), 604-620. https://www.jstor.org/stable/1806062
Dehejia &amp; Wahba (1999, 2002)The experimental NSW subsample (445 obs) popularized for causal-inference benchmarkingDehejia, R. & Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. JASA, 94(448), 1053-1062. https://doi.org/10.1080/01621459.1999.10473858
DoWhy (Sharma &amp; Kiciman 2020)Library shipping the dataset (lalonde_dataset()) and the four-step estimation/refutation methodsSharma, A. & Kiciman, E. (2020). DoWhy: An End-to-End Library for Causal Inference. arXiv:2011.04216. https://arxiv.org/abs/2011.04216
Method referencesEstimators and conceptsHorvitz & Thompson (1952); Rosenbaum & Rubin (1983); Robins, Rotnitzky & Zhao (1994); Cochran (1968).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset [Data set]. https://carlos-mendez.org/post/python_dowhy/

LaLonde, R. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4), 604-620. Dehejia, R., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448), 1053-1062. Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216.

BibTeX

@misc{mendez2026pythondowhy,
  author       = {Mendez, Carlos},
  title        = {Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_dowhy/}},
  note         = {Data set}
}

@article{lalonde1986evaluating,
  author  = {LaLonde, Robert J.},
  title   = {Evaluating the Econometric Evaluations of Training Programs with Experimental Data},
  journal = {American Economic Review},
  volume  = {76}, number = {4}, pages = {604--620}, year = {1986}
}
@article{dehejia1999causal,
  author  = {Dehejia, Rajeev H. and Wahba, Sadek},
  title   = {Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs},
  journal = {Journal of the American Statistical Association},
  volume  = {94}, number = {448}, pages = {1053--1062}, year = {1999}
}
@article{sharma2020dowhy,
  author  = {Sharma, Amit and Kiciman, Emre},
  title   = {{DoWhy}: An End-to-End Library for Causal Inference},
  journal = {arXiv preprint arXiv:2011.04216}, year = {2020}
}

Variable explorer search & filter all 10 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
age#continuousmin 17 | median 24 | max 55Age (years)Participant age at baseline, in years.yearslalonde_dowhyNSW Demonstration baseline
black#dummyshare coded 1 = 0.834Black (1 = yes)Race indicator: 1 if the participant is Black, else 0.0/1lalonde_dowhyNSW Demonstration baseline
educ#continuousmin 3 | median 10 | max 16Years of educationCompleted years of schooling at baseline.yearslalonde_dowhyNSW Demonstration baseline
hisp#dummyshare coded 1 = 0.088Hispanic (1 = yes)Ethnicity indicator: 1 if the participant is Hispanic, else 0.0/1lalonde_dowhyNSW Demonstration baseline
married#dummyshare coded 1 = 0.169Married (1 = yes)Marital-status indicator: 1 if married at baseline, else 0.0/1lalonde_dowhyNSW Demonstration baseline
nodegr#dummyshare coded 1 = 0.782No high-school degree (1 = yes)1 if the participant lacks a high-school diploma, else 0.0/1lalonde_dowhyNSW Demonstration baseline
re74#continuousmin 0 | median 0 | max 3.96e+04Real earnings in 1974 (US$)Pre-program real annual earnings in 1974 (prior earnings, a key confounder).US$ (1974)lalonde_dowhyNSW Demonstration baseline
re75#continuousmin 0 | median 0 | max 2.51e+04Real earnings in 1975 (US$)Pre-program real annual earnings in 1975 (prior earnings, a key confounder).US$ (1975)lalonde_dowhyNSW Demonstration baseline
re78#continuousmin 0 | median 3.7e+03 | max 6.03e+04Real earnings in 1978 (US$)Post-program real annual earnings in 1978 — the outcome variable.US$ (1978)lalonde_dowhyNSW Demonstration follow-up
treat#dummyshare coded 1 = 0.416Job-training assignment (1 = trained)Randomized treatment indicator: 1 if assigned to NSW job training, 0 if control.0/1lalonde_dowhyNSW Demonstration (LaLonde 1986)

Cross-file variable index

Which file each variable appears in (● = present).

Variablelalonde_dowhy
age
black
educ
hisp
married
nodegr
re74
re75
re78
treat

Construction & formulas

The post applies DoWhy's four-step framework to this data, targeting the average treatment effect (ATE), ATE = E[Y(1) − Y(0)], of treat on re78:

The dataset itself contains no constructed variables: every column is an observed field from the NSW evaluation (assignment, demographics, prior/post earnings). The quantities above are computed by the analysis code, not stored in the file.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

individual (cross-section)  445 × 10 · covariates pre-1976; re74/re75 = 1974/1975 earnings; re78 = 1978 earnings · 445 participants (185 treated, 260 control)

Panel key: row = one participant (no explicit id column) · Estimate the ATE of NSW job training on 1978 earnings via DoWhy (Model/Identify/Estimate/Refute).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
treat dummyJob-training assignment (1 = trained)Randomized treatment indicator: 1 if assigned to NSW job training, 0 if control.From NSW random assignment; cast from boolean to integer in the post.0/1NSW Demonstration (LaLonde 1986)185 treated / 260 control
re78 continuousReal earnings in 1978 (US$)Post-program real annual earnings in 1978 — the outcome variable.Observed earnings; real (inflation-adjusted) US dollars.US$ (1978)NSW Demonstration follow-upall 445
age continuousAge (years)Participant age at baseline, in years.Observed at enrollment.yearsNSW Demonstration baselineall 445
educ continuousYears of educationCompleted years of schooling at baseline.Observed at enrollment.yearsNSW Demonstration baselineall 445
black dummyBlack (1 = yes)Race indicator: 1 if the participant is Black, else 0.Observed demographic at baseline.0/1NSW Demonstration baselineall 445
hisp dummyHispanic (1 = yes)Ethnicity indicator: 1 if the participant is Hispanic, else 0.Observed demographic at baseline.0/1NSW Demonstration baselineall 445
married dummyMarried (1 = yes)Marital-status indicator: 1 if married at baseline, else 0.Observed demographic at baseline.0/1NSW Demonstration baselineall 445
nodegr dummyNo high-school degree (1 = yes)1 if the participant lacks a high-school diploma, else 0.Observed at baseline (no high-school degree).0/1NSW Demonstration baselineall 445
re74 continuousReal earnings in 1974 (US$)Pre-program real annual earnings in 1974 (prior earnings, a key confounder).Observed earnings; real US dollars.US$ (1974)NSW Demonstration baselineall 445
re75 continuousReal earnings in 1975 (US$)Pre-program real annual earnings in 1975 (prior earnings, a key confounder).Observed earnings; real US dollars.US$ (1975)NSW Demonstration baselineall 445

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
treatshare coded 1 = 0.416100%445200.41601.000.493
re78min 0 | median 3.7e+03 | max 6.03e+04100%44530805,300.83,701.860,3086,631.5
agemin 17 | median 24 | max 55100%4453417.0025.3724.0055.007.10
educmin 3 | median 10 | max 16100%445143.0010.2010.0016.001.79
blackshare coded 1 = 0.834100%445200.8341.001.000.373
hispshare coded 1 = 0.088100%445200.08801.000.283
marriedshare coded 1 = 0.169100%445200.16901.000.375
nodegrshare coded 1 = 0.782100%445200.7821.001.000.413
re74min 0 | median 0 | max 3.96e+04100%44511502,102.3039,5715,363.6
re75min 0 | median 0 | max 2.51e+04100%44515501,377.1025,1423,151.0

Known limitations & caveats