Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
assets3_raw | household (cross-section) | 9,913 × 11 | assets3_raw.dta | assets3_raw.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate/data/"
use "${BASE}assets3_raw.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate/data/"
df = pd.read_stata(BASE + "assets3_raw.dta")
# load every dataset at once
files = ["assets3_raw"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "assets3_raw.dta", "assets3_raw.dta")
df, meta = pyreadstat.read_dta("assets3_raw.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate/data/"
df <- read_dta(paste0(BASE, "assets3_raw.dta"))Overview & sources
Companion data for a hands-on Stata tutorial on estimating Conditional Average Treatment Effects (CATE) with Stata 19's new cate command. The dataset is the canonical assets3 excerpt — an extract from Chernozhukov & Hansen (2004) that ships with Stata 19 (webuse assets3) — covering 9,913 U.S. households. The outcome is net total financial assets (US$); the treatment is employer-offered 401(k) e401k eligibility (not actual participation); the remaining columns are demographic and financial covariates that describe the heterogeneity of interest. The post contrasts partialing-out (PO) and augmented inverse-probability weighting (AIPW) estimators against a parametric teffects aipw benchmark, then probes heterogeneity with estat heterogeneity, estat projection, GATE on prespecified income groups, GATES on data-driven quartiles, estat classification, and a nonparametric estat series fit. The raw eligible-versus-ineligible gap of $19,557 shrinks to a doubly robust ATE near $8,000, and income emerges as the dominant moderator.
assets3_raw is a single cross-section — one row per household, no time dimension. It is exported verbatim from Stata 19's built-in assets3 sample by the post's do-file (export delimited); the string-coded columns (e.g. e401k, pension) carry Stata's value-label text rather than the underlying 0/1 codes.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Chernozhukov & Hansen (2004) | Source study for the assets3 sample (401(k) eligibility and the household wealth distribution) | Chernozhukov, V., & Hansen, C. (2004). The effects of 401(k) participation on the wealth distribution: an instrumental quantile regression analysis. Review of Economics and Statistics, 86(3), 735–751. https://doi.org/10.1162/0034653041811734 |
| StataCorp (Stata 19) | Distribution of the analysis dataset (webuse assets3) and the cate estimation command | StataCorp. (2025). Stata 19 Causal Inference and Treatment-Effects Reference Manual: cate. https://www.stata.com/manuals/causal.pdf |
| Method references | CATE / heterogeneous-treatment-effect estimators | Athey, Tibshirani & Wager (2019); Chernozhukov et al. (2018); Robinson (1988). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Conditional Average Treatment Effects (CATE) with Stata 19 [Data set]. https://carlos-mendez.org/post/stata_cate/
Chernozhukov, V., & Hansen, C. (2004). The effects of 401(k) participation on the wealth distribution: an instrumental quantile regression analysis. Review of Economics and Statistics, 86(3), 735–751. https://doi.org/10.1162/0034653041811734BibTeX
@misc{mendez2026statacate,
author = {Mendez, Carlos},
title = {Conditional Average Treatment Effects (CATE) with Stata 19},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/stata_cate/}},
note = {Data set}
}
@article{chernozhukov2004effects,
author = {Chernozhukov, Victor and Hansen, Christian},
title = {The effects of 401(k) participation on the wealth distribution: an instrumental quantile regression analysis},
journal = {Review of Economics and Statistics},
volume = {86}, number = {3}, pages = {735--751}, year = {2004},
doi = {10.1162/0034653041811734}
}Variable explorer search & filter all 11 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
age# | continuous | Age of household head (years) | Age of the household head in years. | years | assets3_raw | Stata assets3 | |
assets# | continuous | Net total financial assets (US$) | Household net total financial assets — the outcome variable. | US$ | assets3_raw | Chernozhukov & Hansen (2004) / Stata assets3 | |
e401k# | identifier | – | 401(k) eligibility (treatment) | Whether the household head's employer offers a 401(k) plan (eligibility, not participation) — the treatment. | category | assets3_raw | Stata assets3 |
educ# | continuous | Years of education | Years of completed education of the household head. | years | assets3_raw | Stata assets3 | |
income# | continuous | Household income (US$) | Annual household income. | US$ | assets3_raw | Stata assets3 | |
incomecat# | identifier | – | Income category (0–4) | Coarse household-income category used for the prespecified GATE groups. | 0–4 | assets3_raw | Stata assets3 |
ira# | identifier | – | IRA participation | Whether the household holds an Individual Retirement Account (IRA). | category | assets3_raw | Stata assets3 |
married# | identifier | – | Marital status | Marital status of the household head. | category | assets3_raw | Stata assets3 |
ownhome# | identifier | – | Homeowner | Whether the household owns its home. | category | assets3_raw | Stata assets3 |
pension# | identifier | – | Pension benefits status | Whether the household receives defined-benefit pension benefits. | category | assets3_raw | Stata assets3 |
twoearn# | identifier | – | Two-earner household | Whether the household has two earners. | category | assets3_raw | Stata assets3 |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The estimand is the Conditional Average Treatment Effect — the average effect
of 401(k) eligibility d on net assets y for households with covariate profile
x:
- CATE:
τ(x) = E{ y(1) − y(0) | x = x }— a function ofx, not a single number. - ATE:
ATE = E{ τ(X) }— the CATE averaged over the sample. - GATE (prespecified group
g):τ(g) = E{ Γ_i | G_i = g }, the doubly robust AIPW scoreΓ_iaveraged over households in groupg. - GATES: the same average, but groups are data-driven quartiles of the
predicted effect
τ̂_i(out-of-sample, cross-fit). - AIPW score:
Γ_i = [ ŷ(1)_i + d_i{y_i − ŷ(1)_i}/f̂_i ] − [ ŷ(0)_i + (1−d_i){y_i − ŷ(0)_i}/(1−f̂_i) ]— doubly robust (consistent if either the outcome model or the propensity model is correct).
Stata 19's cate command estimates τ(x) with cross-fitted lasso nuisance models
(the learners for the outcome and treatment functions) plus a generalized random
forest for the individual-effect function, and an honest-tree bootstrap for inference. The
PO (partialing-out, partial-linear) and AIPW (fully interactive) routes
make different model assumptions but both return a per-household τ̂(x_i). Note: the columns
in assets3_raw.csv are the raw inputs only; the IATE predictions τ̂_i are
produced by the do-file and not part of this dataset.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
assets continuous | Net total financial assets (US$) | Household net total financial assets — the outcome variable. | From the assets3 sample; net of liabilities, so values can be negative. | US$ | Chernozhukov & Hansen (2004) / Stata assets3 | 9,913 households |
e401k identifier | 401(k) eligibility (treatment) | Whether the household head's employer offers a 401(k) plan (eligibility, not participation) — the treatment. | Value-label text from the assets3 sample (Eligible / Not eligible). | category | Stata assets3 | 3,682 Eligible; 6,231 Not eligible |
age continuous | Age of household head (years) | Age of the household head in years. | From the assets3 sample. | years | Stata assets3 | 9,913 households |
educ continuous | Years of education | Years of completed education of the household head. | From the assets3 sample. | years | Stata assets3 | 9,913 households |
income continuous | Household income (US$) | Annual household income. | From the assets3 sample. | US$ | Stata assets3 | 9,913 households |
incomecat identifier | Income category (0–4) | Coarse household-income category used for the prespecified GATE groups. | Discrete income bin from the assets3 sample (0 = lowest … 4 = highest). | 0–4 | Stata assets3 | 9,913 households |
pension identifier | Pension benefits status | Whether the household receives defined-benefit pension benefits. | Value-label text from the assets3 sample (Receives pension / No pension). | category | Stata assets3 | 9,913 households |
married identifier | Marital status | Marital status of the household head. | Value-label text from the assets3 sample (Married / Not married). | category | Stata assets3 | 9,913 households |
twoearn identifier | Two-earner household | Whether the household has two earners. | Value-label text from the assets3 sample (Yes / No). | category | Stata assets3 | 9,913 households |
ira identifier | IRA participation | Whether the household holds an Individual Retirement Account (IRA). | Value-label text from the assets3 sample (Yes / No). | category | Stata assets3 | 9,913 households |
ownhome identifier | Homeowner | Whether the household owns its home. | Value-label text from the assets3 sample (Yes / No). | category | Stata assets3 | 9,913 households |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
assets | 100% | 9,913 | 5,168 | -502,302 | 18,054 | 1,499.0 | 1,536,798 | 63,529 | |
e401k | – | 100% | 9,913 | 2 | — | — | — | — | — |
age | 100% | 9,913 | 40 | 25.00 | 41.06 | 40.00 | 64.00 | 10.34 | |
educ | 100% | 9,913 | 18 | 1.00 | 13.21 | 12.00 | 18.00 | 2.81 | |
income | 100% | 9,913 | 7,332 | 0 | 37,208 | 31,488 | 242,124 | 24,771 | |
incomecat | – | 100% | 9,913 | 5 | — | — | — | — | — |
pension | – | 100% | 9,913 | 2 | — | — | — | — | — |
married | – | 100% | 9,913 | 2 | — | — | — | — | — |
twoearn | – | 100% | 9,913 | 2 | — | — | — | — | — |
ira | – | 100% | 9,913 | 2 | — | — | — | — | — |
ownhome | – | 100% | 9,913 | 2 | — | — | — | — | — |
Known limitations & caveats
- Observational, not randomized. 401(k) eligibility is not randomly assigned (it depends on the employer the household head chose), so the raw eligible-versus-ineligible asset gap ($19,557) mixes the causal effect with selection; identification rests on unconfoundedness given the covariates.
- Eligibility, not participation.
e401krecords whether the employer offers a 401(k), not whether the household contributes — a distinct treatment from actual participation. - Heavy-tailed outcome. Net
assetsis extremely right-skewed (mean $18,054 vs. median $1,499; max ≈ $1.5M; min ≈ −$502,302 for negative-net-worth households), so the treatment effect almost certainly varies across the distribution. - String-coded categories. In this CSV the binary covariates are stored as their Stata value-label text (e.g.
e401k= "Eligible"/"Not eligible",pension= "Receives pension"/"No pension") rather than 0/1;incomecatalone is numeric (0–4). - Excerpt. assets3 is a teaching excerpt of the original Chernozhukov & Hansen (2004) data shipped with Stata; it is intended for illustration and may not match the full source extract variable-for-variable.