Data dictionary · Evaluating a Cash Transfer Program (RCT) with Panel Data

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`dataSIM4RCT`	household-year (two waves)	4,000 × 14	dataSIM4RCT.dta	dataSIM4RCT.dta

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
use "${BASE}dataSIM4RCT.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
df = pd.read_stata(BASE + "dataSIM4RCT.dta")

# load every dataset at once
files = ["dataSIM4RCT"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "dataSIM4RCT.dta", "dataSIM4RCT.dta")
df, meta = pyreadstat.read_dta("dataSIM4RCT.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
df <- read_dta(paste0(BASE, "dataSIM4RCT.dta"))

Overview & sources

Companion data for a hands-on Stata tutorial that evaluates the causal effect of a cash transfer program on household consumption. The data are fully synthetic: 2,000 households in a developing country, observed in a balanced panel across a 2021 baseline and a 2024 endline (4,000 observations). The outcome is log monthly consumption and the program raises it by a known true effect of 12% (0.12 log points). The tutorial walks from baseline-balance checks through three cross-sectional estimators — regression adjustment (RA), inverse probability weighting (IPW), and doubly robust AIPW/IPWRA — then difference-in-differences and doubly robust DiD on the panel, and an endogenous-treatment IV model for imperfect compliance. Because the ground truth is known, every estimate can be checked against 0.12.

One file, long panel. dataSIM4RCT.dta is a strongly balanced household panel — two rows per household (2021 baseline and 2024 endline), keyed by id × year. Random assignment to the program offer (treat) is fixed within a household; actual receipt (D) turns on only at endline and only for compliers (imperfect take-up: 85% of the offered, 5% of controls). The variables wave, year, and post are three encodings of the same two-period time axis; alpha and eps are data-generating-process internals exposed for transparency.

Data sources

Source	Provides	Reference / URL
Synthetic (this study)	All values — simulated household panel with a calibrated, known 12% true effect (open & reproducible)	Mendez, C. (2026). See the post's Stata do-file analysis.do and the tutorial for the design.
Method references	Estimators and concepts (RA / IPW / doubly-robust AIPW-IPWRA / DiD / DR-DiD / endogenous-treatment IV)	Stata teffects manual; Sant'Anna & Zhao (2020); Imbens & Rubin (2015); Rios-Avila, Sant'Anna & Callaway (drdid).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata [Data set]. https://carlos-mendez.org/post/stata_rct/

Sant'Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122. https://doi.org/10.1016/j.jeconom.2020.06.003 Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. StataCorp. teffects — Treatment-effects estimation. Stata Treatment-Effects Reference Manual.

BibTeX

@misc{mendez2026statarct,
  author       = {Mendez, Carlos},
  title        = {Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_rct/}},
  note         = {Data set}
}

@article{santanna2020doubly,
  author  = {Sant'Anna, Pedro H. C. and Zhao, Jun},
  title   = {Doubly robust difference-in-differences estimators},
  journal = {Journal of Econometrics},
  volume  = {219}, number = {1}, pages = {101--122}, year = {2020},
  doi     = {10.1016/j.jeconom.2020.06.003}
}
@book{imbens2015causal,
  author    = {Imbens, Guido W. and Rubin, Donald B.},
  title     = {Causal Inference for Statistics, Social, and Biomedical Sciences},
  publisher = {Cambridge University Press}, year = {2015}
}

Variable explorer search & filter all 14 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`D`#	dummy		Receipt of cash transfer (endogenous)	Actual receipt of the transfer; endogenous take-up, non-zero only at endline.	0/1	dataSIM4RCT	Simulation
`age`#	continuous		Age of household head	Age in years of the household head (time-invariant in this panel).	years	dataSIM4RCT	Simulation
`alpha`#	continuous		Household DGP component (random effect)	Simulation random component contributing to consumption; not a tutorial covariate.	log scale	dataSIM4RCT	Simulation (DGP internal)
`edu`#	continuous		Years of education (household head)	Years of education of the household head (time-invariant in this panel).	years	dataSIM4RCT	Simulation
`eps`#	continuous		Idiosyncratic DGP error term	Simulation idiosyncratic shock to consumption; not a tutorial covariate.	log scale	dataSIM4RCT	Simulation (DGP internal)
`female`#	dummy		Female-headed household	1 if the household head is female, else 0 (the chance baseline imbalance, SMD ~9.3%).	0/1	dataSIM4RCT	Simulation
`id`#	identifier	–	Household identifier	Unique household ID; the panel unit, repeated across the two waves.	integer	dataSIM4RCT	Simulation
`post`#	dummy		Endline indicator (1 = 2024)	Binary post-treatment period flag; 1 at endline, 0 at baseline.	0/1	dataSIM4RCT	Simulation
`poverty`#	dummy		Poverty status at baseline	1 if the household is in poverty at baseline (the randomization stratum), else 0.	0/1	dataSIM4RCT	Simulation
`treat`#	dummy		Assignment to offer (intent-to-treat)	Random assignment to the program offer; exogenous, fixed within household across waves.	0/1	dataSIM4RCT	Simulation (randomized)
`wave`#	identifier	–	Survey wave index (1=baseline, 2=endline)	Integer wave index; an alternative encoding of the time axis to year/post.	1/2	dataSIM4RCT	Simulation
`y`#	continuous		Log monthly consumption (outcome)	Outcome variable: natural log of monthly household consumption in each wave.	log of monetary units	dataSIM4RCT	Simulation
`y0`#	continuous		Baseline log consumption (pre-treatment)	Each household's 2021 baseline value of y, carried to both rows for ANCOVA-style adjustment.	log of monetary units	dataSIM4RCT	Simulation
`year`#	year	–	Survey year (2021 or 2024)	Calendar year of the survey wave.	year	dataSIM4RCT	Simulation

Cross-file variable index

Which file each variable appears in (● = present).

Variable	dataSIM4RCT
`D`	●
`age`	●
`alpha`	●
`edu`	●
`eps`	●
`female`	●
`id`	●
`post`	●
`poverty`	●
`treat`	●
`wave`	●
`y`	●
`y0`	●
`year`	●

Construction & formulas

The outcome y is log monthly consumption. The causal target is the program's average effect, with a known true value of 0.12 log points (≈12%). Two estimands recur throughout:

ATE (policymaker's quantity): ATE = E[Y(1) − Y(0)] — the average effect if the program were scaled to everyone.
ATT (evaluator's quantity): ATT = E[Y(1) − Y(0) | T=1] — the average effect among the treated. DiD estimates the ATT only.

Five estimation strategies are applied to these data:

Regression adjustment (RA): fit outcome models on treated and control, impute both potential outcomes, average the difference: τ_RA = (1/N) Σ [μ̂_1(X_i) − μ̂_0(X_i)]. Consistent if the outcome model is correct (teffects ra).
Inverse probability weighting (IPW): reweight by the inverse propensity score p̂(X)=Pr(T=1|X): τ_IPW = (1/N) Σ [T_i Y_i / p̂ − (1−T_i) Y_i / (1−p̂)]. Consistent if the treatment model is correct (teffects ipw).
Doubly robust (AIPW / IPWRA): RA plus an IPW-weighted bias-correction term, τ_DR = τ_RA + (1/N) Σ [T_i(Y_i−μ̂_1)/p̂ − (1−T_i)(Y_i−μ̂_0)/(1−p̂)]. Consistent if either model is correct (teffects aipw / teffects ipwra).
Difference-in-differences (DiD): the difference of pre/post changes, τ_DiD = (Ȳ_tr,post − Ȳ_tr,pre) − (Ȳ_ct,post − Ȳ_ct,pre), netting out time-invariant unobservables; estimates the ATT (xtdidregress).
Doubly robust DiD (DR-DiD): Sant'Anna & Zhao (2020) extend the DR logic to changes ΔY, consistent if the outcome or the propensity model holds (drdid, xthdidregress aipw).

Imperfect compliance is handled by using random assignment treat as an instrument for receipt D (endogenous-treatment regression, etregress), separating the effect of the offer (ITT) from the effect of receipt.

Data-generating process. For household i, log consumption is built from a baseline level plus a household component alpha, an idiosyncratic shock eps, a common time trend, and the 0.12 treatment bump applied at endline to households that receive the transfer. The baseline value is stored as y0 and carried to both rows; wave/year/post encode the two-period time axis.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

household-year (two waves) 4,000 × 14 · 2021, 2024 · 2,000 households (strongly balanced; 4,000 rows)

Panel key: id x year · Evaluate a cash-transfer program (RA / IPW / DR / DiD / DR-DiD / endogenous-treatment IV) against a known 0.12 effect.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source
`id` identifier	Household identifier	Unique household ID; the panel unit, repeated across the two waves.	1..2000, one per household.	integer	Simulation
`age` continuous	Age of household head	Age in years of the household head (time-invariant in this panel).	Drawn at baseline; held fixed across waves.	years	Simulation
`female` dummy	Female-headed household	1 if the household head is female, else 0 (the chance baseline imbalance, SMD ~9.3%).	Drawn at baseline; held fixed across waves.	0/1	Simulation
`poverty` dummy	Poverty status at baseline	1 if the household is in poverty at baseline (the randomization stratum), else 0.	Drawn at baseline; randomization was stratified on this variable.	0/1	Simulation
`edu` continuous	Years of education (household head)	Years of education of the household head (time-invariant in this panel).	Drawn at baseline; held fixed across waves.	years	Simulation
`treat` dummy	Assignment to offer (intent-to-treat)	Random assignment to the program offer; exogenous, fixed within household across waves.	Stratified (block) randomization within poverty strata; ~52% assigned.	0/1	Simulation (randomized)
`wave` identifier	Survey wave index (1=baseline, 2=endline)	Integer wave index; an alternative encoding of the time axis to year/post.	1 for the 2021 baseline, 2 for the 2024 endline.	1/2	Simulation
`year` year	Survey year (2021 or 2024)	Calendar year of the survey wave.	2021 for the baseline wave, 2024 for the endline wave.	year	Simulation
`post` dummy	Endline indicator (1 = 2024)	Binary post-treatment period flag; 1 at endline, 0 at baseline.	1 if year==2024 (endline), else 0.	0/1	Simulation
`D` dummy	Receipt of cash transfer (endogenous)	Actual receipt of the transfer; endogenous take-up, non-zero only at endline.	0 at baseline; at endline ~85% of the offered and ~5% of controls receive (imperfect compliance).	0/1	Simulation
`alpha` continuous	Household DGP component (random effect)	Simulation random component contributing to consumption; not a tutorial covariate.	Generated by the data-generating process (household/wave-level term).	log scale	Simulation (DGP internal)
`eps` continuous	Idiosyncratic DGP error term	Simulation idiosyncratic shock to consumption; not a tutorial covariate.	Generated by the data-generating process (per-observation noise).	log scale	Simulation (DGP internal)
`y` continuous	Log monthly consumption (outcome)	Outcome variable: natural log of monthly household consumption in each wave.	Baseline level + household and time components, plus the 0.12 treatment bump at endline for recipients.	log of monetary units	Simulation
`y0` continuous	Baseline log consumption (pre-treatment)	Each household's 2021 baseline value of y, carried to both rows for ANCOVA-style adjustment.	y at the baseline wave; constant within household across the two rows.	log of monetary units	Simulation

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`id`	–	100%	4,000	2,000	—	—	—	—	—
`age`		100%	4,000	49	18.00	35.13	35.00	68.00	9.65
`female`		100%	4,000	2	0	0.508	1.00	1.00	0.500
`poverty`		100%	4,000	2	0	0.312	0	1.00	0.464
`edu`		100%	4,000	13	6.00	12.03	12.00	18.00	1.99
`treat`		100%	4,000	2	0	0.518	1.00	1.00	0.500
`wave`	–	100%	4,000	2	—	—	—	—	—
`year`	–	100%	4,000	2	2021	2022.5	2022	2024	1.50
`post`		100%	4,000	2	0	0.500	0.500	1.00	0.500
`D`		100%	4,000	2	0	0.231	0	1.00	0.421
`alpha`		100%	4,000	4,000	-0.920	0.005	0.004	0.994	0.302
`eps`		100%	4,000	4,000	-1.23	0.002	0.006	1.17	0.302
`y`		100%	4,000	3,994	8.45	10.06	10.06	11.55	0.439
`y0`		100%	4,000	1,997	8.45	10.02	10.01	11.48	0.435

Known limitations & caveats

Synthetic data. There is no real program behind this tutorial; values are simulated with a known 0.12 log-point effect. Results are internally consistent with the calibration but are not empirical evidence about real cash-transfer programs.
Two periods only. The panel has a single baseline (2021) and endline (2024), so it cannot test pre-treatment (parallel) trends or estimate dynamic/staggered effects.
Homogeneous effect by construction. The treatment effect is the same for all households, so ATE and ATT nearly coincide; real programs typically show heterogeneity worth exploring by subgroup.
Imperfect compliance. Random assignment treat ≠ actual receipt D (85% take-up among the offered, 5% among controls); offer-based (ITT) and receipt-based estimates differ — use treat for the policy offer and the IV/DR-receipt models for the effect of receipt.
DGP internals. alpha and eps are simulation components (household and idiosyncratic terms), and y0 is the baseline outcome carried to both rows; they are exposed for transparency and are not analysis covariates in the tutorial.