← Back to the post
Interactive data dictionary

Evaluating a Cash Transfer Program (RCT) with Panel Data

The simulated two-wave household panel behind the Stata RCT tutorial, with a known 12% true effect.

1
dataset
14
variables
2000
households
2021 & 2024
waves

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
dataSIM4RCThousehold-year (two waves)4,000 × 14dataSIM4RCT.dtadataSIM4RCT.dta

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
use "${BASE}dataSIM4RCT.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
df = pd.read_stata(BASE + "dataSIM4RCT.dta")

# load every dataset at once
files = ["dataSIM4RCT"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "dataSIM4RCT.dta", "dataSIM4RCT.dta")
df, meta = pyreadstat.read_dta("dataSIM4RCT.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
df <- read_dta(paste0(BASE, "dataSIM4RCT.dta"))

Overview & sources

Companion data for a hands-on Stata tutorial that evaluates the causal effect of a cash transfer program on household consumption. The data are fully synthetic: 2,000 households in a developing country, observed in a balanced panel across a 2021 baseline and a 2024 endline (4,000 observations). The outcome is log monthly consumption and the program raises it by a known true effect of 12% (0.12 log points). The tutorial walks from baseline-balance checks through three cross-sectional estimators — regression adjustment (RA), inverse probability weighting (IPW), and doubly robust AIPW/IPWRA — then difference-in-differences and doubly robust DiD on the panel, and an endogenous-treatment IV model for imperfect compliance. Because the ground truth is known, every estimate can be checked against 0.12.

One file, long panel. dataSIM4RCT.dta is a strongly balanced household panel — two rows per household (2021 baseline and 2024 endline), keyed by id × year. Random assignment to the program offer (treat) is fixed within a household; actual receipt (D) turns on only at endline and only for compliers (imperfect take-up: 85% of the offered, 5% of controls). The variables wave, year, and post are three encodings of the same two-period time axis; alpha and eps are data-generating-process internals exposed for transparency.

Data sources

SourceProvidesReference / URL
Synthetic (this study)All values — simulated household panel with a calibrated, known 12% true effect (open &amp; reproducible)Mendez, C. (2026). See the post's Stata do-file analysis.do and the tutorial for the design.
Method referencesEstimators and concepts (RA / IPW / doubly-robust AIPW-IPWRA / DiD / DR-DiD / endogenous-treatment IV)Stata teffects manual; Sant'Anna & Zhao (2020); Imbens & Rubin (2015); Rios-Avila, Sant'Anna & Callaway (drdid).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata [Data set]. https://carlos-mendez.org/post/stata_rct/

Sant'Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122. https://doi.org/10.1016/j.jeconom.2020.06.003 Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. StataCorp. teffects — Treatment-effects estimation. Stata Treatment-Effects Reference Manual.

BibTeX

@misc{mendez2026statarct,
  author       = {Mendez, Carlos},
  title        = {Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_rct/}},
  note         = {Data set}
}

@article{santanna2020doubly,
  author  = {Sant'Anna, Pedro H. C. and Zhao, Jun},
  title   = {Doubly robust difference-in-differences estimators},
  journal = {Journal of Econometrics},
  volume  = {219}, number = {1}, pages = {101--122}, year = {2020},
  doi     = {10.1016/j.jeconom.2020.06.003}
}
@book{imbens2015causal,
  author    = {Imbens, Guido W. and Rubin, Donald B.},
  title     = {Causal Inference for Statistics, Social, and Biomedical Sciences},
  publisher = {Cambridge University Press}, year = {2015}
}

Variable explorer search & filter all 14 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
D#dummyshare coded 1 = 0.231Receipt of cash transfer (endogenous)Actual receipt of the transfer; endogenous take-up, non-zero only at endline.0/1dataSIM4RCTSimulation
age#continuousmin 18 | median 35 | max 68Age of household headAge in years of the household head (time-invariant in this panel).yearsdataSIM4RCTSimulation
alpha#continuousmin -0.92 | median 0.00406 | max 0.994Household DGP component (random effect)Simulation random component contributing to consumption; not a tutorial covariate.log scaledataSIM4RCTSimulation (DGP internal)
edu#continuousmin 6 | median 12 | max 18Years of education (household head)Years of education of the household head (time-invariant in this panel).yearsdataSIM4RCTSimulation
eps#continuousmin -1.23 | median 0.00606 | max 1.17Idiosyncratic DGP error termSimulation idiosyncratic shock to consumption; not a tutorial covariate.log scaledataSIM4RCTSimulation (DGP internal)
female#dummyshare coded 1 = 0.508Female-headed household1 if the household head is female, else 0 (the chance baseline imbalance, SMD ~9.3%).0/1dataSIM4RCTSimulation
id#identifierHousehold identifierUnique household ID; the panel unit, repeated across the two waves.integerdataSIM4RCTSimulation
post#dummyshare coded 1 = 0.500Endline indicator (1 = 2024)Binary post-treatment period flag; 1 at endline, 0 at baseline.0/1dataSIM4RCTSimulation
poverty#dummyshare coded 1 = 0.312Poverty status at baseline1 if the household is in poverty at baseline (the randomization stratum), else 0.0/1dataSIM4RCTSimulation
treat#dummyshare coded 1 = 0.518Assignment to offer (intent-to-treat)Random assignment to the program offer; exogenous, fixed within household across waves.0/1dataSIM4RCTSimulation (randomized)
wave#identifierSurvey wave index (1=baseline, 2=endline)Integer wave index; an alternative encoding of the time axis to year/post.1/2dataSIM4RCTSimulation
y#continuousmin 8.45 | median 10.1 | max 11.6Log monthly consumption (outcome)Outcome variable: natural log of monthly household consumption in each wave.log of monetary unitsdataSIM4RCTSimulation
y0#continuousmin 8.45 | median 10 | max 11.5Baseline log consumption (pre-treatment)Each household's 2021 baseline value of y, carried to both rows for ANCOVA-style adjustment.log of monetary unitsdataSIM4RCTSimulation
year#yearSurvey year (2021 or 2024)Calendar year of the survey wave.yeardataSIM4RCTSimulation

Cross-file variable index

Which file each variable appears in (● = present).

VariabledataSIM4RCT
D
age
alpha
edu
eps
female
id
post
poverty
treat
wave
y
y0
year

Construction & formulas

The outcome y is log monthly consumption. The causal target is the program's average effect, with a known true value of 0.12 log points (≈12%). Two estimands recur throughout:

Five estimation strategies are applied to these data:

Imperfect compliance is handled by using random assignment treat as an instrument for receipt D (endogenous-treatment regression, etregress), separating the effect of the offer (ITT) from the effect of receipt.

Data-generating process. For household i, log consumption is built from a baseline level plus a household component alpha, an idiosyncratic shock eps, a common time trend, and the 0.12 treatment bump applied at endline to households that receive the transfer. The baseline value is stored as y0 and carried to both rows; wave/year/post encode the two-period time axis.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

household-year (two waves)  4,000 × 14 · 2021, 2024 · 2,000 households (strongly balanced; 4,000 rows)

Panel key: id x year · Evaluate a cash-transfer program (RA / IPW / DR / DiD / DR-DiD / endogenous-treatment IV) against a known 0.12 effect.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
id identifierHousehold identifierUnique household ID; the panel unit, repeated across the two waves.1..2000, one per household.integerSimulation
age continuousAge of household headAge in years of the household head (time-invariant in this panel).Drawn at baseline; held fixed across waves.yearsSimulation
female dummyFemale-headed household1 if the household head is female, else 0 (the chance baseline imbalance, SMD ~9.3%).Drawn at baseline; held fixed across waves.0/1Simulation
poverty dummyPoverty status at baseline1 if the household is in poverty at baseline (the randomization stratum), else 0.Drawn at baseline; randomization was stratified on this variable.0/1Simulation
edu continuousYears of education (household head)Years of education of the household head (time-invariant in this panel).Drawn at baseline; held fixed across waves.yearsSimulation
treat dummyAssignment to offer (intent-to-treat)Random assignment to the program offer; exogenous, fixed within household across waves.Stratified (block) randomization within poverty strata; ~52% assigned.0/1Simulation (randomized)
wave identifierSurvey wave index (1=baseline, 2=endline)Integer wave index; an alternative encoding of the time axis to year/post.1 for the 2021 baseline, 2 for the 2024 endline.1/2Simulation
year yearSurvey year (2021 or 2024)Calendar year of the survey wave.2021 for the baseline wave, 2024 for the endline wave.yearSimulation
post dummyEndline indicator (1 = 2024)Binary post-treatment period flag; 1 at endline, 0 at baseline.1 if year==2024 (endline), else 0.0/1Simulation
D dummyReceipt of cash transfer (endogenous)Actual receipt of the transfer; endogenous take-up, non-zero only at endline.0 at baseline; at endline ~85% of the offered and ~5% of controls receive (imperfect compliance).0/1Simulation
alpha continuousHousehold DGP component (random effect)Simulation random component contributing to consumption; not a tutorial covariate.Generated by the data-generating process (household/wave-level term).log scaleSimulation (DGP internal)
eps continuousIdiosyncratic DGP error termSimulation idiosyncratic shock to consumption; not a tutorial covariate.Generated by the data-generating process (per-observation noise).log scaleSimulation (DGP internal)
y continuousLog monthly consumption (outcome)Outcome variable: natural log of monthly household consumption in each wave.Baseline level + household and time components, plus the 0.12 treatment bump at endline for recipients.log of monetary unitsSimulation
y0 continuousBaseline log consumption (pre-treatment)Each household's 2021 baseline value of y, carried to both rows for ANCOVA-style adjustment.y at the baseline wave; constant within household across the two rows.log of monetary unitsSimulation

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
id100%4,0002,000
agemin 18 | median 35 | max 68100%4,0004918.0035.1335.0068.009.65
femaleshare coded 1 = 0.508100%4,000200.5081.001.000.500
povertyshare coded 1 = 0.312100%4,000200.31201.000.464
edumin 6 | median 12 | max 18100%4,000136.0012.0312.0018.001.99
treatshare coded 1 = 0.518100%4,000200.5181.001.000.500
wave100%4,0002
year100%4,000220212022.5202220241.50
postshare coded 1 = 0.500100%4,000200.5000.5001.000.500
Dshare coded 1 = 0.231100%4,000200.23101.000.421
alphamin -0.92 | median 0.00406 | max 0.994100%4,0004,000-0.9200.0050.0040.9940.302
epsmin -1.23 | median 0.00606 | max 1.17100%4,0004,000-1.230.0020.0061.170.302
ymin 8.45 | median 10.1 | max 11.6100%4,0003,9948.4510.0610.0611.550.439
y0min 8.45 | median 10 | max 11.5100%4,0001,9978.4510.0210.0111.480.435

Known limitations & caveats