Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
dataSIM4RCT | household-year (two waves) | 4,000 × 14 | dataSIM4RCT.dta | dataSIM4RCT.dta |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
use "${BASE}dataSIM4RCT.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
df = pd.read_stata(BASE + "dataSIM4RCT.dta")
# load every dataset at once
files = ["dataSIM4RCT"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "dataSIM4RCT.dta", "dataSIM4RCT.dta")
df, meta = pyreadstat.read_dta("dataSIM4RCT.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_rct/data/"
df <- read_dta(paste0(BASE, "dataSIM4RCT.dta"))Overview & sources
Companion data for a hands-on Stata tutorial that evaluates the causal effect of a cash transfer program on household consumption. The data are fully synthetic: 2,000 households in a developing country, observed in a balanced panel across a 2021 baseline and a 2024 endline (4,000 observations). The outcome is log monthly consumption and the program raises it by a known true effect of 12% (0.12 log points). The tutorial walks from baseline-balance checks through three cross-sectional estimators — regression adjustment (RA), inverse probability weighting (IPW), and doubly robust AIPW/IPWRA — then difference-in-differences and doubly robust DiD on the panel, and an endogenous-treatment IV model for imperfect compliance. Because the ground truth is known, every estimate can be checked against 0.12.
dataSIM4RCT.dta is a strongly balanced household panel — two rows per household (2021 baseline and 2024 endline), keyed by id × year. Random assignment to the program offer (treat) is fixed within a household; actual receipt (D) turns on only at endline and only for compliers (imperfect take-up: 85% of the offered, 5% of controls). The variables wave, year, and post are three encodings of the same two-period time axis; alpha and eps are data-generating-process internals exposed for transparency.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — simulated household panel with a calibrated, known 12% true effect (open & reproducible) | Mendez, C. (2026). See the post's Stata do-file analysis.do and the tutorial for the design. |
| Method references | Estimators and concepts (RA / IPW / doubly-robust AIPW-IPWRA / DiD / DR-DiD / endogenous-treatment IV) | Stata teffects manual; Sant'Anna & Zhao (2020); Imbens & Rubin (2015); Rios-Avila, Sant'Anna & Callaway (drdid). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata [Data set]. https://carlos-mendez.org/post/stata_rct/
Sant'Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122. https://doi.org/10.1016/j.jeconom.2020.06.003 Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. StataCorp. teffects — Treatment-effects estimation. Stata Treatment-Effects Reference Manual.BibTeX
@misc{mendez2026statarct,
author = {Mendez, Carlos},
title = {Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/stata_rct/}},
note = {Data set}
}
@article{santanna2020doubly,
author = {Sant'Anna, Pedro H. C. and Zhao, Jun},
title = {Doubly robust difference-in-differences estimators},
journal = {Journal of Econometrics},
volume = {219}, number = {1}, pages = {101--122}, year = {2020},
doi = {10.1016/j.jeconom.2020.06.003}
}
@book{imbens2015causal,
author = {Imbens, Guido W. and Rubin, Donald B.},
title = {Causal Inference for Statistics, Social, and Biomedical Sciences},
publisher = {Cambridge University Press}, year = {2015}
}Variable explorer search & filter all 14 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
D# | dummy | Receipt of cash transfer (endogenous) | Actual receipt of the transfer; endogenous take-up, non-zero only at endline. | 0/1 | dataSIM4RCT | Simulation | |
age# | continuous | Age of household head | Age in years of the household head (time-invariant in this panel). | years | dataSIM4RCT | Simulation | |
alpha# | continuous | Household DGP component (random effect) | Simulation random component contributing to consumption; not a tutorial covariate. | log scale | dataSIM4RCT | Simulation (DGP internal) | |
edu# | continuous | Years of education (household head) | Years of education of the household head (time-invariant in this panel). | years | dataSIM4RCT | Simulation | |
eps# | continuous | Idiosyncratic DGP error term | Simulation idiosyncratic shock to consumption; not a tutorial covariate. | log scale | dataSIM4RCT | Simulation (DGP internal) | |
female# | dummy | Female-headed household | 1 if the household head is female, else 0 (the chance baseline imbalance, SMD ~9.3%). | 0/1 | dataSIM4RCT | Simulation | |
id# | identifier | – | Household identifier | Unique household ID; the panel unit, repeated across the two waves. | integer | dataSIM4RCT | Simulation |
post# | dummy | Endline indicator (1 = 2024) | Binary post-treatment period flag; 1 at endline, 0 at baseline. | 0/1 | dataSIM4RCT | Simulation | |
poverty# | dummy | Poverty status at baseline | 1 if the household is in poverty at baseline (the randomization stratum), else 0. | 0/1 | dataSIM4RCT | Simulation | |
treat# | dummy | Assignment to offer (intent-to-treat) | Random assignment to the program offer; exogenous, fixed within household across waves. | 0/1 | dataSIM4RCT | Simulation (randomized) | |
wave# | identifier | – | Survey wave index (1=baseline, 2=endline) | Integer wave index; an alternative encoding of the time axis to year/post. | 1/2 | dataSIM4RCT | Simulation |
y# | continuous | Log monthly consumption (outcome) | Outcome variable: natural log of monthly household consumption in each wave. | log of monetary units | dataSIM4RCT | Simulation | |
y0# | continuous | Baseline log consumption (pre-treatment) | Each household's 2021 baseline value of y, carried to both rows for ANCOVA-style adjustment. | log of monetary units | dataSIM4RCT | Simulation | |
year# | year | – | Survey year (2021 or 2024) | Calendar year of the survey wave. | year | dataSIM4RCT | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The outcome y is log monthly consumption. The causal target is the program's
average effect, with a known true value of 0.12 log points (≈12%).
Two estimands recur throughout:
- ATE (policymaker's quantity):
ATE = E[Y(1) − Y(0)]— the average effect if the program were scaled to everyone. - ATT (evaluator's quantity):
ATT = E[Y(1) − Y(0) | T=1]— the average effect among the treated. DiD estimates the ATT only.
Five estimation strategies are applied to these data:
- Regression adjustment (RA): fit outcome models on treated and control,
impute both potential outcomes, average the difference:
τ_RA = (1/N) Σ [μ̂_1(X_i) − μ̂_0(X_i)]. Consistent if the outcome model is correct (teffects ra). - Inverse probability weighting (IPW): reweight by the inverse propensity
score
p̂(X)=Pr(T=1|X):τ_IPW = (1/N) Σ [T_i Y_i / p̂ − (1−T_i) Y_i / (1−p̂)]. Consistent if the treatment model is correct (teffects ipw). - Doubly robust (AIPW / IPWRA): RA plus an IPW-weighted bias-correction term,
τ_DR = τ_RA + (1/N) Σ [T_i(Y_i−μ̂_1)/p̂ − (1−T_i)(Y_i−μ̂_0)/(1−p̂)]. Consistent if either model is correct (teffects aipw/teffects ipwra). - Difference-in-differences (DiD): the difference of pre/post changes,
τ_DiD = (Ȳ_tr,post − Ȳ_tr,pre) − (Ȳ_ct,post − Ȳ_ct,pre), netting out time-invariant unobservables; estimates the ATT (xtdidregress). - Doubly robust DiD (DR-DiD): Sant'Anna & Zhao (2020) extend the DR logic to
changes ΔY, consistent if the outcome or the propensity model holds
(
drdid,xthdidregress aipw).
Imperfect compliance is handled by using random assignment treat as an instrument
for receipt D (endogenous-treatment regression, etregress), separating
the effect of the offer (ITT) from the effect of receipt.
Data-generating process. For household i, log consumption is built
from a baseline level plus a household component alpha, an idiosyncratic shock
eps, a common time trend, and the 0.12 treatment bump applied at endline to households
that receive the transfer. The baseline value is stored as y0 and carried to both rows;
wave/year/post encode the two-period time axis.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
id identifier | Household identifier | Unique household ID; the panel unit, repeated across the two waves. | 1..2000, one per household. | integer | Simulation | |
age continuous | Age of household head | Age in years of the household head (time-invariant in this panel). | Drawn at baseline; held fixed across waves. | years | Simulation | |
female dummy | Female-headed household | 1 if the household head is female, else 0 (the chance baseline imbalance, SMD ~9.3%). | Drawn at baseline; held fixed across waves. | 0/1 | Simulation | |
poverty dummy | Poverty status at baseline | 1 if the household is in poverty at baseline (the randomization stratum), else 0. | Drawn at baseline; randomization was stratified on this variable. | 0/1 | Simulation | |
edu continuous | Years of education (household head) | Years of education of the household head (time-invariant in this panel). | Drawn at baseline; held fixed across waves. | years | Simulation | |
treat dummy | Assignment to offer (intent-to-treat) | Random assignment to the program offer; exogenous, fixed within household across waves. | Stratified (block) randomization within poverty strata; ~52% assigned. | 0/1 | Simulation (randomized) | |
wave identifier | Survey wave index (1=baseline, 2=endline) | Integer wave index; an alternative encoding of the time axis to year/post. | 1 for the 2021 baseline, 2 for the 2024 endline. | 1/2 | Simulation | |
year year | Survey year (2021 or 2024) | Calendar year of the survey wave. | 2021 for the baseline wave, 2024 for the endline wave. | year | Simulation | |
post dummy | Endline indicator (1 = 2024) | Binary post-treatment period flag; 1 at endline, 0 at baseline. | 1 if year==2024 (endline), else 0. | 0/1 | Simulation | |
D dummy | Receipt of cash transfer (endogenous) | Actual receipt of the transfer; endogenous take-up, non-zero only at endline. | 0 at baseline; at endline ~85% of the offered and ~5% of controls receive (imperfect compliance). | 0/1 | Simulation | |
alpha continuous | Household DGP component (random effect) | Simulation random component contributing to consumption; not a tutorial covariate. | Generated by the data-generating process (household/wave-level term). | log scale | Simulation (DGP internal) | |
eps continuous | Idiosyncratic DGP error term | Simulation idiosyncratic shock to consumption; not a tutorial covariate. | Generated by the data-generating process (per-observation noise). | log scale | Simulation (DGP internal) | |
y continuous | Log monthly consumption (outcome) | Outcome variable: natural log of monthly household consumption in each wave. | Baseline level + household and time components, plus the 0.12 treatment bump at endline for recipients. | log of monetary units | Simulation | |
y0 continuous | Baseline log consumption (pre-treatment) | Each household's 2021 baseline value of y, carried to both rows for ANCOVA-style adjustment. | y at the baseline wave; constant within household across the two rows. | log of monetary units | Simulation |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
id | – | 100% | 4,000 | 2,000 | — | — | — | — | — |
age | 100% | 4,000 | 49 | 18.00 | 35.13 | 35.00 | 68.00 | 9.65 | |
female | 100% | 4,000 | 2 | 0 | 0.508 | 1.00 | 1.00 | 0.500 | |
poverty | 100% | 4,000 | 2 | 0 | 0.312 | 0 | 1.00 | 0.464 | |
edu | 100% | 4,000 | 13 | 6.00 | 12.03 | 12.00 | 18.00 | 1.99 | |
treat | 100% | 4,000 | 2 | 0 | 0.518 | 1.00 | 1.00 | 0.500 | |
wave | – | 100% | 4,000 | 2 | — | — | — | — | — |
year | – | 100% | 4,000 | 2 | 2021 | 2022.5 | 2022 | 2024 | 1.50 |
post | 100% | 4,000 | 2 | 0 | 0.500 | 0.500 | 1.00 | 0.500 | |
D | 100% | 4,000 | 2 | 0 | 0.231 | 0 | 1.00 | 0.421 | |
alpha | 100% | 4,000 | 4,000 | -0.920 | 0.005 | 0.004 | 0.994 | 0.302 | |
eps | 100% | 4,000 | 4,000 | -1.23 | 0.002 | 0.006 | 1.17 | 0.302 | |
y | 100% | 4,000 | 3,994 | 8.45 | 10.06 | 10.06 | 11.55 | 0.439 | |
y0 | 100% | 4,000 | 1,997 | 8.45 | 10.02 | 10.01 | 11.48 | 0.435 |
Known limitations & caveats
- Synthetic data. There is no real program behind this tutorial; values are simulated with a known 0.12 log-point effect. Results are internally consistent with the calibration but are not empirical evidence about real cash-transfer programs.
- Two periods only. The panel has a single baseline (2021) and endline (2024), so it cannot test pre-treatment (parallel) trends or estimate dynamic/staggered effects.
- Homogeneous effect by construction. The treatment effect is the same for all households, so ATE and ATT nearly coincide; real programs typically show heterogeneity worth exploring by subgroup.
- Imperfect compliance. Random assignment
treat≠ actual receiptD(85% take-up among the offered, 5% among controls); offer-based (ITT) and receipt-based estimates differ — usetreatfor the policy offer and the IV/DR-receipt models for the effect of receipt. - DGP internals.
alphaandepsare simulation components (household and idiosyncratic terms), andy0is the baseline outcome carried to both rows; they are exposed for transparency and are not analysis covariates in the tutorial.