Did after-school tutoring raise GPA? A disciplined evaluation
Nagoya University (GSID)
June 11, 2026
Act I
A fictitious government runs an after-school tutoring program in 10 of 35 high schools to raise the GPA of low-income students.
Look only at the treated schools and GPA jumps from 60 to 96. Spectacular. Or is something else rising too?
Interrupted Time Series — treated schools only. GPA leaps across the red treatment line from ~60 to ~96.
Act II
gpa, a school’s mean GPA on a 0–100 scaleA strongly-balanced panel: 35 schools × 2 periods = 70 observations. The estimand is the ATT \(E[Y_i(1)-Y_i(0)\mid D_i=1]\) — the effect for the schools that actually got the program.
Treatment-timing heatmap. Treated schools (IDs 26–35) flip pre → post simultaneously at time 2; the 25 comparison schools never switch.
The comparison group (rising gently) supplies the dashed counterfactual: where the treated would have ended up without the program.
\[E[Y_{i,1}(0) - Y_{i,0}(0) \mid D=1] = E[Y_{i,1}(0) - Y_{i,0}(0) \mid D=0]\]
Different starting levels are fine. Different changes — divergent slopes — would break the design. This is the one assumption that does the causal work.
\[DiD = \Big(\bar{Y}_{1}^{T} - \bar{Y}_{0}^{T}\Big) - \Big(\bar{Y}_{1}^{C} - \bar{Y}_{0}^{C}\Big)\]
Treated change \(36.20\) minus comparison change \(10.88\) = the program’s effect, with the common time trend removed.
| Group | Pre | Post | Change |
|---|---|---|---|
| Comparison (25 schools) | 71.22 | 82.10 | 10.88 |
| Treated (10 schools) | 60.17 | 96.37 | 36.20 |
| DiD estimate | 25.32 |
Roughly one-third of the treated group’s raw gain (10.88 points) was natural drift, not the program.
diff command: ATT = 25.315, SE 0.627, p < 0.001| Contrast | Estimate | SE | Sig.? |
|---|---|---|---|
| Before: Diff (T−C) | −11.049 | 0.443 | yes |
| After: Diff (T−C) | 14.266 | 0.443 | yes |
| Diff-in-Diff | 25.315 | 0.627 | yes |
txp IS the DiD\[Y_{it} = \alpha + \beta_1 \text{Treat}_i + \beta_2 \text{Post}_t + \beta_3 (\text{Treat}_i \times \text{Post}_t) + \varepsilon_{it}\]
\(\hat\beta_3 = 25.31\) (SE 0.61) — the DiD.
The rest is nuisance: constant 71.22 (comparison pre-mean), \(\hat\beta_1 = -11.05\) (baseline gap), \(\hat\beta_2 = 10.89\) (common trend).
\[Y_{it} = \beta_3 (\text{Treat}_i \times \text{Post}_t) + \gamma_i + \vartheta_t + \varepsilon_{it}\]
\(\gamma_i\) wipes out permanent school differences.
\(\vartheta_t\) wipes out common shocks. What remains: \(\hat\beta_3 = 25.31\).
| Method | Estimate | SE | Clustered? |
|---|---|---|---|
diff (manual) |
25.315 | 0.627 | no |
reg (interaction) |
25.315 | 0.615 | robust |
didregress |
25.315 | 0.834 | yes |
xtreg (TWFE) |
25.315 | 0.585 | yes |
reghdfe (+ covariate) |
25.328 | 0.605 | yes |
Adding the female-share control moves the estimate by ~0.01 points. The design — not the covariates — does the heavy lifting.
Act III
25.32
\(\hat\delta\), ATT (SE 0.627) · vs. naive ITS 36.20 — the 10.88-point gap was secular drift
\[Y_{it} = \alpha + \sum_{j=-m}^{q} \theta_j \cdot \text{treat}_{it}(t = k+j) + \gamma_i + \vartheta_t + \varepsilon_{it}\]
Replace the single interaction with one coefficient \(\theta_j\) per period relative to onset. Leads (\(j<0\)) test pre-trends; lags (\(j\geq 0\)) trace dynamics. The base period \(j=-1\) is omitted.
Event study. Pre-treatment coefficients hug zero; at onset the effect leaps to ~25 and holds. Bands are 95% CIs.
| Period | Coefficient | SE | Sig.? |
|---|---|---|---|
| lead 4 | 0.342 | 0.401 | no |
| lead 3 | −0.322 | 0.441 | no |
| lead 2 | 0.593 | 0.423 | no |
| lag 0 | 25.028 | 0.445 | yes |
| lag 1 | 24.705 | 0.559 | yes |
| lag 2 | 24.768 | 0.739 | yes |
| lag 3 | 25.701 | 0.797 | yes |
Lags span < 1 GPA point across four periods: no fade-out, no ramp-up — an immediate, sustained effect.
Objection. Flat leads and five matching estimators still cannot prove the comparison group is a valid counterfactual.
Response. Correct. The ATT is identified only under parallel trends and SUTVA (no spillovers, consistent treatment). The event study is consistent with parallel trends — it never proves them. A failed pre-trend would refute the design; a passed one only fails to refute it.