Difference-in-Differences in Python

From the classic 2×2 design to staggered adoption and honest sensitivity

5.12classic 2×2 ATT · true effect 5.0
2.41Callaway–Sant’Anna · TWFE biased to 2.18
M = 15HonestDiD breakdown

Carlos Mendez

Nagoya University (GSID)

June 11, 2026

The Tension

Act I

An education ministry rolls out AI tutors in some cities — did it work, or were they already rising?

Some cities adopt AI tutoring bots; others do not. Scores climb in the treated cities.

But maybe those cities were already on the way up. How do you separate the policy from the trend it rode in on?

DiD uses the control group as a mirror for the treated group’s missing counterfactual

Both groups track in lockstep for periods 0–4, then the treated group jumps at treatment onset. The gap after period 5 is the effect.

Where we’re going

  • The logic: why differencing twice identifies the ATT under parallel trends
  • The classic 2×2 estimator and a formal pre-trends test
  • Event studies: the effect period by period
  • Why naive TWFE breaks under staggered timing — and how Callaway–Sant’Anna fixes it
  • HonestDiD: how wrong can parallel trends be before the answer flips?

The Investigation

Act II

DiD estimates the ATT — the effect on the units that actually got treated

\[\text{ATT} = E[Y^1_k - Y^0_k \mid \text{Post}]\]

The average gap between what treated units experienced and what they would have experienced without the policy.

\(E[Y^0_k\mid\text{Post}]\) — the treated group’s untreated outcome — is a counterfactual we never observe. DiD reconstructs it from the control group.

Difference twice: kill the level gap, then kill the common trend

\[\hat{\delta}^{2 \times 2}_{kU} = \big( \bar{Y}_k^{Post} - \bar{Y}_k^{Pre} \big) - \big( \bar{Y}_U^{Post} - \bar{Y}_U^{Pre} \big)\]

The first difference removes time-invariant differences between groups; the second removes the trend common to both.

Algebra splits this into \(\underbrace{\text{ATT}}\) plus a \(\underbrace{\text{non-parallel-trends bias}}\). The bias vanishes exactly when parallel trends holds.

The lab: a 100-unit, 10-period panel built with a known true effect of 5.0

  • 1,000 observations — 100 units × 10 periods (0–9)
  • 50 treated, 50 control; treatment switches on at period 5
  • True effect = 5.0 baked in, so every estimate has a ground truth to hit

A built-in true_effect column gives every estimate a known target to hit.

Before treatment the two groups overlap; after, the treated box jumps far higher

Outcome distributions by group × period. Control (steel) and treated (orange) overlap pre-treatment near 10.6–11.1; the treated box jumps to ~18.9 post-treatment.

The shaded gap is the causal effect: treated outcomes minus their projected counterfactual

The teal dashed line is the control group’s path shifted to the treated group’s pre-level — the no-treatment counterfactual. The shaded gap is the ~5.1 ATT.

Six lines fit the classic 2×2 estimator with the diff-diff package

from diff_diff import DifferenceInDifferences, generate_did_data
data = generate_did_data(n_units=100, n_periods=10,
                         treatment_effect=5.0, treatment_period=5, seed=42)
did = DifferenceInDifferences()
res = did.fit(data, outcome="outcome", treatment="treated", time="post")
res.print_summary()   # ATT = 5.1216, 95% CI [4.6399, 5.6034]

The classic estimator recovers 5.12 — within 2.4% of the true 5.0

5.12

Classic 2×2 \(\widehat{\text{ATT}}\) (SE 0.25, \(t = 20.9\)) · 95% CI [4.64, 5.60] covers the true 5.0

The event study splits the single ATT into one effect per period relative to treatment

\[Y_{it} = \gamma_i + \lambda_t + \sum_{k=-K+1}^{-2}\beta_k^{lead}D_{it}^k + \sum_{k=0}^{L}\beta_k^{lag}D_{it}^k + \varepsilon_{it}\]

The leads \(\beta_k^{lead}\) are placebo tests (should be ~0); the lags \(\beta_k^{lag}\) trace how the effect evolves. The period before treatment is the omitted reference.

Leads sit on zero, lags snap to ~5.0 — the visual signature of a clean DiD

Pre-treatment coefficients (steel) hover at zero with CIs crossing it; post-treatment coefficients (orange) jump to ~5.0, matching the teal true-effect line.

Real policies roll out city by city — and that staggered timing breaks naive TWFE

  • 3,000 observations — 300 units × 10 periods
  • Three cohorts adopt at periods 3, 5, 7 (60, 75, 75 units)
  • 90 never-treated units — a clean control group
  • Effects grow over time by construction (2.0 → 3.2 for the earliest cohort)

TWFE fits one pooled \(\delta\) — a weighted average of many 2×2 comparisons, some of them poisoned.

Cohorts move in parallel, then jump at their own onset — TWFE then mis-uses early adopters as controls

Four cohorts track together pre-treatment, then cohort 3 (orange), 5 (teal), 7 (near black) each jump at its onset; never-treated (steel) drifts gently up.

The Goodman–Bacon decomposition shows 28.3% of TWFE’s weight is on forbidden comparisons

Left: each 2×2 comparison as a point, forbidden ones (dark orange) cluster low. Right: weight by type — nearly a third on forbidden comparisons.

Comparison type Weight Avg effect
Treated vs never-treated (clean) 0.433 2.37
Earlier vs later treated 0.284 2.20
Later vs earlier (forbidden) 0.283 1.60

Forbidden comparisons drag TWFE down to a biased 2.18

2.18

Naive TWFE \(\hat{\delta}\) · downward-biased by 28.3% weight on forbidden comparisons (avg effect 1.60)

Callaway–Sant’Anna rebuilds the estimate from clean group-time ATTs only

\[\text{ATT}(g,t) = E[Y_t - Y_{g-1}\mid G=g] - E[Y_t - Y_{g-1}\mid G=\infty]\]

Each building block is a 2×2 that compares cohort \(g\) against the never-treated group only — every forbidden comparison eliminated by construction.

A doubly robust version reweights controls (propensity) and models their outcome change (regression): valid if either is right.

Theory vs naive: same data, but CS uses only valid comparisons

Naive TWFE

  • one pooled coefficient
  • 28.3% weight on forbidden comparisons
  • \(\hat{\delta} = 2.18\) (biased low)
  • hides the dynamics

Callaway–Sant’Anna

  • clean group-time ATTs
  • never-treated controls only
  • overall \(\widehat{\text{ATT}} = 2.41\)
  • recovers the growth path

CS recovers a clean 2.41, and the effect grows from 1.97 to 3.27 over six periods

CS event study: pre-treatment effects pinned near zero (period −1 = 0 by construction); post-treatment effects rise steadily from ~2.0 to ~3.3.

The Resolution

Act III

The CI stays above zero even when violations are 15× the worst pre-trend

Robust 95% CI (steel band) widening with M; ATT (teal) flat at 2.41; the lower bound is still positive (0.38) at the M = 15 grid edge (orange line).

At \(M=0\): CI [2.53, 2.66]. At \(M=15\): CI [0.38, 4.81] — still excludes zero.

The conclusion survives violations 15× worse than anything seen pre-treatment

M = 15

HonestDiD breakdown value · CI excludes zero even at 15× the largest pre-treatment deviation

Three estimators, one honest verdict: the effect is real, growing, and robust

Setting Estimator \(\widehat{\text{ATT}}\)
Single timing Classic 2×2 5.12
Staggered (naive) TWFE 2.18
Staggered (clean) Callaway–Sant’Anna 2.41

Always report the breakdown value alongside the estimate — here \(M = 15\), exceptionally robust.

Let the design — not the default regression — choose your comparisons.