Bouncing Back Better? Evaluating the Economic Impact of the Aceh Tsunami

Abstract

Localized natural disasters destroy capital and lives yet can attract reconstruction aid large enough to rebuild a region “better than before,” leaving the net long-run effect genuinely ambiguous. This tutorial asks whether the Indonesian province of Aceh — which lost roughly 130,000 people to the 2004 Indian Ocean tsunami but then received the largest developing-world reconstruction effort ever, about USD 7.7 billion committed and USD 7.0 billion spent — ended up on a higher or lower growth path a decade later, and how that effect can be credibly measured. It replicates Heger & Neumayer (2019) on synthetic calibrated data spanning a district panel of 125 Sumatran districts observed annually over 1999–2012 (1,750 rows, 10 flooded Aceh districts treated) and a finer panel of 276 Aceh sub-districts with satellite night-lights. Treating coastal inundation as a quasi-natural experiment, it estimates a dynamic four-period difference-in-differences with pyfixest, an event study with diff-diff, a continuous night-lights dose-response, a synthetic control with mlsynth, and Conley spatial-HAC standard errors validated by Moran’s I. Flooded districts lost 7.9% of output in 2005 (−0.0792, p < 0.01) but grew 6.3 percentage points per year faster during 2006–08 (+0.0628, p < 0.05), and synthetic control places flooded Aceh +18.3% above its no-tsunami counterfactual by 2012; the night-lights rebound (+0.0160, p < 0.001) concentrates in the worst-hit quintile, while a neighbour placebo finds nothing. The case shows that well-governed mega-reconstruction can leave a poor region on a permanently higher trajectory — provided inference accounts for spatially clustered treatment, which roughly doubles the recovery effect’s standard error (0.0146 to 0.0244) and downgrades it from a spuriously confident 1% to an honest 5% significance.

1. Overview

On 26 December 2004, a magnitude-9.1 earthquake off the coast of Sumatra sent a tsunami across the Indian Ocean. The Indonesian province of Aceh bore the worst of it: roughly 130,000 people died, the wave reached up to 9 km inland, and about a third of the coastline was flooded. Then something unusual happened. Aceh received the single largest reconstruction effort ever directed at a developing-world disaster — about USD 7.7 billion committed, USD 7.0 billion actually spent — under a well-coordinated agency with low corruption.

So here is a genuinely hard question: a decade later, was Aceh richer or poorer than it would have been without the tsunami? Catastrophe destroys capital and lives; massive, well-spent aid rebuilds — better than before, sometimes. Which force won? And — the part this tutorial really cares about — how could you ever measure that credibly, when you only get to observe the world where the tsunami did happen?

This post is a hands-on answer. We treat the tsunami as a natural experiment: the wave flooded some districts and spared others for reasons of coastal geography that have nothing to do with their economic prospects. Comparing the flooded “treated” districts to the un-flooded “control” districts — before and after 2004 — lets us isolate the disaster-plus-reconstruction effect. We will measure it four different ways, each answering a slightly sharper version of the question, and we will be honest about uncertainty when the treated places all sit in one corner of the map.

A note on the data (please read this). This tutorial is inspired by and based on the study by Heger & Neumayer (2019), but it runs on synthetic data created for teaching. The paper’s real inputs (World Bank GDP, satellite night-lights, tsunami inundation maps) are licensed or confidential. Our dataset is calibrated so that re-running the paper’s analyses reproduces its findings — the signs, the statistical significance, and the approximate magnitudes of the key coefficients. The direction and significance of most results match the paper closely; the magnitudes can differ slightly (we tabulate exactly how in Section 11). Use this to learn the methods, not to draw new conclusions about Aceh.

1.1 Learning objectives

By the end of this tutorial, you will be able to:

  • Frame a localized natural disaster as a quasi-natural experiment, and explain why a flooded-vs-not comparison can identify a causal effect under parallel trends.
  • Measure disaster exposure and economic activity from administrative and satellite data — district GDP, sub-district night-lights, and satellite inundation maps.
  • Estimate a dynamic, four-period difference-in-differences on district GDP growth with pyfixest, and read it as an event study with diff-diff.
  • Quantify a dose-response relationship at finer resolution using continuous flood intensity on sub-district night-lights.
  • Build a synthetic-control counterfactual for flooded Aceh with mlsynth and read its path and gap plots.
  • Defend your inference when treatment is geographically clustered, using Moran’s I and Conley spatial standard errors, and validate the result with placebo and heterogeneity checks.

1.2 Study design

graph LR
    subgraph SETTING["The natural experiment"]
    A["<b>2004 tsunami</b><br/>floods some<br/>Aceh districts"]
    B["<b>Treated</b><br/>10 flooded<br/>districts"]
    C["<b>Control</b><br/>non-flooded<br/>districts"]
    A --> B
    A --> C
    end

    subgraph MEASURE["Two outcomes"]
    D["<b>District GDP</b><br/>growth"]
    E["<b>Sub-district</b><br/>night-lights"]
    end

    subgraph METHODS["Four causal tools"]
    F["<b>Difference-in-<br/>Differences</b><br/>pyfixest"]
    G["<b>Event study</b><br/>diff-diff"]
    H["<b>Synthetic<br/>control</b><br/>mlsynth"]
    I["<b>Conley spatial<br/>std. errors</b>"]
    end

    B --> D
    C --> D
    B --> E
    D --> F --> G
    D --> H
    F --> I

    style A fill:#d97757,stroke:#141413,color:#fff
    style B fill:#d97757,stroke:#141413,color:#fff
    style C fill:#6a9bcc,stroke:#141413,color:#fff
    style H fill:#00d4c8,stroke:#141413,color:#fff
    style I fill:#00d4c8,stroke:#141413,color:#fff

Read the diagram left to right: the tsunami splits districts into treated and control; we observe two outcomes (district GDP and finer sub-district night-lights); and we deploy four causal tools — DiD and its event-study view, an independent synthetic control, and the spatial standard errors that keep our confidence honest. Each maps onto a section below.

1.3 Key concepts at a glance

The post leans on a small vocabulary repeatedly. Each concept below has three parts. The definition is always visible; the example and analogy sit behind clickable cards — open them when you need them, leave them closed for a quick scan. If a later section mentions “parallel trends” or “Conley standard errors” and the term feels slippery, this is the section to re-read.

1. Difference-in-Differences (DiD). Compare the change in the treated group to the change in the control group. The difference of those two differences is the causal estimate. It nets out anything permanent about a district and any trend shared by the whole country.

Example

Flooded districts’ mean growth went from 0.0567 (before) to 0.0671 (after) — a change of +0.0103. Controls went 0.0519 to 0.0497 — a change of −0.0022. The 2×2 DiD is $0.0103 - (-0.0022) = +0.0125$.

Analogy

Time two runners on parallel tracks. Both speed up when the gun fires (the national trend). Credit your coaching only with the extra burst of the runner you coached — the gap between the two changes.

2. Parallel trends. The identifying assumption of DiD: absent the tsunami, flooded and non-flooded districts would have grown by the same amount on average. Their levels may differ; their trends must match.

Example

We test it directly: the pre-tsunami (2003–04) DiD coefficient is +0.0172 and statistically insignificant (p = 0.28). No detectable divergence before treatment — the assumption survives its placebo.

Analogy

Two boats drifting on the same current. They sit at different points, but the current carries them in step. Only an engine — the treatment — should make one pull ahead.

3. ATT $E[Y(1) - Y(0) \mid D=1]$. The Average effect of the Treatment on the Treated. Here: the effect on the flooded districts, not on some randomly chosen district. DiD and synthetic control both target the ATT.

Example

The +18.3% synthetic-control gap is the ATT for flooded Aceh. It does not claim that flooding any district would raise its GDP 18% — only that these districts, given this reconstruction, ended up that much higher.

Analogy

The bonus speed measured on the car that actually got the coaching — not a promise about any car you might pick off the street.

4. Counterfactual. The output flooded Aceh would have had with no tsunami. It is never observed; it must be estimated — by the control group’s trend (DiD) or a weighted blend of donor districts (synthetic control).

Example

“Synthetic Aceh” is a weighted recipe of 76 Rest-of-Sumatra donor districts (top weights: JAMBI_D01 0.13, BABEL_D05 0.12) chosen to match flooded Aceh’s pre-2005 GDP path. After 2005 it is our stand-in for the no-tsunami Aceh.

Analogy

The parallel-universe Aceh where the wave never came. We cannot visit it, so we build the most convincing look-alike we can from places the wave did miss.

5. Dose-response. Bigger exposure should mean a bigger effect. Instead of an on/off treatment dummy, use continuous intensity — the share of a sub-district flooded — or intensity quintiles.

Example

Each unit of “share of population flooded” raises night-lights growth by +0.016/year during recovery (p < 0.001). And only the top intensity quintile shows a significant rebound — quintiles 1–4 are flat.

Analogy

Medicine dosage. A sip does little; the full dose moves the needle. If only the largest doses show an effect, the drug is real but the average hides where it acts.

6. Night-lights as an economic proxy. Satellite night-time brightness (DMSP-OLS “Digital Numbers”, 0–63) stands in for local economic activity where GDP is unavailable, after a log transform.

Example

In 2004 the flooded sub-districts averaged a luminosity of 5.79 versus 2.36 for non-flooded ones — they are the denser, more active coastal places, and their lights are what we track over time.

Analogy

Judging a city’s bustle from a night flight overhead. You cannot read the GDP accounts from 800 km up, but brighter usually means busier.

7. Conley spatial-HAC standard errors. Standard errors that allow a district’s errors to be correlated with nearby districts in the same year (spatial) and with itself over time (serial). They are larger — and more honest — than naive errors when the treated units cluster in space.

Example

The recovery effect’s standard error roughly doubles, from 0.0146 (naive) to 0.0244 (Conley-HAC). That turns a t of 4.3 into a t of 2.57 — the point estimate (+0.0628) never moves, but it is significant at 5%, not 1%.

Analogy

Counting a milling crowd. Rows of seats suggest many independent heads, but if everyone keeps shuffling between seats you have far fewer truly independent observations than it looks.

2. Setup and the three star libraries

Three specialist packages do the heavy lifting, and each gets a one-line introduction the first time we use it:

  • pyfixest runs fixed-effects regressions with a fast, Stata-flavored formula syntax. Everything left of the | is estimated; everything right of it is absorbed as fixed effects, so we never build dummy columns by hand.
  • diff-diff is a small package built to teach difference-in-differences: it returns the 2×2 estimate and the event-study path in one or two lines each.
  • mlsynth implements modern synthetic-control estimators; we use VanillaSC, the classic Abadie–Diamond–Hainmueller method.
# In Colab, install the three estimation libraries first:
# !pip install pyfixest==0.50.1 diff-diff==3.5.2 "mlsynth @ git+https://github.com/jgreathouse9/mlsynth.git"

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyfixest as pf
import diff_diff as dd
from mlsynth import VanillaSC

np.random.seed(42)  # reproducibility

# Site dark-theme palette for figures
STEEL_BLUE, WARM_ORANGE, TEAL = "#6a9bcc", "#d97757", "#00d4c8"
DARK_NAVY, GRID_LINE, LIGHT_TEXT = "#0f1729", "#1f2b5e", "#c8d0e0"

Two small design helpers encode the paper’s difference-in-differences structure. The post period is split into event-time windows — pre (2003–04), tsunami (2005), recovery (2006–08), and post-recovery (2009–12) — all measured against the omitted 2000–02 baseline. The function below turns a treatment column into the four interaction terms those windows need.

# The four event-time windows (the 2000-02 baseline is the omitted reference)
PERIOD_TO_TERM = {"pre": "D_pre", "tsunami": "D_2005",
                  "recovery": "D_recov", "postrec": "D_post"}
DID_TERMS = ["D_pre", "D_2005", "D_recov", "D_post"]

def make_did_terms(df, treat_col):
    """Build treatment x period interactions: D_pre, D_2005, D_recov, D_post."""
    out = df.copy()
    treat = out[treat_col].astype(float)
    for period, term in PERIOD_TO_TERM.items():
        out[term] = treat * (out["period"] == period).astype(float)
    return out

3. The data: measuring a disaster at two geographic levels

Evaluating a localized disaster forces a measurement problem to the surface. National GDP would barely flinch at a shock to one province — so we need sub-national data, and we need it at two grains. We load both panels straight from the post’s data folder on GitHub, so the code runs unchanged in Colab.

BASE = ("https://raw.githubusercontent.com/cmg777/starter-academic-v501/"
        "master/content/post/python_did_sc_tsunami/data/")
district = pd.read_csv(BASE + "aceh_tsunami_district_panel.csv")
subdistrict = pd.read_csv(BASE + "aceh_tsunami_subdistrict_panel.csv")

print("district panel    :", district.shape)
print("subdistrict panel :", subdistrict.shape)
print(district.groupby(["region_group", "flooded"]).size().unstack("flooded", fill_value=0))
district panel    : (1750, 30)
subdistrict panel : (3864, 19)
flooded          0   1
region_group          
Aceh            13  10
North Sumatra   24   2
Rest of Sumatra 76   0

The district panel is 125 districts observed annually over 1999–2012 (1,750 rows); the sub-district panel is 276 Aceh sub-districts (kecamatans) over the same years. The treatment group is small and concentrated: 10 flooded Aceh districts against 13 non-flooded Aceh districts plus 76 Rest-of-Sumatra controls. (North Sumatra’s two flooded islands are held back for a robustness check, because they were also hit by a separate earthquake in March 2005.) That smallness — only 10 treated units — is the recurring source of statistical caution in this case study.

3.1 The first outcome — district GDP growth

The main outcome (gdp_growth) is the annual growth rate of real district GDP, measured in the spirit of the World Bank’s INDO-DAPOER database, which draws on Indonesia’s large annual socio-economic survey (SUSENAS). Two construction details from the paper matter. First, oil and gas are excluded: that sector is volatile and concentrated in a few non-treated districts, so leaving it in would add noise unrelated to the tsunami. Second, GDP is in constant prices so we measure real output, not inflation.

print(district[["gdp_growth", "gdp_pc_growth", "gdp_const_usd_m"]].describe().round(3).loc[
      ["count", "mean", "std", "min", "max"]])
       gdp_growth  gdp_pc_growth  gdp_const_usd_m
count     1621.00        1621.00          1750.00
mean         0.05           0.04           671.18
std          0.07           0.07           593.99
min         -0.17          -0.20            33.09
max          0.29           0.30          3748.41

District growth averages about 5.2% a year with a wide spread (standard deviation 6.6 percentage points). Notice gdp_growth has 1,621 values, not 1,750: growth is undefined in 1999 (no prior year to difference against) and is missing for one district (Subulussalam) over 2003–06 due to an administrative boundary change. Every estimator below simply drops those rows — a small but honest detail that keeps the sample sizes matching the paper exactly.

3.2 The second outcome — sub-district night-lights

GDP at the district level is too coarse to capture how intensely a place was hit. So the paper drops to the finer sub-district grain and switches to a satellite proxy: night-time luminosity from the DMSP-OLS program. Each pixel records a “Digital Number” from 0 (dark) to 63 (saturated bright). To turn pixel brightness into a sub-district economic measure, the paper sums the lights across a sub-district’s pixels and takes a logarithm:

$$\text{NL}_{ct} = \log\left( \sum_{n=1}^{N} \left( \text{DN}_{nct} + 0.001 \right) \right)$$

In words: a sub-district $c$’s log-luminosity in year $t$ is the log of the total brightness summed over its $N$ pixels, with a tiny $0.001$ added so that pixels reading exactly zero do not break the logarithm. Summing (rather than averaging) keeps the measure comparable to GDP, which is also a total; the log tames the heavy right-skew of brightness. Our outcome nl_growth is the annual change in this log measure — a luminosity growth rate that lines up conceptually with the GDP growth rate.

3.3 Identifying the treated areas — where the wave actually reached

The credibility of the whole exercise rests on how “flooded” is defined. The paper does not let economics decide it. Treatment is read off satellite inundation maps produced 1–5 days after the tsunami by remote-sensing agencies (Germany’s DLR/ZKI and the Dartmouth Flood Observatory), which compared the coastline before and after to flag pixels the water reached. Whether the wave penetrated a given stretch of coast was governed by elevation, vegetation, and offshore depth — geographic happenstance, plausibly unrelated to a district’s economic prospects. That is exactly what makes the flooding a credible natural experiment.

The data encodes exposure three ways, from coarse to fine:

print("Binary treatment (district level):")
print(district.groupby("flooded")["district_id"].nunique())
print("\nContinuous + quintile intensity (sub-district level), among flooded units:")
print(subdistrict.loc[subdistrict.flooded == 1, ["share_pop_flooded", "share_area_flooded"]].describe().round(3).loc[["mean", "min", "max"]])
Binary treatment (district level):
flooded
0    113
1     12
Name: district_id, dtype: int64

Continuous + quintile intensity (sub-district level), among flooded units:
       share_pop_flooded  share_area_flooded
mean               0.190               0.012
min                0.001               0.000
max                0.620               0.105

flooded is the simple on/off dummy used for the district DiD. For the finer night-lights analysis we also have share_pop_flooded and share_area_flooded — the fraction of a sub-district’s population (or land area) that the inundation maps marked as flooded — plus a flood_intensity_quintile ranking. Notice share_area_flooded has a tiny mean (about 1.2%): land area includes a lot of unpopulated hinterland, so the share of area flooded is small even where damage was severe. That tiny scale will make its regression coefficient look enormous later — same story, different units.

3.4 A transparent word on the synthetic data

Before we model anything: the panels above are simulated. The data-generating process was tuned so that a fixed-effects DiD recovers, column by column, coefficients close to the paper’s reported values (within about 0.005 on the headline cells), with the same signs and significance stars. Spatial and serial shocks were injected so the standard errors behave like the paper’s without moving the point estimates. We will hold ourselves accountable for this in Section 11, where a table lines our numbers up against the paper’s. With the measurement settled, let us look at the data before modeling it.

4. Exploratory analysis: the space-time dynamics

Good causal work looks at the data before it regresses it. Three views build the intuition the models will formalize. First, a handful of individual districts over time — three badly-hit ones (Banda Aceh, Aceh Besar, Aceh Jaya) against two highland controls — with GDP indexed so every district starts at 100 in 2004.

def indexed(name):
    s = district[district.district_name == name].set_index("year")["gdp_const_usd_m"]
    return s / s.loc[2004] * 100

fig, ax = plt.subplots(figsize=(9, 5.2))
for name in ["Banda Aceh", "Aceh Besar", "Aceh Jaya"]:
    ax.plot(indexed(name), color=WARM_ORANGE, lw=2.2, label=f"{name} (flooded)")
for name in ["Aceh Tengah", "Bener Meriah"]:
    ax.plot(indexed(name), "--", color=STEEL_BLUE, lw=2, label=f"{name} (control)")
ax.axvline(2004.5, color=LIGHT_TEXT, ls=":")
ax.set(xlabel="Year", ylabel="Real GDP (2004 = 100)")
ax.legend(); plt.show()

Key Aceh districts indexed to 2004 = 100; flooded districts dip in 2005 and rebound far above controls. Three flooded districts (orange) versus two highland controls (blue), each indexed to 100 in 2004.

The flooded districts sit on the same path as the controls through 2004, buckle in 2005, and then climb steeply — Banda Aceh, the provincial capital, ends near 260 (a 2.6× increase over its 2004 level). The control districts grow too, but far more gently. The eye already sees a disaster followed by an over-shooting recovery; the rest of the post is about measuring it and trusting the measurement.

Single districts are noisy, though. The next view summarizes the distribution of growth in each group across the event-time periods.

samp = district[district.region_group != "North Sumatra"].dropna(subset=["gdp_growth"]).copy()
samp["group"] = np.where(samp.flooded == 1, "Treated (flooded)", "Control")
# (full grouped-boxplot styling is in script.py)

Box-plots of GDP growth by group across event-time periods; the treated 2005 box drops below zero and the recovery box lifts above the control box. Distribution of district GDP growth, treated (orange) vs control (blue), in each event-time period.

The boxes make the dynamics unmistakable. In 2000–04 the treated and control boxes overlap almost perfectly — similar centers, similar spread. In 2005 the treated box drops bodily below zero (a median contraction) while the control box stays put. In 2006–08 the treated box jumps above the control box. The disaster and the rebound are both visible as shifts in the whole distribution, not just a couple of outliers.

Finally, the single figure that motivates difference-in-differences: the group means over time.

means = (samp.groupby(["year", "flooded"])["gdp_growth"].mean()
         .unstack("flooded").rename(columns={0: "Control", 1: "Treated"}))

fig, ax = plt.subplots(figsize=(9, 5.2))
ax.plot(means["Control"], "--o", color=STEEL_BLUE, label="Control")
ax.plot(means["Treated"], "-o", color=WARM_ORANGE, label="Treated (flooded)")
ax.axvline(2004.5, color=LIGHT_TEXT, ls=":"); ax.legend(); plt.show()

Treated vs control group-mean growth: parallel before 2005, then the treated line dives and overshoots. Mean annual GDP growth, treated vs control. Parallel before the tsunami, sharply divergent after.

This is the picture difference-in-differences was invented for. Before 2005 the two lines move in lockstep — the visual signature of parallel trends. At the tsunami the treated line plunges to about −0.027 while the control line barely moves. Then the treated line over-shoots, peaking near +0.124 in 2007 before settling back toward the control. The eye is convinced something happened; now we quantify it and, crucially, attach a margin of error.

5. Difference-in-differences on district GDP growth

5.1 The intuition: a 2×2 difference of differences

Start with the simplest possible version. Split time into “before” (≤ 2004) and “after” (≥ 2005), compute the mean growth in each of the four treated/control × before/after cells, and form the difference of the two differences:

$$\widehat{\text{DiD}} = \big( \bar{g}_{\text{treated, after}} - \bar{g}_{\text{treated, before}} \big) - \big( \bar{g}_{\text{control, after}} - \bar{g}_{\text{control, before}} \big)$$

In words: take how much the treated group’s average growth changed across the break, subtract how much the control group’s changed, and what remains is the part attributable to the tsunami — because the control change captures whatever was happening nationwide anyway.

sample = make_did_terms(district[district.region_group != "North Sumatra"], "flooded").dropna(subset=["gdp_growth"])
cell = sample.groupby(["flooded", "post"])["gdp_growth"].mean().unstack("post")
cell.columns, cell.index = ["Before (<=2004)", "After (>=2005)"], ["Control", "Treated"]
cell["change"] = cell["After (>=2005)"] - cell["Before (<=2004)"]
print(cell.round(4))

res = dd.DifferenceInDifferences(cluster="district_id").fit(
    sample, outcome="gdp_growth", treatment="flooded", time="post")
print(f"\nDiD ATT = {res.att:+.4f}  (SE {res.se:.4f}, p = {res.p_value:.3f})")
         Before (<=2004)  After (>=2005)  change
Control           0.0519          0.0497 -0.0022
Treated           0.0567          0.0671  0.0103

DiD ATT = +0.0125  (SE 0.0142, p = 0.379)

The hand calculation gives $0.0103 - (-0.0022) = +0.0125$, and diff-diff confirms it with a standard error: +0.0125, but statistically insignificant (p = 0.38). Before you conclude “no effect,” look closer. This single “after” window blends two opposite phases — the 2005 destruction and the 2006–08 boom — into one average, and they nearly cancel. The pooled estimate is not wrong; it is just uninformative. The fix is to let the effect vary over time.

5.2 The dynamic DiD (the paper’s headline)

The paper’s central specification keeps the same logic but splits the post period into the four event-time windows, each entering as a treatment-times-period interaction relative to the 2000–02 baseline:

$$\Delta Y_{it} = \beta_1 D_i \mathbf{1}[t \in \text{pre}] + \beta_2 D_i \mathbf{1}[t = 2005] + \beta_3 D_i \mathbf{1}[t \in \text{recovery}] + \beta_4 D_i \mathbf{1}[t \in \text{post}] + \alpha_i + \gamma_t + \varepsilon_{it}$$

Here $\Delta Y_{it}$ is district $i$’s GDP growth in year $t$; $D_i = 1$ for flooded districts; $\mathbf{1}[\cdot]$ is an indicator that is 1 when the year falls in that window; $\alpha_i$ is a district fixed effect (absorbing anything permanent about a district) and $\gamma_t$ is a year fixed effect (absorbing common national shocks). The four $\beta$’s are the story: $\beta_2$ should be the negative 2005 shock, $\beta_3$ the positive reconstruction boom, while $\beta_1$ and $\beta_4$ should be near zero. These map exactly onto the code variables D_pre, D_2005, D_recov, D_post. In pyfixest, the part after the | lists the fixed effects to absorb:

m = pf.feols("gdp_growth ~ D_pre + D_2005 + D_recov + D_post | district_id + year",
             data=make_did_terms(district[district.region_group != "North Sumatra"], "flooded"),
             vcov={"CRV1": "district_id"})
m.coef().round(4)
D_pre      0.0172
D_2005    -0.0792
D_recov    0.0628
D_post     0.0114

Reading these against the paper’s three control pools (the full table from script.py) gives the headline result:

Coefficient(1) Sumatra controls(2) Rest of Sumatra(3) Aceh non-flooded
Pre-tsunami (2003-04)+0.0172 (0.0159)+0.0176 (0.0162)+0.0154 (0.0187)
Tsunami (2005)−0.0792*** (0.0240)−0.0782*** (0.0247)−0.0841*** (0.0281)
Recovery (2006-08)+0.0628** (0.0244)+0.0682*** (0.0247)+0.0310 (0.0281)
Post-recovery (2009-12)+0.0114 (0.0146)+0.0132 (0.0147)+0.0008 (0.0204)
Observations1,2831,118295

Conley spatial-HAC standard errors in parentheses; *** p<.01, ** p<.05, * p<.10.

Three things to take away. First, the pre-tsunami coefficient is small and insignificant (+0.0172) — the parallel-trends assumption passes its formal placebo test, so the comparison is credible. Second, the 2005 coefficient is −0.0792 (p < 0.01): flooded districts grew almost 8 percentage points slower the year the wave hit. Third, the recovery coefficient is +0.0628 (p < 0.05): over 2006–08 they grew 6.3 points per year faster than controls — and three years of that premium (≈ +0.19 cumulatively) more than erases the one-year loss. The post-recovery coefficient is a near-zero +0.0114, meaning the gain neither evaporated nor kept compounding: Aceh settled onto a permanently higher path. This is the paper’s signature result — “sustainable recovery beyond the counterfactual trend.” Notice column 3, which compares flooded Aceh to its own non-flooded neighbors, halves the recovery coefficient to +0.0310 (insignificant): reconstruction money spilled across district lines, shrinking the within-Aceh contrast.

The estimand here is the ATT — the effect on the flooded districts — and identification rests on parallel trends, not randomization. This is an observational study; the placebo and spatial-error checks below are what earn it credibility.

5.3 The event study: seeing the whole path

The dynamic DiD has four coefficients; an event study plots them (plus the pinned baseline) so the pre-trend and the recovery path are visible at a glance. diff-diff produces it directly:

mp = dd.MultiPeriodDiD(cluster="district_id").fit(
    sample, outcome="gdp_growth", treatment="flooded", time="period",
    reference_period="baseline", absorb=["district_id"])
for p in ["pre", "tsunami", "recovery", "postrec"]:
    e = mp.period_effects[p]
    print(f"{p:9s} effect={e.effect:+.4f}  se={e.se:.4f}  p={e.p_value:.3f}")
pre       effect=+0.0172  se=0.0160  p=0.283
tsunami   effect=-0.0792  se=0.0260  p=0.002
recovery  effect=+0.0628  se=0.0247  p=0.011
postrec   effect=+0.0114  se=0.0149  p=0.444

Event study: a flat pre-trend, a 2005 collapse, and a 2006-08 rebound, with 95% confidence intervals. Each point is the treated-minus-control effect in that period, relative to the 2000–02 baseline; bars are 95% confidence intervals.

The figure tells the entire story in one arc. The baseline and pre-tsunami points sit on zero (parallel trends — the identifying assumption is visibly satisfied). The 2005 point collapses to −0.079 with a confidence interval well below zero. The recovery point rebounds to +0.063, also significantly positive. And the post-recovery point drifts back toward zero but stays positive — the higher level persists. A single “after” dummy (Section 5.1) averaged the deep red 2005 point with the high green recovery point and got a muted, insignificant number; the event study shows why that average was misleading.

5.4 Did people just leave? The per-capita check

A worry: maybe “growth” per district rose only because the population fell (the tragic arithmetic of 130,000 deaths and displacement). If GDP and population dropped together, GDP per capita need not have moved. Re-running the same DiD on gdp_pc_growth addresses it:

Coefficient(1) Sumatra controls
Tsunami (2005)+0.0192 (0.0239)
Recovery (2006-08)+0.0827*** (0.0261)

In per-capita terms there is no significant 2005 loss (+0.0192) — output and population fell together that year — but the recovery gain is even larger and highly significant (+0.0827, p < 0.01): fewer people then shared a rebuilt, better-capitalized economy. The effect is not a denominator artifact; it survives, indeed strengthens, when we divide by population.

6. Night-lights dose-response: how intensity matters

District GDP answers “did flooded districts grow faster?” Night-lights, available for the much finer sub-districts, can answer a sharper question: did the places hit harder rebound more? First, a quick descriptive — how bright were flooded vs non-flooded sub-districts before the tsunami?

snap = subdistrict[subdistrict.year == 2004]
print(snap.groupby("flooded")["avg_luminosity"].agg(["count", "mean", "std", "max"]).round(2))
         count  mean   std   max
flooded                          
0          208  2.36  4.41  36.0
1           68  5.79  8.31  39.0

The 68 flooded sub-districts averaged a 2004 luminosity of 5.79 versus 2.36 for the 208 non-flooded ones — about 2.5× brighter, because flooded places are the denser, more economically active coastal strips. Now the dose-response: instead of the on/off flooded dummy, we interact the continuous flood intensity with the event-time periods.

def nl_fit(treat):
    df = make_did_terms(subdistrict, treat)
    return pf.feols("nl_growth ~ D_pre + D_2005 + D_recov + D_post | kecamatan_id + year",
                    data=df, vcov={"CRV1": "kecamatan_id"})

print(nl_fit("share_pop_flooded").coef().round(4))
D_pre      0.0052
D_2005    -0.0073
D_recov    0.0160
D_post     0.0019
CoefficientShare of population floodedShare of area flooded
Pre-tsunami (2003-04)+0.0052 (0.0034)+0.565 (0.358)
Tsunami (2005)−0.0073** (0.0035)−0.727* (0.381)
Recovery (2006-08)+0.0160*** (0.0022)+1.660*** (0.246)
Post-recovery (2009-12)+0.0019 (0.0024)+0.270 (0.250)
N3,4443,444

The recovery coefficient on “share of population flooded” is +0.0160 (p < 0.001): each additional unit of population-share flooded buys that much extra annual luminosity growth during reconstruction. The “share of area” column tells the same story with a coefficient about 100× larger (+1.660) — purely because, as we saw in Section 3.3, the share of area flooded is a tiny number, so a one-unit move is enormous. Same effect, different yardstick. The pre-period coefficients are small and the 2005 dip is weak-to-modest, mirroring the district results at finer resolution.

The dose-response sharpens further if we ask where the effect lives. Splitting flood intensity into quintiles and interacting each with the post period:

Night-lights dose-response: continuous period effects (left) and quintile effects (right); only the top quintile is significant. Left: period coefficients for the continuous dose. Right: effect by intensity quintile — only the worst-hit fifth (Q5) rebounds significantly.

QuintileQ1Q2Q3Q4Q5 (worst-hit)
Effect (share of population)+0.0010+0.0010+0.0009+0.0008+0.0018**

Only the top quintile — the most heavily flooded fifth of sub-districts — shows a statistically significant rebound (+0.0018, p ≈ 0.02); quintiles 1 through 4 are flat and indistinguishable from zero. The average effect is not spread evenly; it is concentrated exactly where the damage, and therefore the reconstruction spending, was greatest. That is a substantive lesson about disaster aid as much as a statistical one.

7. Synthetic control: building a counterfactual Aceh

Difference-in-differences leans on the control group’s trend as the counterfactual. The synthetic control method builds a more bespoke one: a weighted blend of donor districts chosen so that the blend tracks flooded Aceh’s pre-tsunami path almost exactly. Formally, it picks non-negative weights $w$ that sum to one to minimize the pre-treatment mismatch:

$$w^{\ast} = \arg\min_{w}\ \left( X_1 - X_0 w \right)^{\top} V \left( X_1 - X_0 w \right) \quad \text{subject to} \quad w_j \geq 0, \quad \sum_j w_j = 1$$

where $X_1$ holds treated Aceh’s pre-2005 outcomes and $X_0$ the donors’ (one column per donor). After 2005 the fitted “synthetic Aceh” is left to run free; the gap between actual and synthetic Aceh is the estimated effect. Before fitting a model, the raw group averages already hint at the answer:

GDP indexed to 2004 = 100: flooded Aceh dips, then climbs above both control groups. Figure 2 of the paper, reproduced: flooded Aceh (orange) ends well above non-flooded Aceh (blue) and the rest of Sumatra (teal).

By 2012 the flooded-Aceh index reaches 177 (2004 = 100), versus 162 for non-flooded Aceh and 142 for the rest of Sumatra. Now the formal version. mlsynth wants a long panel with one treated unit and a pool of donors, so we collapse the 10 flooded Aceh districts into a single average and pair them with the 76 Rest-of-Sumatra donor districts:

treated = (district[(district.flooded == 1) & (district.region_group == "Aceh")]
           .groupby("year", as_index=False)["gdp_const_usd_m"].mean()
           .assign(unitid="Aceh (flooded)").rename(columns={"year": "time", "gdp_const_usd_m": "outcome"}))
donors = (district[district.region_group == "Rest of Sumatra"][["district_id", "year", "gdp_const_usd_m"]]
          .rename(columns={"district_id": "unitid", "year": "time", "gdp_const_usd_m": "outcome"}))
panel = pd.concat([treated[["unitid", "time", "outcome"]], donors], ignore_index=True)
panel["treat"] = ((panel.unitid == "Aceh (flooded)") & (panel.time >= 2005)).astype(int)

out = VanillaSC({"df": panel, "outcome": "outcome", "treat": "treat",
                 "unitid": "unitid", "time": "time", "display_graphs": False}).fit().model_dump()
print(f"pre-RMSE = {out['fit_diagnostics']['rmse_pre']:.3f}")
print(f"ATT = +{out['effects']['att']:.1f} GDP units (+{out['effects']['att_percent']:.1f}%)")
pre-RMSE = 0.485
ATT = +32.9 GDP units (+18.3%)

Synthetic control: flooded-Aceh GDP vs a synthetic Aceh built from 76 donors; the shaded gap is the estimated effect. Figure 3 reproduced: synthetic Aceh tracks the treated path before 2005 (pre-RMSE 0.485), then the actual line pulls clearly above it.

The pre-treatment fit is excellent: a root-mean-squared prediction error of 0.485 against GDP levels near 200 means synthetic Aceh shadows the real thing almost perfectly before 2005 — which is what licenses us to trust it as a counterfactual afterward. After the tsunami the actual line pulls away, ending +18.3% above its synthetic twin (370.9 vs 295.0 by 2012). The gap plot isolates that divergence:

The treated-minus-synthetic gap: near zero before 2005, opening up afterward. The estimated effect over time: indistinguishable from zero before 2005, then steadily positive.

A synthetic control is only as credible as its donor recipe — if one donor carried all the weight, the counterfactual would be fragile. Here the weight is spread:

Donor weights for synthetic Aceh: a handful of Sumatra districts, none dominant. The six largest donor weights. No single district dominates, which makes the counterfactual robust.

The top six donors — districts in Jambi, Bangka-Belitung, the Riau Islands, and Bengkulu — together carry about 62% of the weight, and the largest single weight is only 0.13. A counterfactual assembled from many modest contributors is far harder to dismiss than one resting on a single look-alike. Two very different methods — difference-in-differences and synthetic control — now agree: flooded Aceh ended up materially above where it was heading.

8. Spatial standard errors: honest inference for a clustered treatment

Every result so far came with a standard error, and those numbers were not the defaults. Here is why they cannot be. All 10 treated districts sit in one corner of Sumatra:

Longitude-latitude scatter of all Sumatra districts; the 10 treated units cluster on Aceh&rsquo;s NW coast. Every flooded (treated) district, in orange, sits in the far north-west. Their growth shocks are unlikely to be independent.

When the treated units are packed together, their year-to-year shocks are not independent draws — a good monsoon, a regional price swing, or the reconstruction boom itself hits them together. Tobler’s first law of geography puts it plainly: near things are more related than distant things. The default (“naive”) standard error assumes every observation is independent, so it counts more truly independent information than the data really contain, and reports standard errors that are too small. The first step is to check whether the problem is real, using Moran’s I — the spatial analogue of a correlation coefficient — on the regression residuals:

# residualize growth on flooded + year, then test whether the leftover is spatially clustered
# (full Moran's I + permutation code is in script.py)
print("Pooled within-year Moran's I = +0.065  (permutation p = 0.003)")
Pooled within-year Moran's I = +0.065  (permutation p = 0.003)

A Moran’s I of +0.065 with a permutation p-value of 0.003 says the residual growth of nearby districts is significantly positively correlated within a year — the independence assumption behind naive errors is violated, and we must do something about it. The fix is a Conley spatial-HAC standard error: a single “sandwich” estimator that counts two extra kinds of error correlation — serial (a district correlated with itself over time) and spatial (different districts within 100 km in the same year), with the spatial weight fading linearly to zero at the cutoff. The point estimates never change; only the standard errors do. Running the same DiD with four different standard errors side by side:

CoefficientEstimateNaiveClusteredConleyConley-HACt(HAC)
Pre-tsunami+0.01720.01440.01590.01440.0159+1.08
Tsunami (2005)−0.07920.02360.02580.02160.0240−3.30
Recovery (2006-08)+0.06280.01460.02440.01450.0244+2.57
Post-recovery+0.01140.01090.01480.01060.0146+0.78

Look at the recovery row. The estimate is +0.0628 in every column — the point estimate is rock-solid. But its standard error climbs from 0.0146 (naive) to 0.0244 (Conley-HAC), a 1.68× inflation driven mostly by the serial correlation of the multi-year recovery window. That difference is decisive: under the naive error the recovery effect would have a t-statistic above 4 and look significant at the 1% level (***); under the honest Conley-HAC error its t is 2.57, significant at 5% (**). The point estimate never moved — only our honesty about its uncertainty did. A careless analyst would have overstated the confidence threefold.

How far should the spatial cutoff reach? Too short and you miss real correlation; too long and you dilute the kernel with distant, weakly-related pairs. Sweeping the cutoff shows the standard error is stable across the 25–100 km range the paper uses, then declines as far-flung pairs water it down:

Conley-HAC standard error of the recovery effect as the distance cutoff widens from 0 to 300 km. The recovery effect’s standard error is flat through ~100 km (the paper’s choice), then drifts down as distant pairs dilute the spatial kernel.

(The full Conley sandwich — within-transformation, the two error “meats,” and the negative-variance clamp — lives in script.py as conley_did_estimate; it reproduces pyfixest’s point estimates to four decimals while adding the spatial standard errors pyfixest cannot compute.)

9. Robustness: placebo and heterogeneity

Two final checks decide whether to believe the headline. The first is a placebo: if our design is sound, then districts that merely neighbor a flooded district — but were not themselves flooded — should show no effect. We drop the truly flooded districts and pretend their neighbors were treated:

nonflooded = district[district.flooded == 0]
# re-run the dynamic DiD with `neighbour_of_flooded` as the fake treatment (full code in script.py)
Check2005Recovery (2006-08)N
Placebo (neighbours of flooded)+0.0025 (ns)+0.0064 (ns)1,465
City (Kota) districts−0.0424 (ns)+0.1226***295
Rural (Kabupaten) districts−0.0883***+0.0479*988

The placebo finds nothing — every coefficient is small and insignificant (2005 +0.0025, recovery +0.0064). That is exactly what we want: the result is not some artifact of generalized regional spillovers leaking onto whoever happens to be nearby. The effect is specific to the districts the water actually reached.

The second check looks inside the average. Splitting the treated districts into cities (Kota) and rural regencies (Kabupaten) reveals very different experiences. Rural districts took the brunt of the 2005 shock (−0.0883, p < 0.01) — agriculture floods badly — with a modest rebound (+0.0479). Cities, by contrast, barely contracted in 2005 (−0.0424, insignificant) but rebounded enormously (+0.1226, p < 0.01), reflecting the urban concentration of reconstruction. One caveat the paper itself flags: there are only 2 flooded city districts, so the city column is statistically fragile (few independent clusters) — read its precision, not just its point estimate, with care.

10. Discussion

What we found. Four methods converge on one story. Flooded districts lost about 7.9% of output in 2005 but grew 6.3 percentage points per year faster in 2006–08, ending on a permanently higher path — Aceh’s “recovery beyond the counterfactual trend.” Night-lights confirm it at finer resolution and show the gain concentrated in the worst-hit places. A synthetic control built from 76 donor districts puts flooded Aceh +18.3% above its no-tsunami twin by 2012. The result is robust to a neighbor-district placebo and survives honest, spatially-corrected inference (where it is significant at 5%, not the spuriously confident 1%).

So what? The substantive lesson is not “disasters are good” — they are not; 130,000 people died. It is that a localized catastrophe followed by large, well-governed reconstruction can leave a poor region on a higher long-run trajectory. Aceh received aid worth about 150% of its damages, spent through a low-corruption agency, on infrastructure rebuilt “better than before.” That combination — not the wave — is what bent the growth path upward. For disaster policy, the design lesson is just as important as the result: a credible evaluation needs exogenous exposure (geography, not choice), a finer-than-national unit of analysis, and spatially honest standard errors.

Limitations. Be appropriately humble. The data are synthetic — calibrated to teach the methods, not to report new facts about Aceh. The treatment group is tiny (10 districts), so point estimates are fragile and standard errors wide; the Aceh-only and city columns are especially imprecise. Identification is observational: parallel trends is an assumption, supported by the flat pre-trend and the null placebo but never proven. And a single, exceptionally well-funded case study travels poorly — Aceh’s recovery is evidence about well-governed mega-reconstruction, not about disaster aid in general.

11. Reproduction audit: synthetic data vs the paper

Because the data are synthetic, transparency demands that we line our numbers up against the published ones. The data-generating process was tuned to match the paper column by column; signs and significance agree throughout, and magnitudes land within about 0.005 on the headline cells.

ResultThis synthetic dataPaper (reported)SignSignificance
DiD GDP, 2005 (Table 2, col 1)−0.0792***≈ −0.081***
DiD GDP, recovery 2006-08 (col 1)+0.0628**≈ +0.059**
DiD GDP, recovery vs Aceh controls (col 3)+0.0310 (ns)≈ +0.030**partial
DiD per-capita, recovery (Table 8)+0.0827***≈ +0.078***
Night-lights, share-of-pop recovery (Table 3)+0.0160***≈ +0.016***
Night-lights, share-of-area recovery (Table 3)+1.660***≈ +1.75***
Night-lights quintiles (Table 4)only Q5 significantonly Q5 significant
City vs rural, 2005 (Table 7)rural −0.0883*** / city nsrural ≈ −0.098*** / city ns
Placebo neighbours (Table 9)all nsall ns
Synthetic control ATT+18.3%“recovery beyond counterfactual”qualitative

Two honest gaps. The Aceh-only column-3 recovery effect matches the paper in magnitude (+0.031 vs +0.030) but reads as insignificant here, because with the same 10 treated units in every column our synthetic standard errors are similar across columns, whereas the paper’s Aceh-only sample is more precise. And the night-lights quintile magnitudes sit on Table 3’s (smaller) scale rather than the paper’s Table 4 scale — the paper’s own Tables 3 and 4 are mutually inconsistent in units, so no single process can reproduce both; we match the pattern (only Q5 significant) exactly. Everywhere else, direction and significance track the paper closely.

12. Summary and takeaways

Number to rememberValue
2005 output shock−0.0792*** (≈ −8%)
2006–08 recovery premium+0.0628** (per year)
Synthetic-control gap by 2012+18.3%
Moran’s I (spatial autocorrelation)+0.065 (p = 0.003)
Recovery SE: naive → Conley-HAC0.0146 → 0.0244
Night-lights recovery (share-of-pop)+0.0160***
  • A single “after” hides the story. The pooled 2×2 DiD was an insignificant +0.0125; only splitting time into event-time windows revealed the −0.079 collapse and +0.063 overshoot. When effects evolve, let them.
  • Triangulate. Difference-in-differences, an event study, a dose-response, and a synthetic control all pointed the same way — a far stronger claim than any one method alone.
  • Satellite data unlock localized questions. Night-lights gave a finer, exogenous measure that exposed the dose-response (only the worst-hit quintile rebounds) invisible at the district level.
  • Clustered treatment demands honest inference. With all treated units in one corner of the map, Conley spatial standard errors were not optional — they downgraded the recovery effect from a spurious *** to an honest **, without touching the point estimate.
  • Mind the small print. Ten treated districts make for fragile estimates; the result is about well-governed mega-reconstruction, on synthetic data, identified by an assumption. Strong evidence, stated with the caveats it deserves.
  • Next step. Try modern staggered-adoption DiD estimators, add prediction intervals to the synthetic control, or widen the donor pool — each is a natural extension of the toolkit here.

13. Exercises

  1. Drop a donor. Re-fit VanillaSC after excluding the top-weighted donor (JAMBI_D01). Does the post-2005 gap shrink, hold, or grow relative to the headline +18.3%? What does the answer tell you about the counterfactual’s robustness?
  2. Cutoff sensitivity. Recompute the recovery effect’s Conley-HAC standard error at cutoffs of {0, 50, 150, 300} km. At which cutoff, if any, does the recovery effect’s significance change? Relate your answer to the cutoff figure in Section 8.
  3. Your own event study. Estimate the event study a second way with pyfixest’s factor syntax — pf.feols("gdp_growth ~ i(period, flooded, ref='baseline') | district_id + year", ...) — and check that its coefficients match the diff-diff version to four decimals. Why should two different libraries agree exactly?

14. References

  1. Heger, M. P., & Neumayer, E. (2019). The impact of the Indian Ocean tsunami on Aceh’s long-term economic growth. Journal of Development Economics, 141, 102365. https://doi.org/10.1016/j.jdeveco.2019.06.008
  2. Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies. Journal of the American Statistical Association, 105(490), 493–505.
  3. Conley, T. G. (1999). GMM estimation with cross-sectional dependence. Journal of Econometrics, 92(1), 1–45.
  4. Indonesia Database for Policy and Economic Research (INDO-DAPOER) and SUSENAS — World Bank / BPS-Statistics Indonesia. https://datacatalog.worldbank.org/
  5. DMSP-OLS Nighttime Lights — NOAA National Centers for Environmental Information. https://www.ncei.noaa.gov/
  6. Center for Satellite Based Crisis Information (ZKI), German Aerospace Center (DLR), and the Dartmouth Flood Observatory (inundation maps).
  7. pyfixest documentation — https://pyfixest.org/
  8. diff-diff documentation — https://github.com/igerber/diff-diff
  9. mlsynth documentation — https://github.com/jgreathouse9/mlsynth

This tutorial is a teaching replication built on synthetic data; see the data note in Section 1 and the reproduction audit in Section 11. The companion script.py regenerates every figure and table.


AI Podcast: Evaluating the Impact of Natural Disasters

Click play to load
0:00 0:00
Carlos Mendez
Carlos Mendez
Associate Professor of Development Economics

My research interests focus on the integration of development economics, spatial data science, and econometrics to better understand and inform the process of sustainable development across regions.

Related