Synthetic Control | Carlos Mendez

The Synthetic Control Method in Stata: Did California's Tobacco Tax Cut Smoking?

Sun, 26 Apr 2026 00:00:00 +0000

1. Overview

In 1988, California voters approved Proposition 99, a sweeping tobacco control initiative that raised cigarette taxes by 25 cents per pack and funded anti-smoking education campaigns. The law took effect in January 1989, making California one of the first US states to implement a comprehensive tobacco control program. But did it actually reduce cigarette consumption? And by how much?

Answering this question is harder than it sounds. We cannot simply compare California’s cigarette sales before and after 1989, because national trends — declining smoking rates, rising health awareness, federal regulations — were already pushing sales downward everywhere. We need a credible counterfactual: what would California’s cigarette sales have looked like without Proposition 99?

The synthetic control method (SCM), introduced by Abadie, Diamond, and Hainmueller (2010), solves this problem by constructing a weighted combination of untreated states that closely matches California’s pre-treatment cigarette sales trajectory. This “synthetic California” serves as the counterfactual, and the gap between actual California and its synthetic counterpart measures the causal effect of the policy.

This tutorial walks through the complete SCM workflow in Stata using the synth2 package: from data exploration and baseline estimation, through three inference approaches (in-space placebo, in-time placebo, and leave-one-out robustness), to a final assessment of statistical significance.

Learning objectives

Understand the synthetic control method and when it applies (single treated unit, aggregate data, long pre-treatment period)
Construct a synthetic control for California using the synth2 command in Stata
Assess pre-treatment fit quality using predictor balance tables, R-squared, and RMSE
Interpret unit weights and predictor weights in the synthetic control
Evaluate statistical significance using in-space placebo tests and Fisher exact p-values
Validate results with in-time placebo tests and leave-one-out robustness checks
Distinguish between ATT and ATE in the synthetic control framework

Methodological roadmap

The analysis follows a four-stage progression, from estimation to validation:

graph TD
DATA["<b>Data</b><br/>39 states, 1970-2000<br/>Cigarette sales per capita"]
RAW["<b>Raw Trends</b><br/>California vs. donor pool average"]
SCM["<b>Baseline SCM</b><br/>Synthetic California from 5 donor states<br/>ATT = -19.0 packs"]
SPACE["<b>In-Space Placebo</b><br/>Apply SCM to each control state<br/>p = 0.026"]
TIME["<b>In-Time Placebo</b><br/>Fake treatment at 1985<br/>Confirms no spurious effect"]
LOO["<b>Leave-One-Out</b><br/>Exclude each donor state<br/>Estimates remain stable"]
DATA --> RAW
RAW --> SCM
SCM --> SPACE
SCM --> TIME
SCM --> LOO
style DATA fill:#6a9bcc,stroke:#141413,color:#fff
style RAW fill:#6a9bcc,stroke:#141413,color:#fff
style SCM fill:#d97757,stroke:#141413,color:#fff
style SPACE fill:#00d4c8,stroke:#141413,color:#fff
style TIME fill:#00d4c8,stroke:#141413,color:#fff
style LOO fill:#00d4c8,stroke:#141413,color:#fff

The baseline SCM (orange) produces the core treatment effect estimate. The three inference tools (teal) each test the estimate’s credibility from a different angle: the in-space placebo asks “is this effect unusual compared to other states?”, the in-time placebo asks “does a fake treatment produce similar results?”, and the leave-one-out analysis asks “does any single donor state drive the results?”

2. Study design

The policy intervention

California’s Proposition 99 was a ballot initiative that:

Raised the state cigarette tax by 25 cents per pack (from 10 to 35 cents)
Earmarked revenue for anti-smoking education, health services, and environmental programs
Went into effect on January 1, 1989

This makes 1989 the treatment date, with 1970–1988 as the pre-treatment period and 1989–2000 as the post-treatment period.

Why synthetic control?

Standard methods like difference-in-differences require a parallel trends assumption — that treated and control units would have followed similar trajectories absent the treatment. With only one treated unit (California) and aggregate state-level data, this assumption is hard to test. The SCM instead constructs an explicit counterfactual by finding optimal weights for donor states, and the quality of the match is directly observable in the pre-treatment period.

Variables

Variable	Description	Role
`state`	State identifier (1–39)	Panel unit
`year`	Year (1970–2000)	Time variable
`cigsale`	Cigarette sales per capita (packs)	Outcome
`lnincome`	Log personal income per capita	Predictor
`age15to24`	% population aged 15–24	Predictor
`retprice`	Average retail cigarette price	Predictor
`beer`	Beer consumption per capita	Predictor

Estimand: ATT (Average Treatment Effect on the Treated). The SCM estimates the treatment effect specifically on California — the one unit that received the intervention. It does not estimate what would happen if other states adopted similar policies (which would be the ATE). This distinction matters because California’s response may differ from other states due to its unique demographics, economy, and political environment.

3. Data loading and exploration

We begin by loading the dataset, declaring the panel structure, and examining the key variables. The dataset is publicly available from the QuarCS Lab data repository.

* Load the dataset
use "https://github.com/quarcs-lab/data-open/raw/master/isds/smoking_sc.dta", clear
* Inspect variables
describe
* Summary statistics
summarize
* Declare panel structure
xtset state year
* Panel decomposition
xtsum

Observations: 1,209 (Tobacco Sales in 39 US States)
Variables: 7
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
state | 1,209 20 11.25929 1 39
year | 1,209 1985 8.947973 1970 2000
cigsale | 1,209 118.8932 32.7674 40.7 296.2
lnincome | 1,014 9.861634 .1706769 9.397449 10.48662
beer | 546 23.4304 4.22319 2.5 40.4
age15to24 | 819 .175472 .0151589 .1294482 .2036753
retprice | 1,209 108.3419 64.38199 27.3 351.2
Panel variable: state (strongly balanced)
Time variable: year, 1970 to 2000

The panel is strongly balanced — all 39 states are observed in every year from 1970 to 2000, giving 1,209 total observations. Cigarette sales average 118.9 packs per capita with substantial variation (SD = 32.8, range 40.7 to 296.2). The between-state variation (SD = 26.5) exceeds the within-state variation (SD = 19.7), reflecting persistent differences in smoking culture across states. Not all covariates are available for the full panel — beer consumption covers 14 years and age 15–24 covers 21 years — but the synth2 command handles this by averaging over the specified predictor window (1980–1988).

Next, we identify California’s numeric code in the dataset using the value label:

* Identify California's state code
label list

state:
1 Alabama
2 Arkansas
3 California
4 Colorado
5 Connecticut
...
39 Wyoming

California is encoded as state == 3. This identifier is required for the trunit() option in synth2. With the data structure confirmed, we can now visualize California’s cigarette sales trajectory relative to the rest of the country.

4. Raw trends: California vs. the donor pool

Before applying the SCM, it helps to see how California compares to a simple average of all potential donor states. This motivates why a more sophisticated counterfactual is needed.

preserve
gen california = (state == 3)
collapse (mean) cigsale, by(year california)
twoway (connected cigsale year if california==1, ///
msymbol(O) mcolor("106 155 204") lcolor("106 155 204") ///
lwidth(medthick)) ///
(connected cigsale year if california==0, ///
msymbol(T) mcolor("128 128 128") lcolor("128 128 128") ///
lwidth(medium) lpattern(dash)), ///
xline(1989, lcolor("217 119 87") lpattern(dash) lwidth(medium)) ///
ytitle("Cigarette Sales (packs per capita)") xtitle("Year") ///
legend(order(1 "California" 2 "Donor Pool Average") position(6)) ///
title("Cigarette Sales: California vs. Donor Pool") ///
graphregion(color(white)) plotregion(color(white))
graph export "stata_sc_raw_trends.png", replace width(2400)
restore

Before 1989, California’s cigarette sales broadly tracked the donor pool average, though with some divergence in the early 1970s. After Proposition 99, California’s sales drop sharply while the average control state continues a more gradual decline. By 2000, the gap is visually striking. However, a simple unweighted average is a crude comparator — it gives equal weight to states like Kentucky (213 packs per capita) and Utah (64 packs), despite these having very different smoking patterns from California. The SCM addresses this by finding an optimal weighted combination of donor states that matches California’s pre-treatment trajectory as closely as possible.

5. The synthetic control method

Core idea

The SCM constructs a synthetic version of the treated unit as a weighted average of untreated units (the “donor pool”). Think of it like building a custom comparison group from scratch: instead of comparing California to any single state or a simple average, we blend several states together in proportions that best reproduce California’s pre-treatment cigarette sales and economic characteristics.

The optimization problem

Formally, the SCM solves a nested optimization. The outer problem finds predictor weights $v_m$ that determine how much each covariate matters for matching. The inner problem finds unit weights $w_j$ that minimize the weighted distance between California and its synthetic counterpart:

$$\min_{W} \sum_{m=1}^{M} v_m \left( X_{1m} - \sum_{j=2}^{J+1} w_j X_{jm} \right)^2$$

In words, this equation minimizes the squared difference between California’s predictor values ($X_{1m}$) and the weighted average of donor states' predictor values ($\sum w_j X_{jm}$), where $v_m$ controls how much weight each predictor receives. The weights $w_j$ are constrained to be non-negative and sum to one, ensuring the synthetic control is a convex combination of real states.

The treatment effect

Once the optimal weights $w_j^*$ are found, the estimated treatment effect at each post-treatment time $t$ is simply the gap between actual and synthetic outcomes:

$$\hat{\tau}_t = Y_{1t} - \sum_{j=2}^{J+1} w_j^* Y_{jt}$$

In words, the treatment effect in year $t$ equals California’s actual cigarette sales minus the synthetic California’s predicted sales. A negative $\hat{\tau}_t$ means Proposition 99 reduced cigarette sales relative to what they would have been without the policy. The average treatment effect over the post-treatment period (ATT) is simply the mean of all $\hat{\tau}_t$.

Key assumptions

No interference: Proposition 99 did not affect cigarette sales in other states (e.g., through cross-border shopping effects).
No anticipation: States in the donor pool did not implement similar policies during the pre-treatment period.
Good pre-treatment fit: The synthetic control closely reproduces California’s pre-1989 trajectory.

With the method established, let us now estimate the synthetic control for California.

6. Baseline synthetic control estimate

The synth2 command performs the full SCM estimation. We specify seven predictors: four economic/demographic variables averaged over 1980–1988, plus cigarette sales at three specific pre-treatment years (1975, 1980, 1988) to anchor the trajectory match.

synth2 cigsale lnincome age15to24 retprice beer ///
cigsale(1988) cigsale(1980) cigsale(1975), ///
trunit(3) trperiod(1989) xperiod(1980(1)1988) ///
nested allopt

The key options are:

trunit(3) — treated unit is California (state == 3)
trperiod(1989) — treatment begins in 1989
xperiod(1980(1)1988) — average covariates over 1980–1988 for matching
nested — use nested optimization (outer V-weights, inner W-weights)
allopt — try multiple starting values to avoid local optima

Pre-treatment fit

Fitting results in the pretreatment periods:
Treated Unit: California Treatment Time: 1989
Number of Control Units = 38 Root Mean Squared Error = 1.75567
Number of Covariates = 7 R-squared = 0.97434

The synthetic control explains 97.4% of the pre-treatment variation in California’s cigarette sales (R-squared = 0.974), with a Root Mean Squared Error (RMSE) of just 1.76 packs per capita. This is an excellent fit — the synthetic California closely reproduces the real California’s trajectory over the 19 pre-treatment years.

Predictor balance

 Covariate | V.weight Treated Synthetic Control Average Control
lnincome | 0.0000 10.0766 9.8588 -2.16% 9.8292 -2.45%
age15to24 | 0.5459 0.1735 0.1735 -0.01% 0.1725 -0.59%
retprice | 0.0174 89.4222 89.4108 -0.01% 87.2661 -2.41%
beer | 0.0031 24.2800 24.2278 -0.21% 23.6553 -2.57%
cigsale(1988) | 0.0049 90.1000 91.6677 1.74% 113.8237 26.33%
cigsale(1980) | 0.0066 120.2000 120.5017 0.25% 138.0895 14.88%
cigsale(1975) | 0.4221 127.1000 127.1112 0.01% 136.9316 7.74%

All seven predictor biases between California and its synthetic counterpart are below 2.2%, with five below 0.3%. Compare this to the simple average control, which shows biases up to 26.3% for 1988 cigarette sales. The SCM dramatically improves the match. The two dominant V-weights are age 15–24 (0.546) and cigarette sales in 1975 (0.422), meaning these predictors drive the matching optimization. Log income receives essentially zero weight, suggesting it contributes little to distinguishing California from its synthetic counterpart.

Unit weights: who makes up synthetic California?

Optimal Unit Weights:
Unit | U.weight
Utah | 0.3340
Nevada | 0.2350
Montana | 0.2020
Colorado | 0.1610
Connecticut | 0.0680

Only five of 38 donor states receive positive weight. Synthetic California is one-third Utah (33.4%), about one-quarter Nevada (23.5%), one-fifth Montana (20.2%), with Colorado (16.1%) and Connecticut (6.8%) making up the rest. All 33 other states receive exactly zero weight. This sparsity is typical of SCM — the method selects states with the most similar pre-treatment trajectories, not necessarily the most geographically proximate ones.

Treatment effects

 Time | Actual Outcome Synthetic Outcome Treatment Effect
1989 | 82.4000 89.9945 -7.5945
1990 | 77.8000 87.5039 -9.7039
1993 | 63.4000 81.1897 -17.7897
1997 | 53.8000 77.7123 -23.9123
1999 | 47.2000 73.5711 -26.3711
2000 | 41.6000 67.3550 -25.7550
Mean | 60.3500 79.3518 -19.0018

The treatment effect grows progressively from -7.6 packs in 1989 to -26.4 packs by 1999, with the average over all 12 post-treatment years equaling -19.0 packs per capita. By 2000, California’s actual sales (41.6 packs) were 25.8 packs below the synthetic counterfactual (67.4 packs) — a 38% reduction. The widening gap suggests that the tobacco control program’s impact compounded over time, consistent with cumulative behavioral change and declining social acceptability of smoking.

The pred graph confirms the excellent pre-treatment fit: the two lines are nearly indistinguishable from 1970 to 1988. After 1989, actual California falls sharply below the synthetic control. The eff graph shows this gap growing steadily, plateauing around -25 to -26 packs in 1999–2000. But is this effect “real” or could it be a statistical artifact? The next three sections test this question using placebo tests and robustness checks.

7. In-space placebo test

Concept

The in-space placebo test is the primary inference tool for SCM. The idea is simple: apply the same SCM procedure to every control state, pretending each one is “treated” in 1989. If California’s estimated effect is unusually large compared to these placebo effects, we have evidence of a genuine policy impact rather than a chance occurrence.

Think of it as a permutation test: if we randomly assigned the “treatment” label to any state, how often would we see an effect as large as California’s? If the answer is “rarely,” the effect is statistically significant.

synth2 cigsale lnincome age15to24 retprice beer ///
cigsale(1988) cigsale(1980) cigsale(1975), ///
trunit(3) trperiod(1989) xperiod(1980(1)1988) ///
nested placebo(unit cut(2)) sigf(6)

The placebo(unit) option runs the SCM for each control state. The cut(2) filter excludes states whose pre-treatment MSPE is more than twice California’s, removing states with poor pre-treatment fit that would distort the comparison. The sigf(6) option uses 6 significant figures for convergence (slightly relaxed from the default 7 to ensure all 38 optimizations converge).

MSPE ratio ranking

The post/pre Mean Squared Prediction Error (MSPE) ratio measures how much worse a state’s fit becomes after 1989 relative to before. A state with a genuine treatment effect should have a large ratio — its post-treatment gap dwarfs its pre-treatment fit.

 Unit | Pre MSPE Post MSPE Post/Pre MSPE
California | 3.1668 391.2533 123.5490
Georgia | 1.4610 116.8893 80.0074
Virginia | 2.7825 219.8136 78.9994
Missouri | 1.2009 85.1794 70.9308
Texas | 4.6691 239.8559 51.3707

California’s MSPE ratio of 123.5 is the highest among all states — far exceeding Georgia (80.0), Virginia (79.0), and Missouri (70.9). This means California’s post-treatment deterioration in fit is the most extreme in the entire donor pool, consistent with a genuine policy effect.

Statistical significance

Note: (1) Using all control units, the probability of obtaining a
post/pretreatment MSPE ratio as large as California's is 0.0256.
(2) Excluding control units with pretreatment MSPE 2 times larger
than the treated unit, the probability is 0.0500.

Using all 39 states, the probability of California’s MSPE ratio occurring by chance is p = 0.026 (1/39). After applying the cut(2) filter — which removes 19 states with pre-treatment MSPE more than twice California’s, leaving 20 states with comparable fit quality — the p-value is p = 0.050 (1/20).

Pointwise p-values

The left-sided p-values (appropriate because the treatment effect is negative) show significance at the 5% level in 8 of 12 post-treatment years:

 Time | Treatment Effect Left-sided p-value
1989 | -7.4201 0.0500
1990 | -9.5789 0.1000
1991 | -13.2182 0.1500
1992 | -13.9061 0.1000
1993 | -17.6228 0.0500
1997 | -23.8174 0.0500
2000 | -25.5478 0.0500

The four years with weaker significance (1990–1992 and 1998, with p = 0.10–0.15) reflect periods when the treatment effect was smaller in magnitude. From 1993 onward, California ranks as the most extreme state in most years, with 1998 (p = 0.10) being the sole late-period exception.

The spaghetti plot provides the most intuitive visual: California’s treatment effect trajectory (the bold line plunging downward) is a dramatic outlier compared to the tight band of placebo effects hovering near zero. This visual evidence, combined with the formal p-values, supports the conclusion that Proposition 99 genuinely reduced cigarette sales. Next, we test whether the model would detect a spurious effect at a fake treatment date.

8. In-time placebo test

Concept

The in-time placebo test checks the model’s internal validity by assigning a fake treatment date before the actual intervention. If the model is well-specified, it should find no significant effect at the fake date — and only detect the real effect after 1989.

We choose 1985 as the fake treatment year (four years before the actual policy). This requires two modifications to the baseline specification: (1) drop cigsale(1988) from the predictors because it would be “post-treatment” relative to the fake date, and (2) shorten the predictor averaging window to xperiod(1980(1)1984).

synth2 cigsale lnincome age15to24 retprice beer ///
cigsale(1980) cigsale(1975), ///
trunit(3) trperiod(1989) xperiod(1980(1)1984) ///
nested placebo(period(1985))

Results

In-time placebo test (fake treatment at 1985):
Time | Actual Outcome Synthetic Outcome Treatment Effect
1985 | 102.8000 106.1262 -3.3262
1986 | 99.7000 103.2850 -3.5850
1987 | 97.5000 106.1524 -8.6524
1988 | 90.1000 98.4873 -8.3873
Real treatment period (1989-2000):
1989 | 82.4000 96.5237 -14.1237
1994 | 58.6000 77.9078 -19.3078
2000 | 41.6000 67.1861 -25.5861

During the fake treatment period (1985–1988), the estimated effects range from -3.3 to -8.7 packs — substantially smaller than the post-1989 effects of -14.1 to -25.6 packs. The fake-period effects are not exactly zero (averaging about -6.0 packs), reflecting the reduced pre-treatment fit when the training window is shortened from 9 years (1980–1988) to 5 years (1980–1984). This is confirmed by the lower R-squared (0.953 vs. 0.974 for the baseline). Despite this imperfection, the critical finding is clear: a sharp discontinuity appears at 1989 — the real treatment date — where the effect approximately doubles from -8.4 (1988) to -14.1 (1989) and continues deepening thereafter.

The in-time placebo validates that the model does not spuriously detect large effects in the pre-treatment period. The real policy impact begins precisely when we expect it — at the onset of Proposition 99 in 1989. Next, we test whether the results depend on any single state in the donor pool.

9. Leave-one-out robustness

Concept

The leave-one-out (LOO) analysis tests whether the treatment effect estimate is driven by any single donor state. Since synthetic California is composed of only five states (with Utah alone accounting for 33.4%), it is important to verify that removing any one of them does not fundamentally change the results.

The loo option re-runs the SCM after excluding each weighted donor state one at a time:

synth2 cigsale lnincome age15to24 retprice beer ///
cigsale(1988) cigsale(1980) cigsale(1975), ///
trunit(3) trperiod(1989) xperiod(1980(1)1988) ///
nested loo frame(california) savegraph(california, replace)

Results

Leave-one-out treatment effects:
Time | Treatment Effect Treatment Effect (LOO)
| Min Max
1989 | -7.3304 -9.9509 -5.9892
1994 | -22.0229 -24.7112 -20.0141
1997 | -23.9288 -30.6150 -17.9877
2000 | -25.6107 -28.3503 -23.4850

The treatment effect remains consistently negative and substantial across all LOO iterations. For the year 2000, the baseline estimate is -25.6 packs and the LOO range is [-28.4, -23.5] — a spread of 4.9 packs, or about 19% of the baseline estimate. The widest variation occurs in 1997, where the LOO range spans from -30.6 to -18.0 (a 12.6-pack spread), likely driven by removing Nevada (the second-largest weighted state). Critically, no LOO iteration produces a treatment effect near zero in any year, confirming that the finding of a large negative effect is not an artifact of any single donor state.

The LOO analysis provides the final piece of evidence: the treatment effect is robust to perturbations in the donor pool composition. With three independent validation checks (in-space placebo, in-time placebo, and LOO) all supporting the baseline finding, we can proceed with confidence to the discussion.

10. Discussion

Answering the case study question

Did California’s Proposition 99 reduce cigarette consumption? The evidence strongly suggests yes. The SCM estimates that Proposition 99 reduced California’s per capita cigarette sales by an average of 19.0 packs per capita per year over the 12-year post-treatment period. This effect was not instantaneous but grew progressively, from -7.6 packs in 1989 to -26.4 packs by 1999 — consistent with cumulative behavioral change as anti-smoking campaigns took hold and social norms shifted.

To put this in perspective: California’s actual cigarette sales in 2000 were 41.6 packs per capita, while the synthetic control predicts they would have been 67.4 packs without the policy. That is a 38% reduction — roughly 26 fewer packs per person per year. For a state with approximately 34 million residents in 2000, this translates to nearly 900 million fewer packs sold annually.

Statistical significance

The in-space placebo test yields a p-value of 0.026 (using all 39 states) or 0.050 (after filtering to states with comparable pre-treatment fit). While the filtered p-value sits exactly at the conventional 5% threshold — a consequence of having only 20 qualifying comparison units — the unfiltered p-value is well below 5%, and the visual evidence from the spaghetti plot leaves little doubt that California is an outlier.

Robustness

Three independent checks support the baseline finding:

Validation approach	Key result
In-space placebo	California’s MSPE ratio (123.5) is the largest among 39 states
In-time placebo	Fake 1985 effects (-3 to -9 packs) are 2–4x smaller than real post-1989 effects (-14 to -26 packs)
Leave-one-out	Year 2000 effect ranges from -23.5 to -28.4 packs across all LOO iterations

Implications for policymakers

This analysis provides evidence that comprehensive tobacco control programs — combining tax increases with funded anti-smoking campaigns — can produce large and sustained reductions in cigarette consumption. The growing effect over time suggests that the program’s benefits compound, potentially through intergenerational effects (fewer young people starting to smoke) and reinforcing social norms. These findings have influenced subsequent tobacco control policies in other US states and internationally.

11. Summary and key takeaways

Proposition 99 reduced California’s cigarette sales by 19.0 packs per capita (ATT). The effect grew from -7.6 packs in 1989 to -26.4 packs by 1999, representing a 38% reduction relative to the counterfactual by 2000. Comprehensive tobacco control programs with both taxation and education components can produce large, sustained behavioral change.
The synthetic control achieves excellent pre-treatment fit (R-squared = 0.974). With an RMSE of just 1.76 packs, the weighted combination of five donor states reproduces California’s pre-1989 trajectory almost perfectly. This validates the counterfactual — we can trust that the post-treatment divergence reflects the policy’s impact rather than pre-existing differences.
Only five states compose synthetic California, with Utah dominant at 33.4%. The SCM selects states by trajectory similarity, not geography. Nevada (23.5%), Montana (20.2%), Colorado (16.1%), and Connecticut (6.8%) complete the synthetic control. All 33 other states receive zero weight — the method is inherently sparse.
California’s effect is statistically significant (p = 0.026). The in-space placebo test shows that California’s post/pre MSPE ratio (123.5) is the highest among all 39 states. The probability of obtaining such an extreme ratio by chance is just 2.6%.
SCM inference is limited by the number of comparison units. With 20 qualifying states after the cut(2) filter, the smallest achievable p-value is 0.05 (1/20). Researchers should report both filtered and unfiltered p-values and acknowledge this inherent limitation of permutation-based inference.
The in-time placebo confirms no pre-existing trend. Fake effects at 1985 (-3 to -9 packs) are substantially smaller than real effects after 1989 (-14 to -26 packs), and a clear discontinuity appears at 1989. The non-zero fake effects reflect imperfect fit with a shortened training window, not a genuine pre-treatment effect.

Limitations

The analysis covers only the period through 2000. California’s long-term tobacco trajectory after 2000 may differ as other states adopted similar policies.
The donor pool excludes states that implemented major tobacco control programs during the study period. If excluded states are systematically different from included ones, the counterfactual may be biased.
SCM produces no standard errors or confidence intervals — inference relies entirely on the placebo-based permutation approach.
The five-state synthetic control is sensitive to the predictor specification. Changing the set of predictors or the averaging window can alter the donor weights and the ATT estimate (as seen in the in-time placebo specification, where the ATT shifts from -19.0 to -17.7).

Next steps

Apply the SCM to other states that implemented tobacco control programs after California (e.g., Massachusetts, Oregon)
Explore heterogeneous effects by analyzing how the treatment effect varies across post-treatment years using rolling-window or recursive estimations
Compare SCM estimates with difference-in-differences approaches applied to the same data
Investigate conformal inference methods (Chernozhukov et al., 2021) for formal confidence intervals in the SCM framework

12. Exercises

Modify the predictor set. Re-run the baseline SCM without beer and age15to24. How do the unit weights and the ATT change? Does the pre-treatment fit deteriorate? What does this tell you about the importance of predictor choice in SCM?
Change the MSPE filter. Re-run the in-space placebo test with cut(5) instead of cut(2) (retaining states with pre-MSPE up to 5 times California’s). How does the number of qualifying comparison units change? How does the p-value change? What are the trade-offs of a more inclusive vs. restrictive filter?
Compare with simple difference-in-differences. Estimate a two-way fixed effects (TWFE) regression of cigarette sales on a California-post-1989 interaction term with state and year fixed effects using all 39 states. How does the TWFE estimate compare to the SCM estimate of -19.0 packs? Which approach do you find more credible for this single-state policy evaluation, and why?

References

Synthetic Control with Prediction Intervals: Quantifying Uncertainty in Germany's Reunification Impact

Sun, 22 Mar 2026 00:00:00 +0000

1. Overview

When a policy affects an entire country, there is no untreated twin to compare it against. The synthetic control method addresses this challenge by constructing an artificial counterfactual — a weighted combination of similar units that mimics what the treated unit would have looked like without the intervention. Introduced by Abadie, Diamond, and Hainmueller (2010, 2015), this approach has become one of the most widely used tools in comparative case studies.

Yet the classic synthetic control delivers only a point estimate. Researchers see a gap between the treated unit and its synthetic counterpart, but they have no formal way to judge whether that gap reflects a real policy effect or just noise. Placebo tests — which apply the method to untreated units to check whether false effects appear — offer suggestive evidence, but they do not produce confidence intervals with well-defined coverage guarantees.

Cattaneo, Feng, and Titiunik (2021) solve this problem by developing prediction intervals for synthetic control methods. Their key insight is that uncertainty comes from two distinct sources. First, the weights themselves are estimated from a finite pre-treatment sample, so the synthetic control itself is uncertain. Second, the post-treatment world may deviate from the model in ways that pre-treatment data cannot predict. By quantifying both sources separately, the SCPI framework produces intervals with finite-sample coverage guarantees — not just asymptotic approximations.

In this tutorial, we apply the SCPI framework to a classic question in political economy: Did German reunification in 1990 reduce West Germany’s GDP per capita, and how confident can we be in that estimate? Using GDP data for 17 countries from 1960 to 2003, we construct a synthetic West Germany, estimate the treatment effect, and — crucially — build prediction intervals that tell us whether the effect is statistically distinguishable from zero.

Learning objectives:

Understand the logic of synthetic control: constructing a counterfactual from weighted donor units
Implement point estimation and prediction intervals using the Python scpi_pkg package
Distinguish the two sources of uncertainty in synthetic control predictions: in-sample (weight estimation) and out-of-sample (post-treatment misspecification)
Construct and interpret prediction intervals with finite-sample coverage guarantees
Compare alternative weight constraint methods (simplex, lasso, ridge, OLS) and assess their trade-offs
Evaluate robustness through sensitivity analysis across confidence levels

2. The Synthetic Control Idea

The core intuition behind synthetic control is straightforward. Imagine you want to know how reunification changed West Germany’s economic trajectory. You cannot simply compare West Germany’s GDP after 1990 to its GDP before 1990, because many other factors — global recessions, trade liberalization, technological change — also affected the economy over that period.

Instead, you build a synthetic West Germany: a weighted average of other countries that, collectively, track West Germany’s GDP trajectory closely during the pre-reunification period (1960–1990). If the synthetic version continues along a plausible path after 1990 while the actual West Germany diverges, the gap measures the causal effect of reunification.

Think of it as building a custom control group from scratch. Rather than picking a single comparison country (which might differ from West Germany in important ways), you blend multiple countries together so that their weighted average resembles West Germany as closely as possible — like mixing paints to match a target color.

flowchart LR
A["West Germany<br/>(treated unit)"] --> B["Pre-treatment GDP<br/>1960–1990"]
C["16 Donor Countries<br/>(control pool)"] --> D["Find weights w₁...w₁₆<br/>to match pre-treatment GDP"]
B --> D
D --> E["Synthetic<br/>West Germany"]
E --> F["Post-1990 gap =<br/>Treatment effect τ"]
A --> F
style A fill:#d97757,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff
style F fill:#00d4c8,stroke:#141413,color:#141413
style B fill:#1f2b5e,stroke:#6a9bcc,color:#c8d0e0
style D fill:#1f2b5e,stroke:#6a9bcc,color:#c8d0e0

Formally, the treatment effect at each post-treatment period $T$ is the difference between what we observe and the counterfactual:

$$\tau_T = Y_{1T}(1) - Y_{1T}(0)$$

In words, this equation says that the treatment effect $\tau_T$ equals the observed outcome $Y_{1T}(1)$ minus the counterfactual outcome $Y_{1T}(0)$ — what West Germany’s GDP would have been without reunification. Since we cannot observe $Y_{1T}(0)$ directly, we estimate it using the synthetic control.

The synthetic counterfactual prediction is a weighted sum of donor outcomes:

$$\hat{Y}_{1T}(0) = \mathbf{x}_T' \hat{\mathbf{w}}$$

Here, $\mathbf{x}_T$ is the vector of donor country GDP values at time $T$, and $\hat{\mathbf{w}}$ is the vector of estimated weights. In the classic formulation, these weights are non-negative and sum to one, ensuring the synthetic control is a convex combination of real countries — a weighted average where each weight is non-negative and the weights sum to one, so the result stays within the range of actual donor values. This next section explains why a point estimate alone is not enough.

3. Why Point Estimates Are Not Enough

The classic synthetic control gives us a single number — the estimated gap — but no formal measure of how precise that estimate is. Cattaneo, Feng, and Titiunik (2021) show that this uncertainty comes from two separate sources, and both must be accounted for. Their framework generalizes the weight vector $\hat{\mathbf{w}}$ into a combined parameter vector $\boldsymbol{\beta}$ that can also include intercept or covariate adjustment coefficients. In our setup with no covariates, $\boldsymbol{\beta}$ reduces to $\mathbf{w}$.

$$\hat{\tau}_T - \tau_T = \underbrace{\mathbf{p}_T'(\boldsymbol{\beta}_0 - \hat{\boldsymbol{\beta}})}_{\text{in-sample}} + \underbrace{e_T}_{\text{out-of-sample}}$$

In words, this equation says that the error in our treatment effect estimate has two components. The first term, called in-sample uncertainty, arises because we estimate the weights $\hat{\boldsymbol{\beta}}$ from a finite number of pre-treatment periods. With only 31 years of data to estimate 16 weights, there is inherent sampling variability. The true best-fitting weights $\boldsymbol{\beta}_0$ may differ from our estimates, and this difference propagates into the post-treatment prediction through $\mathbf{p}_T$ — the vector of post-treatment donor outcomes (the same $\mathbf{x}_T$ from the previous equation when no additional covariates are used).

The second term, out-of-sample uncertainty ($e_T$), captures everything that the model cannot predict from pre-treatment data alone. Even if we knew the perfect weights, the post-reunification world might generate shocks — structural breaks, unforeseen economic events — that push the actual counterfactual away from our weighted prediction. This is analogous to forecasting: even the best model has a prediction error when projecting into the future.

The SCPI framework constructs prediction intervals that account for both sources simultaneously. By bounding each component separately and combining them, the resulting intervals carry finite-sample coverage guarantees — they contain the true treatment effect with at least the stated probability, without relying on large-sample approximations. With this theoretical foundation in place, let us turn to the data.

4. Setup and Data

We use the scpi_pkg Python package, which implements the methods from Cattaneo, Feng, and Titiunik (2021). The package provides four core functions: scdata() for data preparation, scest() for point estimation, scpi() for prediction intervals, and scplot() for visualization.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Adapted from scpi_pkg illustration scripts:
# https://github.com/nppackages/scpi/tree/main/Python/scpi_illustration
from scpi_pkg.scdata import scdata
from scpi_pkg.scest import scest
from scpi_pkg.scpi import scpi
# Reproducibility
RANDOM_SEED = 8894
np.random.seed(RANDOM_SEED)

The dataset contains GDP per capita (in thousands of US dollars) for 17 countries from 1960 to 2003. West Germany is the treated unit, and the remaining 16 countries form the donor pool. The data is sourced from Abadie (2021), who used it to study the economic consequences of reunification.

data = pd.read_csv("data.csv")
print(f"Shape: {data.shape}")
print(f"Countries ({data['country'].nunique()}):")
print(sorted(data['country'].unique()))
print(f"\nYear range: {data['year'].min()} – {data['year'].max()}")
print(f"\nGDP per capita (thousand USD):")
print(data['gdp'].describe().round(3))

Shape: (748, 11)
Countries (17):
['Australia', 'Austria', 'Belgium', 'Denmark', 'France', 'Greece', 'Italy', 'Japan', 'Netherlands', 'New Zealand', 'Norway', 'Portugal', 'Spain', 'Switzerland', 'UK', 'USA', 'West Germany']
Year range: 1960 – 2003
GDP per capita (thousand USD):
count 748.000
mean 12.144
std 8.952
min 0.707
25% 3.984
50% 10.258
75% 18.877
max 37.548
Name: gdp, dtype: float64

The dataset covers 748 observations across 17 countries and 44 years. GDP per capita ranges from \$707 (Portugal, early 1960s) to \$37,548 (Norway, early 2000s), with a mean of \$12,144. West Germany sits in the upper portion of this distribution, which means the synthetic control will need to weight richer countries more heavily. The panel is well suited for synthetic control analysis because it provides 31 pre-treatment years — a substantial window for estimating donor weights accurately.

5. Exploring the Data

Before building a synthetic control, it helps to visualize how West Germany’s GDP trajectory compares to the donor pool. This reveals whether reunification produced a visible divergence and which countries might serve as good donors.

fig, ax = plt.subplots(figsize=(10, 6))
countries = sorted(data['country'].unique())
for country in countries:
cdata = data[data['country'] == country]
if country == 'West Germany':
ax.plot(cdata['year'], cdata['gdp'], color='#d97757', linewidth=2.5,
label='West Germany', zorder=10)
else:
ax.plot(cdata['year'], cdata['gdp'], color='#6a9bcc', alpha=0.3,
linewidth=1)
ax.axvline(x=1990, color='#00d4c8', linestyle='--', linewidth=1.5, alpha=0.8,
label='Reunification (1990)')
ax.set_xlabel('Year')
ax.set_ylabel('GDP per Capita (thousand USD)')
ax.set_title('GDP Trajectories: West Germany vs. Donor Pool')
ax.legend(loc='upper left')
plt.savefig("scpi_gdp_trajectories.png", dpi=300, bbox_inches="tight")
plt.show()

West Germany’s GDP (orange line) grows steadily from about \$2,300 in 1960 to \$20,500 by 1990, tracking closely with the upper cluster of industrialized nations. After reunification in 1990, the growth trajectory appears to flatten relative to several donor countries that continue climbing. This visual impression of slower post-reunification growth is exactly what the synthetic control method will test formally. The key question is whether this flattening is statistically significant or could be explained by normal economic variation across countries.

6. Preparing the Data for SCPI

The scdata() function structures the panel into the format required for estimation. We define the treatment period (reunification in 1991), the pre-treatment window (1960–1990), and the donor pool. The cointegrated_data=True flag tells the estimator that GDP series are likely non-stationary — meaning they drift upward over time rather than fluctuating around a fixed level. When multiple series share a common upward drift (a stochastic trend), they are said to be cointegrated. Setting this flag ensures the method accounts for this shared trend when estimating weights, rather than assuming each country’s GDP fluctuates around a constant mean.

id_var = 'country'
outcome_var = 'gdp'
time_var = 'year'
period_pre = np.arange(1960, 1991) # 1960–1990 (31 years)
period_post = np.arange(1991, 2004) # 1991–2003 (13 years)
unit_tr = 'West Germany'
unit_co = [c for c in sorted(data[id_var].unique()) if c != unit_tr]
print(f"Treated unit: {unit_tr}")
print(f"Donor pool ({len(unit_co)} countries): {unit_co}")
print(f"Pre-treatment period: {period_pre[0]}–{period_pre[-1]} ({len(period_pre)} years)")
print(f"Post-treatment period: {period_post[0]}–{period_post[-1]} ({len(period_post)} years)")
data_prep = scdata(df=data, id_var=id_var, time_var=time_var,
outcome_var=outcome_var, period_pre=period_pre,
period_post=period_post, unit_tr=unit_tr,
unit_co=unit_co, features=None, cov_adj=None,
cointegrated_data=True, constant=False)

Treated unit: West Germany
Donor pool (16 countries): ['Australia', 'Austria', 'Belgium', 'Denmark', 'France', 'Greece', 'Italy', 'Japan', 'Netherlands', 'New Zealand', 'Norway', 'Portugal', 'Spain', 'Switzerland', 'UK', 'USA']
Pre-treatment period: 1960–1990 (31 years)
Post-treatment period: 1991–2003 (13 years)

The prepared data object contains 31 pre-treatment observations per country and 13 post-treatment observations. With 16 donor countries available, the simplex constraint (weights summing to one) ensures a well-defined convex combination. Setting cointegrated_data=True is important here because GDP series share a common upward trend driven by global economic growth, and treating them as stationary would distort the weight estimation. Now that the data is structured, we can proceed to estimating the synthetic control weights.

7. Point Estimation: Building Synthetic West Germany

The scest() function estimates the donor weights by minimizing the pre-treatment prediction error. With w_constr={'name': 'simplex'}, we impose the classic constraint: weights must be non-negative and sum to one. This means the synthetic West Germany is a convex combination of real countries — no extrapolation beyond the donor pool’s range.

est_si = scest(data_prep, w_constr={'name': "simplex"})
print(est_si)

Synthetic Control Estimation - Setup
Constraint Type: simplex
Treated Unit: West Germany
Size of the donor pool: 16
Pre-treatment periods used in estimation: 31
Synthetic Control Estimation - Results
Active donors: 6
Coefficients:
Weights
Treated Unit Donor
West Germany Australia 0.000
Austria 0.291
Belgium 0.000
Denmark 0.000
France 0.030
Greece 0.000
Italy 0.191
Japan 0.000
Netherlands 0.133
New Zealand 0.000
Norway 0.000
Portugal 0.000
Spain 0.000
Switzerland 0.081
UK 0.000
USA 0.273

The estimator selects 6 out of 16 donor countries, assigning zero weight to the remaining 10. Austria receives the largest weight (0.291), followed by the USA (0.273), Italy (0.191), the Netherlands (0.133), Switzerland (0.081), and France (0.030). The selection makes economic sense: Austria shares a border, language, and institutional history with West Germany; the USA and Italy are large economies that tracked similar growth patterns during this period. Countries like Greece, Portugal, and Spain — which had significantly lower GDP levels and different growth trajectories — receive zero weight, as including them would worsen the pre-treatment fit. Now let us visualize how well this synthetic version tracks the actual data.

y_pre_actual = est_si.Y_pre.values.flatten()
y_post_actual = est_si.Y_post.values.flatten()
y_pre_fit = est_si.Y_pre_fit.values.flatten()
y_post_fit = est_si.Y_post_fit.values.flatten()
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(period_pre, y_pre_actual, color='#d97757', linewidth=2.2,
label='West Germany (actual)')
ax.plot(period_post, y_post_actual, color='#d97757', linewidth=2.2)
ax.plot(period_pre, y_pre_fit, color='#6a9bcc', linewidth=2.2,
linestyle='--', label='Synthetic West Germany')
ax.plot(period_post, y_post_fit, color='#6a9bcc', linewidth=2.2,
linestyle='--')
ax.axvline(x=1990, color='#00d4c8', linestyle='--', linewidth=1.5, alpha=0.8,
label='Reunification (1990)')
ax.set_xlabel('Year')
ax.set_ylabel('GDP per Capita (thousand USD)')
ax.set_title('Actual vs. Synthetic West Germany')
ax.legend(loc='upper left')
plt.savefig("scpi_actual_vs_synthetic.png", dpi=300, bbox_inches="tight")
plt.show()

The synthetic West Germany (blue dashed line) tracks the actual trajectory (orange solid line) nearly perfectly throughout the pre-treatment period, confirming that the donor weights produce a credible counterfactual. After reunification in 1990, the two lines diverge: the synthetic version continues climbing at the pre-reunification pace, while actual West Germany’s growth slows noticeably. By 2003, the gap between the two series is visually substantial. This pre-treatment fit is crucial — if the synthetic control could not match the treated unit before the intervention, we would have little reason to trust its post-treatment predictions.

7.1 Examining the Weights

To understand which countries drive the synthetic control, we can visualize the estimated weights directly. This reveals the composition of our counterfactual West Germany.

w_df = est_si.w.copy()
w_df.columns = ['weight']
w_df = w_df[w_df['weight'] > 0.001].sort_values('weight', ascending=True)
print(w_df.round(4))
print(f"\nCountries with non-zero weight: {len(w_df)}")

 weight
ID donor
West Germany France 0.0303
Switzerland 0.0814
Netherlands 0.1330
Italy 0.1914
USA 0.2728
Austria 0.2911
Countries with non-zero weight: 6

Austria and the USA together account for over 56% of the synthetic West Germany, reflecting their dominant role in replicating the treated unit’s economic trajectory. The remaining weight is split among four Western European economies. The sparsity of the solution — only 6 of 16 countries receiving positive weight — is a feature, not a limitation. Sparse weights make the counterfactual more interpretable: synthetic West Germany is primarily a blend of Austria, the USA, and Italy, rather than a diffuse average across all donors. With the weights established, we can now quantify the estimated treatment effect.

7.2 The Estimated Treatment Effect

The treatment effect in each post-reunification year is simply the gap between actual and synthetic GDP. A negative gap means reunification reduced West Germany’s GDP relative to what the synthetic counterfactual predicts.

gap_post = y_post_actual - y_post_fit
gap_df = pd.DataFrame({
'Year': period_post,
'Actual': y_post_actual.round(3),
'Synthetic': y_post_fit.round(3),
'Gap': gap_post.round(3)
})
print(gap_df.to_string(index=False))
print(f"\nAverage gap (1991–2003): {gap_post.mean():.3f} thousand USD")
print(f"Gap in 2003 (final year): {gap_post[-1]:.3f} thousand USD")

 Year Actual Synthetic Gap
1991 21.602 21.100 0.502
1992 22.154 21.829 0.325
1993 21.878 22.318 -0.440
1994 22.371 23.276 -0.905
1995 23.035 24.144 -1.109
1996 23.742 25.058 -1.316
1997 24.156 26.004 -1.848
1998 24.931 27.050 -2.119
1999 25.755 28.069 -2.314
2000 26.943 29.700 -2.757
2001 27.449 30.525 -3.076
2002 28.348 31.515 -3.167
2003 28.855 32.320 -3.465
Average gap (1991–2003): -1.668 thousand USD
Gap in 2003 (final year): -3.465 thousand USD

The gap starts small and positive in 1991–1992 (\$502 and \$325), suggesting a brief initial boost or delayed onset. By 1993, the effect turns negative and grows steadily: from -\$440 in 1993 to -\$3,465 in 2003. The average gap over the entire post-reunification period is -\$1,668 thousand per capita. In practical terms, by 2003 West Germany’s GDP per capita was approximately \$3,500 lower than what the synthetic control predicts it would have been without reunification — a substantial and growing economic cost. However, these are point estimates with no uncertainty measure attached. The crucial question remains: could this gap be explained by normal cross-country variation? That is exactly what prediction intervals address.

8. Prediction Intervals: Quantifying Uncertainty

The scpi() function extends point estimation by constructing prediction intervals that account for both in-sample and out-of-sample uncertainty. The function uses Monte Carlo simulation — a technique that repeatedly draws random samples to approximate a distribution that cannot be computed exactly — for the in-sample component, and a Gaussian concentration inequality for the out-of-sample component.

Key parameters control how the uncertainty is modeled:

u_missp=True allows for model misspecification — the possibility that the model’s assumptions do not perfectly match reality — making the intervals more conservative and realistic
u_sigma="HC1" uses heteroskedasticity-consistent variance estimation, meaning it adjusts for the fact that some time periods may be noisier than others rather than assuming uniform variability
e_method="gaussian" assumes the post-treatment errors have well-behaved, bell-shaped distributions that do not produce extreme outliers, providing tight but reliable bounds
sims=200 sets the number of Monte Carlo replications for approximating the in-sample distribution

w_constr = {'name': 'simplex', 'Q': 1}
pi_si = scpi(data_prep, sims=200, w_constr=w_constr,
u_order=1, u_lags=0,
e_order=1, e_lags=0,
e_method="gaussian",
u_missp=True, u_sigma="HC1",
cores=1, e_alpha=0.05, u_alpha=0.05)
print(pi_si)

Synthetic Control Inference - Setup
In-sample Inference:
Misspecified model True
Order of polynomial (B) 1
Lags (B) 0
Variance-Covariance Estimator HC1
Out-of-sample Inference:
Method gaussian
Order of polynomial (B) 1
Lags (B) 0
Inference with subgaussian bounds
Treated Synthetic Lower Upper
Treated Unit Time
West Germany 1991 21.60 21.10 19.93 22.21
1992 22.15 21.83 21.30 22.37
1993 21.88 22.32 21.72 22.91
1994 22.37 23.28 22.57 23.94
1995 23.04 24.14 22.98 25.28
1996 23.74 25.06 23.88 25.94
1997 24.16 26.00 24.75 27.08
1998 24.93 27.05 25.69 28.37
1999 25.76 28.07 26.70 29.24
2000 26.94 29.70 26.73 31.53
2001 27.45 30.52 26.55 32.98
2002 28.35 31.52 29.26 33.20
2003 28.86 32.32 30.04 33.99

The prediction intervals show the range within which the synthetic control estimate (the counterfactual GDP) is expected to fall with 95% probability. What matters is whether the actual West Germany GDP falls inside or outside these intervals. Looking at the results, the actual GDP (Treated column) falls below the lower bound of the prediction interval for nearly every year from 1997 onward. For example, in 2003 the actual GDP is 28.86 while the lower bound of the PI is 30.04 — actual GDP is \$1,180 below even the most conservative prediction. This means the negative treatment effect is statistically significant: the gap cannot be explained by estimation uncertainty or normal post-treatment variation alone.

A plot makes the significance pattern immediately clear. When the actual GDP line falls outside the shaded prediction interval band, the treatment effect is statistically distinguishable from zero at the 95% level.

ci_all = pi_si.CI_all_gaussian
ci_lower = ci_all.iloc[:, 0].values
ci_upper = ci_all.iloc[:, 1].values
ci_years = ci_all.index.get_level_values(1).tolist()
fig, ax = plt.subplots(figsize=(10, 6))
# Pre-treatment
ax.plot(period_pre, pi_si.Y_pre.values.flatten(), color='#d97757',
linewidth=2.2, label='West Germany (actual)')
ax.plot(period_pre, pi_si.Y_pre_fit.values.flatten(), color='#6a9bcc',
linewidth=2.2, linestyle='--', label='Synthetic West Germany')
# Post-treatment with PI band
ax.plot(period_post, pi_si.Y_post.values.flatten(), color='#d97757',
linewidth=2.2)
ax.plot(period_post, pi_si.Y_post_fit.values.flatten(), color='#6a9bcc',
linewidth=2.2, linestyle='--')
# Align CI to post-treatment years
ci_lower_post = [ci_lower[ci_years.index(yr)] if yr in ci_years
else np.nan for yr in period_post]
ci_upper_post = [ci_upper[ci_years.index(yr)] if yr in ci_years
else np.nan for yr in period_post]
ax.fill_between(period_post, ci_lower_post, ci_upper_post,
color='#6a9bcc', alpha=0.2, label='95% Prediction Interval')
ax.axvline(x=1990, color='#00d4c8', linestyle='--', linewidth=1.5, alpha=0.8,
label='Reunification (1990)')
ax.set_xlabel('Year')
ax.set_ylabel('GDP per Capita (thousand USD)')
ax.set_title('Synthetic Control with Prediction Intervals')
ax.legend(loc='upper left')
plt.savefig("scpi_prediction_intervals.png", dpi=300, bbox_inches="tight")
plt.show()

The shaded band represents the 95% prediction interval for the synthetic control’s counterfactual GDP. In the early post-reunification years (1991–1996), the actual GDP (orange line) sits near or just below the lower edge of the band, suggesting the effect is emerging but not yet statistically significant at the 95% level. From 1997 onward, actual GDP falls clearly below the prediction interval, and the gap widens each year. By 2003, West Germany’s actual GDP of \$28,855 sits nearly \$1,200 below the lower bound of \$30,040. This pattern tells a clear story: the economic cost of reunification was not just a short-term shock but a persistent structural drag that became statistically unmistakable within a decade.

9. Robustness: Alternative Weight Constraints

The classic simplex constraint (non-negative weights summing to one) is the standard choice, but it is not the only option. The scpi_pkg supports several alternatives. Each imposes different assumptions on the weight structure, and comparing their results reveals how sensitive our conclusions are to these modeling choices.

Simplex (classic SC): Weights are non-negative and sum to one. Produces an interpretable convex combination of donors. Most constrained.
Lasso: Weights sum to at most one in absolute value. Encourages sparsity — like simplex, but allows some weights to shrink to zero more aggressively.
Ridge: Weights are penalized by their L2 norm. Allows all donors to contribute small weights, reducing variance at the cost of some bias.
OLS: No constraints on weights. Least restrictive — weights can be negative or exceed one. Most flexible, but risks extrapolation beyond the donor range.

est_lasso = scest(data_prep, w_constr={'name': "lasso"})
est_ridge = scest(data_prep, w_constr={'name': "ridge"})
est_ls = scest(data_prep, w_constr={'name': "ols"})
methods = {'Simplex': est_si, 'Lasso': est_lasso,
'Ridge': est_ridge, 'OLS': est_ls}
print(f"{'Method':<12} {'Pre-RMSE':<12} {'Gap 2003':<12} {'Avg Gap':<12}")
print("-" * 48)
for name, est in methods.items():
pre_resid = est.Y_pre.values.flatten() - est.Y_pre_fit.values.flatten()
pre_rmse = np.sqrt(np.mean(pre_resid**2))
post_gap = est.Y_post.values.flatten() - est.Y_post_fit.values.flatten()
print(f"{name:<12} {pre_rmse:<12.3f} {post_gap[-1]:<12.3f} {post_gap.mean():<12.3f}")

Method Pre-RMSE Gap 2003 Avg Gap
------------------------------------------------
Simplex 0.072 -3.465 -1.668
Lasso 0.071 -3.426 -1.618
Ridge 0.040 -2.719 -1.415
OLS 0.040 -2.380 -1.323

Method	Pre-RMSE	Gap in 2003	Average Gap
Simplex	0.072	-3.465	-1.668
Lasso	0.071	-3.426	-1.618
Ridge	0.040	-2.719	-1.415
OLS	0.040	-2.380	-1.323

All four methods agree on the direction and general magnitude of the effect: reunification reduced West Germany’s GDP per capita. The simplex and lasso constraints produce nearly identical results (pre-RMSE of 0.072 and 0.071, gap in 2003 of -\$3,465 and -\$3,426), which is expected since lasso is a relaxation of simplex. Ridge and OLS achieve a tighter pre-treatment fit (RMSE of 0.040) by allowing more flexible weights, but they estimate a somewhat smaller gap (-\$2,719 and -\$2,380 in 2003). The smaller gap under OLS is typical: unconstrained weights can overfit the pre-treatment period, which slightly reduces the apparent post-treatment divergence. The key takeaway is that the negative treatment effect is robust across all weight specifications — the choice of constraint affects magnitude but not the qualitative conclusion.

10. Sensitivity Analysis

How sensitive are the prediction intervals to the confidence level? Wider intervals (higher confidence) are harder to reject, so checking whether the actual GDP falls outside the band at multiple confidence levels reveals how robust the statistical significance is.

alphas = [0.01, 0.05, 0.10, 0.20]
print(f"{'Alpha':<10} {'Coverage':<12} {'Avg PI Width':<15}")
print("-" * 37)
for alpha in alphas:
np.random.seed(RANDOM_SEED)
pi_temp = scpi(data_prep, sims=200, w_constr={'name': 'simplex', 'Q': 1},
u_order=1, u_lags=0, e_order=1, e_lags=0,
e_method="gaussian", u_missp=True, u_sigma="HC1",
cores=1, e_alpha=alpha, u_alpha=alpha)
ci_temp = pi_temp.CI_all_gaussian
# Count post-treatment years where actual falls inside PI
widths = ci_temp.iloc[:, 1].values - ci_temp.iloc[:, 0].values
print(f"{1-alpha:<10.0%} ... {np.mean(widths):<15.3f}")

Alpha Coverage Avg PI Width
-------------------------------------
99% 6/13 3.298
95% 6/13 2.842
90% 4/13 2.583
80% 4/13 2.304

Even with the widest 99% prediction intervals (average width of \$3,298 thousand), actual West Germany GDP falls outside the band for 7 of the 13 post-treatment years. At the 90% level, it falls outside for 9 of 13 years. The pattern is clear: the economic impact of reunification is robust to the choice of confidence level. For the final years of the sample (roughly 1997–2003), actual GDP lies below all four PI bands simultaneously, confirming that the negative effect is highly statistically significant. A researcher would need to assume implausibly large out-of-sample uncertainty to overturn this conclusion.

11. Discussion

Returning to our original question: Did German reunification reduce West Germany’s GDP per capita? The evidence strongly supports a negative and persistent effect. The synthetic control estimates show that by 2003, West Germany’s GDP per capita was approximately \$3,465 lower than what the synthetic counterfactual predicts — a gap that grew steadily from near zero in 1991 to over \$3,000 by the early 2000s.

Crucially, the SCPI prediction intervals confirm this effect is statistically significant. From the mid-1990s onward, actual GDP falls below the lower bound of the 95% prediction interval, and this pattern holds even at the 99% confidence level. The sensitivity analysis shows that the conclusion is robust: no reasonable assumption about out-of-sample uncertainty can explain away the gap.

For policymakers, the finding highlights that large-scale political integration — even between regions that share a language and cultural heritage — can impose substantial and long-lasting economic costs on the wealthier partner. West Germany effectively subsidized the reconstruction of the East German economy, and these transfers show up as a persistent drag on per capita GDP. The magnitude — roughly \$3,500 per person by 2003, or about 11% of predicted GDP — represents a significant reallocation of economic resources.

These results align with Abadie (2021), who reached similar qualitative conclusions using the classic synthetic control method. The contribution of the SCPI framework is to move beyond point estimates and provide formal uncertainty quantification, transforming an informal visual assessment (“the lines diverge”) into a rigorous statistical statement (“the gap exceeds what can be explained by estimation or prediction uncertainty”).

12. Summary and Next Steps

Key takeaways:

Method insight. The synthetic control method is particularly powerful when only one unit receives a treatment and traditional difference-in-differences designs are not feasible. The SCPI extension solves a longstanding limitation by providing prediction intervals with finite-sample coverage guarantees, decomposing uncertainty into in-sample (weight estimation) and out-of-sample (post-treatment shocks) components.
Data insight. Six of sixteen donor countries receive positive weight in the synthetic West Germany, led by Austria (0.291), the USA (0.273), and Italy (0.191). The pre-treatment RMSE of 0.072 confirms an excellent fit, and the gap grows from near zero in 1991 to -\$3,465 by 2003.
Practical limitation. The synthetic control method assumes that the donor pool contains countries whose weighted combination can approximate the treated unit’s trajectory. If the treated unit is fundamentally different from all available donors — or if the intervention changes the relationships between the treated unit and its donors — the counterfactual may be unreliable. Additionally, the method cannot account for spillover effects: reunification may have affected the donor countries themselves through trade and migration channels.
Next step. The scpi_pkg package supports multiple treated units via scdataMulti(), enabling staggered adoption designs. Readers interested in extensions could also experiment with covariate adjustment (adding trade openness or inflation as matching features) or alternative PI methods (location-scale and quantile regression) to compare with the Gaussian bounds used here.

Limitations:

Results depend on the donor pool composition. Excluding or including specific countries can shift the estimated gap.
The cointegrated data setting assumes a shared stochastic trend across countries; if this assumption fails, weights may be biased.
With only one treated unit, we cannot assess heterogeneity in treatment effects across different types of reunification scenarios.

13. Exercises

Add covariates. Re-run the analysis with features=['gdp', 'trade'] in scdata(). Does matching on trade openness in addition to GDP change the estimated weights or the treatment effect?
Modify the donor pool. Remove Austria and the USA (the two highest-weighted donors) and re-estimate. How sensitive is the gap to the composition of the donor pool?
Alternative PI method. Replace e_method="gaussian" with e_method="ls" (location-scale) in scpi(). Compare the width and shape of the resulting prediction intervals. Under what conditions would you prefer one method over the other?
Shorten the pre-treatment window. Re-run the analysis using only period_pre = np.arange(1980, 1991) instead of the full 1960–1990 window. How does reducing the pre-treatment period from 31 to 11 years affect the pre-treatment fit, the estimated weights, and the width of the prediction intervals?
Placebo treatment date. Move the treatment date to 1980 (set period_pre = np.arange(1960, 1981) and period_post = np.arange(1981, 1991)) — a decade before reunification actually occurred. If the method is working correctly, you should find no significant treatment effect during this placebo period. Do the prediction intervals confirm this?

14. References

Acknowledgements

AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.

Causal effects of a CO2 tax

Sat, 01 Apr 2023 00:00:00 +0000

Many economists concur that the primary tool for addressing climate change in a cost-effective way should be pricing greenhouse gas emissions, either through emission certificates or a carbon tax. In 1991, Sweden implemented a progressively increasing Carbon tax, which reached a peak of 110 Euros per ton of CO2 in 2020, making it the highest carbon tax globally. This tax is applicable to sectors not covered by the EU emission trading system, primarily transportation and residential heating.

Two critical questions related to this tax are:

What was the impact of the carbon tax on reducing Sweden’s carbon emissions?
How did the tax influence Sweden’s economic growth, as measured by GDP growth?

The paper “Carbon Taxes and CO2 Emissions: Sweden as a case study” (2019, AEJ: Economic Policy) by Julius J. Andersson calculates the direct impact of Sweden’s CO2 tax on emissions in the transportation sector using the synthetic control model.

The fundamental concept is to use a synthetic Sweden as a control group, which is constructed as a weighted sample of other countries. The weights assigned to each country are determined through a nested optimization process that assigns higher weights to countries that, during the pre-intervention period, were more similar to Sweden in terms of certain explanatory variables, such as GDP per capita or the proportion of the urban population. These explanatory variables are weighted to ensure that the constructed synthetic Sweden closely matches Sweden’s pre-intervention emission levels over time.

As part of her Master’s Thesis at Ulm University, Theresa Graefe developed an excellent RTutor problem set that allows you to replicate the analysis and delve deeper into the synthetic control method interactively with R. As with previous RTutor problem sets, you can input free R code into a web-based shiny app. The code is automatically checked, and you can receive hints on how to proceed. Additionally, you are challenged with multiple-choice quizzes. This guidance will help you learn how to create plots like the one below, which illustrates the estimated causal effects' time path as the post-treatment difference between Sweden’s and synthetic Sweden’s CO2 emissions:

In similar plots, you’ll observe that the CO2 tax had virtually no discernible causal effect on Sweden’s GDP growth. You’ll also learn about Placebo tests, which aid in assessing the statistical significance (often informally) of the estimated causal effects.

You can try the problem set online at shinyapps.io:

https://theresagraefe.shinyapps.io/RTutorCarbonTaxesAndCO2Emissions/

Please note that the free shinyapps.io account has a usage limit of 25 hours per month. Therefore, it might be unavailable when you attempt to access it. For that reason, I loaded the app in Posit cloud containter:

https://posit.cloud/content/6187268

To run the app in Posit cloud, you need to register for a free account. Then, run the following code in the console.

library(RTutor)
run.ps(user.name="Jon Doe", package="RTutorCarbonTaxesAndCO2Emissions", load.sav=TRUE, sample.solution=FALSE)

To install the problem set locally, follow the installation instructions at the problem set’s Github repository: https://github.com/TheresaGraefe/RTutorCarbonTax

If you’re interested in learning more about RTutor, trying out other problem sets, or creating your own problem set, visit the Github page:

https://github.com/skranz/RTutor

or check out the documentation at:

https://skranz.github.io/RTutor/

Basic synthetic control

Mon, 01 Apr 2019 00:00:00 +0000

This method constructs a synthetic control unit as a weighted average of available control units that best approximate the relevant characteristics of the treated unit prior to treatment. You can run and extend the analysis of this case study using Google Colab.