Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_scpi/data/"
use "${BASE}data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_scpi/data/"
df = pd.read_stata(BASE + "data.dta")
# load every dataset at once
files = ["data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "data.dta", "data.dta")
df, meta = pyreadstat.read_dta("data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_scpi/data/"
df <- read_dta(paste0(BASE, "data.dta"))Overview & sources
Companion data for a hands-on Python tutorial that applies the synthetic control with prediction intervals (SCPI) framework of Cattaneo, Feng, and Titiunik (2021) to a classic question in political economy: did German reunification in 1990 reduce West Germany's GDP per capita, and how confident can we be in the estimate? The analysis treats West Germany as the treated unit and 16 OECD countries as the donor pool, builds a synthetic West Germany from 31 pre-treatment years (1960–1990) under a simplex constraint, and constructs prediction intervals that decompose uncertainty into in-sample (weight estimation) and out-of-sample (post-treatment) components with finite-sample coverage guarantees. The estimated gap grows to roughly −\$3,465 per capita by 2003, and actual GDP falls below the 99% prediction interval in 7 of 13 post-treatment years.
data.csv is an annual country panel (one row per country × year) covering 17 countries over 1960–2003 (748 rows). The tutorial uses only the country, year, and gdp columns (features=None); the remaining seven columns are the original Abadie predictor covariates (inflation, trade, schooling, investment ratios, industry share), carried verbatim from the source dataset and available for the post's covariate-adjustment exercise.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Abadie (2021) | The German-reunification panel — GDP per capita and OECD covariates for 17 countries (1960–2003) | Abadie, A. (2021). Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects. Journal of Economic Literature, 59(2), 391–425. https://doi.org/10.1257/jel.20191450 |
| scpi_pkg illustration data | Distributed form of the panel used here (the scpi Python package illustration scripts) | Cattaneo, M. D., Feng, Y., Palomba, F., & Titiunik, R. scpi_pkg. https://github.com/nppackages/scpi |
| Method references | Estimators and concepts | Abadie, Diamond & Hainmueller (2010, 2015); Cattaneo, Feng & Titiunik (2021). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Synthetic Control with Prediction Intervals: Quantifying Uncertainty in Germany's Reunification Impact [Data set]. https://carlos-mendez.org/post/python_scpi/
Abadie, A. (2021). Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects. Journal of Economic Literature, 59(2), 391–425. Cattaneo, M. D., Feng, Y., & Titiunik, R. (2021). Prediction Intervals for Synthetic Control Methods. Journal of the American Statistical Association, 116(536), 1668–1683.BibTeX
@misc{mendez2026pythonscpi,
author = {Mendez, Carlos},
title = {Synthetic Control with Prediction Intervals: Quantifying Uncertainty in Germany's Reunification Impact},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_scpi/}},
note = {Data set}
}
@article{abadie2021using,
author = {Abadie, Alberto},
title = {Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects},
journal = {Journal of Economic Literature},
volume = {59}, number = {2}, pages = {391--425}, year = {2021}
}
@article{cattaneo2021prediction,
author = {Cattaneo, Matias D. and Feng, Yingjie and Titiunik, Rocio},
title = {Prediction Intervals for Synthetic Control Methods},
journal = {Journal of the American Statistical Association},
volume = {116}, number = {536}, pages = {1668--1683}, year = {2021}
}Variable explorer search & filter all 11 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
country# | identifier | – | Country name | Country identifier — the treated unit (West Germany) plus 16 OECD donor countries. | string | data | Abadie (2021) |
gdp# | continuous | GDP per capita (thousand USD) | Real GDP per capita in thousands of US dollars — the outcome variable for the synthetic control. | thousand US$ | data | Abadie (2021) | |
index# | identifier | – | Country numeric ID | Numeric identifier for the country (from the source Abadie dataset; not sequential). | integer code | data | Abadie (2021) |
industry# | continuous | Industry share of GDP (%) | Industry value added as a share of GDP, a structural predictor covariate. | % of GDP | data | Abadie (2021) | |
infrate# | continuous | Inflation rate (%) | Annual inflation rate (consumer prices), a predictor covariate in the original Abadie analysis. | % per year | data | Abadie (2021) | |
invest60# | continuous | Investment ratio, 1960s average | Average investment-to-output ratio over the 1960s (time-invariant per country). | ratio | data | Abadie (2021) | |
invest70# | continuous | Investment ratio, 1970s average | Average investment-to-output ratio over the 1970s (time-invariant per country). | ratio | data | Abadie (2021) | |
invest80# | continuous | Investment ratio, 1980s average (%) | Average investment-to-output ratio over the 1980s (time-invariant per country). | % / ratio | data | Abadie (2021) | |
schooling# | continuous | Secondary schooling (%) | Share of the population with secondary schooling, a human-capital predictor covariate. | % of population | data | Abadie (2021) | |
trade# | continuous | Trade openness (% of GDP) | Trade (exports + imports) as a share of GDP, a predictor covariate. | % of GDP | data | Abadie (2021) | |
year# | year | – | Calendar year | Annual time index, 1960-2003. | year | data | Abadie (2021) |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The synthetic control builds a counterfactual West Germany as a weighted average of donor countries, then reads the treatment effect off the post-treatment gap.
- Synthetic counterfactual:
Ŷ_1T(0) = x_T' · ŵ— donor GDP valuesx_Tat timeTtimes the estimated weight vectorŵ. - Simplex constraint (classic SC):
w_j ≥ 0andΣ_j w_j = 1— a convex combination of real donors, no extrapolation. (Alternatives: lasso, ridge, OLS.) - Treatment effect (ATT for the treated unit):
τ_T = Y_1T(1) − Ŷ_1T(0)— the gap between actual and synthetic GDP in each post-1990 year. A negative gap means reunification lowered West Germany's GDP relative to its counterfactual. - Pre-treatment fit (RMSE):
RMSE = √[ (1/T₀) Σ_t (Y_t − Ŷ_t)² ]over the pre-treatment period (simplex RMSE = 0.072). - Prediction-interval decomposition:
τ̂_T − τ_T = p_T'(β₀ − β̂) [in-sample] + e_T [out-of-sample]— the in-sample term reflects finite-sample weight uncertainty (Monte Carlo, HC1 variance); the out-of-sample terme_Treflects post-treatment forecasting noise (Gaussian bound). Combining the two yields intervals with finite-sample coverage guarantees.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
index identifier | Country numeric ID | Numeric identifier for the country (from the source Abadie dataset; not sequential). | Carried from the source dataset; constant within a country across years. | integer code | Abadie (2021) | all rows |
country identifier | Country name | Country identifier — the treated unit (West Germany) plus 16 OECD donor countries. | Carried from the source dataset. | string | Abadie (2021) | 17 countries |
year year | Calendar year | Annual time index, 1960-2003. | Carried from the source dataset. | year | Abadie (2021) | 1960-2003 (44 years) |
gdp continuous | GDP per capita (thousand USD) | Real GDP per capita in thousands of US dollars — the outcome variable for the synthetic control. | Carried from the source dataset; the only outcome used in estimation (outcome_var='gdp'). | thousand US$ | Abadie (2021) | all rows (748) |
infrate continuous | Inflation rate (%) | Annual inflation rate (consumer prices), a predictor covariate in the original Abadie analysis. | Carried from the source dataset; not used by this tutorial's headline estimation. | % per year | Abadie (2021) | 727 of 748 rows |
trade continuous | Trade openness (% of GDP) | Trade (exports + imports) as a share of GDP, a predictor covariate. | Carried from the source dataset; available for the covariate-adjustment exercise. | % of GDP | Abadie (2021) | 646 of 748 rows |
schooling continuous | Secondary schooling (%) | Share of the population with secondary schooling, a human-capital predictor covariate. | Carried from the source dataset; reported only for selected years (sparse). | % of population | Abadie (2021) | 151 of 748 rows (sparse) |
invest60 continuous | Investment ratio, 1960s average | Average investment-to-output ratio over the 1960s (time-invariant per country). | Carried from the source dataset; one value per country (period average). | ratio | Abadie (2021) | 17 rows (one per country) |
invest70 continuous | Investment ratio, 1970s average | Average investment-to-output ratio over the 1970s (time-invariant per country). | Carried from the source dataset; one value per country (period average). | ratio | Abadie (2021) | 17 rows (one per country) |
invest80 continuous | Investment ratio, 1980s average (%) | Average investment-to-output ratio over the 1980s (time-invariant per country). | Carried from the source dataset; one value per country (period average). | % / ratio | Abadie (2021) | 17 rows (one per country) |
industry continuous | Industry share of GDP (%) | Industry value added as a share of GDP, a structural predictor covariate. | Carried from the source dataset; available for the covariate-adjustment exercise. | % of GDP | Abadie (2021) | 541 of 748 rows |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
index | – | 100% | 748 | 17 | — | — | — | — | — |
country | – | 100% | 748 | 17 | — | — | — | — | — |
year | – | 100% | 748 | 44 | 1960 | 1981.5 | 1981 | 2003 | 12.71 |
gdp | 100% | 748 | 739 | 0.707 | 12.14 | 10.26 | 37.55 | 8.95 | |
infrate | 97% | 727 | 726 | -0.915 | 5.87 | 4.08 | 28.78 | 5.13 | |
trade | 86% | 646 | 646 | 9.43 | 53.12 | 49.53 | 149.7 | 26.46 | |
schooling | 20% | 151 | 133 | 3.50 | 36.36 | 38.00 | 69.60 | 15.50 | |
invest60 | 2% | 17 | 17 | 0.201 | 0.287 | 0.278 | 0.373 | 0.045 | |
invest70 | 2% | 17 | 17 | 0.226 | 0.317 | 0.318 | 0.420 | 0.044 | |
invest80 | 2% | 17 | 17 | 17.59 | 25.96 | 26.49 | 34.99 | 4.28 | |
industry | 72% | 541 | 540 | 21.59 | 33.24 | 33.07 | 48.00 | 5.16 |
Known limitations & caveats
- Most columns are not used by the tutorial. Only
country,year, andgdpenter the synthetic-control estimation (features=None). The seven covariate columns are retained from the source dataset for the covariate-adjustment exercise and are not part of the headline results. - Sparse covariate coverage. Several predictors are only sparsely populated:
schoolinghas 151 of 748 values, andinvest60/invest70/invest80are time-invariant period averages reported once per country (17 non-missing each). - Single treated unit. With one treated unit (West Germany), the method cannot assess heterogeneity in treatment effects, and results depend on the donor-pool composition — excluding or including specific countries can shift the estimated gap.
- Cointegration assumption. The estimation sets
cointegrated_data=True, assuming GDP series share a common stochastic trend; if that assumption fails, the weights may be biased.