Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
source_data | region-year | 774 × 17 | source_data.dta | source_data.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
use "${BASE}source_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
df = pd.read_stata(BASE + "source_data.dta")
# load every dataset at once
files = ["source_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "source_data.dta", "source_data.dta")
df, meta = pyreadstat.read_dta("source_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
df <- read_dta(paste0(BASE, "source_data.dta"))Overview & sources
Companion data for a beginner-friendly R tutorial on the synthetic control method, applied to the classic Basque Country case study of Abadie and Gardeazabal (2003). The panel is the basque dataset bundled with the Synth R package: annual observations from 1955–1997 for 18 Spanish regional units. The outcome is real GDP per capita (gdpcap, 1986 thousands of USD); the remaining columns are the 13 pre-treatment predictors matched by the algorithm — six sectoral production shares, five education levels, an investment-to-GDP ratio, and a 1969 population density. The tutorial frames the analysis as a causal problem (estimand: ATT), builds a “synthetic Basque” from a weighted recipe of the other regions (85% Catalonia, 15% Madrid), and stress-tests the result with Catalonia and in-space placebos.
source_data is an annual regional panel (one row per region × year, 18 units × 43 years = 774 rows over 1955–1997). Region 1 (“Spain (Espana)”) is the national aggregate and is dropped from the analysis; regions 2–18 are the 17 autonomous communities. Region 17 is the Basque Country, the treated unit (terrorism onset 1970). The outcome gdpcap is observed for every row; the predictor columns are sparse by design — the sectoral shares are pooled period averages (one block per region), the education levels are period snapshots, and popdens is a single 1969 cross-section — which is why their non-missing counts are far below 774.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synth R package (basque dataset) | The full annual regional panel used in this tutorial — distributed verbatim with the package. | Abadie, A., Diamond, A., & Hainmueller, J. (2011). Synth: An R Package for Synthetic Control Methods in Comparative Case Studies. Journal of Statistical Software, 42(13), 1–17. https://doi.org/10.18637/jss.v042.i13 |
| Abadie & Gardeazabal (2003) | Original study and source of the data; defines the treated unit, donor pool, predictors, and 1970 treatment date. | Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review, 93(1), 113–132. https://doi.org/10.1257/000282803321455188 |
| Method references | The synthetic control estimator and inference | Abadie, Diamond & Hainmueller (2010); Abadie (2021). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Basic Synthetic Control with R: The Basque Country Case Study [Data set]. https://carlos-mendez.org/post/r_basic_synthetic_control/
Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review, 93(1), 113–132. https://doi.org/10.1257/000282803321455188 Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746BibTeX
@misc{mendez2026rbasicsyntheticcontrol,
author = {Mendez, Carlos},
title = {Basic Synthetic Control with R: The Basque Country Case Study},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/r_basic_synthetic_control/}},
note = {Data set}
}
@article{abadie2003economic,
author = {Abadie, Alberto and Gardeazabal, Javier},
title = {The Economic Costs of Conflict: A Case Study of the Basque Country},
journal = {American Economic Review},
volume = {93}, number = {1}, pages = {113--132}, year = {2003}
}
@article{abadie2010synthetic,
author = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
title = {Synthetic Control Methods for Comparative Case Studies},
journal = {Journal of the American Statistical Association},
volume = {105}, number = {490}, pages = {493--505}, year = {2010}
}Variable explorer search & filter all 17 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
gdpcap# | continuous | Real GDP per capita (1986 thousand USD) | Outcome variable: real GDP per capita in thousands of 1986 US dollars. | 1986 thousand US$ | source_data | Abadie & Gardeazabal (2003) via Synth | |
invest# | continuous | Investment / GDP ratio (%) | Investment as a share of GDP (predictor). | % of GDP | source_data | Abadie & Gardeazabal (2003) via Synth | |
popdens# | continuous | Population density, 1969 (per km^2) | Population density at the time of treatment (1969 cross-section; static control predictor). | people / km^2 | source_data | Abadie & Gardeazabal (2003) via Synth | |
regionname# | identifier | – | Region name | Name of the regional unit (e.g. 'Basque Country (Pais Vasco)', 'Cataluna', 'Madrid (Comunidad De)'). | string | source_data | Synth package (basque) |
regionno# | identifier | – | Region number (1=Spain aggregate) | Integer identifier for the regional unit; 1 is the Spain national aggregate, 2-18 are the 17 autonomous communities (17 = Basque Country, treated). | 1-18 | source_data | Synth package (basque) |
school_high# | continuous | Population: higher education (raw count) | Number of people with higher education (education predictor, raw count). | count | source_data | Abadie & Gardeazabal (2003) via Synth | |
school_illit# | continuous | Population: illiterate (raw count) | Number of people with no schooling / illiterate (education predictor, raw count). | count | source_data | Abadie & Gardeazabal (2003) via Synth | |
school_med# | continuous | Population: secondary education (raw count) | Number of people with intermediate/secondary education (education predictor, raw count). | count | source_data | Abadie & Gardeazabal (2003) via Synth | |
school_post_high# | continuous | Population: post-secondary education (raw count) | Number of people with post-higher education (education predictor, raw count). | count | source_data | Abadie & Gardeazabal (2003) via Synth | |
school_prim# | continuous | Population: primary education (raw count) | Number of people with primary education (education predictor, raw count). | count | source_data | Abadie & Gardeazabal (2003) via Synth | |
sec_agriculture# | continuous | Agriculture share of production (%) | Share of regional production in agriculture (sectoral predictor). | % of production | source_data | Abadie & Gardeazabal (2003) via Synth | |
sec_construction# | continuous | Construction share of production (%) | Share of regional production in construction (sectoral predictor). | % of production | source_data | Abadie & Gardeazabal (2003) via Synth | |
sec_energy# | continuous | Energy share of production (%) | Share of regional production in energy (sectoral predictor). | % of production | source_data | Abadie & Gardeazabal (2003) via Synth | |
sec_industry# | continuous | Industry share of production (%) | Share of regional production in industry (sectoral predictor). | % of production | source_data | Abadie & Gardeazabal (2003) via Synth | |
sec_services_nonventa# | continuous | Non-market services share of production (%) | Share of regional production in non-marketable ('nonventa') services (sectoral predictor). | % of production | source_data | Abadie & Gardeazabal (2003) via Synth | |
sec_services_venta# | continuous | Market services share of production (%) | Share of regional production in marketable ('venta') services (sectoral predictor). | % of production | source_data | Abadie & Gardeazabal (2003) via Synth | |
year# | year | – | Calendar year | Annual time index, 1955-1997. Terrorism onset (treatment) is dated to 1970. | year | source_data | Synth package (basque) |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The synthetic control method estimates the missing counterfactual GDP path of the treated unit (the
Basque Country) as a weighted average of donor (untreated) regions. Let X1 be the treated
unit's pre-treatment predictor vector (13 × 1), X0 the donor predictor matrix
(13 × 16), and Z1/Z0 the pre-treatment outcomes (annual
gdpcap) for the treated unit and donors.
- Donor weights
W*(inner problem): minimize the weighted predictor distance‖X1 − X0·W‖_V = √[(X1 − X0·W)′ V (X1 − X0·W)]overW ≥ 0,Σ w_j = 1(a convex combination of donors). - Predictor weights
V*(outer problem): minimize the pre-treatment outcome error(Z1 − Z0·W*(V))′(Z1 − Z0·W*(V))— the diagonal ofVdials how much each predictor matters; chosen by cross-validation on pre-1970 GDP. - ATT (gap):
α̂_1t = Y_1t − Σ_j w_j* · Y_jtfort ≥ 1970— actual Basque GDP minus the synthetic (donor-weighted) GDP, year by year; the headline ATT averages this over 1970–1997. - Placebo / MSPE ratio: run the same pipeline treating each donor as if treated; rank by the post/pre mean-squared-prediction-error ratio. A trimmed pseudo p-value compares the Basque ratio to the distribution of comparable-fit placebos.
Education predictors are converted to within-region percentage shares and the top two education levels
are collapsed before matching (inside the post's prepare_basque() helper); the sectoral
predictors are pooled 1961–1969 averages. The raw, untransformed values are stored here.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
regionno identifier | Region number (1=Spain aggregate) | Integer identifier for the regional unit; 1 is the Spain national aggregate, 2-18 are the 17 autonomous communities (17 = Basque Country, treated). | From the bundled basque dataset. | 1-18 | Synth package (basque) | all 774 rows |
regionname identifier | Region name | Name of the regional unit (e.g. 'Basque Country (Pais Vasco)', 'Cataluna', 'Madrid (Comunidad De)'). | From the bundled basque dataset (some names stored without diacritics). | string | Synth package (basque) | all 774 rows |
year year | Calendar year | Annual time index, 1955-1997. Terrorism onset (treatment) is dated to 1970. | From the bundled basque dataset. | year | Synth package (basque) | 1955-1997 (43 years) |
gdpcap continuous | Real GDP per capita (1986 thousand USD) | Outcome variable: real GDP per capita in thousands of 1986 US dollars. | Observed regional GDP per capita; this is the outcome the synthetic control matches and predicts. | 1986 thousand US$ | Abadie & Gardeazabal (2003) via Synth | all 774 rows (1955-1997) |
sec_agriculture continuous | Agriculture share of production (%) | Share of regional production in agriculture (sectoral predictor). | Pooled 1961-1969 sectoral average; one value per region. Originally sec.agriculture. | % of production | Abadie & Gardeazabal (2003) via Synth | ~90 rows (period block per region) |
sec_energy continuous | Energy share of production (%) | Share of regional production in energy (sectoral predictor). | Pooled 1961-1969 sectoral average; one value per region. Originally sec.energy. | % of production | Abadie & Gardeazabal (2003) via Synth | ~90 rows (period block per region) |
sec_industry continuous | Industry share of production (%) | Share of regional production in industry (sectoral predictor). | Pooled 1961-1969 sectoral average; one value per region. Originally sec.industry. | % of production | Abadie & Gardeazabal (2003) via Synth | ~90 rows (period block per region) |
sec_construction continuous | Construction share of production (%) | Share of regional production in construction (sectoral predictor). | Pooled 1961-1969 sectoral average; one value per region. Originally sec.construction. | % of production | Abadie & Gardeazabal (2003) via Synth | ~90 rows (period block per region) |
sec_services_venta continuous | Market services share of production (%) | Share of regional production in marketable ('venta') services (sectoral predictor). | Pooled 1961-1969 sectoral average; one value per region. Originally sec.services.venta. | % of production | Abadie & Gardeazabal (2003) via Synth | ~90 rows (period block per region) |
sec_services_nonventa continuous | Non-market services share of production (%) | Share of regional production in non-marketable ('nonventa') services (sectoral predictor). | Pooled 1961-1969 sectoral average; one value per region. Originally sec.services.nonventa. | % of production | Abadie & Gardeazabal (2003) via Synth | ~90 rows (period block per region) |
school_illit continuous | Population: illiterate (raw count) | Number of people with no schooling / illiterate (education predictor, raw count). | Education-level figure; converted to a within-region percentage share before matching. Originally school.illit. | count | Abadie & Gardeazabal (2003) via Synth | ~108 rows (period snapshots per region) |
school_prim continuous | Population: primary education (raw count) | Number of people with primary education (education predictor, raw count). | Education-level figure; converted to a within-region percentage share before matching. Originally school.prim. | count | Abadie & Gardeazabal (2003) via Synth | ~108 rows (period snapshots per region) |
school_med continuous | Population: secondary education (raw count) | Number of people with intermediate/secondary education (education predictor, raw count). | Education-level figure; converted to a within-region percentage share before matching. Originally school.med. | count | Abadie & Gardeazabal (2003) via Synth | ~108 rows (period snapshots per region) |
school_high continuous | Population: higher education (raw count) | Number of people with higher education (education predictor, raw count). | Education-level figure; collapsed with school_post_high and converted to a share before matching. Originally school.high. | count | Abadie & Gardeazabal (2003) via Synth | ~108 rows (period snapshots per region) |
school_post_high continuous | Population: post-secondary education (raw count) | Number of people with post-higher education (education predictor, raw count). | Education-level figure; collapsed into school_high and converted to a share before matching. Originally school.post.high. | count | Abadie & Gardeazabal (2003) via Synth | ~108 rows (period snapshots per region) |
popdens continuous | Population density, 1969 (per km^2) | Population density at the time of treatment (1969 cross-section; static control predictor). | Single 1969 value per region. Originally popdens. | people / km^2 | Abadie & Gardeazabal (2003) via Synth | 18 rows (one 1969 value per region) |
invest continuous | Investment / GDP ratio (%) | Investment as a share of GDP (predictor). | Annual investment-to-GDP ratio; pre-treatment values enter the predictor set. | % of GDP | Abadie & Gardeazabal (2003) via Synth | 576 rows |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
regionno | – | 100% | 774 | 18 | — | — | — | — | — |
regionname | – | 100% | 774 | 18 | — | — | — | — | — |
year | – | 100% | 774 | 43 | 1955 | 1976.0 | 1976 | 1997 | 12.42 |
gdpcap | 100% | 774 | 771 | 1.24 | 5.39 | 5.34 | 12.35 | 2.24 | |
sec_agriculture | 12% | 90 | 89 | 1.32 | 20.27 | 19.24 | 46.50 | 10.38 | |
sec_energy | 12% | 90 | 80 | 1.60 | 5.19 | 3.67 | 21.36 | 4.04 | |
sec_industry | 12% | 90 | 89 | 9.56 | 23.92 | 23.14 | 46.22 | 9.28 | |
sec_construction | 12% | 90 | 84 | 4.34 | 7.21 | 7.13 | 11.28 | 1.36 | |
sec_services_venta | 12% | 90 | 87 | 26.23 | 36.49 | 34.75 | 58.21 | 7.26 | |
sec_services_nonventa | 12% | 90 | 86 | 3.43 | 6.93 | 6.68 | 13.11 | 1.98 | |
school_illit | 14% | 108 | 108 | 8.10 | 308.1 | 116.2 | 2,863.3 | 630.8 | |
school_prim | 14% | 108 | 108 | 151.3 | 2,118.5 | 852.1 | 19,460 | 4,216.8 | |
school_med | 14% | 108 | 108 | 8.61 | 145.6 | 47.75 | 1,696.1 | 297.5 | |
school_high | 14% | 108 | 108 | 3.06 | 45.94 | 16.70 | 474.9 | 92.11 | |
school_post_high | 14% | 108 | 108 | 1.66 | 25.46 | 7.71 | 252.2 | 51.58 | |
popdens | 2% | 18 | 18 | 22.38 | 105.8 | 80.38 | 442.5 | 101.5 | |
invest | 74% | 576 | 576 | 9.33 | 21.40 | 21.35 | 39.41 | 4.11 |
Known limitations & caveats
- Predictor columns are sparse by design. Only
gdpcapis observed for all 774 rows. The six sectoral shares (~90 non-missing) and five education levels (~108) are pooled period averages/snapshots, andpopdensis a single 1969 cross-section (18 rows). TheSynthworkflow reads these via period-aggregation arguments, so the gaps are expected, not data errors. - School columns are raw counts, not shares.
school_illit,school_prim,school_med,school_high, andschool_post_highare stored as raw figures; the post converts them to within-region percentage shares (and collapsesschool_high+school_post_high) before matching. - Region 1 is an aggregate. Row group
regionno == 1(“Spain (Espana)”) is the national total and is dropped from the analysis; the 17 autonomous communities are regions 2–18. - Column names use underscores. The source
basqueobject names sectoral and school variables with dots (e.g.sec.agriculture); they are renamed to underscores here (sec_agriculture) so the variables are valid Stata names. Values are unchanged. - Real data, single treated unit. This is real regional accounting data, not simulated. With one treated region and a 16-region donor pool, inference rests on placebos rather than standard errors; the discrete pseudo p-values have limited resolution.