← Back to the post
Interactive data dictionary

Basic Synthetic Control: The Basque Country Case Study

The real Basque regional panel that ships with the Synth R package — the data behind Abadie & Gardeazabal's economic-cost-of-conflict study.

1
dataset
17
variables
18
regions
1955–1997
years

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
source_dataregion-year774 × 17source_data.dtasource_data.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
use "${BASE}source_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
df = pd.read_stata(BASE + "source_data.dta")

# load every dataset at once
files = ["source_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "source_data.dta", "source_data.dta")
df, meta = pyreadstat.read_dta("source_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
df <- read_dta(paste0(BASE, "source_data.dta"))

Overview & sources

Companion data for a beginner-friendly R tutorial on the synthetic control method, applied to the classic Basque Country case study of Abadie and Gardeazabal (2003). The panel is the basque dataset bundled with the Synth R package: annual observations from 1955–1997 for 18 Spanish regional units. The outcome is real GDP per capita (gdpcap, 1986 thousands of USD); the remaining columns are the 13 pre-treatment predictors matched by the algorithm — six sectoral production shares, five education levels, an investment-to-GDP ratio, and a 1969 population density. The tutorial frames the analysis as a causal problem (estimand: ATT), builds a “synthetic Basque” from a weighted recipe of the other regions (85% Catalonia, 15% Madrid), and stress-tests the result with Catalonia and in-space placebos.

One file. source_data is an annual regional panel (one row per region × year, 18 units × 43 years = 774 rows over 1955–1997). Region 1 (“Spain (Espana)”) is the national aggregate and is dropped from the analysis; regions 2–18 are the 17 autonomous communities. Region 17 is the Basque Country, the treated unit (terrorism onset 1970). The outcome gdpcap is observed for every row; the predictor columns are sparse by design — the sectoral shares are pooled period averages (one block per region), the education levels are period snapshots, and popdens is a single 1969 cross-section — which is why their non-missing counts are far below 774.

Data sources

SourceProvidesReference / URL
Synth R package (basque dataset)The full annual regional panel used in this tutorial — distributed verbatim with the package.Abadie, A., Diamond, A., & Hainmueller, J. (2011). Synth: An R Package for Synthetic Control Methods in Comparative Case Studies. Journal of Statistical Software, 42(13), 1–17. https://doi.org/10.18637/jss.v042.i13
Abadie &amp; Gardeazabal (2003)Original study and source of the data; defines the treated unit, donor pool, predictors, and 1970 treatment date.Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review, 93(1), 113–132. https://doi.org/10.1257/000282803321455188
Method referencesThe synthetic control estimator and inferenceAbadie, Diamond & Hainmueller (2010); Abadie (2021).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Basic Synthetic Control with R: The Basque Country Case Study [Data set]. https://carlos-mendez.org/post/r_basic_synthetic_control/

Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review, 93(1), 113–132. https://doi.org/10.1257/000282803321455188 Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746

BibTeX

@misc{mendez2026rbasicsyntheticcontrol,
  author       = {Mendez, Carlos},
  title        = {Basic Synthetic Control with R: The Basque Country Case Study},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/r_basic_synthetic_control/}},
  note         = {Data set}
}

@article{abadie2003economic,
  author  = {Abadie, Alberto and Gardeazabal, Javier},
  title   = {The Economic Costs of Conflict: A Case Study of the Basque Country},
  journal = {American Economic Review},
  volume  = {93}, number = {1}, pages = {113--132}, year = {2003}
}
@article{abadie2010synthetic,
  author  = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
  title   = {Synthetic Control Methods for Comparative Case Studies},
  journal = {Journal of the American Statistical Association},
  volume  = {105}, number = {490}, pages = {493--505}, year = {2010}
}

Variable explorer search & filter all 17 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
gdpcap#continuousmin 1.24 | median 5.34 | max 12.4Real GDP per capita (1986 thousand USD)Outcome variable: real GDP per capita in thousands of 1986 US dollars.1986 thousand US$source_dataAbadie & Gardeazabal (2003) via Synth
invest#continuousmin 9.33 | median 21.4 | max 39.4Investment / GDP ratio (%)Investment as a share of GDP (predictor).% of GDPsource_dataAbadie & Gardeazabal (2003) via Synth
popdens#continuousmin 22.4 | median 80.4 | max 442Population density, 1969 (per km^2)Population density at the time of treatment (1969 cross-section; static control predictor).people / km^2source_dataAbadie & Gardeazabal (2003) via Synth
regionname#identifierRegion nameName of the regional unit (e.g. 'Basque Country (Pais Vasco)', 'Cataluna', 'Madrid (Comunidad De)').stringsource_dataSynth package (basque)
regionno#identifierRegion number (1=Spain aggregate)Integer identifier for the regional unit; 1 is the Spain national aggregate, 2-18 are the 17 autonomous communities (17 = Basque Country, treated).1-18source_dataSynth package (basque)
school_high#continuousmin 3.06 | median 16.7 | max 475Population: higher education (raw count)Number of people with higher education (education predictor, raw count).countsource_dataAbadie & Gardeazabal (2003) via Synth
school_illit#continuousmin 8.1 | median 116 | max 2.86e+03Population: illiterate (raw count)Number of people with no schooling / illiterate (education predictor, raw count).countsource_dataAbadie & Gardeazabal (2003) via Synth
school_med#continuousmin 8.61 | median 47.8 | max 1.7e+03Population: secondary education (raw count)Number of people with intermediate/secondary education (education predictor, raw count).countsource_dataAbadie & Gardeazabal (2003) via Synth
school_post_high#continuousmin 1.66 | median 7.71 | max 252Population: post-secondary education (raw count)Number of people with post-higher education (education predictor, raw count).countsource_dataAbadie & Gardeazabal (2003) via Synth
school_prim#continuousmin 151 | median 852 | max 1.95e+04Population: primary education (raw count)Number of people with primary education (education predictor, raw count).countsource_dataAbadie & Gardeazabal (2003) via Synth
sec_agriculture#continuousmin 1.32 | median 19.2 | max 46.5Agriculture share of production (%)Share of regional production in agriculture (sectoral predictor).% of productionsource_dataAbadie & Gardeazabal (2003) via Synth
sec_construction#continuousmin 4.34 | median 7.13 | max 11.3Construction share of production (%)Share of regional production in construction (sectoral predictor).% of productionsource_dataAbadie & Gardeazabal (2003) via Synth
sec_energy#continuousmin 1.6 | median 3.67 | max 21.4Energy share of production (%)Share of regional production in energy (sectoral predictor).% of productionsource_dataAbadie & Gardeazabal (2003) via Synth
sec_industry#continuousmin 9.56 | median 23.1 | max 46.2Industry share of production (%)Share of regional production in industry (sectoral predictor).% of productionsource_dataAbadie & Gardeazabal (2003) via Synth
sec_services_nonventa#continuousmin 3.43 | median 6.68 | max 13.1Non-market services share of production (%)Share of regional production in non-marketable ('nonventa') services (sectoral predictor).% of productionsource_dataAbadie & Gardeazabal (2003) via Synth
sec_services_venta#continuousmin 26.2 | median 34.8 | max 58.2Market services share of production (%)Share of regional production in marketable ('venta') services (sectoral predictor).% of productionsource_dataAbadie & Gardeazabal (2003) via Synth
year#yearCalendar yearAnnual time index, 1955-1997. Terrorism onset (treatment) is dated to 1970.yearsource_dataSynth package (basque)

Cross-file variable index

Which file each variable appears in (● = present).

Construction & formulas

The synthetic control method estimates the missing counterfactual GDP path of the treated unit (the Basque Country) as a weighted average of donor (untreated) regions. Let X1 be the treated unit's pre-treatment predictor vector (13 × 1), X0 the donor predictor matrix (13 × 16), and Z1/Z0 the pre-treatment outcomes (annual gdpcap) for the treated unit and donors.

Education predictors are converted to within-region percentage shares and the top two education levels are collapsed before matching (inside the post's prepare_basque() helper); the sectoral predictors are pooled 1961–1969 averages. The raw, untransformed values are stored here.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

region-year  774 × 17 · 1955-1997 · 18 regional units (Spain aggregate + 17 autonomous communities)

Panel key: regionno x year · Source panel for the synthetic control: outcome (gdpcap) + pre-treatment predictors matched by synth().

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
regionno identifierRegion number (1=Spain aggregate)Integer identifier for the regional unit; 1 is the Spain national aggregate, 2-18 are the 17 autonomous communities (17 = Basque Country, treated).From the bundled basque dataset.1-18Synth package (basque)all 774 rows
regionname identifierRegion nameName of the regional unit (e.g. 'Basque Country (Pais Vasco)', 'Cataluna', 'Madrid (Comunidad De)').From the bundled basque dataset (some names stored without diacritics).stringSynth package (basque)all 774 rows
year yearCalendar yearAnnual time index, 1955-1997. Terrorism onset (treatment) is dated to 1970.From the bundled basque dataset.yearSynth package (basque)1955-1997 (43 years)
gdpcap continuousReal GDP per capita (1986 thousand USD)Outcome variable: real GDP per capita in thousands of 1986 US dollars.Observed regional GDP per capita; this is the outcome the synthetic control matches and predicts.1986 thousand US$Abadie & Gardeazabal (2003) via Synthall 774 rows (1955-1997)
sec_agriculture continuousAgriculture share of production (%)Share of regional production in agriculture (sectoral predictor).Pooled 1961-1969 sectoral average; one value per region. Originally sec.agriculture.% of productionAbadie & Gardeazabal (2003) via Synth~90 rows (period block per region)
sec_energy continuousEnergy share of production (%)Share of regional production in energy (sectoral predictor).Pooled 1961-1969 sectoral average; one value per region. Originally sec.energy.% of productionAbadie & Gardeazabal (2003) via Synth~90 rows (period block per region)
sec_industry continuousIndustry share of production (%)Share of regional production in industry (sectoral predictor).Pooled 1961-1969 sectoral average; one value per region. Originally sec.industry.% of productionAbadie & Gardeazabal (2003) via Synth~90 rows (period block per region)
sec_construction continuousConstruction share of production (%)Share of regional production in construction (sectoral predictor).Pooled 1961-1969 sectoral average; one value per region. Originally sec.construction.% of productionAbadie & Gardeazabal (2003) via Synth~90 rows (period block per region)
sec_services_venta continuousMarket services share of production (%)Share of regional production in marketable ('venta') services (sectoral predictor).Pooled 1961-1969 sectoral average; one value per region. Originally sec.services.venta.% of productionAbadie & Gardeazabal (2003) via Synth~90 rows (period block per region)
sec_services_nonventa continuousNon-market services share of production (%)Share of regional production in non-marketable ('nonventa') services (sectoral predictor).Pooled 1961-1969 sectoral average; one value per region. Originally sec.services.nonventa.% of productionAbadie & Gardeazabal (2003) via Synth~90 rows (period block per region)
school_illit continuousPopulation: illiterate (raw count)Number of people with no schooling / illiterate (education predictor, raw count).Education-level figure; converted to a within-region percentage share before matching. Originally school.illit.countAbadie & Gardeazabal (2003) via Synth~108 rows (period snapshots per region)
school_prim continuousPopulation: primary education (raw count)Number of people with primary education (education predictor, raw count).Education-level figure; converted to a within-region percentage share before matching. Originally school.prim.countAbadie & Gardeazabal (2003) via Synth~108 rows (period snapshots per region)
school_med continuousPopulation: secondary education (raw count)Number of people with intermediate/secondary education (education predictor, raw count).Education-level figure; converted to a within-region percentage share before matching. Originally school.med.countAbadie & Gardeazabal (2003) via Synth~108 rows (period snapshots per region)
school_high continuousPopulation: higher education (raw count)Number of people with higher education (education predictor, raw count).Education-level figure; collapsed with school_post_high and converted to a share before matching. Originally school.high.countAbadie & Gardeazabal (2003) via Synth~108 rows (period snapshots per region)
school_post_high continuousPopulation: post-secondary education (raw count)Number of people with post-higher education (education predictor, raw count).Education-level figure; collapsed into school_high and converted to a share before matching. Originally school.post.high.countAbadie & Gardeazabal (2003) via Synth~108 rows (period snapshots per region)
popdens continuousPopulation density, 1969 (per km^2)Population density at the time of treatment (1969 cross-section; static control predictor).Single 1969 value per region. Originally popdens.people / km^2Abadie & Gardeazabal (2003) via Synth18 rows (one 1969 value per region)
invest continuousInvestment / GDP ratio (%)Investment as a share of GDP (predictor).Annual investment-to-GDP ratio; pre-treatment values enter the predictor set.% of GDPAbadie & Gardeazabal (2003) via Synth576 rows

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
regionno100%77418
regionname100%77418
year100%7744319551976.01976199712.42
gdpcapmin 1.24 | median 5.34 | max 12.4100%7747711.245.395.3412.352.24
sec_agriculturemin 1.32 | median 19.2 | max 46.512%90891.3220.2719.2446.5010.38
sec_energymin 1.6 | median 3.67 | max 21.412%90801.605.193.6721.364.04
sec_industrymin 9.56 | median 23.1 | max 46.212%90899.5623.9223.1446.229.28
sec_constructionmin 4.34 | median 7.13 | max 11.312%90844.347.217.1311.281.36
sec_services_ventamin 26.2 | median 34.8 | max 58.212%908726.2336.4934.7558.217.26
sec_services_nonventamin 3.43 | median 6.68 | max 13.112%90863.436.936.6813.111.98
school_illitmin 8.1 | median 116 | max 2.86e+0314%1081088.10308.1116.22,863.3630.8
school_primmin 151 | median 852 | max 1.95e+0414%108108151.32,118.5852.119,4604,216.8
school_medmin 8.61 | median 47.8 | max 1.7e+0314%1081088.61145.647.751,696.1297.5
school_highmin 3.06 | median 16.7 | max 47514%1081083.0645.9416.70474.992.11
school_post_highmin 1.66 | median 7.71 | max 25214%1081081.6625.467.71252.251.58
popdensmin 22.4 | median 80.4 | max 4422%181822.38105.880.38442.5101.5
investmin 9.33 | median 21.4 | max 39.474%5765769.3321.4021.3539.414.11

Known limitations & caveats