Data dictionary · Basic Synthetic Control: The Basque Country Case Study

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`source_data`	region-year	774 × 17	source_data.dta	source_data.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
use "${BASE}source_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
df = pd.read_stata(BASE + "source_data.dta")

# load every dataset at once
files = ["source_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "source_data.dta", "source_data.dta")
df, meta = pyreadstat.read_dta("source_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_basic_synthetic_control/data/"
df <- read_dta(paste0(BASE, "source_data.dta"))

Overview & sources

Companion data for a beginner-friendly R tutorial on the synthetic control method, applied to the classic Basque Country case study of Abadie and Gardeazabal (2003). The panel is the basque dataset bundled with the Synth R package: annual observations from 1955–1997 for 18 Spanish regional units. The outcome is real GDP per capita (gdpcap, 1986 thousands of USD); the remaining columns are the 13 pre-treatment predictors matched by the algorithm — six sectoral production shares, five education levels, an investment-to-GDP ratio, and a 1969 population density. The tutorial frames the analysis as a causal problem (estimand: ATT), builds a “synthetic Basque” from a weighted recipe of the other regions (85% Catalonia, 15% Madrid), and stress-tests the result with Catalonia and in-space placebos.

One file. source_data is an annual regional panel (one row per region × year, 18 units × 43 years = 774 rows over 1955–1997). Region 1 (“Spain (Espana)”) is the national aggregate and is dropped from the analysis; regions 2–18 are the 17 autonomous communities. Region 17 is the Basque Country, the treated unit (terrorism onset 1970). The outcome gdpcap is observed for every row; the predictor columns are sparse by design — the sectoral shares are pooled period averages (one block per region), the education levels are period snapshots, and popdens is a single 1969 cross-section — which is why their non-missing counts are far below 774.

Data sources

Source	Provides	Reference / URL
Synth R package (basque dataset)	The full annual regional panel used in this tutorial — distributed verbatim with the package.	Abadie, A., Diamond, A., & Hainmueller, J. (2011). Synth: An R Package for Synthetic Control Methods in Comparative Case Studies. Journal of Statistical Software, 42(13), 1–17. https://doi.org/10.18637/jss.v042.i13
Abadie & Gardeazabal (2003)	Original study and source of the data; defines the treated unit, donor pool, predictors, and 1970 treatment date.	Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review, 93(1), 113–132. https://doi.org/10.1257/000282803321455188
Method references	The synthetic control estimator and inference	Abadie, Diamond & Hainmueller (2010); Abadie (2021).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Basic Synthetic Control with R: The Basque Country Case Study [Data set]. https://carlos-mendez.org/post/r_basic_synthetic_control/

Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review, 93(1), 113–132. https://doi.org/10.1257/000282803321455188 Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746

BibTeX

@misc{mendez2026rbasicsyntheticcontrol,
  author       = {Mendez, Carlos},
  title        = {Basic Synthetic Control with R: The Basque Country Case Study},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/r_basic_synthetic_control/}},
  note         = {Data set}
}

@article{abadie2003economic,
  author  = {Abadie, Alberto and Gardeazabal, Javier},
  title   = {The Economic Costs of Conflict: A Case Study of the Basque Country},
  journal = {American Economic Review},
  volume  = {93}, number = {1}, pages = {113--132}, year = {2003}
}
@article{abadie2010synthetic,
  author  = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
  title   = {Synthetic Control Methods for Comparative Case Studies},
  journal = {Journal of the American Statistical Association},
  volume  = {105}, number = {490}, pages = {493--505}, year = {2010}
}

Variable explorer search & filter all 17 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`gdpcap`#	continuous		Real GDP per capita (1986 thousand USD)	Outcome variable: real GDP per capita in thousands of 1986 US dollars.	1986 thousand US$	source_data	Abadie & Gardeazabal (2003) via Synth
`invest`#	continuous		Investment / GDP ratio (%)	Investment as a share of GDP (predictor).	% of GDP	source_data	Abadie & Gardeazabal (2003) via Synth
`popdens`#	continuous		Population density, 1969 (per km^2)	Population density at the time of treatment (1969 cross-section; static control predictor).	people / km^2	source_data	Abadie & Gardeazabal (2003) via Synth
`regionname`#	identifier	–	Region name	Name of the regional unit (e.g. 'Basque Country (Pais Vasco)', 'Cataluna', 'Madrid (Comunidad De)').	string	source_data	Synth package (basque)
`regionno`#	identifier	–	Region number (1=Spain aggregate)	Integer identifier for the regional unit; 1 is the Spain national aggregate, 2-18 are the 17 autonomous communities (17 = Basque Country, treated).	1-18	source_data	Synth package (basque)
`school_high`#	continuous		Population: higher education (raw count)	Number of people with higher education (education predictor, raw count).	count	source_data	Abadie & Gardeazabal (2003) via Synth
`school_illit`#	continuous		Population: illiterate (raw count)	Number of people with no schooling / illiterate (education predictor, raw count).	count	source_data	Abadie & Gardeazabal (2003) via Synth
`school_med`#	continuous		Population: secondary education (raw count)	Number of people with intermediate/secondary education (education predictor, raw count).	count	source_data	Abadie & Gardeazabal (2003) via Synth
`school_post_high`#	continuous		Population: post-secondary education (raw count)	Number of people with post-higher education (education predictor, raw count).	count	source_data	Abadie & Gardeazabal (2003) via Synth
`school_prim`#	continuous		Population: primary education (raw count)	Number of people with primary education (education predictor, raw count).	count	source_data	Abadie & Gardeazabal (2003) via Synth
`sec_agriculture`#	continuous		Agriculture share of production (%)	Share of regional production in agriculture (sectoral predictor).	% of production	source_data	Abadie & Gardeazabal (2003) via Synth
`sec_construction`#	continuous		Construction share of production (%)	Share of regional production in construction (sectoral predictor).	% of production	source_data	Abadie & Gardeazabal (2003) via Synth
`sec_energy`#	continuous		Energy share of production (%)	Share of regional production in energy (sectoral predictor).	% of production	source_data	Abadie & Gardeazabal (2003) via Synth
`sec_industry`#	continuous		Industry share of production (%)	Share of regional production in industry (sectoral predictor).	% of production	source_data	Abadie & Gardeazabal (2003) via Synth
`sec_services_nonventa`#	continuous		Non-market services share of production (%)	Share of regional production in non-marketable ('nonventa') services (sectoral predictor).	% of production	source_data	Abadie & Gardeazabal (2003) via Synth
`sec_services_venta`#	continuous		Market services share of production (%)	Share of regional production in marketable ('venta') services (sectoral predictor).	% of production	source_data	Abadie & Gardeazabal (2003) via Synth
`year`#	year	–	Calendar year	Annual time index, 1955-1997. Terrorism onset (treatment) is dated to 1970.	year	source_data	Synth package (basque)

Cross-file variable index

Which file each variable appears in (● = present).

Variable	source_data
`gdpcap`	●
`invest`	●
`popdens`	●
`regionname`	●
`regionno`	●
`school_high`	●
`school_illit`	●
`school_med`	●
`school_post_high`	●
`school_prim`	●
`sec_agriculture`	●
`sec_construction`	●
`sec_energy`	●
`sec_industry`	●
`sec_services_nonventa`	●
`sec_services_venta`	●
`year`	●

Construction & formulas

The synthetic control method estimates the missing counterfactual GDP path of the treated unit (the Basque Country) as a weighted average of donor (untreated) regions. Let X1 be the treated unit's pre-treatment predictor vector (13 × 1), X0 the donor predictor matrix (13 × 16), and Z1/Z0 the pre-treatment outcomes (annual gdpcap) for the treated unit and donors.

Donor weights W* (inner problem): minimize the weighted predictor distance ‖X1 − X0·W‖_V = √[(X1 − X0·W)′ V (X1 − X0·W)] over W ≥ 0, Σ w_j = 1 (a convex combination of donors).
Predictor weights V* (outer problem): minimize the pre-treatment outcome error (Z1 − Z0·W*(V))′(Z1 − Z0·W*(V)) — the diagonal of V dials how much each predictor matters; chosen by cross-validation on pre-1970 GDP.
ATT (gap): α̂_1t = Y_1t − Σ_j w_j* · Y_jt for t ≥ 1970 — actual Basque GDP minus the synthetic (donor-weighted) GDP, year by year; the headline ATT averages this over 1970–1997.
Placebo / MSPE ratio: run the same pipeline treating each donor as if treated; rank by the post/pre mean-squared-prediction-error ratio. A trimmed pseudo p-value compares the Basque ratio to the distribution of comparable-fit placebos.

Education predictors are converted to within-region percentage shares and the top two education levels are collapsed before matching (inside the post's prepare_basque() helper); the sectoral predictors are pooled 1961–1969 averages. The raw, untransformed values are stored here.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

region-year 774 × 17 · 1955-1997 · 18 regional units (Spain aggregate + 17 autonomous communities)

Panel key: regionno x year · Source panel for the synthetic control: outcome (gdpcap) + pre-treatment predictors matched by synth().

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`regionno` identifier	Region number (1=Spain aggregate)	Integer identifier for the regional unit; 1 is the Spain national aggregate, 2-18 are the 17 autonomous communities (17 = Basque Country, treated).	From the bundled basque dataset.	1-18	Synth package (basque)	all 774 rows
`regionname` identifier	Region name	Name of the regional unit (e.g. 'Basque Country (Pais Vasco)', 'Cataluna', 'Madrid (Comunidad De)').	From the bundled basque dataset (some names stored without diacritics).	string	Synth package (basque)	all 774 rows
`year` year	Calendar year	Annual time index, 1955-1997. Terrorism onset (treatment) is dated to 1970.	From the bundled basque dataset.	year	Synth package (basque)	1955-1997 (43 years)
`gdpcap` continuous	Real GDP per capita (1986 thousand USD)	Outcome variable: real GDP per capita in thousands of 1986 US dollars.	Observed regional GDP per capita; this is the outcome the synthetic control matches and predicts.	1986 thousand US$	Abadie & Gardeazabal (2003) via Synth	all 774 rows (1955-1997)
`sec_agriculture` continuous	Agriculture share of production (%)	Share of regional production in agriculture (sectoral predictor).	Pooled 1961-1969 sectoral average; one value per region. Originally sec.agriculture.	% of production	Abadie & Gardeazabal (2003) via Synth	~90 rows (period block per region)
`sec_energy` continuous	Energy share of production (%)	Share of regional production in energy (sectoral predictor).	Pooled 1961-1969 sectoral average; one value per region. Originally sec.energy.	% of production	Abadie & Gardeazabal (2003) via Synth	~90 rows (period block per region)
`sec_industry` continuous	Industry share of production (%)	Share of regional production in industry (sectoral predictor).	Pooled 1961-1969 sectoral average; one value per region. Originally sec.industry.	% of production	Abadie & Gardeazabal (2003) via Synth	~90 rows (period block per region)
`sec_construction` continuous	Construction share of production (%)	Share of regional production in construction (sectoral predictor).	Pooled 1961-1969 sectoral average; one value per region. Originally sec.construction.	% of production	Abadie & Gardeazabal (2003) via Synth	~90 rows (period block per region)
`sec_services_venta` continuous	Market services share of production (%)	Share of regional production in marketable ('venta') services (sectoral predictor).	Pooled 1961-1969 sectoral average; one value per region. Originally sec.services.venta.	% of production	Abadie & Gardeazabal (2003) via Synth	~90 rows (period block per region)
`sec_services_nonventa` continuous	Non-market services share of production (%)	Share of regional production in non-marketable ('nonventa') services (sectoral predictor).	Pooled 1961-1969 sectoral average; one value per region. Originally sec.services.nonventa.	% of production	Abadie & Gardeazabal (2003) via Synth	~90 rows (period block per region)
`school_illit` continuous	Population: illiterate (raw count)	Number of people with no schooling / illiterate (education predictor, raw count).	Education-level figure; converted to a within-region percentage share before matching. Originally school.illit.	count	Abadie & Gardeazabal (2003) via Synth	~108 rows (period snapshots per region)
`school_prim` continuous	Population: primary education (raw count)	Number of people with primary education (education predictor, raw count).	Education-level figure; converted to a within-region percentage share before matching. Originally school.prim.	count	Abadie & Gardeazabal (2003) via Synth	~108 rows (period snapshots per region)
`school_med` continuous	Population: secondary education (raw count)	Number of people with intermediate/secondary education (education predictor, raw count).	Education-level figure; converted to a within-region percentage share before matching. Originally school.med.	count	Abadie & Gardeazabal (2003) via Synth	~108 rows (period snapshots per region)
`school_high` continuous	Population: higher education (raw count)	Number of people with higher education (education predictor, raw count).	Education-level figure; collapsed with school_post_high and converted to a share before matching. Originally school.high.	count	Abadie & Gardeazabal (2003) via Synth	~108 rows (period snapshots per region)
`school_post_high` continuous	Population: post-secondary education (raw count)	Number of people with post-higher education (education predictor, raw count).	Education-level figure; collapsed into school_high and converted to a share before matching. Originally school.post.high.	count	Abadie & Gardeazabal (2003) via Synth	~108 rows (period snapshots per region)
`popdens` continuous	Population density, 1969 (per km^2)	Population density at the time of treatment (1969 cross-section; static control predictor).	Single 1969 value per region. Originally popdens.	people / km^2	Abadie & Gardeazabal (2003) via Synth	18 rows (one 1969 value per region)
`invest` continuous	Investment / GDP ratio (%)	Investment as a share of GDP (predictor).	Annual investment-to-GDP ratio; pre-treatment values enter the predictor set.	% of GDP	Abadie & Gardeazabal (2003) via Synth	576 rows

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`regionno`	–	100%	774	18	—	—	—	—	—
`regionname`	–	100%	774	18	—	—	—	—	—
`year`	–	100%	774	43	1955	1976.0	1976	1997	12.42
`gdpcap`		100%	774	771	1.24	5.39	5.34	12.35	2.24
`sec_agriculture`		12%	90	89	1.32	20.27	19.24	46.50	10.38
`sec_energy`		12%	90	80	1.60	5.19	3.67	21.36	4.04
`sec_industry`		12%	90	89	9.56	23.92	23.14	46.22	9.28
`sec_construction`		12%	90	84	4.34	7.21	7.13	11.28	1.36
`sec_services_venta`		12%	90	87	26.23	36.49	34.75	58.21	7.26
`sec_services_nonventa`		12%	90	86	3.43	6.93	6.68	13.11	1.98
`school_illit`		14%	108	108	8.10	308.1	116.2	2,863.3	630.8
`school_prim`		14%	108	108	151.3	2,118.5	852.1	19,460	4,216.8
`school_med`		14%	108	108	8.61	145.6	47.75	1,696.1	297.5
`school_high`		14%	108	108	3.06	45.94	16.70	474.9	92.11
`school_post_high`		14%	108	108	1.66	25.46	7.71	252.2	51.58
`popdens`		2%	18	18	22.38	105.8	80.38	442.5	101.5
`invest`		74%	576	576	9.33	21.40	21.35	39.41	4.11

Known limitations & caveats

Predictor columns are sparse by design. Only gdpcap is observed for all 774 rows. The six sectoral shares (~90 non-missing) and five education levels (~108) are pooled period averages/snapshots, and popdens is a single 1969 cross-section (18 rows). The Synth workflow reads these via period-aggregation arguments, so the gaps are expected, not data errors.
School columns are raw counts, not shares. school_illit, school_prim, school_med, school_high, and school_post_high are stored as raw figures; the post converts them to within-region percentage shares (and collapses school_high + school_post_high) before matching.
Region 1 is an aggregate. Row group regionno == 1 (“Spain (Espana)”) is the national total and is dropped from the analysis; the 17 autonomous communities are regions 2–18.
Column names use underscores. The source basque object names sectoral and school variables with dots (e.g. sec.agriculture); they are renamed to underscores here (sec_agriculture) so the variables are valid Stata names. Values are unchanged.
Real data, single treated unit. This is real regional accounting data, not simulated. With one treated region and a 16-region donor pool, inference rests on placebos rather than standard errors; the discrete pseudo p-values have limited resolution.