← Back to the post
Interactive data dictionary

Pooled PCA for Building Development Indicators Across Time

Subnational Human Development data for 153 South American regions in 2013 and 2019, in wide and long-panel form.

2
datasets
153
regions
12
countries
2013, 2019
periods

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
dataregion (one row per region)153 × 24data.dtadata.csv
data_longregion-period306 × 10data_long.dtadata_long.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data/"
use "${BASE}data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data/"
df = pd.read_stata(BASE + "data.dta")

# load every dataset at once
files = ["data", "data_long"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "data.dta", "data.dta")
df, meta = pyreadstat.read_dta("data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data/"
df <- read_dta(paste0(BASE, "data.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that builds a temporally comparable composite Human Development Index using pooled PCA and contrasts it with the naive per-period approach. The data are a South American subset of the Subnational Human Development Index from the Global Data Lab (Smits & Permanyer, 2019), with Education, Health, and Income sub-indices (plus the official SHDI and its underlying components) for 153 sub-national regions across 12 South American countries in 2013 and 2019. Pooled PCA stacks both periods and computes a single set of standardization parameters, eigenvector weights, and normalization bounds from the combined data, so a region's 2013 score is measured against the same yardstick as its 2019 score. The pooled PC1 captures 72.42% of variance with weights [0.5642, 0.5448, 0.6204] and registers a net development shift of +0.1439 that per-period PCA forces to zero by construction.

Two files, same underlying observations in two shapes. data.csv is the wide source — one row per region with year-suffixed columns (*2013 / *2019) for the SHDI, its three sub-indices, and the underlying HDI components. data_long.csv is the reshaped long panel the pooled PCA actually consumes — one row per region × period (153 × 2 = 306 rows), keeping only the three sub-indices used by the index plus the official SHDI, population, and a display label. The post melts data.csv into data_long.csv; both ship here so the reshape is fully reproducible.

Data sources

SourceProvidesReference / URL
Global Data Lab — Subnational HDIAll values: SHDI, sub-indices (education/health/income), and HDI components by region-yearGlobal Data Lab. Subnational Human Development Index (SHDI). https://globaldatalab.org/shdi/
Smits &amp; Permanyer (2019)Source database construction and methodology for the SHDISmits, J., & Permanyer, I. (2019). The Subnational Human Development Database. Scientific Data, 6, 190038. https://doi.org/10.1038/sdata.2019.38
Method referencesPCA / composite-index methodsJolliffe & Cadima (2016), Phil. Trans. R. Soc. A 374(2065); Peiro-Palomino, Picazo-Tadeo & Rios (2023), Oxford Economic Papers 75(2); UNDP (2024), Human Development Index Technical Notes.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Pooled PCA for Building Development Indicators Across Time [Data set]. https://carlos-mendez.org/post/python_pca2/

Smits, J., & Permanyer, I. (2019). The Subnational Human Development Database. Scientific Data, 6, 190038. https://doi.org/10.1038/sdata.2019.38

BibTeX

@misc{mendez2026pythonpca2,
  author       = {Mendez, Carlos},
  title        = {Pooled PCA for Building Development Indicators Across Time},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_pca2/}},
  note         = {Data set}
}

@article{smits2019subnational,
  author  = {Smits, Jeroen and Permanyer, Iñaki},
  title   = {The Subnational Human Development Database},
  journal = {Scientific Data},
  volume  = {6}, pages = {190038}, year = {2019},
  doi     = {10.1038/sdata.2019.38}
}

Variable explorer search & filter all 31 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
GDLcode#identifierGlobal Data Lab region codeUnique identifier of the sub-national region in the Global Data Lab database.codedata, data_longGlobal Data Lab
SubContinent#identifierSub-continentGeographic sub-continent of the region (constant: South America).stringdataGlobal Data Lab
country#identifierCountry nameCountry containing the region (12 South American countries).stringdata, data_longGlobal Data Lab
edindex2013#continuousmin 0.382 | median 0.655 | max 0.926Education sub-index, 2013SHDI education dimension index for 2013.0-1dataGlobal Data Lab
edindex2019#continuousmin 0.446 | median 0.679 | max 0.946Education sub-index, 2019SHDI education dimension index for 2019.0-1dataGlobal Data Lab
education#continuousmin 0.382 | median 0.671 | max 0.946Education sub-indexSHDI education dimension index (higher = more education).0-1data_longGlobal Data Lab
esch2013#continuousmin 9.52 | median 14 | max 18Expected years of schooling, 2013Expected years of schooling for a school-entry-age child (2013).yearsdataGlobal Data Lab
esch2019#continuousmin 10.3 | median 14.2 | max 18Expected years of schooling, 2019Expected years of schooling for a school-entry-age child (2019).yearsdataGlobal Data Lab
gnic2013#continuousmin 1.88e+03 | median 1.43e+04 | max 2.79e+04GNI per capita, 2013Gross National Income per capita (2013); basis of the income sub-index.2011 PPP US$dataGlobal Data Lab
gnic2019#continuousmin 1.97e+03 | median 1.19e+04 | max 2.68e+04GNI per capita, 2019Gross National Income per capita (2019); basis of the income sub-index.2011 PPP US$dataGlobal Data Lab
health#continuousmin 0.666 | median 0.854 | max 0.953Health sub-indexSHDI health dimension index (higher = better health).0-1data_longGlobal Data Lab
healthindex2013#continuousmin 0.666 | median 0.843 | max 0.939Health sub-index, 2013SHDI health dimension index for 2013.0-1dataGlobal Data Lab
healthindex2019#continuousmin 0.694 | median 0.863 | max 0.953Health sub-index, 2019SHDI health dimension index for 2019.0-1dataGlobal Data Lab
incindex2013#continuousmin 0.443 | median 0.75 | max 0.851Income sub-index, 2013SHDI income dimension index for 2013.0-1dataGlobal Data Lab
incindex2019#continuousmin 0.45 | median 0.722 | max 0.844Income sub-index, 2019SHDI income dimension index for 2019.0-1dataGlobal Data Lab
income#continuousmin 0.443 | median 0.736 | max 0.851Income sub-indexSHDI income dimension index (higher = higher income).0-1data_longGlobal Data Lab
lgnic2013#continuousmin 7.54 | median 9.57 | max 10.2Log GNI per capita, 2013Natural log of GNI per capita (2013); used to build the income sub-index.log 2011 PPP US$dataGlobal Data Lab
lgnic2019#continuousmin 7.59 | median 9.38 | max 10.2Log GNI per capita, 2019Natural log of GNI per capita (2019); used to build the income sub-index.log 2011 PPP US$dataGlobal Data Lab
lifexp2013#continuousmin 63.3 | median 74.8 | max 81Life expectancy at birth, 2013Life expectancy at birth in years (2013); basis of the health sub-index.yearsdataGlobal Data Lab
lifexp2019#continuousmin 65.1 | median 76.1 | max 82Life expectancy at birth, 2019Life expectancy at birth in years (2019); basis of the health sub-index.yearsdataGlobal Data Lab
msch2013#continuousmin 3.52 | median 7.91 | max 12.8Mean years of schooling, 2013Mean years of schooling of the adult population (2013).yearsdataGlobal Data Lab
msch2019#continuousmin 4.24 | median 8.51 | max 13.4Mean years of schooling, 2019Mean years of schooling of the adult population (2019).yearsdataGlobal Data Lab
period#identifierPeriod (year tag)Time period of the observation in the long panel (Y2013 or Y2019).Y2013 / Y2019data_longDerived (reshape)
pop#continuousmin 8.6 | median 1.23e+03 | max 4.56e+04Population (thousands)Regional population for the period.thousands of personsdata_longGlobal Data Lab
pop2013#continuousmin 9.99 | median 1.21e+03 | max 4.34e+04Population (thousands), 2013Regional population in 2013.thousands of personsdataGlobal Data Lab
pop2019#continuousmin 8.63 | median 1.27e+03 | max 4.56e+04Population (thousands), 2019Regional population in 2019.thousands of personsdataGlobal Data Lab
region#identifierRegion nameName of the sub-national region.stringdata, data_longGlobal Data Lab
region_country#identifierRegion + country display labelHuman-readable label combining a shortened region name with a 3-letter country code.stringdata_longDerived (this study)
shdi2013#continuousmin 0.554 | median 0.743 | max 0.878Official Subnational HDI, 2013Composite Subnational HDI for 2013.0-1dataGlobal Data Lab
shdi2019#continuousmin 0.558 | median 0.744 | max 0.883Official Subnational HDI, 2019Composite Subnational HDI for 2019.0-1dataGlobal Data Lab
shdi_official#continuousmin 0.554 | median 0.744 | max 0.883Official Subnational HDIGlobal Data Lab's published composite Subnational HDI (validation benchmark).0-1data_longGlobal Data Lab

Cross-file variable index

Which file each variable appears in (● = present).

Construction & formulas

The composite index is built with pooled PCA on the three SHDI sub-indices (education, health, income), stacking both periods so a single set of parameters applies to all 306 region-year observations.

Wide vs long. data.csv stores each region once with year-suffixed columns; data_long.csv melts it so each (region, period) pair is a row, rounding the sub-indices/SHDI to 4 decimals and population to 1 decimal and adding a region_country display label (shortened region name + 3-letter country code). The official SHDI uses a geometric mean of its three components, SHDI = (Education × Health × Income)^(1/3), and serves as the external validation benchmark.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

region (one row per region)  153 × 24 · 2013 & 2019 (year-suffixed columns) · 153 sub-national regions, 12 South American countries

Panel key: GDLcode · Wide source table; melted into the long panel that pooled PCA consumes.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
GDLcode identifierGlobal Data Lab region codeUnique identifier of the sub-national region in the Global Data Lab database.Assigned by the Global Data Lab (e.g. ARGr101 = a region of Argentina).codeGlobal Data Laball rows
region identifierRegion nameName of the sub-national region.From the Global Data Lab SHDI database.stringGlobal Data Laball rows
country identifierCountry nameCountry containing the region (12 South American countries).From the Global Data Lab SHDI database.stringGlobal Data Laball rows
SubContinent identifierSub-continentGeographic sub-continent of the region (constant: South America).From the Global Data Lab SHDI database; this subset is South America only.stringGlobal Data Labdata.csv
shdi2013 continuousOfficial Subnational HDI, 2013Composite Subnational HDI for 2013.Geometric mean of the 2013 education, health, and income sub-indices.0-1Global Data Labdata.csv
shdi2019 continuousOfficial Subnational HDI, 2019Composite Subnational HDI for 2019.Geometric mean of the 2019 education, health, and income sub-indices.0-1Global Data Labdata.csv
healthindex2013 continuousHealth sub-index, 2013SHDI health dimension index for 2013.Derived from regional life expectancy (Global Data Lab).0-1Global Data Labdata.csv
healthindex2019 continuousHealth sub-index, 2019SHDI health dimension index for 2019.Derived from regional life expectancy (Global Data Lab).0-1Global Data Labdata.csv
incindex2013 continuousIncome sub-index, 2013SHDI income dimension index for 2013.Derived from log GNI per capita (Global Data Lab).0-1Global Data Labdata.csv
incindex2019 continuousIncome sub-index, 2019SHDI income dimension index for 2019.Derived from log GNI per capita (Global Data Lab).0-1Global Data Labdata.csv
edindex2013 continuousEducation sub-index, 2013SHDI education dimension index for 2013.Derived from expected and mean years of schooling (Global Data Lab).0-1Global Data Labdata.csv
edindex2019 continuousEducation sub-index, 2019SHDI education dimension index for 2019.Derived from expected and mean years of schooling (Global Data Lab).0-1Global Data Labdata.csv
esch2013 continuousExpected years of schooling, 2013Expected years of schooling for a school-entry-age child (2013).Global Data Lab education component.yearsGlobal Data Labdata.csv
esch2019 continuousExpected years of schooling, 2019Expected years of schooling for a school-entry-age child (2019).Global Data Lab education component.yearsGlobal Data Labdata.csv
msch2013 continuousMean years of schooling, 2013Mean years of schooling of the adult population (2013).Global Data Lab education component.yearsGlobal Data Labdata.csv
msch2019 continuousMean years of schooling, 2019Mean years of schooling of the adult population (2019).Global Data Lab education component.yearsGlobal Data Labdata.csv
lifexp2013 continuousLife expectancy at birth, 2013Life expectancy at birth in years (2013); basis of the health sub-index.Global Data Lab health component.yearsGlobal Data Labdata.csv
lifexp2019 continuousLife expectancy at birth, 2019Life expectancy at birth in years (2019); basis of the health sub-index.Global Data Lab health component.yearsGlobal Data Labdata.csv
gnic2013 continuousGNI per capita, 2013Gross National Income per capita (2013); basis of the income sub-index.Global Data Lab income component, in 2011 PPP US$.2011 PPP US$Global Data Labdata.csv
gnic2019 continuousGNI per capita, 2019Gross National Income per capita (2019); basis of the income sub-index.Global Data Lab income component, in 2011 PPP US$.2011 PPP US$Global Data Labdata.csv
lgnic2013 continuousLog GNI per capita, 2013Natural log of GNI per capita (2013); used to build the income sub-index.log(gnic2013).log 2011 PPP US$Global Data Labdata.csv
lgnic2019 continuousLog GNI per capita, 2019Natural log of GNI per capita (2019); used to build the income sub-index.log(gnic2019).log 2011 PPP US$Global Data Labdata.csv
pop2013 continuousPopulation (thousands), 2013Regional population in 2013.Global Data Lab population estimate.thousands of personsGlobal Data Labdata.csv
pop2019 continuousPopulation (thousands), 2019Regional population in 2019.Global Data Lab population estimate.thousands of personsGlobal Data Labdata.csv

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
GDLcode100%153153
region100%153150
country100%15312
SubContinent100%1531
shdi2013min 0.554 | median 0.743 | max 0.878100%1531140.5540.7420.7430.8780.059
shdi2019min 0.558 | median 0.744 | max 0.883100%1531070.5580.7480.7440.8830.061
healthindex2013min 0.666 | median 0.843 | max 0.939100%153910.6660.8370.8430.9390.045
healthindex2019min 0.694 | median 0.863 | max 0.953100%153930.6940.8500.8630.9530.049
incindex2013min 0.443 | median 0.75 | max 0.851100%1531090.4430.7350.7500.8510.074
incindex2019min 0.45 | median 0.722 | max 0.844100%1531170.4500.7150.7220.8440.075
edindex2013min 0.382 | median 0.655 | max 0.926100%1531140.3820.6670.6550.9260.080
edindex2019min 0.446 | median 0.679 | max 0.946100%1531150.4460.6900.6790.9460.082
esch2013min 9.52 | median 14 | max 18100%1531509.5214.1814.0518.001.60
esch2019min 10.3 | median 14.2 | max 18100%15314710.2714.3514.2218.001.78
msch2013min 3.52 | median 7.91 | max 12.8100%1531483.528.217.9112.771.52
msch2019min 4.24 | median 8.51 | max 13.4100%1531474.248.748.5113.381.55
lifexp2013min 63.3 | median 74.8 | max 81100%15313463.2874.4174.7981.042.94
lifexp2019min 65.1 | median 76.1 | max 82100%15313565.1175.2876.0781.963.15
gnic2013min 1.88e+03 | median 1.43e+04 | max 2.79e+04100%1531531,880.514,43014,29727,8975,941.6
gnic2019min 1.97e+03 | median 1.19e+04 | max 2.68e+04100%1531531,971.312,72211,87026,7785,678.9
lgnic2013min 7.54 | median 9.57 | max 10.2100%1531477.549.479.5710.240.492
lgnic2019min 7.59 | median 9.38 | max 10.2100%1531517.599.349.3810.200.494
pop2013min 9.99 | median 1.21e+03 | max 4.34e+04100%1531539.992,642.61,210.943,4404,755.3
pop2019min 8.63 | median 1.27e+03 | max 4.56e+04100%1531538.632,791.21,265.645,6045,012.5

region-period  306 × 10 · 2013, 2019 (period = Y2013 / Y2019) · 306 rows = 153 regions x 2 periods

Panel key: GDLcode x period · Analysis dataset for pooled PCA: stacked region-year rows with the three SHDI sub-indices.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
GDLcode identifierGlobal Data Lab region codeUnique identifier of the sub-national region in the Global Data Lab database.Assigned by the Global Data Lab (e.g. ARGr101 = a region of Argentina).codeGlobal Data Laball rows
region identifierRegion nameName of the sub-national region.From the Global Data Lab SHDI database.stringGlobal Data Laball rows
country identifierCountry nameCountry containing the region (12 South American countries).From the Global Data Lab SHDI database.stringGlobal Data Laball rows
period identifierPeriod (year tag)Time period of the observation in the long panel (Y2013 or Y2019).Created when reshaping wide to long; one row per region per period.Y2013 / Y2019Derived (reshape)data_long.csv
education continuousEducation sub-indexSHDI education dimension index (higher = more education).Global Data Lab education index from expected & mean years of schooling; long-panel copy of edindex2013/edindex2019, rounded to 4 dp.0-1Global Data Labdata_long.csv
health continuousHealth sub-indexSHDI health dimension index (higher = better health).Global Data Lab health index from life expectancy; long-panel copy of healthindex2013/healthindex2019, rounded to 4 dp.0-1Global Data Labdata_long.csv
income continuousIncome sub-indexSHDI income dimension index (higher = higher income).Global Data Lab income index from log GNI per capita; long-panel copy of incindex2013/incindex2019, rounded to 4 dp.0-1Global Data Labdata_long.csv
shdi_official continuousOfficial Subnational HDIGlobal Data Lab's published composite Subnational HDI (validation benchmark).Geometric mean of the three sub-indices: (Education x Health x Income)^(1/3); long-panel copy of shdi2013/shdi2019, rounded to 4 dp.0-1Global Data Labdata_long.csv
pop continuousPopulation (thousands)Regional population for the period.Global Data Lab population; long-panel copy of pop2013/pop2019, rounded to 1 dp.thousands of personsGlobal Data Labdata_long.csv
region_country identifierRegion + country display labelHuman-readable label combining a shortened region name with a 3-letter country code.First 25 chars of region name + uppercase country abbreviation, e.g. 'City of Buenos Aires (ARG)'.stringDerived (this study)data_long.csv

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
GDLcode100%306153
region100%306150
country100%30612
period100%3062
educationmin 0.382 | median 0.671 | max 0.946100%3061740.3820.6790.6710.9460.082
healthmin 0.666 | median 0.854 | max 0.953100%3061390.6660.8440.8540.9530.047
incomemin 0.443 | median 0.736 | max 0.851100%3061780.4430.7250.7360.8510.075
shdi_officialmin 0.554 | median 0.744 | max 0.883100%3061670.5540.7450.7440.8830.060
popmin 8.6 | median 1.23e+03 | max 4.56e+04100%3063058.602,716.91,234.545,6044,878.1
region_country100%306153

Known limitations & caveats