Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
data | region (one row per region) | 153 × 24 | data.dta | data.csv |
data_long | region-period | 306 × 10 | data_long.dta | data_long.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data/"
use "${BASE}data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data/"
df = pd.read_stata(BASE + "data.dta")
# load every dataset at once
files = ["data", "data_long"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "data.dta", "data.dta")
df, meta = pyreadstat.read_dta("data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data/"
df <- read_dta(paste0(BASE, "data.dta"))Overview & sources
Companion data for a hands-on Python tutorial that builds a temporally comparable composite Human Development Index using pooled PCA and contrasts it with the naive per-period approach. The data are a South American subset of the Subnational Human Development Index from the Global Data Lab (Smits & Permanyer, 2019), with Education, Health, and Income sub-indices (plus the official SHDI and its underlying components) for 153 sub-national regions across 12 South American countries in 2013 and 2019. Pooled PCA stacks both periods and computes a single set of standardization parameters, eigenvector weights, and normalization bounds from the combined data, so a region's 2013 score is measured against the same yardstick as its 2019 score. The pooled PC1 captures 72.42% of variance with weights [0.5642, 0.5448, 0.6204] and registers a net development shift of +0.1439 that per-period PCA forces to zero by construction.
data.csv is the wide source — one row per region with year-suffixed columns (*2013 / *2019) for the SHDI, its three sub-indices, and the underlying HDI components. data_long.csv is the reshaped long panel the pooled PCA actually consumes — one row per region × period (153 × 2 = 306 rows), keeping only the three sub-indices used by the index plus the official SHDI, population, and a display label. The post melts data.csv into data_long.csv; both ship here so the reshape is fully reproducible.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Global Data Lab — Subnational HDI | All values: SHDI, sub-indices (education/health/income), and HDI components by region-year | Global Data Lab. Subnational Human Development Index (SHDI). https://globaldatalab.org/shdi/ |
| Smits & Permanyer (2019) | Source database construction and methodology for the SHDI | Smits, J., & Permanyer, I. (2019). The Subnational Human Development Database. Scientific Data, 6, 190038. https://doi.org/10.1038/sdata.2019.38 |
| Method references | PCA / composite-index methods | Jolliffe & Cadima (2016), Phil. Trans. R. Soc. A 374(2065); Peiro-Palomino, Picazo-Tadeo & Rios (2023), Oxford Economic Papers 75(2); UNDP (2024), Human Development Index Technical Notes. |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Pooled PCA for Building Development Indicators Across Time [Data set]. https://carlos-mendez.org/post/python_pca2/
Smits, J., & Permanyer, I. (2019). The Subnational Human Development Database. Scientific Data, 6, 190038. https://doi.org/10.1038/sdata.2019.38BibTeX
@misc{mendez2026pythonpca2,
author = {Mendez, Carlos},
title = {Pooled PCA for Building Development Indicators Across Time},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_pca2/}},
note = {Data set}
}
@article{smits2019subnational,
author = {Smits, Jeroen and Permanyer, Iñaki},
title = {The Subnational Human Development Database},
journal = {Scientific Data},
volume = {6}, pages = {190038}, year = {2019},
doi = {10.1038/sdata.2019.38}
}Variable explorer search & filter all 31 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
GDLcode# | identifier | – | Global Data Lab region code | Unique identifier of the sub-national region in the Global Data Lab database. | code | data, data_long | Global Data Lab |
SubContinent# | identifier | – | Sub-continent | Geographic sub-continent of the region (constant: South America). | string | data | Global Data Lab |
country# | identifier | – | Country name | Country containing the region (12 South American countries). | string | data, data_long | Global Data Lab |
edindex2013# | continuous | Education sub-index, 2013 | SHDI education dimension index for 2013. | 0-1 | data | Global Data Lab | |
edindex2019# | continuous | Education sub-index, 2019 | SHDI education dimension index for 2019. | 0-1 | data | Global Data Lab | |
education# | continuous | Education sub-index | SHDI education dimension index (higher = more education). | 0-1 | data_long | Global Data Lab | |
esch2013# | continuous | Expected years of schooling, 2013 | Expected years of schooling for a school-entry-age child (2013). | years | data | Global Data Lab | |
esch2019# | continuous | Expected years of schooling, 2019 | Expected years of schooling for a school-entry-age child (2019). | years | data | Global Data Lab | |
gnic2013# | continuous | GNI per capita, 2013 | Gross National Income per capita (2013); basis of the income sub-index. | 2011 PPP US$ | data | Global Data Lab | |
gnic2019# | continuous | GNI per capita, 2019 | Gross National Income per capita (2019); basis of the income sub-index. | 2011 PPP US$ | data | Global Data Lab | |
health# | continuous | Health sub-index | SHDI health dimension index (higher = better health). | 0-1 | data_long | Global Data Lab | |
healthindex2013# | continuous | Health sub-index, 2013 | SHDI health dimension index for 2013. | 0-1 | data | Global Data Lab | |
healthindex2019# | continuous | Health sub-index, 2019 | SHDI health dimension index for 2019. | 0-1 | data | Global Data Lab | |
incindex2013# | continuous | Income sub-index, 2013 | SHDI income dimension index for 2013. | 0-1 | data | Global Data Lab | |
incindex2019# | continuous | Income sub-index, 2019 | SHDI income dimension index for 2019. | 0-1 | data | Global Data Lab | |
income# | continuous | Income sub-index | SHDI income dimension index (higher = higher income). | 0-1 | data_long | Global Data Lab | |
lgnic2013# | continuous | Log GNI per capita, 2013 | Natural log of GNI per capita (2013); used to build the income sub-index. | log 2011 PPP US$ | data | Global Data Lab | |
lgnic2019# | continuous | Log GNI per capita, 2019 | Natural log of GNI per capita (2019); used to build the income sub-index. | log 2011 PPP US$ | data | Global Data Lab | |
lifexp2013# | continuous | Life expectancy at birth, 2013 | Life expectancy at birth in years (2013); basis of the health sub-index. | years | data | Global Data Lab | |
lifexp2019# | continuous | Life expectancy at birth, 2019 | Life expectancy at birth in years (2019); basis of the health sub-index. | years | data | Global Data Lab | |
msch2013# | continuous | Mean years of schooling, 2013 | Mean years of schooling of the adult population (2013). | years | data | Global Data Lab | |
msch2019# | continuous | Mean years of schooling, 2019 | Mean years of schooling of the adult population (2019). | years | data | Global Data Lab | |
period# | identifier | – | Period (year tag) | Time period of the observation in the long panel (Y2013 or Y2019). | Y2013 / Y2019 | data_long | Derived (reshape) |
pop# | continuous | Population (thousands) | Regional population for the period. | thousands of persons | data_long | Global Data Lab | |
pop2013# | continuous | Population (thousands), 2013 | Regional population in 2013. | thousands of persons | data | Global Data Lab | |
pop2019# | continuous | Population (thousands), 2019 | Regional population in 2019. | thousands of persons | data | Global Data Lab | |
region# | identifier | – | Region name | Name of the sub-national region. | string | data, data_long | Global Data Lab |
region_country# | identifier | – | Region + country display label | Human-readable label combining a shortened region name with a 3-letter country code. | string | data_long | Derived (this study) |
shdi2013# | continuous | Official Subnational HDI, 2013 | Composite Subnational HDI for 2013. | 0-1 | data | Global Data Lab | |
shdi2019# | continuous | Official Subnational HDI, 2019 | Composite Subnational HDI for 2019. | 0-1 | data | Global Data Lab | |
shdi_official# | continuous | Official Subnational HDI | Global Data Lab's published composite Subnational HDI (validation benchmark). | 0-1 | data_long | Global Data Lab |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The composite index is built with pooled PCA on the three SHDI sub-indices
(education, health, income), stacking both periods so a single
set of parameters applies to all 306 region-year observations.
- Pooled standardization (z-score):
Z = (X − μ_pool) / σ_pool, where the mean and SD are computed across all 306 rows (both years) rather than period by period — a fixed yardstick. Pooled means [0.6786, 0.8437, 0.7254], SDs [0.0814, 0.0472, 0.0749]. - Pooled covariance & eigen-decomposition:
Σ = (1/nT) Zₜ Z, thenΣ v_k = λ_k v_k. Eigenvalues [2.1726, 0.5631, 0.2643]; PC1 eigenvector (weights) [0.5642, 0.5448, 0.6204]. - Variance explained:
λ_k / Σλ— PC1 = 72.42%, PC2 = 18.77%, PC3 = 8.81%. - PC1 score (composite index):
PC1_i = w₁Z_edu + w₂Z_health + w₃Z_income, the projection of each region-year onto the first eigenvector. - Pooled normalization (0–1):
HDI_i = (PC1_i − PC1_min) / (PC1_max − PC1_min), using the min/max across all 306 observations (a common scale across periods).
Wide vs long. data.csv stores each region once with year-suffixed
columns; data_long.csv melts it so each (region, period) pair is a row, rounding the
sub-indices/SHDI to 4 decimals and population to 1 decimal and adding a region_country
display label (shortened region name + 3-letter country code). The official SHDI uses a geometric
mean of its three components, SHDI = (Education × Health × Income)^(1/3),
and serves as the external validation benchmark.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
GDLcode identifier | Global Data Lab region code | Unique identifier of the sub-national region in the Global Data Lab database. | Assigned by the Global Data Lab (e.g. ARGr101 = a region of Argentina). | code | Global Data Lab | all rows |
region identifier | Region name | Name of the sub-national region. | From the Global Data Lab SHDI database. | string | Global Data Lab | all rows |
country identifier | Country name | Country containing the region (12 South American countries). | From the Global Data Lab SHDI database. | string | Global Data Lab | all rows |
SubContinent identifier | Sub-continent | Geographic sub-continent of the region (constant: South America). | From the Global Data Lab SHDI database; this subset is South America only. | string | Global Data Lab | data.csv |
shdi2013 continuous | Official Subnational HDI, 2013 | Composite Subnational HDI for 2013. | Geometric mean of the 2013 education, health, and income sub-indices. | 0-1 | Global Data Lab | data.csv |
shdi2019 continuous | Official Subnational HDI, 2019 | Composite Subnational HDI for 2019. | Geometric mean of the 2019 education, health, and income sub-indices. | 0-1 | Global Data Lab | data.csv |
healthindex2013 continuous | Health sub-index, 2013 | SHDI health dimension index for 2013. | Derived from regional life expectancy (Global Data Lab). | 0-1 | Global Data Lab | data.csv |
healthindex2019 continuous | Health sub-index, 2019 | SHDI health dimension index for 2019. | Derived from regional life expectancy (Global Data Lab). | 0-1 | Global Data Lab | data.csv |
incindex2013 continuous | Income sub-index, 2013 | SHDI income dimension index for 2013. | Derived from log GNI per capita (Global Data Lab). | 0-1 | Global Data Lab | data.csv |
incindex2019 continuous | Income sub-index, 2019 | SHDI income dimension index for 2019. | Derived from log GNI per capita (Global Data Lab). | 0-1 | Global Data Lab | data.csv |
edindex2013 continuous | Education sub-index, 2013 | SHDI education dimension index for 2013. | Derived from expected and mean years of schooling (Global Data Lab). | 0-1 | Global Data Lab | data.csv |
edindex2019 continuous | Education sub-index, 2019 | SHDI education dimension index for 2019. | Derived from expected and mean years of schooling (Global Data Lab). | 0-1 | Global Data Lab | data.csv |
esch2013 continuous | Expected years of schooling, 2013 | Expected years of schooling for a school-entry-age child (2013). | Global Data Lab education component. | years | Global Data Lab | data.csv |
esch2019 continuous | Expected years of schooling, 2019 | Expected years of schooling for a school-entry-age child (2019). | Global Data Lab education component. | years | Global Data Lab | data.csv |
msch2013 continuous | Mean years of schooling, 2013 | Mean years of schooling of the adult population (2013). | Global Data Lab education component. | years | Global Data Lab | data.csv |
msch2019 continuous | Mean years of schooling, 2019 | Mean years of schooling of the adult population (2019). | Global Data Lab education component. | years | Global Data Lab | data.csv |
lifexp2013 continuous | Life expectancy at birth, 2013 | Life expectancy at birth in years (2013); basis of the health sub-index. | Global Data Lab health component. | years | Global Data Lab | data.csv |
lifexp2019 continuous | Life expectancy at birth, 2019 | Life expectancy at birth in years (2019); basis of the health sub-index. | Global Data Lab health component. | years | Global Data Lab | data.csv |
gnic2013 continuous | GNI per capita, 2013 | Gross National Income per capita (2013); basis of the income sub-index. | Global Data Lab income component, in 2011 PPP US$. | 2011 PPP US$ | Global Data Lab | data.csv |
gnic2019 continuous | GNI per capita, 2019 | Gross National Income per capita (2019); basis of the income sub-index. | Global Data Lab income component, in 2011 PPP US$. | 2011 PPP US$ | Global Data Lab | data.csv |
lgnic2013 continuous | Log GNI per capita, 2013 | Natural log of GNI per capita (2013); used to build the income sub-index. | log(gnic2013). | log 2011 PPP US$ | Global Data Lab | data.csv |
lgnic2019 continuous | Log GNI per capita, 2019 | Natural log of GNI per capita (2019); used to build the income sub-index. | log(gnic2019). | log 2011 PPP US$ | Global Data Lab | data.csv |
pop2013 continuous | Population (thousands), 2013 | Regional population in 2013. | Global Data Lab population estimate. | thousands of persons | Global Data Lab | data.csv |
pop2019 continuous | Population (thousands), 2019 | Regional population in 2019. | Global Data Lab population estimate. | thousands of persons | Global Data Lab | data.csv |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
GDLcode | – | 100% | 153 | 153 | — | — | — | — | — |
region | – | 100% | 153 | 150 | — | — | — | — | — |
country | – | 100% | 153 | 12 | — | — | — | — | — |
SubContinent | – | 100% | 153 | 1 | — | — | — | — | — |
shdi2013 | 100% | 153 | 114 | 0.554 | 0.742 | 0.743 | 0.878 | 0.059 | |
shdi2019 | 100% | 153 | 107 | 0.558 | 0.748 | 0.744 | 0.883 | 0.061 | |
healthindex2013 | 100% | 153 | 91 | 0.666 | 0.837 | 0.843 | 0.939 | 0.045 | |
healthindex2019 | 100% | 153 | 93 | 0.694 | 0.850 | 0.863 | 0.953 | 0.049 | |
incindex2013 | 100% | 153 | 109 | 0.443 | 0.735 | 0.750 | 0.851 | 0.074 | |
incindex2019 | 100% | 153 | 117 | 0.450 | 0.715 | 0.722 | 0.844 | 0.075 | |
edindex2013 | 100% | 153 | 114 | 0.382 | 0.667 | 0.655 | 0.926 | 0.080 | |
edindex2019 | 100% | 153 | 115 | 0.446 | 0.690 | 0.679 | 0.946 | 0.082 | |
esch2013 | 100% | 153 | 150 | 9.52 | 14.18 | 14.05 | 18.00 | 1.60 | |
esch2019 | 100% | 153 | 147 | 10.27 | 14.35 | 14.22 | 18.00 | 1.78 | |
msch2013 | 100% | 153 | 148 | 3.52 | 8.21 | 7.91 | 12.77 | 1.52 | |
msch2019 | 100% | 153 | 147 | 4.24 | 8.74 | 8.51 | 13.38 | 1.55 | |
lifexp2013 | 100% | 153 | 134 | 63.28 | 74.41 | 74.79 | 81.04 | 2.94 | |
lifexp2019 | 100% | 153 | 135 | 65.11 | 75.28 | 76.07 | 81.96 | 3.15 | |
gnic2013 | 100% | 153 | 153 | 1,880.5 | 14,430 | 14,297 | 27,897 | 5,941.6 | |
gnic2019 | 100% | 153 | 153 | 1,971.3 | 12,722 | 11,870 | 26,778 | 5,678.9 | |
lgnic2013 | 100% | 153 | 147 | 7.54 | 9.47 | 9.57 | 10.24 | 0.492 | |
lgnic2019 | 100% | 153 | 151 | 7.59 | 9.34 | 9.38 | 10.20 | 0.494 | |
pop2013 | 100% | 153 | 153 | 9.99 | 2,642.6 | 1,210.9 | 43,440 | 4,755.3 | |
pop2019 | 100% | 153 | 153 | 8.63 | 2,791.2 | 1,265.6 | 45,604 | 5,012.5 |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
GDLcode identifier | Global Data Lab region code | Unique identifier of the sub-national region in the Global Data Lab database. | Assigned by the Global Data Lab (e.g. ARGr101 = a region of Argentina). | code | Global Data Lab | all rows |
region identifier | Region name | Name of the sub-national region. | From the Global Data Lab SHDI database. | string | Global Data Lab | all rows |
country identifier | Country name | Country containing the region (12 South American countries). | From the Global Data Lab SHDI database. | string | Global Data Lab | all rows |
period identifier | Period (year tag) | Time period of the observation in the long panel (Y2013 or Y2019). | Created when reshaping wide to long; one row per region per period. | Y2013 / Y2019 | Derived (reshape) | data_long.csv |
education continuous | Education sub-index | SHDI education dimension index (higher = more education). | Global Data Lab education index from expected & mean years of schooling; long-panel copy of edindex2013/edindex2019, rounded to 4 dp. | 0-1 | Global Data Lab | data_long.csv |
health continuous | Health sub-index | SHDI health dimension index (higher = better health). | Global Data Lab health index from life expectancy; long-panel copy of healthindex2013/healthindex2019, rounded to 4 dp. | 0-1 | Global Data Lab | data_long.csv |
income continuous | Income sub-index | SHDI income dimension index (higher = higher income). | Global Data Lab income index from log GNI per capita; long-panel copy of incindex2013/incindex2019, rounded to 4 dp. | 0-1 | Global Data Lab | data_long.csv |
shdi_official continuous | Official Subnational HDI | Global Data Lab's published composite Subnational HDI (validation benchmark). | Geometric mean of the three sub-indices: (Education x Health x Income)^(1/3); long-panel copy of shdi2013/shdi2019, rounded to 4 dp. | 0-1 | Global Data Lab | data_long.csv |
pop continuous | Population (thousands) | Regional population for the period. | Global Data Lab population; long-panel copy of pop2013/pop2019, rounded to 1 dp. | thousands of persons | Global Data Lab | data_long.csv |
region_country identifier | Region + country display label | Human-readable label combining a shortened region name with a 3-letter country code. | First 25 chars of region name + uppercase country abbreviation, e.g. 'City of Buenos Aires (ARG)'. | string | Derived (this study) | data_long.csv |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
GDLcode | – | 100% | 306 | 153 | — | — | — | — | — |
region | – | 100% | 306 | 150 | — | — | — | — | — |
country | – | 100% | 306 | 12 | — | — | — | — | — |
period | – | 100% | 306 | 2 | — | — | — | — | — |
education | 100% | 306 | 174 | 0.382 | 0.679 | 0.671 | 0.946 | 0.082 | |
health | 100% | 306 | 139 | 0.666 | 0.844 | 0.854 | 0.953 | 0.047 | |
income | 100% | 306 | 178 | 0.443 | 0.725 | 0.736 | 0.851 | 0.075 | |
shdi_official | 100% | 306 | 167 | 0.554 | 0.745 | 0.744 | 0.883 | 0.060 | |
pop | 100% | 306 | 305 | 8.60 | 2,716.9 | 1,234.5 | 45,604 | 4,878.1 | |
region_country | – | 100% | 306 | 153 | — | — | — | — | — |
Known limitations & caveats
- Real data. Values are the Global Data Lab's published Subnational HDI for South America (Smits & Permanyer, 2019); they are not simulated.
- Two periods only. 2013 and 2019 are the minimum for temporal analysis; more periods would strengthen the pooled estimates and allow testing the constant-correlation assumption.
- Sample-relative index. The PCA-based HDI is relative to this specific 153-region sample — adding or removing regions changes every score, and Min-Max normalization is sensitive to outliers (Potaro-Siparuni, Guyana anchors the bottom).
- South America only. Correlation structure and eigenvector weights are specific to this region; other world regions can produce different weights.
- Income decline is genuine. Average income fell 2013→2019 (driven by Venezuela's collapse, 0.782→0.630), which is why a fixed (pooled) yardstick is needed to compare scores across time.