Data dictionary · Introduction to PCA for Building Development Indicators

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`health_data`	country (cross-section)	50 × 3	health_data.dta	health_data.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
use "${BASE}health_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
df = pd.read_stata(BASE + "health_data.dta")

# load every dataset at once
files = ["health_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "health_data.dta", "health_data.dta")
df, meta = pyreadstat.read_dta("health_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
df <- read_dta(paste0(BASE, "health_data.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that builds a composite Health Index from two correlated health indicators using Principal Component Analysis (PCA). The dataset is fully simulated: 50 countries are generated from a single latent health factor that drives life_exp (Life Expectancy) positively and infant_mort (Infant Mortality) negatively, so that PCA's recovery of the known single-factor structure can be verified. The post follows a six-step manual pipeline in NumPy/pandas — polarity adjustment, z-score standardization, the covariance matrix, eigen-decomposition, PC1 scoring, and Min-Max normalization — then replicates it with scikit-learn. Polarity adjustment flips the raw correlation from −0.9595 to +0.9595; eigen-decomposition yields eigenvalues 1.9595 and 0.0405 with equal weights of 0.7071, so PC1 explains 97.97% of total variance and the manual scores match scikit-learn to within 1.33×10⁻¹⁵. This file is the single raw input the whole pipeline is built on.

One file. health_data is a cross-section — one row per country — with a country identifier and two raw health indicators. The composite Health Index is derived from these two columns and is not stored here (it is an output of the post's script, regenerated on every run).

Data sources

Source	Provides	Reference / URL
Synthetic (this study)	All values — simulated via a single-latent-factor data-generating process (open & reproducible, seed = 42)	Mendez, C. (2026). See the post's Python script script.py (simulate_health_data) for the full DGP.
Method references	PCA and composite-index construction concepts	Jolliffe & Cadima (2016), Principal Component Analysis: A Review and Recent Developments; Pearson (1901); Hotelling (1933); OECD/JRC (2008) Handbook on Constructing Composite Indicators.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Introduction to PCA Analysis for Building Development Indicators [Data set]. https://carlos-mendez.org/post/python_pca/

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202.

BibTeX

@misc{mendez2026pythonpca,
  author       = {Mendez, Carlos},
  title        = {Introduction to PCA Analysis for Building Development Indicators},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_pca/}},
  note         = {Data set}
}

@article{jolliffe2016pca,
  author  = {Jolliffe, Ian T. and Cadima, Jorge},
  title   = {Principal component analysis: a review and recent developments},
  journal = {Philosophical Transactions of the Royal Society A},
  volume  = {374}, number = {2065}, pages = {20150202}, year = {2016}
}

Variable explorer search & filter all 3 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`country`#	identifier	–	Country identifier	Synthetic country label.	string	health_data	Simulation
`infant_mort`#	continuous		Infant mortality (per 1,000 live births)	Deaths before age 1 per 1,000 live births — a negative health indicator (higher is worse); polarity-adjusted before PCA.	per 1,000 live births	health_data	Simulation
`life_exp`#	continuous		Life expectancy (years)	Average life expectancy at birth — a positive health indicator (higher is better).	years	health_data	Simulation

Cross-file variable index

Which file each variable appears in (● = present).

Variable	health_data
`country`	●
`infant_mort`	●
`life_exp`	●

Construction & formulas

The dataset holds two raw indicators per country; the composite index is built from them in six steps (all derived columns are outputs of the post's script, not stored in this file).

Polarity adjustment: flip "more is worse" indicators — IM* = -1 × infant_mort — so higher always means better health.
Standardization (z-score): Z_ij = (X_ij - X̄_j) / σ_j (population SD, ddof=0); each variable then has mean 0 and SD 1.
Covariance matrix: Σ = (1/n) ZᵀZ; for standardized data this is the 2×2 correlation matrix with 1s on the diagonal and r off it.
Eigen-decomposition: solve Σv = λv; for a 2×2 correlation matrix λ₁ = 1 + r, λ₂ = 1 - r, with eigenvectors [1/√2, 1/√2] and [1/√2, -1/√2].
PC1 score: PC1_i = w₁·Z_i,LE + w₂·Z_i,IM, the projection of country i onto the first eigenvector.
Variance explained: λ_k / Σλ (PC1 = 97.97%).
Min-Max normalization: HI_i = (PC1_i - min) / (max - min), rescaling scores to the human-readable [0, 1] Health Index.

Synthetic data-generating process (seed = 42): for country i, base_health ~ Uniform(0, 1) is a latent health capacity; then life_exp = 55 + 30·base_health + N(0, 2) and infant_mort = 60 - 55·base_health + N(0, 3). Because both indicators load on the same latent factor with opposite signs, the raw correlation is a strong −0.96 — ideal for near-lossless PCA compression.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country (cross-section) 50 × 3 · none (cross-sectional, no time dimension) · 50 simulated countries

Panel key: country · Raw input from which the composite PCA Health Index (PC1) is built.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`country` identifier	Country identifier	Synthetic country label.	Generated as Country_01 ... Country_50 (zero-padded sequential index).	string	Simulation	all rows
`life_exp` continuous	Life expectancy (years)	Average life expectancy at birth — a positive health indicator (higher is better).	Simulated: life_exp = 55 + 30*base_health + N(0, 2), rounded to 1 decimal; base_health ~ Uniform(0, 1).	years	Simulation	all rows (range 54.9-84.7, mean 70.72)
`infant_mort` continuous	Infant mortality (per 1,000 live births)	Deaths before age 1 per 1,000 live births — a negative health indicator (higher is worse); polarity-adjusted before PCA.	Simulated: infant_mort = 60 - 55*base_health + N(0, 3), rounded to 1 decimal; same base_health as life_exp (shared latent factor).	per 1,000 live births	Simulation	all rows (range 3.5-58.7, mean 30.30)

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`country`	–	100%	50	50	—	—	—	—	—
`life_exp`		100%	50	46	54.90	70.72	71.25	84.70	8.62
`infant_mort`		100%	50	50	3.50	30.30	30.25	58.70	15.57

Known limitations & caveats

Synthetic data. There is no real data behind this tutorial; values are drawn from a known single-latent-factor process and are internally consistent with that calibration, but are not empirical evidence about real-world health outcomes.
Two standardized variables → equal weights. With exactly two standardized indicators, PCA always assigns equal weights (1/√2 each) regardless of the correlation, so PC1 is just a (scaled) simple average. PCA's real advantage appears only with three or more variables.
Relative, not absolute. The Health Index measures performance relative to the 50-country sample (Min-Max maps the worst to 0 and the best to 1); it is not an absolute or cross-sample-comparable score.
Derived columns are not stored here. infant_mort_adj, z-scores, PC1, and the Health Index are computed in the script and saved to the post's results CSVs; this file holds only the three raw input columns.