Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
health_data | country (cross-section) | 50 × 3 | health_data.dta | health_data.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
use "${BASE}health_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
df = pd.read_stata(BASE + "health_data.dta")
# load every dataset at once
files = ["health_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "health_data.dta", "health_data.dta")
df, meta = pyreadstat.read_dta("health_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
df <- read_dta(paste0(BASE, "health_data.dta"))Overview & sources
Companion data for a hands-on Python tutorial that builds a composite Health Index from two correlated health indicators using Principal Component Analysis (PCA). The dataset is fully simulated: 50 countries are generated from a single latent health factor that drives life_exp (Life Expectancy) positively and infant_mort (Infant Mortality) negatively, so that PCA's recovery of the known single-factor structure can be verified. The post follows a six-step manual pipeline in NumPy/pandas — polarity adjustment, z-score standardization, the covariance matrix, eigen-decomposition, PC1 scoring, and Min-Max normalization — then replicates it with scikit-learn. Polarity adjustment flips the raw correlation from −0.9595 to +0.9595; eigen-decomposition yields eigenvalues 1.9595 and 0.0405 with equal weights of 0.7071, so PC1 explains 97.97% of total variance and the manual scores match scikit-learn to within 1.33×10⁻¹⁵. This file is the single raw input the whole pipeline is built on.
health_data is a cross-section — one row per country — with a country identifier and two raw health indicators. The composite Health Index is derived from these two columns and is not stored here (it is an output of the post's script, regenerated on every run).
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — simulated via a single-latent-factor data-generating process (open & reproducible, seed = 42) | Mendez, C. (2026). See the post's Python script script.py (simulate_health_data) for the full DGP. |
| Method references | PCA and composite-index construction concepts | Jolliffe & Cadima (2016), Principal Component Analysis: A Review and Recent Developments; Pearson (1901); Hotelling (1933); OECD/JRC (2008) Handbook on Constructing Composite Indicators. |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Introduction to PCA Analysis for Building Development Indicators [Data set]. https://carlos-mendez.org/post/python_pca/
Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202.BibTeX
@misc{mendez2026pythonpca,
author = {Mendez, Carlos},
title = {Introduction to PCA Analysis for Building Development Indicators},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_pca/}},
note = {Data set}
}
@article{jolliffe2016pca,
author = {Jolliffe, Ian T. and Cadima, Jorge},
title = {Principal component analysis: a review and recent developments},
journal = {Philosophical Transactions of the Royal Society A},
volume = {374}, number = {2065}, pages = {20150202}, year = {2016}
}Variable explorer search & filter all 3 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
country# | identifier | – | Country identifier | Synthetic country label. | string | health_data | Simulation |
infant_mort# | continuous | Infant mortality (per 1,000 live births) | Deaths before age 1 per 1,000 live births — a negative health indicator (higher is worse); polarity-adjusted before PCA. | per 1,000 live births | health_data | Simulation | |
life_exp# | continuous | Life expectancy (years) | Average life expectancy at birth — a positive health indicator (higher is better). | years | health_data | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | health_data |
|---|---|
country | ● |
infant_mort | ● |
life_exp | ● |
Construction & formulas
The dataset holds two raw indicators per country; the composite index is built from them in six steps (all derived columns are outputs of the post's script, not stored in this file).
- Polarity adjustment: flip "more is worse" indicators —
IM* = -1 × infant_mort— so higher always means better health. - Standardization (z-score):
Z_ij = (X_ij - X̄_j) / σ_j(population SD,ddof=0); each variable then has mean 0 and SD 1. - Covariance matrix:
Σ = (1/n) ZᵀZ; for standardized data this is the 2×2 correlation matrix with 1s on the diagonal androff it. - Eigen-decomposition: solve
Σv = λv; for a 2×2 correlation matrixλ₁ = 1 + r,λ₂ = 1 - r, with eigenvectors[1/√2, 1/√2]and[1/√2, -1/√2]. - PC1 score:
PC1_i = w₁·Z_i,LE + w₂·Z_i,IM, the projection of countryionto the first eigenvector. - Variance explained:
λ_k / Σλ(PC1 = 97.97%). - Min-Max normalization:
HI_i = (PC1_i - min) / (max - min), rescaling scores to the human-readable [0, 1] Health Index.
Synthetic data-generating process (seed = 42): for country i,
base_health ~ Uniform(0, 1) is a latent health capacity; then
life_exp = 55 + 30·base_health + N(0, 2) and
infant_mort = 60 - 55·base_health + N(0, 3). Because both indicators load on the
same latent factor with opposite signs, the raw correlation is a strong −0.96 — ideal for
near-lossless PCA compression.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
country identifier | Country identifier | Synthetic country label. | Generated as Country_01 ... Country_50 (zero-padded sequential index). | string | Simulation | all rows |
life_exp continuous | Life expectancy (years) | Average life expectancy at birth — a positive health indicator (higher is better). | Simulated: life_exp = 55 + 30*base_health + N(0, 2), rounded to 1 decimal; base_health ~ Uniform(0, 1). | years | Simulation | all rows (range 54.9-84.7, mean 70.72) |
infant_mort continuous | Infant mortality (per 1,000 live births) | Deaths before age 1 per 1,000 live births — a negative health indicator (higher is worse); polarity-adjusted before PCA. | Simulated: infant_mort = 60 - 55*base_health + N(0, 3), rounded to 1 decimal; same base_health as life_exp (shared latent factor). | per 1,000 live births | Simulation | all rows (range 3.5-58.7, mean 30.30) |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
country | – | 100% | 50 | 50 | — | — | — | — | — |
life_exp | 100% | 50 | 46 | 54.90 | 70.72 | 71.25 | 84.70 | 8.62 | |
infant_mort | 100% | 50 | 50 | 3.50 | 30.30 | 30.25 | 58.70 | 15.57 |
Known limitations & caveats
- Synthetic data. There is no real data behind this tutorial; values are drawn from a known single-latent-factor process and are internally consistent with that calibration, but are not empirical evidence about real-world health outcomes.
- Two standardized variables → equal weights. With exactly two standardized indicators, PCA always assigns equal weights (1/√2 each) regardless of the correlation, so PC1 is just a (scaled) simple average. PCA's real advantage appears only with three or more variables.
- Relative, not absolute. The Health Index measures performance relative to the 50-country sample (Min-Max maps the worst to 0 and the best to 1); it is not an absolute or cross-sample-comparable score.
- Derived columns are not stored here.
infant_mort_adj, z-scores, PC1, and the Health Index are computed in the script and saved to the post's results CSVs; this file holds only the three raw input columns.