← Back to the post
Interactive data dictionary

Introduction to PCA for Building Development Indicators

Companion input data for a step-by-step PCA tutorial — a fully simulated cross-section of two health indicators for 50 countries.

1
dataset
3
variables
50
countries
cross-section
structure

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
health_datacountry (cross-section)50 × 3health_data.dtahealth_data.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
use "${BASE}health_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
df = pd.read_stata(BASE + "health_data.dta")

# load every dataset at once
files = ["health_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "health_data.dta", "health_data.dta")
df, meta = pyreadstat.read_dta("health_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/data/"
df <- read_dta(paste0(BASE, "health_data.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that builds a composite Health Index from two correlated health indicators using Principal Component Analysis (PCA). The dataset is fully simulated: 50 countries are generated from a single latent health factor that drives life_exp (Life Expectancy) positively and infant_mort (Infant Mortality) negatively, so that PCA's recovery of the known single-factor structure can be verified. The post follows a six-step manual pipeline in NumPy/pandas — polarity adjustment, z-score standardization, the covariance matrix, eigen-decomposition, PC1 scoring, and Min-Max normalization — then replicates it with scikit-learn. Polarity adjustment flips the raw correlation from −0.9595 to +0.9595; eigen-decomposition yields eigenvalues 1.9595 and 0.0405 with equal weights of 0.7071, so PC1 explains 97.97% of total variance and the manual scores match scikit-learn to within 1.33×10⁻¹⁵. This file is the single raw input the whole pipeline is built on.

One file. health_data is a cross-section — one row per country — with a country identifier and two raw health indicators. The composite Health Index is derived from these two columns and is not stored here (it is an output of the post's script, regenerated on every run).

Data sources

SourceProvidesReference / URL
Synthetic (this study)All values — simulated via a single-latent-factor data-generating process (open &amp; reproducible, seed = 42)Mendez, C. (2026). See the post's Python script script.py (simulate_health_data) for the full DGP.
Method referencesPCA and composite-index construction conceptsJolliffe & Cadima (2016), Principal Component Analysis: A Review and Recent Developments; Pearson (1901); Hotelling (1933); OECD/JRC (2008) Handbook on Constructing Composite Indicators.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Introduction to PCA Analysis for Building Development Indicators [Data set]. https://carlos-mendez.org/post/python_pca/

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202.

BibTeX

@misc{mendez2026pythonpca,
  author       = {Mendez, Carlos},
  title        = {Introduction to PCA Analysis for Building Development Indicators},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_pca/}},
  note         = {Data set}
}

@article{jolliffe2016pca,
  author  = {Jolliffe, Ian T. and Cadima, Jorge},
  title   = {Principal component analysis: a review and recent developments},
  journal = {Philosophical Transactions of the Royal Society A},
  volume  = {374}, number = {2065}, pages = {20150202}, year = {2016}
}

Variable explorer search & filter all 3 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
country#identifierCountry identifierSynthetic country label.stringhealth_dataSimulation
infant_mort#continuousmin 3.5 | median 30.2 | max 58.7Infant mortality (per 1,000 live births)Deaths before age 1 per 1,000 live births — a negative health indicator (higher is worse); polarity-adjusted before PCA.per 1,000 live birthshealth_dataSimulation
life_exp#continuousmin 54.9 | median 71.2 | max 84.7Life expectancy (years)Average life expectancy at birth — a positive health indicator (higher is better).yearshealth_dataSimulation

Cross-file variable index

Which file each variable appears in (● = present).

Variablehealth_data
country
infant_mort
life_exp

Construction & formulas

The dataset holds two raw indicators per country; the composite index is built from them in six steps (all derived columns are outputs of the post's script, not stored in this file).

Synthetic data-generating process (seed = 42): for country i, base_health ~ Uniform(0, 1) is a latent health capacity; then life_exp = 55 + 30·base_health + N(0, 2) and infant_mort = 60 - 55·base_health + N(0, 3). Because both indicators load on the same latent factor with opposite signs, the raw correlation is a strong −0.96 — ideal for near-lossless PCA compression.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country (cross-section)  50 × 3 · none (cross-sectional, no time dimension) · 50 simulated countries

Panel key: country · Raw input from which the composite PCA Health Index (PC1) is built.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
country identifierCountry identifierSynthetic country label.Generated as Country_01 ... Country_50 (zero-padded sequential index).stringSimulationall rows
life_exp continuousLife expectancy (years)Average life expectancy at birth — a positive health indicator (higher is better).Simulated: life_exp = 55 + 30*base_health + N(0, 2), rounded to 1 decimal; base_health ~ Uniform(0, 1).yearsSimulationall rows (range 54.9-84.7, mean 70.72)
infant_mort continuousInfant mortality (per 1,000 live births)Deaths before age 1 per 1,000 live births — a negative health indicator (higher is worse); polarity-adjusted before PCA.Simulated: infant_mort = 60 - 55*base_health + N(0, 3), rounded to 1 decimal; same base_health as life_exp (shared latent factor).per 1,000 live birthsSimulationall rows (range 3.5-58.7, mean 30.30)

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
country100%5050
life_expmin 54.9 | median 71.2 | max 84.7100%504654.9070.7271.2584.708.62
infant_mortmin 3.5 | median 30.2 | max 58.7100%50503.5030.3030.2558.7015.57

Known limitations & caveats