Data dictionary · Introduction to Panel Data Methods in Python

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`raw_data`	worker-year	11,045 × 9	raw_data.dta	raw_data.csv
`data_panel`	worker-year	4,398 × 10	data_panel.dta	data_panel.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
use "${BASE}raw_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
df = pd.read_stata(BASE + "raw_data.dta")

# load every dataset at once
files = ["raw_data", "data_panel"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "raw_data.dta", "raw_data.dta")
df, meta = pyreadstat.read_dta("raw_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
df <- read_dta(paste0(BASE, "raw_data.dta"))

Overview & sources

Companion data for a beginner-friendly Python tutorial that walks through seven canonical panel-data estimators — pooled OLS, the between estimator, first-differences, the within (fixed-effects) estimator, two-way fixed effects, random effects, and Mundlak's correlated random effects (CRE) — on a single worker wage panel. The running question is whether union membership raises wages. The data are real NLSY-style observations on US workers from the quarcs-lab/data-open repository (isds/wage_panel_bob4.dta). The cross-sectional estimators report a union premium of 7–11 log points; once unobserved worker traits are netted out, the within estimators report roughly 21 log points — a near-tripling that illustrates selection on unobservables. The Hausman test fails to reject random effects (H = 1.79, p = 0.180) while the Mundlak term (−0.144, p = 0.072) hints at negative selection; both point toward CRE/Mundlak as the specification to lead with.

Two files. raw_data is the full NLSY-style download — one row per worker × year, five waves (2010, 2012, 2014, 2016, 2018), 2,209 workers, balanced. data_panel is the tutorial's analysis sample: the same data restricted to 2010 and 2012 only so that T = 2 and the first-difference and within estimators coincide; rows with missing lwage/union/age/schooling are dropped (2,199 workers × 2 years = 4,398 observations, balanced) and a female dummy is added.

Data sources

Source	Provides	Reference / URL
quarcs-lab data-open	The source wage panel (real NLSY-style worker observations)	QuaRCS-lab data-open repository, isds/wage_panel_bob4.dta. https://github.com/quarcs-lab/data-open
Method references	Estimators, specification tests, and the panel-data framework	Wooldridge (2010); Hausman (1978); Mundlak (1978).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Introduction to Panel Data Methods in Python [Data set]. https://carlos-mendez.org/post/python_panel_intro/

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press. Hausman, J. A. (1978). Specification Tests in Econometrics. Econometrica, 46(6), 1251–1271. Mundlak, Y. (1978). On the Pooling of Time Series and Cross Section Data. Econometrica, 46(1), 69–85.

BibTeX

@misc{mendez2026pythonpanelintro,
  author       = {Mendez, Carlos},
  title        = {Introduction to Panel Data Methods in Python},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_panel_intro/}},
  note         = {Data set}
}

@book{wooldridge2010econometric,
  author    = {Wooldridge, Jeffrey M.},
  title     = {Econometric Analysis of Cross Section and Panel Data},
  edition   = {2nd}, publisher = {MIT Press}, year = {2010}
}
@article{hausman1978specification,
  author  = {Hausman, Jerry A.},
  title   = {Specification Tests in Econometrics},
  journal = {Econometrica},
  volume  = {46}, number = {6}, pages = {1251--1271}, year = {1978}
}
@article{mundlak1978pooling,
  author  = {Mundlak, Yair},
  title   = {On the Pooling of Time Series and Cross Section Data},
  journal = {Econometrica},
  volume  = {46}, number = {1}, pages = {69--85}, year = {1978}
}

Variable explorer search & filter all 10 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`ID`#	identifier	–	Worker identifier	Unique person ID; the panel cross-sectional (unit) dimension.	integer id	raw_data, data_panel	quarcs-lab data-open
`age`#	continuous		Age (years)	Worker age in years at the survey wave.	years	raw_data, data_panel	quarcs-lab data-open
`female`#	dummy		Female (1=yes)	1 if the worker's gender is Female, else 0; time-invariant covariate.	0/1	data_panel	Derived (this study)
`gender`#	identifier	–	Gender	Worker's reported gender (Female / Male).	category	raw_data, data_panel	quarcs-lab data-open
`lwage`#	continuous		Log hourly wage	Natural log of the hourly wage rate; the outcome variable.	log US$/hour	raw_data, data_panel	quarcs-lab data-open
`region`#	identifier	–	Census region	US Census region of residence.	category	raw_data, data_panel	quarcs-lab data-open
`schooling`#	continuous		Years of schooling	Completed years of education; time-invariant within a worker over the panel window.	years	raw_data, data_panel	quarcs-lab data-open
`union`#	dummy		Union member (1=yes)	1 if the worker is a union member in that wave, else 0; the treatment of interest.	0/1	raw_data, data_panel	quarcs-lab data-open
`wagerate`#	continuous		Hourly wage rate	Worker's hourly wage rate (level), the basis for log wage.	US$/hour	raw_data, data_panel	quarcs-lab data-open
`year`#	year	–	Survey year	Calendar year of the observation; the panel time dimension.	year	raw_data, data_panel	quarcs-lab data-open

Cross-file variable index

Which file each variable appears in (● = present).

Variable	raw_data	data_panel
`ID`	●	●
`age`	●	●
`female`		●
`gender`	●	●
`lwage`	●	●
`region`	●	●
`schooling`	●	●
`union`	●	●
`wagerate`	●	●
`year`	●	●

Construction & formulas

All estimators target the coefficient β on union in the panel model y_it = α_i + β x_it + u_it, where y is lwage, x is union, and α_i is the unobserved worker effect. They differ in which variation identifies β.

Pooled OLS (POLS): OLS on every row, ignoring the panel — uses all variation.
Between: OLS on worker means &ybar;_i on &xbar;_i — cross-sectional variation only.
First-differences (FDFE): regress Δy_it on Δx_it; α_i cancels by differencing.
Within / Fixed effects (FE): OLS on demeaned data &xtilde;_it = x_it − &xbar;_i; α_i vanishes by demeaning. With T = 2, FE ≈ FDFE.
Two-way FE (TWFE): absorbs worker effects α_i and year effects δ_t.
Random effects (RE): GLS treating α_i as a random draw uncorrelated with the regressors — a weighted blend of between and within.
Correlated random effects (CRE / Mundlak): RE plus the worker mean &xbar;_i of each time-varying regressor; the within coefficient equals FE and the mean coefficient tests RE-vs-FE.

Between/within variance decomposition (per variable): between SD is SD(&xbar;_i), within SD is SD(x_it − &xbar;_i), and the between share is between² / (between² + within²). Hausman: H = (β_FE − β_RE)² / (V_FE − V_RE) ~ χ²(1).

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

worker-year 11,045 × 9 · 2010, 2012, 2014, 2016, 2018 · 2,209 US workers (balanced, 5 waves)

Panel key: ID x year · Source download; full NLSY-style panel for extensions (e.g. using all five waves).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`ID` identifier	Worker identifier	Unique person ID; the panel cross-sectional (unit) dimension.	From the source NLSY-style file; stored as a float in the CSV.	integer id	quarcs-lab data-open	both files
`year` year	Survey year	Calendar year of the observation; the panel time dimension.	From the source file; raw has 2010-2018 (biennial), the panel keeps 2010 and 2012.	year	quarcs-lab data-open	both files
`age` continuous	Age (years)	Worker age in years at the survey wave.	From the source file.	years	quarcs-lab data-open	both files
`wagerate` continuous	Hourly wage rate	Worker's hourly wage rate (level), the basis for log wage.	From the source file; lwage = log(wagerate).	US$/hour	quarcs-lab data-open	both files
`schooling` continuous	Years of schooling	Completed years of education; time-invariant within a worker over the panel window.	From the source file.	years	quarcs-lab data-open	both files
`region` identifier	Census region	US Census region of residence.	From the source file (categorical text).	category	quarcs-lab data-open	both files
`union` dummy	Union member (1=yes)	1 if the worker is a union member in that wave, else 0; the treatment of interest.	Mapped from the source Yes/No string to 1/0.	0/1	quarcs-lab data-open	both files
`lwage` continuous	Log hourly wage	Natural log of the hourly wage rate; the outcome variable.	log(wagerate), from the source file.	log US$/hour	quarcs-lab data-open	both files
`gender` identifier	Gender	Worker's reported gender (Female / Male).	From the source file (categorical text).	category	quarcs-lab data-open	both files

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`ID`	–	100%	11,045	2,209	—	—	—	—	—
`year`	–	100%	11,045	5	2010	2014.0	2014	2018	2.83
`age`		100%	11,045	30	25.00	38.71	38.00	54.00	6.80
`wagerate`		100%	11,045	7,728	0.177	29.66	23.92	888.9	25.55
`schooling`		100%	10,995	14	3.00	14.50	15.00	17.00	2.18
`region`	–	100%	11,045	5	—	—	—	—	—
`union`		100%	11,045	2	0	0.162	0	1.00	0.369
`lwage`		100%	11,045	7,726	-1.73	3.19	3.17	6.79	0.603
`gender`	–	100%	11,045	2	—	—	—	—	—

worker-year 4,398 × 10 · 2010, 2012 · 2,199 US workers (balanced, 2 waves)

Panel key: ID x year · Estimation sample for all seven panel estimators (POLS / Between / FDFE / FE / TWFE / RE / CRE).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`ID` identifier	Worker identifier	Unique person ID; the panel cross-sectional (unit) dimension.	From the source NLSY-style file; stored as a float in the CSV.	integer id	quarcs-lab data-open	both files
`year` year	Survey year	Calendar year of the observation; the panel time dimension.	From the source file; raw has 2010-2018 (biennial), the panel keeps 2010 and 2012.	year	quarcs-lab data-open	both files
`age` continuous	Age (years)	Worker age in years at the survey wave.	From the source file.	years	quarcs-lab data-open	both files
`wagerate` continuous	Hourly wage rate	Worker's hourly wage rate (level), the basis for log wage.	From the source file; lwage = log(wagerate).	US$/hour	quarcs-lab data-open	both files
`schooling` continuous	Years of schooling	Completed years of education; time-invariant within a worker over the panel window.	From the source file.	years	quarcs-lab data-open	both files
`region` identifier	Census region	US Census region of residence.	From the source file (categorical text).	category	quarcs-lab data-open	both files
`union` dummy	Union member (1=yes)	1 if the worker is a union member in that wave, else 0; the treatment of interest.	Mapped from the source Yes/No string to 1/0.	0/1	quarcs-lab data-open	both files
`lwage` continuous	Log hourly wage	Natural log of the hourly wage rate; the outcome variable.	log(wagerate), from the source file.	log US$/hour	quarcs-lab data-open	both files
`gender` identifier	Gender	Worker's reported gender (Female / Male).	From the source file (categorical text).	category	quarcs-lab data-open	both files
`female` dummy	Female (1=yes)	1 if the worker's gender is Female, else 0; time-invariant covariate.	Derived in the analysis panel: 1 if gender == 'Female', else 0.	0/1	Derived (this study)	data_panel only

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`ID`	–	100%	4,398	2,199	—	—	—	—	—
`year`	–	100%	4,398	2	2010	2011.0	2011	2012	1.00
`age`		100%	4,398	25	25.00	35.68	35.00	49.00	6.26
`wagerate`		100%	4,398	3,014	0.177	26.84	22.10	429.9	20.26
`schooling`		100%	4,398	14	3.00	14.50	15.00	17.00	2.18
`region`	–	100%	4,398	5	—	—	—	—	—
`union`		100%	4,398	2	0	0.163	0	1.00	0.369
`lwage`		100%	4,398	3,014	-1.73	3.11	3.10	6.06	0.598
`gender`	–	100%	4,398	2	—	—	—	—	—
`female`		100%	4,398	2	0	0.517	1.00	1.00	0.500

Known limitations & caveats

Two periods only in the analysis panel. data_panel keeps just 2010 and 2012 (T = 2) so the first-difference and within estimators coincide; with only two waves the fixed-effects estimate is power-limited and the Hausman test has low power. raw_data retains all five waves for extensions.
Thin within variation. Union status is 93.9% between-worker and only 9.1% within; schooling has zero within-variation in the two-period window, so fixed effects mechanically drops it.
10 workers dropped. The analysis panel keeps 2,199 of the 2,209 raw workers; ten are dropped for missing values in lwage/union/age/schooling.
Estimands differ. Within estimators (FDFE/FE/TWFE/CRE) identify the effect for union switchers under strict exogeneity; POLS/Between report a population-weighted association without a causal interpretation absent unconfoundedness.