← Back to the post
Interactive data dictionary

Introduction to Panel Data Methods in Python

An NLSY-style two-period wage panel for teaching seven panel-data estimators.

2
datasets
2,209
workers (raw)
2010–2018
waves (raw)
T = 2
analysis panel

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
raw_dataworker-year11,045 × 9raw_data.dtaraw_data.csv
data_panelworker-year4,398 × 10data_panel.dtadata_panel.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
use "${BASE}raw_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
df = pd.read_stata(BASE + "raw_data.dta")

# load every dataset at once
files = ["raw_data", "data_panel"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "raw_data.dta", "raw_data.dta")
df, meta = pyreadstat.read_dta("raw_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
df <- read_dta(paste0(BASE, "raw_data.dta"))

Overview & sources

Companion data for a beginner-friendly Python tutorial that walks through seven canonical panel-data estimators — pooled OLS, the between estimator, first-differences, the within (fixed-effects) estimator, two-way fixed effects, random effects, and Mundlak's correlated random effects (CRE) — on a single worker wage panel. The running question is whether union membership raises wages. The data are real NLSY-style observations on US workers from the quarcs-lab/data-open repository (isds/wage_panel_bob4.dta). The cross-sectional estimators report a union premium of 7–11 log points; once unobserved worker traits are netted out, the within estimators report roughly 21 log points — a near-tripling that illustrates selection on unobservables. The Hausman test fails to reject random effects (H = 1.79, p = 0.180) while the Mundlak term (−0.144, p = 0.072) hints at negative selection; both point toward CRE/Mundlak as the specification to lead with.

Two files. raw_data is the full NLSY-style download — one row per worker × year, five waves (2010, 2012, 2014, 2016, 2018), 2,209 workers, balanced. data_panel is the tutorial's analysis sample: the same data restricted to 2010 and 2012 only so that T = 2 and the first-difference and within estimators coincide; rows with missing lwage/union/age/schooling are dropped (2,199 workers × 2 years = 4,398 observations, balanced) and a female dummy is added.

Data sources

SourceProvidesReference / URL
quarcs-lab data-openThe source wage panel (real NLSY-style worker observations)QuaRCS-lab data-open repository, isds/wage_panel_bob4.dta. https://github.com/quarcs-lab/data-open
Method referencesEstimators, specification tests, and the panel-data frameworkWooldridge (2010); Hausman (1978); Mundlak (1978).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Introduction to Panel Data Methods in Python [Data set]. https://carlos-mendez.org/post/python_panel_intro/

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press. Hausman, J. A. (1978). Specification Tests in Econometrics. Econometrica, 46(6), 1251–1271. Mundlak, Y. (1978). On the Pooling of Time Series and Cross Section Data. Econometrica, 46(1), 69–85.

BibTeX

@misc{mendez2026pythonpanelintro,
  author       = {Mendez, Carlos},
  title        = {Introduction to Panel Data Methods in Python},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_panel_intro/}},
  note         = {Data set}
}

@book{wooldridge2010econometric,
  author    = {Wooldridge, Jeffrey M.},
  title     = {Econometric Analysis of Cross Section and Panel Data},
  edition   = {2nd}, publisher = {MIT Press}, year = {2010}
}
@article{hausman1978specification,
  author  = {Hausman, Jerry A.},
  title   = {Specification Tests in Econometrics},
  journal = {Econometrica},
  volume  = {46}, number = {6}, pages = {1251--1271}, year = {1978}
}
@article{mundlak1978pooling,
  author  = {Mundlak, Yair},
  title   = {On the Pooling of Time Series and Cross Section Data},
  journal = {Econometrica},
  volume  = {46}, number = {1}, pages = {69--85}, year = {1978}
}

Variable explorer search & filter all 10 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
ID#identifierWorker identifierUnique person ID; the panel cross-sectional (unit) dimension.integer idraw_data, data_panelquarcs-lab data-open
age#continuousmin 25 | median 38 | max 54Age (years)Worker age in years at the survey wave.yearsraw_data, data_panelquarcs-lab data-open
female#dummyshare coded 1 = 0.517Female (1=yes)1 if the worker's gender is Female, else 0; time-invariant covariate.0/1data_panelDerived (this study)
gender#identifierGenderWorker's reported gender (Female / Male).categoryraw_data, data_panelquarcs-lab data-open
lwage#continuousmin -1.73 | median 3.17 | max 6.79Log hourly wageNatural log of the hourly wage rate; the outcome variable.log US$/hourraw_data, data_panelquarcs-lab data-open
region#identifierCensus regionUS Census region of residence.categoryraw_data, data_panelquarcs-lab data-open
schooling#continuousmin 3 | median 15 | max 17Years of schoolingCompleted years of education; time-invariant within a worker over the panel window.yearsraw_data, data_panelquarcs-lab data-open
union#dummyshare coded 1 = 0.162Union member (1=yes)1 if the worker is a union member in that wave, else 0; the treatment of interest.0/1raw_data, data_panelquarcs-lab data-open
wagerate#continuousmin 0.177 | median 23.9 | max 889Hourly wage rateWorker's hourly wage rate (level), the basis for log wage.US$/hourraw_data, data_panelquarcs-lab data-open
year#yearSurvey yearCalendar year of the observation; the panel time dimension.yearraw_data, data_panelquarcs-lab data-open

Cross-file variable index

Which file each variable appears in (● = present).

Variableraw_datadata_panel
ID
age
female
gender
lwage
region
schooling
union
wagerate
year

Construction & formulas

All estimators target the coefficient β on union in the panel model y_it = α_i + β x_it + u_it, where y is lwage, x is union, and α_i is the unobserved worker effect. They differ in which variation identifies β.

Between/within variance decomposition (per variable): between SD is SD(&xbar;_i), within SD is SD(x_it − &xbar;_i), and the between share is between² / (between² + within²). Hausman: H = (β_FE − β_RE)² / (V_FE − V_RE) ~ χ²(1).

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

worker-year  11,045 × 9 · 2010, 2012, 2014, 2016, 2018 · 2,209 US workers (balanced, 5 waves)

Panel key: ID x year · Source download; full NLSY-style panel for extensions (e.g. using all five waves).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
ID identifierWorker identifierUnique person ID; the panel cross-sectional (unit) dimension.From the source NLSY-style file; stored as a float in the CSV.integer idquarcs-lab data-openboth files
year yearSurvey yearCalendar year of the observation; the panel time dimension.From the source file; raw has 2010-2018 (biennial), the panel keeps 2010 and 2012.yearquarcs-lab data-openboth files
age continuousAge (years)Worker age in years at the survey wave.From the source file.yearsquarcs-lab data-openboth files
wagerate continuousHourly wage rateWorker's hourly wage rate (level), the basis for log wage.From the source file; lwage = log(wagerate).US$/hourquarcs-lab data-openboth files
schooling continuousYears of schoolingCompleted years of education; time-invariant within a worker over the panel window.From the source file.yearsquarcs-lab data-openboth files
region identifierCensus regionUS Census region of residence.From the source file (categorical text).categoryquarcs-lab data-openboth files
union dummyUnion member (1=yes)1 if the worker is a union member in that wave, else 0; the treatment of interest.Mapped from the source Yes/No string to 1/0.0/1quarcs-lab data-openboth files
lwage continuousLog hourly wageNatural log of the hourly wage rate; the outcome variable.log(wagerate), from the source file.log US$/hourquarcs-lab data-openboth files
gender identifierGenderWorker's reported gender (Female / Male).From the source file (categorical text).categoryquarcs-lab data-openboth files

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
ID100%11,0452,209
year100%11,045520102014.0201420182.83
agemin 25 | median 38 | max 54100%11,0453025.0038.7138.0054.006.80
wageratemin 0.177 | median 23.9 | max 889100%11,0457,7280.17729.6623.92888.925.55
schoolingmin 3 | median 15 | max 17100%10,995143.0014.5015.0017.002.18
region100%11,0455
unionshare coded 1 = 0.162100%11,045200.16201.000.369
lwagemin -1.73 | median 3.17 | max 6.79100%11,0457,726-1.733.193.176.790.603
gender100%11,0452

worker-year  4,398 × 10 · 2010, 2012 · 2,199 US workers (balanced, 2 waves)

Panel key: ID x year · Estimation sample for all seven panel estimators (POLS / Between / FDFE / FE / TWFE / RE / CRE).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
ID identifierWorker identifierUnique person ID; the panel cross-sectional (unit) dimension.From the source NLSY-style file; stored as a float in the CSV.integer idquarcs-lab data-openboth files
year yearSurvey yearCalendar year of the observation; the panel time dimension.From the source file; raw has 2010-2018 (biennial), the panel keeps 2010 and 2012.yearquarcs-lab data-openboth files
age continuousAge (years)Worker age in years at the survey wave.From the source file.yearsquarcs-lab data-openboth files
wagerate continuousHourly wage rateWorker's hourly wage rate (level), the basis for log wage.From the source file; lwage = log(wagerate).US$/hourquarcs-lab data-openboth files
schooling continuousYears of schoolingCompleted years of education; time-invariant within a worker over the panel window.From the source file.yearsquarcs-lab data-openboth files
region identifierCensus regionUS Census region of residence.From the source file (categorical text).categoryquarcs-lab data-openboth files
union dummyUnion member (1=yes)1 if the worker is a union member in that wave, else 0; the treatment of interest.Mapped from the source Yes/No string to 1/0.0/1quarcs-lab data-openboth files
lwage continuousLog hourly wageNatural log of the hourly wage rate; the outcome variable.log(wagerate), from the source file.log US$/hourquarcs-lab data-openboth files
gender identifierGenderWorker's reported gender (Female / Male).From the source file (categorical text).categoryquarcs-lab data-openboth files
female dummyFemale (1=yes)1 if the worker's gender is Female, else 0; time-invariant covariate.Derived in the analysis panel: 1 if gender == 'Female', else 0.0/1Derived (this study)data_panel only

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
ID100%4,3982,199
year100%4,398220102011.0201120121.00
agemin 25 | median 35 | max 49100%4,3982525.0035.6835.0049.006.26
wageratemin 0.177 | median 22.1 | max 430100%4,3983,0140.17726.8422.10429.920.26
schoolingmin 3 | median 15 | max 17100%4,398143.0014.5015.0017.002.18
region100%4,3985
unionshare coded 1 = 0.163100%4,398200.16301.000.369
lwagemin -1.73 | median 3.1 | max 6.06100%4,3983,014-1.733.113.106.060.598
gender100%4,3982
femaleshare coded 1 = 0.517100%4,398200.5171.001.000.500

Known limitations & caveats