Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
raw_data | worker-year | 11,045 × 9 | raw_data.dta | raw_data.csv |
data_panel | worker-year | 4,398 × 10 | data_panel.dta | data_panel.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
use "${BASE}raw_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
df = pd.read_stata(BASE + "raw_data.dta")
# load every dataset at once
files = ["raw_data", "data_panel"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "raw_data.dta", "raw_data.dta")
df, meta = pyreadstat.read_dta("raw_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_panel_intro/data/"
df <- read_dta(paste0(BASE, "raw_data.dta"))Overview & sources
Companion data for a beginner-friendly Python tutorial that walks through seven canonical panel-data estimators — pooled OLS, the between estimator, first-differences, the within (fixed-effects) estimator, two-way fixed effects, random effects, and Mundlak's correlated random effects (CRE) — on a single worker wage panel. The running question is whether union membership raises wages. The data are real NLSY-style observations on US workers from the quarcs-lab/data-open repository (isds/wage_panel_bob4.dta). The cross-sectional estimators report a union premium of 7–11 log points; once unobserved worker traits are netted out, the within estimators report roughly 21 log points — a near-tripling that illustrates selection on unobservables. The Hausman test fails to reject random effects (H = 1.79, p = 0.180) while the Mundlak term (−0.144, p = 0.072) hints at negative selection; both point toward CRE/Mundlak as the specification to lead with.
raw_data is the full NLSY-style download — one row per worker × year, five waves (2010, 2012, 2014, 2016, 2018), 2,209 workers, balanced. data_panel is the tutorial's analysis sample: the same data restricted to 2010 and 2012 only so that T = 2 and the first-difference and within estimators coincide; rows with missing lwage/union/age/schooling are dropped (2,199 workers × 2 years = 4,398 observations, balanced) and a female dummy is added.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| quarcs-lab data-open | The source wage panel (real NLSY-style worker observations) | QuaRCS-lab data-open repository, isds/wage_panel_bob4.dta. https://github.com/quarcs-lab/data-open |
| Method references | Estimators, specification tests, and the panel-data framework | Wooldridge (2010); Hausman (1978); Mundlak (1978). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Introduction to Panel Data Methods in Python [Data set]. https://carlos-mendez.org/post/python_panel_intro/
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press. Hausman, J. A. (1978). Specification Tests in Econometrics. Econometrica, 46(6), 1251–1271. Mundlak, Y. (1978). On the Pooling of Time Series and Cross Section Data. Econometrica, 46(1), 69–85.BibTeX
@misc{mendez2026pythonpanelintro,
author = {Mendez, Carlos},
title = {Introduction to Panel Data Methods in Python},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_panel_intro/}},
note = {Data set}
}
@book{wooldridge2010econometric,
author = {Wooldridge, Jeffrey M.},
title = {Econometric Analysis of Cross Section and Panel Data},
edition = {2nd}, publisher = {MIT Press}, year = {2010}
}
@article{hausman1978specification,
author = {Hausman, Jerry A.},
title = {Specification Tests in Econometrics},
journal = {Econometrica},
volume = {46}, number = {6}, pages = {1251--1271}, year = {1978}
}
@article{mundlak1978pooling,
author = {Mundlak, Yair},
title = {On the Pooling of Time Series and Cross Section Data},
journal = {Econometrica},
volume = {46}, number = {1}, pages = {69--85}, year = {1978}
}Variable explorer search & filter all 10 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
ID# | identifier | – | Worker identifier | Unique person ID; the panel cross-sectional (unit) dimension. | integer id | raw_data, data_panel | quarcs-lab data-open |
age# | continuous | Age (years) | Worker age in years at the survey wave. | years | raw_data, data_panel | quarcs-lab data-open | |
female# | dummy | Female (1=yes) | 1 if the worker's gender is Female, else 0; time-invariant covariate. | 0/1 | data_panel | Derived (this study) | |
gender# | identifier | – | Gender | Worker's reported gender (Female / Male). | category | raw_data, data_panel | quarcs-lab data-open |
lwage# | continuous | Log hourly wage | Natural log of the hourly wage rate; the outcome variable. | log US$/hour | raw_data, data_panel | quarcs-lab data-open | |
region# | identifier | – | Census region | US Census region of residence. | category | raw_data, data_panel | quarcs-lab data-open |
schooling# | continuous | Years of schooling | Completed years of education; time-invariant within a worker over the panel window. | years | raw_data, data_panel | quarcs-lab data-open | |
union# | dummy | Union member (1=yes) | 1 if the worker is a union member in that wave, else 0; the treatment of interest. | 0/1 | raw_data, data_panel | quarcs-lab data-open | |
wagerate# | continuous | Hourly wage rate | Worker's hourly wage rate (level), the basis for log wage. | US$/hour | raw_data, data_panel | quarcs-lab data-open | |
year# | year | – | Survey year | Calendar year of the observation; the panel time dimension. | year | raw_data, data_panel | quarcs-lab data-open |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
All estimators target the coefficient β on union in the
panel model y_it = α_i + β x_it + u_it, where y is
lwage, x is union, and α_i is the
unobserved worker effect. They differ in which variation identifies β.
- Pooled OLS (POLS): OLS on every row, ignoring the panel — uses all variation.
- Between: OLS on worker means
&ybar;_ion&xbar;_i— cross-sectional variation only. - First-differences (FDFE): regress
Δy_itonΔx_it;α_icancels by differencing. - Within / Fixed effects (FE): OLS on demeaned data
&xtilde;_it = x_it − &xbar;_i;α_ivanishes by demeaning. With T = 2, FE ≈ FDFE. - Two-way FE (TWFE): absorbs worker effects
α_iand year effectsδ_t. - Random effects (RE): GLS treating
α_ias a random draw uncorrelated with the regressors — a weighted blend of between and within. - Correlated random effects (CRE / Mundlak): RE plus the worker mean
&xbar;_iof each time-varying regressor; the within coefficient equals FE and the mean coefficient tests RE-vs-FE.
Between/within variance decomposition (per variable): between SD is
SD(&xbar;_i), within SD is SD(x_it − &xbar;_i), and the between
share is between² / (between² + within²). Hausman:
H = (β_FE − β_RE)² / (V_FE − V_RE) ~ χ²(1).
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
ID identifier | Worker identifier | Unique person ID; the panel cross-sectional (unit) dimension. | From the source NLSY-style file; stored as a float in the CSV. | integer id | quarcs-lab data-open | both files |
year year | Survey year | Calendar year of the observation; the panel time dimension. | From the source file; raw has 2010-2018 (biennial), the panel keeps 2010 and 2012. | year | quarcs-lab data-open | both files |
age continuous | Age (years) | Worker age in years at the survey wave. | From the source file. | years | quarcs-lab data-open | both files |
wagerate continuous | Hourly wage rate | Worker's hourly wage rate (level), the basis for log wage. | From the source file; lwage = log(wagerate). | US$/hour | quarcs-lab data-open | both files |
schooling continuous | Years of schooling | Completed years of education; time-invariant within a worker over the panel window. | From the source file. | years | quarcs-lab data-open | both files |
region identifier | Census region | US Census region of residence. | From the source file (categorical text). | category | quarcs-lab data-open | both files |
union dummy | Union member (1=yes) | 1 if the worker is a union member in that wave, else 0; the treatment of interest. | Mapped from the source Yes/No string to 1/0. | 0/1 | quarcs-lab data-open | both files |
lwage continuous | Log hourly wage | Natural log of the hourly wage rate; the outcome variable. | log(wagerate), from the source file. | log US$/hour | quarcs-lab data-open | both files |
gender identifier | Gender | Worker's reported gender (Female / Male). | From the source file (categorical text). | category | quarcs-lab data-open | both files |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
ID | – | 100% | 11,045 | 2,209 | — | — | — | — | — |
year | – | 100% | 11,045 | 5 | 2010 | 2014.0 | 2014 | 2018 | 2.83 |
age | 100% | 11,045 | 30 | 25.00 | 38.71 | 38.00 | 54.00 | 6.80 | |
wagerate | 100% | 11,045 | 7,728 | 0.177 | 29.66 | 23.92 | 888.9 | 25.55 | |
schooling | 100% | 10,995 | 14 | 3.00 | 14.50 | 15.00 | 17.00 | 2.18 | |
region | – | 100% | 11,045 | 5 | — | — | — | — | — |
union | 100% | 11,045 | 2 | 0 | 0.162 | 0 | 1.00 | 0.369 | |
lwage | 100% | 11,045 | 7,726 | -1.73 | 3.19 | 3.17 | 6.79 | 0.603 | |
gender | – | 100% | 11,045 | 2 | — | — | — | — | — |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
ID identifier | Worker identifier | Unique person ID; the panel cross-sectional (unit) dimension. | From the source NLSY-style file; stored as a float in the CSV. | integer id | quarcs-lab data-open | both files |
year year | Survey year | Calendar year of the observation; the panel time dimension. | From the source file; raw has 2010-2018 (biennial), the panel keeps 2010 and 2012. | year | quarcs-lab data-open | both files |
age continuous | Age (years) | Worker age in years at the survey wave. | From the source file. | years | quarcs-lab data-open | both files |
wagerate continuous | Hourly wage rate | Worker's hourly wage rate (level), the basis for log wage. | From the source file; lwage = log(wagerate). | US$/hour | quarcs-lab data-open | both files |
schooling continuous | Years of schooling | Completed years of education; time-invariant within a worker over the panel window. | From the source file. | years | quarcs-lab data-open | both files |
region identifier | Census region | US Census region of residence. | From the source file (categorical text). | category | quarcs-lab data-open | both files |
union dummy | Union member (1=yes) | 1 if the worker is a union member in that wave, else 0; the treatment of interest. | Mapped from the source Yes/No string to 1/0. | 0/1 | quarcs-lab data-open | both files |
lwage continuous | Log hourly wage | Natural log of the hourly wage rate; the outcome variable. | log(wagerate), from the source file. | log US$/hour | quarcs-lab data-open | both files |
gender identifier | Gender | Worker's reported gender (Female / Male). | From the source file (categorical text). | category | quarcs-lab data-open | both files |
female dummy | Female (1=yes) | 1 if the worker's gender is Female, else 0; time-invariant covariate. | Derived in the analysis panel: 1 if gender == 'Female', else 0. | 0/1 | Derived (this study) | data_panel only |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
ID | – | 100% | 4,398 | 2,199 | — | — | — | — | — |
year | – | 100% | 4,398 | 2 | 2010 | 2011.0 | 2011 | 2012 | 1.00 |
age | 100% | 4,398 | 25 | 25.00 | 35.68 | 35.00 | 49.00 | 6.26 | |
wagerate | 100% | 4,398 | 3,014 | 0.177 | 26.84 | 22.10 | 429.9 | 20.26 | |
schooling | 100% | 4,398 | 14 | 3.00 | 14.50 | 15.00 | 17.00 | 2.18 | |
region | – | 100% | 4,398 | 5 | — | — | — | — | — |
union | 100% | 4,398 | 2 | 0 | 0.163 | 0 | 1.00 | 0.369 | |
lwage | 100% | 4,398 | 3,014 | -1.73 | 3.11 | 3.10 | 6.06 | 0.598 | |
gender | – | 100% | 4,398 | 2 | — | — | — | — | — |
female | 100% | 4,398 | 2 | 0 | 0.517 | 1.00 | 1.00 | 0.500 |
Known limitations & caveats
- Two periods only in the analysis panel.
data_panelkeeps just 2010 and 2012 (T = 2) so the first-difference and within estimators coincide; with only two waves the fixed-effects estimate is power-limited and the Hausman test has low power.raw_dataretains all five waves for extensions. - Thin within variation. Union status is 93.9% between-worker and only 9.1% within; schooling has zero within-variation in the two-period window, so fixed effects mechanically drops it.
- 10 workers dropped. The analysis panel keeps 2,199 of the 2,209 raw workers; ten are dropped for missing values in
lwage/union/age/schooling. - Estimands differ. Within estimators (FDFE/FE/TWFE/CRE) identify the effect for union switchers under strict exogeneity; POLS/Between report a population-weighted association without a causal interpretation absent unconfoundedness.