Panel Standard Errors — Interactive Lab

A pedagogical companion to Standard Errors in Panel Data: A Beginner's Guide in Python ↗ Back to the post

Standard errors, fixed effects, and the bias–inference split

A regression's headline number is its point estimate. Whether that number is meaningful depends on the standard error beside it. In panel data — the same firm observed year after year — the textbook standard-error formula is almost always wrong. This app lets you see the wrongness in three different ways: a forest plot of eight SE estimators on the same coefficient, a Monte Carlo bar chart of empirical rejection rates, and a simulation lab that builds intuition for sampling distributions across many draws.

The headline takeaway has two parts. First, no SE choice can rescue a biased point estimate — pooled OLS gives β̂ = 1.03 against a true β = 0.5, and tweaking the standard error never moves the coefficient. Second, once fixed effects fix the bias (β̂ = 0.48), the SE choice determines whether your 95% CI actually has 95% coverage. Entity-clustering on 100 firms behaves; time-clustering on 10 years over-rejects.

Why does panel data break the textbook standard error?

Conventional standard errors assume every observation is an independent draw. In a panel, two observations from the same firm — say firm 1 in 2015 and firm 1 in 2016 — share the firm's idiosyncrasies and so are not independent. The animation below shows the difference between two pull-toward-zero forces: an L1 (LASSO) penalty drives coefficients to exactly zero, while an L2 (Ridge) penalty only asymptotes. Use it as a visual reminder that two formulas, applied to the same data, can give qualitatively different answers — the same lesson the SE choice teaches in panel regressions.

Tab 2

SE Forest Plot

The post's eight estimator rows as a forest plot. The teal dashed line marks the truth β = 0.5. Pooled OLS bars never touch it; fixed-effects bars do.

Tab 3

Rejection Rates

Monte Carlo (N = 500) empirical rejection rates for six FE + SE combinations. The dashed line at 5% is the nominal target. Time-clustered SEs land at 9.0%.

Tab 4

Bias vs Variance Lab

A simulation sandbox: see how 100 fresh draws from the same DGP produce a sampling distribution. The width of that distribution is what an honest SE should reflect.

Glossary (open a card if a term is unfamiliar)

Bias vs inference
Bias: point estimate is wrong on average. Inference: standard error misstates uncertainty. SEs cannot fix bias; only the right estimator (e.g., fixed effects) can.
Conventional SE
σ² (X'X)⁻¹. Assumes errors are i.i.d. Almost never appropriate in panel data because of within-cluster correlation.
White / HC SE
Allows heteroskedasticity (variance differs across observations) but still assumes independence. Doesn't help when the issue is correlation.
Cluster-robust SE
Allows arbitrary correlation within a cluster (e.g., firm). Standard in microeconomics. Requires "enough" clusters (rule of thumb: 40–50+).
Two-way clustering
Clusters along entity and time. Insurance when both dimensions plausibly correlate (Cameron, Gelbach & Miller 2011).
Driscoll–Kraay SE
Newey-West-style time kernel applied to cross-sectional averages. Robust to spatial dependence; best when T is large.
Fixed effects (FE)
Subtract each entity's own mean from its observations. Removes any time-invariant confounder (firm ability) and so removes a major source of omitted-variable bias.
Rejection rate (size)
Pr(reject H₀ | H₀ true). At α = 0.05, this should equal 5%. Over-rejection = false positives in too many simulations. Under-rejection = wider intervals than needed (conservative).

Eight estimators, one coefficient — a forest plot

All eight rows below estimate the same parameter — the effect of R&D intensity (x) on firm performance (y) — using the same simulated panel (100 firms × 10 years). What differs is the model (pooled OLS vs. entity FE vs. two-way FE) and the SE estimator (six varieties, from conventional through Driscoll-Kraay). The teal dashed line marks the truth, β = 0.5. Toggle a method off to focus the view.

What to look for

  • All pooled OLS bars miss the truth. Five SE choices, none of them rescue the biased β̂ = 1.03. This is the §13.1 lesson visualised: standard errors address precision, not accuracy.
  • The two FE bars (β̂ ≈ 0.48) cover the truth. Demeaning the firm-level confounder is what moves the point estimate to where it belongs.
  • Hover any row for the t-statistic and full 95% CI. The most striking row is "Pooled OLS (Driscoll-Kraay)" with t = 65.4 — an impressively significant result for a coefficient that is more than double the truth.

Methods

Why does Driscoll-Kraay look so confident?

Driscoll-Kraay (β̂ = 1.03, SE = 0.0158, t = 65.4) is the narrowest CI of the eight rows. The estimator is robust to cross-sectional dependence, but the simulated DGP has very weak cross-sectional dependence — firms are conditionally independent given firm ability. Driscoll-Kraay's bandwidth-3 kernel borrows strength across firms aggressively, producing an SE that is technically valid but smaller than entity-clustering. In a panel with strong common shocks (e.g., banks during the 2008 crisis), the same estimator would give a much wider interval. The narrow CI here reflects the DGP, not a universal property of DK.

Monte Carlo rejection rates — does the test land at 5%?

For each combination of FE model + SE estimator, we simulate 500 independent datasets from the same DGP and ask: across those 500 runs, how often does the 95% CI miss the true β = 0.5? An honest test should reject at 5% — anything materially above 5% is over-rejection (false positives); anything materially below is conservative.

What the bars are telling you

  • Five combinations land near 5%. FE + conventional (6.0%), FE + White (6.4%), FE + entity-cluster (6.6%), FE + both-cluster (7.8%) all fall within simulation noise of the nominal target. After demeaning, the within-firm residuals are reasonably well-behaved.
  • FE + time-cluster over-rejects badly (9.0%). With only 10 year-clusters, the asymptotic theory behind cluster-robust SEs simply does not hold. The standard error is too small; the t-statistic too large; false positives nearly double the nominal rate.
  • TWFE + entity-cluster under-rejects (3.2%). Absorbing time effects costs degrees of freedom and slightly inflates the SE — wider intervals than needed, but conservative is much safer than over-confident.

Rule of thumb

Cluster on the dimension with the larger number of groups. In this DGP that's firms (100) not years (10). With fewer than ~30–40 clusters the cluster-robust SE breaks down, and you need a small-sample correction (e.g., wild cluster bootstrap, CR2). The post's §13.2 decision framework is the takeaway: always start with the right model (FE), then pick the SE estimator with enough clusters along the chosen dimension.

Bias vs variance sandbox

A single regression gives you one number. To understand what a standard error should reflect, you need to imagine running the same regression many times on fresh data and looking at the distribution of point estimates. This tab simulates that thought experiment. Each "Run 100 simulations" press generates 100 fresh datasets and recomputes two estimators; the resulting histogram is the empirical sampling distribution — exactly the object an honest SE is trying to summarise.

How to read this lab

The simulator below uses a Double-LASSO setup because the JavaScript machinery for it is built in. Read the teal estimator ("Rigorous") as a stand-in for a well-sized estimator — tight, centred — and the orange estimator ("CV") as a stand-in for a mis-sized estimator — wider, possibly biased. The pedagogical message is the same as the panel-SE post: two reasonable-looking choices can produce very different sampling distributions on identical data.

Capped at 300 so the "Run 100 sims" button finishes quickly.
Capped at 50 for the 100-sim run.
Strength of the true relationship between covariates and outcomes.
0 = controls predict y and treatment equally · 1 = controls predict treatment well, y barely.

Well-sized (Rigorous)

Theory-driven λ, comparable to a correctly-sized SE.

α̂
SE(α̂)
|I_y|
|I_d|
union |I_y ∪ I_d|

Mis-sized (CV)

Data-driven λ, comparable to an over-confident SE.

α̂
SE(α̂)
|I_y|
|I_d|
union |I_y ∪ I_d|

Bias vs variance over many simulations

Single runs are noisy. Run the whole pipeline 100 times with fresh draws to see the sampling distribution. The width of each histogram is what an honest SE should approximate.