Standard errors, fixed effects, and the bias–inference split
A regression's headline number is its point estimate. Whether that number is meaningful depends on the standard error beside it. In panel data — the same firm observed year after year — the textbook standard-error formula is almost always wrong. This app lets you see the wrongness in three different ways: a forest plot of eight SE estimators on the same coefficient, a Monte Carlo bar chart of empirical rejection rates, and a simulation lab that builds intuition for sampling distributions across many draws.
The headline takeaway has two parts. First, no SE choice can rescue a biased point estimate — pooled OLS gives β̂ = 1.03 against a true β = 0.5, and tweaking the standard error never moves the coefficient. Second, once fixed effects fix the bias (β̂ = 0.48), the SE choice determines whether your 95% CI actually has 95% coverage. Entity-clustering on 100 firms behaves; time-clustering on 10 years over-rejects.
Why does panel data break the textbook standard error?
Conventional standard errors assume every observation is an independent draw. In a panel, two observations from the same firm — say firm 1 in 2015 and firm 1 in 2016 — share the firm's idiosyncrasies and so are not independent. The animation below shows the difference between two pull-toward-zero forces: an L1 (LASSO) penalty drives coefficients to exactly zero, while an L2 (Ridge) penalty only asymptotes. Use it as a visual reminder that two formulas, applied to the same data, can give qualitatively different answers — the same lesson the SE choice teaches in panel regressions.
SE Forest Plot
The post's eight estimator rows as a forest plot. The teal dashed line marks the truth β = 0.5. Pooled OLS bars never touch it; fixed-effects bars do.
Rejection Rates
Monte Carlo (N = 500) empirical rejection rates for six FE + SE combinations. The dashed line at 5% is the nominal target. Time-clustered SEs land at 9.0%.
Bias vs Variance Lab
A simulation sandbox: see how 100 fresh draws from the same DGP produce a sampling distribution. The width of that distribution is what an honest SE should reflect.
Glossary (open a card if a term is unfamiliar)
Bias vs inference
Conventional SE
White / HC SE
Cluster-robust SE
Two-way clustering
Driscoll–Kraay SE
Fixed effects (FE)
Rejection rate (size)
Eight estimators, one coefficient — a forest plot
All eight rows below estimate the same parameter — the effect of R&D intensity (x) on firm performance (y) — using the same simulated panel (100 firms × 10 years). What differs is the model (pooled OLS vs. entity FE vs. two-way FE) and the SE estimator (six varieties, from conventional through Driscoll-Kraay). The teal dashed line marks the truth, β = 0.5. Toggle a method off to focus the view.
What to look for
- All pooled OLS bars miss the truth. Five SE choices, none of them rescue the biased β̂ = 1.03. This is the §13.1 lesson visualised: standard errors address precision, not accuracy.
- The two FE bars (β̂ ≈ 0.48) cover the truth. Demeaning the firm-level confounder is what moves the point estimate to where it belongs.
- Hover any row for the t-statistic and full 95% CI. The most striking row is "Pooled OLS (Driscoll-Kraay)" with t = 65.4 — an impressively significant result for a coefficient that is more than double the truth.
Methods
Why does Driscoll-Kraay look so confident?
Driscoll-Kraay (β̂ = 1.03, SE = 0.0158, t = 65.4) is the narrowest CI of the eight rows. The estimator is robust to cross-sectional dependence, but the simulated DGP has very weak cross-sectional dependence — firms are conditionally independent given firm ability. Driscoll-Kraay's bandwidth-3 kernel borrows strength across firms aggressively, producing an SE that is technically valid but smaller than entity-clustering. In a panel with strong common shocks (e.g., banks during the 2008 crisis), the same estimator would give a much wider interval. The narrow CI here reflects the DGP, not a universal property of DK.
Monte Carlo rejection rates — does the test land at 5%?
For each combination of FE model + SE estimator, we simulate 500 independent datasets from the same DGP and ask: across those 500 runs, how often does the 95% CI miss the true β = 0.5? An honest test should reject at 5% — anything materially above 5% is over-rejection (false positives); anything materially below is conservative.
What the bars are telling you
- Five combinations land near 5%. FE + conventional (6.0%), FE + White (6.4%), FE + entity-cluster (6.6%), FE + both-cluster (7.8%) all fall within simulation noise of the nominal target. After demeaning, the within-firm residuals are reasonably well-behaved.
- FE + time-cluster over-rejects badly (9.0%). With only 10 year-clusters, the asymptotic theory behind cluster-robust SEs simply does not hold. The standard error is too small; the t-statistic too large; false positives nearly double the nominal rate.
- TWFE + entity-cluster under-rejects (3.2%). Absorbing time effects costs degrees of freedom and slightly inflates the SE — wider intervals than needed, but conservative is much safer than over-confident.
Rule of thumb
Cluster on the dimension with the larger number of groups. In this DGP that's firms (100) not years (10). With fewer than ~30–40 clusters the cluster-robust SE breaks down, and you need a small-sample correction (e.g., wild cluster bootstrap, CR2). The post's §13.2 decision framework is the takeaway: always start with the right model (FE), then pick the SE estimator with enough clusters along the chosen dimension.
Bias vs variance sandbox
A single regression gives you one number. To understand what a standard error should reflect, you need to imagine running the same regression many times on fresh data and looking at the distribution of point estimates. This tab simulates that thought experiment. Each "Run 100 simulations" press generates 100 fresh datasets and recomputes two estimators; the resulting histogram is the empirical sampling distribution — exactly the object an honest SE is trying to summarise.
How to read this lab
The simulator below uses a Double-LASSO setup because the JavaScript machinery for it is built in. Read the teal estimator ("Rigorous") as a stand-in for a well-sized estimator — tight, centred — and the orange estimator ("CV") as a stand-in for a mis-sized estimator — wider, possibly biased. The pedagogical message is the same as the panel-SE post: two reasonable-looking choices can produce very different sampling distributions on identical data.
Well-sized (Rigorous)
Theory-driven λ, comparable to a correctly-sized SE.
Mis-sized (CV)
Data-driven λ, comparable to an over-confident SE.
Bias vs variance over many simulations
Single runs are noisy. Run the whole pipeline 100 times with fresh draws to see the sampling distribution. The width of each histogram is what an honest SE should approximate.