Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
simulated_panel_data | unit-period (spatial panel) | 675 × 14 | simulated_panel_data.dta | simulated_panel_data.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_mgwrfer/data/"
use "${BASE}simulated_panel_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_mgwrfer/data/"
df = pd.read_stata(BASE + "simulated_panel_data.dta")
# load every dataset at once
files = ["simulated_panel_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "simulated_panel_data.dta", "simulated_panel_data.dta")
df, meta = pyreadstat.read_dta("simulated_panel_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_mgwrfer/data/"
df <- read_dta(paste0(BASE, "simulated_panel_data.dta"))Overview & sources
Companion data for a Python tutorial faithful to Li & Fotheringham (2026), which introduces Multiscale Geographically Weighted Fixed Effects Regression (MGWFER) — a local panel framework that removes time-invariant spatial confounders from Multiscale GWR. The dataset is a fully synthetic spatial panel generated verbatim from the paper’s data-generating process (Eqs. 39–45) on a 15×15 grid of 225 spatial units observed over 3 time periods (675 observations). Each covariate is coupled to a time-invariant spatial context sc_i (Cor(x_k, sc) ≈ 0.84), so the indirect contextual effect channel is active and Cor(x_4, y) ≈ 0.84 even though β_4 ≡ 0. Because the true coefficient surfaces (beta1_true–beta4_true) and the confounder (alpha_true) are carried as columns, the panel is ground truth against which OLS, pooled OLS, fixed effects, cross-sectional MGWR, pooled MGWR, and MGWFER can be benchmarked. The entire data-generating process is open and reproducible.
simulated_panel_data is a balanced spatial panel: one row per spatial unit × time period (225 units × 3 periods = 675 rows). Spatial position is fixed by integer grid coordinates (coord_i, coord_j) on a 15×15 lattice; time_id indexes the three periods. Alongside the observed outcome y and four covariates x1–x4, the file carries the known truth columns — the fixed effect alpha_true and the four spatially varying slopes beta1_true–beta4_true — so every estimator can be scored against ground truth.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Li & Fotheringham (2026) | Replicated study; the verbatim data-generating process (Eqs. 39–45) and the MGWFER algorithm | Li, Z., & Fotheringham, A. S. (2026). Spatial Context as a Time-Invariant Confounder: A Fixed-Effects Extension of MGWR. Annals of the American Association of Geographers. https://doi.org/10.1080/24694452.2026.2654481 |
| Synthetic (this study) | All values — simulated from the paper's DGP with a fixed random seed (open & reproducible) | Mendez, C. (2026). See the post's Python script script.py for the full DGP (NumPy default_rng, seed 42). |
| Method references | Estimators and software | Fotheringham, Yang & Kang (2017, Multiscale GWR); Oshan et al. (2019, the mgwr package); GeoZhipengLi/MGWPR (panel-enabled mgwr fork); Wooldridge (2010, omitted-variable-bias derivation). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). MGWFER: Causal Spatially Varying Coefficients via Panel Fixed Effects [Data set]. https://carlos-mendez.org/post/python_mgwrfer/
Li, Z., & Fotheringham, A. S. (2026). Spatial Context as a Time-Invariant Confounder: A Fixed-Effects Extension of MGWR. Annals of the American Association of Geographers. https://doi.org/10.1080/24694452.2026.2654481BibTeX
@misc{mendez2026pythonmgwrfer,
author = {Mendez, Carlos},
title = {MGWFER: Causal Spatially Varying Coefficients via Panel Fixed Effects},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_mgwrfer/}},
note = {Data set}
}
@article{li2026spatial,
author = {Li, Zhipeng and Fotheringham, A. Stewart},
title = {Spatial Context as a Time-Invariant Confounder: A Fixed-Effects Extension of {MGWR}},
journal = {Annals of the American Association of Geographers},
year = {2026},
doi = {10.1080/24694452.2026.2654481}
}Variable explorer search & filter all 14 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
alpha_true# | continuous | True spatial context / fixed effect (sc_i) | Known time-invariant confounder; the intrinsic contextual effect MGWFER recovers. Truth column — not an observable predictor. | synthetic units | simulated_panel_data | Simulation (ground truth) | |
beta1_true# | continuous | True local slope beta1 (quadratic dome) | Known spatially varying coefficient on x1; ground truth for scoring. Quadratic dome peaking at the grid centre. | coefficient | simulated_panel_data | Simulation (ground truth) | |
beta2_true# | continuous | True local slope beta2 (linear gradient) | Known spatially varying coefficient on x2; ground truth for scoring. Linear gradient in i+j. | coefficient | simulated_panel_data | Simulation (ground truth) | |
beta3_true# | continuous | – | True local slope beta3 (constant 1.5) | Known spatially homogeneous coefficient on x3; ground truth for scoring. | coefficient | simulated_panel_data | Simulation (ground truth) |
beta4_true# | continuous | – | True local slope beta4 (null = 0) | Known null coefficient on x4; ground truth for false-positive testing. | coefficient | simulated_panel_data | Simulation (ground truth) |
coord_i# | identifier | – | Grid row coordinate (i) | Row position of the unit on the 15x15 lattice; spatial coordinate for kernel weighting. | grid units (1-15) | simulated_panel_data | Simulation |
coord_j# | identifier | – | Grid column coordinate (j) | Column position on the 15x15 lattice; drives the exponential spatial-context gradient. | grid units (1-15) | simulated_panel_data | Simulation |
time_id# | identifier | – | Time period index | Period index within the panel (3 periods per unit). Not a calendar year. | integer (0-2) | simulated_panel_data | Simulation |
unit_id# | identifier | – | Spatial unit ID | Identifier of the spatial unit (one of 225 grid cells); repeats across the unit's 3 time periods. | integer ID (0-224) | simulated_panel_data | Simulation |
x1# | continuous | Covariate x1 (effect = quadratic dome) | Causally-active covariate; its true local slope beta1 is a quadratic dome peaking at the grid centre. | synthetic units | simulated_panel_data | Simulation | |
x2# | continuous | Covariate x2 (effect = linear gradient) | Causally-active covariate; its true local slope beta2 is a linear gradient increasing with i+j. | synthetic units | simulated_panel_data | Simulation | |
x3# | continuous | Covariate x3 (effect = constant 1.5) | Causally-active covariate; its true local slope beta3 is constant at 1.5 everywhere. | synthetic units | simulated_panel_data | Simulation | |
x4# | continuous | Covariate x4 (null effect; spurious link to y) | Covariate with NO causal effect on y (beta4 = 0); shares parent sc with y, so Cor(x4, y) ~ 0.84. | synthetic units | simulated_panel_data | Simulation | |
y# | continuous | Outcome variable | Simulated response: spatial context plus three causally-active covariates plus noise. | synthetic units | simulated_panel_data | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | simulated_panel_data |
|---|---|
alpha_true | ● |
beta1_true | ● |
beta2_true | ● |
beta3_true | ● |
beta4_true | ● |
coord_i | ● |
coord_j | ● |
time_id | ● |
unit_id | ● |
x1 | ● |
x2 | ● |
x3 | ● |
x4 | ● |
y | ● |
Construction & formulas
The data are generated from a two-part data-generating process on a 15×15 grid of
N = 225 units indexed by integer coordinates (i, j), each observed over
T = 3 periods (paper Eqs. 39–45). The columns alpha_true and
beta1_true–beta4_true are the known truth the
estimators are scored against.
- Time-invariant spatial context / fixed effect (
alpha_true,sc_i):sc_i = 30·(exp(j/15) − 1)— an exponential gradient in the column indexj(range 2.07–51.55, mean 23.29). It enters the outcome directly and drives the covariate levels. - True slope β₁ (
beta1_true): a quadratic dome peaking at the grid centre,1 + (q²−(q−i/2)²)(q²−(q−j/2)²)/q⁴withq = ⌈15/4⌉(range 1.06–2.00). - True slope β₂ (
beta2_true): a linear gradient,1 + (i+j)/(2·15)(range 1.07–2.00). - True slope β₃ (
beta3_true): constant1.5everywhere (tests spatial homogeneity). - True slope β₄ (
beta4_true): identically0everywhere (a null effect — tests false-positive detection). - Covariate equation (the indirect contextual channel, Eqs. 40–43):
x_k,it = 0.05·sc_i + ν_k,it,ν ~ N(0, 0.5), fork = 1,2,3,4— every covariate is a noisy linear function of spatial context, givingCor(x_k, sc) ≈ 0.84. - Outcome equation (Eqs. 44–45):
y_it = sc_i + β₁(i,j)·x1 + β₂(i,j)·x2 + β₃(i,j)·x3 + ε_it,ε ~ N(0, 0.5)— notex4is excluded fromy(β₄ ≡ 0), so its 0.84 correlation withyis spurious, transmitted only through the shared parentsc.
The headline correction: pooled estimators recover β_k + δ_k (true slope plus the
indirect contextual effect δ_k); the within-transformation
ỹ_it = y_it − ȳ_i removes the time-invariant sc_i exactly, neutralising
δ_k and restoring identification of the local slopes.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
unit_id identifier | Spatial unit ID | Identifier of the spatial unit (one of 225 grid cells); repeats across the unit's 3 time periods. | 0..224, in row-major order over the 15x15 grid (np.repeat(arange(225), 3)). | integer ID (0-224) | Simulation | all rows |
time_id identifier | Time period index | Period index within the panel (3 periods per unit). Not a calendar year. | np.tile(arange(3), 225); values 0, 1, 2. | integer (0-2) | Simulation | all rows |
coord_i identifier | Grid row coordinate (i) | Row position of the unit on the 15x15 lattice; spatial coordinate for kernel weighting. | Row index 1..15, np.repeat(arange(1,16), 15) then replicated across time. | grid units (1-15) | Simulation | all rows |
coord_j identifier | Grid column coordinate (j) | Column position on the 15x15 lattice; drives the exponential spatial-context gradient. | Column index 1..15, np.tile(arange(1,16), 15) then replicated across time. | grid units (1-15) | Simulation | all rows |
y continuous | Outcome variable | Simulated response: spatial context plus three causally-active covariates plus noise. | y = sc_i + beta1*x1 + beta2*x2 + beta3*x3 + epsilon; epsilon ~ N(0, 0.5). x4 is excluded (paper Eqs. 44-45). | synthetic units | Simulation | all rows |
x1 continuous | Covariate x1 (effect = quadratic dome) | Causally-active covariate; its true local slope beta1 is a quadratic dome peaking at the grid centre. | x1 = 0.05*sc_i + N(0, 0.5) (indirect contextual channel, paper Eq. 40). | synthetic units | Simulation | all rows |
x2 continuous | Covariate x2 (effect = linear gradient) | Causally-active covariate; its true local slope beta2 is a linear gradient increasing with i+j. | x2 = 0.05*sc_i + N(0, 0.5) (paper Eq. 41). | synthetic units | Simulation | all rows |
x3 continuous | Covariate x3 (effect = constant 1.5) | Causally-active covariate; its true local slope beta3 is constant at 1.5 everywhere. | x3 = 0.05*sc_i + N(0, 0.5) (paper Eq. 42). | synthetic units | Simulation | all rows |
x4 continuous | Covariate x4 (null effect; spurious link to y) | Covariate with NO causal effect on y (beta4 = 0); shares parent sc with y, so Cor(x4, y) ~ 0.84. | x4 = 0.05*sc_i + N(0, 0.5) (paper Eq. 43); omitted from the y equation. | synthetic units | Simulation | all rows |
alpha_true continuous | True spatial context / fixed effect (sc_i) | Known time-invariant confounder; the intrinsic contextual effect MGWFER recovers. Truth column — not an observable predictor. | sc_i = 30*(exp(j/15) - 1); exponential in column index j (range 2.07-51.55). Constant across the unit's 3 periods. | synthetic units | Simulation (ground truth) | all rows |
beta1_true continuous | True local slope beta1 (quadratic dome) | Known spatially varying coefficient on x1; ground truth for scoring. Quadratic dome peaking at the grid centre. | 1 + (q^2-(q-i/2)^2)*(q^2-(q-j/2)^2)/q^4, q=ceil(15/4) (range 1.06-2.00). Constant across periods. | coefficient | Simulation (ground truth) | all rows |
beta2_true continuous | True local slope beta2 (linear gradient) | Known spatially varying coefficient on x2; ground truth for scoring. Linear gradient in i+j. | 1 + (i+j)/(2*15) (range 1.07-2.00). Constant across periods. | coefficient | Simulation (ground truth) | all rows |
beta3_true continuous | True local slope beta3 (constant 1.5) | Known spatially homogeneous coefficient on x3; ground truth for scoring. | 1.5 everywhere (np.full(225, 1.5)). | coefficient | Simulation (ground truth) | all rows |
beta4_true continuous | True local slope beta4 (null = 0) | Known null coefficient on x4; ground truth for false-positive testing. | 0 everywhere (np.zeros(225)). | coefficient | Simulation (ground truth) | all rows |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
unit_id | – | 100% | 675 | 225 | — | — | — | — | — |
time_id | – | 100% | 675 | 3 | — | — | — | — | — |
coord_i | – | 100% | 675 | 15 | — | — | — | — | — |
coord_j | – | 100% | 675 | 15 | — | — | — | — | — |
y | 100% | 675 | 675 | -0.577 | 28.55 | 26.20 | 66.16 | 18.78 | |
x1 | 100% | 675 | 675 | -1.02 | 1.15 | 1.08 | 3.45 | 0.904 | |
x2 | 100% | 675 | 675 | -1.61 | 1.16 | 1.09 | 3.98 | 0.929 | |
x3 | 100% | 675 | 675 | -1.03 | 1.11 | 1.05 | 3.77 | 0.909 | |
x4 | 100% | 675 | 675 | -1.20 | 1.18 | 1.12 | 3.70 | 0.935 | |
alpha_true | 100% | 675 | 15 | 2.07 | 23.29 | 21.14 | 51.55 | 15.23 | |
beta1_true | 100% | 675 | 36 | 1.05 | 1.50 | 1.46 | 2.00 | 0.268 | |
beta2_true | 100% | 675 | 29 | 1.07 | 1.53 | 1.53 | 2.00 | 0.204 | |
beta3_true | – | 100% | 675 | 1 | 1.50 | 1.50 | 1.50 | 1.50 | 0 |
beta4_true | – | 100% | 675 | 1 | 0 | 0 | 0 | 0 | 0 |
Known limitations & caveats
- Synthetic data. There is no real data behind this tutorial; values are simulated from the paper's DGP with a fixed seed. Results are internally consistent with the design but are not empirical evidence about real-world spatial processes.
- Truth columns are not observable in practice.
alpha_trueandbeta1_true–beta4_trueare carried only so estimators can be scored. In an applied setting these are exactly the unknowns the methods try to recover — do not feed them to the models as predictors. - Indirect channel by construction. Every covariate is 84% correlated with spatial context, and
x4is 84% correlated withydespite β₄ = 0. Any model that fails to condition onsc(cross-sectional/pooled OLS, MGWR_cs, PMGWR) will misread this as a real local effect — that is the point of the simulation, not a data error. - Reduced 15×15 grid. The paper uses a 30×30 grid; this panel uses 15×15 (225 units) so the multiscale bandwidth search completes in minutes. All qualitative conclusions are preserved, but exact RMSE/correlation magnitudes differ from the paper's.
- Balanced, short panel. Only T = 3 periods; the within estimator's degrees of freedom are NT − K − N = 446. Fixed-effects identification relies on strict exogeneity conditional on the fixed effects (no time-varying confounders).