Does a cash bonus shorten unemployment? Debiasing an RCT with ML
Nagoya University (GSID)
June 11, 2026
Act I
Pennsylvania offered some unemployment-insurance claimants a cash bonus for finding work quickly. The bonus group did leave the rolls a bit sooner.
But people who get a treatment can differ from those who don’t. Is the gap the bonus — or the people?
Log unemployment duration by group — both distributions pile up near 3.0–3.5; the bonus group’s mean sits ~0.09 log points lower.
Act II
inuidur1, log unemployment duration (mean 2.028, sd 1.215)tg, the bonus offer (1 = offered, 0 = control)Treatment was assigned at random, so \(D\) is independent of \(X\) by design — the estimand is a clean Average Treatment Effect (ATE).
Mean of each of the 15 covariates, control (steel) vs bonus (orange) — the bars are nearly the same height everywhere.
Standard linear adjustment regresses \(Y\) on \(D\) and the covariates:
\[Y_i = \alpha + \beta\, D_i + X_i'\gamma + \epsilon_i\]
A linear \(X_i'\gamma\) can miss nonlinear structure in \(X \to Y\). The leftover variation widens the standard errors.
Naive OLS: −0.0855. With covariates: −0.0717. The shift is precision, not bias.
\[Y = D\,\theta_0 + g_0(X) + \varepsilon, \qquad E[\varepsilon \mid D, X] = 0\]
\[D = m_0(X) + V, \qquad E[V \mid X] = 0\]
\(\theta_0\) is the causal ATE. \(g_0(X)\) and \(m_0(X)\) are nuisance functions — scaffolding we estimate but don’t report.
\[\tilde{Y} = Y - \hat{g}_0(X), \qquad \tilde{D} = D - \hat{m}_0(X)\]
\[\hat{\theta}_0 = \frac{\sum_i \tilde{D}_i\,\tilde{Y}_i}{\sum_i \tilde{D}_i^{2}}\]
Like noise-canceling headphones: ML learns how \(X\) drives \(Y\) and \(D\), we subtract it, and only the \(D \to Y\) signal is left.
\[\hat{\theta}_0^{CF} = \frac{\sum_{k=1}^{K}\sum_{i \in I_k} \tilde{D}_i^{(k)}\,\tilde{Y}_i^{(k)}}{\sum_{k=1}^{K}\sum_{i \in I_k} \big(\tilde{D}_i^{(k)}\big)^2}\]
Split into \(K=5\) folds; each fold’s residuals come from models trained on the other four, then average.
doubleml: wrap the data, pick a learner, cross-fit, read θdml_data = DoubleMLData(df, y_col="inuidur1", d_cols="tg", x_cols=COVARIATES)
learner = RandomForestRegressor(n_estimators=500, max_depth=5,
max_features="sqrt", random_state=42)
dml_plr_rf = DoubleMLPLR(dml_data, clone(learner), clone(learner), n_folds=5)
dml_plr_rf.fit()
print(dml_plr_rf.summary) # tg: -0.0736 (SE 0.0354, p 0.0378)Act III
−0.0736
\(\hat\theta_0\), DML · Random Forest (SE 0.0354, \(p=0.038\)) · 95% CI [−0.143, −0.004]
| Learner | \(\hat\theta_0\) | SE | \(p\) | 95% CI |
|---|---|---|---|---|
| Random Forest | −0.0736 | 0.0354 | 0.038 | [−0.143, −0.004] |
| Lasso | −0.0712 | 0.0354 | 0.044 | [−0.141, −0.002] |
Two utterly different learners, a 0.0024 gap — under 7% of one standard error.
Naive OLS (−0.0855) is largest; covariate OLS and both DML estimates cluster near −0.07. Dashed line = zero; only DML carries valid CIs.
DML-RF [−0.143, −0.004] and DML-Lasso [−0.141, −0.002]; near-identical width, upper bounds hugging zero.
Objection. “You ran fancy machine learning, so the −0.0736 must be the true causal effect.”
Response. Identification comes from randomization, not DML. ML only partials out \(X\) and delivers valid inference. DML disciplines adjustment — it cannot manufacture identification.