Introduction to Causal Inference: Double Machine Learning

Does a cash bonus shorten unemployment? Debiasing an RCT with ML

−0.0736DML · Random Forest

−0.0712DML · Lasso · same answer

5,099UI claimants · randomized

Carlos Mendez

Nagoya University (GSID)

June 11, 2026

The Tension

Act I

Did the bonus cause faster reemployment — or did different people get it?

Pennsylvania offered some unemployment-insurance claimants a cash bonus for finding work quickly. The bonus group did leave the rolls a bit sooner.

But people who get a treatment can differ from those who don’t. Is the gap the bonus — or the people?

The raw gap is real but small: bonus duration sits ~0.09 log points lower

Log unemployment duration by group — both distributions pile up near 3.0–3.5; the bonus group’s mean sits ~0.09 log points lower.

Where we’re going

The lab: a 5,099-claimant randomized bonus experiment with 15 covariates
Why even an RCT benefits from covariate adjustment — precision, not bias
The Partially Linear model and the partialling-out estimator
Double Machine Learning: cross-fitting two ML predictions
The lesson: on an RCT, DML sharpens inference; it doesn’t move the point

The Investigation

Act II

The lab: 5,099 claimants, randomly split 3,354 control vs 1,745 bonus

Outcome \(Y\) — inuidur1, log unemployment duration (mean 2.028, sd 1.215)
Treatment \(D\) — tg, the bonus offer (1 = offered, 0 = control)
Controls \(X\) — 15 demographic and labor-market covariates

Treatment was assigned at random, so \(D\) is independent of \(X\) by design — the estimand is a clean Average Treatment Effect (ATE).

Randomization worked: covariate means line up almost exactly across groups

Mean of each of the 15 covariates, control (steel) vs bonus (orange) — the bars are nearly the same height everywhere.

In an RCT, covariates can’t fix bias — but they can sharpen precision

Standard linear adjustment regresses \(Y\) on \(D\) and the covariates:

\[Y_i = \alpha + \beta\, D_i + X_i'\gamma + \epsilon_i\]

A linear \(X_i'\gamma\) can miss nonlinear structure in \(X \to Y\). The leftover variation widens the standard errors.

Linear adjustment already pulls the naive −0.0855 toward −0.0717

ols = LinearRegression()                 # naive: Y on D only
ols.fit(df[["tg"]], df["inuidur1"])
# coefficient → -0.0855

ols_full = LinearRegression()            # add the 15 covariates linearly
ols_full.fit(df[["tg"] + COVARIATES], df["inuidur1"])
# coefficient → -0.0717

Naive OLS: −0.0855. With covariates: −0.0717. The shift is precision, not bias.

DML splits the outcome into a linear treatment term plus a flexible nuisance

\[Y = D\,\theta_0 + g_0(X) + \varepsilon, \qquad E[\varepsilon \mid D, X] = 0\]

\[D = m_0(X) + V, \qquad E[V \mid X] = 0\]

\(\theta_0\) is the causal ATE. \(g_0(X)\) and \(m_0(X)\) are nuisance functions — scaffolding we estimate but don’t report.

Partial out both sides, then regress the residuals — that’s the whole trick

\[\tilde{Y} = Y - \hat{g}_0(X), \qquad \tilde{D} = D - \hat{m}_0(X)\]

\[\hat{\theta}_0 = \frac{\sum_i \tilde{D}_i\,\tilde{Y}_i}{\sum_i \tilde{D}_i^{2}}\]

Like noise-canceling headphones: ML learns how \(X\) drives \(Y\) and \(D\), we subtract it, and only the \(D \to Y\) signal is left.

Cross-fitting computes each residual out-of-sample to kill regularization bias

\[\hat{\theta}_0^{CF} = \frac{\sum_{k=1}^{K}\sum_{i \in I_k} \tilde{D}_i^{(k)}\,\tilde{Y}_i^{(k)}}{\sum_{k=1}^{K}\sum_{i \in I_k} \big(\tilde{D}_i^{(k)}\big)^2}\]

Split into \(K=5\) folds; each fold’s residuals come from models trained on the other four, then average.

Six lines in `doubleml`: wrap the data, pick a learner, cross-fit, read θ

dml_data = DoubleMLData(df, y_col="inuidur1", d_cols="tg", x_cols=COVARIATES)

learner = RandomForestRegressor(n_estimators=500, max_depth=5,
                                max_features="sqrt", random_state=42)
dml_plr_rf = DoubleMLPLR(dml_data, clone(learner), clone(learner), n_folds=5)
dml_plr_rf.fit()
print(dml_plr_rf.summary)            # tg: -0.0736  (SE 0.0354, p 0.0378)

The Resolution

Act III

The bonus offer shortens log unemployment duration by 7.4%

−0.0736

\(\hat\theta_0\), DML · Random Forest (SE 0.0354, \(p=0.038\)) · 95% CI [−0.143, −0.004]

Swap Random Forest for Lasso and the answer barely moves: −0.0712

Learner	\(\hat\theta_0\)	SE	\(p\)	95% CI
Random Forest	−0.0736	0.0354	0.038	[−0.143, −0.004]
Lasso	−0.0712	0.0354	0.044	[−0.141, −0.002]

Two utterly different learners, a 0.0024 gap — under 7% of one standard error.

All four roads lead to ~−0.07: the methods agree on sign and size

Naive OLS (−0.0855) is largest; covariate OLS and both DML estimates cluster near −0.07. Dashed line = zero; only DML carries valid CIs.

Both DML intervals exclude zero — but only just

DML-RF [−0.143, −0.004] and DML-Lasso [−0.141, −0.002]; near-identical width, upper bounds hugging zero.

Does flexible ML make this causal? No — the RCT does

Objection. “You ran fancy machine learning, so the −0.0736 must be the true causal effect.”

Response. Identification comes from randomization, not DML. ML only partials out \(X\) and delivers valid inference. DML disciplines adjustment — it cannot manufacture identification.

On a clean experiment, Double ML sharpens the answer — it doesn’t change it.