Causal Inference with DoWhy

Did job training raise earnings? Model · Identify · Estimate · Refute

$1,620doubly robust ATE

$62placebo collapse

5methods agree

Carlos Mendez

Nagoya University (GSID)

July 8, 2026

The Tension

Act I

Trained workers out-earned controls by $1,794 — but did training cause it?

185 disadvantaged workers got job training; 260 did not. The trained group earned $1,794 more in 1978.

But people who enroll may differ in age, schooling, or prior earnings. Is the gap the program — or the people?

Five disciplined estimators land near $1,620 — far below the naive $1,794

Estimated ATE across six methods. The five adjusted estimators cluster tightly; the naive difference sits highest.

One estimand, four steps, five estimators — do they agree?

The estimand: the ATE of NSW job training on 1978 earnings
DoWhy’s four steps — Model → Identify → Estimate → Refute
Five estimators across three paradigms — do they agree?
The discipline: refutation tests that try to break the result

The Investigation

Act II

We target the ATE: the effect of training on a random worker

\[\text{ATE} = E[Y(1) - Y(0)]\]

The expected earnings gain from moving anyone — treated or not — from no-training to training.

Four methods target the ATE directly; matching drifts toward the ATT (the effect on those actually trained) because it discards unmatched controls.

DoWhy forces four explicit steps instead of one black-box regression

Model — encode assumptions as a causal graph (a DAG)
Identify — graph theory finds the adjustment formula (the estimand)
Estimate — compute the number with one or more methods
Refute — stress-test the result with falsification tests

You cannot estimate before you identify, and you cannot identify before you model. That ordering is the contribution.

The lab: 445 NSW workers, 8 pre-treatment covariates, randomized

Outcome — real earnings in 1978 (re78), mean $5,301, heavily right-skewed
Treatment — randomized job training (185 trained, 260 control)
Covariates — age, education, race, marital status, degree, prior earnings (re74, re75)

All eight covariates are measured before treatment, so they can only be confounders — never mediators or colliders.

Both groups overlap heavily — and both spike at zero earnings

Distribution of 1978 earnings by treatment group. Training mean $6,349 vs control $4,555; both right-skewed with a spike at zero.

Randomization isn’t perfect: `nodegr` is imbalanced by 0.31 SD

Love plot of absolute standardized mean differences. nodegr, hisp, and educ exceed the 0.1 balance threshold (orange); the rest are balanced (blue).

Step 1 — Model: every covariate is a common cause of both arms

DoWhy’s DAG: the eight covariates point to both treatment (treat) and outcome (re78); treat points to re78.

Step 2 — Identify: the backdoor criterion seals all confounding paths

\[\frac{d}{d[\text{treat}]}\, E[\text{re78} \mid \text{age}, \text{educ}, \text{black}, \dots, \text{re75}]\]

Conditioning on the eight covariates blocks every backdoor path, so the effect is identified — under unconfoundedness (no hidden common cause).

DoWhy checks backdoor, instrumental-variable, and front-door strategies automatically, and returns the formula — not a guess about what to “control for”.

Step 3 — Estimate: three paradigms, one question

Outcome modeling

Models $E[Y \mid X, T]$
Regression adjustment

Treatment modeling

Models $P(T \mid X)$
IPW · stratification · matching

Doubly robust

Models both
AIPW

If outcome-based and treatment-based methods agree, neither model is badly misspecified — that agreement is the robustness check.

Regression adjustment compares like with like: $1,676

estimate_ra = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
    confidence_intervals=True)   # ATE = $1,676.34

Models $E[Y \mid X, T]$ and reads the treatment coefficient — the gap at the same covariate values.

IPW re-weights surprising cases by $1/\hat{e}(X)$: $1,559

\[\hat{\tau}_{IPW} = \frac{1}{n}\sum_{i=1}^{n}\left[\frac{T_i Y_i}{\hat{e}(X_i)} - \frac{(1-T_i)Y_i}{1-\hat{e}(X_i)}\right]\]

A treated worker who was unlikely to be treated ($\hat{e} = 0.1$) gets weight 10 — they are the most informative comparison.

AIPW gets two shots at the truth: $1,620

\[\hat{\tau}_{DR} = \frac{1}{n}\sum_{i=1}^{n}\left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i(Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-T_i)(Y_i - \hat{\mu}_0(X_i))}{1-\hat{e}(X_i)}\right]\]

Consistent if either the outcome model or the propensity model is correct — belt and suspenders.

All five adjusted estimators agree: $1,559 to $1,736

Method	Estimated ATE	What it models
Naive (diff. in means)	$1,794	nothing
Regression adjustment	$1,676	outcome
IPW	$1,559	treatment
Doubly robust (AIPW)	$1,620	both
PS stratification	$1,617	treatment
PS matching	$1,736	treatment (→ ATT)

The doubly robust $1,620 is the most credible single estimate — it survives misspecification of either model.

The Resolution

Act III

Step 4 — Refute: a placebo treatment collapses the effect to $62

$62

Placebo ATE after randomly permuting treatment ($p = 0.92$) — down from $1,676

Add a fake confounder, drop 20% of the data — the estimate barely moves

Refutation test	New effect	p-value	Reading
Placebo treatment	$62	0.92	effect vanishes
Random common cause	$1,676	0.90	stable with noise
Data subset (80%)	$1,728	0.80	stable across subsamples

Surviving placebo, random-common-cause, and subset tests is evidence, not proof — refutation can falsify, never confirm.

Does machine-picked adjustment make this causal? No — one assumption still carries the weight

Objection. DoWhy automated the workflow, so the estimate must be airtight.

Response. The ATE is identified only under unconfoundedness — no hidden common cause of training and earnings. The four steps make assumptions explicit and testable; they cannot manufacture identification. Here randomization makes unconfoundedness credible; in observational data it is the load-bearing risk.

The training effect is real: ~$1,620, a 34–38% earnings gain

$1,620

Doubly robust ATE on a control mean of $4,555 · five methods agree · refutation tests survive

State your assumptions, identify the estimand, then let the data — and the refutations — speak.