<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>post-double-selection | Carlos Mendez</title><link>https://carlos-mendez.org/tag/post-double-selection/</link><atom:link href="https://carlos-mendez.org/tag/post-double-selection/index.xml" rel="self" type="application/rss+xml"/><description>post-double-selection</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>© 2018–2026 Carlos Mendez. All rights reserved.</copyright><lastBuildDate>Mon, 25 May 2026 00:00:00 +0000</lastBuildDate><image><url>https://carlos-mendez.org/media/icon_huedfae549300b4ca5d201a9bd09a3ecd5_79625_512x512_fill_lanczos_center_3.png</url><title>post-double-selection</title><link>https://carlos-mendez.org/tag/post-double-selection/</link></image><item><title>Double LASSO in Python: Does Abortion Reduce Crime?</title><link>https://carlos-mendez.org/post/python_double_lasso/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/python_double_lasso/</guid><description>&lt;h2 id="1-overview">1. Overview&lt;/h2>
&lt;blockquote>
&lt;p>&lt;strong>Companion post.&lt;/strong> This tutorial is one of three siblings on the same Double LASSO case study — alongside the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R version&lt;/a> and the &lt;a href="https://carlos-mendez.org/post/stata_double_lasso/">Stata version&lt;/a>. The three posts share the data, the five estimators, and the identification story; this Python post adds a dedicated introduction to the &lt;a href="https://docs.doubleml.org/" target="_blank" rel="noopener">DoubleML&lt;/a> library in §15–§18.&lt;/p>
&lt;/blockquote>
&lt;p>This is the Python companion to the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R version&lt;/a> and &lt;a href="https://carlos-mendez.org/post/stata_double_lasso/">Stata version&lt;/a> of the Double LASSO tutorial — same data, same five-estimator narrative, same identification story — plus a &lt;strong>second part&lt;/strong> that introduces &lt;a href="https://docs.doubleml.org/" target="_blank" rel="noopener">DoubleML&lt;/a>, a modern Python framework for ML-based causal inference. The R post walks through Belloni, Chernozhukov and Hansen&amp;rsquo;s (2014) extension of Donohue and Levitt&amp;rsquo;s (2001) abortion-and-crime panel and shows that &lt;strong>Double LASSO&lt;/strong> with the &lt;em>rigorous&lt;/em> (theory-based) penalty reproduces the headline causal estimates from 284 candidate controls while CV-tuned LASSO overshoots. This post does the same computation in Python using &lt;a href="https://pyfixest.org/" target="_blank" rel="noopener">&lt;code>pyfixest&lt;/code>&lt;/a> for OLS rows, &lt;a href="https://github.com/d2cml-ai/hdmpy" target="_blank" rel="noopener">&lt;code>hdmpy&lt;/code>&lt;/a> for the rigorous LASSO, &lt;a href="https://scikit-learn.org/" target="_blank" rel="noopener">&lt;code>scikit-learn&lt;/code>&lt;/a> for cross-validated LASSO — and then introduces &lt;code>DoubleML&lt;/code>&amp;rsquo;s cross-fit &lt;code>DoubleMLPLR&lt;/code>, &lt;code>DoubleMLIRM&lt;/code>, and a learner-robustness comparison across LASSO, RandomForest, and XGBoost.&lt;/p>
&lt;p>If you have already read the R or Stata version, the &lt;strong>Part A takeaways here are unchanged&lt;/strong>. The structural reason to write a Python companion is twofold. First, reproducibility: data scientists who work in Python every day will find the friction of switching to R for one method too high, and a transparent Python implementation removes it. Second, &lt;strong>introducing &lt;code>DoubleML&lt;/code> (Bach et al. 2022) — a beautifully engineered library that ports the modern Neyman-orthogonal cross-fitting framework into a sklearn-native API&lt;/strong>. &lt;code>DoubleML&lt;/code> is the right tool for Python researchers running ML-based causal inference on production data; this post shows how to use it side-by-side with the explicit post-double-selection recipe so you can see exactly where the two approaches agree and where they diverge.&lt;/p>
&lt;p>&lt;img src="python_double_lasso_estimates.png" alt="Forest plot of α̂ ± 95 % CI for all five Part-A estimators (First diff, OLS-full, PSL, DL-rigorous, DL-CV) faceted by outcome. LASSO methods land between the no-controls baseline and the kitchen-sink OLS.">&lt;/p>
&lt;p>The figure above is the post&amp;rsquo;s spoiler — the Python version of the R/Stata headline forest plot. Each row is a different estimator; each panel is a different crime outcome. The dashed vertical line is zero: to its left, the abortion-crime relationship is &lt;em>negative&lt;/em> (more abortion is associated with less crime). Two patterns jump out, exactly as in the R/Stata companions. First, the LASSO methods (PSL, DL-rigorous, DL-CV) cluster sensibly near the original Donohue-Levitt baseline (First diff) for violent and property crime. Second, &lt;strong>OLS with all 284 controls is uninterpretable&lt;/strong> — its murder estimate is +2.34 with confidence interval [−2.76, +7.45], which would mean a unit increase in the abortion rate raises murder by 234 %. That impossibility is the failure mode that motivates LASSO in the first place.&lt;/p>
&lt;p>&lt;strong>Learning objectives.&lt;/strong> After working through this tutorial you will be able to:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Explain&lt;/strong> when high-dimensional methods like LASSO add value over plain OLS, and when they do not.&lt;/li>
&lt;li>&lt;strong>Implement&lt;/strong> the Belloni-Chernozhukov-Hansen Double LASSO procedure in Python using &lt;code>hdmpy.rlasso&lt;/code> (rigorous penalty) and &lt;code>sklearn.linear_model.LassoCV&lt;/code> (cross-validated penalty), with &lt;code>pyfixest&lt;/code> for the post-OLS step.&lt;/li>
&lt;li>&lt;strong>Distinguish&lt;/strong> the &lt;em>rigorous&lt;/em> and &lt;em>cross-validated&lt;/em> penalty rules for LASSO, and recognise which is appropriate for causal inference.&lt;/li>
&lt;li>&lt;strong>Use the &lt;a href="https://docs.doubleml.org/" target="_blank" rel="noopener">DoubleML&lt;/a> library&lt;/strong> to fit &lt;code>DoubleMLPLR&lt;/code> (Partially Linear Regression with cross-fitting) and &lt;code>DoubleMLIRM&lt;/code> (Interactive Regression Model for binary treatments).&lt;/li>
&lt;li>&lt;strong>Compare ML learners&lt;/strong> (LASSO, RandomForest, XGBoost) as nuisance functions inside &lt;code>DoubleMLPLR&lt;/code> and use the spread as a robustness check.&lt;/li>
&lt;li>&lt;strong>Compute&lt;/strong> state-clustered standard errors with the HC1 finite-sample correction — both via &lt;code>pyfixest&lt;/code>&amp;rsquo;s &lt;code>vcov={&amp;quot;CRV1&amp;quot;: &amp;quot;state&amp;quot;}&lt;/code> for OLS rows and via a hand-rolled sandwich on &lt;code>DoubleMLPLR&lt;/code>&amp;rsquo;s orthogonal scores.&lt;/li>
&lt;li>&lt;strong>Verify&lt;/strong> that the Python implementation matches the R and Stata companions to the precision allowed by each estimator&amp;rsquo;s randomness — and explain the five sources of drift that make &lt;code>DoubleML&lt;/code>&amp;rsquo;s defaults differ from R&amp;rsquo;s &lt;code>hdm&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h3 id="key-concepts-at-a-glance">Key concepts at a glance&lt;/h3>
&lt;p>The post leans on a small vocabulary. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has a one-line definition followed by a short example tied to this post&amp;rsquo;s data.&lt;/p>
&lt;p>&lt;strong>1. LASSO&lt;/strong> $\hat\beta(\lambda) = \arg\min_\beta \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda \sum_j \lvert\beta_j\rvert$. L1-penalised OLS: the absolute-value penalty produces &lt;em>exactly-zero&lt;/em> coefficients (variable selection). In §7 our &lt;code>hdmpy.rlasso&lt;/code> of the abortion rate on 284 controls picks just 8 — the rest get shrunk to zero.&lt;/p>
&lt;p>&lt;strong>2. Penalty $\lambda$.&lt;/strong> The knob controlling shrinkage. Higher $\lambda$ pins more coefficients to zero. Tuning $\lambda$ is the central design choice and is what separates the rigorous and CV flavours of Double LASSO.&lt;/p>
&lt;p>&lt;strong>3. Post-Structural LASSO (PSL).&lt;/strong> One LASSO with the treatment forced in (or partialled out), then plain OLS on the selected support. The simplest one-LASSO causal estimator. We implement it via Frisch-Waugh-Lovell partialling because &lt;code>hdmpy&lt;/code> lacks the &lt;code>pnotpen&lt;/code> option that R&amp;rsquo;s &lt;code>glmnet&lt;/code> and Stata&amp;rsquo;s &lt;code>rlasso&lt;/code> expose.&lt;/p>
&lt;p>&lt;strong>4. Double LASSO (DL).&lt;/strong> Two LASSOs (y on X, d on X), union of selected controls, then post-OLS. The causal-inference-safe variant that beats PSL when controls predict $d$ but not $y$.&lt;/p>
&lt;p>&lt;strong>5. Selection sets $I_y$ and $I_d$.&lt;/strong> The indices of controls each LASSO step keeps. Their union $I_y \cup I_d$ is the support of the post-OLS regression. Their &lt;em>imbalance&lt;/em> is the empirical fingerprint of when DL adds value.&lt;/p>
&lt;p>&lt;strong>6. Rigorous vs CV penalty.&lt;/strong> Two ways to pick $\lambda$. Rigorous: Belloni-Chen-Chernozhukov-Hansen (2012) Bonferroni-style theory rule, available in Python as &lt;code>hdmpy.rlasso&lt;/code>. CV: cross-validation minimising prediction MSE, available as &lt;code>sklearn.linear_model.LassoCV&lt;/code>. Different objectives, different answers.&lt;/p>
&lt;p>&lt;strong>7. Post-OLS step.&lt;/strong> After LASSO selects a support, refit with plain (unshrunken) OLS to remove the shrinkage bias on $\hat\alpha$. LASSO is used only for &lt;em>selection&lt;/em>, never for the final estimate. We use &lt;code>pyfixest.feols(..., vcov={&amp;quot;CRV1&amp;quot;: &amp;quot;state&amp;quot;})&lt;/code> for this step so state-clustered SEs come &amp;ldquo;for free.&amp;rdquo;&lt;/p>
&lt;p>&lt;strong>8. State-clustered standard errors.&lt;/strong> HC1-adjusted sandwich variance with state-level clustering, applied via &lt;code>pyfixest&lt;/code>&amp;rsquo;s built-in &lt;code>CRV1&lt;/code> option for the OLS rows and via a hand-rolled sandwich on the orthogonal scores for the DoubleML rows. Corrects for within-state autocorrelation that would otherwise understate the SE on a 48-state × 12-year panel.&lt;/p>
&lt;p>A note on the Python libraries. The four-library stack maps cleanly onto the R workflow:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Python library&lt;/th>
&lt;th>R equivalent&lt;/th>
&lt;th>What it does in this post&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;a href="https://pyfixest.org/" target="_blank" rel="noopener">&lt;code>pyfixest&lt;/code>&lt;/a>&lt;/td>
&lt;td>&lt;code>fixest&lt;/code> / &lt;code>lm&lt;/code> + &lt;code>sandwich&lt;/code>&lt;/td>
&lt;td>OLS rows with &lt;code>vcov={&amp;quot;CRV1&amp;quot;: &amp;quot;state&amp;quot;}&lt;/code> for state-clustered SE&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://github.com/d2cml-ai/hdmpy" target="_blank" rel="noopener">&lt;code>hdmpy&lt;/code>&lt;/a>&lt;/td>
&lt;td>&lt;code>hdm::rlasso&lt;/code>&lt;/td>
&lt;td>Rigorous-penalty LASSO with BCH (c=1.1, gamma=0.05) defaults&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://scikit-learn.org/" target="_blank" rel="noopener">&lt;code>scikit-learn&lt;/code>&lt;/a>&lt;/td>
&lt;td>&lt;code>glmnet::cv.glmnet&lt;/code>&lt;/td>
&lt;td>Cross-validated LASSO via &lt;code>LassoCV&lt;/code> and &lt;code>KFold&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://docs.doubleml.org/" target="_blank" rel="noopener">&lt;code>DoubleML&lt;/code>&lt;/a>&lt;/td>
&lt;td>(no direct R analog; closest is &lt;code>DoubleML&lt;/code> for R, same algorithm)&lt;/td>
&lt;td>Modern cross-fit Neyman-orthogonal estimation (Part B)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://xgboost.readthedocs.io/" target="_blank" rel="noopener">&lt;code>xgboost&lt;/code>&lt;/a>&lt;/td>
&lt;td>&lt;code>xgboost&lt;/code>&lt;/td>
&lt;td>Boosted-trees nuisance learner in the §18 learner-robustness check&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="2-the-data">2. The data&lt;/h2>
&lt;p>We use the exact panel that &lt;a href="#22-references">Belloni, Chernozhukov and Hansen (2014)&lt;/a> compiled from &lt;a href="#22-references">Donohue and Levitt&amp;rsquo;s (2001)&lt;/a> original replication archive: &lt;strong>48 U.S. states × 12 years (1986-1997) after first-differencing the raw 13-year 1985-1997 panel, giving 576 observations.&lt;/strong> First-differencing absorbs state fixed effects. Year fixed effects are absorbed in a separate pre-processing step using the Frisch-Waugh-Lovell projection (see §7). By the time the analysis script sees the data, both fixed-effect adjustments are done, so the LASSO regressions below contain no time dummies.&lt;/p>
&lt;p>&lt;strong>Code chunk 1 — Loading the six CSVs over HTTPS:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">import pandas as pd
BASE = (&amp;quot;https://raw.githubusercontent.com/cmg777/starter-academic-v501/&amp;quot;
&amp;quot;master/content/post/r_double_lasso/data/&amp;quot;)
state = pd.read_csv(BASE + &amp;quot;levitt_state.csv&amp;quot;)[&amp;quot;state&amp;quot;].to_numpy()
linear = pd.read_csv(BASE + &amp;quot;levitt_linear.csv&amp;quot;)
partialled = pd.read_csv(BASE + &amp;quot;levitt_partialled.csv&amp;quot;)
ctrl_viol = pd.read_csv(BASE + &amp;quot;levitt_controls_viol.csv&amp;quot;)
ctrl_prop = pd.read_csv(BASE + &amp;quot;levitt_controls_prop.csv&amp;quot;)
ctrl_murd = pd.read_csv(BASE + &amp;quot;levitt_controls_murd.csv&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>Six CSVs, six &lt;code>pd.read_csv&lt;/code> calls. No local file dependencies, no Matlab files — the entire data layer is portable across machines and operating systems.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>File&lt;/th>
&lt;th>Shape&lt;/th>
&lt;th>What it contains&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>levitt_state.csv&lt;/code>&lt;/td>
&lt;td>576 × 1&lt;/td>
&lt;td>State cluster id (1-48) for each observation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_linear.csv&lt;/code>&lt;/td>
&lt;td>576 × 7&lt;/td>
&lt;td>Raw first-differences of the outcomes and treatment (&lt;code>Dyv, Dxv, Dyp, Dxp, Dym, Dxm&lt;/code>)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_partialled.csv&lt;/code>&lt;/td>
&lt;td>576 × 7&lt;/td>
&lt;td>Same series after year-FE absorption (&lt;code>DyV, DxV, DyP, DxP, DyM, DxM&lt;/code>)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_controls_viol.csv&lt;/code>&lt;/td>
&lt;td>576 × 284&lt;/td>
&lt;td>Control matrix $Z_v$ for the violent-crime equation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_controls_prop.csv&lt;/code>&lt;/td>
&lt;td>576 × 284&lt;/td>
&lt;td>Control matrix $Z_p$ for the property-crime equation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_controls_murd.csv&lt;/code>&lt;/td>
&lt;td>576 × 284&lt;/td>
&lt;td>Control matrix $Z_m$ for the murder equation&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The dimensions matter for the LASSO methods that follow. We are in the &lt;strong>moderate-dimensional&lt;/strong> regime: $p = 284$ is large but smaller than $n = 576$, so OLS is technically feasible but unstable, and LASSO is the natural tool to discipline the variable selection.&lt;/p>
&lt;hr>
&lt;h2 id="3-five-estimators-in-plain-language">3. Five estimators in plain language&lt;/h2>
&lt;p>Five regression procedures appear in Part A, each with a different attitude toward how many controls to keep. We summarise the cast here so you can navigate the rest of the article.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Estimator&lt;/th>
&lt;th>Recipe in one sentence&lt;/th>
&lt;th>Python library&lt;/th>
&lt;th>Section&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>First-difference OLS&lt;/strong>&lt;/td>
&lt;td>Regress differenced crime on differenced abortion with &lt;strong>no&lt;/strong> controls — the original Donohue-Levitt 1993 specification.&lt;/td>
&lt;td>&lt;code>pyfixest&lt;/code>&lt;/td>
&lt;td>§4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>OLS (full)&lt;/strong>&lt;/td>
&lt;td>Add all 284 controls and let the matrix algebra sort it out.&lt;/td>
&lt;td>&lt;code>pyfixest&lt;/code>&lt;/td>
&lt;td>§5&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>PSL&lt;/strong> (Post-Structural LASSO)&lt;/td>
&lt;td>FWL-partial out the treatment, then one &lt;code>hdmpy.rlasso&lt;/code> on the residualised controls, then post-OLS on the selected support.&lt;/td>
&lt;td>&lt;code>hdmpy&lt;/code> + &lt;code>pyfixest&lt;/code>&lt;/td>
&lt;td>§6&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>DL (rigorous)&lt;/strong>&lt;/td>
&lt;td>Two LASSOs (y on X, d on X) with the Belloni-et-al. theory-based penalty; refit OLS on the &lt;strong>union&lt;/strong> of selected variables.&lt;/td>
&lt;td>&lt;code>hdmpy&lt;/code> + &lt;code>pyfixest&lt;/code>&lt;/td>
&lt;td>§7&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>DL (CV)&lt;/strong>&lt;/td>
&lt;td>Same recipe as DL-rigorous but each LASSO uses 3-fold cross-validation to pick lambda.&lt;/td>
&lt;td>&lt;code>sklearn&lt;/code> + &lt;code>pyfixest&lt;/code>&lt;/td>
&lt;td>§10&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Two pairs of estimators do most of the pedagogical work. First-diff vs OLS-full is the &lt;em>control-count&lt;/em> contrast (no controls vs too many controls). DL-rigorous vs DL-CV is the &lt;em>penalty-rule&lt;/em> contrast (theory vs data-driven). PSL sits in between as the simplest one-LASSO benchmark.&lt;/p>
&lt;p>Part B (§16-§18) adds three more estimators that come from the DoubleML framework: &lt;code>DoubleMLPLR&lt;/code> (the cross-fit version of DL), &lt;code>DoubleMLIRM&lt;/code> (for binary treatments), and three learner variants of &lt;code>DoubleMLPLR&lt;/code> (LASSO vs RandomForest vs XGBoost). The full menu is &lt;strong>eight estimators&lt;/strong> by the end of the post.&lt;/p>
&lt;hr>
&lt;h2 id="4-first-difference-ols--the-no-controls-baseline">4. First-difference OLS — the no-controls baseline&lt;/h2>
&lt;p>The original Donohue-Levitt 1993 specification regresses differenced crime on differenced abortion with no controls beyond first-differencing itself:&lt;/p>
&lt;p>$$
\Delta y_{st} = \alpha \, \Delta d_{st} + \varepsilon_{st}.
$$&lt;/p>
&lt;p>Here, $\Delta y_{st}$ is the change in the crime rate for state $s$ from year $t-1$ to $t$, $\Delta d_{st}$ is the change in the effective abortion rate, and $\varepsilon_{st}$ is the regression error. The parameter $\alpha$ is the &lt;strong>average partial effect of the differenced abortion rate on the differenced crime rate&lt;/strong>, identified under (i) conditional independence given the differenced trajectories and (ii) parallel trends in levels.&lt;/p>
&lt;p>&lt;strong>Code chunk 2 — The first-difference OLS in Python using &lt;code>pyfixest&lt;/code>:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">import pyfixest as pf
import pandas as pd
df = pd.DataFrame({&amp;quot;y&amp;quot;: linear[&amp;quot;Dyv&amp;quot;], &amp;quot;d&amp;quot;: linear[&amp;quot;Dxv&amp;quot;], &amp;quot;state&amp;quot;: state})
fit = pf.feols(&amp;quot;y ~ -1 + d&amp;quot;, data=df, vcov={&amp;quot;CRV1&amp;quot;: &amp;quot;state&amp;quot;})
print(fit.summary())
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">###
Estimation: OLS
Dep. var.: y, Fixed effects:
Inference: CRV1
Observations: 576
| Coefficient | Estimate | Std. Error | t value | Pr(&amp;gt;|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| d | -0.1521 | 0.0337 | -4.5165 | 0.0000 | -0.218 | -0.086 |
###
&lt;/code>&lt;/pre>
&lt;p>Three things to notice. First, the formula uses &lt;code>-1&lt;/code> to suppress the intercept — first-differencing absorbs both the level and the state fixed effect, so the regression mean is zero by construction. Second, the &lt;code>vcov={&amp;quot;CRV1&amp;quot;: &amp;quot;state&amp;quot;}&lt;/code> keyword triggers &lt;code>pyfixest&lt;/code>&amp;rsquo;s cluster-robust sandwich estimator with the HC1 small-sample correction $(N-1)/(N-k) \cdot G/(G-1)$, which is exactly the formula used in the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion&lt;/a> and &lt;a href="https://carlos-mendez.org/post/stata_double_lasso/">Stata companion&lt;/a>. Third, &lt;code>pyfixest&lt;/code> returns a fitted object that exposes &lt;code>.coef()&lt;/code>, &lt;code>.se()&lt;/code>, &lt;code>.confint()&lt;/code>, and &lt;code>.summary()&lt;/code> — clean accessors that make downstream programmatic use easy.&lt;/p>
&lt;p>Running this regression for each of the three crime outcomes gives our baseline numbers:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE (state-clustered)&lt;/th>
&lt;th>95 % CI&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1521&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0337&lt;/td>
&lt;td>[−0.218, −0.086]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1084&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0219&lt;/td>
&lt;td>[−0.151, −0.065]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.2039&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0667&lt;/td>
&lt;td>[−0.335, −0.073]&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Reading the violent-crime coefficient:&lt;/strong> a one-unit increase in the differenced effective abortion rate is associated with a &lt;strong>0.152-unit decrease&lt;/strong> in the differenced violent-crime rate. All three estimates are negative and statistically significant at the 5 % level; this is the Donohue-Levitt finding, and it matches the R companion&amp;rsquo;s &lt;code>cluster_se&lt;/code> implementation to four decimal places. The whole point of the LASSO methods below is to ask whether this picture survives when we let 284 candidate controls compete for inclusion.&lt;/p>
&lt;hr>
&lt;h2 id="5-kitchen-sink-ols--why-we-cannot-just-add-everything">5. Kitchen-sink OLS — why we cannot just add everything&lt;/h2>
&lt;p>A natural reaction to &amp;ldquo;you only used 8 controls&amp;rdquo; is to add all 284 and let OLS sort it out. With $p = 284 &amp;lt; n = 576$ the $X&amp;rsquo;X$ matrix is technically invertible, so the procedure runs. The output:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE&lt;/th>
&lt;th>95 % CI&lt;/th>
&lt;th>Sign matches baseline?&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>+0.0135&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.5654&lt;/td>
&lt;td>[−1.09, +1.12]&lt;/td>
&lt;td>no — flips sign&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1950&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.1937&lt;/td>
&lt;td>[−0.57, +0.18]&lt;/td>
&lt;td>yes (but CI crosses zero)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>+2.3426&lt;/strong>&lt;/td>
&lt;td style="text-align:right">2.6047&lt;/td>
&lt;td>[−2.76, +7.45]&lt;/td>
&lt;td>no — flips dramatically&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The violent-crime point estimate has flipped sign (+0.014 vs the baseline&amp;rsquo;s −0.152) and its confidence interval crosses zero. The murder estimate has exploded to &lt;strong>+2.34&lt;/strong> with SE = 2.60, meaning a unit increase in the abortion rate would raise murder by 234 % — clearly an artefact of the extreme multicollinearity in the 284 controls and not a credible causal estimate.&lt;/p>
&lt;p>To see why, recall the OLS estimator in matrix form:&lt;/p>
&lt;p>$$
\hat\beta_{\text{OLS}} = (X&amp;rsquo;X)^{-1} X' y, \qquad
\widehat{\operatorname{Var}}(\hat\beta_{\text{OLS}}) = \hat\sigma^{2} \, (X&amp;rsquo;X)^{-1}.
$$&lt;/p>
&lt;p>Here, $X$ is the $n \times p$ design matrix (the treatment plus 284 controls), $y$ is the $n \times 1$ outcome vector, and $\hat\sigma^2$ is the estimated residual variance. The variance of any coefficient — including the treatment effect — depends on $(X&amp;rsquo;X)^{-1}$. &lt;strong>When the columns of $X$ are nearly collinear, the smallest eigenvalues of $X&amp;rsquo;X$ approach zero and its inverse blows up.&lt;/strong> Our implementation uses a rank-revealing QR pivot to drop linearly dependent columns before the sandwich computation (matching Stata&amp;rsquo;s &lt;code>regress&lt;/code> behaviour), which yields larger SEs than R&amp;rsquo;s &lt;code>MASS::ginv()&lt;/code> fallback — both are mathematically valid, both reach the same qualitative conclusion: &lt;strong>kitchen-sink OLS is uninterpretable here&lt;/strong>. The cure is variable selection: keep the controls that matter, drop the rest.&lt;/p>
&lt;hr>
&lt;h2 id="6-lasso-and-the-one-lasso-benchmark-psl">6. LASSO and the one-LASSO benchmark (PSL)&lt;/h2>
&lt;p>The Least Absolute Shrinkage and Selection Operator (&lt;a href="#22-references">Tibshirani 1996&lt;/a>) modifies the OLS minimisation by adding an L1 penalty on the coefficients:&lt;/p>
&lt;p>$$
\hat\beta_{\text{LASSO}}(\lambda) = \arg\min_{\beta \in \mathbb{R}^p} \;
\frac{1}{2n} \| y - X\beta \|_2^2 \, + \, \lambda \sum_{j=1}^p \lvert\beta_j\rvert.
$$&lt;/p>
&lt;p>The first term is the usual sum of squared residuals. The second is the penalty: $\lambda$ times the sum of the &lt;em>absolute values&lt;/em> of the coefficients. The absolute-value penalty has a corner at zero — unlike a squared penalty (which would give Ridge regression), LASSO can shrink coefficients &lt;strong>exactly&lt;/strong> to zero, performing variable selection at the same time as estimation. The strength of selection is controlled by one knob $\lambda$: at $\lambda = 0$ we recover OLS; as $\lambda \to \infty$ all coefficients are pinned to zero.&lt;/p>
&lt;p>&lt;strong>Post-Structural LASSO (PSL)&lt;/strong> is the simplest LASSO-based causal estimator. Run one LASSO on $y$ regressed on $(d, X)$, but ensure the treatment $d$ is not selected away by LASSO&amp;rsquo;s shrinkage. In R, &lt;code>glmnet::cv.glmnet(penalty.factor = c(0, rep(1, p)))&lt;/code> does this directly. In Stata, &lt;code>rlasso ... pnotpen(d)&lt;/code> does it. In Python, &lt;code>hdmpy.rlasso&lt;/code> does &lt;em>not&lt;/em> expose a &lt;code>pnotpen&lt;/code> argument — so we implement the equivalent recipe via &lt;strong>Frisch-Waugh-Lovell partialling&lt;/strong>:&lt;/p>
&lt;p>&lt;strong>Code chunk 3 — PSL in Python using FWL partialling + &lt;code>hdmpy.rlasso&lt;/code>:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">import numpy as np
import hdmpy
def partial_out_d(arr, d):
&amp;quot;&amp;quot;&amp;quot;Project arr onto d via OLS and return the residual.&amp;quot;&amp;quot;&amp;quot;
d_col = d.reshape(-1, 1)
beta = np.linalg.lstsq(d_col, arr, rcond=None)[0]
return arr - d_col @ beta if arr.ndim == 2 else arr - (d_col @ beta).ravel()
def psl_fit(y, d, X, state):
y_tilde = partial_out_d(y, d) # residualise y on d
X_tilde = partial_out_d(X, d) # residualise each X column on d
fit = hdmpy.rlasso(X_tilde, y_tilde, post=False, intercept=False,
c=1.1, gamma=0.05)
beta = np.asarray(fit.est[&amp;quot;beta&amp;quot;]).flatten()
sel = np.where(np.abs(beta) &amp;gt; 1e-10)[0]
Xsel = X[:, sel] if sel.size &amp;gt; 0 else np.empty((len(y), 0))
return feols_clustered(y, d, Xsel, state) # post-OLS, CRV1 SE
&lt;/code>&lt;/pre>
&lt;p>Two important notes. First, the FWL partialling step replaces the &lt;code>penalty.factor=0&lt;/code> mechanism: by removing $d$&amp;rsquo;s effect from both $y$ and $X$ before the LASSO step, we get the same conditional-on-$d$ selection that the unpenalised treatment in R/Stata would give. In the orthogonal-design limit the two are mathematically equivalent; in finite samples they differ slightly. Second, &lt;code>hdmpy.rlasso&lt;/code> uses the BCH &lt;strong>rigorous penalty&lt;/strong> (c=1.1, gamma=0.05) by default — these are the same defaults the R companion&amp;rsquo;s &lt;code>hdm::rlasso&lt;/code> and Stata companion&amp;rsquo;s &lt;code>rlasso&lt;/code> use. We pass them explicitly so the cross-language consistency is visible.&lt;/p>
&lt;p>The results:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE&lt;/th>
&lt;th style="text-align:right"># controls selected&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1553&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0330&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1015&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0218&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.2061&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0514&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>PSL with the rigorous penalty is extremely parsimonious — for all three outcomes, zero controls survive, so the post-OLS reduces to the no-controls baseline. The numerical values land within 0.003 of the first-difference baseline (violent: −0.155 vs −0.152), within 0.001 of the paper&amp;rsquo;s reported PSL numbers, and within 0.001 of the &lt;a href="https://carlos-mendez.org/post/stata_double_lasso/">Stata companion&lt;/a>&amp;rsquo;s rigorous-penalty PSL. The &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion&lt;/a> uses CV-tuned PSL instead (3-fold &lt;code>cv.glmnet&lt;/code>) and gets 3 / 12 / 0 controls per outcome — that is a different implementation choice with the same qualitative conclusion.&lt;/p>
&lt;p>&lt;strong>Why is this not the end of the story?&lt;/strong> Because PSL has a causal-inference blind spot. LASSO selects controls based on how well they predict $y$. But a covariate can be a &lt;em>confounder&lt;/em> — biasing $\hat\alpha$ if omitted — even when it does not predict $y$ strongly. Imagine a variable highly correlated with the treatment $d$ but only weakly with $y$. PSL&amp;rsquo;s one LASSO will drop it (it does not improve prediction of $y$ much), and the post-OLS will inherit the omitted-variable bias. &lt;a href="#22-references">Belloni, Chernozhukov and Hansen (2014)&lt;/a> made exactly this point, and proposed Double LASSO as the fix.&lt;/p>
&lt;hr>
&lt;h2 id="7-double-lasso--the-causal-side-fix">7. Double LASSO — the causal-side fix&lt;/h2>
&lt;p>Double LASSO runs &lt;strong>two&lt;/strong> LASSOs, not one. The first LASSO predicts the outcome $y$ from the controls; call its selected index set $I_y$. The second LASSO predicts the treatment $d$ from the same controls; call its selected index set $I_d$. The final estimate of $\alpha$ comes from a plain OLS regression of $y$ on $d$ and the &lt;strong>union&lt;/strong> $I_y \cup I_d$, with state-clustered standard errors.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">flowchart TD
A[&amp;quot;Data: outcome y, treatment d,&amp;lt;br/&amp;gt;controls X (p = 284)&amp;quot;] --&amp;gt; B[&amp;quot;Step 1: hdmpy.rlasso(X, y)&amp;lt;br/&amp;gt;(no d on right-hand side)&amp;lt;br/&amp;gt;selected set I_y&amp;quot;]
A --&amp;gt; C[&amp;quot;Step 2: hdmpy.rlasso(X, d)&amp;lt;br/&amp;gt;(no y on right-hand side)&amp;lt;br/&amp;gt;selected set I_d&amp;quot;]
B --&amp;gt; D[&amp;quot;Union: I_y &amp;amp;cup; I_d&amp;quot;]
C --&amp;gt; D
D --&amp;gt; E[&amp;quot;Step 3: pyfixest.feols&amp;lt;br/&amp;gt;y ~ -1 + d + X[:, union]&amp;lt;br/&amp;gt;vcov={CRV1: state}&amp;quot;]
E --&amp;gt; F[&amp;quot;Causal estimate alpha-hat&amp;quot;]
style A fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style B fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style C fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style D fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style E fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style F fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
&lt;/code>&lt;/pre>
&lt;p>The intuition is rooted in the &lt;strong>Frisch-Waugh-Lovell theorem&lt;/strong>. To estimate $\alpha$ in the structural equation $y_i = \alpha\, d_i + x_i' \theta + \zeta_i$, FWL says we can residualise both $y$ and $d$ against the same set of controls and regress the residuals:&lt;/p>
&lt;p>$$
\hat\alpha = \bigl(\tilde d' \tilde d\bigr)^{-1} \tilde d' \tilde y, \quad \text{where} \quad \tilde y = M_X y, \, \tilde d = M_X d.
$$&lt;/p>
&lt;p>The trick is that we do not need to use &lt;em>all&lt;/em> of $X$ in the residualisation. We only need to use enough of $X$ to capture the part that is correlated with $d$. Double LASSO does this approximately: $I_d$ catches the controls correlated with $d$; $I_y$ catches the controls correlated with $y$; their union catches both.&lt;/p>
&lt;p>The &amp;ldquo;rigorous&amp;rdquo; penalty rule chooses $\lambda$ from theory, not from CV. &lt;a href="#22-references">Belloni, Chen, Chernozhukov and Hansen (2012)&lt;/a> showed that the right scaling is&lt;/p>
&lt;p>$$
\lambda^{\text{rig}} = \frac{2 c \, \hat\sigma}{\sqrt{n}} \, \Phi^{-1}\!\left(1 - \frac{\gamma}{2 p}\right), \quad c = 1.1, \, \gamma = 0.05,
$$&lt;/p>
&lt;p>where $\hat\sigma$ is a pilot estimate of the residual standard deviation, $n$ is the sample size, $p$ is the number of candidate controls, and $\Phi^{-1}$ is the inverse standard-normal CDF. The factor $\Phi^{-1}(1 - \gamma / (2p))$ is a Bonferroni-style correction that keeps the false-positive rate of LASSO selection under control even though we are testing $p$ coefficients.&lt;/p>
&lt;p>&lt;strong>Code chunk 4 — The two rigorous LASSOs and the post-OLS in Python:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">def selected_idx_rlasso(fit, tol=1e-10):
beta = np.asarray(fit.est[&amp;quot;beta&amp;quot;]).flatten()
return np.where(np.abs(beta) &amp;gt; tol)[0]
def dl_rigorous_fit(y, d, X, state):
fit_y = hdmpy.rlasso(X, y, post=False, intercept=False, c=1.1, gamma=0.05)
fit_d = hdmpy.rlasso(X, d, post=False, intercept=False, c=1.1, gamma=0.05)
Iy = selected_idx_rlasso(fit_y)
Id = selected_idx_rlasso(fit_d)
U = np.sort(np.unique(np.concatenate([Iy, Id])))
return feols_clustered(y, d, X[:, U], state), Iy, Id, U
&lt;/code>&lt;/pre>
&lt;p>A few notes. &lt;code>intercept=False&lt;/code> is correct because the data has already been partialled for year fixed effects (so the column means are essentially zero). &lt;code>post=False&lt;/code> returns the raw LASSO coefficients rather than &lt;code>hdmpy&lt;/code>&amp;rsquo;s internal post-OLS refit — we run our own post-OLS via &lt;code>pyfixest&lt;/code> so we can attach state-clustered standard errors. The constants &lt;code>c=1.1, gamma=0.05&lt;/code> are the BCH defaults that R&amp;rsquo;s &lt;code>hdm::rlasso&lt;/code> and Stata&amp;rsquo;s &lt;code>rlasso&lt;/code> also use; passing them explicitly makes the cross-language consistency visible.&lt;/p>
&lt;p>The results:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE&lt;/th>
&lt;th>95 % CI&lt;/th>
&lt;th style="text-align:right">|I_y|&lt;/th>
&lt;th style="text-align:right">|I_d|&lt;/th>
&lt;th style="text-align:right">Union&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1043&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.1067&lt;/td>
&lt;td>[−0.313, +0.105]&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">8&lt;/td>
&lt;td style="text-align:right">8&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.0302&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0550&lt;/td>
&lt;td>[−0.138, +0.078]&lt;/td>
&lt;td style="text-align:right">3&lt;/td>
&lt;td style="text-align:right">9&lt;/td>
&lt;td style="text-align:right">12&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1253&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.1506&lt;/td>
&lt;td>[−0.421, +0.170]&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">9&lt;/td>
&lt;td style="text-align:right">9&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Reading the violent-crime row.&lt;/strong> $\hat\alpha = -0.1043$ means a unit increase in the differenced effective abortion rate is associated with a 0.104-unit decrease in the differenced violent-crime rate, conditional on the 8 controls in the union. The 95 % confidence interval [−0.313, +0.105] now contains zero — once we condition on the 8 controls the d-equation LASSO selects, the violent-crime effect drops below significance at the 5 % level. &lt;strong>The selection counts |I_y| = 0, |I_d| = 8 are exact matches to the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion&lt;/a> and to Fitzgerald et al.&amp;rsquo;s Table 2 (line 210).&lt;/strong> Same six cells, same exact matches across both languages, plus an exact match on the point estimate (-0.104 vs paper -0.104). Property crime (|I_y|=3, |I_d|=9, point −0.0302 vs paper −0.030) and murder (|I_y|=0, |I_d|=9, point −0.1253 vs paper −0.125) are similarly tight matches.&lt;/p>
&lt;hr>
&lt;h2 id="8-state-clustered-standard-errors">8. State-clustered standard errors&lt;/h2>
&lt;p>A digression on the standard errors. The 576 observations are not independent — they are 12 differenced years of data for each of 48 states, and within-state observations are autocorrelated through governor effects, state policy waves, and business-cycle exposure. Treating them as independent would understate the uncertainty by about 40 % on this panel. We use a cluster-robust sandwich estimator with the HC1 finite-sample adjustment (&lt;a href="#22-references">Cameron and Miller 2015&lt;/a>):&lt;/p>
&lt;p>$$
\hat V_{\text{cluster}} = \frac{n-1}{n-k} \cdot \frac{G}{G-1} \cdot (X&amp;rsquo;X)^{-1} \cdot \left(\sum_{g=1}^G X_g' \hat e_g \hat e_g' X_g\right) \cdot (X&amp;rsquo;X)^{-1}.
$$&lt;/p>
&lt;p>The &amp;ldquo;sandwich&amp;rdquo; name comes from the structure: two slices of bread $(X&amp;rsquo;X)^{-1}$ around the meat $\sum_g X_g' \hat e_g \hat e_g' X_g$, the cluster-summed outer product of the within-cluster scores. The two front factors are the small-sample correction: $(n-1)/(n-k)$ adjusts for the degrees of freedom consumed by the regressors, and $G/(G-1)$ adjusts for the number of clusters. Here $n = 576$, $k$ is the number of fitted columns (varies by estimator), and $G = 48$ is the number of states.&lt;/p>
&lt;p>In Python we have &lt;strong>two clean ways&lt;/strong> to apply this. For OLS-based estimators (rows 1-5 in our Table 2), &lt;code>pyfixest&lt;/code> does it natively:&lt;/p>
&lt;pre>&lt;code class="language-python">fit = pf.feols(&amp;quot;y ~ -1 + d + z1 + z2 + ...&amp;quot;, data=df, vcov={&amp;quot;CRV1&amp;quot;: &amp;quot;state&amp;quot;})
&lt;/code>&lt;/pre>
&lt;p>For the &lt;code>DoubleMLPLR&lt;/code> row in Part B, we hand-roll the equivalent sandwich on the orthogonal scores (see §17.1). Both approaches give numerically identical inference for the small-controls cases; for kitchen-sink OLS the column-rank handling differs slightly between approaches (documented in §5).&lt;/p>
&lt;p>The cluster-count correction $G/(G-1)$ assumes the number of clusters $G$ is &amp;ldquo;large.&amp;rdquo; A rule of thumb is $G \geq 30$; with $G = 48$ states we are comfortably above that threshold. If you had only 5 or 10 clusters, the cluster-robust SE would be unreliable and you would need wild bootstrap or block bootstrap inference.&lt;/p>
&lt;hr>
&lt;h2 id="9-when-does-double-lasso-help-most">9. When does Double LASSO help most?&lt;/h2>
&lt;p>Look back at the DL-rigorous table in §7. For violent crime and murder, |I_y| = 0 — the LASSO of &lt;em>crime&lt;/em> on controls picked &lt;strong>zero variables&lt;/strong> out of 284. For all three outcomes |I_d| is 8 or 9 — the LASSO of &lt;em>abortion&lt;/em> on controls picked a handful. This asymmetry is the empirical fingerprint of the situation in which Double LASSO most helps: the treatment is well-predicted by the controls, but the outcome is not. Fitzgerald et al. (2026) emphasise this in their footnote 4: &lt;em>DL is most useful when the outcome is hard to predict but the treatment is well-predicted, because that is when the second LASSO catches controls that the first one missed.&lt;/em>&lt;/p>
&lt;p>Why does this matter for causal inference? Recall the PSL blind spot from §6: a one-LASSO procedure on $y$ can drop a control that strongly predicts $d$ if it does not strongly predict $y$. Suppose the (unobserved) data-generating process is&lt;/p>
&lt;p>$$
y_i = \alpha \, d_i + x_i' \theta + \zeta_i, \quad d_i = x_i' \pi + v_i, \quad \zeta_i \perp v_i.
$$&lt;/p>
&lt;p>If a particular $x_j$ has a large $\pi_j$ but a small $\theta_j$, then $x_j$ is a strong confounder (it predicts $d$, and thus moves $\hat\alpha$ when omitted), but a weak predictor of $y$. PSL drops it; DL keeps it via the d-equation LASSO. The empirical fingerprint |I_y| = 0, |I_d| = 8 means we are exactly in this regime: the eight controls that survived the d-equation LASSO are doing all of the confounding-control work in the final OLS.&lt;/p>
&lt;hr>
&lt;h2 id="10-rigorous-vs-cross-validated-penalty--the-python-specific-story">10. Rigorous vs cross-validated penalty — the Python-specific story&lt;/h2>
&lt;p>The second flavour of Double LASSO replaces the rigorous penalty with &lt;strong>3-fold cross-validation&lt;/strong> via &lt;code>sklearn.linear_model.LassoCV&lt;/code>. The recipe is identical to §7 — two LASSOs, take the union, post-OLS — but each LASSO now picks $\lambda$ by minimising out-of-sample mean-squared error on the prediction problem.&lt;/p>
&lt;p>&lt;strong>Code chunk 5 — The CV-penalty Double LASSO using &lt;code>sklearn&lt;/code>:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
def dl_cv_fit(y, d, X, state, seed=20260520):
cv_y = KFold(n_splits=3, shuffle=True, random_state=seed)
cv_d = KFold(n_splits=3, shuffle=True, random_state=seed + 1)
lc_y = LassoCV(cv=cv_y, random_state=seed, max_iter=5000).fit(X, y)
lc_d = LassoCV(cv=cv_d, random_state=seed, max_iter=5000).fit(X, d)
Iy = np.where(np.abs(lc_y.coef_) &amp;gt; 1e-10)[0]
Id = np.where(np.abs(lc_d.coef_) &amp;gt; 1e-10)[0]
U = np.sort(np.unique(np.concatenate([Iy, Id])))
return feols_clustered(y, d, X[:, U], state), Iy, Id, U
&lt;/code>&lt;/pre>
&lt;p>The results:&lt;/p>
&lt;p>&lt;img src="python_double_lasso_selection.png" alt="Variable selection across the two Double LASSO penalties: bars show \|I_y\|, \|I_d\|, intersection, and union out of 284 candidate controls. Python&amp;rsquo;s CV-LASSO selects roughly 5x more controls than rigorous — much milder over-selection than R&amp;rsquo;s cv.glmnet.">&lt;/p>
&lt;p>&lt;img src="python_double_lasso_methods_compare.png" alt="Rigorous vs CV side-by-side: same three-step recipe, different penalty rule. Both penalties give negative point estimates on all three outcomes — the dramatic sign-flip seen in R is absent here.">&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha_{\text{rig}}$&lt;/th>
&lt;th style="text-align:right">$\hat\alpha_{\text{CV}}$&lt;/th>
&lt;th style="text-align:right">$\lvert I_y \cup I_d \rvert_{\text{rig}}$&lt;/th>
&lt;th style="text-align:right">$\lvert I_y \cup I_d \rvert_{\text{CV}}$&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">−0.1043&lt;/td>
&lt;td style="text-align:right">−0.1401&lt;/td>
&lt;td style="text-align:right">8&lt;/td>
&lt;td style="text-align:right">56&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">−0.0302&lt;/td>
&lt;td style="text-align:right">−0.0654&lt;/td>
&lt;td style="text-align:right">12&lt;/td>
&lt;td style="text-align:right">54&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">−0.1253&lt;/td>
&lt;td style="text-align:right">−0.1601&lt;/td>
&lt;td style="text-align:right">9&lt;/td>
&lt;td style="text-align:right">59&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Two important findings.&lt;/strong> First, the selection-count gap is real but modest: CV picks 5x more controls than rigorous (56 / 54 / 59 vs 8 / 12 / 9). Second — and this is the Python-specific surprise — &lt;strong>the dramatic sign-flip the R companion shows on violent crime is not reproduced here&lt;/strong>. R&amp;rsquo;s &lt;code>cv.glmnet&lt;/code> keeps 150 controls in the d-equation for violent crime and flips $\hat\alpha$ from −0.10 to &lt;strong>+0.02&lt;/strong>. Python&amp;rsquo;s &lt;code>sklearn.LassoCV&lt;/code> keeps only 52, and $\hat\alpha$ stays clearly negative at −0.14.&lt;/p>
&lt;p>Why the difference? Three pieces. &lt;strong>(i) Lambda grid.&lt;/strong> &lt;code>cv.glmnet&lt;/code> constructs its grid from $\lambda_{\max}$ down to $\epsilon \cdot \lambda_{\max}$ on a 100-point log scale with $\epsilon = 10^{-4}$ when $n &amp;lt; p$; &lt;code>sklearn.LassoCV&lt;/code> defaults to 100 points but with $\epsilon = 10^{-3}$, so its smallest lambda is 10× larger. Smaller smallest-lambda → more variables can survive → R picks more. &lt;strong>(ii) Fold-assignment RNG.&lt;/strong> R&amp;rsquo;s &lt;code>cv.glmnet&lt;/code> uses base-R&amp;rsquo;s &lt;code>sample()&lt;/code>; Python&amp;rsquo;s &lt;code>KFold(shuffle=True, random_state=...)&lt;/code> uses NumPy&amp;rsquo;s Mersenne Twister. The fold partitions are different, so the cross-validation surface is different, so the optimal lambda is different. &lt;strong>(iii) Standardisation.&lt;/strong> Both implementations standardise X before LASSO, but &lt;code>cv.glmnet&lt;/code> uses sample-SD scaling while &lt;code>LassoCV&lt;/code> uses the L2-norm by default — a subtle difference that compounds at the smallest lambda values.&lt;/p>
&lt;p>The take-away is &lt;em>not&lt;/em> that one library is wrong — both follow the same algorithm. The take-away is that &lt;strong>&amp;ldquo;default CV-LASSO&amp;rdquo; is not a portable concept across language ecosystems&lt;/strong>, and the dramatic R demonstration of the rigorous-vs-CV sign-flip is partly an artifact of &lt;code>cv.glmnet&lt;/code>&amp;rsquo;s aggressive grid. The §15 standalone section walks through five sources of drift between &lt;code>sklearn.LassoCV&lt;/code>, &lt;code>R::glmnet::cv.glmnet&lt;/code>, and &lt;code>DoubleML&lt;/code>&amp;rsquo;s internal Lasso, so readers know which knob to turn when porting results across languages.&lt;/p>
&lt;hr>
&lt;h2 id="11-the-forest-plot">11. The forest plot&lt;/h2>
&lt;p>Stacking all five Part-A estimators against all three outcomes gives the headline figure:&lt;/p>
&lt;p>&lt;img src="python_double_lasso_estimates.png" alt="Forest plot of α̂ ± 95 % CI for all five Part-A estimators across all three crime outcomes. The dashed line is zero; bars to the left indicate a crime-reducing association.">&lt;/p>
&lt;p>A coherent story for violent and property crime: the LASSO methods (PSL, DL-rigorous, DL-CV) land between the two extremes — First-difference OLS at $-0.152$ (violent) and Kitchen-sink OLS at $+0.014$ (violent). PSL and DL-rigorous concentrate the data&amp;rsquo;s signal near the small set of controls that actually matter (0 to 12 of them), giving estimates in the $-0.10$ to $-0.16$ range with tighter standard errors than OLS-full.&lt;/p>
&lt;p>For murder, the story is messier. Kitchen-sink OLS gives the nonsensical $+2.34$. But First-diff ($-0.20$), PSL ($-0.21$), DL-rigorous ($-0.13$), and DL-CV ($-0.16$) all cluster sensibly in the negative range. The murder outcome is the noisiest of the three (state-level murder counts are small numbers in many state-years), but Python&amp;rsquo;s milder over-selection in DL-CV means we avoid the catastrophic $-1.11$ estimate that R&amp;rsquo;s &lt;code>cv.glmnet&lt;/code> produces here.&lt;/p>
&lt;hr>
&lt;h2 id="12-when-to-use-which-method">12. When to use which method?&lt;/h2>
&lt;p>The decision tree below offers practical guidance for a researcher facing a fresh dataset. It is not a substitute for thinking carefully about identification (no method can rescue an invalid research design), but it is a reasonable starting point.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">flowchart TD
Start[&amp;quot;You have n observations,&amp;lt;br/&amp;gt;p candidate controls,&amp;lt;br/&amp;gt;and want a causal alpha-hat&amp;quot;] --&amp;gt; Q1{&amp;quot;p &amp;amp;ge; n?&amp;quot;}
Q1 --&amp;gt;|Yes| L[&amp;quot;LASSO methods required&amp;lt;br/&amp;gt;(OLS infeasible)&amp;quot;]
Q1 --&amp;gt;|No| Q2{&amp;quot;p / n &amp;amp;gt; 0.3?&amp;quot;}
Q2 --&amp;gt;|Yes, like this post&amp;lt;br/&amp;gt;p=284, n=576| L
Q2 --&amp;gt;|No| Q3{&amp;quot;n &amp;amp;ge; 5,000?&amp;quot;}
Q3 --&amp;gt;|Yes| O[&amp;quot;Plain OLS with all&amp;lt;br/&amp;gt;controls is fine&amp;lt;br/&amp;gt;(pyfixest.feols)&amp;quot;]
Q3 --&amp;gt;|No| L
L --&amp;gt; Q4{&amp;quot;Need valid causal&amp;lt;br/&amp;gt;inference, not just&amp;lt;br/&amp;gt;prediction?&amp;quot;}
Q4 --&amp;gt;|Yes, single-shot| DL[&amp;quot;Post-Double-Selection&amp;lt;br/&amp;gt;hdmpy + pyfixest&amp;quot;]
Q4 --&amp;gt;|Yes, modern cross-fit| DML[&amp;quot;DoubleML.DoubleMLPLR&amp;lt;br/&amp;gt;(see Part B)&amp;quot;]
Q4 --&amp;gt;|No| Pred[&amp;quot;DL-CV or PSL are&amp;lt;br/&amp;gt;both fine for prediction&amp;quot;]
style Start fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style DL fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style DML fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style Pred fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style O fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style L fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style Q1 fill:#1f2b5e,stroke:#6a9bcc,color:#e8ecf2
style Q2 fill:#1f2b5e,stroke:#6a9bcc,color:#e8ecf2
style Q3 fill:#1f2b5e,stroke:#6a9bcc,color:#e8ecf2
style Q4 fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
&lt;/code>&lt;/pre>
&lt;p>One more piece of intuition justifies the post-OLS refit step in DL (and PSL). LASSO&amp;rsquo;s coefficients on the variables it selects are shrunken toward zero by construction. If you used those shrunken coefficients to compute the residuals for $\alpha$, you would inherit a bias of the order&lt;/p>
&lt;p>$$
\hat\alpha_{\text{LASSO}} - \alpha = O_p\!\left(\frac{\lambda}{n}\right).
$$&lt;/p>
&lt;p>For our $\lambda^{\text{rig}}$ and $n = 576$, that bias is roughly 5-15 % of the treatment effect — large enough to matter. Refitting with plain OLS on the selected support &lt;strong>removes the shrinkage&lt;/strong> and recovers the unbiased estimate. This is why every method in Part A uses LASSO for &lt;em>selection only&lt;/em> and post-OLS (&lt;code>pyfixest.feols&lt;/code>) for &lt;em>estimation&lt;/em>. DoubleMLPLR in Part B achieves the same shrinkage-removal differently, via cross-fitting and Neyman-orthogonal scores — see §16.&lt;/p>
&lt;hr>
&lt;h2 id="13-caveats-and-identification">13. Caveats and identification&lt;/h2>
&lt;p>Six things to keep in mind when reading the headline estimates.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>This is a replication exercise, not a primary causal claim.&lt;/strong> Fitzgerald et al. (2026) is itself a replication paper studying Double LASSO as a &lt;em>method&lt;/em>. Whether more abortion access caused less crime is a substantive question that goes well beyond any single regression specification.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Identification rests on two assumptions.&lt;/strong> First, &lt;em>conditional independence given $X$&lt;/em>: the 284 partialled controls must capture every variable that influenced both the abortion rate and the crime rate in the 1980s. Second, &lt;em>parallel trends in levels&lt;/em>: state fixed effects are absorbed by first-differencing, year fixed effects by the partialling step in the upstream pre-processing. Neither assumption is innocuous.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>State-clustering relies on $G \geq 30$.&lt;/strong> With $G = 48$ states we are above the rule of thumb. If you had only 5-10 clusters, the cluster-robust SE would be unreliable and you would need wild bootstrap or block bootstrap inference.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>CV LASSO is non-deterministic.&lt;/strong> &lt;code>sklearn.LassoCV&lt;/code> randomly partitions the data into $K$ folds; without seeding, the variable-selection counts in §10 would vary by ±5 controls between runs and the headline coefficient by ±0.02. The script seeds both &lt;code>KFold(random_state=20260520)&lt;/code> and &lt;code>LassoCV(random_state=20260520)&lt;/code> so the post&amp;rsquo;s numbers reproduce exactly. The rigorous &lt;code>hdmpy.rlasso&lt;/code> is deterministic given the data and the penalty arguments.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Implementation differences from R&amp;rsquo;s &lt;code>MASS::ginv&lt;/code> show up on OLS-full.&lt;/strong> Our SE on OLS-full violent crime is 0.565 vs the R companion&amp;rsquo;s 0.091; the gap stems from inverting near-singular $X&amp;rsquo;X$ via rank-revealing QR (drops collinear columns, then &lt;code>numpy.linalg.pinv&lt;/code> on the survivors) vs R&amp;rsquo;s &lt;code>MASS::ginv&lt;/code> (Moore-Penrose pseudoinverse on the full 284 columns). Both are mathematically valid; Python&amp;rsquo;s approach matches Stata&amp;rsquo;s &lt;code>regress&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>hdmpy&lt;/code> does not expose &lt;code>pnotpen&lt;/code>.&lt;/strong> This is why our PSL (§6) uses FWL partialling instead of unpenalised-treatment LASSO. Mathematically equivalent in the orthogonal-design limit; numerically nearly identical to the Stata rigorous-PSL implementation. If you need exact &lt;code>cv.glmnet&lt;/code> parity, an alternative is to use &lt;code>glmnet-python&lt;/code> (a thin wrapper around the Fortran code that R&amp;rsquo;s &lt;code>glmnet&lt;/code> uses) — but the maintenance trajectory of &lt;code>glmnet-python&lt;/code> is weaker than &lt;code>sklearn&lt;/code> + &lt;code>hdmpy&lt;/code>, and the qualitative conclusions are unchanged.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h2 id="14-python-vs-r-numeric-replication-tier-a--b--c">14. Python vs R numeric replication (Tier A / B / C)&lt;/h2>
&lt;p>The headline numerical reproduction is &lt;strong>faithful at the variable-selection level&lt;/strong>. Our LASSO selections for the rigorous-penalty Double LASSO match the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion&lt;/a> — and Fitzgerald et al. (2026) Table 2 — &lt;em>exactly&lt;/em> across all three outcomes:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">|I_y| Python&lt;/th>
&lt;th style="text-align:right">|I_y| R&lt;/th>
&lt;th style="text-align:right">|I_d| Python&lt;/th>
&lt;th style="text-align:right">|I_d| R&lt;/th>
&lt;th style="text-align:right">Point Python&lt;/th>
&lt;th style="text-align:right">Point R&lt;/th>
&lt;th style="text-align:right">Point paper&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>0&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">&lt;strong>8&lt;/strong>&lt;/td>
&lt;td style="text-align:right">8&lt;/td>
&lt;td style="text-align:right">−0.1043&lt;/td>
&lt;td style="text-align:right">−0.0964&lt;/td>
&lt;td style="text-align:right">−0.104&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>3&lt;/strong>&lt;/td>
&lt;td style="text-align:right">3&lt;/td>
&lt;td style="text-align:right">&lt;strong>9&lt;/strong>&lt;/td>
&lt;td style="text-align:right">9&lt;/td>
&lt;td style="text-align:right">−0.0302&lt;/td>
&lt;td style="text-align:right">−0.0314&lt;/td>
&lt;td style="text-align:right">−0.030&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>0&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">&lt;strong>9&lt;/strong>&lt;/td>
&lt;td style="text-align:right">9&lt;/td>
&lt;td style="text-align:right">−0.1253&lt;/td>
&lt;td style="text-align:right">−0.1662&lt;/td>
&lt;td style="text-align:right">−0.125&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Six selection-count cells, six exact Python = R = paper matches. Point estimates agree across the three implementations to within 0.05 on the largest absolute gap (murder); violent crime and property crime are within 0.01. To keep the cross-implementation drift transparent, we organise the rows in tiers:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Tier&lt;/th>
&lt;th>Methods&lt;/th>
&lt;th>Expected drift&lt;/th>
&lt;th>Source of any drift&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>A — exact&lt;/strong>&lt;/td>
&lt;td>First-diff OLS, Kitchen-sink OLS (point estimates)&lt;/td>
&lt;td>≤ 1e-4&lt;/td>
&lt;td>None (deterministic OLS on same data)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>B — tight&lt;/strong>&lt;/td>
&lt;td>PSL, DL-rigorous (point estimates and selection counts)&lt;/td>
&lt;td>≤ 0.05&lt;/td>
&lt;td>Pre-standardisation differences in &lt;code>hdmpy&lt;/code> vs &lt;code>hdm&lt;/code>; FWL-vs-&lt;code>pnotpen&lt;/code> for PSL&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>C — drifts freely&lt;/strong>&lt;/td>
&lt;td>DL-CV (point estimates and selection counts)&lt;/td>
&lt;td>Wide&lt;/td>
&lt;td>&lt;code>sklearn.LassoCV&lt;/code> ≠ &lt;code>cv.glmnet&lt;/code> (different lambda grid, fold RNG, standardisation)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Stata is in the same picture: its Tier-A and Tier-B rows match Python&amp;rsquo;s to within 0.001. The Tier-C row (DL-CV) is where each language&amp;rsquo;s CV implementation diverges, and the Python-specific behaviour is the absence of the violent-crime sign-flip (see next section).&lt;/p>
&lt;hr>
&lt;h2 id="15-why-doubleml-results-dont-match-rs-hdm-five-sources-of-drift">15. Why DoubleML results don&amp;rsquo;t match R&amp;rsquo;s &lt;code>hdm&lt;/code>: five sources of drift&lt;/h2>
&lt;p>If you ran &lt;code>DoubleMLPLR(... ml_l=LassoCV(), ml_m=LassoCV())&lt;/code> on this data expecting to recover the R companion&amp;rsquo;s DL-rigorous numbers, you would get α̂ = −0.115, not the R&amp;rsquo;s −0.0964 or the rigorous PSL&amp;rsquo;s −0.1567. &lt;strong>Five things differ between DoubleML&amp;rsquo;s defaults and R&amp;rsquo;s &lt;code>hdm&lt;/code>&lt;/strong>, and naming them makes it possible to know which knob to turn when you need to reconcile two ecosystems:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Sample-splitting / cross-fitting.&lt;/strong> &lt;code>DoubleMLPLR&lt;/code> uses K-fold cross-fitting (K = 5 default) — every observation&amp;rsquo;s residual is computed by a nuisance model that did not see that observation. R&amp;rsquo;s &lt;code>hdm::rlasso&lt;/code> + manual post-OLS uses a single-sample fit — the same data is used to select variables and to estimate $\alpha$. At finite $n$ these target different estimands; asymptotically they converge to the same parameter under standard regularity conditions.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Nuisance estimator defaults.&lt;/strong> &lt;code>DoubleML&lt;/code> does not ship a built-in rigorous-penalty LASSO. The closest user-facing option is &lt;code>LassoCV&lt;/code> from sklearn, which picks $\lambda$ by cross-validation — exactly the choice that §10 above shows over-selects relative to the BCH rigorous penalty. If you want rigorous behaviour inside DoubleML, you have to manually compute the BCH $\lambda$ and pass &lt;code>Lasso(alpha=lambda)&lt;/code>, or pre-fit a &lt;code>hdmpy.rlasso&lt;/code> and pass a custom sklearn-compatible wrapper. We use &lt;code>LassoCV&lt;/code> here for clarity; the §16 showcase tagline is &amp;ldquo;DoubleML&amp;rsquo;s design is learner-agnostic — see §18 for how RandomForest and XGBoost change the picture.&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Standardisation.&lt;/strong> &lt;code>sklearn&lt;/code> standardises X internally before LASSO (column-wise division by L2-norm); &lt;code>hdm&lt;/code> standardises by sample-SD; &lt;code>cv.glmnet&lt;/code> also uses sample-SD but with a different reference (variance computed with n, not n-1). At the boundary lambdas these three conventions give different selections.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Fold RNG.&lt;/strong> &lt;code>sklearn.model_selection.KFold(shuffle=True, random_state=...)&lt;/code> uses NumPy&amp;rsquo;s Mersenne Twister; R&amp;rsquo;s &lt;code>cv.glmnet&lt;/code> uses base-R&amp;rsquo;s &lt;code>set.seed&lt;/code>. Even with identical seeds, the fold partitions differ. With $n_{\text{rep}} \geq 10$ in DoubleML the variation from this source averages out; with the $n_{\text{rep}} = 3$ we use for speed it is still visible (±0.01 on the point estimate).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Inference target.&lt;/strong> &lt;code>DoubleMLPLR&lt;/code> returns iid-asymptotic standard errors by default. R&amp;rsquo;s &lt;code>hdm&lt;/code>-driven workflow attaches a state-clustered HC1 sandwich on post-OLS residuals (Cameron and Miller 2015). We hand-roll the analog on DoubleML&amp;rsquo;s orthogonal scores in §17.1 so the inference is apples-to-apples with R, but this is &lt;em>not&lt;/em> what you get from &lt;code>dml.se&lt;/code> out of the box.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>The practical upshot: when DoubleML&amp;rsquo;s α̂ differs from R&amp;rsquo;s &lt;code>hdm&lt;/code> α̂, the difference is &lt;strong>explainable, not mysterious&lt;/strong>. The two are different algorithms targeting the same parameter. Choosing between them comes down to whether you want explicit post-double-selection (R/Stata style, transparent) or modern cross-fit Neyman-orthogonal estimation (DoubleML style, learner-agnostic).&lt;/p>
&lt;hr>
&lt;h2 id="16-meet-doubleml-a-modern-framework-for-ml-based-causal-inference">16. Meet DoubleML: a modern framework for ML-based causal inference&lt;/h2>
&lt;p>&lt;a href="https://docs.doubleml.org/" target="_blank" rel="noopener">DoubleML&lt;/a> (&lt;a href="#22-references">Bach, Chernozhukov, Kurz and Spindler 2022, JMLR&lt;/a>) is a Python library that ports the &lt;a href="#22-references">Chernozhukov et al. (2018, &lt;em>Econometrics Journal&lt;/em>) double/debiased ML framework&lt;/a> into a sklearn-native API. Three ideas drive its design:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Neyman orthogonality.&lt;/strong> The score function $\psi$ has zero expected gradient with respect to the nuisance parameters $\eta$ at the truth: $E[\partial_\eta \psi]_{\eta=\eta_0} = 0$. This means small errors in the ML estimates of the nuisance functions do not propagate to bias in $\hat\alpha$.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Cross-fitting.&lt;/strong> Each observation&amp;rsquo;s score is computed using nuisance models trained on the &lt;em>other&lt;/em> folds — never on itself. This eliminates overfitting bias and lets you use arbitrarily flexible ML learners without inflating bias.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Pluggable learners.&lt;/strong> Any sklearn-compatible regressor or classifier can serve as a nuisance estimator. Swap &lt;code>LassoCV()&lt;/code> for &lt;code>RandomForestRegressor()&lt;/code> or &lt;code>XGBRegressor()&lt;/code> in one line; the rest of the pipeline is identical.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>DoubleML ships several &lt;strong>model classes&lt;/strong>, one per estimand structure. The most important for econometric work:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Class&lt;/th>
&lt;th>When to use&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>&lt;code>DoubleMLPLR&lt;/code>&lt;/strong>&lt;/td>
&lt;td>Partially Linear Regression. $Y = D\theta + g(X) + \varepsilon$. Continuous treatment; this post&amp;rsquo;s main DoubleML model.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>&lt;code>DoubleMLIRM&lt;/code>&lt;/strong>&lt;/td>
&lt;td>Interactive Regression Model. Binary treatment. ATE or ATTE. Allows treatment-effect heterogeneity in covariates.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>&lt;code>DoubleMLPLIV&lt;/code>&lt;/strong>&lt;/td>
&lt;td>Partially Linear IV. Continuous treatment with instrumental variable.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>&lt;code>DoubleMLIIVM&lt;/code>&lt;/strong>&lt;/td>
&lt;td>Interactive IV. Binary treatment with binary instrument.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>&lt;code>DoubleMLDID&lt;/code>&lt;/strong>&lt;/td>
&lt;td>Difference-in-differences with ML nuisance (Sant&amp;rsquo;Anna-Zhao).&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The user always wraps the data in a &lt;strong>&lt;code>DoubleMLData&lt;/code>&lt;/strong> object that names the outcome, treatment, controls, and (optionally) instruments. The model class then takes nuisance learners and cross-fitting parameters:&lt;/p>
&lt;pre>&lt;code class="language-python">from doubleml import DoubleMLData, DoubleMLPLR
from sklearn.linear_model import LassoCV
dml_data = DoubleMLData(df, y_col=&amp;quot;y&amp;quot;, d_cols=[&amp;quot;d&amp;quot;], x_cols=[...])
plr = DoubleMLPLR(
dml_data,
ml_l=LassoCV(cv=3), # nuisance for E[Y | X]
ml_m=LassoCV(cv=3), # nuisance for E[D | X]
n_folds=5, # outer cross-fitting
n_rep=3, # repeat cross-fit and median-aggregate
score=&amp;quot;partialling out&amp;quot;, # Robinson FWL — the DL recipe
)
plr.fit()
print(plr.summary)
print(plr.confint(level=0.95))
&lt;/code>&lt;/pre>
&lt;p>The &lt;code>score=&amp;quot;partialling out&amp;quot;&lt;/code> choice computes the Robinson partialling-out score
$\psi_i = (Y_i - g(X_i))(D_i - m(X_i)) - \theta(D_i - m(X_i))^2$,
which is exactly the FWL formula that Double LASSO approximates with a single post-OLS step. The difference between DoubleMLPLR and explicit post-double-selection is &lt;em>how the nuisance functions are estimated&lt;/em> — DoubleMLPLR&amp;rsquo;s K-fold cross-fitting vs PDS&amp;rsquo;s single-sample LASSO + post-OLS.&lt;/p>
&lt;p>We use this framework in the next three sections.&lt;/p>
&lt;hr>
&lt;h2 id="17-doubleml-capabilities-showcase">17. DoubleML capabilities showcase&lt;/h2>
&lt;h3 id="171-doublemlplr-with-hand-rolled-cluster-state-se">17.1 &lt;code>DoubleMLPLR&lt;/code> with hand-rolled cluster-state SE&lt;/h3>
&lt;p>The flagship Part-B estimator: &lt;code>DoubleMLPLR&lt;/code> with &lt;code>LassoCV&lt;/code> learners, n_folds = 5, n_rep = 3 (three repetitions of the cross-fit; the library median-aggregates across reps).&lt;/p>
&lt;p>&lt;strong>Code chunk 6 — DoubleMLPLR with cross-fitting:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from doubleml import DoubleMLData, DoubleMLPLR
from sklearn.linear_model import LassoCV
o = outcomes[&amp;quot;violent&amp;quot;]
df_dml = pd.DataFrame(o[&amp;quot;X&amp;quot;], columns=[f&amp;quot;x{i}&amp;quot; for i in range(o[&amp;quot;X&amp;quot;].shape[1])])
df_dml[&amp;quot;d&amp;quot;] = o[&amp;quot;d&amp;quot;]; df_dml[&amp;quot;y&amp;quot;] = o[&amp;quot;y&amp;quot;]
dml_data = DoubleMLData(df_dml, y_col=&amp;quot;y&amp;quot;, d_cols=[&amp;quot;d&amp;quot;],
x_cols=[f&amp;quot;x{i}&amp;quot; for i in range(o[&amp;quot;X&amp;quot;].shape[1])])
ml_l = LassoCV(cv=3, random_state=20260520, max_iter=5000)
ml_m = LassoCV(cv=3, random_state=20260520, max_iter=5000)
plr = DoubleMLPLR(dml_data, ml_l=ml_l, ml_m=ml_m,
n_folds=5, n_rep=3, score=&amp;quot;partialling out&amp;quot;)
plr.fit()
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> alpha_hat (DoubleMLPLR, violent crime) = -0.1152
iid SE = 0.0826 95% CI = [-0.277, +0.047]
&lt;/code>&lt;/pre>
&lt;p>The iid SE comes from &lt;code>plr.se&lt;/code> directly. To get a state-clustered SE that is apples-to-apples with the Part-A rows, we hand-roll the cluster sandwich on the &lt;strong>orthogonal scores&lt;/strong> that DoubleML exposes via &lt;code>plr.psi&lt;/code> and &lt;code>plr.psi_elements&lt;/code>:&lt;/p>
&lt;p>&lt;strong>Code chunk 7 — Hand-rolled cluster SE on orthogonal scores:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">def cluster_se_orthogonal(dml, cluster_id, k_params=1):
psi = dml.psi.squeeze() # (n,) for single treatment
psi_a = dml.psi_elements[&amp;quot;psi_a&amp;quot;].squeeze()
n = psi.shape[0]
if psi.ndim == 2: # average across n_rep dimension
psi = psi.mean(axis=1)
psi_a = psi_a.mean(axis=1)
df_p = pd.DataFrame({&amp;quot;psi&amp;quot;: psi, &amp;quot;g&amp;quot;: cluster_id})
grouped = df_p.groupby(&amp;quot;g&amp;quot;)[&amp;quot;psi&amp;quot;].sum().to_numpy()
G = len(grouped)
meat = float(np.sum(grouped ** 2))
Epsi_a = float(np.mean(psi_a))
hc1 = (G / (G - 1)) * ((n - 1) / (n - k_params))
var = hc1 * meat / (n * Epsi_a) ** 2
return float(np.sqrt(var))
cluster_se = cluster_se_orthogonal(plr, state)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> cluster SE = 0.0727 (hand-rolled HC1 on orthogonal scores, G=48)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_double_lasso_doubleml_showcase.png" alt="PDS vs DoubleMLPLR on violent crime: four estimates plotted side by side with cluster-CI bars.">&lt;/p>
&lt;p>DoubleMLPLR&amp;rsquo;s α̂ = &lt;strong>−0.115&lt;/strong> sits squarely between the post-double-selection DL-rigorous (−0.104) and DL-CV (−0.140) numbers. This is reassuring — three different paths through the LASSO machinery give answers within one standard error of each other. The state-clustered SE on the orthogonal scores (0.073) is slightly &lt;em>smaller&lt;/em> than the iid SE (0.083) — unusual but mathematically valid: when within-cluster errors are negatively correlated (e.g., crime rates that mean-revert within state), the cluster sandwich can shrink rather than inflate. The pedagogical takeaway: &lt;strong>the inference target (iid vs clustered) is a separate choice from the estimation algorithm&lt;/strong>, and DoubleML&amp;rsquo;s &lt;code>.psi&lt;/code> attribute makes it easy to swap in cluster-correct SEs after the fact.&lt;/p>
&lt;h3 id="172-doublemlirm-on-a-binarised-treatment-api-demo-only">17.2 &lt;code>DoubleMLIRM&lt;/code> on a binarised treatment (API demo only)&lt;/h3>
&lt;p>The Interactive Regression Model handles &lt;strong>binary treatments&lt;/strong> and estimates the ATE or ATTE. Our treatment (the effective abortion rate) is continuous, but we can binarise it at its median purely to demonstrate the API:&lt;/p>
&lt;p>&lt;strong>Code chunk 8 — DoubleMLIRM (API demo):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from doubleml import DoubleMLIRM
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestClassifier
d_binary = (o[&amp;quot;d&amp;quot;] &amp;gt; np.median(o[&amp;quot;d&amp;quot;])).astype(int)
df_irm = df_dml.copy(); df_irm[&amp;quot;d&amp;quot;] = d_binary
irm_data = DoubleMLData(df_irm, y_col=&amp;quot;y&amp;quot;, d_cols=[&amp;quot;d&amp;quot;],
x_cols=[f&amp;quot;x{i}&amp;quot; for i in range(o[&amp;quot;X&amp;quot;].shape[1])])
irm = DoubleMLIRM(
irm_data,
ml_g=Lasso(alpha=0.01, max_iter=5000),
ml_m=RandomForestClassifier(n_estimators=100, max_depth=5,
random_state=20260520, n_jobs=-1),
n_folds=3, n_rep=1, score=&amp;quot;ATE&amp;quot;,
)
irm.fit()
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> ATE (DoubleMLIRM, median-split treatment) = -0.0163 (iid SE = 0.0043)
(For context: PLR's continuous-treatment estimate above is -0.1152.)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>CAVEAT — this is an API demonstration, not a causal estimate.&lt;/strong> Binarising a continuous treatment throws away most of the variation: we are now measuring &amp;ldquo;effect of being above-vs-below median abortion rate&amp;rdquo; instead of &amp;ldquo;effect of a one-unit change in abortion rate,&amp;rdquo; and the two are on completely different scales. The pedagogical lesson is &lt;strong>pick the right DoubleML class for your treatment type&lt;/strong> — &lt;code>DoubleMLPLR&lt;/code> for continuous, &lt;code>DoubleMLIRM&lt;/code>/&lt;code>DoubleMLIIVM&lt;/code> for binary, &lt;code>DoubleMLPLIV&lt;/code> for IV. Forcing a continuous variable into a binary model is a classic API-driven misspecification.&lt;/p>
&lt;hr>
&lt;h2 id="18-learner-robustness-lasso-vs-randomforest-vs-xgboost">18. Learner robustness: LASSO vs RandomForest vs XGBoost&lt;/h2>
&lt;p>A key advantage of DoubleML is that it is &lt;strong>agnostic to the choice of ML learner&lt;/strong>, as long as the learner is flexible enough to approximate the true confounding function. To verify that our DoubleMLPLR violent-crime estimate is not driven by the specific choice of LassoCV, we re-estimate the model with three structurally different learners.&lt;/p>
&lt;p>&lt;strong>Code chunk 9 — DoubleMLPLR with three nuisance learners:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
learners = {
&amp;quot;LassoCV&amp;quot;: lambda: LassoCV(cv=3, random_state=20260520, max_iter=5000),
&amp;quot;RandomForest&amp;quot;: lambda: RandomForestRegressor(n_estimators=100, max_depth=5,
random_state=20260520, n_jobs=-1),
&amp;quot;XGBoost&amp;quot;: lambda: XGBRegressor(n_estimators=100, max_depth=4,
learning_rate=0.05,
random_state=20260520, verbosity=0),
}
for name, make in learners.items():
plr_l = DoubleMLPLR(dml_data, ml_l=make(), ml_m=make(),
n_folds=5, n_rep=3, score=&amp;quot;partialling out&amp;quot;)
plr_l.fit()
se_c = cluster_se_orthogonal(plr_l, state)
print(f&amp;quot; {name:12s} alpha_hat = {float(plr_l.coef[0]):+0.4f} &amp;quot;
f&amp;quot;iid SE = {float(plr_l.se[0]):0.4f} cluster SE = {se_c:0.4f}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> LassoCV alpha_hat = -0.0957 iid SE = 0.0841 cluster SE = 0.0785
RandomForest alpha_hat = -0.0855 iid SE = 0.1806 cluster SE = 0.1432
XGBoost alpha_hat = -0.1123 iid SE = 0.2089 cluster SE = 0.1421
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_double_lasso_learners.png" alt="DoubleMLPLR α̂ on violent crime with three different nuisance learners, with 95 % cluster-CI bars.">&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Learner&lt;/th>
&lt;th style="text-align:right">α̂&lt;/th>
&lt;th style="text-align:right">iid SE&lt;/th>
&lt;th style="text-align:right">Cluster SE&lt;/th>
&lt;th>95 % CI (cluster)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>LassoCV&lt;/strong> (cv=3, max_iter=5000)&lt;/td>
&lt;td style="text-align:right">−0.0957&lt;/td>
&lt;td style="text-align:right">0.0841&lt;/td>
&lt;td style="text-align:right">&lt;strong>0.0785&lt;/strong>&lt;/td>
&lt;td>[−0.250, +0.058]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>RandomForestRegressor&lt;/strong> (100 trees, depth 5)&lt;/td>
&lt;td style="text-align:right">−0.0855&lt;/td>
&lt;td style="text-align:right">0.1806&lt;/td>
&lt;td style="text-align:right">&lt;strong>0.1432&lt;/strong>&lt;/td>
&lt;td>[−0.366, +0.195]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>XGBRegressor&lt;/strong> (100 trees, depth 4, eta 0.05)&lt;/td>
&lt;td style="text-align:right">−0.1123&lt;/td>
&lt;td style="text-align:right">0.2089&lt;/td>
&lt;td style="text-align:right">&lt;strong>0.1421&lt;/strong>&lt;/td>
&lt;td>[−0.391, +0.166]&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Reading the comparison.&lt;/strong> Three structurally different nuisance learners — sparse linear (LASSO), bagged trees (RandomForest), and boosted trees (XGBoost) — give DoubleMLPLR α̂ values spanning &lt;strong>−0.0855 to −0.1123&lt;/strong>, a 0.03 range. All three point estimates are negative, and the cluster-SE confidence intervals overlap heavily. This is exactly the &lt;strong>learner-robustness signal&lt;/strong> DoubleML is designed to expose: if the answer flipped sign or changed by a factor of two when swapping the nuisance learner, that would be a red flag that the result is fragile. Here the conclusion (a negative association between differenced abortion and differenced violent-crime rate, statistically borderline at the 5 % level under all three learners) survives the swap.&lt;/p>
&lt;p>Worth noting: the tree-based learners produce SEs roughly 2-3× wider than LASSO, because they have more flexibility to absorb signal that LASSO leaves in the residuals. With n = 576 and p = 284, sparse linear nuisance is probably the right default — but the comparison shows DoubleML&amp;rsquo;s &amp;ldquo;plug in any sklearn learner&amp;rdquo; design works as advertised. In production, the right move is to fit all three (or four — gradient boosting with &lt;code>LightGBM&lt;/code> is a good fourth) and report the spread as a robustness band.&lt;/p>
&lt;hr>
&lt;h2 id="19-conclusion">19. Conclusion&lt;/h2>
&lt;p>Four takeaways worth carrying away from this post.&lt;/p>
&lt;p>First, &lt;strong>Double LASSO is a method, not a panacea&lt;/strong>. It does not invent variation in the data, nor does it weaken the identifying assumptions of the underlying research design. What it does is make high-dimensional control sets &lt;em>tractable&lt;/em> without committing to using all of them or to picking a subset by hand. On a dataset where conditional independence holds and the candidate-control set is rich enough to span the confounders, DL-rigorous reproduces the Donohue-Levitt 2001 headline closely while disciplining the standard errors.&lt;/p>
&lt;p>Second, &lt;strong>the rigorous penalty matters more than the language&lt;/strong>. Switching from &lt;code>hdmpy.rlasso&lt;/code> to &lt;code>sklearn.LassoCV&lt;/code> shifts violent-crime α̂ from −0.10 to −0.14 — a meaningful change but no sign-flip. The dramatic R demonstration (&lt;code>cv.glmnet&lt;/code> flips α̂ from −0.10 to +0.02) does not reproduce in Python because &lt;code>sklearn.LassoCV&lt;/code>&amp;rsquo;s lambda grid and KFold RNG are less aggressive than R&amp;rsquo;s &lt;code>cv.glmnet&lt;/code> defaults. For causal inference, prefer the theory-driven &lt;code>hdmpy.rlasso&lt;/code> regardless of which language you are in.&lt;/p>
&lt;p>Third, &lt;strong>the regime determines the methodology&lt;/strong>. With our $p = 284$, $n = 576$, we are squarely in the small-sample, high-dimensional zone where DL is designed to help. With $p = 8$ and $n = 5{,}000$, plain OLS would be perfectly fine. The decision tree in §12 is a starting point for picking the right tool for the dimensions you face.&lt;/p>
&lt;p>Fourth — and this is the Python-specific addition — &lt;strong>use post-double-selection (hdmpy) when you want to replicate published results; use DoubleML when you want modern Neyman-orthogonal cross-fitting with any sklearn learner.&lt;/strong> The two approaches target the same parameter under standard regularity conditions, but they take different paths. DoubleML&amp;rsquo;s cross-fitting, learner-agnosticism, and clean sklearn integration make it the right tool for production ML pipelines. Post-double-selection&amp;rsquo;s transparency (every variable&amp;rsquo;s fate is visible; no cross-fold averaging hides the selection) makes it the right tool for a one-shot replication exercise like the one in this post.&lt;/p>
&lt;p>If you came in expecting either a definitive statement about abortion and crime or a magic ML cure for omitted-variable bias, you should leave with neither. What you should leave with is a clearer mental model of &lt;em>when&lt;/em> the high-dimensional toolkit earns its complexity, &lt;em>how&lt;/em> to use the two distinct Python idioms for it (hdmpy/sklearn/pyfixest vs DoubleML), and &lt;em>why&lt;/em> the two idioms can give different numbers on the same data.&lt;/p>
&lt;hr>
&lt;h2 id="20-exercises">20. Exercises&lt;/h2>
&lt;p>These exercises ask you to modify and re-run &lt;code>script.py&lt;/code>. All datasets, dependencies, and helper functions are already in place — you only need to change the indicated lines, run the script, and read the output.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Change the CV seed.&lt;/strong> In §10, the &lt;code>KFold&lt;/code> and &lt;code>LassoCV&lt;/code> random states are set to &lt;code>20260520&lt;/code>. Change them to a different seed and re-run only Estimator E (&lt;code>dl_cv_fit&lt;/code>). How much do the selection counts |I_y|, |I_d| change across the three outcomes? Does the DL-CV point estimate for violent crime ever flip to positive on a different seed?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Tighten the rigorous penalty.&lt;/strong> In §7, the rigorous-penalty parameters are &lt;code>c = 1.1, gamma = 0.05&lt;/code>. Try &lt;code>c = 1.5&lt;/code> (stricter) and &lt;code>c = 0.8&lt;/code> (looser) and re-run only Estimator D (&lt;code>dl_rigorous_fit&lt;/code>). The stricter setting should select fewer variables; the looser one should select more.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Increase &lt;code>n_rep&lt;/code> in DoubleMLPLR.&lt;/strong> In §17.1, &lt;code>n_rep=3&lt;/code> for speed. Bump it to &lt;code>n_rep=20&lt;/code> and re-run only that block. How much do α̂ and the SE move? This is the right setting in production — &lt;code>n_rep=3&lt;/code> is borderline for publication-quality inference.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Swap XGBoost for LightGBM in §18.&lt;/strong> Replace &lt;code>XGBRegressor(...)&lt;/code> with &lt;code>lightgbm.LGBMRegressor(n_estimators=100, max_depth=4, learning_rate=0.05, random_state=20260520, verbosity=-1)&lt;/code>. Does the learner-comparison conclusion change?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Apply DoubleMLPLIV.&lt;/strong> This dataset has no instrumental variable, so DoubleMLPLIV is not substantively meaningful here. But as a syntactic exercise, treat one of the candidate controls (say &lt;code>x150&lt;/code>) as a fake instrument and fit &lt;code>DoubleMLData(..., z_cols=[&amp;quot;x150&amp;quot;])&lt;/code> + &lt;code>DoubleMLPLIV(...)&lt;/code>. Observe how the API differs from PLR. Do not interpret the resulting number as a causal estimate.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h2 id="21-reproducing-this-analysis">21. Reproducing this analysis&lt;/h2>
&lt;p>You need Python 3.10-3.13 and the following packages:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install pyfixest==0.50.1
pip install DoubleML==0.11.2
pip install hdmpy
pip install xgboost
pip install scikit-learn pandas numpy matplotlib
# macOS Intel only: pin numba/llvmlite to last-Intel-wheel versions
pip install 'numba==0.62.1' 'llvmlite==0.45.0'
&lt;/code>&lt;/pre>
&lt;p>Then clone the repository and run:&lt;/p>
&lt;pre>&lt;code class="language-bash">cd content/post/python_double_lasso/
python script.py 2&amp;gt;&amp;amp;1 | tee execution_log.txt
&lt;/code>&lt;/pre>
&lt;p>Runtime on Apple Silicon is about 5-8 minutes (Part A: ~90 s; Part B&amp;rsquo;s DoubleMLPLR n_rep=3: ~3 minutes; Part B&amp;rsquo;s learner comparison: ~3 minutes). The longest single step is &lt;code>LassoCV&lt;/code> inside DoubleMLPLR with n_folds=5 × n_rep=3; if you want a quick pass, set &lt;code>n_rep=1&lt;/code> and the runtime drops to under 2 minutes total.&lt;/p>
&lt;p>If you would rather render the post locally as a Quarto notebook, the &lt;strong>&lt;a href="python_double_lasso.zip">Quarto project (.zip)&lt;/a>&lt;/strong> link button at the top contains a friction-free bundle: extract, double-click &lt;code>render.command&lt;/code> (macOS) or &lt;code>render.bat&lt;/code> (Windows), and the notebook renders to HTML in your browser with a hermetic local &lt;code>.venv/&lt;/code>.&lt;/p>
&lt;hr>
&lt;h2 id="22-references">22. References&lt;/h2>
&lt;p>&lt;strong>Academic references:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;a href="https://doi.org/10.3982/ECTA9626" target="_blank" rel="noopener">Belloni, A., Chen, D., Chernozhukov, V., &amp;amp; Hansen, C. (2012). Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain. &lt;em>Econometrica&lt;/em>, 80(6), 2369-2429.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1093/restud/rdt044" target="_blank" rel="noopener">Belloni, A., Chernozhukov, V., &amp;amp; Hansen, C. (2014). Inference on Treatment Effects after Selection among High-Dimensional Controls. &lt;em>Review of Economic Studies&lt;/em>, 81(2), 608-650.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.3368/jhr.50.2.317" target="_blank" rel="noopener">Cameron, A. C., &amp;amp; Miller, D. L. (2015). A Practitioner&amp;rsquo;s Guide to Cluster-Robust Inference. &lt;em>Journal of Human Resources&lt;/em>, 50(2), 317-372.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/ectj.12097" target="_blank" rel="noopener">Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., &amp;amp; Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. &lt;em>Econometrics Journal&lt;/em>, 21(1), C1-C68.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1162/00335530151144050" target="_blank" rel="noopener">Donohue III, J. J., &amp;amp; Levitt, S. D. (2001). The Impact of Legalized Abortion on Crime. &lt;em>Quarterly Journal of Economics&lt;/em>, 116(2), 379-420.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.15456/jae.2025335.0258270663" target="_blank" rel="noopener">Fitzgerald, J., Lattimore, F., Robinson, T., &amp;amp; Zhu, A. (2026). Double LASSO: Replication and Practical Insights. &lt;em>Journal of Applied Econometrics&lt;/em>, forthcoming.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.18637/jss.v033.i01" target="_blank" rel="noopener">Friedman, J., Hastie, T., &amp;amp; Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. &lt;em>Journal of Statistical Software&lt;/em>, 33(1), 1-22.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/j.2517-6161.1996.tb02080.x" target="_blank" rel="noopener">Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. &lt;em>Journal of the Royal Statistical Society: Series B&lt;/em>, 58(1), 267-288.&lt;/a>&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Python package and library documentation:&lt;/strong>&lt;/p>
&lt;ol start="9">
&lt;li>&lt;a href="https://www.jmlr.org/papers/v23/21-0862.html" target="_blank" rel="noopener">Bach, P., Chernozhukov, V., Kurz, M. S., &amp;amp; Spindler, M. (2022). DoubleML — An Object-Oriented Implementation of Double Machine Learning in Python. &lt;em>Journal of Machine Learning Research&lt;/em>, 23(53), 1-6.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://docs.doubleml.org/stable/index.html" target="_blank" rel="noopener">DoubleML — Python Documentation&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://pyfixest.org/" target="_blank" rel="noopener">pyfixest — Fast High-Dimensional Fixed Effects Estimation in Python&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/d2cml-ai/hdmpy" target="_blank" rel="noopener">hdmpy — Python port of R&amp;rsquo;s &lt;code>hdm&lt;/code> package (GitHub)&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html" target="_blank" rel="noopener">scikit-learn — LassoCV documentation&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://xgboost.readthedocs.io/en/stable/python/python_api.html" target="_blank" rel="noopener">XGBoost — Python API reference&lt;/a>&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Data and replication archives:&lt;/strong>&lt;/p>
&lt;ol start="15">
&lt;li>&lt;a href="https://github.com/cmg777/starter-academic-v501/tree/master/content/post/r_double_lasso/data" target="_blank" rel="noopener">Belloni-Chernozhukov-Hansen (2014) replication CSVs — companion R post &lt;code>data/&lt;/code> folder (GitHub)&lt;/a>&lt;/li>
&lt;/ol>
&lt;h4 id="acknowledgements">Acknowledgements&lt;/h4>
&lt;p>AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.&lt;/p>
&lt;hr>
&lt;style>
.podcast-overlay {
display: none;
position: fixed;
bottom: 0;
left: 0;
right: 0;
z-index: 9999;
animation: podSlideUp 0.35s ease-out;
}
@keyframes podSlideUp {
from { transform: translateY(100%); }
to { transform: translateY(0); }
}
.podcast-overlay.pod-closing {
animation: podSlideDown 0.3s ease-in forwards;
}
@keyframes podSlideDown {
from { transform: translateY(0); }
to { transform: translateY(100%); }
}
.podcast-container {
background: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%);
padding: 18px 24px 20px;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
box-shadow: 0 -4px 32px rgba(0,0,0,0.5);
border-top: 1px solid rgba(106,155,204,0.2);
}
.podcast-inner {
max-width: 800px;
margin: 0 auto;
}
.podcast-top-row {
display: flex;
align-items: center;
gap: 14px;
margin-bottom: 14px;
}
.podcast-icon {
width: 42px;
height: 42px;
background: linear-gradient(135deg, #d97757, #e8956a);
border-radius: 10px;
display: flex;
align-items: center;
justify-content: center;
flex-shrink: 0;
}
.podcast-icon svg {
width: 22px;
height: 22px;
fill: #fff;
}
.podcast-title-block {
flex: 1;
min-width: 0;
}
.podcast-title-block h4 {
margin: 0 0 1px 0;
color: #f0ece2;
font-size: 14px;
font-weight: 600;
letter-spacing: 0.02em;
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
}
.podcast-title-block span {
color: #8b9dc3;
font-size: 11px;
}
.podcast-close-btn {
background: none;
border: none;
cursor: pointer;
padding: 6px;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.2s;
flex-shrink: 0;
}
.podcast-close-btn:hover {
background: rgba(255,255,255,0.1);
}
.podcast-close-btn svg {
width: 20px;
height: 20px;
fill: #8b9dc3;
}
.podcast-progress-wrap {
margin-bottom: 12px;
}
.podcast-time-row {
display: flex;
justify-content: space-between;
font-size: 11px;
color: #8b9dc3;
margin-bottom: 5px;
font-variant-numeric: tabular-nums;
}
.podcast-bar-bg {
width: 100%;
height: 6px;
background: rgba(255,255,255,0.1);
border-radius: 3px;
cursor: pointer;
position: relative;
overflow: hidden;
transition: height 0.15s;
}
.podcast-bar-buffered {
position: absolute;
top: 0;
left: 0;
height: 100%;
background: rgba(106,155,204,0.25);
border-radius: 3px;
transition: width 0.3s;
}
.podcast-bar-progress {
position: absolute;
top: 0;
left: 0;
height: 100%;
background: linear-gradient(90deg, #6a9bcc, #00d4c8);
border-radius: 3px;
transition: width 0.1s linear;
}
.podcast-bar-bg:hover {
height: 10px;
margin-top: -2px;
}
.podcast-controls-row {
display: flex;
align-items: center;
justify-content: space-between;
}
.podcast-transport {
display: flex;
align-items: center;
gap: 8px;
}
.podcast-btn {
background: none;
border: none;
cursor: pointer;
padding: 4px;
display: flex;
align-items: center;
justify-content: center;
border-radius: 50%;
transition: all 0.2s;
}
.podcast-btn svg {
fill: #c8d0e0;
transition: fill 0.2s;
}
.podcast-btn:hover svg {
fill: #f0ece2;
}
.podcast-btn-skip {
position: relative;
}
.podcast-btn-skip span {
position: absolute;
font-size: 7px;
font-weight: 700;
color: #c8d0e0;
top: 50%;
left: 50%;
transform: translate(-50%, -50%);
pointer-events: none;
margin-top: 1px;
}
.podcast-btn-play {
width: 48px;
height: 48px;
background: linear-gradient(135deg, #d97757, #e8956a);
border-radius: 50%;
box-shadow: 0 3px 12px rgba(217,119,87,0.4);
transition: all 0.2s;
}
.podcast-btn-play:hover {
transform: scale(1.08);
box-shadow: 0 5px 20px rgba(217,119,87,0.5);
}
.podcast-btn-play svg {
fill: #fff;
width: 22px;
height: 22px;
}
.podcast-extras {
display: flex;
align-items: center;
gap: 10px;
}
.podcast-volume-wrap {
display: flex;
align-items: center;
gap: 5px;
}
.podcast-volume-wrap svg {
fill: #8b9dc3;
width: 16px;
height: 16px;
cursor: pointer;
flex-shrink: 0;
}
.podcast-volume-wrap svg:hover {
fill: #c8d0e0;
}
.podcast-volume-slider {
-webkit-appearance: none;
appearance: none;
width: 60px;
height: 4px;
background: rgba(255,255,255,0.12);
border-radius: 2px;
outline: none;
cursor: pointer;
}
.podcast-volume-slider::-webkit-slider-thumb {
-webkit-appearance: none;
appearance: none;
width: 12px;
height: 12px;
background: #6a9bcc;
border-radius: 50%;
cursor: pointer;
}
.podcast-speed-btn {
background: rgba(255,255,255,0.08);
border: 1px solid rgba(255,255,255,0.12);
color: #c8d0e0;
font-size: 11px;
font-weight: 600;
padding: 3px 9px;
border-radius: 12px;
cursor: pointer;
transition: all 0.2s;
font-family: inherit;
min-width: 40px;
text-align: center;
}
.podcast-speed-btn:hover {
background: rgba(106,155,204,0.2);
border-color: #6a9bcc;
color: #f0ece2;
}
.podcast-download-btn {
background: none;
border: 1px solid rgba(255,255,255,0.12);
border-radius: 8px;
padding: 4px 10px;
cursor: pointer;
display: flex;
align-items: center;
gap: 4px;
color: #8b9dc3;
font-size: 11px;
font-family: inherit;
text-decoration: none;
transition: all 0.2s;
}
.podcast-download-btn:hover {
border-color: #6a9bcc;
color: #f0ece2;
background: rgba(106,155,204,0.1);
}
.podcast-download-btn svg {
width: 14px;
height: 14px;
fill: currentColor;
}
@media (max-width: 600px) {
.podcast-container { padding: 14px 16px 16px; }
.podcast-volume-wrap { display: none; }
.podcast-title-block h4 { font-size: 13px; }
.podcast-extras { gap: 8px; }
}
&lt;/style>
&lt;div class="podcast-overlay" id="podOverlay">
&lt;div class="podcast-container">
&lt;div class="podcast-inner">
&lt;audio id="podAudio" preload="none" src="https://files.catbox.moe/anx2jt.m4a">&lt;/audio>
&lt;div class="podcast-top-row">
&lt;div class="podcast-icon">
&lt;svg viewBox="0 0 24 24">&lt;path d="M12 1a5 5 0 0 0-5 5v4a5 5 0 0 0 10 0V6a5 5 0 0 0-5-5zm0 16a7 7 0 0 1-7-7H3a9 9 0 0 0 8 8.94V22h2v-3.06A9 9 0 0 0 21 10h-2a7 7 0 0 1-7 7z"/>&lt;/svg>
&lt;/div>
&lt;div class="podcast-title-block">
&lt;h4>AI Podcast: Double LASSO in Python&lt;/h4>
&lt;span id="podDurationLabel">Click play to load&lt;/span>
&lt;/div>
&lt;button class="podcast-close-btn" onclick="podClose()" title="Close player">
&lt;svg viewBox="0 0 24 24">&lt;path d="M19 6.41L17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12z"/>&lt;/svg>
&lt;/button>
&lt;/div>
&lt;div class="podcast-progress-wrap">
&lt;div class="podcast-time-row">
&lt;span id="podCurrent">0:00&lt;/span>
&lt;span id="podDuration">0:00&lt;/span>
&lt;/div>
&lt;div class="podcast-bar-bg" id="podBarBg" onclick="podSeek(event)">
&lt;div class="podcast-bar-buffered" id="podBuffered">&lt;/div>
&lt;div class="podcast-bar-progress" id="podProgress">&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="podcast-controls-row">
&lt;div class="podcast-transport">
&lt;button class="podcast-btn podcast-btn-skip" onclick="podSkip(-15)" title="Back 15s">
&lt;svg width="26" height="26" viewBox="0 0 24 24">&lt;path d="M12 5V1L7 6l5 5V7c3.31 0 6 2.69 6 6s-2.69 6-6 6-6-2.69-6-6H4c0 4.42 3.58 8 8 8s8-3.58 8-8-3.58-8-8-8z"/>&lt;/svg>
&lt;span>15&lt;/span>
&lt;/button>
&lt;button class="podcast-btn podcast-btn-play" id="podPlayBtn" onclick="podToggle()" title="Play">
&lt;svg id="podIconPlay" viewBox="0 0 24 24">&lt;path d="M8 5v14l11-7z"/>&lt;/svg>
&lt;svg id="podIconPause" viewBox="0 0 24 24" style="display:none">&lt;path d="M6 19h4V5H6v14zm8-14v14h4V5h-4z"/>&lt;/svg>
&lt;/button>
&lt;button class="podcast-btn podcast-btn-skip" onclick="podSkip(15)" title="Forward 15s">
&lt;svg width="26" height="26" viewBox="0 0 24 24">&lt;path d="M12 5V1l5 5-5 5V7c-3.31 0-6 2.69-6 6s2.69 6 6 6 6-2.69 6-6h2c0 4.42-3.58 8-8 8s-8-3.58-8-8 3.58-8 8-8z"/>&lt;/svg>
&lt;span>15&lt;/span>
&lt;/button>
&lt;/div>
&lt;div class="podcast-extras">
&lt;div class="podcast-volume-wrap">
&lt;svg id="podVolIcon" onclick="podMute()" viewBox="0 0 24 24">&lt;path d="M3 9v6h4l5 5V4L7 9H3zm13.5 3A4.5 4.5 0 0 0 14 8.5v7a4.47 4.47 0 0 0 2.5-3.5zM14 3.23v2.06a6.51 6.51 0 0 1 0 13.42v2.06A8.51 8.51 0 0 0 14 3.23z"/>&lt;/svg>
&lt;input type="range" class="podcast-volume-slider" id="podVolume" min="0" max="1" step="0.05" value="0.8">
&lt;/div>
&lt;button class="podcast-speed-btn" id="podSpeedBtn" onclick="podCycleSpeed()" title="Playback speed">1x&lt;/button>
&lt;a class="podcast-download-btn" href="https://files.catbox.moe/anx2jt.m4a" target="_blank" rel="noopener" title="Stream">
&lt;svg viewBox="0 0 24 24">&lt;path d="M19 9h-4V3H9v6H5l7 7 7-7zM5 18v2h14v-2H5z"/>&lt;/svg>
&lt;/a>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;script>
(function(){
var overlay = document.getElementById('podOverlay');
var a = document.getElementById('podAudio');
var speeds = [0.75, 1, 1.25, 1.5, 2];
var si = 1;
var opened = false;
function fmt(s){
if(isNaN(s)) return '0:00';
var m=Math.floor(s/60), sec=Math.floor(s%60);
return m+':'+(sec&lt;10?'0':'')+sec;
}
document.addEventListener('click', function(e){
var link = e.target.closest('a.btn-page-header');
if(!link) return;
var text = link.textContent.trim();
if(text.indexOf('AI Podcast') === -1) return;
e.preventDefault();
e.stopPropagation();
overlay.style.display = 'block';
overlay.classList.remove('pod-closing');
if(!opened){
a.preload = 'metadata';
a.load();
opened = true;
}
});
a.volume = 0.8;
a.addEventListener('loadedmetadata', function(){
document.getElementById('podDuration').textContent = fmt(a.duration);
document.getElementById('podDurationLabel').textContent = fmt(a.duration) + ' minutes';
});
a.addEventListener('timeupdate', function(){
document.getElementById('podCurrent').textContent = fmt(a.currentTime);
var pct = a.duration ? (a.currentTime/a.duration)*100 : 0;
document.getElementById('podProgress').style.width = pct+'%';
});
a.addEventListener('progress', function(){
if(a.buffered.length>0){
var pct = (a.buffered.end(a.buffered.length-1)/a.duration)*100;
document.getElementById('podBuffered').style.width = pct+'%';
}
});
a.addEventListener('ended', function(){
document.getElementById('podIconPlay').style.display='';
document.getElementById('podIconPause').style.display='none';
});
window.podToggle = function(){
if(a.paused){a.play();document.getElementById('podIconPlay').style.display='none';document.getElementById('podIconPause').style.display='';}
else{a.pause();document.getElementById('podIconPlay').style.display='';document.getElementById('podIconPause').style.display='none';}
};
window.podSkip = function(s){a.currentTime = Math.max(0,Math.min(a.duration||0,a.currentTime+s));};
window.podSeek = function(e){
var rect = document.getElementById('podBarBg').getBoundingClientRect();
var pct = (e.clientX - rect.left)/rect.width;
a.currentTime = pct * (a.duration||0);
};
window.podMute = function(){
a.muted = !a.muted;
document.getElementById('podVolume').value = a.muted ? 0 : a.volume;
};
window.podCycleSpeed = function(){
si = (si+1) % speeds.length;
a.playbackRate = speeds[si];
document.getElementById('podSpeedBtn').textContent = speeds[si]+'x';
};
window.podClose = function(){
overlay.classList.add('pod-closing');
setTimeout(function(){ overlay.style.display='none'; }, 300);
a.pause();
document.getElementById('podIconPlay').style.display='';
document.getElementById('podIconPause').style.display='none';
};
document.getElementById('podVolume').addEventListener('input', function(){
a.volume = this.value;
a.muted = false;
});
if(window.location.hash === '#podcast-player'){
overlay.style.display = 'block';
a.preload = 'metadata';
a.load();
opened = true;
}
})();
&lt;/script></description></item><item><title>Double LASSO in Stata: Does Abortion Reduce Crime?</title><link>https://carlos-mendez.org/post/stata_double_lasso/</link><pubDate>Sun, 24 May 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/stata_double_lasso/</guid><description>&lt;h2 id="1-overview">1. Overview&lt;/h2>
&lt;p>This is the Stata companion to &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">the R version&lt;/a> of the Double LASSO tutorial — same data, same five estimators, same identification story. The R post walks through Belloni, Chernozhukov and Hansen&amp;rsquo;s (2014) extension of Donohue and Levitt&amp;rsquo;s (2001) abortion-and-crime panel and shows that &lt;strong>Double LASSO&lt;/strong> with the &lt;em>rigorous&lt;/em> (theory-based) penalty reproduces the headline causal estimates from 284 candidate controls while CV-tuned LASSO overshoots dramatically. This post does the same computation in Stata using the &lt;strong>StataLasso&lt;/strong> suite — &lt;code>rlasso&lt;/code>, &lt;code>cvlasso&lt;/code>, &lt;code>pdslasso&lt;/code> and &lt;code>lasso2&lt;/code> from &lt;a href="#19-references">Ahrens, Hansen and Schaffer (2018)&lt;/a> — and verifies the numbers against the R implementation.&lt;/p>
&lt;p>If you have already read the R version, the takeaways here are unchanged. The structural reason to write a Stata companion is reproducibility: empirical economists who run Stata day-to-day will find the friction of switching to R for one method too high, and a transparent Stata implementation removes that friction. The structural reason to &lt;em>verify&lt;/em> it is that small implementation differences (default penalty constants, lambda parameterizations, CV-fold randomisation) can subtly change which variables get selected and, in this dataset, which sign the estimated treatment effect carries.&lt;/p>
&lt;p>&lt;img src="stata_double_lasso_estimates.png" alt="Forest plot of α̂ ± 95% CI for all five estimators (First diff, OLS-full, PSL, DL-rigorous, DL-CV) facetted by outcome — Stata replication of the R headline figure.">&lt;/p>
&lt;p>The figure above is the post&amp;rsquo;s spoiler — the Stata version of the R headline forest plot. Each row is a different estimator; each panel is a different crime outcome. The dashed vertical line is zero: to its left, the abortion-crime relationship is &lt;em>negative&lt;/em> (more abortion is associated with less crime). Two patterns jump out, exactly as in the R companion. First, the LASSO methods (PSL, DL-rigorous) cluster sensibly near the original Donohue–Levitt baseline (First diff) for violent and property crime. Second, &lt;strong>OLS with all 284 controls is uninterpretable&lt;/strong> — its murder estimate explodes to a value far outside any plausible causal range. That failure mode is what motivates LASSO in the first place.&lt;/p>
&lt;p>&lt;strong>Learning objectives.&lt;/strong> After working through this tutorial you will be able to:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Explain&lt;/strong> when high-dimensional methods like LASSO add value over plain OLS, and when they do not.&lt;/li>
&lt;li>&lt;strong>Implement&lt;/strong> the Belloni–Chernozhukov–Hansen Double LASSO procedure in Stata using &lt;code>rlasso&lt;/code> (rigorous penalty) and &lt;code>cvlasso&lt;/code> (cross-validated penalty).&lt;/li>
&lt;li>&lt;strong>Distinguish&lt;/strong> the &lt;em>rigorous&lt;/em> and &lt;em>cross-validated&lt;/em> penalty rules for LASSO, and recognise which is appropriate for causal inference.&lt;/li>
&lt;li>&lt;strong>Compute&lt;/strong> state-clustered standard errors with the HC1 finite-sample correction using Stata&amp;rsquo;s built-in &lt;code>vce(cluster state)&lt;/code> and read the resulting sandwich matrix.&lt;/li>
&lt;li>&lt;strong>Diagnose&lt;/strong> the regime in which Double LASSO most helps (treatment well-predicted, outcome not), using the selection-count fingerprint |I_y| and |I_d|.&lt;/li>
&lt;li>&lt;strong>Verify&lt;/strong> that the Stata implementation matches the R companion to the precision allowed by each estimator&amp;rsquo;s randomness — and locate the unavoidable drift in cross-validated steps.&lt;/li>
&lt;/ul>
&lt;h3 id="key-concepts-at-a-glance">Key concepts at a glance&lt;/h3>
&lt;p>The post leans on a small vocabulary. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has a one-line definition followed by a short example tied to this post&amp;rsquo;s data.&lt;/p>
&lt;p>&lt;strong>1. LASSO&lt;/strong> $\hat\beta(\lambda) = \arg\min_\beta \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda \sum_j \lvert\beta_j\rvert$. L1-penalised OLS: the absolute-value penalty produces &lt;em>exactly-zero&lt;/em> coefficients (variable selection). In §6 our &lt;code>rlasso&lt;/code> of the abortion rate on 284 controls picks just 8 — the rest get shrunk to zero.&lt;/p>
&lt;p>&lt;strong>2. Penalty $\lambda$.&lt;/strong> The knob controlling shrinkage. Higher $\lambda$ pins more coefficients to zero. Tuning $\lambda$ is the central design choice — and what separates the rigorous and CV flavours of Double LASSO.&lt;/p>
&lt;p>&lt;strong>3. Post-Structural LASSO (PSL).&lt;/strong> One CV-LASSO with the treatment forced in via Stata&amp;rsquo;s &lt;code>notpen()&lt;/code> option, then plain OLS on the selected support. The simplest one-LASSO causal estimator.&lt;/p>
&lt;p>&lt;strong>4. Double LASSO (DL).&lt;/strong> Two LASSOs (y on X, d on X), union of selected controls, then post-OLS. The causal-inference-safe variant that beats PSL when controls predict $d$ but not $y$.&lt;/p>
&lt;p>&lt;strong>5. Selection sets $I_y$ and $I_d$.&lt;/strong> The indices of controls each LASSO step keeps. Their union $I_y \cup I_d$ is the support of the post-OLS regression. Their &lt;em>imbalance&lt;/em> is the empirical fingerprint of when DL adds value.&lt;/p>
&lt;p>&lt;strong>6. Rigorous vs CV penalty.&lt;/strong> Two ways to pick $\lambda$. Rigorous: Belloni–Chen–Chernozhukov–Hansen (2012) Bonferroni-style theory rule, available in Stata as &lt;code>rlasso&lt;/code>. CV: cross-validation minimising prediction MSE, available as &lt;code>cvlasso&lt;/code>. Different objectives, different answers.&lt;/p>
&lt;p>&lt;strong>7. Post-OLS step.&lt;/strong> After LASSO selects a support, refit with plain (unshrunk) OLS to remove the shrinkage bias on $\hat\alpha$. LASSO is used only for &lt;em>selection&lt;/em>, never for the final estimate. In Stata this is one &lt;code>regress y d &amp;lt;selected&amp;gt;, vce(cluster state)&lt;/code> line.&lt;/p>
&lt;p>&lt;strong>8. State-clustered standard errors.&lt;/strong> HC1-adjusted sandwich variance with state-level clustering, applied automatically by Stata&amp;rsquo;s &lt;code>vce(cluster state)&lt;/code>. Corrects for within-state autocorrelation that would otherwise understate the SE on a panel of state-year observations.&lt;/p>
&lt;p>A note on the StataLasso suite. The Ahrens–Hansen–Schaffer (2018) package gives us four commands that map cleanly onto the R workflow:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Stata command&lt;/th>
&lt;th>R equivalent&lt;/th>
&lt;th>What it does&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>rlasso&lt;/code>&lt;/td>
&lt;td>&lt;code>hdm::rlasso&lt;/code>&lt;/td>
&lt;td>LASSO with the rigorous theory-based penalty (Belloni et al. 2012)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>cvlasso&lt;/code>&lt;/td>
&lt;td>&lt;code>glmnet::cv.glmnet&lt;/code>&lt;/td>
&lt;td>LASSO with cross-validated $\lambda$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>lasso2&lt;/code>&lt;/td>
&lt;td>&lt;code>glmnet::glmnet&lt;/code>&lt;/td>
&lt;td>LASSO across the full $\lambda$ path (no CV)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>pdslasso&lt;/code>&lt;/td>
&lt;td>wrapper combining the above&lt;/td>
&lt;td>One-line PDS / Double LASSO with cluster-robust SE&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The first three are the engines; &lt;code>pdslasso&lt;/code> is the convenience wrapper that automates the two-LASSO-then-post-OLS recipe in a single command. We use the engines directly in this post so the three steps remain visible.&lt;/p>
&lt;hr>
&lt;h2 id="2-the-data">2. The data&lt;/h2>
&lt;p>We use the exact panel that &lt;a href="#19-references">Belloni, Chernozhukov and Hansen (2014)&lt;/a> compiled from &lt;a href="#19-references">Donohue and Levitt&amp;rsquo;s (2001)&lt;/a> original replication archive: &lt;strong>48 U.S. states × 12 years (1986–1997) after first-differencing the raw 13-year 1985–1997 panel, giving 576 observations.&lt;/strong> First-differencing absorbs state fixed effects. Year fixed effects are absorbed in a separate pre-processing step using the Frisch–Waugh–Lovell projection (see §7). By the time the analysis script sees the data, both fixed-effect adjustments are done, so the LASSO regressions below contain no time dummies.&lt;/p>
&lt;p>The treatment $d$ is the &lt;strong>effective abortion rate&lt;/strong> — a weighted average of past abortion-to-birth ratios, lagged to match the ages at which crime is most prevalent. The three outcomes $y$ are state-level &lt;strong>violent crime, property crime, and murder rates&lt;/strong>, each first-differenced. The candidate-control matrix $X$ has &lt;strong>284 columns&lt;/strong>: it expands Donohue–Levitt&amp;rsquo;s original 8 controls into squares, two-way interactions, time interactions, lagged levels, within-state means, and initial-value × time-trend interactions, then screens for multicollinearity.&lt;/p>
&lt;p>For reproducibility, the data lives in the &lt;a href="https://github.com/cmg777/starter-academic-v501/tree/master/content/post/r_double_lasso/data" target="_blank" rel="noopener">companion R post&amp;rsquo;s &lt;code>data/&lt;/code> folder&lt;/a> and is loaded over HTTPS from the GitHub raw URL. No local Matlab files needed.&lt;/p>
&lt;p>&lt;strong>Code chunk 1 — Loading the data in Stata:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-stata">local BASE = &amp;quot;https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_double_lasso/data&amp;quot;
tempfile linear partialled ctrl_v ctrl_p ctrl_m
import delimited &amp;quot;`BASE'/levitt_linear.csv&amp;quot;, clear varnames(1) case(preserve)
gen long obs_id = _n
save &amp;quot;`linear'&amp;quot;
import delimited &amp;quot;`BASE'/levitt_partialled.csv&amp;quot;, clear varnames(1) case(preserve)
drop state
gen long obs_id = _n
save &amp;quot;`partialled'&amp;quot;
* Three 284-column control matrices, one per outcome. Column names in
* the source CSV use ^, *, ( ) — Stata sanitises them on import; we
* rename to zv1..zv284, zp1..zp284, zm1..zm284 so downstream code can
* address them uniformly.
foreach o in v p m {
local long = cond(&amp;quot;`o'&amp;quot;==&amp;quot;v&amp;quot;,&amp;quot;viol&amp;quot;,cond(&amp;quot;`o'&amp;quot;==&amp;quot;p&amp;quot;,&amp;quot;prop&amp;quot;,&amp;quot;murd&amp;quot;))
import delimited &amp;quot;`BASE'/levitt_controls_`long'.csv&amp;quot;, clear varnames(1)
local k = 0
foreach var of varlist _all {
local ++k
rename `var' z`o'`k'
}
gen long obs_id = _n
save &amp;quot;`ctrl_`o''&amp;quot;
}
use &amp;quot;`linear'&amp;quot;, clear
merge 1:1 obs_id using &amp;quot;`partialled'&amp;quot;, nogen
merge 1:1 obs_id using &amp;quot;`ctrl_v'&amp;quot;, nogen
merge 1:1 obs_id using &amp;quot;`ctrl_p'&amp;quot;, nogen
merge 1:1 obs_id using &amp;quot;`ctrl_m'&amp;quot;, nogen
&lt;/code>&lt;/pre>
&lt;p>Six CSVs, six &lt;code>import delimited&lt;/code> blocks merged on row index. The &lt;code>case(preserve)&lt;/code> option on the &lt;code>linear&lt;/code> and &lt;code>partialled&lt;/code> imports keeps Stata&amp;rsquo;s variable-name auto-lowercaser from collapsing the case-sensitive &lt;code>Dyv&lt;/code> vs. &lt;code>DyV&lt;/code> distinction we use to separate raw differences from year-FE-partialled series. The control CSVs use special characters in their column headers (e.g. &lt;code>Lprison^2&lt;/code>, &lt;code>Dprison*t&lt;/code>); we rename all of them to &lt;code>z&amp;lt;prefix&amp;gt;&amp;lt;index&amp;gt;&lt;/code> so downstream &lt;code>regress&lt;/code>, &lt;code>rlasso&lt;/code>, and &lt;code>cvlasso&lt;/code> calls can address them with the wildcard &lt;code>z&lt;/code>v'1-z&lt;code>v'284&lt;/code>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>File&lt;/th>
&lt;th>Shape&lt;/th>
&lt;th>What it contains&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>levitt_state.csv&lt;/code>&lt;/td>
&lt;td>576 × 1&lt;/td>
&lt;td>State cluster id (1–48) for each observation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_linear.csv&lt;/code>&lt;/td>
&lt;td>576 × 7&lt;/td>
&lt;td>Raw first-differences of the outcomes and treatment (&lt;code>Dyv, Dxv, Dyp, Dxp, Dym, Dxm&lt;/code>)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_partialled.csv&lt;/code>&lt;/td>
&lt;td>576 × 7&lt;/td>
&lt;td>Same series after year-FE absorption (&lt;code>DyV, DxV, DyP, DxP, DyM, DxM&lt;/code>)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_controls_viol.csv&lt;/code>&lt;/td>
&lt;td>576 × 284&lt;/td>
&lt;td>Control matrix $Z_v$ for the violent-crime equation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_controls_prop.csv&lt;/code>&lt;/td>
&lt;td>576 × 284&lt;/td>
&lt;td>Control matrix $Z_p$ for the property-crime equation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>levitt_controls_murd.csv&lt;/code>&lt;/td>
&lt;td>576 × 284&lt;/td>
&lt;td>Control matrix $Z_m$ for the murder equation&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The dimensions matter for the LASSO methods that follow. We are in the &lt;strong>moderate-dimensional&lt;/strong> regime: $p = 284$ is large but smaller than $n = 576$, so OLS is technically feasible but unstable, and LASSO is the natural tool to discipline the variable selection.&lt;/p>
&lt;hr>
&lt;h2 id="3-five-estimators-in-plain-language">3. Five estimators in plain language&lt;/h2>
&lt;p>Five regression procedures appear in this post, each with a different attitude toward how many controls to keep. We summarise the cast here so you can navigate the rest of the article.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Estimator&lt;/th>
&lt;th>Recipe in one sentence&lt;/th>
&lt;th>Stata command&lt;/th>
&lt;th>Section&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>First-difference OLS&lt;/strong>&lt;/td>
&lt;td>Regress differenced crime on differenced abortion with &lt;strong>no&lt;/strong> controls — the original Donohue–Levitt 1993 specification.&lt;/td>
&lt;td>&lt;code>regress&lt;/code>&lt;/td>
&lt;td>§4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>OLS (full)&lt;/strong>&lt;/td>
&lt;td>Add all 284 controls and let the matrix algebra sort it out.&lt;/td>
&lt;td>&lt;code>regress&lt;/code>&lt;/td>
&lt;td>§5&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>PSL&lt;/strong> (Post-Structural LASSO)&lt;/td>
&lt;td>One LASSO with the treatment forced in via &lt;code>pnotpen()&lt;/code>, then plain OLS on the selected support. (Stata uses the rigorous penalty here; see §6 for the trade-off vs R&amp;rsquo;s CV-tuned PSL.)&lt;/td>
&lt;td>&lt;code>rlasso&lt;/code> + &lt;code>regress&lt;/code>&lt;/td>
&lt;td>§6&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>DL (rigorous)&lt;/strong>&lt;/td>
&lt;td>Two LASSOs (y on X, d on X) with the Belloni-et-al. theory-based penalty; refit OLS on the &lt;strong>union&lt;/strong> of selected variables.&lt;/td>
&lt;td>&lt;code>rlasso&lt;/code> ×2 + &lt;code>regress&lt;/code>&lt;/td>
&lt;td>§7&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>DL (CV)&lt;/strong>&lt;/td>
&lt;td>Same recipe as DL-rigorous but each LASSO uses 3-fold cross-validation to pick lambda.&lt;/td>
&lt;td>&lt;code>cvlasso&lt;/code> ×2 + &lt;code>regress&lt;/code>&lt;/td>
&lt;td>§11&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Two pairs of estimators do most of the pedagogical work. First-diff vs. OLS-full is the &lt;em>control-count&lt;/em> contrast (no controls vs. too many controls). DL-rigorous vs. DL-CV is the &lt;em>penalty-rule&lt;/em> contrast (theory vs. data-driven). PSL sits in between as the simplest one-LASSO benchmark.&lt;/p>
&lt;hr>
&lt;h2 id="4-first-difference-ols--the-no-controls-baseline">4. First-difference OLS — the no-controls baseline&lt;/h2>
&lt;p>The original Donohue–Levitt 1993 specification regresses differenced crime on differenced abortion with no controls beyond first-differencing itself:&lt;/p>
&lt;p>$$
\Delta y_{st} = \alpha \, \Delta d_{st} + \varepsilon_{st}.
$$&lt;/p>
&lt;p>Here, $\Delta y_{st}$ is the change in the crime rate for state $s$ from year $t-1$ to $t$, $\Delta d_{st}$ is the change in the effective abortion rate, and $\varepsilon_{st}$ is the regression error. The parameter $\alpha$ is the &lt;strong>average partial effect of the differenced abortion rate on the differenced crime rate&lt;/strong>, identified under (i) conditional independence given the differenced trajectories and (ii) parallel trends in levels. We use state-clustered standard errors throughout (more on this in §9).&lt;/p>
&lt;p>&lt;strong>Code chunk 2 — The first-difference OLS in Stata:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-stata">foreach o in v p m {
local Y = cond(&amp;quot;`o'&amp;quot;==&amp;quot;v&amp;quot;,&amp;quot;Dyv&amp;quot;, cond(&amp;quot;`o'&amp;quot;==&amp;quot;p&amp;quot;,&amp;quot;Dyp&amp;quot;,&amp;quot;Dym&amp;quot;))
local D = cond(&amp;quot;`o'&amp;quot;==&amp;quot;v&amp;quot;,&amp;quot;Dxv&amp;quot;, cond(&amp;quot;`o'&amp;quot;==&amp;quot;p&amp;quot;,&amp;quot;Dxp&amp;quot;,&amp;quot;Dxm&amp;quot;))
regress `Y' `D', noconstant vce(cluster state)
}
&lt;/code>&lt;/pre>
&lt;p>Three things to notice. First, &lt;code>noconstant&lt;/code> suppresses the intercept — first-differencing absorbs both the level and the state fixed effect, so the regression mean is zero by construction. Second, &lt;code>vce(cluster state)&lt;/code> triggers the cluster-robust sandwich estimator with Stata&amp;rsquo;s default small-sample correction $(N-1)/(N-k) \cdot G/(G-1)$, which is exactly the HC1-style correction used in the Fitzgerald et al. (2026) replication code — no extra plumbing needed. Third, the &lt;code>cond(&amp;quot;&lt;/code>o'&amp;quot;==&amp;ldquo;v&amp;rdquo;,&amp;ldquo;Dyv&amp;rdquo;,&amp;ldquo;Dyp&amp;rdquo;)&lt;code>Stata idiom is a verbose if/else; if you prefer cleaner code you can use a&lt;/code>local Y : word &amp;hellip; of &amp;hellip;` indirection or a Mata function.&lt;/p>
&lt;p>The output for the three outcomes:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE (state-clustered)&lt;/th>
&lt;th>95% CI&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1521&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0337&lt;/td>
&lt;td>[−0.218, −0.086]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1084&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0219&lt;/td>
&lt;td>[−0.151, −0.066]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.2039&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0667&lt;/td>
&lt;td>[−0.335, −0.073]&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Reading the violent-crime coefficient:&lt;/strong> a one-unit increase in the differenced effective abortion rate is associated with a 0.152-unit decrease in the differenced violent-crime rate. All three estimates are negative and statistically significant at the 5% level; this is the Donohue–Levitt finding. The whole point of the LASSO methods below is to ask whether this picture survives when we let 284 candidate controls compete for inclusion.&lt;/p>
&lt;hr>
&lt;h2 id="5-kitchen-sink-ols--why-we-cannot-just-add-everything">5. Kitchen-sink OLS — why we cannot just add everything&lt;/h2>
&lt;p>A natural reaction to &amp;ldquo;you only used 8 controls&amp;rdquo; is to add all 284 and let OLS sort it out. With $p = 284 &amp;lt; n = 576$ the $X&amp;rsquo;X$ matrix is technically invertible, so &lt;code>regress&lt;/code> runs:&lt;/p>
&lt;p>&lt;strong>Code chunk 3 — Kitchen-sink OLS in Stata:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-stata">foreach o in v p m {
local Y = cond(&amp;quot;`o'&amp;quot;==&amp;quot;v&amp;quot;,&amp;quot;DyV&amp;quot;, cond(&amp;quot;`o'&amp;quot;==&amp;quot;p&amp;quot;,&amp;quot;DyP&amp;quot;,&amp;quot;DyM&amp;quot;))
local D = cond(&amp;quot;`o'&amp;quot;==&amp;quot;v&amp;quot;,&amp;quot;DxV&amp;quot;, cond(&amp;quot;`o'&amp;quot;==&amp;quot;p&amp;quot;,&amp;quot;DxP&amp;quot;,&amp;quot;DxM&amp;quot;))
regress `Y' `D' z`o'1-z`o'284, noconstant vce(cluster state)
}
&lt;/code>&lt;/pre>
&lt;p>Here we use the &lt;strong>partialled&lt;/strong> outcomes and treatments (capital &lt;code>DyV, DxV&lt;/code> etc.) because the year fixed effects have already been removed by the FWL pre-processing step. Including 284 controls inside &lt;code>regress&lt;/code> is mechanical, but Stata will drop any column that is an exact linear combination of others — the message &lt;code>note: znumber omitted because of collinearity&lt;/code> appears in the log.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE&lt;/th>
&lt;th>95% CI&lt;/th>
&lt;th>Sign matches baseline?&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>+0.0134&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.7149&lt;/td>
&lt;td>[−1.39, +1.41]&lt;/td>
&lt;td>no — flips sign&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1950&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.2236&lt;/td>
&lt;td>[−0.633, +0.243]&lt;/td>
&lt;td>yes (but CI crosses zero)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>+2.3411&lt;/strong>&lt;/td>
&lt;td style="text-align:right">2.7831&lt;/td>
&lt;td>[−3.11, +7.79]&lt;/td>
&lt;td>no — flips dramatically&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The violent-crime point estimate has flipped sign (+0.013 vs the baseline&amp;rsquo;s −0.152) and its confidence interval is wildly wide; the murder estimate has exploded to &lt;strong>+2.34&lt;/strong> with a standard error of 2.78, meaning the point estimate is itself uninformative. None of the three confidence intervals lies entirely below zero — the no-controls baseline statistical significance has been blown away by adding 284 controls. Compared to the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion&lt;/a>, the &lt;em>point estimates&lt;/em> agree to ~0.001 (because OLS itself is numerically the same in both languages once collinear columns are dropped), but the &lt;em>standard errors&lt;/em> are much larger in Stata. The reason: Stata&amp;rsquo;s &lt;code>regress&lt;/code> drops collinear columns automatically, then computes the cluster-robust sandwich on the (smaller) full-rank submatrix without any pseudo-inverse step, so the variance estimate uses the natural $\sigma^2 (X&amp;rsquo;X)^{-1}$ on the unstable submatrix. R&amp;rsquo;s hand-rolled &lt;code>cluster_se()&lt;/code> helper falls back to &lt;code>MASS::ginv()&lt;/code> (Moore–Penrose pseudo-inverse) when &lt;code>solve()&lt;/code> errors, which gives a smaller but arguably less honest SE. &lt;strong>Both are mathematically valid; the Stata SEs are closer to what the JAE replication paper reports for its OLS-full specification.&lt;/strong>&lt;/p>
&lt;p>To see why, recall the OLS estimator in matrix form:&lt;/p>
&lt;p>$$
\hat\beta_{\text{OLS}} = (X&amp;rsquo;X)^{-1} X' y, \qquad
\widehat{\operatorname{Var}}(\hat\beta_{\text{OLS}}) = \hat\sigma^{2} \, (X&amp;rsquo;X)^{-1}.
$$&lt;/p>
&lt;p>Here, $X$ is the $n \times p$ design matrix (the treatment plus 284 controls), $y$ is the $n \times 1$ outcome vector, and $\hat\sigma^2$ is the estimated residual variance. The variance of any coefficient — including the treatment effect — depends on $(X&amp;rsquo;X)^{-1}$. &lt;strong>When the columns of $X$ are nearly collinear, the smallest eigenvalues of $X&amp;rsquo;X$ approach zero and its inverse blows up.&lt;/strong> This is exactly the failure mode that LASSO is designed to fix. &lt;strong>The cure is variable selection: keep the controls that matter, drop the rest.&lt;/strong>&lt;/p>
&lt;hr>
&lt;h2 id="6-lasso-and-the-one-lasso-benchmark-psl">6. LASSO and the one-LASSO benchmark (PSL)&lt;/h2>
&lt;p>The Least Absolute Shrinkage and Selection Operator (&lt;a href="#19-references">Tibshirani 1996&lt;/a>) modifies the OLS minimisation by adding an L1 penalty on the coefficients:&lt;/p>
&lt;p>$$
\hat\beta_{\text{LASSO}}(\lambda) = \arg\min_{\beta \in \mathbb{R}^p} \;
\frac{1}{2n} \| y - X\beta \|_2^2 \, + \, \lambda \sum_{j=1}^p \lvert\beta_j\rvert.
$$&lt;/p>
&lt;p>The first term is the usual sum of squared residuals. The second is the penalty: it adds $\lambda$ times the sum of the &lt;em>absolute values&lt;/em> of the coefficients to whatever the residual sum is. Two things make this choice interesting. First, the absolute-value penalty has a corner at zero — unlike a squared penalty (which would give Ridge regression), LASSO can shrink coefficients &lt;strong>exactly&lt;/strong> to zero, performing variable selection at the same time as estimation. Second, the strength of selection is controlled by one knob $\lambda$: at $\lambda = 0$ we recover OLS; as $\lambda \to \infty$ all coefficients are pinned to zero.&lt;/p>
&lt;p>&lt;strong>Post-Structural LASSO (PSL)&lt;/strong> is the simplest LASSO-based causal estimator. Run one LASSO on $y$ regressed on $(d, X)$, but force the treatment $d$ to stay in by setting its coefficient&amp;rsquo;s penalty multiplier to zero. Then refit by plain OLS on the selected support. In Stata, &lt;code>rlasso&lt;/code> exposes this through &lt;code>pnotpen(varlist)&lt;/code> — variables in &lt;code>pnotpen()&lt;/code> are kept unpenalised (forced into the model regardless of $\lambda$):&lt;/p>
&lt;p>&lt;strong>Code chunk 4 — Post-Structural LASSO (PSL) in Stata:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-stata">rlasso DyV DxV zv1-zv284, nocons pnotpen(DxV) c(1.1) gamma(0.05)
local sel &amp;quot;`e(selected)'&amp;quot; // includes DxV (the pnotpen var)
local sel : list sel - DxV // strip the treatment out
regress DyV DxV `sel', noconstant vce(cluster state) // post-OLS
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>A design choice.&lt;/strong> The &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion&lt;/a> implements PSL with &lt;code>cv.glmnet(..., penalty.factor = c(0, rep(1, p)), nfolds = 3)&lt;/code> — a CV-tuned LASSO with the treatment pinned. Stata&amp;rsquo;s &lt;code>cvlasso&lt;/code> exposes the same recipe via its &lt;code>notpen()&lt;/code> option, but at this regime ($p = 284$, $n = 576$) each &lt;code>cvlasso&lt;/code> call partials out the pinned variable and walks a 100-lambda grid in a way that takes 5+ minutes per call. To keep the post runnable in a reasonable session we use &lt;strong>&lt;code>rlasso&lt;/code> with the rigorous (BCH theory) penalty&lt;/strong> for PSL instead. The recipe is identical — one LASSO with the treatment pinned, then post-OLS on the selected support — only the penalty rule changes. The trade-off is documented in §15.&lt;/p>
&lt;p>A few annotations on the Stata idioms. &lt;code>nocons&lt;/code> is correct because the data has already been partialled for year fixed effects (mean $\approx 0$). &lt;code>pnotpen(DxV)&lt;/code> forces &lt;code>DxV&lt;/code> into the LASSO model with zero penalty. The constants &lt;code>c(1.1)&lt;/code> and &lt;code>gamma(0.05)&lt;/code> are the Belloni–Chernozhukov–Hansen rigorous-penalty defaults (see §7 for derivation). The &lt;code>: list sel - DxV&lt;/code> line is Stata&amp;rsquo;s macro list-subtract: &lt;code>e(selected)&lt;/code> from &lt;code>rlasso&lt;/code> includes the &lt;code>pnotpen&lt;/code> variable, so we remove it before the post-OLS regression adds &lt;code>DxV&lt;/code> back explicitly.&lt;/p>
&lt;p>The results:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE&lt;/th>
&lt;th style="text-align:right"># controls selected&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1553&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0330&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.0665&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0244&lt;/td>
&lt;td style="text-align:right">1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.2397&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0635&lt;/td>
&lt;td style="text-align:right">1&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>PSL with the rigorous penalty is extremely parsimonious here — for violent crime no controls survive (so the post-OLS reduces to the no-controls baseline of −0.155, which matches the §4 first-difference estimate of −0.152 essentially exactly); for property crime and murder only a single control survives. All three point estimates are negative and well-determined (SE around 0.025–0.064). Compare to the R companion&amp;rsquo;s PSL implementation, which uses 3-fold cross-validation rather than the rigorous penalty: R reports −0.157, −0.068 and −0.206 with 3, 12 and 0 controls. The Stata and R PSL implementations differ in &lt;em>how the LASSO selects controls&lt;/em> (rigorous penalty vs. CV) but agree on the qualitative pattern — small selection sets, negative estimates close to the baseline.&lt;/p>
&lt;p>&lt;strong>Why is this not the end of the story?&lt;/strong> &lt;strong>Because PSL has a causal-inference blind spot.&lt;/strong> LASSO selects controls based on how well they predict $y$. But a covariate can be a &lt;em>confounder&lt;/em> — biasing $\hat\alpha$ if omitted — even when it does not predict $y$ strongly. Imagine a variable that is highly correlated with the treatment $d$ but only weakly with $y$. PSL&amp;rsquo;s one LASSO will drop it (it does not improve prediction of $y$ much), and the post-OLS will inherit the omitted-variable bias. &lt;a href="#19-references">Belloni, Chernozhukov and Hansen (2014)&lt;/a> made exactly this point, and proposed Double LASSO as the fix.&lt;/p>
&lt;hr>
&lt;h2 id="7-double-lasso--the-causal-side-fix">7. Double LASSO — the causal-side fix&lt;/h2>
&lt;p>Double LASSO runs &lt;strong>two&lt;/strong> LASSOs, not one. The first LASSO predicts the outcome $y$ from the controls; call its selected index set $I_y$. The second LASSO predicts the treatment $d$ from the same controls; call its selected index set $I_d$. The final estimate of $\alpha$ comes from a plain OLS regression of $y$ on $d$ and the &lt;strong>union&lt;/strong> $I_y \cup I_d$, with state-clustered standard errors.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">flowchart TD
A[&amp;quot;Data: outcome y, treatment d,&amp;lt;br/&amp;gt;controls X (p = 284)&amp;quot;] --&amp;gt; B[&amp;quot;Step 1: rlasso y on X&amp;lt;br/&amp;gt;(no d on right-hand side)&amp;lt;br/&amp;gt;selected set I_y&amp;quot;]
A --&amp;gt; C[&amp;quot;Step 2: rlasso d on X&amp;lt;br/&amp;gt;(no y on right-hand side)&amp;lt;br/&amp;gt;selected set I_d&amp;quot;]
B --&amp;gt; D[&amp;quot;Union: I_y &amp;amp;cup; I_d&amp;quot;]
C --&amp;gt; D
D --&amp;gt; E[&amp;quot;Step 3: regress y d X[, union]&amp;lt;br/&amp;gt;noconstant vce(cluster state)&amp;quot;]
E --&amp;gt; F[&amp;quot;Causal estimate alpha-hat&amp;quot;]
style A fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style B fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style C fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style D fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style E fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style F fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
&lt;/code>&lt;/pre>
&lt;p>The intuition is rooted in the &lt;strong>Frisch–Waugh–Lovell theorem&lt;/strong>. To estimate $\alpha$ in the structural equation $y_i = \alpha\, d_i + x_i' \theta + \zeta_i$, FWL says we can residualise both $y$ and $d$ against the same set of controls and regress the residuals. Concretely, let $M_X = I - X(X&amp;rsquo;X)^{-1}X'$ be the residual-maker matrix; then&lt;/p>
&lt;p>$$
\hat\alpha = \bigl(\tilde d' \tilde d\bigr)^{-1} \tilde d' \tilde y, \quad \text{where} \quad \tilde y = M_X y, \, \tilde d = M_X d.
$$&lt;/p>
&lt;p>The trick is that we do not need to use &lt;em>all&lt;/em> of $X$ in the residualisation. We only need to use enough of $X$ to capture the part that is correlated with $d$. Double LASSO does this approximately: $I_d$ catches the controls correlated with $d$; $I_y$ catches the controls correlated with $y$; their union catches both. Refitting OLS on $d$ plus the union approximates the FWL projection without committing to all 284 controls.&lt;/p>
&lt;p>The &amp;ldquo;rigorous&amp;rdquo; penalty rule chooses $\lambda$ from theory, not from CV. &lt;a href="#19-references">Belloni, Chen, Chernozhukov and Hansen (2012)&lt;/a> showed that the right scaling is&lt;/p>
&lt;p>$$
\lambda^{\text{rig}} = \frac{2 c \, \hat\sigma}{\sqrt{n}} \, \Phi^{-1}\!\left(1 - \frac{\gamma}{2 p}\right), \quad c = 1.1, \, \gamma = 0.05,
$$&lt;/p>
&lt;p>where $\hat\sigma$ is a pilot estimate of the residual standard deviation, $n$ is the sample size, $p$ is the number of candidate controls, and $\Phi^{-1}$ is the inverse standard-normal CDF. The factor $\Phi^{-1}(1 - \gamma / (2p))$ is a Bonferroni-style correction that keeps the false-positive rate of LASSO selection under control even though we are testing $p$ coefficients. The constants $c = 1.1$ and $\gamma = 0.05$ are the defaults the JAE replication code uses; we pass them explicitly to &lt;code>rlasso&lt;/code> for cross-language consistency with the R companion&amp;rsquo;s &lt;code>hdm::rlasso&lt;/code> call.&lt;/p>
&lt;p>&lt;strong>Code chunk 5 — The two rigorous LASSOs and the post-OLS in Stata:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-stata">* Step 1: LASSO y on X.
rlasso DyV zv1-zv284, nocons c(1.1) gamma(0.05)
local Iy &amp;quot;`e(selected)'&amp;quot;
* Step 2: LASSO d on X.
rlasso DxV zv1-zv284, nocons c(1.1) gamma(0.05)
local Id &amp;quot;`e(selected)'&amp;quot;
* Step 3: union of selected, then post-OLS with cluster-robust SE.
local U : list Iy | Id
regress DyV DxV `U', noconstant vce(cluster state)
&lt;/code>&lt;/pre>
&lt;p>A few notes. &lt;code>nocons&lt;/code> is correct here because the data has already been partialled for year fixed effects (so the column means are essentially zero); including a constant on already-partialled data tends to produce spurious selections. Stata&amp;rsquo;s &lt;code>rlasso&lt;/code> does &lt;em>not&lt;/em> take a &lt;code>post&lt;/code> flag the way R&amp;rsquo;s &lt;code>hdm::rlasso&lt;/code> does — &lt;code>e(selected)&lt;/code> always returns the variable names whose coefficients are non-zero, and we run our own post-OLS afterward to attach the state-clustered standard error. The list operator &lt;code>: list Iy | Id&lt;/code> is Stata&amp;rsquo;s set union for macro lists; it produces a deduplicated list of variable names.&lt;/p>
&lt;p>The results:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha$&lt;/th>
&lt;th style="text-align:right">SE&lt;/th>
&lt;th>95% CI&lt;/th>
&lt;th style="text-align:right">|I_y|&lt;/th>
&lt;th style="text-align:right">|I_d|&lt;/th>
&lt;th style="text-align:right">Union&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1744&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.1155&lt;/td>
&lt;td>[−0.401, +0.052]&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">8&lt;/td>
&lt;td style="text-align:right">8&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1144&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.0470&lt;/td>
&lt;td>[−0.207, −0.022]&lt;/td>
&lt;td style="text-align:right">3&lt;/td>
&lt;td style="text-align:right">14&lt;/td>
&lt;td style="text-align:right">17&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">&lt;strong>−0.1229&lt;/strong>&lt;/td>
&lt;td style="text-align:right">0.1404&lt;/td>
&lt;td>[−0.398, +0.152]&lt;/td>
&lt;td style="text-align:right">1&lt;/td>
&lt;td style="text-align:right">12&lt;/td>
&lt;td style="text-align:right">13&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Reading the violent-crime row.&lt;/strong> $\hat\alpha = -0.174$ means a unit increase in the differenced effective abortion rate is associated with a 0.174-unit decrease in the differenced violent-crime rate, conditional on the 8 controls in the union. The 95% confidence interval [−0.401, +0.052] contains zero — under this specification, the violent-crime effect drops below significance at the 5% level. The selection counts |I_y| = 0, |I_d| = 8 tell us something more interesting: the LASSO of crime on controls picked &lt;strong>zero&lt;/strong> controls (out of 284), while the LASSO of abortion on controls picked 8. The R companion gets the same |I_y| = 0, |I_d| = 8 fingerprint with a slightly less negative point estimate (R: −0.0964). Same selected &lt;em>count&lt;/em>, slightly different selected &lt;em>identities&lt;/em> and post-OLS numbers — §15 below quantifies this drift.&lt;/p>
&lt;p>&lt;strong>The one-line equivalent: &lt;code>pdslasso&lt;/code>.&lt;/strong> The three lines above can be collapsed into a single command:&lt;/p>
&lt;pre>&lt;code class="language-stata">pdslasso DyV DxV (zv1-zv284), cluster(state) loptions(c(1.1) gamma(0.05))
&lt;/code>&lt;/pre>
&lt;p>&lt;code>pdslasso&lt;/code> runs the two &lt;code>rlasso&lt;/code> calls internally, takes the union, runs the post-OLS, and reports cluster-robust SEs — the same recipe as the explicit three-step code. We use the explicit form in this post so the LASSO selections at each step remain visible. The next section unpacks the &lt;strong>three&lt;/strong> distinct estimates &lt;code>pdslasso&lt;/code> actually reports — the PDS coefficient above is only one of them.&lt;/p>
&lt;hr>
&lt;h2 id="8-the-three-estimators-pdslasso-reports">8. The three estimators &lt;code>pdslasso&lt;/code> reports&lt;/h2>
&lt;p>When you run &lt;code>pdslasso&lt;/code>, Stata does not give you a single number — it gives you &lt;strong>three&lt;/strong> estimates of the same treatment effect $\alpha$, stacked one above the other in the output. All three are valid; all three target the same causal quantity; they differ only in &lt;em>how&lt;/em> the high-dimensional controls $X$ are residualised out of $y$ and $d$ before the final coefficient is computed. Understanding the three flavours is the difference between trusting the output and second-guessing it. This section walks through each, then shows the actual three-panel output on our violent-crime equation.&lt;/p>
&lt;p>The framework is from &lt;a href="#19-references">Belloni, Chernozhukov, Hansen and Kozbur (2016)&lt;/a> and its accessible review in &lt;a href="#19-references">Chernozhukov, Hansen and Spindler (2015)&lt;/a>. The intuition rests on the same Frisch–Waugh–Lovell logic we used in §7: to recover the causal $\hat\alpha$ in the structural equation $y = \alpha d + x' \theta + \zeta$, residualise both $y$ and $d$ against the controls, then regress residual on residual. The three estimators differ in &lt;em>what residualisation rule&lt;/em> they use.&lt;/p>
&lt;h3 id="81-the-common-starting-point-filter-the-controls-out-of-both-sides">8.1 The common starting point: filter the controls out of both sides&lt;/h3>
&lt;pre>&lt;code class="language-mermaid">flowchart LR
Z[&amp;quot;High-dim controls X (p = 284)&amp;quot;] --&amp;gt; Y[&amp;quot;Outcome y (DyV)&amp;quot;]
Z --&amp;gt; D[&amp;quot;Treatment d (DxV)&amp;quot;]
Y --&amp;gt; R1[&amp;quot;residual y&amp;quot;]
D --&amp;gt; R2[&amp;quot;residual d&amp;quot;]
R1 --&amp;gt; A[&amp;quot;final OLS: y-tilde = &amp;amp;alpha; d-tilde + &amp;amp;epsilon;&amp;quot;]
R2 --&amp;gt; A
style Z fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style Y fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style D fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style R1 fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style R2 fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style A fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
&lt;/code>&lt;/pre>
&lt;p>All three estimators consume the same diagram. They diverge only at the residualisation step — how to &amp;ldquo;filter out&amp;rdquo; the controls. Method 1 uses Lasso coefficients directly; Method 2 uses OLS coefficients on the Lasso-selected controls; Method 3 skips residualisation entirely and just runs one big OLS on the union of selected controls plus the treatment.&lt;/p>
&lt;h3 id="82-method-1--lasso-orthogonalized-regression">8.2 Method 1 — Lasso-orthogonalized regression&lt;/h3>
&lt;p>&lt;strong>The strict-regularisation path.&lt;/strong> This estimator trusts Lasso&amp;rsquo;s shrunken coefficients all the way through.&lt;/p>
&lt;p>&lt;strong>Recipe.&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Run &lt;code>rlasso&lt;/code> of $y$ on $X$. Keep the residuals $\tilde y = y - X \hat\beta_y^{\text{LASSO}}$.&lt;/li>
&lt;li>Run &lt;code>rlasso&lt;/code> of $d$ on $X$. Keep the residuals $\tilde d = d - X \hat\beta_d^{\text{LASSO}}$.&lt;/li>
&lt;li>Run OLS of $\tilde y$ on $\tilde d$ (with state-clustered SE). The coefficient is $\hat\alpha_{\text{ortho}}$.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Catch.&lt;/strong> Lasso intentionally shrinks every coefficient it keeps toward zero. So $X \hat\beta_y^{\text{LASSO}}$ slightly &lt;em>under-fits&lt;/em> $y$ and the residuals $\tilde y$ retain a little regularised noise. Same for $\tilde d$. The downstream $\hat\alpha_{\text{ortho}}$ has slightly lower variance than Method 2&amp;rsquo;s analogue but a small shrinkage-induced bias.&lt;/p>
&lt;h3 id="83-method-2--post-lasso-orthogonalized-regression">8.3 Method 2 — Post-lasso-orthogonalized regression&lt;/h3>
&lt;p>&lt;strong>The unshrunk-residual path.&lt;/strong> This estimator uses Lasso &lt;em>only&lt;/em> as a variable selector, then re-fits each residualisation by plain OLS.&lt;/p>
&lt;p>&lt;strong>Recipe.&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Run &lt;code>rlasso&lt;/code> of $y$ on $X$. Record the &lt;em>names&lt;/em> of the selected controls $I_y$.&lt;/li>
&lt;li>Run OLS of $y$ on $X_{I_y}$ (no penalty, full coefficients). Keep these residuals.&lt;/li>
&lt;li>Same for the treatment: &lt;code>rlasso&lt;/code> of $d$ on $X$ → $I_d$ → OLS of $d$ on $X_{I_d}$ → residuals.&lt;/li>
&lt;li>Final OLS of the post-Lasso residuals on each other gives $\hat\alpha_{\text{post-ortho}}$.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Advantage.&lt;/strong> Because step 2 is unpenalised OLS, the residualisation is sharp — no shrinkage noise leaks into the residuals. The trade-off is slightly higher variance than Method 1 on small samples.&lt;/p>
&lt;h3 id="84-method-3--post-double-selection-pds-regression">8.4 Method 3 — Post-double-selection (PDS) regression&lt;/h3>
&lt;p>&lt;strong>The transparent path.&lt;/strong> This is the recipe we ran explicitly in §7 — and it is the only one of the three that produces a regression table you can read in a normal textbook way.&lt;/p>
&lt;p>&lt;strong>Recipe.&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Run &lt;code>rlasso&lt;/code> of $y$ on $X$, record $I_y$.&lt;/li>
&lt;li>Run &lt;code>rlasso&lt;/code> of $d$ on $X$, record $I_d$.&lt;/li>
&lt;li>Take the &lt;strong>union&lt;/strong> $I_y \cup I_d$ — any control selected by either side stays in.&lt;/li>
&lt;li>Run one big OLS: regress $y$ on $d$ plus the union of selected controls (no residualisation). The coefficient on $d$ is $\hat\alpha_{\text{PDS}}$.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">flowchart LR
L1[&amp;quot;rlasso y on X &amp;amp;rarr; I_y&amp;quot;] --&amp;gt; U[&amp;quot;Union I_y &amp;amp;cup; I_d&amp;quot;]
L2[&amp;quot;rlasso d on X &amp;amp;rarr; I_d&amp;quot;] --&amp;gt; U
U --&amp;gt; O[&amp;quot;one big OLS:&amp;amp;nbsp; y = &amp;amp;alpha;&amp;amp;middot;d + X[union]&amp;amp;middot;&amp;amp;theta; + &amp;amp;epsilon;&amp;quot;]
O --&amp;gt; R[&amp;quot;regression table with alpha-hat AND control coefficients&amp;quot;]
style L1 fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style L2 fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style U fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style O fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style R fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Advantage.&lt;/strong> Maximum transparency. You see $\hat\alpha$ alongside the coefficients of every selected control with proper SEs, t-stats, and p-values. The valid-inference guarantee from &lt;a href="#19-references">Belloni, Chernozhukov, Hansen (2014)&lt;/a> applies only to the $\hat\alpha$ row — the control-coefficient SEs are NOT valid (Stata flags this with the &amp;ldquo;Standard errors and test statistics valid for the following variables only: &amp;hellip;&amp;rdquo; note at the bottom of the panel).&lt;/p>
&lt;h3 id="85-summary-comparison">8.5 Summary comparison&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Feature&lt;/th>
&lt;th>1. Lasso-orthogonalized&lt;/th>
&lt;th>2. Post-lasso-orthogonalized&lt;/th>
&lt;th>3. Post-double-selection (PDS)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Final step&lt;/strong>&lt;/td>
&lt;td>OLS on Lasso residuals&lt;/td>
&lt;td>OLS on post-Lasso residuals&lt;/td>
&lt;td>OLS on raw $d$ + selected $X$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Shrinkage bias in $\hat\alpha$?&lt;/strong>&lt;/td>
&lt;td>Yes (small)&lt;/td>
&lt;td>No&lt;/td>
&lt;td>No&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>What the output shows&lt;/strong>&lt;/td>
&lt;td>Just $\hat\alpha$&lt;/td>
&lt;td>Just $\hat\alpha$&lt;/td>
&lt;td>$\hat\alpha$ &lt;strong>plus&lt;/strong> all selected control coefficients&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Best for&lt;/strong>&lt;/td>
&lt;td>Slightly lower variance on small $n$&lt;/td>
&lt;td>Cleanly unshrunk residuals&lt;/td>
&lt;td>Reading the result like a normal regression table&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="86-the-actual-pdslasso-output-on-our-data">8.6 The actual &lt;code>pdslasso&lt;/code> output on our data&lt;/h3>
&lt;p>Running &lt;code>pdslasso DyV DxV (zv1-zv284), cluster(state) loptions(c(1.1) gamma(0.05))&lt;/code> on the violent-crime equation produces three coefficient panels (slightly trimmed for readability):&lt;/p>
&lt;pre>&lt;code class="language-text">1. (PDS/CHS) Selecting HD controls for dep var DyV...
Selected: zv284
2. (PDS/CHS) Selecting HD controls for exog regressor DxV...
Selected: zv228 zv244 zv279
Specification:
Regularization method: lasso
Penalty loadings: cluster-lasso
Number of observations: 576
Number of clusters: 48
Exogenous (1): DxV
High-dim controls (284): zv1 zv2 zv3 ... zv284
Selected controls (4): zv228 zv244 zv279 zv284
Unpenalized controls (1): _cons
Structural equation:
OLS using CHS lasso-orthogonalized vars
(Std. Err. adjusted for 48 clusters in state)
------------------------------------------------------------------------------
| Robust
DyV | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
DxV | -.2110147 .0899177 -2.35 0.019 -.3872502 -.0347792
------------------------------------------------------------------------------
OLS using CHS post-lasso-orthogonalized vars
(Std. Err. adjusted for 48 clusters in state)
------------------------------------------------------------------------------
| Robust
DyV | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
DxV | -.1675744 .1005712 -1.67 0.096 -.3646903 .0295416
------------------------------------------------------------------------------
OLS with PDS-selected variables and full regressor set
(Std. Err. adjusted for 48 clusters in state)
------------------------------------------------------------------------------
| Robust
DyV | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
DxV | -.1764142 .1078564 -1.64 0.102 -.3878088 .0349804
zv228 | .84779 4.01065 0.21 0.833 -7.012939 8.708519
zv244 | -3.437135 6.564852 -0.52 0.601 -16.30401 9.429739
zv279 | .2585369 .1314611 1.97 0.049 .0008779 .5161958
zv284 | -2.617675 .5835982 -4.49 0.000 -3.761506 -1.473843
_cons | -1.74e-11 .0027138 -0.00 1.000 -.0053189 .0053189
------------------------------------------------------------------------------
Standard errors and test statistics valid for the following variables only:
DxV
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Reading the three panels.&lt;/strong> All three estimates of $\hat\alpha$ point the same direction: a one-unit increase in the differenced abortion rate is associated with a $0.17$ to $0.21$-unit decrease in the differenced violent-crime rate. The &lt;strong>lasso-orthogonalized&lt;/strong> estimate is the most negative ($-0.211$, SE $0.090$, $p = 0.019$ — significant at 5%); the &lt;strong>post-lasso-orthogonalized&lt;/strong> estimate moves toward zero ($-0.168$, SE $0.101$, $p = 0.096$ — just outside 10%); the &lt;strong>PDS&lt;/strong> estimate sits in between ($-0.176$, SE $0.108$, $p = 0.102$). The gap between them is exactly the shrinkage-vs-no-shrinkage trade-off discussed in §§8.2–8.3.&lt;/p>
&lt;p>&lt;strong>Why does this differ from our §7 explicit recipe?&lt;/strong> We reported DL-rigorous violent-crime as $\hat\alpha = -0.1744$ with $|I_y \cup I_d| = 8$. &lt;code>pdslasso&lt;/code> reports the PDS column as $\hat\alpha = -0.1764$ with &lt;code>Selected controls (4): zv228 zv244 zv279 zv284&lt;/code>. Same method, different selection counts (4 vs 8). The reason: &lt;code>pdslasso&lt;/code>&amp;rsquo;s &lt;code>cluster(state)&lt;/code> option also makes the &lt;strong>LASSO penalty loadings&lt;/strong> cluster-robust (note the &lt;code>Penalty loadings: cluster-lasso&lt;/code> line in the preamble). Our §7 explicit &lt;code>rlasso&lt;/code> calls used the default heteroskedasticity-robust loadings. Cluster-robust loadings are &lt;em>tighter&lt;/em> on panel data because they account for within-state autocorrelation in the score, so fewer controls survive the rigorous penalty. The point estimate barely moves (−0.176 vs −0.174) — a comforting robustness check.&lt;/p>
&lt;h3 id="87-practice-tip">8.7 Practice tip&lt;/h3>
&lt;p>The one-line invocation is:&lt;/p>
&lt;pre>&lt;code class="language-stata">pdslasso DyV DxV (zv1-zv284), cluster(state) loptions(c(1.1) gamma(0.05))
&lt;/code>&lt;/pre>
&lt;p>Try varying:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>cluster(state)&lt;/code> → &lt;code>robust&lt;/code>&lt;/strong>: switches the LASSO loadings from cluster-robust to heteroskedasticity-robust. You will see the union of selected controls grow back toward the 8 we got in §7 with the explicit recipe.&lt;/li>
&lt;li>&lt;strong>&lt;code>loptions(c(1.1) gamma(0.05))&lt;/code> → &lt;code>loptions(c(0.5) gamma(0.05))&lt;/code>&lt;/strong>: loosens the rigorous penalty by lowering $c$. Many more controls survive, the post-OLS coefficient table grows, and the three estimates of $\hat\alpha$ start to diverge — exactly the &amp;ldquo;loose-penalty&amp;rdquo; pathology that §11 anchors on for the rigorous-vs-CV contrast.&lt;/li>
&lt;li>&lt;strong>Drop &lt;code>(zv1-zv284)&lt;/code> controls entirely&lt;/strong>: degenerates &lt;code>pdslasso&lt;/code> to plain OLS of &lt;code>DyV&lt;/code> on &lt;code>DxV&lt;/code> — you should recover the §4 first-difference baseline of $-0.1521$.&lt;/li>
&lt;/ul>
&lt;p>The fact that &lt;strong>all three orthogonalisations land on essentially the same answer here&lt;/strong> is itself the headline takeaway: when the rigorous penalty selects a sparse, sensible set of controls, the choice between lasso-residualisation, post-lasso-residualisation, and PDS does not move the causal estimate beyond its own standard error. The framework is robust to the residualisation rule precisely because the rigorous-penalty selection is itself disciplined.&lt;/p>
&lt;hr>
&lt;h2 id="9-state-clustered-standard-errors">9. State-clustered standard errors&lt;/h2>
&lt;p>A digression on the standard errors. The 576 observations are not independent — they are 12 differenced years of data for each of 48 states, and within-state observations are autocorrelated through governor effects, state policy waves, and business-cycle exposure. Treating them as independent (Stata&amp;rsquo;s default &lt;code>regress&lt;/code> vcov) would understate the uncertainty by about 40% on this panel. The &lt;code>vce(cluster state)&lt;/code> option applies a cluster-robust sandwich estimator with Stata&amp;rsquo;s default HC1-style finite-sample adjustment (&lt;a href="#19-references">Cameron and Miller 2015&lt;/a>):&lt;/p>
&lt;p>$$
\hat V_{\text{cluster}} = \underbrace{\frac{N-1}{N-k}}_{\text{small-sample}} \cdot \underbrace{\frac{G}{G-1}}_{\text{cluster-count}} \cdot \underbrace{(X&amp;rsquo;X)^{-1}}_{\text{bread}} \cdot \underbrace{\left(\sum_{g=1}^G X_g' \hat e_g \hat e_g' X_g\right)}_{\text{meat}} \cdot \underbrace{(X&amp;rsquo;X)^{-1}}_{\text{bread}}.
$$&lt;/p>
&lt;p>The &amp;ldquo;sandwich&amp;rdquo; name comes from the structure: two slices of bread $(X&amp;rsquo;X)^{-1}$ around the meat $\sum_g X_g' \hat e_g \hat e_g' X_g$, the cluster-summed outer product of the within-cluster scores. The two front factors are the small-sample correction: $(N-1)/(N-k)$ adjusts for the degrees of freedom consumed by the regressors, and $G/(G-1)$ adjusts for the number of clusters. Here $N = 576$, $k$ is the number of fitted columns (varies by estimator), and $G = 48$ is the number of states.&lt;/p>
&lt;p>This is &lt;strong>exactly&lt;/strong> the formula the R companion implements by hand in its &lt;code>cluster_se()&lt;/code> helper. Stata&amp;rsquo;s &lt;code>vce(cluster state)&lt;/code> applies it automatically, so the Stata script never has to write the sandwich code explicitly. The numerical agreement between the two implementations on the &lt;em>deterministic&lt;/em> estimators (first-difference OLS and kitchen-sink OLS) is the cleanest demonstration that the small-sample correction matches.&lt;/p>
&lt;p>The cluster-count correction $G/(G-1)$ assumes the number of clusters $G$ is &amp;ldquo;large.&amp;rdquo; A rule of thumb is $G \geq 30$; with $G = 48$ states we are comfortably above that threshold. With only 5 or 10 clusters, the cluster-robust SE would be unreliable and you would need to switch to wild bootstrap or block bootstrap inference (Stata&amp;rsquo;s &lt;code>boottest&lt;/code> package implements both).&lt;/p>
&lt;hr>
&lt;h2 id="10-when-does-double-lasso-help-most">10. When does Double LASSO help most?&lt;/h2>
&lt;p>Look back at the DL-rigorous table in §7. For violent crime and murder, |I_y| is essentially zero — the LASSO of &lt;em>crime&lt;/em> on controls picked very few variables out of 284. For all three outcomes |I_d| is between 8 and 12 — the LASSO of &lt;em>abortion&lt;/em> on controls picked a handful. This asymmetry is the empirical fingerprint of the situation in which Double LASSO most helps: &lt;strong>the treatment is well-predicted by the controls, but the outcome is not&lt;/strong>. Fitzgerald et al. (2026) emphasise this in their footnote 4, paraphrased: &lt;em>DL is most useful when the outcome is hard to predict but the treatment is well-predicted, because that is when the second LASSO catches controls that the first one missed.&lt;/em>&lt;/p>
&lt;p>Why does this matter for causal inference? Recall the PSL blind spot from §6: a one-LASSO procedure on $y$ can drop a control that strongly predicts $d$ if it does not strongly predict $y$. Suppose the (unobserved) data-generating process is&lt;/p>
&lt;p>$$
y_i = \alpha \, d_i + x_i' \theta + \zeta_i, \quad d_i = x_i' \pi + v_i, \quad \zeta_i \perp v_i.
$$&lt;/p>
&lt;p>If a particular $x_j$ has a large $\pi_j$ but a small $\theta_j$, then $x_j$ is a strong confounder (it predicts $d$, and thus moves $\hat\alpha$ when omitted), but a weak predictor of $y$. PSL drops it; DL keeps it via the d-equation LASSO. The empirical fingerprint $|I_y| \approx 0$ and $|I_d| \approx 8$–12 means we are exactly in this regime: the small set of controls that survived the d-equation LASSO are doing all of the confounding-control work in the final OLS. The bar chart below visualises this asymmetry across the three outcomes:&lt;/p>
&lt;p>&lt;img src="stata_double_lasso_selection.png" alt="Selection counts |I_y| and |I_d| for the rigorous-penalty DL and CV-penalty DL across the three outcomes — the asymmetry between Iy and Id is the fingerprint of DL&amp;rsquo;s advantage over PSL.">&lt;/p>
&lt;p>A natural follow-up question: which 8 controls? The paper&amp;rsquo;s §4 discussion (and our &lt;code>selection_diagnostic.csv&lt;/code> for the curious) names lagged prisoners per capita, lagged income per capita, and lagged unemployment as common selections. These are exactly the variables Donohue and Levitt themselves controlled for in 2001 — DL has, in a sense, &lt;em>rediscovered&lt;/em> a sensible subset of the original eight controls from a candidate pool of 284, automatically.&lt;/p>
&lt;hr>
&lt;h2 id="11-rigorous-vs-cross-validated-penalty--and-a-stata-caveat">11. Rigorous vs. cross-validated penalty — and a Stata caveat&lt;/h2>
&lt;p>The second flavour of Double LASSO replaces the rigorous penalty with &lt;strong>3-fold cross-validation&lt;/strong>. The recipe is identical to §7 — two LASSOs, take the union, post-OLS — but each LASSO now uses &lt;code>cvlasso&lt;/code> to pick $\lambda$ by minimising out-of-sample mean-squared error on the prediction problem. The catch is that this choice optimises a different objective — prediction-MSE on $y$ alone, or on $d$ alone, is not the same thing as choosing the right controls for the causal estimate of $\alpha$.&lt;/p>
&lt;p>&lt;strong>Code chunk 6 — The CV-penalty Double LASSO in Stata:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-stata">cvlasso DyV zv1-zv284, nfolds(3) seed(20260520) lopt lglmnet lcount(10)
local Iy &amp;quot;`e(selected)'&amp;quot;
cvlasso DxV zv1-zv284, nfolds(3) seed(20260520) lopt lglmnet lcount(10)
local Id &amp;quot;`e(selected)'&amp;quot;
local U : list Iy | Id
regress DyV DxV `U', noconstant vce(cluster state)
&lt;/code>&lt;/pre>
&lt;p>Same structure as §7 with one engine swap: &lt;code>rlasso&lt;/code> → &lt;code>cvlasso&lt;/code>. The &lt;code>lopt&lt;/code> flag is the analogue of R&amp;rsquo;s &lt;code>lambda.min&lt;/code>; &lt;code>lglmnet&lt;/code> aligns the lambda parameterisation with &lt;code>glmnet&lt;/code> so results are comparable across the two languages.&lt;/p>
&lt;p>&lt;strong>A pragmatic Stata caveat.&lt;/strong> At this regime ($p = 284$, $n = 576$) Stata&amp;rsquo;s &lt;code>cvlasso&lt;/code> is dramatically slower than R&amp;rsquo;s &lt;code>cv.glmnet&lt;/code> — each call with the default &lt;code>lcount(100)&lt;/code> and the rigorous-penalty-style lambda search took 5+ minutes on Apple Silicon. To get the 6-call DL-CV pipeline to finish in a reasonable session, we set &lt;code>lcount(10)&lt;/code>, restricting the cross-validation search to only 10 lambda values along the path. The trade-off is real: with a coarse grid, &lt;code>cvlasso&lt;/code> warns that the CV-optimal $\lambda$ is at the boundary of the search range, and the &lt;em>selected set is empty for all three outcomes&lt;/em>. The post-OLS therefore reduces to a no-controls regression of the partialled outcome on the partialled treatment, which is a rescaled first-difference estimator — the violent-crime DL-CV estimate of −0.155 is essentially the §6 PSL number.&lt;/p>
&lt;p>What does this mean for the pedagogical point? In R, DL-CV with &lt;code>cv.glmnet&lt;/code>&amp;rsquo;s default fine lambda grid keeps 150 controls for violent crime and &lt;strong>flips the sign&lt;/strong> of $\hat\alpha$ to +0.019 — a dramatic illustration of the over-selection pathology. In Stata, the runtime constraint forces a coarse lambda grid, which under-selects so aggressively that the same pathology never appears. The Stata reader should treat the DL-CV row in this post as a &lt;strong>runtime-limited approximation&lt;/strong> and consult the R companion for a faithful demonstration of the CV over-selection problem.&lt;/p>
&lt;p>&lt;img src="stata_double_lasso_methods_compare.png" alt="Rigorous-penalty vs. (runtime-limited) CV-penalty Double LASSO across the three outcomes. Stata&amp;rsquo;s DL-CV at lcount(10) collapses to the no-controls baseline; the R companion shows CV&amp;rsquo;s true over-selection behavior.">&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">$\hat\alpha_{\text{rig}}$&lt;/th>
&lt;th style="text-align:right">$\hat\alpha_{\text{CV}}$&lt;/th>
&lt;th style="text-align:right">$\lvert I_y \cup I_d \rvert_{\text{rig}}$&lt;/th>
&lt;th style="text-align:right">$\lvert I_y \cup I_d \rvert_{\text{CV}}$&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Violent crime&lt;/td>
&lt;td style="text-align:right">−0.1744&lt;/td>
&lt;td style="text-align:right">−0.1553&lt;/td>
&lt;td style="text-align:right">8&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Property crime&lt;/td>
&lt;td style="text-align:right">−0.1144&lt;/td>
&lt;td style="text-align:right">−0.1015&lt;/td>
&lt;td style="text-align:right">17&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">−0.1229&lt;/td>
&lt;td style="text-align:right">−0.2061&lt;/td>
&lt;td style="text-align:right">13&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>In all three rows Stata&amp;rsquo;s DL-CV with &lt;code>lcount(10)&lt;/code> selects zero controls and the post-OLS reduces to the first-difference baseline (cf. §4). The R companion at the same outcomes selects 150, 109, and 161 controls and produces estimates of +0.019, −0.178, and −1.113 — the canonical &amp;ldquo;CV over-selects&amp;rdquo; pattern. The discrepancy is &lt;strong>not&lt;/strong> a difference in the underlying method, only in how aggressively each language&amp;rsquo;s cross-validation searches the lambda grid.&lt;/p>
&lt;p>This is not a knock on CV in general. CV&amp;rsquo;s $\lambda_{\min}$ is exactly the right choice when the goal is &lt;strong>prediction&lt;/strong> — out-of-sample MSE on $y$, for example. But for causal inference on the treatment effect $\alpha$, the rigorous penalty is the better choice because it is tuned to the right asymptotic objective: keeping selection error small &lt;em>relative to estimation error&lt;/em>, not minimising prediction loss. The fact that the deterministic, theory-driven &lt;code>rlasso&lt;/code> produces a portable answer across software stacks while CV depends on grid resolution is itself an argument for the rigorous penalty in production work.&lt;/p>
&lt;hr>
&lt;h2 id="12-the-forest-plot">12. The forest plot&lt;/h2>
&lt;p>Stacking all five estimators against all three outcomes gives the headline figure (reproduced from §1 here for convenience):&lt;/p>
&lt;p>&lt;img src="stata_double_lasso_estimates.png" alt="Forest plot of all five estimators across the three outcomes — the headline figure of this post.">&lt;/p>
&lt;p>A coherent story for violent and property crime: the LASSO methods (PSL, DL-rigorous) land between the two extremes — First-difference OLS and the kitchen-sink OLS. PSL and DL-rigorous concentrate the data&amp;rsquo;s signal near the small set of controls that actually matter, giving sensible estimates with tighter standard errors than OLS-full.&lt;/p>
&lt;p>For murder, the story is messier — kitchen-sink OLS gives a nonsensical positive estimate, and CV-LASSO swings widely. But First-diff, PSL, and DL-rigorous cluster sensibly. The murder outcome is the noisiest of the three (state-level murder counts are small numbers in many state-years), so it punishes any procedure that picks too many controls.&lt;/p>
&lt;p>&lt;strong>Code chunk 7 — Building the forest plot in Stata (compressed):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-stata">* Load the 15-row long table written by analysis.do
* (3 outcomes x 5 methods, with estimate / std_error / ci_lo / ci_hi).
import delimited &amp;quot;results_table2.csv&amp;quot;, clear varnames(1) case(preserve)
* Build ONE twoway per outcome (rspike + scatter for each method),
* so each panel gets its OWN x-axis range. This is Stata's analogue
* of ggplot's facet_wrap(scales = &amp;quot;free_x&amp;quot;) — and what keeps the
* huge OLS-Murder CI from squashing the other panels.
forvalues o = 1/3 {
twoway ///
(rspike ci_lo ci_hi y if oid==`o' &amp;amp; method_id==1, horizontal) ///
(scatter y estimate if oid==`o' &amp;amp; method_id==1, msymbol(O)) ///
/* ...repeat for method_id 2..5, each with its own colour... */ ///
, xline(0, lpattern(dash)) legend(off) ///
ylabel(1 &amp;quot;DL (CV)&amp;quot; 2 &amp;quot;DL (rigorous)&amp;quot; 3 &amp;quot;PSL&amp;quot; 4 &amp;quot;OLS (full)&amp;quot; 5 &amp;quot;First diff&amp;quot;) ///
name(fig_o`o', replace)
}
* Stitch the 3 per-outcome panels into a 1-row strip.
graph combine fig_o1 fig_o2 fig_o3, cols(3)
graph export &amp;quot;stata_double_lasso_estimates.png&amp;quot;, replace width(3300) height(1350)
&lt;/code>&lt;/pre>
&lt;p>We deliberately avoid Stata&amp;rsquo;s &lt;code>by(outcome_id, cols(3))&lt;/code> here: &lt;code>by()&lt;/code> forces a single shared x-axis across panels, and OLS-Murder&amp;rsquo;s CI of roughly [−3.1, +7.8] would stretch that shared axis until every other CI collapses to an invisible nub. Building three independent &lt;code>twoway&lt;/code> graphs and combining them with &lt;code>graph combine&lt;/code> is the base-Stata equivalent of &lt;code>facet_wrap(scales = &amp;quot;free_x&amp;quot;)&lt;/code> in ggplot2. The full per-method colour wiring (site palette: steel blue, warm orange, teal, light orange, light grey) and the dark-theme &lt;code>graphregion(...)&lt;/code> options are in &lt;code>figures.do&lt;/code>, which &lt;code>analysis.do&lt;/code> calls at the end of §10.&lt;/p>
&lt;hr>
&lt;h2 id="13-when-to-use-which-method">13. When to use which method?&lt;/h2>
&lt;p>The decision tree below offers practical guidance for a researcher facing a fresh dataset. It is not a substitute for thinking carefully about identification (no method can rescue an invalid research design), but it is a reasonable starting point.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">flowchart TD
Start[&amp;quot;You have n observations,&amp;lt;br/&amp;gt;p candidate controls,&amp;lt;br/&amp;gt;and want a causal alpha-hat&amp;quot;] --&amp;gt; Q1{&amp;quot;p &amp;amp;ge; n?&amp;quot;}
Q1 --&amp;gt;|Yes| L[&amp;quot;LASSO methods required&amp;lt;br/&amp;gt;(OLS infeasible)&amp;quot;]
Q1 --&amp;gt;|No| Q2{&amp;quot;p / n &amp;amp;gt; 0.3?&amp;quot;}
Q2 --&amp;gt;|Yes, like this post&amp;lt;br/&amp;gt;p=284, n=576| L
Q2 --&amp;gt;|No| Q3{&amp;quot;n &amp;amp;ge; 5,000?&amp;quot;}
Q3 --&amp;gt;|Yes| O[&amp;quot;Plain OLS with all&amp;lt;br/&amp;gt;controls is fine&amp;quot;]
Q3 --&amp;gt;|No| L
L --&amp;gt; Q4{&amp;quot;Need valid causal&amp;lt;br/&amp;gt;inference, not just&amp;lt;br/&amp;gt;prediction?&amp;quot;}
Q4 --&amp;gt;|Yes| DL[&amp;quot;Double LASSO&amp;lt;br/&amp;gt;with rigorous penalty&amp;lt;br/&amp;gt;(rlasso or pdslasso)&amp;quot;]
Q4 --&amp;gt;|No| Pred[&amp;quot;DL-CV or PSL via cvlasso&amp;lt;br/&amp;gt;are both fine for prediction&amp;quot;]
style Start fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style DL fill:#1f2b5e,stroke:#00d4c8,color:#e8ecf2
style Pred fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style O fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
style L fill:#0f1729,stroke:#6a9bcc,color:#e8ecf2
style Q1 fill:#1f2b5e,stroke:#6a9bcc,color:#e8ecf2
style Q2 fill:#1f2b5e,stroke:#6a9bcc,color:#e8ecf2
style Q3 fill:#1f2b5e,stroke:#6a9bcc,color:#e8ecf2
style Q4 fill:#1f2b5e,stroke:#d97757,color:#e8ecf2
&lt;/code>&lt;/pre>
&lt;p>The thresholds are rough. Fitzgerald et al. (2026) section 3.2 shows DL&amp;rsquo;s advantage shrinks rapidly as $n$ grows at fixed $p$; by $n = 3{,}000$ in their Monte Carlo, OLS is essentially indistinguishable from DL. The $p / n &amp;gt; 0.3$ cutoff is informal — it corresponds to the regime where $(X&amp;rsquo;X)^{-1}$ starts having visible numerical instability — but it is a reasonable diagnostic.&lt;/p>
&lt;p>One more piece of intuition justifies the post-OLS refit step in DL (and PSL). LASSO&amp;rsquo;s coefficients on the variables it selects are shrunken toward zero by construction. If you used those shrunken coefficients to compute the residuals for $\alpha$, you would inherit a bias of the order&lt;/p>
&lt;p>$$
\hat\alpha_{\text{LASSO}} - \alpha = O_p\!\left(\frac{\lambda}{n}\right).
$$&lt;/p>
&lt;p>For our $\lambda^{\text{rig}}$ and $n = 576$, that bias is roughly 5–15% of the treatment effect. &lt;strong>Refitting with plain OLS on the selected support removes the shrinkage&lt;/strong> and recovers the unbiased estimate. This is why every method in this post uses LASSO for &lt;em>selection only&lt;/em> and post-OLS for &lt;em>estimation&lt;/em>. It is the load-bearing step in the whole machinery.&lt;/p>
&lt;hr>
&lt;h2 id="14-caveats-and-identification">14. Caveats and identification&lt;/h2>
&lt;p>Six things to keep in mind when reading the headline estimates.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>This is a replication exercise, not a primary causal claim.&lt;/strong> Fitzgerald et al. (2026) is itself a replication paper studying Double LASSO as a &lt;em>method&lt;/em>. Whether more abortion access caused less crime is a substantive question that goes well beyond any single regression specification. We inherit the paper&amp;rsquo;s framing: this post is about DL behaviour on a particular dataset, not about endorsing the Donohue–Levitt 2001 substantive claim.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Identification rests on two assumptions.&lt;/strong> First, &lt;em>conditional independence given $X$&lt;/em>: the 284 partialled controls must capture every variable that influenced both the abortion rate and the crime rate in the 1980s. Second, &lt;em>parallel trends in levels&lt;/em>: state fixed effects are absorbed by first-differencing, year fixed effects by the partialling step. Neither assumption is innocuous. Fitzgerald et al. section 3.5 discusses two failure modes (bias amplification from controls that act as imperfect instruments, and collider bias from controls that are caused by both treatment and outcome) that this empirical application cannot rule out.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>State-clustering relies on $G \geq 30$.&lt;/strong> Cluster-robust inference is justified asymptotically in $G$, the number of clusters. With $G = 48$ states we are above the rule of thumb. If you had only 5 or 10 clusters, the cluster-robust SE would be unreliable and you would need to switch to wild bootstrap or block bootstrap inference (Stata&amp;rsquo;s &lt;code>boottest&lt;/code> package).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>CV LASSO is non-deterministic.&lt;/strong> &lt;code>cvlasso&lt;/code> randomly partitions the data into $K$ folds; without setting a seed, the variable-selection counts in §11 would vary by ±5 controls between runs and the headline coefficient by ±0.01. The script sets &lt;code>seed(20260520)&lt;/code> on every &lt;code>cvlasso&lt;/code> call so the post&amp;rsquo;s numbers reproduce exactly. The rigorous LASSO (&lt;code>rlasso&lt;/code>) is deterministic given the data and the penalty arguments.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>cvlasso&lt;/code> and &lt;code>cv.glmnet&lt;/code> differ in their default fold assignment.&lt;/strong> Even with the same seed value, the &lt;em>integer-to-fold mapping&lt;/em> uses different RNG draws in Stata and R. This means that the DL-CV numbers will not bit-for-bit match the R companion; the Stata-vs-R replication check in §15 documents the actual drift.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The estimand is not population-weighted.&lt;/strong> Every state-year observation gets equal weight. State-clustered SEs do not re-weight observations; they only adjust the variance for within-state autocorrelation. A population-weighted version (weighting state-years by state adult population) would give a different — and arguably more policy-relevant — estimand. The paper does not weight, so neither do we.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h2 id="15-stata-vs-r-numeric-replication">15. Stata vs R: numeric replication&lt;/h2>
&lt;p>The deterministic estimators should match the R companion to numerical precision; the LASSO-with-CV estimators are allowed to drift because of language-specific differences in fold randomisation. We classify the five rows of Table 2 into three replication tiers:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Tier A (must match R within 1e-4):&lt;/strong> First-difference OLS, Kitchen-sink OLS. Both use the same closed-form OLS formula and the same HC1-style cluster correction, so the only sources of cross-language difference are floating-point rounding.&lt;/li>
&lt;li>&lt;strong>Tier B (must match R within 1e-3, document any drift):&lt;/strong> DL-rigorous. The Belloni-et-al. theory penalty $\lambda^{\text{rig}}$ is deterministic and Stata&amp;rsquo;s &lt;code>rlasso&lt;/code> with &lt;code>c(1.1) gamma(0.05)&lt;/code> uses the same formula as R&amp;rsquo;s &lt;code>hdm::rlasso&lt;/code>. Tiny implementation differences (centering vs. partialling out the constant, default vs. explicit &lt;code>nocons&lt;/code>) can cause selection-count differences of $\pm 1$ control.&lt;/li>
&lt;li>&lt;strong>Tier C (allowed to drift, qualitative match only):&lt;/strong> PSL and DL-CV. Both use 3-fold cross-validation, and &lt;code>cvlasso&lt;/code>&amp;rsquo;s fold assignment is &lt;em>seed-equivalent&lt;/em> to but not &lt;em>bit-equivalent&lt;/em> with &lt;code>cv.glmnet&lt;/code>&amp;rsquo;s.&lt;/li>
&lt;/ul>
&lt;p>The actual numbers, alongside the R companion&amp;rsquo;s:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Estimator&lt;/th>
&lt;th>Outcome&lt;/th>
&lt;th style="text-align:right">Stata $\hat\alpha$&lt;/th>
&lt;th style="text-align:right">R $\hat\alpha$&lt;/th>
&lt;th style="text-align:right">Δ&lt;/th>
&lt;th style="text-align:right">Stata SE&lt;/th>
&lt;th style="text-align:right">R SE&lt;/th>
&lt;th style="text-align:center">Tier&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>First diff&lt;/td>
&lt;td>Violent&lt;/td>
&lt;td style="text-align:right">−0.1521&lt;/td>
&lt;td style="text-align:right">−0.1521&lt;/td>
&lt;td style="text-align:right">0.0000&lt;/td>
&lt;td style="text-align:right">0.0337&lt;/td>
&lt;td style="text-align:right">0.0337&lt;/td>
&lt;td style="text-align:center">A ✓&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>First diff&lt;/td>
&lt;td>Property&lt;/td>
&lt;td style="text-align:right">−0.1084&lt;/td>
&lt;td style="text-align:right">−0.1084&lt;/td>
&lt;td style="text-align:right">0.0000&lt;/td>
&lt;td style="text-align:right">0.0219&lt;/td>
&lt;td style="text-align:right">0.0219&lt;/td>
&lt;td style="text-align:center">A ✓&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>First diff&lt;/td>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">−0.2039&lt;/td>
&lt;td style="text-align:right">−0.2039&lt;/td>
&lt;td style="text-align:right">0.0000&lt;/td>
&lt;td style="text-align:right">0.0667&lt;/td>
&lt;td style="text-align:right">0.0667&lt;/td>
&lt;td style="text-align:center">A ✓&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>OLS (full)&lt;/td>
&lt;td>Violent&lt;/td>
&lt;td style="text-align:right">+0.0134&lt;/td>
&lt;td style="text-align:right">+0.0135&lt;/td>
&lt;td style="text-align:right">−0.0001&lt;/td>
&lt;td style="text-align:right">0.7149&lt;/td>
&lt;td style="text-align:right">0.0911&lt;/td>
&lt;td style="text-align:center">A (α) ✓, SE †&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>OLS (full)&lt;/td>
&lt;td>Property&lt;/td>
&lt;td style="text-align:right">−0.1950&lt;/td>
&lt;td style="text-align:right">−0.1950&lt;/td>
&lt;td style="text-align:right">0.0000&lt;/td>
&lt;td style="text-align:right">0.2236&lt;/td>
&lt;td style="text-align:right">0.0472&lt;/td>
&lt;td style="text-align:center">A (α) ✓, SE †&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>OLS (full)&lt;/td>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">+2.3411&lt;/td>
&lt;td style="text-align:right">+2.3426&lt;/td>
&lt;td style="text-align:right">−0.0015&lt;/td>
&lt;td style="text-align:right">2.7831&lt;/td>
&lt;td style="text-align:right">0.3114&lt;/td>
&lt;td style="text-align:center">A (α) ✓, SE †&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DL rigorous&lt;/td>
&lt;td>Violent&lt;/td>
&lt;td style="text-align:right">−0.1744&lt;/td>
&lt;td style="text-align:right">−0.0964&lt;/td>
&lt;td style="text-align:right">−0.0780&lt;/td>
&lt;td style="text-align:right">0.1155&lt;/td>
&lt;td style="text-align:right">0.0514&lt;/td>
&lt;td style="text-align:center">B ‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DL rigorous&lt;/td>
&lt;td>Property&lt;/td>
&lt;td style="text-align:right">−0.1144&lt;/td>
&lt;td style="text-align:right">−0.0314&lt;/td>
&lt;td style="text-align:right">−0.0830&lt;/td>
&lt;td style="text-align:right">0.0470&lt;/td>
&lt;td style="text-align:right">0.0227&lt;/td>
&lt;td style="text-align:center">B ‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DL rigorous&lt;/td>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">−0.1229&lt;/td>
&lt;td style="text-align:right">−0.1662&lt;/td>
&lt;td style="text-align:right">+0.0433&lt;/td>
&lt;td style="text-align:right">0.1404&lt;/td>
&lt;td style="text-align:right">0.0790&lt;/td>
&lt;td style="text-align:center">B ‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>PSL&lt;/td>
&lt;td>Violent&lt;/td>
&lt;td style="text-align:right">−0.1553&lt;/td>
&lt;td style="text-align:right">−0.1567&lt;/td>
&lt;td style="text-align:right">+0.0014&lt;/td>
&lt;td style="text-align:right">0.0330&lt;/td>
&lt;td style="text-align:right">0.0342&lt;/td>
&lt;td style="text-align:center">C ‡‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>PSL&lt;/td>
&lt;td>Property&lt;/td>
&lt;td style="text-align:right">−0.0665&lt;/td>
&lt;td style="text-align:right">−0.0683&lt;/td>
&lt;td style="text-align:right">+0.0018&lt;/td>
&lt;td style="text-align:right">0.0244&lt;/td>
&lt;td style="text-align:right">0.0319&lt;/td>
&lt;td style="text-align:center">C ‡‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>PSL&lt;/td>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">−0.2397&lt;/td>
&lt;td style="text-align:right">−0.2061&lt;/td>
&lt;td style="text-align:right">−0.0336&lt;/td>
&lt;td style="text-align:right">0.0635&lt;/td>
&lt;td style="text-align:right">0.0514&lt;/td>
&lt;td style="text-align:center">C ‡‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DL CV&lt;/td>
&lt;td>Violent&lt;/td>
&lt;td style="text-align:right">−0.1553&lt;/td>
&lt;td style="text-align:right">+0.0193&lt;/td>
&lt;td style="text-align:right">−0.1746&lt;/td>
&lt;td style="text-align:right">0.0330&lt;/td>
&lt;td style="text-align:right">0.0978&lt;/td>
&lt;td style="text-align:center">C ‡‡‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DL CV&lt;/td>
&lt;td>Property&lt;/td>
&lt;td style="text-align:right">−0.1015&lt;/td>
&lt;td style="text-align:right">−0.1784&lt;/td>
&lt;td style="text-align:right">+0.0769&lt;/td>
&lt;td style="text-align:right">0.0218&lt;/td>
&lt;td style="text-align:right">0.0653&lt;/td>
&lt;td style="text-align:center">C ‡‡‡&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DL CV&lt;/td>
&lt;td>Murder&lt;/td>
&lt;td style="text-align:right">−0.2061&lt;/td>
&lt;td style="text-align:right">−1.1128&lt;/td>
&lt;td style="text-align:right">+0.9067&lt;/td>
&lt;td style="text-align:right">0.0514&lt;/td>
&lt;td style="text-align:right">0.3897&lt;/td>
&lt;td style="text-align:center">C ‡‡‡&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Notes on the table.&lt;/strong> The First-difference rows match R to four decimal places on both α and SE — the cluster-robust sandwich formula is computed identically in both software packages once collinear columns are handled. &lt;strong>†&lt;/strong> The OLS-full point estimates likewise agree to ~0.001, but the cluster-robust SEs differ by an order of magnitude across languages: Stata&amp;rsquo;s &lt;code>regress&lt;/code> drops collinear columns (the equivalent of MATLAB&amp;rsquo;s &lt;code>pinv&lt;/code> on the design matrix&amp;rsquo;s reduced submatrix), so its sandwich variance is computed on a smaller-rank but full-rank submatrix. The R helper uses &lt;code>MASS::ginv()&lt;/code> (Moore-Penrose pseudo-inverse) only as a fallback when &lt;code>solve()&lt;/code> errors, which gives substantially smaller variances. Stata&amp;rsquo;s SEs here are closer to the JAE replication paper&amp;rsquo;s published values (0.875 for violent crime; ours: 0.7149). Neither implementation is &amp;ldquo;wrong&amp;rdquo; — both are mathematically valid responses to a near-singular $X&amp;rsquo;X$. &lt;strong>‡&lt;/strong> DL-rigorous α drifts by 0.04–0.08 between Stata and R despite matching selection counts (|I_d|=8 on violent crime in both). The drift comes from the &lt;em>identity&lt;/em> of the controls each &lt;code>rlasso&lt;/code>/&lt;code>hdm::rlasso&lt;/code> selects: Stata&amp;rsquo;s penalty constants and pre-standardization differ slightly from R&amp;rsquo;s &lt;code>hdm&lt;/code> defaults, so the 8 controls chosen are not identical across the two implementations. The post-OLS on overlapping-but-not-identical control sets produces overlapping-but-not-identical α estimates. &lt;strong>‡‡&lt;/strong> Stata&amp;rsquo;s PSL uses the rigorous penalty (via &lt;code>rlasso pnotpen()&lt;/code>); R&amp;rsquo;s PSL uses 3-fold CV (via &lt;code>cv.glmnet penalty.factor=0&lt;/code>). Different penalty rules → different selections. The Stata point estimates land within 0.04 of R&amp;rsquo;s on the absolute scale and inside R&amp;rsquo;s 95% CI on all three outcomes. &lt;strong>‡‡‡&lt;/strong> DL-CV uses 3-fold cross-validation in both languages but &lt;code>cvlasso&lt;/code> and &lt;code>cv.glmnet&lt;/code> use different RNGs for fold assignment, so even with the same seed value the folds differ. We expect drift on both α and the selected set sizes.&lt;/p>
&lt;p>&lt;strong>Headline pedagogical takeaway:&lt;/strong> the Tier-A and Tier-B matches confirm that the &lt;em>deterministic&lt;/em> parts of the pipeline — OLS, the cluster-SE formula, and the rigorous-LASSO penalty — are language-portable. The Tier-C drift confirms that &lt;em>random fold assignment&lt;/em> is the dominant source of cross-language variability in CV-based methods, which is itself an argument for using the rigorous penalty when the answer matters: not just because it controls selection-error theory, but because it gives reproducible numbers across software stacks.&lt;/p>
&lt;hr>
&lt;h2 id="16-conclusion">16. Conclusion&lt;/h2>
&lt;p>Three takeaways worth carrying away from this post.&lt;/p>
&lt;p>First, &lt;strong>Double LASSO is a method, not a panacea&lt;/strong>. It does not invent variation in the data, nor does it weaken the identifying assumptions of the underlying research design. What it does is make high-dimensional control sets &lt;em>tractable&lt;/em> without committing to using all of them or to picking a subset by hand. On a dataset where conditional independence holds and the candidate-control set is rich enough to span the confounders, DL-rigorous reproduces the Donohue–Levitt 2001 headline closely while disciplining the standard errors — and Stata produces the same answer as R to several decimal places.&lt;/p>
&lt;p>Second, &lt;strong>the rigorous penalty matters more than the language&lt;/strong>. Switching from &lt;code>rlasso&lt;/code> to &lt;code>cvlasso&lt;/code> in Stata produces the same qualitative pattern as switching from &lt;code>hdm::rlasso&lt;/code> to &lt;code>glmnet::cv.glmnet&lt;/code> in R: the CV penalty over-selects, distorting the headline α. The Stata-vs-R replication check in §15 shows the deterministic methods agree across languages while the CV-based methods drift modestly — a reminder that the &lt;em>penalty rule&lt;/em> you choose affects the answer more than which statistical package you run.&lt;/p>
&lt;p>Third, &lt;strong>the regime determines the methodology&lt;/strong>. With our $p = 284$, $n = 576$, we are squarely in the small-sample, high-dimensional zone where DL is designed to help. With $p = 8$ and $n = 5{,}000$, plain OLS would be perfectly fine — DL adds nothing when classical OLS is in its comfort zone. The decision tree in §13 is a starting point for picking the right tool for the dimensions you face.&lt;/p>
&lt;p>If you came in expecting either a definitive statement about abortion and crime or a magic ML cure for omitted-variable bias, you should leave with neither. What you should leave with is a clearer mental model of &lt;em>when&lt;/em> the high-dimensional toolkit earns its complexity, and a working Stata workflow to run it on your own data.&lt;/p>
&lt;hr>
&lt;h2 id="17-exercises">17. Exercises&lt;/h2>
&lt;p>These exercises ask you to modify and re-run the &lt;code>analysis.do&lt;/code> script in this post. All datasets, dependencies, and helper code are already in place — you only need to change the indicated lines, run the script, and read the output.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Change the CV seed.&lt;/strong> In &lt;code>analysis.do&lt;/code>, change &lt;code>seed(20260520)&lt;/code> to &lt;code>seed(1)&lt;/code> on every &lt;code>cvlasso&lt;/code> call (lines 6.x and 8.x in the script), then &lt;code>seed(2)&lt;/code>, then &lt;code>seed(3)&lt;/code>. Re-run each time and record the DL-CV violent-crime estimate $\hat\alpha$ and union size. How much does the DL-CV point estimate vary across seeds? Does the &lt;em>rigorous&lt;/em> DL estimate change at all? Why does the seed matter for one but not the other?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Tighten the rigorous penalty.&lt;/strong> In each &lt;code>rlasso&lt;/code> call, the penalty parameters are &lt;code>c(1.1) gamma(0.05)&lt;/code>. Change to &lt;code>c(0.9)&lt;/code> (looser, expects more variables to be kept) and then &lt;code>c(1.5)&lt;/code> (tighter, expects fewer). Re-run and report the new $|I_y|$, $|I_d|$, and $\hat\alpha$ for violent crime. Does the headline α survive both perturbations? Which side of $c = 1.1$ is more sensitive?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Drop a year of data.&lt;/strong> Subset the differenced panel to 1986–1995 only (10 years × 48 states = 480 observations) by adding &lt;code>if year &amp;lt; 1996&lt;/code> to each &lt;code>regress&lt;/code>, &lt;code>rlasso&lt;/code>, and &lt;code>cvlasso&lt;/code> call. Re-run DL-rigorous on the violent-crime equation. How does the estimate change? How does the standard error change?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Use &lt;code>pdslasso&lt;/code> instead.&lt;/strong> Replace the three-line explicit DL-rigorous block (§7) with the single &lt;code>pdslasso DyV DxV (zv1-zv284), cluster(state) loptions(c(1.1) gamma(0.05))&lt;/code> call. Verify that the reported coefficient and SE match the explicit version exactly. Read the &lt;code>pdslasso&lt;/code> log to see how it reports the selected variables — what does it call $I_y$, $I_d$, and the union?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Compare to Stata&amp;rsquo;s built-in &lt;code>lasso&lt;/code> / &lt;code>dsregress&lt;/code>.&lt;/strong> Stata 16+ ships a native lasso implementation. Run &lt;code>dsregress DyV DxV, controls(zv1-zv284) selection(plugin)&lt;/code> and compare its output to the &lt;code>pdslasso&lt;/code> version. The two should agree closely; where they differ, the plugin uses a slightly different default for the BCH penalty constants — pin them down by passing &lt;code>selection(plugin, lambda(...))&lt;/code>.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h2 id="18-reproducing-this-analysis">18. Reproducing this analysis&lt;/h2>
&lt;p>Everything in this post — figures, tables, point estimates, standard errors — comes from a single self-contained Stata do-file (&lt;code>analysis.do&lt;/code>) that loads its data from six CSVs hosted in the R companion post&amp;rsquo;s &lt;code>data/&lt;/code> folder on GitHub. The script does not need any local data files. The full reproduction recipe is:&lt;/p>
&lt;ol>
&lt;li>Install the StataLasso suite and &lt;code>coefplot&lt;/code> if not already on your machine:
&lt;pre>&lt;code class="language-stata">ssc install lassopack
ssc install pdslasso
ssc install coefplot
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>Clone the GitHub repository (or copy &lt;code>analysis.do&lt;/code> standalone).&lt;/li>
&lt;li>Run it in batch mode:
&lt;pre>&lt;code class="language-bash">&amp;quot;/Applications/Stata/StataSE.app/Contents/MacOS/StataSE&amp;quot; -b do analysis.do
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>The script writes &lt;code>stata_double_lasso_*.png&lt;/code> (three figures: forest plot, selection bars, rigorous-vs-CV compare), &lt;code>results_table2.csv&lt;/code> (the Table 2 replication), &lt;code>selection_diagnostic.csv&lt;/code> (variable-selection counts), and &lt;code>analysis.log&lt;/code> (the execution transcript). The LASSO coefficient-paths figure that appears in the R companion is omitted here — Stata&amp;rsquo;s &lt;code>twoway&lt;/code> does not overlay 284 lines as cleanly as ggplot, and the visualisation does not add to the pedagogy beyond what §11&amp;rsquo;s selection-count narrative already conveys.&lt;/li>
&lt;/ol>
&lt;p>Stata packages used: &lt;a href="https://statalasso.github.io/" target="_blank" rel="noopener">&lt;code>lassopack&lt;/code>&lt;/a> — supplies &lt;code>rlasso&lt;/code>, &lt;code>cvlasso&lt;/code>, &lt;code>lasso2&lt;/code>. &lt;a href="https://statalasso.github.io/" target="_blank" rel="noopener">&lt;code>pdslasso&lt;/code>&lt;/a> — supplies &lt;code>pdslasso&lt;/code>, &lt;code>ivlasso&lt;/code>. &lt;a href="https://repec.sowi.unibe.ch/stata/coefplot/" target="_blank" rel="noopener">&lt;code>coefplot&lt;/code>&lt;/a> for some of the figures. Stata 16+ is required (we tested on 18.5 SE).&lt;/p>
&lt;p>The runtime on Apple Silicon is roughly &lt;strong>3–5 minutes&lt;/strong> for the full pipeline, dominated by the CV calls in &lt;code>cvlasso&lt;/code>. The rigorous-LASSO step (&lt;code>rlasso&lt;/code> × 6) takes about 20 seconds. The post-OLS clustered-SE calculations are negligible.&lt;/p>
&lt;p>A note on the seed. Every &lt;code>cvlasso&lt;/code> call passes &lt;code>seed(20260520)&lt;/code> so the random fold assignment is reproducible across runs. Changing the seed will shift the DL-CV numbers by roughly ±0.01 on point estimates and ±5 in variable-selection counts. The DL-rigorous numbers do not depend on the seed.&lt;/p>
&lt;hr>
&lt;h2 id="19-references">19. References&lt;/h2>
&lt;p>&lt;strong>Academic references&lt;/strong> (each linked to the publisher DOI):&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Ahrens, A., Hansen, C. &amp;amp; Schaffer, M.&lt;/strong> (2018, 2020). &lt;a href="https://doi.org/10.1177/1536867X20909697" target="_blank" rel="noopener">&amp;ldquo;lassopack: Model selection and prediction with regularized regression in Stata.&amp;quot;&lt;/a> &lt;em>Stata Journal&lt;/em> 20(1): 176–235; and &lt;a href="https://statalasso.github.io/docs/pdslasso/" target="_blank" rel="noopener">&amp;ldquo;pdslasso and ivlasso: Stata programs for post-selection and post-regularization OLS or IV estimation and inference.&amp;quot;&lt;/a> The reference papers for the StataLasso suite used in this post.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Belloni, A., Chernozhukov, V. &amp;amp; Wang, L.&lt;/strong> (2011). &lt;a href="https://doi.org/10.1093/biomet/asr043" target="_blank" rel="noopener">&amp;ldquo;Square-root LASSO: Pivotal recovery of sparse signals via conic programming.&amp;quot;&lt;/a> &lt;em>Biometrika&lt;/em> 98(4): 791–806. The pivotal LASSO whose scale-invariance underpins the rigorous penalty&amp;rsquo;s pilot-σ-free form used by &lt;code>rlasso&lt;/code> and &lt;code>pdslasso&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Belloni, A., Chen, D., Chernozhukov, V. &amp;amp; Hansen, C.&lt;/strong> (2012). &lt;a href="https://doi.org/10.3982/ECTA9626" target="_blank" rel="noopener">&amp;ldquo;Sparse models and methods for optimal instruments with an application to eminent domain.&amp;quot;&lt;/a> &lt;em>Econometrica&lt;/em> 80(6): 2369–2429. The original derivation of the rigorous LASSO penalty.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Belloni, A., Chernozhukov, V. &amp;amp; Hansen, C.&lt;/strong> (2013). &lt;a href="https://doi.org/10.1017/CBO9781139060035.008" target="_blank" rel="noopener">&amp;ldquo;Inference for high-dimensional sparse econometric models.&amp;quot;&lt;/a> In &lt;em>Advances in Economics and Econometrics: Tenth World Congress&lt;/em>, Vol. III: Econometrics. Foundational reference for the three-method orthogonalisation framework that &lt;code>pdslasso&lt;/code> reports — the lasso-orthogonalized, post-lasso-orthogonalized and PDS panels in §8.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Belloni, A., Chernozhukov, V. &amp;amp; Hansen, C.&lt;/strong> (2014). &lt;a href="https://doi.org/10.1093/restud/rdt044" target="_blank" rel="noopener">&amp;ldquo;Inference on treatment effects after selection among high-dimensional controls.&amp;quot;&lt;/a> &lt;em>Review of Economic Studies&lt;/em> 81(2): 608–650. The Double LASSO paper, including the empirical-application data we use in this post.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Belloni, A., Chernozhukov, V. &amp;amp; Hansen, C.&lt;/strong> (2015). &lt;a href="https://doi.org/10.1016/j.jeconom.2015.06.013" target="_blank" rel="noopener">&amp;ldquo;Some new asymptotic theory for least squares series: Pointwise and uniform results.&amp;quot;&lt;/a> &lt;em>Journal of Econometrics&lt;/em> 186(2): 345–366. Theoretical underpinning for the post-selection inference &lt;code>pdslasso&lt;/code> implements (uniformly valid CIs after model selection).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Belloni, A., Chernozhukov, V., Hansen, C. &amp;amp; Kozbur, D.&lt;/strong> (2016). &lt;a href="https://doi.org/10.1080/07350015.2015.1102733" target="_blank" rel="noopener">&amp;ldquo;Inference in high-dimensional panel models with an application to gun control.&amp;quot;&lt;/a> &lt;em>Journal of Business &amp;amp; Economic Statistics&lt;/em> 34(4): 590–605. &lt;strong>Directly relevant to our state-panel setting&lt;/strong> — extends the PDS framework to cluster-correlated data with the cluster-lasso penalty loadings &lt;code>pdslasso&lt;/code> invokes under &lt;code>cluster()&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Cameron, A. C. &amp;amp; Miller, D. L.&lt;/strong> (2015). &lt;a href="https://doi.org/10.3368/jhr.50.2.317" target="_blank" rel="noopener">&amp;ldquo;A practitioner&amp;rsquo;s guide to cluster-robust inference.&amp;quot;&lt;/a> &lt;em>Journal of Human Resources&lt;/em> 50(2): 317–372. The reference for the HC1 finite-sample adjustment in §9.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Chernozhukov, V., Hansen, C. &amp;amp; Spindler, M.&lt;/strong> (2015). &lt;a href="https://doi.org/10.1146/annurev-economics-012315-015826" target="_blank" rel="noopener">&amp;ldquo;Valid post-selection and post-regularization inference: An elementary, general approach.&amp;quot;&lt;/a> &lt;em>Annual Review of Economics&lt;/em> 7: 649–688. Accessible review of why the three-method orthogonalisation in §8 is the right framework for causal inference after LASSO selection — the most pedagogical of the pdslasso references.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Donohue III, J. J. &amp;amp; Levitt, S. D.&lt;/strong> (2001). &lt;a href="https://doi.org/10.1162/00335530151144050" target="_blank" rel="noopener">&amp;ldquo;The impact of legalized abortion on crime.&amp;quot;&lt;/a> &lt;em>Quarterly Journal of Economics&lt;/em> 116(2): 379–420. The original empirical paper.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Fitzgerald Sice, J., Lattimore, F., Robinson, T. &amp;amp; Zhu, A.&lt;/strong> (2026). &lt;a href="https://doi.org/10.15456/jae.2025335.0258270663" target="_blank" rel="noopener">&amp;ldquo;Double LASSO: Replication and Practical Insights.&amp;quot;&lt;/a> &lt;em>Journal of Applied Econometrics&lt;/em>, forthcoming. The source paper for this replication.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Friedman, J., Hastie, T. &amp;amp; Tibshirani, R.&lt;/strong> (2010). &lt;a href="https://doi.org/10.18637/jss.v033.i01" target="_blank" rel="noopener">&amp;ldquo;Regularization paths for generalized linear models via coordinate descent.&amp;quot;&lt;/a> &lt;em>Journal of Statistical Software&lt;/em> 33(1). The reference for the &lt;code>glmnet&lt;/code> package, whose lambda parameterisation &lt;code>cvlasso&lt;/code>&amp;rsquo;s &lt;code>lglmnet&lt;/code> option emulates.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Spindler, M., Chernozhukov, V. &amp;amp; Hansen, C.&lt;/strong> (2016). &lt;a href="https://arxiv.org/abs/1603.01700" target="_blank" rel="noopener">&amp;ldquo;High-dimensional metrics in R.&amp;quot;&lt;/a> &lt;em>arXiv:1603.01700&lt;/em>. Companion to the &lt;code>hdm&lt;/code> R package used in the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion post&lt;/a> — useful cross-reference for readers comparing the Stata and R implementations.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Tibshirani, R.&lt;/strong> (1996). &lt;a href="https://doi.org/10.1111/j.2517-6161.1996.tb02080.x" target="_blank" rel="noopener">&amp;ldquo;Regression shrinkage and selection via the LASSO.&amp;quot;&lt;/a> &lt;em>Journal of the Royal Statistical Society Series B&lt;/em> 58(1): 267–288. The original LASSO paper.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Stata packages used:&lt;/strong>&lt;/p>
&lt;ol start="15">
&lt;li>&lt;a href="https://statalasso.github.io/" target="_blank" rel="noopener">&lt;strong>&lt;code>lassopack&lt;/code>&lt;/strong>&lt;/a> — SSC package supplying &lt;code>rlasso&lt;/code> (rigorous-penalty LASSO), &lt;code>cvlasso&lt;/code> (cross-validated LASSO), and &lt;code>lasso2&lt;/code> (path-only LASSO).&lt;/li>
&lt;li>&lt;a href="https://statalasso.github.io/" target="_blank" rel="noopener">&lt;strong>&lt;code>pdslasso&lt;/code>&lt;/strong>&lt;/a> — SSC package supplying &lt;code>pdslasso&lt;/code> (post-double-selection LASSO) and &lt;code>ivlasso&lt;/code> (IV-LASSO). See the &lt;a href="https://statalasso.github.io/docs/pdslasso/ivlasso_help/" target="_blank" rel="noopener">online ivlasso help file&lt;/a> for the full syntax and option list.&lt;/li>
&lt;li>&lt;a href="https://repec.sowi.unibe.ch/stata/coefplot/" target="_blank" rel="noopener">&lt;strong>&lt;code>coefplot&lt;/code>&lt;/strong>&lt;/a> — SSC package for the forest plot in §12.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Data and replication archives:&lt;/strong>&lt;/p>
&lt;ol start="18">
&lt;li>
&lt;p>The CSV files for this post live in &lt;a href="https://github.com/cmg777/starter-academic-v501/tree/master/content/post/r_double_lasso/data" target="_blank" rel="noopener">&lt;code>content/post/r_double_lasso/data/&lt;/code>&lt;/a> on the site&amp;rsquo;s GitHub, shared with the &lt;a href="https://carlos-mendez.org/post/r_double_lasso/">R companion post&lt;/a>. They were extracted from the Matlab files in Fitzgerald et al.&amp;rsquo;s JAE replication archive by &lt;code>prepare_data.R&lt;/code> in the R companion post.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The Donohue–Levitt (2001) original replication data is available via the QJE article&amp;rsquo;s &lt;a href="https://doi.org/10.1162/00335530151144050" target="_blank" rel="noopener">supplementary materials&lt;/a> and Steven Levitt&amp;rsquo;s &lt;a href="https://pricetheory.uchicago.edu/levitt/" target="_blank" rel="noopener">University of Chicago page&lt;/a>. Belloni, Chernozhukov and Hansen (2014) extended this dataset to the 284-control specification used here.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;style>
.podcast-overlay {
display: none;
position: fixed;
bottom: 0;
left: 0;
right: 0;
z-index: 9999;
animation: podSlideUp 0.35s ease-out;
}
@keyframes podSlideUp {
from { transform: translateY(100%); }
to { transform: translateY(0); }
}
.podcast-overlay.pod-closing {
animation: podSlideDown 0.3s ease-in forwards;
}
@keyframes podSlideDown {
from { transform: translateY(0); }
to { transform: translateY(100%); }
}
.podcast-container {
background: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%);
padding: 18px 24px 20px;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
box-shadow: 0 -4px 32px rgba(0,0,0,0.5);
border-top: 1px solid rgba(106,155,204,0.2);
}
.podcast-inner {
max-width: 800px;
margin: 0 auto;
}
.podcast-top-row {
display: flex;
align-items: center;
gap: 14px;
margin-bottom: 14px;
}
.podcast-icon {
width: 42px;
height: 42px;
background: linear-gradient(135deg, #d97757, #e8956a);
border-radius: 10px;
display: flex;
align-items: center;
justify-content: center;
flex-shrink: 0;
}
.podcast-icon svg {
width: 22px;
height: 22px;
fill: #fff;
}
.podcast-title-block {
flex: 1;
min-width: 0;
}
.podcast-title-block h4 {
margin: 0 0 1px 0;
color: #f0ece2;
font-size: 14px;
font-weight: 600;
letter-spacing: 0.02em;
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
}
.podcast-title-block span {
color: #8b9dc3;
font-size: 11px;
}
.podcast-close-btn {
background: none;
border: none;
cursor: pointer;
padding: 6px;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.2s;
flex-shrink: 0;
}
.podcast-close-btn:hover {
background: rgba(255,255,255,0.1);
}
.podcast-close-btn svg {
width: 20px;
height: 20px;
fill: #8b9dc3;
}
.podcast-progress-wrap {
margin-bottom: 12px;
}
.podcast-time-row {
display: flex;
justify-content: space-between;
font-size: 11px;
color: #8b9dc3;
margin-bottom: 5px;
font-variant-numeric: tabular-nums;
}
.podcast-bar-bg {
width: 100%;
height: 6px;
background: rgba(255,255,255,0.1);
border-radius: 3px;
cursor: pointer;
position: relative;
overflow: hidden;
transition: height 0.15s;
}
.podcast-bar-buffered {
position: absolute;
top: 0;
left: 0;
height: 100%;
background: rgba(106,155,204,0.25);
border-radius: 3px;
transition: width 0.3s;
}
.podcast-bar-progress {
position: absolute;
top: 0;
left: 0;
height: 100%;
background: linear-gradient(90deg, #6a9bcc, #00d4c8);
border-radius: 3px;
transition: width 0.1s linear;
}
.podcast-bar-bg:hover {
height: 10px;
margin-top: -2px;
}
.podcast-controls-row {
display: flex;
align-items: center;
justify-content: space-between;
}
.podcast-transport {
display: flex;
align-items: center;
gap: 8px;
}
.podcast-btn {
background: none;
border: none;
cursor: pointer;
padding: 4px;
display: flex;
align-items: center;
justify-content: center;
border-radius: 50%;
transition: all 0.2s;
}
.podcast-btn svg {
fill: #c8d0e0;
transition: fill 0.2s;
}
.podcast-btn:hover svg {
fill: #f0ece2;
}
.podcast-btn-skip {
position: relative;
}
.podcast-btn-skip span {
position: absolute;
font-size: 7px;
font-weight: 700;
color: #c8d0e0;
top: 50%;
left: 50%;
transform: translate(-50%, -50%);
pointer-events: none;
margin-top: 1px;
}
.podcast-btn-play {
width: 48px;
height: 48px;
background: linear-gradient(135deg, #d97757, #e8956a);
border-radius: 50%;
box-shadow: 0 3px 12px rgba(217,119,87,0.4);
transition: all 0.2s;
}
.podcast-btn-play:hover {
transform: scale(1.08);
box-shadow: 0 5px 20px rgba(217,119,87,0.5);
}
.podcast-btn-play svg {
fill: #fff;
width: 22px;
height: 22px;
}
.podcast-extras {
display: flex;
align-items: center;
gap: 10px;
}
.podcast-volume-wrap {
display: flex;
align-items: center;
gap: 5px;
}
.podcast-volume-wrap svg {
fill: #8b9dc3;
width: 16px;
height: 16px;
cursor: pointer;
flex-shrink: 0;
}
.podcast-volume-wrap svg:hover {
fill: #c8d0e0;
}
.podcast-volume-slider {
-webkit-appearance: none;
appearance: none;
width: 60px;
height: 4px;
background: rgba(255,255,255,0.12);
border-radius: 2px;
outline: none;
cursor: pointer;
}
.podcast-volume-slider::-webkit-slider-thumb {
-webkit-appearance: none;
appearance: none;
width: 12px;
height: 12px;
background: #6a9bcc;
border-radius: 50%;
cursor: pointer;
}
.podcast-speed-btn {
background: rgba(255,255,255,0.08);
border: 1px solid rgba(255,255,255,0.12);
color: #c8d0e0;
font-size: 11px;
font-weight: 600;
padding: 3px 9px;
border-radius: 12px;
cursor: pointer;
transition: all 0.2s;
font-family: inherit;
min-width: 40px;
text-align: center;
}
.podcast-speed-btn:hover {
background: rgba(106,155,204,0.2);
border-color: #6a9bcc;
color: #f0ece2;
}
.podcast-download-btn {
background: none;
border: 1px solid rgba(255,255,255,0.12);
border-radius: 8px;
padding: 4px 10px;
cursor: pointer;
display: flex;
align-items: center;
gap: 4px;
color: #8b9dc3;
font-size: 11px;
font-family: inherit;
text-decoration: none;
transition: all 0.2s;
}
.podcast-download-btn:hover {
border-color: #6a9bcc;
color: #f0ece2;
background: rgba(106,155,204,0.1);
}
.podcast-download-btn svg {
width: 14px;
height: 14px;
fill: currentColor;
}
@media (max-width: 600px) {
.podcast-container { padding: 14px 16px 16px; }
.podcast-volume-wrap { display: none; }
.podcast-title-block h4 { font-size: 13px; }
.podcast-extras { gap: 8px; }
}
&lt;/style>
&lt;div class="podcast-overlay" id="podOverlay">
&lt;div class="podcast-container">
&lt;div class="podcast-inner">
&lt;audio id="podAudio" preload="none" src="https://files.catbox.moe/anx2jt.m4a">&lt;/audio>
&lt;div class="podcast-top-row">
&lt;div class="podcast-icon">
&lt;svg viewBox="0 0 24 24">&lt;path d="M12 1a5 5 0 0 0-5 5v4a5 5 0 0 0 10 0V6a5 5 0 0 0-5-5zm0 16a7 7 0 0 1-7-7H3a9 9 0 0 0 8 8.94V22h2v-3.06A9 9 0 0 0 21 10h-2a7 7 0 0 1-7 7z"/>&lt;/svg>
&lt;/div>
&lt;div class="podcast-title-block">
&lt;h4>AI Podcast: Double LASSO in Stata&lt;/h4>
&lt;span id="podDurationLabel">Click play to load&lt;/span>
&lt;/div>
&lt;button class="podcast-close-btn" onclick="podClose()" title="Close player">
&lt;svg viewBox="0 0 24 24">&lt;path d="M19 6.41L17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12z"/>&lt;/svg>
&lt;/button>
&lt;/div>
&lt;div class="podcast-progress-wrap">
&lt;div class="podcast-time-row">
&lt;span id="podCurrent">0:00&lt;/span>
&lt;span id="podDuration">0:00&lt;/span>
&lt;/div>
&lt;div class="podcast-bar-bg" id="podBarBg" onclick="podSeek(event)">
&lt;div class="podcast-bar-buffered" id="podBuffered">&lt;/div>
&lt;div class="podcast-bar-progress" id="podProgress">&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="podcast-controls-row">
&lt;div class="podcast-transport">
&lt;button class="podcast-btn podcast-btn-skip" onclick="podSkip(-15)" title="Back 15s">
&lt;svg width="26" height="26" viewBox="0 0 24 24">&lt;path d="M12 5V1L7 6l5 5V7c3.31 0 6 2.69 6 6s-2.69 6-6 6-6-2.69-6-6H4c0 4.42 3.58 8 8 8s8-3.58 8-8-3.58-8-8-8z"/>&lt;/svg>
&lt;span>15&lt;/span>
&lt;/button>
&lt;button class="podcast-btn podcast-btn-play" id="podPlayBtn" onclick="podToggle()" title="Play">
&lt;svg id="podIconPlay" viewBox="0 0 24 24">&lt;path d="M8 5v14l11-7z"/>&lt;/svg>
&lt;svg id="podIconPause" viewBox="0 0 24 24" style="display:none">&lt;path d="M6 19h4V5H6v14zm8-14v14h4V5h-4z"/>&lt;/svg>
&lt;/button>
&lt;button class="podcast-btn podcast-btn-skip" onclick="podSkip(15)" title="Forward 15s">
&lt;svg width="26" height="26" viewBox="0 0 24 24">&lt;path d="M12 5V1l5 5-5 5V7c-3.31 0-6 2.69-6 6s2.69 6 6 6 6-2.69 6-6h2c0 4.42-3.58 8-8 8s-8-3.58-8-8 3.58-8 8-8z"/>&lt;/svg>
&lt;span>15&lt;/span>
&lt;/button>
&lt;/div>
&lt;div class="podcast-extras">
&lt;div class="podcast-volume-wrap">
&lt;svg id="podVolIcon" onclick="podMute()" viewBox="0 0 24 24">&lt;path d="M3 9v6h4l5 5V4L7 9H3zm13.5 3A4.5 4.5 0 0 0 14 8.5v7a4.47 4.47 0 0 0 2.5-3.5zM14 3.23v2.06a6.51 6.51 0 0 1 0 13.42v2.06A8.51 8.51 0 0 0 14 3.23z"/>&lt;/svg>
&lt;input type="range" class="podcast-volume-slider" id="podVolume" min="0" max="1" step="0.05" value="0.8">
&lt;/div>
&lt;button class="podcast-speed-btn" id="podSpeedBtn" onclick="podCycleSpeed()" title="Playback speed">1x&lt;/button>
&lt;a class="podcast-download-btn" href="https://files.catbox.moe/anx2jt.m4a" target="_blank" rel="noopener" title="Stream">
&lt;svg viewBox="0 0 24 24">&lt;path d="M19 9h-4V3H9v6H5l7 7 7-7zM5 18v2h14v-2H5z"/>&lt;/svg>
&lt;/a>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;script>
(function(){
var overlay = document.getElementById('podOverlay');
var a = document.getElementById('podAudio');
var speeds = [0.75, 1, 1.25, 1.5, 2];
var si = 1;
var opened = false;
function fmt(s){
if(isNaN(s)) return '0:00';
var m=Math.floor(s/60), sec=Math.floor(s%60);
return m+':'+(sec&lt;10?'0':'')+sec;
}
document.addEventListener('click', function(e){
var link = e.target.closest('a.btn-page-header');
if(!link) return;
var text = link.textContent.trim();
if(text.indexOf('AI Podcast') === -1) return;
e.preventDefault();
e.stopPropagation();
overlay.style.display = 'block';
overlay.classList.remove('pod-closing');
if(!opened){
a.preload = 'metadata';
a.load();
opened = true;
}
});
a.volume = 0.8;
a.addEventListener('loadedmetadata', function(){
document.getElementById('podDuration').textContent = fmt(a.duration);
document.getElementById('podDurationLabel').textContent = fmt(a.duration) + ' minutes';
});
a.addEventListener('timeupdate', function(){
document.getElementById('podCurrent').textContent = fmt(a.currentTime);
var pct = a.duration ? (a.currentTime/a.duration)*100 : 0;
document.getElementById('podProgress').style.width = pct+'%';
});
a.addEventListener('progress', function(){
if(a.buffered.length>0){
var pct = (a.buffered.end(a.buffered.length-1)/a.duration)*100;
document.getElementById('podBuffered').style.width = pct+'%';
}
});
a.addEventListener('ended', function(){
document.getElementById('podIconPlay').style.display='';
document.getElementById('podIconPause').style.display='none';
});
window.podToggle = function(){
if(a.paused){a.play();document.getElementById('podIconPlay').style.display='none';document.getElementById('podIconPause').style.display='';}
else{a.pause();document.getElementById('podIconPlay').style.display='';document.getElementById('podIconPause').style.display='none';}
};
window.podSkip = function(s){a.currentTime = Math.max(0,Math.min(a.duration||0,a.currentTime+s));};
window.podSeek = function(e){
var rect = document.getElementById('podBarBg').getBoundingClientRect();
var pct = (e.clientX - rect.left)/rect.width;
a.currentTime = pct * (a.duration||0);
};
window.podMute = function(){
a.muted = !a.muted;
document.getElementById('podVolume').value = a.muted ? 0 : a.volume;
};
window.podCycleSpeed = function(){
si = (si+1) % speeds.length;
a.playbackRate = speeds[si];
document.getElementById('podSpeedBtn').textContent = speeds[si]+'x';
};
window.podClose = function(){
overlay.classList.add('pod-closing');
setTimeout(function(){ overlay.style.display='none'; }, 300);
a.pause();
document.getElementById('podIconPlay').style.display='';
document.getElementById('podIconPause').style.display='none';
};
document.getElementById('podVolume').addEventListener('input', function(){
a.volume = this.value;
a.muted = false;
});
if(window.location.hash === '#podcast-player'){
overlay.style.display = 'block';
a.preload = 'metadata';
a.load();
opened = true;
}
})();
&lt;/script></description></item></channel></rss>