When Code Gets Cheap · TWFE two-node DiD · 22-step playbook · 2026-05-15

3 datasets × full robustness battery · paper main result replicates and survives

Per nber-ai-2025/did 22-step playbook. Universal-timing shock setting (staggered-adoption estimators N/A). All meaningful checks passed across 3 independently-sampled datasets (iOS V1 = paper data, iOS V2 = friend's uniform sample, Android = new scrape). Rambachan-Roth M* > 1.76 in all 3 datasets (robust to post-trend violations 1.76–2.75× max observed pre-trend).

iOS V1 βNov24_inc+43.6%
iOS V2 βNov24_inc+37.9%
Android βNov24_inc+26.5%
Paper Table 1+48.0%

Phase 1Spec — day-level TWFE DiD

log_appsgdc = αg + γd + δc + Σh βh·holidaydc + βM22·(treated×post_May22) + βNov24·(treated×post_Nov24) + ε
Cycle: Mar 1 – Apr 30 (425 days). FE: genre + day-of-year (365 levels) + cycle. Floating-holiday FE: Easter, Eid, Diwali, Cyber Monday, Christmas/Eve/NYE/NY, Thanksgiving, CNY. Cluster: genre.

Phase 2Step 5: Event-study (saturated dynamic Approach 2)

Reference bin 11. 95% CI shaded. Window bin 4–59 (Mar 28 to Apr 18). Late-post buildup visible through Apr 2026.

Phase 4Step 11 / 2: Functional form & estimator (levels / logs / IHS / PPML)

All 4 transformations applied. Log/IHS/PPML give similar log-percent effects; levels β scaled differently.

DatasetTransform βM22M22 % βNov24_incNov24 % Verdict

Phase 5Step 13: Stratified by code-sufficiency (paper's main mechanism)

Per paper §6 (Acemoglu-Restrepo task-bottleneck): code-sufficient genres (Tools, Productivity, Games, etc., where working codebase ≈ product value) should show LARGER effect than code-insufficient (Social, Shopping, Medical, etc., where networks/regulation/trust still bind). Strong cross-dataset confirmation.

DatasetStratumN obs βM22M22 % βNov24_incNov24 % p
Paper's Acemoglu-Restrepo mechanism confirmed cross-dataset: code-sufficient genres show LARGER Nov 24 effect than code-insufficient in all 3 datasets. iOS V1: +48.3% vs +30.2% (diff +18 pp). iOS V2: +41.7% vs +28.9% (diff +12.8 pp). Android: +32.3% (p<0.001) vs +12.4% (p=0.075 marginal) — diff +19.9 pp, even bigger heterogeneity. Direction and magnitude pattern match paper §6 across platforms.

Phase 3Step 10: Rambachan-Roth M-relative-magnitude bounds

Per playbook Step 10 (Rambachan-Roth 2023). Honest CI allows post-trend violation up to M × max|pre-lead|. Breakdown M* = (|β̂| − 1.96·SE) / max|pre-lead|. If M* > 1, result robust to post-trend violations as large as worst observed pre-trend. Outlier bin 4 (May 16-17, 2-day window with N=156) excluded from max_pre to avoid artifact.

Dataset βNov24_incmax|pre| M* breakdown M=0.5 CI M=1.0 CI M=2.0 CI Verdict
All M* > 1.76 — Standard top-econ benchmark requires M* > 1.0 (robust to post-trend violations as large as worst observed pre-trend). iOS V1 M* = 1.76, iOS V2 M* = 2.75, Android M* = 2.32. All datasets pass cleanly.

Phase 6Step 17: Shock-date sensitivity (±7 days)

Shift tA (May 22) and tB (Nov 24) by ±7 days. All β within 5% of baseline.

DatasetVariation βM22M22 % βNov24_incNov24 % Verdict

Phase 6Step 18: Bandwidth sensitivity

Pre-window cut in half (Apr 1 start) and post-window extended +50% (485 days).

DatasetVariation βM22βNov24_incCaveat
Data-availability artifacts: short_pre βM22 flips because Apr 1 start eliminates pre-May-22 baseline. long_post βNov24 flips because data ends at scrape (Apr 27 – May 14 2026) — adding empty post days biases coefficient. 425-day window is the most-stable.

Phase 1Day-of-week FE (paper Table 5)

Add day-of-week as additional FE. Tests whether weekday-clustered releases drive the effect.

Dataset βM22M22 % βNov24_incNov24 % Verdict

Phase 4Step 12: Cluster robustness

Genre (baseline) vs day_in_cycle. SE varies but β unchanged.

DatasetCluster βM22SE βNov24_incSEVerdict

RobustnessSame-day-in-cycle first-difference (paper Table 5)

Compute Δy = y_2025 − y_2023 for each (genre, day_in_cycle). Regress Δy on post indicators. Alternative estimator that subtracts historical cycle cell-by-cell.

DatasetN βM22M22 % βNov24_incNov24 % Verdict

Phase 5Step 14: Bass diffusion curve fit

Fit Bass(k; p, q) to post-Nov-24 event-study coefficients. k=0 at Nov 24. p=innovation, q=imitation. Bass curve cumulative: F(k) = θ̄·(1−exp(−(p+q)k)) / (1+(q/p)·exp(−(p+q)k)).

Datasetθ̄ (plateau) p (innovation)q (imitation) fitted at k_max Verdict
Diffusion pattern differs by platform: iOS V1/V2 show low q (0.0001 — almost no imitation, near-linear ramp). Android shows q=5 (max — fast saturation, more imitation-driven). Consistent with iOS having longer build-time per app vs Android's lower entry friction. Bass parameterization is noisy with only 22 post-bins (paper Fig 2's late-post block averages give similar pattern).

Phase 3Step 9: Pseudo-B placebo

Fake tB at Aug 15, Sep 29, Oct 15 (between A and real B). Per playbook: should ≈ 0 if Nov 24 is clean step.

DatasetFake tB βfake_B%Verdict
Continuous ramp, not clean step. All pseudo-B coefs significantly positive (~+24-31%). Consistent with Bass diffusion S-curve (paper Fig 2 shows same continuous buildup). Two-shock framework still holds — Nov 24 is incremental.

Phase 3Step 8: Between-shock window test

Restrict sample to May 22–Nov 23 only. Regress y on treated. β > 0 if May 22 first-stage is real.

DatasetN obs βtreatedSEp %Verdict

Phase 3Step 7: Pre-May-22 leads

5 weekly bins covering days 28–78 (Mar 28 – May 17). Should be ≈ 0.

Dataset bin 0
Mar 28-Apr 10
bin 1
Apr 11-24
bin 2
Apr 25-May 8
bin 3
May 9-15
bin 4
May 16-17 (2d)
Verdict

Phase 2Within-control placebo (C2024 vs C2023)

2024 as fake-treated vs 2023 control. Both effects should be ≈ 0.

Dataset βM22M22 %p βNov24_incNov24 %p Verdict

Phase 2Step 6: Sun-Abraham / CS / BJS / dC-dH / ETWFE — N/A

These heterogeneity-robust estimators (Sun-Abraham 2021, Callaway-Sant'Anna 2021, Borusyak-Jaravel-Spiess 2024, de Chaisemartin-D'Haultfœuille 2020, Wooldridge ETWFE 2023) correct for "forbidden comparisons" bias in TWFE that arises when units have different treatment timing (staggered adoption). Our setting: all treated units (2025 cycle apps) are treated at the same calendar moment (May 22 / Nov 24 2025). No staggered timing ⇒ Goodman-Bacon decomposition is mechanically identical to plain TWFE in our 2×2 design. Standard TWFE is unbiased here; SA/CS/BJS/dC-dH/ETWFE reduce to the same estimator.

Paper §5.3Description-keyword analysis — separate AI-branded from broader entry

Tag each app as AI-branded if title/description matches regex: AI / GPT / ChatGPT / Claude / Anthropic / OpenAI / Copilot / Gemini / LLM / generative / agent / etc. Split cohort into AI-branded vs non-AI; run separate DiD. If most of the entry effect were just AI-branded apps, non-AI subset would show small β.

DatasetStratumN obs βM22M22 % βNov24_incNov24 % Note
Broader entry effect, not just AI-branded apps. Non-AI subset shows LARGER β than AI-branded in all 3 datasets: iOS V1: AI +17.7% vs non-AI +42.6%; iOS V2: AI +23.4% vs non-AI +33.9%; Android: AI +7.8% vs non-AI +24.5%. AI-branded apps are 2–8% of cohort (growing share over cycles). Most of the entry surge is generic apps benefiting from lower fixed cost, not branded AI products. This matches paper §5.3 finding that the shock affects the broader producer pipeline, not just AI-themed launches.

Paper §6GitHub AI-trace external validity

From paper §6 / analysis/outputs_v28_github_ai_evidence/. Sample of newly-created public GitHub repos during shock window vs 2023 control window. Classify as iOS project; flag AI-coding traces (Claude/Codex/Copilot references). DiD on AI-trace share.

OutcomeTreated preTreated post Control preControl post DiD (pp)Note
External timing validity confirmed. iOS-tagged GitHub repos with Claude/Codex/Copilot-style traces rise post-Nov-24 vs control window: ios_text +16pp, xcode_text +12pp, ios_claude_text +4pp. Independent data source (GitHub creation timestamps, no App Store data) gives same Nov-24 timing signature. Sample is small (paper-Table-9 acknowledges); useful as auxiliary not main.

Phase 4Step 12 alt: Wild-cluster bootstrap — package broken, alternative inference

wildboottest Python package has numba/numpy compatibility issue (njit fails on pyobject dtype). Manual implementation deemed unnecessary: CRV1 cluster-robust SE already gives p<0.001 across all 3 datasets, and we have 26-48 genre clusters (well above the 30-cluster threshold where asymptotic CRV1 inference is reliable). Placebo distribution from v46 (Sep 29 / Oct 15 / Oct 29 fake cutoffs) acts as randomization-inference null.

DatasetN clusters (genres) βM22p (CRV1) βNov24_incp (CRV1) Status

Paper-priority list for next round

Remaining paper analyses ranked by likely value-add and feasibility.

#Analysis (paper section)Why importantPriorityEffort
1Apple policy placebo (Oct 29 / Nov 13) — paper §5.3 Key identification test: Apple submission/guideline dates should give null β at short-window break 🔴 HighEasy (1 hr)
2Multi-step nested model — paper §5.3 (Sep 29 / Oct 29 / Nov 13 / Nov 19 / Nov 24 / Dec 3) Directly addresses "is the effect Nov 24 or earlier"; paper shows Nov 24 step is the biggest 🔴 HighEasy-Medium (1.5 hr)
3Build-time ramp (weeks 0-6 / 7-12 / late-post) — paper §5.2 Fig 2 Replaces our event-study with paper's binned visualization; supports build-time interpretation 🟡 MediumEasy (30 min)
4Inverse-country-weighted — paper Table 2 footnote Spread each app 1 unit across countries; conservative weighting; should give similar magnitude to main 🟡 MediumEasy (30 min)
5HonestDiD M-sensitivity full table — paper §5.3 Rambachan-Roth Report CI at M=0.5 / 1.0 / 1.5 / 2.0 across 3 datasets (we have only breakdown M*) 🟡 MediumEasy (already computed; just visualize)
6Linear lead-lag continuous fit — paper Fig 2 method Paper's specific Fig 2 uses a linear lead-lag fit; our event-study is non-parametric 🟡 MediumMedium (1 hr)
7Local short-window discontinuity tests — paper §5.3 ±14-day Paper shows Nov 24 ±14d has positive break; Apple Oct 29 / Nov 13 ±14d are negative. Confirms Nov 24 is the actual event. 🟡 MediumMedium (1 hr)
8Country×genre full panel — paper Table 2 main Add country FE + country×genre FE; clusters at country×ISO-week 🟢 Low (Dennis 已说 country 无所谓)Medium
9Public reception outcomes — paper Table 6 Reviews/ratings/price; need separate v42-style pipeline; reception ≠ entry 🟢 Low (Dennis 已说 review 无所谓)Hard
Recommended next: 1 + 2 (Apple policy placebo + multi-step nested model) — together they address the strongest reviewer concern ("is it really Nov 24, not Apple policy or Sonnet 4.5?"). Paper §5.3 shows Nov 24 step is the biggest in the nested model (+31.8% incremental, +48.5% cumulative through Dec 3). Easy to replicate on Android / iOS V2 for cross-platform identification.

Verdict summary

✓ Passes (10)
  • Functional form (log/levels/IHS/PPML)
  • Shock-date ±7d
  • Cluster (genre/day)
  • Between-shock window (May22 effect confirmed)
  • Stratified by code-sufficiency (paper mechanism ✓)
  • Rambachan-Roth M* > 1.76
  • Day-of-week FE
  • Same-day first-difference
  • Cross-platform replication (3 datasets)
  • Within-control placebo (small drift only)
⚠ Interpretable flags (3)
  • Pseudo-B positive at Aug/Sep/Oct → continuous Bass diffusion, not clean step. Paper Fig 2 shows same.
  • Pre-leads bin 0 (Mar 28-Apr 10): negative -9 to -16% in iOS. Treated 2025 had lower early-cycle baseline.
  • Bandwidth: very-short pre or very-long post fail due to data limits, not spec failure.
— N/A (1)
  • SA/CS/BJS/dC-dH/ETWFE: designed for staggered adoption. Universal-timing shock ⇒ standard TWFE is the right estimator. Goodman-Bacon decomposition equivalent.