docs(design): trim shipped plans, consolidate cluster, add index

Much of design/ described work that has since shipped. Trim each doc to its durable rationale + still-open items (the code is the source of truth for shipped detail; git history holds the full originals). - Shipped plans -> status stubs: stats-capture, gamestream-host-plan, apple-stage2-presenter, windows-service. - Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline, host-latency, gpu-contention (fixed stale status table), game-library, linux-setup (fixed m0->spike + stale zero-copy claim), session-aware-host-followups, windows-client-bootstrap, windows-dualsense-{scoping,game-detection}, windows-virtual-display, security-review (per-finding status table; #12 still open), apollo-comparison (shipped backlog collapsed to one-liners). - Windows-host cluster consolidated: windows-host.md -> redirect into windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is merged, M4 done); windows-secure-desktop.md archived (now a fallback behind IDD-push primary). - Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md. - New design/README.md: per-doc status table + consolidated open-items roll-up so nothing is tracked in only one buried doc. - Repoint 5 code comments to the archived secure-desktop doc path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:39:06 +00:00
parent 9ea2c17419
commit 7b99b41ede
27 changed files with 1322 additions and 3229 deletions
@@ -1,5 +1,11 @@
 # GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)

+> **Status:** Investigation / plan. §5.A (NV12/P010 on the IDD-push default path) is **SHIPPED**
+> — `3514702`, `capture/windows/idd_push.rs` + `encode/windows/nvenc.rs`. All other levers
+> (§5.B/§5.C/§5.E/§5.F/§5.G) are **OPEN**; §5.C is partial (REALTIME knob exists, no auto-gate).
+> Paired with [`host-latency-plan.md`](host-latency-plan.md) (mutual cross-refs — keep both).
+> Trimmed to design rationale + open items; git history holds the full original.
+
 > The headache, stated precisely:
 > a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
 > stream tracks; the moment the game pins the GPU the **stream collapses to 40–50 fps** while the
@@ -14,11 +20,6 @@ supersedes several of that doc's conclusions** — the codebase moved a lot in t
 GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
 two of the old plan's premises. Read §1 (corrections) before acting on the old doc.

-Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
-mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
-their own adversarial verifiers. Every external claim below carries a source URL; every code claim
-carries a current `file:line`.
-
 ---

 ## 0. TL;DR — the corrected mental model and the action list
@@ -50,9 +51,9 @@ politely yield without losing anything" — Reflex, render-queue tricks — is a

 1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
   mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
-2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
-   BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
-   the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
+2. **Stop feeding NVENC RGB on the default path** — **DONE** for IDD-push (`3514702`): the install
+   default now converts BGRA→NV12 (SDR) / FP16→P010 (HDR) before NVENC, off the SM. Linux NV12-default
+   and a video-engine HDR P010 are still open. (§5.A)
 3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
   deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
   implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
@@ -73,9 +74,10 @@ politely yield without losing anything" — Reflex, render-queue tricks — is a
 The old doc was right about the shape but several specifics are now wrong or stale:

 - **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
-  DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
-  **RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
-  *regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
+  DDA and WGC paths — **was false for IDD-push, which became the install default and fed NVENC
+  RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path *regressed*
+  on the exact axis the doc celebrated. **Since fixed** (`3514702`, §5.A): IDD-push now converts
+  BGRA→NV12 on the video engine (FP16→P010 shader for HDR) and feeds NVENC native YUV.
 - **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
  only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
  parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
@@ -108,39 +110,33 @@ honest residual ceiling at 100% GPU. Those carry forward.

 ---

-## 2. How the pipeline actually serializes today (verified against current code)
+## 2. How the pipeline serializes today — the key insight

-The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
-`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
-**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
-near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did** —
-which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
-`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
-diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
-stall"* (`punktfunk1.rs:2466-2468`).
+The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs`, `punktfunk1.rs`): every
+`1/target_fps` tick it grabs the freshest frame with a **non-blocking** `try_latest()`, and **if
+nothing new arrived it re-encodes the held frame** (a near-empty P-frame). So the **outbound fps is
+pinned at `target_fps` no matter what the source did** — which is *why the raw fps counter lies* under
+contention. The only honest signal is the `uniq` / `diag_new` counter; the code itself states the
+diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode stall."*

-The encode round-trip (NVENC, the dominant path):
-
- `submit` → `encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
-  pushes onto a `pending` FIFO.
- `poll` → `lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
-  completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.
-
-So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
-hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
-fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
-throughput wall.
+The NVENC round-trip (the dominant path) is **depth-1 synchronous**: `encode_picture` is a
+non-blocking ASIC launch, but `lock_bitstream` **blocks the same thread** until that frame completes
+(no `enableEncodeAsync`, no completion event). The only thread split is encode-vs-network-send, never
+submit-vs-retrieve. So under contention the loop is strictly serial — `capture (+convert) → submit →
+block in lock_bitstream → hand AU to the send thread` — and the arithmetic matches the symptom:
+`1000/17 ≈ 59` and `1000/13 ≈ 77` fps bracket the observed ~50, the signature of **one frame in
+flight per round-trip**, not an ASIC throughput wall.
 ([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))

-Where the per-frame GPU work lands, by path (this is the crux of contention):
+Where the per-frame GPU work lands, by path (the crux of contention — **lower contended-engine load is
+better**):

-| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
-|---|---|---|---|---|
-| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
-| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
-| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
-| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
+| Path | Colour-convert | NVENC input | Contended-engine load/frame |
+|---|---|---|---|
+| **IDD-push** (install default) | **NV12/P010 on the video engine** (`3514702`; FP16→P010 via shader for HDR) | NV12/P010 | low (SDR) / shader-CSC on SM (HDR) |
+| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine** | NV12/P010 | low |
+| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine** | NV12/P010 | medium (one 3D `CopyResource` to release the dup fast) |
+| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |

 Measured magnitude of "RGB vs NV12 to the encoder":
 [**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
@@ -166,7 +162,8 @@ actual saturating game.
   Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
   itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
   frames.
-3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.
+3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall. (Per-stage
+   `cap`/`submit`/`wait` µs instrumentation landed under `PUNKTFUNK_PERF` in `3514702`.)

 > **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
 > GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
@@ -182,12 +179,12 @@ actual saturating game.
 | Thread priority (Linux) | `setpriority` −10/−5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
 | GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
 | GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
-| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
-| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
+| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) now video-engine NV12** (`3514702`) ✅ | `wgc.rs:631` / `idd_push.rs` |
+| CSC off-SM (Win HDR) | IDD-push HDR via FP16→P010 **shader** (on-SM); other paths on-SM unless `PUNKTFUNK_HDR_SHADER_P010` | `wgc.rs:603` ⚠ |
 | CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
 | Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
 | Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
-| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ |
+| Zero-copy register-in-place | yes; IDD-push out-ring is now the convert target (NV12/P010), no extra copy | `nvenc.rs:623` ✅ |
 | AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
 | QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
 | Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
@@ -202,35 +199,32 @@ actual saturating game.

 ## 5. The levers, ranked, with honest verdicts

-### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**
+### A. Stop feeding NVENC RGB on the default path — **DONE for Windows IDD-push** (`3514702`)

-The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
-forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
-solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
-feeding NV12/P010. **Make IDD-push and Linux do the same.**
+The default Windows IDD-push path used to hand NVENC packed RGB, forcing NVENC's internal RGB→YUV CSC
+onto the SM the game saturates. `3514702` makes the out-ring the convert target: a D3D11 **video-engine**
+`VideoConverter` does BGRA→NV12 (SDR, BT.709 limited) in place, so NVENC gets native NV12 and skips its
+SM-side CSC; HDR uses the FP16→P010 shader (NVIDIA's VideoProcessor can't do RGB→P010). NV12 input forces
+`bit_depth=8`, so an HDR↔SDR toggle re-inits the session at the matching depth (NV12 can't feed a 10-bit
+session). This also removed the separate `CopyResource` (the convert writes the ring directly).

- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
-  out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
-  `..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
-  out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
-  (`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
-  disagree on the format.
- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
-  `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
-  `linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
-  runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
-  P010 convert where the VP supports it).
+**Verdict: REAL, but honestly *conditional*** — the convert has to land **off** the SM to fully pay off.
+`VideoProcessorBlt` is *designed* to use fixed-function video hardware and the hardforum numbers back the
+15%→2% drop, **but no NVIDIA doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat
+the "video engine" claim as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch
+the `enc`/`sm` columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that
+just relocates the CSC to the same SM (this is why the HDR P010 *shader* path is still on-SM; Sunshine's
+RGB→NV12 CUDA kernel still contends).

-**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
-CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
-to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
-doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
-as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
-columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
-relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
+**Still open in §A:**
+- **Linux:** make NV12 the **default** for the tiled zero-copy path (gated behind `PUNKTFUNK_NV12` *and*
+  `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`, `linux/zerocopy/egl.rs:272`), feeding NVENC
+  `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already runs; emitting NV12 from it replaces the swizzle at
+  ~equal cost and deletes NVENC's CSC.
+- **Windows HDR:** move the FP16→P010 convert onto the video engine where the VP supports it (today's
+  shader keeps it on-SM), or flip `PUNKTFUNK_HDR_SHADER_P010` on by default for the non-IDD paths.

-### B. A *correct* async encode pipeline (the untried encoder lever)
+### B. A *correct* async encode pipeline (the untried encoder lever) — **OPEN**

 The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
 work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
@@ -263,7 +257,7 @@ Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `
 and HAGS can spike the *submit* call itself
 ([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).

-### C. Auto-gated REALTIME GPU scheduling priority
+### C. Auto-gated REALTIME GPU scheduling priority — **PARTIAL** (knob exists, no auto-gate)

 Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
 Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
@@ -274,8 +268,8 @@ It works **independently of HAGS** (HAGS does *not* reassign cross-process prior
 *"Windows continues to control prioritization"*
 [DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).

-We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
-to change:
+We ship only **HIGH(4)** by default with a static `realtime` opt-in (`PUNKTFUNK_GPU_PRIORITY_CLASS`,
+`dxgi.rs:208-330`) and **no auto-gate**. Two things to change:

 - **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
  app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
@@ -294,7 +288,7 @@ the host *takes* GPU time from the game; it measurably **costs the game fps**
 That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
 the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).

-### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
+### D. Multi-vendor encoder hygiene (AMF/QSV) — **stable / mostly done, one caveat**

 Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
 `usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
@@ -313,7 +307,7 @@ Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF

 **Verdict: REAL but largely already captured.** No big win left here except via §G.

-### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
+### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix — **OPEN**

 NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
 frame — most visible in the *light* scene (the "200-not-240"). Pin it:
@@ -334,7 +328,7 @@ frame — most visible in the *light* scene (the "200-not-240"). Pin it:
 **Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
 already pins P0). Cheap, low risk, do it for the light-scene win.

-### F. Escape the frame-source ceiling — only if §3 says (b)
+### F. Escape the frame-source ceiling — only if §3 says (b) — **OPEN**

 If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.

@@ -358,7 +352,7 @@ If `uniq` is the wall, no encoder/priority work helps — you need a better fram
 **Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
 frames the game didn't render.

-### G. The honest endgame — encode on a second GPU / the iGPU
+### G. The honest endgame — encode on a second GPU / the iGPU — **OPEN**

 For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
 contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
@@ -384,8 +378,8 @@ the consumer analogue is the iGPU.
 ## 6. Recommended order of attack

 1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
-2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
-   Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
+2. **§5.A NV12/P010 on the default paths** — IDD-push **DONE** (`3514702`); remaining: Linux NV12
+   default-on, Windows HDR P010 off-SM. Confirm off-SM with `nvidia-smi dmon`.
 3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
 4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
 5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
@@ -418,13 +412,39 @@ or a second slice of silicon (§G). Don't chase the rest with encoder micro-opti

 ---

-## 8. Open evidence gaps (flagged honestly)
+## 8. Open items / what's left
+
+Diagnostics + still-unbuilt levers (verbatim, highest leverage first):
+
+- **§3 automation** — instrument the `uniq`-vs-`fps` heuristic + a PresentMon probe so (a)/(b) is
+  decided automatically, not by hand. (Per-stage `cap`/`submit`/`wait` µs already land under
+  `PUNKTFUNK_PERF` from `3514702`; the uniq/PresentMon classifier is not yet automated.)
+- **§5.A residual** — Linux NV12 default-on for the tiled zero-copy path (drop the
+  `PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY` double-opt-in); move the Windows HDR FP16→P010 convert off the
+  SM (today it's a shader). Windows IDD-push SDR/HDR NV12/P010 is **DONE** (`3514702`).
+- **§5.B** — build a *correct* async NVENC pipeline: submit on one thread, blocking-`lock_bitstream`
+  on a dedicated retrieve thread, deep input+output surface pool (≈4–8), Windows per-buffer
+  `completionEvent` (`enableEncodeAsync=1`), same two-thread split on Linux.
+- **§5.C** — auto-gate REALTIME GPU priority: probe HAGS (`D3DKMTQueryAdapterInfo`) + VRAM headroom
+  (`IDXGIAdapter3::QueryVideoMemoryInfo`) continuously; REALTIME only when HAGS-off or HAGS-on with
+  comfortable headroom, downgrade to HIGH the instant VRAM tightens. (Static `realtime` opt-in exists
+  in `dxgi.rs`; no auto-gate.)
+- **§5.E** — clock / P-state pinning: Windows NvAPI DRS `PREFERRED_PSTATE=PREFER_MAX` (crash-safe undo
+  to `%ProgramData%\punktfunk\`); Linux `nvidia-smi -lgc` / `nvmlDeviceSetGpuLockedClocks` (+
+  `CudaNoStablePerfLimit` on R580/595). Gate `PUNKTFUNK_PIN_CLOCKS`, default off on battery/Deck.
+- **§5.F** — frame-source escape (only if §3 says (b)): swapchain-hook capture (OBS-style, anti-cheat
+  tradeoffs); NvFBC on Linux (keylase patch); compose-flip for the DLSS-FG half-rate case; WGC
+  `MinUpdateInterval = 1e7/(fps*2)` 2×-rate tweak.
+- **§5.G** — iGPU / second-GPU encode offload: pin capture to the gaming adapter, encoder to the iGPU
+  adapter, one cross-adapter shared-texture copy. Reuses the AMF/QSV/VAAPI backends.
+
+### Open evidence gaps (verify on-box)

 - Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
  confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
-  `nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
+  `nvidia-smi dmon` (sm% vs enc%) on the IDD-push/WGC path before assuming the win landed.
 - The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
-  unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
-  whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
+  unmeasured. §3 + an A/B of IDD-push-RGB (pre-`3514702`) vs IDD-push-NV12 on the same scene settles it
+  and tells you whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
 - AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
  whitepaper; treat the *direction* as solid, the magnitude as TBD.