feat(host): HDR Vulkan layer so Vulkan games get HDR on the virtual display

NVIDIA/AMD Vulkan ICDs refuse to *advertise* an HDR color space for a surface on an IddCx indirect/virtual display, so Vulkan games (Doom: The Dark Ages, id Tech, Indiana Jones, …) report "device does not support HDR" — even though Windows HDR, DWM compose, and the client PQ stream all work, and the ICD happily *accepts + presents* a forced HDR swapchain there. The whole gap is enumeration; the community (Apollo/Sunshine/VDD) wrote this off as kernel-side / unfixable. Add VK_LAYER_PUNKTFUNK_hdr_inject (packaging/windows/pf-vkhdr-layer/): a standalone cdylib Vulkan implicit layer that appends {A2B10G10R10, HDR10_ST2084} + {RGBA16F, scRGB} to vkGetPhysicalDeviceSurfaceFormats[2]KHR (no need to hook vkCreateSwapchainKHR — the ICD doesn't validate the color space there). Self-gated on the surface monitor's actual advanced-color state (DisplayConfig GET_ADVANCED_COLOR_INFO), so it is a complete no-op on SDR sessions and real monitors (dedup). Always-on (registry-discovered) so it works regardless of how a game is launched — env-scoping silently fails for already-running Steam. Escape hatches: DISABLE_PF_VKHDR, PF_VKHDR_EXCLUDE, and a built-in kernel-anti- cheat denylist. The installer builds/signs/stages it and registers it under HKLM64\SOFTWARE\Khronos\Vulkan\ImplicitLayers (opt-out "Install the HDR Vulkan layer" task); windows-host CI fmt+clippy-gates it (msvc-only FFI). Live-validated on the RTX box: Doom: The Dark Ages enables HDR over the pf-vdisplay virtual display. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 11:33:20 +00:00
parent 3e7c9bd059
commit d01a8fd17a
31 changed files with 1021 additions and 0 deletions
@@ -0,0 +1,282 @@
+# Host latency & the GPU-contention collapse — analysis + prioritized plan
+
+> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
+> That follow-up re-verified this plan against the current code and overturned several specifics:
+> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
+> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
+> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
+> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
+> Windows. Use the new doc's ranked action list. The tiers/dropped-placebo analysis below remain a
+> useful record.
+
+Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
+"saturating game starves the stream" headache:
+
+> CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding
+> (GPU-100%) scene it collapses to 40-50. Capping the game is **not** an acceptable fix.
+
+This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the
+[Apollo comparison](apollo-comparison.md) + external NVIDIA/streaming research) followed by an
+**adversarial verification pass** — every candidate fix was attacked, against our actual code, to
+separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the
+placebos.
+
+## Implementation status (2026-06-18)
+
+- ✅ **Tier 2B — Linux scheduling hygiene**: landed. `boost_thread_priority` now nices the
+  capture/encode + send threads on Linux (`setpriority`, best-effort) and its wrong gamescope
+  doc-comment is fixed; CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread
+  highest-priority CUDA stream (`cuStreamCreateWithPriority`, graceful NULL-stream fallback) with a
+  per-stream sync that no longer blocks on the other worker thread's work. Builds + clippy + fmt
+  green. The stream-priority hint is **measure-then-keep** (NVIDIA Linux may ignore it).
+- ✅ **Tier 3A — Windows session tuning**: landed (`session_tuning.rs`, raw C-ABI FFI, no-op off
+  Windows). Each capture/encode/send thread now applies process-wide tuning once (1 ms timer,
+  `DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) and per-thread MMCSS "Games" + keep-display-awake. Wired
+  into both the native (`boost_thread_priority`) and GameStream (`stream.rs`) paths. Linux no-op
+  path builds green; the **FFI was validated on the real MSVC toolchain** (standalone probe compiled,
+  linked against winmm/kernel32/dwmapi/avrt, and ran — timer/priority/MMCSS all succeed).
+- ✅ **Tier 2A — Linux NV12 convert**: landed, gated behind `PUNKTFUNK_NV12` (default OFF → the
+  RGB/BGRx path is byte-for-byte unchanged). The tiled EGL/GL path produces NV12 (BT.709 limited) on
+  the GPU and feeds NVENC native YUV, deleting NVENC's internal RGB→YUV CSC off the contended SM.
+  **Validated on an RTX 5070 Ti two ways**: (1) `nv12-selftest` — synthetic RGBA→NV12 round-trip vs a
+  BT.709 reference, max abs error Y=0.56 / U=0.33 / V=0.26 LSB; (2) live `capture→NV12→NVENC→decode`
+  of animated content matches the RGB path's colour (avg RGB 230,18,18 vs 231,18,20 — no green-screen,
+  correct matrix + VUI). LINEAR/Vulkan-bridge (gamescope) path stays RGB. Next: glass-to-glass
+  latency + fps-under-saturation A/B on a real game (the Tier-0 measurement) before flipping default.
+
+---
+
+## 0. Three corrections to the mental model (read first)
+
+**(A) "Feed NVENC RGB so the ASIC does the colour-convert" is backwards.**
+NVENC's encode core is YUV-native. RGB input makes the driver insert an **RGB→YUV CSC on the
+SM/3D-compute cores** — the *exact* engine a game saturates. Windows already does the right thing:
+`convert_to_yuv` runs the CSC on the dedicated **VIDEO engine** via `VideoProcessorBlt`
+(`capture/dxgi.rs:1023,1063`), logged as "0% 3D". **Linux still feeds NVENC RGB**
+(`encode/linux.rs:98 nvenc_input` → `RGBZ`/`BGRZ`; the `zerocopy/egl.rs:98` shader is a `.bgra`
+*swizzle*, not a CSC), so it pays NVENC's internal CSC on the SM every frame. That is the single
+biggest, clearly-fixable contention source on Linux, and Windows already eliminated it.
+
+**(B) "More GPU priority so our frames get through" is already maxed on Windows, and hits a
+hardware ceiling.** We ship `D3DKMTSetProcessSchedulingPriorityClass=HIGH(4)` +
+`SetGPUThreadPriority(0x4000001E)` + `SetMaximumFrameLatency(1)` (`capture/dxgi.rs:160-263`). The
+residual ~20 ms `lock_bitstream` wall (documented at `dxgi.rs:155`) is GPU **context-scheduling
+latency**, bounded by **preemption granularity**: NVIDIA preempts *compute* at instruction level
+(~0.1 ms) but *graphics* only at coarse draw/tile/DMA-buffer boundaries (milliseconds out under a
+draw flood). No priority class preempts an in-flight game draw. So the winning strategy is **not
+more priority** — it is (1) do **less work on the contended graphics/3D engine**, and (2) **overlap
+the unavoidable per-frame scheduling wait across frames** to recover throughput.
+
+**(C) A chunk of the collapse is upstream of our encoder — no encode/priority fix can beat it.**
+DXGI Desktop Duplication *and* WGC both capture **from the DWM compositor**, so captured fps is
+hard-ceilinged at the **compose rate**, never the game's 400 fps. Under saturation the *compositor
+itself* is scheduled late → composes fewer unique frames → we starve even though NVENC is idle. And
+borderless/fullscreen games on **Independent/Direct Flip** present straight to scanout, *bypassing
+DWM*, so capture sees ~half the frames (this is the "200 not 240"). The host already paces at
+`target_fps` and **re-encodes held frames**, so *transmitted* fps stays ~240 while *unique* fps
+collapses. **This must be measured before blaming encode.**
+
+> Net: Windows is already near best-in-class (priority + video-engine CSC + encode|send split all
+> shipped); its remaining wins are narrow and partly a hardware/compositor ceiling. **Linux is the
+> least-hardened host and holds most of the headroom.**
+
+---
+
+## Tier 0 — Diagnose first (cheap, decisive, do before writing code)
+
+Everything below is gated on knowing *which* bucket the collapse is in. We already have the tooling.
+
+1. **Run the workload with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`.** The `uniq` counter
+   (genuinely-new captured frames vs re-encoded holds) already exists
+   (`gamestream/stream.rs:332-336,403`; `wgc_helper.rs:122-183`). Under CS2 at GPU-100%:
+   - **`fps`≈240 but `uniq`→40-50** ⇒ the *source/compositor* only produced 40-50 unique frames.
+     No encode/priority/cadence fix on our side exceeds that — it is the game's effective
+     present-to-compose rate at 100% GPU. The lever there is **reducing our own per-frame GPU
+     steal** (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
+   - **both `fps` and `uniq`→40-50** ⇒ our capture→convert→encode round-trip is being starved (the
+     `lock_bitstream` scheduling stall). The Tier 1/2 contention levers apply directly.
+2. **Confirm the game's flip mode on Windows.** If the game is on Independent/Direct Flip (MPO),
+   capture is bypassing DWM and seeing half the frames. We already have `capture/composed_flip.rs`
+   — verify ForceComposedFlip is actually engaged on the game path, and watch `cap_us`.
+3. Capture `cap_us` / `enc_us` / `pace_us` p50/p99 alongside, to localise the stall.
+
+Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This
+headless dev VM cannot reproduce the contention.
+
+---
+
+## Tier 1 — The two under-weighted, cross-platform levers (confirmed by research, not yet done)
+
+### 1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
+The capture ceiling is the compositor's compose rate, and under load the compositor gets starved.
+Levers, in order:
+- **Force Composed Flip on Windows** for the game path (defeat MPO/flip-metering frame loss).
+  Machinery exists (`composed_flip.rs`); confirm it engages and measure the unique-frame delta.
+- **Opt-in "double-refresh" virtual output**: create the per-session virtual output at ~2× the
+  client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
+  we already mint arbitrary-mode virtual outputs). Gate **off** by default and **never** on the
+  gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
+  engine). `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER`.
+- **Reflex / render-queue=0 style headroom** (non-capping): documented as the substitute for an fps
+  cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what
+  we can influence from the host side.
+
+Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our
+capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.
+
+### 1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
+NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
+a hidden latency tax, *most visible in easy scenes* (the ~200-should-be-240 case). Sunshine ships
+this as `nvenc_latency_over_power` and calls it decisive. **Neither host does it.**
+- **Windows**: NvAPI **per-application DRS profile** `PREFERRED_PSTATE = PREFER_MAX` scoped to our
+  exe (not a global override). Load `nvapi64.dll` dynamically; treat `NvAPI_Initialize` failure as
+  "no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). **Crash-safe undo is mandatory**: write
+  an undo record to `%ProgramData%\punktfunk\` *before* applying and revert a stale profile on next
+  startup — a crash must not leave the user's control panel modified.
+- **Linux**: prefer the **root-free** path — disable the CUDA "Force P2 State" downclock that
+  context creation triggers (env/per-context), and `nvidia-smi -pm 1` (persistence) where
+  permitted. `nvmlDeviceSetGpuLockedClocks` needs root/CAP_SYS_ADMIN (our host runs as a normal
+  user → silent no-op) and is brittle across SKUs; if used, query `nvmlDeviceGetMaxClockInfo`, lock
+  to *that*, and restore on teardown **and** via a SIGTERM/panic handler.
+- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default OFF on battery / Steam Deck** (thermal/power caps
+  make pinning actively harmful there).
+
+Impact: reliable, modest p99 / easy-scene win on both OSes. Does **not** fix the saturated-scene
+collapse (at 100% util the clock is already maxed). Low cost.
+
+---
+
+## Tier 2 — Linux work-deletion + scheduling hygiene (the biggest in-our-control headroom)
+
+### 2A. Produce **NV12/P010** on Linux and feed it to NVENC native (delete the SM-side CSC)
+The strictly-correct version (verified): **extend the existing GL de-tile blit
+(`zerocopy/egl.rs`) to emit NV12** instead of swizzled BGRx — multi-render-target (GL_R8 luma
+full-res + GL_RG8 chroma half-res, or two passes) applying an **explicit BT.709 limited-range
+matrix matching the Windows `VideoConverter`** (`dxgi.rs:957`) so hosts look identical — then
+register `NV_ENC_BUFFER_FORMAT_NV12` with the encoder (teach `encode/linux.rs:98 nvenc_input` an
+NV12 case; `CudaHw sw_format` → `AV_PIX_FMT_NV12`).
+- Net: today = GL swizzle (3D) **+** NVENC-internal CSC (SM); after = GL CSC (3D, ~same cost as the
+  swizzle it replaces) **+ zero NVENC CSC**. Removes one whole CSC pass and removes it from the SM.
+- **Do *not* implement this as a standalone CUDA convert kernel on the tiled path** — CUDA can't
+  sample a tiled NVIDIA surface (`cuGraphicsEGLRegisterImage` is Tegra-only, `egl.rs:6-12`), so it
+  would still need the GL detile, *and* a CUDA kernel runs on the same saturated SM. The CUDA-kernel
+  route is only clean on the **LINEAR/Vulkan-bridge (gamescope)** path, where it doubles as the NV12
+  producer; do it there if/when that path needs it.
+- Pitfalls: pervasive 4-byte-pixel assumptions break with NV12 — `cuda.rs` hardcodes
+  `WidthInBytes = width*4` (`:363,392,499`), `BufferPool`/`alloc_pitched` assume 4 B/px, GL dst is
+  `GL_RGBA8`; all need a plane-aware NV12 variant (luma W·H + chroma W·H/2, two-plane copy) or you
+  get the Steam-Deck green-screen class of bug. The HDR/10-bit path needs P010, not NV12.
+- Impact: real, **modest, compounding** — a few ms of per-frame GPU time and a shorter time-slice
+  need, which stacks with cadence + power-pin. **Not** a standalone cure for the 240→40 collapse
+  (external "47→100 fps" numbers are other people's non-zero-copy pipelines; don't promise them).
+  Medium cost. Gate behind a `PUNKTFUNK_*` env and A/B `cap→encoded` p50 + the CS2 fps floor.
+
+### 2B. Linux scheduling hygiene (cheap; the priority bits are "measure-then-keep")
+Consolidates the genuine parts of several candidates. Mostly unambiguous cleanups + opt-in
+priority:
+- **Arm the Linux `boost_thread_priority` no-op** (`punktfunk1.rs:1856` cfg branch): best-effort
+  `libc::setpriority(PRIO_PROCESS, 0, -10/-5)` on the calling thread (tid 0 = self), log-and-continue
+  on EPERM. **Do not** default to SCHED_RR/FIFO (can starve the compositor and the game's render
+  thread — the user refuses to add game frame-time); offer it only behind `PUNKTFUNK_SCHED_RR=1`.
+  **Fix the wrong doc-comment** at `punktfunk1.rs:1834-1835` ("the Linux host caps the game via
+  gamescope, so its threads aren't starved") — false for the uncapped/NVIDIA-direct path.
+- **Set CUDA context scheduling deliberately**: `cuCtxCreate` flag `CU_CTX_SCHED_BLOCKING_SYNC` on
+  this shared VM (frees a core vs the default AUTO/SPIN) — a CPU-efficiency fix, not throughput.
+- **High-priority CUDA stream + EGL context priority** (the missing analogue of the Windows
+  hardening): `cuStreamCreateWithPriority(highest from cuCtxGetStreamPriorityRange)` for our copies;
+  request `EGL_IMG_context_priority HIGH` (try `EGL_NV_context_priority_realtime`) at
+  `egl.rs:332`. **Caveat, load-bearing**: these are intra-process *hints* and NVIDIA's Linux driver
+  has been reported to **ignore** context priority (driver 545: high- vs low-priority EGL contexts
+  measured identical) and to **deny** realtime Vulkan queues. Implement with graceful fallback,
+  gate behind env, and **measure on driver 595** — do not architect around it or credit it before
+  measurement.
+
+> Explicitly **not** doing on Linux: Vulkan `VK_EXT_global_priority` as "the" lever (it only touches
+> the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA).
+> Replacing `cuCtxSynchronize` with a per-stream event chain for *contention* reasons (it's
+> per-context, never waited on the game's separate context — a non-fix; keep the full sync where it
+> guards dmabuf recycle, `egl.rs:491`).
+
+---
+
+## Tier 3 — Windows parity polish (Windows is already strong)
+
+- **3A. Host-process session tuning (we have *zero* today — verified):** `NtSetTimerResolution(0.5ms)`
+  / `timeBeginPeriod(1)` (default 15.6 ms granularity blocks precise pacing), `DwmEnableMMCSS(true)`,
+  `SetPriorityClass(HIGH_PRIORITY_CLASS)`, MMCSS-register the capture/encode threads ("Games"/"Pro
+  Audio"), `SetThreadExecutionState(ES_CONTINUOUS|ES_DISPLAY_REQUIRED)`. All revert on stop.
+  Foundational for any precise frame pacing and the encode|send split. Low cost, low risk.
+  (`gamestream/stream.rs` start/stop; Apollo's `streaming_will_start`/`_stopped`.)
+- **3B. Auto-gated REALTIME D3DKMT class** instead of fixed HIGH (the realtime opt-in already exists
+  at `dxgi.rs:199-207`): probe HAGS (`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom
+  (`IDXGIAdapter3::QueryVideoMemoryInfo`, continuously), allow REALTIME(5) only when safe (HAGS off,
+  or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises —
+  Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling
+  rung, same preemption ceiling), so rank it as cheap parity, not a fix.
+- **3C. Cheap experiment — `VideoProcessorBlt` directly from the DDA surface** (skip the same-format
+  `gpu_copy` at `dxgi.rs:2375`), then `ReleaseFrame`, *iff* it doesn't re-serialize `AcquireNextFrame`
+  (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming
+  the Blt is on the video engine). One-line source-texture change; benchmark only. Do **not** build a
+  D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM
+  (~5% 3D, no PCIe), not worth the interop rebuild.
+- **3D. Async NVENC + off-thread retrieve — measure-gated, uncertain.** Today retrieve
+  (`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which is *why*
+  `depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates submit/retrieve
+  on separate threads with completion events + a deep surface pool; doing that *could* let per-frame
+  scheduling waits **overlap across frames** and recover *throughput* — at a per-frame *latency* cost
+  (depth × frame time). This is the one place the research and our own prior measurement disagree, so
+  it is **strictly measure-first**, and it forecloses slice output (`reportSliceOffsets` needs
+  `enableEncodeAsync=0`). Treat as a structural experiment, not a committed win.
+
+---
+
+## Tier 4 — Deferred 2nd-order latency (not contention fixes; do after Tiers 0-2)
+
+- **GL2 — Intra-refresh for RFI/recovery** (`enableIntraRefresh` + recovery-point SEI) instead of a
+  forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size
+  spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met.
+  Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win,
+  orthogonal to the collapse.
+- **GL1 + GL6 — Sub-frame slice output + per-slice paced send** (the roadmap's "~2-4 ms lever"):
+  `enableSubFrameWrite` + `sliceMode` + transmit each slice as it completes. **Big**: needs the
+  direct NVENC SDK on Linux (libavcodec emits whole AUs) **and** a per-slice wire/FEC redesign in
+  `punktfunk-core` (today `PacketHeader`/`Packetizer`/reassembler are whole-AU; per-slice FEC blocks
+  wreck Leopard efficiency) **and** client slice-granular submit. Gate on
+  `NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK` (often absent on consumer GeForce). The paced-send half is
+  **already shipped** (`stream.rs spawn_sender`, `punktfunk1.rs paced_submit`) — don't re-implement.
+
+---
+
+## Dropped / why (so we don't re-propose placebo)
+
+| Candidate | Verdict | Why |
+|---|---|---|
+| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
+| Replace `cuCtxSynchronize` with per-stream event chain *for contention* | ✗ | `cuCtxSynchronize` is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
+| Vulkan `VK_EXT_global_priority` as the Linux priority lever | ✗ | Touches only the minority gamescope/LINEAR `vkCmdCopyBuffer`, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
+| Async NVENC as a *throughput/collapse* fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our `depth>1` regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
+| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
+| Empty-frame (`LastPresentTime==0`) skip | ✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
+| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets `zeroReorderDelay=1`, lookahead/multipass/AQ off; ffmpeg defaults match + we set `bf=0`. Only `lowDelayKeyFrameScale=1` is non-redundant → fold into GL2 (Windows SDK path only). |
+| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no `nvEncInvalidateRefFrames`; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
+| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
+| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |
+
+---
+
+## Recommended order of attack
+
+1. **Tier 0 diagnose** on the real boxes — settles whether the collapse is source-ceiling or
+   pipeline-starvation, and whether flip-bypass is halving capture.
+2. **Tier 2A (Linux NV12)** + **Tier 2B (Linux scheduling hygiene)** — the largest in-our-control
+   headroom; Linux is the least-hardened host.
+3. **Tier 1B (clock/power pin)** both OSes — cheap, fixes the easy-scene 200-vs-240, crash-safe undo.
+4. **Tier 1A (cadence/flip)** — gated on Tier 0 (this is where a big chunk of the collapse may live).
+5. **Tier 3 (Windows polish)** — session tuning is the clear win; the rest is parity.
+6. **Tier 4** — only after the contention work; intra-refresh first, slice pipelining last.
+
+Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap
+closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game
+physically cannot also render the game *and* compose 240 unique frames, and WDDM/NVIDIA preemption
+granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it
+with encoder micro-optimisations.