docs(design): trim shipped plans, consolidate cluster, add index

Much of design/ described work that has since shipped. Trim each doc to its durable rationale + still-open items (the code is the source of truth for shipped detail; git history holds the full originals). - Shipped plans -> status stubs: stats-capture, gamestream-host-plan, apple-stage2-presenter, windows-service. - Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline, host-latency, gpu-contention (fixed stale status table), game-library, linux-setup (fixed m0->spike + stale zero-copy claim), session-aware-host-followups, windows-client-bootstrap, windows-dualsense-{scoping,game-detection}, windows-virtual-display, security-review (per-finding status table; #12 still open), apollo-comparison (shipped backlog collapsed to one-liners). - Windows-host cluster consolidated: windows-host.md -> redirect into windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is merged, M4 done); windows-secure-desktop.md archived (now a fallback behind IDD-push primary). - Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md. - New design/README.md: per-doc status table + consolidated open-items roll-up so nothing is tracked in only one buried doc. - Repoint 5 code comments to the archived secure-desktop doc path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:39:06 +00:00
parent 9ea2c17419
commit 7b99b41ede
27 changed files with 1322 additions and 3229 deletions
@@ -1,13 +1,18 @@
 # Host latency & the GPU-contention collapse — analysis + prioritized plan

+> **Status:** PARTLY SHIPPED. Tier 2A (Linux NV12 convert) = `1fc6f73`; Tier 2B (Linux
+> scheduling) + Tier 3A (Windows session tuning) = `112a054`. Tiers 1A, 1B, 3B, 3C, 3D, 4 are
+> still open. This doc is trimmed to design rationale + open items; the shipped code is the
+> source of truth for the landed tiers.
+
 > **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
 > That follow-up re-verified this plan against the current code and overturned several specifics:
 > the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
 > right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
 > latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
 > "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
-> Windows. Use the new doc's ranked action list. The tiers/dropped-placebo analysis below remain a
-> useful record.
+> Windows. **For current action prioritization see `gpu-contention-investigation.md`.** The
+> tiers/dropped-placebo analysis below remain a useful record.

 Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
 "saturating game starves the stream" headache:
@@ -21,64 +26,19 @@ This doc is the synthesis of a multi-agent investigation (deep read of our pipel
 separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the
 placebos.

-## Implementation status (2026-06-18)
-
- ✅ **Tier 2B — Linux scheduling hygiene**: landed. `boost_thread_priority` now nices the
-  capture/encode + send threads on Linux (`setpriority`, best-effort) and its wrong gamescope
-  doc-comment is fixed; CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread
-  highest-priority CUDA stream (`cuStreamCreateWithPriority`, graceful NULL-stream fallback) with a
-  per-stream sync that no longer blocks on the other worker thread's work. Builds + clippy + fmt
-  green. The stream-priority hint is **measure-then-keep** (NVIDIA Linux may ignore it).
- ✅ **Tier 3A — Windows session tuning**: landed (`session_tuning.rs`, raw C-ABI FFI, no-op off
-  Windows). Each capture/encode/send thread now applies process-wide tuning once (1 ms timer,
-  `DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) and per-thread MMCSS "Games" + keep-display-awake. Wired
-  into both the native (`boost_thread_priority`) and GameStream (`stream.rs`) paths. Linux no-op
-  path builds green; the **FFI was validated on the real MSVC toolchain** (standalone probe compiled,
-  linked against winmm/kernel32/dwmapi/avrt, and ran — timer/priority/MMCSS all succeed).
- ✅ **Tier 2A — Linux NV12 convert**: landed, gated behind `PUNKTFUNK_NV12` (default OFF → the
-  RGB/BGRx path is byte-for-byte unchanged). The tiled EGL/GL path produces NV12 (BT.709 limited) on
-  the GPU and feeds NVENC native YUV, deleting NVENC's internal RGB→YUV CSC off the contended SM.
-  **Validated on an RTX 5070 Ti two ways**: (1) `nv12-selftest` — synthetic RGBA→NV12 round-trip vs a
-  BT.709 reference, max abs error Y=0.56 / U=0.33 / V=0.26 LSB; (2) live `capture→NV12→NVENC→decode`
-  of animated content matches the RGB path's colour (avg RGB 230,18,18 vs 231,18,20 — no green-screen,
-  correct matrix + VUI). LINEAR/Vulkan-bridge (gamescope) path stays RGB. Next: glass-to-glass
-  latency + fps-under-saturation A/B on a real game (the Tier-0 measurement) before flipping default.
-
 ---

-## 0. Three corrections to the mental model (read first)
+## Mental model (§0A–0C) — see the follow-up

-**(A) "Feed NVENC RGB so the ASIC does the colour-convert" is backwards.**
-NVENC's encode core is YUV-native. RGB input makes the driver insert an **RGB→YUV CSC on the
-SM/3D-compute cores** — the *exact* engine a game saturates. Windows already does the right thing:
-`convert_to_yuv` runs the CSC on the dedicated **VIDEO engine** via `VideoProcessorBlt`
-(`capture/dxgi.rs:1023,1063`), logged as "0% 3D". **Linux still feeds NVENC RGB**
-(`encode/linux.rs:98 nvenc_input` → `RGBZ`/`BGRZ`; the `zerocopy/egl.rs:98` shader is a `.bgra`
-*swizzle*, not a CSC), so it pays NVENC's internal CSC on the SM every frame. That is the single
-biggest, clearly-fixable contention source on Linux, and Windows already eliminated it.
-
-**(B) "More GPU priority so our frames get through" is already maxed on Windows, and hits a
-hardware ceiling.** We ship `D3DKMTSetProcessSchedulingPriorityClass=HIGH(4)` +
-`SetGPUThreadPriority(0x4000001E)` + `SetMaximumFrameLatency(1)` (`capture/dxgi.rs:160-263`). The
-residual ~20 ms `lock_bitstream` wall (documented at `dxgi.rs:155`) is GPU **context-scheduling
-latency**, bounded by **preemption granularity**: NVIDIA preempts *compute* at instruction level
-(~0.1 ms) but *graphics* only at coarse draw/tile/DMA-buffer boundaries (milliseconds out under a
-draw flood). No priority class preempts an in-flight game draw. So the winning strategy is **not
-more priority** — it is (1) do **less work on the contended graphics/3D engine**, and (2) **overlap
-the unavoidable per-frame scheduling wait across frames** to recover throughput.
-
-**(C) A chunk of the collapse is upstream of our encoder — no encode/priority fix can beat it.**
-DXGI Desktop Duplication *and* WGC both capture **from the DWM compositor**, so captured fps is
-hard-ceilinged at the **compose rate**, never the game's 400 fps. Under saturation the *compositor
-itself* is scheduled late → composes fewer unique frames → we starve even though NVENC is idle. And
-borderless/fullscreen games on **Independent/Direct Flip** present straight to scanout, *bypassing
-DWM*, so capture sees ~half the frames (this is the "200 not 240"). The host already paces at
-`target_fps` and **re-encodes held frames**, so *transmitted* fps stays ~240 while *unique* fps
-collapses. **This must be measured before blaming encode.**
-
-> Net: Windows is already near best-in-class (priority + video-engine CSC + encode|send split all
-> shipped); its remaining wins are narrow and partly a hardware/compositor ceiling. **Linux is the
-> least-hardened host and holds most of the headroom.**
+The original three-correction mental model (A: feeding NVENC RGB is backwards; B: GPU priority is
+maxed on Windows and hits a preemption-granularity ceiling; C: a chunk of the collapse is upstream
+of the encoder at the compositor compose-rate, with Independent/Direct Flip bypassing DWM) is
+**partly corrected by `gpu-contention-investigation.md` §1** — notably that the default Windows
+IDD-push path now feeds NVENC RGB (so §0A's "Windows already does the right thing" no longer holds),
+and "capture sees half the frames" is DLSS-Frame-Gen-specific rather than general. Read the
+follow-up doc for the corrected model. The durable takeaways still stand: **do less work on the
+contended graphics/3D engine**, **overlap the unavoidable per-frame scheduling wait across frames**,
+and **measure source-vs-pipeline before blaming encode**.

 ---

@@ -105,7 +65,7 @@ headless dev VM cannot reproduce the contention.

 ---

-## Tier 1 — The two under-weighted, cross-platform levers (confirmed by research, not yet done)
+## Tier 1 — The two under-weighted, cross-platform levers (OPEN — confirmed by research, not yet done)

 ### 1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
 The capture ceiling is the compositor's compose rate, and under load the compositor gets starved.
@@ -146,50 +106,23 @@ collapse (at 100% util the clock is already maxed). Low cost.

 ---

-## Tier 2 — Linux work-deletion + scheduling hygiene (the biggest in-our-control headroom)
+## Tier 2 — Linux work-deletion + scheduling hygiene

-### 2A. Produce **NV12/P010** on Linux and feed it to NVENC native (delete the SM-side CSC)
-The strictly-correct version (verified): **extend the existing GL de-tile blit
-(`zerocopy/egl.rs`) to emit NV12** instead of swizzled BGRx — multi-render-target (GL_R8 luma
-full-res + GL_RG8 chroma half-res, or two passes) applying an **explicit BT.709 limited-range
-matrix matching the Windows `VideoConverter`** (`dxgi.rs:957`) so hosts look identical — then
-register `NV_ENC_BUFFER_FORMAT_NV12` with the encoder (teach `encode/linux.rs:98 nvenc_input` an
-NV12 case; `CudaHw sw_format` → `AV_PIX_FMT_NV12`).
- Net: today = GL swizzle (3D) **+** NVENC-internal CSC (SM); after = GL CSC (3D, ~same cost as the
-  swizzle it replaces) **+ zero NVENC CSC**. Removes one whole CSC pass and removes it from the SM.
- **Do *not* implement this as a standalone CUDA convert kernel on the tiled path** — CUDA can't
-  sample a tiled NVIDIA surface (`cuGraphicsEGLRegisterImage` is Tegra-only, `egl.rs:6-12`), so it
-  would still need the GL detile, *and* a CUDA kernel runs on the same saturated SM. The CUDA-kernel
-  route is only clean on the **LINEAR/Vulkan-bridge (gamescope)** path, where it doubles as the NV12
-  producer; do it there if/when that path needs it.
- Pitfalls: pervasive 4-byte-pixel assumptions break with NV12 — `cuda.rs` hardcodes
-  `WidthInBytes = width*4` (`:363,392,499`), `BufferPool`/`alloc_pitched` assume 4 B/px, GL dst is
-  `GL_RGBA8`; all need a plane-aware NV12 variant (luma W·H + chroma W·H/2, two-plane copy) or you
-  get the Steam-Deck green-screen class of bug. The HDR/10-bit path needs P010, not NV12.
- Impact: real, **modest, compounding** — a few ms of per-frame GPU time and a shorter time-slice
-  need, which stacks with cadence + power-pin. **Not** a standalone cure for the 240→40 collapse
-  (external "47→100 fps" numbers are other people's non-zero-copy pipelines; don't promise them).
-  Medium cost. Gate behind a `PUNKTFUNK_*` env and A/B `cap→encoded` p50 + the CS2 fps floor.
+### 2A. Linux NV12 convert — **SHIPPED (`1fc6f73`)**
+GL de-tile blit emits NV12 (BT.709 limited) on the GPU and feeds NVENC native YUV, deleting NVENC's
+internal RGB→YUV CSC off the contended SM. Gated `PUNKTFUNK_NV12` (default OFF). Tiled EGL/GL path
+only; LINEAR/Vulkan-bridge (gamescope) stays RGB. Validated colour-correct on RTX 5070 Ti. Open
+follow-up: glass-to-glass latency + CS2 fps-under-saturation A/B before flipping the default, and
+the **P010** variant for the HDR/10-bit path. Code is the source of truth (`zerocopy/egl.rs`,
+`encode/linux.rs`).

-### 2B. Linux scheduling hygiene (cheap; the priority bits are "measure-then-keep")
-Consolidates the genuine parts of several candidates. Mostly unambiguous cleanups + opt-in
-priority:
- **Arm the Linux `boost_thread_priority` no-op** (`punktfunk1.rs:1856` cfg branch): best-effort
-  `libc::setpriority(PRIO_PROCESS, 0, -10/-5)` on the calling thread (tid 0 = self), log-and-continue
-  on EPERM. **Do not** default to SCHED_RR/FIFO (can starve the compositor and the game's render
-  thread — the user refuses to add game frame-time); offer it only behind `PUNKTFUNK_SCHED_RR=1`.
-  **Fix the wrong doc-comment** at `punktfunk1.rs:1834-1835` ("the Linux host caps the game via
-  gamescope, so its threads aren't starved") — false for the uncapped/NVIDIA-direct path.
- **Set CUDA context scheduling deliberately**: `cuCtxCreate` flag `CU_CTX_SCHED_BLOCKING_SYNC` on
-  this shared VM (frees a core vs the default AUTO/SPIN) — a CPU-efficiency fix, not throughput.
- **High-priority CUDA stream + EGL context priority** (the missing analogue of the Windows
-  hardening): `cuStreamCreateWithPriority(highest from cuCtxGetStreamPriorityRange)` for our copies;
-  request `EGL_IMG_context_priority HIGH` (try `EGL_NV_context_priority_realtime`) at
-  `egl.rs:332`. **Caveat, load-bearing**: these are intra-process *hints* and NVIDIA's Linux driver
-  has been reported to **ignore** context priority (driver 545: high- vs low-priority EGL contexts
-  measured identical) and to **deny** realtime Vulkan queues. Implement with graceful fallback,
-  gate behind env, and **measure on driver 595** — do not architect around it or credit it before
-  measurement.
+### 2B. Linux scheduling hygiene — **SHIPPED (`112a054`)**
+`boost_thread_priority` nices capture/encode/send on Linux (best-effort `setpriority`);
+CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread highest-priority CUDA
+stream (`cuStreamCreateWithPriority`, NULL-stream fallback). The stream-priority hint is
+**measure-then-keep** (NVIDIA Linux may ignore it). **Do not** default to SCHED_RR/FIFO (can starve
+the compositor + the game's render thread); opt-in only behind `PUNKTFUNK_SCHED_RR=1`. Code is the
+source of truth (`punktfunk1.rs`).

 > Explicitly **not** doing on Linux: Vulkan `VK_EXT_global_priority` as "the" lever (it only touches
 > the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA).
@@ -201,36 +134,41 @@ priority:

 ## Tier 3 — Windows parity polish (Windows is already strong)

- **3A. Host-process session tuning (we have *zero* today — verified):** `NtSetTimerResolution(0.5ms)`
-  / `timeBeginPeriod(1)` (default 15.6 ms granularity blocks precise pacing), `DwmEnableMMCSS(true)`,
-  `SetPriorityClass(HIGH_PRIORITY_CLASS)`, MMCSS-register the capture/encode threads ("Games"/"Pro
-  Audio"), `SetThreadExecutionState(ES_CONTINUOUS|ES_DISPLAY_REQUIRED)`. All revert on stop.
-  Foundational for any precise frame pacing and the encode|send split. Low cost, low risk.
-  (`gamestream/stream.rs` start/stop; Apollo's `streaming_will_start`/`_stopped`.)
- **3B. Auto-gated REALTIME D3DKMT class** instead of fixed HIGH (the realtime opt-in already exists
-  at `dxgi.rs:199-207`): probe HAGS (`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom
-  (`IDXGIAdapter3::QueryVideoMemoryInfo`, continuously), allow REALTIME(5) only when safe (HAGS off,
-  or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises —
-  Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling
-  rung, same preemption ceiling), so rank it as cheap parity, not a fix.
- **3C. Cheap experiment — `VideoProcessorBlt` directly from the DDA surface** (skip the same-format
-  `gpu_copy` at `dxgi.rs:2375`), then `ReleaseFrame`, *iff* it doesn't re-serialize `AcquireNextFrame`
-  (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming
-  the Blt is on the video engine). One-line source-texture change; benchmark only. Do **not** build a
-  D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM
-  (~5% 3D, no PCIe), not worth the interop rebuild.
- **3D. Async NVENC + off-thread retrieve — measure-gated, uncertain.** Today retrieve
-  (`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which is *why*
-  `depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates submit/retrieve
-  on separate threads with completion events + a deep surface pool; doing that *could* let per-frame
-  scheduling waits **overlap across frames** and recover *throughput* — at a per-frame *latency* cost
-  (depth × frame time). This is the one place the research and our own prior measurement disagree, so
-  it is **strictly measure-first**, and it forecloses slice output (`reportSliceOffsets` needs
-  `enableEncodeAsync=0`). Treat as a structural experiment, not a committed win.
+### 3A. Host-process session tuning — **SHIPPED (`112a054`)**
+`session_tuning.rs` (raw C-ABI FFI, no-op off Windows): each capture/encode/send thread applies
+process-wide tuning once (1 ms timer, `DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) + per-thread MMCSS
+"Games" + keep-display-awake; reverts on stop. Wired into both native (`boost_thread_priority`) and
+GameStream (`stream.rs`) paths. FFI validated on the real MSVC toolchain.
+
+### 3B. Auto-gated REALTIME D3DKMT class (OPEN)
+Instead of fixed HIGH (the realtime opt-in already exists at `dxgi.rs:199-207`): probe HAGS
+(`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom (`IDXGIAdapter3::QueryVideoMemoryInfo`,
+continuously), allow REALTIME(5) only when safe (HAGS off, or HAGS on + VRAM comfortably below
+budget), downgrade to HIGH the moment VRAM pressure rises — Sunshine's actual gate avoids the
+HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling rung, same preemption ceiling), so
+rank it as cheap parity, not a fix.
+
+### 3C. `VideoProcessorBlt` directly from the DDA surface (OPEN — cheap experiment)
+Skip the same-format `gpu_copy` at `dxgi.rs:2375`, then `ReleaseFrame`, *iff* it doesn't
+re-serialize `AcquireNextFrame` (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but
+that note predates confirming the Blt is on the video engine). One-line source-texture change;
+benchmark only. Do **not** build a D3D11↔D3D12 copy-queue offload — the convert is already off-3D,
+the remaining copy is intra-VRAM (~5% 3D, no PCIe), not worth the interop rebuild.
+
+### 3D. Async NVENC + off-thread retrieve (OPEN — measure-gated, uncertain)
+Today retrieve (`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which
+is *why* `depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates
+submit/retrieve on separate threads with completion events + a deep surface pool; doing that *could*
+let per-frame scheduling waits **overlap across frames** and recover *throughput* — at a per-frame
+*latency* cost (depth × frame time). This is the one place the research and our own prior
+measurement disagree, so it is **strictly measure-first**, and it forecloses slice output
+(`reportSliceOffsets` needs `enableEncodeAsync=0`). Treat as a structural experiment, not a
+committed win. (The follow-up doc notes the prior "async stacks latency" result was a *same-thread*
+implementation, not a disproof of a correct two-thread pipeline.)

 ---

-## Tier 4 — Deferred 2nd-order latency (not contention fixes; do after Tiers 0-2)
+## Tier 4 — Deferred 2nd-order latency (OPEN — not contention fixes; do after Tiers 0-2)

 - **GL2 — Intra-refresh for RFI/recovery** (`enableIntraRefresh` + recovery-point SEI) instead of a
  forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size
@@ -264,16 +202,24 @@ priority:

 ---

-## Recommended order of attack
+## Open items / What's left

-1. **Tier 0 diagnose** on the real boxes — settles whether the collapse is source-ceiling or
-   pipeline-starvation, and whether flip-bypass is halving capture.
-2. **Tier 2A (Linux NV12)** + **Tier 2B (Linux scheduling hygiene)** — the largest in-our-control
-   headroom; Linux is the least-hardened host.
-3. **Tier 1B (clock/power pin)** both OSes — cheap, fixes the easy-scene 200-vs-240, crash-safe undo.
-4. **Tier 1A (cadence/flip)** — gated on Tier 0 (this is where a big chunk of the collapse may live).
-5. **Tier 3 (Windows polish)** — session tuning is the clear win; the rest is parity.
-6. **Tier 4** — only after the contention work; intra-refresh first, slice pipelining last.
+For current action prioritization see [`gpu-contention-investigation.md`](gpu-contention-investigation.md).
+Still-open work tracked by this doc:
+
+- **Tier 0** — run the `PUNKTFUNK_PERF=1` uniq-vs-fps + flip-mode diagnosis on the real-GPU boxes
+  (gate for everything below).
+- **Tier 1A** — capture-source / compose-rate cadence levers (ForceComposedFlip verify;
+  `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER` double-refresh; Reflex/render-queue=0 headroom).
+- **Tier 1B** — GPU clock/power pinning (`PUNKTFUNK_PIN_CLOCKS`; NvAPI per-app DRS on Windows w/
+  crash-safe undo; root-free CUDA-P2/persistence on Linux; default OFF on battery/Deck).
+- **Tier 2A follow-up** — glass-to-glass + CS2-floor A/B before defaulting `PUNKTFUNK_NV12`, and the
+  **P010** HDR/10-bit variant.
+- **Tier 3B** — auto-gated REALTIME D3DKMT class (HAGS + VRAM-headroom gate).
+- **Tier 3C** — `VideoProcessorBlt` directly from the DDA surface (benchmark-only experiment).
+- **Tier 3D** — correct async NVENC two-thread submit/retrieve pipeline (strictly measure-first).
+- **Tier 4** — GL2 intra-refresh for RFI/recovery; GL1/GL6 sub-frame slice output + per-slice paced
+  send (paced-send half already shipped).

 Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap
 closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game