Files
punktfunk/design/host-latency-plan.md
enricobuehler 7b99b41ede docs(design): trim shipped plans, consolidate cluster, add index
Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).

- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
  apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
  host-latency, gpu-contention (fixed stale status table), game-library,
  linux-setup (fixed m0->spike + stale zero-copy claim),
  session-aware-host-followups, windows-client-bootstrap,
  windows-dualsense-{scoping,game-detection}, windows-virtual-display,
  security-review (per-finding status table; #12 still open),
  apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
  windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
  merged, M4 done); windows-secure-desktop.md archived (now a fallback
  behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
  roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:39:06 +00:00

229 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Host latency & the GPU-contention collapse — analysis + prioritized plan
> **Status:** PARTLY SHIPPED. Tier 2A (Linux NV12 convert) = `1fc6f73`; Tier 2B (Linux
> scheduling) + Tier 3A (Windows session tuning) = `112a054`. Tiers 1A, 1B, 3B, 3C, 3D, 4 are
> still open. This doc is trimmed to design rationale + open items; the shipped code is the
> source of truth for the landed tiers.
> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
> That follow-up re-verified this plan against the current code and overturned several specifics:
> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
> Windows. **For current action prioritization see `gpu-contention-investigation.md`.** The
> tiers/dropped-placebo analysis below remain a useful record.
Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
"saturating game starves the stream" headache:
> CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding
> (GPU-100%) scene it collapses to 40-50. Capping the game is **not** an acceptable fix.
This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the
[Apollo comparison](apollo-comparison.md) + external NVIDIA/streaming research) followed by an
**adversarial verification pass** — every candidate fix was attacked, against our actual code, to
separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the
placebos.
---
## Mental model (§0A0C) — see the follow-up
The original three-correction mental model (A: feeding NVENC RGB is backwards; B: GPU priority is
maxed on Windows and hits a preemption-granularity ceiling; C: a chunk of the collapse is upstream
of the encoder at the compositor compose-rate, with Independent/Direct Flip bypassing DWM) is
**partly corrected by `gpu-contention-investigation.md` §1** — notably that the default Windows
IDD-push path now feeds NVENC RGB (so §0A's "Windows already does the right thing" no longer holds),
and "capture sees half the frames" is DLSS-Frame-Gen-specific rather than general. Read the
follow-up doc for the corrected model. The durable takeaways still stand: **do less work on the
contended graphics/3D engine**, **overlap the unavoidable per-frame scheduling wait across frames**,
and **measure source-vs-pipeline before blaming encode**.
---
## Tier 0 — Diagnose first (cheap, decisive, do before writing code)
Everything below is gated on knowing *which* bucket the collapse is in. We already have the tooling.
1. **Run the workload with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`.** The `uniq` counter
(genuinely-new captured frames vs re-encoded holds) already exists
(`gamestream/stream.rs:332-336,403`; `wgc_helper.rs:122-183`). Under CS2 at GPU-100%:
- **`fps`≈240 but `uniq`→40-50** ⇒ the *source/compositor* only produced 40-50 unique frames.
No encode/priority/cadence fix on our side exceeds that — it is the game's effective
present-to-compose rate at 100% GPU. The lever there is **reducing our own per-frame GPU
steal** (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
- **both `fps` and `uniq`→40-50** ⇒ our capture→convert→encode round-trip is being starved (the
`lock_bitstream` scheduling stall). The Tier 1/2 contention levers apply directly.
2. **Confirm the game's flip mode on Windows.** If the game is on Independent/Direct Flip (MPO),
capture is bypassing DWM and seeing half the frames. We already have `capture/composed_flip.rs`
— verify ForceComposedFlip is actually engaged on the game path, and watch `cap_us`.
3. Capture `cap_us` / `enc_us` / `pace_us` p50/p99 alongside, to localise the stall.
Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This
headless dev VM cannot reproduce the contention.
---
## Tier 1 — The two under-weighted, cross-platform levers (OPEN — confirmed by research, not yet done)
### 1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
The capture ceiling is the compositor's compose rate, and under load the compositor gets starved.
Levers, in order:
- **Force Composed Flip on Windows** for the game path (defeat MPO/flip-metering frame loss).
Machinery exists (`composed_flip.rs`); confirm it engages and measure the unique-frame delta.
- **Opt-in "double-refresh" virtual output**: create the per-session virtual output at ~2× the
client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
we already mint arbitrary-mode virtual outputs). Gate **off** by default and **never** on the
gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
engine). `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER`.
- **Reflex / render-queue=0 style headroom** (non-capping): documented as the substitute for an fps
cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what
we can influence from the host side.
Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our
capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.
### 1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
a hidden latency tax, *most visible in easy scenes* (the ~200-should-be-240 case). Sunshine ships
this as `nvenc_latency_over_power` and calls it decisive. **Neither host does it.**
- **Windows**: NvAPI **per-application DRS profile** `PREFERRED_PSTATE = PREFER_MAX` scoped to our
exe (not a global override). Load `nvapi64.dll` dynamically; treat `NvAPI_Initialize` failure as
"no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). **Crash-safe undo is mandatory**: write
an undo record to `%ProgramData%\punktfunk\` *before* applying and revert a stale profile on next
startup — a crash must not leave the user's control panel modified.
- **Linux**: prefer the **root-free** path — disable the CUDA "Force P2 State" downclock that
context creation triggers (env/per-context), and `nvidia-smi -pm 1` (persistence) where
permitted. `nvmlDeviceSetGpuLockedClocks` needs root/CAP_SYS_ADMIN (our host runs as a normal
user → silent no-op) and is brittle across SKUs; if used, query `nvmlDeviceGetMaxClockInfo`, lock
to *that*, and restore on teardown **and** via a SIGTERM/panic handler.
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default OFF on battery / Steam Deck** (thermal/power caps
make pinning actively harmful there).
Impact: reliable, modest p99 / easy-scene win on both OSes. Does **not** fix the saturated-scene
collapse (at 100% util the clock is already maxed). Low cost.
---
## Tier 2 — Linux work-deletion + scheduling hygiene
### 2A. Linux NV12 convert — **SHIPPED (`1fc6f73`)**
GL de-tile blit emits NV12 (BT.709 limited) on the GPU and feeds NVENC native YUV, deleting NVENC's
internal RGB→YUV CSC off the contended SM. Gated `PUNKTFUNK_NV12` (default OFF). Tiled EGL/GL path
only; LINEAR/Vulkan-bridge (gamescope) stays RGB. Validated colour-correct on RTX 5070 Ti. Open
follow-up: glass-to-glass latency + CS2 fps-under-saturation A/B before flipping the default, and
the **P010** variant for the HDR/10-bit path. Code is the source of truth (`zerocopy/egl.rs`,
`encode/linux.rs`).
### 2B. Linux scheduling hygiene — **SHIPPED (`112a054`)**
`boost_thread_priority` nices capture/encode/send on Linux (best-effort `setpriority`);
CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread highest-priority CUDA
stream (`cuStreamCreateWithPriority`, NULL-stream fallback). The stream-priority hint is
**measure-then-keep** (NVIDIA Linux may ignore it). **Do not** default to SCHED_RR/FIFO (can starve
the compositor + the game's render thread); opt-in only behind `PUNKTFUNK_SCHED_RR=1`. Code is the
source of truth (`punktfunk1.rs`).
> Explicitly **not** doing on Linux: Vulkan `VK_EXT_global_priority` as "the" lever (it only touches
> the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA).
> Replacing `cuCtxSynchronize` with a per-stream event chain for *contention* reasons (it's
> per-context, never waited on the game's separate context — a non-fix; keep the full sync where it
> guards dmabuf recycle, `egl.rs:491`).
---
## Tier 3 — Windows parity polish (Windows is already strong)
### 3A. Host-process session tuning — **SHIPPED (`112a054`)**
`session_tuning.rs` (raw C-ABI FFI, no-op off Windows): each capture/encode/send thread applies
process-wide tuning once (1 ms timer, `DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) + per-thread MMCSS
"Games" + keep-display-awake; reverts on stop. Wired into both native (`boost_thread_priority`) and
GameStream (`stream.rs`) paths. FFI validated on the real MSVC toolchain.
### 3B. Auto-gated REALTIME D3DKMT class (OPEN)
Instead of fixed HIGH (the realtime opt-in already exists at `dxgi.rs:199-207`): probe HAGS
(`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom (`IDXGIAdapter3::QueryVideoMemoryInfo`,
continuously), allow REALTIME(5) only when safe (HAGS off, or HAGS on + VRAM comfortably below
budget), downgrade to HIGH the moment VRAM pressure rises — Sunshine's actual gate avoids the
HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling rung, same preemption ceiling), so
rank it as cheap parity, not a fix.
### 3C. `VideoProcessorBlt` directly from the DDA surface (OPEN — cheap experiment)
Skip the same-format `gpu_copy` at `dxgi.rs:2375`, then `ReleaseFrame`, *iff* it doesn't
re-serialize `AcquireNextFrame` (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but
that note predates confirming the Blt is on the video engine). One-line source-texture change;
benchmark only. Do **not** build a D3D11↔D3D12 copy-queue offload — the convert is already off-3D,
the remaining copy is intra-VRAM (~5% 3D, no PCIe), not worth the interop rebuild.
### 3D. Async NVENC + off-thread retrieve (OPEN — measure-gated, uncertain)
Today retrieve (`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which
is *why* `depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates
submit/retrieve on separate threads with completion events + a deep surface pool; doing that *could*
let per-frame scheduling waits **overlap across frames** and recover *throughput* — at a per-frame
*latency* cost (depth × frame time). This is the one place the research and our own prior
measurement disagree, so it is **strictly measure-first**, and it forecloses slice output
(`reportSliceOffsets` needs `enableEncodeAsync=0`). Treat as a structural experiment, not a
committed win. (The follow-up doc notes the prior "async stacks latency" result was a *same-thread*
implementation, not a disproof of a correct two-thread pipeline.)
---
## Tier 4 — Deferred 2nd-order latency (OPEN — not contention fixes; do after Tiers 0-2)
- **GL2 — Intra-refresh for RFI/recovery** (`enableIntraRefresh` + recovery-point SEI) instead of a
forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size
spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met.
Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win,
orthogonal to the collapse.
- **GL1 + GL6 — Sub-frame slice output + per-slice paced send** (the roadmap's "~2-4 ms lever"):
`enableSubFrameWrite` + `sliceMode` + transmit each slice as it completes. **Big**: needs the
direct NVENC SDK on Linux (libavcodec emits whole AUs) **and** a per-slice wire/FEC redesign in
`punktfunk-core` (today `PacketHeader`/`Packetizer`/reassembler are whole-AU; per-slice FEC blocks
wreck Leopard efficiency) **and** client slice-granular submit. Gate on
`NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK` (often absent on consumer GeForce). The paced-send half is
**already shipped** (`stream.rs spawn_sender`, `punktfunk1.rs paced_submit`) — don't re-implement.
---
## Dropped / why (so we don't re-propose placebo)
| Candidate | Verdict | Why |
|---|---|---|
| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
| Replace `cuCtxSynchronize` with per-stream event chain *for contention* | ✗ | `cuCtxSynchronize` is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
| Vulkan `VK_EXT_global_priority` as the Linux priority lever | ✗ | Touches only the minority gamescope/LINEAR `vkCmdCopyBuffer`, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
| Async NVENC as a *throughput/collapse* fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our `depth>1` regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
| Empty-frame (`LastPresentTime==0`) skip | ✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets `zeroReorderDelay=1`, lookahead/multipass/AQ off; ffmpeg defaults match + we set `bf=0`. Only `lowDelayKeyFrameScale=1` is non-redundant → fold into GL2 (Windows SDK path only). |
| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no `nvEncInvalidateRefFrames`; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |
---
## Open items / What's left
For current action prioritization see [`gpu-contention-investigation.md`](gpu-contention-investigation.md).
Still-open work tracked by this doc:
- **Tier 0** — run the `PUNKTFUNK_PERF=1` uniq-vs-fps + flip-mode diagnosis on the real-GPU boxes
(gate for everything below).
- **Tier 1A** — capture-source / compose-rate cadence levers (ForceComposedFlip verify;
`PUNKTFUNK_OUTPUT_HZ_MULTIPLIER` double-refresh; Reflex/render-queue=0 headroom).
- **Tier 1B** — GPU clock/power pinning (`PUNKTFUNK_PIN_CLOCKS`; NvAPI per-app DRS on Windows w/
crash-safe undo; root-free CUDA-P2/persistence on Linux; default OFF on battery/Deck).
- **Tier 2A follow-up** — glass-to-glass + CS2-floor A/B before defaulting `PUNKTFUNK_NV12`, and the
**P010** HDR/10-bit variant.
- **Tier 3B** — auto-gated REALTIME D3DKMT class (HAGS + VRAM-headroom gate).
- **Tier 3C** — `VideoProcessorBlt` directly from the DDA surface (benchmark-only experiment).
- **Tier 3D** — correct async NVENC two-thread submit/retrieve pipeline (strictly measure-first).
- **Tier 4** — GL2 intra-refresh for RFI/recovery; GL1/GL6 sub-frame slice output + per-slice paced
send (paced-send half already shipped).
Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap
closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game
physically cannot also render the game *and* compose 240 unique frames, and WDDM/NVIDIA preemption
granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it
with encoder micro-optimisations.