Files
punktfunk/design/host-latency-plan.md
T
enricobuehler 7b99b41ede docs(design): trim shipped plans, consolidate cluster, add index
Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).

- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
  apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
  host-latency, gpu-contention (fixed stale status table), game-library,
  linux-setup (fixed m0->spike + stale zero-copy claim),
  session-aware-host-followups, windows-client-bootstrap,
  windows-dualsense-{scoping,game-detection}, windows-virtual-display,
  security-review (per-finding status table; #12 still open),
  apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
  windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
  merged, M4 done); windows-secure-desktop.md archived (now a fallback
  behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
  roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:39:06 +00:00

16 KiB
Raw Permalink Blame History

Host latency & the GPU-contention collapse — analysis + prioritized plan

Status: PARTLY SHIPPED. Tier 2A (Linux NV12 convert) = 1fc6f73; Tier 2B (Linux scheduling) + Tier 3A (Windows session tuning) = 112a054. Tiers 1A, 1B, 3B, 3C, 3D, 4 are still open. This doc is trimmed to design rationale + open items; the shipped code is the source of truth for the landed tiers.

⚠ Partially superseded (2026-06-25) by gpu-contention-investigation.md. That follow-up re-verified this plan against the current code and overturned several specifics: the default Windows path (IDD-push) now feeds NVENC RGB (regressing the §0A "Windows does it right" claim); PUNKTFUNK_ENCODE_DEPTH never existed (phantom knob); the "async NVENC stacks latency" result was a same-thread implementation, not a disproof of a correct two-thread pipeline; "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on Windows. For current action prioritization see gpu-contention-investigation.md. The tiers/dropped-placebo analysis below remain a useful record.

Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: latency, and specifically the "saturating game starves the stream" headache:

CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding (GPU-100%) scene it collapses to 40-50. Capping the game is not an acceptable fix.

This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the Apollo comparison + external NVIDIA/streaming research) followed by an adversarial verification pass — every candidate fix was attacked, against our actual code, to separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the placebos.


Mental model (§0A0C) — see the follow-up

The original three-correction mental model (A: feeding NVENC RGB is backwards; B: GPU priority is maxed on Windows and hits a preemption-granularity ceiling; C: a chunk of the collapse is upstream of the encoder at the compositor compose-rate, with Independent/Direct Flip bypassing DWM) is partly corrected by gpu-contention-investigation.md §1 — notably that the default Windows IDD-push path now feeds NVENC RGB (so §0A's "Windows already does the right thing" no longer holds), and "capture sees half the frames" is DLSS-Frame-Gen-specific rather than general. Read the follow-up doc for the corrected model. The durable takeaways still stand: do less work on the contended graphics/3D engine, overlap the unavoidable per-frame scheduling wait across frames, and measure source-vs-pipeline before blaming encode.


Tier 0 — Diagnose first (cheap, decisive, do before writing code)

Everything below is gated on knowing which bucket the collapse is in. We already have the tooling.

  1. Run the workload with PUNKTFUNK_PERF=1 and read uniq vs fps. The uniq counter (genuinely-new captured frames vs re-encoded holds) already exists (gamestream/stream.rs:332-336,403; wgc_helper.rs:122-183). Under CS2 at GPU-100%:
    • fps≈240 but uniq→40-50 ⇒ the source/compositor only produced 40-50 unique frames. No encode/priority/cadence fix on our side exceeds that — it is the game's effective present-to-compose rate at 100% GPU. The lever there is reducing our own per-frame GPU steal (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
    • both fps and uniq→40-50 ⇒ our capture→convert→encode round-trip is being starved (the lock_bitstream scheduling stall). The Tier 1/2 contention levers apply directly.
  2. Confirm the game's flip mode on Windows. If the game is on Independent/Direct Flip (MPO), capture is bypassing DWM and seeing half the frames. We already have capture/composed_flip.rs — verify ForceComposedFlip is actually engaged on the game path, and watch cap_us.
  3. Capture cap_us / enc_us / pace_us p50/p99 alongside, to localise the stall.

Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This headless dev VM cannot reproduce the contention.


Tier 1 — The two under-weighted, cross-platform levers (OPEN — confirmed by research, not yet done)

1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)

The capture ceiling is the compositor's compose rate, and under load the compositor gets starved. Levers, in order:

  • Force Composed Flip on Windows for the game path (defeat MPO/flip-metering frame loss). Machinery exists (composed_flip.rs); confirm it engages and measure the unique-frame delta.
  • Opt-in "double-refresh" virtual output: create the per-session virtual output at ~2× the client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since we already mint arbitrary-mode virtual outputs). Gate off by default and never on the gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated engine). PUNKTFUNK_OUTPUT_HZ_MULTIPLIER.
  • Reflex / render-queue=0 style headroom (non-capping): documented as the substitute for an fps cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what we can influence from the host side.

Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.

1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)

NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame — a hidden latency tax, most visible in easy scenes (the ~200-should-be-240 case). Sunshine ships this as nvenc_latency_over_power and calls it decisive. Neither host does it.

  • Windows: NvAPI per-application DRS profile PREFERRED_PSTATE = PREFER_MAX scoped to our exe (not a global override). Load nvapi64.dll dynamically; treat NvAPI_Initialize failure as "no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). Crash-safe undo is mandatory: write an undo record to %ProgramData%\punktfunk\ before applying and revert a stale profile on next startup — a crash must not leave the user's control panel modified.
  • Linux: prefer the root-free path — disable the CUDA "Force P2 State" downclock that context creation triggers (env/per-context), and nvidia-smi -pm 1 (persistence) where permitted. nvmlDeviceSetGpuLockedClocks needs root/CAP_SYS_ADMIN (our host runs as a normal user → silent no-op) and is brittle across SKUs; if used, query nvmlDeviceGetMaxClockInfo, lock to that, and restore on teardown and via a SIGTERM/panic handler.
  • Gate behind PUNKTFUNK_PIN_CLOCKS; default OFF on battery / Steam Deck (thermal/power caps make pinning actively harmful there).

Impact: reliable, modest p99 / easy-scene win on both OSes. Does not fix the saturated-scene collapse (at 100% util the clock is already maxed). Low cost.


Tier 2 — Linux work-deletion + scheduling hygiene

2A. Linux NV12 convert — SHIPPED (1fc6f73)

GL de-tile blit emits NV12 (BT.709 limited) on the GPU and feeds NVENC native YUV, deleting NVENC's internal RGB→YUV CSC off the contended SM. Gated PUNKTFUNK_NV12 (default OFF). Tiled EGL/GL path only; LINEAR/Vulkan-bridge (gamescope) stays RGB. Validated colour-correct on RTX 5070 Ti. Open follow-up: glass-to-glass latency + CS2 fps-under-saturation A/B before flipping the default, and the P010 variant for the HDR/10-bit path. Code is the source of truth (zerocopy/egl.rs, encode/linux.rs).

2B. Linux scheduling hygiene — SHIPPED (112a054)

boost_thread_priority nices capture/encode/send on Linux (best-effort setpriority); CUDA context uses CU_CTX_SCHED_BLOCKING_SYNC; copies run on a per-thread highest-priority CUDA stream (cuStreamCreateWithPriority, NULL-stream fallback). The stream-priority hint is measure-then-keep (NVIDIA Linux may ignore it). Do not default to SCHED_RR/FIFO (can starve the compositor + the game's render thread); opt-in only behind PUNKTFUNK_SCHED_RR=1. Code is the source of truth (punktfunk1.rs).

Explicitly not doing on Linux: Vulkan VK_EXT_global_priority as "the" lever (it only touches the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA). Replacing cuCtxSynchronize with a per-stream event chain for contention reasons (it's per-context, never waited on the game's separate context — a non-fix; keep the full sync where it guards dmabuf recycle, egl.rs:491).


Tier 3 — Windows parity polish (Windows is already strong)

3A. Host-process session tuning — SHIPPED (112a054)

session_tuning.rs (raw C-ABI FFI, no-op off Windows): each capture/encode/send thread applies process-wide tuning once (1 ms timer, DwmEnableMMCSS, HIGH_PRIORITY_CLASS) + per-thread MMCSS "Games" + keep-display-awake; reverts on stop. Wired into both native (boost_thread_priority) and GameStream (stream.rs) paths. FFI validated on the real MSVC toolchain.

3B. Auto-gated REALTIME D3DKMT class (OPEN)

Instead of fixed HIGH (the realtime opt-in already exists at dxgi.rs:199-207): probe HAGS (D3DKMTQueryAdapterInfo HwSchEnabled) and VRAM headroom (IDXGIAdapter3::QueryVideoMemoryInfo, continuously), allow REALTIME(5) only when safe (HAGS off, or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises — Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling rung, same preemption ceiling), so rank it as cheap parity, not a fix.

3C. VideoProcessorBlt directly from the DDA surface (OPEN — cheap experiment)

Skip the same-format gpu_copy at dxgi.rs:2375, then ReleaseFrame, iff it doesn't re-serialize AcquireNextFrame (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming the Blt is on the video engine). One-line source-texture change; benchmark only. Do not build a D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM (~5% 3D, no PCIe), not worth the interop rebuild.

3D. Async NVENC + off-thread retrieve (OPEN — measure-gated, uncertain)

Today retrieve (lock_bitstream) runs inline on the submit thread (nvenc.rs:524-558), which is why depth>1 was measured to regress (wgc_helper.rs:111-114). The NVENC guide mandates submit/retrieve on separate threads with completion events + a deep surface pool; doing that could let per-frame scheduling waits overlap across frames and recover throughput — at a per-frame latency cost (depth × frame time). This is the one place the research and our own prior measurement disagree, so it is strictly measure-first, and it forecloses slice output (reportSliceOffsets needs enableEncodeAsync=0). Treat as a structural experiment, not a committed win. (The follow-up doc notes the prior "async stacks latency" result was a same-thread implementation, not a disproof of a correct two-thread pipeline.)


Tier 4 — Deferred 2nd-order latency (OPEN — not contention fixes; do after Tiers 0-2)

  • GL2 — Intra-refresh for RFI/recovery (enableIntraRefresh + recovery-point SEI) instead of a forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met. Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win, orthogonal to the collapse.
  • GL1 + GL6 — Sub-frame slice output + per-slice paced send (the roadmap's "~2-4 ms lever"): enableSubFrameWrite + sliceMode + transmit each slice as it completes. Big: needs the direct NVENC SDK on Linux (libavcodec emits whole AUs) and a per-slice wire/FEC redesign in punktfunk-core (today PacketHeader/Packetizer/reassembler are whole-AU; per-slice FEC blocks wreck Leopard efficiency) and client slice-granular submit. Gate on NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK (often absent on consumer GeForce). The paced-send half is already shipped (stream.rs spawn_sender, punktfunk1.rs paced_submit) — don't re-implement.

Dropped / why (so we don't re-propose placebo)

Candidate Verdict Why
Feed NVENC ARGB to "offload CSC to ASIC" ✗ backwards RGB input forces CSC onto the SM; YUV-native is correct (see §0A).
Replace cuCtxSynchronize with per-stream event chain for contention cuCtxSynchronize is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle.
Vulkan VK_EXT_global_priority as the Linux priority lever Touches only the minority gamescope/LINEAR vkCmdCopyBuffer, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority.
Async NVENC as a throughput/collapse fix ✗ (→ measure-gated 3D) Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our depth>1 regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D).
D3D12 copy-queue offload of the DDA copy Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild.
Empty-frame (LastPresentTime==0) skip ✗ for this Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip.
GL5 — set ULL RC knobs explicitly ✗ (audit only) ULL preset already sets zeroReorderDelay=1, lookahead/multipass/AQ off; ffmpeg defaults match + we set bf=0. Only lowDelayKeyFrameScale=1 is non-redundant → fold into GL2 (Windows SDK path only).
GL3 — true ref-frame invalidation ✗ for this No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no nvEncInvalidateRefFrames; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness.
GL4 — move input injection off the ENet thread ✗ for this CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness.
SCHED_RR/FIFO by default (Linux) ✗ default Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only.

Open items / What's left

For current action prioritization see gpu-contention-investigation.md. Still-open work tracked by this doc:

  • Tier 0 — run the PUNKTFUNK_PERF=1 uniq-vs-fps + flip-mode diagnosis on the real-GPU boxes (gate for everything below).
  • Tier 1A — capture-source / compose-rate cadence levers (ForceComposedFlip verify; PUNKTFUNK_OUTPUT_HZ_MULTIPLIER double-refresh; Reflex/render-queue=0 headroom).
  • Tier 1B — GPU clock/power pinning (PUNKTFUNK_PIN_CLOCKS; NvAPI per-app DRS on Windows w/ crash-safe undo; root-free CUDA-P2/persistence on Linux; default OFF on battery/Deck).
  • Tier 2A follow-up — glass-to-glass + CS2-floor A/B before defaulting PUNKTFUNK_NV12, and the P010 HDR/10-bit variant.
  • Tier 3B — auto-gated REALTIME D3DKMT class (HAGS + VRAM-headroom gate).
  • Tier 3CVideoProcessorBlt directly from the DDA surface (benchmark-only experiment).
  • Tier 3D — correct async NVENC two-thread submit/retrieve pipeline (strictly measure-first).
  • Tier 4 — GL2 intra-refresh for RFI/recovery; GL1/GL6 sub-frame slice output + per-slice paced send (paced-send half already shipped).

Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game physically cannot also render the game and compose 240 unique frames, and WDDM/NVIDIA preemption granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it with encoder micro-optimisations.