Files
punktfunk/docs/host-latency-plan.md
T
enricobuehler 112a054c35 perf(host): latency hardening for the game-vs-encode GPU contention collapse
Verified, prioritized analysis in docs/host-latency-plan.md (multi-agent
investigation + adversarial verification). Lands the two low-risk tiers:

Tier 2B — Linux scheduling hygiene:
- boost_thread_priority now nices the capture/encode (-10) and send (-5)
  threads on Linux (setpriority, best-effort; no-op without CAP_SYS_NICE),
  and the wrong "gamescope caps the game" doc-comment is corrected.
- CUDA context created with CU_CTX_SCHED_BLOCKING_SYNC (frees a core on the
  shared box instead of busy-spinning on completion).
- Copies moved off the default stream onto a per-thread highest-priority
  CUDA stream (cuStreamCreateWithPriority, graceful NULL-stream fallback)
  with a per-stream sync that no longer blocks on the other worker thread's
  in-flight copies. Stream priority is measure-then-keep (NVIDIA Linux may
  ignore it); never regresses.

Tier 3A — Windows session tuning (new session_tuning.rs, raw C-ABI FFI,
no-op off Windows): once-per-process 1ms timer + DwmEnableMMCSS + HIGH
priority class; per-thread MMCSS "Games" + keep-display-awake. Wired into
both the native (boost_thread_priority) and GameStream (stream.rs) paths.
We had zero session tuning before (Apollo streaming_will_start parity).

Tier 2A (Linux NV12 convert) is specified but intentionally not landed:
it is colour-correctness-critical and needs A/B validation on a GPU box
with a display (green-screen risk). Builds + clippy + fmt green on Linux.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 23:05:57 +00:00

20 KiB
Raw Blame History

Host latency & the GPU-contention collapse — analysis + prioritized plan

Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: latency, and specifically the "saturating game starves the stream" headache:

CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding (GPU-100%) scene it collapses to 40-50. Capping the game is not an acceptable fix.

This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the Apollo comparison + external NVIDIA/streaming research) followed by an adversarial verification pass — every candidate fix was attacked, against our actual code, to separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the placebos.

Implementation status (2026-06-18)

  • Tier 2B — Linux scheduling hygiene: landed. boost_thread_priority now nices the capture/encode + send threads on Linux (setpriority, best-effort) and its wrong gamescope doc-comment is fixed; CUDA context uses CU_CTX_SCHED_BLOCKING_SYNC; copies run on a per-thread highest-priority CUDA stream (cuStreamCreateWithPriority, graceful NULL-stream fallback) with a per-stream sync that no longer blocks on the other worker thread's work. Builds + clippy + fmt green. The stream-priority hint is measure-then-keep (NVIDIA Linux may ignore it).
  • Tier 3A — Windows session tuning: landed (session_tuning.rs, raw C-ABI FFI, no-op off Windows). Each capture/encode/send thread now applies process-wide tuning once (1 ms timer, DwmEnableMMCSS, HIGH_PRIORITY_CLASS) and per-thread MMCSS "Games" + keep-display-awake. Wired into both the native (boost_thread_priority) and GameStream (stream.rs) paths. Linux no-op path builds green; the Windows path is validated by the Windows CI runner / on-box.
  • Tier 2A — Linux NV12 convert: specified to the code level (below) but not landed — it is a ~300-line, colour-correctness-critical change that cannot be A/B-validated on the headless dev VM (no display; the project has already been burned by the exact green-screen failure mode this risks — Steam-Deck SEPARATE_LAYERS bug). Execute + A/B it on a GPU box with a display.

0. Three corrections to the mental model (read first)

(A) "Feed NVENC RGB so the ASIC does the colour-convert" is backwards. NVENC's encode core is YUV-native. RGB input makes the driver insert an RGB→YUV CSC on the SM/3D-compute cores — the exact engine a game saturates. Windows already does the right thing: convert_to_yuv runs the CSC on the dedicated VIDEO engine via VideoProcessorBlt (capture/dxgi.rs:1023,1063), logged as "0% 3D". Linux still feeds NVENC RGB (encode/linux.rs:98 nvenc_inputRGBZ/BGRZ; the zerocopy/egl.rs:98 shader is a .bgra swizzle, not a CSC), so it pays NVENC's internal CSC on the SM every frame. That is the single biggest, clearly-fixable contention source on Linux, and Windows already eliminated it.

(B) "More GPU priority so our frames get through" is already maxed on Windows, and hits a hardware ceiling. We ship D3DKMTSetProcessSchedulingPriorityClass=HIGH(4) + SetGPUThreadPriority(0x4000001E) + SetMaximumFrameLatency(1) (capture/dxgi.rs:160-263). The residual ~20 ms lock_bitstream wall (documented at dxgi.rs:155) is GPU context-scheduling latency, bounded by preemption granularity: NVIDIA preempts compute at instruction level (~0.1 ms) but graphics only at coarse draw/tile/DMA-buffer boundaries (milliseconds out under a draw flood). No priority class preempts an in-flight game draw. So the winning strategy is not more priority — it is (1) do less work on the contended graphics/3D engine, and (2) overlap the unavoidable per-frame scheduling wait across frames to recover throughput.

(C) A chunk of the collapse is upstream of our encoder — no encode/priority fix can beat it. DXGI Desktop Duplication and WGC both capture from the DWM compositor, so captured fps is hard-ceilinged at the compose rate, never the game's 400 fps. Under saturation the compositor itself is scheduled late → composes fewer unique frames → we starve even though NVENC is idle. And borderless/fullscreen games on Independent/Direct Flip present straight to scanout, bypassing DWM, so capture sees ~half the frames (this is the "200 not 240"). The host already paces at target_fps and re-encodes held frames, so transmitted fps stays ~240 while unique fps collapses. This must be measured before blaming encode.

Net: Windows is already near best-in-class (priority + video-engine CSC + encode|send split all shipped); its remaining wins are narrow and partly a hardware/compositor ceiling. Linux is the least-hardened host and holds most of the headroom.


Tier 0 — Diagnose first (cheap, decisive, do before writing code)

Everything below is gated on knowing which bucket the collapse is in. We already have the tooling.

  1. Run the workload with PUNKTFUNK_PERF=1 and read uniq vs fps. The uniq counter (genuinely-new captured frames vs re-encoded holds) already exists (gamestream/stream.rs:332-336,403; wgc_helper.rs:122-183). Under CS2 at GPU-100%:
    • fps≈240 but uniq→40-50 ⇒ the source/compositor only produced 40-50 unique frames. No encode/priority/cadence fix on our side exceeds that — it is the game's effective present-to-compose rate at 100% GPU. The lever there is reducing our own per-frame GPU steal (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
    • both fps and uniq→40-50 ⇒ our capture→convert→encode round-trip is being starved (the lock_bitstream scheduling stall). The Tier 1/2 contention levers apply directly.
  2. Confirm the game's flip mode on Windows. If the game is on Independent/Direct Flip (MPO), capture is bypassing DWM and seeing half the frames. We already have capture/composed_flip.rs — verify ForceComposedFlip is actually engaged on the game path, and watch cap_us.
  3. Capture cap_us / enc_us / pace_us p50/p99 alongside, to localise the stall.

Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This headless dev VM cannot reproduce the contention.


Tier 1 — The two under-weighted, cross-platform levers (confirmed by research, not yet done)

1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)

The capture ceiling is the compositor's compose rate, and under load the compositor gets starved. Levers, in order:

  • Force Composed Flip on Windows for the game path (defeat MPO/flip-metering frame loss). Machinery exists (composed_flip.rs); confirm it engages and measure the unique-frame delta.
  • Opt-in "double-refresh" virtual output: create the per-session virtual output at ~2× the client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since we already mint arbitrary-mode virtual outputs). Gate off by default and never on the gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated engine). PUNKTFUNK_OUTPUT_HZ_MULTIPLIER.
  • Reflex / render-queue=0 style headroom (non-capping): documented as the substitute for an fps cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what we can influence from the host side.

Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.

1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)

NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame — a hidden latency tax, most visible in easy scenes (the ~200-should-be-240 case). Sunshine ships this as nvenc_latency_over_power and calls it decisive. Neither host does it.

  • Windows: NvAPI per-application DRS profile PREFERRED_PSTATE = PREFER_MAX scoped to our exe (not a global override). Load nvapi64.dll dynamically; treat NvAPI_Initialize failure as "no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). Crash-safe undo is mandatory: write an undo record to %ProgramData%\punktfunk\ before applying and revert a stale profile on next startup — a crash must not leave the user's control panel modified.
  • Linux: prefer the root-free path — disable the CUDA "Force P2 State" downclock that context creation triggers (env/per-context), and nvidia-smi -pm 1 (persistence) where permitted. nvmlDeviceSetGpuLockedClocks needs root/CAP_SYS_ADMIN (our host runs as a normal user → silent no-op) and is brittle across SKUs; if used, query nvmlDeviceGetMaxClockInfo, lock to that, and restore on teardown and via a SIGTERM/panic handler.
  • Gate behind PUNKTFUNK_PIN_CLOCKS; default OFF on battery / Steam Deck (thermal/power caps make pinning actively harmful there).

Impact: reliable, modest p99 / easy-scene win on both OSes. Does not fix the saturated-scene collapse (at 100% util the clock is already maxed). Low cost.


Tier 2 — Linux work-deletion + scheduling hygiene (the biggest in-our-control headroom)

2A. Produce NV12/P010 on Linux and feed it to NVENC native (delete the SM-side CSC)

The strictly-correct version (verified): extend the existing GL de-tile blit (zerocopy/egl.rs) to emit NV12 instead of swizzled BGRx — multi-render-target (GL_R8 luma full-res + GL_RG8 chroma half-res, or two passes) applying an explicit BT.709 limited-range matrix matching the Windows VideoConverter (dxgi.rs:957) so hosts look identical — then register NV_ENC_BUFFER_FORMAT_NV12 with the encoder (teach encode/linux.rs:98 nvenc_input an NV12 case; CudaHw sw_formatAV_PIX_FMT_NV12).

  • Net: today = GL swizzle (3D) + NVENC-internal CSC (SM); after = GL CSC (3D, ~same cost as the swizzle it replaces) + zero NVENC CSC. Removes one whole CSC pass and removes it from the SM.
  • Do not implement this as a standalone CUDA convert kernel on the tiled path — CUDA can't sample a tiled NVIDIA surface (cuGraphicsEGLRegisterImage is Tegra-only, egl.rs:6-12), so it would still need the GL detile, and a CUDA kernel runs on the same saturated SM. The CUDA-kernel route is only clean on the LINEAR/Vulkan-bridge (gamescope) path, where it doubles as the NV12 producer; do it there if/when that path needs it.
  • Pitfalls: pervasive 4-byte-pixel assumptions break with NV12 — cuda.rs hardcodes WidthInBytes = width*4 (:363,392,499), BufferPool/alloc_pitched assume 4 B/px, GL dst is GL_RGBA8; all need a plane-aware NV12 variant (luma W·H + chroma W·H/2, two-plane copy) or you get the Steam-Deck green-screen class of bug. The HDR/10-bit path needs P010, not NV12.
  • Impact: real, modest, compounding — a few ms of per-frame GPU time and a shorter time-slice need, which stacks with cadence + power-pin. Not a standalone cure for the 240→40 collapse (external "47→100 fps" numbers are other people's non-zero-copy pipelines; don't promise them). Medium cost. Gate behind a PUNKTFUNK_* env and A/B cap→encoded p50 + the CS2 fps floor.

2B. Linux scheduling hygiene (cheap; the priority bits are "measure-then-keep")

Consolidates the genuine parts of several candidates. Mostly unambiguous cleanups + opt-in priority:

  • Arm the Linux boost_thread_priority no-op (punktfunk1.rs:1856 cfg branch): best-effort libc::setpriority(PRIO_PROCESS, 0, -10/-5) on the calling thread (tid 0 = self), log-and-continue on EPERM. Do not default to SCHED_RR/FIFO (can starve the compositor and the game's render thread — the user refuses to add game frame-time); offer it only behind PUNKTFUNK_SCHED_RR=1. Fix the wrong doc-comment at punktfunk1.rs:1834-1835 ("the Linux host caps the game via gamescope, so its threads aren't starved") — false for the uncapped/NVIDIA-direct path.
  • Set CUDA context scheduling deliberately: cuCtxCreate flag CU_CTX_SCHED_BLOCKING_SYNC on this shared VM (frees a core vs the default AUTO/SPIN) — a CPU-efficiency fix, not throughput.
  • High-priority CUDA stream + EGL context priority (the missing analogue of the Windows hardening): cuStreamCreateWithPriority(highest from cuCtxGetStreamPriorityRange) for our copies; request EGL_IMG_context_priority HIGH (try EGL_NV_context_priority_realtime) at egl.rs:332. Caveat, load-bearing: these are intra-process hints and NVIDIA's Linux driver has been reported to ignore context priority (driver 545: high- vs low-priority EGL contexts measured identical) and to deny realtime Vulkan queues. Implement with graceful fallback, gate behind env, and measure on driver 595 — do not architect around it or credit it before measurement.

Explicitly not doing on Linux: Vulkan VK_EXT_global_priority as "the" lever (it only touches the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA). Replacing cuCtxSynchronize with a per-stream event chain for contention reasons (it's per-context, never waited on the game's separate context — a non-fix; keep the full sync where it guards dmabuf recycle, egl.rs:491).


Tier 3 — Windows parity polish (Windows is already strong)

  • 3A. Host-process session tuning (we have zero today — verified): NtSetTimerResolution(0.5ms) / timeBeginPeriod(1) (default 15.6 ms granularity blocks precise pacing), DwmEnableMMCSS(true), SetPriorityClass(HIGH_PRIORITY_CLASS), MMCSS-register the capture/encode threads ("Games"/"Pro Audio"), SetThreadExecutionState(ES_CONTINUOUS|ES_DISPLAY_REQUIRED). All revert on stop. Foundational for any precise frame pacing and the encode|send split. Low cost, low risk. (gamestream/stream.rs start/stop; Apollo's streaming_will_start/_stopped.)
  • 3B. Auto-gated REALTIME D3DKMT class instead of fixed HIGH (the realtime opt-in already exists at dxgi.rs:199-207): probe HAGS (D3DKMTQueryAdapterInfo HwSchEnabled) and VRAM headroom (IDXGIAdapter3::QueryVideoMemoryInfo, continuously), allow REALTIME(5) only when safe (HAGS off, or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises — Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling rung, same preemption ceiling), so rank it as cheap parity, not a fix.
  • 3C. Cheap experiment — VideoProcessorBlt directly from the DDA surface (skip the same-format gpu_copy at dxgi.rs:2375), then ReleaseFrame, iff it doesn't re-serialize AcquireNextFrame (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming the Blt is on the video engine). One-line source-texture change; benchmark only. Do not build a D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM (~5% 3D, no PCIe), not worth the interop rebuild.
  • 3D. Async NVENC + off-thread retrieve — measure-gated, uncertain. Today retrieve (lock_bitstream) runs inline on the submit thread (nvenc.rs:524-558), which is why depth>1 was measured to regress (wgc_helper.rs:111-114). The NVENC guide mandates submit/retrieve on separate threads with completion events + a deep surface pool; doing that could let per-frame scheduling waits overlap across frames and recover throughput — at a per-frame latency cost (depth × frame time). This is the one place the research and our own prior measurement disagree, so it is strictly measure-first, and it forecloses slice output (reportSliceOffsets needs enableEncodeAsync=0). Treat as a structural experiment, not a committed win.

Tier 4 — Deferred 2nd-order latency (not contention fixes; do after Tiers 0-2)

  • GL2 — Intra-refresh for RFI/recovery (enableIntraRefresh + recovery-point SEI) instead of a forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met. Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win, orthogonal to the collapse.
  • GL1 + GL6 — Sub-frame slice output + per-slice paced send (the roadmap's "~2-4 ms lever"): enableSubFrameWrite + sliceMode + transmit each slice as it completes. Big: needs the direct NVENC SDK on Linux (libavcodec emits whole AUs) and a per-slice wire/FEC redesign in punktfunk-core (today PacketHeader/Packetizer/reassembler are whole-AU; per-slice FEC blocks wreck Leopard efficiency) and client slice-granular submit. Gate on NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK (often absent on consumer GeForce). The paced-send half is already shipped (stream.rs spawn_sender, punktfunk1.rs paced_submit) — don't re-implement.

Dropped / why (so we don't re-propose placebo)

Candidate Verdict Why
Feed NVENC ARGB to "offload CSC to ASIC" ✗ backwards RGB input forces CSC onto the SM; YUV-native is correct (see §0A).
Replace cuCtxSynchronize with per-stream event chain for contention cuCtxSynchronize is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle.
Vulkan VK_EXT_global_priority as the Linux priority lever Touches only the minority gamescope/LINEAR vkCmdCopyBuffer, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority.
Async NVENC as a throughput/collapse fix ✗ (→ measure-gated 3D) Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our depth>1 regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D).
D3D12 copy-queue offload of the DDA copy Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild.
Empty-frame (LastPresentTime==0) skip ✗ for this Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip.
GL5 — set ULL RC knobs explicitly ✗ (audit only) ULL preset already sets zeroReorderDelay=1, lookahead/multipass/AQ off; ffmpeg defaults match + we set bf=0. Only lowDelayKeyFrameScale=1 is non-redundant → fold into GL2 (Windows SDK path only).
GL3 — true ref-frame invalidation ✗ for this No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no nvEncInvalidateRefFrames; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness.
GL4 — move input injection off the ENet thread ✗ for this CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness.
SCHED_RR/FIFO by default (Linux) ✗ default Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only.

  1. Tier 0 diagnose on the real boxes — settles whether the collapse is source-ceiling or pipeline-starvation, and whether flip-bypass is halving capture.
  2. Tier 2A (Linux NV12) + Tier 2B (Linux scheduling hygiene) — the largest in-our-control headroom; Linux is the least-hardened host.
  3. Tier 1B (clock/power pin) both OSes — cheap, fixes the easy-scene 200-vs-240, crash-safe undo.
  4. Tier 1A (cadence/flip) — gated on Tier 0 (this is where a big chunk of the collapse may live).
  5. Tier 3 (Windows polish) — session tuning is the clear win; the rest is parity.
  6. Tier 4 — only after the contention work; intra-refresh first, slice pipelining last.

Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game physically cannot also render the game and compose 240 unique frames, and WDDM/NVIDIA preemption granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it with encoder micro-optimisations.