Verified, prioritized analysis in docs/host-latency-plan.md (multi-agent investigation + adversarial verification). Lands the two low-risk tiers: Tier 2B — Linux scheduling hygiene: - boost_thread_priority now nices the capture/encode (-10) and send (-5) threads on Linux (setpriority, best-effort; no-op without CAP_SYS_NICE), and the wrong "gamescope caps the game" doc-comment is corrected. - CUDA context created with CU_CTX_SCHED_BLOCKING_SYNC (frees a core on the shared box instead of busy-spinning on completion). - Copies moved off the default stream onto a per-thread highest-priority CUDA stream (cuStreamCreateWithPriority, graceful NULL-stream fallback) with a per-stream sync that no longer blocks on the other worker thread's in-flight copies. Stream priority is measure-then-keep (NVIDIA Linux may ignore it); never regresses. Tier 3A — Windows session tuning (new session_tuning.rs, raw C-ABI FFI, no-op off Windows): once-per-process 1ms timer + DwmEnableMMCSS + HIGH priority class; per-thread MMCSS "Games" + keep-display-awake. Wired into both the native (boost_thread_priority) and GameStream (stream.rs) paths. We had zero session tuning before (Apollo streaming_will_start parity). Tier 2A (Linux NV12 convert) is specified but intentionally not landed: it is colour-correctness-critical and needs A/B validation on a GPU box with a display (green-screen risk). Builds + clippy + fmt green on Linux. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
20 KiB
Host latency & the GPU-contention collapse — analysis + prioritized plan
Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: latency, and specifically the "saturating game starves the stream" headache:
CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding (GPU-100%) scene it collapses to 40-50. Capping the game is not an acceptable fix.
This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the Apollo comparison + external NVIDIA/streaming research) followed by an adversarial verification pass — every candidate fix was attacked, against our actual code, to separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the placebos.
Implementation status (2026-06-18)
- ✅ Tier 2B — Linux scheduling hygiene: landed.
boost_thread_prioritynow nices the capture/encode + send threads on Linux (setpriority, best-effort) and its wrong gamescope doc-comment is fixed; CUDA context usesCU_CTX_SCHED_BLOCKING_SYNC; copies run on a per-thread highest-priority CUDA stream (cuStreamCreateWithPriority, graceful NULL-stream fallback) with a per-stream sync that no longer blocks on the other worker thread's work. Builds + clippy + fmt green. The stream-priority hint is measure-then-keep (NVIDIA Linux may ignore it). - ✅ Tier 3A — Windows session tuning: landed (
session_tuning.rs, raw C-ABI FFI, no-op off Windows). Each capture/encode/send thread now applies process-wide tuning once (1 ms timer,DwmEnableMMCSS,HIGH_PRIORITY_CLASS) and per-thread MMCSS "Games" + keep-display-awake. Wired into both the native (boost_thread_priority) and GameStream (stream.rs) paths. Linux no-op path builds green; the Windows path is validated by the Windows CI runner / on-box. - ⏳ Tier 2A — Linux NV12 convert: specified to the code level (below) but not landed — it is
a ~300-line, colour-correctness-critical change that cannot be A/B-validated on the headless dev
VM (no display; the project has already been burned by the exact green-screen failure mode this
risks — Steam-Deck
SEPARATE_LAYERSbug). Execute + A/B it on a GPU box with a display.
0. Three corrections to the mental model (read first)
(A) "Feed NVENC RGB so the ASIC does the colour-convert" is backwards.
NVENC's encode core is YUV-native. RGB input makes the driver insert an RGB→YUV CSC on the
SM/3D-compute cores — the exact engine a game saturates. Windows already does the right thing:
convert_to_yuv runs the CSC on the dedicated VIDEO engine via VideoProcessorBlt
(capture/dxgi.rs:1023,1063), logged as "0% 3D". Linux still feeds NVENC RGB
(encode/linux.rs:98 nvenc_input → RGBZ/BGRZ; the zerocopy/egl.rs:98 shader is a .bgra
swizzle, not a CSC), so it pays NVENC's internal CSC on the SM every frame. That is the single
biggest, clearly-fixable contention source on Linux, and Windows already eliminated it.
(B) "More GPU priority so our frames get through" is already maxed on Windows, and hits a
hardware ceiling. We ship D3DKMTSetProcessSchedulingPriorityClass=HIGH(4) +
SetGPUThreadPriority(0x4000001E) + SetMaximumFrameLatency(1) (capture/dxgi.rs:160-263). The
residual ~20 ms lock_bitstream wall (documented at dxgi.rs:155) is GPU context-scheduling
latency, bounded by preemption granularity: NVIDIA preempts compute at instruction level
(~0.1 ms) but graphics only at coarse draw/tile/DMA-buffer boundaries (milliseconds out under a
draw flood). No priority class preempts an in-flight game draw. So the winning strategy is not
more priority — it is (1) do less work on the contended graphics/3D engine, and (2) overlap
the unavoidable per-frame scheduling wait across frames to recover throughput.
(C) A chunk of the collapse is upstream of our encoder — no encode/priority fix can beat it.
DXGI Desktop Duplication and WGC both capture from the DWM compositor, so captured fps is
hard-ceilinged at the compose rate, never the game's 400 fps. Under saturation the compositor
itself is scheduled late → composes fewer unique frames → we starve even though NVENC is idle. And
borderless/fullscreen games on Independent/Direct Flip present straight to scanout, bypassing
DWM, so capture sees ~half the frames (this is the "200 not 240"). The host already paces at
target_fps and re-encodes held frames, so transmitted fps stays ~240 while unique fps
collapses. This must be measured before blaming encode.
Net: Windows is already near best-in-class (priority + video-engine CSC + encode|send split all shipped); its remaining wins are narrow and partly a hardware/compositor ceiling. Linux is the least-hardened host and holds most of the headroom.
Tier 0 — Diagnose first (cheap, decisive, do before writing code)
Everything below is gated on knowing which bucket the collapse is in. We already have the tooling.
- Run the workload with
PUNKTFUNK_PERF=1and readuniqvsfps. Theuniqcounter (genuinely-new captured frames vs re-encoded holds) already exists (gamestream/stream.rs:332-336,403;wgc_helper.rs:122-183). Under CS2 at GPU-100%:fps≈240 butuniq→40-50 ⇒ the source/compositor only produced 40-50 unique frames. No encode/priority/cadence fix on our side exceeds that — it is the game's effective present-to-compose rate at 100% GPU. The lever there is reducing our own per-frame GPU steal (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).- both
fpsanduniq→40-50 ⇒ our capture→convert→encode round-trip is being starved (thelock_bitstreamscheduling stall). The Tier 1/2 contention levers apply directly.
- Confirm the game's flip mode on Windows. If the game is on Independent/Direct Flip (MPO),
capture is bypassing DWM and seeing half the frames. We already have
capture/composed_flip.rs— verify ForceComposedFlip is actually engaged on the game path, and watchcap_us. - Capture
cap_us/enc_us/pace_usp50/p99 alongside, to localise the stall.
Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This headless dev VM cannot reproduce the contention.
Tier 1 — The two under-weighted, cross-platform levers (confirmed by research, not yet done)
1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
The capture ceiling is the compositor's compose rate, and under load the compositor gets starved. Levers, in order:
- Force Composed Flip on Windows for the game path (defeat MPO/flip-metering frame loss).
Machinery exists (
composed_flip.rs); confirm it engages and measure the unique-frame delta. - Opt-in "double-refresh" virtual output: create the per-session virtual output at ~2× the
client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
we already mint arbitrary-mode virtual outputs). Gate off by default and never on the
gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
engine).
PUNKTFUNK_OUTPUT_HZ_MULTIPLIER. - Reflex / render-queue=0 style headroom (non-capping): documented as the substitute for an fps cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what we can influence from the host side.
Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.
1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
a hidden latency tax, most visible in easy scenes (the ~200-should-be-240 case). Sunshine ships
this as nvenc_latency_over_power and calls it decisive. Neither host does it.
- Windows: NvAPI per-application DRS profile
PREFERRED_PSTATE = PREFER_MAXscoped to our exe (not a global override). Loadnvapi64.dlldynamically; treatNvAPI_Initializefailure as "no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). Crash-safe undo is mandatory: write an undo record to%ProgramData%\punktfunk\before applying and revert a stale profile on next startup — a crash must not leave the user's control panel modified. - Linux: prefer the root-free path — disable the CUDA "Force P2 State" downclock that
context creation triggers (env/per-context), and
nvidia-smi -pm 1(persistence) where permitted.nvmlDeviceSetGpuLockedClocksneeds root/CAP_SYS_ADMIN (our host runs as a normal user → silent no-op) and is brittle across SKUs; if used, querynvmlDeviceGetMaxClockInfo, lock to that, and restore on teardown and via a SIGTERM/panic handler. - Gate behind
PUNKTFUNK_PIN_CLOCKS; default OFF on battery / Steam Deck (thermal/power caps make pinning actively harmful there).
Impact: reliable, modest p99 / easy-scene win on both OSes. Does not fix the saturated-scene collapse (at 100% util the clock is already maxed). Low cost.
Tier 2 — Linux work-deletion + scheduling hygiene (the biggest in-our-control headroom)
2A. Produce NV12/P010 on Linux and feed it to NVENC native (delete the SM-side CSC)
The strictly-correct version (verified): extend the existing GL de-tile blit
(zerocopy/egl.rs) to emit NV12 instead of swizzled BGRx — multi-render-target (GL_R8 luma
full-res + GL_RG8 chroma half-res, or two passes) applying an explicit BT.709 limited-range
matrix matching the Windows VideoConverter (dxgi.rs:957) so hosts look identical — then
register NV_ENC_BUFFER_FORMAT_NV12 with the encoder (teach encode/linux.rs:98 nvenc_input an
NV12 case; CudaHw sw_format → AV_PIX_FMT_NV12).
- Net: today = GL swizzle (3D) + NVENC-internal CSC (SM); after = GL CSC (3D, ~same cost as the swizzle it replaces) + zero NVENC CSC. Removes one whole CSC pass and removes it from the SM.
- Do not implement this as a standalone CUDA convert kernel on the tiled path — CUDA can't
sample a tiled NVIDIA surface (
cuGraphicsEGLRegisterImageis Tegra-only,egl.rs:6-12), so it would still need the GL detile, and a CUDA kernel runs on the same saturated SM. The CUDA-kernel route is only clean on the LINEAR/Vulkan-bridge (gamescope) path, where it doubles as the NV12 producer; do it there if/when that path needs it. - Pitfalls: pervasive 4-byte-pixel assumptions break with NV12 —
cuda.rshardcodesWidthInBytes = width*4(:363,392,499),BufferPool/alloc_pitchedassume 4 B/px, GL dst isGL_RGBA8; all need a plane-aware NV12 variant (luma W·H + chroma W·H/2, two-plane copy) or you get the Steam-Deck green-screen class of bug. The HDR/10-bit path needs P010, not NV12. - Impact: real, modest, compounding — a few ms of per-frame GPU time and a shorter time-slice
need, which stacks with cadence + power-pin. Not a standalone cure for the 240→40 collapse
(external "47→100 fps" numbers are other people's non-zero-copy pipelines; don't promise them).
Medium cost. Gate behind a
PUNKTFUNK_*env and A/Bcap→encodedp50 + the CS2 fps floor.
2B. Linux scheduling hygiene (cheap; the priority bits are "measure-then-keep")
Consolidates the genuine parts of several candidates. Mostly unambiguous cleanups + opt-in priority:
- Arm the Linux
boost_thread_priorityno-op (punktfunk1.rs:1856cfg branch): best-effortlibc::setpriority(PRIO_PROCESS, 0, -10/-5)on the calling thread (tid 0 = self), log-and-continue on EPERM. Do not default to SCHED_RR/FIFO (can starve the compositor and the game's render thread — the user refuses to add game frame-time); offer it only behindPUNKTFUNK_SCHED_RR=1. Fix the wrong doc-comment atpunktfunk1.rs:1834-1835("the Linux host caps the game via gamescope, so its threads aren't starved") — false for the uncapped/NVIDIA-direct path. - Set CUDA context scheduling deliberately:
cuCtxCreateflagCU_CTX_SCHED_BLOCKING_SYNCon this shared VM (frees a core vs the default AUTO/SPIN) — a CPU-efficiency fix, not throughput. - High-priority CUDA stream + EGL context priority (the missing analogue of the Windows
hardening):
cuStreamCreateWithPriority(highest from cuCtxGetStreamPriorityRange)for our copies; requestEGL_IMG_context_priority HIGH(tryEGL_NV_context_priority_realtime) ategl.rs:332. Caveat, load-bearing: these are intra-process hints and NVIDIA's Linux driver has been reported to ignore context priority (driver 545: high- vs low-priority EGL contexts measured identical) and to deny realtime Vulkan queues. Implement with graceful fallback, gate behind env, and measure on driver 595 — do not architect around it or credit it before measurement.
Explicitly not doing on Linux: Vulkan
VK_EXT_global_priorityas "the" lever (it only touches the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA). ReplacingcuCtxSynchronizewith a per-stream event chain for contention reasons (it's per-context, never waited on the game's separate context — a non-fix; keep the full sync where it guards dmabuf recycle,egl.rs:491).
Tier 3 — Windows parity polish (Windows is already strong)
- 3A. Host-process session tuning (we have zero today — verified):
NtSetTimerResolution(0.5ms)/timeBeginPeriod(1)(default 15.6 ms granularity blocks precise pacing),DwmEnableMMCSS(true),SetPriorityClass(HIGH_PRIORITY_CLASS), MMCSS-register the capture/encode threads ("Games"/"Pro Audio"),SetThreadExecutionState(ES_CONTINUOUS|ES_DISPLAY_REQUIRED). All revert on stop. Foundational for any precise frame pacing and the encode|send split. Low cost, low risk. (gamestream/stream.rsstart/stop; Apollo'sstreaming_will_start/_stopped.) - 3B. Auto-gated REALTIME D3DKMT class instead of fixed HIGH (the realtime opt-in already exists
at
dxgi.rs:199-207): probe HAGS (D3DKMTQueryAdapterInfoHwSchEnabled) and VRAM headroom (IDXGIAdapter3::QueryVideoMemoryInfo, continuously), allow REALTIME(5) only when safe (HAGS off, or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises — Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling rung, same preemption ceiling), so rank it as cheap parity, not a fix. - 3C. Cheap experiment —
VideoProcessorBltdirectly from the DDA surface (skip the same-formatgpu_copyatdxgi.rs:2375), thenReleaseFrame, iff it doesn't re-serializeAcquireNextFrame(the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming the Blt is on the video engine). One-line source-texture change; benchmark only. Do not build a D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM (~5% 3D, no PCIe), not worth the interop rebuild. - 3D. Async NVENC + off-thread retrieve — measure-gated, uncertain. Today retrieve
(
lock_bitstream) runs inline on the submit thread (nvenc.rs:524-558), which is whydepth>1was measured to regress (wgc_helper.rs:111-114). The NVENC guide mandates submit/retrieve on separate threads with completion events + a deep surface pool; doing that could let per-frame scheduling waits overlap across frames and recover throughput — at a per-frame latency cost (depth × frame time). This is the one place the research and our own prior measurement disagree, so it is strictly measure-first, and it forecloses slice output (reportSliceOffsetsneedsenableEncodeAsync=0). Treat as a structural experiment, not a committed win.
Tier 4 — Deferred 2nd-order latency (not contention fixes; do after Tiers 0-2)
- GL2 — Intra-refresh for RFI/recovery (
enableIntraRefresh+ recovery-point SEI) instead of a forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met. Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win, orthogonal to the collapse. - GL1 + GL6 — Sub-frame slice output + per-slice paced send (the roadmap's "~2-4 ms lever"):
enableSubFrameWrite+sliceMode+ transmit each slice as it completes. Big: needs the direct NVENC SDK on Linux (libavcodec emits whole AUs) and a per-slice wire/FEC redesign inpunktfunk-core(todayPacketHeader/Packetizer/reassembler are whole-AU; per-slice FEC blocks wreck Leopard efficiency) and client slice-granular submit. Gate onNV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK(often absent on consumer GeForce). The paced-send half is already shipped (stream.rs spawn_sender,punktfunk1.rs paced_submit) — don't re-implement.
Dropped / why (so we don't re-propose placebo)
| Candidate | Verdict | Why |
|---|---|---|
| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
Replace cuCtxSynchronize with per-stream event chain for contention |
✗ | cuCtxSynchronize is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
Vulkan VK_EXT_global_priority as the Linux priority lever |
✗ | Touches only the minority gamescope/LINEAR vkCmdCopyBuffer, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
| Async NVENC as a throughput/collapse fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our depth>1 regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
Empty-frame (LastPresentTime==0) skip |
✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets zeroReorderDelay=1, lookahead/multipass/AQ off; ffmpeg defaults match + we set bf=0. Only lowDelayKeyFrameScale=1 is non-redundant → fold into GL2 (Windows SDK path only). |
| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no nvEncInvalidateRefFrames; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |
Recommended order of attack
- Tier 0 diagnose on the real boxes — settles whether the collapse is source-ceiling or pipeline-starvation, and whether flip-bypass is halving capture.
- Tier 2A (Linux NV12) + Tier 2B (Linux scheduling hygiene) — the largest in-our-control headroom; Linux is the least-hardened host.
- Tier 1B (clock/power pin) both OSes — cheap, fixes the easy-scene 200-vs-240, crash-safe undo.
- Tier 1A (cadence/flip) — gated on Tier 0 (this is where a big chunk of the collapse may live).
- Tier 3 (Windows polish) — session tuning is the clear win; the rest is parity.
- Tier 4 — only after the contention work; intra-refresh first, slice pipelining last.
Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game physically cannot also render the game and compose 240 unique frames, and WDDM/NVIDIA preemption granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it with encoder micro-optimisations.