Files
punktfunk/docs/host-latency-plan.md
T
enricobuehler 6b3cbce120 wip: host latency/GPU-contention notes + Windows packaging tweaks
Pre-existing working-tree changes committed to the branch on request: the
gpu-contention investigation doc, host-latency-plan additions, and small
pack-host-installer / stage-pf-vdisplay packaging-script edits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 06:53:09 +00:00

283 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Host latency & the GPU-contention collapse — analysis + prioritized plan
> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
> That follow-up re-verified this plan against the current code and overturned several specifics:
> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
> Windows. Use the new doc's ranked action list. The tiers/dropped-placebo analysis below remain a
> useful record.
Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
"saturating game starves the stream" headache:
> CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding
> (GPU-100%) scene it collapses to 40-50. Capping the game is **not** an acceptable fix.
This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the
[Apollo comparison](apollo-comparison.md) + external NVIDIA/streaming research) followed by an
**adversarial verification pass** — every candidate fix was attacked, against our actual code, to
separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the
placebos.
## Implementation status (2026-06-18)
-**Tier 2B — Linux scheduling hygiene**: landed. `boost_thread_priority` now nices the
capture/encode + send threads on Linux (`setpriority`, best-effort) and its wrong gamescope
doc-comment is fixed; CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread
highest-priority CUDA stream (`cuStreamCreateWithPriority`, graceful NULL-stream fallback) with a
per-stream sync that no longer blocks on the other worker thread's work. Builds + clippy + fmt
green. The stream-priority hint is **measure-then-keep** (NVIDIA Linux may ignore it).
-**Tier 3A — Windows session tuning**: landed (`session_tuning.rs`, raw C-ABI FFI, no-op off
Windows). Each capture/encode/send thread now applies process-wide tuning once (1 ms timer,
`DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) and per-thread MMCSS "Games" + keep-display-awake. Wired
into both the native (`boost_thread_priority`) and GameStream (`stream.rs`) paths. Linux no-op
path builds green; the **FFI was validated on the real MSVC toolchain** (standalone probe compiled,
linked against winmm/kernel32/dwmapi/avrt, and ran — timer/priority/MMCSS all succeed).
-**Tier 2A — Linux NV12 convert**: landed, gated behind `PUNKTFUNK_NV12` (default OFF → the
RGB/BGRx path is byte-for-byte unchanged). The tiled EGL/GL path produces NV12 (BT.709 limited) on
the GPU and feeds NVENC native YUV, deleting NVENC's internal RGB→YUV CSC off the contended SM.
**Validated on an RTX 5070 Ti two ways**: (1) `nv12-selftest` — synthetic RGBA→NV12 round-trip vs a
BT.709 reference, max abs error Y=0.56 / U=0.33 / V=0.26 LSB; (2) live `capture→NV12→NVENC→decode`
of animated content matches the RGB path's colour (avg RGB 230,18,18 vs 231,18,20 — no green-screen,
correct matrix + VUI). LINEAR/Vulkan-bridge (gamescope) path stays RGB. Next: glass-to-glass
latency + fps-under-saturation A/B on a real game (the Tier-0 measurement) before flipping default.
---
## 0. Three corrections to the mental model (read first)
**(A) "Feed NVENC RGB so the ASIC does the colour-convert" is backwards.**
NVENC's encode core is YUV-native. RGB input makes the driver insert an **RGB→YUV CSC on the
SM/3D-compute cores** — the *exact* engine a game saturates. Windows already does the right thing:
`convert_to_yuv` runs the CSC on the dedicated **VIDEO engine** via `VideoProcessorBlt`
(`capture/dxgi.rs:1023,1063`), logged as "0% 3D". **Linux still feeds NVENC RGB**
(`encode/linux.rs:98 nvenc_input``RGBZ`/`BGRZ`; the `zerocopy/egl.rs:98` shader is a `.bgra`
*swizzle*, not a CSC), so it pays NVENC's internal CSC on the SM every frame. That is the single
biggest, clearly-fixable contention source on Linux, and Windows already eliminated it.
**(B) "More GPU priority so our frames get through" is already maxed on Windows, and hits a
hardware ceiling.** We ship `D3DKMTSetProcessSchedulingPriorityClass=HIGH(4)` +
`SetGPUThreadPriority(0x4000001E)` + `SetMaximumFrameLatency(1)` (`capture/dxgi.rs:160-263`). The
residual ~20 ms `lock_bitstream` wall (documented at `dxgi.rs:155`) is GPU **context-scheduling
latency**, bounded by **preemption granularity**: NVIDIA preempts *compute* at instruction level
(~0.1 ms) but *graphics* only at coarse draw/tile/DMA-buffer boundaries (milliseconds out under a
draw flood). No priority class preempts an in-flight game draw. So the winning strategy is **not
more priority** — it is (1) do **less work on the contended graphics/3D engine**, and (2) **overlap
the unavoidable per-frame scheduling wait across frames** to recover throughput.
**(C) A chunk of the collapse is upstream of our encoder — no encode/priority fix can beat it.**
DXGI Desktop Duplication *and* WGC both capture **from the DWM compositor**, so captured fps is
hard-ceilinged at the **compose rate**, never the game's 400 fps. Under saturation the *compositor
itself* is scheduled late → composes fewer unique frames → we starve even though NVENC is idle. And
borderless/fullscreen games on **Independent/Direct Flip** present straight to scanout, *bypassing
DWM*, so capture sees ~half the frames (this is the "200 not 240"). The host already paces at
`target_fps` and **re-encodes held frames**, so *transmitted* fps stays ~240 while *unique* fps
collapses. **This must be measured before blaming encode.**
> Net: Windows is already near best-in-class (priority + video-engine CSC + encode|send split all
> shipped); its remaining wins are narrow and partly a hardware/compositor ceiling. **Linux is the
> least-hardened host and holds most of the headroom.**
---
## Tier 0 — Diagnose first (cheap, decisive, do before writing code)
Everything below is gated on knowing *which* bucket the collapse is in. We already have the tooling.
1. **Run the workload with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`.** The `uniq` counter
(genuinely-new captured frames vs re-encoded holds) already exists
(`gamestream/stream.rs:332-336,403`; `wgc_helper.rs:122-183`). Under CS2 at GPU-100%:
- **`fps`≈240 but `uniq`→40-50** ⇒ the *source/compositor* only produced 40-50 unique frames.
No encode/priority/cadence fix on our side exceeds that — it is the game's effective
present-to-compose rate at 100% GPU. The lever there is **reducing our own per-frame GPU
steal** (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
- **both `fps` and `uniq`→40-50** ⇒ our capture→convert→encode round-trip is being starved (the
`lock_bitstream` scheduling stall). The Tier 1/2 contention levers apply directly.
2. **Confirm the game's flip mode on Windows.** If the game is on Independent/Direct Flip (MPO),
capture is bypassing DWM and seeing half the frames. We already have `capture/composed_flip.rs`
— verify ForceComposedFlip is actually engaged on the game path, and watch `cap_us`.
3. Capture `cap_us` / `enc_us` / `pace_us` p50/p99 alongside, to localise the stall.
Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This
headless dev VM cannot reproduce the contention.
---
## Tier 1 — The two under-weighted, cross-platform levers (confirmed by research, not yet done)
### 1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
The capture ceiling is the compositor's compose rate, and under load the compositor gets starved.
Levers, in order:
- **Force Composed Flip on Windows** for the game path (defeat MPO/flip-metering frame loss).
Machinery exists (`composed_flip.rs`); confirm it engages and measure the unique-frame delta.
- **Opt-in "double-refresh" virtual output**: create the per-session virtual output at ~2× the
client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
we already mint arbitrary-mode virtual outputs). Gate **off** by default and **never** on the
gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
engine). `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER`.
- **Reflex / render-queue=0 style headroom** (non-capping): documented as the substitute for an fps
cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what
we can influence from the host side.
Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our
capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.
### 1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
a hidden latency tax, *most visible in easy scenes* (the ~200-should-be-240 case). Sunshine ships
this as `nvenc_latency_over_power` and calls it decisive. **Neither host does it.**
- **Windows**: NvAPI **per-application DRS profile** `PREFERRED_PSTATE = PREFER_MAX` scoped to our
exe (not a global override). Load `nvapi64.dll` dynamically; treat `NvAPI_Initialize` failure as
"no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). **Crash-safe undo is mandatory**: write
an undo record to `%ProgramData%\punktfunk\` *before* applying and revert a stale profile on next
startup — a crash must not leave the user's control panel modified.
- **Linux**: prefer the **root-free** path — disable the CUDA "Force P2 State" downclock that
context creation triggers (env/per-context), and `nvidia-smi -pm 1` (persistence) where
permitted. `nvmlDeviceSetGpuLockedClocks` needs root/CAP_SYS_ADMIN (our host runs as a normal
user → silent no-op) and is brittle across SKUs; if used, query `nvmlDeviceGetMaxClockInfo`, lock
to *that*, and restore on teardown **and** via a SIGTERM/panic handler.
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default OFF on battery / Steam Deck** (thermal/power caps
make pinning actively harmful there).
Impact: reliable, modest p99 / easy-scene win on both OSes. Does **not** fix the saturated-scene
collapse (at 100% util the clock is already maxed). Low cost.
---
## Tier 2 — Linux work-deletion + scheduling hygiene (the biggest in-our-control headroom)
### 2A. Produce **NV12/P010** on Linux and feed it to NVENC native (delete the SM-side CSC)
The strictly-correct version (verified): **extend the existing GL de-tile blit
(`zerocopy/egl.rs`) to emit NV12** instead of swizzled BGRx — multi-render-target (GL_R8 luma
full-res + GL_RG8 chroma half-res, or two passes) applying an **explicit BT.709 limited-range
matrix matching the Windows `VideoConverter`** (`dxgi.rs:957`) so hosts look identical — then
register `NV_ENC_BUFFER_FORMAT_NV12` with the encoder (teach `encode/linux.rs:98 nvenc_input` an
NV12 case; `CudaHw sw_format``AV_PIX_FMT_NV12`).
- Net: today = GL swizzle (3D) **+** NVENC-internal CSC (SM); after = GL CSC (3D, ~same cost as the
swizzle it replaces) **+ zero NVENC CSC**. Removes one whole CSC pass and removes it from the SM.
- **Do *not* implement this as a standalone CUDA convert kernel on the tiled path** — CUDA can't
sample a tiled NVIDIA surface (`cuGraphicsEGLRegisterImage` is Tegra-only, `egl.rs:6-12`), so it
would still need the GL detile, *and* a CUDA kernel runs on the same saturated SM. The CUDA-kernel
route is only clean on the **LINEAR/Vulkan-bridge (gamescope)** path, where it doubles as the NV12
producer; do it there if/when that path needs it.
- Pitfalls: pervasive 4-byte-pixel assumptions break with NV12 — `cuda.rs` hardcodes
`WidthInBytes = width*4` (`:363,392,499`), `BufferPool`/`alloc_pitched` assume 4 B/px, GL dst is
`GL_RGBA8`; all need a plane-aware NV12 variant (luma W·H + chroma W·H/2, two-plane copy) or you
get the Steam-Deck green-screen class of bug. The HDR/10-bit path needs P010, not NV12.
- Impact: real, **modest, compounding** — a few ms of per-frame GPU time and a shorter time-slice
need, which stacks with cadence + power-pin. **Not** a standalone cure for the 240→40 collapse
(external "47→100 fps" numbers are other people's non-zero-copy pipelines; don't promise them).
Medium cost. Gate behind a `PUNKTFUNK_*` env and A/B `cap→encoded` p50 + the CS2 fps floor.
### 2B. Linux scheduling hygiene (cheap; the priority bits are "measure-then-keep")
Consolidates the genuine parts of several candidates. Mostly unambiguous cleanups + opt-in
priority:
- **Arm the Linux `boost_thread_priority` no-op** (`punktfunk1.rs:1856` cfg branch): best-effort
`libc::setpriority(PRIO_PROCESS, 0, -10/-5)` on the calling thread (tid 0 = self), log-and-continue
on EPERM. **Do not** default to SCHED_RR/FIFO (can starve the compositor and the game's render
thread — the user refuses to add game frame-time); offer it only behind `PUNKTFUNK_SCHED_RR=1`.
**Fix the wrong doc-comment** at `punktfunk1.rs:1834-1835` ("the Linux host caps the game via
gamescope, so its threads aren't starved") — false for the uncapped/NVIDIA-direct path.
- **Set CUDA context scheduling deliberately**: `cuCtxCreate` flag `CU_CTX_SCHED_BLOCKING_SYNC` on
this shared VM (frees a core vs the default AUTO/SPIN) — a CPU-efficiency fix, not throughput.
- **High-priority CUDA stream + EGL context priority** (the missing analogue of the Windows
hardening): `cuStreamCreateWithPriority(highest from cuCtxGetStreamPriorityRange)` for our copies;
request `EGL_IMG_context_priority HIGH` (try `EGL_NV_context_priority_realtime`) at
`egl.rs:332`. **Caveat, load-bearing**: these are intra-process *hints* and NVIDIA's Linux driver
has been reported to **ignore** context priority (driver 545: high- vs low-priority EGL contexts
measured identical) and to **deny** realtime Vulkan queues. Implement with graceful fallback,
gate behind env, and **measure on driver 595** — do not architect around it or credit it before
measurement.
> Explicitly **not** doing on Linux: Vulkan `VK_EXT_global_priority` as "the" lever (it only touches
> the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA).
> Replacing `cuCtxSynchronize` with a per-stream event chain for *contention* reasons (it's
> per-context, never waited on the game's separate context — a non-fix; keep the full sync where it
> guards dmabuf recycle, `egl.rs:491`).
---
## Tier 3 — Windows parity polish (Windows is already strong)
- **3A. Host-process session tuning (we have *zero* today — verified):** `NtSetTimerResolution(0.5ms)`
/ `timeBeginPeriod(1)` (default 15.6 ms granularity blocks precise pacing), `DwmEnableMMCSS(true)`,
`SetPriorityClass(HIGH_PRIORITY_CLASS)`, MMCSS-register the capture/encode threads ("Games"/"Pro
Audio"), `SetThreadExecutionState(ES_CONTINUOUS|ES_DISPLAY_REQUIRED)`. All revert on stop.
Foundational for any precise frame pacing and the encode|send split. Low cost, low risk.
(`gamestream/stream.rs` start/stop; Apollo's `streaming_will_start`/`_stopped`.)
- **3B. Auto-gated REALTIME D3DKMT class** instead of fixed HIGH (the realtime opt-in already exists
at `dxgi.rs:199-207`): probe HAGS (`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom
(`IDXGIAdapter3::QueryVideoMemoryInfo`, continuously), allow REALTIME(5) only when safe (HAGS off,
or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises —
Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling
rung, same preemption ceiling), so rank it as cheap parity, not a fix.
- **3C. Cheap experiment — `VideoProcessorBlt` directly from the DDA surface** (skip the same-format
`gpu_copy` at `dxgi.rs:2375`), then `ReleaseFrame`, *iff* it doesn't re-serialize `AcquireNextFrame`
(the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming
the Blt is on the video engine). One-line source-texture change; benchmark only. Do **not** build a
D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM
(~5% 3D, no PCIe), not worth the interop rebuild.
- **3D. Async NVENC + off-thread retrieve — measure-gated, uncertain.** Today retrieve
(`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which is *why*
`depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates submit/retrieve
on separate threads with completion events + a deep surface pool; doing that *could* let per-frame
scheduling waits **overlap across frames** and recover *throughput* — at a per-frame *latency* cost
(depth × frame time). This is the one place the research and our own prior measurement disagree, so
it is **strictly measure-first**, and it forecloses slice output (`reportSliceOffsets` needs
`enableEncodeAsync=0`). Treat as a structural experiment, not a committed win.
---
## Tier 4 — Deferred 2nd-order latency (not contention fixes; do after Tiers 0-2)
- **GL2 — Intra-refresh for RFI/recovery** (`enableIntraRefresh` + recovery-point SEI) instead of a
forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size
spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met.
Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win,
orthogonal to the collapse.
- **GL1 + GL6 — Sub-frame slice output + per-slice paced send** (the roadmap's "~2-4 ms lever"):
`enableSubFrameWrite` + `sliceMode` + transmit each slice as it completes. **Big**: needs the
direct NVENC SDK on Linux (libavcodec emits whole AUs) **and** a per-slice wire/FEC redesign in
`punktfunk-core` (today `PacketHeader`/`Packetizer`/reassembler are whole-AU; per-slice FEC blocks
wreck Leopard efficiency) **and** client slice-granular submit. Gate on
`NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK` (often absent on consumer GeForce). The paced-send half is
**already shipped** (`stream.rs spawn_sender`, `punktfunk1.rs paced_submit`) — don't re-implement.
---
## Dropped / why (so we don't re-propose placebo)
| Candidate | Verdict | Why |
|---|---|---|
| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
| Replace `cuCtxSynchronize` with per-stream event chain *for contention* | ✗ | `cuCtxSynchronize` is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
| Vulkan `VK_EXT_global_priority` as the Linux priority lever | ✗ | Touches only the minority gamescope/LINEAR `vkCmdCopyBuffer`, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
| Async NVENC as a *throughput/collapse* fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our `depth>1` regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
| Empty-frame (`LastPresentTime==0`) skip | ✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets `zeroReorderDelay=1`, lookahead/multipass/AQ off; ffmpeg defaults match + we set `bf=0`. Only `lowDelayKeyFrameScale=1` is non-redundant → fold into GL2 (Windows SDK path only). |
| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no `nvEncInvalidateRefFrames`; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |
---
## Recommended order of attack
1. **Tier 0 diagnose** on the real boxes — settles whether the collapse is source-ceiling or
pipeline-starvation, and whether flip-bypass is halving capture.
2. **Tier 2A (Linux NV12)** + **Tier 2B (Linux scheduling hygiene)** — the largest in-our-control
headroom; Linux is the least-hardened host.
3. **Tier 1B (clock/power pin)** both OSes — cheap, fixes the easy-scene 200-vs-240, crash-safe undo.
4. **Tier 1A (cadence/flip)** — gated on Tier 0 (this is where a big chunk of the collapse may live).
5. **Tier 3 (Windows polish)** — session tuning is the clear win; the rest is parity.
6. **Tier 4** — only after the contention work; intra-refresh first, slice pipelining last.
Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap
closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game
physically cannot also render the game *and* compose 240 unique frames, and WDDM/NVIDIA preemption
granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it
with encoder micro-optimisations.