punktfunk/design/host-latency-plan.md

# Host latency & the GPU-contention collapse — analysis + prioritized plan

> **Status:** PARTLY SHIPPED. Tier 2A (Linux NV12 convert) = `1fc6f73`; Tier 2B (Linux
> scheduling) + Tier 3A (Windows session tuning) = `112a054`. Tiers 1A, 1B, 3B, 3C, 3D, 4 are
> still open. This doc is trimmed to design rationale + open items; the shipped code is the
> source of truth for the landed tiers.

> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
> That follow-up re-verified this plan against the current code and overturned several specifics:
> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
> Windows. **For current action prioritization see `gpu-contention-investigation.md`.** The
> tiers/dropped-placebo analysis below remain a useful record.

Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
"saturating game starves the stream" headache:

> CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding
> (GPU-100%) scene it collapses to 40-50. Capping the game is **not** an acceptable fix.

This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the
[Apollo comparison](apollo-comparison.md) + external NVIDIA/streaming research) followed by an
**adversarial verification pass** — every candidate fix was attacked, against our actual code, to
separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the
placebos.

---

## Mental model (§0A–0C) — see the follow-up

The original three-correction mental model (A: feeding NVENC RGB is backwards; B: GPU priority is
maxed on Windows and hits a preemption-granularity ceiling; C: a chunk of the collapse is upstream
of the encoder at the compositor compose-rate, with Independent/Direct Flip bypassing DWM) is
**partly corrected by `gpu-contention-investigation.md` §1** — notably that the default Windows
IDD-push path now feeds NVENC RGB (so §0A's "Windows already does the right thing" no longer holds),
and "capture sees half the frames" is DLSS-Frame-Gen-specific rather than general. Read the
follow-up doc for the corrected model. The durable takeaways still stand: **do less work on the
contended graphics/3D engine**, **overlap the unavoidable per-frame scheduling wait across frames**,
and **measure source-vs-pipeline before blaming encode**.

---

## Tier 0 — Diagnose first (cheap, decisive, do before writing code)

Everything below is gated on knowing *which* bucket the collapse is in. We already have the tooling.

1. **Run the workload with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`.** The `uniq` counter
   (genuinely-new captured frames vs re-encoded holds) already exists
   (`gamestream/stream.rs:332-336,403`; `wgc_helper.rs:122-183`). Under CS2 at GPU-100%:
   - **`fps`≈240 but `uniq`→40-50** ⇒ the *source/compositor* only produced 40-50 unique frames.
     No encode/priority/cadence fix on our side exceeds that — it is the game's effective
     present-to-compose rate at 100% GPU. The lever there is **reducing our own per-frame GPU
     steal** (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
   - **both `fps` and `uniq`→40-50** ⇒ our capture→convert→encode round-trip is being starved (the
     `lock_bitstream` scheduling stall). The Tier 1/2 contention levers apply directly.
2. **Confirm the game's flip mode on Windows.** If the game is on Independent/Direct Flip (MPO),
   capture is bypassing DWM and seeing half the frames. We already have `capture/composed_flip.rs`
   — verify ForceComposedFlip is actually engaged on the game path, and watch `cap_us`.
3. Capture `cap_us` / `enc_us` / `pace_us` p50/p99 alongside, to localise the stall.

Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This
headless dev VM cannot reproduce the contention.

---

## Tier 1 — The two under-weighted, cross-platform levers (OPEN — confirmed by research, not yet done)

### 1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
The capture ceiling is the compositor's compose rate, and under load the compositor gets starved.
Levers, in order:
- **Force Composed Flip on Windows** for the game path (defeat MPO/flip-metering frame loss).
  Machinery exists (`composed_flip.rs`); confirm it engages and measure the unique-frame delta.
- **Opt-in "double-refresh" virtual output**: create the per-session virtual output at ~2× the
  client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
  we already mint arbitrary-mode virtual outputs). Gate **off** by default and **never** on the
  gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
  engine). `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER`.
- **Reflex / render-queue=0 style headroom** (non-capping): documented as the substitute for an fps
  cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what
  we can influence from the host side.

Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our
capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.

### 1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
a hidden latency tax, *most visible in easy scenes* (the ~200-should-be-240 case). Sunshine ships
this as `nvenc_latency_over_power` and calls it decisive. **Neither host does it.**
- **Windows**: NvAPI **per-application DRS profile** `PREFERRED_PSTATE = PREFER_MAX` scoped to our
  exe (not a global override). Load `nvapi64.dll` dynamically; treat `NvAPI_Initialize` failure as
  "no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). **Crash-safe undo is mandatory**: write
  an undo record to `%ProgramData%\punktfunk\` *before* applying and revert a stale profile on next
  startup — a crash must not leave the user's control panel modified.
- **Linux**: prefer the **root-free** path — disable the CUDA "Force P2 State" downclock that
  context creation triggers (env/per-context), and `nvidia-smi -pm 1` (persistence) where
  permitted. `nvmlDeviceSetGpuLockedClocks` needs root/CAP_SYS_ADMIN (our host runs as a normal
  user → silent no-op) and is brittle across SKUs; if used, query `nvmlDeviceGetMaxClockInfo`, lock
  to *that*, and restore on teardown **and** via a SIGTERM/panic handler.
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default OFF on battery / Steam Deck** (thermal/power caps
  make pinning actively harmful there).

Impact: reliable, modest p99 / easy-scene win on both OSes. Does **not** fix the saturated-scene
collapse (at 100% util the clock is already maxed). Low cost.

---

## Tier 2 — Linux work-deletion + scheduling hygiene

### 2A. Linux NV12 convert — **SHIPPED (`1fc6f73`)**
GL de-tile blit emits NV12 (BT.709 limited) on the GPU and feeds NVENC native YUV, deleting NVENC's
internal RGB→YUV CSC off the contended SM. Gated `PUNKTFUNK_NV12` (default OFF). Tiled EGL/GL path
only; LINEAR/Vulkan-bridge (gamescope) stays RGB. Validated colour-correct on RTX 5070 Ti. Open
follow-up: glass-to-glass latency + CS2 fps-under-saturation A/B before flipping the default, and
the **P010** variant for the HDR/10-bit path. Code is the source of truth (`zerocopy/egl.rs`,
`encode/linux.rs`).

### 2B. Linux scheduling hygiene — **SHIPPED (`112a054`)**
`boost_thread_priority` nices capture/encode/send on Linux (best-effort `setpriority`);
CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread highest-priority CUDA
stream (`cuStreamCreateWithPriority`, NULL-stream fallback). The stream-priority hint is
**measure-then-keep** (NVIDIA Linux may ignore it). **Do not** default to SCHED_RR/FIFO (can starve
the compositor + the game's render thread); opt-in only behind `PUNKTFUNK_SCHED_RR=1`. Code is the
source of truth (`punktfunk1.rs`).

> Explicitly **not** doing on Linux: Vulkan `VK_EXT_global_priority` as "the" lever (it only touches
> the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA).
> Replacing `cuCtxSynchronize` with a per-stream event chain for *contention* reasons (it's
> per-context, never waited on the game's separate context — a non-fix; keep the full sync where it
> guards dmabuf recycle, `egl.rs:491`).

---

## Tier 3 — Windows parity polish (Windows is already strong)

### 3A. Host-process session tuning — **SHIPPED (`112a054`)**
`session_tuning.rs` (raw C-ABI FFI, no-op off Windows): each capture/encode/send thread applies
process-wide tuning once (1 ms timer, `DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) + per-thread MMCSS
"Games" + keep-display-awake; reverts on stop. Wired into both native (`boost_thread_priority`) and
GameStream (`stream.rs`) paths. FFI validated on the real MSVC toolchain.

### 3B. Auto-gated REALTIME D3DKMT class (OPEN)
Instead of fixed HIGH (the realtime opt-in already exists at `dxgi.rs:199-207`): probe HAGS
(`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom (`IDXGIAdapter3::QueryVideoMemoryInfo`,
continuously), allow REALTIME(5) only when safe (HAGS off, or HAGS on + VRAM comfortably below
budget), downgrade to HIGH the moment VRAM pressure rises — Sunshine's actual gate avoids the
HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling rung, same preemption ceiling), so
rank it as cheap parity, not a fix.

### 3C. `VideoProcessorBlt` directly from the DDA surface (OPEN — cheap experiment)
Skip the same-format `gpu_copy` at `dxgi.rs:2375`, then `ReleaseFrame`, *iff* it doesn't
re-serialize `AcquireNextFrame` (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but
that note predates confirming the Blt is on the video engine). One-line source-texture change;
benchmark only. Do **not** build a D3D11↔D3D12 copy-queue offload — the convert is already off-3D,
the remaining copy is intra-VRAM (~5% 3D, no PCIe), not worth the interop rebuild.

### 3D. Async NVENC + off-thread retrieve (OPEN — measure-gated, uncertain)
Today retrieve (`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which
is *why* `depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates
submit/retrieve on separate threads with completion events + a deep surface pool; doing that *could*
let per-frame scheduling waits **overlap across frames** and recover *throughput* — at a per-frame
*latency* cost (depth × frame time). This is the one place the research and our own prior
measurement disagree, so it is **strictly measure-first**, and it forecloses slice output
(`reportSliceOffsets` needs `enableEncodeAsync=0`). Treat as a structural experiment, not a
committed win. (The follow-up doc notes the prior "async stacks latency" result was a *same-thread*
implementation, not a disproof of a correct two-thread pipeline.)

---

## Tier 4 — Deferred 2nd-order latency (OPEN — not contention fixes; do after Tiers 0-2)

- **GL2 — Intra-refresh for RFI/recovery** (`enableIntraRefresh` + recovery-point SEI) instead of a
  forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size
  spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met.
  Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win,
  orthogonal to the collapse.
- **GL1 + GL6 — Sub-frame slice output + per-slice paced send** (the roadmap's "~2-4 ms lever"):
  `enableSubFrameWrite` + `sliceMode` + transmit each slice as it completes. **Big**: needs the
  direct NVENC SDK on Linux (libavcodec emits whole AUs) **and** a per-slice wire/FEC redesign in
  `punktfunk-core` (today `PacketHeader`/`Packetizer`/reassembler are whole-AU; per-slice FEC blocks
  wreck Leopard efficiency) **and** client slice-granular submit. Gate on
  `NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK` (often absent on consumer GeForce). The paced-send half is
  **already shipped** (`stream.rs spawn_sender`, `punktfunk1.rs paced_submit`) — don't re-implement.

---

## Dropped / why (so we don't re-propose placebo)

| Candidate | Verdict | Why |
|---|---|---|
| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
| Replace `cuCtxSynchronize` with per-stream event chain *for contention* | ✗ | `cuCtxSynchronize` is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
| Vulkan `VK_EXT_global_priority` as the Linux priority lever | ✗ | Touches only the minority gamescope/LINEAR `vkCmdCopyBuffer`, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
| Async NVENC as a *throughput/collapse* fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our `depth>1` regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
| Empty-frame (`LastPresentTime==0`) skip | ✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets `zeroReorderDelay=1`, lookahead/multipass/AQ off; ffmpeg defaults match + we set `bf=0`. Only `lowDelayKeyFrameScale=1` is non-redundant → fold into GL2 (Windows SDK path only). |
| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no `nvEncInvalidateRefFrames`; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |

---

## Open items / What's left

For current action prioritization see [`gpu-contention-investigation.md`](gpu-contention-investigation.md).
Still-open work tracked by this doc:

- **Tier 0** — run the `PUNKTFUNK_PERF=1` uniq-vs-fps + flip-mode diagnosis on the real-GPU boxes
  (gate for everything below).
- **Tier 1A** — capture-source / compose-rate cadence levers (ForceComposedFlip verify;
  `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER` double-refresh; Reflex/render-queue=0 headroom).
- **Tier 1B** — GPU clock/power pinning (`PUNKTFUNK_PIN_CLOCKS`; NvAPI per-app DRS on Windows w/
  crash-safe undo; root-free CUDA-P2/persistence on Linux; default OFF on battery/Deck).
- **Tier 2A follow-up** — glass-to-glass + CS2-floor A/B before defaulting `PUNKTFUNK_NV12`, and the
  **P010** HDR/10-bit variant.
- **Tier 3B** — auto-gated REALTIME D3DKMT class (HAGS + VRAM-headroom gate).
- **Tier 3C** — `VideoProcessorBlt` directly from the DDA surface (benchmark-only experiment).
- **Tier 3D** — correct async NVENC two-thread submit/retrieve pipeline (strictly measure-first).
- **Tier 4** — GL2 intra-refresh for RFI/recovery; GL1/GL6 sub-frame slice output + per-slice paced
  send (paced-send half already shipped).

Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap
closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game
physically cannot also render the game *and* compose 240 unique frames, and WDDM/NVIDIA preemption
granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it
with encoder micro-optimisations.