Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).
- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
host-latency, gpu-contention (fixed stale status table), game-library,
linux-setup (fixed m0->spike + stale zero-copy claim),
session-aware-host-followups, windows-client-bootstrap,
windows-dualsense-{scoping,game-detection}, windows-virtual-display,
security-review (per-finding status table; #12 still open),
apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
merged, M4 done); windows-secure-desktop.md archived (now a fallback
behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
16 KiB
Host latency & the GPU-contention collapse — analysis + prioritized plan
Status: PARTLY SHIPPED. Tier 2A (Linux NV12 convert) =
1fc6f73; Tier 2B (Linux scheduling) + Tier 3A (Windows session tuning) =112a054. Tiers 1A, 1B, 3B, 3C, 3D, 4 are still open. This doc is trimmed to design rationale + open items; the shipped code is the source of truth for the landed tiers.
⚠ Partially superseded (2026-06-25) by
gpu-contention-investigation.md. That follow-up re-verified this plan against the current code and overturned several specifics: the default Windows path (IDD-push) now feeds NVENC RGB (regressing the §0A "Windows does it right" claim);PUNKTFUNK_ENCODE_DEPTHnever existed (phantom knob); the "async NVENC stacks latency" result was a same-thread implementation, not a disproof of a correct two-thread pipeline; "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on Windows. For current action prioritization seegpu-contention-investigation.md. The tiers/dropped-placebo analysis below remain a useful record.
Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: latency, and specifically the "saturating game starves the stream" headache:
CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding (GPU-100%) scene it collapses to 40-50. Capping the game is not an acceptable fix.
This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the Apollo comparison + external NVIDIA/streaming research) followed by an adversarial verification pass — every candidate fix was attacked, against our actual code, to separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the placebos.
Mental model (§0A–0C) — see the follow-up
The original three-correction mental model (A: feeding NVENC RGB is backwards; B: GPU priority is
maxed on Windows and hits a preemption-granularity ceiling; C: a chunk of the collapse is upstream
of the encoder at the compositor compose-rate, with Independent/Direct Flip bypassing DWM) is
partly corrected by gpu-contention-investigation.md §1 — notably that the default Windows
IDD-push path now feeds NVENC RGB (so §0A's "Windows already does the right thing" no longer holds),
and "capture sees half the frames" is DLSS-Frame-Gen-specific rather than general. Read the
follow-up doc for the corrected model. The durable takeaways still stand: do less work on the
contended graphics/3D engine, overlap the unavoidable per-frame scheduling wait across frames,
and measure source-vs-pipeline before blaming encode.
Tier 0 — Diagnose first (cheap, decisive, do before writing code)
Everything below is gated on knowing which bucket the collapse is in. We already have the tooling.
- Run the workload with
PUNKTFUNK_PERF=1and readuniqvsfps. Theuniqcounter (genuinely-new captured frames vs re-encoded holds) already exists (gamestream/stream.rs:332-336,403;wgc_helper.rs:122-183). Under CS2 at GPU-100%:fps≈240 butuniq→40-50 ⇒ the source/compositor only produced 40-50 unique frames. No encode/priority/cadence fix on our side exceeds that — it is the game's effective present-to-compose rate at 100% GPU. The lever there is reducing our own per-frame GPU steal (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).- both
fpsanduniq→40-50 ⇒ our capture→convert→encode round-trip is being starved (thelock_bitstreamscheduling stall). The Tier 1/2 contention levers apply directly.
- Confirm the game's flip mode on Windows. If the game is on Independent/Direct Flip (MPO),
capture is bypassing DWM and seeing half the frames. We already have
capture/composed_flip.rs— verify ForceComposedFlip is actually engaged on the game path, and watchcap_us. - Capture
cap_us/enc_us/pace_usp50/p99 alongside, to localise the stall.
Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This headless dev VM cannot reproduce the contention.
Tier 1 — The two under-weighted, cross-platform levers (OPEN — confirmed by research, not yet done)
1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
The capture ceiling is the compositor's compose rate, and under load the compositor gets starved. Levers, in order:
- Force Composed Flip on Windows for the game path (defeat MPO/flip-metering frame loss).
Machinery exists (
composed_flip.rs); confirm it engages and measure the unique-frame delta. - Opt-in "double-refresh" virtual output: create the per-session virtual output at ~2× the
client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
we already mint arbitrary-mode virtual outputs). Gate off by default and never on the
gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
engine).
PUNKTFUNK_OUTPUT_HZ_MULTIPLIER. - Reflex / render-queue=0 style headroom (non-capping): documented as the substitute for an fps cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what we can influence from the host side.
Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.
1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
a hidden latency tax, most visible in easy scenes (the ~200-should-be-240 case). Sunshine ships
this as nvenc_latency_over_power and calls it decisive. Neither host does it.
- Windows: NvAPI per-application DRS profile
PREFERRED_PSTATE = PREFER_MAXscoped to our exe (not a global override). Loadnvapi64.dlldynamically; treatNvAPI_Initializefailure as "no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). Crash-safe undo is mandatory: write an undo record to%ProgramData%\punktfunk\before applying and revert a stale profile on next startup — a crash must not leave the user's control panel modified. - Linux: prefer the root-free path — disable the CUDA "Force P2 State" downclock that
context creation triggers (env/per-context), and
nvidia-smi -pm 1(persistence) where permitted.nvmlDeviceSetGpuLockedClocksneeds root/CAP_SYS_ADMIN (our host runs as a normal user → silent no-op) and is brittle across SKUs; if used, querynvmlDeviceGetMaxClockInfo, lock to that, and restore on teardown and via a SIGTERM/panic handler. - Gate behind
PUNKTFUNK_PIN_CLOCKS; default OFF on battery / Steam Deck (thermal/power caps make pinning actively harmful there).
Impact: reliable, modest p99 / easy-scene win on both OSes. Does not fix the saturated-scene collapse (at 100% util the clock is already maxed). Low cost.
Tier 2 — Linux work-deletion + scheduling hygiene
2A. Linux NV12 convert — SHIPPED (1fc6f73)
GL de-tile blit emits NV12 (BT.709 limited) on the GPU and feeds NVENC native YUV, deleting NVENC's
internal RGB→YUV CSC off the contended SM. Gated PUNKTFUNK_NV12 (default OFF). Tiled EGL/GL path
only; LINEAR/Vulkan-bridge (gamescope) stays RGB. Validated colour-correct on RTX 5070 Ti. Open
follow-up: glass-to-glass latency + CS2 fps-under-saturation A/B before flipping the default, and
the P010 variant for the HDR/10-bit path. Code is the source of truth (zerocopy/egl.rs,
encode/linux.rs).
2B. Linux scheduling hygiene — SHIPPED (112a054)
boost_thread_priority nices capture/encode/send on Linux (best-effort setpriority);
CUDA context uses CU_CTX_SCHED_BLOCKING_SYNC; copies run on a per-thread highest-priority CUDA
stream (cuStreamCreateWithPriority, NULL-stream fallback). The stream-priority hint is
measure-then-keep (NVIDIA Linux may ignore it). Do not default to SCHED_RR/FIFO (can starve
the compositor + the game's render thread); opt-in only behind PUNKTFUNK_SCHED_RR=1. Code is the
source of truth (punktfunk1.rs).
Explicitly not doing on Linux: Vulkan
VK_EXT_global_priorityas "the" lever (it only touches the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA). ReplacingcuCtxSynchronizewith a per-stream event chain for contention reasons (it's per-context, never waited on the game's separate context — a non-fix; keep the full sync where it guards dmabuf recycle,egl.rs:491).
Tier 3 — Windows parity polish (Windows is already strong)
3A. Host-process session tuning — SHIPPED (112a054)
session_tuning.rs (raw C-ABI FFI, no-op off Windows): each capture/encode/send thread applies
process-wide tuning once (1 ms timer, DwmEnableMMCSS, HIGH_PRIORITY_CLASS) + per-thread MMCSS
"Games" + keep-display-awake; reverts on stop. Wired into both native (boost_thread_priority) and
GameStream (stream.rs) paths. FFI validated on the real MSVC toolchain.
3B. Auto-gated REALTIME D3DKMT class (OPEN)
Instead of fixed HIGH (the realtime opt-in already exists at dxgi.rs:199-207): probe HAGS
(D3DKMTQueryAdapterInfo HwSchEnabled) and VRAM headroom (IDXGIAdapter3::QueryVideoMemoryInfo,
continuously), allow REALTIME(5) only when safe (HAGS off, or HAGS on + VRAM comfortably below
budget), downgrade to HIGH the moment VRAM pressure rises — Sunshine's actual gate avoids the
HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling rung, same preemption ceiling), so
rank it as cheap parity, not a fix.
3C. VideoProcessorBlt directly from the DDA surface (OPEN — cheap experiment)
Skip the same-format gpu_copy at dxgi.rs:2375, then ReleaseFrame, iff it doesn't
re-serialize AcquireNextFrame (the existing decouple-copy was measured 40-200 fps vs ~60 fps, but
that note predates confirming the Blt is on the video engine). One-line source-texture change;
benchmark only. Do not build a D3D11↔D3D12 copy-queue offload — the convert is already off-3D,
the remaining copy is intra-VRAM (~5% 3D, no PCIe), not worth the interop rebuild.
3D. Async NVENC + off-thread retrieve (OPEN — measure-gated, uncertain)
Today retrieve (lock_bitstream) runs inline on the submit thread (nvenc.rs:524-558), which
is why depth>1 was measured to regress (wgc_helper.rs:111-114). The NVENC guide mandates
submit/retrieve on separate threads with completion events + a deep surface pool; doing that could
let per-frame scheduling waits overlap across frames and recover throughput — at a per-frame
latency cost (depth × frame time). This is the one place the research and our own prior
measurement disagree, so it is strictly measure-first, and it forecloses slice output
(reportSliceOffsets needs enableEncodeAsync=0). Treat as a structural experiment, not a
committed win. (The follow-up doc notes the prior "async stacks latency" result was a same-thread
implementation, not a disproof of a correct two-thread pipeline.)
Tier 4 — Deferred 2nd-order latency (OPEN — not contention fixes; do after Tiers 0-2)
- GL2 — Intra-refresh for RFI/recovery (
enableIntraRefresh+ recovery-point SEI) instead of a forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met. Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win, orthogonal to the collapse. - GL1 + GL6 — Sub-frame slice output + per-slice paced send (the roadmap's "~2-4 ms lever"):
enableSubFrameWrite+sliceMode+ transmit each slice as it completes. Big: needs the direct NVENC SDK on Linux (libavcodec emits whole AUs) and a per-slice wire/FEC redesign inpunktfunk-core(todayPacketHeader/Packetizer/reassembler are whole-AU; per-slice FEC blocks wreck Leopard efficiency) and client slice-granular submit. Gate onNV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK(often absent on consumer GeForce). The paced-send half is already shipped (stream.rs spawn_sender,punktfunk1.rs paced_submit) — don't re-implement.
Dropped / why (so we don't re-propose placebo)
| Candidate | Verdict | Why |
|---|---|---|
| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
Replace cuCtxSynchronize with per-stream event chain for contention |
✗ | cuCtxSynchronize is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
Vulkan VK_EXT_global_priority as the Linux priority lever |
✗ | Touches only the minority gamescope/LINEAR vkCmdCopyBuffer, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
| Async NVENC as a throughput/collapse fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our depth>1 regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
Empty-frame (LastPresentTime==0) skip |
✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets zeroReorderDelay=1, lookahead/multipass/AQ off; ffmpeg defaults match + we set bf=0. Only lowDelayKeyFrameScale=1 is non-redundant → fold into GL2 (Windows SDK path only). |
| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no nvEncInvalidateRefFrames; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |
Open items / What's left
For current action prioritization see gpu-contention-investigation.md.
Still-open work tracked by this doc:
- Tier 0 — run the
PUNKTFUNK_PERF=1uniq-vs-fps + flip-mode diagnosis on the real-GPU boxes (gate for everything below). - Tier 1A — capture-source / compose-rate cadence levers (ForceComposedFlip verify;
PUNKTFUNK_OUTPUT_HZ_MULTIPLIERdouble-refresh; Reflex/render-queue=0 headroom). - Tier 1B — GPU clock/power pinning (
PUNKTFUNK_PIN_CLOCKS; NvAPI per-app DRS on Windows w/ crash-safe undo; root-free CUDA-P2/persistence on Linux; default OFF on battery/Deck). - Tier 2A follow-up — glass-to-glass + CS2-floor A/B before defaulting
PUNKTFUNK_NV12, and the P010 HDR/10-bit variant. - Tier 3B — auto-gated REALTIME D3DKMT class (HAGS + VRAM-headroom gate).
- Tier 3C —
VideoProcessorBltdirectly from the DDA surface (benchmark-only experiment). - Tier 3D — correct async NVENC two-thread submit/retrieve pipeline (strictly measure-first).
- Tier 4 — GL2 intra-refresh for RFI/recovery; GL1/GL6 sub-frame slice output + per-slice paced send (paced-send half already shipped).
Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game physically cannot also render the game and compose 240 unique frames, and WDDM/NVIDIA preemption granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it with encoder micro-optimisations.