diff --git a/docs/gpu-contention-investigation.md b/docs/gpu-contention-investigation.md new file mode 100644 index 0000000..66296a3 --- /dev/null +++ b/docs/gpu-contention-investigation.md @@ -0,0 +1,430 @@ +# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25) + +> The headache, stated precisely: +> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the +> stream tracks; the moment the game pins the GPU the **stream collapses to 40–50 fps** while the +> game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light +> titles like CS2). **Capping is not an acceptable fix** — demanding titles exhaust the GPU even +> when capped. + +This is the second, deeper pass on the problem. The first pass is +[`host-latency-plan.md`](host-latency-plan.md) (a 25-agent investigation, 2026-06-18). **This doc +supersedes several of that doc's conclusions** — the codebase moved a lot in the week since +(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the +GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned +two of the old plan's premises. Read §1 (corrections) before acting on the old doc. + +Method: five parallel investigations — three deep reads of the *current* code (encode, capture, +mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with +their own adversarial verifiers. Every external claim below carries a source URL; every code claim +carries a current `file:line`. + +--- + +## 0. TL;DR — the corrected mental model and the action list + +**The governing fact:** NVENC is a **dedicated ASIC on its own GPU runlist**, physically separate +from the SM/CUDA/graphics cores a 3D game saturates. The game does **not** steal the encode block. +It steals everything that *feeds* the block — capture-acquire, the **RGB→YUV colour-convert**, the +copy into the encoder's input surface, the readback — **and the GPU-scheduler time** to run that +feed work, which is queued behind the game's graphics context. +([NVENC app-note](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-application-note/index.html), +[engine-table proof, UNC RTAS'24](https://www.cs.unc.edu/~jbakita/rtas24.pdf)) + +**Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart +before writing code:** + +| Bottleneck | Symptom | Fix family | +|---|---|---| +| **(a) feed-scheduling contention** | `uniq`≈`fps`, both ~50; `encode_ms` 13–17 | shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU | +| **(b) frame-source ceiling** | `fps`≈240 (held re-encodes) but `uniq`→40–50 | capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case | + +**The single hardest truth:** on one saturated GPU there is **no free lunch**. Any host GPU work +either *preempts* the game (and steals its frames) or *waits* behind it. Capping the game works +only because it cuts the game's **total** GPU demand and opens idle gaps. The non-capping +equivalents are exactly three: **need less GPU** (footprint shrink), **take more** (priority — which +costs the game fps), or **use a different GPU** (real isolation). Anything pitched as "make the game +politely yield without losing anything" — Reflex, render-queue tricks — is a **placebo** here (§7). + +**Action list, highest leverage first** (detail in §5–§6): + +1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation + mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter. +2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC + BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on + the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A) +3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another, + deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread* + implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never + tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B) +4. **Auto-gated REALTIME GPU priority.** Our `LocalSystem` service *can* grant it (most apps can't). + Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C) +5. **Lock clocks / pin P-state** for jitter (cheap; fixes the light-scene "200-not-240", not the + collapse). (§5.E) +6. **If source-bound: swapchain-hook capture** (OBS-style) — the real escape from the compose + ceiling. Big lift, anti-cheat tradeoffs. (§5.F) +7. **The honest endgame for demanding titles: encode on a second GPU / the iGPU.** The only approach + that *removes* contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G) + +--- + +## 1. Corrections to `host-latency-plan.md` (read before reusing it) + +The old doc was right about the shape but several specifics are now wrong or stale: + +- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the + DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC + **RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path + *regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`) +- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists + only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never + parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the + native path; GameStream and the WGC helper are hardcoded depth-1. +- **"Async NVENC is measure-gated and probably stacks latency (Tier 3D)."** The measurement that + produced that verdict (`capture/windows/wgc_helper.rs:131-135`) pipelined **on a single thread** — + it queued more frames but still blocked `lock_bitstream` inline, so it added queue latency with + **zero overlap**. That is not the pattern the NVENC guide prescribes (submit/retrieve on + *separate* threads). The correct async pipeline is **untried**, not disproven. (§5.B) +- **"More GPU priority is maxed and hits a hard preemption wall with no recourse."** Half right. + Priority *is* near-maxed (HIGH), but the "no recourse" intuition is wrong: a **higher-priority GPU + context does preempt a saturating graphics context at pixel granularity** — that is precisely how + NVIDIA VR Async-TimeWarp injects a frame into a busy game + ([VRWorks Context Priority](https://developer.nvidia.com/vrworks/headset/contextpriority)). And we + default to HIGH, leaving **REALTIME unused** even though our SYSTEM service can grant it. (§5.C) +- **"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss."** The + "half the frames" effect is **specifically a DLSS-Frame-Generation flip-metering artifact** + (FG v310.x+ / RTX 50-series), *not* a general property of independent-flip games — normal + fullscreen flip games are captured at full rate by DDA. So composed-flip is a **narrow** fix, not a + general lever. ([Apollo #676 — DDA captured a flip game at full 120 fps](https://github.com/ClassicOldSong/Apollo/issues/676), + [Sunshine #3621 — version-pinned to FG 310.x](https://github.com/LizardByte/Sunshine/issues/3621)) +- **"NvFBC is a possible low-overhead capture path."** **Dead on Windows** — deprecated, frozen at + Capture SDK 7.1 / Win10-1803 + ([NVIDIA deprecation bulletin](https://developer.download.nvidia.com/designworks/capture-sdk/docs/NVFBC_Win10_Deprecation_Tech_Bulletin.pdf)). + Linux-only, and there only via the consumer `keylase` patch. + +What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling +is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the +honest residual ceiling at 100% GPU. Those carry forward. + +--- + +## 2. How the pipeline actually serializes today (verified against current code) + +The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`, +`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a +**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a +near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did** — +which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` / +`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the +diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode +stall"* (`punktfunk1.rs:2466-2468`). + +The encode round-trip (NVENC, the dominant path): + +- `submit` → `encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it + pushes onto a `pending` FIFO. +- `poll` → `lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode + completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event. +- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve. + +So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream → +hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77` +fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC +throughput wall. +([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2)) + +Where the per-frame GPU work lands, by path (this is the crux of contention): + +| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame | +|---|---|---|---|---| +| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) | +| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low | +| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium | +| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high | + +Measured magnitude of "RGB vs NV12 to the encoder": +[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/). +NVENC's guide confirms the mechanism: *"Encoding of RGB contents"* is on the explicit list of +features that **internally use CUDA** +([NVENC prog-guide §Encoder Features using CUDA](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)). + +--- + +## 3. Diagnose first — cheap, decisive, do before any code + +Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM +cannot reproduce this — run on the **RTX 4090 Windows box** (and a real NVIDIA Linux box) with an +actual saturating game. + +1. **Run with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`** under CS2 at GPU-100%: + - `fps`≈target but `uniq`→40–50 ⇒ **(b) source ceiling** — the compositor/IDD only produced + 40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F. + - both `fps` and `uniq`→40–50, with `encode_ms` 13–17 ⇒ **(a) feed contention** — the round-trip + is starving. Go to §5.A/B/C. +2. **Classify the game's presentation with [PresentMon](https://github.com/GameTechDev/PresentMon)** — + "Presented FPS" vs "Displayed FPS" and **Presentation Mode** (Hardware: Independent Flip vs + Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS + itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing + frames. +3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall. + +> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's +> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the +> throughput of a saturated single GPU is split between game and host no matter what. + +--- + +## 4. Current-state audit (what's shipped / regressed / missing) + +| Area | State | Where | +|---|---|---| +| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | `session_tuning.rs` ✅ | +| Thread priority (Linux) | `setpriority` −10/−5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ | +| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ | +| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ | +| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` | +| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ | +| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ | +| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ | +| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ | +| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ | +| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ | +| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ | +| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ | +| encode\|send split + paced send + sendmmsg + 32 MB sockbuf | yes | `stream.rs`, `transport/qos.rs` ✅ | +| **Clock / P-state pin** | **none** (zero hits repo-wide) | ✗ | +| **Async NVENC (2-thread)** | **none** | ✗ | +| **Frame-source escape (hook/NvFBC-Linux)** | **none** | ✗ | +| **Second-GPU / iGPU encode offload** | **none** | ✗ | +| DSCP/QoS | implemented, `PUNKTFUNK_DSCP` opt-in (default off) | `transport/qos.rs` ⚠ | + +--- + +## 5. The levers, ranked, with honest verdicts + +### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win** + +The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB, +forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already +solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and +feeding NV12/P010. **Make IDD-push and Linux do the same.** + +- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the + out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` / + `..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the + out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan` + (`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't + disagree on the format. +- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind + `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`, + `linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already + runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC. +- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine + P010 convert where the VP supports it). + +**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA +CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed* +to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA +doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim +as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm` +columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just +relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends). + +### B. A *correct* async encode pipeline (the untried encoder lever) + +The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit +work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion +event in asynchronous mode, or calling `NvEncLockBitstream` in synchronous mode — should be done in +the **secondary thread**."* +([NVENC prog-guide, threading model](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)) +We do the opposite — submit and blocking-retrieve on **one** thread. Queuing more `pending` entries +(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with **no overlap**, +which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong +implementation, not a disproof. + +The fix: **submit on the capture/encode thread; do `lock_bitstream` on a dedicated retrieve thread; +hold a deep input+output surface pool (≈4–8); on Windows register a `completionEvent` per output +buffer (`enableEncodeAsync=1`) — on Linux async events are unsupported, so use the same two-thread +split with a blocking retrieve.** +([async is Windows/WDDM-only](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html); +FFmpeg models the same knob as `delay`/`async_depth`, +[libavcodec/nvenc.c](https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/nvenc.c)). + +This lets the WDDM scheduler find a **backlog** when it finally grants the encoder context a slice, +and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do +frame N+1's convert. + +**Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.** +The honest bound (and why this is *second* to §A/§C): pipelining cannot manufacture GPU time — if the +scheduler grants the encode context only X% under load, depth only guarantees work is *ready* for +each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is +**priority**, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by. +Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `enableEncodeAsync`), +and HAGS can spike the *submit* call itself +([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)). + +### C. Auto-gated REALTIME GPU scheduling priority + +Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and +Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind +fullscreen games +([OBS commit](https://github.com/obsproject/obs-studio/commit/ec769ef008b748f7dfba211daec9eb203ea4bea0), +[Sunshine `display_base.cpp`](https://raw.githubusercontent.com/LizardByte/Sunshine/master/src/platform/windows/display_base.cpp)). +It works **independently of HAGS** (HAGS does *not* reassign cross-process priority — Microsoft: +*"Windows continues to control prioritization"* +[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)). + +We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things +to change: + +- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated + app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.** + The lever is available to us specifically. +- **Gate it to dodge the freeze.** REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a **documented + NVENC hang** (Sunshine ships `nvenc_realtime_hags` to downgrade to HIGH for exactly this; + [Sunshine config](https://docs.lizardbyte.dev/projects/sunshine/latest/md_docs_2configuration.html), + [NVIDIA repro](https://forums.developer.nvidia.com/t/bug-report-nvenc-encoder-hangs-on-windows-when-using-d3d11-in-real-time-mode/357466)). + Implement the old plan's "Tier 3B": probe HAGS via `D3DKMTQueryAdapterInfo` and VRAM headroom via + `IDXGIAdapter3::QueryVideoMemoryInfo` (continuously); use REALTIME only when HAGS-off, or HAGS-on + with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens. + +**Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever.** Priority is how +the host *takes* GPU time from the game; it measurably **costs the game fps** +([Doom Eternal 121→60 with Sunshine running](https://github.com/LizardByte/Sunshine/issues/3703)). +That's acceptable for a streaming host (the remote view is the product), but say so plainly and make +the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`). + +### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat + +Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF +`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` + +`low_power=1` VDEnc path (`:226`). Keep them. Two notes: + +- **AMF/QSV suffer contention *worse* than NVENC.** OBS: *"For Intel and AMD GPUs, the hardware + encoder requires significant resources of the same type a 3D app/game requires… different from + NVIDIA's NVENC, which has dedicated encoding circuits"* + ([OBS KB](https://obsproject.com/forum/threads/how-to-debug-encoding-overloaded.168625/)). So on an + AMD/Intel host the collapse is *expected to be harder* — and §G (iGPU offload) is even more + attractive there. +- **The AMF busy-poll floor** (a fixed-sleep `QueryOutput` poll imposes ~15 ms via timer + granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's `QUERY_TIMEOUT` patch); since we + go through libavcodec we inherit it — just **confirm the pinned FFmpeg build includes it**. + ([ffmpeg-devel](https://www.mail-archive.com/ffmpeg-devel@ffmpeg.org/msg170489.html)) + +**Verdict: REAL but largely already captured.** No big win left here except via §G. + +### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix + +NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every +frame — most visible in the *light* scene (the "200-not-240"). Pin it: + +- **Windows:** NvAPI per-application DRS `PREFERRED_PSTATE = PREFER_MAX` scoped to our exe (this is + exactly Sunshine's `nvenc_latency_over_power`, + [Sunshine nvprefs](https://github.com/LizardByte/Sunshine/blob/master/src/platform/windows/nvprefs/driver_settings.cpp)). + **Crash-safe undo is mandatory** — persist an undo record to `%ProgramData%\punktfunk\` *before* + applying, revert a stale profile on next start, so a crash never leaves the user's control panel + modified. +- **Linux:** `nvidia-smi -lgc`/NVML `nvmlDeviceSetGpuLockedClocks` (needs root/`CAP_SYS_ADMIN`; query + `nvmlDeviceGetMaxClockInfo`, lock to that, restore on teardown *and* SIGTERM). Plus the newly-added + `CudaNoStablePerfLimit` driver profile — *new in R580/595, so usable on the 595 box* — to defeat + the CUDA "Force P2" memory-clock clamp. +- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default off on battery / Steam Deck** (pinning is harmful + there). + +**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game +already pins P0). Cheap, low risk, do it for the light-scene win. + +### F. Escape the frame-source ceiling — only if §3 says (b) + +If `uniq` is the wall, no encoder/priority work helps — you need a better frame source. + +- **Swapchain-hook capture (the real fix).** Inject a hook on `IDXGISwapChain::Present`/`Present1`, + `vkQueuePresentKHR`, `wglSwapBuffers` and copy the backbuffer to a shared texture *before* the + compositor — OBS Game Capture's mechanism. Sees **every presented frame**, no compose/refresh + gating. + ([OBS dxgi-capture](https://github.com/obsproject/obs-studio/blob/master/plugins/win-capture/graphics-hook/dxgi-capture.cpp)) + **Tradeoffs are serious:** anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs + whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an + opt-in "game capture" mode, not the default. +- **NvFBC:** **not an option on Windows** (dead, §1). On **Linux** it's viable via the consumer + keylase patch and captures below composition — worth a flag for the Linux NVIDIA host. +- **Compose-flip (narrow):** the topmost 1×1 layered-window trick (we already have + `composed_flip.rs`) forces DWM composition and fixes specifically the **DLSS-Frame-Gen** half-rate + case. Adds host-display latency; don't enable globally. +- **WGC "deliver 2× rate":** Apollo sets `MinUpdateInterval = 1e7/(fps*2)` so the pacer always has a + fresh frame to pick ([Apollo](https://github.com/ClassicOldSong/Apollo/pull/785)); we set it to 1× + refresh (`wgc.rs:310`). Cheap tweak to try on the WGC path. + +**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents +frames the game didn't render. + +### G. The honest endgame — encode on a second GPU / the iGPU + +For *demanding* titles that saturate the GPU even when capped, the only thing that **removes** +contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a +**different** GPU — a second dGPU or, more realistically, the **iGPU** (Intel QuickSync / AMD VCN), +which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once, +encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder" +play, and the OBS "second GPU is harmful" verdict does **not** apply — that verdict is about moving +*only the NVENC block*; moving capture + CSC + copies off the gaming GPU genuinely frees it. +([OBS forum](https://obsproject.com/forum/threads/can-you-use-a-2nd-gpu-to-eliminate-encoder-overload.149644/)) + +We're unusually well-placed for this: we already have working AMF and QSV backends +(`encode/windows/ffmpeg_win.rs`) and the Linux VAAPI backend. The missing piece is a capture/topology +mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one +cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but +it's the only path that lets a demanding game and a clean stream coexist on one machine. + +**Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses."** +Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session; +the consumer analogue is the iGPU. + +--- + +## 6. Recommended order of attack + +1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)* +2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on; + Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`. +3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it. +4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win. +5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization. +6. **§3-gated §5.F** source escape (swapchain hook) — only if `uniq` is the wall. +7. **§5.G iGPU encode offload** — the strategic answer for demanding titles; larger build. + +After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the +honest ceiling: **on one saturated GPU the game and the host split a fixed pie** — coarse WDDM +graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only +*rendered* 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie +are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps), +or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation. + +--- + +## 7. Placebos & dead ends (so we don't re-propose them) + +| Candidate | Verdict | Why | +|---|---|---| +| **NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames** as a "non-capping yield" | ✗ placebo | Shrinks the *game's* render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. ([Battle(non)sense LDAT data](https://forums.guru3d.com/threads/battle-non-sense-youtuber-claims-low-latency-mode-only-helps-when-gpu-load-is-99.429074/)) | +| **HAGS on, as a contention fix** | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it *causes* NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime *queue*. ([OBS KB](https://obsproject.com/kb/hags)) | +| **Split-frame encode (2/3/4-way) to fix contention** | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured **zero** latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit *disable* sentinel, not a bug. ([SDK header](https://raw.githubusercontent.com/FFmpeg/nv-codec-headers/master/include/ffnvcodec/nvEncodeAPI.h)) | +| **Move the encoded-bitstream readback to a copy engine** | ✗ placebo | Output is KB-scale; the cost of `lock_bitstream` is the completion *wait*, not copy bandwidth. (The *input* full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) | +| **CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_*`** | ✗ placebo cross-process | Intra-context only; the game is a *separate* context. Stream priority "will not preempt already executing work". ([CUDA docs](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html)) | +| **VK/EGL global-priority REALTIME on Linux NVIDIA** | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. | +| **Windows "High performance" GPU preference** | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). | +| **MIG / MPS / vGPU** | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. | +| **NvFBC on Windows** | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. | +| **Frame Generation / Smooth Motion** to "make more frames" | ✗ red herring | We stream *rendered* frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. | + +--- + +## 8. Open evidence gaps (flagged honestly) + +- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not + confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with + `nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it. +- The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is + unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you + whether §5.A alone is enough or whether §5.C is doing the heavy lifting. +- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD + whitepaper; treat the *direction* as solid, the magnitude as TBD. diff --git a/docs/host-latency-plan.md b/docs/host-latency-plan.md index b415e41..92cf7a8 100644 --- a/docs/host-latency-plan.md +++ b/docs/host-latency-plan.md @@ -1,5 +1,14 @@ # Host latency & the GPU-contention collapse — analysis + prioritized plan +> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).** +> That follow-up re-verified this plan against the current code and overturned several specifics: +> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it +> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks +> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline; +> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on +> Windows. Use the new doc's ranked action list. The tiers/dropped-placebo analysis below remain a +> useful record. + Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the "saturating game starves the stream" headache: diff --git a/packaging/windows/pack-host-installer.ps1 b/packaging/windows/pack-host-installer.ps1 index cafdb9d..705e6fe 100644 --- a/packaging/windows/pack-host-installer.ps1 +++ b/packaging/windows/pack-host-installer.ps1 @@ -141,7 +141,7 @@ $defines = @( ) # --- stage the pf-vdisplay virtual-display driver bundle -------------------------------------- -# pf-vdisplay is our all-Rust IddCx driver (packaging/windows/vdisplay-driver/), vendored signed under +# pf-vdisplay is our all-Rust IddCx driver (packaging/windows/drivers/), vendored signed under # packaging/windows/pf-vdisplay/. It replaced the vendored SudoVDA C++ driver. if (-not $NoDriver) { $stage = Join-Path $OutDir 'stage' diff --git a/packaging/windows/stage-pf-vdisplay.ps1 b/packaging/windows/stage-pf-vdisplay.ps1 index ecd4d1f..5497f29 100644 --- a/packaging/windows/stage-pf-vdisplay.ps1 +++ b/packaging/windows/stage-pf-vdisplay.ps1 @@ -4,11 +4,11 @@ driver + the fetched nefcon device tool. .DESCRIPTION - pf-vdisplay (our all-Rust IddCx virtual display) is built from packaging/windows/vdisplay-driver/, and + pf-vdisplay (our all-Rust IddCx virtual display) is built from packaging/windows/drivers/, and the SIGNED output (pf_vdisplay.dll/.inf/.cat + punktfunk-driver.cer) is VENDORED under packaging/windows/pf-vdisplay/ (signer punktfunk-ds-test — shared with the gamepad drivers — Class= Display, HWID root\pf_vdisplay). Rebuild + re-vendor with - packaging/windows/vdisplay-driver/deploy-dev.ps1 when the driver source changes, then copy the staged + packaging/windows/drivers/deploy-dev.ps1 when the driver source changes, then copy the staged pf_vdisplay.{dll,inf,cat} over the vendored copies. nefcon publishes a pinned release, so we fetch + SHA-256-verify it (it provides nefconc.exe, used to create the root-enumerated device node — pnputil can't). @@ -36,7 +36,7 @@ New-Item -ItemType Directory -Force -Path $OutDir | Out-Null # --- vendored pf-vdisplay driver -------------------------------------------------------------- $inf = Get-ChildItem -Path $VendorDir -Filter pf_vdisplay.inf -ErrorAction SilentlyContinue | Select-Object -First 1 -if (-not $inf) { throw "no vendored pf_vdisplay.inf under $VendorDir — re-vendor via vdisplay-driver/deploy-dev.ps1" } +if (-not $inf) { throw "no vendored pf_vdisplay.inf under $VendorDir — re-vendor via drivers/deploy-dev.ps1" } Copy-Item (Join-Path $VendorDir '*') $OutDir -Force Write-Host "==> vendored pf-vdisplay staged from $VendorDir"