wip: host latency/GPU-contention notes + Windows packaging tweaks

Pre-existing working-tree changes committed to the branch on request: the gpu-contention investigation doc, host-latency-plan additions, and small pack-host-installer / stage-pf-vdisplay packaging-script edits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 06:53:09 +00:00
parent 739fa74e68
commit 6b3cbce120
4 changed files with 443 additions and 4 deletions
@@ -0,0 +1,430 @@
+# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
+
+> The headache, stated precisely:
+> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
+> stream tracks; the moment the game pins the GPU the **stream collapses to 40–50 fps** while the
+> game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light
+> titles like CS2). **Capping is not an acceptable fix** — demanding titles exhaust the GPU even
+> when capped.
+
+This is the second, deeper pass on the problem. The first pass is
+[`host-latency-plan.md`](host-latency-plan.md) (a 25-agent investigation, 2026-06-18). **This doc
+supersedes several of that doc's conclusions** — the codebase moved a lot in the week since
+(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the
+GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
+two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
+
+Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
+mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
+their own adversarial verifiers. Every external claim below carries a source URL; every code claim
+carries a current `file:line`.
+
+---
+
+## 0. TL;DR — the corrected mental model and the action list
+
+**The governing fact:** NVENC is a **dedicated ASIC on its own GPU runlist**, physically separate
+from the SM/CUDA/graphics cores a 3D game saturates. The game does **not** steal the encode block.
+It steals everything that *feeds* the block — capture-acquire, the **RGB→YUV colour-convert**, the
+copy into the encoder's input surface, the readback — **and the GPU-scheduler time** to run that
+feed work, which is queued behind the game's graphics context.
+([NVENC app-note](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-application-note/index.html),
+[engine-table proof, UNC RTAS'24](https://www.cs.unc.edu/~jbakita/rtas24.pdf))
+
+**Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart
+before writing code:**
+
+| Bottleneck | Symptom | Fix family |
+|---|---|---|
+| **(a) feed-scheduling contention** | `uniq`≈`fps`, both ~50; `encode_ms` 13–17 | shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
+| **(b) frame-source ceiling** | `fps`≈240 (held re-encodes) but `uniq`→40–50 | capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |
+
+**The single hardest truth:** on one saturated GPU there is **no free lunch**. Any host GPU work
+either *preempts* the game (and steals its frames) or *waits* behind it. Capping the game works
+only because it cuts the game's **total** GPU demand and opens idle gaps. The non-capping
+equivalents are exactly three: **need less GPU** (footprint shrink), **take more** (priority — which
+costs the game fps), or **use a different GPU** (real isolation). Anything pitched as "make the game
+politely yield without losing anything" — Reflex, render-queue tricks — is a **placebo** here (§7).
+
+**Action list, highest leverage first** (detail in §5–§6):
+
+1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
+   mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
+2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
+   BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
+   the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
+3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
+   deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
+   implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
+   tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
+4. **Auto-gated REALTIME GPU priority.** Our `LocalSystem` service *can* grant it (most apps can't).
+   Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
+5. **Lock clocks / pin P-state** for jitter (cheap; fixes the light-scene "200-not-240", not the
+   collapse). (§5.E)
+6. **If source-bound: swapchain-hook capture** (OBS-style) — the real escape from the compose
+   ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
+7. **The honest endgame for demanding titles: encode on a second GPU / the iGPU.** The only approach
+   that *removes* contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)
+
+---
+
+## 1. Corrections to `host-latency-plan.md` (read before reusing it)
+
+The old doc was right about the shape but several specifics are now wrong or stale:
+
+- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
+  DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
+  **RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
+  *regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
+- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
+  only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
+  parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
+  native path; GameStream and the WGC helper are hardcoded depth-1.
+- **"Async NVENC is measure-gated and probably stacks latency (Tier 3D)."** The measurement that
+  produced that verdict (`capture/windows/wgc_helper.rs:131-135`) pipelined **on a single thread** —
+  it queued more frames but still blocked `lock_bitstream` inline, so it added queue latency with
+  **zero overlap**. That is not the pattern the NVENC guide prescribes (submit/retrieve on
+  *separate* threads). The correct async pipeline is **untried**, not disproven. (§5.B)
+- **"More GPU priority is maxed and hits a hard preemption wall with no recourse."** Half right.
+  Priority *is* near-maxed (HIGH), but the "no recourse" intuition is wrong: a **higher-priority GPU
+  context does preempt a saturating graphics context at pixel granularity** — that is precisely how
+  NVIDIA VR Async-TimeWarp injects a frame into a busy game
+  ([VRWorks Context Priority](https://developer.nvidia.com/vrworks/headset/contextpriority)). And we
+  default to HIGH, leaving **REALTIME unused** even though our SYSTEM service can grant it. (§5.C)
+- **"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss."** The
+  "half the frames" effect is **specifically a DLSS-Frame-Generation flip-metering artifact**
+  (FG v310.x+ / RTX 50-series), *not* a general property of independent-flip games — normal
+  fullscreen flip games are captured at full rate by DDA. So composed-flip is a **narrow** fix, not a
+  general lever. ([Apollo #676 — DDA captured a flip game at full 120 fps](https://github.com/ClassicOldSong/Apollo/issues/676),
+  [Sunshine #3621 — version-pinned to FG 310.x](https://github.com/LizardByte/Sunshine/issues/3621))
+- **"NvFBC is a possible low-overhead capture path."** **Dead on Windows** — deprecated, frozen at
+  Capture SDK 7.1 / Win10-1803
+  ([NVIDIA deprecation bulletin](https://developer.download.nvidia.com/designworks/capture-sdk/docs/NVFBC_Win10_Deprecation_Tech_Bulletin.pdf)).
+  Linux-only, and there only via the consumer `keylase` patch.
+
+What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling
+is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the
+honest residual ceiling at 100% GPU. Those carry forward.
+
+---
+
+## 2. How the pipeline actually serializes today (verified against current code)
+
+The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
+`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
+**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
+near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did** —
+which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
+`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
+diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
+stall"* (`punktfunk1.rs:2466-2468`).
+
+The encode round-trip (NVENC, the dominant path):
+
+- `submit` → `encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
+  pushes onto a `pending` FIFO.
+- `poll` → `lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
+  completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
+- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.
+
+So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
+hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
+fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
+throughput wall.
+([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))
+
+Where the per-frame GPU work lands, by path (this is the crux of contention):
+
+| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
+|---|---|---|---|---|
+| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
+| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
+| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
+| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
+
+Measured magnitude of "RGB vs NV12 to the encoder":
+[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
+NVENC's guide confirms the mechanism: *"Encoding of RGB contents"* is on the explicit list of
+features that **internally use CUDA**
+([NVENC prog-guide §Encoder Features using CUDA](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)).
+
+---
+
+## 3. Diagnose first — cheap, decisive, do before any code
+
+Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM
+cannot reproduce this — run on the **RTX 4090 Windows box** (and a real NVIDIA Linux box) with an
+actual saturating game.
+
+1. **Run with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`** under CS2 at GPU-100%:
+   - `fps`≈target but `uniq`→40–50 ⇒ **(b) source ceiling** — the compositor/IDD only produced
+     40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
+   - both `fps` and `uniq`→40–50, with `encode_ms` 13–17 ⇒ **(a) feed contention** — the round-trip
+     is starving. Go to §5.A/B/C.
+2. **Classify the game's presentation with [PresentMon](https://github.com/GameTechDev/PresentMon)** —
+   "Presented FPS" vs "Displayed FPS" and **Presentation Mode** (Hardware: Independent Flip vs
+   Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
+   itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
+   frames.
+3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.
+
+> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
+> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
+> throughput of a saturated single GPU is split between game and host no matter what.
+
+---
+
+## 4. Current-state audit (what's shipped / regressed / missing)
+
+| Area | State | Where |
+|---|---|---|
+| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | `session_tuning.rs` ✅ |
+| Thread priority (Linux) | `setpriority` −10/−5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
+| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
+| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
+| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
+| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
+| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
+| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
+| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
+| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ |
+| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
+| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
+| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
+| encode\|send split + paced send + sendmmsg + 32 MB sockbuf | yes | `stream.rs`, `transport/qos.rs` ✅ |
+| **Clock / P-state pin** | **none** (zero hits repo-wide) | ✗ |
+| **Async NVENC (2-thread)** | **none** | ✗ |
+| **Frame-source escape (hook/NvFBC-Linux)** | **none** | ✗ |
+| **Second-GPU / iGPU encode offload** | **none** | ✗ |
+| DSCP/QoS | implemented, `PUNKTFUNK_DSCP` opt-in (default off) | `transport/qos.rs` ⚠ |
+
+---
+
+## 5. The levers, ranked, with honest verdicts
+
+### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**
+
+The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
+forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
+solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
+feeding NV12/P010. **Make IDD-push and Linux do the same.**
+
+- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
+  out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
+  `..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
+  out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
+  (`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
+  disagree on the format.
+- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
+  `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
+  `linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
+  runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
+- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
+  P010 convert where the VP supports it).
+
+**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
+CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
+to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
+doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
+as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
+columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
+relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
+
+### B. A *correct* async encode pipeline (the untried encoder lever)
+
+The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
+work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
+event in asynchronous mode, or calling `NvEncLockBitstream` in synchronous mode — should be done in
+the **secondary thread**."*
+([NVENC prog-guide, threading model](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html))
+We do the opposite — submit and blocking-retrieve on **one** thread. Queuing more `pending` entries
+(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with **no overlap**,
+which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong
+implementation, not a disproof.
+
+The fix: **submit on the capture/encode thread; do `lock_bitstream` on a dedicated retrieve thread;
+hold a deep input+output surface pool (≈4–8); on Windows register a `completionEvent` per output
+buffer (`enableEncodeAsync=1`) — on Linux async events are unsupported, so use the same two-thread
+split with a blocking retrieve.**
+([async is Windows/WDDM-only](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html);
+FFmpeg models the same knob as `delay`/`async_depth`,
+[libavcodec/nvenc.c](https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/nvenc.c)).
+
+This lets the WDDM scheduler find a **backlog** when it finally grants the encoder context a slice,
+and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do
+frame N+1's convert.
+
+**Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.**
+The honest bound (and why this is *second* to §A/§C): pipelining cannot manufacture GPU time — if the
+scheduler grants the encode context only X% under load, depth only guarantees work is *ready* for
+each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
+**priority**, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
+Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `enableEncodeAsync`),
+and HAGS can spike the *submit* call itself
+([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).
+
+### C. Auto-gated REALTIME GPU scheduling priority
+
+Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
+Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
+fullscreen games
+([OBS commit](https://github.com/obsproject/obs-studio/commit/ec769ef008b748f7dfba211daec9eb203ea4bea0),
+[Sunshine `display_base.cpp`](https://raw.githubusercontent.com/LizardByte/Sunshine/master/src/platform/windows/display_base.cpp)).
+It works **independently of HAGS** (HAGS does *not* reassign cross-process priority — Microsoft:
+*"Windows continues to control prioritization"*
+[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).
+
+We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
+to change:
+
+- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
+  app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
+  The lever is available to us specifically.
+- **Gate it to dodge the freeze.** REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a **documented
+  NVENC hang** (Sunshine ships `nvenc_realtime_hags` to downgrade to HIGH for exactly this;
+  [Sunshine config](https://docs.lizardbyte.dev/projects/sunshine/latest/md_docs_2configuration.html),
+  [NVIDIA repro](https://forums.developer.nvidia.com/t/bug-report-nvenc-encoder-hangs-on-windows-when-using-d3d11-in-real-time-mode/357466)).
+  Implement the old plan's "Tier 3B": probe HAGS via `D3DKMTQueryAdapterInfo` and VRAM headroom via
+  `IDXGIAdapter3::QueryVideoMemoryInfo` (continuously); use REALTIME only when HAGS-off, or HAGS-on
+  with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.
+
+**Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever.** Priority is how
+the host *takes* GPU time from the game; it measurably **costs the game fps**
+([Doom Eternal 121→60 with Sunshine running](https://github.com/LizardByte/Sunshine/issues/3703)).
+That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
+the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).
+
+### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
+
+Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
+`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
+`low_power=1` VDEnc path (`:226`). Keep them. Two notes:
+
+- **AMF/QSV suffer contention *worse* than NVENC.** OBS: *"For Intel and AMD GPUs, the hardware
+  encoder requires significant resources of the same type a 3D app/game requires… different from
+  NVIDIA's NVENC, which has dedicated encoding circuits"*
+  ([OBS KB](https://obsproject.com/forum/threads/how-to-debug-encoding-overloaded.168625/)). So on an
+  AMD/Intel host the collapse is *expected to be harder* — and §G (iGPU offload) is even more
+  attractive there.
+- **The AMF busy-poll floor** (a fixed-sleep `QueryOutput` poll imposes ~15 ms via timer
+  granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's `QUERY_TIMEOUT` patch); since we
+  go through libavcodec we inherit it — just **confirm the pinned FFmpeg build includes it**.
+  ([ffmpeg-devel](https://www.mail-archive.com/ffmpeg-devel@ffmpeg.org/msg170489.html))
+
+**Verdict: REAL but largely already captured.** No big win left here except via §G.
+
+### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
+
+NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
+frame — most visible in the *light* scene (the "200-not-240"). Pin it:
+
+- **Windows:** NvAPI per-application DRS `PREFERRED_PSTATE = PREFER_MAX` scoped to our exe (this is
+  exactly Sunshine's `nvenc_latency_over_power`,
+  [Sunshine nvprefs](https://github.com/LizardByte/Sunshine/blob/master/src/platform/windows/nvprefs/driver_settings.cpp)).
+  **Crash-safe undo is mandatory** — persist an undo record to `%ProgramData%\punktfunk\` *before*
+  applying, revert a stale profile on next start, so a crash never leaves the user's control panel
+  modified.
+- **Linux:** `nvidia-smi -lgc`/NVML `nvmlDeviceSetGpuLockedClocks` (needs root/`CAP_SYS_ADMIN`; query
+  `nvmlDeviceGetMaxClockInfo`, lock to that, restore on teardown *and* SIGTERM). Plus the newly-added
+  `CudaNoStablePerfLimit` driver profile — *new in R580/595, so usable on the 595 box* — to defeat
+  the CUDA "Force P2" memory-clock clamp.
+- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default off on battery / Steam Deck** (pinning is harmful
+  there).
+
+**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
+already pins P0). Cheap, low risk, do it for the light-scene win.
+
+### F. Escape the frame-source ceiling — only if §3 says (b)
+
+If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.
+
+- **Swapchain-hook capture (the real fix).** Inject a hook on `IDXGISwapChain::Present`/`Present1`,
+  `vkQueuePresentKHR`, `wglSwapBuffers` and copy the backbuffer to a shared texture *before* the
+  compositor — OBS Game Capture's mechanism. Sees **every presented frame**, no compose/refresh
+  gating.
+  ([OBS dxgi-capture](https://github.com/obsproject/obs-studio/blob/master/plugins/win-capture/graphics-hook/dxgi-capture.cpp))
+  **Tradeoffs are serious:** anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs
+  whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an
+  opt-in "game capture" mode, not the default.
+- **NvFBC:** **not an option on Windows** (dead, §1). On **Linux** it's viable via the consumer
+  keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
+- **Compose-flip (narrow):** the topmost 1×1 layered-window trick (we already have
+  `composed_flip.rs`) forces DWM composition and fixes specifically the **DLSS-Frame-Gen** half-rate
+  case. Adds host-display latency; don't enable globally.
+- **WGC "deliver 2× rate":** Apollo sets `MinUpdateInterval = 1e7/(fps*2)` so the pacer always has a
+  fresh frame to pick ([Apollo](https://github.com/ClassicOldSong/Apollo/pull/785)); we set it to 1×
+  refresh (`wgc.rs:310`). Cheap tweak to try on the WGC path.
+
+**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
+frames the game didn't render.
+
+### G. The honest endgame — encode on a second GPU / the iGPU
+
+For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
+contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
+**different** GPU — a second dGPU or, more realistically, the **iGPU** (Intel QuickSync / AMD VCN),
+which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once,
+encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder"
+play, and the OBS "second GPU is harmful" verdict does **not** apply — that verdict is about moving
+*only the NVENC block*; moving capture + CSC + copies off the gaming GPU genuinely frees it.
+([OBS forum](https://obsproject.com/forum/threads/can-you-use-a-2nd-gpu-to-eliminate-encoder-overload.149644/))
+
+We're unusually well-placed for this: we already have working AMF and QSV backends
+(`encode/windows/ffmpeg_win.rs`) and the Linux VAAPI backend. The missing piece is a capture/topology
+mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
+cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
+it's the only path that lets a demanding game and a clean stream coexist on one machine.
+
+**Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses."**
+Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session;
+the consumer analogue is the iGPU.
+
+---
+
+## 6. Recommended order of attack
+
+1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
+2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
+   Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
+3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
+4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
+5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
+6. **§3-gated §5.F** source escape (swapchain hook) — only if `uniq` is the wall.
+7. **§5.G iGPU encode offload** — the strategic answer for demanding titles; larger build.
+
+After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the
+honest ceiling: **on one saturated GPU the game and the host split a fixed pie** — coarse WDDM
+graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only
+*rendered* 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie
+are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps),
+or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.
+
+---
+
+## 7. Placebos & dead ends (so we don't re-propose them)
+
+| Candidate | Verdict | Why |
+|---|---|---|
+| **NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames** as a "non-capping yield" | ✗ placebo | Shrinks the *game's* render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. ([Battle(non)sense LDAT data](https://forums.guru3d.com/threads/battle-non-sense-youtuber-claims-low-latency-mode-only-helps-when-gpu-load-is-99.429074/)) |
+| **HAGS on, as a contention fix** | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it *causes* NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime *queue*. ([OBS KB](https://obsproject.com/kb/hags)) |
+| **Split-frame encode (2/3/4-way) to fix contention** | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured **zero** latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit *disable* sentinel, not a bug. ([SDK header](https://raw.githubusercontent.com/FFmpeg/nv-codec-headers/master/include/ffnvcodec/nvEncodeAPI.h)) |
+| **Move the encoded-bitstream readback to a copy engine** | ✗ placebo | Output is KB-scale; the cost of `lock_bitstream` is the completion *wait*, not copy bandwidth. (The *input* full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
+| **CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_*`** | ✗ placebo cross-process | Intra-context only; the game is a *separate* context. Stream priority "will not preempt already executing work". ([CUDA docs](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html)) |
+| **VK/EGL global-priority REALTIME on Linux NVIDIA** | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
+| **Windows "High performance" GPU preference** | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
+| **MIG / MPS / vGPU** | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
+| **NvFBC on Windows** | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
+| **Frame Generation / Smooth Motion** to "make more frames" | ✗ red herring | We stream *rendered* frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |
+
+---
+
+## 8. Open evidence gaps (flagged honestly)
+
+- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
+  confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
+  `nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
+- The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
+  unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
+  whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
+- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
+  whitepaper; treat the *direction* as solid, the magnitude as TBD.
@@ -1,5 +1,14 @@
 # Host latency & the GPU-contention collapse — analysis + prioritized plan

+> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
+> That follow-up re-verified this plan against the current code and overturned several specifics:
+> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
+> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
+> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
+> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
+> Windows. Use the new doc's ranked action list. The tiers/dropped-placebo analysis below remain a
+> useful record.
+
 Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
 "saturating game starves the stream" headache: