punktfunk/design/gpu-contention-investigation.md

# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)

> The headache, stated precisely:
> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
> stream tracks; the moment the game pins the GPU the **stream collapses to 40–50 fps** while the
> game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light
> titles like CS2). **Capping is not an acceptable fix** — demanding titles exhaust the GPU even
> when capped.

This is the second, deeper pass on the problem. The first pass is
[`host-latency-plan.md`](host-latency-plan.md) (a 25-agent investigation, 2026-06-18). **This doc
supersedes several of that doc's conclusions** — the codebase moved a lot in the week since
(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.

Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
their own adversarial verifiers. Every external claim below carries a source URL; every code claim
carries a current `file:line`.

---

## 0. TL;DR — the corrected mental model and the action list

**The governing fact:** NVENC is a **dedicated ASIC on its own GPU runlist**, physically separate
from the SM/CUDA/graphics cores a 3D game saturates. The game does **not** steal the encode block.
It steals everything that *feeds* the block — capture-acquire, the **RGB→YUV colour-convert**, the
copy into the encoder's input surface, the readback — **and the GPU-scheduler time** to run that
feed work, which is queued behind the game's graphics context.
([NVENC app-note](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-application-note/index.html),
[engine-table proof, UNC RTAS'24](https://www.cs.unc.edu/~jbakita/rtas24.pdf))

**Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart
before writing code:**

| Bottleneck | Symptom | Fix family |
|---|---|---|
| **(a) feed-scheduling contention** | `uniq`≈`fps`, both ~50; `encode_ms` 13–17 | shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
| **(b) frame-source ceiling** | `fps`≈240 (held re-encodes) but `uniq`→40–50 | capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |

**The single hardest truth:** on one saturated GPU there is **no free lunch**. Any host GPU work
either *preempts* the game (and steals its frames) or *waits* behind it. Capping the game works
only because it cuts the game's **total** GPU demand and opens idle gaps. The non-capping
equivalents are exactly three: **need less GPU** (footprint shrink), **take more** (priority — which
costs the game fps), or **use a different GPU** (real isolation). Anything pitched as "make the game
politely yield without losing anything" — Reflex, render-queue tricks — is a **placebo** here (§7).

**Action list, highest leverage first** (detail in §5–§6):

1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
   mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
   BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
   the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
   deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
   implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
   tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
4. **Auto-gated REALTIME GPU priority.** Our `LocalSystem` service *can* grant it (most apps can't).
   Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
5. **Lock clocks / pin P-state** for jitter (cheap; fixes the light-scene "200-not-240", not the
   collapse). (§5.E)
6. **If source-bound: swapchain-hook capture** (OBS-style) — the real escape from the compose
   ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
7. **The honest endgame for demanding titles: encode on a second GPU / the iGPU.** The only approach
   that *removes* contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)

---

## 1. Corrections to `host-latency-plan.md` (read before reusing it)

The old doc was right about the shape but several specifics are now wrong or stale:

- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
  DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
  **RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
  *regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
  only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
  parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
  native path; GameStream and the WGC helper are hardcoded depth-1.
- **"Async NVENC is measure-gated and probably stacks latency (Tier 3D)."** The measurement that
  produced that verdict (`capture/windows/wgc_helper.rs:131-135`) pipelined **on a single thread** —
  it queued more frames but still blocked `lock_bitstream` inline, so it added queue latency with
  **zero overlap**. That is not the pattern the NVENC guide prescribes (submit/retrieve on
  *separate* threads). The correct async pipeline is **untried**, not disproven. (§5.B)
- **"More GPU priority is maxed and hits a hard preemption wall with no recourse."** Half right.
  Priority *is* near-maxed (HIGH), but the "no recourse" intuition is wrong: a **higher-priority GPU
  context does preempt a saturating graphics context at pixel granularity** — that is precisely how
  NVIDIA VR Async-TimeWarp injects a frame into a busy game
  ([VRWorks Context Priority](https://developer.nvidia.com/vrworks/headset/contextpriority)). And we
  default to HIGH, leaving **REALTIME unused** even though our SYSTEM service can grant it. (§5.C)
- **"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss."** The
  "half the frames" effect is **specifically a DLSS-Frame-Generation flip-metering artifact**
  (FG v310.x+ / RTX 50-series), *not* a general property of independent-flip games — normal
  fullscreen flip games are captured at full rate by DDA. So composed-flip is a **narrow** fix, not a
  general lever. ([Apollo #676 — DDA captured a flip game at full 120 fps](https://github.com/ClassicOldSong/Apollo/issues/676),
  [Sunshine #3621 — version-pinned to FG 310.x](https://github.com/LizardByte/Sunshine/issues/3621))
- **"NvFBC is a possible low-overhead capture path."** **Dead on Windows** — deprecated, frozen at
  Capture SDK 7.1 / Win10-1803
  ([NVIDIA deprecation bulletin](https://developer.download.nvidia.com/designworks/capture-sdk/docs/NVFBC_Win10_Deprecation_Tech_Bulletin.pdf)).
  Linux-only, and there only via the consumer `keylase` patch.

What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling
is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the
honest residual ceiling at 100% GPU. Those carry forward.

---

## 2. How the pipeline actually serializes today (verified against current code)

The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did** —
which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
stall"* (`punktfunk1.rs:2466-2468`).

The encode round-trip (NVENC, the dominant path):

- `submit` → `encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
  pushes onto a `pending` FIFO.
- `poll` → `lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
  completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.

So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
throughput wall.
([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))

Where the per-frame GPU work lands, by path (this is the crux of contention):

| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
|---|---|---|---|---|
| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |

Measured magnitude of "RGB vs NV12 to the encoder":
[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
NVENC's guide confirms the mechanism: *"Encoding of RGB contents"* is on the explicit list of
features that **internally use CUDA**
([NVENC prog-guide §Encoder Features using CUDA](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)).

---

## 3. Diagnose first — cheap, decisive, do before any code

Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM
cannot reproduce this — run on the **RTX 4090 Windows box** (and a real NVIDIA Linux box) with an
actual saturating game.

1. **Run with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`** under CS2 at GPU-100%:
   - `fps`≈target but `uniq`→40–50 ⇒ **(b) source ceiling** — the compositor/IDD only produced
     40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
   - both `fps` and `uniq`→40–50, with `encode_ms` 13–17 ⇒ **(a) feed contention** — the round-trip
     is starving. Go to §5.A/B/C.
2. **Classify the game's presentation with [PresentMon](https://github.com/GameTechDev/PresentMon)** —
   "Presented FPS" vs "Displayed FPS" and **Presentation Mode** (Hardware: Independent Flip vs
   Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
   itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
   frames.
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.

> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
> throughput of a saturated single GPU is split between game and host no matter what.

---

## 4. Current-state audit (what's shipped / regressed / missing)

| Area | State | Where |
|---|---|---|
| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | `session_tuning.rs` ✅ |
| Thread priority (Linux) | `setpriority` −10/−5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ |
| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
| encode\|send split + paced send + sendmmsg + 32 MB sockbuf | yes | `stream.rs`, `transport/qos.rs` ✅ |
| **Clock / P-state pin** | **none** (zero hits repo-wide) | ✗ |
| **Async NVENC (2-thread)** | **none** | ✗ |
| **Frame-source escape (hook/NvFBC-Linux)** | **none** | ✗ |
| **Second-GPU / iGPU encode offload** | **none** | ✗ |
| DSCP/QoS | implemented, `PUNKTFUNK_DSCP` opt-in (default off) | `transport/qos.rs` ⚠ |

---

## 5. The levers, ranked, with honest verdicts

### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**

The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
feeding NV12/P010. **Make IDD-push and Linux do the same.**

- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
  out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
  `..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
  out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
  (`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
  disagree on the format.
- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
  `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
  `linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
  runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
  P010 convert where the VP supports it).

**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).

### B. A *correct* async encode pipeline (the untried encoder lever)

The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
event in asynchronous mode, or calling `NvEncLockBitstream` in synchronous mode — should be done in
the **secondary thread**."*
([NVENC prog-guide, threading model](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html))
We do the opposite — submit and blocking-retrieve on **one** thread. Queuing more `pending` entries
(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with **no overlap**,
which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong
implementation, not a disproof.

The fix: **submit on the capture/encode thread; do `lock_bitstream` on a dedicated retrieve thread;
hold a deep input+output surface pool (≈4–8); on Windows register a `completionEvent` per output
buffer (`enableEncodeAsync=1`) — on Linux async events are unsupported, so use the same two-thread
split with a blocking retrieve.**
([async is Windows/WDDM-only](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html);
FFmpeg models the same knob as `delay`/`async_depth`,
[libavcodec/nvenc.c](https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/nvenc.c)).

This lets the WDDM scheduler find a **backlog** when it finally grants the encoder context a slice,
and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do
frame N+1's convert.

**Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.**
The honest bound (and why this is *second* to §A/§C): pipelining cannot manufacture GPU time — if the
scheduler grants the encode context only X% under load, depth only guarantees work is *ready* for
each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
**priority**, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `enableEncodeAsync`),
and HAGS can spike the *submit* call itself
([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).

### C. Auto-gated REALTIME GPU scheduling priority

Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
fullscreen games
([OBS commit](https://github.com/obsproject/obs-studio/commit/ec769ef008b748f7dfba211daec9eb203ea4bea0),
[Sunshine `display_base.cpp`](https://raw.githubusercontent.com/LizardByte/Sunshine/master/src/platform/windows/display_base.cpp)).
It works **independently of HAGS** (HAGS does *not* reassign cross-process priority — Microsoft:
*"Windows continues to control prioritization"*
[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).

We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
to change:

- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
  app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
  The lever is available to us specifically.
- **Gate it to dodge the freeze.** REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a **documented
  NVENC hang** (Sunshine ships `nvenc_realtime_hags` to downgrade to HIGH for exactly this;
  [Sunshine config](https://docs.lizardbyte.dev/projects/sunshine/latest/md_docs_2configuration.html),
  [NVIDIA repro](https://forums.developer.nvidia.com/t/bug-report-nvenc-encoder-hangs-on-windows-when-using-d3d11-in-real-time-mode/357466)).
  Implement the old plan's "Tier 3B": probe HAGS via `D3DKMTQueryAdapterInfo` and VRAM headroom via
  `IDXGIAdapter3::QueryVideoMemoryInfo` (continuously); use REALTIME only when HAGS-off, or HAGS-on
  with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.

**Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever.** Priority is how
the host *takes* GPU time from the game; it measurably **costs the game fps**
([Doom Eternal 121→60 with Sunshine running](https://github.com/LizardByte/Sunshine/issues/3703)).
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).

### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat

Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
`low_power=1` VDEnc path (`:226`). Keep them. Two notes:

- **AMF/QSV suffer contention *worse* than NVENC.** OBS: *"For Intel and AMD GPUs, the hardware
  encoder requires significant resources of the same type a 3D app/game requires… different from
  NVIDIA's NVENC, which has dedicated encoding circuits"*
  ([OBS KB](https://obsproject.com/forum/threads/how-to-debug-encoding-overloaded.168625/)). So on an
  AMD/Intel host the collapse is *expected to be harder* — and §G (iGPU offload) is even more
  attractive there.
- **The AMF busy-poll floor** (a fixed-sleep `QueryOutput` poll imposes ~15 ms via timer
  granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's `QUERY_TIMEOUT` patch); since we
  go through libavcodec we inherit it — just **confirm the pinned FFmpeg build includes it**.
  ([ffmpeg-devel](https://www.mail-archive.com/ffmpeg-devel@ffmpeg.org/msg170489.html))

**Verdict: REAL but largely already captured.** No big win left here except via §G.

### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix

NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
frame — most visible in the *light* scene (the "200-not-240"). Pin it:

- **Windows:** NvAPI per-application DRS `PREFERRED_PSTATE = PREFER_MAX` scoped to our exe (this is
  exactly Sunshine's `nvenc_latency_over_power`,
  [Sunshine nvprefs](https://github.com/LizardByte/Sunshine/blob/master/src/platform/windows/nvprefs/driver_settings.cpp)).
  **Crash-safe undo is mandatory** — persist an undo record to `%ProgramData%\punktfunk\` *before*
  applying, revert a stale profile on next start, so a crash never leaves the user's control panel
  modified.
- **Linux:** `nvidia-smi -lgc`/NVML `nvmlDeviceSetGpuLockedClocks` (needs root/`CAP_SYS_ADMIN`; query
  `nvmlDeviceGetMaxClockInfo`, lock to that, restore on teardown *and* SIGTERM). Plus the newly-added
  `CudaNoStablePerfLimit` driver profile — *new in R580/595, so usable on the 595 box* — to defeat
  the CUDA "Force P2" memory-clock clamp.
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default off on battery / Steam Deck** (pinning is harmful
  there).

**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
already pins P0). Cheap, low risk, do it for the light-scene win.

### F. Escape the frame-source ceiling — only if §3 says (b)

If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.

- **Swapchain-hook capture (the real fix).** Inject a hook on `IDXGISwapChain::Present`/`Present1`,
  `vkQueuePresentKHR`, `wglSwapBuffers` and copy the backbuffer to a shared texture *before* the
  compositor — OBS Game Capture's mechanism. Sees **every presented frame**, no compose/refresh
  gating.
  ([OBS dxgi-capture](https://github.com/obsproject/obs-studio/blob/master/plugins/win-capture/graphics-hook/dxgi-capture.cpp))
  **Tradeoffs are serious:** anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs
  whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an
  opt-in "game capture" mode, not the default.
- **NvFBC:** **not an option on Windows** (dead, §1). On **Linux** it's viable via the consumer
  keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
- **Compose-flip (narrow):** the topmost 1×1 layered-window trick (we already have
  `composed_flip.rs`) forces DWM composition and fixes specifically the **DLSS-Frame-Gen** half-rate
  case. Adds host-display latency; don't enable globally.
- **WGC "deliver 2× rate":** Apollo sets `MinUpdateInterval = 1e7/(fps*2)` so the pacer always has a
  fresh frame to pick ([Apollo](https://github.com/ClassicOldSong/Apollo/pull/785)); we set it to 1×
  refresh (`wgc.rs:310`). Cheap tweak to try on the WGC path.

**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
frames the game didn't render.

### G. The honest endgame — encode on a second GPU / the iGPU

For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
**different** GPU — a second dGPU or, more realistically, the **iGPU** (Intel QuickSync / AMD VCN),
which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once,
encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder"
play, and the OBS "second GPU is harmful" verdict does **not** apply — that verdict is about moving
*only the NVENC block*; moving capture + CSC + copies off the gaming GPU genuinely frees it.
([OBS forum](https://obsproject.com/forum/threads/can-you-use-a-2nd-gpu-to-eliminate-encoder-overload.149644/))

We're unusually well-placed for this: we already have working AMF and QSV backends
(`encode/windows/ffmpeg_win.rs`) and the Linux VAAPI backend. The missing piece is a capture/topology
mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
it's the only path that lets a demanding game and a clean stream coexist on one machine.

**Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses."**
Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session;
the consumer analogue is the iGPU.

---

## 6. Recommended order of attack

1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
   Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
6. **§3-gated §5.F** source escape (swapchain hook) — only if `uniq` is the wall.
7. **§5.G iGPU encode offload** — the strategic answer for demanding titles; larger build.

After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the
honest ceiling: **on one saturated GPU the game and the host split a fixed pie** — coarse WDDM
graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only
*rendered* 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie
are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps),
or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.

---

## 7. Placebos & dead ends (so we don't re-propose them)

| Candidate | Verdict | Why |
|---|---|---|
| **NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames** as a "non-capping yield" | ✗ placebo | Shrinks the *game's* render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. ([Battle(non)sense LDAT data](https://forums.guru3d.com/threads/battle-non-sense-youtuber-claims-low-latency-mode-only-helps-when-gpu-load-is-99.429074/)) |
| **HAGS on, as a contention fix** | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it *causes* NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime *queue*. ([OBS KB](https://obsproject.com/kb/hags)) |
| **Split-frame encode (2/3/4-way) to fix contention** | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured **zero** latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit *disable* sentinel, not a bug. ([SDK header](https://raw.githubusercontent.com/FFmpeg/nv-codec-headers/master/include/ffnvcodec/nvEncodeAPI.h)) |
| **Move the encoded-bitstream readback to a copy engine** | ✗ placebo | Output is KB-scale; the cost of `lock_bitstream` is the completion *wait*, not copy bandwidth. (The *input* full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
| **CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_*`** | ✗ placebo cross-process | Intra-context only; the game is a *separate* context. Stream priority "will not preempt already executing work". ([CUDA docs](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html)) |
| **VK/EGL global-priority REALTIME on Linux NVIDIA** | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
| **Windows "High performance" GPU preference** | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
| **MIG / MPS / vGPU** | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
| **NvFBC on Windows** | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
| **Frame Generation / Smooth Motion** to "make more frames" | ✗ red herring | We stream *rendered* frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |

---

## 8. Open evidence gaps (flagged honestly)

- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
  confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
  `nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
- The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
  unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
  whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
  whitepaper; treat the *direction* as solid, the magnitude as TBD.