Pre-existing working-tree changes committed to the branch on request: the
gpu-contention investigation doc, host-latency-plan additions, and small
pack-host-installer / stage-pf-vdisplay packaging-script edits.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
**Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart
before writing code:**
| Bottleneck | Symptom | Fix family |
|---|---|---|
| **(a) feed-scheduling contention** | `uniq`≈`fps`, both ~50; `encode_ms` 13–17 | shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
| **(b) frame-source ceiling** | `fps`≈240 (held re-encodes) but `uniq`→40–50 | capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |
**The single hardest truth:** on one saturated GPU there is **no free lunch**. Any host GPU work
either *preempts* the game (and steals its frames) or *waits* behind it. Capping the game works
only because it cuts the game's **total** GPU demand and opens idle gaps. The non-capping
equivalents are exactly three: **need less GPU** (footprint shrink), **take more** (priority — which
costs the game fps), or **use a different GPU** (real isolation). Anything pitched as "make the game
politely yield without losing anything" — Reflex, render-queue tricks — is a **placebo** here (§7).
**Action list, highest leverage first** (detail in §5–§6):
1.**Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
2.**Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
3.**Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
4.**Auto-gated REALTIME GPU priority.** Our `LocalSystem` service *can* grant it (most apps can't).
Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
5.**Lock clocks / pin P-state** for jitter (cheap; fixes the light-scene "200-not-240", not the
collapse). (§5.E)
6.**If source-bound: swapchain-hook capture** (OBS-style) — the real escape from the compose
ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
7.**The honest endgame for demanding titles: encode on a second GPU / the iGPU.** The only approach
that *removes* contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)
---
## 1. Corrections to `host-latency-plan.md` (read before reusing it)
The old doc was right about the shape but several specifics are now wrong or stale:
- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
**RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
*regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
native path; GameStream and the WGC helper are hardcoded depth-1.
- **"Async NVENC is measure-gated and probably stacks latency (Tier 3D)."** The measurement that
produced that verdict (`capture/windows/wgc_helper.rs:131-135`) pipelined **on a single thread** —
it queued more frames but still blocked `lock_bitstream` inline, so it added queue latency with
**zero overlap**. That is not the pattern the NVENC guide prescribes (submit/retrieve on
*separate* threads). The correct async pipeline is **untried**, not disproven. (§5.B)
- **"More GPU priority is maxed and hits a hard preemption wall with no recourse."** Half right.
Priority *is* near-maxed (HIGH), but the "no recourse" intuition is wrong: a **higher-priority GPU
context does preempt a saturating graphics context at pixel granularity** — that is precisely how
NVIDIA VR Async-TimeWarp injects a frame into a busy game
([VRWorks Context Priority](https://developer.nvidia.com/vrworks/headset/contextpriority)). And we
default to HIGH, leaving **REALTIME unused** even though our SYSTEM service can grant it. (§5.C)
- **"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss."** The
"half the frames" effect is **specifically a DLSS-Frame-Generation flip-metering artifact**
(FG v310.x+ / RTX 50-series), *not* a general property of independent-flip games — normal
fullscreen flip games are captured at full rate by DDA. So composed-flip is a **narrow** fix, not a
general lever. ([Apollo #676 — DDA captured a flip game at full 120 fps](https://github.com/ClassicOldSong/Apollo/issues/676),
[Sunshine #3621 — version-pinned to FG 310.x](https://github.com/LizardByte/Sunshine/issues/3621))
- **"NvFBC is a possible low-overhead capture path."** **Dead on Windows** — deprecated, frozen at
| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12`*and*`PUNKTFUNK_ZEROCOPY`) | high |
Measured magnitude of "RGB vs NV12 to the encoder":
[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
NVENC's guide confirms the mechanism: *"Encoding of RGB contents"* is on the explicit list of
features that **internally use CUDA**
([NVENC prog-guide §Encoder Features using CUDA](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)).
---
## 3. Diagnose first — cheap, decisive, do before any code
Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM
cannot reproduce this — run on the **RTX 4090 Windows box** (and a real NVIDIA Linux box) with an
actual saturating game.
1.**Run with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`** under CS2 at GPU-100%:
-`fps`≈target but `uniq`→40–50 ⇒ **(b) source ceiling** — the compositor/IDD only produced
40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
- both `fps` and `uniq`→40–50, with `encode_ms` 13–17 ⇒ **(a) feed contention** — the round-trip
is starving. Go to §5.A/B/C.
2.**Classify the game's presentation with [PresentMon](https://github.com/GameTechDev/PresentMon)** —
"Presented FPS" vs "Displayed FPS" and **Presentation Mode** (Hardware: Independent Flip vs
This lets the WDDM scheduler find a **backlog** when it finally grants the encoder context a slice,
and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do
frame N+1's convert.
**Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.**
The honest bound (and why this is *second* to §A/§C): pipelining cannot manufacture GPU time — if the
scheduler grants the encode context only X% under load, depth only guarantees work is *ready* for
each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
**priority**, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `enableEncodeAsync`),
and HAGS can spike the *submit* call itself
([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).
### C. Auto-gated REALTIME GPU scheduling priority
Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
We're unusually well-placed for this: we already have working AMF and QSV backends
(`encode/windows/ffmpeg_win.rs`) and the Linux VAAPI backend. The missing piece is a capture/topology
mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
it's the only path that lets a demanding game and a clean stream coexist on one machine.
**Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses."**
Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session;
the consumer analogue is the iGPU.
---
## 6. Recommended order of attack
1.**§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
2.**§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
3.**§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
4.**§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
5.**§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
6.**§3-gated §5.F** source escape (swapchain hook) — only if `uniq` is the wall.
7.**§5.G iGPU encode offload** — the strategic answer for demanding titles; larger build.
After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the
honest ceiling: **on one saturated GPU the game and the host split a fixed pie** — coarse WDDM
graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only
*rendered* 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie
are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps),
or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.
---
## 7. Placebos & dead ends (so we don't re-propose them)
| Candidate | Verdict | Why |
|---|---|---|
| **NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames** as a "non-capping yield" | ✗ placebo | Shrinks the *game's* render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. ([Battle(non)sense LDAT data](https://forums.guru3d.com/threads/battle-non-sense-youtuber-claims-low-latency-mode-only-helps-when-gpu-load-is-99.429074/)) |
| **HAGS on, as a contention fix** | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it *causes* NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime *queue*. ([OBS KB](https://obsproject.com/kb/hags)) |
| **Split-frame encode (2/3/4-way) to fix contention** | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured **zero** latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit *disable* sentinel, not a bug. ([SDK header](https://raw.githubusercontent.com/FFmpeg/nv-codec-headers/master/include/ffnvcodec/nvEncodeAPI.h)) |
| **Move the encoded-bitstream readback to a copy engine** | ✗ placebo | Output is KB-scale; the cost of `lock_bitstream` is the completion *wait*, not copy bandwidth. (The *input* full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
| **CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_*`** | ✗ placebo cross-process | Intra-context only; the game is a *separate* context. Stream priority "will not preempt already executing work". ([CUDA docs](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html)) |
| **VK/EGL global-priority REALTIME on Linux NVIDIA** | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
| **Windows "High performance" GPU preference** | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
| **MIG / MPS / vGPU** | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
| **NvFBC on Windows** | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
| **Frame Generation / Smooth Motion** to "make more frames" | ✗ red herring | We stream *rendered* frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |
---
## 8. Open evidence gaps (flagged honestly)
- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
`nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
- The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
whitepaper; treat the *direction* as solid, the magnitude as TBD.
if(-not$inf){throw"no vendored pf_vdisplay.inf under $VendorDir — re-vendor via vdisplay-driver/deploy-dev.ps1"}
if(-not$inf){throw"no vendored pf_vdisplay.inf under $VendorDir — re-vendor via drivers/deploy-dev.ps1"}
Copy-Item(Join-Path$VendorDir'*')$OutDir-Force
Write-Host"==> vendored pf-vdisplay staged from $VendorDir"
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.