NVIDIA/AMD Vulkan ICDs refuse to *advertise* an HDR color space for a surface on an
IddCx indirect/virtual display, so Vulkan games (Doom: The Dark Ages, id Tech, Indiana
Jones, …) report "device does not support HDR" — even though Windows HDR, DWM compose,
and the client PQ stream all work, and the ICD happily *accepts + presents* a forced HDR
swapchain there. The whole gap is enumeration; the community (Apollo/Sunshine/VDD) wrote
this off as kernel-side / unfixable.
Add VK_LAYER_PUNKTFUNK_hdr_inject (packaging/windows/pf-vkhdr-layer/): a standalone
cdylib Vulkan implicit layer that appends {A2B10G10R10, HDR10_ST2084} + {RGBA16F, scRGB}
to vkGetPhysicalDeviceSurfaceFormats[2]KHR (no need to hook vkCreateSwapchainKHR — the
ICD doesn't validate the color space there). Self-gated on the surface monitor's actual
advanced-color state (DisplayConfig GET_ADVANCED_COLOR_INFO), so it is a complete no-op
on SDR sessions and real monitors (dedup). Always-on (registry-discovered) so it works
regardless of how a game is launched — env-scoping silently fails for already-running
Steam. Escape hatches: DISABLE_PF_VKHDR, PF_VKHDR_EXCLUDE, and a built-in kernel-anti-
cheat denylist.
The installer builds/signs/stages it and registers it under
HKLM64\SOFTWARE\Khronos\Vulkan\ImplicitLayers (opt-out "Install the HDR Vulkan layer"
task); windows-host CI fmt+clippy-gates it (msvc-only FFI).
Live-validated on the RTX box: Doom: The Dark Ages enables HDR over the pf-vdisplay
virtual display.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
32 KiB
GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
The headache, stated precisely: a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the stream tracks; the moment the game pins the GPU the stream collapses to 40–50 fps while the game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light titles like CS2). Capping is not an acceptable fix — demanding titles exhaust the GPU even when capped.
This is the second, deeper pass on the problem. The first pass is
host-latency-plan.md (a 25-agent investigation, 2026-06-18). This doc
supersedes several of that doc's conclusions — the codebase moved a lot in the week since
(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
Method: five parallel investigations — three deep reads of the current code (encode, capture,
mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
their own adversarial verifiers. Every external claim below carries a source URL; every code claim
carries a current file:line.
0. TL;DR — the corrected mental model and the action list
The governing fact: NVENC is a dedicated ASIC on its own GPU runlist, physically separate from the SM/CUDA/graphics cores a 3D game saturates. The game does not steal the encode block. It steals everything that feeds the block — capture-acquire, the RGB→YUV colour-convert, the copy into the encoder's input surface, the readback — and the GPU-scheduler time to run that feed work, which is queued behind the game's graphics context. (NVENC app-note, engine-table proof, UNC RTAS'24)
Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart before writing code:
| Bottleneck | Symptom | Fix family |
|---|---|---|
| (a) feed-scheduling contention | uniq≈fps, both ~50; encode_ms 13–17 |
shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
| (b) frame-source ceiling | fps≈240 (held re-encodes) but uniq→40–50 |
capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |
The single hardest truth: on one saturated GPU there is no free lunch. Any host GPU work either preempts the game (and steals its frames) or waits behind it. Capping the game works only because it cuts the game's total GPU demand and opens idle gaps. The non-capping equivalents are exactly three: need less GPU (footprint shrink), take more (priority — which costs the game fps), or use a different GPU (real isolation). Anything pitched as "make the game politely yield without losing anything" — Reflex, render-queue tricks — is a placebo here (§7).
Action list, highest leverage first (detail in §5–§6):
- Diagnose first (§3). Read
uniq-vs-fpsunder the real workload + PresentMon presentation mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter. - Stop feeding NVENC RGB on the default path. IDD-push (the install default) hands NVENC BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on the video engine like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
- Build a correct async encode pipeline — submit on one thread, blocking-retrieve on another, deep surface pool, Windows completion events. Our past "pipelining didn't help" was a same-thread implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
- Auto-gated REALTIME GPU priority. Our
LocalSystemservice can grant it (most apps can't). Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C) - Lock clocks / pin P-state for jitter (cheap; fixes the light-scene "200-not-240", not the collapse). (§5.E)
- If source-bound: swapchain-hook capture (OBS-style) — the real escape from the compose ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
- The honest endgame for demanding titles: encode on a second GPU / the iGPU. The only approach that removes contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)
1. Corrections to host-latency-plan.md (read before reusing it)
The old doc was right about the shape but several specifics are now wrong or stale:
- "Windows already feeds NVENC YUV on the video engine, so it does the right thing." True for the
DDA and WGC paths — false for IDD-push, which is now the install default and feeds NVENC
RGB, paying the SM-side CSC the old doc said Windows had eliminated. The default path
regressed on the exact axis the doc celebrated. (§5.A,
capture/windows/idd_push.rs:545-551,743) - "
PUNKTFUNK_ENCODE_DEPTH(default 4, ≤6) deep-pipelines." There is no such knob. It exists only in two stale comments (encode/windows/nvenc.rs:30,capture/windows/wgc.rs:57) and is never parsed. The real depth knob isPUNKTFUNK_IDD_DEPTH(default 2), used only by IDD-push on the native path; GameStream and the WGC helper are hardcoded depth-1. - "Async NVENC is measure-gated and probably stacks latency (Tier 3D)." The measurement that
produced that verdict (
capture/windows/wgc_helper.rs:131-135) pipelined on a single thread — it queued more frames but still blockedlock_bitstreaminline, so it added queue latency with zero overlap. That is not the pattern the NVENC guide prescribes (submit/retrieve on separate threads). The correct async pipeline is untried, not disproven. (§5.B) - "More GPU priority is maxed and hits a hard preemption wall with no recourse." Half right. Priority is near-maxed (HIGH), but the "no recourse" intuition is wrong: a higher-priority GPU context does preempt a saturating graphics context at pixel granularity — that is precisely how NVIDIA VR Async-TimeWarp injects a frame into a busy game (VRWorks Context Priority). And we default to HIGH, leaving REALTIME unused even though our SYSTEM service can grant it. (§5.C)
- "Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss." The "half the frames" effect is specifically a DLSS-Frame-Generation flip-metering artifact (FG v310.x+ / RTX 50-series), not a general property of independent-flip games — normal fullscreen flip games are captured at full rate by DDA. So composed-flip is a narrow fix, not a general lever. (Apollo #676 — DDA captured a flip game at full 120 fps, Sunshine #3621 — version-pinned to FG 310.x)
- "NvFBC is a possible low-overhead capture path." Dead on Windows — deprecated, frozen at
Capture SDK 7.1 / Win10-1803
(NVIDIA deprecation bulletin).
Linux-only, and there only via the consumer
keylasepatch.
What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the honest residual ceiling at 100% GPU. Those carry forward.
2. How the pipeline actually serializes today (verified against current code)
The capture→encode loop is a fixed-cadence pacer (gamestream/stream.rs:375-480,
punktfunk1.rs:2430-2540): every 1/target_fps tick it grabs the freshest frame with a
non-blocking try_latest(), and if nothing new arrived it re-encodes the held frame (a
near-empty P-frame). So the outbound fps is pinned at target_fps no matter what the source did —
which is why the raw fps counter lies under contention. The only honest signal is the uniq /
diag_new counter (stream.rs:380, punktfunk1.rs:2433-2436), and the code itself states the
diagnostic: "low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
stall" (punktfunk1.rs:2466-2468).
The encode round-trip (NVENC, the dominant path):
submit→encode_picture(encode/windows/nvenc.rs:722) is a non-blocking ASIC launch; it pushes onto apendingFIFO.poll→lock_bitstream(nvenc.rs:801) blocks the same thread until that frame's encode completes. The session is synchronous — noenableEncodeAsync, no completion event.- The only thread split is encode-vs-network-send, never submit-vs-retrieve.
So at depth-1 the loop is strictly serial: capture (+convert) → submit → block in lock_bitstream → hand AU to the send thread. The arithmetic matches the symptom — 1000/17 ≈ 59 and 1000/13 ≈ 77
fps bracket the observed ~50, the signature of one frame in flight per round-trip, not an ASIC
throughput wall.
(independent NVENC latency study: ~7 frames across all presets)
Where the per-frame GPU work lands, by path (this is the crux of contention):
| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
|---|---|---|---|---|
| IDD-push (install default) | none → NVENC internal RGB→YUV on the SM | CopyResource BGRA→out-ring (3D), idd_push.rs:743 |
BGRA/Rgb10a2 | highest (SM CSC + 3D copy) |
| WGC (fallback default) | VideoProcessorBlt → NV12 on the video engine, wgc.rs:631 |
none (encodes pool texture in place) | NV12/P010 | low |
| DDA | VideoProcessorBlt → NV12 on the video engine, dxgi.rs:1657-1762 |
one CopyResource (3D) to release the dup fast, dxgi.rs:3099 |
NV12/P010 | medium |
| Linux NVENC | none → NVENC internal RGB→YUV on the SM (default) | CUDA dev→dev copy + cuStreamSynchronize |
RGBZ/BGRZ (NV12 only if PUNKTFUNK_NV12 and PUNKTFUNK_ZEROCOPY) |
high |
Measured magnitude of "RGB vs NV12 to the encoder": RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%. NVENC's guide confirms the mechanism: "Encoding of RGB contents" is on the explicit list of features that internally use CUDA (NVENC prog-guide §Encoder Features using CUDA).
3. Diagnose first — cheap, decisive, do before any code
Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM cannot reproduce this — run on the RTX 4090 Windows box (and a real NVIDIA Linux box) with an actual saturating game.
- Run with
PUNKTFUNK_PERF=1and readuniqvsfpsunder CS2 at GPU-100%:fps≈target butuniq→40–50 ⇒ (b) source ceiling — the compositor/IDD only produced 40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.- both
fpsanduniq→40–50, withencode_ms13–17 ⇒ (a) feed contention — the round-trip is starving. Go to §5.A/B/C.
- Classify the game's presentation with PresentMon —
"Presented FPS" vs "Displayed FPS" and Presentation Mode (Hardware: Independent Flip vs
Composed: Flip). Independent-Flip +
uniq≪ Presented ⇒ source/flip problem; Presented FPS itself collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing frames. - Log
cap_us/enc_us/pace_usp50/p99 alongside to localise the stall.
Necessary-but-not-sufficient caveat: if the game only rendered 50 frames because it's GPU-bound, nothing downstream creates the other 90. Source fixes address (b) only; the throughput of a saturated single GPU is split between game and host no matter what.
4. Current-state audit (what's shipped / regressed / missing)
| Area | State | Where |
|---|---|---|
| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | session_tuning.rs ✅ |
| Thread priority (Linux) | setpriority −10/−5 — native path only; GameStream Linux threads get none |
punktfunk1.rs:1977 ⚠ |
| GPU sched priority | D3DKMTSetProcessSchedulingPriorityClass HIGH(4) default; realtime opt-in, no auto-gate; cross-process onto WGC helper |
capture/windows/dxgi.rs:208-330 ⚠ |
| GPU thread/latency | SetGPUThreadPriority(0x4000001E), SetMaximumFrameLatency(1) |
dxgi.rs:193-200 ✅ |
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — IDD-push (default) RGB→SM ✗ | wgc.rs:631 / idd_push.rs:545 |
| CSC off-SM (Win HDR) | on-SM unless PUNKTFUNK_HDR_SHADER_P010 (default off) |
wgc.rs:603 ⚠ |
| CSC off-SM (Linux) | RGB→SM by default; NV12 is double-opt-in (PUNKTFUNK_NV12+PUNKTFUNK_ZEROCOPY) |
encode/linux/mod.rs:104 ⚠ |
| Encode pipeline | depth-1 synchronous, inline lock_bitstream; IDD-push native = depth-2 same-thread |
nvenc.rs:801 ⚠ |
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | nvenc.rs:424-447 ✅ |
| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | nvenc.rs:623 ✅/⚠ |
| AMF tuning | usage=ultralowlatency, preanalysis=false |
ffmpeg_win.rs:215-219 ✅ |
| QSV tuning | async_depth=1, low_power=1 (VDEnc) |
ffmpeg_win.rs:226-227 ✅ |
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
| encode|send split + paced send + sendmmsg + 32 MB sockbuf | yes | stream.rs, transport/qos.rs ✅ |
| Clock / P-state pin | none (zero hits repo-wide) | ✗ |
| Async NVENC (2-thread) | none | ✗ |
| Frame-source escape (hook/NvFBC-Linux) | none | ✗ |
| Second-GPU / iGPU encode offload | none | ✗ |
| DSCP/QoS | implemented, PUNKTFUNK_DSCP opt-in (default off) |
transport/qos.rs ⚠ |
5. The levers, ranked, with honest verdicts
A. Stop feeding NVENC RGB on the default path — highest in-our-control win
The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
solved this by doing the CSC with ID3D11VideoProcessor::VideoProcessorBlt (video engine) and
feeding NV12/P010. Make IDD-push and Linux do the same.
- Windows IDD-push: add a
VideoProcessorBltBGRA→NV12 (SDR) / FP16→P010 (HDR) step into the out-ring, exactly likewgc.rs:631/dxgi.rs:1657-1762, and feedNV_ENC_BUFFER_FORMAT_NV12/..._YUV420_10BIT. This also lets you drop the separateCopyResource(the convert writes the out-ring), removing both contended-engine ops per frame. Plug it intoSessionPlan(session_plan.rs, the single owner of the capture/encode decision) so capture and encode can't disagree on the format. - Linux: make NV12 the default for the tiled zero-copy path (it's gated behind
PUNKTFUNK_NV12andPUNKTFUNK_ZEROCOPYtoday —encode/linux/mod.rs:104,linux/zerocopy/egl.rs:272), and feed NVENCNV_ENC_BUFFER_FORMAT_NV12. The GL detile already runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC. - Windows HDR: flip
PUNKTFUNK_HDR_SHADER_P010on by default (or, better, use a video-engine P010 convert where the VP supports it).
Verdict: REAL, but honestly conditional. Feeding NV12 provably removes NVENC's internal CUDA
CSC — but the convert has to land off the SM to fully pay off. VideoProcessorBlt is designed
to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, but no NVIDIA
doc explicitly confirms VideoProcessorBlt runs off-SM on GeForce — treat the "video engine" claim
as well-founded-but-unverified and confirm on-box with nvidia-smi dmon (watch the enc/sm
columns) before and after. Do not convert with a CUDA/3D shader and call it done — that just
relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
B. A correct async encode pipeline (the untried encoder lever)
The NVENC Programming Guide is explicit: "The main encoder thread should be used only to submit
work… (non-blocking NvEncEncodePicture). Output buffer processing — waiting on the completion
event in asynchronous mode, or calling NvEncLockBitstream in synchronous mode — should be done in
the secondary thread."
(NVENC prog-guide, threading model)
We do the opposite — submit and blocking-retrieve on one thread. Queuing more pending entries
(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with no overlap,
which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong
implementation, not a disproof.
The fix: submit on the capture/encode thread; do lock_bitstream on a dedicated retrieve thread;
hold a deep input+output surface pool (≈4–8); on Windows register a completionEvent per output
buffer (enableEncodeAsync=1) — on Linux async events are unsupported, so use the same two-thread
split with a blocking retrieve.
(async is Windows/WDDM-only;
FFmpeg models the same knob as delay/async_depth,
libavcodec/nvenc.c).
This lets the WDDM scheduler find a backlog when it finally grants the encoder context a slice, and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do frame N+1's convert.
Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.
The honest bound (and why this is second to §A/§C): pipelining cannot manufacture GPU time — if the
scheduler grants the encode context only X% under load, depth only guarantees work is ready for
each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
priority, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
Watch out: this forecloses sub-frame slice output (mutually exclusive with enableEncodeAsync),
and HAGS can spike the submit call itself
(100–200 ms nvEncEncodePicture stalls under HAGS).
C. Auto-gated REALTIME GPU scheduling priority
Raising the host process's WDDM GPU priority is the proven single-PC production lever — OBS and
Sunshine both set D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME to stop being descheduled behind
fullscreen games
(OBS commit,
Sunshine display_base.cpp).
It works independently of HAGS (HAGS does not reassign cross-process priority — Microsoft:
"Windows continues to control prioritization"
DirectX devblog).
We ship only HIGH(4) by default with a static realtime opt-in and no auto-gate. Two things
to change:
- We can actually grant REALTIME. It needs
SeIncreaseBasePriorityPrivilege, which an unelevated app lacks (OBS logs the failure) — but our host runs as aLocalSystemservice, which holds it. The lever is available to us specifically. - Gate it to dodge the freeze. REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a documented
NVENC hang (Sunshine ships
nvenc_realtime_hagsto downgrade to HIGH for exactly this; Sunshine config, NVIDIA repro). Implement the old plan's "Tier 3B": probe HAGS viaD3DKMTQueryAdapterInfoand VRAM headroom viaIDXGIAdapter3::QueryVideoMemoryInfo(continuously); use REALTIME only when HAGS-off, or HAGS-on with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.
Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever. Priority is how
the host takes GPU time from the game; it measurably costs the game fps
(Doom Eternal 121→60 with Sunshine running).
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
the class operator-configurable (we already expose PUNKTFUNK_GPU_PRIORITY_CLASS).
D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
Our *_amf/*_qsv libavcodec config already follows the research's advice: AMF
usage=ultralowlatency + preanalysis=false (ffmpeg_win.rs:215), QSV async_depth=1 +
low_power=1 VDEnc path (:226). Keep them. Two notes:
- AMF/QSV suffer contention worse than NVENC. OBS: "For Intel and AMD GPUs, the hardware encoder requires significant resources of the same type a 3D app/game requires… different from NVIDIA's NVENC, which has dedicated encoding circuits" (OBS KB). So on an AMD/Intel host the collapse is expected to be harder — and §G (iGPU offload) is even more attractive there.
- The AMF busy-poll floor (a fixed-sleep
QueryOutputpoll imposes ~15 ms via timer granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman'sQUERY_TIMEOUTpatch); since we go through libavcodec we inherit it — just confirm the pinned FFmpeg build includes it. (ffmpeg-devel)
Verdict: REAL but largely already captured. No big win left here except via §G.
E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every frame — most visible in the light scene (the "200-not-240"). Pin it:
- Windows: NvAPI per-application DRS
PREFERRED_PSTATE = PREFER_MAXscoped to our exe (this is exactly Sunshine'snvenc_latency_over_power, Sunshine nvprefs). Crash-safe undo is mandatory — persist an undo record to%ProgramData%\punktfunk\before applying, revert a stale profile on next start, so a crash never leaves the user's control panel modified. - Linux:
nvidia-smi -lgc/NVMLnvmlDeviceSetGpuLockedClocks(needs root/CAP_SYS_ADMIN; querynvmlDeviceGetMaxClockInfo, lock to that, restore on teardown and SIGTERM). Plus the newly-addedCudaNoStablePerfLimitdriver profile — new in R580/595, so usable on the 595 box — to defeat the CUDA "Force P2" memory-clock clamp. - Gate behind
PUNKTFUNK_PIN_CLOCKS; default off on battery / Steam Deck (pinning is harmful there).
Verdict: REAL for latency stability, marginal for the saturated collapse (at 100% util the game already pins P0). Cheap, low risk, do it for the light-scene win.
F. Escape the frame-source ceiling — only if §3 says (b)
If uniq is the wall, no encoder/priority work helps — you need a better frame source.
- Swapchain-hook capture (the real fix). Inject a hook on
IDXGISwapChain::Present/Present1,vkQueuePresentKHR,wglSwapBuffersand copy the backbuffer to a shared texture before the compositor — OBS Game Capture's mechanism. Sees every presented frame, no compose/refresh gating. (OBS dxgi-capture) Tradeoffs are serious: anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an opt-in "game capture" mode, not the default. - NvFBC: not an option on Windows (dead, §1). On Linux it's viable via the consumer keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
- Compose-flip (narrow): the topmost 1×1 layered-window trick (we already have
composed_flip.rs) forces DWM composition and fixes specifically the DLSS-Frame-Gen half-rate case. Adds host-display latency; don't enable globally. - WGC "deliver 2× rate": Apollo sets
MinUpdateInterval = 1e7/(fps*2)so the pacer always has a fresh frame to pick (Apollo); we set it to 1× refresh (wgc.rs:310). Cheap tweak to try on the WGC path.
Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow. None invents frames the game didn't render.
G. The honest endgame — encode on a second GPU / the iGPU
For demanding titles that saturate the GPU even when capped, the only thing that removes contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a different GPU — a second dGPU or, more realistically, the iGPU (Intel QuickSync / AMD VCN), which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once, encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder" play, and the OBS "second GPU is harmful" verdict does not apply — that verdict is about moving only the NVENC block; moving capture + CSC + copies off the gaming GPU genuinely frees it. (OBS forum)
We're unusually well-placed for this: we already have working AMF and QSV backends
(encode/windows/ffmpeg_win.rs) and the Linux VAAPI backend. The missing piece is a capture/topology
mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
it's the only path that lets a demanding game and a clean stream coexist on one machine.
Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses." Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session; the consumer analogue is the iGPU.
6. Recommended order of attack
- §3 Diagnose on the RTX box + a real game. Settles (a) vs (b). (half a day, decisive)
- §5.A NV12/P010 on the default paths (IDD-push video-engine convert; Linux NV12 default-on;
Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with
nvidia-smi dmon. - §5.C Auto-gated REALTIME priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
- §5.E Clock pin both OSes (crash-safe undo). Cheap light-scene win.
- §5.B Correct two-thread async pipeline. Structural; recovers the depth-1 serialization.
- §3-gated §5.F source escape (swapchain hook) — only if
uniqis the wall. - §5.G iGPU encode offload — the strategic answer for demanding titles; larger build.
After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the honest ceiling: on one saturated GPU the game and the host split a fixed pie — coarse WDDM graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only rendered 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps), or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.
7. Placebos & dead ends (so we don't re-propose them)
| Candidate | Verdict | Why |
|---|---|---|
| NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames as a "non-capping yield" | ✗ placebo | Shrinks the game's render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. (Battle(non)sense LDAT data) |
| HAGS on, as a contention fix | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it causes NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime queue. (OBS KB) |
| Split-frame encode (2/3/4-way) to fix contention | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured zero latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). splitEncodeMode=15 is the legit disable sentinel, not a bug. (SDK header) |
| Move the encoded-bitstream readback to a copy engine | ✗ placebo | Output is KB-scale; the cost of lock_bitstream is the completion wait, not copy bandwidth. (The input full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
CUDA stream priority / CUDA_DEVICE_MAX_CONNECTIONS / CU_CTX_SCHED_* |
✗ placebo cross-process | Intra-context only; the game is a separate context. Stream priority "will not preempt already executing work". (CUDA docs) |
| VK/EGL global-priority REALTIME on Linux NVIDIA | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
| Windows "High performance" GPU preference | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
| MIG / MPS / vGPU | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
| NvFBC on Windows | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
| Frame Generation / Smooth Motion to "make more frames" | ✗ red herring | We stream rendered frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |
8. Open evidence gaps (flagged honestly)
- Whether
ID3D11VideoProcessor::VideoProcessorBlt(BGRA→NV12) runs off the SM on GeForce is not confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. Verify on-box withnvidia-smi dmon(sm% vs enc%) on the WGC path before assuming IDD-push will match it. - The exact share of the 13–17 ms
encode_msthat is convert-on-SM vs scheduling-wait is unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you whether §5.A alone is enough or whether §5.C is doing the heavy lifting. - AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD whitepaper; treat the direction as solid, the magnitude as TBD.