unom/punktfunk

Fork 0

Files

T

enricobuehler d01a8fd17a

windows-host / package (push) Failing after 4m16s

Details

ci / rust (push) Failing after 4m56s

Details

ci / web (push) Failing after 22s

Details

ci / docs-site (push) Successful in 1m7s

Details

android / android (push) Successful in 9m19s

Details

ci / bench (push) Successful in 4m47s

Details

decky / build-publish (push) Successful in 11s

Details

docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s

Details

docker / build-push (., web/Dockerfile, punktfunk-web) (push) Failing after 3s

Details

docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s

Details

docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s

Details

docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s

Details

docker / deploy-docs (push) Has been skipped

Details

deb / build-publish (push) Failing after 6m29s

Details

rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 7m4s

Details

rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 7m17s

Details

apple / swift (push) Successful in 1m13s

Details

apple / screenshots (push) Successful in 5m27s

Details

feat(host): HDR Vulkan layer so Vulkan games get HDR on the virtual display

NVIDIA/AMD Vulkan ICDs refuse to *advertise* an HDR color space for a surface on an
IddCx indirect/virtual display, so Vulkan games (Doom: The Dark Ages, id Tech, Indiana
Jones, …) report "device does not support HDR" — even though Windows HDR, DWM compose,
and the client PQ stream all work, and the ICD happily *accepts + presents* a forced HDR
swapchain there. The whole gap is enumeration; the community (Apollo/Sunshine/VDD) wrote
this off as kernel-side / unfixable.

Add VK_LAYER_PUNKTFUNK_hdr_inject (packaging/windows/pf-vkhdr-layer/): a standalone
cdylib Vulkan implicit layer that appends {A2B10G10R10, HDR10_ST2084} + {RGBA16F, scRGB}
to vkGetPhysicalDeviceSurfaceFormats[2]KHR (no need to hook vkCreateSwapchainKHR — the
ICD doesn't validate the color space there). Self-gated on the surface monitor's actual
advanced-color state (DisplayConfig GET_ADVANCED_COLOR_INFO), so it is a complete no-op
on SDR sessions and real monitors (dedup). Always-on (registry-discovered) so it works
regardless of how a game is launched — env-scoping silently fails for already-running
Steam. Escape hatches: DISABLE_PF_VKHDR, PF_VKHDR_EXCLUDE, and a built-in kernel-anti-
cheat denylist.

The installer builds/signs/stages it and registers it under
HKLM64\SOFTWARE\Khronos\Vulkan\ImplicitLayers (opt-out "Install the HDR Vulkan layer"
task); windows-host CI fmt+clippy-gates it (msvc-only FFI).

Live-validated on the RTX box: Doom: The Dark Ages enables HDR over the pf-vdisplay
virtual display.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-26 11:33:20 +00:00

32 KiB

Raw Blame History

GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)

The headache, stated precisely: a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the stream tracks; the moment the game pins the GPU the stream collapses to 40–50 fps while the game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light titles like CS2). Capping is not an acceptable fix — demanding titles exhaust the GPU even when capped.

This is the second, deeper pass on the problem. The first pass is host-latency-plan.md (a 25-agent investigation, 2026-06-18). This doc supersedes several of that doc's conclusions — the codebase moved a lot in the week since (the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned two of the old plan's premises. Read §1 (corrections) before acting on the old doc.

Method: five parallel investigations — three deep reads of the current code (encode, capture, mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with their own adversarial verifiers. Every external claim below carries a source URL; every code claim carries a current file:line.

0. TL;DR — the corrected mental model and the action list

The governing fact: NVENC is a dedicated ASIC on its own GPU runlist, physically separate from the SM/CUDA/graphics cores a 3D game saturates. The game does not steal the encode block. It steals everything that feeds the block — capture-acquire, the RGB→YUV colour-convert, the copy into the encoder's input surface, the readback — and the GPU-scheduler time to run that feed work, which is queued behind the game's graphics context. (NVENC app-note, engine-table proof, UNC RTAS'24)

Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart before writing code:

Bottleneck	Symptom	Fix family
(a) feed-scheduling contention	`uniq`≈`fps`, both ~50; `encode_ms` 13–17	shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU
(b) frame-source ceiling	`fps`≈240 (held re-encodes) but `uniq`→40–50	capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case

The single hardest truth: on one saturated GPU there is no free lunch. Any host GPU work either preempts the game (and steals its frames) or waits behind it. Capping the game works only because it cuts the game's total GPU demand and opens idle gaps. The non-capping equivalents are exactly three: need less GPU (footprint shrink), take more (priority — which costs the game fps), or use a different GPU (real isolation). Anything pitched as "make the game politely yield without losing anything" — Reflex, render-queue tricks — is a placebo here (§7).

Action list, highest leverage first (detail in §5–§6):

Diagnose first (§3). Read uniq-vs-fps under the real workload + PresentMon presentation mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
Stop feeding NVENC RGB on the default path. IDD-push (the install default) hands NVENC BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on the video engine like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
Build a correct async encode pipeline — submit on one thread, blocking-retrieve on another, deep surface pool, Windows completion events. Our past "pipelining didn't help" was a same-thread implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
Auto-gated REALTIME GPU priority. Our LocalSystem service can grant it (most apps can't). Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
Lock clocks / pin P-state for jitter (cheap; fixes the light-scene "200-not-240", not the collapse). (§5.E)
If source-bound: swapchain-hook capture (OBS-style) — the real escape from the compose ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
The honest endgame for demanding titles: encode on a second GPU / the iGPU. The only approach that removes contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)

1. Corrections to `host-latency-plan.md` (read before reusing it)

The old doc was right about the shape but several specifics are now wrong or stale:

"Windows already feeds NVENC YUV on the video engine, so it does the right thing." True for the DDA and WGC paths — false for IDD-push, which is now the install default and feeds NVENC RGB, paying the SM-side CSC the old doc said Windows had eliminated. The default path regressed on the exact axis the doc celebrated. (§5.A, capture/windows/idd_push.rs:545-551,743)
"PUNKTFUNK_ENCODE_DEPTH (default 4, ≤6) deep-pipelines." There is no such knob. It exists only in two stale comments (encode/windows/nvenc.rs:30, capture/windows/wgc.rs:57) and is never parsed. The real depth knob is PUNKTFUNK_IDD_DEPTH (default 2), used only by IDD-push on the native path; GameStream and the WGC helper are hardcoded depth-1.
"Async NVENC is measure-gated and probably stacks latency (Tier 3D)." The measurement that produced that verdict (capture/windows/wgc_helper.rs:131-135) pipelined on a single thread — it queued more frames but still blocked lock_bitstream inline, so it added queue latency with zero overlap. That is not the pattern the NVENC guide prescribes (submit/retrieve on separate threads). The correct async pipeline is untried, not disproven. (§5.B)
"More GPU priority is maxed and hits a hard preemption wall with no recourse." Half right. Priority is near-maxed (HIGH), but the "no recourse" intuition is wrong: a higher-priority GPU context does preempt a saturating graphics context at pixel granularity — that is precisely how NVIDIA VR Async-TimeWarp injects a frame into a busy game (VRWorks Context Priority). And we default to HIGH, leaving REALTIME unused even though our SYSTEM service can grant it. (§5.C)
"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss." The "half the frames" effect is specifically a DLSS-Frame-Generation flip-metering artifact (FG v310.x+ / RTX 50-series), not a general property of independent-flip games — normal fullscreen flip games are captured at full rate by DDA. So composed-flip is a narrow fix, not a general lever. (Apollo #676 — DDA captured a flip game at full 120 fps, Sunshine #3621 — version-pinned to FG 310.x)
"NvFBC is a possible low-overhead capture path." Dead on Windows — deprecated, frozen at Capture SDK 7.1 / Win10-1803 (NVIDIA deprecation bulletin). Linux-only, and there only via the consumer keylase patch.

What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the honest residual ceiling at 100% GPU. Those carry forward.

2. How the pipeline actually serializes today (verified against current code)

The capture→encode loop is a fixed-cadence pacer (gamestream/stream.rs:375-480, punktfunk1.rs:2430-2540): every 1/target_fps tick it grabs the freshest frame with a non-blocking try_latest(), and if nothing new arrived it re-encodes the held frame (a near-empty P-frame). So the outbound fps is pinned at target_fps no matter what the source did — which is why the raw fps counter lies under contention. The only honest signal is the uniq / diag_new counter (stream.rs:380, punktfunk1.rs:2433-2436), and the code itself states the diagnostic: "low new_fps at high send rate ⇒ the source isn't producing frames, not an encode stall" (punktfunk1.rs:2466-2468).

The encode round-trip (NVENC, the dominant path):

submit → encode_picture (encode/windows/nvenc.rs:722) is a non-blocking ASIC launch; it pushes onto a pending FIFO.
poll → lock_bitstream (nvenc.rs:801) blocks the same thread until that frame's encode completes. The session is synchronous — no enableEncodeAsync, no completion event.
The only thread split is encode-vs-network-send, never submit-vs-retrieve.

So at depth-1 the loop is strictly serial: capture (+convert) → submit → block in lock_bitstream → hand AU to the send thread. The arithmetic matches the symptom — 1000/17 ≈ 59 and 1000/13 ≈ 77 fps bracket the observed ~50, the signature of one frame in flight per round-trip, not an ASIC throughput wall. (independent NVENC latency study: ~7 frames across all presets)

Where the per-frame GPU work lands, by path (this is the crux of contention):

Path	Colour-convert	Extra copy	NVENC input	Contended-engine load/frame
IDD-push (install default)	none → NVENC internal RGB→YUV on the SM	`CopyResource` BGRA→out-ring (3D), `idd_push.rs:743`	BGRA/Rgb10a2	highest (SM CSC + 3D copy)
WGC (fallback default)	`VideoProcessorBlt` → NV12 on the video engine, `wgc.rs:631`	none (encodes pool texture in place)	NV12/P010	low
DDA	`VideoProcessorBlt` → NV12 on the video engine, `dxgi.rs:1657-1762`	one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099`	NV12/P010	medium
Linux NVENC	none → NVENC internal RGB→YUV on the SM (default)	CUDA dev→dev copy + `cuStreamSynchronize`	RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` and `PUNKTFUNK_ZEROCOPY`)	high

Measured magnitude of "RGB vs NV12 to the encoder": RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%. NVENC's guide confirms the mechanism: "Encoding of RGB contents" is on the explicit list of features that internally use CUDA (NVENC prog-guide §Encoder Features using CUDA).

3. Diagnose first — cheap, decisive, do before any code

Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM cannot reproduce this — run on the RTX 4090 Windows box (and a real NVIDIA Linux box) with an actual saturating game.

Run with PUNKTFUNK_PERF=1 and read uniq vs fps under CS2 at GPU-100%:
- fps≈target but uniq→40–50 ⇒ (b) source ceiling — the compositor/IDD only produced 40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
- both fps and uniq→40–50, with encode_ms 13–17 ⇒ (a) feed contention — the round-trip is starving. Go to §5.A/B/C.
Classify the game's presentation with PresentMon — "Presented FPS" vs "Displayed FPS" and Presentation Mode (Hardware: Independent Flip vs Composed: Flip). Independent-Flip + uniq ≪ Presented ⇒ source/flip problem; Presented FPS itself collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing frames.
Log cap_us / enc_us / pace_us p50/p99 alongside to localise the stall.

Necessary-but-not-sufficient caveat: if the game only rendered 50 frames because it's GPU-bound, nothing downstream creates the other 90. Source fixes address (b) only; the throughput of a saturated single GPU is split between game and host no matter what.

4. Current-state audit (what's shipped / regressed / missing)

Area	State	Where
Thread priority (Win)	HIGH class + MMCSS "Games" + 1 ms timer	`session_tuning.rs` ✅
Thread priority (Linux)	`setpriority` −10/−5 — native path only; GameStream Linux threads get none	`punktfunk1.rs:1977` ⚠
GPU sched priority	`D3DKMTSetProcessSchedulingPriorityClass` HIGH(4) default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper	`capture/windows/dxgi.rs:208-330` ⚠
GPU thread/latency	`SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)`	`dxgi.rs:193-200` ✅
CSC off-SM (Win SDR)	WGC/DDA video-engine NV12 ✅ — IDD-push (default) RGB→SM ✗	`wgc.rs:631` / `idd_push.rs:545`
CSC off-SM (Win HDR)	on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default off)	`wgc.rs:603` ⚠
CSC off-SM (Linux)	RGB→SM by default; NV12 is double-opt-in (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`)	`encode/linux/mod.rs:104` ⚠
Encode pipeline	depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread	`nvenc.rs:801` ⚠
Split-encode	2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum	`nvenc.rs:424-447` ✅
Zero-copy register-in-place	yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy	`nvenc.rs:623` ✅/⚠
AMF tuning	`usage=ultralowlatency`, `preanalysis=false`	`ffmpeg_win.rs:215-219` ✅
QSV tuning	`async_depth=1`, `low_power=1` (VDEnc)	`ffmpeg_win.rs:226-227` ✅
Intra-refresh / infinite GOP	yes (killed the periodic-IDR freeze)	✅
encode\|send split + paced send + sendmmsg + 32 MB sockbuf	yes	`stream.rs`, `transport/qos.rs` ✅
Clock / P-state pin	none (zero hits repo-wide)	✗
Async NVENC (2-thread)	none	✗
Frame-source escape (hook/NvFBC-Linux)	none	✗
Second-GPU / iGPU encode offload	none	✗
DSCP/QoS	implemented, `PUNKTFUNK_DSCP` opt-in (default off)	`transport/qos.rs` ⚠

5. The levers, ranked, with honest verdicts

A. Stop feeding NVENC RGB on the default path — highest in-our-control win

The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB, forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already solved this by doing the CSC with ID3D11VideoProcessor::VideoProcessorBlt (video engine) and feeding NV12/P010. Make IDD-push and Linux do the same.

Windows IDD-push: add a VideoProcessorBlt BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the out-ring, exactly like wgc.rs:631 / dxgi.rs:1657-1762, and feed NV_ENC_BUFFER_FORMAT_NV12 / ..._YUV420_10BIT. This also lets you drop the separate CopyResource (the convert writes the out-ring), removing both contended-engine ops per frame. Plug it into SessionPlan (session_plan.rs, the single owner of the capture/encode decision) so capture and encode can't disagree on the format.
Linux: make NV12 the default for the tiled zero-copy path (it's gated behind PUNKTFUNK_NV12 and PUNKTFUNK_ZEROCOPY today — encode/linux/mod.rs:104, linux/zerocopy/egl.rs:272), and feed NVENC NV_ENC_BUFFER_FORMAT_NV12. The GL detile already runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
Windows HDR: flip PUNKTFUNK_HDR_SHADER_P010 on by default (or, better, use a video-engine P010 convert where the VP supports it).

Verdict: REAL, but honestly conditional. Feeding NV12 provably removes NVENC's internal CUDA CSC — but the convert has to land off the SM to fully pay off. VideoProcessorBlt is designed to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, but no NVIDIA doc explicitly confirms VideoProcessorBlt runs off-SM on GeForce — treat the "video engine" claim as well-founded-but-unverified and confirm on-box with nvidia-smi dmon (watch the enc/sm columns) before and after. Do not convert with a CUDA/3D shader and call it done — that just relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).

B. A correct async encode pipeline (the untried encoder lever)

The NVENC Programming Guide is explicit: "The main encoder thread should be used only to submit work… (non-blocking NvEncEncodePicture). Output buffer processing — waiting on the completion event in asynchronous mode, or calling NvEncLockBitstream in synchronous mode — should be done in the secondary thread." (NVENC prog-guide, threading model) We do the opposite — submit and blocking-retrieve on one thread. Queuing more pending entries (IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with no overlap, which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong implementation, not a disproof.

The fix: submit on the capture/encode thread; do lock_bitstream on a dedicated retrieve thread; hold a deep input+output surface pool (≈4–8); on Windows register a completionEvent per output buffer (enableEncodeAsync=1) — on Linux async events are unsupported, so use the same two-thread split with a blocking retrieve. (async is Windows/WDDM-only; FFmpeg models the same knob as delay/async_depth, libavcodec/nvenc.c).

This lets the WDDM scheduler find a backlog when it finally grants the encoder context a slice, and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do frame N+1's convert.

Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded. The honest bound (and why this is second to §A/§C): pipelining cannot manufacture GPU time — if the scheduler grants the encode context only X% under load, depth only guarantees work is ready for each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is priority, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by. Watch out: this forecloses sub-frame slice output (mutually exclusive with enableEncodeAsync), and HAGS can spike the submit call itself (100–200 ms nvEncEncodePicture stalls under HAGS).

C. Auto-gated REALTIME GPU scheduling priority

Raising the host process's WDDM GPU priority is the proven single-PC production lever — OBS and Sunshine both set D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME to stop being descheduled behind fullscreen games (OBS commit, Sunshine display_base.cpp). It works independently of HAGS (HAGS does not reassign cross-process priority — Microsoft: "Windows continues to control prioritization" DirectX devblog).

We ship only HIGH(4) by default with a static realtime opt-in and no auto-gate. Two things to change:

We can actually grant REALTIME. It needs SeIncreaseBasePriorityPrivilege, which an unelevated app lacks (OBS logs the failure) — but our host runs as a LocalSystem service, which holds it. The lever is available to us specifically.
Gate it to dodge the freeze. REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a documented NVENC hang (Sunshine ships nvenc_realtime_hags to downgrade to HIGH for exactly this; Sunshine config, NVIDIA repro). Implement the old plan's "Tier 3B": probe HAGS via D3DKMTQueryAdapterInfo and VRAM headroom via IDXGIAdapter3::QueryVideoMemoryInfo (continuously); use REALTIME only when HAGS-off, or HAGS-on with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.

Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever. Priority is how the host takes GPU time from the game; it measurably costs the game fps (Doom Eternal 121→60 with Sunshine running). That's acceptable for a streaming host (the remote view is the product), but say so plainly and make the class operator-configurable (we already expose PUNKTFUNK_GPU_PRIORITY_CLASS).

D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat

Our *_amf/*_qsv libavcodec config already follows the research's advice: AMF usage=ultralowlatency + preanalysis=false (ffmpeg_win.rs:215), QSV async_depth=1 + low_power=1 VDEnc path (:226). Keep them. Two notes:

AMF/QSV suffer contention worse than NVENC. OBS: "For Intel and AMD GPUs, the hardware encoder requires significant resources of the same type a 3D app/game requires… different from NVIDIA's NVENC, which has dedicated encoding circuits" (OBS KB). So on an AMD/Intel host the collapse is expected to be harder — and §G (iGPU offload) is even more attractive there.
The AMF busy-poll floor (a fixed-sleep QueryOutput poll imposes ~15 ms via timer granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's QUERY_TIMEOUT patch); since we go through libavcodec we inherit it — just confirm the pinned FFmpeg build includes it. (ffmpeg-devel)

Verdict: REAL but largely already captured. No big win left here except via §G.

E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix

NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every frame — most visible in the light scene (the "200-not-240"). Pin it:

Windows: NvAPI per-application DRS PREFERRED_PSTATE = PREFER_MAX scoped to our exe (this is exactly Sunshine's nvenc_latency_over_power, Sunshine nvprefs). Crash-safe undo is mandatory — persist an undo record to %ProgramData%\punktfunk\ before applying, revert a stale profile on next start, so a crash never leaves the user's control panel modified.
Linux: nvidia-smi -lgc/NVML nvmlDeviceSetGpuLockedClocks (needs root/CAP_SYS_ADMIN; query nvmlDeviceGetMaxClockInfo, lock to that, restore on teardown and SIGTERM). Plus the newly-added CudaNoStablePerfLimit driver profile — new in R580/595, so usable on the 595 box — to defeat the CUDA "Force P2" memory-clock clamp.
Gate behind PUNKTFUNK_PIN_CLOCKS; default off on battery / Steam Deck (pinning is harmful there).

Verdict: REAL for latency stability, marginal for the saturated collapse (at 100% util the game already pins P0). Cheap, low risk, do it for the light-scene win.

F. Escape the frame-source ceiling — only if §3 says (b)

If uniq is the wall, no encoder/priority work helps — you need a better frame source.

Swapchain-hook capture (the real fix). Inject a hook on IDXGISwapChain::Present/Present1, vkQueuePresentKHR, wglSwapBuffers and copy the backbuffer to a shared texture before the compositor — OBS Game Capture's mechanism. Sees every presented frame, no compose/refresh gating. (OBS dxgi-capture) Tradeoffs are serious: anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an opt-in "game capture" mode, not the default.
NvFBC: not an option on Windows (dead, §1). On Linux it's viable via the consumer keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
Compose-flip (narrow): the topmost 1×1 layered-window trick (we already have composed_flip.rs) forces DWM composition and fixes specifically the DLSS-Frame-Gen half-rate case. Adds host-display latency; don't enable globally.
WGC "deliver 2× rate": Apollo sets MinUpdateInterval = 1e7/(fps*2) so the pacer always has a fresh frame to pick (Apollo); we set it to 1× refresh (wgc.rs:310). Cheap tweak to try on the WGC path.

Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow. None invents frames the game didn't render.

G. The honest endgame — encode on a second GPU / the iGPU

For demanding titles that saturate the GPU even when capped, the only thing that removes contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a different GPU — a second dGPU or, more realistically, the iGPU (Intel QuickSync / AMD VCN), which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once, encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder" play, and the OBS "second GPU is harmful" verdict does not apply — that verdict is about moving only the NVENC block; moving capture + CSC + copies off the gaming GPU genuinely frees it. (OBS forum)

We're unusually well-placed for this: we already have working AMF and QSV backends (encode/windows/ffmpeg_win.rs) and the Linux VAAPI backend. The missing piece is a capture/topology mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but it's the only path that lets a demanding game and a clean stream coexist on one machine.

Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses." Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session; the consumer analogue is the iGPU.

6. Recommended order of attack

§3 Diagnose on the RTX box + a real game. Settles (a) vs (b). (half a day, decisive)
§5.A NV12/P010 on the default paths (IDD-push video-engine convert; Linux NV12 default-on; Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with nvidia-smi dmon.
§5.C Auto-gated REALTIME priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
§5.E Clock pin both OSes (crash-safe undo). Cheap light-scene win.
§5.B Correct two-thread async pipeline. Structural; recovers the depth-1 serialization.
§3-gated §5.F source escape (swapchain hook) — only if uniq is the wall.
§5.G iGPU encode offload — the strategic answer for demanding titles; larger build.

After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the honest ceiling: on one saturated GPU the game and the host split a fixed pie — coarse WDDM graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only rendered 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps), or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.

7. Placebos & dead ends (so we don't re-propose them)

Candidate	Verdict	Why
NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames as a "non-capping yield"	✗ placebo	Shrinks the game's render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. (Battle(non)sense LDAT data)
HAGS on, as a contention fix	✗ neutral→harmful	Doesn't reassign cross-process priority (Microsoft); OBS reports it causes NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime queue. (OBS KB)
Split-frame encode (2/3/4-way) to fix contention	✗ (pixel-rate only)	Parallelizes the ASIC, not the contended copy/CSC; measured zero latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit disable sentinel, not a bug. (SDK header)
Move the encoded-bitstream readback to a copy engine	✗ placebo	Output is KB-scale; the cost of `lock_bitstream` is the completion wait, not copy bandwidth. (The input full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.)
*CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_`**	✗ placebo cross-process	Intra-context only; the game is a separate context. Stream priority "will not preempt already executing work". (CUDA docs)
VK/EGL global-priority REALTIME on Linux NVIDIA	✗	Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue.
Windows "High performance" GPU preference	✗ single-GPU placebo	Only selects an adapter; real only to split work across adapters (→ that's §G).
MIG / MPS / vGPU	✗ N/A	MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU.
NvFBC on Windows	✗ dead	Deprecated, frozen at Capture SDK 7.1 / Win10-1803.
Frame Generation / Smooth Motion to "make more frames"	✗ red herring	We stream rendered frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention.

8. Open evidence gaps (flagged honestly)

Whether ID3D11VideoProcessor::VideoProcessorBlt (BGRA→NV12) runs off the SM on GeForce is not confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. Verify on-box with nvidia-smi dmon (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
The exact share of the 13–17 ms encode_ms that is convert-on-SM vs scheduling-wait is unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD whitepaper; treat the direction as solid, the magnitude as TBD.

32 KiB Raw Blame History Unescape Escape

GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)

0. TL;DR — the corrected mental model and the action list

1. Corrections to host-latency-plan.md (read before reusing it)

2. How the pipeline actually serializes today (verified against current code)

3. Diagnose first — cheap, decisive, do before any code

4. Current-state audit (what's shipped / regressed / missing)

5. The levers, ranked, with honest verdicts

A. Stop feeding NVENC RGB on the default path — highest in-our-control win

B. A correct async encode pipeline (the untried encoder lever)

C. Auto-gated REALTIME GPU scheduling priority

D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat

E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix

F. Escape the frame-source ceiling — only if §3 says (b)

G. The honest endgame — encode on a second GPU / the iGPU

6. Recommended order of attack

7. Placebos & dead ends (so we don't re-propose them)

8. Open evidence gaps (flagged honestly)

32 KiB

Raw Blame History

1. Corrections to `host-latency-plan.md` (read before reusing it)