Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).
- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
host-latency, gpu-contention (fixed stale status table), game-library,
linux-setup (fixed m0->spike + stale zero-copy claim),
session-aware-host-followups, windows-client-bootstrap,
windows-dualsense-{scoping,game-detection}, windows-virtual-display,
security-review (per-finding status table; #12 still open),
apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
merged, M4 done); windows-secure-desktop.md archived (now a fallback
behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
34 KiB
GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
Status: Investigation / plan. §5.A (NV12/P010 on the IDD-push default path) is SHIPPED —
3514702,capture/windows/idd_push.rs+encode/windows/nvenc.rs. All other levers (§5.B/§5.C/§5.E/§5.F/§5.G) are OPEN; §5.C is partial (REALTIME knob exists, no auto-gate). Paired withhost-latency-plan.md(mutual cross-refs — keep both). Trimmed to design rationale + open items; git history holds the full original.
The headache, stated precisely: a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the stream tracks; the moment the game pins the GPU the stream collapses to 40–50 fps while the game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light titles like CS2). Capping is not an acceptable fix — demanding titles exhaust the GPU even when capped.
This is the second, deeper pass on the problem. The first pass is
host-latency-plan.md (a 25-agent investigation, 2026-06-18). This doc
supersedes several of that doc's conclusions — the codebase moved a lot in the week since
(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
0. TL;DR — the corrected mental model and the action list
The governing fact: NVENC is a dedicated ASIC on its own GPU runlist, physically separate from the SM/CUDA/graphics cores a 3D game saturates. The game does not steal the encode block. It steals everything that feeds the block — capture-acquire, the RGB→YUV colour-convert, the copy into the encoder's input surface, the readback — and the GPU-scheduler time to run that feed work, which is queued behind the game's graphics context. (NVENC app-note, engine-table proof, UNC RTAS'24)
Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart before writing code:
| Bottleneck | Symptom | Fix family |
|---|---|---|
| (a) feed-scheduling contention | uniq≈fps, both ~50; encode_ms 13–17 |
shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
| (b) frame-source ceiling | fps≈240 (held re-encodes) but uniq→40–50 |
capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |
The single hardest truth: on one saturated GPU there is no free lunch. Any host GPU work either preempts the game (and steals its frames) or waits behind it. Capping the game works only because it cuts the game's total GPU demand and opens idle gaps. The non-capping equivalents are exactly three: need less GPU (footprint shrink), take more (priority — which costs the game fps), or use a different GPU (real isolation). Anything pitched as "make the game politely yield without losing anything" — Reflex, render-queue tricks — is a placebo here (§7).
Action list, highest leverage first (detail in §5–§6):
- Diagnose first (§3). Read
uniq-vs-fpsunder the real workload + PresentMon presentation mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter. - Stop feeding NVENC RGB on the default path — DONE for IDD-push (
3514702): the install default now converts BGRA→NV12 (SDR) / FP16→P010 (HDR) before NVENC, off the SM. Linux NV12-default and a video-engine HDR P010 are still open. (§5.A) - Build a correct async encode pipeline — submit on one thread, blocking-retrieve on another, deep surface pool, Windows completion events. Our past "pipelining didn't help" was a same-thread implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
- Auto-gated REALTIME GPU priority. Our
LocalSystemservice can grant it (most apps can't). Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C) - Lock clocks / pin P-state for jitter (cheap; fixes the light-scene "200-not-240", not the collapse). (§5.E)
- If source-bound: swapchain-hook capture (OBS-style) — the real escape from the compose ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
- The honest endgame for demanding titles: encode on a second GPU / the iGPU. The only approach that removes contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)
1. Corrections to host-latency-plan.md (read before reusing it)
The old doc was right about the shape but several specifics are now wrong or stale:
- "Windows already feeds NVENC YUV on the video engine, so it does the right thing." True for the
DDA and WGC paths — was false for IDD-push, which became the install default and fed NVENC
RGB, paying the SM-side CSC the old doc said Windows had eliminated. The default path regressed
on the exact axis the doc celebrated. Since fixed (
3514702, §5.A): IDD-push now converts BGRA→NV12 on the video engine (FP16→P010 shader for HDR) and feeds NVENC native YUV. - "
PUNKTFUNK_ENCODE_DEPTH(default 4, ≤6) deep-pipelines." There is no such knob. It exists only in two stale comments (encode/windows/nvenc.rs:30,capture/windows/wgc.rs:57) and is never parsed. The real depth knob isPUNKTFUNK_IDD_DEPTH(default 2), used only by IDD-push on the native path; GameStream and the WGC helper are hardcoded depth-1. - "Async NVENC is measure-gated and probably stacks latency (Tier 3D)." The measurement that
produced that verdict (
capture/windows/wgc_helper.rs:131-135) pipelined on a single thread — it queued more frames but still blockedlock_bitstreaminline, so it added queue latency with zero overlap. That is not the pattern the NVENC guide prescribes (submit/retrieve on separate threads). The correct async pipeline is untried, not disproven. (§5.B) - "More GPU priority is maxed and hits a hard preemption wall with no recourse." Half right. Priority is near-maxed (HIGH), but the "no recourse" intuition is wrong: a higher-priority GPU context does preempt a saturating graphics context at pixel granularity — that is precisely how NVIDIA VR Async-TimeWarp injects a frame into a busy game (VRWorks Context Priority). And we default to HIGH, leaving REALTIME unused even though our SYSTEM service can grant it. (§5.C)
- "Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss." The "half the frames" effect is specifically a DLSS-Frame-Generation flip-metering artifact (FG v310.x+ / RTX 50-series), not a general property of independent-flip games — normal fullscreen flip games are captured at full rate by DDA. So composed-flip is a narrow fix, not a general lever. (Apollo #676 — DDA captured a flip game at full 120 fps, Sunshine #3621 — version-pinned to FG 310.x)
- "NvFBC is a possible low-overhead capture path." Dead on Windows — deprecated, frozen at
Capture SDK 7.1 / Win10-1803
(NVIDIA deprecation bulletin).
Linux-only, and there only via the consumer
keylasepatch.
What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the honest residual ceiling at 100% GPU. Those carry forward.
2. How the pipeline serializes today — the key insight
The capture→encode loop is a fixed-cadence pacer (gamestream/stream.rs, punktfunk1.rs): every
1/target_fps tick it grabs the freshest frame with a non-blocking try_latest(), and if
nothing new arrived it re-encodes the held frame (a near-empty P-frame). So the outbound fps is
pinned at target_fps no matter what the source did — which is why the raw fps counter lies under
contention. The only honest signal is the uniq / diag_new counter; the code itself states the
diagnostic: "low new_fps at high send rate ⇒ the source isn't producing frames, not an encode stall."
The NVENC round-trip (the dominant path) is depth-1 synchronous: encode_picture is a
non-blocking ASIC launch, but lock_bitstream blocks the same thread until that frame completes
(no enableEncodeAsync, no completion event). The only thread split is encode-vs-network-send, never
submit-vs-retrieve. So under contention the loop is strictly serial — capture (+convert) → submit → block in lock_bitstream → hand AU to the send thread — and the arithmetic matches the symptom:
1000/17 ≈ 59 and 1000/13 ≈ 77 fps bracket the observed ~50, the signature of one frame in
flight per round-trip, not an ASIC throughput wall.
(independent NVENC latency study: ~7 frames across all presets)
Where the per-frame GPU work lands, by path (the crux of contention — lower contended-engine load is better):
| Path | Colour-convert | NVENC input | Contended-engine load/frame |
|---|---|---|---|
| IDD-push (install default) | NV12/P010 on the video engine (3514702; FP16→P010 via shader for HDR) |
NV12/P010 | low (SDR) / shader-CSC on SM (HDR) |
| WGC (fallback default) | VideoProcessorBlt → NV12 on the video engine |
NV12/P010 | low |
| DDA | VideoProcessorBlt → NV12 on the video engine |
NV12/P010 | medium (one 3D CopyResource to release the dup fast) |
| Linux NVENC | none → NVENC internal RGB→YUV on the SM (default) | RGBZ/BGRZ (NV12 only if PUNKTFUNK_NV12 and PUNKTFUNK_ZEROCOPY) |
high |
Measured magnitude of "RGB vs NV12 to the encoder": RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%. NVENC's guide confirms the mechanism: "Encoding of RGB contents" is on the explicit list of features that internally use CUDA (NVENC prog-guide §Encoder Features using CUDA).
3. Diagnose first — cheap, decisive, do before any code
Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM cannot reproduce this — run on the RTX 4090 Windows box (and a real NVIDIA Linux box) with an actual saturating game.
- Run with
PUNKTFUNK_PERF=1and readuniqvsfpsunder CS2 at GPU-100%:fps≈target butuniq→40–50 ⇒ (b) source ceiling — the compositor/IDD only produced 40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.- both
fpsanduniq→40–50, withencode_ms13–17 ⇒ (a) feed contention — the round-trip is starving. Go to §5.A/B/C.
- Classify the game's presentation with PresentMon —
"Presented FPS" vs "Displayed FPS" and Presentation Mode (Hardware: Independent Flip vs
Composed: Flip). Independent-Flip +
uniq≪ Presented ⇒ source/flip problem; Presented FPS itself collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing frames. - Log
cap_us/enc_us/pace_usp50/p99 alongside to localise the stall. (Per-stagecap/submit/waitµs instrumentation landed underPUNKTFUNK_PERFin3514702.)
Necessary-but-not-sufficient caveat: if the game only rendered 50 frames because it's GPU-bound, nothing downstream creates the other 90. Source fixes address (b) only; the throughput of a saturated single GPU is split between game and host no matter what.
4. Current-state audit (what's shipped / regressed / missing)
| Area | State | Where |
|---|---|---|
| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | session_tuning.rs ✅ |
| Thread priority (Linux) | setpriority −10/−5 — native path only; GameStream Linux threads get none |
punktfunk1.rs:1977 ⚠ |
| GPU sched priority | D3DKMTSetProcessSchedulingPriorityClass HIGH(4) default; realtime opt-in, no auto-gate; cross-process onto WGC helper |
capture/windows/dxgi.rs:208-330 ⚠ |
| GPU thread/latency | SetGPUThreadPriority(0x4000001E), SetMaximumFrameLatency(1) |
dxgi.rs:193-200 ✅ |
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — IDD-push (default) now video-engine NV12 (3514702) ✅ |
wgc.rs:631 / idd_push.rs |
| CSC off-SM (Win HDR) | IDD-push HDR via FP16→P010 shader (on-SM); other paths on-SM unless PUNKTFUNK_HDR_SHADER_P010 |
wgc.rs:603 ⚠ |
| CSC off-SM (Linux) | RGB→SM by default; NV12 is double-opt-in (PUNKTFUNK_NV12+PUNKTFUNK_ZEROCOPY) |
encode/linux/mod.rs:104 ⚠ |
| Encode pipeline | depth-1 synchronous, inline lock_bitstream; IDD-push native = depth-2 same-thread |
nvenc.rs:801 ⚠ |
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | nvenc.rs:424-447 ✅ |
| Zero-copy register-in-place | yes; IDD-push out-ring is now the convert target (NV12/P010), no extra copy | nvenc.rs:623 ✅ |
| AMF tuning | usage=ultralowlatency, preanalysis=false |
ffmpeg_win.rs:215-219 ✅ |
| QSV tuning | async_depth=1, low_power=1 (VDEnc) |
ffmpeg_win.rs:226-227 ✅ |
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
| encode|send split + paced send + sendmmsg + 32 MB sockbuf | yes | stream.rs, transport/qos.rs ✅ |
| Clock / P-state pin | none (zero hits repo-wide) | ✗ |
| Async NVENC (2-thread) | none | ✗ |
| Frame-source escape (hook/NvFBC-Linux) | none | ✗ |
| Second-GPU / iGPU encode offload | none | ✗ |
| DSCP/QoS | implemented, PUNKTFUNK_DSCP opt-in (default off) |
transport/qos.rs ⚠ |
5. The levers, ranked, with honest verdicts
A. Stop feeding NVENC RGB on the default path — DONE for Windows IDD-push (3514702)
The default Windows IDD-push path used to hand NVENC packed RGB, forcing NVENC's internal RGB→YUV CSC
onto the SM the game saturates. 3514702 makes the out-ring the convert target: a D3D11 video-engine
VideoConverter does BGRA→NV12 (SDR, BT.709 limited) in place, so NVENC gets native NV12 and skips its
SM-side CSC; HDR uses the FP16→P010 shader (NVIDIA's VideoProcessor can't do RGB→P010). NV12 input forces
bit_depth=8, so an HDR↔SDR toggle re-inits the session at the matching depth (NV12 can't feed a 10-bit
session). This also removed the separate CopyResource (the convert writes the ring directly).
Verdict: REAL, but honestly conditional — the convert has to land off the SM to fully pay off.
VideoProcessorBlt is designed to use fixed-function video hardware and the hardforum numbers back the
15%→2% drop, but no NVIDIA doc explicitly confirms VideoProcessorBlt runs off-SM on GeForce — treat
the "video engine" claim as well-founded-but-unverified and confirm on-box with nvidia-smi dmon (watch
the enc/sm columns) before and after. Do not convert with a CUDA/3D shader and call it done — that
just relocates the CSC to the same SM (this is why the HDR P010 shader path is still on-SM; Sunshine's
RGB→NV12 CUDA kernel still contends).
Still open in §A:
- Linux: make NV12 the default for the tiled zero-copy path (gated behind
PUNKTFUNK_NV12andPUNKTFUNK_ZEROCOPYtoday —encode/linux/mod.rs:104,linux/zerocopy/egl.rs:272), feeding NVENCNV_ENC_BUFFER_FORMAT_NV12. The GL detile already runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC. - Windows HDR: move the FP16→P010 convert onto the video engine where the VP supports it (today's
shader keeps it on-SM), or flip
PUNKTFUNK_HDR_SHADER_P010on by default for the non-IDD paths.
B. A correct async encode pipeline (the untried encoder lever) — OPEN
The NVENC Programming Guide is explicit: "The main encoder thread should be used only to submit
work… (non-blocking NvEncEncodePicture). Output buffer processing — waiting on the completion
event in asynchronous mode, or calling NvEncLockBitstream in synchronous mode — should be done in
the secondary thread."
(NVENC prog-guide, threading model)
We do the opposite — submit and blocking-retrieve on one thread. Queuing more pending entries
(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with no overlap,
which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong
implementation, not a disproof.
The fix: submit on the capture/encode thread; do lock_bitstream on a dedicated retrieve thread;
hold a deep input+output surface pool (≈4–8); on Windows register a completionEvent per output
buffer (enableEncodeAsync=1) — on Linux async events are unsupported, so use the same two-thread
split with a blocking retrieve.
(async is Windows/WDDM-only;
FFmpeg models the same knob as delay/async_depth,
libavcodec/nvenc.c).
This lets the WDDM scheduler find a backlog when it finally grants the encoder context a slice, and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do frame N+1's convert.
Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.
The honest bound (and why this is second to §A/§C): pipelining cannot manufacture GPU time — if the
scheduler grants the encode context only X% under load, depth only guarantees work is ready for
each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
priority, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
Watch out: this forecloses sub-frame slice output (mutually exclusive with enableEncodeAsync),
and HAGS can spike the submit call itself
(100–200 ms nvEncEncodePicture stalls under HAGS).
C. Auto-gated REALTIME GPU scheduling priority — PARTIAL (knob exists, no auto-gate)
Raising the host process's WDDM GPU priority is the proven single-PC production lever — OBS and
Sunshine both set D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME to stop being descheduled behind
fullscreen games
(OBS commit,
Sunshine display_base.cpp).
It works independently of HAGS (HAGS does not reassign cross-process priority — Microsoft:
"Windows continues to control prioritization"
DirectX devblog).
We ship only HIGH(4) by default with a static realtime opt-in (PUNKTFUNK_GPU_PRIORITY_CLASS,
dxgi.rs:208-330) and no auto-gate. Two things to change:
- We can actually grant REALTIME. It needs
SeIncreaseBasePriorityPrivilege, which an unelevated app lacks (OBS logs the failure) — but our host runs as aLocalSystemservice, which holds it. The lever is available to us specifically. - Gate it to dodge the freeze. REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a documented
NVENC hang (Sunshine ships
nvenc_realtime_hagsto downgrade to HIGH for exactly this; Sunshine config, NVIDIA repro). Implement the old plan's "Tier 3B": probe HAGS viaD3DKMTQueryAdapterInfoand VRAM headroom viaIDXGIAdapter3::QueryVideoMemoryInfo(continuously); use REALTIME only when HAGS-off, or HAGS-on with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.
Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever. Priority is how
the host takes GPU time from the game; it measurably costs the game fps
(Doom Eternal 121→60 with Sunshine running).
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
the class operator-configurable (we already expose PUNKTFUNK_GPU_PRIORITY_CLASS).
D. Multi-vendor encoder hygiene (AMF/QSV) — stable / mostly done, one caveat
Our *_amf/*_qsv libavcodec config already follows the research's advice: AMF
usage=ultralowlatency + preanalysis=false (ffmpeg_win.rs:215), QSV async_depth=1 +
low_power=1 VDEnc path (:226). Keep them. Two notes:
- AMF/QSV suffer contention worse than NVENC. OBS: "For Intel and AMD GPUs, the hardware encoder requires significant resources of the same type a 3D app/game requires… different from NVIDIA's NVENC, which has dedicated encoding circuits" (OBS KB). So on an AMD/Intel host the collapse is expected to be harder — and §G (iGPU offload) is even more attractive there.
- The AMF busy-poll floor (a fixed-sleep
QueryOutputpoll imposes ~15 ms via timer granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman'sQUERY_TIMEOUTpatch); since we go through libavcodec we inherit it — just confirm the pinned FFmpeg build includes it. (ffmpeg-devel)
Verdict: REAL but largely already captured. No big win left here except via §G.
E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix — OPEN
NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every frame — most visible in the light scene (the "200-not-240"). Pin it:
- Windows: NvAPI per-application DRS
PREFERRED_PSTATE = PREFER_MAXscoped to our exe (this is exactly Sunshine'snvenc_latency_over_power, Sunshine nvprefs). Crash-safe undo is mandatory — persist an undo record to%ProgramData%\punktfunk\before applying, revert a stale profile on next start, so a crash never leaves the user's control panel modified. - Linux:
nvidia-smi -lgc/NVMLnvmlDeviceSetGpuLockedClocks(needs root/CAP_SYS_ADMIN; querynvmlDeviceGetMaxClockInfo, lock to that, restore on teardown and SIGTERM). Plus the newly-addedCudaNoStablePerfLimitdriver profile — new in R580/595, so usable on the 595 box — to defeat the CUDA "Force P2" memory-clock clamp. - Gate behind
PUNKTFUNK_PIN_CLOCKS; default off on battery / Steam Deck (pinning is harmful there).
Verdict: REAL for latency stability, marginal for the saturated collapse (at 100% util the game already pins P0). Cheap, low risk, do it for the light-scene win.
F. Escape the frame-source ceiling — only if §3 says (b) — OPEN
If uniq is the wall, no encoder/priority work helps — you need a better frame source.
- Swapchain-hook capture (the real fix). Inject a hook on
IDXGISwapChain::Present/Present1,vkQueuePresentKHR,wglSwapBuffersand copy the backbuffer to a shared texture before the compositor — OBS Game Capture's mechanism. Sees every presented frame, no compose/refresh gating. (OBS dxgi-capture) Tradeoffs are serious: anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an opt-in "game capture" mode, not the default. - NvFBC: not an option on Windows (dead, §1). On Linux it's viable via the consumer keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
- Compose-flip (narrow): the topmost 1×1 layered-window trick (we already have
composed_flip.rs) forces DWM composition and fixes specifically the DLSS-Frame-Gen half-rate case. Adds host-display latency; don't enable globally. - WGC "deliver 2× rate": Apollo sets
MinUpdateInterval = 1e7/(fps*2)so the pacer always has a fresh frame to pick (Apollo); we set it to 1× refresh (wgc.rs:310). Cheap tweak to try on the WGC path.
Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow. None invents frames the game didn't render.
G. The honest endgame — encode on a second GPU / the iGPU — OPEN
For demanding titles that saturate the GPU even when capped, the only thing that removes contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a different GPU — a second dGPU or, more realistically, the iGPU (Intel QuickSync / AMD VCN), which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once, encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder" play, and the OBS "second GPU is harmful" verdict does not apply — that verdict is about moving only the NVENC block; moving capture + CSC + copies off the gaming GPU genuinely frees it. (OBS forum)
We're unusually well-placed for this: we already have working AMF and QSV backends
(encode/windows/ffmpeg_win.rs) and the Linux VAAPI backend. The missing piece is a capture/topology
mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
it's the only path that lets a demanding game and a clean stream coexist on one machine.
Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses." Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session; the consumer analogue is the iGPU.
6. Recommended order of attack
- §3 Diagnose on the RTX box + a real game. Settles (a) vs (b). (half a day, decisive)
- §5.A NV12/P010 on the default paths — IDD-push DONE (
3514702); remaining: Linux NV12 default-on, Windows HDR P010 off-SM. Confirm off-SM withnvidia-smi dmon. - §5.C Auto-gated REALTIME priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
- §5.E Clock pin both OSes (crash-safe undo). Cheap light-scene win.
- §5.B Correct two-thread async pipeline. Structural; recovers the depth-1 serialization.
- §3-gated §5.F source escape (swapchain hook) — only if
uniqis the wall. - §5.G iGPU encode offload — the strategic answer for demanding titles; larger build.
After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the honest ceiling: on one saturated GPU the game and the host split a fixed pie — coarse WDDM graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only rendered 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps), or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.
7. Placebos & dead ends (so we don't re-propose them)
| Candidate | Verdict | Why |
|---|---|---|
| NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames as a "non-capping yield" | ✗ placebo | Shrinks the game's render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. (Battle(non)sense LDAT data) |
| HAGS on, as a contention fix | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it causes NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime queue. (OBS KB) |
| Split-frame encode (2/3/4-way) to fix contention | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured zero latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). splitEncodeMode=15 is the legit disable sentinel, not a bug. (SDK header) |
| Move the encoded-bitstream readback to a copy engine | ✗ placebo | Output is KB-scale; the cost of lock_bitstream is the completion wait, not copy bandwidth. (The input full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
CUDA stream priority / CUDA_DEVICE_MAX_CONNECTIONS / CU_CTX_SCHED_* |
✗ placebo cross-process | Intra-context only; the game is a separate context. Stream priority "will not preempt already executing work". (CUDA docs) |
| VK/EGL global-priority REALTIME on Linux NVIDIA | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
| Windows "High performance" GPU preference | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
| MIG / MPS / vGPU | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
| NvFBC on Windows | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
| Frame Generation / Smooth Motion to "make more frames" | ✗ red herring | We stream rendered frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |
8. Open items / what's left
Diagnostics + still-unbuilt levers (verbatim, highest leverage first):
- §3 automation — instrument the
uniq-vs-fpsheuristic + a PresentMon probe so (a)/(b) is decided automatically, not by hand. (Per-stagecap/submit/waitµs already land underPUNKTFUNK_PERFfrom3514702; the uniq/PresentMon classifier is not yet automated.) - §5.A residual — Linux NV12 default-on for the tiled zero-copy path (drop the
PUNKTFUNK_NV12+PUNKTFUNK_ZEROCOPYdouble-opt-in); move the Windows HDR FP16→P010 convert off the SM (today it's a shader). Windows IDD-push SDR/HDR NV12/P010 is DONE (3514702). - §5.B — build a correct async NVENC pipeline: submit on one thread, blocking-
lock_bitstreamon a dedicated retrieve thread, deep input+output surface pool (≈4–8), Windows per-buffercompletionEvent(enableEncodeAsync=1), same two-thread split on Linux. - §5.C — auto-gate REALTIME GPU priority: probe HAGS (
D3DKMTQueryAdapterInfo) + VRAM headroom (IDXGIAdapter3::QueryVideoMemoryInfo) continuously; REALTIME only when HAGS-off or HAGS-on with comfortable headroom, downgrade to HIGH the instant VRAM tightens. (Staticrealtimeopt-in exists indxgi.rs; no auto-gate.) - §5.E — clock / P-state pinning: Windows NvAPI DRS
PREFERRED_PSTATE=PREFER_MAX(crash-safe undo to%ProgramData%\punktfunk\); Linuxnvidia-smi -lgc/nvmlDeviceSetGpuLockedClocks(+CudaNoStablePerfLimiton R580/595). GatePUNKTFUNK_PIN_CLOCKS, default off on battery/Deck. - §5.F — frame-source escape (only if §3 says (b)): swapchain-hook capture (OBS-style, anti-cheat
tradeoffs); NvFBC on Linux (keylase patch); compose-flip for the DLSS-FG half-rate case; WGC
MinUpdateInterval = 1e7/(fps*2)2×-rate tweak. - §5.G — iGPU / second-GPU encode offload: pin capture to the gaming adapter, encoder to the iGPU adapter, one cross-adapter shared-texture copy. Reuses the AMF/QSV/VAAPI backends.
Open evidence gaps (verify on-box)
- Whether
ID3D11VideoProcessor::VideoProcessorBlt(BGRA→NV12) runs off the SM on GeForce is not confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. Verify on-box withnvidia-smi dmon(sm% vs enc%) on the IDD-push/WGC path before assuming the win landed. - The exact share of the 13–17 ms
encode_msthat is convert-on-SM vs scheduling-wait is unmeasured. §3 + an A/B of IDD-push-RGB (pre-3514702) vs IDD-push-NV12 on the same scene settles it and tells you whether §5.A alone is enough or whether §5.C is doing the heavy lifting. - AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD whitepaper; treat the direction as solid, the magnitude as TBD.