Files

T

enricobuehler 7b99b41ede docs(design): trim shipped plans, consolidate cluster, add index

Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).

- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
  apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
  host-latency, gpu-contention (fixed stale status table), game-library,
  linux-setup (fixed m0->spike + stale zero-copy claim),
  session-aware-host-followups, windows-client-bootstrap,
  windows-dualsense-{scoping,game-detection}, windows-virtual-display,
  security-review (per-finding status table; #12 still open),
  apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
  windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
  merged, M4 done); windows-secure-desktop.md archived (now a fallback
  behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
  roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-26 16:39:06 +00:00

34 KiB

Raw Blame History

GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)

Status: Investigation / plan. §5.A (NV12/P010 on the IDD-push default path) is SHIPPED — 3514702, capture/windows/idd_push.rs + encode/windows/nvenc.rs. All other levers (§5.B/§5.C/§5.E/§5.F/§5.G) are OPEN; §5.C is partial (REALTIME knob exists, no auto-gate). Paired with host-latency-plan.md (mutual cross-refs — keep both). Trimmed to design rationale + open items; git history holds the full original.

The headache, stated precisely: a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the stream tracks; the moment the game pins the GPU the stream collapses to 40–50 fps while the game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light titles like CS2). Capping is not an acceptable fix — demanding titles exhaust the GPU even when capped.

This is the second, deeper pass on the problem. The first pass is host-latency-plan.md (a 25-agent investigation, 2026-06-18). This doc supersedes several of that doc's conclusions — the codebase moved a lot in the week since (the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned two of the old plan's premises. Read §1 (corrections) before acting on the old doc.

0. TL;DR — the corrected mental model and the action list

The governing fact: NVENC is a dedicated ASIC on its own GPU runlist, physically separate from the SM/CUDA/graphics cores a 3D game saturates. The game does not steal the encode block. It steals everything that feeds the block — capture-acquire, the RGB→YUV colour-convert, the copy into the encoder's input surface, the readback — and the GPU-scheduler time to run that feed work, which is queued behind the game's graphics context. (NVENC app-note, engine-table proof, UNC RTAS'24)

Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart before writing code:

Bottleneck	Symptom	Fix family
(a) feed-scheduling contention	`uniq`≈`fps`, both ~50; `encode_ms` 13–17	shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU
(b) frame-source ceiling	`fps`≈240 (held re-encodes) but `uniq`→40–50	capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case

The single hardest truth: on one saturated GPU there is no free lunch. Any host GPU work either preempts the game (and steals its frames) or waits behind it. Capping the game works only because it cuts the game's total GPU demand and opens idle gaps. The non-capping equivalents are exactly three: need less GPU (footprint shrink), take more (priority — which costs the game fps), or use a different GPU (real isolation). Anything pitched as "make the game politely yield without losing anything" — Reflex, render-queue tricks — is a placebo here (§7).

Action list, highest leverage first (detail in §5–§6):

Diagnose first (§3). Read uniq-vs-fps under the real workload + PresentMon presentation mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
Stop feeding NVENC RGB on the default path — DONE for IDD-push (3514702): the install default now converts BGRA→NV12 (SDR) / FP16→P010 (HDR) before NVENC, off the SM. Linux NV12-default and a video-engine HDR P010 are still open. (§5.A)
Build a correct async encode pipeline — submit on one thread, blocking-retrieve on another, deep surface pool, Windows completion events. Our past "pipelining didn't help" was a same-thread implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
Auto-gated REALTIME GPU priority. Our LocalSystem service can grant it (most apps can't). Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
Lock clocks / pin P-state for jitter (cheap; fixes the light-scene "200-not-240", not the collapse). (§5.E)
If source-bound: swapchain-hook capture (OBS-style) — the real escape from the compose ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
The honest endgame for demanding titles: encode on a second GPU / the iGPU. The only approach that removes contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)

1. Corrections to `host-latency-plan.md` (read before reusing it)

The old doc was right about the shape but several specifics are now wrong or stale:

"Windows already feeds NVENC YUV on the video engine, so it does the right thing." True for the DDA and WGC paths — was false for IDD-push, which became the install default and fed NVENC RGB, paying the SM-side CSC the old doc said Windows had eliminated. The default path regressed on the exact axis the doc celebrated. Since fixed (3514702, §5.A): IDD-push now converts BGRA→NV12 on the video engine (FP16→P010 shader for HDR) and feeds NVENC native YUV.
"PUNKTFUNK_ENCODE_DEPTH (default 4, ≤6) deep-pipelines." There is no such knob. It exists only in two stale comments (encode/windows/nvenc.rs:30, capture/windows/wgc.rs:57) and is never parsed. The real depth knob is PUNKTFUNK_IDD_DEPTH (default 2), used only by IDD-push on the native path; GameStream and the WGC helper are hardcoded depth-1.
"Async NVENC is measure-gated and probably stacks latency (Tier 3D)." The measurement that produced that verdict (capture/windows/wgc_helper.rs:131-135) pipelined on a single thread — it queued more frames but still blocked lock_bitstream inline, so it added queue latency with zero overlap. That is not the pattern the NVENC guide prescribes (submit/retrieve on separate threads). The correct async pipeline is untried, not disproven. (§5.B)
"More GPU priority is maxed and hits a hard preemption wall with no recourse." Half right. Priority is near-maxed (HIGH), but the "no recourse" intuition is wrong: a higher-priority GPU context does preempt a saturating graphics context at pixel granularity — that is precisely how NVIDIA VR Async-TimeWarp injects a frame into a busy game (VRWorks Context Priority). And we default to HIGH, leaving REALTIME unused even though our SYSTEM service can grant it. (§5.C)
"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss." The "half the frames" effect is specifically a DLSS-Frame-Generation flip-metering artifact (FG v310.x+ / RTX 50-series), not a general property of independent-flip games — normal fullscreen flip games are captured at full rate by DDA. So composed-flip is a narrow fix, not a general lever. (Apollo #676 — DDA captured a flip game at full 120 fps, Sunshine #3621 — version-pinned to FG 310.x)
"NvFBC is a possible low-overhead capture path." Dead on Windows — deprecated, frozen at Capture SDK 7.1 / Win10-1803 (NVIDIA deprecation bulletin). Linux-only, and there only via the consumer keylase patch.

What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the honest residual ceiling at 100% GPU. Those carry forward.

2. How the pipeline serializes today — the key insight

The capture→encode loop is a fixed-cadence pacer (gamestream/stream.rs, punktfunk1.rs): every 1/target_fps tick it grabs the freshest frame with a non-blocking try_latest(), and if nothing new arrived it re-encodes the held frame (a near-empty P-frame). So the outbound fps is pinned at target_fps no matter what the source did — which is why the raw fps counter lies under contention. The only honest signal is the uniq / diag_new counter; the code itself states the diagnostic: "low new_fps at high send rate ⇒ the source isn't producing frames, not an encode stall."

The NVENC round-trip (the dominant path) is depth-1 synchronous: encode_picture is a non-blocking ASIC launch, but lock_bitstream blocks the same thread until that frame completes (no enableEncodeAsync, no completion event). The only thread split is encode-vs-network-send, never submit-vs-retrieve. So under contention the loop is strictly serial — capture (+convert) → submit → block in lock_bitstream → hand AU to the send thread — and the arithmetic matches the symptom: 1000/17 ≈ 59 and 1000/13 ≈ 77 fps bracket the observed ~50, the signature of one frame in flight per round-trip, not an ASIC throughput wall. (independent NVENC latency study: ~7 frames across all presets)

Where the per-frame GPU work lands, by path (the crux of contention — lower contended-engine load is better):

Path	Colour-convert	NVENC input	Contended-engine load/frame
IDD-push (install default)	NV12/P010 on the video engine (`3514702`; FP16→P010 via shader for HDR)	NV12/P010	low (SDR) / shader-CSC on SM (HDR)
WGC (fallback default)	`VideoProcessorBlt` → NV12 on the video engine	NV12/P010	low
DDA	`VideoProcessorBlt` → NV12 on the video engine	NV12/P010	medium (one 3D `CopyResource` to release the dup fast)
Linux NVENC	none → NVENC internal RGB→YUV on the SM (default)	RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` and `PUNKTFUNK_ZEROCOPY`)	high

Measured magnitude of "RGB vs NV12 to the encoder": RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%. NVENC's guide confirms the mechanism: "Encoding of RGB contents" is on the explicit list of features that internally use CUDA (NVENC prog-guide §Encoder Features using CUDA).

3. Diagnose first — cheap, decisive, do before any code

Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM cannot reproduce this — run on the RTX 4090 Windows box (and a real NVIDIA Linux box) with an actual saturating game.

Run with PUNKTFUNK_PERF=1 and read uniq vs fps under CS2 at GPU-100%:
- fps≈target but uniq→40–50 ⇒ (b) source ceiling — the compositor/IDD only produced 40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
- both fps and uniq→40–50, with encode_ms 13–17 ⇒ (a) feed contention — the round-trip is starving. Go to §5.A/B/C.
Classify the game's presentation with PresentMon — "Presented FPS" vs "Displayed FPS" and Presentation Mode (Hardware: Independent Flip vs Composed: Flip). Independent-Flip + uniq ≪ Presented ⇒ source/flip problem; Presented FPS itself collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing frames.
Log cap_us / enc_us / pace_us p50/p99 alongside to localise the stall. (Per-stage cap/submit/wait µs instrumentation landed under PUNKTFUNK_PERF in 3514702.)

Necessary-but-not-sufficient caveat: if the game only rendered 50 frames because it's GPU-bound, nothing downstream creates the other 90. Source fixes address (b) only; the throughput of a saturated single GPU is split between game and host no matter what.

4. Current-state audit (what's shipped / regressed / missing)

Area	State	Where
Thread priority (Win)	HIGH class + MMCSS "Games" + 1 ms timer	`session_tuning.rs` ✅
Thread priority (Linux)	`setpriority` −10/−5 — native path only; GameStream Linux threads get none	`punktfunk1.rs:1977` ⚠
GPU sched priority	`D3DKMTSetProcessSchedulingPriorityClass` HIGH(4) default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper	`capture/windows/dxgi.rs:208-330` ⚠
GPU thread/latency	`SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)`	`dxgi.rs:193-200` ✅
CSC off-SM (Win SDR)	WGC/DDA video-engine NV12 ✅ — IDD-push (default) now video-engine NV12 (`3514702`) ✅	`wgc.rs:631` / `idd_push.rs`
CSC off-SM (Win HDR)	IDD-push HDR via FP16→P010 shader (on-SM); other paths on-SM unless `PUNKTFUNK_HDR_SHADER_P010`	`wgc.rs:603` ⚠
CSC off-SM (Linux)	RGB→SM by default; NV12 is double-opt-in (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`)	`encode/linux/mod.rs:104` ⚠
Encode pipeline	depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread	`nvenc.rs:801` ⚠
Split-encode	2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum	`nvenc.rs:424-447` ✅
Zero-copy register-in-place	yes; IDD-push out-ring is now the convert target (NV12/P010), no extra copy	`nvenc.rs:623` ✅
AMF tuning	`usage=ultralowlatency`, `preanalysis=false`	`ffmpeg_win.rs:215-219` ✅
QSV tuning	`async_depth=1`, `low_power=1` (VDEnc)	`ffmpeg_win.rs:226-227` ✅
Intra-refresh / infinite GOP	yes (killed the periodic-IDR freeze)	✅
encode\|send split + paced send + sendmmsg + 32 MB sockbuf	yes	`stream.rs`, `transport/qos.rs` ✅
Clock / P-state pin	none (zero hits repo-wide)	✗
Async NVENC (2-thread)	none	✗
Frame-source escape (hook/NvFBC-Linux)	none	✗
Second-GPU / iGPU encode offload	none	✗
DSCP/QoS	implemented, `PUNKTFUNK_DSCP` opt-in (default off)	`transport/qos.rs` ⚠

5. The levers, ranked, with honest verdicts

A. Stop feeding NVENC RGB on the default path — DONE for Windows IDD-push (`3514702`)

The default Windows IDD-push path used to hand NVENC packed RGB, forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. 3514702 makes the out-ring the convert target: a D3D11 video-engine VideoConverter does BGRA→NV12 (SDR, BT.709 limited) in place, so NVENC gets native NV12 and skips its SM-side CSC; HDR uses the FP16→P010 shader (NVIDIA's VideoProcessor can't do RGB→P010). NV12 input forces bit_depth=8, so an HDR↔SDR toggle re-inits the session at the matching depth (NV12 can't feed a 10-bit session). This also removed the separate CopyResource (the convert writes the ring directly).

Verdict: REAL, but honestly conditional — the convert has to land off the SM to fully pay off. VideoProcessorBlt is designed to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, but no NVIDIA doc explicitly confirms VideoProcessorBlt runs off-SM on GeForce — treat the "video engine" claim as well-founded-but-unverified and confirm on-box with nvidia-smi dmon (watch the enc/sm columns) before and after. Do not convert with a CUDA/3D shader and call it done — that just relocates the CSC to the same SM (this is why the HDR P010 shader path is still on-SM; Sunshine's RGB→NV12 CUDA kernel still contends).

Still open in §A:

Linux: make NV12 the default for the tiled zero-copy path (gated behind PUNKTFUNK_NV12 and PUNKTFUNK_ZEROCOPY today — encode/linux/mod.rs:104, linux/zerocopy/egl.rs:272), feeding NVENC NV_ENC_BUFFER_FORMAT_NV12. The GL detile already runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
Windows HDR: move the FP16→P010 convert onto the video engine where the VP supports it (today's shader keeps it on-SM), or flip PUNKTFUNK_HDR_SHADER_P010 on by default for the non-IDD paths.

B. A correct async encode pipeline (the untried encoder lever) — OPEN

The NVENC Programming Guide is explicit: "The main encoder thread should be used only to submit work… (non-blocking NvEncEncodePicture). Output buffer processing — waiting on the completion event in asynchronous mode, or calling NvEncLockBitstream in synchronous mode — should be done in the secondary thread." (NVENC prog-guide, threading model) We do the opposite — submit and blocking-retrieve on one thread. Queuing more pending entries (IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with no overlap, which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong implementation, not a disproof.

The fix: submit on the capture/encode thread; do lock_bitstream on a dedicated retrieve thread; hold a deep input+output surface pool (≈4–8); on Windows register a completionEvent per output buffer (enableEncodeAsync=1) — on Linux async events are unsupported, so use the same two-thread split with a blocking retrieve. (async is Windows/WDDM-only; FFmpeg models the same knob as delay/async_depth, libavcodec/nvenc.c).

This lets the WDDM scheduler find a backlog when it finally grants the encoder context a slice, and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do frame N+1's convert.

Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded. The honest bound (and why this is second to §A/§C): pipelining cannot manufacture GPU time — if the scheduler grants the encode context only X% under load, depth only guarantees work is ready for each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is priority, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by. Watch out: this forecloses sub-frame slice output (mutually exclusive with enableEncodeAsync), and HAGS can spike the submit call itself (100–200 ms nvEncEncodePicture stalls under HAGS).

C. Auto-gated REALTIME GPU scheduling priority — PARTIAL (knob exists, no auto-gate)

Raising the host process's WDDM GPU priority is the proven single-PC production lever — OBS and Sunshine both set D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME to stop being descheduled behind fullscreen games (OBS commit, Sunshine display_base.cpp). It works independently of HAGS (HAGS does not reassign cross-process priority — Microsoft: "Windows continues to control prioritization" DirectX devblog).

We ship only HIGH(4) by default with a static realtime opt-in (PUNKTFUNK_GPU_PRIORITY_CLASS, dxgi.rs:208-330) and no auto-gate. Two things to change:

We can actually grant REALTIME. It needs SeIncreaseBasePriorityPrivilege, which an unelevated app lacks (OBS logs the failure) — but our host runs as a LocalSystem service, which holds it. The lever is available to us specifically.
Gate it to dodge the freeze. REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a documented NVENC hang (Sunshine ships nvenc_realtime_hags to downgrade to HIGH for exactly this; Sunshine config, NVIDIA repro). Implement the old plan's "Tier 3B": probe HAGS via D3DKMTQueryAdapterInfo and VRAM headroom via IDXGIAdapter3::QueryVideoMemoryInfo (continuously); use REALTIME only when HAGS-off, or HAGS-on with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.

Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever. Priority is how the host takes GPU time from the game; it measurably costs the game fps (Doom Eternal 121→60 with Sunshine running). That's acceptable for a streaming host (the remote view is the product), but say so plainly and make the class operator-configurable (we already expose PUNKTFUNK_GPU_PRIORITY_CLASS).

D. Multi-vendor encoder hygiene (AMF/QSV) — stable / mostly done, one caveat

Our *_amf/*_qsv libavcodec config already follows the research's advice: AMF usage=ultralowlatency + preanalysis=false (ffmpeg_win.rs:215), QSV async_depth=1 + low_power=1 VDEnc path (:226). Keep them. Two notes:

AMF/QSV suffer contention worse than NVENC. OBS: "For Intel and AMD GPUs, the hardware encoder requires significant resources of the same type a 3D app/game requires… different from NVIDIA's NVENC, which has dedicated encoding circuits" (OBS KB). So on an AMD/Intel host the collapse is expected to be harder — and §G (iGPU offload) is even more attractive there.
The AMF busy-poll floor (a fixed-sleep QueryOutput poll imposes ~15 ms via timer granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's QUERY_TIMEOUT patch); since we go through libavcodec we inherit it — just confirm the pinned FFmpeg build includes it. (ffmpeg-devel)

Verdict: REAL but largely already captured. No big win left here except via §G.

E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix — OPEN

NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every frame — most visible in the light scene (the "200-not-240"). Pin it:

Windows: NvAPI per-application DRS PREFERRED_PSTATE = PREFER_MAX scoped to our exe (this is exactly Sunshine's nvenc_latency_over_power, Sunshine nvprefs). Crash-safe undo is mandatory — persist an undo record to %ProgramData%\punktfunk\ before applying, revert a stale profile on next start, so a crash never leaves the user's control panel modified.
Linux: nvidia-smi -lgc/NVML nvmlDeviceSetGpuLockedClocks (needs root/CAP_SYS_ADMIN; query nvmlDeviceGetMaxClockInfo, lock to that, restore on teardown and SIGTERM). Plus the newly-added CudaNoStablePerfLimit driver profile — new in R580/595, so usable on the 595 box — to defeat the CUDA "Force P2" memory-clock clamp.
Gate behind PUNKTFUNK_PIN_CLOCKS; default off on battery / Steam Deck (pinning is harmful there).

Verdict: REAL for latency stability, marginal for the saturated collapse (at 100% util the game already pins P0). Cheap, low risk, do it for the light-scene win.

F. Escape the frame-source ceiling — only if §3 says (b) — OPEN

If uniq is the wall, no encoder/priority work helps — you need a better frame source.

Swapchain-hook capture (the real fix). Inject a hook on IDXGISwapChain::Present/Present1, vkQueuePresentKHR, wglSwapBuffers and copy the backbuffer to a shared texture before the compositor — OBS Game Capture's mechanism. Sees every presented frame, no compose/refresh gating. (OBS dxgi-capture) Tradeoffs are serious: anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an opt-in "game capture" mode, not the default.
NvFBC: not an option on Windows (dead, §1). On Linux it's viable via the consumer keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
Compose-flip (narrow): the topmost 1×1 layered-window trick (we already have composed_flip.rs) forces DWM composition and fixes specifically the DLSS-Frame-Gen half-rate case. Adds host-display latency; don't enable globally.
WGC "deliver 2× rate": Apollo sets MinUpdateInterval = 1e7/(fps*2) so the pacer always has a fresh frame to pick (Apollo); we set it to 1× refresh (wgc.rs:310). Cheap tweak to try on the WGC path.

Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow. None invents frames the game didn't render.

G. The honest endgame — encode on a second GPU / the iGPU — OPEN

For demanding titles that saturate the GPU even when capped, the only thing that removes contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a different GPU — a second dGPU or, more realistically, the iGPU (Intel QuickSync / AMD VCN), which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once, encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder" play, and the OBS "second GPU is harmful" verdict does not apply — that verdict is about moving only the NVENC block; moving capture + CSC + copies off the gaming GPU genuinely frees it. (OBS forum)

We're unusually well-placed for this: we already have working AMF and QSV backends (encode/windows/ffmpeg_win.rs) and the Linux VAAPI backend. The missing piece is a capture/topology mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but it's the only path that lets a demanding game and a clean stream coexist on one machine.

Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses." Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session; the consumer analogue is the iGPU.

6. Recommended order of attack

§3 Diagnose on the RTX box + a real game. Settles (a) vs (b). (half a day, decisive)
§5.A NV12/P010 on the default paths — IDD-push DONE (3514702); remaining: Linux NV12 default-on, Windows HDR P010 off-SM. Confirm off-SM with nvidia-smi dmon.
§5.C Auto-gated REALTIME priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
§5.E Clock pin both OSes (crash-safe undo). Cheap light-scene win.
§5.B Correct two-thread async pipeline. Structural; recovers the depth-1 serialization.
§3-gated §5.F source escape (swapchain hook) — only if uniq is the wall.
§5.G iGPU encode offload — the strategic answer for demanding titles; larger build.

After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the honest ceiling: on one saturated GPU the game and the host split a fixed pie — coarse WDDM graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only rendered 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps), or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.

7. Placebos & dead ends (so we don't re-propose them)

Candidate	Verdict	Why
NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames as a "non-capping yield"	✗ placebo	Shrinks the game's render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. (Battle(non)sense LDAT data)
HAGS on, as a contention fix	✗ neutral→harmful	Doesn't reassign cross-process priority (Microsoft); OBS reports it causes NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime queue. (OBS KB)
Split-frame encode (2/3/4-way) to fix contention	✗ (pixel-rate only)	Parallelizes the ASIC, not the contended copy/CSC; measured zero latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit disable sentinel, not a bug. (SDK header)
Move the encoded-bitstream readback to a copy engine	✗ placebo	Output is KB-scale; the cost of `lock_bitstream` is the completion wait, not copy bandwidth. (The input full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.)
*CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_`**	✗ placebo cross-process	Intra-context only; the game is a separate context. Stream priority "will not preempt already executing work". (CUDA docs)
VK/EGL global-priority REALTIME on Linux NVIDIA	✗	Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue.
Windows "High performance" GPU preference	✗ single-GPU placebo	Only selects an adapter; real only to split work across adapters (→ that's §G).
MIG / MPS / vGPU	✗ N/A	MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU.
NvFBC on Windows	✗ dead	Deprecated, frozen at Capture SDK 7.1 / Win10-1803.
Frame Generation / Smooth Motion to "make more frames"	✗ red herring	We stream rendered frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention.

8. Open items / what's left

Diagnostics + still-unbuilt levers (verbatim, highest leverage first):

§3 automation — instrument the uniq-vs-fps heuristic + a PresentMon probe so (a)/(b) is decided automatically, not by hand. (Per-stage cap/submit/wait µs already land under PUNKTFUNK_PERF from 3514702; the uniq/PresentMon classifier is not yet automated.)
§5.A residual — Linux NV12 default-on for the tiled zero-copy path (drop the PUNKTFUNK_NV12+PUNKTFUNK_ZEROCOPY double-opt-in); move the Windows HDR FP16→P010 convert off the SM (today it's a shader). Windows IDD-push SDR/HDR NV12/P010 is DONE (3514702).
§5.B — build a correct async NVENC pipeline: submit on one thread, blocking-lock_bitstream on a dedicated retrieve thread, deep input+output surface pool (≈4–8), Windows per-buffer completionEvent (enableEncodeAsync=1), same two-thread split on Linux.
§5.C — auto-gate REALTIME GPU priority: probe HAGS (D3DKMTQueryAdapterInfo) + VRAM headroom (IDXGIAdapter3::QueryVideoMemoryInfo) continuously; REALTIME only when HAGS-off or HAGS-on with comfortable headroom, downgrade to HIGH the instant VRAM tightens. (Static realtime opt-in exists in dxgi.rs; no auto-gate.)
§5.E — clock / P-state pinning: Windows NvAPI DRS PREFERRED_PSTATE=PREFER_MAX (crash-safe undo to %ProgramData%\punktfunk\); Linux nvidia-smi -lgc / nvmlDeviceSetGpuLockedClocks (+ CudaNoStablePerfLimit on R580/595). Gate PUNKTFUNK_PIN_CLOCKS, default off on battery/Deck.
§5.F — frame-source escape (only if §3 says (b)): swapchain-hook capture (OBS-style, anti-cheat tradeoffs); NvFBC on Linux (keylase patch); compose-flip for the DLSS-FG half-rate case; WGC MinUpdateInterval = 1e7/(fps*2) 2×-rate tweak.
§5.G — iGPU / second-GPU encode offload: pin capture to the gaming adapter, encoder to the iGPU adapter, one cross-adapter shared-texture copy. Reuses the AMF/QSV/VAAPI backends.

Open evidence gaps (verify on-box)

Whether ID3D11VideoProcessor::VideoProcessorBlt (BGRA→NV12) runs off the SM on GeForce is not confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. Verify on-box with nvidia-smi dmon (sm% vs enc%) on the IDD-push/WGC path before assuming the win landed.
The exact share of the 13–17 ms encode_ms that is convert-on-SM vs scheduling-wait is unmeasured. §3 + an A/B of IDD-push-RGB (pre-3514702) vs IDD-push-NV12 on the same scene settles it and tells you whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD whitepaper; treat the direction as solid, the magnitude as TBD.

34 KiB Raw Blame History Unescape Escape