d01a8fd17a
windows-host / package (push) Failing after 4m16s
ci / rust (push) Failing after 4m56s
ci / web (push) Failing after 22s
ci / docs-site (push) Successful in 1m7s
android / android (push) Successful in 9m19s
ci / bench (push) Successful in 4m47s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Failing after 3s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
docker / deploy-docs (push) Has been skipped
deb / build-publish (push) Failing after 6m29s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 7m4s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 7m17s
apple / swift (push) Successful in 1m13s
apple / screenshots (push) Successful in 5m27s
NVIDIA/AMD Vulkan ICDs refuse to *advertise* an HDR color space for a surface on an
IddCx indirect/virtual display, so Vulkan games (Doom: The Dark Ages, id Tech, Indiana
Jones, …) report "device does not support HDR" — even though Windows HDR, DWM compose,
and the client PQ stream all work, and the ICD happily *accepts + presents* a forced HDR
swapchain there. The whole gap is enumeration; the community (Apollo/Sunshine/VDD) wrote
this off as kernel-side / unfixable.
Add VK_LAYER_PUNKTFUNK_hdr_inject (packaging/windows/pf-vkhdr-layer/): a standalone
cdylib Vulkan implicit layer that appends {A2B10G10R10, HDR10_ST2084} + {RGBA16F, scRGB}
to vkGetPhysicalDeviceSurfaceFormats[2]KHR (no need to hook vkCreateSwapchainKHR — the
ICD doesn't validate the color space there). Self-gated on the surface monitor's actual
advanced-color state (DisplayConfig GET_ADVANCED_COLOR_INFO), so it is a complete no-op
on SDR sessions and real monitors (dedup). Always-on (registry-discovered) so it works
regardless of how a game is launched — env-scoping silently fails for already-running
Steam. Escape hatches: DISABLE_PF_VKHDR, PF_VKHDR_EXCLUDE, and a built-in kernel-anti-
cheat denylist.
The installer builds/signs/stages it and registers it under
HKLM64\SOFTWARE\Khronos\Vulkan\ImplicitLayers (opt-out "Install the HDR Vulkan layer"
task); windows-host CI fmt+clippy-gates it (msvc-only FFI).
Live-validated on the RTX box: Doom: The Dark Ages enables HDR over the pf-vdisplay
virtual display.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
431 lines
32 KiB
Markdown
431 lines
32 KiB
Markdown
# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
|
||
|
||
> The headache, stated precisely:
|
||
> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
|
||
> stream tracks; the moment the game pins the GPU the **stream collapses to 40–50 fps** while the
|
||
> game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light
|
||
> titles like CS2). **Capping is not an acceptable fix** — demanding titles exhaust the GPU even
|
||
> when capped.
|
||
|
||
This is the second, deeper pass on the problem. The first pass is
|
||
[`host-latency-plan.md`](host-latency-plan.md) (a 25-agent investigation, 2026-06-18). **This doc
|
||
supersedes several of that doc's conclusions** — the codebase moved a lot in the week since
|
||
(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the
|
||
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
|
||
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
|
||
|
||
Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
|
||
mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
|
||
their own adversarial verifiers. Every external claim below carries a source URL; every code claim
|
||
carries a current `file:line`.
|
||
|
||
---
|
||
|
||
## 0. TL;DR — the corrected mental model and the action list
|
||
|
||
**The governing fact:** NVENC is a **dedicated ASIC on its own GPU runlist**, physically separate
|
||
from the SM/CUDA/graphics cores a 3D game saturates. The game does **not** steal the encode block.
|
||
It steals everything that *feeds* the block — capture-acquire, the **RGB→YUV colour-convert**, the
|
||
copy into the encoder's input surface, the readback — **and the GPU-scheduler time** to run that
|
||
feed work, which is queued behind the game's graphics context.
|
||
([NVENC app-note](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-application-note/index.html),
|
||
[engine-table proof, UNC RTAS'24](https://www.cs.unc.edu/~jbakita/rtas24.pdf))
|
||
|
||
**Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart
|
||
before writing code:**
|
||
|
||
| Bottleneck | Symptom | Fix family |
|
||
|---|---|---|
|
||
| **(a) feed-scheduling contention** | `uniq`≈`fps`, both ~50; `encode_ms` 13–17 | shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
|
||
| **(b) frame-source ceiling** | `fps`≈240 (held re-encodes) but `uniq`→40–50 | capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |
|
||
|
||
**The single hardest truth:** on one saturated GPU there is **no free lunch**. Any host GPU work
|
||
either *preempts* the game (and steals its frames) or *waits* behind it. Capping the game works
|
||
only because it cuts the game's **total** GPU demand and opens idle gaps. The non-capping
|
||
equivalents are exactly three: **need less GPU** (footprint shrink), **take more** (priority — which
|
||
costs the game fps), or **use a different GPU** (real isolation). Anything pitched as "make the game
|
||
politely yield without losing anything" — Reflex, render-queue tricks — is a **placebo** here (§7).
|
||
|
||
**Action list, highest leverage first** (detail in §5–§6):
|
||
|
||
1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
|
||
mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
|
||
2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
|
||
BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
|
||
the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
|
||
3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
|
||
deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
|
||
implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
|
||
tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
|
||
4. **Auto-gated REALTIME GPU priority.** Our `LocalSystem` service *can* grant it (most apps can't).
|
||
Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
|
||
5. **Lock clocks / pin P-state** for jitter (cheap; fixes the light-scene "200-not-240", not the
|
||
collapse). (§5.E)
|
||
6. **If source-bound: swapchain-hook capture** (OBS-style) — the real escape from the compose
|
||
ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
|
||
7. **The honest endgame for demanding titles: encode on a second GPU / the iGPU.** The only approach
|
||
that *removes* contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)
|
||
|
||
---
|
||
|
||
## 1. Corrections to `host-latency-plan.md` (read before reusing it)
|
||
|
||
The old doc was right about the shape but several specifics are now wrong or stale:
|
||
|
||
- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
|
||
DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
|
||
**RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
|
||
*regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
|
||
- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
|
||
only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
|
||
parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
|
||
native path; GameStream and the WGC helper are hardcoded depth-1.
|
||
- **"Async NVENC is measure-gated and probably stacks latency (Tier 3D)."** The measurement that
|
||
produced that verdict (`capture/windows/wgc_helper.rs:131-135`) pipelined **on a single thread** —
|
||
it queued more frames but still blocked `lock_bitstream` inline, so it added queue latency with
|
||
**zero overlap**. That is not the pattern the NVENC guide prescribes (submit/retrieve on
|
||
*separate* threads). The correct async pipeline is **untried**, not disproven. (§5.B)
|
||
- **"More GPU priority is maxed and hits a hard preemption wall with no recourse."** Half right.
|
||
Priority *is* near-maxed (HIGH), but the "no recourse" intuition is wrong: a **higher-priority GPU
|
||
context does preempt a saturating graphics context at pixel granularity** — that is precisely how
|
||
NVIDIA VR Async-TimeWarp injects a frame into a busy game
|
||
([VRWorks Context Priority](https://developer.nvidia.com/vrworks/headset/contextpriority)). And we
|
||
default to HIGH, leaving **REALTIME unused** even though our SYSTEM service can grant it. (§5.C)
|
||
- **"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss."** The
|
||
"half the frames" effect is **specifically a DLSS-Frame-Generation flip-metering artifact**
|
||
(FG v310.x+ / RTX 50-series), *not* a general property of independent-flip games — normal
|
||
fullscreen flip games are captured at full rate by DDA. So composed-flip is a **narrow** fix, not a
|
||
general lever. ([Apollo #676 — DDA captured a flip game at full 120 fps](https://github.com/ClassicOldSong/Apollo/issues/676),
|
||
[Sunshine #3621 — version-pinned to FG 310.x](https://github.com/LizardByte/Sunshine/issues/3621))
|
||
- **"NvFBC is a possible low-overhead capture path."** **Dead on Windows** — deprecated, frozen at
|
||
Capture SDK 7.1 / Win10-1803
|
||
([NVIDIA deprecation bulletin](https://developer.download.nvidia.com/designworks/capture-sdk/docs/NVFBC_Win10_Deprecation_Tech_Bulletin.pdf)).
|
||
Linux-only, and there only via the consumer `keylase` patch.
|
||
|
||
What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling
|
||
is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the
|
||
honest residual ceiling at 100% GPU. Those carry forward.
|
||
|
||
---
|
||
|
||
## 2. How the pipeline actually serializes today (verified against current code)
|
||
|
||
The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
|
||
`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
|
||
**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
|
||
near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did** —
|
||
which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
|
||
`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
|
||
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
|
||
stall"* (`punktfunk1.rs:2466-2468`).
|
||
|
||
The encode round-trip (NVENC, the dominant path):
|
||
|
||
- `submit` → `encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
|
||
pushes onto a `pending` FIFO.
|
||
- `poll` → `lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
|
||
completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
|
||
- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.
|
||
|
||
So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
|
||
hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
|
||
fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
|
||
throughput wall.
|
||
([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))
|
||
|
||
Where the per-frame GPU work lands, by path (this is the crux of contention):
|
||
|
||
| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
|
||
|---|---|---|---|---|
|
||
| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
|
||
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
|
||
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
|
||
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
|
||
|
||
Measured magnitude of "RGB vs NV12 to the encoder":
|
||
[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
|
||
NVENC's guide confirms the mechanism: *"Encoding of RGB contents"* is on the explicit list of
|
||
features that **internally use CUDA**
|
||
([NVENC prog-guide §Encoder Features using CUDA](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)).
|
||
|
||
---
|
||
|
||
## 3. Diagnose first — cheap, decisive, do before any code
|
||
|
||
Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM
|
||
cannot reproduce this — run on the **RTX 4090 Windows box** (and a real NVIDIA Linux box) with an
|
||
actual saturating game.
|
||
|
||
1. **Run with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`** under CS2 at GPU-100%:
|
||
- `fps`≈target but `uniq`→40–50 ⇒ **(b) source ceiling** — the compositor/IDD only produced
|
||
40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
|
||
- both `fps` and `uniq`→40–50, with `encode_ms` 13–17 ⇒ **(a) feed contention** — the round-trip
|
||
is starving. Go to §5.A/B/C.
|
||
2. **Classify the game's presentation with [PresentMon](https://github.com/GameTechDev/PresentMon)** —
|
||
"Presented FPS" vs "Displayed FPS" and **Presentation Mode** (Hardware: Independent Flip vs
|
||
Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
|
||
itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
|
||
frames.
|
||
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.
|
||
|
||
> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
|
||
> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
|
||
> throughput of a saturated single GPU is split between game and host no matter what.
|
||
|
||
---
|
||
|
||
## 4. Current-state audit (what's shipped / regressed / missing)
|
||
|
||
| Area | State | Where |
|
||
|---|---|---|
|
||
| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | `session_tuning.rs` ✅ |
|
||
| Thread priority (Linux) | `setpriority` −10/−5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
|
||
| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
|
||
| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
|
||
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
|
||
| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
|
||
| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
|
||
| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
|
||
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
|
||
| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ |
|
||
| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
|
||
| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
|
||
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
|
||
| encode\|send split + paced send + sendmmsg + 32 MB sockbuf | yes | `stream.rs`, `transport/qos.rs` ✅ |
|
||
| **Clock / P-state pin** | **none** (zero hits repo-wide) | ✗ |
|
||
| **Async NVENC (2-thread)** | **none** | ✗ |
|
||
| **Frame-source escape (hook/NvFBC-Linux)** | **none** | ✗ |
|
||
| **Second-GPU / iGPU encode offload** | **none** | ✗ |
|
||
| DSCP/QoS | implemented, `PUNKTFUNK_DSCP` opt-in (default off) | `transport/qos.rs` ⚠ |
|
||
|
||
---
|
||
|
||
## 5. The levers, ranked, with honest verdicts
|
||
|
||
### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**
|
||
|
||
The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
|
||
forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
|
||
solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
|
||
feeding NV12/P010. **Make IDD-push and Linux do the same.**
|
||
|
||
- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
|
||
out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
|
||
`..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
|
||
out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
|
||
(`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
|
||
disagree on the format.
|
||
- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
|
||
`PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
|
||
`linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
|
||
runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
|
||
- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
|
||
P010 convert where the VP supports it).
|
||
|
||
**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
|
||
CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
|
||
to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
|
||
doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
|
||
as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
|
||
columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
|
||
relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
|
||
|
||
### B. A *correct* async encode pipeline (the untried encoder lever)
|
||
|
||
The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
|
||
work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
|
||
event in asynchronous mode, or calling `NvEncLockBitstream` in synchronous mode — should be done in
|
||
the **secondary thread**."*
|
||
([NVENC prog-guide, threading model](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html))
|
||
We do the opposite — submit and blocking-retrieve on **one** thread. Queuing more `pending` entries
|
||
(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with **no overlap**,
|
||
which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong
|
||
implementation, not a disproof.
|
||
|
||
The fix: **submit on the capture/encode thread; do `lock_bitstream` on a dedicated retrieve thread;
|
||
hold a deep input+output surface pool (≈4–8); on Windows register a `completionEvent` per output
|
||
buffer (`enableEncodeAsync=1`) — on Linux async events are unsupported, so use the same two-thread
|
||
split with a blocking retrieve.**
|
||
([async is Windows/WDDM-only](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html);
|
||
FFmpeg models the same knob as `delay`/`async_depth`,
|
||
[libavcodec/nvenc.c](https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/nvenc.c)).
|
||
|
||
This lets the WDDM scheduler find a **backlog** when it finally grants the encoder context a slice,
|
||
and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do
|
||
frame N+1's convert.
|
||
|
||
**Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.**
|
||
The honest bound (and why this is *second* to §A/§C): pipelining cannot manufacture GPU time — if the
|
||
scheduler grants the encode context only X% under load, depth only guarantees work is *ready* for
|
||
each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
|
||
**priority**, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
|
||
Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `enableEncodeAsync`),
|
||
and HAGS can spike the *submit* call itself
|
||
([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).
|
||
|
||
### C. Auto-gated REALTIME GPU scheduling priority
|
||
|
||
Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
|
||
Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
|
||
fullscreen games
|
||
([OBS commit](https://github.com/obsproject/obs-studio/commit/ec769ef008b748f7dfba211daec9eb203ea4bea0),
|
||
[Sunshine `display_base.cpp`](https://raw.githubusercontent.com/LizardByte/Sunshine/master/src/platform/windows/display_base.cpp)).
|
||
It works **independently of HAGS** (HAGS does *not* reassign cross-process priority — Microsoft:
|
||
*"Windows continues to control prioritization"*
|
||
[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).
|
||
|
||
We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
|
||
to change:
|
||
|
||
- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
|
||
app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
|
||
The lever is available to us specifically.
|
||
- **Gate it to dodge the freeze.** REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a **documented
|
||
NVENC hang** (Sunshine ships `nvenc_realtime_hags` to downgrade to HIGH for exactly this;
|
||
[Sunshine config](https://docs.lizardbyte.dev/projects/sunshine/latest/md_docs_2configuration.html),
|
||
[NVIDIA repro](https://forums.developer.nvidia.com/t/bug-report-nvenc-encoder-hangs-on-windows-when-using-d3d11-in-real-time-mode/357466)).
|
||
Implement the old plan's "Tier 3B": probe HAGS via `D3DKMTQueryAdapterInfo` and VRAM headroom via
|
||
`IDXGIAdapter3::QueryVideoMemoryInfo` (continuously); use REALTIME only when HAGS-off, or HAGS-on
|
||
with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.
|
||
|
||
**Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever.** Priority is how
|
||
the host *takes* GPU time from the game; it measurably **costs the game fps**
|
||
([Doom Eternal 121→60 with Sunshine running](https://github.com/LizardByte/Sunshine/issues/3703)).
|
||
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
|
||
the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).
|
||
|
||
### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
|
||
|
||
Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
|
||
`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
|
||
`low_power=1` VDEnc path (`:226`). Keep them. Two notes:
|
||
|
||
- **AMF/QSV suffer contention *worse* than NVENC.** OBS: *"For Intel and AMD GPUs, the hardware
|
||
encoder requires significant resources of the same type a 3D app/game requires… different from
|
||
NVIDIA's NVENC, which has dedicated encoding circuits"*
|
||
([OBS KB](https://obsproject.com/forum/threads/how-to-debug-encoding-overloaded.168625/)). So on an
|
||
AMD/Intel host the collapse is *expected to be harder* — and §G (iGPU offload) is even more
|
||
attractive there.
|
||
- **The AMF busy-poll floor** (a fixed-sleep `QueryOutput` poll imposes ~15 ms via timer
|
||
granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's `QUERY_TIMEOUT` patch); since we
|
||
go through libavcodec we inherit it — just **confirm the pinned FFmpeg build includes it**.
|
||
([ffmpeg-devel](https://www.mail-archive.com/ffmpeg-devel@ffmpeg.org/msg170489.html))
|
||
|
||
**Verdict: REAL but largely already captured.** No big win left here except via §G.
|
||
|
||
### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
|
||
|
||
NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
|
||
frame — most visible in the *light* scene (the "200-not-240"). Pin it:
|
||
|
||
- **Windows:** NvAPI per-application DRS `PREFERRED_PSTATE = PREFER_MAX` scoped to our exe (this is
|
||
exactly Sunshine's `nvenc_latency_over_power`,
|
||
[Sunshine nvprefs](https://github.com/LizardByte/Sunshine/blob/master/src/platform/windows/nvprefs/driver_settings.cpp)).
|
||
**Crash-safe undo is mandatory** — persist an undo record to `%ProgramData%\punktfunk\` *before*
|
||
applying, revert a stale profile on next start, so a crash never leaves the user's control panel
|
||
modified.
|
||
- **Linux:** `nvidia-smi -lgc`/NVML `nvmlDeviceSetGpuLockedClocks` (needs root/`CAP_SYS_ADMIN`; query
|
||
`nvmlDeviceGetMaxClockInfo`, lock to that, restore on teardown *and* SIGTERM). Plus the newly-added
|
||
`CudaNoStablePerfLimit` driver profile — *new in R580/595, so usable on the 595 box* — to defeat
|
||
the CUDA "Force P2" memory-clock clamp.
|
||
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default off on battery / Steam Deck** (pinning is harmful
|
||
there).
|
||
|
||
**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
|
||
already pins P0). Cheap, low risk, do it for the light-scene win.
|
||
|
||
### F. Escape the frame-source ceiling — only if §3 says (b)
|
||
|
||
If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.
|
||
|
||
- **Swapchain-hook capture (the real fix).** Inject a hook on `IDXGISwapChain::Present`/`Present1`,
|
||
`vkQueuePresentKHR`, `wglSwapBuffers` and copy the backbuffer to a shared texture *before* the
|
||
compositor — OBS Game Capture's mechanism. Sees **every presented frame**, no compose/refresh
|
||
gating.
|
||
([OBS dxgi-capture](https://github.com/obsproject/obs-studio/blob/master/plugins/win-capture/graphics-hook/dxgi-capture.cpp))
|
||
**Tradeoffs are serious:** anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs
|
||
whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an
|
||
opt-in "game capture" mode, not the default.
|
||
- **NvFBC:** **not an option on Windows** (dead, §1). On **Linux** it's viable via the consumer
|
||
keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
|
||
- **Compose-flip (narrow):** the topmost 1×1 layered-window trick (we already have
|
||
`composed_flip.rs`) forces DWM composition and fixes specifically the **DLSS-Frame-Gen** half-rate
|
||
case. Adds host-display latency; don't enable globally.
|
||
- **WGC "deliver 2× rate":** Apollo sets `MinUpdateInterval = 1e7/(fps*2)` so the pacer always has a
|
||
fresh frame to pick ([Apollo](https://github.com/ClassicOldSong/Apollo/pull/785)); we set it to 1×
|
||
refresh (`wgc.rs:310`). Cheap tweak to try on the WGC path.
|
||
|
||
**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
|
||
frames the game didn't render.
|
||
|
||
### G. The honest endgame — encode on a second GPU / the iGPU
|
||
|
||
For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
|
||
contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
|
||
**different** GPU — a second dGPU or, more realistically, the **iGPU** (Intel QuickSync / AMD VCN),
|
||
which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once,
|
||
encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder"
|
||
play, and the OBS "second GPU is harmful" verdict does **not** apply — that verdict is about moving
|
||
*only the NVENC block*; moving capture + CSC + copies off the gaming GPU genuinely frees it.
|
||
([OBS forum](https://obsproject.com/forum/threads/can-you-use-a-2nd-gpu-to-eliminate-encoder-overload.149644/))
|
||
|
||
We're unusually well-placed for this: we already have working AMF and QSV backends
|
||
(`encode/windows/ffmpeg_win.rs`) and the Linux VAAPI backend. The missing piece is a capture/topology
|
||
mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
|
||
cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
|
||
it's the only path that lets a demanding game and a clean stream coexist on one machine.
|
||
|
||
**Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses."**
|
||
Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session;
|
||
the consumer analogue is the iGPU.
|
||
|
||
---
|
||
|
||
## 6. Recommended order of attack
|
||
|
||
1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
|
||
2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
|
||
Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
|
||
3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
|
||
4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
|
||
5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
|
||
6. **§3-gated §5.F** source escape (swapchain hook) — only if `uniq` is the wall.
|
||
7. **§5.G iGPU encode offload** — the strategic answer for demanding titles; larger build.
|
||
|
||
After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the
|
||
honest ceiling: **on one saturated GPU the game and the host split a fixed pie** — coarse WDDM
|
||
graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only
|
||
*rendered* 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie
|
||
are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps),
|
||
or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.
|
||
|
||
---
|
||
|
||
## 7. Placebos & dead ends (so we don't re-propose them)
|
||
|
||
| Candidate | Verdict | Why |
|
||
|---|---|---|
|
||
| **NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames** as a "non-capping yield" | ✗ placebo | Shrinks the *game's* render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. ([Battle(non)sense LDAT data](https://forums.guru3d.com/threads/battle-non-sense-youtuber-claims-low-latency-mode-only-helps-when-gpu-load-is-99.429074/)) |
|
||
| **HAGS on, as a contention fix** | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it *causes* NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime *queue*. ([OBS KB](https://obsproject.com/kb/hags)) |
|
||
| **Split-frame encode (2/3/4-way) to fix contention** | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured **zero** latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit *disable* sentinel, not a bug. ([SDK header](https://raw.githubusercontent.com/FFmpeg/nv-codec-headers/master/include/ffnvcodec/nvEncodeAPI.h)) |
|
||
| **Move the encoded-bitstream readback to a copy engine** | ✗ placebo | Output is KB-scale; the cost of `lock_bitstream` is the completion *wait*, not copy bandwidth. (The *input* full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
|
||
| **CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_*`** | ✗ placebo cross-process | Intra-context only; the game is a *separate* context. Stream priority "will not preempt already executing work". ([CUDA docs](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html)) |
|
||
| **VK/EGL global-priority REALTIME on Linux NVIDIA** | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
|
||
| **Windows "High performance" GPU preference** | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
|
||
| **MIG / MPS / vGPU** | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
|
||
| **NvFBC on Windows** | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
|
||
| **Frame Generation / Smooth Motion** to "make more frames" | ✗ red herring | We stream *rendered* frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |
|
||
|
||
---
|
||
|
||
## 8. Open evidence gaps (flagged honestly)
|
||
|
||
- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
|
||
confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
|
||
`nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
|
||
- The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
|
||
unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
|
||
whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
|
||
- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
|
||
whitepaper; treat the *direction* as solid, the magnitude as TBD.
|