wip: host latency/GPU-contention notes + Windows packaging tweaks
Pre-existing working-tree changes committed to the branch on request: the gpu-contention investigation doc, host-latency-plan additions, and small pack-host-installer / stage-pf-vdisplay packaging-script edits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,430 @@
|
||||
# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
|
||||
|
||||
> The headache, stated precisely:
|
||||
> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
|
||||
> stream tracks; the moment the game pins the GPU the **stream collapses to 40–50 fps** while the
|
||||
> game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light
|
||||
> titles like CS2). **Capping is not an acceptable fix** — demanding titles exhaust the GPU even
|
||||
> when capped.
|
||||
|
||||
This is the second, deeper pass on the problem. The first pass is
|
||||
[`host-latency-plan.md`](host-latency-plan.md) (a 25-agent investigation, 2026-06-18). **This doc
|
||||
supersedes several of that doc's conclusions** — the codebase moved a lot in the week since
|
||||
(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the
|
||||
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
|
||||
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
|
||||
|
||||
Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
|
||||
mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
|
||||
their own adversarial verifiers. Every external claim below carries a source URL; every code claim
|
||||
carries a current `file:line`.
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR — the corrected mental model and the action list
|
||||
|
||||
**The governing fact:** NVENC is a **dedicated ASIC on its own GPU runlist**, physically separate
|
||||
from the SM/CUDA/graphics cores a 3D game saturates. The game does **not** steal the encode block.
|
||||
It steals everything that *feeds* the block — capture-acquire, the **RGB→YUV colour-convert**, the
|
||||
copy into the encoder's input surface, the readback — **and the GPU-scheduler time** to run that
|
||||
feed work, which is queued behind the game's graphics context.
|
||||
([NVENC app-note](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-application-note/index.html),
|
||||
[engine-table proof, UNC RTAS'24](https://www.cs.unc.edu/~jbakita/rtas24.pdf))
|
||||
|
||||
**Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart
|
||||
before writing code:**
|
||||
|
||||
| Bottleneck | Symptom | Fix family |
|
||||
|---|---|---|
|
||||
| **(a) feed-scheduling contention** | `uniq`≈`fps`, both ~50; `encode_ms` 13–17 | shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
|
||||
| **(b) frame-source ceiling** | `fps`≈240 (held re-encodes) but `uniq`→40–50 | capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |
|
||||
|
||||
**The single hardest truth:** on one saturated GPU there is **no free lunch**. Any host GPU work
|
||||
either *preempts* the game (and steals its frames) or *waits* behind it. Capping the game works
|
||||
only because it cuts the game's **total** GPU demand and opens idle gaps. The non-capping
|
||||
equivalents are exactly three: **need less GPU** (footprint shrink), **take more** (priority — which
|
||||
costs the game fps), or **use a different GPU** (real isolation). Anything pitched as "make the game
|
||||
politely yield without losing anything" — Reflex, render-queue tricks — is a **placebo** here (§7).
|
||||
|
||||
**Action list, highest leverage first** (detail in §5–§6):
|
||||
|
||||
1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
|
||||
mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
|
||||
2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
|
||||
BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
|
||||
the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
|
||||
3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
|
||||
deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
|
||||
implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
|
||||
tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
|
||||
4. **Auto-gated REALTIME GPU priority.** Our `LocalSystem` service *can* grant it (most apps can't).
|
||||
Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
|
||||
5. **Lock clocks / pin P-state** for jitter (cheap; fixes the light-scene "200-not-240", not the
|
||||
collapse). (§5.E)
|
||||
6. **If source-bound: swapchain-hook capture** (OBS-style) — the real escape from the compose
|
||||
ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
|
||||
7. **The honest endgame for demanding titles: encode on a second GPU / the iGPU.** The only approach
|
||||
that *removes* contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)
|
||||
|
||||
---
|
||||
|
||||
## 1. Corrections to `host-latency-plan.md` (read before reusing it)
|
||||
|
||||
The old doc was right about the shape but several specifics are now wrong or stale:
|
||||
|
||||
- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
|
||||
DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
|
||||
**RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
|
||||
*regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
|
||||
- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
|
||||
only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
|
||||
parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
|
||||
native path; GameStream and the WGC helper are hardcoded depth-1.
|
||||
- **"Async NVENC is measure-gated and probably stacks latency (Tier 3D)."** The measurement that
|
||||
produced that verdict (`capture/windows/wgc_helper.rs:131-135`) pipelined **on a single thread** —
|
||||
it queued more frames but still blocked `lock_bitstream` inline, so it added queue latency with
|
||||
**zero overlap**. That is not the pattern the NVENC guide prescribes (submit/retrieve on
|
||||
*separate* threads). The correct async pipeline is **untried**, not disproven. (§5.B)
|
||||
- **"More GPU priority is maxed and hits a hard preemption wall with no recourse."** Half right.
|
||||
Priority *is* near-maxed (HIGH), but the "no recourse" intuition is wrong: a **higher-priority GPU
|
||||
context does preempt a saturating graphics context at pixel granularity** — that is precisely how
|
||||
NVIDIA VR Async-TimeWarp injects a frame into a busy game
|
||||
([VRWorks Context Priority](https://developer.nvidia.com/vrworks/headset/contextpriority)). And we
|
||||
default to HIGH, leaving **REALTIME unused** even though our SYSTEM service can grant it. (§5.C)
|
||||
- **"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss."** The
|
||||
"half the frames" effect is **specifically a DLSS-Frame-Generation flip-metering artifact**
|
||||
(FG v310.x+ / RTX 50-series), *not* a general property of independent-flip games — normal
|
||||
fullscreen flip games are captured at full rate by DDA. So composed-flip is a **narrow** fix, not a
|
||||
general lever. ([Apollo #676 — DDA captured a flip game at full 120 fps](https://github.com/ClassicOldSong/Apollo/issues/676),
|
||||
[Sunshine #3621 — version-pinned to FG 310.x](https://github.com/LizardByte/Sunshine/issues/3621))
|
||||
- **"NvFBC is a possible low-overhead capture path."** **Dead on Windows** — deprecated, frozen at
|
||||
Capture SDK 7.1 / Win10-1803
|
||||
([NVIDIA deprecation bulletin](https://developer.download.nvidia.com/designworks/capture-sdk/docs/NVFBC_Win10_Deprecation_Tech_Bulletin.pdf)).
|
||||
Linux-only, and there only via the consumer `keylase` patch.
|
||||
|
||||
What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling
|
||||
is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the
|
||||
honest residual ceiling at 100% GPU. Those carry forward.
|
||||
|
||||
---
|
||||
|
||||
## 2. How the pipeline actually serializes today (verified against current code)
|
||||
|
||||
The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
|
||||
`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
|
||||
**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
|
||||
near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did** —
|
||||
which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
|
||||
`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
|
||||
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
|
||||
stall"* (`punktfunk1.rs:2466-2468`).
|
||||
|
||||
The encode round-trip (NVENC, the dominant path):
|
||||
|
||||
- `submit` → `encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
|
||||
pushes onto a `pending` FIFO.
|
||||
- `poll` → `lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
|
||||
completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
|
||||
- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.
|
||||
|
||||
So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
|
||||
hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
|
||||
fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
|
||||
throughput wall.
|
||||
([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))
|
||||
|
||||
Where the per-frame GPU work lands, by path (this is the crux of contention):
|
||||
|
||||
| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
|
||||
|---|---|---|---|---|
|
||||
| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
|
||||
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
|
||||
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
|
||||
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
|
||||
|
||||
Measured magnitude of "RGB vs NV12 to the encoder":
|
||||
[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
|
||||
NVENC's guide confirms the mechanism: *"Encoding of RGB contents"* is on the explicit list of
|
||||
features that **internally use CUDA**
|
||||
([NVENC prog-guide §Encoder Features using CUDA](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)).
|
||||
|
||||
---
|
||||
|
||||
## 3. Diagnose first — cheap, decisive, do before any code
|
||||
|
||||
Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM
|
||||
cannot reproduce this — run on the **RTX 4090 Windows box** (and a real NVIDIA Linux box) with an
|
||||
actual saturating game.
|
||||
|
||||
1. **Run with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`** under CS2 at GPU-100%:
|
||||
- `fps`≈target but `uniq`→40–50 ⇒ **(b) source ceiling** — the compositor/IDD only produced
|
||||
40–50 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
|
||||
- both `fps` and `uniq`→40–50, with `encode_ms` 13–17 ⇒ **(a) feed contention** — the round-trip
|
||||
is starving. Go to §5.A/B/C.
|
||||
2. **Classify the game's presentation with [PresentMon](https://github.com/GameTechDev/PresentMon)** —
|
||||
"Presented FPS" vs "Displayed FPS" and **Presentation Mode** (Hardware: Independent Flip vs
|
||||
Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
|
||||
itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
|
||||
frames.
|
||||
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.
|
||||
|
||||
> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
|
||||
> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
|
||||
> throughput of a saturated single GPU is split between game and host no matter what.
|
||||
|
||||
---
|
||||
|
||||
## 4. Current-state audit (what's shipped / regressed / missing)
|
||||
|
||||
| Area | State | Where |
|
||||
|---|---|---|
|
||||
| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | `session_tuning.rs` ✅ |
|
||||
| Thread priority (Linux) | `setpriority` −10/−5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
|
||||
| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
|
||||
| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
|
||||
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
|
||||
| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
|
||||
| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
|
||||
| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
|
||||
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
|
||||
| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ |
|
||||
| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
|
||||
| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
|
||||
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
|
||||
| encode\|send split + paced send + sendmmsg + 32 MB sockbuf | yes | `stream.rs`, `transport/qos.rs` ✅ |
|
||||
| **Clock / P-state pin** | **none** (zero hits repo-wide) | ✗ |
|
||||
| **Async NVENC (2-thread)** | **none** | ✗ |
|
||||
| **Frame-source escape (hook/NvFBC-Linux)** | **none** | ✗ |
|
||||
| **Second-GPU / iGPU encode offload** | **none** | ✗ |
|
||||
| DSCP/QoS | implemented, `PUNKTFUNK_DSCP` opt-in (default off) | `transport/qos.rs` ⚠ |
|
||||
|
||||
---
|
||||
|
||||
## 5. The levers, ranked, with honest verdicts
|
||||
|
||||
### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**
|
||||
|
||||
The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
|
||||
forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
|
||||
solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
|
||||
feeding NV12/P010. **Make IDD-push and Linux do the same.**
|
||||
|
||||
- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
|
||||
out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
|
||||
`..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
|
||||
out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
|
||||
(`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
|
||||
disagree on the format.
|
||||
- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
|
||||
`PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
|
||||
`linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
|
||||
runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
|
||||
- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
|
||||
P010 convert where the VP supports it).
|
||||
|
||||
**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
|
||||
CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
|
||||
to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
|
||||
doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
|
||||
as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
|
||||
columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
|
||||
relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
|
||||
|
||||
### B. A *correct* async encode pipeline (the untried encoder lever)
|
||||
|
||||
The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
|
||||
work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
|
||||
event in asynchronous mode, or calling `NvEncLockBitstream` in synchronous mode — should be done in
|
||||
the **secondary thread**."*
|
||||
([NVENC prog-guide, threading model](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html))
|
||||
We do the opposite — submit and blocking-retrieve on **one** thread. Queuing more `pending` entries
|
||||
(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with **no overlap**,
|
||||
which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong
|
||||
implementation, not a disproof.
|
||||
|
||||
The fix: **submit on the capture/encode thread; do `lock_bitstream` on a dedicated retrieve thread;
|
||||
hold a deep input+output surface pool (≈4–8); on Windows register a `completionEvent` per output
|
||||
buffer (`enableEncodeAsync=1`) — on Linux async events are unsupported, so use the same two-thread
|
||||
split with a blocking retrieve.**
|
||||
([async is Windows/WDDM-only](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html);
|
||||
FFmpeg models the same knob as `delay`/`async_depth`,
|
||||
[libavcodec/nvenc.c](https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/nvenc.c)).
|
||||
|
||||
This lets the WDDM scheduler find a **backlog** when it finally grants the encoder context a slice,
|
||||
and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do
|
||||
frame N+1's convert.
|
||||
|
||||
**Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +1–2 frames, ceiling-bounded.**
|
||||
The honest bound (and why this is *second* to §A/§C): pipelining cannot manufacture GPU time — if the
|
||||
scheduler grants the encode context only X% under load, depth only guarantees work is *ready* for
|
||||
each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
|
||||
**priority**, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
|
||||
Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `enableEncodeAsync`),
|
||||
and HAGS can spike the *submit* call itself
|
||||
([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).
|
||||
|
||||
### C. Auto-gated REALTIME GPU scheduling priority
|
||||
|
||||
Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
|
||||
Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
|
||||
fullscreen games
|
||||
([OBS commit](https://github.com/obsproject/obs-studio/commit/ec769ef008b748f7dfba211daec9eb203ea4bea0),
|
||||
[Sunshine `display_base.cpp`](https://raw.githubusercontent.com/LizardByte/Sunshine/master/src/platform/windows/display_base.cpp)).
|
||||
It works **independently of HAGS** (HAGS does *not* reassign cross-process priority — Microsoft:
|
||||
*"Windows continues to control prioritization"*
|
||||
[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).
|
||||
|
||||
We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
|
||||
to change:
|
||||
|
||||
- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
|
||||
app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
|
||||
The lever is available to us specifically.
|
||||
- **Gate it to dodge the freeze.** REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a **documented
|
||||
NVENC hang** (Sunshine ships `nvenc_realtime_hags` to downgrade to HIGH for exactly this;
|
||||
[Sunshine config](https://docs.lizardbyte.dev/projects/sunshine/latest/md_docs_2configuration.html),
|
||||
[NVIDIA repro](https://forums.developer.nvidia.com/t/bug-report-nvenc-encoder-hangs-on-windows-when-using-d3d11-in-real-time-mode/357466)).
|
||||
Implement the old plan's "Tier 3B": probe HAGS via `D3DKMTQueryAdapterInfo` and VRAM headroom via
|
||||
`IDXGIAdapter3::QueryVideoMemoryInfo` (continuously); use REALTIME only when HAGS-off, or HAGS-on
|
||||
with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.
|
||||
|
||||
**Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever.** Priority is how
|
||||
the host *takes* GPU time from the game; it measurably **costs the game fps**
|
||||
([Doom Eternal 121→60 with Sunshine running](https://github.com/LizardByte/Sunshine/issues/3703)).
|
||||
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
|
||||
the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).
|
||||
|
||||
### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
|
||||
|
||||
Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
|
||||
`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
|
||||
`low_power=1` VDEnc path (`:226`). Keep them. Two notes:
|
||||
|
||||
- **AMF/QSV suffer contention *worse* than NVENC.** OBS: *"For Intel and AMD GPUs, the hardware
|
||||
encoder requires significant resources of the same type a 3D app/game requires… different from
|
||||
NVIDIA's NVENC, which has dedicated encoding circuits"*
|
||||
([OBS KB](https://obsproject.com/forum/threads/how-to-debug-encoding-overloaded.168625/)). So on an
|
||||
AMD/Intel host the collapse is *expected to be harder* — and §G (iGPU offload) is even more
|
||||
attractive there.
|
||||
- **The AMF busy-poll floor** (a fixed-sleep `QueryOutput` poll imposes ~15 ms via timer
|
||||
granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's `QUERY_TIMEOUT` patch); since we
|
||||
go through libavcodec we inherit it — just **confirm the pinned FFmpeg build includes it**.
|
||||
([ffmpeg-devel](https://www.mail-archive.com/ffmpeg-devel@ffmpeg.org/msg170489.html))
|
||||
|
||||
**Verdict: REAL but largely already captured.** No big win left here except via §G.
|
||||
|
||||
### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
|
||||
|
||||
NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
|
||||
frame — most visible in the *light* scene (the "200-not-240"). Pin it:
|
||||
|
||||
- **Windows:** NvAPI per-application DRS `PREFERRED_PSTATE = PREFER_MAX` scoped to our exe (this is
|
||||
exactly Sunshine's `nvenc_latency_over_power`,
|
||||
[Sunshine nvprefs](https://github.com/LizardByte/Sunshine/blob/master/src/platform/windows/nvprefs/driver_settings.cpp)).
|
||||
**Crash-safe undo is mandatory** — persist an undo record to `%ProgramData%\punktfunk\` *before*
|
||||
applying, revert a stale profile on next start, so a crash never leaves the user's control panel
|
||||
modified.
|
||||
- **Linux:** `nvidia-smi -lgc`/NVML `nvmlDeviceSetGpuLockedClocks` (needs root/`CAP_SYS_ADMIN`; query
|
||||
`nvmlDeviceGetMaxClockInfo`, lock to that, restore on teardown *and* SIGTERM). Plus the newly-added
|
||||
`CudaNoStablePerfLimit` driver profile — *new in R580/595, so usable on the 595 box* — to defeat
|
||||
the CUDA "Force P2" memory-clock clamp.
|
||||
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default off on battery / Steam Deck** (pinning is harmful
|
||||
there).
|
||||
|
||||
**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
|
||||
already pins P0). Cheap, low risk, do it for the light-scene win.
|
||||
|
||||
### F. Escape the frame-source ceiling — only if §3 says (b)
|
||||
|
||||
If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.
|
||||
|
||||
- **Swapchain-hook capture (the real fix).** Inject a hook on `IDXGISwapChain::Present`/`Present1`,
|
||||
`vkQueuePresentKHR`, `wglSwapBuffers` and copy the backbuffer to a shared texture *before* the
|
||||
compositor — OBS Game Capture's mechanism. Sees **every presented frame**, no compose/refresh
|
||||
gating.
|
||||
([OBS dxgi-capture](https://github.com/obsproject/obs-studio/blob/master/plugins/win-capture/graphics-hook/dxgi-capture.cpp))
|
||||
**Tradeoffs are serious:** anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs
|
||||
whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an
|
||||
opt-in "game capture" mode, not the default.
|
||||
- **NvFBC:** **not an option on Windows** (dead, §1). On **Linux** it's viable via the consumer
|
||||
keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
|
||||
- **Compose-flip (narrow):** the topmost 1×1 layered-window trick (we already have
|
||||
`composed_flip.rs`) forces DWM composition and fixes specifically the **DLSS-Frame-Gen** half-rate
|
||||
case. Adds host-display latency; don't enable globally.
|
||||
- **WGC "deliver 2× rate":** Apollo sets `MinUpdateInterval = 1e7/(fps*2)` so the pacer always has a
|
||||
fresh frame to pick ([Apollo](https://github.com/ClassicOldSong/Apollo/pull/785)); we set it to 1×
|
||||
refresh (`wgc.rs:310`). Cheap tweak to try on the WGC path.
|
||||
|
||||
**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
|
||||
frames the game didn't render.
|
||||
|
||||
### G. The honest endgame — encode on a second GPU / the iGPU
|
||||
|
||||
For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
|
||||
contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
|
||||
**different** GPU — a second dGPU or, more realistically, the **iGPU** (Intel QuickSync / AMD VCN),
|
||||
which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once,
|
||||
encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder"
|
||||
play, and the OBS "second GPU is harmful" verdict does **not** apply — that verdict is about moving
|
||||
*only the NVENC block*; moving capture + CSC + copies off the gaming GPU genuinely frees it.
|
||||
([OBS forum](https://obsproject.com/forum/threads/can-you-use-a-2nd-gpu-to-eliminate-encoder-overload.149644/))
|
||||
|
||||
We're unusually well-placed for this: we already have working AMF and QSV backends
|
||||
(`encode/windows/ffmpeg_win.rs`) and the Linux VAAPI backend. The missing piece is a capture/topology
|
||||
mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
|
||||
cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
|
||||
it's the only path that lets a demanding game and a clean stream coexist on one machine.
|
||||
|
||||
**Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses."**
|
||||
Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session;
|
||||
the consumer analogue is the iGPU.
|
||||
|
||||
---
|
||||
|
||||
## 6. Recommended order of attack
|
||||
|
||||
1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
|
||||
2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
|
||||
Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
|
||||
3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
|
||||
4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
|
||||
5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
|
||||
6. **§3-gated §5.F** source escape (swapchain hook) — only if `uniq` is the wall.
|
||||
7. **§5.G iGPU encode offload** — the strategic answer for demanding titles; larger build.
|
||||
|
||||
After 2–5 the light-scene gap closes and the saturated floor rises materially. But report the
|
||||
honest ceiling: **on one saturated GPU the game and the host split a fixed pie** — coarse WDDM
|
||||
graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only
|
||||
*rendered* 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie
|
||||
are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps),
|
||||
or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.
|
||||
|
||||
---
|
||||
|
||||
## 7. Placebos & dead ends (so we don't re-propose them)
|
||||
|
||||
| Candidate | Verdict | Why |
|
||||
|---|---|---|
|
||||
| **NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames** as a "non-capping yield" | ✗ placebo | Shrinks the *game's* render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. ([Battle(non)sense LDAT data](https://forums.guru3d.com/threads/battle-non-sense-youtuber-claims-low-latency-mode-only-helps-when-gpu-load-is-99.429074/)) |
|
||||
| **HAGS on, as a contention fix** | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it *causes* NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime *queue*. ([OBS KB](https://obsproject.com/kb/hags)) |
|
||||
| **Split-frame encode (2/3/4-way) to fix contention** | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured **zero** latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit *disable* sentinel, not a bug. ([SDK header](https://raw.githubusercontent.com/FFmpeg/nv-codec-headers/master/include/ffnvcodec/nvEncodeAPI.h)) |
|
||||
| **Move the encoded-bitstream readback to a copy engine** | ✗ placebo | Output is KB-scale; the cost of `lock_bitstream` is the completion *wait*, not copy bandwidth. (The *input* full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
|
||||
| **CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_*`** | ✗ placebo cross-process | Intra-context only; the game is a *separate* context. Stream priority "will not preempt already executing work". ([CUDA docs](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html)) |
|
||||
| **VK/EGL global-priority REALTIME on Linux NVIDIA** | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
|
||||
| **Windows "High performance" GPU preference** | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
|
||||
| **MIG / MPS / vGPU** | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
|
||||
| **NvFBC on Windows** | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
|
||||
| **Frame Generation / Smooth Motion** to "make more frames" | ✗ red herring | We stream *rendered* frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |
|
||||
|
||||
---
|
||||
|
||||
## 8. Open evidence gaps (flagged honestly)
|
||||
|
||||
- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
|
||||
confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
|
||||
`nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
|
||||
- The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
|
||||
unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
|
||||
whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
|
||||
- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
|
||||
whitepaper; treat the *direction* as solid, the magnitude as TBD.
|
||||
@@ -1,5 +1,14 @@
|
||||
# Host latency & the GPU-contention collapse — analysis + prioritized plan
|
||||
|
||||
> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
|
||||
> That follow-up re-verified this plan against the current code and overturned several specifics:
|
||||
> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
|
||||
> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
|
||||
> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
|
||||
> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
|
||||
> Windows. Use the new doc's ranked action list. The tiers/dropped-placebo analysis below remain a
|
||||
> useful record.
|
||||
|
||||
Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
|
||||
"saturating game starves the stream" headache:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user