docs(design): trim shipped plans, consolidate cluster, add index

Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).

- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
  apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
  host-latency, gpu-contention (fixed stale status table), game-library,
  linux-setup (fixed m0->spike + stale zero-copy claim),
  session-aware-host-followups, windows-client-bootstrap,
  windows-dualsense-{scoping,game-detection}, windows-virtual-display,
  security-review (per-finding status table; #12 still open),
  apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
  windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
  merged, M4 done); windows-secure-desktop.md archived (now a fallback
  behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
  roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-26 16:39:06 +00:00
parent 9ea2c17419
commit 7b99b41ede
27 changed files with 1322 additions and 3229 deletions
+101 -81
View File
@@ -1,5 +1,11 @@
# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
> **Status:** Investigation / plan. §5.A (NV12/P010 on the IDD-push default path) is **SHIPPED**
> — `3514702`, `capture/windows/idd_push.rs` + `encode/windows/nvenc.rs`. All other levers
> (§5.B/§5.C/§5.E/§5.F/§5.G) are **OPEN**; §5.C is partial (REALTIME knob exists, no auto-gate).
> Paired with [`host-latency-plan.md`](host-latency-plan.md) (mutual cross-refs — keep both).
> Trimmed to design rationale + open items; git history holds the full original.
> The headache, stated precisely:
> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
> stream tracks; the moment the game pins the GPU the **stream collapses to 4050 fps** while the
@@ -14,11 +20,6 @@ supersedes several of that doc's conclusions** — the codebase moved a lot in t
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
their own adversarial verifiers. Every external claim below carries a source URL; every code claim
carries a current `file:line`.
---
## 0. TL;DR — the corrected mental model and the action list
@@ -50,9 +51,9 @@ politely yield without losing anything" — Reflex, render-queue tricks — is a
1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
2. **Stop feeding NVENC RGB on the default path** **DONE** for IDD-push (`3514702`): the install
default now converts BGRA→NV12 (SDR) / FP16→P010 (HDR) before NVENC, off the SM. Linux NV12-default
and a video-engine HDR P010 are still open. (§5.A)
3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
@@ -73,9 +74,10 @@ politely yield without losing anything" — Reflex, render-queue tricks — is a
The old doc was right about the shape but several specifics are now wrong or stale:
- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
**RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
*regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
DDA and WGC paths — **was false for IDD-push, which became the install default and fed NVENC
RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path *regressed*
on the exact axis the doc celebrated. **Since fixed** (`3514702`, §5.A): IDD-push now converts
BGRA→NV12 on the video engine (FP16→P010 shader for HDR) and feeds NVENC native YUV.
- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
@@ -108,39 +110,33 @@ honest residual ceiling at 100% GPU. Those carry forward.
---
## 2. How the pipeline actually serializes today (verified against current code)
## 2. How the pipeline serializes today — the key insight
The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did**
which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
stall"* (`punktfunk1.rs:2466-2468`).
The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs`, `punktfunk1.rs`): every
`1/target_fps` tick it grabs the freshest frame with a **non-blocking** `try_latest()`, and **if
nothing new arrived it re-encodes the held frame** (a near-empty P-frame). So the **outbound fps is
pinned at `target_fps` no matter what the source did** — which is *why the raw fps counter lies* under
contention. The only honest signal is the `uniq` / `diag_new` counter; the code itself states the
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode stall."*
The encode round-trip (NVENC, the dominant path):
- `submit``encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
pushes onto a `pending` FIFO.
- `poll``lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.
So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
throughput wall.
The NVENC round-trip (the dominant path) is **depth-1 synchronous**: `encode_picture` is a
non-blocking ASIC launch, but `lock_bitstream` **blocks the same thread** until that frame completes
(no `enableEncodeAsync`, no completion event). The only thread split is encode-vs-network-send, never
submit-vs-retrieve. So under contention the loop is strictly serial — `capture (+convert) → submit →
block in lock_bitstream → hand AU to the send thread` — and the arithmetic matches the symptom:
`1000/17 ≈ 59` and `1000/13 ≈ 77` fps bracket the observed ~50, the signature of **one frame in
flight per round-trip**, not an ASIC throughput wall.
([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))
Where the per-frame GPU work lands, by path (this is the crux of contention):
Where the per-frame GPU work lands, by path (the crux of contention — **lower contended-engine load is
better**):
| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
|---|---|---|---|---|
| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
| Path | Colour-convert | NVENC input | Contended-engine load/frame |
|---|---|---|---|
| **IDD-push** (install default) | **NV12/P010 on the video engine** (`3514702`; FP16→P010 via shader for HDR) | NV12/P010 | low (SDR) / shader-CSC on SM (HDR) |
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine** | NV12/P010 | low |
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine** | NV12/P010 | medium (one 3D `CopyResource` to release the dup fast) |
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
Measured magnitude of "RGB vs NV12 to the encoder":
[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
@@ -166,7 +162,8 @@ actual saturating game.
Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
frames.
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall. (Per-stage
`cap`/`submit`/`wait` µs instrumentation landed under `PUNKTFUNK_PERF` in `3514702`.)
> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
@@ -182,12 +179,12 @@ actual saturating game.
| Thread priority (Linux) | `setpriority` 10/5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) now video-engine NV12** (`3514702`) ✅ | `wgc.rs:631` / `idd_push.rs` |
| CSC off-SM (Win HDR) | IDD-push HDR via FP16→P010 **shader** (on-SM); other paths on-SM unless `PUNKTFUNK_HDR_SHADER_P010` | `wgc.rs:603` ⚠ |
| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623`/⚠ |
| Zero-copy register-in-place | yes; IDD-push out-ring is now the convert target (NV12/P010), no extra copy | `nvenc.rs:623` ✅ |
| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
@@ -202,35 +199,32 @@ actual saturating game.
## 5. The levers, ranked, with honest verdicts
### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**
### A. Stop feeding NVENC RGB on the default path — **DONE for Windows IDD-push** (`3514702`)
The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
feeding NV12/P010. **Make IDD-push and Linux do the same.**
The default Windows IDD-push path used to hand NVENC packed RGB, forcing NVENC's internal RGB→YUV CSC
onto the SM the game saturates. `3514702` makes the out-ring the convert target: a D3D11 **video-engine**
`VideoConverter` does BGRA→NV12 (SDR, BT.709 limited) in place, so NVENC gets native NV12 and skips its
SM-side CSC; HDR uses the FP16→P010 shader (NVIDIA's VideoProcessor can't do RGB→P010). NV12 input forces
`bit_depth=8`, so an HDR↔SDR toggle re-inits the session at the matching depth (NV12 can't feed a 10-bit
session). This also removed the separate `CopyResource` (the convert writes the ring directly).
- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
`..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
(`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
disagree on the format.
- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
`PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
`linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
P010 convert where the VP supports it).
**Verdict: REAL, but honestly *conditional*** — the convert has to land **off** the SM to fully pay off.
`VideoProcessorBlt` is *designed* to use fixed-function video hardware and the hardforum numbers back the
15%→2% drop, **but no NVIDIA doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat
the "video engine" claim as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch
the `enc`/`sm` columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that
just relocates the CSC to the same SM (this is why the HDR P010 *shader* path is still on-SM; Sunshine's
RGB→NV12 CUDA kernel still contends).
**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
**Still open in §A:**
- **Linux:** make NV12 the **default** for the tiled zero-copy path (gated behind `PUNKTFUNK_NV12` *and*
`PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`, `linux/zerocopy/egl.rs:272`), feeding NVENC
`NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already runs; emitting NV12 from it replaces the swizzle at
~equal cost and deletes NVENC's CSC.
- **Windows HDR:** move the FP16→P010 convert onto the video engine where the VP supports it (today's
shader keeps it on-SM), or flip `PUNKTFUNK_HDR_SHADER_P010` on by default for the non-IDD paths.
### B. A *correct* async encode pipeline (the untried encoder lever)
### B. A *correct* async encode pipeline (the untried encoder lever) — **OPEN**
The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
@@ -263,7 +257,7 @@ Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `
and HAGS can spike the *submit* call itself
([100200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).
### C. Auto-gated REALTIME GPU scheduling priority
### C. Auto-gated REALTIME GPU scheduling priority — **PARTIAL** (knob exists, no auto-gate)
Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
@@ -274,8 +268,8 @@ It works **independently of HAGS** (HAGS does *not* reassign cross-process prior
*"Windows continues to control prioritization"*
[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).
We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
to change:
We ship only **HIGH(4)** by default with a static `realtime` opt-in (`PUNKTFUNK_GPU_PRIORITY_CLASS`,
`dxgi.rs:208-330`) and **no auto-gate**. Two things to change:
- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
@@ -294,7 +288,7 @@ the host *takes* GPU time from the game; it measurably **costs the game fps**
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).
### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
### D. Multi-vendor encoder hygiene (AMF/QSV) — **stable / mostly done, one caveat**
Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
@@ -313,7 +307,7 @@ Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
**Verdict: REAL but largely already captured.** No big win left here except via §G.
### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix — **OPEN**
NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
frame — most visible in the *light* scene (the "200-not-240"). Pin it:
@@ -334,7 +328,7 @@ frame — most visible in the *light* scene (the "200-not-240"). Pin it:
**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
already pins P0). Cheap, low risk, do it for the light-scene win.
### F. Escape the frame-source ceiling — only if §3 says (b)
### F. Escape the frame-source ceiling — only if §3 says (b) — **OPEN**
If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.
@@ -358,7 +352,7 @@ If `uniq` is the wall, no encoder/priority work helps — you need a better fram
**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
frames the game didn't render.
### G. The honest endgame — encode on a second GPU / the iGPU
### G. The honest endgame — encode on a second GPU / the iGPU — **OPEN**
For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
@@ -384,8 +378,8 @@ the consumer analogue is the iGPU.
## 6. Recommended order of attack
1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
2. **§5.A NV12/P010 on the default paths** IDD-push **DONE** (`3514702`); remaining: Linux NV12
default-on, Windows HDR P010 off-SM. Confirm off-SM with `nvidia-smi dmon`.
3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
@@ -418,13 +412,39 @@ or a second slice of silicon (§G). Don't chase the rest with encoder micro-opti
---
## 8. Open evidence gaps (flagged honestly)
## 8. Open items / what's left
Diagnostics + still-unbuilt levers (verbatim, highest leverage first):
- **§3 automation** — instrument the `uniq`-vs-`fps` heuristic + a PresentMon probe so (a)/(b) is
decided automatically, not by hand. (Per-stage `cap`/`submit`/`wait` µs already land under
`PUNKTFUNK_PERF` from `3514702`; the uniq/PresentMon classifier is not yet automated.)
- **§5.A residual** — Linux NV12 default-on for the tiled zero-copy path (drop the
`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY` double-opt-in); move the Windows HDR FP16→P010 convert off the
SM (today it's a shader). Windows IDD-push SDR/HDR NV12/P010 is **DONE** (`3514702`).
- **§5.B** — build a *correct* async NVENC pipeline: submit on one thread, blocking-`lock_bitstream`
on a dedicated retrieve thread, deep input+output surface pool (≈48), Windows per-buffer
`completionEvent` (`enableEncodeAsync=1`), same two-thread split on Linux.
- **§5.C** — auto-gate REALTIME GPU priority: probe HAGS (`D3DKMTQueryAdapterInfo`) + VRAM headroom
(`IDXGIAdapter3::QueryVideoMemoryInfo`) continuously; REALTIME only when HAGS-off or HAGS-on with
comfortable headroom, downgrade to HIGH the instant VRAM tightens. (Static `realtime` opt-in exists
in `dxgi.rs`; no auto-gate.)
- **§5.E** — clock / P-state pinning: Windows NvAPI DRS `PREFERRED_PSTATE=PREFER_MAX` (crash-safe undo
to `%ProgramData%\punktfunk\`); Linux `nvidia-smi -lgc` / `nvmlDeviceSetGpuLockedClocks` (+
`CudaNoStablePerfLimit` on R580/595). Gate `PUNKTFUNK_PIN_CLOCKS`, default off on battery/Deck.
- **§5.F** — frame-source escape (only if §3 says (b)): swapchain-hook capture (OBS-style, anti-cheat
tradeoffs); NvFBC on Linux (keylase patch); compose-flip for the DLSS-FG half-rate case; WGC
`MinUpdateInterval = 1e7/(fps*2)` 2×-rate tweak.
- **§5.G** — iGPU / second-GPU encode offload: pin capture to the gaming adapter, encoder to the iGPU
adapter, one cross-adapter shared-texture copy. Reuses the AMF/QSV/VAAPI backends.
### Open evidence gaps (verify on-box)
- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
`nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
`nvidia-smi dmon` (sm% vs enc%) on the IDD-push/WGC path before assuming the win landed.
- The exact share of the 1317 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
unmeasured. §3 + an A/B of IDD-push-RGB (pre-`3514702`) vs IDD-push-NV12 on the same scene settles it
and tells you whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
whitepaper; treat the *direction* as solid, the magnitude as TBD.