docs(design): trim shipped plans, consolidate cluster, add index
Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).
- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
host-latency, gpu-contention (fixed stale status table), game-library,
linux-setup (fixed m0->spike + stale zero-copy claim),
session-aware-host-followups, windows-client-bootstrap,
windows-dualsense-{scoping,game-detection}, windows-virtual-display,
security-review (per-finding status table; #12 still open),
apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
merged, M4 done); windows-secure-desktop.md archived (now a fallback
behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,11 @@
|
||||
# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
|
||||
|
||||
> **Status:** Investigation / plan. §5.A (NV12/P010 on the IDD-push default path) is **SHIPPED**
|
||||
> — `3514702`, `capture/windows/idd_push.rs` + `encode/windows/nvenc.rs`. All other levers
|
||||
> (§5.B/§5.C/§5.E/§5.F/§5.G) are **OPEN**; §5.C is partial (REALTIME knob exists, no auto-gate).
|
||||
> Paired with [`host-latency-plan.md`](host-latency-plan.md) (mutual cross-refs — keep both).
|
||||
> Trimmed to design rationale + open items; git history holds the full original.
|
||||
|
||||
> The headache, stated precisely:
|
||||
> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
|
||||
> stream tracks; the moment the game pins the GPU the **stream collapses to 40–50 fps** while the
|
||||
@@ -14,11 +20,6 @@ supersedes several of that doc's conclusions** — the codebase moved a lot in t
|
||||
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
|
||||
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
|
||||
|
||||
Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
|
||||
mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
|
||||
their own adversarial verifiers. Every external claim below carries a source URL; every code claim
|
||||
carries a current `file:line`.
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR — the corrected mental model and the action list
|
||||
@@ -50,9 +51,9 @@ politely yield without losing anything" — Reflex, render-queue tricks — is a
|
||||
|
||||
1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
|
||||
mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
|
||||
2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
|
||||
BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
|
||||
the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
|
||||
2. **Stop feeding NVENC RGB on the default path** — **DONE** for IDD-push (`3514702`): the install
|
||||
default now converts BGRA→NV12 (SDR) / FP16→P010 (HDR) before NVENC, off the SM. Linux NV12-default
|
||||
and a video-engine HDR P010 are still open. (§5.A)
|
||||
3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
|
||||
deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
|
||||
implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
|
||||
@@ -73,9 +74,10 @@ politely yield without losing anything" — Reflex, render-queue tricks — is a
|
||||
The old doc was right about the shape but several specifics are now wrong or stale:
|
||||
|
||||
- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
|
||||
DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
|
||||
**RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
|
||||
*regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
|
||||
DDA and WGC paths — **was false for IDD-push, which became the install default and fed NVENC
|
||||
RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path *regressed*
|
||||
on the exact axis the doc celebrated. **Since fixed** (`3514702`, §5.A): IDD-push now converts
|
||||
BGRA→NV12 on the video engine (FP16→P010 shader for HDR) and feeds NVENC native YUV.
|
||||
- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
|
||||
only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
|
||||
parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
|
||||
@@ -108,39 +110,33 @@ honest residual ceiling at 100% GPU. Those carry forward.
|
||||
|
||||
---
|
||||
|
||||
## 2. How the pipeline actually serializes today (verified against current code)
|
||||
## 2. How the pipeline serializes today — the key insight
|
||||
|
||||
The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
|
||||
`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
|
||||
**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
|
||||
near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did** —
|
||||
which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
|
||||
`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
|
||||
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
|
||||
stall"* (`punktfunk1.rs:2466-2468`).
|
||||
The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs`, `punktfunk1.rs`): every
|
||||
`1/target_fps` tick it grabs the freshest frame with a **non-blocking** `try_latest()`, and **if
|
||||
nothing new arrived it re-encodes the held frame** (a near-empty P-frame). So the **outbound fps is
|
||||
pinned at `target_fps` no matter what the source did** — which is *why the raw fps counter lies* under
|
||||
contention. The only honest signal is the `uniq` / `diag_new` counter; the code itself states the
|
||||
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode stall."*
|
||||
|
||||
The encode round-trip (NVENC, the dominant path):
|
||||
|
||||
- `submit` → `encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
|
||||
pushes onto a `pending` FIFO.
|
||||
- `poll` → `lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
|
||||
completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
|
||||
- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.
|
||||
|
||||
So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
|
||||
hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
|
||||
fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
|
||||
throughput wall.
|
||||
The NVENC round-trip (the dominant path) is **depth-1 synchronous**: `encode_picture` is a
|
||||
non-blocking ASIC launch, but `lock_bitstream` **blocks the same thread** until that frame completes
|
||||
(no `enableEncodeAsync`, no completion event). The only thread split is encode-vs-network-send, never
|
||||
submit-vs-retrieve. So under contention the loop is strictly serial — `capture (+convert) → submit →
|
||||
block in lock_bitstream → hand AU to the send thread` — and the arithmetic matches the symptom:
|
||||
`1000/17 ≈ 59` and `1000/13 ≈ 77` fps bracket the observed ~50, the signature of **one frame in
|
||||
flight per round-trip**, not an ASIC throughput wall.
|
||||
([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))
|
||||
|
||||
Where the per-frame GPU work lands, by path (this is the crux of contention):
|
||||
Where the per-frame GPU work lands, by path (the crux of contention — **lower contended-engine load is
|
||||
better**):
|
||||
|
||||
| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
|
||||
|---|---|---|---|---|
|
||||
| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
|
||||
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
|
||||
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
|
||||
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
|
||||
| Path | Colour-convert | NVENC input | Contended-engine load/frame |
|
||||
|---|---|---|---|
|
||||
| **IDD-push** (install default) | **NV12/P010 on the video engine** (`3514702`; FP16→P010 via shader for HDR) | NV12/P010 | low (SDR) / shader-CSC on SM (HDR) |
|
||||
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine** | NV12/P010 | low |
|
||||
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine** | NV12/P010 | medium (one 3D `CopyResource` to release the dup fast) |
|
||||
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
|
||||
|
||||
Measured magnitude of "RGB vs NV12 to the encoder":
|
||||
[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
|
||||
@@ -166,7 +162,8 @@ actual saturating game.
|
||||
Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
|
||||
itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
|
||||
frames.
|
||||
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.
|
||||
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall. (Per-stage
|
||||
`cap`/`submit`/`wait` µs instrumentation landed under `PUNKTFUNK_PERF` in `3514702`.)
|
||||
|
||||
> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
|
||||
> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
|
||||
@@ -182,12 +179,12 @@ actual saturating game.
|
||||
| Thread priority (Linux) | `setpriority` −10/−5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
|
||||
| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
|
||||
| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
|
||||
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
|
||||
| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
|
||||
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) now video-engine NV12** (`3514702`) ✅ | `wgc.rs:631` / `idd_push.rs` |
|
||||
| CSC off-SM (Win HDR) | IDD-push HDR via FP16→P010 **shader** (on-SM); other paths on-SM unless `PUNKTFUNK_HDR_SHADER_P010` | `wgc.rs:603` ⚠ |
|
||||
| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
|
||||
| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
|
||||
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
|
||||
| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ |
|
||||
| Zero-copy register-in-place | yes; IDD-push out-ring is now the convert target (NV12/P010), no extra copy | `nvenc.rs:623` ✅ |
|
||||
| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
|
||||
| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
|
||||
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
|
||||
@@ -202,35 +199,32 @@ actual saturating game.
|
||||
|
||||
## 5. The levers, ranked, with honest verdicts
|
||||
|
||||
### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**
|
||||
### A. Stop feeding NVENC RGB on the default path — **DONE for Windows IDD-push** (`3514702`)
|
||||
|
||||
The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
|
||||
forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
|
||||
solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
|
||||
feeding NV12/P010. **Make IDD-push and Linux do the same.**
|
||||
The default Windows IDD-push path used to hand NVENC packed RGB, forcing NVENC's internal RGB→YUV CSC
|
||||
onto the SM the game saturates. `3514702` makes the out-ring the convert target: a D3D11 **video-engine**
|
||||
`VideoConverter` does BGRA→NV12 (SDR, BT.709 limited) in place, so NVENC gets native NV12 and skips its
|
||||
SM-side CSC; HDR uses the FP16→P010 shader (NVIDIA's VideoProcessor can't do RGB→P010). NV12 input forces
|
||||
`bit_depth=8`, so an HDR↔SDR toggle re-inits the session at the matching depth (NV12 can't feed a 10-bit
|
||||
session). This also removed the separate `CopyResource` (the convert writes the ring directly).
|
||||
|
||||
- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
|
||||
out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
|
||||
`..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
|
||||
out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
|
||||
(`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
|
||||
disagree on the format.
|
||||
- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
|
||||
`PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
|
||||
`linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
|
||||
runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
|
||||
- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
|
||||
P010 convert where the VP supports it).
|
||||
**Verdict: REAL, but honestly *conditional*** — the convert has to land **off** the SM to fully pay off.
|
||||
`VideoProcessorBlt` is *designed* to use fixed-function video hardware and the hardforum numbers back the
|
||||
15%→2% drop, **but no NVIDIA doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat
|
||||
the "video engine" claim as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch
|
||||
the `enc`/`sm` columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that
|
||||
just relocates the CSC to the same SM (this is why the HDR P010 *shader* path is still on-SM; Sunshine's
|
||||
RGB→NV12 CUDA kernel still contends).
|
||||
|
||||
**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
|
||||
CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
|
||||
to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
|
||||
doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
|
||||
as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
|
||||
columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
|
||||
relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
|
||||
**Still open in §A:**
|
||||
- **Linux:** make NV12 the **default** for the tiled zero-copy path (gated behind `PUNKTFUNK_NV12` *and*
|
||||
`PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`, `linux/zerocopy/egl.rs:272`), feeding NVENC
|
||||
`NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already runs; emitting NV12 from it replaces the swizzle at
|
||||
~equal cost and deletes NVENC's CSC.
|
||||
- **Windows HDR:** move the FP16→P010 convert onto the video engine where the VP supports it (today's
|
||||
shader keeps it on-SM), or flip `PUNKTFUNK_HDR_SHADER_P010` on by default for the non-IDD paths.
|
||||
|
||||
### B. A *correct* async encode pipeline (the untried encoder lever)
|
||||
### B. A *correct* async encode pipeline (the untried encoder lever) — **OPEN**
|
||||
|
||||
The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
|
||||
work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
|
||||
@@ -263,7 +257,7 @@ Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `
|
||||
and HAGS can spike the *submit* call itself
|
||||
([100–200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).
|
||||
|
||||
### C. Auto-gated REALTIME GPU scheduling priority
|
||||
### C. Auto-gated REALTIME GPU scheduling priority — **PARTIAL** (knob exists, no auto-gate)
|
||||
|
||||
Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
|
||||
Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
|
||||
@@ -274,8 +268,8 @@ It works **independently of HAGS** (HAGS does *not* reassign cross-process prior
|
||||
*"Windows continues to control prioritization"*
|
||||
[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).
|
||||
|
||||
We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
|
||||
to change:
|
||||
We ship only **HIGH(4)** by default with a static `realtime` opt-in (`PUNKTFUNK_GPU_PRIORITY_CLASS`,
|
||||
`dxgi.rs:208-330`) and **no auto-gate**. Two things to change:
|
||||
|
||||
- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
|
||||
app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
|
||||
@@ -294,7 +288,7 @@ the host *takes* GPU time from the game; it measurably **costs the game fps**
|
||||
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
|
||||
the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).
|
||||
|
||||
### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
|
||||
### D. Multi-vendor encoder hygiene (AMF/QSV) — **stable / mostly done, one caveat**
|
||||
|
||||
Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
|
||||
`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
|
||||
@@ -313,7 +307,7 @@ Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
|
||||
|
||||
**Verdict: REAL but largely already captured.** No big win left here except via §G.
|
||||
|
||||
### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
|
||||
### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix — **OPEN**
|
||||
|
||||
NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
|
||||
frame — most visible in the *light* scene (the "200-not-240"). Pin it:
|
||||
@@ -334,7 +328,7 @@ frame — most visible in the *light* scene (the "200-not-240"). Pin it:
|
||||
**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
|
||||
already pins P0). Cheap, low risk, do it for the light-scene win.
|
||||
|
||||
### F. Escape the frame-source ceiling — only if §3 says (b)
|
||||
### F. Escape the frame-source ceiling — only if §3 says (b) — **OPEN**
|
||||
|
||||
If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.
|
||||
|
||||
@@ -358,7 +352,7 @@ If `uniq` is the wall, no encoder/priority work helps — you need a better fram
|
||||
**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
|
||||
frames the game didn't render.
|
||||
|
||||
### G. The honest endgame — encode on a second GPU / the iGPU
|
||||
### G. The honest endgame — encode on a second GPU / the iGPU — **OPEN**
|
||||
|
||||
For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
|
||||
contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
|
||||
@@ -384,8 +378,8 @@ the consumer analogue is the iGPU.
|
||||
## 6. Recommended order of attack
|
||||
|
||||
1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
|
||||
2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
|
||||
Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
|
||||
2. **§5.A NV12/P010 on the default paths** — IDD-push **DONE** (`3514702`); remaining: Linux NV12
|
||||
default-on, Windows HDR P010 off-SM. Confirm off-SM with `nvidia-smi dmon`.
|
||||
3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
|
||||
4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
|
||||
5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
|
||||
@@ -418,13 +412,39 @@ or a second slice of silicon (§G). Don't chase the rest with encoder micro-opti
|
||||
|
||||
---
|
||||
|
||||
## 8. Open evidence gaps (flagged honestly)
|
||||
## 8. Open items / what's left
|
||||
|
||||
Diagnostics + still-unbuilt levers (verbatim, highest leverage first):
|
||||
|
||||
- **§3 automation** — instrument the `uniq`-vs-`fps` heuristic + a PresentMon probe so (a)/(b) is
|
||||
decided automatically, not by hand. (Per-stage `cap`/`submit`/`wait` µs already land under
|
||||
`PUNKTFUNK_PERF` from `3514702`; the uniq/PresentMon classifier is not yet automated.)
|
||||
- **§5.A residual** — Linux NV12 default-on for the tiled zero-copy path (drop the
|
||||
`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY` double-opt-in); move the Windows HDR FP16→P010 convert off the
|
||||
SM (today it's a shader). Windows IDD-push SDR/HDR NV12/P010 is **DONE** (`3514702`).
|
||||
- **§5.B** — build a *correct* async NVENC pipeline: submit on one thread, blocking-`lock_bitstream`
|
||||
on a dedicated retrieve thread, deep input+output surface pool (≈4–8), Windows per-buffer
|
||||
`completionEvent` (`enableEncodeAsync=1`), same two-thread split on Linux.
|
||||
- **§5.C** — auto-gate REALTIME GPU priority: probe HAGS (`D3DKMTQueryAdapterInfo`) + VRAM headroom
|
||||
(`IDXGIAdapter3::QueryVideoMemoryInfo`) continuously; REALTIME only when HAGS-off or HAGS-on with
|
||||
comfortable headroom, downgrade to HIGH the instant VRAM tightens. (Static `realtime` opt-in exists
|
||||
in `dxgi.rs`; no auto-gate.)
|
||||
- **§5.E** — clock / P-state pinning: Windows NvAPI DRS `PREFERRED_PSTATE=PREFER_MAX` (crash-safe undo
|
||||
to `%ProgramData%\punktfunk\`); Linux `nvidia-smi -lgc` / `nvmlDeviceSetGpuLockedClocks` (+
|
||||
`CudaNoStablePerfLimit` on R580/595). Gate `PUNKTFUNK_PIN_CLOCKS`, default off on battery/Deck.
|
||||
- **§5.F** — frame-source escape (only if §3 says (b)): swapchain-hook capture (OBS-style, anti-cheat
|
||||
tradeoffs); NvFBC on Linux (keylase patch); compose-flip for the DLSS-FG half-rate case; WGC
|
||||
`MinUpdateInterval = 1e7/(fps*2)` 2×-rate tweak.
|
||||
- **§5.G** — iGPU / second-GPU encode offload: pin capture to the gaming adapter, encoder to the iGPU
|
||||
adapter, one cross-adapter shared-texture copy. Reuses the AMF/QSV/VAAPI backends.
|
||||
|
||||
### Open evidence gaps (verify on-box)
|
||||
|
||||
- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
|
||||
confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
|
||||
`nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
|
||||
`nvidia-smi dmon` (sm% vs enc%) on the IDD-push/WGC path before assuming the win landed.
|
||||
- The exact share of the 13–17 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
|
||||
unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
|
||||
whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
|
||||
unmeasured. §3 + an A/B of IDD-push-RGB (pre-`3514702`) vs IDD-push-NV12 on the same scene settles it
|
||||
and tells you whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
|
||||
- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
|
||||
whitepaper; treat the *direction* as solid, the magnitude as TBD.
|
||||
|
||||
Reference in New Issue
Block a user