perf(host/windows): move capture→encode off the 3D engine (NV12/P010 video-processor path, zero-copy, GPU priority)
apple / swift (push) Successful in 56s
ci / rust (push) Successful in 1m36s
android / android (push) Successful in 1m56s
ci / web (push) Successful in 27s
ci / docs-site (push) Successful in 28s
deb / build-publish (push) Successful in 2m26s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 5s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 4s
ci / bench (push) Successful in 4m33s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 8m15s
docker / deploy-docs (push) Successful in 18s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 7m58s

The Windows host capped at ~60 fps with 35-40 ms latency on a GPU-heavy game:
the per-frame capture→encode path shared the 3D engine with the game and got
scheduled behind it. Rework to minimize 3D-engine work per frame:

- VideoConverter (D3D11 video processor): capture → NVENC-native NV12/P010 so
  NVENC skips its internal RGB→YUV (a 3D/compute step). Wired into both DDA
  (dxgi.rs) and WGC (wgc.rs). New PixelFormat::Nv12/P010 + NVENC YUV input.
- GPU scheduling hardening (Apollo-style): D3DKMTSetProcessSchedulingPriorityClass
  HIGH, absolute SetGPUThreadPriority, SetMaximumFrameLatency(1).
- WGC SDR zero-copy (hold pool frames; no CopyResource). DDA keeps a fast
  CopyResource to decouple its single-frame acquire/release from the async convert.
- Pipelined helper encode loop (PUNKTFUNK_ENCODE_DEPTH, default 1) + perf split
  (cap_wait / encode / write).

Live on the RTX 4090: hard 60 fps ceiling removed (now scene-scaling 40-200+),
latency much reduced. Residual cap in GPU-pinned scenes is the irreducible RGB→YUV
convert (no fixed-function unit on NVIDIA — VideoProcessing engine reads 0%) waiting
behind an uncapped game under WDDM context time-slicing; Linux avoids it via
gamescope capping the game to the display refresh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-17 13:08:03 +00:00
parent 15d3d423fa
commit 4cc57d5c39
8 changed files with 689 additions and 67 deletions
+7
View File
@@ -26,6 +26,13 @@ pub enum PixelFormat {
/// produces this: scRGB FP16 desktop pixels are converted to BT.2020 PQ and written here, then
/// handed to NVENC as `ABGR10` for an HEVC Main10 / HDR10 encode.
Rgb10a2,
/// `NV12` (DXGI `NV12`): 8-bit BT.709 limited-range YUV 4:2:0. Produced by the D3D11 **video
/// processor** (video engine, not the 3D engine) so the per-frame colour conversion doesn't fight a
/// GPU-saturating game; handed to NVENC as `NV12` (it encodes YUV natively — no internal RGB→YUV).
Nv12,
/// `P010` (DXGI `P010`): 10-bit BT.2020 PQ limited-range YUV 4:2:0. HDR analogue of [`Nv12`]:
/// video-processor output for HEVC Main10 / HDR10, handed to NVENC as `YUV420_10BIT`.
P010,
}
impl PixelFormat {