perf(host/windows): move capture→encode off the 3D engine (NV12/P010 video-processor path, zero-copy, GPU priority)
apple / swift (push) Successful in 56s
ci / rust (push) Successful in 1m36s
android / android (push) Successful in 1m56s
ci / web (push) Successful in 27s
ci / docs-site (push) Successful in 28s
deb / build-publish (push) Successful in 2m26s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 5s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 4s
ci / bench (push) Successful in 4m33s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 8m15s
docker / deploy-docs (push) Successful in 18s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 7m58s

The Windows host capped at ~60 fps with 35-40 ms latency on a GPU-heavy game:
the per-frame capture→encode path shared the 3D engine with the game and got
scheduled behind it. Rework to minimize 3D-engine work per frame:

- VideoConverter (D3D11 video processor): capture → NVENC-native NV12/P010 so
  NVENC skips its internal RGB→YUV (a 3D/compute step). Wired into both DDA
  (dxgi.rs) and WGC (wgc.rs). New PixelFormat::Nv12/P010 + NVENC YUV input.
- GPU scheduling hardening (Apollo-style): D3DKMTSetProcessSchedulingPriorityClass
  HIGH, absolute SetGPUThreadPriority, SetMaximumFrameLatency(1).
- WGC SDR zero-copy (hold pool frames; no CopyResource). DDA keeps a fast
  CopyResource to decouple its single-frame acquire/release from the async convert.
- Pipelined helper encode loop (PUNKTFUNK_ENCODE_DEPTH, default 1) + perf split
  (cap_wait / encode / write).

Live on the RTX 4090: hard 60 fps ceiling removed (now scene-scaling 40-200+),
latency much reduced. Residual cap in GPU-pinned scenes is the irreducible RGB→YUV
convert (no fixed-function unit on NVIDIA — VideoProcessing engine reads 0%) waiting
behind an uncapped game under WDDM context time-slicing; Linux avoids it via
gamescope capping the game to the display refresh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-17 13:08:03 +00:00
parent 15d3d423fa
commit 4cc57d5c39
8 changed files with 689 additions and 67 deletions
+4 -3
View File
@@ -103,9 +103,10 @@ fn nvenc_input(format: PixelFormat) -> (Pixel, bool) {
PixelFormat::Rgba => (Pixel::RGBA, false),
PixelFormat::Rgb => (Pixel::RGBZ, true), // RGB -> rgb0
PixelFormat::Bgr => (Pixel::BGRZ, true), // BGR -> bgr0
// 10-bit HDR (R10G10B10A2) is produced only by the Windows DXGI HDR capture path; the Linux
// capturer never emits it. Map to BGRA so the match is exhaustive — unreachable here.
PixelFormat::Rgb10a2 => (Pixel::BGRA, false),
// Rgb10a2 (HDR) and NV12/P010 (the Windows video-processor YUV outputs) are produced only by
// the Windows capture/encode paths; the Linux capturer never emits them. Map to BGRA so the
// match is exhaustive — unreachable here.
PixelFormat::Rgb10a2 | PixelFormat::Nv12 | PixelFormat::P010 => (Pixel::BGRA, false),
}
}