perf(host/linux): NV12 GPU convert — feed NVENC native YUV, off the contended SM (Tier 2A)
apple / swift (push) Successful in 54s
windows-host / package (push) Failing after 2m18s
ci / web (push) Successful in 32s
ci / rust (push) Failing after 5m2s
decky / build-publish (push) Successful in 11s
android / android (push) Failing after 49s
ci / docs-site (push) Successful in 35s
ci / bench (push) Failing after 3m15s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 3m49s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 15s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Failing after 40s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Failing after 28s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
docker / deploy-docs (push) Has been skipped
deb / build-publish (push) Successful in 5m54s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 11s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 1m36s

The Linux zero-copy tiled-GL path can now produce NV12 (BT.709 limited range)
on the GPU and feed NVENC native YUV, deleting NVENC's internal RGB->YUV CSC —
which runs on the SM/3D-compute engine a saturating game pins at 100% (the
game-vs-encode contention headache). Windows already does this via the D3D11
video processor; this closes the Linux gap. See docs/host-latency-plan.md §2A.

Gated behind PUNKTFUNK_NV12 (default OFF → the RGB/BGRx path is byte-for-byte
unchanged; zero regression). Only the tiled EGL/GL path converts; the
LINEAR/Vulkan-bridge (gamescope) path stays RGB.

- zerocopy/egl.rs: Nv12Blit — BT.709 limited Y pass (R8, full-res) + UV pass
  (RG8, half-res, GL_LINEAR 2x2 average); both CUDA-registered; import_nv12.
- zerocopy/cuda.rs: two-plane DeviceBuffer (Y W*H@1B + interleaved UV
  (W/2)*2 x H/2), paired Y+UV pool, copy_mapped_nv12 + copy_nv12_to_device,
  on the per-thread priority stream (dmabuf-recycle sync preserved).
- encode/linux.rs: nvenc_input(Nv12)->NV12; submit_cuda copies two planes into
  NVENC's surface; VUI signalled BT.709 limited (colorspace/range/primaries/trc).
- capture/linux.rs: gate (PUNKTFUNK_NV12 && tiled), report format Nv12.
- main.rs + zerocopy/mod.rs: `nv12-selftest` subcommand.

Validated on RTX 5070 Ti two ways: (1) nv12-selftest — synthetic RGBA->NV12
round-trip vs a BT.709 reference, max abs error Y=0.56/U=0.33/V=0.26 LSB;
(2) live capture->NV12->NVENC->decode of animated red content matches the RGB
path's colour (avg RGB 230,18,18 vs 231,18,20). build/clippy/fmt green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-18 23:39:11 +00:00
parent a58b6b8e76
commit 1fc6f73784
6 changed files with 792 additions and 24 deletions
+5
View File
@@ -125,6 +125,11 @@ fn real_main() -> Result<()> {
// Zero-copy FFI/GPU probe: init the EGL importer + CUDA context (no capture needed).
#[cfg(target_os = "linux")]
Some("zerocopy-probe") => zerocopy::probe(),
// NV12 colour self-test (no display/capture needed): convert a known RGBA pattern to NV12
// on the GPU and compare against a BT.709 limited-range reference. Validates the Tier 2A
// `PUNKTFUNK_NV12` convert is colour-correct. Prints PASS/FAIL + max Y/U/V error.
#[cfg(target_os = "linux")]
Some("nv12-selftest") => zerocopy::nv12_selftest(),
// Compositor readiness probe: exit 0 iff the (detected or PUNKTFUNK_COMPOSITOR-forced)
// compositor is up and able to create a virtual output *now*. A session-bringup
// script polls this to gate on real readiness instead of a blind `sleep`.