perf(host/linux): NV12 GPU convert — feed NVENC native YUV, off the contended SM (Tier 2A)
apple / swift (push) Successful in 54s
windows-host / package (push) Failing after 2m18s
ci / web (push) Successful in 32s
ci / rust (push) Failing after 5m2s
decky / build-publish (push) Successful in 11s
android / android (push) Failing after 49s
ci / docs-site (push) Successful in 35s
ci / bench (push) Failing after 3m15s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 3m49s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 15s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Failing after 40s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Failing after 28s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
docker / deploy-docs (push) Has been skipped
deb / build-publish (push) Successful in 5m54s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 11s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 1m36s
apple / swift (push) Successful in 54s
windows-host / package (push) Failing after 2m18s
ci / web (push) Successful in 32s
ci / rust (push) Failing after 5m2s
decky / build-publish (push) Successful in 11s
android / android (push) Failing after 49s
ci / docs-site (push) Successful in 35s
ci / bench (push) Failing after 3m15s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 3m49s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 15s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Failing after 40s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Failing after 28s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
docker / deploy-docs (push) Has been skipped
deb / build-publish (push) Successful in 5m54s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 11s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 1m36s
The Linux zero-copy tiled-GL path can now produce NV12 (BT.709 limited range) on the GPU and feed NVENC native YUV, deleting NVENC's internal RGB->YUV CSC — which runs on the SM/3D-compute engine a saturating game pins at 100% (the game-vs-encode contention headache). Windows already does this via the D3D11 video processor; this closes the Linux gap. See docs/host-latency-plan.md §2A. Gated behind PUNKTFUNK_NV12 (default OFF → the RGB/BGRx path is byte-for-byte unchanged; zero regression). Only the tiled EGL/GL path converts; the LINEAR/Vulkan-bridge (gamescope) path stays RGB. - zerocopy/egl.rs: Nv12Blit — BT.709 limited Y pass (R8, full-res) + UV pass (RG8, half-res, GL_LINEAR 2x2 average); both CUDA-registered; import_nv12. - zerocopy/cuda.rs: two-plane DeviceBuffer (Y W*H@1B + interleaved UV (W/2)*2 x H/2), paired Y+UV pool, copy_mapped_nv12 + copy_nv12_to_device, on the per-thread priority stream (dmabuf-recycle sync preserved). - encode/linux.rs: nvenc_input(Nv12)->NV12; submit_cuda copies two planes into NVENC's surface; VUI signalled BT.709 limited (colorspace/range/primaries/trc). - capture/linux.rs: gate (PUNKTFUNK_NV12 && tiled), report format Nv12. - main.rs + zerocopy/mod.rs: `nv12-selftest` subcommand. Validated on RTX 5070 Ti two ways: (1) nv12-selftest — synthetic RGBA->NV12 round-trip vs a BT.709 reference, max abs error Y=0.56/U=0.33/V=0.26 LSB; (2) live capture->NV12->NVENC->decode of animated red content matches the RGB path's colour (avg RGB 230,18,18 vs 231,18,20). build/clippy/fmt green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -103,10 +103,14 @@ fn nvenc_input(format: PixelFormat) -> (Pixel, bool) {
|
||||
PixelFormat::Rgba => (Pixel::RGBA, false),
|
||||
PixelFormat::Rgb => (Pixel::RGBZ, true), // RGB -> rgb0
|
||||
PixelFormat::Bgr => (Pixel::BGRZ, true), // BGR -> bgr0
|
||||
// Rgb10a2 (HDR) and NV12/P010 (the Windows video-processor YUV outputs) are produced only by
|
||||
// the Windows capture/encode paths; the Linux capturer never emits them. Map to BGRA so the
|
||||
// match is exhaustive — unreachable here.
|
||||
PixelFormat::Rgb10a2 | PixelFormat::Nv12 | PixelFormat::P010 => (Pixel::BGRA, false),
|
||||
// NV12 is native YUV: NVENC encodes it with NO internal RGB→YUV CSC (the Tier 2A win). On
|
||||
// Linux it's produced by the GPU convert on the zero-copy tiled path (`PUNKTFUNK_NV12`); on
|
||||
// Windows by the D3D11 video processor.
|
||||
PixelFormat::Nv12 => (Pixel::NV12, false),
|
||||
// Rgb10a2 (HDR) and P010 (the Windows 10-bit video-processor output) are produced only by
|
||||
// the Windows paths; the Linux capturer never emits them. Map to BGRA so the match is
|
||||
// exhaustive — unreachable here.
|
||||
PixelFormat::Rgb10a2 | PixelFormat::P010 => (Pixel::BGRA, false),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -204,6 +208,21 @@ impl NvencEncoder {
|
||||
(*video.as_mut_ptr()).gop_size = -1;
|
||||
}
|
||||
|
||||
// NV12 path: we did the RGB→YUV conversion ourselves as BT.709 *limited* range, so signal
|
||||
// that in the bitstream VUI (colorspace/range/primaries/transfer) — otherwise the client
|
||||
// decoder assumes a default and the picture comes out washed-out / wrong-contrast. The
|
||||
// RGB-input paths leave these unset (NVENC's internal CSC writes its own VUI). Matches the
|
||||
// Windows NV12 path's BT.709 limited-range signalling.
|
||||
if matches!(format, PixelFormat::Nv12) {
|
||||
unsafe {
|
||||
let raw = video.as_mut_ptr();
|
||||
(*raw).colorspace = ffi::AVColorSpace::AVCOL_SPC_BT709;
|
||||
(*raw).color_range = ffi::AVColorRange::AVCOL_RANGE_MPEG; // limited/studio
|
||||
(*raw).color_primaries = ffi::AVColorPrimaries::AVCOL_PRI_BT709;
|
||||
(*raw).color_trc = ffi::AVColorTransferCharacteristic::AVCOL_TRC_BT709;
|
||||
}
|
||||
}
|
||||
|
||||
// For the zero-copy path, take CUDA surfaces: wrap the shared CUcontext in CUDA
|
||||
// hwdevice/hwframes contexts and set `pix_fmt = CUDA` on the raw encoder context
|
||||
// *before* open (NVENC derives the device from `hw_frames_ctx`).
|
||||
@@ -419,9 +438,20 @@ impl NvencEncoder {
|
||||
ffi::av_frame_free(&mut f);
|
||||
bail!("av_hwframe_get_buffer(CUDA) failed ({r})");
|
||||
}
|
||||
let dst_ptr = (*f).data[0] as crate::zerocopy::cuda::CUdeviceptr;
|
||||
let dst_pitch = (*f).linesize[0] as usize;
|
||||
if let Err(e) = crate::zerocopy::cuda::copy_device_to_device(buf, dst_ptr, dst_pitch) {
|
||||
// NV12 surfaces are two-plane (Y in data[0], interleaved UV in data[1]); the RGB
|
||||
// surfaces are single-plane. Copy the matching layout into NVENC's pooled surface.
|
||||
let copy_res = if buf.is_nv12() {
|
||||
let y_ptr = (*f).data[0] as crate::zerocopy::cuda::CUdeviceptr;
|
||||
let y_pitch = (*f).linesize[0] as usize;
|
||||
let uv_ptr = (*f).data[1] as crate::zerocopy::cuda::CUdeviceptr;
|
||||
let uv_pitch = (*f).linesize[1] as usize;
|
||||
crate::zerocopy::cuda::copy_nv12_to_device(buf, y_ptr, y_pitch, uv_ptr, uv_pitch)
|
||||
} else {
|
||||
let dst_ptr = (*f).data[0] as crate::zerocopy::cuda::CUdeviceptr;
|
||||
let dst_pitch = (*f).linesize[0] as usize;
|
||||
crate::zerocopy::cuda::copy_device_to_device(buf, dst_ptr, dst_pitch)
|
||||
};
|
||||
if let Err(e) = copy_res {
|
||||
ffi::av_frame_free(&mut f);
|
||||
return Err(e).context("copy imported buffer into NVENC surface");
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user