4cc57d5c39
apple / swift (push) Successful in 56s
ci / rust (push) Successful in 1m36s
android / android (push) Successful in 1m56s
ci / web (push) Successful in 27s
ci / docs-site (push) Successful in 28s
deb / build-publish (push) Successful in 2m26s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 5s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 4s
ci / bench (push) Successful in 4m33s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 8m15s
docker / deploy-docs (push) Successful in 18s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 7m58s
The Windows host capped at ~60 fps with 35-40 ms latency on a GPU-heavy game: the per-frame capture→encode path shared the 3D engine with the game and got scheduled behind it. Rework to minimize 3D-engine work per frame: - VideoConverter (D3D11 video processor): capture → NVENC-native NV12/P010 so NVENC skips its internal RGB→YUV (a 3D/compute step). Wired into both DDA (dxgi.rs) and WGC (wgc.rs). New PixelFormat::Nv12/P010 + NVENC YUV input. - GPU scheduling hardening (Apollo-style): D3DKMTSetProcessSchedulingPriorityClass HIGH, absolute SetGPUThreadPriority, SetMaximumFrameLatency(1). - WGC SDR zero-copy (hold pool frames; no CopyResource). DDA keeps a fast CopyResource to decouple its single-frame acquire/release from the async convert. - Pipelined helper encode loop (PUNKTFUNK_ENCODE_DEPTH, default 1) + perf split (cap_wait / encode / write). Live on the RTX 4090: hard 60 fps ceiling removed (now scene-scaling 40-200+), latency much reduced. Residual cap in GPU-pinned scenes is the irreducible RGB→YUV convert (no fixed-function unit on NVIDIA — VideoProcessing engine reads 0%) waiting behind an uncapped game under WDDM context time-slicing; Linux avoids it via gamescope capping the game to the display refresh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
52 lines
2.3 KiB
Rust
52 lines
2.3 KiB
Rust
//! Zero-copy capture→encode (plan §9): the PipeWire dmabuf is imported into CUDA via EGL and
|
|
//! handed straight to NVENC, eliminating the per-frame CPU copies (at 5K the CPU-copy path
|
|
//! moves ~3.5 GB/s). Opt in with `PUNKTFUNK_ZEROCOPY=1`; the CPU-copy path stays the default and
|
|
//! the runtime fallback (foreign-allocator / no-dmabuf / import failure).
|
|
//!
|
|
//! Pieces: [`cuda`] (driver-API FFI + the shared `CUcontext` + device buffers), [`egl`] (the
|
|
//! headless EGLDisplay + dmabuf→`EGLImage`→CUDA import). The encoder's CUDA-frame path lives in
|
|
//! `encode/linux.rs`; the dmabuf negotiation lives in `capture/linux.rs`.
|
|
|
|
pub mod cuda;
|
|
pub mod egl;
|
|
pub mod vulkan;
|
|
|
|
pub use cuda::DeviceBuffer;
|
|
pub use egl::{DmabufPlane, EglImporter};
|
|
|
|
/// Whether the zero-copy path is opted in (`PUNKTFUNK_ZEROCOPY` truthy).
|
|
pub fn enabled() -> bool {
|
|
std::env::var("PUNKTFUNK_ZEROCOPY")
|
|
.map(|v| matches!(v.trim(), "1" | "true" | "yes" | "on"))
|
|
.unwrap_or(false)
|
|
}
|
|
|
|
/// DRM FourCC for a packed 32-bit format name (little-endian, e.g. `b"XR24"`).
|
|
const fn fourcc(c: &[u8; 4]) -> u32 {
|
|
(c[0] as u32) | ((c[1] as u32) << 8) | ((c[2] as u32) << 16) | ((c[3] as u32) << 24)
|
|
}
|
|
|
|
/// Map a SPA/our [`crate::capture::PixelFormat`] to the DRM FourCC EGL expects for import.
|
|
/// SPA byte order `BGRx` ⇒ DRM `XRGB8888` (memory B,G,R,X), etc.
|
|
pub fn drm_fourcc(format: crate::capture::PixelFormat) -> Option<u32> {
|
|
use crate::capture::PixelFormat::*;
|
|
Some(match format {
|
|
Bgrx => fourcc(b"XR24"), // DRM_FORMAT_XRGB8888
|
|
Bgra => fourcc(b"AR24"), // DRM_FORMAT_ARGB8888
|
|
Rgbx => fourcc(b"XB24"), // DRM_FORMAT_XBGR8888
|
|
Rgba => fourcc(b"AB24"), // DRM_FORMAT_ABGR8888
|
|
// 24-bit packed RGB/BGR have no straightforward dmabuf import here; use the CPU path.
|
|
// Rgb10a2/Nv12/P010 are the Windows HDR / video-processor formats — never produced on Linux.
|
|
Rgb | Bgr | Rgb10a2 | Nv12 | P010 => return None,
|
|
})
|
|
}
|
|
|
|
/// Standalone probe (the `zerocopy-probe` subcommand): initialize the EGL importer + CUDA
|
|
/// context and report. De-risks the FFI/linking/GPU-access without needing a capture session.
|
|
pub fn probe() -> anyhow::Result<()> {
|
|
let _importer = EglImporter::new()?;
|
|
let ctx = cuda::context()?;
|
|
tracing::info!(cuda_ctx = ?ctx, "zero-copy probe OK — EGL display + CUDA context initialized");
|
|
Ok(())
|
|
}
|