punktfunk

Author	SHA1	Message	Date
enricobuehler	826da9968e	feat: M2 — Vulkan bridge: TRUE zero-copy for gamescope's LINEAR dmabufs (Phase 3) The missing zero-copy path is closed. NVIDIA's EGL won't sample LINEAR and the CUDA driver rejects raw dmabuf fds — but Vulkan imports dmabufs (VK_EXT_external_memory_dma_buf) and exports OPAQUE_FD memory that CUDA officially imports. zerocopy/vulkan.rs (ash): dmabuf fd → VkBuffer (import cached per fd) → vkCmdCopyBuffer (GPU) → exportable VkBuffer → vkGetMemoryFdKHR(OPAQUE_FD) → cuImportExternalMemory → CUdeviceptr The exportable buffer + CUDA mapping are per-resolution; per frame it's one GPU buffer copy (fence-waited) + one pitched CUDA copy into the encoder's pool. No CPU touches pixels. EglImporter::import_linear now routes through the bridge (lazy init; any failure still falls back to the CPU mmap path). cuda::ExternalDmabuf gained import_owned_fd for the Vulkan-exported fd. Validated live: gamescope 720p120 → "Vulkan→CUDA exportable staging buffer ready size=3686400" (exactly 12807204), full-rate 122.7 fps, decoded frame pixel-correct (vkcube). KWin's tiled EGL path regression-tested intact. NV12 negotiation dropped — moot now that BGRx is fully zero-copy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 23:18:38 +00:00
enricobuehler	751789f932	feat: M2 — LINEAR-dmabuf CUDA import attempt + graceful zero-copy fallback (gamescope) gamescope only offers LINEAR dmabufs, which the EGL/GL interop path can't handle (NVIDIA's EGL lists no LINEAR modifier for sampling). Attempt a direct CUDA external-memory import (cuImportExternalMemory OPAQUE_FD, cached per buffer fd, one DtoD copy per frame into the pooled buffer): the FFI + plumbing are in place, and LINEAR(0) is now advertised alongside the tiled EGL modifiers (tiled first, so KWin still prefers it — regression-tested). Empirically the 595 desktop driver rejects raw dmabuf fds as OPAQUE_FD (CUDA_ERROR_UNKNOWN), matching the documented limitation — true LINEAR GPU import needs a Vulkan interop bridge (import dmabuf via VK_EXT_external_memory_dma_buf, GPU-copy into an exportable allocation, hand that to CUDA), noted as future work. So the importer now degrades instead of dying: on GPU-import failure it logs once, disables itself, and falls through to the CPU mmap path. Validated: gamescope + LUMEN_ZEROCOPY=1 runs full-rate (122.9 fps @720p120, valid HEVC) via the fallback; KWin keeps real zero-copy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 22:43:35 +00:00
enricobuehler	a473f4a926	perf: M2 — amortize per-frame zero-copy overhead (pool buffers + register once) The zero-copy import did real per-frame GPU churn that capped high-fps throughput: a fresh ~29MB cuMemAllocPitch + cuMemFree, a cuGraphicsGLRegisterImage/unregister, and a map of the same persistent blit texture — every frame. Two fixes: - BufferPool: a recycled free-list of pitched device buffers per resolution. DeviceBuffer returns its allocation to the pool on drop (after the encoder synchronized) instead of freeing — kills the per-frame 29MB alloc/free that took the device allocator lock and serialized against the GPU. - RegisteredTexture: register the (reused) GL_RGBA8 blit destination with CUDA ONCE when the GlBlit is built; each frame only maps → copies the array → unmaps, instead of registering/unregistering every frame. This is the "zero-copy should be overhead-free" cleanup. Verified the import still produces correct frames; the remaining per-frame cuCtxSynchronize pair (shared-context coupling) is the next step (CUDA stream + events). lumen-host builds, clippy/fmt/tests clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 20:38:26 +00:00
enricobuehler	aa91485008	feat: M2 — complete zero-copy dmabuf→NVENC capture path (EGL/GL→CUDA) The PipeWire dmabuf now reaches NVENC with no CPU touch. Verified live against headless KWin: a tiled BGRx dmabuf is imported and encoded to a pixel-correct H.265 stream (decoded frame matches the captured desktop — no tiling artifacts, no colour swap). The CPU-copy path stays the default and the runtime fallback. Capture side (zerocopy::egl): desktop NVIDIA can't register a dmabuf EGLImage with CUDA directly (cuGraphicsEGLRegisterImage is Tegra-only; cuGraphicsGLRegisterImage rejects EGLImage-backed textures), so we follow OBS/Sunshine — bind the EGLImage to a GL texture, render it through a fullscreen-triangle shader into an immutable GL_RGBA8 texture (de-tiling + .bgra swizzle to the BGRx the encoder wants), then register that texture with CUDA and copy it device-to-device into an owned buffer so the dmabuf returns to the compositor immediately. Encode side (encode/linux::submit_cuda): take a pooled CUDA surface via av_hwframe_get_buffer and device→device-copy our imported buffer into it, instead of wrapping our own pointer in a bare AVFrame. A bare frame is rejected with EINVAL (NVENC ignores frames with null buf[0]; the encode path's av_frame_ref needs a refcounted buffer), and a fresh device pointer every frame would thrash NVENC's bounded resource-registration cache — the pool recycles a small set. Also: gate FFmpeg AV_LOG_DEBUG behind LUMEN_FFMPEG_DEBUG for diagnosing hw-frame rejects, and refresh the now-accurate module docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 16:28:29 +00:00
enricobuehler	e3876c0d8a	feat: M2 zero-copy — PipeWire dmabuf negotiation + EGL device-platform import (WIP) Wire the capture side of zero-copy (LUMEN_ZEROCOPY=1): - EGL importer now opens the headless EGLDisplay on the NVIDIA EGL device (EGL_PLATFORM_DEVICE_EXT) and queries its importable DRM modifiers (eglQueryDmaBufModifiersEXT). - The PipeWire stream advertises a BGRx dmabuf format with those modifiers as a mandatory enum Choice + a dmabuf-only Buffers param; the compositor fixates an importable tiled modifier. param_changed reads the negotiated modifier; the process callback imports the dmabuf (eglCreateImage with explicit LO/HI modifier) and would copy it into a CUDA buffer for the encoder. Validated against headless KWin (Plasma 6.4): negotiation succeeds (13 NVIDIA modifiers advertised, KWin fixates one, stream reaches Streaming with a real tiled dmabuf) and `eglCreateImage` succeeds. The remaining blocker is `cuGraphicsEGLRegisterImage` returning CUDA_ERROR_INVALID_VALUE on the dmabuf-imported EGLImage — the likely fix is to bind the EGLImage to a GL texture (glEGLImageTargetTexture2DOES) and register that via cuGraphicsGLRegisterImage (OBS/Sunshine's path), which needs a GL context. The CPU-copy path stays the default and is unaffected (regression-checked: real KWin capture → HEVC). LUMEN_ZEROCOPY is opt-in/experimental until the CUDA registration lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 15:41:31 +00:00
enricobuehler	16a00563a8	feat: M2 zero-copy foundation — EGL→CUDA import + NVENC CUDA-frame path Scaffolding for dmabuf zero-copy (plan §9), opt-in via LUMEN_ZEROCOPY: - src/zerocopy/{cuda,egl}.rs: hand-rolled CUDA Driver-API FFI (no Rust crate exposes the EGL-interop calls / CUeglFrame) with a shared process-wide CUcontext + pitched device buffers; an EGL importer (GBM platform on the NVIDIA render node) that turns a dmabuf into an EGLImage, registers it with CUDA, and copies it device-to-device into an owned buffer. `zerocopy-probe` subcommand validates the FFI/linking/GPU access — confirmed on the box (driver 595, EGL_EXT_image_dma_buf_import + modifiers). - CapturedFrame gains a FramePayload enum (Cpu(Vec<u8>) \| Cuda(DeviceBuffer)); the encoder branches: CPU keeps the expand+upload path, CUDA wraps the device buffer in an AV_PIX_FMT_CUDA frame fed straight to hevc_nvenc (sharing our CUcontext via a hand-declared AVCUDADeviceContext, since ffmpeg-sys doesn't bind hwcontext_cuda.h). open_video/the encoder take a `cuda` flag derived from the first frame's payload. The capture-side dmabuf negotiation (which produces the Cuda frames) is the next step; the CPU path is unchanged and remains the default + fallback. Builds clean, clippy clean, tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 15:13:05 +00:00

6 Commits