The missing zero-copy path is closed. NVIDIA's EGL won't sample LINEAR and the CUDA driver
rejects raw dmabuf fds — but Vulkan imports dmabufs (VK_EXT_external_memory_dma_buf) and
exports OPAQUE_FD memory that CUDA officially imports. zerocopy/vulkan.rs (ash):
dmabuf fd → VkBuffer (import cached per fd) → vkCmdCopyBuffer (GPU) →
exportable VkBuffer → vkGetMemoryFdKHR(OPAQUE_FD) → cuImportExternalMemory → CUdeviceptr
The exportable buffer + CUDA mapping are per-resolution; per frame it's one GPU buffer copy
(fence-waited) + one pitched CUDA copy into the encoder's pool. No CPU touches pixels.
EglImporter::import_linear now routes through the bridge (lazy init; any failure still falls
back to the CPU mmap path). cuda::ExternalDmabuf gained import_owned_fd for the
Vulkan-exported fd.
Validated live: gamescope 720p120 → "Vulkan→CUDA exportable staging buffer ready
size=3686400" (exactly 1280*720*4), full-rate 122.7 fps, decoded frame pixel-correct
(vkcube). KWin's tiled EGL path regression-tested intact. NV12 negotiation dropped — moot
now that BGRx is fully zero-copy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gamescope only offers LINEAR dmabufs, which the EGL/GL interop path can't handle (NVIDIA's
EGL lists no LINEAR modifier for sampling). Attempt a direct CUDA external-memory import
(cuImportExternalMemory OPAQUE_FD, cached per buffer fd, one DtoD copy per frame into the
pooled buffer): the FFI + plumbing are in place, and LINEAR(0) is now advertised alongside
the tiled EGL modifiers (tiled first, so KWin still prefers it — regression-tested).
Empirically the 595 desktop driver rejects raw dmabuf fds as OPAQUE_FD (CUDA_ERROR_UNKNOWN),
matching the documented limitation — true LINEAR GPU import needs a Vulkan interop bridge
(import dmabuf via VK_EXT_external_memory_dma_buf, GPU-copy into an exportable allocation,
hand that to CUDA), noted as future work. So the importer now degrades instead of dying:
on GPU-import failure it logs once, disables itself, and falls through to the CPU mmap path.
Validated: gamescope + LUMEN_ZEROCOPY=1 runs full-rate (122.9 fps @720p120, valid HEVC) via
the fallback; KWin keeps real zero-copy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The zero-copy import did real per-frame GPU churn that capped high-fps throughput: a fresh
~29MB cuMemAllocPitch + cuMemFree, a cuGraphicsGLRegisterImage/unregister, and a map of the
*same* persistent blit texture — every frame. Two fixes:
- BufferPool: a recycled free-list of pitched device buffers per resolution. DeviceBuffer
returns its allocation to the pool on drop (after the encoder synchronized) instead of
freeing — kills the per-frame 29MB alloc/free that took the device allocator lock and
serialized against the GPU.
- RegisteredTexture: register the (reused) GL_RGBA8 blit destination with CUDA ONCE when the
GlBlit is built; each frame only maps → copies the array → unmaps, instead of
registering/unregistering every frame.
This is the "zero-copy should be overhead-free" cleanup. Verified the import still produces
correct frames; the remaining per-frame cuCtxSynchronize pair (shared-context coupling) is
the next step (CUDA stream + events). lumen-host builds, clippy/fmt/tests clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The PipeWire dmabuf now reaches NVENC with no CPU touch. Verified live against
headless KWin: a tiled BGRx dmabuf is imported and encoded to a pixel-correct
H.265 stream (decoded frame matches the captured desktop — no tiling artifacts,
no colour swap). The CPU-copy path stays the default and the runtime fallback.
Capture side (zerocopy::egl): desktop NVIDIA can't register a dmabuf EGLImage
with CUDA directly (cuGraphicsEGLRegisterImage is Tegra-only; cuGraphicsGLRegisterImage
rejects EGLImage-backed textures), so we follow OBS/Sunshine — bind the EGLImage
to a GL texture, render it through a fullscreen-triangle shader into an immutable
GL_RGBA8 texture (de-tiling + .bgra swizzle to the BGRx the encoder wants), then
register that texture with CUDA and copy it device-to-device into an owned buffer
so the dmabuf returns to the compositor immediately.
Encode side (encode/linux::submit_cuda): take a *pooled* CUDA surface via
av_hwframe_get_buffer and device→device-copy our imported buffer into it, instead
of wrapping our own pointer in a bare AVFrame. A bare frame is rejected with
EINVAL (NVENC ignores frames with null buf[0]; the encode path's av_frame_ref
needs a refcounted buffer), and a fresh device pointer every frame would thrash
NVENC's bounded resource-registration cache — the pool recycles a small set.
Also: gate FFmpeg AV_LOG_DEBUG behind LUMEN_FFMPEG_DEBUG for diagnosing
hw-frame rejects, and refresh the now-accurate module docs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the capture side of zero-copy (LUMEN_ZEROCOPY=1):
- EGL importer now opens the headless EGLDisplay on the NVIDIA EGL device
(EGL_PLATFORM_DEVICE_EXT) and queries its importable DRM modifiers
(eglQueryDmaBufModifiersEXT).
- The PipeWire stream advertises a BGRx dmabuf format with those modifiers as a
mandatory enum Choice + a dmabuf-only Buffers param; the compositor fixates an
importable tiled modifier. param_changed reads the negotiated modifier; the
process callback imports the dmabuf (eglCreateImage with explicit LO/HI
modifier) and would copy it into a CUDA buffer for the encoder.
Validated against headless KWin (Plasma 6.4): negotiation succeeds (13 NVIDIA
modifiers advertised, KWin fixates one, stream reaches Streaming with a real
tiled dmabuf) and `eglCreateImage` succeeds. The remaining blocker is
`cuGraphicsEGLRegisterImage` returning CUDA_ERROR_INVALID_VALUE on the
dmabuf-imported EGLImage — the likely fix is to bind the EGLImage to a GL
texture (glEGLImageTargetTexture2DOES) and register that via
cuGraphicsGLRegisterImage (OBS/Sunshine's path), which needs a GL context.
The CPU-copy path stays the default and is unaffected (regression-checked: real
KWin capture → HEVC). LUMEN_ZEROCOPY is opt-in/experimental until the CUDA
registration lands.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Scaffolding for dmabuf zero-copy (plan §9), opt-in via LUMEN_ZEROCOPY:
- src/zerocopy/{cuda,egl}.rs: hand-rolled CUDA Driver-API FFI (no Rust crate
exposes the EGL-interop calls / CUeglFrame) with a shared process-wide
CUcontext + pitched device buffers; an EGL importer (GBM platform on the
NVIDIA render node) that turns a dmabuf into an EGLImage, registers it with
CUDA, and copies it device-to-device into an owned buffer. `zerocopy-probe`
subcommand validates the FFI/linking/GPU access — confirmed on the box
(driver 595, EGL_EXT_image_dma_buf_import + modifiers).
- CapturedFrame gains a FramePayload enum (Cpu(Vec<u8>) | Cuda(DeviceBuffer));
the encoder branches: CPU keeps the expand+upload path, CUDA wraps the device
buffer in an AV_PIX_FMT_CUDA frame fed straight to hevc_nvenc (sharing our
CUcontext via a hand-declared AVCUDADeviceContext, since ffmpeg-sys doesn't
bind hwcontext_cuda.h). open_video/the encoder take a `cuda` flag derived from
the first frame's payload.
The capture-side dmabuf negotiation (which produces the Cuda frames) is the
next step; the CPU path is unchanged and remains the default + fallback. Builds
clean, clippy clean, tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>