Files
punktfunk/crates/lumen-host
enricobuehler a473f4a926 perf: M2 — amortize per-frame zero-copy overhead (pool buffers + register once)
The zero-copy import did real per-frame GPU churn that capped high-fps throughput: a fresh
~29MB cuMemAllocPitch + cuMemFree, a cuGraphicsGLRegisterImage/unregister, and a map of the
*same* persistent blit texture — every frame. Two fixes:

- BufferPool: a recycled free-list of pitched device buffers per resolution. DeviceBuffer
  returns its allocation to the pool on drop (after the encoder synchronized) instead of
  freeing — kills the per-frame 29MB alloc/free that took the device allocator lock and
  serialized against the GPU.
- RegisteredTexture: register the (reused) GL_RGBA8 blit destination with CUDA ONCE when the
  GlBlit is built; each frame only maps → copies the array → unmaps, instead of
  registering/unregistering every frame.

This is the "zero-copy should be overhead-free" cleanup. Verified the import still produces
correct frames; the remaining per-frame cuCtxSynchronize pair (shared-context coupling) is
the next step (CUDA stream + events). lumen-host builds, clippy/fmt/tests clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 20:38:26 +00:00
..