1a483aae06
The gpu-contention plan's §5.B lever: today submit and the blocking lock_bitstream share one thread, so under a GPU-saturating game the pipeline serializes on the WDDM scheduling wait (1000/17ms ≈ 59fps — the depth-1 collapse; the old 'deeper pipeline just stacks latency' result was a same-thread implementation, not a disproof). Async mode opens the session enableEncodeAsync=1, registers an auto-reset completion event per pool bitstream, and moves the wait+lock+copy+ unlock onto an internal retrieve thread feeding poll() through a channel — the exact split the NVENC guide mandates. Register/map/unmap stay on the encode thread; teardown drops the job channel, joins the thread, THEN destroys the session. In-flight depth is bounded by PUNKTFUNK_NVENC_ASYNC_DEPTH (default 4, hard cap POOL-1) — both for output-buffer reuse and because NVENC encodes the capture ring's textures in place. Idle latency cost ≈ 0 (same-tick pickup); under contention completed frames queue instead of stalling capture. CI-compile validated only — on-glass A/B under game load on the RTX box still pending (box offline). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>