68c92f6874
Chasing the 8ms submit at 1440p on the 780M: the sampled PUNKTFUNK_PERF split (push/pull/send) shows desc+buffersrc at ~5us, hwmap-import+VPP CSC at ~0.2-0.5ms, and avcodec_send_frame owning the rest — so neither a VA-surface import cache nor CSC overlap would help. Two facts landed: (1) async_depth>=2 in libavcodec's vaapi_encode is a structural +1-frame latency (frame N's packet only materializes when N+1 queues; measured 18ms vs 8.3ms p50 at depth 1) — depth 1 stays the default, PUNKTFUNK_VAAPI_ASYNC_DEPTH exists for pixel rates beyond the ASIC's serial budget, and poll() now does a bounded in-flight wait so a deeper depth still ships the AU as soon as the ASIC finishes. (2) The residual send_frame block tracks GPU CLOCKS, not the ASIC: ~8ms/frame at a 60fps duty cycle vs ~4.4ms at 120fps pacing vs 3.5ms back-to-back (270fps CLI benchmark, even at -async_depth 1) — the clock-sag fix lands in gpuclocks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>