perf(encode/windows): AMF quality=speed + bf=0; drop the useless poll spin
ci / rust (push) Failing after 48s
windows-host / package (push) Failing after 10s
apple / swift (push) Successful in 1m6s
ci / web (push) Successful in 51s
ci / docs-site (push) Successful in 1m8s
android / android (push) Successful in 3m20s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 4s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 4s
apple / screenshots (push) Successful in 5m10s
ci / bench (push) Successful in 4m43s
docker / deploy-docs (push) Successful in 18s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 3m25s
deb / build-publish (push) Failing after 44s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 3m21s

On-box A/B on the .173 Ryzen 7000 iGPU (720p60, real composition via input
injection — an idle virtual desktop composes ~1 fps and gives meaningless
encode timings): the encode-time-first `quality=speed` preset + explicit `bf=0`
cut host-side encode_us from ~36 ms to ~19.5 ms.

The blocking-poll idea from the prior commit was WRONG and is reverted to a
single non-blocking receive (default PUNKTFUNK_FFWIN_POLL_MS=0): libavcodec's
hevc_amf holds ~2 frames before releasing the oldest (needs frame N+2 to flush
N), so a spin between submits provably never yields the owed AU — verified with
a 150 ms cap pegging at exactly 150 ms across every usage preset and pipeline
depth. That ~2-frame buffer is inherent to the libavcodec wrapper, not host
scheduling; the real latency lever is a direct AMF SDK encoder (the AMF
analogue of the direct-NVENC path), tracked as the next AMD work item. The
env knob is retained for a future VCN/driver where a bounded spin can help.

Also measured and rejected: PUNKTFUNK_ZEROCOPY=1 on AMF is ~2x WORSE (68 ms vs
36 ms) — the D3D11 import path adds sync overhead beyond the readback it saves,
so the system-memory default stays. GPU-priority elevation is already
process-wide (dxgi.rs), so it covers the iGPU encode session with no change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-07-03 14:57:39 +00:00
parent 24fa018c70
commit f204a89cef
@@ -1307,14 +1307,18 @@ impl Encoder for FfmpegWinEncoder {
self.force_kf = true;
}
/// Poll-contract note: the session encode loop's pipelining treats a `None` from `poll` as
/// "come back next tick" and was designed around direct NVENC, whose poll BLOCKS in
/// `lock_bitstream` until the owed AU is done. libavcodec's AMF wrapper is truly async
/// (EAGAIN until the ASIC finishes), so a single non-blocking try quantizes AU retrieval to
/// the submit cadence — measured +12 frame periods (~43 ms p50 at 720p60 on the Ryzen iGPU,
/// vs ~3.5 ms of actual encode). While an AU is owed (`in_flight > 0`), spin-poll with short
/// sleeps like NVENC's blocking wait, bounded to ~2 frame periods so an overloaded encoder
/// degrades back to next-tick pickup instead of stalling capture.
/// Poll for the next finished AU (single non-blocking `receive_packet`).
///
/// libavcodec's `hevc_amf`/`av1_amf` wrapper holds ~2 frames before releasing the oldest
/// (it needs frame N+2 submitted to flush N), so the encode→retrieve latency floors at
/// **~2 frame periods** — measured dead-stable at 36 ms p50 for 720p60 on the Ryzen 7000
/// iGPU across depth 1/2, every `usage` preset, and any spin (a spin between submits provably
/// never produces the owed AU — verified with a 150 ms cap pegging at exactly 150 ms). So the
/// buffer is inherent to the libavcodec path, NOT host scheduling: the real fix is a direct
/// AMF SDK encoder (the AMF analogue of `encode/windows/nvenc.rs`, whose delay=0 gives NVENC
/// its ~12 ms) — tracked as the next AMD latency lever. `PUNKTFUNK_FFWIN_POLL_MS` keeps a
/// bounded spin available for a future VCN/driver where the AU can land mid-spin (0 = off,
/// the default and correct choice on measured hardware).
fn poll(&mut self) -> Result<Option<EncodedFrame>> {
let fps = self.fps;
let enc = match &mut self.inner {
@@ -1322,14 +1326,12 @@ impl Encoder for FfmpegWinEncoder {
Some(Inner::ZeroCopy(z)) => &mut z.enc,
None => return Ok(None),
};
// Default cap: ~2 frame periods. `PUNKTFUNK_FFWIN_POLL_MS` overrides for on-box latency
// forensics (e.g. 150 to see WHEN the AU really lands vs. being gated on the next submit).
let cap_us = std::env::var("PUNKTFUNK_FFWIN_POLL_MS")
.ok()
.and_then(|s| s.parse::<u64>().ok())
.map(|ms| ms * 1000)
.unwrap_or_else(|| (2_000_000 / fps.max(1) as u64).max(10_000));
let deadline = (self.in_flight > 0)
.unwrap_or(0); // default: no spin — the libavcodec AMF buffer can't be spun out
let deadline = (cap_us > 0 && self.in_flight > 0)
.then(|| std::time::Instant::now() + std::time::Duration::from_micros(cap_us));
loop {
match poll_encoder(enc, fps)? {