feat(clients/windows): all-vendor video pipeline rewrite + app icon + hosts-page tiles

Decode+present rewrite (first real pixels on glass for this client): - Decode: FFmpeg D3D11VA on NVIDIA/AMD/Intel. get_format now only returns AV_PIX_FMT_D3D11 and lets libavcodec build the decode pool from hw_device_ctx (hand-built frames contexts failed three different ways: NVIDIA rejects DECODER|SHADER_RESOURCE arrays, BindFlags=0 fails texture creation, Intel rejects non-128-aligned HEVC surfaces at the first SubmitDecoderBuffers). A DXVA profile probe before the hwdevice commits hardware-vs-software up front instead of burning the opening IDR; extra_hw_frames covers the frames the client holds. - Present: the decoded slice is copied with ONE display-size-boxed CopySubresourceRegion (a planar slice is a single subresource in D3D11; the old two-copy D3D12-style code silently no-opped - the black screen) into a sampleable NV12/P010 texture, per-plane SRVs + YUV->RGB shaders. - New dedicated render thread (render.rs): presenting is decoupled from the XAML thread; frame-latency-waitable swapchain + SetMaximumFrameLatency(1), newest-wins drain after the wait, crossbeam frame channel with pts for a capture->presented p50 log. - HiDPI: pixel-sized buffers + SetMatrixTransform(96/dpi) - was blurry at 125/150 % scaling. - Software fallback now feeds the same shaders (swscale -> NV12/P010 planes -> two dynamic plane textures); ps_rgba/X2BGR10 path deleted, hw/sw colour math identical. - Adapter selection for hybrid boxes: PUNKTFUNK_ADAPTER > the window's monitor's adapter > default; PUNKTFUNK_D3D_DEBUG=1 debug layer. - Session pump: request_keyframe at start and on hw->sw demotion (infinite GOP would otherwise sit on a black screen). Validated live on the Arc Pro + RTX 3500 Ada laptop against the local Windows host: 60 fps D3D11VA on both vendors, software path, GUI on glass. Also: embedded app icon (build.rs winresource + WM_SETICON, MSIX Square44x44 targetsize assets, pack-msix stages them) and the hosts-page tile rework (tap-to-connect tiles with sibling overflow menu - fixes forget-also-connects - in-tile rename editor, add-host modal via root state). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:24:23 +02:00
parent 2c416a4bff
commit a4c84ac620
36 changed files with 1797 additions and 581 deletions
@@ -2,15 +2,24 @@
 //!
 //! Two backends, picked at session start (override via [`DecoderPref`] / the Settings UI):
 //!
-//! * **D3D11VA** (any GPU): libavcodec decodes on the GPU straight into `ID3D11Texture2D`s that
-//!   carry `D3D11_BIND_SHADER_RESOURCE`, so the presenter samples the decoded NV12/P010 surface
-//!   directly — **zero copy** (no swscale, no CPU readback, no per-frame upload). The textures are
-//!   created by the process-wide shared device ([`crate::gpu`]) the presenter also draws with, which
-//!   is what makes them bindable there. This is the big latency/throughput win over software decode.
-//! * **Software**: libavcodec on the CPU + swscale to a packed 4-byte format the presenter uploads
-//!   (`RGBA` for SDR, `X2BGR10` for HDR). The fallback on a GPU-less box (WARP), when D3D11VA init
-//!   fails, or when a mid-session hardware error demotes us — the host's IDR/RFI recovery
-//!   resynchronizes on the next keyframe either way.
+//! * **D3D11VA** (any GPU — the vendor-agnostic DXVA path on NVIDIA/AMD/Intel): libavcodec decodes
+//!   on the GPU into an `ID3D11Texture2D` decode array (decoder-only bind — NVIDIA rejects a
+//!   decoder array that is also a shader resource). The presenter copies each decoded slice into
+//!   its own sampleable NV12/P010 texture and converts YUV→RGB in a shader — one cheap GPU-to-GPU
+//!   copy per frame (no swscale, no CPU readback). The decode array is created by the process-wide
+//!   shared device ([`crate::gpu`]) the presenter also draws with, so the copy stays on-GPU. This
+//!   is the big latency/throughput win over software.
+//! * **Software**: libavcodec on the CPU + swscale to the same planar layout the hardware path
+//!   produces (NV12, or P010 for 10-bit) — the presenter uploads the two planes and runs the SAME
+//!   YUV→RGB shaders, so hw/sw color math is identical. The fallback on a GPU-less box (WARP),
+//!   when D3D11VA init fails, or when a mid-session hardware error demotes us — the host's
+//!   IDR/RFI recovery resynchronizes on the next keyframe either way.
+//!
+//! D3D11VA viability is settled **before the session's first frame** by two probes: the adapter
+//! must expose the negotiated codec's DXVA decode profile ([`decode_profile_supported`] — hwaccel
+//! init otherwise only fails at the first AU, burning the IDR), and it must be able to create the
+//! decode surface pool ([`d3d11va_decode_supported`]). Either failing commits to software decode
+//! from frame one (a clean, gap-free stream) instead of dying mid-stream.
 //!
 //! Both run `AV_CODEC_FLAG_LOW_DELAY`; the host encodes zero-reorder streams (no B-frames, in-band
 //! parameter sets on every IDR), so decode is strictly one-in/one-out.
@@ -25,7 +34,9 @@ use ffmpeg::util::frame::Video as AvFrame;
 use ffmpeg_next as ffmpeg;
 use std::ffi::c_void;
 use std::ptr;
-use windows::core::Interface; // ID3D11Device::clone().into_raw() for the FFmpeg hwdevice ctx
+use windows::core::{Interface, GUID};
+use windows::Win32::Graphics::Direct3D11::{ID3D11Device, ID3D11VideoDevice};
+use windows::Win32::Graphics::Dxgi::Common::{DXGI_FORMAT, DXGI_FORMAT_NV12, DXGI_FORMAT_P010};

 /// Which decode backend to use; the Settings UI persists this as a string.
 #[derive(Clone, Copy, PartialEq, Eq, Debug, Default)]
@@ -69,21 +80,27 @@ impl DecodedFrame {
    }
 }

-/// Packed 4-byte-per-pixel frame for a D3D11 dynamic-texture upload (which takes a row pitch). The
-/// bytes are `R8G8B8A8` for SDR and `X2BGR10` (== DXGI `R10G10B10A2`, R in the low 10 bits) for HDR.
+/// A software-decoded frame in the same planar layout the hardware path produces: an NV12 (or
+/// P010 for 10-bit) luma plane + interleaved chroma plane, each with its swscale row stride
+/// (≥ the row bytes — swscale pads rows for SIMD). The presenter uploads them into two dynamic
+/// plane textures sampled by the same shaders as the D3D11VA path.
 pub struct CpuFrame {
    pub width: u32,
    pub height: u32,
-    /// Row stride in bytes (≥ width*4 — swscale pads rows for SIMD).
-    pub stride: usize,
-    pub pixels: Vec<u8>,
-    /// BT.2020 PQ HDR10 frame: `pixels` is `X2BGR10` and the presenter switches to a 10-bit
-    /// R10G10B10A2 + ST.2084 swapchain. `false` = ordinary 8-bit BT.709 SDR.
+    /// Luma plane (`W×H` samples, 1 byte each; 2 for 10-bit) + its row stride in bytes.
+    pub y: Vec<u8>,
+    pub y_stride: usize,
+    /// Interleaved chroma plane (`⌈W/2⌉×⌈H/2⌉` UV pairs) + its row stride in bytes.
+    pub uv: Vec<u8>,
+    pub uv_stride: usize,
+    /// P010 sample layout (10 bits in the high bits of 16) vs NV12. Selects texture/SRV formats.
+    pub ten_bit: bool,
+    /// BT.2020 PQ HDR10 vs ordinary BT.709 SDR. Selects shader + swapchain colour space.
    pub hdr: bool,
 }

 /// A decoded frame still on the GPU: a D3D11 texture **array** plus the slice index the decoder
-/// wrote this frame into. The presenter creates per-plane shader-resource views over the slice and
+/// wrote this frame into. The presenter copies the slice into its own sampleable texture and
 /// converts YUV→RGB in a pixel shader. The underlying surface stays alive — and out of the decoder's
 /// reuse pool — for exactly as long as `guard` (an `av_frame_clone` of the decoded frame) lives.
 pub struct GpuFrame {
@@ -91,16 +108,20 @@ pub struct GpuFrame {
    pub height: u32,
    /// Texture-array slice this frame occupies (`AVFrame::data[1]`).
    pub index: u32,
-    /// BT.2020 PQ HDR10 (P010, ST.2084) vs ordinary 8-bit BT.709 SDR (NV12). The present path keys
-    /// SRV format + shader off this (the host couples 10-bit ⟺ HDR).
+    /// The decode pool is P010 (10 bits in the high bits) vs NV12 — from the frames context's
+    /// `sw_format`. The presenter keys its copy-texture/SRV formats off this: they must match the
+    /// source array exactly for `CopySubresourceRegion`.
+    pub ten_bit: bool,
+    /// BT.2020 PQ HDR10 (ST.2084 transfer) vs ordinary BT.709 SDR. Selects shader + swapchain
+    /// colour space only (the host couples 10-bit ⟺ HDR today, but formats key off `ten_bit`).
    pub hdr: bool,
    guard: D3d11FrameGuard,
 }

 impl GpuFrame {
    /// The decoder's D3D11 texture array holding this frame's slice, borrowed from the live cloned
-    /// `AVFrame`. Construct the windows-rs interface on the thread that will use it (the presenter /
-    /// UI thread): COM interfaces are `!Send`, but the raw pointer is fine to carry across threads.
+    /// `AVFrame`. Construct the windows-rs interface on the thread that will use it (the render
+    /// thread): COM interfaces are `!Send`, but the raw pointer is fine to carry across threads.
    pub fn texture_ptr(&self) -> *mut c_void {
        unsafe { (*self.guard.0).data[0] as *mut c_void }
    }
@@ -108,7 +129,7 @@ impl GpuFrame {

 /// Owns a cloned decoded `AVFrame` (which refs the D3D11 surface in the decoder pool). Dropping it
 /// releases the surface back for reuse. The clone is plain refcounted data; freeing it from the
-/// presenter thread is fine.
+/// render thread is fine.
 pub struct D3d11FrameGuard(*mut ffmpeg::ffi::AVFrame);
 unsafe impl Send for D3d11FrameGuard {}
 impl Drop for D3d11FrameGuard {
@@ -139,6 +160,7 @@ pub fn ffmpeg_codec_id(wire: u8) -> ffmpeg::codec::Id {

 /// The `quic` codec bitfield this client can decode — whatever FFmpeg has a decoder for (HEVC/H.264
 /// always; AV1 when built in). Advertised to the host so it never emits a codec we can't decode.
+/// Deliberately NOT gated on the DXVA profiles: software decode covers anything FFmpeg can.
 pub fn decodable_codecs() -> u8 {
    let _ = ffmpeg::init();
    let mut bits = 0u8;
@@ -160,7 +182,7 @@ impl Decoder {
        if pref != DecoderPref::Software {
            match D3d11vaDecoder::new(codec_id) {
                Ok(d) => {
-                    tracing::info!(?codec_id, "D3D11VA hardware decode active (zero-copy)");
+                    tracing::info!(?codec_id, "D3D11VA hardware decode active");
                    return Ok(Decoder {
                        backend: Backend::D3d11va(d),
                        codec_id,
@@ -180,7 +202,7 @@ impl Decoder {
        })
    }

-    /// True for the zero-copy hardware backend (shown in the stream HUD).
+    /// True for the GPU hardware backend (shown in the stream HUD).
    pub fn is_hardware(&self) -> bool {
        matches!(self.backend, Backend::D3d11va(_))
    }
@@ -203,12 +225,73 @@ impl Decoder {
    }
 }

+// --- DXVA decode-profile probe --------------------------------------------------------
+
+/// DXVA decode-profile GUIDs (`dxva.h`), defined locally so no extra windows-rs feature or
+/// metadata surface is pulled in for four constants.
+const PROFILE_H264_VLD_NOFGT: GUID = GUID::from_u128(0x1b81be68_a0c7_11d3_b984_00c04f2e73c5);
+const PROFILE_HEVC_VLD_MAIN: GUID = GUID::from_u128(0x5b11d51b_2f4c_4452_bcc3_09f2a1160cc0);
+const PROFILE_HEVC_VLD_MAIN10: GUID = GUID::from_u128(0x107af0e0_ef1a_4d19_aba8_67a163073d13);
+const PROFILE_AV1_VLD_PROFILE0: GUID = GUID::from_u128(0xb8be4ccb_cf53_46ba_8d59_d6b8a6da5d2a);
+
+/// Does the shared device's adapter expose a DXVA decode profile for `codec_id`? Checked before
+/// building the FFmpeg hwdevice because hwaccel selection (`get_format`) only runs on the FIRST
+/// access unit — an unsupported profile would otherwise burn the opening IDR and recover through
+/// the mid-stream demotion path instead of committing to software up front. Also logs (once) the
+/// adapter's full profile list plus Main10 availability — the forensics for a new GPU/driver.
+fn decode_profile_supported(device: &ID3D11Device, codec_id: ffmpeg::codec::Id) -> Result<()> {
+    let video: ID3D11VideoDevice = device
+        .cast()
+        .context("device lacks ID3D11VideoDevice (created without VIDEO_SUPPORT)")?;
+    let profiles: Vec<GUID> = unsafe {
+        let n = video.GetVideoDecoderProfileCount();
+        (0..n)
+            .filter_map(|i| video.GetVideoDecoderProfile(i).ok())
+            .collect()
+    };
+    log_profiles_once(&profiles);
+
+    let (wanted, format, name): (GUID, DXGI_FORMAT, &str) = match codec_id {
+        ffmpeg::codec::Id::H264 => (PROFILE_H264_VLD_NOFGT, DXGI_FORMAT_NV12, "H.264 VLD NoFGT"),
+        ffmpeg::codec::Id::HEVC => (PROFILE_HEVC_VLD_MAIN, DXGI_FORMAT_NV12, "HEVC Main"),
+        ffmpeg::codec::Id::AV1 => (PROFILE_AV1_VLD_PROFILE0, DXGI_FORMAT_NV12, "AV1 Profile 0"),
+        other => bail!("no DXVA profile known for {other:?}"),
+    };
+    let ok = profiles.contains(&wanted)
+        && unsafe { video.CheckVideoDecoderFormat(&wanted, format) }
+            .map(|b| b.as_bool())
+            .unwrap_or(false);
+    if !ok {
+        bail!("adapter exposes no {name} decode profile");
+    }
+    // 10-bit (a mid-session HDR upgrade needs Main10): informational — if it's missing the
+    // decode error → software demotion + keyframe re-request path covers the switch.
+    if codec_id == ffmpeg::codec::Id::HEVC {
+        let main10 = profiles.contains(&PROFILE_HEVC_VLD_MAIN10)
+            && unsafe { video.CheckVideoDecoderFormat(&PROFILE_HEVC_VLD_MAIN10, DXGI_FORMAT_P010) }
+                .map(|b| b.as_bool())
+                .unwrap_or(false);
+        tracing::info!(main10, "HEVC Main10 (10-bit/HDR) decode profile");
+    }
+    Ok(())
+}
+
+/// One-time dump of the adapter's DXVA decode profiles.
+fn log_profiles_once(profiles: &[GUID]) {
+    use std::sync::atomic::{AtomicBool, Ordering};
+    static ONCE: AtomicBool = AtomicBool::new(true);
+    if ONCE.swap(false, Ordering::Relaxed) {
+        let list: Vec<String> = profiles.iter().map(|g| format!("{g:?}")).collect();
+        tracing::info!(count = profiles.len(), profiles = ?list, "adapter DXVA decode profiles");
+    }
+}
+
 // --- software backend ---------------------------------------------------------------

 struct SoftwareDecoder {
    decoder: ffmpeg::decoder::Video,
    /// Rebuilt whenever the decoded format/size **or output format** changes (mid-stream
-    /// `Reconfigure`, or an SDR↔HDR flip): `(ctx, src_fmt, w, h, dst_fmt)`.
+    /// `Reconfigure`, or an 8↔10-bit flip): `(ctx, src_fmt, w, h, dst_fmt)`.
    sws: Option<(scaling::Context, Pixel, u32, u32, Pixel)>,
 }

@@ -241,36 +324,24 @@ impl SoftwareDecoder {
        Ok(out)
    }

-    /// Convert the decoded YUV frame to a packed 4-byte format the presenter uploads directly:
-    /// SDR → `RGBA` (BT.709), HDR (SMPTE ST.2084 / PQ transfer) → `X2BGR10` (== DXGI R10G10B10A2)
-    /// using the BT.2020 matrix. For HDR the PQ-encoded values pass through unchanged (swscale only
-    /// applies the YUV→RGB matrix + range, never the transfer) — exactly what an HDR10 swapchain wants.
+    /// Convert the decoded planar YUV to the hardware path's layout: NV12 for 8-bit, P010 for
+    /// 10-bit — a chroma interleave (and 10→16-high-bits shift), NOT a colour conversion. The
+    /// matrix/range/transfer handling all lives in the presenter's shaders, shared with the
+    /// D3D11VA path, so software frames are bit-comparable with hardware ones.
    fn convert(&mut self, frame: &AvFrame) -> Result<CpuFrame> {
        use ffmpeg::color::TransferCharacteristic;
        let (fmt, w, h) = (frame.format(), frame.width(), frame.height());
        let hdr = frame.color_transfer_characteristic() == TransferCharacteristic::SMPTE2084;
-        let dst = if hdr { Pixel::X2BGR10LE } else { Pixel::RGBA };
+        // Source bit depth from the pix-fmt descriptor (stable FFmpeg public API).
+        let ten_bit = unsafe {
+            let desc = ffmpeg::ffi::av_pix_fmt_desc_get(fmt.into());
+            !desc.is_null() && (*desc).comp[0].depth > 8
+        };
+        let dst = if ten_bit { Pixel::P010LE } else { Pixel::NV12 };
        let rebuild = !matches!(&self.sws, Some((_, f, sw, sh, d)) if *f == fmt && *sw == w && *sh == h && *d == dst);
        if rebuild {
-            let mut ctx = scaling::Context::get(fmt, w, h, dst, w, h, scaling::Flags::POINT)
+            let ctx = scaling::Context::get(fmt, w, h, dst, w, h, scaling::Flags::POINT)
                .context("swscale context")?;
-            if hdr {
-                // BT.2020 non-constant-luminance YUV (limited range) → full-range RGB. swscale
-                // applies only the matrix + range here, so the samples stay PQ-encoded.
-                unsafe {
-                    let coef = ffmpeg::ffi::sws_getCoefficients(ffmpeg::ffi::SWS_CS_BT2020);
-                    ffmpeg::ffi::sws_setColorspaceDetails(
-                        ctx.as_mut_ptr(),
-                        coef,
-                        0, // src range: limited (video)
-                        coef,
-                        1, // dst range: full
-                        0,
-                        1 << 16,
-                        1 << 16, // brightness / contrast / saturation defaults (16.16)
-                    );
-                }
-            }
            self.sws = Some((ctx, fmt, w, h, dst));
        }
        let (sws, ..) = self.sws.as_mut().unwrap();
@@ -279,8 +350,11 @@ impl SoftwareDecoder {
        Ok(CpuFrame {
            width: w,
            height: h,
-            stride: conv.stride(0),
-            pixels: conv.data(0).to_vec(),
+            y: conv.data(0).to_vec(),
+            y_stride: conv.stride(0),
+            uv: conv.data(1).to_vec(),
+            uv_stride: conv.stride(1),
+            ten_bit,
            hdr,
        })
    }
@@ -295,11 +369,16 @@ impl SoftwareDecoder {
 // decoded surfaces transfer out through D3d11FrameGuard.

 const AVERROR_EAGAIN: i32 = -11; // -EAGAIN
-const D3D11_BIND_SHADER_RESOURCE: u32 = 0x8; // <d3d11.h>; FFmpeg ORs D3D11_BIND_DECODER itself
+
+/// D3D11VA decode surface pool depth: the zero-reorder DPB (1–2 refs) + the bounded decoded channel
+/// (2) + the frame the presenter currently holds (until its copy flushes) + one in-flight decode —
+/// 12 is comfortable. A GPU that can't create the pool at all is gated out by
+/// `d3d11va_decode_supported` and the session uses software decode.
+const DECODE_POOL_SIZE: i32 = 12;

 /// `hwcontext_d3d11va.h` — `AVHWDeviceContext::hwctx`. Leaving `lock` null makes FFmpeg install an
 /// `ID3D11Multithread` default lock + set multithread protection on `device_context` during init,
-/// which is what lets the presenter share this device's immediate context from the UI thread.
+/// which is what lets the presenter share this device's immediate context from the render thread.
 #[repr(C)]
 struct AVD3D11VADeviceContext {
    device: *mut c_void,         // ID3D11Device*
@@ -311,70 +390,79 @@ struct AVD3D11VADeviceContext {
    lock_ctx: *mut c_void,
 }

-/// `hwcontext_d3d11va.h` — `AVHWFramesContext::hwctx`. `BindFlags` lets us add
-/// `D3D11_BIND_SHADER_RESOURCE` so the decoded array texture is sampleable (zero copy).
+/// `hwcontext_d3d11va.h` — `AVHWFramesContext::hwctx`. The header is explicit: "The user must at
+/// least set D3D11_BIND_DECODER if the frames context is to be used for video decoding" — a
+/// user-built frames context gets NO default (BindFlags 0 → `CreateTexture2D` E_INVALIDARG); the
+/// automatic OR-in lives only in libavcodec's own frames-param path, which we bypass.
 #[repr(C)]
 struct AVD3D11VAFramesContext {
    texture: *mut c_void, // ID3D11Texture2D* (null → FFmpeg allocates the pool)
    bind_flags: u32,      // UINT BindFlags
    misc_flags: u32,      // UINT MiscFlags
+    texture_infos: *mut c_void, // AVD3D11FrameDescriptor* (FFmpeg-managed)
 }

+/// `D3D11_BIND_DECODER` — the decode pool's ONLY bind flag. Adding `D3D11_BIND_SHADER_RESOURCE`
+/// is what NVIDIA rejects on a decoder texture ARRAY; the presenter samples via its own copy.
+const BIND_DECODER: u32 = 0x200;
+
 fn averr(what: &str, code: i32) -> anyhow::Error {
    anyhow!("{what}: {}", ffmpeg::Error::from(code))
 }

-/// libavcodec's `get_format` callback: accept the D3D11 hw surface, building a frames context whose
-/// textures carry `BIND_SHADER_RESOURCE` (so the presenter can sample them). Returning anything but
-/// `AV_PIX_FMT_D3D11` aborts hardware decode → the session demotes to software.
+/// libavcodec's `get_format` callback: pick the D3D11 hw surface format and nothing else.
+/// Deliberately does NOT build a frames context — with `hw_device_ctx` set and `hw_frames_ctx`
+/// left null, libavcodec derives the decode pool itself (`ff_decode_get_hw_frames_ctx`), applying
+/// every vendor quirk: DXVA surface alignment (128 for HEVC/AV1), DPB-based pool sizing, and the
+/// decoder-only `D3D11_BIND_DECODER` flags. A hand-built context validated on NVIDIA was rejected
+/// by Intel at the first `SubmitDecoderBuffers` (E_INVALIDARG) — the vendor-proof path is the one
+/// the ffmpeg CLI/mpv ship. Returning anything but `AV_PIX_FMT_D3D11` aborts hardware decode →
+/// the session demotes to software.
 unsafe extern "C" fn get_format_d3d11(
    avctx: *mut ffmpeg::ffi::AVCodecContext,
    mut list: *const ffmpeg::ffi::AVPixelFormat,
 ) -> ffmpeg::ffi::AVPixelFormat {
    use ffmpeg::ffi::*;
    unsafe {
-        let mut found = false;
+        if (*avctx).hw_device_ctx.is_null() {
+            return AVPixelFormat::AV_PIX_FMT_NONE;
+        }
        while *list != AVPixelFormat::AV_PIX_FMT_NONE {
            if *list == AVPixelFormat::AV_PIX_FMT_D3D11 {
-                found = true;
-                break;
+                return AVPixelFormat::AV_PIX_FMT_D3D11;
            }
            list = list.add(1);
        }
-        if !found {
-            return AVPixelFormat::AV_PIX_FMT_NONE;
-        }
-        let device_ref = (*avctx).hw_device_ctx;
-        if device_ref.is_null() {
-            return AVPixelFormat::AV_PIX_FMT_NONE;
-        }
-        let frames_ref = av_hwframe_ctx_alloc(device_ref);
+        AVPixelFormat::AV_PIX_FMT_NONE
+    }
+}
+
+/// Predict whether D3D11VA decode will work by doing EXACTLY what the decoder's `get_format` does —
+/// allocate an `AVHWFramesContext` (decoder-only pool, no shader-resource bind) and initialize it,
+/// which creates the real NV12 decode surface array. On a GPU/driver that can't create the pool this
+/// fails here, up front, so the session commits to software decode from the first frame (a clean,
+/// gap-free stream) rather than decoding the IDR then dying mid-stream on a texture error that a
+/// software demotion can't reliably recover from (the host's infinite GOP won't re-send an IDR).
+unsafe fn d3d11va_decode_supported(hw_device: *mut ffmpeg::ffi::AVBufferRef) -> bool {
+    use ffmpeg::ffi::*;
+    unsafe {
+        let frames_ref = av_hwframe_ctx_alloc(hw_device);
        if frames_ref.is_null() {
-            return AVPixelFormat::AV_PIX_FMT_NONE;
+            return false;
        }
        let frames = (*frames_ref).data as *mut AVHWFramesContext;
        (*frames).format = AVPixelFormat::AV_PIX_FMT_D3D11;
-        let sw = if (*avctx).sw_pix_fmt != AVPixelFormat::AV_PIX_FMT_NONE {
-            (*avctx).sw_pix_fmt
-        } else {
-            AVPixelFormat::AV_PIX_FMT_NV12
-        };
-        (*frames).sw_format = sw;
-        (*frames).width = (*avctx).coded_width;
-        (*frames).height = (*avctx).coded_height;
-        // DPB + a few in-flight (decoded channel + the presenter's held frame); the host's
-        // zero-reorder stream needs only a small DPB, so 20 is comfortable headroom.
-        (*frames).initial_pool_size = 20;
+        (*frames).sw_format = AVPixelFormat::AV_PIX_FMT_NV12;
+        (*frames).width = 1920;
+        (*frames).height = 1152; // 128-aligned 1080p surface (the HEVC DXVA alignment, see get_format)
+        (*frames).initial_pool_size = DECODE_POOL_SIZE;
+        // Decoder-only — matches get_format exactly.
        let fhw = (*frames).hwctx as *mut AVD3D11VAFramesContext;
-        (*fhw).bind_flags = D3D11_BIND_SHADER_RESOURCE;
+        (*fhw).bind_flags = BIND_DECODER;
        let r = av_hwframe_ctx_init(frames_ref);
-        if r < 0 {
-            let mut fr = frames_ref;
-            av_buffer_unref(&mut fr);
-            return AVPixelFormat::AV_PIX_FMT_NONE;
-        }
-        (*avctx).hw_frames_ctx = frames_ref; // decoder takes ownership
-        AVPixelFormat::AV_PIX_FMT_D3D11
+        let mut fr = frames_ref;
+        av_buffer_unref(&mut fr);
+        r >= 0
    }
 }

@@ -395,6 +483,8 @@ impl D3d11vaDecoder {
        if !shared.hardware {
            bail!("shared device is WARP (no hardware video decode)");
        }
+        // The adapter must expose the codec's DXVA profile — checked here, not at the first AU.
+        decode_profile_supported(&shared.device, codec_id)?;
        unsafe {
            // Build a D3D11VA hwdevice context around the *shared* device, so decoded textures live
            // on the same device the presenter samples + draws with.
@@ -417,6 +507,15 @@ impl D3d11vaDecoder {
                bail!("av_hwdevice_ctx_init: {}", ffmpeg::Error::from(r));
            }

+            // Up-front viability probe (see `d3d11va_decode_supported`): a GPU/driver that can't
+            // create the decode surface pool commits to software NOW, so it decodes cleanly from the
+            // first frame instead of failing mid-stream (which a demotion can't reliably recover).
+            if !d3d11va_decode_supported(hw_device) {
+                let mut hw = hw_device;
+                ffi::av_buffer_unref(&mut hw);
+                bail!("GPU can't create the D3D11VA decode surface pool — using software decode");
+            }
+
            let codec = ffi::avcodec_find_decoder(codec_id.into());
            if codec.is_null() {
                let mut hw = hw_device;
@@ -427,7 +526,11 @@ impl D3d11vaDecoder {
            (*ctx).hw_device_ctx = ffi::av_buffer_ref(hw_device);
            (*ctx).get_format = Some(get_format_d3d11);
            (*ctx).flags |= ffi::AV_CODEC_FLAG_LOW_DELAY as i32;
-            (*ctx).thread_count = 1; // hwaccel: threads only add latency
+            // hwaccel: threads only add latency.
+            (*ctx).thread_count = 1;
+            // On top of the DPB-based pool libavcodec sizes for us: the bounded decoded channel
+            // (2) + the frame the presenter holds until its copy flushes + margin.
+            (*ctx).extra_hw_frames = 4;
            let r = ffi::avcodec_open2(ctx, codec, ptr::null_mut());
            if r < 0 {
                let mut ctx = ctx;
@@ -499,6 +602,7 @@ impl D3d11vaDecoder {
                width: (*self.frame).width as u32,
                height: (*self.frame).height as u32,
                index: (*self.frame).data[1] as usize as u32,
+                ten_bit,
                hdr,
                guard: D3d11FrameGuard(cloned),
            };
@@ -532,7 +636,7 @@ fn log_layout_once(width: u32, height: u32, index: u32, hdr: bool, ten_bit: bool
            slice = index,
            hdr,
            ten_bit,
-            "D3D11VA first frame (zero-copy)"
+            "D3D11VA first frame"
        );
    }
 }