feat(clients/windows): all-vendor video pipeline rewrite + app icon + hosts-page tiles

Decode+present rewrite (first real pixels on glass for this client): - Decode: FFmpeg D3D11VA on NVIDIA/AMD/Intel. get_format now only returns AV_PIX_FMT_D3D11 and lets libavcodec build the decode pool from hw_device_ctx (hand-built frames contexts failed three different ways: NVIDIA rejects DECODER|SHADER_RESOURCE arrays, BindFlags=0 fails texture creation, Intel rejects non-128-aligned HEVC surfaces at the first SubmitDecoderBuffers). A DXVA profile probe before the hwdevice commits hardware-vs-software up front instead of burning the opening IDR; extra_hw_frames covers the frames the client holds. - Present: the decoded slice is copied with ONE display-size-boxed CopySubresourceRegion (a planar slice is a single subresource in D3D11; the old two-copy D3D12-style code silently no-opped - the black screen) into a sampleable NV12/P010 texture, per-plane SRVs + YUV->RGB shaders. - New dedicated render thread (render.rs): presenting is decoupled from the XAML thread; frame-latency-waitable swapchain + SetMaximumFrameLatency(1), newest-wins drain after the wait, crossbeam frame channel with pts for a capture->presented p50 log. - HiDPI: pixel-sized buffers + SetMatrixTransform(96/dpi) - was blurry at 125/150 % scaling. - Software fallback now feeds the same shaders (swscale -> NV12/P010 planes -> two dynamic plane textures); ps_rgba/X2BGR10 path deleted, hw/sw colour math identical. - Adapter selection for hybrid boxes: PUNKTFUNK_ADAPTER > the window's monitor's adapter > default; PUNKTFUNK_D3D_DEBUG=1 debug layer. - Session pump: request_keyframe at start and on hw->sw demotion (infinite GOP would otherwise sit on a black screen). Validated live on the Arc Pro + RTX 3500 Ada laptop against the local Windows host: 60 fps D3D11VA on both vendors, software path, GUI on glass. Also: embedded app icon (build.rs winresource + WM_SETICON, MSIX Square44x44 targetsize assets, pack-msix stages them) and the hosts-page tile rework (tap-to-connect tiles with sibling overflow menu - fixes forget-also-connects - in-tile rename editor, add-host modal via root state). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:24:23 +02:00
parent 2c416a4bff
commit a4c84ac620
36 changed files with 1797 additions and 581 deletions
@@ -1,17 +1,29 @@
 //! Direct3D11 presenter for a WinUI 3 `SwapChainPanel`. It draws a decoded frame Contain-fit into a
 //! **composition** flip-model swapchain, which the reactor stream page binds to the panel via
-//! `SwapChainPanelHandle::set_swap_chain`.
+//! `SwapChainPanelHandle::set_swap_chain`. After that one UI-thread bind, the presenter lives on
+//! the dedicated render thread ([`crate::render`]) — presenting never touches (or is stalled by)
+//! the XAML thread.
 //!
-//! Two frame sources, one swapchain:
+//! Two frame sources, one pair of YUV shaders (identical colour math for both):
 //!
-//! * **GPU (zero-copy)** — [`crate::video::GpuFrame`] is a decoder-owned NV12/P010 `ID3D11Texture2D`
-//!   array slice (D3D11VA). We create per-plane shader-resource views over the slice and convert
-//!   YUV→RGB in a pixel shader: NV12 via BT.709 (`ps_nv12`), P010 via BT.2020 with the PQ transfer
-//!   left intact (`ps_p010`). No CPU copy. The decoder uses the **same** shared device
-//!   ([`crate::gpu`]) so the texture is bindable here.
-//! * **CPU upload** — [`crate::video::CpuFrame`] is packed RGBA (SDR) or X2BGR10 (HDR) from the
-//!   software decoder; we upload it into a dynamic texture and draw it with a passthrough shader
-//!   (`ps_rgba`). The fallback path.
+//! * **GPU (D3D11VA)** — [`crate::video::GpuFrame`] is a slice of the decoder-only NV12/P010
+//!   texture array. One `CopySubresourceRegion` with a display-size box moves the slice — **both
+//!   planes; in D3D11 a planar slice is a single subresource** (unlike D3D12) — into our
+//!   sampleable texture, which per-plane SRVs (R8/R8G8, R16/R16G16) expose to the shaders. The
+//!   source box is mandatory: the decode array is coded-size (e.g. 1920×1088), the target
+//!   display-size (1920×1080), and D3D11 silently drops size-mismatched full-resource copies.
+//! * **CPU upload** — [`crate::video::CpuFrame`] carries NV12/P010 planes from the software
+//!   decoder; they upload into two dynamic plane textures feeding the same SRV slots/shaders.
+//!
+//! **Pacing**: the swapchain is created with `DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT`
+//! and `SetMaximumFrameLatency(1)` (flagless fallback for odd drivers). The render thread waits
+//! on the latency waitable before drawing, so at most one present is ever queued (minimum compose
+//! latency) and a stream faster than the display drops frames *before* any GPU work. Every
+//! `ResizeBuffers` must re-pass the creation flags — that's `swap_flags`.
+//!
+//! **HiDPI**: buffers are sized in physical pixels and `IDXGISwapChain2::SetMatrixTransform`
+//! (scale 96/DPI) maps them to the panel's DIP coordinate space — without it XAML samples a
+//! DIP-sized buffer up and the video is blurry at 125/150 % scaling.
 //!
 //! **HDR10**: when a frame is BT.2020 PQ the swapchain flips to `R10G10B10A2` +
 //! `DXGI_COLOR_SPACE_RGB_FULL_G2084_NONE_P2020` (+ HDR10 metadata) via `ResizeBuffers`/
@@ -21,21 +33,23 @@
 //! All `windows` types here come from the same windows-rs commit as `windows-reactor`, so the
 //! `IDXGISwapChain1` handed to `set_swap_chain` satisfies reactor's `windows_core::Interface`.

-use crate::video::{DecodedFrame, GpuFrame};
+use crate::video::{CpuFrame, DecodedFrame, GpuFrame};
 use anyhow::{anyhow, Context, Result};
 use windows::core::{Interface, PCSTR};
+use windows::Win32::Foundation::{CloseHandle, HANDLE, WAIT_OBJECT_0};
 use windows::Win32::Graphics::Direct3D::Fxc::{D3DCompile, D3DCOMPILE_OPTIMIZATION_LEVEL3};
 use windows::Win32::Graphics::Direct3D::{
-    ID3DBlob, D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, D3D_SRV_DIMENSION_TEXTURE2DARRAY,
+    ID3DBlob, D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, D3D_SRV_DIMENSION_TEXTURE2D,
 };
 use windows::Win32::Graphics::Direct3D11::*;
 use windows::Win32::Graphics::Dxgi::Common::*;
 use windows::Win32::Graphics::Dxgi::*;
+use windows::Win32::System::Threading::WaitForSingleObject;

-// One vertex shader (fullscreen triangle) + three pixel shaders, selected per frame source. tex0 is
-// RGBA (passthrough) or the luma plane; tex1 is the chroma plane. The YUV→RGB matrices fold the
-// limited→full range scale into the coefficients; for P010 the R16 sample is rescaled (×65535/65472)
-// to undo the 10-bits-in-the-high-bits packing, then converted with BT.2020 NCL, PQ preserved.
+// One vertex shader (fullscreen triangle) + two pixel shaders, selected per frame colour space.
+// tex0 is the luma plane, tex1 the chroma plane. The YUV→RGB matrices fold the limited→full range
+// scale into the coefficients; for P010 the R16 sample is rescaled (×65535/65472) to undo the
+// 10-bits-in-the-high-bits packing, then converted with BT.2020 NCL, PQ preserved.
 const SHADER_HLSL: &str = r#"
 struct VSOut { float4 pos : SV_Position; float2 uv : TEXCOORD0; };
 VSOut vs_main(uint vid : SV_VertexID) {
@@ -49,8 +63,6 @@ Texture2D tex0 : register(t0);
 Texture2D tex1 : register(t1);
 SamplerState smp : register(s0);

-float4 ps_rgba(VSOut i) : SV_Target { return tex0.Sample(smp, i.uv); }
-
 float4 ps_nv12(VSOut i) : SV_Target {
    float  y  = tex0.Sample(smp, i.uv).r;
    float2 uv = tex1.Sample(smp, i.uv).rg;
@@ -77,46 +89,53 @@ float4 ps_p010(VSOut i) : SV_Target {
 }
 "#;

-/// A bound GPU frame: per-plane SRVs over the decoder's texture-array slice, plus the `GpuFrame`
-/// itself kept alive so the decoder won't recycle the slice while we re-present it.
-struct GpuView {
+/// The currently bound frame: per-plane SRVs (over the GPU sample texture or the CPU plane
+/// textures) + the colour space that picks the shader. Redraws (resize, letterbox) re-present it.
+struct Bound {
    y: ID3D11ShaderResourceView,
    c: ID3D11ShaderResourceView,
-    /// Held only for its `Drop` (returns the decoder surface to the reuse pool) — never read.
-    #[allow(dead_code)]
-    frame: GpuFrame,
-}
-
-/// Current draw source.
-#[derive(Clone, Copy, PartialEq)]
-enum Mode {
-    Empty,
-    Rgba,
-    Nv12,
-    P010,
+    hdr: bool,
 }

 pub struct Presenter {
    device: ID3D11Device,
    context: ID3D11DeviceContext,
    vs: ID3D11VertexShader,
-    ps_rgba: ID3D11PixelShader,
    ps_nv12: ID3D11PixelShader,
    ps_p010: ID3D11PixelShader,
    sampler: ID3D11SamplerState,
    swap: IDXGISwapChain1,
+    /// Creation flags — MUST be re-passed to every `ResizeBuffers` or it fails.
+    swap_flags: u32,
+    /// The frame-latency waitable (owned; closed in `Drop`), `None` on the flagless fallback.
+    waitable: Option<HANDLE>,
    rtv: Option<ID3D11RenderTargetView>,
-    /// CPU-upload texture + SRV + dimensions; recreated when the decoded size/format changes.
-    cpu_tex: Option<(ID3D11Texture2D, ID3D11ShaderResourceView, u32, u32)>,
-    /// Bound zero-copy GPU frame (held to keep its decoder surface alive).
-    gpu: Option<GpuView>,
-    mode: Mode,
+    /// GPU path: sampleable copy target for the decoded slice — `(tex, w, h, ten_bit)`, recreated
+    /// when the decoded size/bit depth changes. Format must equal the decode array's (NV12/P010).
+    sample_tex: Option<(ID3D11Texture2D, u32, u32, bool)>,
+    /// The last GPU frame, held until the NEXT bind so its decode surface stays out of the reuse
+    /// pool at least until this frame's copy has been queued ahead of any later decoder write.
+    gpu_frame: Option<GpuFrame>,
+    /// CPU path: dynamic luma + chroma plane textures + their SRVs — `(y, uv, y_srv, uv_srv, w, h,
+    /// ten_bit)`, recreated when the decoded size/bit depth changes.
+    #[allow(clippy::type_complexity)]
+    plane_tex: Option<(
+        ID3D11Texture2D,
+        ID3D11Texture2D,
+        ID3D11ShaderResourceView,
+        ID3D11ShaderResourceView,
+        u32,
+        u32,
+        bool,
+    )>,
+    bound: Option<Bound>,
    /// Source frame dimensions, for the Contain-fit letterbox.
    src_w: u32,
    src_h: u32,
-    /// Panel (swapchain) size in pixels, updated on resize.
+    /// Panel (swapchain) size in physical pixels + the window DPI, updated on resize.
    panel_w: u32,
    panel_h: u32,
+    dpi: u32,
    /// Whether the swapchain is currently in 10-bit HDR10 (R10G10B10A2 + ST.2084) mode.
    hdr: bool,
    /// The source's static HDR mastering metadata received over the protocol (`0xCE`), applied via
@@ -126,45 +145,71 @@ pub struct Presenter {
 }

 /// Latest source HDR mastering metadata, written by the session pump (`session.rs`, the sole
-/// `next_hdr_meta` consumer) and read by `present_newest` on the UI thread — decoupled so the
+/// `next_hdr_meta` consumer) and read by the render thread before each present — decoupled so the
 /// presenter doesn't need the connector. One session at a time on the client, so a single slot.
 pub static LATEST_HDR_META: std::sync::Mutex<Option<punktfunk_core::quic::HdrMeta>> =
    std::sync::Mutex::new(None);

 impl Presenter {
    /// Create the presenter on the process-wide shared D3D11 device (the one the decoder uses), plus
-    /// the composition swapchain + shaders, sized to the panel.
-    pub fn new(width: u32, height: u32) -> Result<Presenter> {
+    /// the composition swapchain + shaders, sized to the panel in physical pixels at `dpi`.
+    pub fn new(width: u32, height: u32, dpi: u32) -> Result<Presenter> {
        let shared = crate::gpu::shared().ok_or_else(|| anyhow!("no shared D3D11 device"))?;
        let device = shared.device.clone();
        let context = shared.context.clone();
-        let (vs, ps_rgba, ps_nv12, ps_p010, sampler) = build_pipeline(&device)?;
-        let swap = create_composition_swapchain(&device, width.max(1), height.max(1))?;
-        Ok(Presenter {
+        let (vs, ps_nv12, ps_p010, sampler) = build_pipeline(&device)?;
+        let (swap, swap_flags) =
+            create_composition_swapchain(&device, width.max(1), height.max(1))?;
+        // ≤1 queued present: the render thread blocks on the waitable, so a frame is only drawn
+        // when the compositor is ready to take it — the newest-wins drain happens after the wait.
+        let waitable = (swap_flags & DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT.0 as u32
+            != 0)
+            .then(|| unsafe {
+                let sc2: IDXGISwapChain2 = swap.cast().ok()?;
+                sc2.SetMaximumFrameLatency(1).ok()?;
+                let h = sc2.GetFrameLatencyWaitableObject();
+                (!h.is_invalid()).then_some(h)
+            })
+            .flatten();
+        let p = Presenter {
            device,
            context,
            vs,
-            ps_rgba,
            ps_nv12,
            ps_p010,
            sampler,
            swap,
+            swap_flags,
+            waitable,
            rtv: None,
-            cpu_tex: None,
-            gpu: None,
-            mode: Mode::Empty,
+            sample_tex: None,
+            gpu_frame: None,
+            plane_tex: None,
+            bound: None,
            src_w: 1,
            src_h: 1,
            panel_w: width.max(1),
            panel_h: height.max(1),
+            dpi: dpi.max(96),
            hdr: false,
            hdr_meta: None,
-        })
+        };
+        p.apply_dpi_matrix();
+        Ok(p)
+    }
+
+    /// Block until the swapchain can take another present (≤ `timeout_ms`). True when a present
+    /// slot is free; also true on the flagless fallback (no throttle available, just present).
+    pub fn wait_present_slot(&self, timeout_ms: u32) -> bool {
+        match self.waitable {
+            Some(h) => unsafe { WaitForSingleObject(h, timeout_ms) == WAIT_OBJECT_0 },
+            None => true,
+        }
    }

    /// Update the source HDR mastering metadata (from the `0xCE` plane). Stored for the next HDR
    /// swapchain switch, and applied immediately if already presenting HDR. A no-op when unchanged
-    /// (so it's cheap to call every frame from the present loop).
+    /// (so it's cheap to call every frame from the render loop).
    pub fn set_hdr_metadata(&mut self, meta: punktfunk_core::quic::HdrMeta) {
        if self.hdr_meta == Some(meta) {
            return;
@@ -180,28 +225,54 @@ impl Presenter {
        &self.swap
    }

-    /// Resize the back buffers to the panel's new size (drops the stale RTV).
-    pub fn resize(&mut self, width: u32, height: u32) {
-        if width == 0 || height == 0 || (width == self.panel_w && height == self.panel_h) {
+    /// Resize the back buffers to the panel's new size in physical pixels at `dpi` (drops the
+    /// stale RTV, re-applies the DIP↔pixel matrix).
+    pub fn resize(&mut self, width: u32, height: u32, dpi: u32) {
+        let dpi = dpi.max(96);
+        if width == 0
+            || height == 0
+            || (width == self.panel_w && height == self.panel_h && dpi == self.dpi)
+        {
            return;
        }
        self.rtv = None; // release all back-buffer refs before ResizeBuffers
        unsafe {
-            let _ = self.swap.ResizeBuffers(
+            if let Err(e) = self.swap.ResizeBuffers(
                0,
                width,
                height,
                DXGI_FORMAT_UNKNOWN,
-                DXGI_SWAP_CHAIN_FLAG(0),
-            );
+                DXGI_SWAP_CHAIN_FLAG(self.swap_flags as i32),
+            ) {
+                tracing::warn!(error = %e, "ResizeBuffers failed");
+                return;
+            }
        }
        self.panel_w = width;
        self.panel_h = height;
+        self.dpi = dpi;
+        self.apply_dpi_matrix();
    }

-    /// Present one decoded frame (Contain-fit) — or, when `frame` is `None`, re-present the last one
-    /// (or black). Called from the reactor `on_rendering` per-frame callback on the UI thread. Takes
-    /// the frame by value so the GPU path can retain the decoder surface across re-presents.
+    /// Map the pixel-sized buffers into the panel's DIP coordinate space (scale 96/DPI) — XAML
+    /// otherwise stretches whatever size the buffers are to the panel's DIP bounds (blurry).
+    fn apply_dpi_matrix(&self) {
+        let s = 96.0 / self.dpi as f32;
+        if let Ok(sc2) = self.swap.cast::<IDXGISwapChain2>() {
+            let m = DXGI_MATRIX_3X2_F {
+                _11: s,
+                _22: s,
+                ..Default::default()
+            };
+            if let Err(e) = unsafe { sc2.SetMatrixTransform(&m) } {
+                tracing::warn!(error = %e, "SetMatrixTransform failed");
+            }
+        }
+    }
+
+    /// Present one decoded frame (Contain-fit) — or, when `frame` is `None`, re-present the last
+    /// one (or black). Called from the render thread. Takes the frame by value: the GPU path
+    /// retains the decoder surface until the next bind.
    pub fn present(&mut self, frame: Option<DecodedFrame>) {
        match frame {
            Some(DecodedFrame::Cpu(c)) => {
@@ -210,20 +281,14 @@ impl Presenter {
                }
                if let Err(e) = self.upload(&c) {
                    tracing::warn!(error = %e, "frame upload failed");
-                } else {
-                    self.mode = Mode::Rgba;
-                    self.src_w = c.width;
-                    self.src_h = c.height;
-                    self.gpu = None; // drop any held GPU frame
                }
            }
            Some(DecodedFrame::Gpu(g)) => {
                if g.hdr != self.hdr {
                    self.set_hdr(g.hdr);
                }
-                match self.bind_gpu(g) {
-                    Ok(()) => {}
-                    Err(e) => tracing::warn!(error = %e, "GPU frame bind failed"),
+                if let Err(e) = self.bind_gpu(g) {
+                    tracing::warn!(error = %e, "GPU frame bind failed");
                }
            }
            None => {}
@@ -231,46 +296,102 @@ impl Presenter {
        self.draw();
    }

-    /// Build per-plane SRVs over the decoded texture-array slice and retain the frame.
+    /// Copy the decoded slice into our sampleable texture and build per-plane SRVs over it. The
+    /// decode array is decoder-only (NVIDIA won't bind a decoder array as a shader resource), so
+    /// it can't be sampled directly — one GPU-to-GPU copy makes the frame sampleable on every
+    /// vendor. D3D11 planar semantics: the slice is ONE subresource (both planes copy together),
+    /// and the source box is display-size (the array is coded-size; a full-resource copy would
+    /// size-mismatch and be silently dropped).
    fn bind_gpu(&mut self, g: GpuFrame) -> Result<()> {
-        let tex: ID3D11Texture2D = unsafe {
+        let src: ID3D11Texture2D = unsafe {
            let raw = g.texture_ptr();
            ID3D11Texture2D::from_raw_borrowed(&raw)
                .ok_or_else(|| anyhow!("null D3D11 texture"))?
                .clone()
        };
-        // NV12: R8 luma + R8G8 chroma. P010: R16 luma + R16G16 chroma (10 bits in the high bits).
-        let (fy, fc) = if g.hdr {
-            (DXGI_FORMAT_R16_UNORM, DXGI_FORMAT_R16G16_UNORM)
-        } else {
-            (DXGI_FORMAT_R8_UNORM, DXGI_FORMAT_R8G8_UNORM)
+        self.ensure_sample_tex(g.width, g.height, g.ten_bit)?;
+        let dst = self.sample_tex.as_ref().unwrap().0.clone();
+        // Even-aligned luma coordinates (NV12/P010 chroma is 2×2 subsampled).
+        let src_box = D3D11_BOX {
+            left: 0,
+            top: 0,
+            front: 0,
+            right: g.width & !1,
+            bottom: g.height & !1,
+            back: 1,
        };
-        let y = self.array_srv(&tex, fy, g.index)?;
-        let c = self.array_srv(&tex, fc, g.index)?;
-        self.mode = if g.hdr { Mode::P010 } else { Mode::Nv12 };
+        unsafe {
+            self.context
+                .CopySubresourceRegion(&dst, 0, 0, 0, 0, &src, g.index, Some(&src_box));
+        }
+        let (fy, fc) = plane_formats(g.ten_bit);
+        let y = self.plane_srv(&dst, fy)?;
+        let c = self.plane_srv(&dst, fc)?;
+        if g.ten_bit != g.hdr {
+            warn_bitdepth_mismatch_once(g.ten_bit, g.hdr);
+        }
        self.src_w = g.width;
        self.src_h = g.height;
-        self.gpu = Some(GpuView { y, c, frame: g });
+        self.bound = Some(Bound { y, c, hdr: g.hdr });
+        // Hold the frame until the next bind: its decode surface stays out of the reuse pool
+        // until this copy is queued ahead of any later decoder write (previous frame drops here).
+        self.gpu_frame = Some(g);
        Ok(())
    }

-    /// A shader-resource view over a single slice of a texture array, reinterpreting the plane
-    /// format (the NV12/P010 sub-format trick D3D11 allows on video textures).
-    fn array_srv(
+    /// Ensure the sampleable copy texture matches the decoded frame's size + bit depth (NV12 for
+    /// 8-bit, P010 for 10-bit — the same format as the decode array, a `CopySubresourceRegion`
+    /// requirement), recreating it on a change.
+    fn ensure_sample_tex(&mut self, w: u32, h: u32, ten_bit: bool) -> Result<()> {
+        if matches!(&self.sample_tex, Some((_, tw, th, tb)) if *tw == w && *th == h && *tb == ten_bit)
+        {
+            return Ok(());
+        }
+        let desc = D3D11_TEXTURE2D_DESC {
+            Width: w,
+            Height: h,
+            MipLevels: 1,
+            ArraySize: 1,
+            Format: if ten_bit {
+                DXGI_FORMAT_P010
+            } else {
+                DXGI_FORMAT_NV12
+            },
+            SampleDesc: DXGI_SAMPLE_DESC {
+                Count: 1,
+                Quality: 0,
+            },
+            Usage: D3D11_USAGE_DEFAULT,
+            BindFlags: D3D11_BIND_SHADER_RESOURCE.0 as u32,
+            CPUAccessFlags: 0,
+            MiscFlags: 0,
+        };
+        let tex = unsafe {
+            let mut t = None;
+            self.device
+                .CreateTexture2D(&desc, None, Some(&mut t))
+                .context("CreateTexture2D (sample target)")?;
+            t.ok_or_else(|| anyhow!("null sample texture"))?
+        };
+        self.sample_tex = Some((tex, w, h, ten_bit));
+        Ok(())
+    }
+
+    /// A shader-resource view over one plane of a single (non-array) NV12/P010 texture — the
+    /// R8/R8G8 (or R16/R16G16) format selects the luma vs. chroma plane (the D3D11 video
+    /// sub-format trick).
+    fn plane_srv(
        &self,
        tex: &ID3D11Texture2D,
        format: DXGI_FORMAT,
-        slice: u32,
    ) -> Result<ID3D11ShaderResourceView> {
        let desc = D3D11_SHADER_RESOURCE_VIEW_DESC {
            Format: format,
-            ViewDimension: D3D_SRV_DIMENSION_TEXTURE2DARRAY,
+            ViewDimension: D3D_SRV_DIMENSION_TEXTURE2D,
            Anonymous: D3D11_SHADER_RESOURCE_VIEW_DESC_0 {
-                Texture2DArray: D3D11_TEX2D_ARRAY_SRV {
+                Texture2D: D3D11_TEX2D_SRV {
                    MostDetailedMip: 0,
                    MipLevels: 1,
-                    FirstArraySlice: slice,
-                    ArraySize: 1,
                },
            },
        };
@@ -278,37 +399,109 @@ impl Presenter {
            let mut srv = None;
            self.device
                .CreateShaderResourceView(tex, Some(&desc), Some(&mut srv))
-                .context("CreateShaderResourceView (array slice)")?;
+                .context("CreateShaderResourceView (plane)")?;
            srv.ok_or_else(|| anyhow!("null SRV"))
        }
    }

+    /// Upload a software-decoded frame's two planes into the dynamic plane textures (created to
+    /// match size/bit depth), feeding the same SRV slots + shaders as the GPU path.
+    fn upload(&mut self, frame: &CpuFrame) -> Result<()> {
+        let (w, h) = (frame.width, frame.height);
+        let rebuild = !matches!(&self.plane_tex,
+            Some((.., tw, th, tb)) if *tw == w && *th == h && *tb == frame.ten_bit);
+        if rebuild {
+            let (fy, fc) = plane_formats(frame.ten_bit);
+            let y = self.dynamic_tex(w, h, fy)?;
+            let uv = self.dynamic_tex(w.div_ceil(2), h.div_ceil(2), fc)?;
+            let y_srv = self.plane_srv(&y, fy)?;
+            let uv_srv = self.plane_srv(&uv, fc)?;
+            self.plane_tex = Some((y, uv, y_srv, uv_srv, w, h, frame.ten_bit));
+        }
+        let (y, uv, y_srv, uv_srv, ..) = self.plane_tex.as_ref().unwrap();
+        let bytes = if frame.ten_bit { 2 } else { 1 };
+        self.map_rows(y, &frame.y, frame.y_stride, w as usize * bytes, h as usize)?;
+        self.map_rows(
+            uv,
+            &frame.uv,
+            frame.uv_stride,
+            w.div_ceil(2) as usize * 2 * bytes,
+            h.div_ceil(2) as usize,
+        )?;
+        self.src_w = w;
+        self.src_h = h;
+        self.bound = Some(Bound {
+            y: y_srv.clone(),
+            c: uv_srv.clone(),
+            hdr: frame.hdr,
+        });
+        self.gpu_frame = None; // drop any held GPU frame
+        Ok(())
+    }
+
+    fn dynamic_tex(&self, w: u32, h: u32, format: DXGI_FORMAT) -> Result<ID3D11Texture2D> {
+        let desc = D3D11_TEXTURE2D_DESC {
+            Width: w,
+            Height: h,
+            MipLevels: 1,
+            ArraySize: 1,
+            Format: format,
+            SampleDesc: DXGI_SAMPLE_DESC {
+                Count: 1,
+                Quality: 0,
+            },
+            Usage: D3D11_USAGE_DYNAMIC,
+            BindFlags: D3D11_BIND_SHADER_RESOURCE.0 as u32,
+            CPUAccessFlags: D3D11_CPU_ACCESS_WRITE.0 as u32,
+            MiscFlags: 0,
+        };
+        unsafe {
+            let mut t = None;
+            self.device
+                .CreateTexture2D(&desc, None, Some(&mut t))
+                .context("CreateTexture2D (plane)")?;
+            t.ok_or_else(|| anyhow!("null plane texture"))
+        }
+    }
+
+    /// Map-discard `tex` and copy `rows` rows of `row_bytes` from `src` (stride `src_pitch`).
+    fn map_rows(
+        &self,
+        tex: &ID3D11Texture2D,
+        src: &[u8],
+        src_pitch: usize,
+        row_bytes: usize,
+        rows: usize,
+    ) -> Result<()> {
+        unsafe {
+            let mut mapped = D3D11_MAPPED_SUBRESOURCE::default();
+            self.context
+                .Map(tex, 0, D3D11_MAP_WRITE_DISCARD, 0, Some(&mut mapped))
+                .context("Map plane texture")?;
+            let dst = mapped.pData as *mut u8;
+            let dst_pitch = mapped.RowPitch as usize;
+            let n = row_bytes.min(src_pitch);
+            for r in 0..rows {
+                std::ptr::copy_nonoverlapping(
+                    src.as_ptr().add(r * src_pitch),
+                    dst.add(r * dst_pitch),
+                    n,
+                );
+            }
+            self.context.Unmap(tex, 0);
+        }
+        Ok(())
+    }
+
    fn draw(&mut self) {
        let Ok(rtv) = self.rtv() else {
            return;
        };
        let (pw, ph) = (self.panel_w, self.panel_h);
-        // Resolve the current source's shader + the (up to two) SRVs to bind — cheap interface
-        // clones. Each arm yields `Option<(&pixel_shader, [Option<SRV>; 2])>`.
-        let binding = match self.mode {
-            Mode::Rgba => self
-                .cpu_tex
-                .as_ref()
-                .map(|(_, srv, _, _)| (&self.ps_rgba, [Some(srv.clone()), None])),
-            Mode::Nv12 => self
-                .gpu
-                .as_ref()
-                .map(|g| (&self.ps_nv12, [Some(g.y.clone()), Some(g.c.clone())])),
-            Mode::P010 => self
-                .gpu
-                .as_ref()
-                .map(|g| (&self.ps_p010, [Some(g.y.clone()), Some(g.c.clone())])),
-            Mode::Empty => None,
-        };
        unsafe {
            let c = &self.context;
            c.ClearRenderTargetView(&rtv, &[0.0, 0.0, 0.0, 1.0]);
-            if let Some((ps, srvs)) = binding {
+            if let Some(bound) = &self.bound {
                // Contain-fit viewport: scale to the smaller axis, centre, letterbox the rest.
                let (ww, wh, vfw, vfh) = (
                    pw as f32,
@@ -332,8 +525,15 @@ impl Presenter {
                c.IASetInputLayout(None);
                c.IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
                c.VSSetShader(&self.vs, None);
-                c.PSSetShader(ps, None);
-                c.PSSetShaderResources(0, Some(&srvs));
+                c.PSSetShader(
+                    if bound.hdr {
+                        &self.ps_p010
+                    } else {
+                        &self.ps_nv12
+                    },
+                    None,
+                );
+                c.PSSetShaderResources(0, Some(&[Some(bound.y.clone()), Some(bound.c.clone())]));
                c.PSSetSamplers(0, Some(&[Some(self.sampler.clone())]));
                c.Draw(3, 0);
            }
@@ -347,7 +547,6 @@ impl Presenter {
    /// PQ-encoded BT.2020 for HDR, so the colour space is all the compositor needs.
    fn set_hdr(&mut self, on: bool) {
        self.rtv = None; // release back-buffer refs before ResizeBuffers
-        self.cpu_tex = None; // CPU texture format changes (R10G10B10A2 vs R8G8B8A8)
        let format = if on {
            DXGI_FORMAT_R10G10B10A2_UNORM
        } else {
@@ -359,7 +558,7 @@ impl Presenter {
                self.panel_w,
                self.panel_h,
                format,
-                DXGI_SWAP_CHAIN_FLAG(0),
+                DXGI_SWAP_CHAIN_FLAG(self.swap_flags as i32),
            ) {
                tracing::warn!(error = %e, "ResizeBuffers for HDR switch failed");
                return;
@@ -389,6 +588,7 @@ impl Presenter {
                self.apply_hdr_metadata();
            }
        }
+        self.apply_dpi_matrix(); // belt-and-braces: keep the DIP mapping across the format switch
        tracing::info!(hdr = on, "swapchain colour mode switched");
    }

@@ -410,68 +610,6 @@ impl Presenter {
        }
    }

-    fn upload(&mut self, frame: &crate::video::CpuFrame) -> Result<()> {
-        let (w, h) = (frame.width, frame.height);
-        let need_new = !matches!(&self.cpu_tex, Some((_, _, tw, th)) if *tw == w && *th == h);
-        if need_new {
-            let format = if self.hdr {
-                DXGI_FORMAT_R10G10B10A2_UNORM
-            } else {
-                DXGI_FORMAT_R8G8B8A8_UNORM
-            };
-            let desc = D3D11_TEXTURE2D_DESC {
-                Width: w,
-                Height: h,
-                MipLevels: 1,
-                ArraySize: 1,
-                Format: format,
-                SampleDesc: DXGI_SAMPLE_DESC {
-                    Count: 1,
-                    Quality: 0,
-                },
-                Usage: D3D11_USAGE_DYNAMIC,
-                BindFlags: D3D11_BIND_SHADER_RESOURCE.0 as u32,
-                CPUAccessFlags: D3D11_CPU_ACCESS_WRITE.0 as u32,
-                MiscFlags: 0,
-            };
-            let texture = unsafe {
-                let mut t = None;
-                self.device
-                    .CreateTexture2D(&desc, None, Some(&mut t))
-                    .context("CreateTexture2D")?;
-                t.unwrap()
-            };
-            let srv = unsafe {
-                let mut s = None;
-                self.device
-                    .CreateShaderResourceView(&texture, None, Some(&mut s))
-                    .context("CreateShaderResourceView")?;
-                s.unwrap()
-            };
-            self.cpu_tex = Some((texture, srv, w, h));
-        }
-        let (texture, _, _, _) = self.cpu_tex.as_ref().unwrap();
-        unsafe {
-            let mut mapped = D3D11_MAPPED_SUBRESOURCE::default();
-            self.context
-                .Map(texture, 0, D3D11_MAP_WRITE_DISCARD, 0, Some(&mut mapped))
-                .context("Map video texture")?;
-            let dst = mapped.pData as *mut u8;
-            let dst_pitch = mapped.RowPitch as usize;
-            let src_pitch = frame.stride;
-            let row_bytes = (w as usize) * 4;
-            for y in 0..h as usize {
-                std::ptr::copy_nonoverlapping(
-                    frame.pixels.as_ptr().add(y * src_pitch),
-                    dst.add(y * dst_pitch),
-                    row_bytes.min(src_pitch),
-                );
-            }
-            self.context.Unmap(texture, 0);
-        }
-        Ok(())
-    }
-
    fn rtv(&mut self) -> Result<ID3D11RenderTargetView> {
        if self.rtv.is_none() {
            let back: ID3D11Texture2D = unsafe { self.swap.GetBuffer(0).context("GetBuffer")? };
@@ -488,18 +626,53 @@ impl Presenter {
    }
 }

-/// A composition flip-model swapchain (no HWND) for binding to a XAML `SwapChainPanel`.
+impl Drop for Presenter {
+    fn drop(&mut self) {
+        if let Some(h) = self.waitable.take() {
+            unsafe {
+                let _ = CloseHandle(h);
+            }
+        }
+    }
+}
+
+/// Luma + chroma plane view formats for NV12 (8-bit) vs P010 (10-in-16-bit).
+fn plane_formats(ten_bit: bool) -> (DXGI_FORMAT, DXGI_FORMAT) {
+    if ten_bit {
+        (DXGI_FORMAT_R16_UNORM, DXGI_FORMAT_R16G16_UNORM)
+    } else {
+        (DXGI_FORMAT_R8_UNORM, DXGI_FORMAT_R8G8_UNORM)
+    }
+}
+
+/// The host couples 10-bit ⟺ HDR today; a mismatch means the shader's transfer/matrix assumption
+/// is off for this stream (rendered anyway — approximate colour beats no picture).
+fn warn_bitdepth_mismatch_once(ten_bit: bool, hdr: bool) {
+    use std::sync::atomic::{AtomicBool, Ordering};
+    static ONCE: AtomicBool = AtomicBool::new(true);
+    if ONCE.swap(false, Ordering::Relaxed) {
+        tracing::warn!(
+            ten_bit,
+            hdr,
+            "bit depth / HDR mismatch — colour may be approximate"
+        );
+    }
+}
+
+/// A composition flip-model swapchain (no HWND) for binding to a XAML `SwapChainPanel`, with the
+/// frame-latency waitable when the driver allows it. Returns the swapchain + the flags it was
+/// created with (every `ResizeBuffers` must re-pass them).
 fn create_composition_swapchain(
    device: &ID3D11Device,
    width: u32,
    height: u32,
-) -> Result<IDXGISwapChain1> {
+) -> Result<(IDXGISwapChain1, u32)> {
    let dxdev: IDXGIDevice = device.cast().context("IDXGIDevice cast")?;
    let factory: IDXGIFactory2 = unsafe {
        let adapter = dxdev.GetAdapter().context("GetAdapter")?;
        adapter.GetParent().context("GetParent (IDXGIFactory2)")?
    };
-    let desc = DXGI_SWAP_CHAIN_DESC1 {
+    let mut desc = DXGI_SWAP_CHAIN_DESC1 {
        Width: width,
        Height: height,
        Format: DXGI_FORMAT_B8G8R8A8_UNORM,
@@ -512,16 +685,24 @@ fn create_composition_swapchain(
        BufferCount: 2,
        Scaling: DXGI_SCALING_STRETCH,
        SwapEffect: DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL,
-        // IGNORE (opaque), not PREMULTIPLIED: the video fills the panel and the HDR `X2BGR10`
-        // upload leaves the 2 padding/alpha bits 0 — premultiplied alpha would then make HDR frames
-        // transparent. Opaque is correct for a full-frame video surface either way.
+        // IGNORE (opaque), not PREMULTIPLIED: the video fills the panel with opaque RGB either way.
        AlphaMode: DXGI_ALPHA_MODE_IGNORE,
-        Flags: 0,
+        Flags: DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT.0 as u32,
    };
    unsafe {
-        factory
-            .CreateSwapChainForComposition(device, &desc, None)
-            .context("CreateSwapChainForComposition")
+        match factory.CreateSwapChainForComposition(device, &desc, None) {
+            Ok(sc) => Ok((sc, desc.Flags)),
+            Err(e) => {
+                // Odd driver/WARP combinations can reject the waitable — fall back to plain
+                // Present(1) pacing rather than failing the stream page.
+                tracing::warn!(error = %e, "waitable swapchain rejected — creating without");
+                desc.Flags = 0;
+                let sc = factory
+                    .CreateSwapChainForComposition(device, &desc, None)
+                    .context("CreateSwapChainForComposition")?;
+                Ok((sc, 0))
+            }
+        }
    }
 }

@@ -531,11 +712,9 @@ fn build_pipeline(
    ID3D11VertexShader,
    ID3D11PixelShader,
    ID3D11PixelShader,
-    ID3D11PixelShader,
    ID3D11SamplerState,
 )> {
    let vs_blob = compile(SHADER_HLSL, "vs_main", "vs_5_0")?;
-    let rgba_blob = compile(SHADER_HLSL, "ps_rgba", "ps_5_0")?;
    let nv12_blob = compile(SHADER_HLSL, "ps_nv12", "ps_5_0")?;
    let p010_blob = compile(SHADER_HLSL, "ps_p010", "ps_5_0")?;
    unsafe {
@@ -543,10 +722,6 @@ fn build_pipeline(
        device
            .CreateVertexShader(blob_bytes(&vs_blob), None, Some(&mut vs))
            .context("CreateVertexShader")?;
-        let mut ps_rgba = None;
-        device
-            .CreatePixelShader(blob_bytes(&rgba_blob), None, Some(&mut ps_rgba))
-            .context("CreatePixelShader (rgba)")?;
        let mut ps_nv12 = None;
        device
            .CreatePixelShader(blob_bytes(&nv12_blob), None, Some(&mut ps_nv12))
@@ -569,7 +744,6 @@ fn build_pipeline(
            .context("CreateSamplerState")?;
        Ok((
            vs.unwrap(),
-            ps_rgba.unwrap(),
            ps_nv12.unwrap(),
            ps_p010.unwrap(),
            sampler.unwrap(),