feat(windows-drivers): STEP 6 — IDD-push FramePublisher (driver) + host migration to proto::frame

The driver now publishes each acquired swap-chain surface into the host-created shared ring (the IDD-push path) — the full glass-to-glass transport is code-complete. Both sides use the canonical pf_vdisplay_proto::frame layout (lockstep by compile-error, not "must match" comments). Driver compiles + LOADS on-glass (adapter inits, Status=OK; no regression — the publisher is dormant until a frame is acquired); host cargo check green; adversarially reviewed (no blockers — token layout, keyed-mutex key 0, names by target_id, and the format guard all match the host consumer). - new driver frame_transport.rs: FramePublisher OPENS the host ring by target_id (OpenFileMapping header + magic Acquire readiness gate + OpenEvent + OpenSharedResourceByName RING_LEN keyed-mutex textures), writes its render LUID + DRV_STATUS back into the header; publish() is NON-BLOCKING (round-robin 0ms try-acquire -> CopyResource -> ReleaseSync -> FrameToken::pack store Release -> SetEvent; drops the frame if every slot is busy or the surface format != the ring format). Manual handle/view cleanup on every try_open early return; RAII Drop (slots -> unmap -> CloseHandle). Layout/consts/names/token all from pf_vdisplay_proto::frame. - swap_chain_processor.rs run_core: lazy rate-limited attach (every ~30 frames) + is_stale re-attach (mid-session HDR ring recreate); publishes buffer.MetaData.pSurface via IDXGIResource::from_raw_borrowed (preserves IddCx's refcount) BEFORE IddCxSwapChainFinishedProcessingFrame. run/run_core gain the render LUID; callbacks.rs assign_swap_chain passes it. - host idd_push.rs migrated onto pf_vdisplay_proto::frame (deleted the hand-rolled SharedHeader / MAGIC / VERSION / RING_LEN / DRV_STATUS_* / name fns / token packing) — pure refactor, byte-identical, no behavior or gating change. DebugBlock + DXGI_SHARED_RESOURCE_RW kept local (not in the proto). - driver windows crate gains Win32_System_Memory (MapViewOfFile/OpenFileMappingW/...); rustfmt'd the whole driver workspace (incl. wdk-probe — fmt-only). Built via the ultracode flow: STEP-6 map workflow -> agent-implement -> box build (driver + host both green; caught nothing this time) -> adversarial-verify-agent (no blockers) -> FrameToken::pack hardening -> deploy (loads). Glass-to-glass frame validation awaits a composited session (per the parity finding: this headless box yields 0 frames for the proven SudoVDA path too). FOLLOW-UPs: port the optional Global\pfvd-dbg DebugBlock triage channel to the new driver; STEP 7 HDR; STEP 8 drop SudoVDA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 10:28:47 +00:00
parent 590ceaa850
commit e2f004589c
10 changed files with 449 additions and 80 deletions
@@ -214,7 +214,17 @@ pub unsafe extern "C" fn assign_swap_chain(

    if let Some(device) = crate::direct_3d_device::pooled_device(luid) {
        let mut processor = crate::swap_chain_processor::SwapChainProcessor::new();
-        processor.run(swap_chain, device, new_frame_event, target_id);
+        // STEP 6: the publisher reports this render LUID into the host header so the host detects a
+        // render-adapter mismatch (it created the ring textures on its own GPU). `luid` is the OS-picked
+        // render adapter built above.
+        processor.run(
+            swap_chain,
+            device,
+            new_frame_event,
+            target_id,
+            luid.LowPart,
+            luid.HighPart,
+        );
        // Install on the monitor; drop any processor it replaced (a race lost above) OUTSIDE the lock.
        drop(crate::monitor::set_swap_chain_processor(monitor, processor));
    } else {
@@ -5,8 +5,9 @@
 //! D3D/DXGI types are the `windows` crate (refcounted COM, no manual Drop); the swap-chain/LUID hand-off
 //! to the wdk-sys IddCx world happens via raw pointers in `swap_chain_processor.rs`.
 //!
-//! STEP 5 only DRAINS the swap-chain to keep the monitor a live display — there is no frame publisher,
-//! so the device's immediate context is unused here (it returns to use in STEP 6's `CopyResource`).
+//! STEP 5 binds this device to the swap-chain to keep the monitor a live display; STEP 6 reuses the
+//! device's immediate context in the frame publisher's `CopyResource` (both on the swap-chain processor
+//! thread, the one thread this device is touched from).

 use std::sync::atomic::{AtomicI32, Ordering};
 use std::sync::{Arc, Mutex};
@@ -54,8 +55,6 @@ pub struct Direct3DDevice {
    pub device: ID3D11Device,
    /// The single (SINGLETHREADED) immediate context — used by STEP 6's frame-push publisher's
    /// `CopyResource` on the swap-chain processor thread (the one thread this device is touched from).
-    /// Unused in STEP 5 (drain-only); kept so the device matches the oracle exactly.
-    #[allow(dead_code)]
    pub device_context: ID3D11DeviceContext,
 }

@@ -0,0 +1,317 @@
+//! STEP 6 — IDD-push frame publisher (DRIVER side).
+//!
+//! The restricted WUDFHost token canNOT create named kernel objects (proven on the RTX box: it can't
+//! even write a world-writable file), so — exactly like the gamepad UMDF drivers
+//! (`crates/punktfunk-host/src/inject/dualsense_windows.rs`: *"the host creates the section, privileged,
+//! with a permissive SDDL so the WUDFHost can open it; the driver maps it"*) — the **host** creates the
+//! shared header + frame-ready event + ring of keyed-mutex textures, and the driver only **OPENS** them.
+//! The driver writes its actual render-adapter LUID + a status code back into the host-created header (our
+//! only driver-visibility channel: UMDF hides OutputDebugString in ETW and the token can't write files),
+//! then copies each acquired swap-chain surface into the next ring slot and signals the host.
+//!
+//! Host counterpart: `crates/punktfunk-host/src/capture/idd_push.rs`. The shared `SharedHeader` layout,
+//! the [`FrameToken`] packing, the `Global\` object-name scheme, the `MAGIC`/`RING_LEN` and the
+//! `DRV_STATUS_*` codes are NOT hand-duplicated here: both sides `use pf_vdisplay_proto::frame::*`, which
+//! OWNS the contract (with `const` size asserts so any drift is a compile error).
+//!
+//! Ported from the proven oracle (`packaging/windows/vdisplay-driver/pf-vdisplay/src/frame_transport.rs`).
+//! Differences from the oracle:
+//! * the layout/consts/names/token come from `pf_vdisplay_proto::frame` instead of being re-declared;
+//! * `dbglog!` replaces `log::info!`;
+//! * the optional fixed-name `Global\pfvd-dbg` `DebugBlock` bring-up channel is SKIPPED (not on the data
+//!   path). FOLLOW-UP: if the host bring-up diagnostics are needed again, port the oracle's `DebugBlock`
+//!   here too (it is owned by `idd_push.rs`, not the proto).
+
+use std::sync::atomic::{AtomicU32, AtomicU64, Ordering};
+
+use pf_vdisplay_proto::frame::{
+    DRV_STATUS_NO_DEVICE1, DRV_STATUS_OPENED, DRV_STATUS_TEX_FAIL, FrameToken, MAGIC, RING_LEN,
+    SharedHeader, event_name, header_name, texture_name,
+};
+use windows::Win32::Foundation::{CloseHandle, HANDLE};
+use windows::Win32::Graphics::Direct3D11::{
+    D3D11_TEXTURE2D_DESC, ID3D11Device, ID3D11Device1, ID3D11DeviceContext, ID3D11Texture2D,
+};
+use windows::Win32::Graphics::Dxgi::IDXGIKeyedMutex;
+use windows::Win32::System::Memory::{
+    FILE_MAP_ALL_ACCESS, MEMORY_MAPPED_VIEW_ADDRESS, MapViewOfFile, OpenFileMappingW,
+    UnmapViewOfFile,
+};
+use windows::Win32::System::Threading::{OpenEventW, SYNCHRONIZATION_ACCESS_RIGHTS, SetEvent};
+use windows::core::{HSTRING, Interface};
+
+/// `DXGI_SHARED_RESOURCE_READ | _WRITE` — passed to `OpenSharedResourceByName` (matches the host's
+/// `CreateSharedHandle` access). Kept local: it is a `OpenSharedResourceByName` arg, not part of the
+/// proto contract. (Same value the host uses in `idd_push.rs`.)
+const DXGI_SHARED_RESOURCE_RW: u32 = 0x8000_0000 | 0x1;
+/// SYNCHRONIZE | EVENT_MODIFY_STATE — the driver does not wait on the event, only SIGNALS it.
+const EVENT_ACCESS: u32 = 0x0010_0000 | 0x0002;
+/// `WAIT_TIMEOUT` as an HRESULT — `AcquireSync` returns this when the slot is held by the consumer.
+const WAIT_TIMEOUT_HRESULT: i32 = 0x0000_0102;
+
+struct Slot {
+    tex: ID3D11Texture2D,
+    mutex: IDXGIKeyedMutex,
+}
+
+/// Publishes acquired swap-chain surfaces into the HOST-created ring. Owned by the swap-chain processor
+/// thread; attached lazily once the host has created the shared objects.
+pub struct FramePublisher {
+    context: ID3D11DeviceContext,
+    map: HANDLE,
+    header: *mut SharedHeader,
+    event: HANDLE,
+    slots: Vec<Slot>,
+    next: u32,
+    seq: u64,
+    /// The host-created ring textures' DXGI format (from the shared header). A swap-chain surface whose
+    /// format differs (e.g. an FP16 HDR frame vs a BGRA ring) is dropped in `publish` — `CopyResource`
+    /// needs matching formats.
+    ring_format: u32,
+    /// The ring generation this publisher attached to. The host BUMPS the header generation when it
+    /// recreates the ring at a new format mid-session (the display's HDR mode flipped) — [`Self::is_stale`]
+    /// detects that so `run_core` re-attaches to the new-format textures instead of dropping every frame.
+    generation: u32,
+}
+
+// SAFETY: created and used only on the swap-chain processor thread.
+unsafe impl Send for FramePublisher {}
+
+impl FramePublisher {
+    /// Try ONCE to attach to the host-created shared objects. Returns `Err` cheaply if the host hasn't
+    /// created/published them yet — the drain loop retries periodically, so a non-IDD-push session just
+    /// keeps draining with no stall. All early-return paths clean up the handles/mapping they opened
+    /// explicitly (raw-handle style, no RAII — matches the rest of this driver).
+    pub fn try_open(
+        target_id: u32,
+        render_luid_low: u32,
+        render_luid_high: i32,
+        device: &ID3D11Device,
+        context: &ID3D11DeviceContext,
+    ) -> windows::core::Result<Self> {
+        // 1. Open the host-created header (RW). Err if the host hasn't created it yet.
+        let map = unsafe {
+            OpenFileMappingW(
+                FILE_MAP_ALL_ACCESS.0,
+                false,
+                &HSTRING::from(header_name(target_id)),
+            )?
+        };
+        let view = unsafe {
+            MapViewOfFile(
+                map,
+                FILE_MAP_ALL_ACCESS,
+                0,
+                0,
+                core::mem::size_of::<SharedHeader>(),
+            )
+        };
+        if view.Value.is_null() {
+            unsafe {
+                let _ = CloseHandle(map);
+            }
+            return Err(windows::core::Error::from_win32());
+        }
+        let header = view.Value.cast::<SharedHeader>();
+
+        // 2. Report our render adapter to the host immediately (lets it detect a mismatch).
+        unsafe {
+            (*header).driver_render_luid_low = render_luid_low;
+            (*header).driver_render_luid_high = render_luid_high;
+        }
+
+        // 3. The host sets magic==MAGIC only once the ring textures exist. Not ready → retry later.
+        let magic = unsafe {
+            (*(core::ptr::addr_of!((*header).magic) as *const AtomicU32)).load(Ordering::Acquire)
+        };
+        if magic != MAGIC {
+            unsafe {
+                let _ = UnmapViewOfFile(MEMORY_MAPPED_VIEW_ADDRESS {
+                    Value: header.cast(),
+                });
+                let _ = CloseHandle(map);
+            }
+            return Err(windows::core::Error::from_win32());
+        }
+        let (generation, ring_len) =
+            unsafe { ((*header).generation, (*header).ring_len.min(RING_LEN)) };
+
+        // 4. Open the event (SYNCHRONIZE | EVENT_MODIFY_STATE so we can SetEvent).
+        let event = match unsafe {
+            OpenEventW(
+                SYNCHRONIZATION_ACCESS_RIGHTS(EVENT_ACCESS),
+                false,
+                &HSTRING::from(event_name(target_id)),
+            )
+        } {
+            Ok(e) => e,
+            Err(e) => {
+                unsafe {
+                    let _ = UnmapViewOfFile(MEMORY_MAPPED_VIEW_ADDRESS {
+                        Value: header.cast(),
+                    });
+                    let _ = CloseHandle(map);
+                }
+                return Err(e);
+            }
+        };
+
+        // 5. Open device1 + the ring textures the host created (same render adapter required).
+        let device1: ID3D11Device1 = match device.cast() {
+            Ok(d) => d,
+            Err(e) => {
+                unsafe {
+                    (*header).driver_status = DRV_STATUS_NO_DEVICE1;
+                    let _ = CloseHandle(event);
+                    let _ = UnmapViewOfFile(MEMORY_MAPPED_VIEW_ADDRESS {
+                        Value: header.cast(),
+                    });
+                    let _ = CloseHandle(map);
+                }
+                return Err(e);
+            }
+        };
+        let mut slots = Vec::new();
+        for k in 0..ring_len {
+            let name = HSTRING::from(texture_name(target_id, generation, k));
+            let opened: windows::core::Result<ID3D11Texture2D> =
+                unsafe { device1.OpenSharedResourceByName(&name, DXGI_SHARED_RESOURCE_RW) };
+            match opened {
+                Ok(tex) => match tex.cast::<IDXGIKeyedMutex>() {
+                    Ok(mutex) => slots.push(Slot { tex, mutex }),
+                    Err(e) => {
+                        unsafe {
+                            (*header).driver_status = DRV_STATUS_TEX_FAIL;
+                            (*header).driver_status_detail = e.code().0 as u32;
+                            let _ = CloseHandle(event);
+                            let _ = UnmapViewOfFile(MEMORY_MAPPED_VIEW_ADDRESS {
+                                Value: header.cast(),
+                            });
+                            let _ = CloseHandle(map);
+                        }
+                        return Err(e);
+                    }
+                },
+                Err(e) => {
+                    // Most likely a render-adapter mismatch (the host made the textures on a different
+                    // GPU than the swap-chain renders on). Tell the host so it can report it.
+                    unsafe {
+                        (*header).driver_status = DRV_STATUS_TEX_FAIL;
+                        (*header).driver_status_detail = e.code().0 as u32;
+                        let _ = CloseHandle(event);
+                        let _ = UnmapViewOfFile(MEMORY_MAPPED_VIEW_ADDRESS {
+                            Value: header.cast(),
+                        });
+                        let _ = CloseHandle(map);
+                    }
+                    return Err(e);
+                }
+            }
+        }
+
+        unsafe {
+            (*header).driver_status = DRV_STATUS_OPENED;
+        }
+        dbglog!(
+            "[pf-vd] frame-push(driver): attached to host ring gen {generation} ({ring_len} slots)"
+        );
+        Ok(Self {
+            context: context.clone(),
+            map,
+            header,
+            event,
+            slots,
+            next: 0,
+            seq: 0,
+            ring_format: unsafe { (*header).dxgi_format },
+            generation,
+        })
+    }
+
+    #[inline]
+    fn latest_cell(&self) -> &AtomicU64 {
+        unsafe { &*(core::ptr::addr_of!((*self.header).latest) as *const AtomicU64) }
+    }
+
+    /// True once the host has recreated the ring (bumped the header generation) — e.g. the display's HDR
+    /// mode flipped, so the ring format changed (FP16 ⇄ BGRA) and the texture names now carry a new
+    /// generation. `run_core` drops the publisher on this so it re-attaches to the new ring.
+    pub fn is_stale(&self) -> bool {
+        let cur = unsafe {
+            (*(core::ptr::addr_of!((*self.header).generation) as *const AtomicU32))
+                .load(Ordering::Acquire)
+        };
+        cur != self.generation
+    }
+
+    /// Copy `surface` into the next free ring slot and signal the host. Never blocks (0 ms try-acquire).
+    pub fn publish(&mut self, surface: &ID3D11Texture2D) {
+        let ring_len = self.slots.len() as u32;
+        if ring_len == 0 {
+            return;
+        }
+        // Format guard: `CopyResource` needs the surface + ring textures to share a DXGI format. Drop a
+        // frame that doesn't match (e.g. an FP16 HDR surface arriving while the ring is still BGRA, before
+        // the host recreates the ring as FP16) instead of corrupting / failing the copy.
+        let mut desc = D3D11_TEXTURE2D_DESC::default();
+        unsafe { surface.GetDesc(&mut desc) };
+        if desc.Format.0 as u32 != self.ring_format {
+            return;
+        }
+        let start = self.next;
+        for attempt in 0..ring_len {
+            let slot = (start + attempt) % ring_len;
+            let s = &self.slots[slot as usize];
+            match unsafe { s.mutex.AcquireSync(0, 0) } {
+                Ok(()) => {
+                    // STRAIGHT-LINE, NO `?` between acquire + release — a `?`-return here would leak the
+                    // keyed-mutex lock and wedge the host on this slot. The ordering below is load-bearing:
+                    // the CopyResource is GPU-ordered before the consumer via the slot keyed mutex, and the
+                    // `latest` store (Release) publishes the slot only AFTER the copy is queued + the mutex
+                    // released.
+                    unsafe {
+                        self.context.CopyResource(&s.tex, surface);
+                        let _ = s.mutex.ReleaseSync(0);
+                    }
+                    self.seq = self.seq.wrapping_add(1);
+                    // `latest` = (generation << 40) | (seq << 8) | slot, packed by the proto's `FrameToken`
+                    // (single source of truth — the host unpacks with the same type). Stamping the generation
+                    // lets the host REJECT a publish from a stale ring (an old-generation publisher racing the
+                    // host's mid-session ring recreate) so it never consumes an unwritten new-ring slot.
+                    let latest = FrameToken {
+                        generation: self.generation,
+                        seq: self.seq as u32,
+                        slot: slot as u8,
+                    }
+                    .pack();
+                    self.latest_cell().store(latest, Ordering::Release);
+                    unsafe {
+                        let _ = SetEvent(self.event);
+                    }
+                    self.next = (slot + 1) % ring_len;
+                    return;
+                }
+                Err(e) if e.code().0 == WAIT_TIMEOUT_HRESULT => continue,
+                Err(_) => return,
+            }
+        }
+        // All slots busy — drop this frame (never block the swap-chain thread).
+    }
+}
+
+impl Drop for FramePublisher {
+    fn drop(&mut self) {
+        // Slots FIRST (release the shared textures + keyed mutexes), THEN unmap the header, THEN the
+        // handles.
+        self.slots.clear();
+        unsafe {
+            if !self.header.is_null() {
+                let _ = UnmapViewOfFile(MEMORY_MAPPED_VIEW_ADDRESS {
+                    Value: self.header.cast(),
+                });
+            }
+            let _ = CloseHandle(self.event);
+            let _ = CloseHandle(self.map);
+        }
+    }
+}
@@ -18,6 +18,7 @@ mod control;
 mod direct_3d_device;
 mod edid;
 mod entry;
+mod frame_transport;
 mod monitor;
 mod swap_chain_processor;

@@ -1,18 +1,20 @@
-//! The swap-chain processor (STEP 5): a worker thread that DRAINS the IddCx swap-chain so the virtual
-//! monitor stays a usable display.
+//! The swap-chain processor (STEP 5 + STEP 6): a worker thread that DRAINS the IddCx swap-chain (so the
+//! virtual monitor stays a usable display) and PUBLISHES each acquired surface into the host-created
+//! shared ring (the IDD-push path).
 //!
-//! The OS presents the composited desktop to the driver through a swap-chain; the driver MUST consume
-//! it (acquire → finished-processing) or the monitor stalls. STEP 5 binds our render device to the
-//! swap-chain (`IddCxSwapChainSetDevice`) and loops acquire/finish, discarding each frame. It does NOT
-//! publish frames to the host — that is STEP 6 (the `CopyResource` of `out.MetaData.pSurface` into a
-//! shared ring), deliberately omitted here.
+//! The OS presents the composited desktop to the driver through a swap-chain; the driver MUST consume it
+//! (acquire → finished-processing) or the monitor stalls. STEP 5 binds our render device to the swap-chain
+//! (`IddCxSwapChainSetDevice`) and loops acquire/finish. STEP 6 lazily attaches a [`FramePublisher`] to
+//! the host's shared ring and, on each acquired frame, `CopyResource`s `out.MetaData.pSurface` into the
+//! next ring slot before finishing the frame (a non-IDD-push session simply never attaches and keeps
+//! draining).
 //!
 //! Ported from the proven oracle (`packaging/windows/vdisplay-driver/pf-vdisplay/src/
 //! swap_chain_processor.rs`) onto wdk-sys + wdk-iddcx. The oracle's `wdf_umdf`/`wdf_umdf_sys` are
 //! replaced by `wdk_sys::iddcx::*` + the `wdk_iddcx` DDI wrappers. Those wrappers return a RAW
 //! `NTSTATUS` (`i32`) that is HRESULT-shaped for the swap-chain DDIs, so we classify it by hand
 //! (`hr >= 0` = success; `0x8000_000A` = E_PENDING; `hr < 0 && != E_PENDING` = error) rather than with
-//! `nt_success`. The publisher + `render_luid_low/high` params are dropped (STEP 6).
+//! `nt_success`.

 use std::{
    mem::size_of,
@@ -35,7 +37,10 @@ use wdk_sys::{HANDLE, NTSTATUS, WDFOBJECT, call_unsafe_wdf_function_binding};
 use windows::{
    Win32::{
        Foundation::HANDLE as WHANDLE,
-        Graphics::Dxgi::IDXGIDevice,
+        Graphics::{
+            Direct3D11::ID3D11Texture2D,
+            Dxgi::{IDXGIDevice, IDXGIResource},
+        },
        System::Threading::{
            AvRevertMmThreadCharacteristics, AvSetMmThreadCharacteristicsW, WaitForSingleObject,
        },
@@ -43,7 +48,7 @@ use windows::{
    core::{Interface, w},
 };

-use crate::direct_3d_device::Direct3DDevice;
+use crate::{direct_3d_device::Direct3DDevice, frame_transport::FramePublisher};

 /// E_PENDING — `ReleaseAndAcquireBuffer2` returns this (HRESULT-shaped) when the swap-chain is valid but
 /// DWM has composed no new frame yet; wait on the surface-available event and retry.
@@ -89,6 +94,8 @@ impl SwapChainProcessor {
        device: Arc<Direct3DDevice>,
        available_buffer_event: HANDLE,
        target_id: u32,
+        render_luid_low: u32,
+        render_luid_high: i32,
    ) {
        let available_buffer_event = Sendable(available_buffer_event);
        let swap_chain = Sendable(swap_chain);
@@ -117,6 +124,8 @@ impl SwapChainProcessor {
                available_buffer_event.0,
                &terminate,
                target_id,
+                render_luid_low,
+                render_luid_high,
            );

            dbglog!(
@@ -147,6 +156,8 @@ impl SwapChainProcessor {
        available_buffer_event: HANDLE,
        terminate: &AtomicBool,
        target_id: u32,
+        render_luid_low: u32,
+        render_luid_high: i32,
    ) {
        // SetDevice fails (0x887A0026, FACILITY_DXGI) when the monitor briefly flaps INACTIVE during
        // topology activation — the OS unassigns + re-assigns the swap-chain, and a fresh run_core thread
@@ -208,6 +219,13 @@ impl SwapChainProcessor {
            return;
        }

+        // STEP 6 IDD-push: lazily ATTACH to the HOST-created shared ring. The restricted UMDF token can't
+        // create named objects, so the host creates the header + event + textures and we only OPEN them
+        // once they appear (`try_open`). Until then we just drain — exactly the STEP-5 behaviour — so a
+        // non-IDD-push session never stalls. Retried every ~30 loop iterations.
+        let mut publisher: Option<FramePublisher> = None;
+        let mut frames_since_try: u32 = u32::MAX; // attach attempt on the first loop iteration
+
        let mut logged_pending = false;
        let mut logged_frame = false;
        loop {
@@ -221,9 +239,40 @@ impl SwapChainProcessor {
                break;
            }

+            // The host recreates the shared ring (new format) mid-session when the display's HDR mode
+            // flips — it bumps the header generation. Detect that and drop the publisher so we re-attach to
+            // the new-format textures below; otherwise we'd keep CopyResource'ing into the stale ring, whose
+            // format now mismatches the surface → the publish() format-guard drops every frame and the
+            // stream freezes until the next swap-chain recreate.
+            if publisher.as_ref().is_some_and(FramePublisher::is_stale) {
+                publisher = None;
+                frames_since_try = u32::MAX; // re-attach immediately
+            }
+            // Lazy-attach (rate-limited) at the loop TOP so we keep trying even while the display is idle
+            // (E_PENDING / no frames presented yet), not only when a frame is acquired. `try_open` is a
+            // cheap OpenFileMapping that fails fast until the host has created the ring.
+            if publisher.is_none() {
+                if frames_since_try >= 30 {
+                    frames_since_try = 0;
+                    // `if let Ok` (not a `match` with an empty `Err` arm) keeps clippy's `single_match`
+                    // happy under `-D warnings`; semantics are identical — attach on success, retry on Err.
+                    if let Ok(p) = FramePublisher::try_open(
+                        target_id,
+                        render_luid_low,
+                        render_luid_high,
+                        &device.device,
+                        &device.device_context,
+                    ) {
+                        publisher = Some(p);
+                    }
+                } else {
+                    frames_since_try += 1;
+                }
+            }
+
            // ...Buffer2 is required once CAN_PROCESS_FP16 is set. AcquireSystemMemoryBuffer=FALSE keeps
-            // the GPU surface (out.MetaData.pSurface). STEP 5 only drains — it does NOT publish the
-            // surface (STEP 6 will). Built zeroed + field-assigned (driver style) so a bindgen field-set
+            // the GPU surface (out.MetaData.pSurface) — STEP 6 publishes it into the shared ring in the
+            // success branch below. Built zeroed + field-assigned (driver style) so a bindgen field-set
            // difference can't break a positional struct literal.
            let mut in_args: IDARG_IN_RELEASEANDACQUIREBUFFER2 = unsafe { core::mem::zeroed() };
            #[allow(clippy::cast_possible_truncation)]
@@ -275,9 +324,23 @@ impl SwapChainProcessor {
                    );
                    logged_frame = true;
                }
-                // STEP 6 publishes `buffer.MetaData.pSurface` into the shared ring HERE (the surface is
-                // valid until the next ReleaseAndAcquire). STEP 5 only drains, so we immediately finish
-                // the frame.
+                // STEP 6: copy the acquired surface into the shared ring BEFORE FinishedProcessingFrame
+                // (the surface is valid until the next ReleaseAndAcquire). The pointer is BORROWED —
+                // `from_raw_borrowed` does NOT take IddCx's refcount — and the GPU-side copy is ordered
+                // before the consumer via the slot keyed mutex. (Attach happens at the loop top.)
+                if let Some(p) = publisher.as_mut() {
+                    let raw = buffer.MetaData.pSurface as *mut core::ffi::c_void;
+                    if !raw.is_null() {
+                        // SAFETY: `raw` is IddCx's live surface pointer (valid until the next
+                        // ReleaseAndAcquire); `from_raw_borrowed` does not consume the refcount.
+                        if let Some(res) = unsafe { IDXGIResource::from_raw_borrowed(&raw) } {
+                            if let Ok(tex) = res.cast::<ID3D11Texture2D>() {
+                                p.publish(&tex);
+                            }
+                        }
+                    }
+                }
+
                // SAFETY: driver is loaded; `swap_chain` is valid.
                let hr = unsafe { wdk_iddcx::IddCxSwapChainFinishedProcessingFrame(swap_chain) };
                if !hr_success(hr) {