feat(windows-host): pf-vdisplay — fix the ADD/REMOVE wedge + per-client display-config persistence

Two phases of pf-vdisplay (IddCx virtual display) lifecycle work, both validated on-glass on the RTX box. Phase 1 — fix the long-standing IOCTL_ADD 0x80070490 (ERROR_NOT_FOUND) wedge that ghost-monitor slot-budget exhaustion produced under ADD/REMOVE churn (the reset-script/reboot recurring failure). Validated: 43 reconnect-churn cycles, 0 wedges, monitor-node count flat at 1. * driver: on IddCxMonitorArrival failure, tear the created-but-not-arrived monitor down with WdfObjectDelete + reclaim its id — the asymmetric-with-the-create-failure-path leak that exhausted the 16-monitor MaxMonitorsSupported budget; recover MONITOR_MODES from lock poisoning instead of failing closed (defensive; the driver builds panic=abort). * host: collapse the build-retry churn — hold ONE monitor lease across all build attempts and preempt only on Lingering (not Active), so a cold start does 1 ADD not 8; reap not-present "punktfunk" monitor PDOs on startup (the reset-script step-2 logic, in-process) and self-heal a detected 0x80070490 by reaping + retrying ADD; force-preempt a stuck-Active prior monitor on the begin_idd_setup timeout (the safety net the Lingering-only preempt would otherwise drop). Phase 2 — give each client (keyed by its cert FINGERPRINT) a STABLE virtual-monitor id (1..=15) so Windows reapplies that client's saved per-monitor config (DPI SCALING) across reconnects, and two clients never share/bleed config. Validated: distinct clients -> distinct ids (1, 2); the driver honors the host's id (echoed resolved == preferred). * proto: rename AddRequest._reserved -> preferred_monitor_id (offset 20) and AddReply._reserved -> resolved_monitor_id (offset 12) — byte-compatible (offset asserts), NO PROTOCOL_VERSION bump, so a pre-Phase-2 driver degrades gracefully to auto-id (the host detects it via the resolved echo). * driver: create_monitor honors a host-supplied preferred id via resolve_id (range 1..=15, never collides with a live monitor) and seeds the EDID serial + IddCx ConnectorIndex + ContainerId from it. * host: a persisted LRU fingerprint->id map (%ProgramData%\punktfunk\pf-vdisplay-identity.json), threaded to add_monitor via a set_client_identity no-op trait method (Linux/GameStream unaffected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 21:42:59 +02:00
parent 080c55dbf7
commit 0f798d62b6
8 changed files with 553 additions and 83 deletions
@@ -133,9 +133,13 @@ unsafe fn add(request: WDFREQUEST) {
        complete(request, STATUS_INVALID_PARAMETER);
        return;
    }
-    let Some((target_id, luid_low, luid_high)) =
-        crate::monitor::create_monitor(req.session_id, req.width, req.height, req.refresh_hz)
-    else {
+    let Some((monitor_id, target_id, luid_low, luid_high)) = crate::monitor::create_monitor(
+        req.session_id,
+        req.width,
+        req.height,
+        req.refresh_hz,
+        req.preferred_monitor_id,
+    ) else {
        complete(request, STATUS_NOT_FOUND);
        return;
    };
@@ -143,7 +147,7 @@ unsafe fn add(request: WDFREQUEST) {
        adapter_luid_low: luid_low,
        adapter_luid_high: luid_high,
        target_id,
-        _reserved: 0,
+        resolved_monitor_id: monitor_id,
    };
    // SAFETY: `request` is the framework WDFREQUEST.
    unsafe { write_output_complete(request, &reply) };
@@ -7,7 +7,7 @@
 use std::sync::Mutex;
 use std::time::{Duration, Instant};

-use wdk_sys::iddcx;
+use wdk_sys::{WDFOBJECT, call_unsafe_wdf_function_binding, iddcx};

 /// One resolution with the refresh rates it supports.
 #[derive(Clone)]
@@ -69,10 +69,23 @@ unsafe impl Send for MonitorObject {}
 /// heavy per-monitor resources on device removal is instead done explicitly ([`cleanup_for_device_removal`]).
 pub static MONITOR_MODES: Mutex<Vec<MonitorObject>> = Mutex::new(Vec::new());

+/// Lock [`MONITOR_MODES`], recovering the guard on poison instead of failing. DEFENSIVE ONLY: this driver
+/// workspace builds with `panic = "abort"` (packaging/windows/drivers/Cargo.toml), so a panic while the
+/// lock is held aborts the process WITHOUT unwinding — `MutexGuard::drop` never runs, the poison flag is
+/// never set, and `.lock()` can never return `Err`. The `into_inner()` arm is therefore currently
+/// unreachable; it is retained to consolidate the lock pattern and to stay correct if the panic strategy
+/// ever becomes `unwind` (the guarded data is a plain `Vec` with no cross-field invariant a half-completed
+/// panic could corrupt, so recovering the guard is sound). NOTE: this does NOT explain the observed ADD
+/// 0x80070490 wedge — that is ghost-monitor slot-budget exhaustion (the arrival-failure `WdfObjectDelete`
+/// teardown above + the host-side reap), not lock poisoning.
+fn lock_monitors() -> std::sync::MutexGuard<'static, Vec<MonitorObject>> {
+    MONITOR_MODES.lock().unwrap_or_else(|e| e.into_inner())
+}
+
 /// True if any virtual monitor currently exists — the host-gone watchdog only reaps when there's
 /// something to reap (see [`crate::control::start_watchdog`]).
 pub fn has_monitors() -> bool {
-    MONITOR_MODES.lock().map(|l| !l.is_empty()).unwrap_or(false)
+    !lock_monitors().is_empty()
 }

 /// Depart every monitor that has existed at least `grace` — the host-gone watchdog reap
@@ -85,9 +98,7 @@ pub fn reap_orphaned(grace: Duration) -> usize {
        Option<iddcx::IDDCX_MONITOR>,
        Option<crate::swap_chain_processor::SwapChainProcessor>,
    )> = {
-        let Ok(mut lock) = MONITOR_MODES.lock() else {
-            return 0;
-        };
+        let mut lock = lock_monitors();
        let mut taken = Vec::new();
        let mut i = 0;
        while i < lock.len() {
@@ -138,7 +149,8 @@ pub fn display_info(
    // Compute in u64 then saturate the u32 rational numerators: the old u32 `refresh*(h+4)^2` overflows
    // for a large mode (e.g. 8K@240), which panics→aborts the extern-"C" mode DDI in a debug build.
    // Identical for every real mode; only an absurd (also now bounds-rejected) mode saturates.
-    let clock_rate: u64 = u64::from(refresh_rate) * u64::from(height + 4) * u64::from(height + 4) + 1000;
+    let clock_rate: u64 =
+        u64::from(refresh_rate) * u64::from(height + 4) * u64::from(height + 4) + 1000;
    let clock_rate_u32 = u32::try_from(clock_rate).unwrap_or(u32::MAX);
    let mut si = pod_init!(wdk_sys::DISPLAYCONFIG_VIDEO_SIGNAL_INFO);
    si.pixelRate = clock_rate;
@@ -264,9 +276,7 @@ pub fn set_swap_chain_processor(
    object: iddcx::IDDCX_MONITOR,
    proc: crate::swap_chain_processor::SwapChainProcessor,
 ) -> Option<crate::swap_chain_processor::SwapChainProcessor> {
-    let Ok(mut lock) = MONITOR_MODES.lock() else {
-        return Some(proc);
-    };
+    let mut lock = lock_monitors();
    if let Some(m) = lock.iter_mut().find(|m| m.object == Some(object)) {
        m.swap_chain_processor.replace(proc)
    } else {
@@ -290,15 +300,17 @@ pub fn take_swap_chain_processor(
        .take()
 }

-/// `IOCTL_ADD`: create + arrive a virtual monitor at `width`x`height`@`refresh`. Returns the OS
-/// `(target_id, adapter_luid_low, adapter_luid_high)` for the [`AddReply`](pf_driver_proto::control::AddReply),
-/// or `None` on failure (no adapter yet / IddCx error).
+/// `IOCTL_ADD`: create + arrive a virtual monitor at `width`x`height`@`refresh` for `session_id`, naming it
+/// by `preferred_id` (the host's per-client stable id; `0` = auto-allocate). Returns the resolved
+/// `(monitor_id, target_id, adapter_luid_low, adapter_luid_high)` for the
+/// [`AddReply`](pf_driver_proto::control::AddReply), or `None` on failure (no adapter yet / IddCx error).
 pub fn create_monitor(
    session_id: u64,
    width: u32,
    height: u32,
    refresh: u32,
-) -> Option<(u32, u32, i32)> {
+    preferred_id: u32,
+) -> Option<(u32, u32, u32, i32)> {
    let adapter = crate::adapter::adapter()?;
    // Single identity per session (E1): if the host re-ADDs a still-live `session_id` (it shouldn't), depart
    // the stale monitor first, so one session maps to exactly one monitor (no duplicate EDID/target lingers).
@@ -307,7 +319,9 @@ pub fn create_monitor(
        .map(|l| l.iter().any(|m| m.session_id == session_id))
        .unwrap_or(false)
    {
-        dbglog!("[pf-vd] create_monitor: session {session_id} already live — departing the stale monitor");
+        dbglog!(
+            "[pf-vd] create_monitor: session {session_id} already live — departing the stale monitor"
+        );
        remove_monitor(session_id);
    }
    let mut modes = vec![Mode {
@@ -317,17 +331,17 @@ pub fn create_monitor(
    }];
    modes.extend(default_modes());

-    // Register the (pending) monitor so the mode DDIs can find it by EDID-serial id before arrival, under a
-    // REUSED id (the lowest not currently live). Reclaiming the id on REMOVE — instead of a monotonic
-    // counter — keeps the connector index / EDID serial / container GUID bounded, so IddCx reuses the same
-    // OS target slot on a fresh ADD rather than leaving a ghost monitor node behind (the slot-exhaustion
-    // wedge: sustained ADD/REMOVE churn eventually makes ADD fail 0x80070490 ERROR_NOT_FOUND). Allocated
-    // under the lock with the push so two concurrent ADDs can't pick the same id.
+    // Register the (pending) monitor so the mode DDIs can find it by EDID-serial id before arrival. The id
+    // seeds the EDID serial + IddCx ConnectorIndex + ContainerId — i.e. the monitor's OS IDENTITY. Honor the
+    // host's per-client `preferred_id` when it is valid + not currently live, so a given client gets a
+    // STABLE identity across reconnects (→ Windows reapplies its saved per-monitor DPI scaling); else fall
+    // back to the lowest-free id (auto — the original slot-based behavior). A bounded reused id (vs a
+    // monotonic counter) keeps IddCx reusing the same OS target slot rather than leaving a ghost monitor
+    // node behind (the slot-exhaustion wedge). Allocated under the lock with the push so two concurrent ADDs
+    // can't pick the same id.
    let id = {
-        let Ok(mut lock) = MONITOR_MODES.lock() else {
-            return None;
-        };
-        let id = alloc_monitor_id(&lock);
+        let mut lock = lock_monitors();
+        let id = resolve_id(&lock, preferred_id);
        lock.push(MonitorObject {
            object: None,
            id,
@@ -379,7 +393,8 @@ pub fn create_monitor(
        return None;
    }
    let monitor = create_out.MonitorObject;
-    if let Ok(mut lock) = MONITOR_MODES.lock() {
+    {
+        let mut lock = lock_monitors();
        if let Some(m) = lock.iter_mut().find(|m| m.id == id) {
            m.object = Some(monitor);
        }
@@ -391,6 +406,24 @@ pub fn create_monitor(
    let st = unsafe { wdk_iddcx::IddCxMonitorArrival(monitor, &mut arrival_out) };
    dbglog!("[pf-vd] IddCxMonitorArrival(id={id}) -> {st:#x}");
    if !wdk_iddcx::nt_success(st) {
+        // Arrival failed on a monitor we already CREATED. It must be torn down with `WdfObjectDelete`:
+        // `IddCxMonitorDeparture` is only valid for an ARRIVED monitor, so departing here would be a
+        // no-op that LEAKS the IddCx monitor object and permanently pins its slot against the adapter's
+        // `MaxMonitorsSupported` budget — the leak that, asymmetric with the create-failure path just
+        // above (which only reclaims the id, having no object to delete), accelerates the ADD 0x80070490
+        // wedge. Reclaim the id FIRST (drop the `MONITOR_MODES` entry that still holds this handle) so a
+        // concurrent `clear_all`/`reap_orphaned` can't grab + depart the handle we're about to delete,
+        // THEN delete the object — `monitor` is a local copy of the handle, valid across both.
+        dbglog!(
+            "[pf-vd] IddCxMonitorArrival(id={id}) FAILED — reclaiming the id + deleting the created monitor"
+        );
+        remove_by_id(id);
+        // SAFETY: `monitor` is the just-created (not-yet-arrived) IddCx monitor handle, now owned solely
+        // here (its `MONITOR_MODES` entry was just removed); `WdfObjectDelete` takes a `WDFOBJECT` (a raw
+        // handle cast, as in the swap-chain / device-cleanup teardowns).
+        unsafe {
+            call_unsafe_wdf_function_binding!(WdfObjectDelete, monitor as WDFOBJECT);
+        }
        return None;
    }

@@ -399,14 +432,15 @@ pub fn create_monitor(
        arrival_out.OsAdapterLuid.LowPart,
        arrival_out.OsAdapterLuid.HighPart,
    );
-    if let Ok(mut lock) = MONITOR_MODES.lock() {
+    {
+        let mut lock = lock_monitors();
        if let Some(m) = lock.iter_mut().find(|m| m.id == id) {
            m.target_id = target_id;
            m.adapter_luid_low = luid_low;
            m.adapter_luid_high = luid_high;
        }
    }
-    Some((target_id, luid_low, luid_high))
+    Some((id, target_id, luid_low, luid_high))
 }

 /// `IOCTL_REMOVE`: depart + drop the monitor for `session_id`. Returns true if one was removed.
@@ -415,9 +449,7 @@ pub fn remove_monitor(session_id: u64) -> bool {
    // (which RAII-joins its worker thread) only AFTER the lock guard is released — joining a worker
    // while holding `MONITOR_MODES` would head-block the whole control plane / risk a self-deadlock.
    let (monitor, processor) = {
-        let Ok(mut lock) = MONITOR_MODES.lock() else {
-            return false;
-        };
+        let mut lock = lock_monitors();
        let Some(pos) = lock.iter().position(|m| m.session_id == session_id) else {
            return false;
        };
@@ -441,9 +473,7 @@ pub fn clear_all() {
        Option<iddcx::IDDCX_MONITOR>,
        Option<crate::swap_chain_processor::SwapChainProcessor>,
    )> = {
-        let Ok(mut lock) = MONITOR_MODES.lock() else {
-            return;
-        };
+        let mut lock = lock_monitors();
        lock.drain(..)
            .map(|mut m| (m.object, m.swap_chain_processor.take()))
            .collect()
@@ -467,9 +497,7 @@ pub fn clear_all() {
 /// though the per-devnode WUDFHost (`ProcessSharingDisabled`) would also reap them when it exits.
 pub fn cleanup_for_device_removal() {
    let mut drained: Vec<Option<crate::swap_chain_processor::SwapChainProcessor>> = {
-        let Ok(mut lock) = MONITOR_MODES.lock() else {
-            return;
-        };
+        let mut lock = lock_monitors();
        lock.drain(..)
            .map(|mut m| m.swap_chain_processor.take())
            .collect()
@@ -483,8 +511,20 @@ pub fn cleanup_for_device_removal() {

 /// Drop a pending entry by id (create failed before arrival).
 fn remove_by_id(id: u32) {
-    if let Ok(mut lock) = MONITOR_MODES.lock() {
-        lock.retain(|m| m.id != id);
+    lock_monitors().retain(|m| m.id != id);
+}
+
+/// Resolve the id to name a new monitor by: honor the host's `preferred` per-client id when it is in the
+/// valid range (`1..=15`, so the IddCx `ConnectorIndex` = id stays `< MaxMonitorsSupported` = 16) AND not
+/// currently live (two live monitors MUST have distinct ids/connectors); otherwise fall back to
+/// [`alloc_monitor_id`] (auto, lowest-free). NEVER auto-departs a colliding live monitor — that would tear
+/// down an unrelated concurrent client — so the live-uniqueness invariant is preserved even against a host
+/// bug. `preferred == 0` (anonymous/TOFU/GameStream) always falls through to auto. Caller holds `MONITOR_MODES`.
+fn resolve_id(modes: &[MonitorObject], preferred: u32) -> u32 {
+    if (1..=15).contains(&preferred) && !modes.iter().any(|m| m.id == preferred) {
+        preferred
+    } else {
+        alloc_monitor_id(modes)
    }
 }