fix(windows-drivers): reclaim pf-vdisplay monitor ids on REMOVE (P1, slot-reclaim)

The driver assigned each virtual monitor a monotonically-increasing NEXT_ID used as the EDID serial / IddCx ConnectorIndex / container GUID, and never reclaimed it on REMOVE. Under sustained ADD/REMOVE churn the connector index kept climbing, so IddCx/PnP allocated a NEW OS target slot every cycle and orphaned the old one (ghost "Generic Monitor (punktfunk)" nodes) until the adapter's target capacity was exhausted and ADD failed 0x80070490 ERROR_NOT_FOUND. Fix: `create_monitor` now allocates the LOWEST free id (`alloc_monitor_id`, computed under the MONITOR_MODES lock with the push) instead of a counter, so a departed monitor's id is reclaimed and a fresh ADD reuses its target slot rather than orphaning it. With <= N live monitors the id stays bounded to 1..=N+1. Deleted the now-unused NEXT_ID + AtomicU32/Ordering import. CI-compile-gated only — the wedge reproduces solely under sustained churn on the RTX box, so this needs an on-glass reconnect-storm A/B to confirm (box is ephemeral/down). Marked on-glass-pending in windows-host-rewrite.md §4; keep reset-pf-vdisplay.ps1 as the recovery until validated. NOT to be relied on (or merged to main) until that A/B passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 22:11:36 +00:00
parent 0bf3984614
commit 8cde8621ce
2 changed files with 32 additions and 14 deletions
@@ -211,10 +211,14 @@ These are expensive empirical wins; keep them intact when touching the code:
 2. **Make IDD-push the default** — today it is gated behind `PUNKTFUNK_IDD_PUSH` (`config.rs` default
   `false`); deployment sets it in `host.env`. Flip the code default (with the WGC/DDA fallback already in
   place) so a fresh install runs the validated path, or document the `host.env` requirement explicitly.
-3. **pf-vdisplay slot reclaim on REMOVE** (driver robustness) — sustained ADD/REMOVE churn wedges the
+3. **pf-vdisplay slot reclaim on REMOVE** (driver robustness) — 🟡 **fix landed, on-glass-validation
-   driver (`ADD → 0x80070490 ERROR_NOT_FOUND`) because IddCx monitor slots aren't reclaimed (ghost nodes
+   pending.** Sustained ADD/REMOVE churn wedged the driver (`ADD → 0x80070490 ERROR_NOT_FOUND`) because the
-   accumulate). Workaround today: `packaging/windows/reset-pf-vdisplay.ps1`. Real fix in `control.rs`/
+   monitor id (EDID serial / `ConnectorIndex` / container GUID) was a **monotonic** `NEXT_ID`, never
-   `monitor.rs`. On-glass-gated.
+   reclaimed → IddCx accumulated a new OS target slot per cycle until exhaustion. `monitor.rs` now allocates
   the **lowest free id** (`alloc_monitor_id`), reused on REMOVE, so a fresh ADD reuses the departed
   monitor's target slot instead of orphaning it. CI-compile-gated; the wedge only reproduces under
   sustained churn on the RTX box, so this needs an **on-glass reconnect-storm A/B** to confirm (the box is
   ephemeral). Keep `packaging/windows/reset-pf-vdisplay.ps1` as the recovery until validated.
 **P2 — hygiene / architecture completion**
 4. **D1-host — host-crate P0 lints.** Add `#![deny(unsafe_op_in_unsafe_fn)]` +
@@ -5,7 +5,6 @@
 //! `guid: u128` → `session_id: u64` for the owned `pf_vdisplay_proto` control plane.
 use std::sync::Mutex;
 use std::sync::atomic::{AtomicU32, Ordering};
 use std::time::{Duration, Instant};
 use wdk_sys::iddcx;
@@ -69,8 +68,6 @@ unsafe impl Send for MonitorObject {}
 /// thread ([`crate::control::start_watchdog`]) races device cleanup — for no real gain. Cleanup of the
 /// heavy per-monitor resources on device removal is instead done explicitly ([`cleanup_for_device_removal`]).
 pub static MONITOR_MODES: Mutex<Vec<MonitorObject>> = Mutex::new(Vec::new());
 /// Monitor id / EDID-serial counter (unique per created monitor).
 static NEXT_ID: AtomicU32 = AtomicU32::new(1);
 /// True if any virtual monitor currently exists — the host-gone watchdog only reaps when there's
 /// something to reap (see [`crate::control::start_watchdog`]).
@@ -323,8 +320,6 @@ pub fn create_monitor(
        dbglog!("[pf-vd] create_monitor: session {session_id} already live — departing the stale monitor");
        remove_monitor(session_id);
    }
    let id = NEXT_ID.fetch_add(1, Ordering::Relaxed);
    let mut modes = vec![Mode {
        width,
        height,
@@ -332,8 +327,17 @@ pub fn create_monitor(
    }];
    modes.extend(default_modes());
-    // Register the (pending) monitor so the mode DDIs can find it by EDID-serial id before arrival.
+    // Register the (pending) monitor so the mode DDIs can find it by EDID-serial id before arrival, under a
-    if let Ok(mut lock) = MONITOR_MODES.lock() {
+    // REUSED id (the lowest not currently live). Reclaiming the id on REMOVE — instead of a monotonic
    // counter — keeps the connector index / EDID serial / container GUID bounded, so IddCx reuses the same
    // OS target slot on a fresh ADD rather than leaving a ghost monitor node behind (the slot-exhaustion
    // wedge: sustained ADD/REMOVE churn eventually makes ADD fail 0x80070490 ERROR_NOT_FOUND). Allocated
    // under the lock with the push so two concurrent ADDs can't pick the same id.
    let id = {
        let Ok(mut lock) = MONITOR_MODES.lock() else {
            return None;
        };
        let id = alloc_monitor_id(&lock);
        lock.push(MonitorObject {
            object: None,
            id,
@@ -345,9 +349,8 @@ pub fn create_monitor(
            swap_chain_processor: None,
            created_at: Instant::now(),
        });
-    } else {
+        id
-        return None;
+    };
    }
    // EDID (serial = id) describes the monitor; the OS calls back into parse_monitor_description.
    let mut edid = crate::edid::Edid::generate_with(id);
@@ -505,6 +508,17 @@ fn remove_by_id(id: u32) {
    }
 }
 /// The lowest monitor id (≥1) not currently live. Reusing freed ids (instead of a monotonic counter) keeps
 /// the connector index / EDID serial / container GUID bounded to the number of concurrent monitors, so a
 /// fresh ADD reuses a departed monitor's OS target slot rather than allocating a new one and orphaning the
 /// old (the ghost-monitor accumulation that wedges ADD at 0x80070490 ERROR_NOT_FOUND). Caller holds
 /// `MONITOR_MODES`. With ≤ N live ids, a free one always exists in `1..=N+1` (pigeonhole).
 fn alloc_monitor_id(modes: &[MonitorObject]) -> u32 {
    (1u32..=modes.len() as u32 + 1)
        .find(|id| !modes.iter().any(|m| m.id == *id))
        .unwrap_or(1)
 }
 /// A deterministic, monitor-unique container GUID (groups targets into a physical device). Derived from
 /// `id` so it is stable + collision-free without a random source.
 fn container_guid(id: u32) -> wdk_sys::GUID {