docs(windows-rewrite): audit the IDD-push rewrite against its plan

Driver track (M0+M1, STEPs 0-7) landed and is on-glass-validated, but the host-side goals (clean architecture, SudoVDA removal, unsafe reduction) and several driver-spec items (host-gone watchdog, SET_RENDER_ADAPTER, ownership model) are not yet done. Full findings + a prioritized P0-P2 fix list in the doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-25 12:39:42 +00:00
parent a11b0dd3c7
commit 0badc17d87
+287
View File
@@ -0,0 +1,287 @@
# Windows Host Rewrite — Audit
Status: **audit** (2026-06-25). Reviews the state of the Windows host rewrite against its plan
([`docs/windows-host-rewrite.md`](windows-host-rewrite.md)). Read-only assessment — no code changed.
Scope: the new IddCx driver workspace (`packaging/windows/drivers/`), the owned ABI crate
(`crates/pf-vdisplay-proto`), the host-side IDD-push path (`capture/idd_push.rs`,
`vdisplay/pf_vdisplay.rs`), and the deployment/packaging seam. Evidence is cited as `file:line`.
> **Remediation in progress (2026-06-25).** The findings below were the state at audit time; several are
> already being worked through. Resolved since: the **cutover (§3)** — STEP 8 gave the new driver its own
> `.inx` and re-vendored the installer to the new wdk-sys build (`pf_vdisplay.dll` 613 KB → 251 KB), so the
> new driver is now the shipped one; and the **proto ABI hardening (§6.1/§6.2)** — offset asserts + the
> owned gamepad SHM layouts have landed. Remaining items are tracked and in progress.
---
## 0. Bottom line
The framing "the Windows host has been rewritten with IDD-push as the main path" **overstates what is
on disk.** What actually landed is the **driver rewrite** (plan M0 + M1, STEPs 07): a clean, new,
all-Rust IddCx driver (`packaging/windows/drivers/pf-vdisplay`, ~2,000 LOC) on the unified
`windows-drivers-rs` stack, speaking an owned ABI crate (`pf-vdisplay-proto`), validated on-glass through
HDR. That is the hardest, highest-risk part of the plan (the `/INTEGRITYCHECK` answer, the `iddcx` binding
on `wdk-sys`, on-glass IDD-push + HDR) and it is genuinely well executed.
Three facts the framing hides:
1. **The new path is not the shipped path — it is not shipped at all.** The installer still vendors and
installs the **old** `vdisplay-driver/` (wdf-umdf) build
(`packaging/windows/pf-vdisplay/pf_vdisplay.dll`, dated 2026-06-24). The new driver has **no INF
in-tree**, is not vendored, and therefore cannot be packaged. IDD-push capture is gated behind
`PUNKTFUNK_IDD_PUSH`, which is **not set** in `scripts/windows/host.env.example`, so the default
capture path is **WGC→DDA** and the default display backend falls back to **SudoVDA** whenever the new
driver interface isn't enumerable. The new path runs only on a hand-built bench box with the env var
set.
2. **The host-side rewrite — Goal 1 — has not started.** No `src/windows/` tree, no `config.rs`/
`HostConfig`, no `SessionFactory`/`SessionPlan`, no `session/`. The old god-files are intact. SudoVDA
was not removed (135 refs; `sudovda.rs` is a *hard dependency* of the new path). Unsafe went **up**,
not down.
3. **The new driver itself diverges from its own spec in load-bearing ways** — the watchdog is dead code,
`SET_RENDER_ADAPTER` is a stub, the §2.5 ownership-model refactor wasn't done, and world-writable
logging was re-introduced.
So the riskiest **proof** is done (real progress). The **rewrite** (clean architecture, cutover,
hardening) is still ahead.
---
## 1. Goal / milestone scorecard
| Goal / milestone | Status | Evidence |
|---|---|---|
| **M0** proto ABI + driver toolchain + `/INTEGRITYCHECK` + iddcx binding | ✅ Done | `pf-vdisplay-proto`, vendored `windows-drivers-rs`, `clear-force-integrity.ps1` |
| **M1** new IddCx driver, first light + HDR | ✅ Done (on-glass) | STEPs 07; `swap_chain_processor.rs`, `frame_transport.rs`, `callbacks.rs` |
| **Goal 1** clean, layered host architecture | ❌ Not started | no `src/windows/`, `config.rs`, `session/`, `SessionFactory`/`SessionPlan` |
| **Goal 2** drop every trace of SudoVDA | ❌ Not done | 135 `sudovda` refs; `sudovda.rs` (1,193 LOC) is a hard dep of `pf_vdisplay.rs` + `idd_push.rs` |
| **Goal 3** minimize unsafe + P0 lints | ❌ Regressed | host unsafe ~476 (↑); driver ~160 vs ~60 target; **no** P0 lints anywhere; `OwnedHandle` in **0** host files |
| **§2.5** delete driver global statics / DeviceContext-owned state / `EvtCleanupCallback` | ❌ Not done | `MONITOR_MODES`/`NEXT_ID`/`ADAPTER`/`DEVICE_POOL` still process-globals; `DeviceContext{_device}` empty; no monitor cleanup callback |
| **M4** unify gamepad drivers onto new stack | ❌ Not started | workspace members = `wdk-probe/wdk-iddcx/pf-vdisplay` only; gamepad drivers still standalone wdf-umdf |
| **M6** cutover + delete old monoliths | ❌ Not reached | old driver trees + `dxgi/wgc/wgc_relay/sudovda/punktfunk1` all present (partly by-design as "reference until parity") |
---
## 2. What landed well (preserve, do not regress)
- **The §1 driver "jewels" survived the port.** The two real swap-chain leak fixes are verbatim with
their rationale: borrow `IDXGIDevice` once across `SetDevice` retries
(`swap_chain_processor.rs:174`), and check `terminate` at the loop top during a frame burst (`:238`).
`DEVICE_POOL` keyed by render LUID (the NVIDIA UMD-thread/VRAM leak fix) is intact
(`direct_3d_device.rs:115`). Monitor lock discipline (drop the worker **outside** `MONITOR_MODES`) is
correct (`monitor.rs:343-390`).
- **The frame transport is clean and correct** — the standout module. `FramePublisher` uses
`pf_vdisplay_proto::frame` for header/token/names (no hand-rolled offsets), straight-line
acquire→copy→release with no `?` between lock/unlock (`frame_transport.rs:266-275`), format guard
before `CopyResource`, stale-ring generation detection, correct drop order.
- **The proto control plane is properly owned**: fresh GUID (not SudoVDA's `e5bcc234`), centralized
`FrameToken::pack/unpack` used by both sides, and a **real version handshake the host actually
asserts** and bails on mismatch (`pf_vdisplay.rs:455-466`). Typed IOCTL dispatch collapsed the
per-call unsafe (`control.rs`).
- **Per-block `// SAFETY:` discipline** is already present throughout the new driver — most of the value
of `clippy::undocumented_unsafe_blocks` without the lint being on yet.
---
## 3. Deployment gap (the headline)
The new path is built and validated but not reachable by an installed product.
- **Installer ships the old driver.** `packaging/windows/stage-pf-vdisplay.ps1:7-8` vendors the signed
output of `packaging/windows/vdisplay-driver/` (the wdf-umdf tree); `punktfunk-host.iss` installs that
via `install-pf-vdisplay.ps1`. The vendored binary is `packaging/windows/pf-vdisplay/pf_vdisplay.dll`
(613,760 bytes — the old build).
- **New driver is not packageable.** `find packaging/windows/drivers -name '*.inf'` → none. The new
workspace is built + FORCE_INTEGRITY-cleared in CI (`windows-drivers.yml`) as a **compile/link gate
only**; nothing signs or vendors its output.
- **GUID split keeps them apart.** The old driver exposes the old SudoVDA interface GUID; the host's
`sudovda.rs` backend opens it. The new driver exposes the fresh `70667664-…` GUID; only
`pf_vdisplay.rs` opens it. With the old driver installed, `pf_vdisplay::is_available()` → false → the
host silently uses the SudoVDA backend.
- **IDD-push is off by default.** `scripts/windows/host.env.example` sets only
`PUNKTFUNK_ENCODER=auto`, `PUNKTFUNK_VIDEO_SOURCE=virtual`, `PUNKTFUNK_SECURE_DDA=1`, `RUST_LOG=info`.
`PUNKTFUNK_IDD_PUSH` is checked via `var_os(...).is_some()` (`capture.rs:348`, `punktfunk1.rs:2223+`,
`pf_vdisplay.rs:57`) but never set in deployment.
Net: a freshly installed Windows host runs **old driver + SudoVDA backend + WGC/DDA capture** — the
pre-rewrite path. The rewrite is a manually-validated parallel track, not a delivered feature.
---
## 4. Driver code audit — stability / correctness
### 4.1 P0 — the watchdog is dead code; host-crash leaks an orphan monitor
`WATCHDOG_PINGS` is incremented on `IOCTL_PING` (`control.rs:35`) but **nothing reads it** — the only
`thread::spawn` in the driver is the swap-chain worker (`swap_chain_processor.rs:104`). The comments are
misleading: "STEP 4's watchdog thread samples it" (`control.rs:17`) and "the watchdog reaps all monitors"
(`control.rs:14`) describe a thread that does not exist; `adapter_init_finished`
(`callbacks.rs:30-37`) does not start one despite its doc claiming so.
Consequence: if `serve` dies or the service is stopped with `TerminateProcess` (skipping `Drop` → no
`IOCTL_REMOVE`), the virtual monitor + its worker thread + pooled D3D device persist in WUDFHost until the
**next** host start issues `IOCTL_CLEAR_ALL`. If the host is not restarted, the orphan monitor stays
plugged into the desktop topology indefinitely.
The plan called for host-gone detection by **`EvtCleanupCallback` RAII**, a **polling watchdog**, or
**`EvtFileClose`** (§3.4) — none is implemented. Fix: implement the watchdog thread, or (preferred) wire
`EvtFileClose` so "host holds the control handle open" = liveness; and remove the false comments.
### 4.2 P1 — `SET_RENDER_ADAPTER` is a stub → hybrid-GPU is a hard failure
`control.rs:47` returns `STATUS_NOT_IMPLEMENTED`, contradicting plan §3.2 (which made it unconditional).
The driver renders the virtual monitor on whatever adapter the OS picks (`callbacks.rs:275`,
`pooled_device(luid)`) and reports that LUID to the host. On a hybrid **iGPU+dGPU** box, if the OS picks
the iGPU, the host's ring textures (created on the NVENC dGPU) fail `OpenSharedResourceByName`
`DRV_STATUS_TEX_FAIL` (`frame_transport.rs:195-208`) → the host's 20 s hard bail (§5.1). This is a silent
hard failure on common Optimus/hybrid configs. The single-dGPU RTX bench box never reproduced it.
### 4.3 P1 — the §2.5 ownership refactor wasn't done
State is still process-global: `MONITOR_MODES`/`NEXT_ID` (`monitor.rs:63,65`), `ADAPTER`
(`adapter.rs:41`), `DEVICE_POOL` (`direct_3d_device.rs:115`); `DeviceContext` is an empty `{ _device }`
(`entry.rs:20`). No `EvtCleanupCallback` on the monitor object (`monitor.rs:292-296` sets only Size +
scope). Monitor identity is still 3-keyed (`id`/`object`/`session_id`), not the collapsed single
`Monitor`.
This is why the plan's central payoff — *stable monitor reuse → drop the preempt dance → unblock
`max_concurrent>1` on Windows* — was not achieved. The host still does fresh-monitor-per-session with the
`IDD_SETUP_LOCK` preempt + `wait_for_monitor_released` dance (`punktfunk1.rs:2216-2237`), so Windows
IDD-push is effectively single-client even though `DEFAULT_MAX_CONCURRENT = 4`.
### 4.4 P2 — world-writable logging re-introduced
Plan §6 said delete the `C:\Users\Public\*.log` driver logging; the new driver re-added it
(`pf-vdisplay/src/log.rs:18``C:\Users\Public\pfvd-driver.log`). Info-leak / DoS surface; should move to
ETW or be gated off release builds.
### 4.5 P2 — no control-plane input validation
`create_monitor` receives `width/height/refresh` from the IOCTL with no bounds check (`control.rs:62-63`
`monitor.rs:243`). The host is a trusted LocalSystem process so the trust boundary holds, but a buggy
host could request an absurd mode. `read_input` uses `T: Copy`, not `bytemuck::Pod` (`control.rs:96`);
Pod would be a stronger guarantee.
---
## 5. Host code audit
### 5.1 P1 — when IDD-push is engaged there is no fallback
The plan kept WGC/DDA as a safety net; the code commits hard. `capture.rs:345` consumes the keepalive and
returns the IDD-push capturer with "no fall-through"; attach failure surfaces as a **20 s deadline
`bail!`** (`idd_push.rs:820-846`) that tears the session down black rather than degrading to DDA. Combined
with §4.2, hybrid-GPU = a guaranteed 20 s black-then-drop.
### 5.2 P1 — SudoVDA is a hard dependency of the "new" path
`pf_vdisplay.rs` and `idd_push.rs` import `isolate_displays_ccd`/`resolve_render_adapter_luid`/
`set_advanced_color`/`CURRENT_MON_GEN` directly from `super::sudovda` (`pf_vdisplay.rs:43-46`,
`idd_push.rs:351-356,809`). `punktfunk1.rs:2231` calls `crate::vdisplay::sudovda::wait_for_monitor_released`
even when pf-vdisplay is the live backend — benign **today** only because pf-vdisplay preempts inline and
the SudoVDA `MGR` is empty (`pf_vdisplay.rs:645-647`), but it is a fragile cross-static landmine. Plan §9
(move CCD/adapter helpers into neutral `windows/display_ccd.rs` + `adapter.rs`) is the right fix and is
unstarted.
### 5.3 P2 — texture-ownership contract is convention, not types
The §4 in-place-encode hazard is *mitigated* by a host-owned 3-slot `OUT_RING` +
`pipeline_depth().clamp(1, OUT_RING)` (`idd_push.rs:60,867-872`) — sound for the live synchronous loop —
but nothing type-enforces it. `nvenc.rs:7-10` still carries the "safe because the loop is synchronous"
comment, and `repeat_last()` (`idd_push.rs:755-766`) can re-hand an out-ring slot that may still be
encoding under depth>1. Narrow, but it is the residual corruption edge the plan wanted closed type-level.
### 5.4 P2 — HDR toggle recreates the whole ring mid-session
`recreate_ring` (`idd_push.rs:582-617`) drops + recreates all 6 keyed-mutex textures on an HDR mode flip,
polled on a 250 ms throttle (`idd_push.rs:622-626`) → up to a 250 ms format-mismatch freeze window where
the driver drops every frame (`frame_transport.rs:256-260`). Works, but heavy and visibly janky.
---
## 6. ABI / proto
### 6.1 P1 — gamepad SHM was not migrated into proto (the one real drift hazard)
Plan §3.1 wanted `XusbShm` (64 B) and `PadShm` (256 B incl. `device_type`) in `pf-vdisplay-proto`. They
are hand-duplicated across four sides on two build graphs, with `device_type` as a bare literal `140`:
host `inject/dualsense_windows.rs:45-52` (`OFF_DEVTYPE=140`) vs driver `dualsense-driver/src/lib.rs:753`
(`*view.add(140)`); XUSB host `inject/gamepad_windows.rs:36-47` vs driver `xusb-driver/src/lib.rs`. A
one-sided edit compiles clean on both and silently mis-routes. The `pf-vdisplay` frame/control contract
got compile-error-on-drift; the gamepad contract did not. (The gamepad drivers being standalone cargo
workspaces is the structural blocker — folding them into the unified workspace, M4, fixes both.)
### 6.2 P2 — proto advertises offset asserts but only has size asserts
`SharedHeader` (14 mixed-width fields + a `_pad`) is guarded by `size_of == 64` + bytemuck-Pod
(`pf-vdisplay-proto/src/lib.rs:232`), which catches most regressions but not a same-size field reorder.
Add `offset_of!` asserts for `magic/latest/generation/dxgi_format/driver_status` and the `AddReply` LUID
split.
---
## 7. Performance opportunities
- **Hybrid-GPU cross-adapter copy** (once §4.2 `SET_RENDER_ADAPTER` works): pinning the driver render to
the NVENC GPU removes a cross-adapter staging path entirely — correctness *and* latency.
- **HDR ring recreate** (§5.4) is the heaviest per-session-event op; if the display HDR state is known at
`open()` from the negotiated mode, size the ring right the first time and skip the recreate + 250 ms
window in the common case.
- **Keyed-mutex acquire timeout is 8 ms** on the host consume side (`idd_push.rs:725`) — at 240 Hz
(4.2 ms/frame) one stall already drops ≥2 frames. Reasonable as a safety bound; worth measuring under
load against a tighter value plus an explicit drop counter.
- The encode|send split, microburst pacing, and `pipeline_depth=2` convert/copy-vs-NVENC overlap are
preserved — no regression on the hot path.
---
## 8. Hygiene (Goal 3)
- **No P0 lints anywhere.** Neither the host crate nor the new driver crates carry
`deny(unsafe_op_in_unsafe_fn)` / `warn(clippy::undocumented_unsafe_blocks)` /
`warn(clippy::multiple_unsafe_ops_per_block)`. The plan claimed the driver workspace "already has it";
it does not (`pf-vdisplay/src/lib.rs:11` is only `allow(...)`). A few-line, high-leverage first step
before any further unsafe work.
- **`OwnedHandle`/`from_raw_handle` used in zero host files** — the plan's "single biggest cheap win."
`pf_vdisplay.rs` holds a raw `isize` device handle in the pinger thread; `idd_push.rs` holds raw
event/map handles. Obvious first conversions.
- **Unsafe counts moved the wrong way.** Host ~476 (target ~35); new driver ~160 (target for all three
drivers ~60), and the old gamepad drivers are untouched on top of that.
---
## 9. Recommended priority order
**P0 — correctness/stability, before relying on the path**
1. Make host-gone detection real: implement the watchdog thread **or** `EvtFileClose`, and delete the
false "watchdog" comments. Verify service stop is cooperative (named stop event → `Drop`
`IOCTL_REMOVE`), not `TerminateProcess`. (§4.1)
2. Implement `SET_RENDER_ADAPTER` (pin driver render to the NVENC adapter) **and** add a real capture
fallback (IDD-push attach failure → DDA) instead of the 20 s black bail. (§4.2, §5.1)
**P1 — ship-ability + the actual rewrite**
3. Cutover plan: give the new driver an in-tree INF, vendor *its* signed output, flip
`stage-pf-vdisplay.ps1`, and make IDD-push the code default (WGC/DDA fallback) or set
`PUNKTFUNK_IDD_PUSH=1` in `host.env`. Until then the rewrite does not reach users. (§3)
4. Migrate the gamepad SHM into `pf-vdisplay-proto` (kills the `140`-literal drift hazard). (§6.1)
5. Add the P0 lints; convert raw handles to `OwnedHandle`. (§8)
**P2 — the host-side architecture (Goal 1, the bulk of "rewrite the host")**
6. §2.5 driver ownership refactor (DeviceContext state + `EvtCleanupCallback` + single monitor identity)
— the prerequisite to `max_concurrent>1` on Windows. (§4.3)
7. §9 SudoVDA decoupling (split CCD/adapter helpers into neutral modules), then the §2.2/§2.4 host tree
(`config.rs`/`SessionFactory`) — the clean architecture that was Goal 1. (§5.2)
8. Offset asserts in proto; remove world-writable driver logging; M4 gamepad-driver unification; then M6
deletion of the old monoliths. (§6.2, §4.4)
---
## Appendix — methodology
Full read of the new driver (`packaging/windows/drivers/pf-vdisplay/src/*.rs`, `wdk-iddcx/src/lib.rs`)
and `pf-vdisplay-proto`; targeted read of the host IDD-push path (`capture/idd_push.rs`,
`vdisplay/pf_vdisplay.rs`, `capture.rs`, `vdisplay.rs`, `encode.rs`, `encode/nvenc.rs`); structural
grep/diff of plan §2.2/§6/§8/§9/§10 against the on-disk tree; packaging/CI inspection
(`punktfunk-host.iss`, `stage-pf-vdisplay.ps1`, `windows-drivers.yml`, `scripts/windows/host.env.example`).
Unsafe counts are raw `grep -c unsafe` over the relevant subtrees (occurrences, not blocks). Not validated
on hardware — this audit reads code and packaging only; on-glass behavior is per the commit log and
[`docs/windows-host-rewrite.md`](windows-host-rewrite.md) §1314.