Files
punktfunk/docs/windows-host-rewrite-remediation.md
T
enricobuehler 1cd87066d7 docs(windows-rewrite): track GB1/GB3 progress + box IP floats (DHCP)
Record GB1 (host-side recover-or-drop) + GB3 groundwork (driver descriptor guard/logging) in the tracker; note the RTX validation box IP floats (DHCP/ephemeral, recently .173/.158) instead of hardcoding .158.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:35:27 +00:00

169 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Windows Host Rewrite — Audit Remediation Tracker
Status: **in progress** (2026-06-25). Living hand-off doc for working through the findings in
[`docs/windows-host-rewrite-audit.md`](windows-host-rewrite-audit.md) (the audit of the IDD-push rewrite
vs [`docs/windows-host-rewrite.md`](windows-host-rewrite.md)). Keep this updated as items land so the work
can be handed off without losing tasks.
## TL;DR
- **9 commits on `main`, NOT pushed** (`+9` ahead of `origin/main`, tip `e60cda3`). Each is compile-verified
on the RTX box (see [Verification](#verification)).
- **Done:** the entire audit **P0 + P1 + P2** payload, the driver `unsafe` lint, and **F1** (SudoVDA helper
decoupling) complete.
- **Remaining:** **D2** (OwnedHandle), **D1-host** (unsafe-lint sweep), **E1** (driver ownership refactor),
**G** (gamepad-driver unification + old-tree deletion + host `src/windows/` tree).
- **Two cross-cutting follow-ups:** (1) **on-glass behavioral validation** of the committed driver/host
fixes (the box is single-GPU + headless-ish, so hybrid-GPU / HDR-toggle / fallback paths weren't
exercised at runtime); (2) **push** to run the full CI matrix (the local checks skip the `amf-qsv` path).
## Done — committed on `main` (unpushed)
| Commit | Audit § | What | Compile-verified |
|---|---|---|---|
| `0badc17` | — | The audit doc itself | — |
| `95dcef3` | §6.1/6.2 | **A** proto: `offset_of!` asserts on `SharedHeader`/`AddReply`/control structs; owned `XusbShm`/`PadShm` gamepad layouts (+ `min_const_generics`) | local `cargo test` + MSVC (box) |
| `0a7ae5e` | §4.1/4.2/4.4/4.5 | **B** driver: real host-gone **watchdog** (was dead code), **`SET_RENDER_ADAPTER`** impl, world-writable-log gate, mode bounds + `display_info` u64-saturate | driver `cargo build` (box) |
| `e5c9ee8` | §4.2h/6.1 | **C2/C5** host: render-pin comment/activation (driver now honors it); gamepad SHM consumers derive from `pf_vdisplay_proto::gamepad` | host clippy (box) |
| `ed58365` | §5.1 | **C1** host: IDD-push **attach fallback to DDA** (open() hands keepalive back; bounded `wait_for_attach` on `DRV_STATUS_OPENED`) instead of the 20s black bail | host clippy (box) |
| `b0d2838` | §5.3/5.4 | **C3/C4** host: `repeat_last` rotates+copies into a fresh out-ring slot; HDR ring sized FP16 at open when advanced-color is enabled | host clippy (box) |
| `a755d6e` | §8 | **D1-driver** `#![deny(unsafe_op_in_unsafe_fn)]` on `pf-vdisplay` + `wdk-iddcx` | driver `cargo build` (box) |
| `d638a93` | §9 | **F1 pt1**: `resolve_render_adapter_luid` → neutral `crate::win_adapter` | host clippy (box) |
| `e60cda3` | §9 | **F1 rest**: 6 CCD/HDR helpers + `SavedConfig` → neutral `crate::win_display`; SudoVDA reach-in fully broken | host clippy (box) + Linux `cargo check` |
## Remaining — to do
Ordered by suggested sequence. **On-glass = cannot be *finished* without a real session on the RTX box,
driven by a human** (driver install + client connect).
### D2 — `OwnedHandle` on the new path · audit §8 · compile-verifiable · moderate
- **Goal:** replace raw `HANDLE`/`isize` handles held across their lifetime with
`std::os::windows::io::OwnedHandle` (RAII close, fixes leak-on-error, deletes manual `CloseHandle`).
- **Targets:** `vdisplay/pf_vdisplay.rs` — the pinger thread's raw `isize` device handle (`pf_vdisplay.rs`
~324-344); `capture/idd_push.rs``IddPushCapturer { map, event, dbg_map: HANDLE }` (manually closed in
`Drop`). The plan also lists events/jobs/tokens/sections in `windows/process.rs`/`service.rs` (broader).
- **Risk:** handle ownership (double-close / premature close). Compile catches type errors; lifecycle
needs care. Touches the live IDD-push path → ideally smoke-tested on glass after.
- **Verify:** host clippy on the box (the new path is `--features nvenc`).
### D1-host — host-wide `unsafe` lint sweep · audit §8 · large/mechanical
- **Goal:** add `#![deny(unsafe_op_in_unsafe_fn)]` + `#![warn(clippy::undocumented_unsafe_blocks)]`
(+ optionally `multiple_unsafe_ops_per_block`) to the **host crate** (`crates/punktfunk-host/src/main.rs`),
and fix the fallout.
- **Scope:** large — hundreds of `unsafe` blocks across **both** Linux and Windows code need explicit
`unsafe {}` wrapping inside `unsafe fn`s and `// SAFETY:` comments. The driver already has the `deny`
(`a755d6e`); the host has none.
- **Verify:** Linux `cargo clippy -p punktfunk-host --all-targets -- -D warnings` (Linux/cross paths) **and**
host clippy on the box (Windows paths). Do it incrementally per-subsystem to keep the diff reviewable.
### E1 — driver ownership refactor · audit §4.3 / plan §2.5 + §14 step 5 · **on-glass-gated** · large
- **Goal:** move the driver's process-global statics (`MONITOR_MODES`, `NEXT_ID`, `ADAPTER`, `DEVICE_POOL`)
into a WDF `DeviceContext`; **wire `EvtCleanupCallback` on the `IDDCX_MONITOR` object** so the
`SwapChainProcessor` + D3D drop via RAII; collapse the 3-key monitor identity (`id`/`object`/`session_id`)
to one. Unblocks `max_concurrent>1` on Windows + removes the host-side preempt dance.
- **Why on-glass:** the plan's critique is explicit — *instrument that `MonitorContext::Drop` actually
RAN*; if the cleanup callback does not fire on this UMDF/IddCx stack, **keep the current explicit
REMOVE/teardown path as the fallback**. Cannot be signed off compile-only.
- **Files:** `packaging/windows/drivers/pf-vdisplay/src/{entry,adapter,monitor,callbacks,swap_chain_processor}.rs`.
- **Verify:** driver `cargo build` (compile) on the box; then on-glass reconnect-storm + leak check
(`LIVE_DEVICES` counter in `direct_3d_device.rs`, the world-readable log when `PFVD_DEBUG_LOG` is set).
### G — gamepad-driver unification (M4) + deletion (M6) + host tree · audit §6/§10 + plan §2.2 · **on-glass-gated** · largest
- **M4:** fold `pf_dualsense` + `pf_xusb` (today standalone `packaging/windows/{dualsense,xusb}-driver/` on
the old `wdf` stack) into the unified `packaging/windows/drivers/` workspace on `windows-drivers-rs`. This
also enables the **driver-side** gamepad-SHM→proto switch (host side already done in C5 — the driver still
hand-reads `view.add(140)`; point it at `pf_vdisplay_proto::gamepad::PadShm`/`XusbShm`).
- **M6:** delete the old `packaging/windows/vdisplay-driver/` tree + the old gamepad driver trees + the
bring-up scaffolding (`DebugBlock`/`spawn_observer`/`IDD_PERSIST`/`open_or_reuse` in `idd_push.rs`) — **only
after on-glass parity** of the new path.
- **Host architecture (Goal 1, plan §2.2/2.4):** the `src/windows/` subtree + `config.rs` (`HostConfig`) +
`SessionFactory`/`SessionPlan`**not started**. The biggest clarity lever; large.
### Cross-cutting follow-ups (not a single task)
- **On-glass validation of the committed fixes** — needs the RTX box + a client. Specifically: the
**watchdog** actually reaps on host-kill (B1); **`SET_RENDER_ADAPTER`** pins correctly on a *hybrid* box
(B2/C2 — the lab box is single-dGPU, so this path is unexercised); the **IDD-push→DDA fallback** triggers
+ the happy path still attaches within 4s (C1); **HDR ring sizing** + **out-ring repeat** under real HDR /
static-desktop pipelining (C3/C4).
- **Push** to run the full CI matrix — the local host checks use `--features nvenc` only (no FFmpeg), so the
`amf-qsv` encode path is unexercised locally; CI (`windows-host.yml`) covers it.
## Related workstream — fullscreen-game IDD-push capture bug (separate doc)
A **separate, newly-found bug** (NOT an audit finding) in the same IDD-push subsystem, with its own staged
fix plan: [`docs/windows-host-rewrite-game-capture-bug.md`](windows-host-rewrite-game-capture-bug.md).
**Symptom:** launching a fullscreen game (Doom the Dark Ages) on an HDR IDD-push stream flashes the desktop,
the game never shows, and reconnect = black screen + working audio. **Root cause:** the IDD-push ring is
fixed format+size at session start; the driver silently drops every frame whose surface descriptor no longer
matches (a game forces a mode-set); the host has no channel to learn the descriptor changed; and there is no
mid-session fallback → 20 s `bail!`.
**Intersections with this remediation — read before implementing:**
- **Stage 1 builds on our C1 (`ed58365`); do not duplicate it.** C1 added an IDD-push→DDA fallback, but
**open-time only** (driver never attaches). The game bug is **mid-session** (attached, then a game changes
format/size). The bug doc's Stage 1 (a composing capturer that fails over mid-session) is the
generalization — build it on C1's `open()`-returns-keepalive + bounded-attach infrastructure.
- **The bug doc was written against pre-remediation `main` (`a11b0dd`).** Its line numbers and its claim
"`capture.rs:348-356` … no fall-through" are **stale after our 9 commits** (C1 changed exactly that).
Rebase on current `main` first.
- **Stage 2 (new `SharedHeader` fields + `PROTOCOL_VERSION` bump)** must update the **`offset_of!`/size
asserts added in A (`95dcef3`)** — they catch drift at compile time (the intended safety net). Note: those
asserts live in the `frame` module of `crates/pf-vdisplay-proto/src/lib.rs` (the doc says `frame.rs`).
- **Stage 0 / S3 diagnostics rely on the driver log**, which **B3 (`0a7ae5e`) gated off in release builds**
(`debug_assertions || PFVD_DEBUG_LOG`). Enable it (`PFVD_DEBUG_LOG=1` or a debug build) for the repro.
- **S1/S2 (driver swap-chain resilience)** is adjacent to **E1** (same `swap_chain_processor.rs`/
`callbacks.rs`); coordinate so they don't conflict.
- The bug doc's "doc-lag" note (`stage-pf-vdisplay.ps1` still names the old `vdisplay-driver/` tree) is part
of our **G / M6** packaging cleanup.
**Stages (detail in the bug doc):** Stage 0 diagnostics (S3) → Stage 1 mid-session fallback (P3, host-only,
the user-visible fix) → Stage 2 adaptive ring (P1/P2; proto bump + driver re-vendor) → Stage 3 trim
advertised modes → Stage S driver resilience (S1/S2). Tracked as GB0GB3 in the task list.
**Progress (2026-06-25):** **GB1 landed host-side***recover-or-drop, no DDA* (per the owner's call): the
ring now tracks the display's ACTUAL mode (CCD `active_resolution`), recreating on a size/HDR change so a
game mode-set recovers in-place; if no frame resumes within 3 s it drops the session cleanly (client
reconnects). Commits `f98ab07` (first-frame failover) + `c87bfe0`. **Awaiting on-glass Doom validation.**
**GB3 groundwork landed** — driver `publish()` width/height guard + descriptor-on-drop logging + a flushed
process-lifetime log appender so the swap-chain worker's lines land (commit `789ad49`); **needs a driver
rebuild + re-vendor to deploy.** Stage 3 (trim modes) deprioritized; Stage S code-fix gated on these
diagnostics showing whether S1/S2 fire on-glass.
## Verification
The persistent validator is the **RTX box** `ssh "Enrico Bühler"@<ip>` (ENRICOS-DESKTOP, RTX 4090,
PS shell). **The IP FLOATS — DHCP + boots to Proxmox on reboot (new lease each time); recently `.173` /
`.158`, confirm the current IP first. EPHEMERAL — never reboot it, never depend on it surviving.** It has
WDK 26100 + LLVM 21.1.2 + the Rust toolchain. Build clone: `C:\Users\Public\pf-rewrite`.
```sh
# 0. (local, cross-platform) the proto crate + the Linux host build
cargo test -p pf-vdisplay-proto
cargo check -p punktfunk-host # Linux paths; the win_* mods are #[cfg(windows)]
# 1. reset the box clone to a clean base, then overlay your changed files
# ssh ... "cd C:\Users\Public\pf-rewrite; git fetch -q origin; git reset -q --hard origin/main; git clean -qfd; git checkout -q <rev>"
# scp <changed files> "Enrico Bühler@<ip>:C:/Users/Public/pf-rewrite/<same rel path>"
# 2. host clippy (warm target ~4s). NVENC import lib at C:\t\nvenc; no FFmpeg needed (amf-qsv off).
ssh ... "cd C:\Users\Public\pf-rewrite; $env:PUNKTFUNK_NVENC_LIB_DIR='C:\t\nvenc'; \
cargo clippy -p punktfunk-host --features nvenc --target x86_64-pc-windows-msvc -- -D warnings"
# 3. driver workspace build (fires deny(unsafe_op_in_unsafe_fn)); ~5s
ssh ... "cd C:\Users\Public\pf-rewrite\packaging\windows\drivers; \
$env:Version_Number='10.0.26100.0'; $env:LIBCLANG_PATH='C:\Program Files\LLVM\bin'; cargo build"
```
Gotchas: the box username has a `ü` → quote it; PS shell, filter output with `Select-Object -Last N`. After
a `git reset --hard` on the box clone, re-`scp` your working files (reset discards them). Do **not** build in
`C:\Users\Public\punktfunk-native` (the deployed host).
## New modules introduced by this work
- `crates/pf-vdisplay-proto/src/lib.rs` → added `mod gamepad` (`XusbShm`/`PadShm`/magics/name helpers) +
`offset_of!` asserts.
- `crates/punktfunk-host/src/win_adapter.rs``resolve_render_adapter_luid` (plan's `windows/adapter.rs`).
- `crates/punktfunk-host/src/win_display.rs` → CCD/HDR display helpers (plan's `windows/display_ccd.rs`).
- Driver: `start_watchdog`/`reap_orphaned` (control.rs/monitor.rs), `set_render_adapter` (adapter.rs),
`file_log_enabled` gate (log.rs).