Files
punktfunk/docs/windows-host-rewrite-game-capture-bug.md
T
enricobuehler 48202a0f89 docs(windows-rewrite): mark game-capture bug FIXED + bring rewrite status current (§15)
The fullscreen-game-breaks-IDD-push bug is FIXED by the resolution-listening
recovery (c87bfe0: the 250ms poll now follows the display's actual resolution
and recreates the ring on any descriptor change, recover-or-drop), backed by
open-time first-frame DDA failover (f98ab07) and the driver publish() width/
height guard + flushed logging (789ad49). No protocol bump was needed — the host
reads the real resolution straight from Windows (CCD/GDI), so the bug doc's
Stage-1 composing capturer + Stage-2 protocol bump were unnecessary. Bug doc
marked FIXED with a Resolution section; the staged plan kept as superseded record.

windows-host-rewrite.md: the progress log was stale (ended at "M1 cont."). Added
§15 Current status — the driver STEP 0-8 port landed on main on-glass HDR-
validated; the host was refactored *in place* via windows-host-goal1 (not the §10
greenfield rebuild); §2.5 ownership model resolved the swap-chain-reuse / monitor-
leak open item; iddcx + /INTEGRITYCHECK CI-green. Remaining: the secure-desktop
on-glass gate (the single biggest unproven claim), M4 gamepad-driver migration,
M5/M6 cleanup, and the pf-vdisplay slot-reclaim driver fix. Top Status flipped
proposed → largely implemented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 21:35:55 +00:00

261 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# pf-vdisplay: fullscreen game breaks video (IDD-push capture) — issue analysis
> **Status: FIXED ✅ (2026-06-25).** Resolved by the **resolution-listening recovery** — see
> [Resolution](#resolution-fixed-2026-06-25) below. The investigation that follows is kept as the record
> of how it was diagnosed. Companion to [`windows-host-rewrite.md`](./windows-host-rewrite.md).
## Resolution (fixed 2026-06-25)
The fix landed as the **recover-or-drop** design (host-only, **no protocol bump**), *not* the
composing-capturer mid-session failover originally sketched in
[Recommended fix](#recommended-fix-staged):
- **`c87bfe0` — IDD-push *recovers* from a game mode-set (the "resolution-listening" work).** The ring now
**tracks the display's actual mode**. At open it is sized to the display's real resolution (new
`win_display::active_resolution`, CCD/GDI). Mid-session the 250 ms poll — previously HDR-toggle-only —
now also **follows the active resolution**; on *any* descriptor change (size **or** HDR) it recreates the
ring at the new mode (`recreate_ring` generalized to a new size), the driver re-attaches via the existing
`is_stale()` path, and frames resume at the game's mode. **No freeze, no reconnect.** If a change is
genuinely unrecoverable (e.g. an exclusive flip the host can't follow) a `recovering_since` clock fires
after 3 s and `try_consume` drops the session cleanly so the client reconnects, instead of freezing
forever. A pure idle desktop (no mode change) never triggers it.
- **`f98ab07` — open-time first-frame failover to DDA (GB1 pt 1).** `wait_for_attach` now requires the
driver to publish a *first frame* (not just `DRV_STATUS_OPENED`); a display the driver attaches to but
whose frames its `publish()` guard rejects now fails `open()` within ~4 s → `capture.rs` falls back to
DDA → the game is captured + visible after a reconnect. A normal/idle open (frame within ~1 s) is never
false-failed, and DDA is itself a working path, so even a false positive degrades gracefully.
- **`789ad49` — driver `publish()` width/height guard + a process-lifetime flushed log appender** (GB3
groundwork): drops a surface whose descriptor no longer matches the host ring (`CopyResource` needs
matching dims too, else garbage) and logs the actual descriptor once per mismatch episode, so the
swap-chain WORKER-thread lines land (closing the bug-doc **S3** observability gap). Needs a driver
rebuild + re-vendor to deploy (separate from the host-only GB1 fix).
**Why this instead of the composing capturer (original Stage 1):** the host reads the display's real
resolution straight from Windows (CCD/GDI), so it doesn't need the driver to report it over a new
`SharedHeader` field — the original **Stage 2's protocol bump is unnecessary**. In-place recovery keeps the
fast IDD-push (zero-copy) path live *through* a game mode-set instead of permanently demoting to DDA;
open-time DDA failover (`f98ab07`) covers the "display already in a broken mode at connect" case.
**Deferred (non-blocking):** Stage 3 (trim `default_modes`) — deprioritized (recovery handles mode-sets and
trimming risks the live display-activation path); Stage S driver resilience (S1/S2) — gated on the
`789ad49` logging once a fresh repro is captured. Owner-confirmed the resolution-listening recovery fixes
the user-visible bug (2026-06-25).
## Context
The all-Rust `pf-vdisplay` IddCx virtual-display driver (STEP 08 of the Windows host rewrite,
now on `main`, on-glass-validated for plain desktop + HDR streaming) breaks when a **fullscreen
game** runs on the stream.
**Reproduction (RTX 4090 box `192.168.1.158`):** launch *Doom the Dark Ages* while streaming → the
desktop image **flashes** (a display mode-set fired), the game is **never visible**, and **disconnect
+ reconnect yields a black screen with working audio**. (The box was rebooted afterward, so live
logs from the incident are gone.)
**Runtime config in play** (`C:\ProgramData\punktfunk\host.env`):
- `PUNKTFUNK_IDD_PUSH=1` → capture comes from the driver's **shared-memory frame ring**, not DDA/WGC.
- `PUNKTFUNK_10BIT=1` (+ `PUNKTFUNK_HDR_SHADER_P010=1`) → **HDR active**; the ring is FP16.
- `PUNKTFUNK_MONITOR_LINGER_MS=0` → every (re)connect builds a **fresh** monitor + ring.
- `PUNKTFUNK_VDISPLAY=pf`, `PUNKTFUNK_ENCODER=nvenc`, `PUNKTFUNK_SECURE_DDA=1`.
The driver log (`C:\Users\Public\pfvd-driver.log`) at inspection showed **8 fresh
`IddCxMonitorCreate`/`Arrival` pairs (ids 18), all `0x0`, and ZERO swap-chain-processor lines** —
so monitor creation is healthy and the break is entirely **downstream of monitor creation**
(swap-chain drain / frame publish / host consume), exactly where a game-induced mode change lands.
## Root cause (one sentence)
The IDD-push ring is created **once** at session start with a **fixed format and fixed size**
derived from session-start state, there is **no channel for the driver to report the actual
acquired-surface descriptor** back to the host, and there is **no mid-session fallback** — so when
a game forces a format and/or resolution change on the virtual display, the driver silently drops
every frame, the host never learns it needs to adapt, and the stream goes black and then hard-crashes.
## How the symptom maps to the code
1. Game launches → forces a **mode set** on the virtual display (the "desktop flash"). This changes
the OS-composed surface's **DXGI format and/or width/height**, and triggers a swap-chain
unassign→reassign in the driver.
2. The driver's `publish()` copies the acquired surface into the host ring **only if formats match
exactly** (`desc.Format` u32 compare) — and `CopyResource` *also* silently requires identical
dimensions, which is never checked. → **every frame dropped.**
3. The host's only ring-recreate trigger is polling Windows' **HDR-enabled toggle**. A game-driven
format/size change it can't observe → **host never recreates the ring** → driver re-attaches to
the same mismatched ring → keeps dropping.
4. Once `PUNKTFUNK_IDD_PUSH=1`, the ring is the **sole** capture source (no DDA/WGC fallback).
`next_frame()` repeats the last good frame, then **`bail!`s after a 20 s deadline → the stream
dies.**
5. **Reconnect stays black** because the game is still holding the display in the changed state; the
fresh ring is rebuilt at the **session-negotiated** format/size again and re-mismatches. Audio is
a fully independent plane, so it survives — matching "black + audio."
---
## Identified issues
### Primary
**P1 — IDD-push ring format is fixed at session start; host can't observe a game-driven format change.**
- Host picks the ring format once: FP16 (`DXGI_FORMAT_R16G16B16A16_FLOAT`) if
`advanced_color_enabled(target_id)` else `DXGI_FORMAT_B8G8R8A8_UNORM`.
`crates/punktfunk-host/src/capture/idd_push.rs:340-361`
- Driver drops any frame whose `desc.Format` ≠ the ring format, silently.
`packaging/windows/drivers/pf-vdisplay/src/frame_transport.rs:281-286`
- Host recreates the ring **only** on a Windows HDR-toggle poll (250 ms), never on a format change
it can't see. `idd_push.rs:619-640` (`poll_display_hdr``recreate_ring` at `:582-617`).
- Driver re-attaches on a host generation bump (`is_stale`), but nothing bumps it for this case.
`frame_transport.rs:259-270`.
- **No `SharedHeader` field carries the driver's actual acquired-surface format** — the driver only
writes `driver_status`, `driver_status_detail`, `driver_render_luid_low/high` back.
**P2 — IDD-push ring size is fixed at session start; a resolution change is never detected.**
- `header.width/height` written once at `idd_push.rs:396-397`; ring slots sized once and never
resized; consumed frames always report the session size (`idd_push.rs:744-745`).
- `publish()` guards **format only, not width/height** (`frame_transport.rs:284`). `CopyResource`
requires identical dimensions, so a resolution change → silent no-op/garbage, no error logged.
- Driver never reports the acquired surface's real width/height to the host.
**P3 — No mid-session capture fallback; a 20 s hard crash instead of degrade.**
- `PUNKTFUNK_IDD_PUSH=1` returns the IDD-push capturer early with the keepalive moved into it — **no
fall-through**. `crates/punktfunk-host/src/capture.rs:348-356`.
- `next_frame()` waits on the frame-ready event (16 ms), repeats the last frame, and **`bail!`s
after a 20 s deadline** → the encode loop tears the session down.
`idd_push.rs:819-847`.
- The WGC→DDA fallback that exists (`capture.rs:389-404`) is **open-time only** and on the
**non**-IDD-push path; it does not help here.
- The `VirtualOutput` already carries a `WinCaptureTarget { adapter_luid, gdi_name, target_id }`
(`vdisplay/pf_vdisplay.rs` `Monitor::target()`), so a DDA/WGC capturer **can** be opened on the
same virtual output — the wiring just doesn't exist for IDD-push.
### Secondary (verify during the fix; not the proven primary cause)
**S1 — Driver `run_core` exits permanently on a swap-chain error, with no clear re-arm.**
- On a `ReleaseAndAcquireBuffer2` error (e.g. `DXGI_ERROR_ACCESS_LOST` when a game grabs the
display), `run_core` `break`s and returns; the worker exits and deletes the swap-chain object.
`packaging/windows/drivers/pf-vdisplay/src/swap_chain_processor.rs:359-362` (+ delete at `:141-143`).
- A mode change drives unassign→assign which **does** respawn a fresh processor
(`callbacks.rs:309-318`, `:249-305`), so a clean mode change recovers. **Open question:** whether
the OS reliably re-assigns after a bare `ACCESS_LOST` exit (no unassign), or whether the monitor
stalls with a dead-but-installed processor. Confirm against the IddCx contract / upstream
`virtual-display-rs`. The standard IddCx model expects the OS to re-assign, but this needs proof.
**S2 — `IddCxSwapChainSetDevice` give-up leaves a dead-but-installed processor.**
- `assign_swap_chain` returns `STATUS_SUCCESS` and installs the processor **before** the worker's
`SetDevice` retries run; if all 60 retries (≈3 s) fail during a mode flap, the worker returns and
the processor is dead, but the OS believes the swap chain is assigned → potential permanent stall.
`swap_chain_processor.rs:191-226`, `callbacks.rs:279-293`.
**S3 — Driver worker-thread diagnostics are not landing (impairs root-causing).**
- `dbglog!``log.rs` opens/append/closes the file per call with **no explicit flush**, and the
observed log had only control-plane (IOCTL-thread) lines, no swap-chain-processor lines.
`packaging/windows/drivers/pf-vdisplay/src/log.rs:9-22`.
- Whatever the exact reason (write race / token / interleave), the practical effect is the
swap-chain processor's behavior during the break is **invisible**, which is why the cause can't be
pinned from logs alone today. **Fix this first** so the next repro is conclusive.
---
## Verified facts that de-risk the fix
- **The encoder already adapts to a mid-session size/format change.** `encode/nvenc.rs:580-618`:
`submit` detects `size_changed`/`hdr_changed`/device change per frame, tears down, and re-inits
adopting the new frame's geometry + pixel format. So a capturer that changes resolution/format
mid-session is handled downstream — **no encoder API change is needed** for either fix direction.
- **The stream loop relays per-frame geometry.** `CapturedFrame` carries `width`/`height`/`format`
(`capture.rs:50-57`); the loop reads `pipeline_depth()` live and forwards whatever `try_latest()`
returns.
- **WGC and DDA emit the same pixel formats the IDD-push path emits** (`Bgra` / `Rgb10a2`), so a
failover capturer feeds the encoder compatible frames.
- **A failover capturer fits the existing `Capturer` trait** (`next_frame` + `try_latest`,
`capture.rs:120-155`) — a composing capturer that owns the ring capturer + a lazily-opened
WGC/DDA capturer and switches between them is a clean drop-in.
---
## Recommended fix (staged)
> **Superseded — see [Resolution](#resolution-fixed-2026-06-25).** This was the original plan; the bug
> was fixed by the simpler **recover-or-drop** approach (host follows the OS resolution + open-time DDA
> failover), so Stage 1's composing capturer and Stage 2's protocol bump were not needed. Kept for context.
Defense-in-depth. Stages 01 are **host-only** (no driver rebuild, no protocol bump) and are the
fast, robust, user-visible fix. Stages 23 harden the fast path and need the driver re-vendor loop.
- **Stage 0 — Diagnostics first (land before anything else).**
- `log.rs`: flush after each write (or keep a process-lifetime appender) and confirm worker-thread
writes land. (S3)
- Driver: in `publish()`, log/record the acquired surface's **actual format + width + height**
even on the drop path, so a repro shows exactly what changed.
- Host: replace the silent 20 s wait with a `tracing::warn!` at ~2 s of no fresh frame, including
`driver_status`/`driver_status_detail` and the host's expected ring format/size.
- Goal: the next Doom-launch repro definitively classifies the cause (format mismatch vs size
mismatch vs `run_core` exit vs no-reassign).
- **Stage 1 — Mid-session fallback IDD-push → WGC/DDA (robust to ALL failure modes).** (P3)
- Add a composing `Capturer` that owns the IDD-push capturer and, when it yields no fresh frame
for a **short** window (~1.5 s, not 20 s), opens a DDA/WGC capturer on the same
`WinCaptureTarget` and serves from it for the rest of the session (optionally probing the ring
for recovery). Encoder follows the new format/size automatically (verified above).
- This alone guarantees the session never goes permanently black again and makes Doom playable via
WGC/DDA when the ring path is defeated — independent of the *why*.
- Touch points: `capture.rs:334-356` (wire the composing capturer behind `PUNKTFUNK_IDD_PUSH`),
`idd_push.rs` (expose a "stalled?" signal + shorten the deadline), reuse `dxgi.rs`/`wgc.rs`.
- **Stage 2 — Adaptive ring (makes the fast IDD-push path itself survive a game mode change).** (P1, P2)
- Driver writes the **actual acquired-surface format + width + height** into new `SharedHeader`
fields, in `publish()`, **even when about to drop the frame**.
- Host watches those fields and, on any change vs the ring's current format/size, **recreates the
ring at the new descriptor + bumps `generation`** (generalize `recreate_ring`/`poll_display_hdr`
from "HDR toggled" to "descriptor changed"). Driver re-attaches via existing `is_stale()`.
- Driver `publish()` gains a **width/height guard** alongside the format guard.
- **Implications:** bump `pf_vdisplay_proto::PROTOCOL_VERSION` (host does a HARD version check in
`pf_vdisplay.rs::mgr_ensure_device`), update the `const` size/offset asserts in
`crates/pf-vdisplay-proto/src/frame.rs`, and deploy host + driver **in lockstep** (rebuild +
re-sign + re-vendor `packaging/windows/pf-vdisplay/{dll,inf,cat}` on the RTX box, WUDFHost
reload).
- **Stage 3 — Prevention (frequency reducer, not a standalone fix).** (reduces P1/P2 triggers)
- Trim `monitor.rs::default_modes()` so the IDD advertises essentially only the negotiated mode, so
a game can't pick a different fullscreen resolution. Verify it doesn't break mid-stream
`Reconfigure`. Optionally re-assert the active mode after a detected mode change.
- **Stage S — Driver resilience (address S1/S2 once Stage 0 reveals if they fire).**
- If logs show a permanent stall after `ACCESS_LOST`/SetDevice-give-up, add a re-arm path (e.g.
delete the swap chain so the OS re-assigns, or signal `assign_swap_chain` to retry) and avoid
installing a processor that has already failed `SetDevice`.
## Validation plan (RTX box `ssh "Enrico Bühler@192.168.1.158"`)
1. Deploy the Stage-0 host (+ driver if rebuilt); `punktfunk-host service stop/start`.
2. Connect a client, confirm normal stream. `type C:\Users\Public\pfvd-driver.log` to baseline.
3. Launch *Doom the Dark Ages* (or any fullscreen/HDR game). Capture: driver log + host service log
(find where the in-session `serve` logs land; `RUST_LOG=info`).
4. Read which mechanism fired (format/size/exit/no-reassign) from the Stage-0 diagnostics.
5. **Success:** game is visible, the stream survives the mode-set flash, no 20 s crash, reconnect
restores video. With Stage 1: the failover to WGC/DDA is logged and frames keep flowing. With
Stage 2: the ring recreates at the new descriptor and the fast path resumes.
## File map
| Area | Path |
|---|---|
| Host ring consumer | `crates/punktfunk-host/src/capture/idd_push.rs` |
| Capture selection / trait | `crates/punktfunk-host/src/capture.rs` |
| NVENC re-init (no change needed) | `crates/punktfunk-host/src/encode/nvenc.rs:564-618` |
| DDA / WGC capturers (failover targets) | `crates/punktfunk-host/src/capture/{dxgi,wgc}.rs` |
| Host monitor lifecycle / capture target | `crates/punktfunk-host/src/vdisplay/pf_vdisplay.rs` |
| Shared contract (Stage 2 fields + version) | `crates/pf-vdisplay-proto/src/{lib,frame}.rs` |
| Driver frame publisher (guards + reporting) | `packaging/windows/drivers/pf-vdisplay/src/frame_transport.rs` |
| Driver swap-chain lifecycle (S1/S2) | `packaging/windows/drivers/pf-vdisplay/src/swap_chain_processor.rs`, `callbacks.rs` |
| Driver logging (S3) | `packaging/windows/drivers/pf-vdisplay/src/log.rs` |
| Advertised modes (Stage 3) | `packaging/windows/drivers/pf-vdisplay/src/monitor.rs` (`default_modes`) |
| Vendored signed driver (Stage 2 re-vendor) | `packaging/windows/pf-vdisplay/{pf_vdisplay.dll,.inf,.cat}` |
## Notes / caveats
- Doc lag (unrelated to the fix, worth flagging): `stage-pf-vdisplay.ps1` / packaging comments still
reference the OLD `packaging/windows/vdisplay-driver/` tree; the active driver source is the NEW
`packaging/windows/drivers/pf-vdisplay/` tree (re-vendored in commit `a11b0dd`).
- The exact trigger (format vs resolution vs exclusive-flip vs processor-death) is **not yet proven
from logs** — Stage 0 exists to pin it. Stage 1 fixes the user-visible symptom regardless.