Files
punktfunk/docs/windows-host-rewrite-game-capture-bug.md
T
enricobuehler 48202a0f89 docs(windows-rewrite): mark game-capture bug FIXED + bring rewrite status current (§15)
The fullscreen-game-breaks-IDD-push bug is FIXED by the resolution-listening
recovery (c87bfe0: the 250ms poll now follows the display's actual resolution
and recreates the ring on any descriptor change, recover-or-drop), backed by
open-time first-frame DDA failover (f98ab07) and the driver publish() width/
height guard + flushed logging (789ad49). No protocol bump was needed — the host
reads the real resolution straight from Windows (CCD/GDI), so the bug doc's
Stage-1 composing capturer + Stage-2 protocol bump were unnecessary. Bug doc
marked FIXED with a Resolution section; the staged plan kept as superseded record.

windows-host-rewrite.md: the progress log was stale (ended at "M1 cont."). Added
§15 Current status — the driver STEP 0-8 port landed on main on-glass HDR-
validated; the host was refactored *in place* via windows-host-goal1 (not the §10
greenfield rebuild); §2.5 ownership model resolved the swap-chain-reuse / monitor-
leak open item; iddcx + /INTEGRITYCHECK CI-green. Remaining: the secure-desktop
on-glass gate (the single biggest unproven claim), M4 gamepad-driver migration,
M5/M6 cleanup, and the pf-vdisplay slot-reclaim driver fix. Top Status flipped
proposed → largely implemented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 21:35:55 +00:00

18 KiB
Raw Blame History

pf-vdisplay: fullscreen game breaks video (IDD-push capture) — issue analysis

Status: FIXED (2026-06-25). Resolved by the resolution-listening recovery — see Resolution below. The investigation that follows is kept as the record of how it was diagnosed. Companion to windows-host-rewrite.md.

Resolution (fixed 2026-06-25)

The fix landed as the recover-or-drop design (host-only, no protocol bump), not the composing-capturer mid-session failover originally sketched in Recommended fix:

  • c87bfe0 — IDD-push recovers from a game mode-set (the "resolution-listening" work). The ring now tracks the display's actual mode. At open it is sized to the display's real resolution (new win_display::active_resolution, CCD/GDI). Mid-session the 250 ms poll — previously HDR-toggle-only — now also follows the active resolution; on any descriptor change (size or HDR) it recreates the ring at the new mode (recreate_ring generalized to a new size), the driver re-attaches via the existing is_stale() path, and frames resume at the game's mode. No freeze, no reconnect. If a change is genuinely unrecoverable (e.g. an exclusive flip the host can't follow) a recovering_since clock fires after 3 s and try_consume drops the session cleanly so the client reconnects, instead of freezing forever. A pure idle desktop (no mode change) never triggers it.
  • f98ab07 — open-time first-frame failover to DDA (GB1 pt 1). wait_for_attach now requires the driver to publish a first frame (not just DRV_STATUS_OPENED); a display the driver attaches to but whose frames its publish() guard rejects now fails open() within ~4 s → capture.rs falls back to DDA → the game is captured + visible after a reconnect. A normal/idle open (frame within ~1 s) is never false-failed, and DDA is itself a working path, so even a false positive degrades gracefully.
  • 789ad49 — driver publish() width/height guard + a process-lifetime flushed log appender (GB3 groundwork): drops a surface whose descriptor no longer matches the host ring (CopyResource needs matching dims too, else garbage) and logs the actual descriptor once per mismatch episode, so the swap-chain WORKER-thread lines land (closing the bug-doc S3 observability gap). Needs a driver rebuild + re-vendor to deploy (separate from the host-only GB1 fix).

Why this instead of the composing capturer (original Stage 1): the host reads the display's real resolution straight from Windows (CCD/GDI), so it doesn't need the driver to report it over a new SharedHeader field — the original Stage 2's protocol bump is unnecessary. In-place recovery keeps the fast IDD-push (zero-copy) path live through a game mode-set instead of permanently demoting to DDA; open-time DDA failover (f98ab07) covers the "display already in a broken mode at connect" case.

Deferred (non-blocking): Stage 3 (trim default_modes) — deprioritized (recovery handles mode-sets and trimming risks the live display-activation path); Stage S driver resilience (S1/S2) — gated on the 789ad49 logging once a fresh repro is captured. Owner-confirmed the resolution-listening recovery fixes the user-visible bug (2026-06-25).

Context

The all-Rust pf-vdisplay IddCx virtual-display driver (STEP 08 of the Windows host rewrite, now on main, on-glass-validated for plain desktop + HDR streaming) breaks when a fullscreen game runs on the stream.

Reproduction (RTX 4090 box 192.168.1.158): launch Doom the Dark Ages while streaming → the desktop image flashes (a display mode-set fired), the game is never visible, and **disconnect

  • reconnect yields a black screen with working audio**. (The box was rebooted afterward, so live logs from the incident are gone.)

Runtime config in play (C:\ProgramData\punktfunk\host.env):

  • PUNKTFUNK_IDD_PUSH=1 → capture comes from the driver's shared-memory frame ring, not DDA/WGC.
  • PUNKTFUNK_10BIT=1 (+ PUNKTFUNK_HDR_SHADER_P010=1) → HDR active; the ring is FP16.
  • PUNKTFUNK_MONITOR_LINGER_MS=0 → every (re)connect builds a fresh monitor + ring.
  • PUNKTFUNK_VDISPLAY=pf, PUNKTFUNK_ENCODER=nvenc, PUNKTFUNK_SECURE_DDA=1.

The driver log (C:\Users\Public\pfvd-driver.log) at inspection showed 8 fresh IddCxMonitorCreate/Arrival pairs (ids 18), all 0x0, and ZERO swap-chain-processor lines — so monitor creation is healthy and the break is entirely downstream of monitor creation (swap-chain drain / frame publish / host consume), exactly where a game-induced mode change lands.

Root cause (one sentence)

The IDD-push ring is created once at session start with a fixed format and fixed size derived from session-start state, there is no channel for the driver to report the actual acquired-surface descriptor back to the host, and there is no mid-session fallback — so when a game forces a format and/or resolution change on the virtual display, the driver silently drops every frame, the host never learns it needs to adapt, and the stream goes black and then hard-crashes.

How the symptom maps to the code

  1. Game launches → forces a mode set on the virtual display (the "desktop flash"). This changes the OS-composed surface's DXGI format and/or width/height, and triggers a swap-chain unassign→reassign in the driver.
  2. The driver's publish() copies the acquired surface into the host ring only if formats match exactly (desc.Format u32 compare) — and CopyResource also silently requires identical dimensions, which is never checked. → every frame dropped.
  3. The host's only ring-recreate trigger is polling Windows' HDR-enabled toggle. A game-driven format/size change it can't observe → host never recreates the ring → driver re-attaches to the same mismatched ring → keeps dropping.
  4. Once PUNKTFUNK_IDD_PUSH=1, the ring is the sole capture source (no DDA/WGC fallback). next_frame() repeats the last good frame, then bail!s after a 20 s deadline → the stream dies.
  5. Reconnect stays black because the game is still holding the display in the changed state; the fresh ring is rebuilt at the session-negotiated format/size again and re-mismatches. Audio is a fully independent plane, so it survives — matching "black + audio."

Identified issues

Primary

P1 — IDD-push ring format is fixed at session start; host can't observe a game-driven format change.

  • Host picks the ring format once: FP16 (DXGI_FORMAT_R16G16B16A16_FLOAT) if advanced_color_enabled(target_id) else DXGI_FORMAT_B8G8R8A8_UNORM. crates/punktfunk-host/src/capture/idd_push.rs:340-361
  • Driver drops any frame whose desc.Format ≠ the ring format, silently. packaging/windows/drivers/pf-vdisplay/src/frame_transport.rs:281-286
  • Host recreates the ring only on a Windows HDR-toggle poll (250 ms), never on a format change it can't see. idd_push.rs:619-640 (poll_display_hdrrecreate_ring at :582-617).
  • Driver re-attaches on a host generation bump (is_stale), but nothing bumps it for this case. frame_transport.rs:259-270.
  • No SharedHeader field carries the driver's actual acquired-surface format — the driver only writes driver_status, driver_status_detail, driver_render_luid_low/high back.

P2 — IDD-push ring size is fixed at session start; a resolution change is never detected.

  • header.width/height written once at idd_push.rs:396-397; ring slots sized once and never resized; consumed frames always report the session size (idd_push.rs:744-745).
  • publish() guards format only, not width/height (frame_transport.rs:284). CopyResource requires identical dimensions, so a resolution change → silent no-op/garbage, no error logged.
  • Driver never reports the acquired surface's real width/height to the host.

P3 — No mid-session capture fallback; a 20 s hard crash instead of degrade.

  • PUNKTFUNK_IDD_PUSH=1 returns the IDD-push capturer early with the keepalive moved into it — no fall-through. crates/punktfunk-host/src/capture.rs:348-356.
  • next_frame() waits on the frame-ready event (16 ms), repeats the last frame, and bail!s after a 20 s deadline → the encode loop tears the session down. idd_push.rs:819-847.
  • The WGC→DDA fallback that exists (capture.rs:389-404) is open-time only and on the non-IDD-push path; it does not help here.
  • The VirtualOutput already carries a WinCaptureTarget { adapter_luid, gdi_name, target_id } (vdisplay/pf_vdisplay.rs Monitor::target()), so a DDA/WGC capturer can be opened on the same virtual output — the wiring just doesn't exist for IDD-push.

Secondary (verify during the fix; not the proven primary cause)

S1 — Driver run_core exits permanently on a swap-chain error, with no clear re-arm.

  • On a ReleaseAndAcquireBuffer2 error (e.g. DXGI_ERROR_ACCESS_LOST when a game grabs the display), run_core breaks and returns; the worker exits and deletes the swap-chain object. packaging/windows/drivers/pf-vdisplay/src/swap_chain_processor.rs:359-362 (+ delete at :141-143).
  • A mode change drives unassign→assign which does respawn a fresh processor (callbacks.rs:309-318, :249-305), so a clean mode change recovers. Open question: whether the OS reliably re-assigns after a bare ACCESS_LOST exit (no unassign), or whether the monitor stalls with a dead-but-installed processor. Confirm against the IddCx contract / upstream virtual-display-rs. The standard IddCx model expects the OS to re-assign, but this needs proof.

S2 — IddCxSwapChainSetDevice give-up leaves a dead-but-installed processor.

  • assign_swap_chain returns STATUS_SUCCESS and installs the processor before the worker's SetDevice retries run; if all 60 retries (≈3 s) fail during a mode flap, the worker returns and the processor is dead, but the OS believes the swap chain is assigned → potential permanent stall. swap_chain_processor.rs:191-226, callbacks.rs:279-293.

S3 — Driver worker-thread diagnostics are not landing (impairs root-causing).

  • dbglog!log.rs opens/append/closes the file per call with no explicit flush, and the observed log had only control-plane (IOCTL-thread) lines, no swap-chain-processor lines. packaging/windows/drivers/pf-vdisplay/src/log.rs:9-22.
  • Whatever the exact reason (write race / token / interleave), the practical effect is the swap-chain processor's behavior during the break is invisible, which is why the cause can't be pinned from logs alone today. Fix this first so the next repro is conclusive.

Verified facts that de-risk the fix

  • The encoder already adapts to a mid-session size/format change. encode/nvenc.rs:580-618: submit detects size_changed/hdr_changed/device change per frame, tears down, and re-inits adopting the new frame's geometry + pixel format. So a capturer that changes resolution/format mid-session is handled downstream — no encoder API change is needed for either fix direction.
  • The stream loop relays per-frame geometry. CapturedFrame carries width/height/format (capture.rs:50-57); the loop reads pipeline_depth() live and forwards whatever try_latest() returns.
  • WGC and DDA emit the same pixel formats the IDD-push path emits (Bgra / Rgb10a2), so a failover capturer feeds the encoder compatible frames.
  • A failover capturer fits the existing Capturer trait (next_frame + try_latest, capture.rs:120-155) — a composing capturer that owns the ring capturer + a lazily-opened WGC/DDA capturer and switches between them is a clean drop-in.

Superseded — see Resolution. This was the original plan; the bug was fixed by the simpler recover-or-drop approach (host follows the OS resolution + open-time DDA failover), so Stage 1's composing capturer and Stage 2's protocol bump were not needed. Kept for context.

Defense-in-depth. Stages 01 are host-only (no driver rebuild, no protocol bump) and are the fast, robust, user-visible fix. Stages 23 harden the fast path and need the driver re-vendor loop.

  • Stage 0 — Diagnostics first (land before anything else).

    • log.rs: flush after each write (or keep a process-lifetime appender) and confirm worker-thread writes land. (S3)
    • Driver: in publish(), log/record the acquired surface's actual format + width + height even on the drop path, so a repro shows exactly what changed.
    • Host: replace the silent 20 s wait with a tracing::warn! at ~2 s of no fresh frame, including driver_status/driver_status_detail and the host's expected ring format/size.
    • Goal: the next Doom-launch repro definitively classifies the cause (format mismatch vs size mismatch vs run_core exit vs no-reassign).
  • Stage 1 — Mid-session fallback IDD-push → WGC/DDA (robust to ALL failure modes). (P3)

    • Add a composing Capturer that owns the IDD-push capturer and, when it yields no fresh frame for a short window (~1.5 s, not 20 s), opens a DDA/WGC capturer on the same WinCaptureTarget and serves from it for the rest of the session (optionally probing the ring for recovery). Encoder follows the new format/size automatically (verified above).
    • This alone guarantees the session never goes permanently black again and makes Doom playable via WGC/DDA when the ring path is defeated — independent of the why.
    • Touch points: capture.rs:334-356 (wire the composing capturer behind PUNKTFUNK_IDD_PUSH), idd_push.rs (expose a "stalled?" signal + shorten the deadline), reuse dxgi.rs/wgc.rs.
  • Stage 2 — Adaptive ring (makes the fast IDD-push path itself survive a game mode change). (P1, P2)

    • Driver writes the actual acquired-surface format + width + height into new SharedHeader fields, in publish(), even when about to drop the frame.
    • Host watches those fields and, on any change vs the ring's current format/size, recreates the ring at the new descriptor + bumps generation (generalize recreate_ring/poll_display_hdr from "HDR toggled" to "descriptor changed"). Driver re-attaches via existing is_stale().
    • Driver publish() gains a width/height guard alongside the format guard.
    • Implications: bump pf_vdisplay_proto::PROTOCOL_VERSION (host does a HARD version check in pf_vdisplay.rs::mgr_ensure_device), update the const size/offset asserts in crates/pf-vdisplay-proto/src/frame.rs, and deploy host + driver in lockstep (rebuild + re-sign + re-vendor packaging/windows/pf-vdisplay/{dll,inf,cat} on the RTX box, WUDFHost reload).
  • Stage 3 — Prevention (frequency reducer, not a standalone fix). (reduces P1/P2 triggers)

    • Trim monitor.rs::default_modes() so the IDD advertises essentially only the negotiated mode, so a game can't pick a different fullscreen resolution. Verify it doesn't break mid-stream Reconfigure. Optionally re-assert the active mode after a detected mode change.
  • Stage S — Driver resilience (address S1/S2 once Stage 0 reveals if they fire).

    • If logs show a permanent stall after ACCESS_LOST/SetDevice-give-up, add a re-arm path (e.g. delete the swap chain so the OS re-assigns, or signal assign_swap_chain to retry) and avoid installing a processor that has already failed SetDevice.

Validation plan (RTX box ssh "Enrico Bühler@192.168.1.158")

  1. Deploy the Stage-0 host (+ driver if rebuilt); punktfunk-host service stop/start.
  2. Connect a client, confirm normal stream. type C:\Users\Public\pfvd-driver.log to baseline.
  3. Launch Doom the Dark Ages (or any fullscreen/HDR game). Capture: driver log + host service log (find where the in-session serve logs land; RUST_LOG=info).
  4. Read which mechanism fired (format/size/exit/no-reassign) from the Stage-0 diagnostics.
  5. Success: game is visible, the stream survives the mode-set flash, no 20 s crash, reconnect restores video. With Stage 1: the failover to WGC/DDA is logged and frames keep flowing. With Stage 2: the ring recreates at the new descriptor and the fast path resumes.

File map

Area Path
Host ring consumer crates/punktfunk-host/src/capture/idd_push.rs
Capture selection / trait crates/punktfunk-host/src/capture.rs
NVENC re-init (no change needed) crates/punktfunk-host/src/encode/nvenc.rs:564-618
DDA / WGC capturers (failover targets) crates/punktfunk-host/src/capture/{dxgi,wgc}.rs
Host monitor lifecycle / capture target crates/punktfunk-host/src/vdisplay/pf_vdisplay.rs
Shared contract (Stage 2 fields + version) crates/pf-vdisplay-proto/src/{lib,frame}.rs
Driver frame publisher (guards + reporting) packaging/windows/drivers/pf-vdisplay/src/frame_transport.rs
Driver swap-chain lifecycle (S1/S2) packaging/windows/drivers/pf-vdisplay/src/swap_chain_processor.rs, callbacks.rs
Driver logging (S3) packaging/windows/drivers/pf-vdisplay/src/log.rs
Advertised modes (Stage 3) packaging/windows/drivers/pf-vdisplay/src/monitor.rs (default_modes)
Vendored signed driver (Stage 2 re-vendor) packaging/windows/pf-vdisplay/{pf_vdisplay.dll,.inf,.cat}

Notes / caveats

  • Doc lag (unrelated to the fix, worth flagging): stage-pf-vdisplay.ps1 / packaging comments still reference the OLD packaging/windows/vdisplay-driver/ tree; the active driver source is the NEW packaging/windows/drivers/pf-vdisplay/ tree (re-vendored in commit a11b0dd).
  • The exact trigger (format vs resolution vs exclusive-flip vs processor-death) is not yet proven from logs — Stage 0 exists to pin it. Stage 1 fixes the user-visible symptom regardless.