Files
punktfunk/design/gamepad-driver-health.md
T
enricobuehler efb1ba26d7 fix(windows): opt-in pad-driver file logs + size-capped service log rotation
Two disk-write fixes:

- pf-xusb/pf-dualsense no longer write C:\Users\Public\pf*-driver.log
  unconditionally — the file log is now opt-in (debug builds, or the
  PFXUSB_DEBUG_LOG / PFDS_DEBUG_LOG system env var), mirroring the audit-§4.4
  fix pf-vdisplay already got: a release driver never writes the world-writable
  Public file (info-leak/DoS surface), and the per-report OUTPUT/SET_STATE hex
  dumps stop being a sustained per-rumble disk-write path during gameplay.
  OutputDebugStringA stays unconditional; the host's driver-silence WARN and
  the gamepad-driver-health failure-mode table now say the log is opt-in.

- service.log/host.log get one-generation rotation: at each (re)open a file
  over 10 MB is renamed to .old, so a crash-restart loop or a RUST_LOG=debug
  left in host.env can't grow the append-forever logs without bound. Rotation
  runs only before an open (never under a live appender — host.log's handle
  lacks FILE_SHARE_DELETE, so a racing rename harmlessly fails).

Windows CI compile/clippy pending (drivers workspace + host are not
Linux-cross-checkable); rides along with the next pad-driver redeploy.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:03:32 +00:00

66 lines
6.3 KiB
Markdown

# Windows gamepad-driver health: failure modes and how each is surfaced
Written for the "host doesn't see the client's gamepad" class of bug report (2026-07-02). The
Windows virtual pads have many silent ways to fail: the stack spans the host process, a named
shared-memory section, a PnP software devnode, a UMDF driver in its own WUDFHost.exe, and finally
the game's input API (XInput / HID / SDL). Before this work the host logged only its *own* create
calls — a pad could "exist" with no driver installed and nothing was ever logged. This document
enumerates the failure modes and states, for each, how it is now detected and what the log line
says. All host lines land in stderr / `%ProgramData%\punktfunk\logs\host.log` (service) **and** the
in-memory ring served at `GET /api/v1/logs` → the web console **Logs** page.
## The health signals
The gamepad drivers have no IOCTL plane (`hidclass` gates the device stack), so the only
cross-process channel is the shared section itself. Two fields were carved out of reserved space
(layout-compatible; old drivers simply never write them, `pf-driver-proto` pins the offsets):
| field | XusbShm | PadShm | writer | meaning |
|---|---|---|---|---|
| `driver_proto` | @32 | @144 | driver | `GAMEPAD_PROTO_VERSION` once attached; `0` = no driver on this section |
| `driver_heartbeat` | @36 | @148 | driver | XUSB: +1 per serviced XInput IOCTL (game-visible path). DS/DS4: +1 per ~8 ms timer tick (liveness) |
Host side, every pad owns a `DriverAttach` watcher (`inject/windows/gamepad_raii.rs`), fed from the
existing `service()` poll. State machine, each transition logs exactly once:
- `driver_proto != 0` → INFO `gamepad driver attached to the shared section` (with `late=true` if
it came after the warning); WARN on a proto/host version mismatch.
- 3 s of silence → one diagnosis WARN combining: **driver-store check** (`pnputil /enum-drivers`,
cached once per process, only run on the failure path), **devnode PnP status** (`CM_Locate_DevNodeW`
+ `CM_Get_DevNode_Status` on the instance id captured from the SwDeviceCreate callback, with a
plain-language hint per CM problem code), and the driver's own debug log path.
## Failure modes
| # | failure | cause examples | detection | surfaced as |
|---|---|---|---|---|
| 1 | Driver package not installed | fresh box, installer's `driver install --gamepad` skipped/failed, package pruned | attach timeout → `pnputil /enum-drivers` misses `pf_xusb.inf`/`pf_dualsense.inf` | WARN `driver package NOT in the driver store — run: punktfunk-host.exe driver install --gamepad` |
| 2 | Package present but binding failed | certificate not in Root/TrustedPublisher, Memory Integrity (HVCI) rejects it, stale DriverVer kept the old binary | attach timeout → devnode problem code (28 = drivers not installed, 52 = signature rejected, 31/39 = load failure) | WARN with the CM problem code + hint |
| 3 | Driver bound but crashed / never started | WUDFHost crash, `WdfDeviceCreate`/queue failure inside the driver | attach timeout → devnode status shows `driver_loaded`/`started` flags; the driver's own log (`C:\Users\Public\pf*-driver.log`) has the failing WDF call — opt-in like pf-vdisplay's (debug builds, or `PFXUSB_DEBUG_LOG`/`PFDS_DEBUG_LOG` set system-wide + device restart) | WARN referencing both |
| 4 | `SwDeviceCreate` fails outright | not Administrator/SYSTEM, PnP wedged, `_` in enumerator (E_INVALIDARG) | existing error path (unchanged) | WARN `SwDeviceCreate failed; … devnode unavailable`, pad continues on the out-of-band fallback |
| 5 | `SwDeviceCreate` callback never fires | PnP service hung | **was silently mis-read as success** (zero-init `HRESULT(0)` + ignored `WaitForSingleObject` return). Fixed: `result` inits to `E_FAIL`, the wait result is checked | ERROR `enumeration callback never fired (10s) — PnP may be wedged` |
| 6 | Driver attached, then WUDFHost died mid-session | crash, killed | `driver_heartbeat` freezes (DS/DS4: timer-driven, so a freeze is conclusive; XUSB: only advances while a game polls, so absence is *not* an error) | field exists for a future stall check; not auto-warned yet (XUSB semantics make a generic rule false-positive-prone) |
| 7 | Version skew host↔driver | new host + old installed driver (or vice versa) | `driver_proto` ≠ host's `GAMEPAD_PROTO_VERSION`; pre-health drivers read as never-attached | WARN `driver/host protocol mismatch — update the drivers` (mismatch) / the mode-1 diagnosis text notes the pre-health case |
| 8 | Whole backend latched off | first pad creation failed → `broken` latch disables pads for the session | existing behaviour, now with remedy text | ERROR `…controller input disabled until the next client connect (install/repair: punktfunk-host.exe driver install --gamepad)` |
| 9 | Section created but game can't see the pad | XInput slot ordering, HidHide-style HID filters on the game process, RPCS3 pad-handler config, GameInput's instance-path VID/PID parse | **not host-detectable** — outside our process and the driver's stack. XUSB `driver_heartbeat` advancing proves "some XInput client polls us", which brackets the problem to the game's side | diagnosis text points at the driver log; the client-side controller view (Android "Connected controllers") covers the other end of the chain |
## What deliberately did NOT change
- The `broken` latch stays one-way per session (retry loops against a missing driver would spam
PnP); the log line now says so and gives the remedy.
- No mgmt-API health endpoint yet — the log ring is the surfacing channel. If the web console ever
grows a "gamepad health" card, `DriverAttach` is the state to expose.
- The DS/DS4 heartbeat is not yet watched for mid-session stalls (mode 6): worth adding once the
XUSB/DS semantics split is encoded (DS freeze = conclusive, XUSB freeze = normal when no game
polls).
## Validation status
- Linux-side: workspace build/tests/clippy green; `pf-driver-proto` layout asserts pin the new
offsets (compile-time).
- Windows host code + both drivers: compile-checked in CI only (this box cannot cross-build the
native deps); **not yet on-box validated**. On-box test recipe: stop the service, `pnputil
/delete-driver` the gamepad package, connect a client with a pad → expect the mode-1 WARN in the
console Logs page within ~3 s of the pad arriving; reinstall drivers → expect the
`attached (late=true)` INFO on the next session.