feat(host): HDR Vulkan layer so Vulkan games get HDR on the virtual display
ci / web (push) Failing after 22s
windows-host / package (push) Failing after 4m16s
ci / rust (push) Failing after 4m56s
ci / docs-site (push) Successful in 1m7s
android / android (push) Successful in 9m19s
ci / bench (push) Successful in 4m47s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Failing after 3s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
docker / deploy-docs (push) Has been skipped
deb / build-publish (push) Failing after 6m29s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 7m4s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 7m17s
apple / swift (push) Successful in 1m13s
apple / screenshots (push) Successful in 5m27s

NVIDIA/AMD Vulkan ICDs refuse to *advertise* an HDR color space for a surface on an
IddCx indirect/virtual display, so Vulkan games (Doom: The Dark Ages, id Tech, Indiana
Jones, …) report "device does not support HDR" — even though Windows HDR, DWM compose,
and the client PQ stream all work, and the ICD happily *accepts + presents* a forced HDR
swapchain there. The whole gap is enumeration; the community (Apollo/Sunshine/VDD) wrote
this off as kernel-side / unfixable.

Add VK_LAYER_PUNKTFUNK_hdr_inject (packaging/windows/pf-vkhdr-layer/): a standalone
cdylib Vulkan implicit layer that appends {A2B10G10R10, HDR10_ST2084} + {RGBA16F, scRGB}
to vkGetPhysicalDeviceSurfaceFormats[2]KHR (no need to hook vkCreateSwapchainKHR — the
ICD doesn't validate the color space there). Self-gated on the surface monitor's actual
advanced-color state (DisplayConfig GET_ADVANCED_COLOR_INFO), so it is a complete no-op
on SDR sessions and real monitors (dedup). Always-on (registry-discovered) so it works
regardless of how a game is launched — env-scoping silently fails for already-running
Steam. Escape hatches: DISABLE_PF_VKHDR, PF_VKHDR_EXCLUDE, and a built-in kernel-anti-
cheat denylist.

The installer builds/signs/stages it and registers it under
HKLM64\SOFTWARE\Khronos\Vulkan\ImplicitLayers (opt-out "Install the HDR Vulkan layer"
task); windows-host CI fmt+clippy-gates it (msvc-only FFI).

Live-validated on the RTX box: Doom: The Dark Ages enables HDR over the pf-vdisplay
virtual display.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-26 11:33:20 +00:00
parent 3e7c9bd059
commit d01a8fd17a
31 changed files with 1021 additions and 0 deletions
File diff suppressed because it is too large Load Diff
+126
View File
@@ -0,0 +1,126 @@
---
title: "Apple Stage-2 Presenter (handoff)"
description: "Implementation plan for the explicit VTDecompressionSession → CAMetalLayer presenter — hand-paced present + true decode→present (glass-to-glass) measurement. Written so a Mac agent can pick it up."
---
> **Status update:** the stage-2 presenter described here has since been **built and live-validated**,
> shipping behind an opt-in flag (`AVSampleBufferDisplayLayer` remains the default known-good path).
> This page is preserved as the implementation/handoff record for that work.
The implementation plan for the **stage-2 Apple presenter**. The **stage-1** presenter feeds
compressed HEVC straight into `AVSampleBufferDisplayLayer`, which hardware-decodes **and presents
internally with no per-frame callback** — so we can't stamp decode or present, and we can't hand-pace.
Stage-2 takes explicit control: decode with `VTDecompressionSession`, present decoded frames through a
`CAMetalLayer` driven by a display link. Two wins: **~0.5 refresh off the present tail** (the biggest
client latency term at 60 Hz) and **true decode→present / glass-to-glass** numbers.
All of this is **macOS/iOS/tvOS-only** — build + validate on a Mac (`swift build && swift test`, then
live against a Linux host). The host + connector side is already done: `PunktfunkConnection.clockOffsetNs`
(the connect-time skew offset, host minus client) is what makes the present timestamp cross-machine
valid. See [Status](/docs/status) and roadmap §12.
## Where it plugs into the existing code
| Existing (stage-1) | Stage-2 change |
|---|---|
| `StreamPump` pulls AUs → `AnnexB.sampleBuffer``layer.enqueue` (compressed) | A `Stage2Pump` (or a mode flag on `StreamPump`) feeds AUs to `VTDecompressionSessionDecodeFrame` instead |
| `StreamView`/`StreamViewIOS` host an `AVSampleBufferDisplayLayer` | Host a `CAMetalLayer` (+ a display link); keep the input-capture + HUD overlay unchanged |
| `AnnexB.formatDescription(fromIDR:)` builds the format desc, refreshed on every IDR | **Reused** — it's the `VTDecompressionSession`'s format description; recreate the session when it changes |
| `LatencyMeter` records capture→client-receipt at `onFrame` | Extend to record **decode-completion** and **present** stages (below) |
Keep stage-1 behind a `UserDefaults` flag (e.g. `punktfunk.presenter = "stage1" | "stage2"`) so a
regression can fall back — `AVSampleBufferDisplayLayer` is the known-good path.
## Decode: VTDecompressionSession
1. Create the session from the IDR's `CMVideoFormatDescription`
(`AnnexB.formatDescription(fromIDR:)`):
```
VTDecompressionSessionCreate(
allocator: nil,
formatDescription: fmt,
decoderSpecification: nil, // hardware by default; no need to force
imageBufferAttributes: [
kCVPixelBufferMetalCompatibilityKey: true,
kCVPixelBufferPixelFormatTypeKey:
kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange, // 8-bit SDR; 10-bit (…10BiPlanar) for HDR later
],
outputCallback: <C-callback>,
decompressionSessionOut: &session)
```
2. Per AU: build the same `CMSampleBuffer` as stage-1 (`AnnexB.sampleBuffer(au:format:)`, PTS =
`au.ptsNs` @ 1e9 timescale) and submit:
```
VTDecompressionSessionDecodeFrame(session, sampleBuffer,
flags: ._EnableAsynchronousDecompression,
frameRefcon: <pts or a boxed context>, infoFlagsOut: nil)
```
3. The **output callback** delivers `(status, infoFlags, imageBuffer: CVImageBuffer?, presentationTimeStamp, …)`.
`presentationTimeStamp` is `au.ptsNs` (the host capture clock). **Stamp decode-completion here**
(`CLOCK_REALTIME` ns), retain the `CVPixelBuffer`, and push `{pts, pixelBuffer, decodedNs}` into a
small NSLock-guarded ring (the "ready" queue) the display link drains.
4. **IDR / mode change**: when `AnnexB.formatDescription` yields a new desc, check
`VTDecompressionSessionCanAcceptFormatDescription`; if not, finish-and-recreate the session (same
trigger stage-1 uses to refresh `format`). On decoder error (`kVTVideoDecoderBadDataErr`, etc.) drop
to the next IDR — there's no out-of-band extradata; recovery keyframes re-carry the parameter sets.
## Present: CAMetalLayer + display link
- `CAMetalLayer` (device = system default, `pixelFormat = .bgra8Unorm`, `framebufferOnly = true`,
`drawableSize` = stream WxH). The view: macOS `NSView`/iOS `UIView` whose `layerClass`/backing layer
is the `CAMetalLayer` (mirror `StreamView`/`StreamViewIOS`).
- **Display link** drives present: macOS `CVDisplayLink` (or `CADisplayLink` on macOS 14+),
iOS/tvOS `CADisplayLink`. Each callback carries the **target present timestamp** (`CVTimeStamp` /
`targetTimestamp`).
- Each vsync: pop the **newest** ready frame (drop older undisplayed ones — low-latency default; no
smoothing buffer to start), render a fullscreen quad sampling the **biplanar YUV** (luma +
chroma planes via `CVMetalTextureCache`) with a BT.709 YUV→RGB fragment shader, then
`commandBuffer.present(drawable)` (or `present(drawable, atTime:)`). **Stamp present time** for the
frame just shown (use the display link's target timestamp converted to `CLOCK_REALTIME`).
- Colorspace: BT.709 8-bit for now (matches the host's SDR). HDR (BT.2020/PQ, 10-bit `…10BiPlanar` +
EDR `CAMetalLayer.wantsExtendedDynamicRangeContent`) is a later tie-in with the HDR roadmap (§10).
### Cheaper intermediate (2a) if the Metal path is too big in one step
Decode with `VTDecompressionSession` (gets the **decode-completion timestamp** = capture→decoded),
then wrap the decoded `CVPixelBuffer` in a `CMSampleBuffer` and `enqueue` it into the existing
`AVSampleBufferDisplayLayer` (it accepts uncompressed pixel buffers too). This yields the decode term
**without** a Metal renderer — but **not** true present (the layer still presents internally). Ship 2a
first if useful; 2b (CAMetalLayer + display link) is required for the on-glass present stamp.
## Measurement (the whole point)
Extend `LatencyMeter` (or add per-stage meters) so each frame records three instants, all
`CLOCK_REALTIME` ns, all shifted by `connection.clockOffsetNs` to the host clock:
- **capture→decoded** = `decodedNs + offset pts_ns` (VideoToolbox decode latency, cross-machine)
- **decode→present** = `presentedNs decodedNs` (the present tail stage-2 shortens)
- **capture→present** = `presentedNs + offset pts_ns` — **the glass-to-glass number** (modulo the
host render→capture term, still unmeasured; see roadmap §12)
Surface `capture→present` p50/p95 in the HUD (extend the existing `model.latency*` line in
`ContentView`). `skewCorrected` stays false when `clockOffsetNs == 0` (old host) — then the numbers are
same-host-only, as today.
## Validation
- `swift test`: add a decode-output test (decode a known IDR built like
`VideoToolboxRoundTripTests` → assert a `CVPixelBuffer` of the right dimensions + the
decode callback fires). Present is display-bound — validate it **live** via the HUD number.
- Live: connect to a Linux host (`punktfunk1-host --source virtual` on the GNOME box; see
[Ubuntu — GNOME](/docs/ubuntu-gnome)), confirm `capture→present` is a few ms over `capture→client`
and that `decode→present` shrank vs. an `AVSampleBufferDisplayLayer` baseline.
- Compare against the headless reference number: `punktfunk-probe` reports skew-corrected
capture→reassembled (~1.3 ms p50 GNOME box → dev box); capture→present should be that **+ decode +
present**.
## Gotchas
- VT decode is **async**; the output callback runs on a VT-managed thread — don't block it, just stamp
+ enqueue. Retain the `CVPixelBuffer` until presented (the ring owns it).
- `VTDecompressionSessionDecodeFrame` wants the **same** `CMSampleBuffer` shape stage-1 builds (AVCC
length-prefixed NALs, in-band parameter sets in the format desc, never as extradata).
- `CAMetalLayer.drawableSize` must track mode changes (the host can `Reconfigure` mid-stream — watch
`PunktfunkConnection.mode`/the new-IDR dimensions).
- Don't add a jitter/smoothing buffer for the first cut — present newest-ready for lowest latency; a
pacing policy can come later if frames look uneven.
- Keep `clients/apple/README.md`'s "Stage 2" item + [Status](/docs/status) updated when this lands.
+135
View File
@@ -0,0 +1,135 @@
---
title: "CI & Docker"
description: "Gitea Actions setup — workflows, the dockerized pieces, and the runners."
---
CI runs on **Gitea Actions** (`git.unom.io`, org `unom`). The workflows live in
`.gitea/workflows/`; they run across Linux and macOS runners and push a few images to the
Gitea container registry.
## Release model
Two tracks (full guide: [Release Channels](https://punktfunk.unom.io/docs/channels)). A push to
`main` publishes **canary** builds to the canary channels; a single **`vX.Y.Z` tag** is THE release
for every platform — built at that version, published to the **stable** channels, and every artifact
attached to one Gitea Release via the shared `scripts/ci/gitea-release.{sh,ps1}` helper (idempotent
create-or-fetch + delete-before-upload, so concurrent cross-runner attaches don't collide). The old
`host-v*` / `win-v*` / `host-win-v*` tag namespaces are retired — `v*` is the only release tag.
## Workflows
| Workflow | Trigger | Runner | What it does |
|---|---|---|---|
| `ci.yml` | push `main`, PRs | Linux | Rust workspace (fmt · clippy `-D warnings` · build · test · C-ABI harness · header drift) in `punktfunk-rust-ci`; `web/` + `docs-site/` build + typecheck in `oven/bun:1` |
| `apple.yml` | push `main`, PRs, manual | macOS | Rust core → `PunktfunkCore.xcframework``swift build`/`swift test` (CI gate, no publish) |
| `windows.yml` | push `main` (paths), PRs, manual | Windows | client build · clippy · fmt · test for `x86_64`/`aarch64` (CI gate, no publish) |
| `deb.yml` | push `main` → canary, `v*` → stable, manual | Linux | host/client/web `.deb` → apt (`canary`/`stable` distribution); `v*` attaches to the release |
| `rpm.yml` | push `main` → canary, `v*` → stable, manual | Linux | host `.rpm` (bazzite + fedora-44 bases) → rpm (`*-canary`/base groups); `v*` attaches |
| `windows-msix.yml` | push `main` (paths) → canary, `v*` → stable, manual | Windows | client MSIX `x64`+`arm64` → generic registry (`canary/`/`latest/`); `v*` attaches |
| `windows-host.yml` | push `main` (paths) → canary, `v*` → stable, manual | Windows | host Inno installer → generic registry (`canary/`/`latest/`); `v*` attaches |
| `android.yml` | push `main` → Play internal, `v*` → Play alpha, PRs, manual | Linux | signed AAB+APK → Play + generic registry; `v*` attaches |
| `release.yml` | push `main` (paths) → TestFlight, `v*` → DMG + TestFlight, manual | macOS | Apple mac/iOS(/tvOS on stable); `v*` notarized `.dmg` attaches |
| `flatpak.yml` | push `main` (paths) → canary branch, `v*` → stable, manual | Linux | client flatpak (OSTree repo + bundle, branch per channel); `v*` attaches |
| `decky.yml` | push `main` → canary, `v*` → stable, manual | Linux | Decky plugin zip → generic registry (`canary/`/`latest/`); `v*` attaches |
| `docker.yml` | push `main`, `v*`, manual | Linux | web/docs/CI images (`latest` + `sha-<short>`; `v*` adds a `vX.Y.Z` tag) |
## Dockerized pieces
The host and the native clients are intentionally **not** containerized (the host needs
the GPU/compositor stack of the box it runs on). What is:
| Image | Source | Notes |
|---|---|---|
| `git.unom.io/unom/punktfunk-web` | `web/Dockerfile` (repo-root context — orval needs `docs/api/openapi.json`) | Nitro `bun` bundle; `PORT` (3000) and `PUNKTFUNK_MGMT_URL` env at runtime |
| `git.unom.io/unom/punktfunk-docs` | `docs-site/Dockerfile` | This site; `PORT` (3000) |
| `git.unom.io/unom/punktfunk-rust-ci` | `ci/rust-ci.Dockerfile` | Ubuntu 26.04 + FFmpeg 8/PipeWire/GL/GBM dev libs + a libcuda **link stub** (driver userspace, no kernel module) + pinned rustup — the container `ci.yml`'s Rust job runs in |
Registry pushes authenticate with a repo Actions secret holding a registry token (a PAT
with `write:package`; the login username in `docker.yml` is the token owner, not the
push actor).
## Runners
- **Linux runner** — runs the Rust/web/docs jobs (as docker containers) and the image
build+push jobs.
- **macOS runner** — an Apple-silicon Mac running macOS, a **host-mode** `act_runner`
(upstream now ships it as `gitea-runner`) provisioned by
[`scripts/ci/setup-macos-runner.sh`](https://git.unom.io/unom/punktfunk/src/branch/main/scripts/ci/setup-macos-runner.sh):
rustup (+ both darwin targets for the universal xcframework), Node.js (host-mode runners
execute JS actions via `node` from PATH — nothing auto-provisions it), the runner binary
in `~/.local/bin`, state under `~/ci/act-runner/` (config, `.runner` registration,
`runner.log`), kept alive by the `io.gitea.act_runner` **root LaunchDaemon** — it cannot
be a user LaunchAgent: macOS Local Network privacy silently blocks LAN dials
("no route to host") from unbundled CLI binaries in gui/user launchd domains, while
system daemons are exempt. Needs full **Xcode** for `xcodebuild -create-xcframework`
(CLT alone only covers `swift build/test`); if `xcode-select` still points at CLT, the
script auto-detects `/Applications/Xcode*.app` and bakes a `DEVELOPER_DIR` override into
the daemon environment — no `xcode-select -s` required.
- **Windows runner** — builds and packages the native Windows client (MSIX) for the
release matrix.
Re-provisioning is idempotent — re-running `scripts/ci/setup-macos-runner.sh` on the macOS
runner with a fresh `GITEA_RUNNER_TOKEN` (org `unom` → Settings → Actions → Runners →
Create new runner) re-registers it without manual cleanup.
## Apple releases
`release.yml` produces the production client builds on the Mac runner. All three app
targets share the bundle ID **`io.unom.punktfunk`** (one App Store listing, universal
purchase — effectively unchangeable after first submission). Signing is **not** secret-based:
the runner uses its **login keychain** directly, so install the **Developer ID Application**,
**Apple Distribution**, and (for the Mac App Store `.pkg`) **3rd Party Mac Developer
Installer** identities once via Xcode, with the WWDR intermediate present so they show as
valid. The only secrets are `ASC_API_KEY_P8`/`ASC_API_KEY_ID`/`ASC_API_ISSUER_ID` (App Store
Connect API key — notarization + TestFlight upload). Per-platform state:
- **macOS (Developer ID)** — sandboxed app (`Config/Punktfunk-macOS.entitlements`) → export
`notarytool` → stapled `.dmg` on the Gitea release.
- **macOS (App Store)** — manual-signed archive (Apple Distribution + the *Punktfunk macOS
App Store Distribution* profile) → upload to TestFlight. App Sandbox is **mandatory** here
and is now declared (app-sandbox + network client/server + audio-input + bluetooth/usb).
Prereqs (one-time, Apple portal): add the **macOS platform** to the App Store Connect app
record (universal purchase), install the Mac App Store distribution profile + the installer
cert above. `continue-on-error` until those exist.
- **iOS** — archive + upload to TestFlight (`method: app-store-connect`,
`destination: upload`). Crypto is declared exempt (`ITSAppUsesNonExemptEncryption`,
`Config/Info.plist`) so builds don't stall on the compliance question.
- **tvOS** — archive + upload to TestFlight (Rust core built from tier-3 targets, nightly
`-Zbuild-std` via `build-xcframework.sh`).
Each macOS target uses its own entitlements: `Config/Punktfunk-macOS.entitlements` (App
Sandbox is macOS-only) for the macOS app, and the shared `Config/Punktfunk.entitlements`
(keychain-access-groups only) for iOS/tvOS — `com.apple.security.app-sandbox` is invalid on
iOS/tvOS and would fail upload validation.
The runner needs a **release (non-beta) Xcode** — App Store processing rejects beta-SDK
builds, and a beta is unusable for the Rust side too: a newer-than-OS ld emits dylibs the
running dyld rejects ("mis-aligned LINKEDIT string pool"), killing every proc-macro build
with a misleading `E0463 can't find crate`. `build-xcframework.sh` therefore resolves
toolchains itself: non-beta Xcode for everything; with only CLT + a beta present it
builds macOS slices against CLT (packaging via any Xcode — `-create-xcframework` does no
linking) and **refuses iOS/tvOS slices** (CLT has no iOS SDK).
## Deployment
`docker.yml`'s `deploy-docs` job ships this docs site after every image push: it syncs
`compose.production.yml` to the docs server and runs `docker compose pull && up -d` there
over SSH, driven by a small set of deploy secrets (`DEPLOY_HOST` / `DEPLOY_USER` /
`DEPLOY_PORT` / `DEPLOY_SSH_KEY`). A reverse proxy in front of that server serves the
container as <https://docs.punktfunk.unom.io>. The host and the web console are NOT
deployed — the console fronts a punktfunk host's management API on whatever box runs the
host.
## Troubleshooting
- **macOS runner offline** — check `~/ci/act-runner/runner.log` on the runner; restart with
`sudo launchctl kickstart -k system/io.gitea.act_runner`. "no route to host" in the log
means the daemon is running in a gui/user domain again — see the Local Network note
above.
- **`apple.yml` fails at the xcframework step** — Xcode missing or unselected:
`sudo xcode-select -s /Applications/Xcode.app/Contents/Developer` and accept the license
(`sudo xcodebuild -license accept`), then re-run.
- **Rust job can't pull `punktfunk-rust-ci`** — the runner host's docker daemon needs a
`docker login git.unom.io` if the org/registry isn't anonymously readable.
- **Stale builder image after toolchain/dep changes** — `docker.yml` re-pushes it on every
`main` push; a manual `workflow_dispatch` of `docker.yml` forces a rebuild.
+111
View File
@@ -0,0 +1,111 @@
---
title: "DualSense Haptics"
description: "Feasibility and scoping for audio-driven DualSense haptics."
---
**Status: scoped, NO-GO for now (deferred).** Advanced voice-coil haptics on the DualSense are
driven by the controller's **USB audio interface** (4-channel surround, the back two channels carry
the haptic waveform), *not* by HID reports. Emulating that on a Linux host and faithfully replaying
it on the Apple client both hit hard walls, and the supply of software that actually *emits* these
haptics on a Linux host is essentially zero. We defer the audio-haptics feature and instead land the
parts of "really supporting the DualSense" that *are* reachable: **adaptive triggers (HID) and
two-motor rumble.**
(Grounded in a 4-agent feasibility read — host USB-gadget viability, DualSense audio descriptors,
Linux game demand, Apple client render path — 2026-06-10.)
## The one distinction that decides everything
| Feature | How it's driven | Reachable for us? |
|---|---|---|
| Basic rumble (2 motors) | HID output report `0x02`, bytes 34 | **Yes** — already parsed; client already has `nextRumble()` |
| **Adaptive triggers** (L2/R2 resistance) | HID output report `0x02`, bytes 1122 / 2233 | **Yes** — already parsed in `dualsense.rs`; just needs the `0xCD` back-channel + client render |
| **Advanced haptics** (voice-coil actuators) | **USB *audio* interface** — 4-ch, back 2 channels = haptic PCM | **No (for now)** — see the three walls below |
The UHID DualSense we already built is **HID-only**. It cannot present the DualSense's *audio*
interface, so it structurally cannot carry advanced haptics. That's not a bug in our implementation —
it's the wrong transport for this signal.
## The three walls (any one is fatal on its own)
### Wall 1 — Host capture needs a kernel rebuild
To *capture* haptic audio a game emits, the host must present a virtual device that owns the
DualSense audio interface. The standard way is a composite USB gadget (`configfs` + `f_hid` +
`f_uac2`) bound to a software UDC (`dummy_hcd`).
- ✅ Present & enabled on this box: `CONFIG_USB_CONFIGFS`, `CONFIG_USB_CONFIGFS_F_HID`,
`CONFIG_USB_CONFIGFS_F_UAC2`, plus `libcomposite`/`usb_f_hid`/`usb_f_uac2`/`u_audio` modules.
-**Blocker:** `# CONFIG_USB_DUMMY_HCD is not set` in `/boot/config-7.0.0-22-generic`. No
`dummy_hcd.ko`, no `/sys/class/udc/`. **No UDC → nothing to bind the gadget to.** Requires a
custom kernel build to enable `CONFIG_USB_DUMMY_HCD=m`, plus root for module-load/configfs.
A lighter alternative exists — a **virtual PipeWire/ALSA sink renamed as the DualSense** (this is how
the working Linux setups capture the back-2-channels today, via WirePlumber rules). It skips the
kernel rebuild, but is gated by the same Wall 2 below, and games' audio-device detection is
hardcoded per-title so it's fragile.
### Wall 2 — Almost nothing on a Linux host emits these haptics
This is the decisive one. The *supply* that would feed our capture barely exists:
- **Steam Input (Linux):** no official advanced-haptics support (open feature request as of 2026).
- **Sony's `hid-playstation` kernel driver:** explicitly does **not** expose VCM haptics or adaptive
triggers — basic rumble only.
- **RPCS3:** treats the DualSense as a generic pad; no advanced haptics.
- **Native Linux games:** effectively **zero** with advanced haptics.
- **The only working path** is a handful of Proton titles (FF7 Remake, Ghostwire, Deathloop, Animal
Well, Stellar Blade) via ClearlyClaire's *custom Wine patches*, **USB-only**, Steam Input
disabled, forced into a 4.0-surround profile, device renamed to match Windows. ~510 games total.
- Bluetooth can't carry it on Linux *or* Windows (Sony's proprietary A2DP repurposing isn't exposed).
A host-side capture feature is only as useful as the software willing to drive it. On Linux that set
is a niche-of-a-niche.
### Wall 3 — The Apple client can't faithfully replay it
Even with a captured waveform, the primary client (macOS/iOS) can't render it well:
- macOS GameController exposes the DualSense as a **basic gamepad** — no voice-coil / adaptive-trigger
access. Those are PS5-only in Apple's stack.
- CoreHaptics is **discrete, pattern-based** (`CHHapticPattern` events, ≤30 s), **not** a PCM
streaming sink. Converting a streamed haptic waveform to patterns is lossy — it throws away exactly
the fidelity that makes voice-coil haptics worth having.
- There is **no public macOS API** to route CoreAudio to the DualSense's channels 34. Doing it
anyway means private/reverse-engineered APIs that break across OS updates.
## What we *can* ship instead ("really supporting the DualSense" minus audio haptics)
The HID DualSense we built is the foundation, and the high-value parts are within reach:
1. **Adaptive triggers — GO.** `dualsense.rs` already parses the L2/R2 trigger effects out of HID
output report `0x02`. Finishing this is the paused HID work: route them over the `0xCD`
HID-output back-channel and render on the client. This delivers the headline "DualSense feel"
(trigger resistance/weapon tension) for any source that emits it — and it's pure HID, no audio
interface, no kernel rebuild.
2. **Two-motor rumble — already done.** Parsed host-side; the Apple client already has
`nextRumble()`. Wire it to `GCDeviceHaptics`/`CHHapticEngine` as discrete patterns (API-clean,
no private APIs).
3. **LED / player-LED / touchpad / motion** — already parsed; finish the `0xCC`/`0xCD` routing.
This is the resume-able HID DualSense Phase C/D/E work — it stands on its own and was never blocked.
## Conditions for a future GO on audio haptics
Revisit if **all three** change:
- A real DualSense is available on the dev box to capture an authoritative `lsusb -v` + the exact
UAC channel/sample-rate/format layout (today: undocumented, would need reverse-engineering).
- The host target gains a UDC (custom kernel with `dummy_hcd`, or real hardware OTG) **or** we accept
the PipeWire-renamed-sink path *and* the title set that emits haptics on Linux grows beyond the
Proton-patch niche.
- The client target shifts to one that can render PCM haptics (a Linux/Windows client with direct
CoreAudio-style channel access, or a future Apple API) — or we accept lossy pattern conversion.
Until then the cost/benefit is upside-down: three hard subsystems (kernel, USB gadget, audio
routing) to serve ~510 Proton titles, rendered lossily on the one client we ship.
## Recommendation
**Defer audio-driven advanced haptics. Land adaptive triggers (HID) + rumble instead** — that's the
reachable 80% of "really supporting the DualSense," needs no kernel work, and the parsing is already
written. Keep this doc as the down payment for the audio-haptics feature whenever the three
conditions above are met.
+377
View File
@@ -0,0 +1,377 @@
# Game library: more game stores
Status: **design / not started** · Author research: web-backed, adversarially verified (2026-06-26).
Goal: extend the unified game library so it enumerates and launches titles from more stores —
on **Windows** Xbox / Game Pass, Epic, EA app (and GOG / Ubisoft / Battle.net / Amazon);
on **Linux** Heroic (Epic+GOG+Amazon), Lutris, and a `.desktop`/Flatpak catch-all.
---
## 1. Where the extension point already is
The library lives in [`crates/punktfunk-host/src/library.rs`](../crates/punktfunk-host/src/library.rs)
and is already a plug-in system — its own doc comment names these exact targets. Adding a store is
a new `LibraryProvider`, not a rewrite.
```rust
pub trait LibraryProvider {
fn store(&self) -> &'static str; // "steam", ...
fn list(&self) -> Vec<GameEntry>; // best-effort: empty (not Err) if the store is absent
}
pub struct GameEntry { id: String /* "<store>:<localid>" */, store, title, art: Artwork, launch: Option<LaunchSpec> }
pub struct Artwork { portrait, hero, logo, header: Option<String> } // URLs the CLIENT fetches
pub struct LaunchSpec{ kind: String, value: String } // today: "steam_appid" | "command"
```
Today: `SteamProvider` (reads local `.acf` / `.vdf` files — **no API key, no network**) plus a
user-curated `custom` store. `all_games()` merges them; `launch_command(id)` resolves a
store-qualified id **against the host's own library** and maps the `LaunchSpec` to a shell command,
with injection guards (`steam_appid` is validated digits-only; the client never sends a raw command).
**The "read the launcher's own on-disk files, no auth" approach is the gold standard we replicate per store.**
Surfaces touched by adding stores:
- `library.rs` — new providers (the bulk of the work is small per store).
- [`mgmt.rs`](../crates/punktfunk-host/src/mgmt.rs) `:1138` — serves `/library`; OpenAPI-generated TS client picks up new stores as data.
- [`web/src/sections/Library/view.tsx`](../web/src/sections/Library/view.tsx) — the grid; **store badge is hard-coded** steam-vs-custom, needs generalizing per `game.store`.
- Launch wiring: [`punktfunk1.rs`](../crates/punktfunk-host/src/punktfunk1.rs) `:573` (native) and [`gamestream/stream.rs`](../crates/punktfunk-host/src/gamestream/stream.rs) `:122` (Moonlight).
> The legacy GameStream `apps.json` ([`gamestream/apps.rs`](../crates/punktfunk-host/src/gamestream/apps.rs))
> is a **separate** Moonlight surface (session recipes: compositor + nested command) and stays as-is.
---
## 2. The two cross-cutting pieces (this is the real work)
Per-store enumeration is mostly easy. Two shared problems gate everything — especially Windows.
### 2a. Launch abstraction + the Windows launch gap
- **Linux** runs the chosen title as a shell command **nested in the per-session gamescope**
(`set_launch_command` / `PUNKTFUNK_GAMESCOPE_APP`). Works today.
- **Windows** captures the whole desktop (DXGI/WGC); there is no nesting, and
`VirtualDisplay::set_launch_command` is a **no-op** ([`vdisplay.rs:57`](../crates/punktfunk-host/src/vdisplay.rs)).
So on Windows **nothing is auto-started** — the user just sees the desktop.
**Plan.** Stop returning a single Linux shell string from `command_for`; introduce an internal enum and
an OS-aware resolver:
```rust
enum LaunchAction { Shell(String), Spawn { exe: PathBuf, args: Vec<String>, workdir: Option<PathBuf> } }
fn resolve_launch(&LaunchSpec) -> Option<LaunchAction> // cfg-aware
fn launch_command(id) -> Option<String> // Linux: thin Shell wrapper (back-compat)
#[cfg(windows)] fn launch_title(id) -> Result<()> // resolve Spawn + run in interactive session
```
**The Windows launcher already exists in the codebase — reuse it.**
[`capture/windows/wgc_relay.rs:196-204`](../crates/punktfunk-host/src/capture/windows/wgc_relay.rs)
does exactly the needed sequence:
`WTSGetActiveConsoleSessionId → WTSQueryUserToken → DuplicateTokenEx(TokenPrimary) →
CreateEnvironmentBlock → CreateProcessAsUserW(lpDesktop="winsta0\\default")`.
- Factor that into `windows/interactive.rs::spawn_in_active_session(exe, args, workdir) -> u32`.
- **Critical:** use the **logged-in user token** (`WTSQueryUserToken`, as `wgc_relay` does) — **not**
`windows/service.rs:449-510`'s variant, which duplicates the **SYSTEM** token and only retargets its
session id. UWP/appx activation, the user-hive protocol handlers (`HKCU\Software\Classes`), and each
launcher's auth/entitlement context all require the *real user's* token. The host process stays SYSTEM.
- For URI-handoff kinds (Epic/Steam/EA/Amazon/GOG-Galaxy) build a **concrete EXE + the URI as a separate
argv element**. `CreateProcessAsUserW` does **no** shell/protocol resolution — never `cmd /c`, never a
bare URI. For schemes with no exe-argv form (`amazon-games://`, `origin2://`), add an impersonate-token
`ShellExecuteEx` fallback (`ImpersonateLoggedOnUser` on a worker thread + `CoInitialize`).
- **Order:** launch the title **after** the interactive capture pipeline is live, so the game renders onto
the already-captured desktop and grabs foreground.
- **Caveats:** `WTSQueryUserToken` fails when no interactive user is logged on (a pre-login box can stream
the login/secure desktop but can't auto-launch a title); on the lock/secure desktop a launch may queue
until unlock. **Needs on-glass validation** (RTX box) that each launcher EXE accepts its URI on argv and
that post-capture launch grabs foreground.
### 2b. Artwork: a layered, no-auth-first `ArtResolver`
Steam gets free CDN art keyed by appid. Most stores don't. Layered ladder, degrade to a title-only card:
1. **Steam** → public Steam CDN by appid (unchanged, client fetches directly).
2. **Stores that already hold public CDN URLs** → emit verbatim, **no host endpoint**: Heroic
`store_cache` `art_*` (Epic/GOG/Amazon CDN), itch `cover_url`, GOG via public `api.gog.com/products/<id>?expand=images`
(one cached lookup), Epic via local `catcache.bin` keyImages.
3. **Xbox** → one **unofficial** no-auth `displaycatalog.mp.microsoft.com` lookup by StoreId, cached,
degrade to no-art offline. (Not a stable contract — tolerate drift.)
4. **Genuinely-local art** (Lutris `coverart`/`banners` JPEGs, Flatpak/.desktop icons, Bottles) → a
**new host-served endpoint is required**, because `Artwork` carries URLs the client fetches and a file
on the host has no public URL.
5. **Opt-in SteamGridDB** enrichment (v2 API `https://www.steamgriddb.com/api/v2`, `Authorization: Bearer
<operator key>`, **off by default**) to fill gaps. Not no-auth; never blocks listing.
6. **None** → existing title-only card.
**New endpoint:** `GET /library/art/<entryId>/<slot>` (slot ∈ `portrait|hero|logo|header`) on `mgmt.rs`.
It resolves `entryId` in the host library to a **known on-disk absolute path** (never interpolates raw
client input into a filesystem path), sanitizes the slot, rejects `..`, streams the bytes with the right
content-type. Reserve `data:` URLs for tiny logos only (don't bloat the catalog JSON that crosses the
control plane). See open question on whether this GET bypasses the mgmt bearer (images are non-sensitive
and the streaming client connects over punktfunk/1, not the bearer-gated REST).
---
## 3. Security model (preserved and extended)
The invariant is unchanged: **the client sends only a store-qualified `GameEntry.id`** (e.g. `lutris:42`,
`xbox:9NBLGGH4R315`, `epic:fn:4fe…:Fortnite`) in `Hello.launch`. The host looks it up in its **own**
enumerated library, reads the **host-derived** `LaunchSpec`, and resolves it. The client never sends a
`LaunchSpec`, command, URI, or path.
Per-kind charset validators are belt-and-suspenders before any interpolation (values are already
host-derived from local files the host owns):
| kind | guard |
|---|---|
| `steam_appid`, `lutris_id`, `uplay` | digits only |
| `battlenet` | `^[A-Za-z0-9]+$` (case-sensitive) |
| `amazon` | `^[A-Za-z0-9-]+$` |
| `aumid` | `^[A-Za-z0-9._-]+![A-Za-z0-9._-]+$` (the `!` separator) |
| `epic` | ≤3 `:`-split parts, each `^[A-Za-z0-9._-]+$`, then URL-encode colons |
| `heroic` | runner ∈ {legendary,gog,nile} + appName `^[A-Za-z0-9._-]+$` |
| `ea_offer_ids` | `^[A-Za-z0-9._,-]+$` (allow comma) |
On **Windows never route a client-influenced string through `cmd /c start`.** `resolve_launch` yields
`Spawn{exe,args,workdir}`; `CreateProcessAsUserW` launches a concrete EXE with the URI/flags as separate
argv elements. The operator-only `command` kind (custom store + provider-generated Linux shell lines for
`desktop`/`itch`) is host-derived/operator-typed, never client-set.
The one net-new surface is `GET /library/art` — covered in §2b (id-resolved path, no traversal).
---
## 4. New `LaunchSpec` kinds
| kind | value holds | maps to |
|---|---|---|
| `lutris_id` | `pga.db` `games.id` (digits) | Linux Shell `lutris lutris:rungameid/<id>` (nests in gamescope) |
| `heroic` | `<runner>:<appName>` | Linux argv `heroic --no-gui "heroic://launch?appName=<app>&runner=<runner>"` |
| `aumid` | `<PFN>!<AppId>` | Windows Spawn `explorer.exe "shell:AppsFolder\<aumid>"` (interactive session) |
| `epic` | `<namespace>:<catalogItemId>:<appName>` | Windows Spawn `EpicGamesLauncher.exe` + `com.epicgames.launcher://apps/<ns>%3A<cat>%3A<app>?action=launch&silent=true` |
| `gog` | host-resolved `exe \t args \t workdir` | Windows Spawn `CreateProcessAsUserW(exe,args,workdir)` (direct exe, no Galaxy) |
| `uplay` | Ubisoft gameId (digits) | Windows `uplay://launch/<gameId>/0` |
| `battlenet` | product code (e.g. `WTCG`, `Fen`, `OSI`) | Windows Spawn `Battle.net.exe --exec="launch <code>"` |
| `amazon` | Amazon Games `DbSet.Id` | Windows `amazon-games://play/<Id>` (impersonate ShellExecute) |
| `ea_offer_ids` | comma-joined contentID list | Windows `origin2://game/launch/?offerIds=<list>&autoDownload=1` |
| `command` (existing) | host-derived shell line | Linux gamescope-nested (desktop/flatpak/itch reuse this) |
---
## 5. Per-store provider catalog
Confidence is **after** adversarial web-verification (research → verify). All enumeration is no-auth,
local, launcher-need-not-be-running unless noted.
### Linux
#### Lutris — P0, effort M, confidence **high**
- **Enumerate:** read-only `rusqlite` open of `pga.db`
(`$XDG_DATA_HOME/lutris` | `~/.local/share/lutris` | `~/.var/app/net.lutris.Lutris/data/lutris`).
`SELECT id, slug, name, runner FROM games WHERE installed=1`. Optionally LEFT JOIN
`games_categories`/`categories` to drop the `.hidden` category. Open `mode=ro`/`immutable=1` (Lutris
holds it open). `installed=1` matters — the DB also lists owned-but-not-installed rows.
- **Launch:** `lutris_id` → `lutris lutris:rungameid/<id>` (execs the game; most nesting-friendly).
One-time on-box check that `games.id` == the `rungameid` int.
- **Artwork:** **local** JPEGs keyed by slug — `coverart/<slug>.jpg` (→ portrait), `banners/<slug>.jpg`
(→ header) under `~/.local/share/lutris` (0.5.18+), with `~/.cache/lutris` (≤0.5.17) and the Flatpak
cache as fallbacks. Needs the `/library/art` endpoint. hero/logo stay None.
- **Notes:** highest-confidence new store. A `runner=='steam'` row can duplicate `SteamProvider` — dedup
is a nicety. Verify bundled-SQLite is fine for deb/rpm/flatpak.
#### Heroic — P0, effort M, confidence **high** (one provider = Epic + GOG + Amazon, art free)
- **Enumerate:** parse `~/.config/heroic/store_cache/{legendary,gog,nile}_library.json` (Flatpak:
`~/.var/app/com.heroicgameslauncher.hgl/config/heroic/...`). Data key is `"library"` (legendary/nile)
or `"games"` (gog); ignore `__timestamp.*` siblings. Filter `is_installed==true` **and** cross-check
`install.install_path` exists (works around the gog `is_installed` bug, Heroic #2691). Fall back to
`legendaryConfig/legendary/installed.json` etc. when a cache file is absent.
*(Heroic uses `legendaryConfig/legendary`, **not** the standalone `~/.config/legendary`.)*
- **Launch:** `heroic` → `heroic --no-gui "heroic://launch?appName=<app>&runner=<runner>"` (argv, no shell).
`--no-gui` does the suppression; the `gui=false` query param is **inert/fabricated** — drop it.
**Ship enumeration+art first, gate launch:** Heroic is single-instance Electron — if already running it
forwards the URI and **exits**, which (as gamescope's foreground child) would tear the session down while
the game runs **outside** gamescope, uncaptured. Also Electron needs a display — fine nested in gamescope,
not in a bare headless context.
- **Artwork:** **free** — `art_square` → portrait, `art_cover` → header, `art_background`||`art_cover` →
hero, `art_logo` → logo are already public Epic/GOG/Amazon CDN URLs. Skip non-`http(s)` values
(sideloaded `file://` art). No host endpoint.
- **Notes:** do **not** also build separate Linux GOG/Amazon providers — native Linux GOG Galaxy doesn't
exist; Heroic is the canonical Linux path for those.
#### Desktop (`.desktop` + Flatpak) — P1, effort M, confidence medium (universal catch-all)
- **Enumerate:** scan `{/var/lib/flatpak/exports/share/applications,
~/.local/share/flatpak/.../applications, /usr/share/applications, /usr/local/share/applications,
~/.local/share/applications}/*.desktop`. Require `Type=Application` + `Categories` contains `Game`; skip
`NoDisplay`/`Hidden`/`Terminal=true` and known launcher app-ids (Steam/Heroic/Lutris/Bottles/RetroArch)
to avoid recursion/dupes.
- **Launch:** reuse `command` (host-derived shell line, nested in gamescope): cleaned `Exec` (strip
`%U/%F/%f/%u/%i/%c/%k`) else `flatpak run <app-id>`.
- **Artwork:** local — resolve `Icon=` via the hicolor theme / flatpak exported icons → `/library/art`.
App icons are low-res, not box art (acceptable header fallback).
- **Notes:** run **last** and dedup by install path / drop ids already surfaced by Steam/Heroic/Lutris.
#### itch.io — P3, effort S, confidence medium (Linux + Windows)
- **Enumerate:** read-only `rusqlite` of `butler.db` (`~/.config/itch/db/butler.db`; Flatpak
`io.itch.itch`; Windows `%AppData%\itch\db`, per-user). JOIN `caves`→`games`. **Key on `cave.ID`** (a
game can have multiple caves; install location + verdict are per-cave). Read game title / `cover_url`;
resolve install dir from `InstallLocationID`+`InstallFolderName`||`CustomInstallFolder` + the Verdict
candidate. Confirm exact column names on-box.
- **Launch:** `command` → direct binary `basePath`+`candidate.path`, **only** for Verdict candidates with
`flavor==native` (html/jar/love need itch's runtime — fall back to custom).
- **Artwork:** **free** — `games.cover_url` is a public itch CDN URL.
### Windows
#### Epic Games Store — P1, effort M, confidence medium (cleanest Windows store to validate the launch wiring)
- **Enumerate:** read `C:\ProgramData\Epic\EpicGamesLauncher\Data\Manifests\*.item` (JSON; machine-wide,
SYSTEM-readable, launcher need not run). Read `DisplayName`, `AppName`, `CatalogNamespace`,
`CatalogItemId`, `InstallLocation`, `LaunchExecutable`, `MainGameAppName`, `AppCategories`. Iterate the
dir (filename is a random GUID).
**Use Playnite's EXCLUSION filter, not a positive `games` filter:** skip `AppName` starting `UE_`; skip
DLC only when `AppCategories` has `addons` && **not** `addons/launchable`; require `InstallLocation`
exists. (The first-pass positive filter `games + MainGameAppName==AppName` can drop legit games.)
- **Launch:** `epic` → Spawn `EpicGamesLauncher.exe` + `com.epicgames.launcher://apps/<ns>%3A<cat>%3A<app>?action=launch&silent=true`.
Build the **triple** only when both namespace and CatalogItemId are present; otherwise **fall back to the
bare `appName` URI (don't set launch=None)** — bare still works in Playnite today, it's just less robust.
CatalogItemId is **not** present in every `.item` — verify on a real box.
- **Artwork:** **free** — base64-decode + parse `Data\Catalog\catcache.bin`, index by catalogItemId, map
keyImages `DieselGameBoxTall`→portrait, `DieselGameBox`→hero, `DieselGameBoxLogo`→logo. None on miss.
- **Notes:** `.item` + `catcache.bin` are community-RE'd; `silent=true` may not suppress a cold-start
launcher window.
#### GOG — P1, effort M, confidence medium
- **Enumerate:** registry `HKLM\SOFTWARE\WOW6432Node\GOG.com\Games\<id>` (PATH/GAMENAME/gameID/EXE) or
Uninstall `<id>_is1` keys with `Publisher=='GOG.com'` (exclude `GOGPACK*`). Parse
`<PATH>\goggame-<id>.info` for `playTasks[isPrimary && type=='FileTask']` → exe/args/workingDir.
- **Launch:** `gog` → **direct-exe** Spawn (no Galaxy dependency, dodges cold-start/anti-cheat). Optional
fallback: `GalaxyClient.exe /launchViaAutostart /gameId=<id> /command=runGame /path="<dir>"` (note the
`/launchViaAutostart` token; `goggalaxy://openGameView/<id>` only **opens the page**, doesn't launch).
- **Artwork:** **free** — public no-auth `GET https://api.gog.com/products/<id>?expand=images` →
`images.logo2x`/`verticalCover`/`background`; cache resolved URLs. (`goggame-.info` carries no art; the
Galaxy `galaxy-2.0.db` is undocumented/locked — avoid.)
#### Xbox / Microsoft Store / Game Pass — P1, effort **L**, confidence medium (big Game Pass value, most plumbing)
- **Enumerate:** probe each fixed drive for an `XboxGames` dir (default `C:\XboxGames`; the `.GamingRoot`
binary layout is **undocumented** — just scan, don't depend on parsing it). For each
`<Title>\Content\MicrosoftGame.config` (**presence = it's a GDK game**, the game-vs-app signal) read
`ShellVisuals.DefaultDisplayName` (title), `<StoreId>` (12-char BigId, the art key), `Identity Name`,
`<Executable Id="Game">` (the AppId). **Read the PackageFamilyName from the
`C:\ProgramData\Microsoft\Windows\AppRepository\Packages\<PackageFullName>` directory name** (strip
`_Version_Arch_~_PublisherHash`) — **never compute the PFN by hashing the publisher**. AUMID = `PFN!AppId`.
- **Launch:** `aumid` → `explorer.exe shell:AppsFolder\<AUMID>` into the interactive session. **UWP
activation fails from SYSTEM/session-0 — the interactive user token is load-bearing.**
- **Artwork:** one **unofficial** no-auth lookup
`displaycatalog.mp.microsoft.com/v7.0/products/<StoreId>?market=US&languages=en-us&fieldsTemplate=Details`,
map `Images[]` ImagePurpose Poster→portrait / SuperHeroArt→hero / Logo→logo / BoxArt→header; cache to
the config dir, degrade to no-art offline. Not a stable contract.
- **Notes:** misses pure-UWP (non-GDK) Store games under the ACL-locked `WindowsApps` — accept for v1.
#### Ubisoft Connect — P2, effort S, confidence medium
- **Enumerate:** registry `HKLM\SOFTWARE\WOW6432Node\Ubisoft\Launcher\Installs\<gameId>` (both reg views),
read `InstallDir`; title = install-dir leaf folder (primary) else the `Uplay Install <gameId>` Uninstall
`DisplayName`.
- **Launch:** `uplay` → `uplay://launch/<gameId>/0`. **Artwork:** none → title-only.
- **Notes:** smallest effort once the Windows URI-launch wiring exists; hive+scheme unchanged across the
Origin→EA migration.
#### Amazon Games — P2, effort S, confidence medium
- **Enumerate:** read-only `rusqlite` of
`%LocalAppData%\Amazon Games\Data\Games\Sql\GameInstallInfo.sqlite`:
`SELECT Id,ProductTitle,InstallDirectory FROM DbSet WHERE Installed=1`. **Per-user path** — the SYSTEM
service must resolve the **active session user's** profile (not the SYSTEM profile).
- **Launch:** `amazon` → `amazon-games://play/<Id>` (impersonate-token ShellExecute; no clean exe-argv form).
- **Artwork:** `ProductIconUrl`/`ProductLogoUrl` columns when present, else none.
#### Battle.net — P2, effort **L**, confidence medium (high catalog value: WoW/Diablo IV/Overwatch 2/CoD)
- **Enumerate:** hand-roll a ~4-field protobuf decode of `C:\ProgramData\Battle.net\Agent\product.db`
(`product_install{ uid, product_code, settings.install_path, cached_product_state.base_product_state.installed }`).
Registry fallback: Uninstall keys whose `UninstallString` matches `Battle.net.exe --uid=<uid>`.
`product.db` has **no titles** → maintain a ~30-entry `product_code`→name map (source from
bnetlauncher/Lutris/Heroic; codes are **case-sensitive**).
- **Launch:** `battlenet` → `Battle.net.exe --exec="launch <code>"` (more reliable than the
`battlenet://<code>` URI, which only hands off). **Artwork:** none → title-only.
- **Notes:** the protobuf + name map + no-art make it L; pin the `.proto` and decode defensively.
#### EA app — P2, effort M, confidence medium (most closed/fragile — ship last)
- **Enumerate:** registry `HKLM\SOFTWARE\WOW6432Node\{EA Games,Origin Games}\<id>` (Install Dir /
DisplayName), parse `<dir>\__Installer\installerdata.xml` for the **full** `<contentIDs>` list +
`<gameTitle locale='en_US'>`. Registry under-reports for EA-app (vs legacy Origin) installs — known
completeness gap. Keep the AES-256 encrypted `IS`-file decrypt **out** of the default path (optional
feature flag for completeness).
- **Launch:** `ea_offer_ids` → `origin2://game/launch/?offerIds=<full,comma,list>&autoDownload=1`. **Emit
the full contentID list** — a single offerId generally no longer launches under the EA app.
- **Artwork:** none no-auth → title-only.
#### Rockstar — P3, fold into custom
- Registry `HKLM\SOFTWARE\WOW6432Node\Rockstar Games\<Title>\InstallFolder`; direct-exe Spawn; no art.
Tiny catalog, most titles now bought on Steam/Epic.
---
## 6. Suggested structure & phasing
**Structure.** Split `library.rs` → a `library/` dir before it balloons:
`mod.rs` (trait, wire types, `LaunchAction`, custom CRUD, `all_games`, `resolve_launch`,
`launch_command`/`launch_title`), `steam.rs`, one file per provider, `art.rs` (ArtResolver +
displaycatalog/gog-api/steamgriddb helpers), `win_util.rs` (HKLM subkey enumerator, read-only SQLite
opener, tiny read-only XML reader). New deps: `rusqlite` (bundled, read-only) for lutris/itch/amazon DBs;
`roxmltree`/`quick-xml` for the Windows manifests; registry via the `windows` crate's
`Win32_System_Registry` feature (no new crate). Avoid `prost` — hand-roll the ~4 Battle.net fields.
| Phase | Deliverable | Files |
|---|---|---|
| **1 — Foundation** (no new stores) | Split `library.rs` → `library/`; add `LaunchAction` + `resolve_launch`; factor `windows/interactive.rs::spawn_in_active_session` out of `wgc_relay.rs`; make `set_launch_command` real on Windows; wire `launch_title` at session-start post-capture; add `win_util.rs` + deps | `library/{mod,steam,launch,art,win_util}.rs`; `windows/interactive.rs` (new); `capture/windows/wgc_relay.rs`; `punktfunk1.rs:573`; `gamestream/stream.rs:122`; `vdisplay.rs:57`; `main.rs`; `Cargo.toml` |
| **2 — Linux Lutris + Heroic + art endpoint** (P0) | `LutrisProvider`, `HeroicProvider` (art free); `GET /library/art/<id>/<slot>` for Lutris local JPEGs; wire into `all_games()`; unit tests for new `resolve_launch` arms + guards | `library/{lutris,heroic,art}.rs`; `library/mod.rs`; `mgmt.rs:1138` + new route |
| **3 — Windows Epic + GOG** (P1) | `EpicProvider` (.item + catcache art), `GogProvider` (registry + .info + api.gog.com art); validate `windows/interactive.rs` end-to-end on the RTX box | `library/{epic,gog,win_util,art,launch}.rs` |
| **4 — Xbox / Game Pass** (P1) | `XboxProvider` (XboxGames scan + MicrosoftGame.config + AppRepository PFN + aumid launch) + displaycatalog art with caching/offline degrade | `library/{xbox,art,launch}.rs` |
| **5 — Linux Desktop catch-all + easy Windows URI stores** (P1/P2) | `DesktopProvider` (last + dedup, icons via `/library/art`), `UplayProvider`, `AmazonProvider` (+ per-user-profile-under-SYSTEM helper) | `library/{desktop,uplay,amazon,win_util,art}.rs` |
| **6 — Remaining + opt-in enrichment** (P2/P3) | `BattleNetProvider` (hand-rolled protobuf + code→name map), `EaAppProvider`, `ItchProvider`; Rockstar/Bottles → custom; optional SteamGridDB v2 behind an operator key | `library/{battlenet,eaapp,itch,art,mod}.rs` |
Also generalize the web console store badge (`web/src/sections/Library/view.tsx`) to render per `game.store`.
---
## 7. Open questions
- **Art delivery auth:** the streaming client connects over punktfunk/1 (QUIC), not the bearer-gated mgmt
REST, yet already fetches Steam CDN URLs over plain HTTP. Should `GET /library/art/*` be an
unauthenticated read-only image GET on the mgmt listener (bearer bypass for that path only), a separate
tiny image server, or should local-art bytes ride the punktfunk/1 control plane?
- **Windows launch ordering** needs on-glass RTX-box validation: confirm launching *after* capture is live
grabs foreground+capture, and that `CreateProcessAsUserW(EpicGamesLauncher.exe/steam.exe, URI-as-argv)`
actually starts the game per launcher (vs needing the impersonate-ShellExecute fallback).
- **Per-user-profile resolution under SYSTEM** for Amazon (`%LocalAppData%`) and itch (`%AppData%`): add
`WTSQueryUserToken` + `GetUserProfileDirectoryW` (or read `USERPROFILE` from `CreateEnvironmentBlock`)?
- **`rusqlite` bundled SQLite** — acceptable for deb/rpm/flatpak and no link conflict? Otherwise fall back
to `lutris -l -j` (fragile: single-instance D-Bus forwarding).
- **Battle.net** product-code→name map source/maintenance, and `product.db` `.proto` drift across Agent versions.
- **Unofficial art sources** (Xbox displaycatalog): best-effort with aggressive caching + no-art degrade,
or Xbox-art local-tile-only for v1?
- **Heroic launch:** ship enumeration+art only at first, or invest in direct legendary/gogdl/nile CLI
launch (needs the user's on-disk auth tokens) to dodge the single-instance-Electron / gamescope-escape problem?
- **`config_dir()` consistency:** `library.rs` uses an XDG/HOME-based dir; confirm the Windows SYSTEM host
lands its art cache + custom store under `%ProgramData%\punktfunk` (there's a separate
`gamestream::config_dir()` that already does this).
- Should provider-generated Linux shell lines (`desktop`/`itch`) reuse the `command` kind (documented
"operator-only") or get a distinct internal kind to keep the mgmt-UI `command` semantics clean?
---
## 8. Verification notes (what the adversarial pass corrected)
First-pass research was web-re-checked; corrections folded into §5 above:
- **Epic:** bare-`AppName` URI is **not** universally removed (Playnite still uses it) — build the triple
when ids exist, fall back to bare; use Playnite's **exclusion** filter, not a positive `games` filter.
- **EA:** a single offerId no longer launches — emit the **full** comma-joined contentID list; registry
under-reports for EA-app installs.
- **Battle.net:** `battlenet://<code>` only hands off — use `Battle.net.exe --exec="launch <code>"`.
- **Xbox:** **read** the PFN from the AppRepository dir name, don't hash the publisher; `.GamingRoot`
layout is undocumented — just scan `XboxGames`.
- **Heroic:** `gui=false` is inert (`--no-gui` does it); single-instance Electron forwards-and-exits →
gate launch.
- **Lutris:** open the DB read-only; `lutris -l -j` fallback is fragile (single-instance D-Bus forwarding).
- **SteamGridDB:** v1 is deprecated — use v2 (`/api/v2`, Bearer key).
**Not web-confirmable / needs on-box validation:** every Windows launch path (each launcher's argv
handling, foreground grab, secure-desktop behavior), all registry keys / DB schemas against a live box,
and `rusqlite` packaging.
+73
View File
@@ -0,0 +1,73 @@
---
title: "gamescope Multi-User Isolation (deferred)"
description: "Research + design for concurrent INDEPENDENT gamescope desktops (multi-user), and why it's deferred. The shared-desktop multi-view case already landed."
---
**Status: deferred (2026-06-12).** Concurrent sessions landed for the **shared-desktop multi-view**
case — multiple devices viewing/controlling the *same* KWin/Mutter/wlroots desktop ([Status](/docs/status)).
This page captures the research for the *other* model — **independent desktops** (each client its own
gamescope instance: the multi-user / cloud-gaming-on-one-box case) — and why it's parked. Pick this
up from here if the use case becomes a priority.
## What landed vs what this is
| Model | Backends | Input | Audio | Status |
|---|---|---|---|---|
| **Shared-desktop multi-view** | kwin / mutter / wlroots | shared (all drive one desktop) | shared (all hear one desktop) | ✅ **landed** — correct semantics: stream *your* desktop to laptop + TV at once |
| **Independent desktops (multi-user)** | gamescope | **per-session** (each drives its own game) | **per-session** | ⏸ **deferred** — this page |
For independent desktops, shared input/audio is *wrong* — each user must drive and hear only their own
session. gamescope is the natural fit: each `create()` spawns a fresh nested compositor (own
rendering, own EIS input socket). The blocker is that the host's input/audio/mic are host-lifetime
**shared** services, and the gamescope EIS socket is relayed through a single global file.
## Current architecture (the research)
Each gamescope **process is per-session** (`vdisplay/gamescope.rs::create()` spawns one; the
`VirtualOutput.keepalive` owns it). But:
- **EIS input socket — single global file.** gamescope exports `LIBEI_SOCKET` for its children; a
shell wrapper relays it to the fixed path `/tmp/punktfunk-gamescope-ei` (`EI_SOCKET_FILE`).
**Two concurrent instances overwrite each other's socket name** in that one file.
- **Injector — one host-lifetime `!Send` service.** `punktfunk1.rs::InjectorService` opens **one**
`inject::open(backend)` for the whole run and forwards events over an mpsc channel. It was made
shared deliberately (the portal `CreateSession` churn wedged KWin's EIS — "EIS setup timed out").
For gamescope it reads the one global socket file, so all sessions' input lands in whichever
instance wrote last.
- **Audio — global default-sink monitor.** `audio::open_audio_capture()` sets
`STREAM_CAPTURE_SINK` and autoconnects to the host's **default sink monitor** (PW_ID_ANY) — the
whole system's output, not a per-gamescope node. gamescope exposes **no per-instance audio node**.
- **Mic — one global `Audio/Source`.** `MicService` feeds one PipeWire source named `punktfunk-mic`;
all clients' mic uplinks mix into it.
- Per-session already (no work): the gamescope process, the PipeWire video node, and the uinput
gamepads.
## What it would take
1. **Per-instance EIS socket** — give each gamescope a unique relay file
(`/tmp/punktfunk-gamescope-{id}-ei`) and carry the path on `VirtualOutput` (new field) so the
session can find its own socket.
2. **Per-session injector** — for gamescope sessions, create a **per-session** injector bound to that
socket (its own thread, since `InputInjector` is `!Send`), instead of the shared `InjectorService`.
Keep the shared service for the portal backends (kwin/mutter) where shared input is correct.
Ordering nuance: the input thread is wired before the gamescope socket exists, so the per-session
injector must open **lazily** (on first event, by which time gamescope is up) or be created after
`build_pipeline`.
3. **Per-session audio (the bigger piece).** gamescope has no per-instance audio node, but audio
*is* isolatable: create a **per-session PipeWire null-sink**, route that gamescope's apps to it
(`PULSE_SINK` / a target node on the spawn env), and capture **that sink's monitor** per session.
This is the largest addition — null-sink create/teardown + routing + per-session capture.
4. **Per-session mic** — a virtual `Audio/Source` per session (`punktfunk-mic-{id}`), routed into
that gamescope, instead of the one global source.
## Why deferred
- It's a **large multi-file refactor** — the whole input path (per-instance sockets + per-session
injector + the lazy-open ordering), **plus** per-session null-sink audio routing, **plus** per-session
mic — for a **niche** use case (multiple independent users gaming on one box).
- The **common** concurrency case — stream one desktop to several of *your own* devices — is the
shared-desktop multi-view model, which **already landed and is the correct semantics** for it.
- No correctness gap in what shipped: concurrent sessions work today; this is purely the *additional*
independent-desktops model.
Revisit when there's a real multi-user requirement. The plumbing list above is the whole job.
+95
View File
@@ -0,0 +1,95 @@
---
title: "GameStream Host"
description: "Stream to a stock Moonlight client on a client-sized virtual display."
---
The shippable milestone (plan §8). A stock Moonlight/Artemis client discovers this host,
pairs, launches, and gets video (then input, then audio) on a client-sized virtual display.
Ground-truth protocol reference: [`research/gamestream-protocol-research.json`](research/gamestream-protocol-research.json)
(distilled from Sunshine + moonlight-common-c source; cite those for byte-level detail).
## Architecture (respects the "one core" invariant)
- **punktfunk-core** gains a **P1 GameStream wire codec** (`ProtocolPhase::P1GameStream`, the
hook already exists): the exact RTP+`NV_VIDEO_PACKET` framing, the GameStream FEC shard
layout, and the video/audio AES-GCM/CBC paths. Hot path, native threads, **no async**.
Kept beside punktfunk's native internal format (P2), selected by phase.
- **punktfunk-host** gains the **control plane** (tokio/axum OK — I/O-bound, not the hot path):
mDNS discovery, nvhttp serverinfo + the 4-phase pairing, the RTSP handshake, the ENet
control stream + input injection, the virtual-display lifecycle, and Opus audio encode.
## Port map (base 47989; Moonlight derives all by offset)
| Port | Proto | Role |
|---|---|---|
| 47989 | TCP | HTTP nvhttp (unpaired: /serverinfo, /pair PIN flow) |
| 47984 | TCP | HTTPS nvhttp (paired; **client-cert pinned**) — /launch, /resume, … |
| 48010 | TCP | RTSP (OPTIONS/DESCRIBE/SETUP/ANNOUNCE/PLAY) |
| 47998 | UDP | Video RTP (+ RS-FEC, optional AES-GCM) |
| 47999 | UDP | Audio RTP (Opus, RS-FEC 4+2, optional **AES-CBC**) |
| 48000 | UDP | ENet control stream (AES-GCM) + remote input |
| 5353 | UDP | mDNS `_nvstream._tcp.local` advertisement |
## Key wire facts (the non-obvious ones)
- **Video datagram** = `RTP_PACKET(12, BIG-endian)` + `reserved[4]` + `NV_VIDEO_PACKET(16,
LITTLE-endian)` + payload. Endianness differs *within the same packet*. `header=0x80|0x10`.
- `fecInfo` (u32 LE) = `(dataShards<<22)|(fecIndex<<12)|(fecPercentage<<4)`; parityShards is
**recomputed** by the client as `ceil(dataShards*pct/100)` — must match exactly.
- `multiFecBlocks` = `(blockIdx<<4)|((nBlocks-1)<<6)`; **≤4 FEC blocks/frame**, ≤255 shards/block.
- Each frame's bitstream is prefixed with an 8-byte `video_short_frame_header_t`
(`headerType=0x01`, `frameType` 2=IDR, `lastPayloadLen`) before striping into shards.
- Shard size = `packetSize + 16`. Data shards first, then parity, over a contiguous RTP
sequence range. Last data shard zero-padded.
- **Video crypto** (when `SS_ENC_VIDEO` negotiated): AES-128-GCM, key = raw 16-byte RIKEY
(from `/launch?rikey=`), IV = `counter_le[8]||0,0,0||'V'(0x56)`, **NO AAD**, 32-byte
`ENC_VIDEO_HEADER{iv[12],frameNumber,tag[16]}` prefix; **FEC first, then encrypt per shard**.
- **Pairing**: PIN key = `SHA-256(salt[16] || ascii_pin)[..16]`; AES-128-**ECB** (no padding)
for the challenge blocks; SHA-256 rolling hashes; RSA-SHA256 signatures over X.509 certs;
the client cert is pinned for subsequent HTTPS. 4 phases over `/pair?phrase=…`.
- **RTSP** `Session: DEADBEEFCAFE;timeout = 90` (literal), `Transport: server_port=<p>`,
`streamid=video/0/0` / `control/13/0`. ANNOUNCE carries the negotiated config
(`x-nv-video[0].*`, `x-nv-vqos[0].*`) → maps to `punktfunk_core::Config`.
## The two highest interop risks (validate EARLY)
1. **RS-FEC matrix compatibility.** Sunshine + Moonlight both use **nanors** (GF(2⁸), poly
0x11d, Vandermonde systematic). punktfunk-core uses `reed-solomon-erasure` (Cauchy) — parity
bytes likely **don't match**, so Moonlight silently fails to recover any frame with a lost
data shard. Mitigation: **on a clean LAN with no loss the client never runs RS decode**, so
defer this — get a frame decoded first, then FFI/port nanors for loss recovery.
2. **Crypto layout.** punktfunk's `SessionCrypto` (salt + seq-as-AAD) is wire-incompatible. P1
needs a separate GameStream GCM path. Mitigation: **video encryption is negotiated and
usually off on LAN** — implement plaintext video first, add GCM later.
## Phasing (each phase independently testable with a real Moonlight client)
- **P1.1 — Discovery + serverinfo + pairing.** mDNS `_nvstream._tcp`, HTTP/HTTPS nvhttp,
`/serverinfo` XML, the 4-phase pairing + cert pinning. *Acceptance: Moonlight discovers,
pairs (PIN), and shows the host as ready.* ← first slice.
- **P1.2 — Launch + RTSP + virtual display.** `/launch` (parse rikey/rikeyid/mode), the RTSP
handshake, negotiate `Config`, create a wlroots virtual output sized to the client.
*Acceptance: Moonlight completes RTSP and the host stands up the UDP streams.*
- **P1.3 — Video (punktfunk-core P1 codec), plaintext, clean-LAN.** RTP+NV framing + FEC shard
layout in punktfunk-core; wire the spike's NVENC AUs → UDP 47998. *Acceptance: Moonlight DISPLAYS video.*
- **P1.4 — Control + input.** ENet (`rusty_enet`) control stream; decode input → `inject.rs`
(uinput/reis); request-IDR → force NVENC keyframe. *Acceptance: mouse/keyboard work.*
- **P1.5 — Robustness: FEC recovery + encryption.** nanors-exact FEC; per-shard AES-GCM.
*Acceptance: stable under `tc netem` loss; encrypted streams.*
- **P1.6 — Audio + polish.** Opus + audio RTP/FEC/CBC (UDP 47999); disconnect teardown; KWin
backend for the user's KDE box. *Acceptance: full game stream with sound — the GameStream-host goal.*
## Crates (verified available)
`mdns-sd` 0.20 (discovery) · `axum` 0.8 + `rustls` + `tokio-rustls` (nvhttp/HTTPS, custom
`ClientCertVerifier` for pinning) · `rcgen` 0.14 + `x509-parser` 0.18 + `rsa`/`sha2`/`aes`/
`ecb` (pairing crypto) · hand-rolled RTSP over `tokio::net::TcpListener` · `rusty_enet` 0.4
(control) · `opus` 0.3 (audio) · `reis` 0.6 + `input-linux` (input) · `aes-gcm` (already in
core) for the P1 video/control GCM path; nanors (FFI/port) for FEC recovery in P1.5.
## Testing note
The host is headless; end-to-end needs a **stock Moonlight client on the LAN** pointed at
this box (manual "add host" by IP works without mDNS). P1.1 is testable with `curl` against
`/serverinfo` + the Moonlight pair flow; P1.3+ needs a client that can display.
+430
View File
@@ -0,0 +1,430 @@
# GPU-contention performance investigation — why a saturating game starves the stream (2026-06-25)
> The headache, stated precisely:
> a game renders ~140 fps on the host GPU; the client requests 120/240; in a GPU-light scene the
> stream tracks; the moment the game pins the GPU the **stream collapses to 4050 fps** while the
> game keeps rendering 140. Capping the game's fps raises the stream back up (clearest in light
> titles like CS2). **Capping is not an acceptable fix** — demanding titles exhaust the GPU even
> when capped.
This is the second, deeper pass on the problem. The first pass is
[`host-latency-plan.md`](host-latency-plan.md) (a 25-agent investigation, 2026-06-18). **This doc
supersedes several of that doc's conclusions** — the codebase moved a lot in the week since
(the Windows-host rewrite landed IDD-push as the default capture path, split-encode shipped, the
GPU-priority knob got configurable), and a fresh, adversarially-verified research pass overturned
two of the old plan's premises. Read §1 (corrections) before acting on the old doc.
Method: five parallel investigations — three deep reads of the *current* code (encode, capture,
mitigations) and two web-research passes (encoder-side and GPU-scheduling-side), the latter run with
their own adversarial verifiers. Every external claim below carries a source URL; every code claim
carries a current `file:line`.
---
## 0. TL;DR — the corrected mental model and the action list
**The governing fact:** NVENC is a **dedicated ASIC on its own GPU runlist**, physically separate
from the SM/CUDA/graphics cores a 3D game saturates. The game does **not** steal the encode block.
It steals everything that *feeds* the block — capture-acquire, the **RGB→YUV colour-convert**, the
copy into the encoder's input surface, the readback — **and the GPU-scheduler time** to run that
feed work, which is queued behind the game's graphics context.
([NVENC app-note](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-application-note/index.html),
[engine-table proof, UNC RTAS'24](https://www.cs.unc.edu/~jbakita/rtas24.pdf))
**Therefore there are two different bottlenecks with opposite fixes, and you must tell them apart
before writing code:**
| Bottleneck | Symptom | Fix family |
|---|---|---|
| **(a) feed-scheduling contention** | `uniq``fps`, both ~50; `encode_ms` 1317 | shrink the host's contended-engine footprint; raise GPU scheduling priority; pipeline correctly; in the limit, a second GPU |
| **(b) frame-source ceiling** | `fps`≈240 (held re-encodes) but `uniq`→4050 | capture the game's real frames (swapchain hook); compose-flip for the DLSS-FG case |
**The single hardest truth:** on one saturated GPU there is **no free lunch**. Any host GPU work
either *preempts* the game (and steals its frames) or *waits* behind it. Capping the game works
only because it cuts the game's **total** GPU demand and opens idle gaps. The non-capping
equivalents are exactly three: **need less GPU** (footprint shrink), **take more** (priority — which
costs the game fps), or **use a different GPU** (real isolation). Anything pitched as "make the game
politely yield without losing anything" — Reflex, render-queue tricks — is a **placebo** here (§7).
**Action list, highest leverage first** (detail in §5–§6):
1. **Diagnose first** (§3). Read `uniq`-vs-`fps` under the real workload + PresentMon presentation
mode. Half a day; decides whether you're fighting (a) or (b). The repo already prints the counter.
2. **Stop feeding NVENC RGB on the default path.** IDD-push (the install default) hands NVENC
BGRA → NVENC runs its RGB→YUV CSC on the SM, the exact contended engine. Convert to NV12/P010 on
the **video engine** like the WGC/DDA paths already do. Biggest in-our-control win. (§5.A)
3. **Build a *correct* async encode pipeline** — submit on one thread, blocking-retrieve on another,
deep surface pool, Windows completion events. Our past "pipelining didn't help" was a *same-thread*
implementation that can't overlap; the two-thread pattern the NVENC guide mandates was never
tried. Recovers the depth-1 serialization that produces ~50 fps, up to the priority ceiling. (§5.B)
4. **Auto-gated REALTIME GPU priority.** Our `LocalSystem` service *can* grant it (most apps can't).
Gate on HAGS-state + VRAM headroom to dodge the documented NVENC freeze. (§5.C)
5. **Lock clocks / pin P-state** for jitter (cheap; fixes the light-scene "200-not-240", not the
collapse). (§5.E)
6. **If source-bound: swapchain-hook capture** (OBS-style) — the real escape from the compose
ceiling. Big lift, anti-cheat tradeoffs. (§5.F)
7. **The honest endgame for demanding titles: encode on a second GPU / the iGPU.** The only approach
that *removes* contention instead of re-prioritizing it. We already have AMF/QSV paths. (§5.G)
---
## 1. Corrections to `host-latency-plan.md` (read before reusing it)
The old doc was right about the shape but several specifics are now wrong or stale:
- **"Windows already feeds NVENC YUV on the video engine, so it does the right thing."** True for the
DDA and WGC paths — **false for IDD-push, which is now the install default** and feeds NVENC
**RGB**, paying the SM-side CSC the old doc said Windows had eliminated. The default path
*regressed* on the exact axis the doc celebrated. (§5.A, `capture/windows/idd_push.rs:545-551,743`)
- **"`PUNKTFUNK_ENCODE_DEPTH` (default 4, ≤6) deep-pipelines."** **There is no such knob.** It exists
only in two stale comments (`encode/windows/nvenc.rs:30`, `capture/windows/wgc.rs:57`) and is never
parsed. The real depth knob is `PUNKTFUNK_IDD_DEPTH` (default 2), used only by IDD-push on the
native path; GameStream and the WGC helper are hardcoded depth-1.
- **"Async NVENC is measure-gated and probably stacks latency (Tier 3D)."** The measurement that
produced that verdict (`capture/windows/wgc_helper.rs:131-135`) pipelined **on a single thread**
it queued more frames but still blocked `lock_bitstream` inline, so it added queue latency with
**zero overlap**. That is not the pattern the NVENC guide prescribes (submit/retrieve on
*separate* threads). The correct async pipeline is **untried**, not disproven. (§5.B)
- **"More GPU priority is maxed and hits a hard preemption wall with no recourse."** Half right.
Priority *is* near-maxed (HIGH), but the "no recourse" intuition is wrong: a **higher-priority GPU
context does preempt a saturating graphics context at pixel granularity** — that is precisely how
NVIDIA VR Async-TimeWarp injects a frame into a busy game
([VRWorks Context Priority](https://developer.nvidia.com/vrworks/headset/contextpriority)). And we
default to HIGH, leaving **REALTIME unused** even though our SYSTEM service can grant it. (§5.C)
- **"Force Composed Flip / double-refresh recovers the 'capture sees half the frames' loss."** The
"half the frames" effect is **specifically a DLSS-Frame-Generation flip-metering artifact**
(FG v310.x+ / RTX 50-series), *not* a general property of independent-flip games — normal
fullscreen flip games are captured at full rate by DDA. So composed-flip is a **narrow** fix, not a
general lever. ([Apollo #676 — DDA captured a flip game at full 120 fps](https://github.com/ClassicOldSong/Apollo/issues/676),
[Sunshine #3621 — version-pinned to FG 310.x](https://github.com/LizardByte/Sunshine/issues/3621))
- **"NvFBC is a possible low-overhead capture path."** **Dead on Windows** — deprecated, frozen at
Capture SDK 7.1 / Win10-1803
([NVIDIA deprecation bulletin](https://developer.download.nvidia.com/designworks/capture-sdk/docs/NVFBC_Win10_Deprecation_Tech_Bulletin.pdf)).
Linux-only, and there only via the consumer `keylase` patch.
What the old doc got right and still holds: feeding NVENC RGB is backwards; the source/compose ceiling
is real and upstream of encode; split-encode is a pixel-rate lever not a contention lever; the
honest residual ceiling at 100% GPU. Those carry forward.
---
## 2. How the pipeline actually serializes today (verified against current code)
The capture→encode loop is a **fixed-cadence pacer** (`gamestream/stream.rs:375-480`,
`punktfunk1.rs:2430-2540`): every `1/target_fps` tick it grabs the freshest frame with a
**non-blocking** `try_latest()`, and **if nothing new arrived it re-encodes the held frame** (a
near-empty P-frame). So the **outbound fps is pinned at `target_fps` no matter what the source did**
which is *why the raw fps counter lies* under contention. The only honest signal is the `uniq` /
`diag_new` counter (`stream.rs:380`, `punktfunk1.rs:2433-2436`), and the code itself states the
diagnostic: *"low new_fps at high send rate ⇒ the source isn't producing frames, not an encode
stall"* (`punktfunk1.rs:2466-2468`).
The encode round-trip (NVENC, the dominant path):
- `submit``encode_picture` (`encode/windows/nvenc.rs:722`) is a **non-blocking** ASIC launch; it
pushes onto a `pending` FIFO.
- `poll``lock_bitstream` (`nvenc.rs:801`) **blocks the same thread** until that frame's encode
completes. The session is **synchronous** — no `enableEncodeAsync`, no completion event.
- The only thread split is **encode-vs-network-send**, never submit-vs-retrieve.
So at depth-1 the loop is strictly serial: `capture (+convert) → submit → block in lock_bitstream →
hand AU to the send thread`. The arithmetic matches the symptom — `1000/17 ≈ 59` and `1000/13 ≈ 77`
fps bracket the observed ~50, the signature of **one frame in flight per round-trip**, not an ASIC
throughput wall.
([independent NVENC latency study: ~7 frames across all presets](https://arxiv.org/html/2511.18688v2))
Where the per-frame GPU work lands, by path (this is the crux of contention):
| Path | Colour-convert | Extra copy | NVENC input | Contended-engine load/frame |
|---|---|---|---|---|
| **IDD-push** (install default) | **none → NVENC internal RGB→YUV on the SM** | `CopyResource` BGRA→out-ring (3D), `idd_push.rs:743` | **BGRA/Rgb10a2** | **highest** (SM CSC + 3D copy) |
| **WGC** (fallback default) | `VideoProcessorBlt` → NV12 on the **video engine**, `wgc.rs:631` | none (encodes pool texture in place) | NV12/P010 | low |
| **DDA** | `VideoProcessorBlt` → NV12 on the **video engine**, `dxgi.rs:1657-1762` | one `CopyResource` (3D) to release the dup fast, `dxgi.rs:3099` | NV12/P010 | medium |
| **Linux NVENC** | **none → NVENC internal RGB→YUV on the SM** (default) | CUDA dev→dev copy + `cuStreamSynchronize` | RGBZ/BGRZ (NV12 only if `PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY`) | high |
Measured magnitude of "RGB vs NV12 to the encoder":
[**RGB input ≈ video-engine 40% + 3D/CUDA 15%; NV12 input ≈ video 26% + 3D 2%**](https://hardforum.com/threads/can-someone-explain-to-me-how-nvenc-obs-work-with-nvidia-gpus-and-the-gpu-load-they-cause.2025896/).
NVENC's guide confirms the mechanism: *"Encoding of RGB contents"* is on the explicit list of
features that **internally use CUDA**
([NVENC prog-guide §Encoder Features using CUDA](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html)).
---
## 3. Diagnose first — cheap, decisive, do before any code
Everything in §5 is gated on knowing whether you're fighting bottleneck (a) or (b). The dev VM
cannot reproduce this — run on the **RTX 4090 Windows box** (and a real NVIDIA Linux box) with an
actual saturating game.
1. **Run with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`** under CS2 at GPU-100%:
- `fps`≈target but `uniq`→4050 ⇒ **(b) source ceiling** — the compositor/IDD only produced
4050 unique frames. No encode/priority fix exceeds that number. Go to §5.F.
- both `fps` and `uniq`→4050, with `encode_ms` 1317 ⇒ **(a) feed contention** — the round-trip
is starving. Go to §5.A/B/C.
2. **Classify the game's presentation with [PresentMon](https://github.com/GameTechDev/PresentMon)**
"Presented FPS" vs "Displayed FPS" and **Presentation Mode** (Hardware: Independent Flip vs
Composed: Flip). Independent-Flip + `uniq` ≪ Presented ⇒ source/flip problem; **Presented FPS
itself** collapsed ⇒ the game is genuinely GPU-bound and no capture trick invents the missing
frames.
3. Log `cap_us` / `enc_us` / `pace_us` p50/p99 alongside to localise the stall.
> **Necessary-but-not-sufficient caveat:** if the game only *rendered* 50 frames because it's
> GPU-bound, **nothing downstream creates the other 90**. Source fixes address (b) only; the
> throughput of a saturated single GPU is split between game and host no matter what.
---
## 4. Current-state audit (what's shipped / regressed / missing)
| Area | State | Where |
|---|---|---|
| Thread priority (Win) | HIGH class + MMCSS "Games" + 1 ms timer | `session_tuning.rs` ✅ |
| Thread priority (Linux) | `setpriority` 10/5 — **native path only; GameStream Linux threads get none** | `punktfunk1.rs:1977` ⚠ |
| GPU sched priority | `D3DKMTSetProcessSchedulingPriorityClass` **HIGH(4)** default; `realtime` opt-in, no auto-gate; cross-process onto WGC helper | `capture/windows/dxgi.rs:208-330` ⚠ |
| GPU thread/latency | `SetGPUThreadPriority(0x4000001E)`, `SetMaximumFrameLatency(1)` | `dxgi.rs:193-200` ✅ |
| CSC off-SM (Win SDR) | WGC/DDA video-engine NV12 ✅ — **IDD-push (default) RGB→SM ✗** | `wgc.rs:631` / `idd_push.rs:545` |
| CSC off-SM (Win HDR) | on-SM unless `PUNKTFUNK_HDR_SHADER_P010` (default **off**) | `wgc.rs:603` ⚠ |
| CSC off-SM (Linux) | RGB→SM by default; NV12 is **double-opt-in** (`PUNKTFUNK_NV12`+`PUNKTFUNK_ZEROCOPY`) | `encode/linux/mod.rs:104` ⚠ |
| Encode pipeline | depth-1 synchronous, inline `lock_bitstream`; IDD-push native = depth-2 same-thread | `nvenc.rs:801` ⚠ |
| Split-encode | 2-way >1 Gpix/s (HEVC/AV1); disabled 10-bit (correct); proper enum | `nvenc.rs:424-447` ✅ |
| Zero-copy register-in-place | yes (no encoder-owned pool copy) — IDD-push adds its own out-ring copy | `nvenc.rs:623` ✅/⚠ |
| AMF tuning | `usage=ultralowlatency`, `preanalysis=false` | `ffmpeg_win.rs:215-219` ✅ |
| QSV tuning | `async_depth=1`, `low_power=1` (VDEnc) | `ffmpeg_win.rs:226-227` ✅ |
| Intra-refresh / infinite GOP | yes (killed the periodic-IDR freeze) | ✅ |
| encode\|send split + paced send + sendmmsg + 32 MB sockbuf | yes | `stream.rs`, `transport/qos.rs` ✅ |
| **Clock / P-state pin** | **none** (zero hits repo-wide) | ✗ |
| **Async NVENC (2-thread)** | **none** | ✗ |
| **Frame-source escape (hook/NvFBC-Linux)** | **none** | ✗ |
| **Second-GPU / iGPU encode offload** | **none** | ✗ |
| DSCP/QoS | implemented, `PUNKTFUNK_DSCP` opt-in (default off) | `transport/qos.rs` ⚠ |
---
## 5. The levers, ranked, with honest verdicts
### A. Stop feeding NVENC RGB on the default path — **highest in-our-control win**
The default Windows capture path (IDD-push) and the default Linux path both hand NVENC packed RGB,
forcing NVENC's internal RGB→YUV CSC onto the SM the game saturates. The WGC and DDA paths already
solved this by doing the CSC with `ID3D11VideoProcessor::VideoProcessorBlt` (video engine) and
feeding NV12/P010. **Make IDD-push and Linux do the same.**
- **Windows IDD-push:** add a `VideoProcessorBlt` BGRA→NV12 (SDR) / FP16→P010 (HDR) step into the
out-ring, exactly like `wgc.rs:631` / `dxgi.rs:1657-1762`, and feed `NV_ENC_BUFFER_FORMAT_NV12` /
`..._YUV420_10BIT`. This *also* lets you drop the separate `CopyResource` (the convert writes the
out-ring), removing **both** contended-engine ops per frame. Plug it into `SessionPlan`
(`session_plan.rs`, the single owner of the capture/encode decision) so capture and encode can't
disagree on the format.
- **Linux:** make NV12 the **default** for the tiled zero-copy path (it's gated behind
`PUNKTFUNK_NV12` *and* `PUNKTFUNK_ZEROCOPY` today — `encode/linux/mod.rs:104`,
`linux/zerocopy/egl.rs:272`), and feed NVENC `NV_ENC_BUFFER_FORMAT_NV12`. The GL detile already
runs; emitting NV12 from it replaces the swizzle at ~equal cost and deletes NVENC's CSC.
- **Windows HDR:** flip `PUNKTFUNK_HDR_SHADER_P010` on by default (or, better, use a video-engine
P010 convert where the VP supports it).
**Verdict: REAL, but honestly *conditional*.** Feeding NV12 provably removes NVENC's internal CUDA
CSC — but the convert has to land **off** the SM to fully pay off. `VideoProcessorBlt` is *designed*
to use fixed-function video hardware and the hardforum numbers back the 15%→2% drop, **but no NVIDIA
doc explicitly confirms `VideoProcessorBlt` runs off-SM on GeForce** — treat the "video engine" claim
as well-founded-but-unverified and confirm on-box with `nvidia-smi dmon` (watch the `enc`/`sm`
columns) before and after. Do **not** convert with a CUDA/3D shader and call it done — that just
relocates the CSC to the same SM (Sunshine's RGB→NV12 CUDA kernel still contends).
### B. A *correct* async encode pipeline (the untried encoder lever)
The NVENC Programming Guide is explicit: *"The main encoder thread should be used only to submit
work… (non-blocking `NvEncEncodePicture`). Output buffer processing — waiting on the completion
event in asynchronous mode, or calling `NvEncLockBitstream` in synchronous mode — should be done in
the **secondary thread**."*
([NVENC prog-guide, threading model](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html))
We do the opposite — submit and blocking-retrieve on **one** thread. Queuing more `pending` entries
(IDD-push depth-2, or the abandoned wgc_helper experiment) adds queue latency with **no overlap**,
which is exactly the "deeper pipeline only stacks latency" result we recorded. It was the wrong
implementation, not a disproof.
The fix: **submit on the capture/encode thread; do `lock_bitstream` on a dedicated retrieve thread;
hold a deep input+output surface pool (≈48); on Windows register a `completionEvent` per output
buffer (`enableEncodeAsync=1`) — on Linux async events are unsupported, so use the same two-thread
split with a blocking retrieve.**
([async is Windows/WDDM-only](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-video-encoder-api-prog-guide/index.html);
FFmpeg models the same knob as `delay`/`async_depth`,
[libavcodec/nvenc.c](https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/nvenc.c)).
This lets the WDDM scheduler find a **backlog** when it finally grants the encoder context a slice,
and drain several frames back-to-back, while the ASIC encodes frame N as the contended engines do
frame N+1's convert.
**Verdict: REAL throughput recovery for the depth-1 collapse, latency cost +12 frames, ceiling-bounded.**
The honest bound (and why this is *second* to §A/§C): pipelining cannot manufacture GPU time — if the
scheduler grants the encode context only X% under load, depth only guarantees work is *ready* for
each grant; it can't raise X. That is why Sunshine's documented lever for "GPU heavily loaded" is
**priority**, not depth. So §B recovers the serialization loss; §A/§C raise the share it's bounded by.
Watch out: this **forecloses sub-frame slice output** (mutually exclusive with `enableEncodeAsync`),
and HAGS can spike the *submit* call itself
([100200 ms `nvEncEncodePicture` stalls under HAGS](https://forums.developer.nvidia.com/t/windows-11-hardware-accelerated-gpu-scheduling-issue/286128)).
### C. Auto-gated REALTIME GPU scheduling priority
Raising the host process's WDDM GPU priority is **the** proven single-PC production lever — OBS and
Sunshine both set `D3DKMT_SCHEDULINGPRIORITYCLASS_REALTIME` to stop being descheduled behind
fullscreen games
([OBS commit](https://github.com/obsproject/obs-studio/commit/ec769ef008b748f7dfba211daec9eb203ea4bea0),
[Sunshine `display_base.cpp`](https://raw.githubusercontent.com/LizardByte/Sunshine/master/src/platform/windows/display_base.cpp)).
It works **independently of HAGS** (HAGS does *not* reassign cross-process priority — Microsoft:
*"Windows continues to control prioritization"*
[DirectX devblog](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/)).
We ship only **HIGH(4)** by default with a static `realtime` opt-in and **no auto-gate**. Two things
to change:
- **We can actually grant REALTIME.** It needs `SeIncreaseBasePriorityPrivilege`, which an unelevated
app lacks (OBS logs the failure) — **but our host runs as a `LocalSystem` service, which holds it.**
The lever is available to us specifically.
- **Gate it to dodge the freeze.** REALTIME + NVIDIA + HAGS-on + near-full-VRAM is a **documented
NVENC hang** (Sunshine ships `nvenc_realtime_hags` to downgrade to HIGH for exactly this;
[Sunshine config](https://docs.lizardbyte.dev/projects/sunshine/latest/md_docs_2configuration.html),
[NVIDIA repro](https://forums.developer.nvidia.com/t/bug-report-nvenc-encoder-hangs-on-windows-when-using-d3d11-in-real-time-mode/357466)).
Implement the old plan's "Tier 3B": probe HAGS via `D3DKMTQueryAdapterInfo` and VRAM headroom via
`IDXGIAdapter3::QueryVideoMemoryInfo` (continuously); use REALTIME only when HAGS-off, or HAGS-on
with comfortable VRAM headroom; downgrade to HIGH the instant VRAM tightens.
**Verdict: REAL — the genuine ceiling-raiser — but it is the no-free-lunch lever.** Priority is how
the host *takes* GPU time from the game; it measurably **costs the game fps**
([Doom Eternal 121→60 with Sunshine running](https://github.com/LizardByte/Sunshine/issues/3703)).
That's acceptable for a streaming host (the remote view is the product), but say so plainly and make
the class operator-configurable (we already expose `PUNKTFUNK_GPU_PRIORITY_CLASS`).
### D. Multi-vendor encoder hygiene (AMF/QSV) — mostly done, one caveat
Our `*_amf`/`*_qsv` libavcodec config already follows the research's advice: AMF
`usage=ultralowlatency` + `preanalysis=false` (`ffmpeg_win.rs:215`), QSV `async_depth=1` +
`low_power=1` VDEnc path (`:226`). Keep them. Two notes:
- **AMF/QSV suffer contention *worse* than NVENC.** OBS: *"For Intel and AMD GPUs, the hardware
encoder requires significant resources of the same type a 3D app/game requires… different from
NVIDIA's NVENC, which has dedicated encoding circuits"*
([OBS KB](https://obsproject.com/forum/threads/how-to-debug-encoding-overloaded.168625/)). So on an
AMD/Intel host the collapse is *expected to be harder* — and §G (iGPU offload) is even more
attractive there.
- **The AMF busy-poll floor** (a fixed-sleep `QueryOutput` poll imposes ~15 ms via timer
granularity) is fixed in FFmpeg's amf wrapper (Cameron Gutman's `QUERY_TIMEOUT` patch); since we
go through libavcodec we inherit it — just **confirm the pinned FFmpeg build includes it**.
([ffmpeg-devel](https://www.mail-archive.com/ffmpeg-devel@ffmpeg.org/msg170489.html))
**Verdict: REAL but largely already captured.** No big win left here except via §G.
### E. Lock clocks / pin P-state — cheap jitter fix, not a collapse fix
NVIDIA's adaptive clocking downclocks between our small bursty frames and pays a ramp tax every
frame — most visible in the *light* scene (the "200-not-240"). Pin it:
- **Windows:** NvAPI per-application DRS `PREFERRED_PSTATE = PREFER_MAX` scoped to our exe (this is
exactly Sunshine's `nvenc_latency_over_power`,
[Sunshine nvprefs](https://github.com/LizardByte/Sunshine/blob/master/src/platform/windows/nvprefs/driver_settings.cpp)).
**Crash-safe undo is mandatory** — persist an undo record to `%ProgramData%\punktfunk\` *before*
applying, revert a stale profile on next start, so a crash never leaves the user's control panel
modified.
- **Linux:** `nvidia-smi -lgc`/NVML `nvmlDeviceSetGpuLockedClocks` (needs root/`CAP_SYS_ADMIN`; query
`nvmlDeviceGetMaxClockInfo`, lock to that, restore on teardown *and* SIGTERM). Plus the newly-added
`CudaNoStablePerfLimit` driver profile — *new in R580/595, so usable on the 595 box* — to defeat
the CUDA "Force P2" memory-clock clamp.
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default off on battery / Steam Deck** (pinning is harmful
there).
**Verdict: REAL for latency *stability*, marginal for the saturated collapse** (at 100% util the game
already pins P0). Cheap, low risk, do it for the light-scene win.
### F. Escape the frame-source ceiling — only if §3 says (b)
If `uniq` is the wall, no encoder/priority work helps — you need a better frame source.
- **Swapchain-hook capture (the real fix).** Inject a hook on `IDXGISwapChain::Present`/`Present1`,
`vkQueuePresentKHR`, `wglSwapBuffers` and copy the backbuffer to a shared texture *before* the
compositor — OBS Game Capture's mechanism. Sees **every presented frame**, no compose/refresh
gating.
([OBS dxgi-capture](https://github.com/obsproject/obs-studio/blob/master/plugins/win-capture/graphics-hook/dxgi-capture.cpp))
**Tradeoffs are serious:** anti-cheat (EAC/BattlEye/Vanguard) flags injection — needs
whitelisting/compat handling; per-graphics-API hooks; fragility across game updates. Scope it as an
opt-in "game capture" mode, not the default.
- **NvFBC:** **not an option on Windows** (dead, §1). On **Linux** it's viable via the consumer
keylase patch and captures below composition — worth a flag for the Linux NVIDIA host.
- **Compose-flip (narrow):** the topmost 1×1 layered-window trick (we already have
`composed_flip.rs`) forces DWM composition and fixes specifically the **DLSS-Frame-Gen** half-rate
case. Adds host-display latency; don't enable globally.
- **WGC "deliver 2× rate":** Apollo sets `MinUpdateInterval = 1e7/(fps*2)` so the pacer always has a
fresh frame to pick ([Apollo](https://github.com/ClassicOldSong/Apollo/pull/785)); we set it to 1×
refresh (`wgc.rs:310`). Cheap tweak to try on the WGC path.
**Verdict: swapchain-hook is REAL and the only general escape; the rest are narrow.** None invents
frames the game didn't render.
### G. The honest endgame — encode on a second GPU / the iGPU
For *demanding* titles that saturate the GPU even when capped, the only thing that **removes**
contention rather than re-prioritizing it is to run the capture→convert→encode pipeline on a
**different** GPU — a second dGPU or, more realistically, the **iGPU** (Intel QuickSync / AMD VCN),
which most desktops already have. Render on the gaming GPU, copy the frame across the adapter once,
encode on the iGPU's independent media engine. This is the textbook "stream on a separate encoder"
play, and the OBS "second GPU is harmful" verdict does **not** apply — that verdict is about moving
*only the NVENC block*; moving capture + CSC + copies off the gaming GPU genuinely frees it.
([OBS forum](https://obsproject.com/forum/threads/can-you-use-a-2nd-gpu-to-eliminate-encoder-overload.149644/))
We're unusually well-placed for this: we already have working AMF and QSV backends
(`encode/windows/ffmpeg_win.rs`) and the Linux VAAPI backend. The missing piece is a capture/topology
mode that pins capture to the gaming adapter and the encoder to the iGPU adapter, with one
cross-adapter shared-texture copy. Cost: that copy still shares VRAM bandwidth, so it's not free, but
it's the only path that lets a demanding game and a clean stream coexist on one machine.
**Verdict: REAL — the cleanest isolation, and the right answer to "even capped it collapses."**
Datacenter stacks (GeForce NOW, Stadia) "solve" this by one dedicated GPU + encoder per session;
the consumer analogue is the iGPU.
---
## 6. Recommended order of attack
1. **§3 Diagnose** on the RTX box + a real game. Settles (a) vs (b). *(half a day, decisive)*
2. **§5.A NV12/P010 on the default paths** (IDD-push video-engine convert; Linux NV12 default-on;
Windows HDR P010 default). Biggest in-our-control floor-raise; confirm off-SM with `nvidia-smi dmon`.
3. **§5.C Auto-gated REALTIME** priority (HAGS + VRAM gate). Cheap, big, we can uniquely grant it.
4. **§5.E Clock pin** both OSes (crash-safe undo). Cheap light-scene win.
5. **§5.B Correct two-thread async pipeline.** Structural; recovers the depth-1 serialization.
6. **§3-gated §5.F** source escape (swapchain hook) — only if `uniq` is the wall.
7. **§5.G iGPU encode offload** — the strategic answer for demanding titles; larger build.
After 25 the light-scene gap closes and the saturated floor rises materially. But report the
honest ceiling: **on one saturated GPU the game and the host split a fixed pie** — coarse WDDM
graphics preemption caps how much priority can claw back, and a genuinely GPU-bound game that only
*rendered* 50 frames cannot also yield 140 unique frames to capture. The only escapes from that pie
are reducing the game's demand (cap — rejected), taking a bigger slice (priority — costs game fps),
or a second slice of silicon (§G). Don't chase the rest with encoder micro-optimisation.
---
## 7. Placebos & dead ends (so we don't re-propose them)
| Candidate | Verdict | Why |
|---|---|---|
| **NVIDIA Reflex / Ultra-Low-Latency / max-pre-rendered-frames** as a "non-capping yield" | ✗ placebo | Shrinks the *game's* render queue but the game still demands ~99% GPU → frees ≈0 SM headroom. Reflex needs in-game SDK (host can't force it); ULLM is host-forceable only on DX11/DX9 (DX12 since driver 551.23) and is NVIDIA's weaker mechanism. Only honest effect: µs of tail-jitter smoothing. ([Battle(non)sense LDAT data](https://forums.guru3d.com/threads/battle-non-sense-youtuber-claims-low-latency-mode-only-helps-when-gpu-load-is-99.429074/)) |
| **HAGS on, as a contention fix** | ✗ neutral→harmful | Doesn't reassign cross-process priority (Microsoft); OBS reports it *causes* NVENC latency spikes; it's the freeze-hazard variable. Needed only to enable the VK/D3D12 realtime *queue*. ([OBS KB](https://obsproject.com/kb/hags)) |
| **Split-frame encode (2/3/4-way) to fix contention** | ✗ (pixel-rate only) | Parallelizes the ASIC, not the contended copy/CSC; measured **zero** latency change at 4K. Correct use = raise the single-session pixel ceiling (5K@240). `splitEncodeMode=15` is the legit *disable* sentinel, not a bug. ([SDK header](https://raw.githubusercontent.com/FFmpeg/nv-codec-headers/master/include/ffnvcodec/nvEncodeAPI.h)) |
| **Move the encoded-bitstream readback to a copy engine** | ✗ placebo | Output is KB-scale; the cost of `lock_bitstream` is the completion *wait*, not copy bandwidth. (The *input* full-frame copy is the real one — but D3D11 can't target the copy engine; zero-copy already avoids it.) |
| **CUDA stream priority / `CUDA_DEVICE_MAX_CONNECTIONS` / `CU_CTX_SCHED_*`** | ✗ placebo cross-process | Intra-context only; the game is a *separate* context. Stream priority "will not preempt already executing work". ([CUDA docs](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html)) |
| **VK/EGL global-priority REALTIME on Linux NVIDIA** | ✗ | Not reliably granted on the proprietary driver, and moot anyway — our Linux NVENC is driven via CUDA/NVENC-SDK, not a Vulkan queue. |
| **Windows "High performance" GPU preference** | ✗ single-GPU placebo | Only selects an adapter; real only to split work across adapters (→ that's §G). |
| **MIG / MPS / vGPU** | ✗ N/A | MIG/vGPU are datacenter/pro + hypervisor/license; MPS is Linux-CUDA-only with no graphics notion. None apply to a consumer GPU. |
| **NvFBC on Windows** | ✗ dead | Deprecated, frozen at Capture SDK 7.1 / Win10-1803. |
| **Frame Generation / Smooth Motion** to "make more frames" | ✗ red herring | We stream *rendered* frames; FG adds optical-flow/tensor + present load to the same GPU → amplifies contention. |
---
## 8. Open evidence gaps (flagged honestly)
- Whether `ID3D11VideoProcessor::VideoProcessorBlt` (BGRA→NV12) runs **off the SM on GeForce** is not
confirmed by any NVIDIA document — it's the linchpin of §5.A's full payoff. **Verify on-box** with
`nvidia-smi dmon` (sm% vs enc%) on the WGC path before assuming IDD-push will match it.
- The exact share of the 1317 ms `encode_ms` that is *convert-on-SM* vs *scheduling-wait* is
unmeasured. §3 + an A/B of IDD-push-RGB vs IDD-push-NV12 on the same scene settles it and tells you
whether §5.A alone is enough or whether §5.C is doing the heavy lifting.
- AMD VCN "degrades worse under contention" is practitioner-consensus + architecture, not an AMD
whitepaper; treat the *direction* as solid, the magnitude as TBD.
+270
View File
@@ -0,0 +1,270 @@
# HDR pipeline — investigation & implementation plan
Goal: **true, correct HDR glass-to-glass** for punktfunk, across the host (Windows today;
Linux blocked upstream) and every client (Windows / Apple / Android / Linux).
This is an audit of the current state, the gap list, and a phased plan. It was produced from
a full read of every HDR-touching subsystem cross-checked against the HDR10 standards
(CICP/H.273 VUI, SMPTE ST.2086 mastering, CEA-861.3 MaxCLL/MaxFALL) and the
Sunshine/Apollo/Moonlight reference implementation.
> Status legend: **blocker** (HDR can't work) · **correctness** (HDR works but looks wrong) ·
> **quality** (correct-ish, missing refinement) · **ok**.
---
## TL;DR
Our HDR is **correct in isolated islands but broken end-to-end.** The pixel math and the HEVC
VUI we *do* emit are right (self-test validated, matches Apollo). What's missing is the
**metadata chain**: nothing measures, signals, transports, or applies the *static HDR metadata*
(mastering display colour volume + content light level) that tells a display how to tone-map —
so every client hardcodes generic values or infers from the bitstream, and one line
(`abi.rs:896`, `video_caps = 0`) makes the entire (correct) Apple HDR pipeline dead code.
---
## What's already correct (the islands)
| Stage | Where |
|---|---|
| Windows host HEVC **VUI** — primaries=9 (BT.2020) / transfer=16 (PQ) / matrix=9 (BT.2020-NCL) / limited range | `encode/nvenc.rs:307-316` |
| Windows host **scRGB→BT.2020 PQ** shader (×80 nits → BT.709→2020 → ST.2084 OETF, 10000-nit peak) | `capture/dxgi.rs` — self-test `<1` code error, matches Apollo |
| Windows client **P010 decode + YUV→RGB** (BT.2020-NCL, limited→full) + **R10G10B10A2 / G2084-P2020 swapchain** | `present.rs:66-77, 320-370` |
| Android client **Main10 decode + reactive DataSpace** (BT2020-PQ/HLG) | `decode.rs:210-227` |
| Apple client decode/present **code** (P010 VideoToolbox, BT.2020 PQ Metal, `itur_2100_PQ` + EDR) | correct — but never runs (blocker #2) |
## Gap list
### Blockers
1. **No color-metadata transport in the protocol** *(the keystone).* The wire carries only
`Hello.video_caps` (10BIT/HDR bits) and `Welcome.bit_depth` (8/10) — `quic.rs:127-128`
explicitly defers color. No primaries/transfer/matrix/range, **no ST.2086 mastering, no
MaxCLL/MaxFALL**. ST.2086/CLL host→client is impossible by construction today.
2. **C ABI hardcodes `video_caps = 0`** (`abi.rs:896`) → Apple's complete HDR pipeline is dead
code; no ABI embedder can request HDR. One-line root cause.
3. **H.264 and AV1 emit zero color signaling on Windows** — the `if self.hdr` VUI block in
`nvenc.rs` only writes `hevcConfig`. Any H.264+10-bit or AV1+HDR stream decodes as BT.709 SDR.
*(AV1 is **not** a "copy the HEVC VUI" fix — AV1 has no VUI/SEI; it carries
primaries/transfer/matrix in the sequence-header `color_config` and mastering/CLL in
**METADATA OBUs** `HDR_MDCV`/`HDR_CLL`. Verify whether NVENC's AV1 path accepts them.)*
4. **Linux host is 8-bit only end to end** — capture offers only 8-bit PipeWire formats
(`capture/linux.rs:443-453, 594-654`; gamescope #2126, portals don't wire PipeWire 1.6
BT.2020/PQ); encode downgrades 10-bit (`encode/linux.rs:153-162` TODO, `vaapi.rs:719`) with
BT.709 hardcoded. The Windows-style 8-bit→Main10 upconvert shim is not implemented here.
5. **Linux client HDR is a complete non-feature**`video_caps=0`, P010 decode path dead
(`video.rs:379`), CICP hardcoded BT.709 (`ui_stream.rs:239-243`), no Wayland
color-management (GTK4 0.11 too old).
### Correctness
6. **No host ever emits the ST.2086 mastering or CEA-861.3 CLL SEI.** Windows never reads
`IDXGIOutput6::GetDesc1`; `nvenc.rs` never builds an `NV_ENC_SEI_PAYLOAD`; Linux attaches no
libavcodec `side_data`. Apollo reads `GetDesc1` and attaches it.
7. **Clients hardcode mastering metadata.** `present.rs:584-595` ships fixed
`1000-nit / MaxCLL 1000 / MaxFALL 400` (with the literal "the protocol doesn't carry the
stream's real mastering metadata yet" comment). Apple/Android set none.
8. **HDR→SDR tone-mapping is unaddressed — and it's the common case.** Most client displays are
SDR. No client queries display peak; silent `SetColorSpace1`/`SetHDRMetaData` failures present
PQ as SDR gamma (crushed/dark). We lean entirely on OS auto-fallback.
9. **Windows secure desktop drops HDR to SDR** on lock/UAC (`dxgi.rs:325-368`,
`sudovda.rs:234-277`).
10. **GameStream silently streams SDR** on a Moonlight HDR request (`mod.rs:48-56`,
`rtsp.rs:288-293`) — logged, but no negotiated error. Real Apollo parity needs the Moonlight
`SS_HDR_METADATA` blob on the **ENet control channel**, not just in-band.
11. **Linux client software path is color-wrong even for SDR** — BT.601 applied to BT.709
(`video.rs:162-167`, no `color_state` on the texture). Standalone bug.
### Quality
12. No per-content MaxCLL/MaxFALL (`GetDesc1` doesn't expose it). No encoder-CSC-range vs
signaled-range reconciliation (black-crush risk). No automated 10-bit test — `probe` never
even reads `Welcome.bit_depth` (`main.rs:396-406`).
### Out of scope (call out, don't build)
- Dynamic metadata: HDR10+ (ST.2094-40) and Dolby Vision RPU. We handle *static* ST.2086 only,
with mid-stream changes carried by re-sending the static block (below).
- HLG: the colorimetry transfer enum carries `18` from day one (free), but the `0xCE` mastering
datagram is **omitted for HLG** (scene-referred, no mastering metadata).
---
## Protocol design (the keystone — pure-additive, hardware-free, CI-testable)
Two layers, both back-compat-safe via the established trailing-bytes / new-datagram-tag patterns.
### (A) Per-session colorimetry — 4 trailing bytes on `Welcome`
After the existing `bit_depth` (offset 59), append a fixed 4-byte CICP block at offsets 60..64.
(A future mirror on `Reconfigured` will announce a mid-stream SDR↔HDR / BT.709↔BT.2020 flip on the
control stream we already use for renegotiation — deferred to Step 1 with the mid-stream-flip work;
today a mode switch never changes the colour, and the `0xCE` re-send covers mastering changes.)
```
[60] colour_primaries (CICP: 1=BT.709, 9=BT.2020)
[61] transfer_characteristics (1=BT.709, 16=PQ/SMPTE2084, 18=HLG)
[62] matrix_coeffs (1=BT.709, 9=BT.2020-NCL) ← never emit 10 (CL): no client decodes it
[63] video_full_range_flag (0=limited, 1=full)
```
Decode with `b.get(60).unwrap_or(1)` etc. — an older host omits them → BT.709 limited SDR
(today's behavior). `Welcome` stays `Copy`. Modeled as a `ColorInfo` struct on the wire types
and exposed on `NativeClient` (with `bit_depth`) so clients *know* the colorimetry instead of
inferring it.
### (B) Per-change mastering + CLL — a new host→client datagram, tag `0xCE`
ST.2086 is variable and changes mid-stream, so it rides a datagram (next tag after `0xCD`
HIDOUT), demuxed in `client.rs` like AUDIO/RUMBLE/HIDOUT. 28 bytes, standard SEI fixed-point:
```
[0] = 0xCE
G.x G.y B.x B.y R.x R.y 6 × u16 LE display primaries, 1/50000 units
wp.x wp.y 2 × u16 LE white point, 1/50000 units
max_display_mastering_luminance u32 LE 0.0001 cd/m²
min_display_mastering_luminance u32 LE 0.0001 cd/m²
max_cll u16 LE nits
max_fall u16 LE nits
```
- Sent on session start and whenever `GetDesc1`/source mastering changes; **re-sent on every
IDR/RFI keyframe** so a client that dropped the (best-effort) datagram converges within a GOP.
Until first receipt the client uses the Welcome transfer + a documented generic default.
- **Bounds-check length before reading** (reassembler-bounds security invariant) — truncation
test required.
- **Omitted entirely for HLG.**
- Units note: these map straight to DXGI `DXGI_HDR_METADATA_HDR10`, Android `KEY_HDR_STATIC_INFO`,
and Apple `CAEDRMetadata.hdr10`. On the **libavcodec/Linux** side they need conversion —
`AVMasteringDisplayMetadata` stores `AVRational`, not raw fixed-point.
### (C) C ABI
- `punktfunk_connect_ex5(... video_caps: u8)` (ex4 delegates with 0); **fix `abi.rs:896`.**
- `punktfunk_connection_next_hdr_meta(c, *mut PunktfunkHdrMeta, timeout_ms)` — new plane,
one-puller contract like `next_audio`.
- `punktfunk_connection_color_info(c, *mut prim, *mut trc, *mut matrix, *mut range, *mut bit_depth)`.
- Regenerate `include/punktfunk_core.h` (cbindgen); `struct_size`/repr(C) guards on new structs.
---
## Phases
### Step 0 — Protocol + ABI carry color metadata end to end *(this change)*
The dominant cross-cutting blocker; everything else is downstream. No rendering changes, no
hardware, CI-testable.
- **core:** `ColorInfo` + 4 Welcome bytes; `HdrMeta` + `0xCE` codec (bounds-checked);
`NativeClient` `color`/`bit_depth` fields + HdrMeta receiver + demux + `next_hdr_meta`.
- **C ABI:** `connect_ex5`, `next_hdr_meta`, `color_info`, fix caps=0; regen header.
- **host:** populate `Welcome.color` from the negotiated bit-depth/HDR decision; send a `0xCE`
(generic default for now) when HDR is negotiated.
- **clients:** Windows/Android inherit the demux via shared core; Apple flips to `ex5`.
- **validation:** `quic.rs` round-trip + truncation + **SDR back-compat** tests; `probe` logs
`bit_depth` + colorimetry; loopback asserts a 10-bit Welcome carries trc=16 and a `0xCE` lands.
### Step 1 — Host emits correct in-band SEI + complete VUI on all codecs *(landed; RTX-validation pending)*
In-band SEI is read directly by decoders, so it fixes correctness even before clients consume
the protocol, and gives an Apollo/Moonlight on-glass parity gate.
- **Single source of truth:** the capturer learns the source display's mastering metadata and
exposes it via `Capturer::hdr_meta() -> Option<HdrMeta>`. The stream loop forwards it to the
encoder (`Encoder::set_hdr_meta` → in-band SEI) **and** the client (real `0xCE`, re-sent on each
keyframe). Pure byte-level logic (float→fixed conversion + the HEVC/H.264 SEI payloads) lives in
the unit-tested, cross-platform `src/hdr.rs` (`hdr_meta_from_display`, `hevc_mastering_display_sei`
type **137**, `hevc_content_light_level_sei` type **144** — note: NOT "type 4", that was a
drafting error).
- **Windows (done, CI-compiled / RTX on-glass pending):** `dxgi.rs` + `wgc.rs` read
`IDXGIOutput6::GetDesc1` at capture init / output change → `HdrMeta` (MaxCLL/MaxFALL left 0 —
GetDesc1 has none, like Apollo). `nvenc.rs` attaches the mastering + CLL SEI on every IDR for
HEVC/H.264. (AV1 mastering rides METADATA OBUs, not SEI — follow-up; AV1 `color_config` already
lands in Step 0's quick win.)
- **Linux encode-ready — DEFERRED into Step 4:** Linux capture is 8-bit only, so signalling
BT.2020 PQ + attaching mastering side-data on a downconverted 8-bit stream would be *incorrect*.
The libavcodec `side_data` path (with the `AVRational` conversion) lands together with the
8-bit→Main10 shim / true 10-bit capture in Step 4.
- **Windows secure-desktop relay** (`virtual_stream_relay`) still sends only the generic baseline
`0xCE`; the helper's in-band SEI carries the real grade. Wiring the relay's `0xCE` is a follow-up.
- **validation (RTX box):** `ffprobe -show_frames` shows mastering + CLL side-data with the
display's real luminance and VUI 9/16/9; stock Moonlight shows correct (not washed-out) HDR.
Add **encoder-CSC-range == signaled-range** check.
### Step 2 — Clients apply the metadata *(landed; CI/on-glass validation pending)*
All three clients now drain the protocol's `HdrMeta` (`next_hdr_meta` / `nextHdrMeta`) and apply it,
each remapping from the wire form (ST.2086 G,B,R order, mastering luminance in 0.0001 cd/m²) to the
platform's expected layout:
- **Windows (Rust, CI-compiled):** session pump drains `next_hdr_meta` into a `LATEST_HDR_META`
slot; `present_newest` applies it via `Presenter::set_hdr_metadata` → real `SetHDRMetaData`
(`hdr_meta_to_dxgi`: G,B,R→R,G,B reorder, 0.0001-nit→nit for `MaxMasteringLuminance`), dropping
the 1000/1000/400 hardcode. `SetColorSpace1`/`SetHDRMetaData` failures + an SDR-display
colour-space rejection are now **logged**, not swallowed.
- **Apple (Swift, mac-runner CI):** connect now advertises caps via `punktfunk_connect_ex5`
(`SessionModel` computes `videoCap10Bit|videoCapHDR` from `hdrEnabled`) — *this is the fix that
resurrects Apple's previously-dead HDR pipeline*. `nextHdrMeta`/`colorInfo` wrappers added; the
pump drains `nextHdrMeta``VideoDecoder.setHdrMeta``CVBufferSetAttachment` of
`kCVImageBufferMasteringDisplayColorVolumeKey` (24-byte BE SEI) +
`kCVImageBufferContentLightLevelInfoKey` (4-byte BE) on each HDR pixel buffer (the correct path
for the itur_2100_PQ layer; `CAEDRMetadata` on a PQ layer is ambiguous and was avoided).
- **Android (Rust `decode.rs`, cargo-ndk verified):** when `client.color.is_hdr()`, drain the first
`next_hdr_meta` and set `MediaFormat` `hdr-static-info` (`KEY_HDR_STATIC_INFO`) before
`configure()``android_hdr_static_info` builds the 25-byte CTA-861.3 Type-1 blob (LE, **R,G,B**
order, max-lum in **nits-u16**). `Display.getHdrCapabilities` gate deferred (the Surface DataSpace
already drives SurfaceFlinger tone-mapping on non-HDR displays).
### Step 3 — Display-capability gate *(landed; CI/on-glass validation pending)*
The common-case correctness step — most client displays are SDR. **Chosen approach: capability-gate**
(not an in-shader BT.2390 tone-map). Rationale: with Steps 12 the host sends *correct* mastering
metadata, so an HDR display self-tone-maps from it; the real remaining gap is SDR displays, best
fixed by **not advertising HDR you can't present** — the host then sends a proper BT.709 SDR stream
instead of PQ the panel would mis-tone-map (washed-out/dark). No guessed tone-map curve, deterministic.
- **Windows** (`present::display_supports_hdr` via DXGI: any `IDXGIOutput6` colour space ==
`G2084`): `session.rs` ANDs it with the user's HDR setting before advertising caps; logs when it
drops to SDR.
- **Apple** (`SessionModel`, main-actor): `NSScreen.maximumExtendedDynamicRangeColorComponentValue
> 1` (macOS) / `UIScreen.main.potentialEDRHeadroom > 1` (iOS) ANDed with `hdrEnabled`.
- **Android** (`Settings.displaySupportsHdr` via `Display.getHdrCapabilities` HDR10/HDR10+): Kotlin
passes it to `nativeConnect`; `session.rs` gates the caps on the new `hdr_enabled` jboolean
(cargo-ndk-verified).
- **Deferred** (need on-glass / the RTX box): the **mid-session `Reconfigure` "downgrade to SDR"**
for a monitor move HDR↔SDR; and confirming the **host produces SDR for an SDR client even off an
HDR desktop** — on the native path the per-session SudoVDA follows the negotiated depth (SDR
client → SDR virtual display → SDR stream), so it should hold end-to-end; verify the
stale-HDR-SudoVDA edge case on the RTX box.
### Step 4 — Linux (last; capture blocked upstream)
- **8-bit→Main10 NVENC upconvert shim** (`encode/linux.rs`) — Main10 transport with correct
VUI/SEI without HDR capture (gate so we don't claim HDR transfer on SDR content).
- **Linux encode color + side-data (the deferred Step 1c):** set
`color_primaries/trc/colorspace/range` from the negotiated `ColorInfo` and attach
`AV_FRAME_DATA_MASTERING_DISPLAY_METADATA` / `CONTENT_LIGHT_LEVEL` side-data (with the
`AVRational` conversion) in `encode/linux.rs` + `vaapi.rs` — only once the encoder actually
produces 10-bit, so the signalling matches the bits.
- True 10-bit capture: offer `ABGR2101010`/`P010` PipeWire formats + read colorimetry; pilot on
Sway/wlroots; track gamescope #2126. **Don't block the rest of the plan on it.**
- Linux client: `ex5` caps, P010 decode, GdkDmabufTexture CICP from Welcome,
`wp_color_management` when GTK ≥ 4.14.
## Quick wins (independent, land in parallel)
1. `connect_ex5` + fix `abi.rs:896` — resurrects Apple's pipeline *(Step 0)*.
2. H.264 VUI + AV1 `color_config` on `nvenc.rs` — closes two latent blockers *(Windows-only,
validated in CI / on the RTX box)*.
3. `probe` logs `bit_depth` + colorimetry — observability for every later round-trip assertion.
4. Linux client BT.601→BT.709 sws + texture `color_state` — standalone SDR correctness bug.
5. GameStream silent-downgrade already warns (`rtsp.rs:289`) — keep observable.
## Open questions
- **MaxCLL source:** `GetDesc1` doesn't expose it (Apollo zeroes). Static default, or measure
per-frame peak in the PQ shader (only truly-correct, adds a readback)?
- **GameStream:** implement `SS_HDR_METADATA` for Moonlight parity, or keep it deliberately SDR
and steer HDR users to punktfunk/1?
- **HLG:** carry the enum from day one (free) — but do any sources actually produce HLG?
- **Linux:** is shipping the 8-bit→Main10 shim as "HDR-capable transport" acceptable, or does it
risk advertising HDR we can't truly deliver?
## Ordering rationale
Step 0 first: it's the keystone (metadata transport is the dominant cross-cutter; the ABI line
is a one-line root cause) and needs no hardware. Step 1 next: in-band SEI is read directly by
decoders, so it fixes correctness even before our clients consume the protocol, and gives an
Apollo-parity on-glass gate. Steps 23 are mechanical per-client wiring once metadata flows.
Linux is last because capture is gated on upstream we don't control; the shim delivers Main10
transport without that dependency.
Hardware dependencies: Step 0 = none (CI); Step 1 = RTX Windows host; Steps 23 = a real HDR
display per platform; Step 4 = a Linux GPU box + HDR-capable Wayland compositor.
+282
View File
@@ -0,0 +1,282 @@
# Host latency & the GPU-contention collapse — analysis + prioritized plan
> **⚠ Partially superseded (2026-06-25) by [`gpu-contention-investigation.md`](gpu-contention-investigation.md).**
> That follow-up re-verified this plan against the current code and overturned several specifics:
> the default Windows path (IDD-push) now feeds NVENC **RGB** (regressing the §0A "Windows does it
> right" claim); `PUNKTFUNK_ENCODE_DEPTH` never existed (phantom knob); the "async NVENC stacks
> latency" result was a *same-thread* implementation, not a disproof of a correct two-thread pipeline;
> "capture sees half the frames" is DLSS-Frame-Gen-specific, not general; and NvFBC is dead on
> Windows. Use the new doc's ranked action list. The tiers/dropped-placebo analysis below remain a
> useful record.
Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
"saturating game starves the stream" headache:
> CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding
> (GPU-100%) scene it collapses to 40-50. Capping the game is **not** an acceptable fix.
This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the
[Apollo comparison](apollo-comparison.md) + external NVIDIA/streaming research) followed by an
**adversarial verification pass** — every candidate fix was attacked, against our actual code, to
separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the
placebos.
## Implementation status (2026-06-18)
-**Tier 2B — Linux scheduling hygiene**: landed. `boost_thread_priority` now nices the
capture/encode + send threads on Linux (`setpriority`, best-effort) and its wrong gamescope
doc-comment is fixed; CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread
highest-priority CUDA stream (`cuStreamCreateWithPriority`, graceful NULL-stream fallback) with a
per-stream sync that no longer blocks on the other worker thread's work. Builds + clippy + fmt
green. The stream-priority hint is **measure-then-keep** (NVIDIA Linux may ignore it).
-**Tier 3A — Windows session tuning**: landed (`session_tuning.rs`, raw C-ABI FFI, no-op off
Windows). Each capture/encode/send thread now applies process-wide tuning once (1 ms timer,
`DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) and per-thread MMCSS "Games" + keep-display-awake. Wired
into both the native (`boost_thread_priority`) and GameStream (`stream.rs`) paths. Linux no-op
path builds green; the **FFI was validated on the real MSVC toolchain** (standalone probe compiled,
linked against winmm/kernel32/dwmapi/avrt, and ran — timer/priority/MMCSS all succeed).
-**Tier 2A — Linux NV12 convert**: landed, gated behind `PUNKTFUNK_NV12` (default OFF → the
RGB/BGRx path is byte-for-byte unchanged). The tiled EGL/GL path produces NV12 (BT.709 limited) on
the GPU and feeds NVENC native YUV, deleting NVENC's internal RGB→YUV CSC off the contended SM.
**Validated on an RTX 5070 Ti two ways**: (1) `nv12-selftest` — synthetic RGBA→NV12 round-trip vs a
BT.709 reference, max abs error Y=0.56 / U=0.33 / V=0.26 LSB; (2) live `capture→NV12→NVENC→decode`
of animated content matches the RGB path's colour (avg RGB 230,18,18 vs 231,18,20 — no green-screen,
correct matrix + VUI). LINEAR/Vulkan-bridge (gamescope) path stays RGB. Next: glass-to-glass
latency + fps-under-saturation A/B on a real game (the Tier-0 measurement) before flipping default.
---
## 0. Three corrections to the mental model (read first)
**(A) "Feed NVENC RGB so the ASIC does the colour-convert" is backwards.**
NVENC's encode core is YUV-native. RGB input makes the driver insert an **RGB→YUV CSC on the
SM/3D-compute cores** — the *exact* engine a game saturates. Windows already does the right thing:
`convert_to_yuv` runs the CSC on the dedicated **VIDEO engine** via `VideoProcessorBlt`
(`capture/dxgi.rs:1023,1063`), logged as "0% 3D". **Linux still feeds NVENC RGB**
(`encode/linux.rs:98 nvenc_input``RGBZ`/`BGRZ`; the `zerocopy/egl.rs:98` shader is a `.bgra`
*swizzle*, not a CSC), so it pays NVENC's internal CSC on the SM every frame. That is the single
biggest, clearly-fixable contention source on Linux, and Windows already eliminated it.
**(B) "More GPU priority so our frames get through" is already maxed on Windows, and hits a
hardware ceiling.** We ship `D3DKMTSetProcessSchedulingPriorityClass=HIGH(4)` +
`SetGPUThreadPriority(0x4000001E)` + `SetMaximumFrameLatency(1)` (`capture/dxgi.rs:160-263`). The
residual ~20 ms `lock_bitstream` wall (documented at `dxgi.rs:155`) is GPU **context-scheduling
latency**, bounded by **preemption granularity**: NVIDIA preempts *compute* at instruction level
(~0.1 ms) but *graphics* only at coarse draw/tile/DMA-buffer boundaries (milliseconds out under a
draw flood). No priority class preempts an in-flight game draw. So the winning strategy is **not
more priority** — it is (1) do **less work on the contended graphics/3D engine**, and (2) **overlap
the unavoidable per-frame scheduling wait across frames** to recover throughput.
**(C) A chunk of the collapse is upstream of our encoder — no encode/priority fix can beat it.**
DXGI Desktop Duplication *and* WGC both capture **from the DWM compositor**, so captured fps is
hard-ceilinged at the **compose rate**, never the game's 400 fps. Under saturation the *compositor
itself* is scheduled late → composes fewer unique frames → we starve even though NVENC is idle. And
borderless/fullscreen games on **Independent/Direct Flip** present straight to scanout, *bypassing
DWM*, so capture sees ~half the frames (this is the "200 not 240"). The host already paces at
`target_fps` and **re-encodes held frames**, so *transmitted* fps stays ~240 while *unique* fps
collapses. **This must be measured before blaming encode.**
> Net: Windows is already near best-in-class (priority + video-engine CSC + encode|send split all
> shipped); its remaining wins are narrow and partly a hardware/compositor ceiling. **Linux is the
> least-hardened host and holds most of the headroom.**
---
## Tier 0 — Diagnose first (cheap, decisive, do before writing code)
Everything below is gated on knowing *which* bucket the collapse is in. We already have the tooling.
1. **Run the workload with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`.** The `uniq` counter
(genuinely-new captured frames vs re-encoded holds) already exists
(`gamestream/stream.rs:332-336,403`; `wgc_helper.rs:122-183`). Under CS2 at GPU-100%:
- **`fps`≈240 but `uniq`→40-50** ⇒ the *source/compositor* only produced 40-50 unique frames.
No encode/priority/cadence fix on our side exceeds that — it is the game's effective
present-to-compose rate at 100% GPU. The lever there is **reducing our own per-frame GPU
steal** (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
- **both `fps` and `uniq`→40-50** ⇒ our capture→convert→encode round-trip is being starved (the
`lock_bitstream` scheduling stall). The Tier 1/2 contention levers apply directly.
2. **Confirm the game's flip mode on Windows.** If the game is on Independent/Direct Flip (MPO),
capture is bypassing DWM and seeing half the frames. We already have `capture/composed_flip.rs`
— verify ForceComposedFlip is actually engaged on the game path, and watch `cap_us`.
3. Capture `cap_us` / `enc_us` / `pace_us` p50/p99 alongside, to localise the stall.
Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This
headless dev VM cannot reproduce the contention.
---
## Tier 1 — The two under-weighted, cross-platform levers (confirmed by research, not yet done)
### 1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
The capture ceiling is the compositor's compose rate, and under load the compositor gets starved.
Levers, in order:
- **Force Composed Flip on Windows** for the game path (defeat MPO/flip-metering frame loss).
Machinery exists (`composed_flip.rs`); confirm it engages and measure the unique-frame delta.
- **Opt-in "double-refresh" virtual output**: create the per-session virtual output at ~2× the
client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
we already mint arbitrary-mode virtual outputs). Gate **off** by default and **never** on the
gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
engine). `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER`.
- **Reflex / render-queue=0 style headroom** (non-capping): documented as the substitute for an fps
cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what
we can influence from the host side.
Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our
capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.
### 1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
a hidden latency tax, *most visible in easy scenes* (the ~200-should-be-240 case). Sunshine ships
this as `nvenc_latency_over_power` and calls it decisive. **Neither host does it.**
- **Windows**: NvAPI **per-application DRS profile** `PREFERRED_PSTATE = PREFER_MAX` scoped to our
exe (not a global override). Load `nvapi64.dll` dynamically; treat `NvAPI_Initialize` failure as
"no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). **Crash-safe undo is mandatory**: write
an undo record to `%ProgramData%\punktfunk\` *before* applying and revert a stale profile on next
startup — a crash must not leave the user's control panel modified.
- **Linux**: prefer the **root-free** path — disable the CUDA "Force P2 State" downclock that
context creation triggers (env/per-context), and `nvidia-smi -pm 1` (persistence) where
permitted. `nvmlDeviceSetGpuLockedClocks` needs root/CAP_SYS_ADMIN (our host runs as a normal
user → silent no-op) and is brittle across SKUs; if used, query `nvmlDeviceGetMaxClockInfo`, lock
to *that*, and restore on teardown **and** via a SIGTERM/panic handler.
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default OFF on battery / Steam Deck** (thermal/power caps
make pinning actively harmful there).
Impact: reliable, modest p99 / easy-scene win on both OSes. Does **not** fix the saturated-scene
collapse (at 100% util the clock is already maxed). Low cost.
---
## Tier 2 — Linux work-deletion + scheduling hygiene (the biggest in-our-control headroom)
### 2A. Produce **NV12/P010** on Linux and feed it to NVENC native (delete the SM-side CSC)
The strictly-correct version (verified): **extend the existing GL de-tile blit
(`zerocopy/egl.rs`) to emit NV12** instead of swizzled BGRx — multi-render-target (GL_R8 luma
full-res + GL_RG8 chroma half-res, or two passes) applying an **explicit BT.709 limited-range
matrix matching the Windows `VideoConverter`** (`dxgi.rs:957`) so hosts look identical — then
register `NV_ENC_BUFFER_FORMAT_NV12` with the encoder (teach `encode/linux.rs:98 nvenc_input` an
NV12 case; `CudaHw sw_format``AV_PIX_FMT_NV12`).
- Net: today = GL swizzle (3D) **+** NVENC-internal CSC (SM); after = GL CSC (3D, ~same cost as the
swizzle it replaces) **+ zero NVENC CSC**. Removes one whole CSC pass and removes it from the SM.
- **Do *not* implement this as a standalone CUDA convert kernel on the tiled path** — CUDA can't
sample a tiled NVIDIA surface (`cuGraphicsEGLRegisterImage` is Tegra-only, `egl.rs:6-12`), so it
would still need the GL detile, *and* a CUDA kernel runs on the same saturated SM. The CUDA-kernel
route is only clean on the **LINEAR/Vulkan-bridge (gamescope)** path, where it doubles as the NV12
producer; do it there if/when that path needs it.
- Pitfalls: pervasive 4-byte-pixel assumptions break with NV12 — `cuda.rs` hardcodes
`WidthInBytes = width*4` (`:363,392,499`), `BufferPool`/`alloc_pitched` assume 4 B/px, GL dst is
`GL_RGBA8`; all need a plane-aware NV12 variant (luma W·H + chroma W·H/2, two-plane copy) or you
get the Steam-Deck green-screen class of bug. The HDR/10-bit path needs P010, not NV12.
- Impact: real, **modest, compounding** — a few ms of per-frame GPU time and a shorter time-slice
need, which stacks with cadence + power-pin. **Not** a standalone cure for the 240→40 collapse
(external "47→100 fps" numbers are other people's non-zero-copy pipelines; don't promise them).
Medium cost. Gate behind a `PUNKTFUNK_*` env and A/B `cap→encoded` p50 + the CS2 fps floor.
### 2B. Linux scheduling hygiene (cheap; the priority bits are "measure-then-keep")
Consolidates the genuine parts of several candidates. Mostly unambiguous cleanups + opt-in
priority:
- **Arm the Linux `boost_thread_priority` no-op** (`punktfunk1.rs:1856` cfg branch): best-effort
`libc::setpriority(PRIO_PROCESS, 0, -10/-5)` on the calling thread (tid 0 = self), log-and-continue
on EPERM. **Do not** default to SCHED_RR/FIFO (can starve the compositor and the game's render
thread — the user refuses to add game frame-time); offer it only behind `PUNKTFUNK_SCHED_RR=1`.
**Fix the wrong doc-comment** at `punktfunk1.rs:1834-1835` ("the Linux host caps the game via
gamescope, so its threads aren't starved") — false for the uncapped/NVIDIA-direct path.
- **Set CUDA context scheduling deliberately**: `cuCtxCreate` flag `CU_CTX_SCHED_BLOCKING_SYNC` on
this shared VM (frees a core vs the default AUTO/SPIN) — a CPU-efficiency fix, not throughput.
- **High-priority CUDA stream + EGL context priority** (the missing analogue of the Windows
hardening): `cuStreamCreateWithPriority(highest from cuCtxGetStreamPriorityRange)` for our copies;
request `EGL_IMG_context_priority HIGH` (try `EGL_NV_context_priority_realtime`) at
`egl.rs:332`. **Caveat, load-bearing**: these are intra-process *hints* and NVIDIA's Linux driver
has been reported to **ignore** context priority (driver 545: high- vs low-priority EGL contexts
measured identical) and to **deny** realtime Vulkan queues. Implement with graceful fallback,
gate behind env, and **measure on driver 595** — do not architect around it or credit it before
measurement.
> Explicitly **not** doing on Linux: Vulkan `VK_EXT_global_priority` as "the" lever (it only touches
> the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA).
> Replacing `cuCtxSynchronize` with a per-stream event chain for *contention* reasons (it's
> per-context, never waited on the game's separate context — a non-fix; keep the full sync where it
> guards dmabuf recycle, `egl.rs:491`).
---
## Tier 3 — Windows parity polish (Windows is already strong)
- **3A. Host-process session tuning (we have *zero* today — verified):** `NtSetTimerResolution(0.5ms)`
/ `timeBeginPeriod(1)` (default 15.6 ms granularity blocks precise pacing), `DwmEnableMMCSS(true)`,
`SetPriorityClass(HIGH_PRIORITY_CLASS)`, MMCSS-register the capture/encode threads ("Games"/"Pro
Audio"), `SetThreadExecutionState(ES_CONTINUOUS|ES_DISPLAY_REQUIRED)`. All revert on stop.
Foundational for any precise frame pacing and the encode|send split. Low cost, low risk.
(`gamestream/stream.rs` start/stop; Apollo's `streaming_will_start`/`_stopped`.)
- **3B. Auto-gated REALTIME D3DKMT class** instead of fixed HIGH (the realtime opt-in already exists
at `dxgi.rs:199-207`): probe HAGS (`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom
(`IDXGIAdapter3::QueryVideoMemoryInfo`, continuously), allow REALTIME(5) only when safe (HAGS off,
or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises —
Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling
rung, same preemption ceiling), so rank it as cheap parity, not a fix.
- **3C. Cheap experiment — `VideoProcessorBlt` directly from the DDA surface** (skip the same-format
`gpu_copy` at `dxgi.rs:2375`), then `ReleaseFrame`, *iff* it doesn't re-serialize `AcquireNextFrame`
(the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming
the Blt is on the video engine). One-line source-texture change; benchmark only. Do **not** build a
D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM
(~5% 3D, no PCIe), not worth the interop rebuild.
- **3D. Async NVENC + off-thread retrieve — measure-gated, uncertain.** Today retrieve
(`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which is *why*
`depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates submit/retrieve
on separate threads with completion events + a deep surface pool; doing that *could* let per-frame
scheduling waits **overlap across frames** and recover *throughput* — at a per-frame *latency* cost
(depth × frame time). This is the one place the research and our own prior measurement disagree, so
it is **strictly measure-first**, and it forecloses slice output (`reportSliceOffsets` needs
`enableEncodeAsync=0`). Treat as a structural experiment, not a committed win.
---
## Tier 4 — Deferred 2nd-order latency (not contention fixes; do after Tiers 0-2)
- **GL2 — Intra-refresh for RFI/recovery** (`enableIntraRefresh` + recovery-point SEI) instead of a
forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size
spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met.
Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win,
orthogonal to the collapse.
- **GL1 + GL6 — Sub-frame slice output + per-slice paced send** (the roadmap's "~2-4 ms lever"):
`enableSubFrameWrite` + `sliceMode` + transmit each slice as it completes. **Big**: needs the
direct NVENC SDK on Linux (libavcodec emits whole AUs) **and** a per-slice wire/FEC redesign in
`punktfunk-core` (today `PacketHeader`/`Packetizer`/reassembler are whole-AU; per-slice FEC blocks
wreck Leopard efficiency) **and** client slice-granular submit. Gate on
`NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK` (often absent on consumer GeForce). The paced-send half is
**already shipped** (`stream.rs spawn_sender`, `punktfunk1.rs paced_submit`) — don't re-implement.
---
## Dropped / why (so we don't re-propose placebo)
| Candidate | Verdict | Why |
|---|---|---|
| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
| Replace `cuCtxSynchronize` with per-stream event chain *for contention* | ✗ | `cuCtxSynchronize` is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
| Vulkan `VK_EXT_global_priority` as the Linux priority lever | ✗ | Touches only the minority gamescope/LINEAR `vkCmdCopyBuffer`, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
| Async NVENC as a *throughput/collapse* fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our `depth>1` regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
| Empty-frame (`LastPresentTime==0`) skip | ✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets `zeroReorderDelay=1`, lookahead/multipass/AQ off; ffmpeg defaults match + we set `bf=0`. Only `lowDelayKeyFrameScale=1` is non-redundant → fold into GL2 (Windows SDK path only). |
| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no `nvEncInvalidateRefFrames`; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |
---
## Recommended order of attack
1. **Tier 0 diagnose** on the real boxes — settles whether the collapse is source-ceiling or
pipeline-starvation, and whether flip-bypass is halving capture.
2. **Tier 2A (Linux NV12)** + **Tier 2B (Linux scheduling hygiene)** — the largest in-our-control
headroom; Linux is the least-hardened host.
3. **Tier 1B (clock/power pin)** both OSes — cheap, fixes the easy-scene 200-vs-240, crash-safe undo.
4. **Tier 1A (cadence/flip)** — gated on Tier 0 (this is where a big chunk of the collapse may live).
5. **Tier 3 (Windows polish)** — session tuning is the clear win; the rest is parity.
6. **Tier 4** — only after the contention work; intra-refresh first, slice pipelining last.
Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap
closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game
physically cannot also render the game *and* compose 240 unique frames, and WDDM/NVIDIA preemption
granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it
with encoder micro-optimisations.
+300
View File
@@ -0,0 +1,300 @@
---
title: "Implementation Plan"
description: "The full design: protocol core, milestones, and architecture."
---
*A ground-up low-latency desktop streaming stack, built Linux-first, with a shared Rust protocol core and native clients per platform.*
> The name `punktfunk` fits the lowercase house style (`unom`, `played`, `remplir`) and reads as "glass-to-glass light," which is the whole point.
---
## 0. The thesis (why this is worth building)
Two concrete gaps justify a new project rather than another fork:
1. **The 1 Gbps wall is a FEC design limit, not a bandwidth limit.** Moonlight/Sunshine protect each frame with ReedSolomon over GF(2⁸), which caps a block at 255 shards. At 5120×1440@240 that ceiling is hit around 1 Gbps. Switching the erasure code to **Leopard-RS over GF(2¹⁶)** (via the `reed-solomon-simd` crate) raises the per-block shard limit to 65,536 and runs in O(n log n) with SIMD. The wall disappears as a *consequence* of a better core, not as a hack.
2. **Linux software virtual displays are a real, unfilled gap.** The compositor-side capability now exists (Mutter headless virtual monitors since GNOME 40; wlroots headless outputs; KWin virtual outputs in Plasma 6), but no streaming host *drives* those APIs to create a client-sized output on demand, capture it via PipeWire, and route input back via libei. Apollo's virtual display is Windows-only. This is the immediate, shippable win.
**Strategic ordering:** ship the Linux virtual-display host speaking the *existing* Moonlight protocol first (every Moonlight/Artemis client works on day one, no client to write). Only then introduce the new GF(2¹⁶) transport as a negotiated protocol extension with our own clients. Value early, hard parts deferred until de-risked.
---
## 1. Scope & non-goals
**In scope (eventually):**
- Linux streaming host with on-demand software virtual displays (KWin first, then wlroots, then Mutter).
- A shared Rust protocol/transport/FEC core exposed over a stable C ABI.
- A modern transport that removes the 1 Gbps ceiling.
- Native clients: Rust (Linux), Swift (macOS/iOS), Kotlin (Android) — all linking the same core.
**Explicit non-goals (at least at first):**
- Windows *host* support (Sunshine/Apollo already do this well; no gap to fill).
- Internet/NAT-traversal relay infrastructure (LAN/VPN first; lean on an existing mesh VPN such as Headscale/NetBird/Tailscale).
- Reinventing encoders/decoders (bind to FFmpeg + vendor SDKs; never rewrite codecs).
- A bespoke compositor (drive existing ones; only consider a dedicated headless compositor as a *deployment mode*, see §6).
---
## 2. Architecture overview
```mermaid
flowchart TD
subgraph Host["Linux Host (Rust)"]
VD["Virtual display orchestrator<br/>(KWin / wlroots / Mutter)"]
CAP["Capture<br/>(PipeWire / dmabuf)"]
ENC["Encoder<br/>(VAAPI / NVENC via FFmpeg)"]
VD --> CAP --> ENC
ENC --> COREH
IN_H["Input injector<br/>(libei / uinput)"]
COREH["punktfunk-core (C ABI)<br/>protocol · FEC · pacing · crypto"]
COREH --> IN_H
end
COREH <-->|"UDP+FEC video / QUIC control+audio"| COREC
subgraph Client["Client (Rust / Swift / Kotlin)"]
COREC["punktfunk-core (same crate, C ABI)"]
DEC["Decoder<br/>(VideoToolbox / NVDEC / VAAPI)"]
PRES["Present + frame pacing"]
INP["Input capture"]
COREC --> DEC --> PRES
INP --> COREC
end
```
**The load-bearing decision:** `punktfunk-core` is one crate, compiled once, linked by every host and client through a C ABI. Protocol logic, FEC, packet pacing, jitter buffering, pairing, and crypto live there and exist exactly once. Platform code (capture, encode, decode, present, input, UI) lives outside the core and is written in whatever language suits the platform.
---
## 3. Protocol strategy (three phases)
| Phase | Protocol | Clients that work | Bitrate ceiling | Purpose |
|------|----------|-------------------|-----------------|---------|
| **P1** | GameStream-compatible (existing Moonlight wire format) | All existing Moonlight/Artemis clients | ~1 Gbps (legacy GF(2⁸) FEC) | Ship the Linux virtual-display win with zero client work |
| **P2** | `punktfunk/1` negotiated extension: GF(2¹⁶) FEC, multi-block framing, optional QUIC control | punktfunk clients only; falls back to P1 for others | Multi-Gbps | Break the wall; introduce native clients |
| **P3** | `punktfunk/1` as primary; GameStream kept as compat shim | punktfunk everywhere, Moonlight as fallback | Multi-Gbps | Full control of features (mic passthrough, per-client identity, HDR signalling) |
Negotiation: extend the `serverinfo`/RTSP `SETUP` handshake with a capability flag. Old clients never see the flag and get P1 behavior. This is how Apollo/Artemis diverge cleanly, and it keeps you compatible while you build.
---
## 4. Tech stack (settled)
**Language split:** Rust for the core and all non-Apple platform code; Swift only for the macOS/iOS client UI + VideoToolbox/Metal; Kotlin for Android UI + MediaCodec. The C ABI is the seam.
**Threading:** native OS threads for the video hot path. `tokio` is allowed *only* for the control plane (pairing, web config, QUIC control stream). The per-frame pipeline must never touch an async runtime.
### Core crate dependencies
| Concern | Crate | Notes |
|--------|-------|-------|
| FEC | `reed-solomon-simd` (v3+) | Leopard/GF(2¹⁶), SIMD, O(n log n) — the wall-breaker |
| QUIC (control/audio) | `quinn` | Datagram ext for audio; reliable streams for control |
| TLS / crypto | `rustls` + `ring` (or `aws-lc-rs`) | Pairing, session keys (AES-GCM to match GameStream in P1) |
| Serialization | `zerocopy` / `bytes` | Wire structs `#[repr(C)]`, zero-copy parse |
| C header gen | `cbindgen` | Generates `punktfunk_core.h` from the ABI module |
| Error/log | `tracing` | Structured; feature-gate off the hot path |
### Linux host dependencies
| Concern | Crate / API | Notes |
|--------|-------------|-------|
| Capture | `pipewire` (pipewire-rs) | ScreenCast portal stream → dmabuf |
| Portal / DBus | `ashpd` + `zbus` | xdg-desktop-portal: ScreenCast, RemoteDesktop |
| Encode | `ffmpeg-next` or `rsmpeg` | VAAPI / NVENC, dmabuf import (zero-copy) |
| Input inject | `reis` (libei) + `input-linux` (uinput fallback) | Wayland-native first, uinput as universal fallback |
| Virtual output | per-compositor (see §6) | KWin DBus / Sway `create_output` / Mutter DBus |
| Web config | `axum` + `tokio` + small Vite/React UI | You own this stack already |
### Apple client (P2+)
Swift + VideoToolbox (decode) + Metal (present) + SwiftUI. Imports `punktfunk_core.h` directly via a module map — no glue layer.
### Ruled out
- **Swift for the host/core:** no Linux Wayland/PipeWire/DRM/VAAPI ecosystem; ARC in hot loops. (Excellent *Apple-client* language, wrong for systems/Linux.)
- **Go:** GC disqualifies the hot path.
- **C++:** throws away the safety/concurrency wins that justified greenfield over forking.
- **Zig:** best-in-class C interop, but pre-1.0 with no Wayland/QUIC ecosystem — too much risk for a multi-month build. Revisit later if desired.
---
## 5. The C ABI boundary
Design it on day one; retrofitting an ABI is painful.
**Principles**
- Opaque handles only across the boundary: `PunktfunkSession*`, never Rust types.
- All cross-boundary structs are `#[repr(C)]`; primitives + pointer/len pairs for buffers.
- Async events via registered C callbacks (`fn ptr` + `void* userdata`).
- Explicit, documented ownership: who frees what, when. Provide `punktfunk_*_free` for every allocation that crosses out.
- Versioned ABI: `uint32_t punktfunk_abi_version(void)` + a `PunktfunkConfig` struct whose first field is its own size for forward-compat.
**Minimal surface (sketch)**
```c
// lifecycle
PunktfunkSession* punktfunk_session_new(const PunktfunkConfig* cfg);
void punktfunk_session_free(PunktfunkSession*);
// host: feed an encoded access unit (the core does FEC + packetize + pace + send)
int punktfunk_host_submit_frame(PunktfunkSession*, const uint8_t* data, size_t len,
uint64_t pts_ns, PunktfunkFrameFlags flags);
// client: pull a reassembled, FEC-recovered access unit ready to decode
int punktfunk_client_poll_frame(PunktfunkSession*, PunktfunkFrame* out /*borrowed until next poll*/);
// input (both directions): client captures, host receives via callback
int punktfunk_send_input(PunktfunkSession*, const PunktfunkInputEvent*);
void punktfunk_set_input_callback(PunktfunkSession*, PunktfunkInputCb, void* user);
// stats for the frame-pacing/quality logic and the web UI
void punktfunk_get_stats(PunktfunkSession*, PunktfunkStats* out);
```
Keep it this small. Everything platform-specific (how you got the encoded bytes, how you decode them) stays on the platform side.
---
## 6. Virtual display orchestration
This is the differentiator and the most fragmented part. Two deployment models — support both eventually, pick one for the MVP.
**Model A — Attach to the running session.** Create a client-sized virtual output *inside the user's live desktop*, stream it, tear it down on disconnect. This is "add a monitor to my actual PC." Best UX, hardest because it depends on per-compositor runtime APIs.
**Model B — Dedicated headless session.** Spawn a separate headless compositor purely for the stream (e.g. `gnome-shell --headless --virtual-monitor WxH`, or a headless wlroots compositor). Cleaner isolation, sidesteps runtime-output APIs, ideal for "remote second PC." Worse for "mirror/extend my real desktop."
**Per-compositor (Model A) runtime virtual-output creation:**
- **KWin / Plasma 6 (recommended MVP target — a common KDE daily-driver setup, and where the gap is loudest):** KWin can create virtual outputs; KRdp already does this internally for remote sessions. Drive it via the KWin DBus interface; capture via `xdg-desktop-portal-kde` ScreenCast (PipeWire); inject input via the RemoteDesktop portal or `reis`.
- **wlroots (Sway/Hyprland — fastest to *prototype* the pipeline):** enable the headless backend (`WLR_BACKENDS=…,headless`), then `swaymsg create_output` / `hyprctl output create headless`. Capture via `wlr-screencopy` or the portal. Simplest API; good for validating capture→encode→send before fighting KWin/Mutter.
- **Mutter / GNOME:** virtual monitors via the headless backend; runtime creation via Mutter DBus (`org.gnome.Mutter.*` — partly experimental). Capture via `xdg-desktop-portal-gnome` ScreenCast.
**Recommendation:** do a 12 day wlroots spike to prove the *pipeline*, then build the real MVP on KWin because that's your deployment target. Abstract virtual-output creation behind a trait so compositors are pluggable:
```rust
trait VirtualDisplay {
fn create(&self, mode: Mode) -> Result<OutputHandle>;
fn destroy(&self, h: OutputHandle) -> Result<()>;
}
```
---
## 7. The hot path: pipeline & latency budget
Per-frame pipeline, each stage on its own thread, connected by bounded SPSC channels (drop-oldest on overflow, never block the encoder):
```
capture(dmabuf) → encode(NVENC/VAAPI) → core[FEC+packetize+pace+send]
│ network
client: recv → core[reorder+FEC recover+jitter] → decode → present
```
**Glass-to-glass budget (LAN, 240 Hz = 4.17 ms/frame):**
| Stage | Target | Notes |
|------|--------|-------|
| Capture latency | ≤ 1 frame | dmabuf, no copy to CPU |
| Encode | 14 ms | NVENC low-latency preset; tune lookahead off |
| FEC + packetize | < 1 ms | SIMD RS; pre-allocated shard buffers |
| Network (LAN) | < 1 ms | `sendmmsg` / UDP GSO to cut syscalls |
| Jitter buffer | 01 frame | adaptive; minimum that hides observed jitter |
| FEC recover + reassemble | < 1 ms | only when loss occurs |
| Decode | 14 ms | hardware decoder |
| Present | ≤ 1 frame | align to client vsync |
**Target: 1535 ms glass-to-glass on LAN.** The art is *frame pacing* — matching capture/encode cadence to the client's actual refresh and keeping the jitter buffer as small as the link allows. This, not the codec, is what separates good from bad streaming. Budget real time for it.
**Throughput math to keep honest:** 5120×1440@240 ≈ 1.77 Gpx/s. At 0.5 bpp that's ~885 Mbps; 0.6 bpp ≈ 1.06 Gbps; 0.8 bpp (4:4:4 headroom) ≈ 1.4 Gbps. The GF(2¹⁶) FEC + multi-block framing must sustain these without the per-frame shard count being the limiter — which it no longer is once you leave GF(2⁸).
---
## 8. Milestones
Sizing is rough and relative (Spike / S / M / L) for a focused solo dev; treat as ordering, not deadlines.
**M0 — Pipeline spike (S).** wlroots headless output → PipeWire capture → VAAPI/NVENC encode → dump H.265 to a file that plays. *Acceptance:* a valid encoded file from a virtual output, no streaming yet. Proves the Linux capture+encode chain end-to-end.
**M1 — `punktfunk-core` skeleton + C ABI (M).** Session lifecycle, GameStream-compatible packetization and GF(2⁸) FEC (P1), AES-GCM, `cbindgen` header, a tiny C test harness. *Acceptance:* core links from C; round-trips packets in a loopback test with simulated loss.
**M2 — P1 host: stream to stock Moonlight (L).** Wire M0's pipeline into the core; implement `serverinfo`/pairing/RTSP enough for a real Moonlight client to connect, with a KWin virtual output created on connect and destroyed on disconnect. Input via `reis`/uinput. *Acceptance:* **you play a game on your KDE box streamed to a stock Moonlight client on a virtual display, no dummy plug, no kernel args.** This is the shippable milestone and the project's reason to exist.
**M3 — Measurement harness (S).** Glass-to-glass latency measurement (on-screen QR/timestamp or photodiode), packet-loss injection, frame-pacing and stall metrics surfaced in the web UI. *Acceptance:* you can quantify a regression. Build this before optimizing anything.
**M4 — P2 transport: break the wall (L).** Add `punktfunk/1` negotiation; swap to `reed-solomon-simd` GF(2¹⁶) with multi-block per-frame framing; optional QUIC control/audio. Write a minimal **Rust** reference client (decode via VAAPI, present via wgpu/Vulkan) to exercise it. *Acceptance:* a stable stream above 1.4 Gbps at 5120×1440@240 with loss recovery working; latency unchanged vs. M2.
**M5 — Apple client (L).** Swift + VideoToolbox + Metal + SwiftUI, linking `punktfunk-core` via the C header. *Acceptance:* a Mac plays a stream at native resolution/refresh.
**M6 — Feature surface (M, ongoing).** Mic passthrough as a proper encrypted, per-client reverse audio stream (the thing the upstream PR got wrong); HDR signalling; per-client identity/permissions; pause/resume. *Acceptance:* feature parity with Apollo on the items you care about, plus mic done right.
---
## 9. Risk register
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| KWin runtime virtual-output API is undocumented/unstable | High | High | Spike on wlroots first to de-risk the pipeline; study KRdp's source for the KWin path; keep `VirtualDisplay` pluggable so a stuck compositor doesn't block the project |
| Wayland input injection gaps (libei still evolving) | Med | Med | uinput fallback always available; `reis` for the Wayland-native path |
| dmabuf → encoder zero-copy import quirks per GPU/driver | High | Med | Validate on your actual NVIDIA + AMD hardware early (M0); have a CPU-copy fallback path |
| Encoder/decoder can't sustain 1.77 Gpx/s @ 240 | Med | High | Measure in M0/M4 on real silicon; this is a hardware ceiling no rewrite fixes — discover it before P2, not after |
| Frame pacing eats more time than expected | High | Med | M3 measurement harness first; treat pacing as a first-class subsystem, not a polish step |
| Scope creep into a full Moonlight replacement | High | High | P1 (stock-client compat) is the firewall: it forces you to ship value before writing a client |
| Solo bandwidth vs. other projects | High | Med | M2 is a complete, useful artifact on its own; the plan is safe to pause after any milestone |
---
## 10. Testing & measurement
- **Loopback correctness:** core encodes→FEC→loss-inject→recover→decode in-process; property tests over loss patterns and shard counts (proptest).
- **Glass-to-glass latency:** rendered timestamp/QR on host, read back on client capture; or a photodiode for true photons. Track p50/p99.
- **Loss resilience:** `tc netem` to inject loss/jitter/reorder; verify FEC recovery and graceful degradation.
- **Pacing:** log present timestamps vs. client vsync; alert on stalls and duplicate/dropped frames.
- **Soak:** multi-hour streams; watch for buffer growth, fd leaks, encoder session exhaustion.
- **Hardware matrix:** an NVIDIA box (NVENC), an AMD/Intel box (VAAPI), a Mac (VideoToolbox decode). Catch driver quirks early.
---
## 11. Repo / workspace structure
```
punktfunk/
├── Cargo.toml # workspace
├── crates/
│ ├── punktfunk-core/ # protocol, FEC, pacing, crypto — C ABI (cdylib + staticlib)
│ │ ├── src/abi.rs # #[no_mangle] extern "C" surface
│ │ ├── src/fec.rs # GF(2^16) blocking over reed-solomon-simd
│ │ ├── src/transport/ # udp+fec video, quinn control/audio
│ │ ├── src/protocol/ # gamestream-compat (P1) + punktfunk/1 (P2)
│ │ └── cbindgen.toml
│ ├── punktfunk-host/ # Linux host binary
│ │ ├── src/capture/ # pipewire / portal
│ │ ├── src/encode/ # ffmpeg vaapi/nvenc
│ │ ├── src/vdisplay/ # trait + kwin/wlroots/mutter impls
│ │ ├── src/input/ # reis + uinput
│ │ └── src/web/ # axum config/pairing API
│ └── punktfunk-probe/ # reference Rust client (M4)
├── clients/
│ ├── apple/ # Swift package, imports punktfunk_core.h (M5)
│ └── android/ # Kotlin + JNI (later)
├── include/ # generated punktfunk_core.h
└── tools/
├── latency-probe/
└── loss-harness/
```
---
## 12. Immediate next actions (first week)
1. **Stand up the workspace** with `punktfunk-core` (empty ABI + `cbindgen`) and `punktfunk-host` skeletons; wire up CI (Gitea Actions, BuildKit-based pipelines).
2. **M0 spike on wlroots:** headless output → PipeWire capture → NVENC/VAAPI encode → playable file. This validates the riskiest *pipeline* assumptions in days, on real GPU hardware.
3. **Read KRdp's source** for how KDE creates virtual outputs and casts them — it's the closest existing reference for the KWin path needed in M2.
4. **Decide P1 protocol depth:** confirm exactly which `serverinfo`/RTSP/pairing messages a current Moonlight client requires for a successful connect, so M2's compat surface is scoped precisely.
---
*The shape of the bet: M2 alone — virtual-display streaming to stock Moonlight clients on Linux — is a complete, useful, gap-filling release. Everything after it (the wall-breaking transport, native clients, mic-done-right) is upside you unlock from a position of having already shipped, with the hard transport work resting on a FEC core that makes the 1 Gbps ceiling a thing of the past rather than a thing to hack around.*
+159
View File
@@ -0,0 +1,159 @@
# Linux host setup — NVIDIA GPU VM (pipeline spike + GameStream host)
How to bring up the build environment for the punktfunk Linux host on an NVIDIA-GPU Ubuntu VM
and run the **pipeline spike** (capture→encode). `punktfunk-core` already builds and is tested
cross-platform; this is about the platform backends in `crates/punktfunk-host`.
> Target **Ubuntu 24.04 (noble)**: Sway 1.9, FFmpeg 6.1.1, xdg-desktop-portal 1.18.
> 22.04 (jammy) ships Sway 1.7 / FFmpeg 4.4 — too old for this path; build from source or
> upgrade. Package names/versions below were verified against the live Ubuntu archive.
## 1. Bootstrap
```sh
git clone git@git.unom.io:unom/punktfunk.git && cd punktfunk && git checkout m1-punktfunk-core
bash scripts/bootstrap-ubuntu.sh
```
It **verifies** the (already-installed) NVIDIA + NVENC stack, installs the Rust toolchain
(rustup) and the build/runtime deps (PipeWire, xdg-desktop-portal + the wlroots backend,
Sway, Wayland/DRM/EGL/GBM/VA dev libs, capture tools), **gates** the FFmpeg `-dev`
headers so it can't clobber your custom NVENC FFmpeg, and drops headless-Sway + portal
config templates into `~/.config` (only if absent). It does **not** reboot or edit GRUB.
After it runs, sanity-check the core on Linux:
```sh
cargo test --workspace # 21 tests; same suite that's green on macOS
```
## 2. NVIDIA prerequisites (one-time, may need a reboot)
Wayland on NVIDIA requires KMS modeset. The bootstrap checks it; if it isn't `Y`:
```sh
echo 'options nvidia-drm modeset=1 fbdev=1' | sudo tee /etc/modprobe.d/nvidia-drm.conf
sudo update-initramfs -u && sudo reboot
cat /sys/module/nvidia_drm/parameters/modeset # must print Y after reboot
```
- Driver **≥ 535** is the floor for headless wlroots (EGL/dmabuf); 550+ recommended.
- **Install the NVIDIA GL/EGL userspace, not just `nvidia-utils`:**
`sudo apt install libnvidia-gl-<NNN>` (matching the driver, e.g. `libnvidia-gl-595`).
`nvidia-utils-NNN` ships nvidia-smi + NVENC but **not** `libEGL_nvidia.so.0` or the GLVND
vendor JSON (`/usr/share/glvnd/egl_vendor.d/10_nvidia.json`). Without them libglvnd falls
back to Mesa, wlroots can't init EGL on the GPU and drops to the **pixman** software
renderer — and the ScreenCast portal then fails to negotiate a buffer format
(`unable to receive a valid format from wlr_screencopy`). Verify after install:
`ls /usr/share/glvnd/egl_vendor.d/10_nvidia.json && ldconfig -p | grep libEGL_nvidia`.
A correct GPU Sway logs `EGL vendor: NVIDIA` and a list of DMA-BUF formats.
- **Join the `render` + `video` groups:** `sudo usermod -aG render,video $USER`, then
**re-login** (group changes only apply to new logins). wlroots opens
`/dev/dri/renderD128` (group `render`) and `/dev/dri/card*` (group `video`), both 0660;
without membership Sway aborts with `Permission denied`. (`scripts/headless/*.sh` bridge a
not-yet-re-logged-in shell with `sg render`, but re-login is the clean fix.)
- A **headless VM GPU exposes no DRM connectors** — that's expected. We don't use the DRM
backend; `WLR_BACKENDS=headless` renders to an offscreen GBM/EGL surface and creates a
virtual `HEADLESS-1` output. Use the render node `/dev/dri/renderD128`.
- **NVENC in a VM:** full PCI **passthrough** = bare-metal NVENC, no license. **vGPU**
needs a valid license (vWS) or NVENC runs degraded — the bootstrap's smoke-encode tells
you if it actually works. Consumer GeForce cards also cap concurrent NVENC sessions
(~8); datacenter/RTX-pro are effectively unlimited — relevant once we serve many clients.
## 3. Bring up the headless compositor + prove capture→NVENC
```sh
# shell 1 — start headless GPU Sway on the shared user bus (blocks; -d for debug log)
bash scripts/headless/run-headless-sway.sh # success logs "EGL vendor: NVIDIA"
# shell 2 — same user: set the client mode, import the portal env, write the env file
bash scripts/headless/prepare-session.sh 2560x1440@60Hz
source /tmp/punktfunk-sway-env.sh
swaymsg -t get_outputs # confirm HEADLESS-1 active
swaymsg exec foot # optional: animated content to capture
bash scripts/headless/capture-smoke-test.sh # wf-recorder (wlr-screencopy) -> hevc_nvenc
ffprobe /tmp/punktfunk-headless-test.mkv # confirm a real H.265 stream
```
`wf-recorder` uses `wlr-screencopy` directly (no portal/D-Bus) — the fastest way to
de-risk the GPU encode path. **Note:** screencopy encodes straight to a file and *cannot*
feed PipeWire; the real integration uses the ScreenCast portal (see the pipeline spike). If shell 1 logged
a Mesa/EGL fallback (or Sway dropped to pixman) instead of `EGL vendor: NVIDIA`, install the
NVIDIA GL userspace (§2) — the portal cannot capture a pixman output.
**An idle headless output produces no frames** (its frame clock is driven by damage); give
it a real refresh mode (`prepare-session.sh` does) *and* run something animated
(`swaymsg exec foot`) or the capture will be ~1 frame.
The wlroots-on-NVIDIA env workarounds (`WLR_RENDERER=gles2`, `WLR_NO_HARDWARE_CURSORS=1`,
`GBM_BACKEND=nvidia-drm`, `sway --unsupported-gpu`, …) live in
`scripts/headless/env.sh``source` it before launching anything Wayland.
## 4. The spike proper — wire it into `punktfunk-core`
Goal (plan §8): headless output → PipeWire ScreenCast → NVENC → a playable file, then feed
the encoded access units into a `punktfunk_core::Session` (host role). The module seams exist
in `crates/punktfunk-host/src/{vdisplay,capture,encode,inject,pipeline}.rs`.
**Status: implemented and verified end-to-end** in `crates/punktfunk-host` (`spike.rs`,
`capture/linux.rs`, `encode/linux.rs`). After the §3 bring-up:
```sh
source /tmp/punktfunk-sway-env.sh
swaymsg exec foot # animated content
# Live portal capture → NVENC HEVC → playable file, with each AU also round-tripped
# through a punktfunk_core host→client Session (FEC + packetize + reassemble) and verified:
cargo run -p punktfunk-host -- m0 --source portal --seconds 5 --out /tmp/punktfunk-m0.h265
ffprobe /tmp/punktfunk-m0.h265
# No capture session needed (encode + core only): --source synthetic
```
Verified result: `1920x1080` HEVC, ~300 frames in 5s, `punktfunk-core loopback … 0 mismatches`.
The portal negotiates packed **`RGB` (24-bit, 3 bpp)** on wlroots; the encoder expands it to
`rgb0` (one pad byte/pixel, no colour math) since NVENC accepts `rgb0`/`bgr0` but not
`rgb24`. dmabuf zero-copy import is still deferred (plan §9) — this is the CPU-copy path.
Crate choices, verified current:
- **Capture (portal path):** [`ashpd`](https://docs.rs/ashpd) **0.13** with the
`screencast` feature (the `pipewire` feature is *not* needed — `open_pipe_wire_remote`
is unconditional). Flow (0.13 API, verified against the vendored source): `Screencast::new`
`create_session(Default)``select_sources(&session, SelectSourcesOptions::default()
.set_sources(BitFlags::from_flag(SourceType::Monitor))…)``start(&session, None,
Default)``.response()?``Stream::pipe_wire_node_id()` + `open_pipe_wire_remote()`.
Note 0.13 takes **options structs**, not the old positional args, and defaults to the
**tokio** runtime — drive the handshake on a *multi-thread* tokio runtime (a
current-thread one starves zbus's reader and the portal reports "Invalid session").
Pull frames with [`pipewire`](https://docs.rs/pipewire) **0.9** — it must match the
pipewire crate ashpd 0.13 links (the `pipewire-sys` `links` key is unique per build, so
`0.10` fails to resolve). 0.9 uses `MainLoopRc`/`ContextRc::connect_fd_rc(OwnedFd)`/
`StreamBox`. Only request `SourceType::Monitor` — the wlr backend's
`AvailableSourceTypes` is `1` (Monitor only); asking for `Window`/`Virtual` invalidates
the session. Set `XDG_CURRENT_DESKTOP=sway` so the wlr portal backend is chosen, and
import it into the portal's environment (see "Portal bring-up" below).
- **Encode:** [`ffmpeg-next`](https://crates.io/crates/ffmpeg-next) **8.x** (binds the
system FFmpeg 8.x via pkg-config; needs `clang`/`libclang`). Select the encoder by
name — `encoder::find_by_name("hevc_nvenc")`, *not* by codec id (that's the SW encoder).
Low-latency opts: `preset=p1`, `tune=ull`, `rc=cbr`, `bf=0`, `delay=0`, large `g`.
If your FFmpeg is in a non-standard prefix, `export FFMPEG_DIR=/that/prefix`.
- **Zero-copy is the hard part.** There's no direct dmabuf→CUDA import in FFmpeg.
**Start with the CPU-copy fallback** (download frame → `hwupload_cuda``hevc_nvenc`)
to get an end-to-end stream, then chase true dmabuf zero-copy. The plan flags this
(§9) and the `capture` module already has a `cpu_bytes` fallback field.
- **Input (GameStream host):** [`reis`](https://crates.io/crates/reis) (pure-Rust libei — no native
`libei` needed) with `input-linux`/uinput as the universal fallback.
Then continue toward the **GameStream host**: `serverinfo`/RTSP/pairing enough for a stock Moonlight client
to connect, a KWin virtual output created on connect, input via reis/uinput — the
shippable milestone.
## Troubleshooting
| Symptom | Fix |
|---|---|
| Sway aborts on NVIDIA | add `--unsupported-gpu` (the helper scripts do) |
| `not a KMS device` / no connectors | expected on a headless VM GPU — use `WLR_BACKENDS=headless`, not the DRM backend |
| Sway won't start at all | `WLR_RENDERER_ALLOW_SOFTWARE=1 WLR_RENDERER=pixman` to prove the pipeline, then fix EGL |
| ScreenCast portal finds no output | ensure `xdg-desktop-portal-wlr` is running in the same session, `XDG_CURRENT_DESKTOP=sway`, and `~/.config/xdg-desktop-portal-wlr/config` has `output_name=HEADLESS-1` |
| `Cannot load libnvidia-encode.so.1` | NVENC runtime lib missing (driver) or unlicensed vGPU |
| `cargo build` can't find FFmpeg | `export FFMPEG_DIR=$(pkg-config --variable=prefix libavcodec)` or point `PKG_CONFIG_PATH` at the custom build |
| bindgen: libclang not found | `export LIBCLANG_PATH=$(llvm-config --libdir)` |
@@ -0,0 +1,448 @@
{
"summary": "Extract the exact GameStream/Moonlight P1 host protocol from Sunshine + moonlight-common-c",
"agentCount": 6,
"logs": [
"[research:control-input] failed: API Error: The socket connection was closed unexpectedly. For more information, pass `verbose: true` in the second argument to fetch()"
],
"result": [
{
"area": "GameStream HTTP serverinfo + pairing handshake (host side, what stock Moonlight expects)",
"summary": "A GameStream host runs two HTTP servers from the same NvHTTP code: plain HTTP on port 47989 (insecure, unauthenticated) and HTTPS with mutual TLS on 47984. Ports are derived from a base port (config default 47989) plus a signed offset: PORT_HTTP=0 -> 47989, PORT_HTTPS=-5 -> 47984. Moonlight first GETs /serverinfo over HTTP (before pairing) to read an XML document of capabilities and the host's pairing/running state; key fields it parses are hostname, appversion (its major version selects the pairing hash: >=7 -> SHA-256, else SHA-1; Sunshine advertises \"7.1.431.-1\"), GfeVersion, uniqueid, HttpsPort, ExternalPort, mac, MaxLumaPixelsHEVC, ServerCodecModeSupport (a bitmask: 3=H264-only, 259=+HEVC, 3843=+AV1), PairStatus, currentgame and state. Pairing is a 4-phase challenge/response over /pair driven entirely by repeated HTTP GETs with a `phrase` query param: getservercert, clientchallenge, serverchallengeresp, clientpairingsecret, followed by a final pairchallenge over HTTPS. The shared secret is an AES-128 key = SHA(salt(16) || PIN-as-utf8) truncated to 16 bytes; salt is client-generated random 16 bytes sent hex-encoded in phase 1. All pairing AES uses AES-128 in ECB mode with NO padding (inputs zero-extended to a 16-byte multiple, so a 32-byte SHA-256 hash is exactly two blocks). Each side proves it knows the PIN by exchanging encrypted random challenges and verifying SHA hashes that bind both X.509 cert signatures and a per-side 16-byte secret; the secrets are additionally RSA-SHA256-signed by each side's cert key and verified. On success the host stores the client's self-signed X.509 cert in an allow-list; thereafter every HTTPS request requires that exact client cert (cert pinning via a custom OpenSSL verify callback comparing the presented cert's PEM against stored authorized clients). All XML responses are an `<root status_code=\"200\">...</root>` tree; pairing replies carry `paired` (1/0) plus the phase-specific element (plaincert, challengeresponse, pairingsecret).",
"ports": [
"47989/tcp = HTTP (insecure NvHTTP), offset PORT_HTTP=0 from base port",
"47984/tcp = HTTPS (mutual-TLS NvHTTP), offset PORT_HTTPS=-5 from base port",
"Base/config port default = 47989; map_port(p) = config.port + p (uint16, warns if <1024 or >65535)",
"Related stream ports (offsets, not part of this area but advertised/used post-launch): 48010/tcp RTSP (+21), 47998/udp video (+9), 47999/udp control (+10), 48000/udp audio (+11), 48002/udp mic (+13)",
"Moonlight resolves stream ports via LiGetPortFromPortFlagIndex; TCP set {47984,47989,48010}, UDP set {47998,47999,48000,48010}"
],
"wire_formats": [
{
"name": "/serverinfo XML response",
"layout": "<root status_code=\"200\">\n <hostname>...</hostname>\n <appversion>7.1.431.-1</appversion> (VERSION; major>=7 => client uses SHA-256)\n <GfeVersion>3.23.0.74</GfeVersion> (GFE_VERSION)\n <uniqueid>...</uniqueid> (host unique id)\n <HttpsPort>47984</HttpsPort> (net::map_port(PORT_HTTPS))\n <ExternalPort>47989</ExternalPort> (net::map_port(PORT_HTTP))\n <MaxLumaPixelsHEVC>1869449984</MaxLumaPixelsHEVC> (or \"0\" if HEVC disabled)\n <mac>aa:bb:cc:dd:ee:ff</mac> (real MAC on HTTPS; \"00:00:00:00:00:00\" on plain HTTP)\n <LocalIP>...</LocalIP>\n <ServerCodecModeSupport>3843</ServerCodecModeSupport> (bitmask; 3=H264 only,259=+HEVC,3843=+AV1)\n <ExternalIP>...</ExternalIP> (conditional)\n <PairStatus>1</PairStatus> (1 if request carried a known uniqueid over HTTPS, else 0)\n <currentgame>0</currentgame> (0 if idle, else running app id)\n <state>SUNSHINE_SERVER_FREE</state> (or SUNSHINE_SERVER_BUSY; GFE uses MJOLNIR_IDLE/_SERVER_BUSY)\n</root>",
"notes": "Served on BOTH HTTP(47989) and HTTPS(47984). Plain-HTTP version hides mac and forces PairStatus=0. Element NAMES are case-sensitive and Moonlight requires hostname, appversion, PairStatus, currentgame, state, and at least one port. appversion MAJOR number is the SHA-1-vs-SHA-256 switch."
},
{
"name": "/pair phase 1 request+response (getservercert)",
"layout": "GET /pair?uniqueid=<id>&uuid=<uuid>&devicename=<name>&updateState=1&phrase=getservercert&salt=<32 hex chars = 16 bytes>&clientcert=<hex(PEM bytes)>\nResponse: <root status_code=\"200\"><paired>1</paired><plaincert>hex(server X.509 PEM)</plaincert></root>",
"notes": "salt is client-random 16 bytes, sent as 32 hex chars. Server validates salt length >=32 hex chars, takes first 16 bytes, computes AES key = SHA256(salt||PIN)[..16]. clientcert is the client's self-signed cert (kept for later signature checks and TLS pinning). If PIN not yet entered, server may stall here until the user enters it."
},
{
"name": "/pair phase 2 request+response (clientchallenge)",
"layout": "GET /pair?uniqueid=<id>&clientchallenge=<hex(ECB-encrypt(randomChallenge[16]))>\nResponse: <root status_code=\"200\"><paired>1</paired><challengeresponse>hex(ECB-encrypt(hash[H] || serverChallenge[16]))</challengeresponse></root>",
"notes": "Client sends AES-ECB(16-byte random). Server decrypts -> clientChallenge; computes hash = SHA( clientChallenge || serverCertSignature || serversecret[16 random] ); generates serverChallenge[16 random]; returns ECB(hash || serverChallenge). H = 32 (SHA-256) or 20 (SHA-1). With no padding, the ECB input is hash(32)+serverChallenge(16)=48 bytes = 3 blocks for SHA-256."
},
{
"name": "/pair phase 3 request+response (serverchallengeresp)",
"layout": "GET /pair?uniqueid=<id>&serverchallengeresp=<hex(ECB-encrypt(challengeRespHash[H]))>\nResponse: <root status_code=\"200\"><pairingsecret>hex( serversecret[16] || RSA-SHA256-sign(serversecret) )</pairingsecret><paired>1</paired></root>",
"notes": "Client decrypts phase-2 challengeresponse into [serverResponseHash(H) || serverChallenge(16)], generates clientSecret[16 random], computes challengeRespHash = SHA( serverChallenge || clientCert.signature || clientSecret ), sends ECB(challengeRespHash). Server stores decrypted value as clienthash for the phase-4 check, then returns its serversecret plus that secret signed by the server cert's private key (sign256)."
},
{
"name": "/pair phase 4 request+response (clientpairingsecret)",
"layout": "GET /pair?uniqueid=<id>&clientpairingsecret=<hex( clientSecret[16] || RSA-SHA256-sign(clientSecret) )>\nResponse: <root status_code=\"200\"><paired>1 or 0</paired></root>",
"notes": "Server splits into secret(16)+signature(rest). Builds data = serverChallenge || clientCert.signature || clientSecret, hashes it, compares to the clienthash stored in phase 3 (proves client knew PIN). Also verify256(clientCert, clientSecret, signature) (proves client owns its cert key). Both must pass; on success the client cert is added to the authorized allow-list. paired=0 means PIN/cert mismatch."
},
{
"name": "/pair final pairchallenge (HTTPS)",
"layout": "GET /pair?uniqueid=<id>&phrase=pairchallenge (over HTTPS:47984, presenting the now-trusted client cert)\nResponse: <root status_code=\"200\"><paired>1</paired></root>",
"notes": "Moonlight calls executePairingChallenge() after phase 4; it must succeed over the mutual-TLS connection using the freshly-paired client cert, confirming the cert pinning round-trips. If this fails Moonlight calls /unpair and reports FAILED."
},
{
"name": "Other HTTPS endpoints (params only, post-pair)",
"layout": "/applist -> XML list of <App><IsHdrSupported><AppTitle><ID>; /appasset?appid=&assetidx=&assettype= -> PNG; /launch?uniqueid=&appid=&mode=WxHxFPS&additionalStates=&sops=&rikey=<hex AES key>&rikeyid=<int>&localAudioPlayMode=&surroundAudioInfo=&hdrMode=&corever=; /resume?rikey=&rikeyid=&surroundAudioInfo=; /cancel (no params)",
"notes": "Out of this area's depth but listed for completeness. rikey/rikeyid carry the AES-128 RTSP/stream key (16-byte key as hex, plus a 4-byte key id) used to seal the control/RTSP plane; mode is 'WIDTHxHEIGHTxFPS'. /launch and /resume require an already-paired (pinned) HTTPS connection."
}
],
"flow": [
"0. Moonlight GET http://host:47989/serverinfo -> parses appversion major (>=7 => SHA-256 hash, else SHA-1), HttpsPort, PairStatus, state.",
"1. Client generates salt=random16 and a self-signed RSA-2048 X.509 cert. GET /pair?...&phrase=getservercert&salt=hex(salt)&clientcert=hex(certPEM). User enters PIN on host. Both sides compute aesKey = SHA(salt || pinUTF8)[0..16].",
"2. Server replies plaincert=hex(serverCertPEM), paired=1. Client stores server cert.",
"3. Client: randomChallenge=random16; GET ...&clientchallenge=hex(AES_ECB_enc(randomChallenge, aesKey)).",
"4. Server: decrypt -> clientChallenge; serversecret=random16; respHash=SHA(clientChallenge || serverCert.signature || serversecret); serverChallenge=random16; reply challengeresponse=hex(AES_ECB_enc(respHash || serverChallenge)).",
"5. Client: decrypt challengeresponse -> [serverRespHash(H) || serverChallenge(16)]; clientSecret=random16; challengeRespHash=SHA(serverChallenge || clientCert.signature || clientSecret); GET ...&serverchallengeresp=hex(AES_ECB_enc(challengeRespHash)).",
"6. Server: decrypt -> store as clienthash; reply pairingsecret=hex(serversecret || RSA_SHA256_sign(serversecret, serverPrivKey)), paired=1.",
"7. Client: split pairingsecret into [serversecret(16) || serverSig]; verify expected = SHA(randomChallenge || serverCert.signature || serversecret) (sanity) and RSA-verify serverSig over serversecret with server cert. GET ...&clientpairingsecret=hex(clientSecret || RSA_SHA256_sign(clientSecret, clientPrivKey)).",
"8. Server: split into [clientSecret(16) || clientSig]; recompute SHA(serverChallenge || clientCert.signature || clientSecret) and compare to stored clienthash; RSA-verify clientSig over clientSecret with clientCert. Both pass => add clientCert to authorized list; reply paired=1 (else paired=0).",
"9. Client: GET https://host:47984/pair?...&phrase=pairchallenge over mutual-TLS presenting its now-trusted cert; expects paired=1. Pairing complete -> PairState.PAIRED.",
"10. On any mismatch the client GET /unpair and returns FAILED / PIN_WRONG."
],
"crypto": "PIN-derived key: aesKey = HASH(salt[16] || PIN_utf8)[0..16], where HASH = SHA-256 if server appversion major >=7 (Sunshine: 7.1.431.-1) else SHA-1. Salt = client random 16 bytes. PIN is the 4-digit code shown/entered by user; concatenation is salt FIRST then pin. Pairing cipher: AES-128 in ECB mode, NO padding / padding DISABLED (Sunshine ecb_t(key,false); Moonlight AESLightEngine block loop). Inputs are zero-extended to a 16-byte multiple before encryption (a 32-byte SHA-256 hash = 2 blocks; respHash(32)+serverChallenge(16)=48=3 blocks). Per-side proofs: serverHash = SHA(clientChallenge || serverCert.signature || serversecret16); clientHash = SHA(serverChallenge || clientCert.signature || clientSecret16) (cert.signature = the DER signature bytes of the self-signed X.509). Identity binding: each side RSA-signs its own 16-byte secret with its cert's private key using RSA-PKCS1 over SHA-256 (sign256/verify256), other side verifies. Certs: self-signed RSA-2048, SHA-256 signed, ~20-year validity. Result of pairing is NOT a streaming key — it only establishes mutual TLS trust (pinned certs). The actual AES-128 STREAM key is delivered separately at /launch as rikey (16-byte hex) + rikeyid; that is where punktfunk-core's existing AES-128-GCM session crypto plugs in. IMPORTANT: pairing AES-ECB-no-pad is distinct from and unrelated to punktfunk-core's AES-128-GCM session sealing.",
"rust_options": "HTTP/HTTPS control plane belongs in crates/punktfunk-host/src/web.rs (the existing stub explicitly permits tokio/axum here, off the hot path). Use axum or hyper for the two servers. TLS with mutual auth + custom cert pinning: use rustls via axum-server/tokio-rustls with a custom ClientCertVerifier (rustls::server::danger::ClientCertVerifier) that accepts any well-formed cert at handshake time and then matches the presented leaf DER/PEM against the paired allow-list (mirror Sunshine's verify callback), OR use openssl/openssl crate to match Sunshine 1:1. XML: build/parse with quick-xml or xml-rs (or just format! the small fixed templates and a tiny extractor). Crypto: aes crate (already a dep transitively) in ECB mode via the `ecb` crate with NoPadding (Aes128 + ecb::Decryptor/Encryptor, manual block handling) — note RustCrypto deprecates ECB so call the block cipher directly (aes::Aes128 + cipher::BlockEncrypt/BlockDecrypt over 16-byte chunks). Hashing: sha2 (SHA-256) and sha1 crates. X.509 self-signed cert generation + RSA-SHA256 sign/verify: rcgen for cert gen, rsa + sha2 for PKCS1v15 sign/verify, x509-parser or x509-cert to extract the cert's signature bytes (cert.getSignature() equivalent) and to do TLS-trust comparison. Persist authorized client certs (PEM) in a small JSON/sled store. Run all of this on tokio in web.rs; keep it fully separate from the native-thread per-frame pipeline.",
"reuse_from_punktfunk": "REUSE little of punktfunk-core's crypto here — its crypto.rs is AES-128-GCM session sealing (nonce = salt||seq, seq as AAD) for the VIDEO/INPUT plane, which corresponds to the post-/launch rikey stream key, NOT the pairing handshake. Pairing needs AES-128-ECB-no-pad + SHA-256/SHA-1 + RSA, none of which exist in punktfunk yet and must be newly built (best placed in punktfunk-host, not punktfunk-core, since it is control-plane only). The natural seam is the existing web.rs stub (WebConfig::run) whose TODO already says 'GameStream serverinfo, pairing handshake, RTSP SETUP' — implement the two HTTP servers and the 4-phase /pair state machine there. punktfunk-core's AES-128-GCM SessionCrypto IS reusable downstream: once paired and /launch hands over rikey (16-byte AES key) + rikeyid, feed that key into punktfunk-core's Session/SessionCrypto for the encrypted video/control planes. The internal 40-byte packet format is unrelated to this HTTP/pairing area. So: new pairing crypto + axum servers in punktfunk-host/web.rs (control), reuse punktfunk-core GCM for the data plane post-launch.",
"gotchas": [
"appversion MAJOR number is load-bearing: it silently switches the client's pairing hash. Advertise major >=7 (e.g. \"7.1.431.-1\") to get SHA-256; advertise <7 and the client uses SHA-1 with 20-byte hashes (changes all the ECB block counts). Mismatch => silent pairing failure.",
"AES-ECB has NO padding and NO IV. Do not use a library that auto-pads (PKCS7) — Sunshine passes ecb_t(key,false). Inputs are zero-extended to 16-byte multiples; a 32-byte SHA-256 hash is exactly 2 blocks but respHash(32)+serverChallenge(16)=48 must be encrypted as one 48-byte buffer.",
"The hash inputs use cert.getSignature() = the X.509 DER SIGNATURE bytes (the signatureValue), NOT the cert body, NOT a hash of the cert. Getting this field wrong is the most common pairing bug.",
"Salt+PIN order is salt FIRST then PIN (UTF-8 ascii digits). PIN is the literal 4-char string, not parsed as an integer.",
"Pairing does NOT yield the stream key. The streaming AES key arrives later as /launch?rikey=<hex>&rikeyid=<n>. Don't conflate the PIN-derived ECB key with the GCM stream key.",
"Two distinct servers: /serverinfo and /pair answer on BOTH 47989 (HTTP) and 47984 (HTTPS); /applist,/appasset,/launch,/resume,/cancel are HTTPS-ONLY and require the pinned client cert. The plain-HTTP /serverinfo must zero out mac and force PairStatus=0.",
"TLS pinning is custom: standard CA validation is bypassed; the verify callback accepts the handshake then checks the presented leaf cert's PEM against the stored authorized-client list. Client certs are self-signed, so a normal rustls verifier would reject them — you must supply a permissive ClientCertVerifier + post-hoc allow-list match.",
"All responses wrap in <root status_code=\"200\"> ... </root>; the status_code attribute (and HTTP 200) both matter. paired=1/0 appears in every pairing reply and the client checks it as a string \"1\".",
"RSA signatures are PKCS#1 v1.5 over SHA-256 (sign256/verify256). Use the cert's RSA key; ECDSA certs would change this.",
"ServerCodecModeSupport is a bitmask, advertise the decimal: 3 (H264 only), 259 (H264+HEVC), 3843 (H264+HEVC+AV1); flags SCM_H264=0x1, SCM_HEVC=0x100, SCM_HEVC_MAIN10=0x200, SCM_AV1_MAIN8=0x10000, SCM_AV1_MAIN10=0x20000.",
"If the user hasn't entered the PIN yet, the host stalls the phase-1 response until it's entered (or until clientchallenge); design the state machine to await PIN input keyed by uniqueid.",
"Each pairing phase validates the previous phase happened for that uniqueid; the host keeps per-client pairing session state (last_phase, cipher_key, serversecret, serverChallenge, clienthash) across the 4 separate HTTP GETs."
],
"sources": [
"Sunshine src/nvhttp.cpp: serverinfo() (root.hostname/appversion/GfeVersion/uniqueid/HttpsPort/ExternalPort/MaxLumaPixelsHEVC/mac/LocalIP/ServerCodecModeSupport/PairStatus/currentgame/state; codec_mode_flags ORing SCM_*; MaxLumaPixelsHEVC=\"1869449984\"; state SUNSHINE_SERVER_FREE/BUSY) and pair() 4-phase state machine (getservercert/clientchallenge/serverchallengeresp/clientpairingsecret; ecb_t(*cipher_key,false); crypto::hash; sign256; verify256; add_authorized_client) and the https_server.verify pinning callback",
"Sunshine src/nvhttp.h: VERSION=\"7.1.431.-1\", GFE_VERSION=\"3.23.0.74\", PORT_HTTP=0, PORT_HTTPS=-5",
"Sunshine src/crypto.cpp: gen_aes_key (salt||pin -> SHA-256 -> first 16 bytes), hash() = EVP_sha256, sign256/verify256 = EVP_sha256, ecb_t/gcm_t/cbc_t (EVP_aes_128_ecb/gcm/cbc, EVP_CIPHER_CTX_set_padding), AES-128 throughout",
"Sunshine src/network.cpp: map_port(p) = config::sunshine.port + p",
"moonlight-android PairingManager.java: serverMajorVersion>=7 -> Sha256PairingHash else Sha1PairingHash; saltPin (salt then pin utf-8); generateAesKey = copyOf(hash,16); encryptAes/decryptAes AESLightEngine ECB; performBlockCipher 16-byte block loop with zero-pad blockRoundedSize; phase byte orders (clientchallenge=enc(random16); challengeRespHash=hash(serverChallenge||cert.signature||clientSecret); clientPairingSecret=clientSecret||signData(clientSecret)); Sha256PairingHash.getHashLength=32 / Sha1=20; signData=SHA256withRSA; verifySignature; PairState{NOT_PAIRED,PAIRED,PIN_WRONG,FAILED,ALREADY_IN_PROGRESS}; executePairingChallenge final pairchallenge",
"moonlight-common-c src/Limelight.h: SCM_H264=0x1, SCM_HEVC=0x100, SCM_HEVC_MAIN10=0x200, SCM_AV1_MAIN8=0x10000, SCM_AV1_MAIN10=0x20000, SCM_*_444 flags; VIDEO_FORMAT_* constants; default ports 47984/47989/48010 tcp, 47998/47999/48000/48010 udp",
"DeepWiki LizardByte/Sunshine NVHTTP page: ServerCodecModeSupport decimal values 3 / 259 / 3843",
"punktfunk repo (local): crates/punktfunk-host/src/web.rs (WebConfig stub, control-plane seam) and crates/punktfunk-core/src/crypto.rs (AES-128-GCM SessionCrypto = data-plane, not pairing)"
]
},
{
"area": "RTSP handshake + SDP + stream config negotiation (GameStream / Sunshine ↔ Moonlight)",
"summary": "GameStream negotiation is two phases. Phase 1 is an HTTPS GET to /launch (or /resume) on port 47984/47989 where the client passes the session parameters as URL query args — most importantly rikey (a 32-hex-char = 16-byte AES-128 key) and rikeyid (a signed 32-bit int). The host derives the per-stream AES-GCM key directly from rikey and a 16-byte IV from rikeyid as a big-endian uint32 left-padded into a 16-byte buffer (Sunshine make_launch_session). Phase 2 is a GameStream-flavored RTSP/1.0 exchange over TCP on port 48010 (RTSP_SETUP_PORT = base 47989 + 21). The sequence is OPTIONS → DESCRIBE → SETUP(audio) → SETUP(video) → SETUP(control) → ANNOUNCE → PLAY, each carrying a CSeq header. DESCRIBE returns an SDP-ish body of a= attributes advertising host capabilities (x-ss-general.featureFlags, encryptionSupported/Requested, refPicInvalidation, AV1 rtpmap, Opus surround-params). SETUP returns Session: DEADBEEFCAFE;timeout = 90 plus Transport: server_port=<port> and an X-SS-Connect-Data (control) or X-SS-Ping-Payload (A/V) header. ANNOUNCE is where the client sends the actual negotiated stream config as an SDP body of x-nv-video[0].*, x-nv-vqos[0].*, x-nv-general.*, x-nv-audio.*, x-nv-aqos.*, x-ss-video[0].*, x-ml-* attributes (resolution, fps, bitrate, packetSize, fecPercentage, codec/bitStreamFormat, HDR, surround, encryptionEnabled). PLAY simply ACKs and the host begins sending RTP. When encryption is negotiated the RTSP messages themselves are wrapped in an encrypted framing (typeAndLength MSB-flagged + sequenceNumber + 16-byte GCM tag) keyed by the same gcm_key with IV bytes 10/11 = direction+'R'.",
"ports": [
"47984 = HTTPS control (PORT_HTTPS, base+5) — serves /launch, /resume, /serverinfo, /pair over TLS",
"47989 = HTTP control (base port; PORT_HTTP) — unauthenticated /serverinfo etc.",
"48010 TCP = RTSP_SETUP_PORT (base 47989 + 21) — the RTSP handshake; this is the task's stated port",
"47998 UDP = VIDEO_STREAM_PORT (base + 9) — RTP video, returned in SETUP video Transport server_port",
"47999 UDP = CONTROL_PORT (base + 10) — ENet/control + remote input, returned in SETUP control",
"48000 UDP = AUDIO_STREAM_PORT (base + 11) — RTP audio, returned in SETUP audio",
"All ports are base+offset via net::map_port(); base is configurable (default 47989). Moonlight overrides VideoPortNumber/AudioPortNumber/ControlPortNumber from the SETUP Transport server_port= field, with fallbacks video=47998, audio=48000, control=47999"
],
"wire_formats": [
{
"name": "RTSP request line + headers (plaintext)",
"layout": "ASCII text, CRLF line endings. Request line: <METHOD> <target> RTSP/1.0\\r\\n. Methods: OPTIONS, DESCRIBE, SETUP, ANNOUNCE, PLAY. Common headers: CSeq: <n>\\r\\n (monotonic, set by client currentSeqNumber), X-GS-ClientVersion: <n> (AppVersionQuad[0] map 3→10,4→11,5→12,7→14), Host: <addr> (TCP only). Header block terminated by \\r\\n\\r\\n. Body (if any) preceded by Content-type: application/sdp and Content-length: <bytes>.",
"notes": "Parsed by moonlight-common-c parseRtspMessage(), which Sunshine vendors in. NOT 100% standard RTSP — targets use streamid= scheme and there is an alternate ENet 'rtspru://' transport for ancient GFE (<v404); modern Moonlight uses raw TCP RTSP/1.0."
},
{
"name": "SETUP target (request URI)",
"layout": "Modern (GFE≥5): streamid=audio/0/0 , streamid=video/0/0 , streamid=control/13/0 (control id is 13 for GFE≥7.1.431, else 1). Legacy (GFE<5): streamid=audio , streamid=video. Sunshine parses by finding '=' then '/': type = audio|video|control.",
"notes": "Sunshine maps audio→AUDIO_STREAM_PORT(48000), video→VIDEO_STREAM_PORT(47998), control→CONTROL_PORT(47999); unknown type → 404."
},
{
"name": "SETUP response headers",
"layout": "CSeq: <echo>\\r\\nSession: DEADBEEFCAFE;timeout = 90\\r\\nTransport: server_port=<port>\\r\\n then ONE of: X-SS-Connect-Data: <control_connect_data> (for control stream) OR X-SS-Ping-Payload: <av_ping_payload> (for audio/video). 200 OK.",
"notes": "Session string is the literal constant DEADBEEFCAFE;timeout = 90 (note the spaces around '='). server_port is the UDP port the client must send/recv on. X-SS-Ping-Payload is the per-session magic the client must send as the first UDP datagram to A/V ports so the host learns the client's source port; X-SS-Connect-Data likewise for control."
},
{
"name": "DESCRIBE response body (host capabilities SDP)",
"layout": "Newline-joined a= lines (Sunshine cmd_describe builds via stringstream, each << std::endl):\\n a=x-ss-general.featureFlags:<platf::get_capabilities() uint32>\\n a=x-ss-general.encryptionSupported:<flags>\\n a=x-ss-general.encryptionRequested:<flags>\\n a=x-nv-video[0].refPicInvalidation:1 (only if encoder supports RFI)\\n sprop-parameter-sets=AAAAAU (emitted unless HEVC-only forced; HEVC indicator)\\n a=rtpmap:98 AV1/90000 (emitted unless AV1 forced off)\\n a=fmtp:97 surround-params=<channelCount><streams><coupledStreams><mapping digits...> (one per audio::stream_configs entry; 5.1/7.1 rotate mapping)\\n a=rtpmap audio is implied via fmtp:97 / fmtp:96.",
"notes": "encryptionSupported default = SS_ENC_CONTROL_V2|SS_ENC_AUDIO (=0x05); adds SS_ENC_VIDEO(0x02) unless encryption mode NEVER. encryptionRequested default = SS_ENC_CONTROL_V2 (=0x01); adds VIDEO|AUDIO if mode MANDATORY. Moonlight scans this body for 'AV1/90000', 'sprop-parameter-sets=AAAAAU' (HEVC), refPicInvalidation, the x-ss-general.* flags, and the fmtp:97 surround-params (channelCount, streamCount, coupledStreams, channel mapping for Opus multistream)."
},
{
"name": "ANNOUNCE request body (client→host negotiated config SDP)",
"layout": "Body: v=0\\r\\no=android 0 <sessver> IN <af> <addr>\\r\\ns=NVIDIA Streaming Client\\r\\n then a=<name>:<value> lines, then t=0 0\\r\\nm=video <port> \\r\\n. Sunshine cmd_announce splits on lines, takes s= as client name and a=name:value into a map. Keys parsed into stream::config_t: x-nv-video[0].clientViewportWd (width), x-nv-video[0].clientViewportHt (height), x-nv-video[0].maxFPS (fps int), x-nv-video[0].clientRefreshRateX100 (fps*100), x-nv-video[0].packetSize (MTU payload), x-nv-video[0].videoEncoderSlicesPerFrame (slices), x-nv-video[0].maxNumReferenceFrames, x-nv-video[0].encoderCscMode, x-nv-video[0].dynamicRangeMode (HDR 0/1), x-nv-vqos[0].bitStreamFormat (codec: 0=H264,1=HEVC,2=AV1), x-nv-vqos[0].bw.maximumBitrateKbps (bitrate), x-nv-vqos[0].fec.minRequiredFecPackets, x-nv-vqos[0].qosTrafficType, x-nv-audio.surround.numChannels, x-nv-audio.surround.channelMask, x-nv-audio.surround.AudioQuality, x-nv-aqos.packetDuration, x-nv-aqos.qosTrafficType, x-nv-general.useReliableUdp (controlProtocolType, 13 or 1), x-nv-general.featureFlags (bit 0x20 ⇒ enable SS_ENC_AUDIO), x-ml-general.featureFlags, x-ml-video.configuredBitrateKbps, x-ss-general.encryptionEnabled (the actual negotiated enc bitmask), x-ss-video[0].chromaSamplingType (0=4:2:0,1=4:4:4), x-ss-video[0].intraRefresh.",
"notes": "Sunshine fills defaults via try_emplace before reading: encoderCscMode=0, bitStreamFormat=0, dynamicRangeMode=0, packetDuration=5, useReliableUdp=1, fec.minRequiredFecPackets=0, x-nv-general.featureFlags=135, x-ml-general.featureFlags=0, vqos qosTrafficType=5, aqos qosTrafficType=4, configuredBitrateKbps=0, encryptionEnabled=0, chromaSamplingType=0, intraRefresh=0, clientRefreshRateX100=0. Missing REQUIRED keys (width/height/fps/packetSize/bitrate/channels) throw std::out_of_range → 400 BAD REQUEST. Moonlight's SdpGenerator.c additionally emits but Sunshine ignores: rateControlMode=4, timeoutLengthMs=7000, framesWithInvalidRefThreshold=0, fec.enable=1, fec.repairPercent (5 or 20), fec.minRequiredFecPackets=2, bllFec.enable=0, videoQualityScoreUpdateTime=5000, bw.minimumBitrateKbps, initialBitrateKbps, x-nv-clientSupportHevc, surround.enable, enableRecoveryMode=0, x-nv-ri.useControlChannel=1."
},
{
"name": "Encrypted RTSP framing (when SS_ENC_CONTROL_V2 negotiated)",
"layout": "struct encrypted_rtsp_header_t { uint32_t typeAndLength; /*big-endian; MSB ENCRYPTED_MESSAGE_TYPE_BIT=0x80000000 set, low bits = payload length*/ uint32_t sequenceNumber; /*big-endian, monotonic*/ uint8_t tag[16]; /*AES-128-GCM auth tag*/ }; followed by ciphertext payload (the plaintext RTSP message).",
"notes": "AES-GCM IV (12 bytes, NIST SP800-38D 8.2.1): bytes[0..4]=sequenceNumber big-endian, byte[10]='C'(client-originated) or 'H'(host-originated), byte[11]='R' (RTSP). Decrypt input is tag||ciphertext. Plaintext RTSP (no enc) is delimited by \\r\\n\\r\\n."
},
{
"name": "/launch HTTPS query (where RIKEY/RIKEYID arrive)",
"layout": "GET https://<host>:47984/launch?uniqueid=...&appid=<id>&mode=<W>x<H>x<fps>&additionalStates=...&sops=<0|1>&rikey=<32 hex chars>&rikeyid=<int32>&localAudioPlayMode=<0|1>&surroundAudioInfo=<packed>&remoteControllersBitmap=...&gcmap=...&hdrMode=<0|1>&clientHdrCapabilities=...&corever=<n>. /resume is the same minus appid/mode.",
"notes": "Sunshine requires rikey, rikeyid, localAudioPlayMode, appid present or 400. rikey = util::from_hex_vec(rikey,true) → gcm_key (16 bytes). iv = 16-byte buffer with big-endian uint32(rikeyid) in the FIRST 4 bytes, rest zero. mode split on 'x' → width/height/fps. surroundAudioInfo default 196610 (=0x30002 ⇒ 2-channel mask 0x3 high16=channels). corever decides whether encrypted RTSP is used. The launch response gives root.sessionUrl0 = <rtsp_url_scheme><addr>:48010 telling Moonlight where to open RTSP."
}
],
"flow": [
"0. (HTTPS, prior to RTSP) Client GETs /serverinfo, then /launch?...&rikey=<hex16>&rikeyid=<int>&mode=WxHxF&... on port 47984. Host stores gcm_key=hex(rikey), iv=BE32(rikeyid)||zeros, parses mode/sops/surroundAudioInfo/hdrMode, allocates the launch_session, and responds with root.sessionUrl0=rtsp[enc]://<addr>:48010.",
"1. Client opens TCP to 48010 and sends OPTIONS <rtspTargetUrl> RTSP/1.0 with CSeq:1, X-GS-ClientVersion. Host replies 200 OK echoing CSeq (cmd_option).",
"2. Client sends DESCRIBE <rtspTargetUrl> RTSP/1.0 (CSeq:2, Accept/If-Modified-Since). Host replies 200 OK with the a= capability body (featureFlags, encryptionSupported/Requested, refPicInvalidation, AV1 rtpmap, surround-params).",
"3. Client sends SETUP streamid=audio/0/0 (CSeq:3). Host replies Session: DEADBEEFCAFE;timeout = 90, Transport: server_port=48000, X-SS-Ping-Payload:<...>.",
"4. Client sends SETUP streamid=video/0/0 (CSeq:4). Host replies Transport: server_port=47998, X-SS-Ping-Payload:<...>.",
"5. Client sends SETUP streamid=control/13/0 (CSeq:5). Host replies Transport: server_port=47999, X-SS-Connect-Data:<...>. (Moonlight latches these server_port values into Video/Audio/ControlPortNumber.)",
"6. Client sends ANNOUNCE (CSeq:6) with Content-type: application/sdp and the full x-nv-*/x-ss-*/x-ml-* config body (resolution, fps, bitrate, packetSize, fecPercent, bitStreamFormat codec, dynamicRangeMode HDR, surround, encryptionEnabled). Host parses into config_t; missing required key → 400. 200 OK on success.",
"7. Client sends PLAY (CSeq:7; single PLAY '/' for GFE≥7.1.431, else per-stream). Host replies 200 OK and the RTP video/audio + control/input flows begin on the UDP ports. First UDP datagram from client on each A/V port carries the X-SS-Ping-Payload so the host learns the client source port (NAT punch / port-learn)."
],
"crypto": "RIKEY/RIKEYID origin: the Moonlight client app generates remoteInputAesKey[16] and remoteInputAesIv[16] (STREAM_CONFIGURATION in Limelight.h) before connecting; rikey = hex(remoteInputAesKey), rikeyid = a 32-bit int. They are delivered to the host NOT in RTSP but in the HTTPS /launch query string. Sunshine make_launch_session: gcm_key = from_hex_vec(rikey) (16 bytes, the AES-128-GCM key shared by video/audio/control/input/RTSP ciphers); iv (16 bytes) = big-endian uint32(rikeyid) in bytes[0..4], zero-padded. AES-128-GCM is used everywhere. Per-stream 12-byte GCM nonce construction (Sunshine stream.cpp / rtsp.cpp): VIDEO (host→client): bytes[0..]=gcm_iv_counter (LE incrementing), byte[11]='V'. CONTROL host→client: 12-byte iv = seq(LE) || byte[10]='H', byte[11]='C'. CONTROL client→host: seq(LE) || byte[10]='C', byte[11]='C'. RTSP (encrypted handshake): seq(BE in bytes[0..4]) || byte[10]='C'(client)|'H'(host), byte[11]='R'. Encryption is negotiated, not key-exchanged: DESCRIBE advertises x-ss-general.encryptionSupported (default SS_ENC_CONTROL_V2|SS_ENC_AUDIO=0x05, +SS_ENC_VIDEO=0x02 if allowed) and encryptionRequested (default SS_ENC_CONTROL_V2=0x01, +VIDEO|AUDIO if MANDATORY); the client echoes the chosen bitmask in ANNOUNCE x-ss-general.encryptionEnabled, and x-nv-general.featureFlags bit 0x20 forces SS_ENC_AUDIO on. Flags: SS_ENC_CONTROL_V2=0x01, SS_ENC_VIDEO=0x02, SS_ENC_AUDIO=0x04. Codec tag bitStreamFormat: 0=H264,1=HEVC,2=AV1 (client capability bits VIDEO_FORMAT_H264=0x0001,H265=0x0100,H265_MAIN10=0x0200,AV1_MAIN8=0x1000,AV1_MAIN10=0x2000).",
"rust_options": "For RTSP on 48010: a tiny synchronous TCP server using std::net::TcpListener on a native thread (NO tokio — keep off the hot path, consistent with punktfunk's no-async-on-hot-path invariant). Parse RTSP/1.0 manually: read until \\\\r\\\\n\\\\r\\\\n, split request line + headers, read Content-length body for ANNOUNCE. There is no need for a crate; a hand-rolled parser mirroring Sunshine's parseRtspMessage is simplest and avoids pulling RTSP libs that assume standard semantics (the streamid= targets and DEADBEEFCAFE session break them). For SDP build/parse, just format!/split on lines — it is line-oriented a=key:value. For the /launch HTTPS endpoint, reuse the existing crates/punktfunk-host/src/web.rs seam; a small hyper or tiny_http + rustls TLS server (control plane only, async OK here since it is not the hot path — matches the 'quic feature gated' precedent). For encrypted-RTSP framing use the aes-gcm crate already in punktfunk-core. Suggested new types: an RtspServer in punktfunk-host that produces a punktfunk_core::Config from the ANNOUNCE map. Hex decode rikey with the `hex` crate (or from_str_radix). Big-endian rikeyid → IV with u32::to_be_bytes.",
"reuse_from_punktfunk": "REUSE: punktfunk-core/src/crypto.rs SessionCrypto already implements AES-128-GCM with per-direction salting and seq-as-AAD — but NOTE it is NOT byte-compatible with GameStream's nonce layout. punktfunk uses nonce = salt(4) || seq(8, BE) with a direction bit folded into salt[0], whereas GameStream uses iv = seq(LE or BE per stream) with literal direction/stream marker bytes at [10]/[11] ('V', 'H'/'C'+'C', 'C'/'H'+'R'). To talk to stock Moonlight you must add a GameStream-exact nonce mode (new constructor or a feature) rather than reuse the existing salt scheme verbatim. The Aes128Gcm cipher init and seal/open plumbing are reusable. REUSE: punktfunk-core Config/FecConfig — fec_percent maps to GameStream's repairPercent and the recovery_for() ceil(k*pct/100) already matches GameStream's FEC math; map ANNOUNCE packetSize→shard_payload, maximumBitrateKbps→bitrate, fec.minRequiredFecPackets→minRequiredFecPackets. FecScheme::Gf8 is the GameStream-compatible field. BUILD NEW: the entire RTSP/SDP/launch negotiation layer (punktfunk's internal 40-byte packet format and Config are not wire-exact to RTP/RTSP); the RTSP server, SDP describe/announce codec, the /launch query parser that produces gcm_key+iv from rikey/rikeyid, and the GameStream RTP video/audio packetization + RTPFEC are all new (separate areas). The Session in punktfunk-core can consume the negotiated Config but its on-wire packet header must be swapped for GameStream RTP for Moonlight compat.",
"gotchas": [
"punktfunk-core's AES-GCM nonce layout is NOT GameStream-compatible (salt+BE-seq vs literal 'V'/'C'/'R' marker bytes + LE/BE seq). A stock Moonlight will fail auth unless you implement the exact per-stream IV construction. This is the single biggest bridging hazard.",
"rikeyid is parsed as a SIGNED int then cast to big-endian uint32 in Sunshine (util::from_view → int → endian::big<uint32_t>). Negative rikeyid values wrap; match the signed-int→BE-u32 path exactly.",
"The IV from /launch (BE32(rikeyid)||zeros, 16 bytes) is the *base*; the actual per-packet 12-byte GCM nonce is rebuilt per stream with seq + marker bytes — do not just use the 16-byte launch IV directly.",
"Session header value is the literal 'DEADBEEFCAFE;timeout = 90' WITH spaces around '='. Moonlight is lenient but match it.",
"SETUP Transport must be exactly 'server_port=<port>' (Moonlight greps for 'server_port='); it also expects X-SS-Ping-Payload (A/V) / X-SS-Connect-Data (control) headers and will send that payload as the first UDP datagram for port-learning — the host must accept it.",
"ANNOUNCE keys are case-sensitive and bracketed exactly 'x-nv-video[0].' — the [0] index is literal. Missing a REQUIRED key (width/height/fps/packetSize/bitrate/channels) yields 400; supply the same try_emplace defaults Sunshine uses or Moonlight builds may omit them.",
"Codec is bitStreamFormat (0/1/2), but capability advertisement in DESCRIBE uses sprop-parameter-sets=AAAAAU (HEVC marker) and a=rtpmap:98 AV1/90000 — Moonlight infers codec support from those strings, so emit/omit them to steer codec.",
"fec.repairPercent (GFE<7.1.431) vs fec.minRequiredFecPackets (GFE≥7.1.431) — newer clients send the latter; Sunshine reads minRequiredFecPackets and defaults repairPercent handling. Handle both.",
"Stream targets differ by GFE version: modern 'streamid=video/0/0' and 'streamid=control/13/0', legacy 'streamid=video'. Parse by splitting on '=' and '/', taking the type token, like Sunshine.",
"There is an alternate ENet 'rtspru://' RTSP transport for very old GFE (<v404); stock modern Moonlight uses raw TCP RTSP/1.0 on 48010, so TCP-only is sufficient for current clients (note legacy ENet path as unsupported).",
"Encrypted-RTSP is only used when corever/encryption is negotiated; you must support BOTH plaintext (\\r\\n\\r\\n delimited) and the encrypted_rtsp_header_t framing depending on the negotiated SS_ENC_CONTROL_V2 flag."
],
"sources": [
"Sunshine src/rtsp.cpp — cmd_option, cmd_describe (a=x-ss-general.featureFlags/encryptionSupported/encryptionRequested, refPicInvalidation, sprop-parameter-sets=AAAAAU, rtpmap:98 AV1/90000, fmtp:97 surround-params), cmd_setup (Session DEADBEEFCAFE;timeout = 90, Transport server_port=, X-SS-Connect-Data/X-SS-Ping-Payload, audio/video/control→ports), cmd_announce (full x-nv-*/x-ss-*/x-ml-* config_t parse + try_emplace defaults), cmd_play, encrypted_rtsp_header_t + IV construction",
"Sunshine src/rtsp.h — RTSP_SETUP_PORT = 21 (offset), launch_session_t (gcm_key, iv, rtsp_cipher)",
"Sunshine src/nvhttp.cpp — make_launch_session: rikey hex→gcm_key, rikeyid→BE uint32 IV[0..4]; /launch & /resume required args (rikey,rikeyid,localAudioPlayMode,appid); mode WxHxF, sops, surroundAudioInfo default 196610, hdrMode, corever; sessionUrl0=rtsp://addr:48010; serverinfo HttpsPort/ExternalPort",
"Sunshine src/stream.cpp — VIDEO/AUDIO/CONTROL port usage via net::map_port; per-stream GCM IV (iv[11]='V' video, [10]='H'/'C'+[11]='C' control); SS_ENC_CONTROL_V2/VIDEO/AUDIO flag usage",
"moonlight-common-c src/RtspConnection.c — performRtspHandshake sequence (OPTIONS,DESCRIBE,SETUP audio/video/control,ANNOUNCE,PLAY), streamid=audio/0/0|video/0/0|control/13/0 (or /1/0), Transport server_port fallbacks (video 47998,audio 48000,control 47999), RTSP/1.0, X-GS-ClientVersion mapping, useEnet/rtspru legacy path, encrypted rtspenc:// framing",
"moonlight-common-c src/SdpGenerator.c — full ANNOUNCE SDP: v=0/o=android/s=NVIDIA Streaming Client, x-nv-video[0].clientViewportWd/Ht/maxFPS/packetSize/rateControlMode=4/timeoutLengthMs=7000/videoEncoderSlicesPerFrame/dynamicRangeMode/maxNumReferenceFrames/clientRefreshRateX100/encoderCscMode, x-nv-vqos[0].bitStreamFormat/bw.max-min/fec.enable=1/fec.minRequiredFecPackets=2/fec.repairPercent/bllFec.enable=0/qosTrafficType, x-nv-audio.surround.*, x-nv-aqos.*, x-nv-general.useReliableUdp=13/featureFlags, x-ss-video[0].chromaSamplingType, x-ml-* attributes",
"moonlight-common-c src/Limelight.h — ENCFLG_*, STREAM_CFG_*, STREAM_CONFIGURATION.remoteInputAesKey[16]/remoteInputAesIv[16], VIDEO_FORMAT_H264 0x0001/H265 0x0100/H265_MAIN10 0x0200/AV1_MAIN8 0x1000/AV1_MAIN10 0x2000, COLORSPACE_*, COLOR_RANGE_*, CAPABILITY_*",
"moonlight-common-c src/Limelight-internal.h — SS_ENC_CONTROL_V2 0x01, SS_ENC_VIDEO 0x02, SS_ENC_AUDIO 0x04; RtspPortNumber/Control/Audio/VideoPortNumber externs",
"moonlight-common-c src/Connection.c — port number init (set from RTSP SETUP), resolveHostName base 47984",
"DeepWiki LizardByte/Sunshine port management — base 47989 with offsets: video 47998 (+9), control 47999 (+10), audio 48000 (+11), RTSP 48010 (+21) via net::map_port (1024-65535 validated)",
"punktfunk-core src/crypto.rs and src/config.rs (read directly) — existing AES-128-GCM SessionCrypto (salt+BE-seq nonce, NOT GameStream-exact) and Config/FecConfig for reuse assessment"
]
},
{
"area": "GameStream video stream wire format + FEC + encryption (host→client video plane)",
"summary": "Each video UDP datagram is one FEC shard: a 12-byte RTP_PACKET, then 4 reserved bytes, then a 16-byte NV_VIDEO_PACKET header, then the shard payload. RTP scalar fields are BIG-endian (network order); NV_VIDEO_PACKET scalar fields (streamPacketIndex, frameIndex, fecInfo) are LITTLE-endian. packetType is unused for video on the wire (Sunshine never sets it for video; client reads dataOffset from FLAG_EXTENSION instead). The frame's encoded bitstream is prefixed with an 8-byte video_short_frame_header_t (frame type, last-payload-len, processing latency), then split into one of 1..4 FEC blocks, each block striped into fixed-size data shards of blocksize = packetSize + MAX_RTP_HEADER_SIZE(16). Reed-Solomon GF(2^8) parity (via the nanors library, used identically by both Sunshine host and Moonlight client) is computed per block: parity_shards = ceil(data_shards * fecPercentage / 100). The shard's RS index within its block is derived purely from (sequenceNumber - blockLowestSequenceNumber); data shards occupy indices [0, data_shards), parity [data_shards, data_shards+parity). fecInfo packs fecIndex(10b)<<12 | dataShards(10b)<<22 | fecPercentage(8b)<<4. multiFecBlocks packs currentBlock(2b)<<4 | (lastBlock=blocks-1)(2b)<<6; multiFecFlags is always 0x10. When video encryption is negotiated (SS_ENC_VIDEO), each fully-formed shard (the whole blocksize buffer: RTP+reserved+NV_VIDEO_PACKET+payload) is AES-128-GCM encrypted IN PLACE *after* FEC encoding, and a 32-byte ENC_VIDEO_HEADER (iv[12], frameNumber[4 LE], tag[16]) is prepended as a wire prefix; there is NO AAD. The client strips/decrypts the prefix first, then runs FEC reconstruction over the decrypted shards. The AES key is the 16-byte RIKEY (StreamConfig.remoteInputAesKey / Sunshine launch_session.gcm_key), shared with the control stream; the IV is a deterministic 64-bit little-endian counter in iv[0..8], zero in iv[8..11], and 'V' (0x56) in iv[11].",
"ports": [
"Video RTP/UDP: client connects to host UDP port 47998 (RTP_VIDEO_PORT, base 47998). Host receives the client's first packet on this port to learn the client's source address/port, then sends all video shards back to that endpoint.",
"(context) Audio RTP/UDP 47998+1=47999, Control/ENet UDP 47999+1 / 47999, RTSP TCP 48010 — not part of this video-plane spec but relevant to the session bring-up that supplies RIKEY/packetSize/fecPercentage."
],
"wire_formats": [
{
"name": "RTP_PACKET (12 bytes, fields BIG-endian)",
"layout": "offset 0: uint8 header; offset 1: uint8 packetType; offset 2: uint16 sequenceNumber (BE); offset 4: uint32 timestamp (BE, 90kHz clock); offset 8: uint32 ssrc (BE). Host sets header = 0x80 | FLAG_EXTENSION (0x10) = 0x90. packetType is NOT set by Sunshine for video (left as whatever / 0); the client does not read it for video — it detects the extension purely via (header & FLAG_EXTENSION). sequenceNumber = BE16(lowseq + shardIndexInBlock); timestamp = BE32(round to 90kHz of (frame_timestamp - video_epoch)); ssrc copied through (host leaves it, client preserves it during recovery).",
"notes": "From moonlight Video.h _RTP_PACKET and Sunshine stream.cpp lines ~1490-1493. FLAG_EXTENSION=0x10 (Video.h). FIXED_RTP_HEADER_SIZE=12, MAX_RTP_HEADER_SIZE=16. The extra 4 bytes (the 'reserved') exist because of the extension flag."
},
{
"name": "reserved[4] (4 bytes)",
"layout": "offset 12: 4 opaque bytes (Sunshine: `char reserved[4];` inside video_packet_raw_t). Client computes dataOffset = sizeof(RTP_PACKET)=12, then += 4 because (header & FLAG_EXTENSION), giving dataOffset=16 before NV_VIDEO_PACKET. Contents are not interpreted.",
"notes": "RtpVideoQueue.c RtpvAddPacket: `int dataOffset = sizeof(*packet); if (header & FLAG_EXTENSION) dataOffset += 4; // 2 additional fields`. So the NV header always starts at byte 16."
},
{
"name": "NV_VIDEO_PACKET (16 bytes, scalar fields LITTLE-endian)",
"layout": "offset 16: uint32 streamPacketIndex (LE); offset 20: uint32 frameIndex (LE); offset 24: uint8 flags; offset 25: uint8 extraFlags; offset 26: uint8 multiFecFlags; offset 27: uint8 multiFecBlocks; offset 28: uint32 fecInfo (LE). Host: streamPacketIndex = ((uint32)(lowseq + x)) << 8 (low byte zero); frameIndex = frame index counter; flags = FLAG_CONTAINS_PIC_DATA(0x1) [| FLAG_SOF(0x4) if x==0] [| FLAG_EOF(0x2) if x==lastDataPkt]; multiFecFlags = 0x10; multiFecBlocks = (blockIndex<<4) | ((fec_blocks_needed-1)<<6); extraFlags 0 (NV_VIDEO_PACKET_EXTRA_FLAG_LTR_FRAME=0x1).",
"notes": "From Video.h _NV_VIDEO_PACKET and stream.cpp ~1434-1495. Client byteswaps streamPacketIndex/frameIndex/fecInfo with LE32 in RtpVideoQueue.c lines 564-566. Depacketizer does streamPacketIndex >>= 8; &= 0xFFFFFF (24-bit stream index). Flags/extraFlags/multiFecFlags/multiFecBlocks are single bytes (no swap)."
},
{
"name": "fecInfo bit packing (uint32, LE on wire)",
"layout": "fecInfo = (fecIndex << 12) | (dataShards << 22) | (fecPercentage << 4). Decode (RtpVideoQueue.c): fecIndex = (fecInfo & 0x3FF000) >> 12 (10 bits, the shard's RS index within its block); dataShards = (fecInfo & 0xFFC00000) >> 22 (10 bits); fecPercentage = (fecInfo & 0xFF0) >> 4 (8 bits). Bits 0..3 unused. parityShards is NOT transmitted — client recomputes: parityShards = (dataShards * fecPercentage + 99) / 100.",
"notes": "stream.cpp ~1485-1488 set side; RtpVideoQueue.c lines 583, 703-705 decode side. fecIndex tops out at 1023 (>=1024 packets/block = unrecoverable, logged as error). dataShards <= 255 (DATA_SHARDS_MAX)."
},
{
"name": "multiFecBlocks / multiFecFlags bit packing (uint8 each)",
"layout": "multiFecFlags = 0x10 (constant; marks multi-FEC protocol). multiFecBlocks = (currentBlockIndex << 4) | ((totalBlocks-1) << 6). Decode: currentBlock = (multiFecBlocks >> 4) & 0x3; lastBlock = (multiFecBlocks >> 6) & 0x3; totalBlocks = lastBlock+1 (1..4). Legacy/non-multiFec servers: client forces multiFecFlags=0x10, multiFecBlocks=0x00.",
"notes": "stream.cpp 1438-1439; RtpVideoQueue.c 584, 709. Only 2 bits each for current and last block → max 4 FEC blocks per frame (MAX_FEC_BLOCKS=4)."
},
{
"name": "video_short_frame_header_t (8 bytes, prepended to the frame bitstream, LITTLE-endian)",
"layout": "offset 0: uint8 headerType (always 0x01); offset 1: uint16 frame_processing_latency (LE, 1/10 ms, Sunshine ext, 0 if N/A); offset 3: uint8 frameType (1=normal P, 2=IDR, 4=P w/ intra-refresh, 5=P after ref-frame-invalidation); offset 4: uint16 lastPayloadLen (LE, length of final packet's real payload, for codecs like AV1 that can't tolerate zero padding); offset 6: uint8 unknown[2]. This 8-byte header precedes the actual H.264/HEVC/AV1 access unit, and the concatenation is what gets striped into shards.",
"notes": "stream.cpp video_short_frame_header_t (static_assert ==8). lastPayloadLen = (payloadSize + 8) % (packetSize - sizeof(NV_VIDEO_PACKET)); if 0 → set to (packetSize - sizeof(NV_VIDEO_PACKET)). frameType 2 == IDR is how the client detects keyframes."
},
{
"name": "ENC_VIDEO_HEADER / video_packet_enc_prefix_t (32 bytes) — present only when SS_ENC_VIDEO",
"layout": "offset 0: uint8 iv[12]; offset 12: uint32 frameNumber (LE); offset 16: uint8 tag[16]. This is a WIRE PREFIX that sits in front of the (encrypted) shard, NOT inside the FEC blocksize. On-wire encrypted packet = ENC_VIDEO_HEADER (32B) || ciphertext(blocksize bytes). ciphertext is the AES-128-GCM encryption of the entire plaintext shard (RTP+reserved+NV_VIDEO_PACKET+payload = blocksize). tag is the 16-byte GCM tag. iv is the literal 12-byte nonce used (sent so the client doesn't have to reconstruct it). frameNumber = packet->frame_index().",
"notes": "Video.h _ENC_VIDEO_HEADER and stream.cpp video_packet_enc_prefix_t + encrypt call ~1498-1515. Header must be a multiple of 16 bytes so the FEC blocksize stays a multiple of 16 (comment in Video.h). When encryption is on, the SDP negotiation subtracts sizeof(ENC_VIDEO_HEADER) from packetSize (SdpGenerator.c line 325)."
},
{
"name": "Full on-wire datagram (one shard)",
"layout": "UNENCRYPTED: [RTP_PACKET 12B][reserved 4B][NV_VIDEO_PACKET 16B][payload up to (packetSize - sizeof(NV_VIDEO_PACKET))]. Total header before payload = 32B = sizeof(video_packet_raw_t). ENCRYPTED: [ENC_VIDEO_HEADER 32B] || AES-GCM-ciphertext( the entire 32B+payload plaintext = blocksize ). The plaintext that gets encrypted INCLUDES the RTP and NV headers; the only cleartext fields on an encrypted packet are the 32-byte ENC_VIDEO_HEADER prefix.",
"notes": "Client recv buffer sizing (VideoStream.c 96-99): decryptedSize = packetSize + MAX_RTP_HEADER_SIZE(16) = blocksize; receiveSize = decryptedSize + (encrypted ? 32 : 0). So blocksize == packetSize + 16. After decrypt, client byteswaps RTP seq/timestamp/ssrc to host order, then calls RtpvAddPacket."
}
],
"flow": [
"Host has an encoded access unit (HEVC/H264/AV1) for frame N from NVENC. It computes frame_index N (monotonic) and builds the 8-byte video_short_frame_header_t (headerType=0x01, frameType per IDR/P, lastPayloadLen, latency).",
"Host prepends sizeof(video_packet_raw_t)=32 bytes of header space per shard via concat_and_insert with payload_blocksize = blocksize - 32, where blocksize = packetSize + 16. This produces the striped payload buffer with room for per-shard headers.",
"Host decides FEC block count: max_data_shards_per_fec_block = (255*100)/(100+fecPercentage); fec_blocks_needed = ceil(payload / (max_data_shards*blocksize)), capped at 4. If >4 needed, FEC is disabled for that frame (fecPercentage=0). Each block aligned to blocksize: aligned_size = roundup(payload/blocks, blocksize).",
"For each FEC block: host fills each data shard's RTP+NV headers (frameIndex, streamPacketIndex=(lowseq+x)<<8, multiFecFlags=0x10, multiFecBlocks, flags incl SOF on x==0 and EOF on last data packet).",
"Host calls fec::encode(block, blocksize, fecPercentage, minRequiredFecPackets, prefixsize): data_shards = ceil(blockBytes/blocksize) (last data shard zero-padded), parity_shards = ceil(data_shards*fecPercentage/100) (raised to minRequiredFecPackets if below), then nanors reed_solomon_new(data_shards, parity_shards) + reed_solomon_encode over all shards at blocksize.",
"Host stamps every shard (data AND parity) with fecInfo = (x<<12 | data_shards<<22 | percentage<<4), rtp.header=0x80|0x10, rtp.sequenceNumber=BE16(lowseq+x), rtp.timestamp=BE32(timestamp), multiFecBlocks, frameIndex.",
"If video encryption: for each shard, build iv = (gcm_iv_counter as 8 LE bytes)||0,0,0 with iv[11]='V'; increment counter; AES-128-GCM encrypt the whole blocksize buffer in place (no AAD), writing tag into the prefix; set prefix.frameNumber=frame_index, prefix.iv=iv. Wire packet = prefix(32B)||ciphertext(blocksize).",
"Host sends all data then parity shards as UDP datagrams (paced ~80% of 1Gbps, batched up to 64 packets) to the client's video endpoint; lowseq advances by total shards across all blocks of the frame.",
"Client recv: if encrypted, read prefix, drop early if prefix.frameNumber < currentFrameNumber, else AES-128-GCM decrypt ciphertext into buffer using iv+tag from prefix (auth-fail → drop). Then byteswap RTP seq/timestamp/ssrc to host order.",
"Client RtpvAddPacket: parse NV header (LE32 swaps), derive fecIndex, currentBlock, dataShards, fecPercentage, recompute parityShards; set bufferLowestSequenceNumber = seq - fecIndex; place shard at RS index = seq - bufferLowestSequenceNumber.",
"When a block has >= dataShards shards (data+parity), client runs nanors reed_solomon_decode(rs, packets, marks, totalPackets, blocksize) to recover any missing data shards (missing slots zero-filled, marks[]=1 for missing). Recovered data shards get synthetic RTP headers (seq/header/timestamp/ssrc copied from a present packet).",
"Client advances through blocks (currentBlock 0..lastBlock); once all blocks' data shards are present/recovered it strips the 32-byte video_packet_raw_t header off each data shard, concatenates payloads in sequence order, parses the 8-byte short frame header, and hands the reassembled access unit to the depacketizer/decoder."
],
"crypto": "\"Cipher: AES-128-GCM (EVP_aes_128_gcm). Key: the 16-byte RIKEY — Sunshine `launch_session.gcm_key`, Moonlight `StreamConfig.remoteInputAesKey[16]` — established during RTSP/pairing; the SAME key is used for the control stream and (if enabled) audio. There is no separate video key and no key derivation: the raw 16-byte RIKEY is used directly. IV/nonce: 12 bytes, constructed deterministically (NIST SP 800-38D 8.2.1): iv[0..8] = a 64-bit per-session counter (session->video.gcm_iv_counter, starts at 0) copied in NATIVE byte order (little-endian on x86), iv[8..11] = 0 except iv[11] = 'V' (0x56, the video-stream fixed field). The counter increments once per shard. The full 12-byte IV is transmitted in the ENC_VIDEO_HEADER so the client uses it verbatim (it does not reconstruct it). Tag: 16 bytes, standard GCM tag, transmitted in ENC_VIDEO_HEADER. AAD / associated data: NONE — Sunshine's gcm_t::encrypt for video calls EVP_EncryptUpdate only with plaintext (no AAD update), and Moonlight's PltDecryptMessage passes no AAD argument. (NOTE: this differs from punktfunk-core/crypto.rs which uses seq-as-AAD and a per-direction salt; GameStream video does neither.) Order: FEC FIRST, THEN ENCRYPT — encryption is applied per-shard after RS parity is computed, over the entire blocksize shard buffer (RTP+NV+payload), so the client must DECRYPT each shard before it can run FEC reconstruction. Encrypted plaintext length == blocksize; ciphertext length == blocksize (GCM is a stream cipher, no expansion); on the wire the packet grows only by the 32-byte prefix.\"",
"rust_options": "\"FEC math: punktfunk already has Gf8Coder over `reed-solomon-erasure` (galois_8). CRITICAL RISK: this is NOT guaranteed byte-compatible with nanors. Both Sunshine and Moonlight use nanors (Sunshine's rswrapper.h is 'a drop-in replacement for nanors rs.h', DATA_SHARDS_MAX=255), which uses a specific GF(2^8) field (primitive poly 0x11d) and a Vandermonde-derived generator matrix with a particular systematic encoding. `reed-solomon-erasure` uses Cauchy matrices by default and may produce DIFFERENT parity bytes — meaning Moonlight would FAIL to recover frames where any data shard is lost. RECOMMENDED: vendor/FFI the actual nanors C library (it is tiny, MIT, header+rs.c+oblas) and call reed_solomon_new/encode through a thin Rust FFI, OR port nanors' matrix construction exactly into a new gf8 backend. Do NOT assume reed-solomon-erasure interop without a byte-for-byte test against nanors output. (For punktfunk-to-punktfunk P2 traffic, keep the existing coders; for GameStream-client compat, use nanors.) Crypto: use `aes-gcm` (already a dep) but build a NEW path that (a) takes the raw RIKEY as the key, (b) builds the 12-byte IV as counter_le[8]||0||0||0||'V', (c) uses NO AAD, (d) does NOT use the per-direction salt logic in SessionCrypto. The cleanest approach is a small standalone `Aes128Gcm` call rather than reusing SessionCrypto (whose nonce/AAD scheme is incompatible). Byte layout: define `#[repr(C, packed)]` structs RtpPacket, NvVideoPacket, EncVideoHeader and use explicit `to_be_bytes`/`to_le_bytes` per field (RTP=BE, NV=LE) — do not rely on struct memory layout for endianness.\"",
"reuse_from_punktfunk": "\"REUSE the GF(2^8) concept/structure but NOT necessarily the implementation: punktfunk's `ErasureCoder` trait and `Gf8Coder` (reed-solomon-erasure) give the right data||parity systematic layout and the 255-shard ceiling, but parity bytes likely won't match nanors — so for client-facing GameStream compat add a `nanors`-backed coder (FFI or exact port) behind the same trait. The trait's reconstruct(data_count, recovery_count, received: indices 0..K originals, K..K+M recovery) maps cleanly onto Moonlight's layout (data shards at RS index 0..dataShards, parity after) so the adapter just needs to map RTP-seq→shard-index. REUSE aes-gcm crate but NOT SessionCrypto (its salt+seq-AAD scheme is wire-incompatible with GameStream video which uses no AAD and a counter||'V' IV). REUSE punktfunk's UDP transport for sending datagrams. Do NOT reuse punktfunk's internal 40-byte packet format — GameStream needs the exact 12+4+16 header + optional 32-byte enc prefix. NEW work: a `gamestream` wire module in punktfunk-host (NOT punktfunk-core, to keep punktfunk-core's clean internal protocol) that (1) builds RTP_PACKET/NV_VIDEO_PACKET/ENC_VIDEO_HEADER bytes, (2) implements the frame→FEC-block split (max 4 blocks, the 255/(1+F) shard math), (3) drives a nanors coder, (4) does the per-shard counter-IV AES-128-GCM-no-AAD encrypt, (5) paces/batches sends. Best location: a new file like crates/punktfunk-host/src/gamestream/video.rs (host side) with the nanors FFI either in punktfunk-host or a small new crate; punktfunk-core stays the punktfunk-native protocol and only its aes-gcm + the gf8 *math* are conceptually shared. The 'adapter' lives at the punktfunk-host pipeline seam: take the NVENC access unit + frame metadata from encode.rs and emit GameStream datagrams instead of (or alongside) punktfunk-native packets.\"",
"gotchas": [
"ENDIANNESS SPLIT: RTP_PACKET fields are BIG-endian, NV_VIDEO_PACKET fields are LITTLE-endian, within the SAME packet. Easy to get wrong. RTP: BE16/BE32; NV streamPacketIndex/frameIndex/fecInfo: LE32.",
"ENCRYPT-AFTER-FEC, not before: the GCM-encrypted region is the WHOLE shard (RTP+NV+payload) and the client must decrypt each shard before FEC. The 32-byte ENC_VIDEO_HEADER is a wire PREFIX outside the FEC blocksize, not part of the protected data. If you FEC after encrypt or include the prefix in the FEC math, recovery breaks.",
"NO AAD on video GCM — unlike punktfunk-core's SessionCrypto which authenticates the sequence number as AAD. Using SessionCrypto verbatim will fail Moonlight's tag check.",
"IV counter byte order: Sunshine copies the 64-bit counter with std::copy_n in NATIVE order (little-endian on the x86 build), so iv[0..8] is the counter LE; iv[11]='V'(0x56), iv[8..11]=0. The client uses the transmitted iv verbatim, so as long as you SEND the iv you used, internal byte order is self-consistent — but match LE to mirror Sunshine exactly and to keep nonces unique.",
"FEC parity matrix must match nanors EXACTLY. reed-solomon-erasure (punktfunk's current backend) is likely NOT byte-compatible (Cauchy vs nanors Vandermonde). Without a byte-for-byte match, Moonlight silently fails to recover any frame with a lost data shard. Validate against real nanors output or FFI nanors.",
"streamPacketIndex is (lowseq+x)<<8 with low byte zero; client does >>=8 then &0xFFFFFF → a 24-bit stream-wide packet index, distinct from the 16-bit RTP sequenceNumber. Both must be set consistently or the depacketizer's continuity check (FLAG_SOF / streamPacketIndex == lastPacketInStream+1) rejects the frame.",
"Max 4 FEC blocks per frame (2-bit fields). Max 1024 packets per block (10-bit fecIndex). Max 255 shards/block (GF(2^8)). data_shards per block = 255*100/(100+fecPercentage). Exceeding these → FEC disabled or unrecoverable frame.",
"Last data shard is zero-padded to blocksize before RS encode; lastPayloadLen in the short frame header tells the client the real length of the final packet's payload (needed for AV1). Padding must be zeros so RS math and the client's memset-padding agree.",
"fecPercentage and parityShards: the host transmits dataShards and fecPercentage in fecInfo but NOT parityShards; the client recomputes parityShards = (dataShards*fecPercentage+99)/100. Use the IDENTICAL rounding (ceil) or shard indices misalign. Sunshine may also bump fecPercentage up to satisfy minRequiredFecPackets — recompute percentage = 100*parity/data in that case and stamp the bumped value into fecInfo.",
"blocksize = packetSize + MAX_RTP_HEADER_SIZE(16). When encryption is enabled the SDP negotiation REDUCES packetSize by sizeof(ENC_VIDEO_HEADER)=32 first, so the encrypted-shard plaintext stays the original size. Get this off-by-32 right or buffers mismatch.",
"packetType in RTP_PACKET is effectively unused for video (Sunshine doesn't set it; client ignores it for video, keying on FLAG_EXTENSION instead). Do not rely on a video packetType constant like 97 (that's the AUDIO packetType; audio FEC is 127)."
],
"sources": [
"Sunshine src/stream.cpp — packetTypes[] (audio 97 / audio-fec 127, control types); struct video_short_frame_header_t (8B), video_packet_raw_t (RTP + reserved[4] + NV_VIDEO_PACKET), video_packet_enc_prefix_t (iv[12]/frameNumber/tag[16]); fec::encode() (data/parity shard math, nanors reed_solomon_new/encode, zero-pad last data shard); videoBroadcastThread() (frame->FEC-block split, fecInfo/multiFecBlocks/streamPacketIndex/flags packing, per-shard AES-GCM encrypt with counter||'V' IV, lines ~1434-1515); session cipher init from launch_session.gcm_key with SS_ENC_VIDEO (lines ~2032-2040).",
"Sunshine src/crypto.cpp — gcm_t::encrypt (EVP_aes_128_gcm, no AAD update, EVP_CTRL_GCM_GET_TAG tag_size=16), init_encrypt_gcm (EVP_CTRL_GCM_SET_IVLEN to iv size 12).",
"Sunshine src/rswrapper.h — 'drop-in replacement for nanors rs.h', #define DATA_SHARDS_MAX 255, reed_solomon_new/encode/decode signatures.",
"Sunshine src/rtsp.cpp — RIKEY/gcm IV context (12-byte IV, deterministic construction comment) confirming RIKEY is the GCM key.",
"moonlight-common-c src/Video.h — _RTP_PACKET (12B), _NV_VIDEO_PACKET (16B: streamPacketIndex/frameIndex/flags/extraFlags/multiFecFlags/multiFecBlocks/fecInfo), _ENC_VIDEO_HEADER (iv[12]/frameNumber/tag[16]), FLAG_CONTAINS_PIC_DATA=0x1/FLAG_EOF=0x2/FLAG_SOF=0x4, FLAG_EXTENSION=0x10, FIXED_RTP_HEADER_SIZE=12, MAX_RTP_HEADER_SIZE=16, SS_FRAME_FEC_STATUS.",
"moonlight-common-c src/RtpVideoQueue.c — RtpvAddPacket(): dataOffset=12(+4 for extension)=16; LE32 swaps of streamPacketIndex/frameIndex/fecInfo; fecIndex=(fecInfo&0x3FF000)>>12, dataShards=(fecInfo&0xFFC00000)>>22, fecPercentage=(fecInfo&0xFF0)>>4; currentBlock=(multiFecBlocks>>4)&0x3, lastBlock=(multiFecBlocks>>6)&0x3; parityShards=(dataShards*pct+99)/100; bufferLowestSequenceNumber=seq-fecIndex; reconstructFecBlock() nanors reed_solomon_new/decode over totalPackets at blocksize=packetSize+16, RS index = seq-bufferLowestSequenceNumber, recovered shards get synthetic RTP headers.",
"moonlight-common-c src/VideoStream.c — VideoReceiveThreadProc(): receiveSize=packetSize+16(+32 if SS_ENC_VIDEO); per-packet AES-GCM decrypt via PltDecryptMessage(ALGORITHM_AES_GCM, key=remoteInputAesKey[16], iv=encHeader->iv[12], tag=encHeader->tag[16], ciphertext after 32B header, NO AAD); early-drop if encHeader->frameNumber < currentFrameNumber; then BE16/BE32 swap of RTP seq/timestamp/ssrc before RtpvAddPacket.",
"moonlight-common-c src/VideoDepacketizer.c — processRtpPayload/reassembleFrame: streamPacketIndex >>= 8 & 0xFFFFFF (24-bit), SOF/EOF continuity, frameType for IDR detection.",
"moonlight-common-c src/PlatformCrypto.h — ALGORITHM_AES_GCM=2, PltDecryptMessage/PltEncryptMessage signatures (key,iv,tag,input,output — no AAD parameter).",
"moonlight-common-c src/Limelight.h — ENCFLG_VIDEO=0x2, ENCFLG_AUDIO=0x1, remoteInputAesKey[16] (the RIKEY).",
"moonlight-common-c src/SdpGenerator.c — SS_ENC_VIDEO negotiation; when enabled StreamConfig.packetSize -= sizeof(ENC_VIDEO_HEADER) (32).",
"moonlight-common-c .gitmodules + repo tree — nanors at /nanors (rs.c, rs.h, deps/obl GF(2^8) tables), confirming both host(Sunshine via rswrapper) and client use nanors.",
"punktfunk local: crates/punktfunk-core/src/crypto.rs (SessionCrypto: salt+seq-AAD scheme — incompatible with GameStream video), crates/punktfunk-core/src/fec/{mod.rs,gf8.rs} (ErasureCoder trait + reed-solomon-erasure galois_8 — needs nanors compat verification)."
]
},
{
"area": "GameStream audio stream — UDP RTP transport, Opus multistream config, Reed-Solomon FEC (4+2 over GF(2^8)), AES-CBC encryption",
"summary": "The audio stream is a one-way UDP RTP flow from host to client on the \"audio\" port (base+10; with the default GameStream base port 47989 that is 47999 — the well-known \"port 48000\" in the task refers to the default-numbered GameStream audio slot, but the offset is +10 / control is +11; both Sunshine and moonlight derive ports as base+offset). The host sends Opus-encoded 48 kHz audio in RTP packets with a fixed 12-byte RTP header where packetType=97 for data and 127 for FEC. Audio is grouped into fixed FEC blocks of 4 data shards + 2 parity shards (RTPA_DATA_SHARDS=4, RTPA_FEC_SHARDS=2) using Reed-Solomon over GF(2^8); critically Nvidia/Sunshine use a HARDCODED parity matrix that differs from a generic RS implementation (moonlight-common-c overrides its matrix with the bytes {0x77,0x40,0x38,0x0e,0xc7,0xa7,0x0d,0x6c} from OpenFEC to match the wire). Each Opus frame is one data shard; after every 4th data packet (sequenceNumber % 4 == 0 marks block start, (seq+1)%4==0 triggers encode) two FEC packets are emitted carrying an 12-byte AUDIO_FEC_HEADER after the RTP header. Opus is configured by the host: stereo = 2ch/1 stream/1 coupled, 5.1 = 6ch/4 streams/2 coupled (or 6/0 high quality), 7.1 = 8ch/5 streams/3 coupled (or 8/0 high quality), with channel mappings matching SMPTE/Vorbis order (FL,FR,FC,LFE,RL,RR,SL,SR). samplesPerFrame = 48 * AudioPacketDuration where AudioPacketDuration defaults to 5 ms (240 samples/channel) for lowest latency, optionally 10 ms (480 samples). Audio payload (when SS_ENC_AUDIO is enabled) is encrypted with AES-128-CBC (NOT GCM like video/control), keyed by the same session GCM/RI key, with a per-packet 16-byte IV whose first 4 bytes are big-endian (avRiKeyId + sequenceNumber) and the rest zero; payload is PKCS7-padded. The FEC parity is computed over the encrypted+padded shard bytes, all shards padded to the same block size. This differs from video FEC which uses larger dynamic block sizes, multi-FEC blocks, and a different packet header; audio FEC is a tiny fixed 4+2 layout.",
"ports": [
"Audio RTP (UDP, host→client): base_port + 10. With default GameStream base 47989 → 47999. The task's 'port 48000' corresponds to the canonical GameStream audio slot; verify against the actual base. Client binds and sends pings to this port (moonlight: SET_PORT(&saddr, AudioPortNumber), AudioPortNumber parsed from RTSP SETUP).",
"Video RTP: base_port + 9 (47998 default)",
"Control (ENet): base_port + 11 (48000 default)",
"RTSP setup: base_port + 21 (48010 default)",
"Audio port is dynamically negotiated via RTSP SETUP (moonlight reads AudioPortNumber); do not hardcode — but offset is +10 in Sunshine net::map_port(AUDIO_STREAM_PORT)"
],
"wire_formats": [
{
"name": "RTP_PACKET (12-byte audio RTP header)",
"layout": "Offset 0: header (uint8, set to 0x80 = RTP version 2, no padding/ext/CSRC). Offset 1: packetType (uint8): 97 (RTP_PAYLOAD_TYPE_AUDIO) for data, 127 (RTP_PAYLOAD_TYPE_FEC) for FEC. Offset 2: sequenceNumber (uint16, big-endian on wire, host-order after parse). Offset 4: timestamp (uint32, big-endian). Offset 8: ssrc (uint32, big-endian; Sunshine sets ssrc=0). Total 12 bytes, then payload.",
"notes": "Sunshine: audio_packet.rtp.header=0x80; .packetType=97; .ssrc=0; .sequenceNumber=big(seq); .timestamp=big(ts). seq increments by 1 per packet; timestamp += packetDuration (ms units, i.e. 5 or 10) each packet. Moonlight checks rtp->packetType==97 for data."
},
{
"name": "AUDIO_FEC_HEADER (12 bytes, follows RTP header in FEC packets)",
"layout": "Offset 0 (after the 12-byte RTP header, i.e. file offset 12): fecShardIndex (uint8) = 0 or 1 (which of the 2 parity shards). Offset 1: payloadType (uint8) = 97 (the data payload type being protected). Offset 2: baseSequenceNumber (uint16, big-endian) = seq of first data packet in the block (block start). Offset 4: baseTimestamp (uint32, big-endian) = timestamp of first data packet in block. Offset 8: ssrc (uint32, big-endian). Total 12 bytes, then the parity shard bytes (length = blockSize).",
"notes": "moonlight struct order: {uint8 fecShardIndex; uint8 payloadType; uint16 baseSequenceNumber; uint32 baseTimestamp; uint32 ssrc;}. Sunshine audio_fec_packet_t = {RTP_PACKET rtp; AUDIO_FEC_HEADER fecHeader;}. FEC packet rtp.packetType=127, rtp.sequenceNumber=big(baseSeq + x + 1) for shard x, fecHeader.payloadType=97, fecHeader.fecShardIndex=x."
},
{
"name": "FEC block (logical grouping)",
"layout": "4 data packets (RTPA_DATA_SHARDS) + 2 FEC packets (RTPA_FEC_SHARDS) = 6 total shards (RTPA_TOTAL_SHARDS). Block starts when sequenceNumber % 4 == 0. All shards (data payload + parity) padded to common blockSize for RS. Block can be recovered if any 4 of the 6 shards arrive.",
"notes": "Fixed small block: 4+2. moonlight RTPA_FEC_BLOCK holds dataPackets[4], fecPackets[2], marks[6]. Boundary: 'FEC blocks must start on a RTPA_DATA_SHARDS boundary.' This is unlike video which uses variable block sizes and multi-FEC blocks."
},
{
"name": "Encrypted audio payload (AES-128-CBC)",
"layout": "When SS_ENC_AUDIO flag set: payload = AES-128-CBC(opus_frame_PKCS7_padded). IV = 16 bytes: bytes[0..4] = big-endian uint32 (avRiKeyId + sequenceNumber), bytes[4..16] = 0. Key = session RI/GCM key (16 bytes, from remoteInputAesKey). Block size rounded via round_to_pkcs7_padded; max_block_size = round_to_pkcs7_padded(2048).",
"notes": "CBC, not GCM. No auth tag appended (unlike video/control GCM). moonlight: ivSeq = BE32(avRiKeyId + rtp->sequenceNumber); memcpy(iv,&ivSeq,4); decrypts then strips PKCS7. FEC parity computed over the ENCRYPTED+padded shard."
}
],
"flow": [
"RTSP SETUP negotiates audio: client/host agree on audio config (channels via AUDIO_CONFIGURATION mask), AudioPacketDuration (5 ms default, 10 ms fallback), and the audio UDP port (AudioPortNumber). avRiKeyId and the AES key/IV come from the launch/resume request (remoteInputAesKey/Iv).",
"Client opens UDP socket to host audio port and begins sending periodic ping packets every 500 ms to punch NAT and tell host where to send (legacy ping = ASCII 'PING' {0x50,0x49,0x4E,0x47}; modern = AudioPingPayload/SS_PING with sequence). Host learns client addr from first ping.",
"Host captures PCM, encodes with opus_multistream_encode_float at 48 kHz into samplesPerFrame-sized frames (240 samples/ch at 5 ms).",
"Host builds RTP data packet: header=0x80, packetType=97, seq++ (big-endian), timestamp += packetDuration (big-endian), ssrc=0. If SS_ENC_AUDIO: AES-128-CBC encrypt the PKCS7-padded Opus frame with IV=BE32(avRiKeyId+seq)||0. Send to client.",
"FEC accumulation: when seq % 4 == 0, record block baseSequenceNumber/baseTimestamp. The 4 (possibly encrypted) data shard payloads are placed in shards_p[seq % 4].",
"When (seq+1) % 4 == 0 (i.e. after the 4th data packet): reed_solomon_encode(rs, shards, RTPA_TOTAL_SHARDS=6, blockSize) generates 2 parity shards. Host sends 2 FEC packets: rtp.packetType=127, rtp.sequenceNumber=big(baseSeq + x + 1), fecHeader.fecShardIndex=x (0,1), fecHeader.payloadType=97, baseSequenceNumber/baseTimestamp/ssrc set.",
"Client (RtpAudioQueue) groups incoming packets by block (base seq aligned to 4). If ≥4 of 6 shards present and any data missing, reed_solomon_decode(rs, shards, marks, 6, blockSize) recovers them — using the hardcoded Nvidia parity matrix {0x77,0x40,0x38,0x0e,0xc7,0xa7,0x0d,0x6c}.",
"Client decrypts recovered/received data shards (AES-128-CBC, strip PKCS7), reorders by sequence, and feeds frames to opus_multistream_decoder_create/decode_float."
],
"crypto": "Cipher: AES-128-CBC (NOT GCM — video & control use GCM, audio uses CBC). Key: 16 bytes = the session AES key (Sunshine launch_session.gcm_key / moonlight StreamConfig.remoteInputAesKey), same key family used for the input/RI channel. IV: 16 bytes, per-packet — first 4 bytes = big-endian uint32 of (avRiKeyId + sequenceNumber), remaining 12 bytes = 0x00. avRiKeyId is a per-session 32-bit value from the RTSP/launch negotiation (session->audio.avRiKeyId). Padding: PKCS7 to AES block (16-byte) multiple; round_to_pkcs7_padded, max_block_size = round_to_pkcs7_padded(2048). No GMAC/auth tag is appended to audio packets. Encryption is gated by the SS_ENC_AUDIO encryption flag (config.encryptionFlagsEnabled & SS_ENC_AUDIO) — if disabled, raw Opus frame is sent. FEC parity is computed over the post-encryption, post-padding shard bytes so recovery yields ciphertext that is then decrypted.",
"rust_options": "Opus encode: use the `audiopus` crate or `opus` (libopus bindings) — both expose multistream via opus_multistream_encoder; if missing, FFI to libopus opus_multistream_encoder_create/opus_multistream_encode_float directly. Configure sampleRate=48000, the streams/coupledStreams and mapping per the negotiated AUDIO_CONFIGURATION; frame size = 48*packetDuration samples/ch. AES-128-CBC: use the `aes` + `cbc` crates (cbc::Encryptor<aes::Aes128>) with manual PKCS7 (`block-padding`/`Pkcs7`) — build the 16-byte IV as BE32(avRiKeyId+seq) || [0u8;12]. Reed-Solomon: do NOT use a generic RS matrix; the wire requires Nvidia's specific parity matrix. The `reed-solomon-erasure` crate computes its own (Vandermonde/Cauchy) matrix that will NOT match — either (a) port moonlight's approach: take the rs lib's encode path but inject the OpenFEC parity matrix bytes {0x77,0x40,0x38,0x0e,0xc7,0xa7,0x0d,0x6c} for the 4+2 case, or (b) hand-roll a tiny GF(2^8) 4-data/2-parity encoder/decoder using that exact 2x4 parity matrix (8 bytes = 2 parity rows × 4 data cols). punktfunk's existing `reed-solomon` GF(2^8) code can be reused ONLY if it lets you supply a custom generator/parity matrix; otherwise add a dedicated audio-FEC path. Big-endian field writes: use `byteorder`/`to_be_bytes`. UDP: std::net::UdpSocket on a native thread (no async, matching punktfunk's hot-path rule).",
"reuse_from_punktfunk": "REUSE: punktfunk-core's AES-128 primitive (crypto.rs) underlies CBC, but the MODE differs — punktfunk uses AES-128-GCM with per-direction nonce salts + seq-as-AAD; GameStream audio needs AES-128-CBC with the BE32(avRiKeyId+seq) IV and PKCS7, no AAD/tag. So add a CBC path; do not reuse the GCM nonce/AAD scheme for audio. punktfunk's GF(2^8) Reed-Solomon (reed-solomon crate) is the right field but the MATRIX is wrong for the wire — must supply Nvidia's hardcoded {0x77,0x40,0x38,0x0e,0xc7,0xa7,0x0d,0x6c} parity matrix or hand-roll the 4+2 encoder; punktfunk's internal 40-byte packet format and its FEC block sizing are NOT wire-compatible and cannot be reused for the on-wire audio packets. punktfunk's UDP transport + native-thread pacing model is reusable as plumbing. NEW: 12-byte RTP header serializer, 12-byte AUDIO_FEC_HEADER, the fixed 4+2 audio FEC block state machine (block starts at seq%4==0, encode at (seq+1)%4==0), the Opus multistream encoder integration, the 500 ms ping listener, and the AES-CBC+PKCS7 audio path. These are GameStream-specific and don't exist in punktfunk-core.",
"gotchas": [
"AES-CBC, not GCM. Audio is the one stream using CBC; reusing punktfunk's GCM code verbatim will break interop. No auth tag is on the wire for audio.",
"IV is only 4 meaningful bytes: BE32(avRiKeyId + sequenceNumber) then 12 zero bytes. The addition wraps as uint32. avRiKeyId is per-session from RTSP launch.",
"The Reed-Solomon parity matrix MUST be Nvidia's hardcoded one. moonlight explicitly notes 'the RS parity matrix computed by our RS implementation doesn't match the one Nvidia uses' and overrides it with the 8 OpenFEC bytes {0x77,0x40,0x38,0x0e,0xc7,0xa7,0x0d,0x6c}. A stock reed-solomon-erasure encoder will produce parity the client cannot decode.",
"FEC is a FIXED 4+2 block (RTPA_DATA_SHARDS=4, RTPA_FEC_SHARDS=2), unlike video which uses dynamic/large blocks and multi-FEC. Block boundaries must align to seq%4==0 or the client's queue logic rejects them.",
"FEC packets carry their OWN incrementing rtp.sequenceNumber = baseSeq + shardIndex + 1, distinct from data packet seq space conceptually but in the same 16-bit counter — get the +x+1 right.",
"FEC parity is computed AFTER encryption+padding (over ciphertext shards). All shards must be padded to the same blockSize before encode, and parity packets carry blockSize bytes.",
"Port offset is +10 (Sunshine), but the actual audio port is negotiated in RTSP SETUP (AudioPortNumber). Don't hardcode 48000 — 48000 is the CONTROL port (+11) under the default base; audio is +10 (47999 default). Confirm against your base port.",
"timestamp increments by packetDuration in ms units (5 or 10), not by sample count — Sunshine: timestamp += packetDuration.",
"samplesPerFrame = 48 * AudioPacketDuration → 240 samples/ch at 5 ms default, 480 at 10 ms. Opus must be configured to encode exactly this frame size.",
"Surround Opus uses multistream with specific stream/coupled counts (5.1: 4 streams/2 coupled normal or 6/0 high; 7.1: 5/3 normal or 8/0 high) — wrong stream layout makes the client's multistream decoder produce garbage. Channel mapping is FL,FR,FC,LFE,RL,RR,SL,SR (indices 0..7).",
"rtp.header byte is 0x80 (RTP v2); ssrc=0 in Sunshine. Match these or some clients may drop packets.",
"Client sends 500 ms pings to the audio port; host must read pings to discover the client's UDP source address before sending audio (one-way send relies on the learned addr)."
],
"sources": [
"moonlight-common-c/src/AudioStream.c — packetType==97 check, AES-CBC IV ivSeq=BE32(avRiKeyId+rtp->sequenceNumber), samplesPerFrame=48*AudioPacketDuration, chosenConfig=High/NormalQualityOpusConfig, SET_PORT(&saddr, AudioPortNumber), 500ms ping, MAX_PACKET_SIZE 1400, QUEUED_AUDIO_PACKET",
"moonlight-common-c/src/RtpAudioQueue.c — reed_solomon_new(RTPA_DATA_SHARDS, RTPA_FEC_SHARDS), hardcoded OpenFEC parity matrix {0x77,0x40,0x38,0x0e,0xc7,0xa7,0x0d,0x6c} ('doesn't match the one Nvidia uses'), reed_solomon_decode(rs, shards, marks, RTPA_TOTAL_SHARDS, blockSize), FEC block boundary on RTPA_DATA_SHARDS",
"moonlight-common-c/src/RtpAudioQueue.h — RTPA_DATA_SHARDS=4, RTPA_FEC_SHARDS=2, RTPA_TOTAL_SHARDS=6, AUDIO_FEC_HEADER {uint8 fecShardIndex; uint8 payloadType; uint16 baseSequenceNumber; uint32 baseTimestamp; uint32 ssrc;}, RTPA_FEC_BLOCK, RTP_AUDIO_QUEUE",
"moonlight-common-c/src/Limelight.h — MAKE_AUDIO_CONFIGURATION, AUDIO_CONFIGURATION_STEREO/51/71, CHANNEL_COUNT_FROM_AUDIO_CONFIGURATION, OPUS_MULTISTREAM_CONFIGURATION {sampleRate,channelCount,streams,coupledStreams,samplesPerFrame,mapping[8]}, remoteInputAesKey[16]/remoteInputAesIv[16]",
"moonlight-common-c/src/Limelight-internal.h — extern NormalQualityOpusConfig/HighQualityOpusConfig, extern int AudioPacketDuration, extern SS_PING AudioPingPayload",
"moonlight-common-c/src/SdpGenerator.c (via search) — AudioPacketDuration default 5 ms ('Use 5 ms packets by default for lowest latency'), 10 ms fallback",
"Sunshine/src/audio.cpp — SAMPLE_RATE=48000, opus_stream_config_t stream_configs[]: STEREO 2ch/1str/1coupled/96000bps, HIGH_STEREO 2/1/1/512000, SURROUND51 6/4/2/256000, HIGH_SURROUND51 6/6/0/1536000, SURROUND71 8/5/3/450000, HIGH_SURROUND71 8/8/0/2048000; opus_multistream_encode_float; buffer 1400",
"Sunshine/src/audio.h — opus_stream_config_t {int32 sampleRate;int channelCount;int streams;int coupledStreams;const uint8* mapping;int bitrate;}, stream_params_t, config_t {packetDuration,channels,mask,...}, enum stream_config_e {STEREO,HIGH_STEREO,SURROUND51,HIGH_SURROUND51,SURROUND71,HIGH_SURROUND71,MAX_STREAM_CONFIG}",
"Sunshine/src/stream.cpp — audioBroadcastThread: audio_packet.rtp.header=0x80/.packetType=97/.ssrc=0/.sequenceNumber=big(seq)/.timestamp=big(ts); seq++; timestamp+=packetDuration; IV=BE32(avRiKeyId+sequenceNumber); cbc cipher; if seq%4==0 set fecHeader.baseSequenceNumber/baseTimestamp; if (seq+1)%4==0 reed_solomon_encode(rs,shards,RTPA_TOTAL_SHARDS,bytes); fec_packet.rtp.packetType=127/.sequenceNumber=big(seq+x+1); fecHeader.fecShardIndex=x/.payloadType=97; net::map_port(AUDIO_STREAM_PORT); cbc_t{gcm_key,true}; round_to_pkcs7_padded(2048)",
"Sunshine/src/platform/common.h — speaker enum {FRONT_LEFT,FRONT_RIGHT,FRONT_CENTER,LOW_FREQUENCY,BACK_LEFT,BACK_RIGHT,SIDE_LEFT,SIDE_RIGHT}; map_stereo={FL,FR}; map_surround51={FL,FR,FC,LFE,BL,BR}; map_surround71={FL,FR,FC,LFE,BL,BR,SL,SR}",
"DeepWiki LizardByte/Sunshine Network Configuration — base port 47989, offsets: HTTP+0, HTTPS+1, Video+9 (47998), Audio+10 (47999), Control+11 (48000), RTSP+21 (48010); net::map_port validates 1024-65535",
"Sunshine/src/network.h — uint16_t map_port(int port) (maps offset onto base port)"
]
},
{
"area": "GameStream wire-format gap analysis + architecture recommendation for punktfunk-host (P1 / M2)",
"summary": "punktfunk-core today speaks an INTERNAL protocol that is structurally similar to GameStream but byte-incompatible on every wire surface, so a stock Moonlight client cannot connect to it as-is. Differences: (1) punktfunk prefixes each shard with a 40-byte little-endian `PacketHeader` and no RTP layer; GameStream uses a 12-byte big-endian RTP header + 4 reserved bytes + a 16-byte `NV_VIDEO_PACKET` (28 bytes total) carrying frameIndex/streamPacketIndex/flags and the FEC params bit-packed into a single `fecInfo` u32 and `multiFecBlocks` u8. (2) punktfunk's RS-FEC interleaves data+recovery shards within one block keyed by `shard_index`; GameStream packs ALL data shards first then ALL parity shards across a contiguous RTP sequence range, derives (data,parity,fecIndex,pct) from `fecInfo`, splits a frame into up to 4 FEC blocks via `multiFecBlocks`, and the data shards must be the literal RTP-framed bytes of the H.264/HEVC NAL slices (the depacketizer concatenates payloads to rebuild Annex-B). (3) punktfunk seals the whole 40-byte+payload packet under AES-128-GCM with an 8-byte seq prefix and seq-as-AAD; GameStream encrypts only the post-RTP payload, prefixing a `video_packet_enc_prefix_t {iv[12]; u32 frameNumber; u8 tag[16]}` where the IV is an 8-byte little-endian per-stream counter with iv[11]='V'. The RS math itself is identical (ceil(k*pct/100), GF(2^8), <=255 shards) so punktfunk's `reed-solomon` GF(2^8) coder CAN produce Moonlight-recoverable parity, but ONLY if punktfunk abandons its own shard layout and emits shards in GameStream's data-then-parity contiguous order with GameStream's exact shard size (packetSize + 4 reserved + RTP). Beyond video, GameStream needs an entire control plane punktfunk has not started: HTTPS:47984/HTTP:47989 nvhttp pairing (PIN->AES-128 via SHA-256(salt||pin)[..16], ECB challenge exchange, RSA-signed client cert), an RTSP:48010 handshake (OPTIONS/DESCRIBE/SETUP/ANNOUNCE/PLAY) carrying SDP `x-nv-*` params, an ENet control stream (UDP 48000) with its own AES-128-GCM framing and opcodes (request-IDR, loss-stats, ping, HDR, termination, rumble), an AES-CBC audio stream (UDP 47999), and mDNS `_nvstream._tcp` advertisement. Recommendation: put the GameStream video/FEC/crypto wire codec as a P1 \"wire mode\" INSIDE punktfunk-core (the invariant says protocol logic lives in the core), but keep the stateful control plane (nvhttp/RTSP/ENet/pairing/mDNS) in punktfunk-host as a tokio control-plane adapter that calls into core codec functions, because that machinery is I/O-bound, async, and not part of the hot path.",
"ports": [
"TCP 47984 — HTTPS nvhttp (paired control: /serverinfo, /pair, /applist, /launch, /resume, /cancel). Client-cert pinned to the paired client.",
"TCP 47989 — HTTP nvhttp (unpaired: /serverinfo unauthenticated, /pair PIN flow).",
"TCP 48010 — RTSP setup (OPTIONS/DESCRIBE/SETUP/ANNOUNCE/PLAY). Plaintext over TCP, or encrypted_rtsp_header_t {u32 typeAndLength MSB=0x80000000; u32 seq; u8 tag[16]} when encryption negotiated.",
"UDP 47998 — Video RTP stream (NV_VIDEO_PACKET + RS-FEC). ML_PORT_INDEX_UDP_47998=8.",
"UDP 47999 — Audio RTP stream (Opus, AES-CBC, RS-FEC). ML_PORT_INDEX_UDP_47999=9.",
"UDP 48000 — ENet control stream (reliable, AES-128-GCM, opcodes). ML_PORT_INDEX_UDP_48000=10.",
"UDP/mDNS 5353 — _nvstream._tcp.local advertisement so Moonlight auto-discovers the host.",
"Note: Moonlight derives all of these by offset from the HTTP base port (default 47989); changing the base shifts the whole set. punktfunk-host must advertise the actual HttpsPort/ExternalPort in serverinfo XML."
],
"wire_formats": [
{
"name": "DELTA: video packet header (punktfunk vs GameStream)",
"layout": "punktfunk PacketHeader = 40 bytes, little-endian, repr(C): pts_ns u64, frame_index u32, stream_seq u32, frame_bytes u32, user_flags u32, block_index u16, block_count u16, data_shards u16, recovery_shards u16, shard_index u16, shard_bytes u16, magic u8(0xC9), version u8, fec_scheme u8, flags u8. || GameStream on-wire = RTP_PACKET(12, big-endian: u8 header, u8 packetType, u16 sequenceNumber, u32 timestamp, u32 ssrc) + char reserved[4] + NV_VIDEO_PACKET(16, little-endian: u32 streamPacketIndex@0, u32 frameIndex@4, u8 flags@8, u8 extraFlags@9 (NV_VIDEO_PACKET_EXTRA_FLAG_LTR_FRAME=0x1), u8 multiFecFlags@10, u8 multiFecBlocks@11, u32 fecInfo@12) = 28 bytes before payload.",
"notes": "DELTA: drop pts_ns/frame_bytes/shard_bytes from the wire (GameStream carries none of these per-packet); add the RTP header + reserved[4]; replace explicit u16 FEC fields with the bit-packed fecInfo+multiFecBlocks. flags map 1:1 (FLAG_CONTAINS_PIC_DATA=0x1==punktfunk FLAG_PIC, FLAG_EOF=0x2, FLAG_SOF=0x4). frameIndex == punktfunk frame_index; streamPacketIndex == per-stream packet counter (NOT punktfunk stream_seq which is per-AU)."
},
{
"name": "fecInfo bit-packing (GameStream, exact)",
"layout": "fecInfo (u32, little-endian field) = (dataShards << 22) | (fecIndex << 12) | (fecPercentage << 4). Decode masks (moonlight-common-c): dataShards=(fecInfo & 0xFFC00000)>>22 (bits 22-31, 10 bits, <=1023 but RS caps at 255); fecIndex=(fecInfo & 0x3FF000)>>12 (bits 12-21, the shard's index within its block); fecPercentage=(fecInfo & 0xFF0)>>4 (bits 4-11). parityShards = (dataShards*fecPercentage + 99)/100 (ceiling). bits 0-3 unused.",
"notes": "This is IDENTICAL math to punktfunk's FecConfig::recovery_for (ceil(k*pct/100)). punktfunk already computes data_shards/recovery_shards as explicit u16; the only delta is packing them into this bitfield and emitting fecIndex as the contiguous index across [0..data) then [data..data+parity)."
},
{
"name": "multiFecBlocks bit-packing (GameStream, exact)",
"layout": "multiFecBlocks (u8) = (blockIndex << 4) | ((fec_blocks_needed - 1) << 6). Decode: fecCurrentBlockNumber=(multiFecBlocks>>4)&0x3; lastBlockNumber=(multiFecBlocks>>6)&0x3. Max 4 FEC blocks per frame (2 bits each).",
"notes": "DELTA vs punktfunk: punktfunk uses u16 block_index/block_count with no 4-block ceiling. For P1 wire mode, max_data_per_block must be chosen so a frame needs <=4 blocks AND each block <=255 total shards. punktfunk's p1_defaults (max_data_per_block=200, 15% FEC -> 230 total) already respects 255; just cap blocks at 4 for GameStream mode."
},
{
"name": "RS-FEC shard arrangement (the recoverability question)",
"layout": "GameStream: within one FEC block, RTP sequence numbers are contiguous: data shards occupy [bufferLowestSequenceNumber .. bufferFirstParitySequenceNumber-1], parity shards immediately follow. totalPackets = highest-lowest+1 = dataShards+parityShards. Each shard is exactly receiveSize = packetSize + MAX_RTP_HEADER_SIZE bytes, the last data shard zero-padded to receiveSize. Decode: rs=reed_solomon_new(dataShards, parityShards); reed_solomon_decode(rs, packets[], marks[], totalPackets, receiveSize) with marks[i]=1 for missing. Data shards = the RTP-framed bytes of the video payload concatenated; depacketizer strips RTP+NV header and concatenates payloads to rebuild the Annex-B AU.",
"notes": "punktfunk TODAY: shard_index addresses data [0..K) then recovery [K..K+M) within a block, reconstruct() takes received[] of length K+M with None=lost — STRUCTURALLY THE SAME ordering as GameStream's data-then-parity. VERDICT: punktfunk's reed-solomon GF(2^8) coder CAN produce moonlight-recoverable shards, because both use the same Vandermonde/Cauchy RS over GF(2^8) with data-first layout. BUT the byte CONTENT of each data shard must be GameStream's RTP-framed packet bytes (not punktfunk's 40-byte-header packets), and the shard size must be packetSize+RTP, and the parity must be computed over those exact bytes. CAVEAT (unverified at byte level): the specific RS library Moonlight uses (reed-solomon-new / Fec.c, a CM256/Plank-style Cauchy matrix) may use a different generator matrix than the Rust `reed-solomon` crate; parity bytes are only interoperable if the matrices match. This MUST be validated against real Moonlight before trusting it — if matrices differ, punktfunk must port/match Moonlight's Fec.c matrix exactly (this is the single highest-risk interop item)."
},
{
"name": "video AES-GCM crypto (DELTA)",
"layout": "GameStream video_packet_enc_prefix_t = { u8 iv[12]; u32 frameNumber; u8 tag[16] } prepended to the encrypted payload. IV = 8-byte little-endian per-stream gcm_iv_counter in iv[0..8], iv[11]='V' (0x56), iv[8..11]=0; counter increments per packet. Cipher = AES-128-GCM, key = the GCM key from /launch (riKey). Only the post-RTP/post-NV payload is encrypted; RTP+NV header stay in clear. video_short_frame_header_t (8 bytes, inside the encrypted payload, first packet of frame) = { u8 headerType=0x01; le_u16 frame_processing_latency; u8 frameType (1=P,2=IDR,4=intra-refresh,5=after-ref-invalidation); le_u16 lastPayloadLen; u8 unknown[2] }.",
"notes": "DELTA vs punktfunk crypto.rs: punktfunk seals the ENTIRE packet (header+payload) and uses a 4-byte salt + 8-byte big-endian seq nonce with seq as AAD, prefixing an 8-byte seq. GameStream encrypts only payload, uses 8-byte LE counter + 'V' marker (NO AAD), and the prefix carries iv+frameNumber+tag explicitly. punktfunk's per-direction salt-bit trick is a punktfunk invention not on the GameStream wire. For P1 wire mode the core needs a SEPARATE gcm path matching this prefix exactly."
},
{
"name": "ENet control crypto + opcodes (new in host)",
"layout": "Encrypted control: NVCTL_ENCRYPTED_PACKET_HEADER { le_u16 encryptedHeaderType=0x0001; le_u16 length; u32 seq } then [16-byte AES-GCM tag][encrypted V2 header + payload]. Cipher AES-128-GCM (Sunshine SS_ENC_CONTROL_V2): 12-byte LE IV = seq in bytes 0-3, bytes 10-11='CC'. Plain header V2 = { u16 type; u16 payloadLength }. Opcodes (Gen7 plain): 0x0305 Start A, 0x0307 Start B, 0x0301 invalidate-ref-frames, 0x0201 loss-stats, 0x0206 input, 0x010b rumble, 0x0100 termination, 0x010e HDR, 0x0302 request-IDR (encrypted gen). Periodic ping {le_u16 len=4; le_u32 ts}.",
"notes": "Entirely absent from punktfunk. Belongs in punktfunk-host (ENet via a Rust ENet crate); the AES-128-GCM seal/open of the control payload can reuse a core crypto primitive but the framing is host-side."
},
{
"name": "audio packet (new in host)",
"layout": "audio_packet_t = RTP_PACKET (12) + Opus payload. audio_fec_packet_t = RTP_PACKET + AUDIO_FEC_HEADER. Encryption = AES-128-CBC (NOT GCM); IV = big-endian u32(avRiKeyId + sequenceNumber) where avRiKeyId = first 4 bytes of the launch IV. Fixed RTPA_DATA_SHARDS / RTPA_FEC_SHARDS RS-FEC.",
"notes": "Audio is AES-CBC, different from video GCM — a separate codec path. Lower priority for M2 (can stream video-only first; Moonlight tolerates audio coming up after video)."
}
],
"flow": [
"PHASE A (core, low risk): Add a P1 'gamestream wire mode' to punktfunk-core alongside the internal format. New module crates/punktfunk-core/src/protocol/gamestream.rs implementing (a) RTP+reserved+NV_VIDEO_PACKET serialize/parse with exact bit-packing, (b) a GameStream-layout FEC packetizer/reassembler that emits data-then-parity contiguous RTP shards at packetSize+RTP shard size, (c) the video_packet_enc_prefix_t AES-128-GCM path. Gate behind ProtocolPhase::P1GameStream (already exists). Keep punktfunk's internal 40-byte format for P2.",
"PHASE B (validate the FEC matrix — HIGHEST RISK, do early): Before building any host networking, prove byte-for-byte that punktfunk's reed-solomon GF(2^8) parity matches Moonlight's expectation. Capture real Sunshine video packets (or vendor moonlight-common-c's Fec.c into a test) and assert punktfunk-encoded parity is decodable by Moonlight's RS and vice versa. If the generator matrices differ, port Moonlight's Cauchy matrix into punktfunk's gf8 coder. This gates everything: if shards aren't interoperable, P1 is dead.",
"PHASE C (host control plane, in punktfunk-host): Implement nvhttp on TCP 47989 (HTTP) + 47984 (HTTPS): /serverinfo XML (appversion, GfeVersion, uniqueid, HttpsPort, ExternalPort, mac, MaxLumaPixelsHEVC, ServerCodecModeSupport, currentgame, PairStatus, sessionUrl0), the /pair PIN state machine (getservercert -> clientchallenge -> serverchallengeresp -> clientpairingsecret) with PIN-AES = SHA-256(salt||pin)[..16], AES-128-ECB challenge, SHA-256, X.509 + RSA sign/verify. Persist the paired client cert; pin it for HTTPS client-cert auth.",
"PHASE D (RTSP on TCP 48010): OPTIONS/DESCRIBE/SETUP/ANNOUNCE/PLAY. DESCRIBE returns SDP with x-nv-video[0].* , x-nv-vqos[0].fec.* , x-ss-general.* attributes. SETUP returns server_port= per stream. ANNOUNCE parses client's packetSize, fec.minRequiredFecPackets, maximumBitrateKbps, videoEncoderSlicesPerFrame — feed these into the punktfunk-core Config (shard_payload=packetSize, fec_percent, etc).",
"PHASE E (data plane wiring): On PLAY, bind UDP 47998 (video), spawn the M0 capture->NVENC pipeline, and drive punktfunk-core's P1 packetizer to that socket. Bind UDP 48000 ENet control (request-IDR -> force NVENC keyframe; loss-stats -> adjust; termination). Audio (UDP 47999, AES-CBC) and full input can follow.",
"PHASE F (discovery + display): mDNS-advertise _nvstream._tcp. On RTSP SETUP/PLAY, create the wlroots virtual output sized to the negotiated WxH@fps, point M0 capture at it, tear down on RTSP TEARDOWN / ENet termination."
],
"crypto": "VIDEO (P1 wire): AES-128-GCM, key=riKey from /launch (16 bytes). Per-packet prefix video_packet_enc_prefix_t{iv[12],u32 frameNumber,u8 tag[16]}; IV = 8-byte LE per-stream counter in iv[0..8], iv[11]='V'(0x56), no AAD. Only payload encrypted. ||| CONTROL: AES-128-GCM (Sunshine SS_ENC_CONTROL_V2), 12-byte LE IV = seq[0..4], iv[10..12]='CC', 16-byte tag, NVCTL_ENCRYPTED_PACKET_HEADER prefix. ||| AUDIO: AES-128-CBC, IV = BE u32(avRiKeyId + seq), avRiKeyId = first 4 bytes of launch IV. ||| PAIRING (nvhttp): PIN-derived key = first 16 bytes of SHA-256(salt(16) || ascii-pin(4)); AES-128-ECB for the challenge/response blocks; SHA-256 for the rolling hashes; RSA (server key + client cert) for signing/verifying the pairing secret; X.509 certs exchanged (server cert returned in getservercert, client cert pinned for HTTPS). ||| DELTA vs punktfunk crypto.rs: punktfunk uses AES-128-GCM but with a 4-byte random salt + 8-byte BE seq nonce and seq-as-AAD, sealing the WHOLE packet and prefixing 8-byte seq — none of these match GameStream's iv/marker/prefix/no-AAD scheme. punktfunk has NO ECB/CBC, NO RSA/X.509, NO PIN-KDF. So: keep punktfunk's GCM for P2; add a distinct gamestream-gcm path for P1; add ECB+CBC+RSA+X.509+SHA-256-KDF in the host pairing layer (rustls/aws-lc-rs/rsa/x509 crates).",
"rust_options": "FEC: KEEP the existing `reed-solomon` GF(2^8) coder in punktfunk-core for math, but it MUST be validated byte-compatible with Moonlight's Fec.c (CM256/Plank Cauchy matrix) — if not, port that matrix. (reed-solomon-simd is GF(2^16), P2 only, NOT moonlight-compatible.) ||| ENet control: `rusty_enet` (pure-Rust ENet 1.3.x, no_std-friendly, actively maintained) — speaks the exact ENet wire protocol Moonlight expects; alternative is FFI to libenet via `enet-sys`. ||| RTSP: NO good off-the-shelf server crate handles GameStream's non-standard interleaved/encrypted RTSP — hand-roll a minimal parser over a tokio TcpListener (it's ~6 verbs); `httparse`-style manual parsing. Do NOT pull a full RTSP stack. ||| HTTPS with pinned client-cert: `axum`/`hyper` + `rustls` (ServerConfig with a custom `ClientCertVerifier` that checks the cert against the paired set) + `tokio-rustls`; or `actix-web` with rustls. The plan already commits to axum+tokio for the control plane. ||| X.509 gen: `rcgen` (generate the self-signed server cert + key on first run); parse/verify client certs with `x509-parser` + `rsa` + `sha2`. PIN-KDF and ECB/CBC/GCM via `aes`, `aes-gcm`, `cbc`, `ecb` (RustCrypto) or `aws-lc-rs`/`openssl`. ||| mDNS: `mdns-sd` (pure-Rust, registers `_nvstream._tcp.local` with TXT records) or `zeroconf` (FFI to Avahi). `mdns-sd` preferred (no daemon dependency). ||| Opus audio: `audiopus`/`opus` crate if/when audio is implemented.",
"reuse_from_punktfunk": "REUSE: (1) punktfunk-core's GF(2^8) `reed-solomon` coder and its data-then-parity reconstruct() contract — same ordering as GameStream; (2) FecConfig::recovery_for ceil(k*pct/100) — IDENTICAL to Moonlight's parity math; (3) the ReassemblerLimits bounds-before-allocate hardening pattern — reuse the same discipline when parsing attacker-controlled NV_VIDEO_PACKET fields; (4) aes-gcm dependency and crypto.rs structure (the GCM primitive itself, even though nonce/prefix scheme differs); (5) ProtocolPhase::P1GameStream / FecScheme::Gf8 enums already exist as the negotiation hook; (6) punktfunk-host M0's capture->NVENC pipeline produces exactly the HEVC/H264 Annex-B AUs that become GameStream video payload; (7) the Packetizer/Reassembler split is the right shape — add a parallel GameStream packetizer/reassembler beside them. ||| MUST BUILD NEW: the RTP+NV_VIDEO_PACKET (de)serialization with bit-packed fecInfo/multiFecBlocks; the GameStream-layout shard emitter (contiguous data-then-parity, packetSize+RTP shard size, no 40-byte punktfunk header); the video_packet_enc_prefix_t GCM path (iv counter + 'V', payload-only, no AAD); the ENTIRE control plane (nvhttp pairing, RTSP, ENet control, mDNS, X.509/RSA, ECB/CBC); audio AES-CBC path. ||| CANNOT REUSE on the wire: punktfunk's 40-byte PacketHeader, its 8-byte-seq GCM framing, its per-direction salt bit — all are punktfunk-internal inventions absent from GameStream.",
"gotchas": [
"RS GENERATOR MATRIX is the #1 interop risk: same GF(2^8) RS and same data-first ordering does NOT guarantee byte-compatible parity. Moonlight's Fec.c uses a specific Cauchy/Vandermonde matrix; the Rust `reed-solomon` crate may differ. Validate against real Moonlight FIRST (Phase B) or all of P1 fails silently as 'unrecoverable loss'.",
"GameStream RTP header is BIG-endian; punktfunk's PacketHeader is little-endian. NV_VIDEO_PACKET itself is little-endian. Don't conflate them.",
"streamPacketIndex is a per-STREAM monotonic packet counter (the RTP-ish sequence), NOT punktfunk's per-AU stream_seq. frameIndex is the per-frame counter. Two different counters.",
"multiFecBlocks caps a frame at 4 FEC blocks (2 bits). Combined with the 255-shard GF(2^8) cap, large frames at high res can overflow — this IS the 1 Gbps wall the plan describes. P1 must keep frames within 4 blocks x 255 shards; reduce via slicesPerFrame / bitrate from ANNOUNCE.",
"Video GCM uses NO AAD and an 8-byte LE counter + 'V' marker; punktfunk's GCM uses seq-as-AAD + per-direction salt. A naive reuse of punktfunk's seal() will produce undecryptable-by-Moonlight packets. Build the gamestream-gcm path separately.",
"Encryption is NEGOTIATED (ENCFLG_VIDEO=0x2, ENCFLG_AUDIO=0x1 in serverinfo/SETUP). Many Moonlight setups stream video in the CLEAR on LAN — implement plaintext video first, add GCM second; serverinfo's encryptionSupported/Requested controls this.",
"Audio is AES-CBC not GCM, with a BE counter IV — a third distinct crypto scheme. Easy to get wrong if you assume GCM everywhere.",
"Pairing PIN key = SHA-256(salt||pin)[..16] where pin is the 4 ASCII digits and salt is 16 raw bytes from the client — order and encoding matter exactly. ECB (not CBC) for the challenge blocks.",
"The shard size is packetSize + RTP_HEADER (not just packetSize). punktfunk's shard_payload must be set to the negotiated packetSize and the shard the core FEC-protects must include the RTP/NV framing bytes, else the depacketizer mis-aligns.",
"HTTPS endpoint pins the CLIENT cert obtained during pairing; a stock TLS server that accepts any cert will let unpaired clients in. Use a custom rustls ClientCertVerifier.",
"mDNS service name must be exactly _nvstream._tcp with the right TXT records or Moonlight won't auto-discover (manual IP add still works as a fallback for testing)."
],
"sources": [
"/home/enricobuehler/punktfunk/crates/punktfunk-core/src/packet.rs (PacketHeader 40-byte layout, Packetizer/Reassembler, ReassemblerLimits hardening, FLAG_* constants)",
"/home/enricobuehler/punktfunk/crates/punktfunk-core/src/crypto.rs (SessionCrypto AES-128-GCM, 4-byte salt + 8-byte BE seq nonce, seq-as-AAD, per-direction salt bit)",
"/home/enricobuehler/punktfunk/crates/punktfunk-core/src/config.rs (FecConfig::recovery_for ceil(k*pct/100), FecScheme::max_total_shards Gf8=255, ProtocolPhase::P1GameStream, p1_defaults)",
"/home/enricobuehler/punktfunk/crates/punktfunk-core/src/session.rs (seal_for_wire 8-byte seq prefix, submit_frame/poll_frame hot path)",
"/home/enricobuehler/punktfunk/crates/punktfunk-core/src/fec/mod.rs (ErasureCoder trait, data-then-parity reconstruct contract, GF(2^8) Gf8Coder)",
"/home/enricobuehler/punktfunk/crates/punktfunk-host/src/{web.rs,vdisplay.rs,inject.rs,pipeline.rs,m0.rs} (control-plane stub, VirtualDisplay trait + wlroots/kwin/mutter stubs, M0 capture->NVENC->AU pipeline + punktfunk-core loopback)",
"/home/enricobuehler/punktfunk/docs/implementation-plan.md sections 3,5,6,8 (P1/P2/P3 strategy, C ABI, virtual-display orchestration, milestones M0/M2)",
"moonlight-common-c/src/Video.h (NV_VIDEO_PACKET 16-byte struct, RTP_PACKET 12-byte struct, FLAG_CONTAINS_PIC_DATA/EOF/SOF, FIXED_RTP_HEADER_SIZE)",
"moonlight-common-c/src/RtpVideoQueue.c (fecInfo masks 0xFFC00000>>22 / 0x3FF000>>12 / 0xFF0>>4, parity=(data*pct+99)/100, reed_solomon_new/reed_solomon_decode, receiveSize=packetSize+MAX_RTP_HEADER_SIZE, contiguous data-then-parity sequence range, multiFecBlocks>>4&0x3 / >>6&0x3)",
"moonlight-common-c/src/ControlStream.c (ENet channels, opcodes 0x0305/0x0307/0x0301/0x0201/0x010b/0x0100/0x010e/0x0302, NVCTL_ENCRYPTED_PACKET_HEADER, AES-128-GCM control IV seq+'CC', ping/loss-stats/HDR/rumble/termination layouts)",
"moonlight-common-c/src/Limelight.h (ML_PORT_INDEX/FLAG constants 47984/47989/48010/47998/47999/48000, ENCFLG_AUDIO=0x1/VIDEO=0x2, VIDEO_FORMAT_* codec masks, STREAM_CONFIGURATION fields incl remoteInputAesKey/Iv, packetSize)",
"Sunshine/src/stream.cpp (video_packet_raw_t = RTP+reserved[4]+NV_VIDEO_PACKET, fecInfo send packing x<<12|data<<22|pct<<4, multiFecBlocks (block<<4)|((n-1)<<6), video_short_frame_header_t, video_packet_enc_prefix_t iv[12]+frameNumber+tag[16], IV 8-byte counter + iv[11]='V', audio AES-CBC IV=BE(avRiKeyId+seq), CONTROL/VIDEO/AUDIO_STREAM_PORT via map_port)",
"Sunshine/src/rtsp.cpp (OPTIONS/DESCRIBE/SETUP/ANNOUNCE/PLAY flow, server_port= response, SDP x-nv-video/x-nv-vqos/x-ss-general attributes, encrypted_rtsp_header_t MSB 0x80000000, RTSP_SETUP_PORT default TCP 48010, ANNOUNCE carries packetSize/fec/bitrate/slicesPerFrame)",
"Sunshine/src/nvhttp.cpp (endpoints /serverinfo /pair /applist /launch /resume /cancel, HTTPS 47984 / HTTP 47989, pairing state machine getservercert/clientchallenge/serverchallengeresp/clientpairingsecret, serverinfo XML fields)",
"Sunshine/src/crypto.cpp (gen_aes_key = SHA-256(salt||pin) truncated to 16 bytes, AES-128 ECB/GCM/CBC modes)",
"Moonlight/Sunshine port documentation (TCP 47984/47989/48010, UDP 47998-48000/48010) — moonlight-stream wiki and portforward.com (port roles confirmation)"
]
}
]
}
+189
View File
@@ -0,0 +1,189 @@
# punktfunk — security audit (2026-06-21)
Whole-project audit by a 10-surface multi-agent review; every finding adversarially verified (reachability, attacker-control, existing mitigation). **10 surfaces · 20 raw findings → 18 confirmed/partial, 2 refuted.** Threat model: a malicious network client (pre- and post-pairing) is the primary adversary; also an on-path MITM and a local unprivileged user (the host is privileged).
## Remediation status (2026-06-21)
All 12 confirmed findings have been addressed — fixed, or documented where a fix isn't safely possible:
| # | Sev | Status |
|---|---|---|
| #1 | high | **FIXED** (3526517) — secret files 0600 + dir 0700 / Windows icacls DACL |
| #2 | high | **FIXED** (3526517) — single-use SPAKE2 PIN (consumed at the host key-confirmation) |
| #3 | med | **FIXED** (3526517) — RTSP packetSize bounded + saturating packetizer math |
| #4 | low | **FIXED** — mgmt mTLS-cert auth restricted to a read-only allowlist; admin/state-changing routes require the bearer token |
| #5 | low | **DOCUMENTED (won't-fix on legacy)** — legacy GameStream GCM nonce reuse is inherent to Nvidia's old-style control encryption (Apollo/Moonlight identical); the GCM key is client-known. Real fix = V2 control-encryption negotiation; use punktfunk/1 for untrusted nets. Code comment at `control.rs` rumble loop. |
| #6 | low | **FIXED** — RTSP Content-Length/header caps + per-read timeout + concurrent-connection cap |
| #7 | low | **FIXED (GameStream) / DOCUMENTED (native)** — new `VirtualDisplay::set_launch_command` carries the launch command per-session (GameStream); native path keeps the env (safe under today's single-session model; plumb per-session with concurrent sessions) |
| #8 | info | **FIXED** — constant-time GameStream phase-4 hash compare (`crypto::ct_eq`) |
| #9 | info | **DOCUMENTED** — GameStream pairing over plain HTTP is inherent to GFE compat; steer untrusted networks to the SPAKE2 native plane |
| #10 | info | **FIXED** — fixed ALPN (`pkf1`) on both QUIC endpoints (coordinated client+host upgrade) |
| #11 | info | **FIXED** — FEC reconstruction failure is now a counted drop, not stream-fatal |
| #12 | low | **DEFERRED (fix ready, reverted)** — the scoped-dispatcher fix (undici `Agent` on `proxyRequest`'s `fetch` option) is designed and the mechanism verified sound (h3 honors the fetch option), but it needs `undici` added as a web dependency (`bun add undici` + lockfile regen), which requires the web build env — not available here. Reverted to keep the web build/proxy working. Latent-only: the loopback mgmt fetch is the web console's ONLY outbound TLS, so the global env weakens nothing today. Apply with: `cd web && bun add undici`, then scope `rejectUnauthorized:false` to the mgmt fetch and drop the global env. |
## Executive summary
Overall the punktfunk host is a security-conscious codebase with a strong cryptographic and wire-parsing core: the FEC/reassembler path bounds every attacker-controlled length field before allocation, AES-GCM is used correctly with per-direction nonce separation and seq-as-AAD on the native plane, and the native trust model (SPAKE2 PIN binding both cert fingerprints, fingerprint pinning that still verifies the real TLS handshake signature) is genuinely sound. The most serious real defects are (1) local secret-disclosure of the host's master private key (key.pem) — written with no restrictive mode/ACL while the far-less-sensitive mgmt token is carefully 0600 — which on Windows (%ProgramData% default Users-read ACL, LocalSystem service) is a near-certain cross-privilege host-impersonation primitive, and (2) the native SPAKE2 PIN ceremony permitting unlimited online guesses against a static, non-rotating 4-digit PIN (no disarm-on-failure, no lockout), which contradicts the documented "one online guess" guarantee and lets a pre-auth LAN attacker brute-force pairing of a fully-trusted rogue client in a few hours against the default standalone/CLI flow. Dominant themes: file-permission hygiene on secrets is inconsistent (the secure pattern exists but is applied selectively), pairing throttling relies on a single global rate-limit rather than attempt-bounding, and authorization is overbroad (any streaming-paired cert is also a full mgmt admin). The remaining findings are a contained pre-auth RTSP video-thread DoS (unbounded packetSize and Content-Length), a legacy GameStream control-stream GCM nonce-reuse that is muted by modern V2 negotiation and being key-gated, and several defense-in-depth nits (non-constant-time GameStream hash compare, no QUIC ALPN, cross-session env-var launch confusion, global NODE_TLS_REJECT_UNAUTHORIZED). No memory-unsafety or RCE was found on attacker wire bytes; panics are safe Rust and isolated by panic=unwind. Net: a solid foundation whose highest-leverage fixes are tightening secret file permissions and making the PIN single-use/lockout-bounded.
## Findings (ranked by severity × exploitability)
### 🟠 #1 [HIGH] Host master private key (key.pem) written with no restrictive file mode / ACL — local secret disclosure enabling full host impersonation
**Surface:** `secrets-availability`
**Refs:** `crates/punktfunk-host/src/gamestream/cert.rs:36-44`, `crates/punktfunk-host/src/gamestream/mod.rs:216-232`, `crates/punktfunk-host/src/mgmt_token.rs:58-70`, `crates/punktfunk-host/src/service.rs:605-627`, `crates/punktfunk-host/src/native_pairing.rs:116-126`
**Why it ranks here / impact:** Ranked #1 because it is the highest verdict-adjusted severity (high, three corroborating findings merged) and the most reliably exploitable post-foothold: key.pem is the single trust root for ALL surfaces — GameStream TLS server cert, GameStream pairing signing key, the punktfunk/1 QUIC identity every client pins, and the mgmt HTTPS cert — so its disclosure yields full host impersonation/MITM that defeats client fingerprint pinning, plus the mgmt bearer token is likewise unprotected on Windows. ServerIdentity::load_or_create writes it with a bare fs::write (no mode) and create_dir_all (no DACL). On Windows the leak is near-certain and umask-independent: config_dir() is %ProgramData%\punktfunk, whose default ACL grants BUILTIN\Users read, and the host runs as LocalSystem — any local unprivileged user reads the SYSTEM service's key; the mgmt-token 0o600 hardening is #[cfg(unix)] so it is a no-op there. On Linux the file lands at umask (commonly 0664/0644, verified live as world-readable) and is reachable cross-user whenever the home/config chain is traversable. The project demonstrably knows the secure pattern (mgmt_token.rs uses OpenOptions::mode(0o600)+set_permissions) but applies it to the less-sensitive token and not the master key. Local-only (adversary #3), not pre-auth/network, which caps it below critical.
**Fix:** Write key.pem (and cert.pem) via OpenOptions::mode(0o600) + a follow-up set_permissions(0o600) on Unix, mirroring mgmt_token.rs; create config_dir() with DirBuilder::mode(0o700). On Windows set an explicit DACL granting only SYSTEM+Administrators on the punktfunk %ProgramData% subtree and per-file on key.pem / mgmt-token / *paired.json (or relocate the key under a SYSTEM-only path), since the default ProgramData ACL is Users-readable. Extend the same hardening to client-key.pem and the persisted trust stores. Add a regression test asserting 0600 on key.pem on Unix.
### 🟠 #2 [HIGH] Native SPAKE2 PIN ceremony allows unlimited online guesses against a static 4-digit PIN — pre-auth brute-force to a fully-trusted rogue client
**Surface:** `pairing-pin`
**Refs:** `crates/punktfunk-host/src/punktfunk1.rs:388-446`, `crates/punktfunk-host/src/punktfunk1.rs:475-491`, `crates/punktfunk-host/src/punktfunk1.rs:82`, `crates/punktfunk-host/src/native_pairing.rs:189-234`, `crates/punktfunk-host/src/native_pairing.rs:128-131`, `crates/punktfunk-host/src/mgmt.rs:841-842`
**Why it ranks here / impact:** Ranked #2: high severity AND pre-auth + fully attacker-controlled, the strongest exploitability combination among the high-rated issues — gated only on pairing being armed and an hours-long active window. Merges the three pairing-pin brute-force findings (they share one root cause: no disarm/rotate-on-failure and no attempt budget). pair_ceremony logs a warning and returns Err on a wrong PIN but never calls np.disarm() or rotates the PIN; current_pin() returns the same value forever (cleared only by TTL or operator); the only throttle is one process-wide 2s PAIRING_COOLDOWN. The PIN space is 10,000. Critically the standalone punktfunk1-host default (--require-pairing forces allow_pairing) arms with expires_at:None at startup, so the indefinite static-PIN window is the DEFAULT for that binary, not an opt-in. At ~1 guess/2s the space exhausts in ~5.5h worst / ~2.8h avg, and on success the attacker's cert is permanently pinned, granting input injection, screen capture and app launch. This directly contradicts the documented 'one online guess, no offline dictionary' claim — the offline-dictionary resistance from SPAKE2 holds, but the online single-guess limit is simply not implemented. Mitigations partial: the web/mgmt arm path is TTL-bounded (15..600s), confining the worst case to the CLI/standalone mode.
**Fix:** Make a failed confirmation consume the PIN: on ok==false in pair_ceremony, call np.disarm() (or rotate to a fresh random PIN) so a single wrong guess closes the window — this is what actually delivers the documented 'one online guess'. Add a per-window failed-attempt budget (auto-disarm after N>=1 failures), give the CLI no-expiry arm path a default expiry, and disarm after a SUCCESSFUL pair too. Keep the 2s cooldown as defence-in-depth and raise the web-armed PIN to 6 digits.
### 🟡 #3 [MEDIUM] Pre-auth RTSP ANNOUNCE packetSize underflows/panics the GameStream video pipeline (div-by-zero / OOB slice / allocation amplification)
**Surface:** `gamestream-parsing`
**Refs:** `crates/punktfunk-host/src/gamestream/rtsp.rs:275`, `crates/punktfunk-host/src/gamestream/video.rs:55-89`, `crates/punktfunk-host/src/gamestream/stream.rs:322`
**Why it ranks here / impact:** Ranked #3: medium and fully pre-auth + attacker-controlled — the highest-exploitability of the medium-and-below tier. The RTSP listener on TCP 48010 performs no TLS/pairing/auth; an unauthenticated peer drives OPTIONS→ANNOUNCE→PLAY (+ a UDP ping to the video port) and the video thread starts on state.stream alone, no paired session required. x-nv-video[0].packetSize is read with no bound and flows into VideoPacketizer::new where payload_per_shard = packet_size - 16: packetSize==16 → pps==0 → div-by-zero panic; packetSize<16 → underflow → OOB slice panic; packetSize==17 → one byte/shard → per-frame datagram flood. Reliable remote pre-auth DoS of a privileged media service, made stickier because the panic unwinds before running.store(false) leaving the session wedged until restart. Calibrated medium (not higher) because it is a SAFE Rust panic (checked slice access, no memory corruption/UB) isolated to the punktfunk-video thread by panic=unwind — the host process and other listeners survive; not RCE.
**Fix:** Validate packet_size in stream_config() before building StreamConfig: reject packetSize below a sane floor (e.g. < 64) and clamp to a sane max (e.g. <= 2048). Additionally harden VideoPacketizer::new to use checked/saturating arithmetic and refuse construction (or fall back to a default) when packet_size < SHARD_HEADER-16 so the per-frame path never sees pps==0 or a wrapped payload_per_shard. Also store(false) on the unwind path so a panic doesn't wedge the session. Add a regression test over packetSize in {0,15,16,17}.
### 🔵 #4 [LOW] Any paired punktfunk/1 streaming client gets full management-API authority via the mTLS-paired-cert auth path (no streaming-vs-admin separation)
**Surface:** `authz-trust`
**Refs:** `crates/punktfunk-host/src/mgmt.rs:459-488`, `crates/punktfunk-host/src/mgmt.rs:466-470`
**Why it ranks here / impact:** Ranked #4: low but a genuine post-auth privilege over-broadening with concrete admin impact. require_auth grants any verified peer cert whose fingerprint is in the native paired store full unscoped access to every /api/v1 route — the SAME paired set that admits a device to stream. So a device paired purely to watch the screen can DELETE /clients/{fp} (unpair others), POST /native/pair/arm (open a pairing window and read the PIN), approve arbitrary knocking devices, DELETE /session, and CRUD the library; there is no role/scope check anywhere in the router. The native client presents its identity via TLS client auth on both ports, so the credential is genuinely usable against mgmt. Bounded to low because it requires being an already-paired (operator-trusted) device and the mgmt port binds loopback by default — remote reach needs an explicit routable --mgmt-bind (and the mTLS path then bypasses the token requirement).
**Fix:** Separate streaming trust from management trust: keep a distinct admin allow-list (or an admin flag on a paired entry) for the mTLS mgmt path, or restrict mTLS-cert auth to read-only endpoints and require the bearer token for state-changing/admin routes. At minimum gate the pairing-administration endpoints (arm/approve/unpair) and session/library mutation behind the bearer token only.
### 🔵 #5 [LOW] GameStream legacy control-stream AES-GCM nonce reuse across directions (host rumble vs client input share key+nonce)
**Surface:** `crypto`
**Refs:** `crates/punktfunk-host/src/gamestream/control.rs:373-400`, `crates/punktfunk-host/src/gamestream/control.rs:257-266`, `crates/punktfunk-host/src/gamestream/control.rs:67,106-114`
**Why it ranks here / impact:** Ranked #5: a real, correctly-identified catastrophic-class crypto defect (AES-GCM (key,nonce) reuse) but adjusted to low because reachability and impact are heavily muted. The legacy NonceKind branches apply no direction separation (other => other), so host rumble (rumble_seq from 0) and client control (seq from 0) under the shared rikey produce identical (key,nonce). BUT: (1) it only triggers on the legacy auto-detected scheme — modern moonlight-common-c negotiates the V2 scheme which flips marker[0] to 'H' and is direction-separated, so the default path is safe; the doc claim 'the legacy path — which we hit' is stale; (2) the rikey is delivered only over the mTLS /launch, so a pure MITM cannot derive the key — only a paired client can; (3) a paired client can already legitimately send any client→host control message (in-scope-by-design), so forgery is largely redundant and the only genuinely new gain is recovering low-value rumble keystream / forging rumble to its own client. Post-auth, conditional path.
**Fix:** Separate the two directions' nonce spaces for the legacy schemes too — set a reserved high bit/byte of the legacy IV for host-originated packets (mirror the V2 'H' marker), or better, HKDF-derive an independent host→client key from the rikey with a direction label so host and client never share a GCM key. Never let host rumble and client input share (key,nonce).
### 🔵 #6 [LOW] RTSP request Content-Length / header size unbounded with no read timeout or connection cap — pre-auth slow-loris / memory-growth DoS
**Surface:** `gamestream-parsing`
**Refs:** `crates/punktfunk-host/src/gamestream/rtsp.rs:82-106`, `crates/punktfunk-host/src/gamestream/rtsp.rs:24-48`
**Why it ranks here / impact:** Ranked #6: low, pre-auth and attacker-controlled but a rate-limited resource DoS, not unsafety or auth bypass. read_message parses content-length and computes total = end+4+content_len with no cap, looping buf.extend_from_slice until buf.len()>=total; the header scan is likewise unbounded and there is no body/header cap, no read/write timeout, and one unbounded native thread is spawned per connection with no global limit. Growth is bounded by attacker send rate (no pre-allocation), so it is slow exhaustion rather than instant OOM; the stronger lever is thread/FD exhaustion from many idle slow-loris connections at near-zero bandwidth. On a privileged LAN-facing plaintext listener with zero defensive caps.
**Fix:** Cap Content-Length and total header size to small constants (e.g. reject content_len > 64 KiB, total header > 16 KiB) and close on violation. Add a read timeout so a slow-loris connection cannot pin a thread indefinitely, and bound concurrent RTSP connections.
### 🔵 #7 [LOW] Per-session launch command carried via process-global PUNKTFUNK_GAMESCOPE_APP env var, stomped under concurrent native sessions (cross-session launch confusion)
**Surface:** `privilege-process-launch`
**Refs:** `crates/punktfunk-host/src/punktfunk1.rs:560-571`, `crates/punktfunk-host/src/punktfunk1.rs:140`, `crates/punktfunk-host/src/vdisplay/gamescope.rs:629-647`
**Why it ranks here / impact:** Ranked #7: low, post-auth cross-session isolation bug, explicitly NOT command injection. serve_session does std::env::set_var(PUNKTFUNK_GAMESCOPE_APP) per accepted connection with a stale comment claiming 'one session at a time', but DEFAULT_MAX_CONCURRENT=4 sessions run concurrently and the var is read in gamescope::spawn during VirtualDisplay::create — a genuine TOCTOU where client B's launch overwrites what client A's bare-spawn reads, and the never-cleared value leaks into a later no-launch client. Impact is capped because cmd always resolves through library::launch_command (digit-validated Steam appids / operator-only custom store), so the worst case is launching a DIFFERENT operator-approved title or a stale title — and it only affects the gamescope bare-spawn backend (kwin/mutter/wlroots/attach ignore the var).
**Fix:** Stop carrying the per-session launch command in a process-global env var. Plumb the resolved command through the VirtualDisplay::create call / per-session context (e.g. a field on Mode or a per-session GamescopeDisplay), and on the bare-spawn path pass it explicitly to spawn(); clear/scope it so a stale value never leaks to the next client.
### ⚪ #8 [INFO] GameStream pairing phase-4 hash compare is not constant-time
**Surface:** `pairing-pin`
**Refs:** `crates/punktfunk-host/src/gamestream/pairing.rs:226-247`
**Why it ranks here / impact:** Ranked #8: info / hardening only — a real variable-time `==` on attacker-influenced 32-byte SHA-256 digests, but not weaponizable. The compared `expected` mixes in host-random server_challenge that is never disclosed (so the attacker can neither compute nor aim at the target), the attacker cannot steer client_hash to a chosen value without the PIN key, and any mismatch removes the session (map.remove) forcing a fresh ceremony with new randomness — so there is no stable secret to recover prefix-by-prefix and no path from timing to PIN recovery or match forgery. Worth fixing for consistency since the codebase already has ct_eq for the native ceremony.
**Fix:** Use a constant-time comparator (subtle::ConstantTimeEq or the project's existing ct_eq) for hash_ok, matching the constant-time discipline already used in the native SPAKE2 ceremony.
### ⚪ #9 [INFO] GameStream pairing ceremony runs over plain HTTP — inherited GFE brute-forceable-PIN / MITM weakness
**Surface:** `authz-trust`
**Refs:** `crates/punktfunk-host/src/gamestream/nvhttp.rs:33`, `crates/punktfunk-host/src/gamestream/nvhttp.rs:215-264`, `crates/punktfunk-host/src/gamestream/pairing.rs:102-247`
**Why it ranks here / impact:** Ranked #9: info — real but intentional Moonlight-compat behavior, on record rather than a regression. The whole /pair flow (incl. phase-4 cert pinning) is on plain HTTP 47989 with no transport confidentiality and no rate-limiting; the AES key is pin_key(salt,pin) = SHA-256(salt||pin)[..16] feeding AES-128-ECB, so an on-path attacker observing a legitimate pairing can offline-brute-force the 4-digit PIN and forge a clientpairingsecret to get a cert pinned. This is the well-known GFE/Sunshine construction, fixed by interop, and is precisely why punktfunk/1's SPAKE2 path exists; it requires an active MITM during an operator-initiated pairing within the 300s window. A paired GameStream client is in-scope-by-design.
**Fix:** Inherent to GameStream compatibility — document it and steer users to punktfunk/1 (SPAKE2) for untrusted networks. Optionally rate-limit pairing sessions per uniqueid/IP and tighten/expire the awaiting-PIN window aggressively.
### ⚪ #10 [INFO] No ALPN configured on the native QUIC server/client (cross-protocol confusion hardening absent)
**Surface:** `cert-tls-identity`
**Refs:** `crates/punktfunk-core/src/quic.rs:1335-1354`, `crates/punktfunk-core/src/quic.rs:1412-1448`
**Why it ranks here / impact:** Ranked #10: info — factually correct (no alpn_protocols set on either endpoint; the cert.pem identity is shared with GameStream TLS) but no reachable confusion attack. ALPACA-style attacks need two TLS services sharing a cert on the SAME transport; here GameStream is TLS-over-TCP and punktfunk/1 is TLS-in-QUIC (UDP) — not cross-reachable — and there is exactly one QUIC server so ALPN would make no authorization decision. Trust is already enforced by fingerprint pinning + app-layer Hello/Welcome magic. Cheap future-proofing only.
**Fix:** Set a fixed ALPN on both endpoints (e.g. rustls_cfg.alpn_protocols = vec![b"pkf1".to_vec()]) so a mismatched protocol is rejected during the TLS handshake — defense-in-depth against ever multiplexing protocols on the QUIC endpoint.
### ⚪ #11 [INFO] FEC reconstruct error on the receive path is stream-fatal — code-contract inconsistency (not an exploitable DoS)
**Surface:** `core-wire-deser`
**Refs:** `crates/punktfunk-core/src/packet.rs:411`, `crates/punktfunk-core/src/session.rs:283-289`, `clients/probe/src/main.rs:959`, `crates/punktfunk-host/src/spike.rs:251`
**Why it ranks here / impact:** Ranked last: info — a correctly-identified contract inconsistency with NO demonstrable exploit. Reassembler::push propagates coder.reconstruct(...)? and both real receive-side callers treat any non-NoFrame error as fatal, inconsistent with the surrounding 'malformed = silent drop, never fatal' discipline. But every Err arm was traced unreachable from hostile input: header firewall + block-geometry pinning guarantee equal-length, correctly-counted shards; reconstruct is only called once received>=data_shards; Config::validate rejects odd/zero shard_payload before any decode; and MDS Reed-Solomon decodes any data_shards distinct shards. Reaching the reassembler also requires an AES-GCM-decryptable packet, so it is the connected host (not a port-sprayer), and it is client-side only — the privileged host never runs the reassembler on attacker bytes. Pure defense-in-depth hardening.
**Fix:** Make a FEC reconstruction failure a counted drop rather than stream-fatal: in Reassembler::push match coder.reconstruct(...) and on Err bump packets_dropped (or a fec_failed counter), discard the block, and return Ok(None). Reserve poll_frame's Err for genuinely fatal conditions (role misuse, transport teardown), matching the discipline documented at packet.rs:298-300.
### 🔵 #12 [LOW] Web console sets NODE_TLS_REJECT_UNAUTHORIZED=0 process-globally — latent footgun disabling all outbound TLS verification
**Surface:** `deps-config-exposure`
**Refs:** `web/.env.example:22-24`, `web/web.env.example:11-14`, `web/server/util/auth.ts:17-22`, `web/vite.config.ts:23`
**Why it ranks here / impact:** Ranked #12: low and not currently exploitable (attackerControlled false), included as a latent defense-in-depth defect. NODE_TLS_REJECT_UNAUTHORIZED=0 disables certificate validation for every outbound TLS connection the Node process makes, but the only current server-side outbound hop is the loopback proxy to https://127.0.0.1:47990 (CDN/art fetches are browser-side), and a loopback connection cannot be MITM'd — so impact is nil today. Real impact materializes silently if anyone later adds a server-side off-host HTTPS call (update check, webhook, metadata fetch) or points PUNKTFUNK_MGMT_URL off-loopback.
**Fix:** Do not disable TLS verification globally. Pin the host's self-signed cert for the single loopback fetch: pass an https.Agent with the host cert as `ca` (or rejectUnauthorized:false on that one Agent only) to the proxyRequest fetch in server/routes/api/[...].ts, and drop NODE_TLS_REJECT_UNAUTHORIZED from the deployment env.
## Cross-cutting themes
- Inconsistent secret file-permission hygiene: the secure 0600/ACL pattern exists (mgmt-token) but is applied selectively, leaving the master private key and trust stores at umask/default-ACL — the highest-impact local-privilege gap, acute on Windows %ProgramData% + LocalSystem.
- Pairing throttling is rate-based, not attempt-bounded: a single global 2s cooldown and a static non-rotating 4-digit PIN with no disarm-on-failure/lockout means the documented 'one online guess' property is not actually implemented for the native ceremony.
- Overbroad authorization / collapsed trust tiers: 'paired to stream' equals 'paired to administer' (mgmt mTLS), and GameStream pairing inherits the plaintext-HTTP brute-forceable-PIN model — coarse trust boundaries where finer scopes are warranted.
- Pre-auth attack surface on the GameStream/RTSP listeners with missing input bounds and resource caps (unbounded packetSize, unbounded Content-Length, no timeouts/connection caps) — contained DoS, but on a privileged plaintext service.
- Stale concurrency assumptions and process-global mutable state (legacy GCM nonce direction, PUNKTFUNK_GAMESCOPE_APP env var) that were safe under a since-removed 'one session at a time' invariant and now cause cross-session confusion / crypto reuse.
- Strong, well-tested cryptographic and memory-safety core (bounded wire parsing, correct AEAD/SPAKE2/pinning, catch_unwind FFI, panic=unwind isolation) — the foundation is solid; the residual risk is in operational hardening and trust-tier granularity, not in unsafe/RCE.
## Prioritized remediation (do in this order)
1. Lock down secret files: write key.pem (and cert.pem) 0600 + create config_dir 0700 on Unix using the existing mgmt_token OpenOptions::mode pattern, and set an explicit SYSTEM+Administrators-only DACL on the punktfunk %ProgramData% subtree / key.pem / mgmt-token / *paired.json on Windows. Extend to client-key.pem; add a 0600 regression test.
2. Make the native PIN single-use and lockout-bounded: disarm or rotate the PIN on a failed SPAKE2 confirmation, add a per-window failed-attempt budget, give the CLI no-expiry arm path a default expiry, and disarm after a successful pair — this is what delivers the documented 'one online guess'.
3. Bound the RTSP video path: validate/clamp x-nv-video[0].packetSize (floor ~64, cap ~2048) in stream_config() and use checked/saturating arithmetic in VideoPacketizer::new so pps==0 / underflow can never occur; store(false) on the unwind path; add a {0,15,16,17} regression test.
4. Cap RTSP request parsing: enforce a Content-Length and total-header-size limit, add a read timeout, and bound concurrent connections so a pre-auth peer cannot slow-loris exhaust threads/memory.
5. Separate streaming trust from management trust: require the mgmt bearer token (not just a paired streaming cert) for state-changing and pairing-administration routes (arm/approve/unpair/session/library), or keep a distinct admin allow-list.
6. Fix the legacy GameStream GCM nonce reuse: HKDF-derive an independent host→client key from the rikey (direction label), or mirror the V2 'H' direction marker into the legacy IV so host rumble and client input never share (key,nonce).
7. Stop carrying the per-session gamescope launch command in a process-global env var: plumb it through the per-session VirtualDisplay::create/context and clear it when no launch is requested, eliminating cross-session stomping under concurrency.
8. Apply the cheap hardening nits: constant-time compare for the GameStream phase-4 hash (use ct_eq), set a fixed ALPN ('pkf1') on both QUIC endpoints, make FEC reconstruct failures a counted drop instead of stream-fatal, and replace the global NODE_TLS_REJECT_UNAUTHORIZED with a cert-pinned https.Agent scoped to the loopback mgmt fetch.
## Security controls done right (positives)
- Defense-in-depth wire parsing: every attacker-controllable FEC/reassembler header field is bounded against negotiated limits BEFORE any allocation keyed on it (packet.rs:328-343) — shard_bytes exact-match, data/total/block counts in range, indices in bounds, frame_bytes<=max — with no integer overflow in the size math and regression tests (rejects_oversized_shard_counts, rejects_inconsistent_block_geometry_without_panicking).
- Reassembler memory is bounded to a 16-frame reorder window with prune-on-push and completed-frame dedup (packet.rs:451-468), so a flood of distinct frame indices cannot grow memory unboundedly and late shards cannot resurrect emitted frames.
- AEAD gates the reassembler: on an encrypted session open_from_wire verifies the GCM tag (with seq as AAD) before any bytes reach push (session.rs:120-131), so an attacker cannot reach the reassembler without the session key; oversized-datagram truncation is always detectable (recv buffers MAX+1, len>MAX dropped).
- Native AES-GCM is correct and misuse-resistant: 96-bit nonce = 4-byte salt || 8-byte BE seq with seq also as AEAD AAD (tampering fails the tag, not shifts the nonce), per-direction salt-bit separation gives disjoint nonce spaces under the shared key, a fresh CSPRNG 128-bit key per session, and Config::validate rejects encrypt=true with an all-zero key (crypto.rs, session.rs).
- Host-side data-plane datagram decoders (mic / RichInput / HidOutput / InputEvent / gamepad) are all length-checked, Option-returning and non-fatal — the privileged host drops anything malformed and keeps draining, never reassembles attacker video, and never panics on truncated/hostile input.
- punktfunk/1 trust establishment is sound: PinVerify rejects a fingerprint mismatch AND still performs real TLS 1.2/1.3 CertificateVerify signature checks (not stubbed), so an active MITM cannot replay the host's public cert to satisfy a pin without the private key (quic.rs:1547-1608) — the single most important thing to get right, done correctly.
- SPAKE2 PIN pairing for the native plane is built correctly: a balanced PAKE binding BOTH cert fingerprints as identities and into the key-confirmation transcript, a wrong PIN yields a different key (one online guess, no offline dictionary, no error oracle), MITM with different certs per leg reaches no shared key, and confirmation MACs use a constant-time ct_eq — all exercised by tests.
- Authorization is cleanly split from authentication on both planes: AcceptAnyClientCert verifiers accept any self-signed cert at the handshake but still verify the handshake signature, so the post-handshake fingerprint proves key possession, and authorization is then enforced against the paired allow-list (--require-pairing default fail-closed; certless peers rejected).
- GameStream post-pair endpoints (applist/launch/resume/cancel) are gated by peer_is_paired() requiring a pinned mutual-TLS client cert, fail-closed for certless/unknown/None peers, with a dedicated regression test (nvhttp.rs:46-55, 303-328).
- No command injection on the launch surface: client-supplied launch ids resolve against the host's OWN catalog (client can only pick an existing title), Steam appids are digits-only validated (with a `570; rm -rf ~` rejection test), custom commands come only from the operator mgmt store, and gamescope::spawn uses argv (Command::new + args), never /bin/sh -c.
- Client-controlled display dimensions are validated (encode::validate_dimensions: zero/odd/over-max rejected) on both the initial Hello and mid-stream Reconfigure before reaching encoders or the privileged SudoVDA ADD ioctl, which marshals a fixed #[repr(C)] struct and only selects driver-advertised modes.
- mgmt API is authenticated on every route except /health even on loopback (fails closed on a blank token), with constant-time SHA-256-digest token comparison and a CSPRNG token persisted 0600 on Unix (O_CREAT mode + follow-up set_permissions, never briefly world-readable).
- Attacker-controlled device names are sanitized before logging/storage/UI (C0/C1 controls, Unicode bidi/format overrides, BOM stripped, length-capped, fingerprint fallback), blocking log/console-injection and trusted-device spoofing in the approval UI.
- The pending-knock / delegated-approval queue is bounded (PENDING_CAP=32, LRU eviction) and time-bounded (10-min TTL), in-memory only, so a LAN scanner cannot grow it unboundedly or poison the persistent trust store.
- Both trust stores are persisted atomically (temp + rename) with in-memory rollback on a failed persist, so a crash or full disk mid-write cannot truncate the allow-list and silently lock out or un-gate paired clients.
- C-ABI boundary is hardened: config_from_ptr enforces the struct_size skew guard before dereferencing, every narrowing field is range-checked before truncation, every data-processing entry point is wrapped in catch_unwind returning Panic (no unwind across FFI), and null/zero-length handling is consistent and safe.
- No secret material is logged anywhere — PINs, GCM/rikey keys, nonces/salts, the mgmt token, and private keys never reach tracing/println; pairing-failure logs include only the sanitized device name + fingerprint.
- Crypto stack is current and free of disclosed-vuln versions (rustls 0.23.40, quinn 0.11.9, ring 0.17.14, aes-gcm 0.10.3, spake2 0.4.0), ring-only with no aws-lc C dep; rsa 0.9.10 (RUSTSEC-2023-0071 Marvin) is used ONLY for sign/verify, never decryption, so the vulnerable path is not exercised.
- The Windows service launches the host with a correctly-scoped duplicated SYSTEM token (only the session id retargeted), a fixed winsta0\default desktop, a command line built from current_exe + an operator-controlled host.env subcommand (never network input), and a kill-on-job-close job object so a crash never orphans the SYSTEM host.
## Refuted (investigated, NOT vulnerabilities)
- **[unsafe-ffi-cabi] Free/close FFI entry points run Drop without catch_unwind — a panic in teardown is UB across the C boundary** — Verified all cited code. abi.rs:272-276 (punktfunk_session_free) and abi.rs:1627-1631 (punktfunk_connection_close) are verbatim as described — both call drop(Box::from_raw(..)) outside guard()/catch_unwind, while the guard helper (abi.rs:168-170) wraps catch_unwind. The doc at abi.rs:11 ("every entry point is wrapped in catch_unwind") is literally inaccurate for these two — that doc-vs-code discrepancy is REAL.
But the finding's security claim — "the unwind would cross the extern \"C\" frame, which is undefined behavior" — is REFUTED by the build toolchain. This crate is edition 2021 built with rustc 1.96.0. Since Rust 1.81 (Sept 2024), an unwind that reaches a non-`-unwind` `extern "C"` boundary is a defined, safe process abort, not UB. None of these functions use `extern "C-unwind"`. So the worst possible outcome is a clean abort, which is also exactly what catch_unwind→PunktfunkStatus::Panic would avoid only by returning a status — but free/close return void, so there is no status to return anyway.
Moreover the precondition (a panicking Drop) does not exist in the code. NativeClient::drop discards the worker join result with `let _ = w.join()`, so a worker-thread panic/poison cannot re-enter Drop. Config::drop only zeroizes. Session has no custom Drop. The transport Drop closes a socket. There is no .unwrap(), no Mutex::lock, and no result-propagating thread join on any teardown path — the finding's speculated panic sources are not present.
The finding is correctly self-rated as informational and explicitly non-attacker-controlled / non-pre-auth. The accurate residual is a documentation inconsistency (the module doc overstates the catch_unwind invariant), not a security weakness. The substantive recommendation (keep Drop impls panic-free) is already satisfied. Net: refuted as a vulnerability; severity info, and even as a code-quality nit the UB framing is incorrect for the current compiler. Worth at most a one-line doc fix to say free/close intentionally abort-on-panic.
- **[unsafe-ffi-cabi] C-ABI pointer/length contracts (32-byte fingerprint buffers, caller buffers) are trusted, not validated — standard FFI, embedder-only** — All three cited spans are present and behave exactly as described. abi.rs:877 `from_raw_parts(pin_sha256, 32)` (after null-check at 873), abi.rs:905 `from_raw_parts_mut(observed_sha256_out, 32)` (after null-check at 903), abi.rs:1003 `from_raw_parts_mut(host_sha256_out, 32)` (after null-check at 990). The submit_frame/send_mic claim is also accurate: host_submit_frame (283-302, with an extra null+len guard) and send_mic (1233-1253) build slices from caller (data,len). These are C-ABI entry points whose pointer/length arguments come from the trusted embedding app (PunktfunkKit/GTK/WinUI), not from wire bytes. None of the threat-model adversaries can influence them: the malicious network client (pre- or post-auth) controls protocol bytes that flow *through* the embedder, not the embedder's own FFI call arguments; a MITM is irrelevant; a local unprivileged user cannot call into another process's loaded library. Each function carries a documented Safety contract and null-checks its pointers, and the hard-coded 32 matches the fingerprint type, so even a buggy-but-honest caller passing a correctly-sized buffer is safe. This is idiomatic, sound FFI — not a vulnerability. The finding's own posture (info, attackerControlled=false, preAuth=false, no fix required, listed for the record) is correct and well-calibrated. In vuln terms this is refuted (not a vuln / not reachable by any in-scope adversary), consistent with the info classification.
+70
View File
@@ -0,0 +1,70 @@
# Session-aware host — known limitations & follow-ups
Status: 2026-06-14. The host auto-detects the live session (Gaming / KDE / GNOME / wlroots) **per
connect** and routes both video and input at it — managed gamescope at the client's resolution in
Steam Gaming Mode, a KWin/Mutter virtual output at the client's resolution on a Desktop. A watcher
(opt-in: `PUNKTFUNK_SESSION_WATCH=1`) follows a Gaming↔Desktop switch **mid-stream** and rebuilds the
backend in place without a reconnect.
Live-validated on the Bazzite F44 box (`bazzite-deck-nvidia:testing`, RTX 4090): Desktop KDE at
5120×1440 + input; Gaming managed at 5120×1440; warm-session reuse on quick reconnect; Feature B
video-switch both directions.
## Resolved (2026-06-15, `3363576`)
- **#2 — mid-stream-switch input** ✅ `vdisplay::settle_desktop_portal()` pushes the live session env
into the systemd/D-Bus activation environment and restarts the KWin portal on a switch, so input
lands without a reconnect. Validated live: `settled desktop portal env … compositor=kwin`
`libei: portal granted devices``device RESUMED` on a Gaming→Desktop mid-stream switch.
- **#3 — KWin/Mutter virtual output primary** ✅ `apply_session_env` defaults
`PUNKTFUNK_KWIN_VIRTUAL_PRIMARY` / `PUNKTFUNK_MUTTER_VIRTUAL_PRIMARY` on for the auto desktop path.
Validated live: `KWin: streamed output set as the sole desktop also_disabled=["HDMI-A-1"]` — panels
now render on the streamed screen.
## Still parked
### 1. F44 gamescope teardown corrupts the GPU context
Every gamescope teardown on this box (stop the autologin on connect; stop the managed session on
restore) risks leaking the NVIDIA GPU context — surfaces as `CUDA_ERROR_ILLEGAL_STATE` (401) in
`cuCtxCreate` / `vkCreateDevice` `VK_ERROR_INITIALIZATION_FAILED` (-3), then a black screen that
**needs a reboot**. The 5 s debounced restore + the desktop restore-guard cut the teardown *count*
but don't eliminate it. Options, in order of preference:
- **SIGKILL the gamescope on teardown** instead of `systemctl stop` (SIGTERM). Hypothesis: skipping
gamescope's buggy SIGTERM teardown handler (the part that SIGSEGVs, exit 139) lets the process die
hard and the driver reclaim its GPU resources cleanly via normal process exit — no half-torn-down
context. Change `stop_autologin_sessions` + `stop_session` (`vdisplay/gamescope.rs`) to
`systemctl --user kill --signal=SIGKILL <unit>` (+ a follow-up `stop`/`reset-failed` to clear unit
state). **Untested** — this is the first thing to try; it would preserve "managed client-res
gaming AND TV-shows-gaming-when-idle".
- **Keep the managed session warm** (no per-disconnect restore): spawn once, reuse forever, never
tear down → ~1 teardown per host lifetime. Tradeoff: the TV is blank/idle when no client is
connected (the autologin is never restored; return to gaming manually).
- Upstream gamescope/driver fix.
(#2 mid-stream-switch input and #3 virtual-output-primary are **resolved** — see the Resolved section above.)
## Lower priority / polish
### 4. Mid-stream-switch input loss window (~6 s)
During the libei portal setup on a switch, buffered input drops (`libei: DROP — no resumed device`,
hundreds of events). Polish: pre-warm the portal, or hold events instead of dropping during the
device-resume window.
### 5. NVENC `InitializeEncoder failed: invalid param` (recovered)
At 5120×1440@240 the first NVENC open fails with `invalid param (8)` and **recovers** via the 2-way
split-encode path (the stream is live). Cosmetic but noisy — investigate the first-attempt failure /
silence the log.
### 6. NVENC HEVC bitrate cap (~800 Mbps on the RTX 4090)
HEVC opens at the GPU's max (~800 Mbps) when a higher rate is requested (e.g. 1600). Not a bug;
consider preferring AV1 when the client requests >~800 Mbps HEVC, and surface the cap in the
speed-test / bitrate UI.
### 7. Restore-guard / keep-warm model interaction
`do_restore_tv_session`, when a desktop is active, still stops the idle managed gamescope (a teardown
— leak risk per #1) and consumes `STOPPED_AUTOLOGIN` (so a later return-to-gaming won't auto-restore
the TV session). Resolve together with the keep-warm decision in #1.
### 8. Feature B is opt-in
The mid-stream watcher is gated behind `PUNKTFUNK_SESSION_WATCH=1` pending broader validation. Promote
to default-on once #2 (mid-stream input) lands and it's exercised on more boxes.
+170
View File
@@ -0,0 +1,170 @@
# Windows native client — bootstrap handoff
A handoff for an agent picking up the **native Windows punktfunk/1 client**. The host side is done
and live-validated on a real RTX 4090; the client is the remaining piece. This doc is the concrete
starting point: the locked decisions, the reference code to port, the stack swaps, the dev loop, and
the gotchas. Read it top to bottom, then start at **Phase 1** (de-risk Reactor first).
## Status — WinUI 3 client landed (2026-06-15)
The client is implemented in `clients/windows` (binary `punktfunk-client`) and is
**build + clippy + fmt green on `x86_64-pc-windows-msvc` and `aarch64-pc-windows-msvc`** (the ARM64
target cross-compiled off the one x64 runner — see `windows.yml`; signed MSIX for both arches via
`windows-msix.yml`). It is the **WinUI 3** client this doc planned: native chrome (host list,
settings, in-app SPAKE2 PIN pairing) + the video on a **`SwapChainPanel`**, all in pure Rust.
- **Reactor is viable after all — it is what we use.** The locked decision held. windows-rs
[PR #4499](https://github.com/microsoft/windows-rs/pull/4499) (merged 2026-06-01) added a
`SwapChainPanel` widget to **`windows-reactor`** with `set_swap_chain` over
`CreateSwapChainForComposition` — so a DXGI presenter *can* be hosted. (An earlier read that Reactor
had no swapchain hatch was wrong/stale.) The UI is a declarative React-like tree
(`App::new().render(app)`, `use_state`/`use_resource`/`use_effect` hooks, `list_view`/`text_box`/
`combo_box`/`content_dialog`/`button`/`ToggleSwitch`); the video page is `swap_chain_panel()
.on_ready(|p| p.set_swap_chain(&sc))` driven by `on_rendering`. **`present.rs`** owns the D3D11
composition swapchain (WARP fallback, runtime shaders, Contain-fit) — the same renderer, bound to
the panel instead of an HWND.
- **windows-reactor is unpublished** (`version 0.0.0`) and fast-moving — depend on it as a **git dep
pinned to a commit** (`b4129fcc`), and pin the `windows` crate to the **same commit** so the
`IDXGISwapChain1` you pass to `set_swap_chain` satisfies reactor's `windows_core::Interface`. Its
`build.rs` downloads the Windows App SDK NuGets (Foundation/Interactive/Runtime) and stages the
bootstrap DLL + `resources.pri` next to the exe; it **`.unwrap()`s `CARGO_WORKSPACE_DIR`**, so set
it in the build env (`CARGO_WORKSPACE_DIR=C:\Users\Public\punktfunk`). It writes `/temp` + `/winmd`
to the workspace root (gitignored). The App SDK runtime must be installed to *run*.
- **Stream input is Win32 low-level hooks**, not XAML: reactor exposes only keyboard *accelerators* +
pointer *button-state* (no raw key-down/up, no pointer position, no wheel), insufficient for a game
stream. `input.rs` installs `WH_KEYBOARD_LL`/`WH_MOUSE_LL` on the stream page (uninstalled on exit),
maps the pointer through the window client rect, sends native VK + abs mouse + wheel, with a
Ctrl+Alt+Shift+Q capture toggle. (A future alternative: generate `Microsoft.UI.Xaml.UIElement`
bindings from the staged winmd and subscribe to `KeyDown`/`PointerMoved` — scoped to the panel.)
- **Build gotcha:** `CARGO_HOME` must be on an **ASCII path** (`C:\Users\Public\.cargo`). SDL3's
`build-from-source` PCH embeds the registry source path; the `ü` in the dev box's username makes
MSVC fail (`MSB8084` / `C4828`).
- **Still pending:** **on-glass validation** — the dev VM is headless / SSH Session 0, so the WinUI
window can't show there; validate over RDP or on the RTX box. Then **D3D11VA hardware decode** +
**10-bit/HDR present**, RAWINPUT relative-mouse pointer-lock, and a per-host speed test in the UI.
## What we're building
A native Windows client that connects to a punktfunk/1 host (`serve` / `punktfunk1-host`), decodes
HEVC, presents it low-latency, plays Opus audio, and captures local mouse/keyboard/gamepad to send
back — i.e. the Windows analogue of the **GTK4 Linux client** (`clients/linux`),
which is the architectural template. The Windows client is close to a 1:1 port of the Linux client
with the platform layers swapped.
## Locked decisions (from the Windows-host/client plan, `docs/windows-host.md` + project memory)
- **Pure Rust.** `windows-rs` + **Windows App SDK "Reactor"** (WinUI 3 from Rust, merged windows-rs
PR #4479). No C++/C#. De-risk Reactor + `SwapChainPanel` FIRST — it's the only novel/uncertain
piece; everything else is a known-good port.
- **Links `punktfunk-core` directly** (Cargo path dep, `features = ["quic"]`) — **no C ABI**, exactly
like the GTK client. `NativeClient` is already `Sync` (mutexed plane receivers), so it drops into a
UI app cleanly. The C ABI (`punktfunk_connect` + `next_au`/`next_audio`/`next_rumble`/`next_hidout`/
`send_input`/`send_rich_input`) is the *Apple* path; the native Rust clients call
`crates/punktfunk-core/src/client.rs` (`NativeClient`) methods directly.
- **Video widget = WinUI 3 `SwapChainPanel`** (built-in), fed a D3D11 swapchain via
`ISwapChainPanelNative::SetSwapChain`.
- **Decode = FFmpeg-next + D3D11VA** (HEVC; **Main10** for 10-bit/HDR — see below).
- **Audio playback = WASAPI render** + Opus decode (`opus` crate, vendors libopus via cmake; set
`CMAKE_POLICY_VERSION_MINIMUM=3.5`).
- **Input capture→send**: the client captures LOCAL input and sends it. Mouse (abs + relative) +
keyboard via the **inverse VK table** (port `keymap.rs`); gamepad via **SDL3** (already a workspace
dep, cross-platform) → `NativeClient::send_input`/`send_rich_input`. (`SendInput`/`ViGEm` are
HOST-side injection — not used by the client.)
- **Discovery = `mdns-sd`** (cross-platform, browses `_punktfunk._udp`).
- **Trust = shared client identity + SPAKE2 PIN pairing + TOFU** (port `trust.rs`; same identity
files/logic as the other native clients).
## The reference: `clients/linux/src/`
Port these files (near 1:1; only the platform layers change):
| Linux file | Role | Windows swap |
|---|---|---|
| `main.rs` / `app.rs` | app shell, lifecycle | WinUI 3 `App`/`Window` via Reactor |
| `ui_hosts.rs` | host list / connect screen | WinUI 3 page |
| `ui_settings.rs` | settings | WinUI 3 page |
| `ui_stream.rs` | the streaming view | WinUI 3 page hosting `SwapChainPanel` |
| `video.rs` | FFmpeg decode + present | FFmpeg **D3D11VA** → D3D11 swapchain in `SwapChainPanel` |
| `audio.rs` | Opus decode + playback | **WASAPI render** (was PipeWire) |
| `session.rs` | `NativeClient` connect + plane pumps | **reuse almost verbatim** (core is cross-platform) |
| `trust.rs` | identity, PIN, TOFU | **reuse almost verbatim** |
| `discovery.rs` | mDNS browse | **reuse verbatim** (`mdns-sd`) |
| `keymap.rs` | inverse VK table | reuse; Windows VK is the native source so this is *simpler* |
| `gamepad.rs` | SDL3 pad capture + rumble/feedback | **reuse almost verbatim** (SDL3 is cross-platform) |
`session.rs`, `trust.rs`, `discovery.rs`, `keymap.rs`, `gamepad.rs` are mostly platform-neutral
(they touch `punktfunk-core` + SDL3 + mdns, all cross-platform) — expect to reuse them with minimal
changes. The real work is `video.rs` (D3D11VA + swapchain), `audio.rs` (WASAPI), and the WinUI shell.
## 10-bit + HDR (NEW — landed this session, the client MUST handle it)
The host now negotiates and emits **HEVC Main10 + BT.2020 PQ HDR10** when the captured desktop is
HDR (and 10-bit SDR Main10 when negotiated). The Apple client already does the matching present; the
Windows client should mirror it:
- **Advertise caps** in the `Hello`: `video_caps = VIDEO_CAP_10BIT | VIDEO_CAP_HDR`
(`crates/punktfunk-core/src/quic.rs`). The host enables 10-bit only if the client advertised it.
(The native-client connector in `client.rs` currently hardcodes `video_caps: 0` with a TODO —
thread the real caps through when you wire decode; or detect HDR purely in-band, see next.)
- **Detect HDR in-band** from the HEVC VUI (transfer characteristics = SMPTE ST 2084 / PQ), exactly
like the Apple client's `VideoDecoder.isHDRFormat` (`clients/apple/Sources/PunktfunkKit/`). This
handles a mid-session HDR toggle without renegotiation. `Welcome.bit_depth` (8/10) is also available.
- **Decode** Main10 → **P010** (10-bit) via D3D11VA.
- **Present HDR**: swapchain in `DXGI_FORMAT_R10G10B10A2_UNORM` (or `R16G16B16A16_FLOAT`),
`IDXGISwapChain3::SetColorSpace1(DXGI_COLOR_SPACE_RGB_FULL_G2084_NONE_P2020)` +
`SetHDRMetaData` for HDR10; the host's stream is BT.2020 PQ, so present PQ. For SDR, the existing
`DXGI_FORMAT_B8G8R8A8_UNORM` + BT.709 path. (The host-side HDR conversion math is in
`crates/punktfunk-host/src/capture/dxgi.rs` `HDR_PS`/`HdrConverter` if you need the inverse.)
## Dev boxes
- **No-GPU dev box (UI + connect + software decode):** `ssh "Enrico Bühler"@192.168.1.57` — Win11 Pro
25H2 (build 26200), QEMU Q35, 8 vCPU/12 GB, **no working GPU** (so no NVENC, no D3D11VA hardware
decode — use FFmpeg software decode here; this box is for UI/connect/protocol work). Has Rust 1.96
MSVC, VS 2026 + VC tools + Win SDK, Win App Runtime 2.2, SudoVDA + Parsec VDD.
- **Real-GPU box (HDR / hardware decode / end-to-end):** `ssh "Enrico Bühler"@192.168.1.174` — Win11,
RTX 4090, runs the host. Use it to test the client against a live HDR host.
### Dev-loop gotchas (both boxes)
- **Build under an ASCII path** (`C:\Users\Public\…`). The username "Enrico Bühler" has a `ü` → MSVC
`LNK1201` PDB-write failure under `~/Developer`.
- **Toolchain gaps:** `winget install NASM.NASM Kitware.CMake LLVM.LLVM` (aws-lc-rs on the quic path,
ffmpeg-sys needs libclang).
- **`CMAKE_POLICY_VERSION_MINIMUM=3.5`** in the build env (CMake 4 rejects libopus's old minimum).
- **File transfer = `sftp`** (scp is broken under the PowerShell DefaultShell):
`printf 'put %s /C:/Users/Public/REL/PATH\n' LOCAL | sftp -b - "Enrico Bühler@192.168.1.57"`
note the **leading slash** `/C:/…`. Let the VM regenerate its own `Cargo.lock` (don't transfer it).
- **Windows clippy is stricter** than Linux CI and `cfg(windows)` code is excluded from Linux CI →
run `cargo clippy -p punktfunk-client-windows -- -D warnings` ON THE VM before committing.
- Work on `main`; fetch+merge `origin/main` before pushing.
## Suggested phased plan
1. **De-risk Reactor (do this first).** A windows-rs Reactor (WinUI 3) hello-world that hosts a
`SwapChainPanel` and presents a cleared D3D11 swapchain into it. Confirm the windows-rs Reactor
version/API (PR #4479) and `ISwapChainPanelNative::SetSwapChain` interop. If Reactor proves too
raw, the fallback is `winit` + a child HWND swapchain, but try Reactor first per the decision.
2. **Crate scaffold.** `clients/windows`, `[target.'cfg(windows)'.dependencies]`:
`punktfunk-core { path, features=["quic"] }`, `windows`, the Reactor crate, `ffmpeg-next`, `opus`,
`sdl3`, `mdns-sd`, `anyhow`, `tracing`. Mirror `clients/linux/Cargo.toml`.
3. **Connect + control plane.** Port `session.rs` + `trust.rs`; validate headless against the 4090
box (`punktfunk1-host`/`serve`) — handshake, PIN/TOFU, plane counters — before any UI/decode.
4. **Decode + present.** FFmpeg D3D11VA → `SwapChainPanel`. SDR (8-bit BGRA) first, then **P010 +
HDR colorspace** (see the HDR section).
5. **Audio.** WASAPI render + Opus decode (port `audio.rs`).
6. **Input.** Mouse + keyboard capture→send (port `keymap.rs`), gamepad via SDL3 (port `gamepad.rs`),
feedback from `next_rumble`/`next_hidout`.
7. **Discovery + UI.** Port `discovery.rs` + `ui_hosts.rs` + `ui_settings.rs` to WinUI pages.
## Key references
- **Template:** `clients/linux/src/*` (the client to port).
- **Apple HDR present** (the pattern to mirror): `clients/apple/Sources/PunktfunkKit/{VideoDecoder,
MetalVideoPresenter,Stage2Pipeline}.swift` — in-band PQ detection, P010 decode, EDR present.
- **Core client API:** `crates/punktfunk-core/src/client.rs` (`NativeClient`).
- **Protocol:** `crates/punktfunk-core/src/quic.rs` (`Hello.video_caps`, `Welcome.bit_depth`,
`VIDEO_CAP_10BIT`/`VIDEO_CAP_HDR`).
- **Full Windows plan + SudoVDA/host details:** `docs/windows-host.md`.
- **Host HDR conversion (for the inverse math):** `crates/punktfunk-host/src/capture/dxgi.rs`
(`HDR_PS`, `HdrConverter`) + `crates/punktfunk-host/src/encode/nvenc.rs` (BT.2020/PQ VUI).
+241
View File
@@ -0,0 +1,241 @@
# Windows virtual DualSense — game detection handoff
Goal: get the host's virtual DualSense **detected and usable in games** (Cyberpunk's native PS5 path +
others) on the Windows host. This doc is the portable handoff (the investigation lives here, not in any
one agent's memory). Run the experiments **on the Windows host** (`.173`, repo at
`C:\Users\Public\punktfunk-native`).
## Status (2026-06-22)
- **Input works.** Client → host → virtual DualSense → games read input. Verified in Steam's controller
test (buttons/sticks).
- **The HID is a CORRECT, COMPLETE DualSense.** An SDL3 probe reports our live device as
`name='DualSense Wireless Controller' vid=0x054C pid=0x0CE6 isGamepad=True gamepadType=PS5`. SDL =
HIDAPI = what Steam (and many games) build on → that's why Steam works. So the report descriptor,
feature reports, and identity are right; this is **not** a descriptor/feature-report problem.
- **Cyberpunk's native DualSense path does NOT detect it at all.** (Steam Input was off — Cyberpunk was
reading the raw HID.)
- **Rumble:** host-side is proven working (driver captures the game's `0x02`, `parse_ds_output` extracts
the motors, host forwards `0xCA` — log: `rumble: forwarding to client (0xCA) low=16128 high=16128`).
The break is the **client** (macOS) not rendering `0xCA` onto the physical pad. Separate task/agent.
## Root cause — CONFIRMED (2026-06-22, run live on the interactive desktop, console session 3)
The break is the device's **PnP identity / device-interface path**, not the HID descriptor or feature
reports. `hidclass` derives the HID child's path token and its `HID\VID_054C&PID_0CE6` hardware-ids from the
**parent bus device's hardware-id**. Our parent is the software (SWD) devnode `SWD\PUNKTFUNK\PF_PAD_0` whose
hardware-id is `pf_dualsense` (no VID/PID), so hidclass emits only the *VendorID+usage* fallback and **no
PID**. Measured on this box (one virtual pad live + one real 8BitDo present):
HID-child hardware-ids (`DEVPKEY_Device_HardwareIds`, CompatibleIds empty):
`HID\pf_dualsense` · `HID\VID_054C&UP:0001_U:0005` · `HID_DEVICE_SYSTEM_GAME` · `HID_DEVICE_UP:0001_U:0005`
· `HID_DEVICE`**note the absent `HID\VID_054C&PID_0CE6`.** `HIDD_ATTRIBUTES` itself is correct (VID 054C
/ PID 0CE6), which is why attribute-readers work.
Device-interface paths (from `HKLM\SYSTEM\CurrentControlSet\Control\DeviceClasses\{4d1e55b2-…}`):
| Device | HID interface path |
| --- | --- |
| **Ours (virtual)** | `\\?\HID#punktfunk#1&ca418da&0&0000#{…}`**no `VID_/PID_` token** |
| Real DualShock 4 (USB, registry remnant) | `\\?\HID#VID_054C&PID_05C4&REV_0100#…` |
| Real DualSense (BT, registry remnant) | `\\?\HID#{00001124-…}_VID&0002054c_PID&0ce6#…` |
**Cross-API enumeration (the decisive experiment — impossible over SSH, run live in the console session):**
| API | Sees our virtual DS5? | Identity reported | Reads from |
| --- | --- | --- | --- |
| SDL3 / HIDAPI | ✅ | 054C:0CE6, type=PS5 | `HIDD_ATTRIBUTES` → Steam works |
| RawInput | ✅ | 054C:0CE6 | `HIDD_ATTRIBUTES` |
| WGI `RawGameController` | ✅ | 054C:0CE6 | `HIDD_ATTRIBUTES` |
| WGI `Gamepad` | ❌ empty | — | (empty for *all* pads on this box — no Xbox-profile pad; not DS-specific) |
| **MS GameInput** | ✅ enumerates it | **vid=0x0000 pid=0x0000** | **PnP path / hardware-ids** |
| Cyberpunk native PS5 | ❌ | — | needs the DS5 VID/PID identity |
The GameInput result is the clincher: it **does** enumerate our pad — descriptor fingerprint matches exactly
(15 buttons, 6 axes, 1 hat, usage Game Pad 0x05) — but reports **vid/pid = 0**, while it reads the real
8BitDo's `vid=0x3434` correctly. So GameInput (and, by the same logic, a native PS5 path) takes VID/PID from
the **PnP device path / hardware-ids, NOT from `HIDD_ATTRIBUTES`**, and ours carry no `VID_054C&PID_0CE6`.
Everything that reads attributes directly (SDL / RawInput / WGI-raw) is fine; everything that keys off the
device *identity/path* (GameInput, native DualSense detection) sees a generic, unidentified gamepad → no
PS5 path.
**⇒ The fix must put `VID_054C&PID_0CE6` into the device-interface path and the `HID\VID&PID` hardware-ids**
(give the device a real-USB-like PnP identity), not merely correct `HIDD_ATTRIBUTES`. See "Fix options".
**Secondary driver gaps found (not the detection blocker, but fix while here):**
- `IOCTL_HID_GET_STRING` (id 4, ioctl `0x000b0013`) returns `STATUS_NOT_IMPLEMENTED` — a game polls it
repeatedly (seen live in `pfds-driver.log`). Implement manufacturer / product / serial strings
(`"DualSense Wireless Controller"`, a serial). Native PS5 code can read the serial to tell USB from BT.
- `DS_FEATURE_CALIBRATION` is **42** bytes but the report descriptor declares feature `0x05` as **41**
(`0x95 0x28` = 40 data + 1 id). Trim to 41 (motion-only; SDL accepts it regardless).
## Fix — implemented & validated at the identity layer (2026-06-22)
`create_swdevice` (`inject/dualsense_windows.rs`) now sets, via **`SW_DEVICE_CREATE_INFO` struct fields**
(NOT `pProperties` — empirically a `DEVPROPERTY` write of these PnP-owned identity keys is ignored; the
create-time struct fields are the supported lever, confirmed on `.173`):
- **`pszzCompatibleIds`** = `USB\VID_054C&PID_0CE6`, `USB\Class_03&SubClass_00&Prot_00`, `USB\Class_03`
(Windows appends `SWD\Generic`). HIDAPI/SDL/libScePad walk HID-child → `CM_Get_Parent` → this parent's
CompatibleIds and string-match `"USB"`**`bus_type` now resolves to USB** (was UNKNOWN).
- **`pszzHardwareIds`** = `pf_dualsense` **first** (so the INF still binds our UMDF driver), then
`USB\VID_054C&PID_0CE6&REV_0100`, `USB\VID_054C&PID_0CE6`. hidclass then derives the real-DS5 child ids
**`HID\VID_054C&PID_0CE6[&REV_0100]`** (previously only `HID\VID_054C&UP:0001_U:0005`).
- **`pContainerId`** = a deterministic per-pad GUID `{50464453-0000-0000-0000-00000000000<idx>}` ("PFDS")
(avoids the null-sentinel-ContainerId `xinput1_4` slot-skip bug; groups the pad's devnodes).
**Validated live** (real shipping path, `dualsense-windows-test --index 1` alongside the running service's
pad 0): INF still binds (`Service=MsHidUmdf`), parent CompatibleIds/HardwareIds + per-pad ContainerId set,
the HID child gains `HID\VID_054C&PID_0CE6`, and the HIDAPI parent-walk reports **bus_type=USB**.
SDL / RawInput / WGI `RawGameController` identity stays correct (054C:0CE6).
**Remaining gap (NOT fixed by the above): GameInput VID/PID still reads 0.** GameInput parses VID/PID from
the HID child's **instance path** (`HID\punktfunk\1&…`), which carries no `VID_…&PID_…` token; neither
CompatibleIds nor HardwareIds change the instance path. Only a real USB-bus instance path
(`HID\VID_054C&PID_0CE6\…`) does — i.e. a **ViGEm-style KMDF USB-emulating bus driver** (the rank-3, last
resort). Pursue only if a target title uses GameInput AND the identity fix above doesn't satisfy it; prior
art (HIDMaestro) shows pure user-mode pads ARE accepted by WGI/GameInput, so other parity (descriptor /
strings / mapping) may matter more than a genuine USB bus.
## Next steps
> **Deployed to `.173` (2026-06-22):** the host identity fix is live in the `PunktfunkHost` service (release
> rebuilt + restarted) and the driver fixes are installed + signed (`oem74.inf`, `punktfunk-ds-test` cert).
> The box is ready for the decisive on-glass test. A rollback copy of the prior driver is at
> `C:\Users\Public\giprobe\driver-backup-oem74`.
1. **Decisive on-glass test (only the user can run):** launch Cyberpunk 2077 with Steam Input OFF against a
virtual DS5 carrying the new identity; check the in-game glyphs/prompt switch to DualSense. Cleanest
single-pad test (frees the service's pad 0 so only the new-identity pad is present):
`sc stop PunktfunkHost``target\debug\punktfunk-host.exe dualsense-windows-test --index 0 --seconds 600`
(new identity + live cycling Cross/stick), launch the game; then deploy the release + restart with
`scripts\windows\deploy-host.ps1`.
2. **Driver-side correctness — DONE & installed (2026-06-22).** Rebuilt/resigned/reinstalled per the recipe
below; validated live (`hidstrings` probe + `pfds-driver.log`):
- `IOCTL_HID_GET_STRING` now implemented (was `STATUS_NOT_IMPLEMENTED`). **Discovery:** Windows polls
this device's string slots with low-word ids **`0x0E`/`0x0F`/`0x10`** (lang `0x0409`) cyclically — NOT
the `0/1/2` `HID_STRING_ID_*` constants. The handler maps them (+ `0/1/2` as fallbacks):
`0x0E`→manufacturer "Sony Interactive Entertainment", `0x0F`→product "DualSense Wireless Controller",
`0x10`→serial "35533AD6E774" (the `0x09` pairing-report MAC). Verified: `HidD_GetManufacturer/Product/
SerialNumberString` now return those three distinct strings.
- `DS_FEATURE_CALIBRATION` trimmed 42 → 41 bytes (1 id + 40 data) to match the descriptor's feature
`0x05` (`0x95 0x28`).
- The repo source (`packaging/windows/dualsense-driver/src/lib.rs`) and the m0 build copy were diverged
by *formatting only*; they are now back in sync (the repo file was copied to m0 before building).
3. If a GameInput-only title needs the real VID/PID → the rank-3 KMDF USB-emulating bus driver.
## On-box experiment tooling (built 2026-06-22, `C:\Users\Public\giprobe\`)
- `probe.cpp` (+`build.bat`) — GameInput enumeration/fingerprint via `LoadLibrary("GameInput.dll")` +
`GameInputCreate`/`RegisterDeviceCallback` (GDK header). Prints each device's vid/pid/usage/counts —
this is what proved GameInput reads our pad as vid=0.
- `swexp.cpp` (+`build-swexp.bat`) — standalone `SwDeviceCreate` identity experiment: variations for
`pszzCompatibleIds` (struct field) vs `DEVPKEY_Device_CompatibleIds` (pProperties — ignored),
`pszzHardwareIds` USB ids, `pContainerId`. Create at a spare instance id, hold, inspect. Built with the
VS18 MSVC toolchain via `vcvars64.bat`.
- WGI probe: Windows PowerShell **5.1** WinRT projection of `RawGameController`/`Gamepad` (pump the message
loop; subscribe `RawGameControllerAdded` to kick enumeration).
- Parent-walk bus check: from the HID child, `DEVPKEY_Device_Parent` → that node's
`DEVPKEY_Device_CompatibleIds`, match `^USB`/`^BTH` — mirrors HIDAPI's `hid_internal_detect_bus_type()`.
- NOTE: the agent shell's PowerShell tool chokes on inline `@'…'@` here-strings feeding `Add-Type` (throws
a spurious "Remove-Item on system path '/' is blocked"); write C#/scripts to a file and run them instead.
## How to reproduce / iterate (on `.173`)
### 1. Spawn a live virtual DualSense to test against
```
C:\Users\Public\punktfunk-native\target\debug\punktfunk-host.exe dualsense-windows-test --seconds 60
```
Creates `SWD\PUNKTFUNK\PF_PAD_0` (+ its HID child) and holds it, pushing a cycling input. Or just connect
a client — the real session creates the identical device. (Build with the env `CMAKE_POLICY_VERSION_MINIMUM=3.5`.)
### 2. SDL3 detection oracle (already set up: `C:\Users\Public\sdltest\SDL3.dll`)
Confirms HID-level recognition (HIDAPI). Run while a device from step 1 is live. PowerShell + C# (note:
PS 5.1's Add-Type is C# 5 — **no** interpolated strings, **no** inline `out` vars, **no**
`Marshal.PtrToStringUTF8`; SDL3 bools are 1 byte → `[return: MarshalAs(UnmanagedType.I1)]`):
```powershell
$cs = @'
using System; using System.Runtime.InteropServices; using System.Text;
public static class S {
const string D = @"C:\Users\Public\sdltest\SDL3.dll";
[DllImport(D)][return: MarshalAs(UnmanagedType.I1)] public static extern bool SDL_Init(uint f);
[DllImport(D)] public static extern IntPtr SDL_GetJoysticks(out int c);
[DllImport(D)] public static extern IntPtr SDL_GetJoystickNameForID(uint id);
[DllImport(D)] public static extern ushort SDL_GetJoystickVendorForID(uint id);
[DllImport(D)] public static extern ushort SDL_GetJoystickProductForID(uint id);
[DllImport(D)][return: MarshalAs(UnmanagedType.I1)] public static extern bool SDL_IsGamepad(uint id);
[DllImport(D)] public static extern IntPtr SDL_OpenGamepad(uint id);
[DllImport(D)] public static extern int SDL_GetGamepadType(IntPtr g);
static string U(IntPtr p){ if(p==IntPtr.Zero)return""; int n=0; while(Marshal.ReadByte(p,n)!=0)n++; byte[] b=new byte[n]; Marshal.Copy(p,b,0,n); return Encoding.UTF8.GetString(b); }
public static string Run(){ if(!SDL_Init(0x2000))return"init fail"; System.Threading.Thread.Sleep(1500);
int n=0; IntPtr a=SDL_GetJoysticks(out n); StringBuilder sb=new StringBuilder("joysticks: "+n+"\n");
for(int i=0;i<n;i++){ uint id=(uint)Marshal.ReadInt32(a,i*4); bool ig=SDL_IsGamepad(id); int t=ig?SDL_GetGamepadType(SDL_OpenGamepad(id)):-1;
sb.AppendLine(" '"+U(SDL_GetJoystickNameForID(id))+"' vid=0x"+SDL_GetJoystickVendorForID(id).ToString("x4")+" pid=0x"+SDL_GetJoystickProductForID(id).ToString("x4")+" isGamepad="+ig+" type="+t+" (PS5=6)"); }
return sb.ToString(); }
}
'@
Add-Type -TypeDefinition $cs; [S]::Run()
```
Expected today: it lists our device with `type=6` (PS5). That's the baseline "HID is correct".
## Next experiments — MUST run ON THE INTERACTIVE DESKTOP, not over SSH
WGI / RawInput / GameInput enumeration returns **empty from a headless SSH session** (no window/message
pump) — only HIDAPI works headless. So these must run in the logged-in desktop session (RDP in, or run
locally) while a DualSense session is live:
1. **Determine which API Cyberpunk uses and whether it sees the SWD device.** Enumerate via, separately:
- `Windows.Gaming.Input` (`RawGameController.RawGameControllers`, `Gamepad.Gamepads`),
- RawInput (`GetRawInputDeviceList` → filter HID gamepad usage 01/05),
- GameInput (`GameInputCreate``EnumerateDevices`) — `GameInputRedistService` is installed on `.173`.
Compare which list our `VID_054C&PID_0CE6` appears in. The one(s) it's *missing from* point at the API
Cyberpunk uses.
2. **If WGI/GameInput exclude it:** make the SwDeviceCreate device enumerate more like a real USB device.
`SwDeviceCreate` takes a `pProperties` (`DEVPROPERTY[]`) array — try setting bus-type / container-id /
compatible-IDs so the newer APIs accept it. If that's insufficient, the heavyweight option is a
USB-emulating bus driver (the way ViGEmBus presents a real-looking device) instead of SwDeviceCreate +
UMDF-HID.
3. **Rule out an XInput device taking priority** (a leftover ViGEm pad, etc.).
4. **Correctness (not the detection blocker):** `DS_FEATURE_CALIBRATION` in the driver is **42 bytes**
but the report descriptor declares feature `0x05` as **41** (1 id + 40 data, `0x95 0x28`). Trim to 41;
wrong calibration only affects motion, and SDL accepts the device regardless.
## On-box layout (`.173`, builds + tools)
- **Host repo / build:** `C:\Users\Public\punktfunk-native``cargo build -p punktfunk-host`
(debug for `dualsense-windows-test`; `--release --features nvenc` is what the service runs). The
build env is persisted Machine-scope (`PUNKTFUNK_NVENC_LIB_DIR`, `LIBCLANG_PATH`,
`CMAKE_POLICY_VERSION_MINIMUM`) — see `scripts\windows\`. **One-call rebuild+redeploy of the
service: `scripts\windows\deploy-host.ps1`** (stop → build → restart, `.bak` rollback); web:
`scripts\windows\build-web.ps1`. bun=`C:\Users\Public\bun`, node=`C:\Users\Public\node-v22.11.0-win-x64`.
- **Host service:** scheduled task / SCM `PunktfunkHost` runs `…\target\release\punktfunk-host.exe
service run` → spawns `serve` (currently native-only, `PUNKTFUNK_HOST_CMD=serve` in
`C:\ProgramData\punktfunk\host.env`). Restart: `sc stop/start PunktfunkHost`. Native port 9777, mgmt
47990. (NB: Sunshine/Apollo conflicts on the GameStream ports — keep it stopped, or run native-only.)
- **UMDF driver build project:** `C:\Users\Public\m0\windows-drivers-rs\examples\pf-dualsense`
(`pf_dualsense.inx` + `src\lib.rs` live here; the canonical copies are in the repo under
`packaging/windows/dualsense-driver/` — keep them in sync). Rebuild + reinstall recipe (e.g. after the
calibration fix), all from that dir, env `LIBCLANG_PATH=C:\Program Files\LLVM\bin`,
`Version_Number=10.0.26100.0`:
1. `cargo make` → `target\debug\pf_dualsense_package\`
2. **Clear the FORCE_INTEGRITY PE bit** (wdk-build sets `/INTEGRITYCHECK`, which blocks self-signed
load): clear bit 0x80 at `PE_header_offset+0x5e` of `pf_dualsense.dll`, then re-sign.
3. `signtool sign /fd SHA256 /sha1 6A52984E54376C45A1C236B1A2C8A746C5AB6131 pf_dualsense.dll`
4. `Inf2Cat /driver:<pkg> /os:10_x64` → re-sign the `.cat` with the same thumbprint.
5. `pnputil /delete-driver <old oemNN.inf> /uninstall /force` then `pnputil /add-driver
pf_dualsense.inf /install`. (Self-signed cert is already trusted on `.173`; Secure Boot ON, HVCI off.)
- **SDL oracle:** `C:\Users\Public\sdltest\SDL3.dll`. **Test device:** `punktfunk-host.exe
dualsense-windows-test --seconds N` creates one `SWD\PUNKTFUNK\PF_PAD_0` and holds it.
## Key code
| What | File |
| --- | --- |
| Host backend (`create_swdevice`, the `Global\pfds-shm-<idx>` section, write_state/service/pump) | `crates/punktfunk-host/src/inject/dualsense_windows.rs` |
| UMDF driver (HID descriptor, feature reports, `on_output_report`) | `packaging/windows/dualsense-driver/src/lib.rs` |
| Shared report codec (`serialize_state` input, `parse_ds_output` feedback) | `crates/punktfunk-host/src/inject/dualsense_proto.rs` |
| Pad seam (`PadBackend`, `pump` → rumble `0xCA` / hidout `0xCD`) | `crates/punktfunk-host/src/punktfunk1.rs` |
## Facts proven (don't re-litigate)
- `SwDeviceCreate` requirements: enumerator must have **no underscore** (`punktfunk`); the completion
**callback is mandatory** (NULL → E_INVALIDARG). Per-session device works; auto-removed on disconnect.
- HID descriptor + feature reports are DS5-accurate enough that **SDL identifies it as PS5**.
- Host-side rumble works end to end; the client (macOS) rendering of `0xCA` is the open rumble bug.
+416
View File
@@ -0,0 +1,416 @@
# Windows host — virtual DualSense scoping
**Status:** **M0 feasibility gate PASSED (2026-06-21)** — a self-authored **Rust** UMDF virtual DualSense
loads self-signed under Secure Boot, is recognized as a genuine DualSense by Steam, and receives `0x02`
output reports at its write callback. Driver source: `packaging/windows/dualsense-driver/`. (Earlier in
this doc's history the gate looked blocked by Secure Boot / driver code-integrity — that was wrong; the
real blocker was the PE FORCE_INTEGRITY bit that `wdk-build` sets via `/INTEGRITYCHECK`, cleared post-build.)
Web-research pass complete; the mechanism conclusion is **reversed**
from the 2026-06-20 draft. This doc **supersedes the 2026-06-20 VHF scoping** — VHF was the wrong
answer (it is kernel-only and cannot host a user-mode HID source), and the correct mechanism is a
**UMDF2 user-mode HID minidriver**, the same driver tier punktfunk already vendors/signs/installs for
SudoVDA. Two product decisions are now fixed and drive this plan: **(1)** the driver is for **public
end-user distribution** (so: EV cert + Microsoft attestation signing, not just the fleet self-signed
recipe), and **(2)** the strong preference is a **self-authored Rust driver**, with a thin C/C++ shim
as the realistic fallback and forking HIDMaestro as the last resort.
## TL;DR
Apollo's backlog item #23/#89 ("DS4 ViGEm target on Windows") is the **wrong target** if the goal is
*actual DualSense*. ViGEmBus emulates only **Xbox 360 (XUSB)** and **DualShock 4 (DS4)** — never a
DualSense. Because this is a *host-side* virtual pad, the DualSense-defining features (adaptive
triggers, the fine haptic actuators, DS5 identity) only work end-to-end if the **game sees a real
DualSense** and therefore drives them; a DS4 virtual pad means the game takes its DS4 code path and
never emits those commands, so the client's adaptive-trigger rendering is never exercised. ViGEm DS4
structurally **cannot** deliver adaptive triggers.
The right path is the Windows analog of what the Linux host already does over `/dev/uhid`: present a
**real virtual DualSense HID device** (Sony VID `054C` / PID `0CE6`, the inputtino PS5 report
descriptor punktfunk already ships). On Windows that is a **UMDF2 (user-mode) HID minidriver**
created/torn-down per session from the host via `SwDeviceCreate`, sitting as a lower filter under
the OS pass-through driver `mshidumdf.sys`. It is the **same driver tier as SudoVDA** (UMDF, not
kernel), so the existing vendor → sign → Inno-installer machinery applies almost unchanged.
> **Supersedes the 2026-06-20 VHF scoping.** That draft concluded "a kernel-mode virtual-HID device
> via the Virtual HID Framework (VHF) — a SudoVDA-class driver effort." The decisive correction:
> **VHF supports a HID *source* driver only in kernel mode** (Microsoft "Virtual HID Framework
> (VHF)"). A user-mode (UMDF) HID source is **not** a VHF use case — it is a UMDF2 HID minidriver
> built from the `vhidmini2` sample (or DMF's `Dmf_VirtualHidMini`). The earlier "KMDF is a higher
> bar than SudoVDA's UMDF/IddCx" framing is therefore wrong: the correct mechanism is **the same
> UMDF tier as SudoVDA**, not above it.
Everything except the host backend is already platform-agnostic and DualSense-complete (verified
against live code), so this is a well-bounded host-side addition. **The whole effort is gated by an
on-glass feasibility spike (M0)** that no prior art settles: whether a virtual `054C:0CE6` device is
accepted as a genuine DualSense by `Windows.Gaming.Input` / GameInput / Steam **and** whether the
game's output report `0x02` (the adaptive-trigger block) actually reaches the driver's write callback.
## Why this is the wrong place to copy Apollo
Apollo (and all of Sunshine's lineage) **does DualSense only on Linux** (`inputtino`,
`DualSenseWired`). Its Windows input path (`src/platform/windows/input.cpp`) is ViGEm
`XUSB_REPORT` + `DS4_REPORT_EX` only — `MPS2_TO_DS4_ACCEL` motion conversion, inverse-ViGEmBus gyro
calibration, DS4 touchpad packing. There is **zero** virtual-HID / DualSense code on Apollo's Windows
side. So:
- Copying Apollo on Windows gets us a **DS4**, with the adaptive-trigger ceiling baked in.
- There is **no in-ecosystem upstream** (Sunshine/Apollo/Wolf) that already solved a virtual
*DualSense* on Windows to vendor from. The closest prior art is in the **virtual-HID-controller**
space, not the streaming-host space: HIDMaestro and Nefarius DsHidMini (see *Mechanism*).
This is unchanged from the 2026-06-20 draft and remains correct.
## The parity target — and what's *already* done
The Linux host (`crates/punktfunk-host/src/inject/dualsense.rs`) creates a **UHID** device presenting
the genuine DualSense descriptor, so the kernel `hid-playstation` driver binds it and games see a real
DualSense — gamepad + motion + touchpad + lightbar/player-LEDs + adaptive triggers. It writes HID
**input** report `0x01` (controller state) and reads HID **output** report `0x02` (the game's
rumble/LED/trigger feedback), which it forwards to the client as `punktfunk_core::quic::HidOutput`.
Crucially, **everything except the host backend is already platform-agnostic and DualSense-complete**
(verified against live source):
| Layer | State | Where |
|---|---|---|
| Protocol planes (rich input `0xCC`, rumble `0xCA`, HID-output `0xCD`) | ✅ done | `punktfunk_core::quic` |
| Feedback abstraction (`HidOutput::{Led,PlayerLeds,Trigger,…}`) | ✅ done | `punktfunk_core::quic` |
| Pad-type negotiation (client pref > env > default), `GamepadPref::DualSense` | ✅ done | `punktfunk1.rs` `resolve_gamepad` (~1577) |
| Backend dispatch (`enum PadBackend`) | ✅ done; `DualSense`/`DualShock4` arms are `#[cfg(target_os="linux")]` | `punktfunk1.rs` (PadBackend ~11811272) |
| Clients (capture + adaptive-trigger/lightbar/haptic/touchpad/motion rendering) | ✅ done, all platforms | `clients/*` |
| C-ABI (`next_hidout` / `send_rich_input`) | ✅ done | `abi.rs` |
| **Host virtual-DualSense backend** | **Linux only (UHID)** | `inject/dualsense.rs` |
So a Windows DualSense backend needs **no protocol, client, or C-ABI change**. The whole DualSense
**HID contract already exists as pure, transport-independent Rust + const data**, kernel-verified
byte-for-byte against `hid-playstation.c` / inputtino / SDL, in `inject/dualsense.rs`:
- `DUALSENSE_RDESC` — the 232-byte USB report descriptor.
- `serialize_state` — the input report `0x01` packer (controller state → bytes).
- `parse_ds_output` — the output report `0x02` parser (game's rumble/LED/trigger block → `HidOutput`),
valid-flag gated.
- Feature blobs `0x05` calibration, `0x09` pairing, `0x20` firmware. **USB framing (no CRC).**
**No hardware capture is needed** — the bytes are already correct and proven. The *only* Linux
coupling is the `/dev/uhid` event framing (`UHID_CREATE2`/`INPUT2`/`OUTPUT`/`GET_REPORT`) in
`DualSensePad::open`/`write_state`/`service`. A Windows backend swaps that framing for the
`SwDeviceCreate` + IOCTL channel to the UMDF driver; **the report bytes are identical.**
> **One in-repo bug to fix in passing:** `DS_FEATURE_CALIBRATION` (`0x05`) is currently **42 bytes**;
> the spec is **41**. Trim it for strict Windows consumers as part of M1 (`42 → 41`).
`dualshock4.rs` (committed `3e6c9f6`) is a worked **second** example of the multi-pad-type
`PadBackend` pattern, reusing the DualSense state — a template for how the Windows arm slots in.
The host integration seam is small and already mapped: ~1 enum arm + 5 match arms in the
`PadBackend` block (`punktfunk1.rs` ~11811272), flipping `pick_gamepad`/`resolve_gamepad`
(~15581606) from `#[cfg(target_os = "linux")]` to `#[cfg(any(target_os = "linux", target_os =
"windows"))]`, plus the `inject.rs` module gating (~424451). `gamepad_windows.rs` is today
ViGEm-Xbox360-only (138 LOC); the new `inject/dualsense_windows.rs` sits beside it, and ViGEm stays
for Xbox 360 / Xbox One.
## The Windows mechanism — UMDF2 HID minidriver (not VHF)
Windows has **no userspace HID-device creation** (unlike Linux UHID), so a real virtual DualSense
needs a driver component. The decisive correction over the prior draft:
- **VHF (Virtual HID Framework) supports a HID *source* driver only in kernel mode.** It is not the
mechanism for a user-mode virtual pad. (Microsoft, "Virtual HID Framework (VHF)".)
- The user-mode mechanism is a **UMDF2 HID minidriver**: a small lower-filter driver under the
OS-supplied pass-through driver **`mshidumdf.sys`** (which calls `HidRegisterMinidriver` on the
minidriver's behalf). This is the **same UMDF tier as SudoVDA***below* kernel work, not above it.
A second prior-research correction that matters for the language choice: **UMDF 2.0 is NOT
COM-based.** COM / `IDriverEntry` / `IWDFDriver` belong to legacy **UMDF 1.x**. UMDF 2.0 uses the
same **C-style WDF object model as KMDF** — a `DriverEntry` symbol plus C function pointers
(`EvtDriverDeviceAdd`, `EvtIoDeviceControl`) stored in config structs. There is no vtable to
implement. (Microsoft, "Porting a Driver from UMDF 1 to UMDF 2", "Getting Started with UMDF v2".)
This is precisely why a Rust FFI implementation is even conceivable (see *Driver language*).
### What the driver actually does (small, well-bounded)
A UMDF2 HID minidriver holds **no device logic** — it shuttles bytes. Its entire job is one
`EvtIoDeviceControl` callback branching on ~10 HID IOCTLs (Microsoft, "Creating WDF HID
Minidrivers"; reference source `vhidmini2`):
- In `EvtDriverDeviceAdd`: call `WdfFdoInitSetFilter`, then create the I/O queue(s).
- **Descriptor IOCTLs** (`GET_DEVICE_DESCRIPTOR` / `GET_REPORT_DESCRIPTOR` / `GET_DEVICE_ATTRIBUTES`)
— trivial: `RequestCopyFromBuffer` a static blob. For punktfunk these blobs are the **existing
`DUALSENSE_RDESC` (232 B)** + a `HID_DEVICE_ATTRIBUTES` filled `054C`/`0CE6`.
- **Output / feature IOCTLs** (`WRITE_REPORT` / `SET_OUTPUT_REPORT` / `GET_FEATURE` / `SET_FEATURE`)
— pull the `HID_XFER_PACKET` (report id + buffer) and hand the bytes to the host. These carry the
game's `0x02` output report (rumble / lightbar / **adaptive-trigger** block) — exactly what
`parse_ds_output` already decodes.
- **Input path** (`READ_REPORT`, pad → game) — the only non-trivial mechanic, an **inverted call**:
each `READ_REPORT` request is pended into a manual `WDFQUEUE`
(`WdfIoQueueDispatchManual` + `WdfRequestForwardToIoQueue`) and later popped
(`WdfIoQueueRetrieveNextRequest`), filled, and completed (`WdfRequestComplete`) whenever the host
has a fresh `0x01` input report. `vhidmini2` drives this from a periodic timer; punktfunk drives
it from each new `0x01` report arriving over the host channel — **structurally identical to the
existing Linux `/dev/uhid` loop.**
Because UMDF can't marshal embedded pointers, `mshidumdf.sys` converts `IOCTL_HID_*` into
`IOCTL_UMDF_HID_*` (e.g. `IOCTL_UMDF_HID_GET_INPUT_REPORT`, `IOCTL_UMDF_HID_SET_FEATURE`), passing
`reportBuffer` / `reportId` as separate buffers — the driver branches on those.
### Integration sketch
```
host process (Rust) <-- SwDeviceCreate + IOCTL channel --> UMDF2 HID minidriver <-- HID --> game / Steam / GameInput
PadState -------------- input report 0x01 -------------> inverted READ_REPORT queue
HidOutput <----- output report 0x02 (WriteReport cb) ----- EvtIoDeviceControl
```
- **Descriptor reuse:** the exact inputtino PS5 descriptor + feature-report replies we already ship
for Linux (`dualsense.rs` `DS_*` constants) — same bytes, same VID/PID, so Windows + games
recognize it as a DualSense.
- **Host-side device creation:** `windows::Win32::Devices::Enumeration::Pnp::SwDeviceCreate`
`Result<HSWDEVICE>` (pure Win32, in the `windows` crate, **no WDK needed**), enumerating a
root device whose hardware IDs match the pre-staged INF. Requires Administrator. **The device
exists only while the `HSWDEVICE` handle (i.e. the host process) is open** — `SwDeviceClose`
removes it — so the pad is created/destroyed with the session, exactly like the Linux UHID fd.
The INF is pre-staged once (`pnputil /add-driver`).
- **Userspace bridge:** a `DualSenseManager`-shaped struct mirroring the Linux one (same `RichInput`
→ report `0x01` packing via `serialize_state`, same `HidOutput` parsing via `parse_ds_output`),
talking to the driver over an IOCTL channel instead of `/dev/uhid`.
- **Packaging:** vendor + sign the `.dll`/`.inf`/`.cat` and install via the existing
`packaging/windows` machinery (`pnputil` + an `install-*.ps1`, bundled in the Inno `setup.exe`).
The precedent — SudoVDA, a UMDF/IddCx driver — is already in the repo.
## Driver language — recommendation
The user strongly prefers a **self-authored Rust driver**. Verified verdict: **a Rust UMDF2 HID
minidriver is technically viable but unproven and pioneering** — it does not clear the bar for a
*low-risk* M2. Honest ranking of the three options:
### Option R — fully self-authored Rust driver (preferred; viable, but pioneering)
- **What's real today:** `microsoft/windows-drivers-rs` (`wdk`, `wdk-sys`, `wdk-build`,
`wdk-macros`) officially targets WDM + KMDF + **UMDF** (tested UMDF 2.33). It ships a *real* Rust
UMDF sample, `examples/sample-umdf-driver/src/lib.rs`, that `#[unsafe(export_name = "DriverEntry")]`,
builds a `WDF_DRIVER_CONFIG` with `EvtDriverDeviceAdd: Some(...)`, and calls `WdfDriverCreate` +
`WdfDeviceCreate` via `call_unsafe_wdf_function_binding!` over raw `wdk-sys` FFI. Because UMDF 2.0
is the C function-pointer model (no COM vtable), the FFI maps cleanly.
- **The gap:** that sample is a **bare stub** — no I/O queue, no IOCTL dispatch, no HID. The entire
HID-minidriver layer (`WdfFdoInitSetFilter`, the manual inverted-call queue, `IOCTL_UMDF_HID_*`
dispatch, `HID_XFER_PACKET`, `METHOD_NEITHER`) would be **hand-written `unsafe` FFI with no safe
wrappers**, against `vhidmini2`/GazeHid-scale glue (a few hundred lines). The heavy domain logic is
*not* in the driver — it already exists in `dualsense.rs`.
- **The honest blockers:** **zero precedent** — every shipping virtual-HID controller driver
(`vhidmini2`, HIDMaestro, DsHidMini, EmuController, GazeHid) is **C**. Microsoft labels
`windows-drivers-rs` "not yet recommended for production use" (Sept 2025) and has **not settled the
WHCP/attestation submission path for Rust drivers** — directly relevant given the public-distribution
requirement (though attestation re-signs the `.cat` and treats the `.dll` opaquely, so signing
*should* be language-agnostic — unverified). Whether all needed WDF symbols (`WdfIoQueueCreate`,
`WdfFdoInitSetFilter`, `WdfRequestRetrieveOutputMemory`, manual-queue APIs,
`WDF_IO_QUEUE_CONFIG_INIT`) are generated/usable for the UMDF target is **unverified against the
bindings — this is exactly what the M0 build spike must answer.** Note the Dec 2025
`windows-drivers-rs` build break (Discussion #591) is a transient LLVM-22-tip bindgen issue, fixed
by pinning LLVM 21.1.2 — not a fundamental defect.
Do **not** C-FFI-bind DMF's `Dmf_VirtualHidMini` from Rust (large, awkward C surface) — reimplement
the modest `vhidmini2` queue/IOCTL glue directly.
### Option C — thin C/C++ UMDF2 shim + all logic in the Rust host (realistic fallback / lowest-risk M2)
Clone `vhidmini2` (`WdfFdoInitSetFilter` + `EvtIoDeviceControl` + manual inverted-call queue, a few
hundred LOC of generic byte-shuttling); keep **all** DualSense logic in the existing Rust host
(`dualsense.rs` descriptors/packers/parsers fed over the IOCTL channel); the `SwDeviceCreate` host
bridge stays pure Rust in the `windows` crate (no WDK). This **mirrors HIDMaestro's split** (generic
C/C++ UMDF2 HID minidriver under `mshidumdf.sys`, all profile/DualSense logic in the user-mode
service) **and punktfunk's own Linux design.** It is the user's pre-ranked middle option and the
fastest way to reach the M0 on-glass gate.
### Option H — fork/reuse HIDMaestro (last resort)
HIDMaestro is a proven, pure-UMDF2 virtual controller (self-signed, no EV/test-signing/reboot)
recognized by DirectInput/XInput/SDL3/WGI/GameInput/RawInput + Steam, with a **DualSense profile**
(byte-exact VID/PID + descriptor). Use only if even the C shim stalls **and** adaptive-trigger
fidelity is not required — **HIDMaestro omits adaptive triggers from its DS5 feature list**, so it
cannot prove the very thing that makes a virtual DualSense worth building. Its driver is C; its
service is C#.
### Recommendation
**Lead with Option R for the long-term codebase, but de-risk the on-glass gate with Option C in M2.**
Concretely: run the **M0 spike in two halves** — (a) a `windows-drivers-rs` UMDF *build* spike to
confirm the WDF queue/IOCTL symbols are usable from Rust at all, and (b) the on-glass recognition gate
using whichever driver is fastest to stand up (the C `vhidmini2` shim is the safe bet). If (a) passes
**and** the on-glass gate passes, author the M2 driver in **Rust** (it would be the first Rust UMDF
HID driver, accepted as pioneering risk per the user's explicit preference). If (a) is shaky, ship M2
as the **C shim** and migrate the driver to Rust later, once `windows-drivers-rs` ships safe WDF/HID
abstractions. Either way the DualSense *logic* stays in Rust where it already lives. Forking HIDMaestro
is the fallback-of-fallbacks and is acceptable only if adaptive triggers are dropped from scope.
## Signing
Two recipes coexist in the Inno installer, selected by the bundled payload — the same pattern already
proven for SudoVDA.
### Fleet / self-signed (dev + internal boxes)
The in-repo precedent is `packaging/windows/install-sudovda.ps1`: import the bundled `.cer` into the
machine **Root** *and* **TrustedPublisher** stores (`certutil -addstore -f`), then `pnputil
/add-driver /install`. This installs silently **only** because the publisher is pre-trusted on that
machine. Microsoft is explicit that this auto-import-into-Root practice "should never be followed for
any driver package distributed outside your organization" — so it is the **fleet** path, never the
public one.
### Public end-user distribution — EV cert + Microsoft attestation
For arms-length public users, the correct tier is **Microsoft attestation signing** via Partner
Center (verified: "Attestation signing supports Windows Desktop kernel mode **and user mode**
drivers"; processable types include `.cab`/`.dll`). Pipeline:
1. **Prerequisites:** a registered **Windows Hardware Developer Program** (Partner Center) account
(free to register; sign in with an Entra ID global-admin work account; accept the agreements,
provide org/D-U-N-S info, respond to the legal-contact verification email) and an **EV
code-signing certificate** (mandatory to register *and* to sign the submission CAB; ~USD 250560/yr;
FIPS hardware token/HSM mandatory; 17 business-day identity vetting). Windows ADK (`MakeCab`).
2. **Build the submission:** `MakeCab` the `.dll` + `.inf` (+ `.pdb`/symbols) into per-driver
subfolders (folder names < 40 chars, no special chars, no UNC); `SignTool sign` the CAB with the
EV cert (`/fd sha256` + RFC3161 timestamp `/tr … /td sha256`).
3. **Submit:** Partner Center → *Submit new hardware*, **leave test-signing unchecked**, request the
desired signatures.
4. **Microsoft re-signs:** it appends a Microsoft SHA-2 signature and **regenerates + signs a new
`.cat` with a Microsoft cert** (your `.cat` is replaced). Because the catalog signer is then
Microsoft (already trusted), **PnP installs silently — no publisher prompt, no test-signing, no
reboot, and no shipping our cert into users' Root store.** Validation: `devcon`/`pnputil` install
must not show "Windows can't verify the publisher of this driver software."
**Important nuance — is attestation even *required* for UMDF?** UMDF is user-mode, so it is **exempt
from kernel-mode code-integrity *load* enforcement** — the driver `.dll` will *load* without a
Microsoft signature. But **PnP *installation* still requires a signed catalog whose publisher is
trusted.** A driver signed only with a plain publicly-trusted (OV/EV) Authenticode cert that is *not*
already in TrustedPublisher will **install, but with the blocking "Windows Security / would you like
to install this device software?" prompt** (setupapi warning `0x800b0109`, error `0xe0000242`
"publisher … not yet established as trusted"). So a bare Authenticode signature is **not** sufficient
for a prompt-free public install — **attestation is the minimal correct public path.** The April 2026
kernel-trust change (removing trust for legacy cross-signed *kernel* drivers) **does not affect**
attestation/WHQL or user-mode UMDF drivers.
What attestation does **not** do: attestation-signed drivers are **not** distributed via Windows
Update — irrelevant here, since punktfunk bundles the driver in its Inno installer exactly like
SudoVDA. (Azure Trusted Signing is **not** an option for the driver `.cat` at all — it signs only
user-mode PE / `/INTEGRITYCHECK` / SmartScreen, and cannot substitute for the EV cert in Partner
Center; it could only improve SmartScreen reputation on the installer `.exe`.) Note attestation does
**not** require HLK/WHQL testing. The heavier fallback, only if attestation's "testing scenarios"
positioning ever hardens into a block, is full **WHQL/HLK** submission (also yields a Microsoft-signed
catalog, plus Windows Update eligibility).
### Coexistence in the Inno installer
`packaging/windows/punktfunk-host.iss` already gates the SudoVDA driver payload behind
`#ifdef WithDriver` + the `installdriver` task + a `[Run]` call to `install-sudovda.ps1`. Add an
analogous gated payload + `install-dualsense.ps1` for the virtual DualSense driver, switching the
bundled `.cat` per build:
- **fleet build** → self-signed `.cat` + `install-dualsense.ps1` keeps the
`certutil -addstore Root/TrustedPublisher` step (cloned from `install-sudovda.ps1`).
- **public build** → Microsoft-attestation-re-signed `.cat`, and `install-dualsense.ps1`
**drops** the `certutil` import (just `pnputil /add-driver /install`).
Operationally, the EV key lives on a non-exportable FIPS token, so the **CAB signing + Partner Center
submission is a manual offline step**, not a CI secret (cloud-HSM/Azure Key Vault EV options exist but
need per-CA confirmation). The Microsoft-resigned `.cat` is then committed as the vendored public
payload, the way SudoVDA's signed driver is vendored in `packaging/windows/sudovda/`.
## Feasibility gate (BLOCKING — M0, on-glass only)
No prior art settles the two questions that decide whether this whole effort is worth building. **This
gate blocks M1M6** and can only be answered on the **RTX box (`192.168.1.173`)** — the dev VM is
headless/WARP and cannot validate game-facing HID recognition:
1. **Recognition:** is a virtual `054C:0CE6` UMDF2 device accepted as a *genuine DualSense* by
`Windows.Gaming.Input` / GameInput / Steam (and native-DS5 games)? HIDMaestro proves DualSense
*recognition* is possible, but…
2. **Adaptive-trigger fidelity:** does the game's output report `0x02` (the adaptive-trigger block)
actually reach the driver's `WriteReport`/`SetOutputReport` callback? **HIDMaestro omits adaptive
triggers**, so no prior art proves this — it must be **measured on glass**.
If (2) fails, the realistic product is a DualSense *identity* without adaptive triggers — at which
point the value over ViGEm DS4 collapses and the project should likely **defer** rather than ship.
**M0 RESULT (2026-06-21): GATE PASSED.** Both answered YES on the RTX box with a self-authored **Rust**
UMDF minidriver (`packaging/windows/dualsense-driver/`). (1) **Recognition:** Steam recognized the virtual
`054C:0CE6` device as a genuine DualSense and drove its DualSense-specific LEDs. (2) **`0x02` reaches the
write callback:** captured two Steam-Input output reports (`validFlag1=0x14` = LIGHTBAR|PLAYER_INDICATOR).
Adaptive-trigger-specific bytes ride the same `0x02` path (Cyberpunk confirmation is gravy, not a gate).
Three bugs had to be fixed to get there — the load wall was the PE **FORCE_INTEGRITY** bit (`wdk-build`'s
`/INTEGRITYCHECK`; clear bit `0x80` at PE+0x5e + re-sign), then `WdfTimerCreate` exec-level, then a parallel
queue's zeroed `NumberOfPresentedRequests`. **Option R (Rust) confirmed for M2; no C shim needed.**
**Host integration status (2026-06-21): M1/M3/M4 landed; data plane runtime-proven.** The Linux
DualSense logic is shared via `inject/dualsense_proto.rs`; the Windows backend
`inject/dualsense_windows.rs` (`DualSenseWindowsManager`) drives the driver over the
`Global\pfds-shm-<idx>` section, and the `PadBackend`/`pick_gamepad` seam now resolves DualSense on
Windows. Live-verified on the RTX box: the manager creates the section + pushes report `0x01` and a
devnode serves it to a HID read (manager data plane works). **Open item — `SwDeviceCreate`
per-session devnode:** two `E_INVALIDARG` causes found — (1) an underscore in the enumerator name
(`pf_dualsense` → use `punktfunk`), (2) passing the completion callback is still rejected (cause
unresolved; needs a known-good C reference). So per-session auto-creation is **best-effort/non-fatal**:
the host falls back to an out-of-band `pf_dualsense` devnode (the INF lists both `root\pf_dualsense`
for devgen and `pf_dualsense` for SwDevice; the installer would create it, as SudoVDA does). Remaining:
fix the SwDeviceCreate callback E_INVALIDARG, then the M5 on-glass game test.
## Milestone plan (M0M6)
| # | Milestone | Output | Gate / risk |
|---|---|---|---|
| **M0 ✅ DONE** | **Feasibility spike — PASSED (2026-06-21)** | (a) Rust `windows-drivers-rs` UMDF build spike — symbols usable, driver authored in Rust; (b) on-glass on the RTX box: self-signed Rust `054C:0CE6` UMDF minidriver loads under Secure Boot, Steam recognizes it as a DualSense, `0x02` output reaches the write callback. Source: `packaging/windows/dualsense-driver/` | **PASSED.** Option R (Rust) chosen for M2. Load needed clearing the PE FORCE_INTEGRITY bit |
| **M1** | Linux codec refactor | Extract the transport-independent contract from `dualsense.rs` into `inject/dualsense_proto.rs` (`DUALSENSE_RDESC`, `serialize_state`, `parse_ds_output`, feature blobs); **fix `DS_FEATURE_CALIBRATION` 42 → 41**; Linux backend keeps passing | Pure refactor; keep Linux loopback green |
| **M2** | UMDF2 driver | The HID minidriver + INF + signed `.cat` (test-signed for dev). **Language per M0(a):** Rust if the build spike is solid, else the `vhidmini2`-derived C shim. INF carries the required UMDF directives (`UmdfKernelModeClientPolicy=AllowKernelModeClients`, `UmdfMethodNeitherAction=Copy`, `UmdfFileObjectPolicy=AllowNullAndUnknownFileObjects`, `UmdfFsContextUsePolicy=CanUseFsContext2`), root-enumerated `HIDClass`, filter under `mshidumdf.sys` | Pioneering if Rust; manual inverted-call queue is the hard part |
| **M3** | Rust host bridge | `inject/dualsense_windows.rs`: `SwDeviceCreate` per-session device (hold `HSWDEVICE` for the session) + the inverted-call IOCTL channel, feeding `0x01` and surfacing `0x02` as `HidOutput` — reusing `dualsense_proto.rs` | Channel design (single control device + inverted-call IOCTL vs shared-memory) |
| **M4** | Un-gate the seam + negotiation | New `PadBackend::DualSense` Windows arm; relax the `#[cfg(target_os="linux")]` guards on DualSense/DualShock4 in `pick_gamepad`/`resolve_gamepad` to `any(linux, windows)`; wire `GamepadPref::DualSense` resolution | Small; `dualshock4.rs` is the template |
| **M5** | On-glass E2E | Client → host → virtual DualSense → game, with adaptive triggers / lightbar / touchpad / motion / rumble round-tripping; latency check | RTX box; the real proof |
| **M6** | Packaging / installer | Vendor + sign the driver; `install-dualsense.ps1` (fleet vs public variant); gate the payload in `punktfunk-host.iss`; complete the **EV cert + attestation** submission for the public build | EV-cert procurement + Partner Center turnaround are lead-time items — start early |
## Decision matrix
| Option | Adaptive triggers / DS5 identity | Effort | When it's right |
|---|---|---|---|
| **A. UMDF2 virtual DualSense** (parity) | ✅ full (pending the M0 gate) | medium — **UMDF, same tier as SudoVDA** (was mis-scoped as "kernel/large" in the 2026-06-20 draft) | the goal — matches the Linux host |
| **B. ViGEm DS4** (interim) | ❌ never (DS4 ceiling) | small | quick PS-pad on Windows w/ touchpad/motion/lightbar/rumble, no adaptive triggers |
| **C. Hybrid** | A for DS5 clients, B/Xbox 360 fallback | A + small | belt-and-suspenders once A exists |
| **D. Defer** | — | — | if the M0 gate fails (esp. output `0x02` fidelity), or a higher-ROI item wins the slot |
Xbox 360 (XInput, via ViGEm) is already implemented and covers most Windows games regardless; Xbox
One/Series fold into it on Windows. Windows-host DualShock 4 (ViGEm) remains separately deferred.
## Risk register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Output report `0x02` (adaptive triggers) never reaches the driver write callback | medium | **fatal** to the value prop | M0(b) measures it directly; if it fails → Option D |
| `054C:0CE6` UMDF2 device not accepted as a real DualSense by WGI/GameInput/Steam | lowmed | fatal | M0(b); HIDMaestro suggests recognition works, but confirm |
| Rust UMDF driver pioneering risk (first of its kind; no safe WDF/HID wrappers; symbol coverage unproven) | medium | schedule | M0(a) build spike; **Option C (C shim) as the de-risked M2 fallback** |
| EV cert + Partner Center attestation lead time / friction | medium | schedule | Start procurement at M0; lean on the SudoVDA UMDF submission precedent |
| EV key non-exportable → can't sign in CI | high | low | Accept a manual offline sign+submit step; vendor the Microsoft-resigned `.cat` |
| `SwDeviceCreate` device lifetime tied to the host process handle | known | low | Hold `HSWDEVICE` for the session lifetime (matches Linux UHID fd semantics) |
| `windows-drivers-rs` transient toolchain breaks (e.g. LLVM-22 bindgen, Disc. #591) | low | low | Pin LLVM 21.1.2; not a fundamental defect |
| `DS_FEATURE_CALIBRATION` 42-byte blob rejected by strict Windows consumers | low | low | Trim to 41 bytes in M1 |
## Open questions
1. **Driver channel design** (unknown): punktfunk's own driver↔host protocol — simplest is a private
control device with an inverted-call IOCTL for input + IOCTLs for output/feature, vs HIDMaestro's
shared-memory section. `vhidmini2` has *no* service channel (it self-generates via a timer), so this
must be designed fresh (or read out of HIDMaestro/DsHidMini source). **Resolve in M3.**
2. **Rust UMDF symbol coverage** (unknown — the M0(a) gate): are all needed WDF symbols
(`WdfIoQueueCreate`, `WdfFdoInitSetFilter`, `WdfRequestRetrieveOutputMemory`, manual-queue APIs,
`WDF_IO_QUEUE_CONFIG_INIT`) generated/usable from `wdk-sys` for the UMDF target?
3. **Attestation for a Rust-authored `.dll`** (likely fine, unverified): attestation re-signs the
`.cat` and treats the `.dll` opaquely (allowed type), so language *should* be irrelevant to
signing — but Microsoft has not explicitly settled the WHCP path for Rust drivers. Confirm via a
Partner Center dry-run.
4. **Single multi-driver CAB** (unknown, operationally useful): can one Partner Center submission carry
*both* the existing SudoVDA driver and the new DualSense driver? Multi-driver CABs are supported in
general; unverified for this account.
5. **EV cert + Partner Center mechanics** (unknown): exact cost/turnaround; whether a cloud-HSM EV
option lets CI sign, or whether it must be a manual offline step (most likely the latter).
6. **HidHide** (carried over): needed at all on a usually-headless host, or only when a physical pad is
attached?
7. **Min-OS / UMDFVERSION target** (unknown): which `UmdfLibraryVersion` / WDK to target for the widest
Win10/11 install base, consistent with punktfunk's existing host support matrix.
8. **DsHidMini end-user signing tier** (unknown): self-signed vs attestation in its WixSharp MSI —
useful as a second public-distribution data point.
+432
View File
@@ -0,0 +1,432 @@
# Windows Host — Architecture, Status & Roadmap
> **Single source of truth** for the punktfunk Windows streaming host: the all-Rust **`pf-vdisplay`
> IddCx virtual-display driver** + **IDD-push zero-copy capture** + **NVENC/AMF/QSV encode**, shipped as
> a signed Inno Setup installer with a LocalSystem SCM service. Live-validated on the RTX box through
> 5120×1440@240 HDR, the secure desktop (lock/UAC), and a fullscreen game.
>
> This file **consolidates and replaces** five earlier docs (now retired into it): the rewrite design
> plan, the Goal-1 staged-refactor plan, the audit, the audit-remediation tracker, and the
> fullscreen-game capture-bug analysis. See the [consolidation note](#appendix--consolidation-note) for
> what moved where. **Last updated 2026-06-26.** Work lives on branch **`windows-host-goal1`** (off
> `main`, not yet merged).
---
## 1. Status at a glance
The Windows host is **functionally complete and validated on glass.** The hard, high-risk proofs are done:
a clean all-Rust IddCx driver on the unified `windows-drivers-rs` stack (the `/INTEGRITYCHECK` answer +
the `iddcx` `wdk-sys` binding), IDD-push zero-copy capture at 5K@240 HDR, the secure desktop (Winlogon /
UAC / lock), and the host re-architected into a clean, typed, layered shape. What remains is
**non-blocking**: hygiene (host `unsafe` lints, a few `OwnedHandle` rollouts), the SudoVDA backend
deletion (decoupled, not yet removed), a driver robustness gap (slot reclaim), the gamepad-driver
unification (M4), and old-monolith cleanup (M6) — plus the merge to `main`.
One framing correction baked into this doc: the host was **not** greenfield-rebuilt as the original plan
imagined. It was **refactored in place** via a staged, behavior-preserving sequence (the "Goal-1" plan),
which kept the live-validated host working at every step. The driver, by contrast, *was* rebuilt fresh
(the new `packaging/windows/drivers/pf-vdisplay/` tree).
### Scorecard (verified against `windows-host-goal1` HEAD, 2026-06-25)
| Item | Status | Evidence |
|---|---|---|
| **Goal 1** — clean, layered host architecture | ✅ **DONE** | `config.rs` (`HostConfig`), `session_plan.rs` (`SessionPlan`), `SessionContext`, `windows/`+`linux/` confinement (`38c68c3`), `VirtualDisplayManager` (§2.5), `EncoderCaps` (`0ccd0fe`) |
| **Goal 2** — drop every trace of SudoVDA | ✅ **DONE** | reach-in decoupled (F1: `d638a93`/`e60cda3``win_adapter`/`win_display`), then the `sudovda.rs` backend + the dual-backend select **deleted** (this branch) — pf-vdisplay is the sole Windows virtual-display backend |
| **Goal 3** — minimize `unsafe` + P0 lints | 🟡 **PARTIAL** (**box-validated**) | driver `deny(unsafe_op_in_unsafe_fn)` (`a755d6e`); **`OwnedHandle`/RAII rollout** — `idd_push.rs` (`011607e`, view-leak fix) + `service.rs` child/job (`4c95ba7`) + the 3 gamepad backends via shared `gamepad_raii.rs` (`e5c2b4e`) + the IDD-push `KeyedMutexGuard` hot loop (`6585643`) + the **SCM STOP/SESSION events**`OnceLock<OwnedHandle>` (`61c02e6`, runtime-validated: clean ~1 s `sc stop`); **driver `pod_init!`** (`bf57704`, 27→1). **On-glass clean: host clippy `-D warnings` + driver build** (RTX box; `bd05bc8` fixed 11 lints the gate surfaced). The host-side raw-handle smuggling is fully retired; only host-crate P0 lints remain (deferred — churn>value) |
| **M0** — proto ABI + driver toolchain + `/INTEGRITYCHECK` + `iddcx` | ✅ **DONE** | `pf-driver-proto`; vendored `windows-drivers-rs` 0.5.1; `clear-force-integrity.ps1`; CI-green |
| **M1** — new IddCx driver, first light + HDR | ✅ **DONE (on-glass)** | STEP 08 (`d7a9fbf``cd59151`); HDR live ("Mac connects WITH HDR", `6399d28`) |
| **M2** — IDD-push capture + NVENC, glass-to-glass | ✅ **DONE (on-glass)** | 5120×1440@240 HDR zero-copy; integrated into the host path |
| **M3** — service / input / audio / **secure desktop** | ✅ **DONE (on-glass)** | secure desktop (lock/UAC) **owner-confirmed 2026-06-25** — IDD-push captures it + input reaches it |
| **M4** — gamepad drivers onto the unified stack | ❌ **OPEN** | `pf_dualsense`/`pf_xusb` still standalone (`packaging/windows/{dualsense,xusb}-driver/`), not in `drivers/` workspace |
| **M5** — WGC/DDA fallback reshape + GameStream-on-pipeline + AMF/QSV | 🟡 **PARTIAL** | fallbacks exist (`wgc.rs`/`wgc_relay.rs`/`dxgi.rs`), not reshaped onto the new seams; AMF/QSV CI-only (no lab hw) |
| **M6** — cut over + delete the old monoliths | 🟡 **PARTIAL** | old `vdisplay-driver/` tree deleted (`a2bd0cd`); host monoliths + bring-up scaffolding (`spawn_observer`/`DebugBlock`) remain |
| **Game-capture bug (GB1)** — fullscreen game breaks IDD-push | ✅ **FIXED** | resolution-listening recovery (`c87bfe0`) + open-time DDA failover (`f98ab07`) + driver guard/log (`789ad49`) |
| **Audit P0/P1/P2** | ✅ mostly **RESOLVED** | watchdog, `SET_RENDER_ADAPTER`, log gate, mode bounds, IDD-push fallback, F1, out-ring/HDR-ring, proto asserts — all landed; **open:** host hygiene (§8), E1 completion, slot-reclaim |
---
## 2. Architecture (what is on disk)
### 2.1 Layering & crates
- **`crates/punktfunk-host`** — one shared host crate (Linux + Windows; not split). Platform code is
confined under per-module `windows/`+`linux/` folders behind `#[cfg]` seams (`capture/{windows,linux}/`,
`encode/{windows,linux}/`, `inject/{windows,linux}/`, `audio/{windows,linux}/`, `vdisplay/{windows,linux}/`,
and top-level `src/windows/`+`src/linux/`). Module names stay flat (`#[path]`), so caller paths are
platform-agnostic.
- **`crates/punktfunk-core`** — the one linked protocol/FEC/crypto/QUIC core (unchanged here).
- **`crates/pf-driver-proto`** — the owned, `no_std` host↔driver ABI (frame ring + control plane +
gamepad SHM), consumed by both the host crate and the driver workspace (§2.7).
- **`packaging/windows/drivers/`** — the unified driver workspace on `microsoft/windows-drivers-rs`
(vendored 0.5.1 + an `iddcx` subset): members `pf-vdisplay` (the IddCx display driver), `wdk-iddcx`
(the typed IddCx DDI wrappers), `wdk-probe` (the CI link/surface gate), `vendor/{wdk-build,wdk-sys}`.
### 2.2 Session resolution — `HostConfig → SessionPlan → SessionContext` (Goal-1 realized)
The old ~40-knob `PUNKTFUNK_*` env soup, re-read and recomputed in three places, is replaced by a
resolve-once pipeline:
- **`config.rs` `HostConfig`** — typed config parsed **once** from `host.env`/env/flags
(`idd_push`/`encoder_pref`/`no_wgc`/`capture_backend`/`render_adapter`/`secure_dda`/`ten_bit`/`zerocopy`/…).
Each field's parser is byte-identical to the read it replaced. (Runtime-mutated Linux session vars from
`vdisplay::apply_session_env`, and single-use local tuning knobs, are deliberately kept live — see the
`config.rs` header.)
- **`session_plan.rs` `SessionPlan { display, capture, topology, encoder, input_format, bit_depth, hdr,
pipeline_depth }`** — a `Copy` plan resolved **once** per session from `HostConfig` + the negotiated
bit-depth, logged, and threaded through `build_pipeline`. `CaptureBackend::resolve()` is the one
resolver (`IddPush | Dda | Wgc`); `resolve_topology` decides `SingleProcess | TwoProcessRelay`. This
killed the latent capture/encode backend-disagreement bug.
- **`SessionContext`** — bundles the session entry's ~13 args (was `#[allow(too_many_arguments)]`) and the
plane receivers into one owned struct moved into the stream thread.
### 2.3 Ownership model — `VirtualDisplayManager` + `MonitorLease` (§2.5 realized)
A single **OnceLock `VirtualDisplayManager`** (`vdisplay/windows/manager.rs`) owns a *typed*
`Arc<OwnedHandle>` control-device handle (no raw-`isize` cross-thread smuggle), the refcounted
Idle/Active/Lingering state machine, and the monitor generation (`AtomicU64`). Both Windows backends
(`pf_vdisplay`, `sudovda`) shrank to thin `VdisplayDriver` impls (`open`/`add_monitor`/`remove_monitor`/
`ping`) behind it; `MonitorKey = Guid | Session(u64)`. A per-session `MonitorLease`'s `Drop` releases the
refcount (a stale lease can't tear down a fresh monitor). This deleted the old `CURRENT_MON_GEN`/`MON_GEN`/
two-`MGR`/`IDD_PERSIST`/`IDD_SETUP_LOCK`/`IDD_SESSION_STOP` globals. Validated on glass: **0 leaked active
monitors across a reconnect storm**, A/B-equivalent to the shipping host. (The 5-agent map found
`CURRENT_MON_GEN` had been **write-only** — the per-frame "monitor-gen bail" was never wired — so the gen
lives on the manager + lease only.)
### 2.4 The seam traits
`VirtualDisplay`/`VirtualOutput`/`VirtualLease` (RAII keepalive = release), `Capturer`
(`next_frame`/`try_latest`/`set_active`/`hdr_meta`/`pipeline_depth`), `Encoder`
(`submit`/`caps`/`request_keyframe`/`set_hdr_meta`/`invalidate_ref_frames`/`poll`/`flush`),
`AudioCapturer`/`VirtualMic`/`InputInjector`/`PadManager`. Realized tightenings: the capturer takes the
desired `OutputFormat { gpu, hdr }` **in** (killed the `capture → encode::windows_resolved_backend()`
back-reference recomputed in `dxgi.rs`); and `Encoder::caps() -> EncoderCaps { supports_rfi,
supports_hdr_metadata }` lets the session glue route loss-recovery by query (only Windows direct-NVENC
overrides it; the GameStream loop gates the RFI path on `supports_rfi`).
### 2.5 Capture — IDD-push primary (normal **and** secure desktop), WGC/DDA fallback, GB1 recovery
**IDD-push is the universal primary path.** Capture comes straight from the driver's shared keyed-mutex
texture ring (`capture/windows/idd_push.rs`) — no Desktop Duplication, no `win32u` reparenting hook. The
host creates the ring; the driver opens it (permissive `D:(A;;GA;;;WD)` SDDL). The generation-tagged
`latest = gen<<40 | seq<<8 | slot` stale-ring reject kills the HDR-flip garbage frame; a host-owned
3-slot `OUT_RING` rotated per frame is the texture-ownership contract that enables `pipeline_depth=2`
(convert/copy on the 3D engine overlapping NVENC on the ASIC). It captures the **secure desktop**
(Winlogon/UAC/lock) directly (validated 2026-06-25), so there is no separate secure capturer in the
primary path.
- **Open-time fallback:** `IddPushCapturer::open` waits a bounded ~4 s for a *first frame* (not just
`DRV_STATUS_OPENED`); on attach failure it returns the keepalive back so `capture.rs` opens **DDA** on
the same `WinCaptureTarget` — never a 20 s black bail (audit §5.1, `ed58365`/`f98ab07`).
- **Mid-session game mode-set recovery (GB1, fixed):** the 250 ms poll follows the display's *actual*
resolution (`win_display::active_resolution`, CCD/GDI) and recreates the ring on any descriptor change
(size **or** HDR) → the driver re-attaches → frames resume at the game's mode, **no reconnect**. If a
change is unrecoverable (e.g. an exclusive flip), a `recovering_since` clock drops the session after 3 s
so the client reconnects cleanly. No protocol bump was needed — the host reads the resolution straight
from Windows (`c87bfe0`; the driver's `publish()` width/height guard + flushed log is `789ad49`).
- **WGC + DDA** stay as demoted fallbacks for non-IddCx hardware (`wgc.rs`/`dxgi.rs`). The two-process WGC
secure-desktop relay (`wgc_relay.rs`) is no longer load-bearing now that IDD-push handles the secure
desktop; it is kept recoverable but slated for M5/M6 cleanup.
### 2.6 Encode — NVENC / AMF / QSV / software; `EncoderCaps`; HDR
`encode/windows/` dispatches per DXGI adapter vendor (`open_video`): **NVENC** (NVIDIA, direct SDK,
`nvenc.rs` — caps-probe-before-configure, bitrate-clamp binary search, true RFI over the DPB, in-band
ST.2086/CLL SEI), **AMF**/**QSV** (AMD/Intel via libavcodec, `ffmpeg_win.rs` — system-readback default,
opt-in zero-copy D3D11; CI-only, no lab hardware), or **software** H.264 (`sw.rs`). HDR (10-bit) forces
HEVC Main10 + BT.2020 PQ; the client auto-detects PQ from the VUI. The encoder adapts to a mid-session
size/format/HDR change per frame (tears down + re-inits), so the GB1 capturer's resolution changes are
handled downstream with no API change.
### 2.7 Host↔driver ABI — `pf-driver-proto`
One `no_std` crate, both build graphs. Owns the **frame plane** (`SharedHeader`, `FrameToken { generation,
seq, slot }` with `pack`/`unpack`, `Global\pfvd-*` name helpers), the **control plane** (fresh interface
GUID — not SudoVDA's `e5bcc234`; contiguous `0x900` IOCTL ops; `u64` session id; a real `GET_INFO` version
handshake the host **asserts** + bails on mismatch), and the **gamepad SHM** (`XusbShm` 64 B, `PadShm`
256 B incl. `device_type`). `bytemuck`-`Pod` + `size_of` **and** `offset_of!` asserts make ABI drift a
**compile error** (`95dcef3`). The host-side gamepad consumers derive their layouts from here; the
**driver-side** gamepad drivers do not yet (M4).
### 2.8 The `pf-vdisplay` IddCx driver
All-Rust UMDF IddCx driver on `windows-drivers-rs` + the `iddcx` `wdk-sys` subset. STEP 08 landed
(`packaging/windows/drivers/pf-vdisplay/src/`): `entry.rs` (DriverEntry + `IDD_CX_CLIENT_CONFIG`, 15
callbacks), `adapter.rs` (caps + FP16 + `SET_RENDER_ADAPTER`), `monitor.rs`/`callbacks.rs` (the `*2` HDR
mode DDIs, EDID verbatim), `swap_chain_processor.rs` (the worker, `SetDevice`-retry + top-of-loop
`terminate`), `frame_transport.rs` (the `FramePublisher` on `pf_driver_proto::frame`), `control.rs` (the
typed IOCTL dispatch + host-gone **watchdog** + mode bounds). Self-signed-loadable under Secure Boot
(FORCE_INTEGRITY cleared post-link). **Known gaps:** ownership state is still partly process-global
(`MONITOR_MODES`/`NEXT_ID`/`ADAPTER`/`DEVICE_POOL`) with `EvtCleanupCallback` on the **WDFDEVICE** (not
per-`IDDCX_MONITOR`) — see E1 in §4; and it does not reclaim IddCx monitor **slots** on REMOVE (the
ghost-monitor wedge, §4).
### 2.9 Service, packaging, installer
A `LocalSystem` SCM supervisor (`service.rs`) token-retargets and `CreateProcessAsUserW`s `serve` into the
console session (so `SendInput` reaches the streamed desktop + the secure desktop), relaunches on
session-change, and kills-on-close via a Job Object. Shipped as a **signed Inno Setup** `setup.exe`
(`packaging/windows/`, `windows-host.yml`) that bundles the **new** `pf-vdisplay` driver
(`pf_vdisplay.inx` in-tree, old `vdisplay-driver/` tree deleted) + FFmpeg DLLs and delegates to `service
install`. GameStream (Moonlight) is kept but the installer/service default to secure `serve` (GameStream
opt-in).
---
## 3. Validated invariants — preserve, do not regress
These are expensive empirical wins; keep them intact when touching the code:
- **Frame transport:** host-creates/driver-opens keyed-mutex ring; generation-tagged stale-ring reject;
0 ms try-acquire / drop-on-full publish (never block the swap-chain thread); the `OUT_RING` rotation +
`pipeline_depth=2` overlap; `repeat_last` rotates into a fresh out-ring slot (depth-safe).
- **Driver internals:** `edid.rs` (128-byte EDID + CTA-861.3 HDR block, dual checksums); the FP16 HDR
recipe (`CAN_PROCESS_FP16` + the `*2` DDIs + gamma/HDR accept-stubs + `HIGH_COLOR_SPACE`); `DEVICE_POOL`
per render-LUID (NVIDIA UMD/VRAM leak fix); target-id stamped on the monitor context; the two swap-chain
leak fixes (borrow `IDXGIDevice` across `SetDevice` retries; check `terminate` at the loop top).
- **Monitor lifecycle:** serialized ADD/REMOVE/teardown; restore CCD topology **before** REMOVE; the
generation-stamped lease (a stale lease can't tear down a fresh monitor); 0-leak across reconnects.
- **HDR color math:** `hdr.rs` (pure, unit-tested, ST.2086 + big-endian SEI); the FP16→P010/Rgb10a2
converters + `hdr_p010_selftest`; the cursor decomposition.
- **NVENC tuning:** caps-probe-before-configure (10-bit→8-bit graceful downgrade); bitrate-clamp binary
search (each GPU's real ceiling); true RFI over the DPB; CBR / infinite-GOP / P-only / ~1-frame VBV.
- **Gamepad recipe:** the SwDeviceCreate identity (enumerator with no `_`; mandatory completion callback;
synthesized DS5 compat-ids; non-null per-pad `ContainerId`); one `pf_dualsense` serving DualSense+DS4
via a `device_type` byte; XUSB declining `WAIT_*`; per-pad index via `pszDeviceLocation`.
- **Session glue:** the trait seam + RAII keepalive teardown; host-lifetime shared services + per-session
gamepads; the encode|send split + microburst pacing; `build_pipeline_with_retry` permanent-vs-transient
classification; the GameStream `VideoPacketizer` (GF8 Cauchy, Moonlight byte-exact); the pairing/trust
handshake.
- **Core discipline:** no async on the per-frame path; `pf-driver-proto` is the single ABI source
(drift = compile error); the version handshake the host asserts.
---
## 4. Open work / next tasks (prioritized)
**P1 — ship-readiness / correctness**
1. **Merge `windows-host-goal1` → `main` + push** (outward-facing → confirm first). Pushing also runs the
full Windows CI matrix incl. the `amf-qsv` encode path, which local checks skip.
2. **Make IDD-push the default** — today it is gated behind `PUNKTFUNK_IDD_PUSH` (`config.rs` default
`false`); deployment sets it in `host.env`. Flip the code default (with the WGC/DDA fallback already in
place) so a fresh install runs the validated path, or document the `host.env` requirement explicitly.
3. **pf-vdisplay slot reclaim on REMOVE** (driver robustness) — 🟡 **fix landed, on-glass-validation
pending.** Sustained ADD/REMOVE churn wedged the driver (`ADD → 0x80070490 ERROR_NOT_FOUND`) because the
monitor id (EDID serial / `ConnectorIndex` / container GUID) was a **monotonic** `NEXT_ID`, never
reclaimed → IddCx accumulated a new OS target slot per cycle until exhaustion. `monitor.rs` now allocates
the **lowest free id** (`alloc_monitor_id`), reused on REMOVE, so a fresh ADD reuses the departed
monitor's target slot instead of orphaning it. CI-compile-gated; the wedge only reproduces under
sustained churn on the RTX box, so this needs an **on-glass reconnect-storm A/B** to confirm (the box is
ephemeral). Keep `packaging/windows/reset-pf-vdisplay.ps1` as the recovery until validated.
**P2 — hygiene / architecture completion** (the unsafe-reduction + stability priority)
4. **D1-host — host-crate P0 lints.** Add `#![deny(unsafe_op_in_unsafe_fn)]` +
`#![warn(clippy::undocumented_unsafe_blocks)]` to the host crate and fix the fallout (~30 of the 52
`unsafe fn`s need an inner `unsafe {}`). Stage it **per-module, Linux-first** (item-level `#[deny]` on
`linux/zerocopy/cuda.rs`/`egl.rs`, `encode/linux/vaapi.rs` — locally verifiable), then the Windows
modules (CI-gated), then promote to crate-level. The driver already has the deny.
5. **D2 — `OwnedHandle` / RAII rollout.** ✅ **DONE (complete).** `capture/windows/idd_push.rs` (`011607e`:
a `MappedSection` RAII for the mapping handle **+** the leaked `MapViewOfFile` view, + `OwnedHandle` for
the event / ring-slot shared handles); `windows/service.rs` (`4c95ba7`: the child process/thread + Job
handles, ~9 `CloseHandle` deleted); the **three gamepad backends** (`e5c2b4e`: a shared
`inject/windows/gamepad_raii.rs` — `Shm` for the section+view, `SwDevice` for the devnode — replacing the
duplicated `create_shm_section` + three hand-written `Drop`s); and the **SCM STOP/SESSION events**
(`61c02e6`: `AtomicIsize` raw-`isize` smuggle → `OnceLock<OwnedHandle>` the capture-free C handler reads,
owned for the process lifetime — also closes a latent close-then-signal window). **Runtime-validated on
the RTX box**: swapped in, `sc start` → RUNNING, `sc stop` → clean STOPPED in ~1 s (not a timeout-kill),
original restored. `manager.rs`/`pf_vdisplay.rs` already used the pattern.
6. **Hot-loop `KeyedMutexGuard` ✅ done** (`6585643`) — the IDD-push consume loop's hand-written
`AcquireSync`/`ReleaseSync` (with its "don't `?`-return between them or you leak the lock + stall the
driver" caveat) is now a RAII guard scoped to the convert/copy block: same release point (latency
unchanged), but leak-proof on any early return. **Driver `pod_init!` ✅** (`bf57704`, 27 `mem::zeroed` →
1). **Skipped `ThreadBound<T>`** (each `unsafe impl Send` wraps a distinct type — churn, no real gain) and
**scratched the IOCTL dispatcher** (`control.rs`'s `read_input<T>`/`write_output_complete<T>` are already
generic with minimal unsafe).
**On-glass build validation (RTX box, 2026-06-26).** Built this branch on the box in an isolated worktree:
**host `cargo clippy -p punktfunk-host --features nvenc -D warnings` = CLEAN**, **driver `cargo build` =
CLEAN** — validating the whole session's Windows + driver work on real hardware. The clippy gate (which the
goal1/§2.5 work never ran — it used `cargo check`) surfaced + fixed 11 lint issues (`bd05bc8`: 9 redundant
`as *mut c_void`, an `if_same_then_else`, an `unused_unsafe` in `pod_init!`). Remaining only a runtime
**latency A/B** for the `KeyedMutexGuard` (provably equivalent — same release point) if a deeper check is
wanted.
7. **D1-host P0 lints — deferred (low value / high churn).** A crate-wide `#![deny(unsafe_op_in_unsafe_fn)]`
produced 100+ FFI-wrap sites across the Linux modules; it *wraps* unsafe (discipline) rather than
reducing it and doesn't improve stability, so it was deprioritized vs the `OwnedHandle`/RAII reductions
above. Revisit as a final discipline pass (staged per-module) if desired.
8. **M6 scaffolding cleanup** — delete the bring-up diagnostics (`spawn_observer`/`DebugBlock` in
`idd_push.rs`) and, once full parity is proven on glass, the host monoliths.
**Explicitly NOT doing (stability decision): E1 — driver `DeviceContext` ownership + per-`IDDCX_MONITOR`
`EvtCleanupCallback`.** The current process-global design is *sound*: IddCx DDIs receive only an
`IDDCX_MONITOR` handle (never the WDFDEVICE/context), and `ProcessSharingDisabled` makes one devnode = one
host process that dies with the device. A "device-owned" variant would *add* a use-after-free window (the
watchdog races device cleanup) for no gain, and the per-monitor cleanup callback isn't reliably reachable
on this UMDF/IddCx stack. Cleanup is already deterministic (WDFDEVICE `EvtCleanupCallback` +
`cleanup_for_device_removal` + the host-gone watchdog). **Revisit only if `max_concurrent>1` on Windows is
actually needed.** (`monitor.rs` documents this rationale at the `MONITOR_MODES` static.)
8. **M6 scaffolding cleanup** — delete the bring-up diagnostics (`spawn_observer`/`DebugBlock` in
`idd_push.rs`) and, once full parity is proven on glass, the host monoliths.
**P3 — larger, mostly hardware-gated**
9. **M4 — gamepad-driver unification.** Fold `pf_dualsense` + `pf_xusb` (standalone
`packaging/windows/{dualsense,xusb}-driver/` on the old WDF stack) into the unified `drivers/` workspace
on `windows-drivers-rs` with WDF device contexts (true multi-pad), and point the **driver side** at
`pf_driver_proto::gamepad::{PadShm,XusbShm}` (host side already does — the `device_type`-at-offset-140
hand-duplication is the last ABI-drift hazard). Largest item.
10. **M5 — reshape WGC/DDA + GameStream onto `session/pipeline`**, then delete the old relay/monoliths.
AMF/QSV stays CI-only (no lab hardware).
11. **On-glass behavioral validation** of the committed-but-unexercised fixes: the watchdog reaping on
host-kill, `SET_RENDER_ADAPTER` on a **hybrid** box (the lab box is single-dGPU), the IDD-push→DDA
fallback trigger, HDR-ring sizing + out-ring repeat under real HDR/static-desktop pipelining.
---
## 5. Operations
### 5.1 RTX box on-glass recipe
The persistent on-glass validator is the **RTX box** (`ssh "Enrico Bühler"@<ip>`, ENRICOS-DESKTOP, RTX
4090, PS shell). **The IP FLOATS** (DHCP; boots to **Proxmox** on reboot → ephemeral, unreachable after a
reboot; recently `.173`/`.158` — confirm current first; **never reboot it, never depend on it surviving**).
It has WDK 26100 + LLVM 21.1.2 + the Rust toolchain; build clone at `C:\Users\Public\pf-rewrite` (the
user's active driver-dev tree — **don't clobber uncommitted WIP**; use a worktree). Username has a `ü` →
quote it; it only breaks SDL3/client builds, not the host. To validate a host branch: worktree-checkout,
build with `CARGO_TARGET_DIR=C:\t-goal1`, then stop the **PunktfunkHost** service, back up the binary +
`%ProgramData%\punktfunk\host.env`, copy your build in, restart, drive `punktfunk-probe.exe` loopback,
then restore + `git worktree remove`. Drive over ssh via `powershell -EncodedCommand <base64 UTF-16LE>`
(plain quoting mangles; prefer `Write-Output`/file-redirect for clean output). Driver redeploy:
`packaging/windows/redeploy-pf-vdisplay.ps1`; ghost-monitor recovery: `reset-pf-vdisplay.ps1`.
### 5.2 CI / validation
The persistent build validator is the **windows-amd64 CI runner** (no GPU — fine for builds / `iddcx`
link / `/INTEGRITYCHECK` self-sign / the surface-asserts; live NVENC encode + on-glass defers to the RTX
box). Workflows: `windows-host.yml` (the host installer), `windows-drivers.yml` (the driver workspace
build + FORCE_INTEGRITY clear), `windows-drivers-provision.yml` (WDK/LLVM toolchain), `windows-msix.yml`
(the client). A single Windows runner serializes the whole fleet; a `Cargo.toml` touch costs ~25 min of
queue, so driver pushes that avoid `Cargo.toml` skip the fleet serialization.
Local pre-push checks (this Linux box can't compile the Windows paths):
```sh
cargo test -p pf-driver-proto # the ABI crate (cross-platform)
cargo check -p punktfunk-host # Linux paths; win_* mods are #[cfg(windows)]
cargo clippy -p punktfunk-host --all-targets -- -D warnings
# Windows host clippy (on the box): PUNKTFUNK_NVENC_LIB_DIR=C:\t\nvenc;
# cargo clippy -p punktfunk-host --features nvenc --target x86_64-pc-windows-msvc -- -D warnings
# Driver build (on the box): cd packaging/windows/drivers; Version_Number=10.0.26100.0;
# LIBCLANG_PATH='C:\Program Files\LLVM\bin'; cargo build
```
Note: a pre-existing rustfmt-version drift exists in some Windows-only files (this box's rustfmt 1.9.0
wraps `offset_of!`/`unsafe fn` differently than the runner's) — don't reformat unrelated files to chase it.
### 5.3 Env knobs (Windows host)
`PUNKTFUNK_IDD_PUSH=1` (capture from the driver ring; default off), `PUNKTFUNK_VDISPLAY=pf|sudovda`,
`PUNKTFUNK_ENCODER=auto|nvenc` (auto → vendor-detect), `PUNKTFUNK_10BIT=1` + `PUNKTFUNK_HDR_SHADER_P010=1`
(HDR), `PUNKTFUNK_SECURE_DDA=1`, `PUNKTFUNK_NO_WGC=1` (pure DDA), `PUNKTFUNK_ZEROCOPY=1`,
`PUNKTFUNK_MONITOR_LINGER_MS`, `PFVD_DEBUG_LOG=1` (driver file log — release builds are silent without it).
Config lives in `%ProgramData%\punktfunk\host.env`; logs in `%ProgramData%\punktfunk\logs\host.log`.
### 5.4 Build / deploy / packaging
x64-only by design (no ARM64 NVIDIA driver / SudoVDA). The installer is the thin-`.iss` / fat-binary model
delegating to `service install`; tag `host-win-vX.Y.Z`. The driver is built + FORCE_INTEGRITY-cleared +
signed + `Inf2Cat`'d in CI from source. DriverVer must bump on any driver change; create the ROOT devnode
via nefcon (devgen is forbidden).
---
## 6. Reference (hard-won — keep)
### 6.1 The `/INTEGRITYCHECK` answer
`wdk-build` emits `cargo::rustc-cdylib-link-arg=/INTEGRITYCHECK` **unconditionally** (no cfg/env/Config
opt-out), so a self-signed driver can't load (CodeIntegrity 3004/3089). The fix: a deterministic,
idempotent post-link step `packaging/windows/clear-force-integrity.ps1` clears the PE FORCE_INTEGRITY bit
(`0x0080 @ e_lfanew+0x5e`) + verifies (CI-proven `0x01E0 → 0x0160`), **before** signing. Packaging order:
`cargo build` → clear-force-integrity → sign `.dll` → `Inf2Cat` → sign `.cat`. (A public build would use
real attestation signing, which satisfies `/INTEGRITYCHECK` legitimately.)
### 6.2 The `iddcx` binding on `wdk-sys` (the make-or-break — proven, the 6 bindgen knobs)
IddCx DDIs are **function-table dispatched** (`IddFunctions[]` indexed by `_IDDFUNCENUM::<Name>TableIndex`,
`IddDriverGlobals` implicit arg 1) — the same model `wdk-sys` already implements for WDF. The vendored
`windows-drivers-rs` 0.5.1 (`packaging/windows/drivers/vendor/`, `[patch.crates-io]`'d) gets a first-class
`ApiSubset::Iddcx` that bindgens `iddcx/1.10/IddCx.h` reusing the identical `wdk_default(config)` baseline
(so WDF/DXGI types **resolve to**, not redefine, `wdk-sys`'s — type-identity by construction). The six
knobs `generate_iddcx` needed (each a real gotcha, all CI-proven):
1. **`--language=c++`** — `wdk_default` parses C; `IddCx.h`'s `IDARG_*` typedefs need C++ (else a "must use
'struct' tag" cascade).
2. **`-DIDD_STUB`** — table-dispatch mode; skips `IddCxFuncEnum.h`'s `#error IDDCX_VERSION_MAJOR not
defined`. **Do NOT add `WDF_STUB`** (would desync the shared WDF type-identity).
3. **`allowlist_recursively(false)` + `allowlist_file("(?i).*iddcx.*")`, full codegen (no `.complement()`)**
— emit ONLY IddCx items; WDF/Win types resolve via `use crate::types::*`.
4. **`allowlist_type("_?DXGI_.*" / "IDXGI.*" / "_?OPM_.*" / "_?D3DCOLORVALUE")`** — emit the non-WDF types
`wdk-sys` doesn't bindgen, locally. The `_?` is load-bearing (`typedef struct _OPM_X {} OPM_X` needs the
tag AND the alias).
5. **`pub type UINT = ::core::ffi::c_uint;` in `src/iddcx.rs`** — `UINT` is absent from `crate::types`.
6. **`translate_enum_integer_types(true)`** — emit native `u32` reprs for the DXGI/OPM ModuleConsts enums
(nested modules can't see a parent `UINT`).
Wrapper note: table dispatch via `_IDDFUNCENUM::<Name>TableIndex as usize` (the ModuleConsts const, **not**
a NewType `.0`); NTSTATUS is plain `i32` (`wdk_sys::NT_SUCCESS`). The driver `build.rs` adds the IddCxStub
link-search (the import lib is under `iddcx\1.0\` even though headers are `1.10`) + `#[no_mangle] pub static
IddMinimumVersionRequired: ULONG = 4`. The versioned `IDD_STRUCTURE_SIZE!` path is dropped — the WDK links
the iddcx **1.0** stub (lacks the version table); we target 1.10 vs a current framework, so `size_of` is
exactly correct.
### 6.3 Driver port checklist (STEP 08, as landed)
0. workspace `pf-vdisplay`(cdylib)+`wdk-iddcx`; prove `std::thread`+`OwnedHandle` link under UMDF (done).
1. `wdk-iddcx`: 11 typed DDI wrappers via one dispatch macro + re-export the inbound `PFN_*` types.
2. DriverEntry + `IDD_CX_CLIENT_CONFIG` (15 callbacks) + DeviceInitConfig + WdfDeviceCreate +
CreateDeviceInterface (the owned pf GUID) + DeviceInitialize; `edid.rs` salvaged verbatim.
3. DeviceContext + `WDF_DECLARE_CONTEXT_TYPE` blob; `init_adapter` in D0Entry (caps + FP16) →
AdapterInitAsync; the `*2` mode DDIs + `query_target_info` + gamma/HDR accept-stubs. (Box gate: loads
under Secure Boot, enumerates as an IddCx adapter, Status OK.)
4. control plane (`GET_INFO` version handshake the host asserts, ADD/REMOVE/SET_RENDER_ADAPTER/PING/
CLEAR_ALL) + create_monitor + real mode DDIs + watchdog + mode bounds; host switched to
`pf_driver_proto`.
5. `Direct3DDevice` + assign/unassign + `SwapChainProcessor` (worker, `SetDevice` 60×@50 ms single-borrow
retry, top-of-loop `terminate`, `ReleaseAndAcquireBuffer2`, `from_raw_borrowed`).
6. `FramePublisher` on `pf_driver_proto::frame` + keyed-mutex RAII guard; wire into `run_core`. (Box:
full IDD-push glass-to-glass + the **secure-desktop** gate — validated 2026-06-25.)
7. HDR / FP16 ring (validated: Mac connects WITH HDR).
8. its own `.inx` + an `unsafe`-reduction pass (`deny(unsafe_op_in_unsafe_fn)`, per-site `// SAFETY:`).
**Remaining driver work** beyond STEP 8: E1 (DeviceContext-owned state + per-`IDDCX_MONITOR`
`EvtCleanupCallback` → unblock `max_concurrent>1`), the slot-reclaim-on-REMOVE fix, and M4 (fold the
gamepad drivers in). See §4.
### 6.4 Resolved product decisions (the five forks)
**A** the host was refactored **in place** (staged, behavior-preserving), not greenfield-rebuilt — the
driver *was* rebuilt fresh. **B** IDD-push primary for everything incl. the **secure desktop** (validated);
WGC+DDA demoted to non-IddCx fallbacks. **C** all drivers on `microsoft/windows-drivers-rs` (+ the `iddcx`
subset; `/INTEGRITYCHECK` solved) — done for `pf-vdisplay`, **pending for the gamepad drivers (M4)**.
**D** keep GameStream (Moonlight), default to secure `serve`. **E** concurrent sessions: the host-side
preempt dance was removed by §2.5, but true `max_concurrent>1` on Windows stays blocked on the E1 driver
swap-chain-reuse work.
---
## Appendix — consolidation note
This file replaces five docs (recoverable from git history):
- `windows-host-rewrite.md` (the original design + plan, §0–§15) — its current status, architecture, the
jewels, the seam traits, and the deep reference (§6) are folded in here.
- `windows-host-goal1-plan.md` (the 6-stage in-place host refactor) — **complete**; its outcome is §2.22.4
and the Goal-1 scorecard row.
- `windows-host-rewrite-audit.md` (the 2026-06-25 audit) — its findings are reconciled to current reality
in §1 (scorecard) and §4 (only the still-open items survive: host hygiene, E1, slot-reclaim).
- `windows-host-rewrite-remediation.md` (the audit-remediation tracker) — its landed items are in §1; its
remaining items (D1-host, D2, E1, G) are §4 P2/P3.
- `windows-host-rewrite-game-capture-bug.md` (the GB1 investigation + fix) — **fixed**; the resolution is
§2.5 (capture). The full investigation narrative is in git history.
(The older `docs/windows-host.md`, a pre-rewrite implementation plan from 2026-06-22, is a separate
lineage and is left as-is.)
+375
View File
@@ -0,0 +1,375 @@
# Windows host + client — implementation plan
**Status: in progress — dev box provisioned, host-first.** A Windows host is an *"add backends
behind the existing traits"* job, not a parallel port: `punktfunk-core` and the whole control plane
are platform-agnostic and the host already compiles on non-Linux (macOS) thanks to existing
`cfg(target_os)` gating. The one piece that used to make it XL — a per-client *virtual* output, which
has no user-mode Windows API — is solved by reusing **[SudoVDA](https://github.com/SudoMaker/SudoVDA)**
(the SudoMaker Virtual Display Adapter, the same IDD the Apollo Sunshine-fork ships): a pre-built IDD
that creates virtual displays at **arbitrary `WxH@Hz` on the fly**. We install it and drive its IOCTL
control interface — **no driver to write or WHQL-sign.**
History: scoped 2026-06-10 (4-agent read of the host crate); SudoVDA path 2026-06-11; this concrete
plan + dev box + SudoVDA protocol + no-GPU strategy added 2026-06-14 (12-agent research pass).
## Status (2026-06-15) — full pipeline live-validated on an RTX 4090
Every OS-touching backend is implemented behind the existing traits and **builds clean on
`x86_64-pc-windows-msvc`** (and Linux unaffected). `serve` / `punktfunk1-host` **run on Windows**
(identity in `%APPDATA%`, QUIC bound, mDNS advertising, accepting sessions). The **full native
pipeline is validated live on a real RTX 4090** (Windows 11): SudoVDA virtual display → DXGI
Desktop Duplication (D3D11 zero-copy) → **NVENC HEVC** → punktfunk/1 → Rust reference client, at
720p60 and 1080p60 (0 mismatched frames, p50 1.6 / 3.45 ms cross-machine, ffmpeg-decodes clean),
coexisting with a running Apollo (two concurrent NVENC sessions).
| Backend | State | GPU-less validation on the VM |
|---|---|---|
| Virtual display (SudoVDA) | ✅ done | live: open/version/watchdog/ADD/REMOVE via the trait |
| Input (SendInput) | ✅ **live on RTX 4090** | mouse injection verified — cursor tracked the client's absolute diagonal sweep across the desktop in Session 1 (keyboard shares the same SendInput primitive) |
| Software encode (openh264) | ✅ done | **live: m0 synthetic→openh264→core FEC loopback, 120/120, 0 mismatches** |
| Audio (WASAPI loopback) | ✅ done | live: init chain opens (silent VM → no samples) |
| Capture (DXGI Desktop Duplication) | ✅ **live on RTX 4090** | SudoVDA monitor → D3D11 zero-copy duplication; output is enumerated under the *rendering* GPU, not the SudoVDA LUID (search all adapters) |
| NVENC (D3D11, `--features nvenc`) | ✅ **live on RTX 4090** | 720p60 + 1080p60 HEVC end-to-end to the Rust client; ffmpeg-decodes clean; ran alongside Apollo (2 NVENC sessions) |
| Run host (serve/punktfunk1-host) | ✅ live | punktfunk1-host starts + listens; `c_abi_connection_roundtrip` passes |
| Gamepad (ViGEm) | ✅ done | compiles incl. rumble back-channel; live needs ViGEmBus + a physical pad |
| Host→client audio wiring | ✅ done | builds on MSVC; `m3` `audio_thread` active on Windows (silent VM → no samples to send) |
| GameStream (Moonlight) audio | ✅ done | stereo path active on Windows (WASAPI→Opus→RTP/FEC); surround stays Linux-only (libopus multistream / `audiopus_sys`) |
| Rumble back-channel (ViGEm) | ✅ done | `request_notification` → background thread → 0xCA; live needs a physical pad |
| Game library (Steam discovery) | ✅ done | Windows Steam roots (Program Files) + VDF other-drive libraries; custom store already cross-platform. Non-default Steam install dir (registry) not yet covered |
**Remaining for full parity:**
- **Keyboard injection** — exercised via the same SendInput path (mouse verified live); not yet
asserted into a focused text field.
- **ViGEm rumble + gamepad input** — the pad is created live (ViGEmBus connected); the rumble
back-channel + input still need a physical pad to verify.
- **GameStream (Moonlight) path on a GPU box** — not yet run live (its fixed ports collide with
Apollo, so stop Apollo first).
- **Frame pacing on static content** — DXGI duplication is change-driven, so a blank/idle virtual
display delivers only ~12 fps (181/177 frames over ~15 s); a rendering app drives the full rate.
### Live UX hardening (2026-06-15, validated Mac ↔ RTX 4090)
Driven by live testing with the native macOS client at the display's native **5120×1440@240**:
- **Native resolution, not 1080p.** `sudovda::set_active_mode` enumerates the modes the IDD actually
advertises (`EnumDisplaySettingsW`) and sets the requested **resolution** at the best supported
refresh — keeping 5120×1440@240, never silently collapsing to the 1280×720/1920×1080 OS default
when an exact mode is briefly unavailable.
- **Bitrate auto-cap.** NVENC `init_session` probes and steps the average bitrate down (×3/4 to a
floor) when the requested rate exceeds the GPU's codec-level max, so a high client bitrate connects
instead of failing (matches the Linux host; we do NOT split NVENC sessions).
- **Mouse cursor.** DXGI duplication excludes the hardware cursor; we read the pointer
position/shape from the frame info (`GetFramePointerShape`) and GPU-composite it onto the captured
texture before NVENC (a CPU read-back would stall the pipeline). Color cursors alpha-blend;
**masked-color** cursors (the text I-beam) use an `INV_DEST_COLOR` blend for true screen inversion,
so the caret is visible on any background (no black box). Monochrome handled too.
- **Secure desktop (lock / login / UAC).** The host runs as **SYSTEM in the interactive session**;
the capturer `SetThreadDesktop`s onto the current input desktop and, on the WinSta switch,
**recreates the D3D11 device** and **re-resolves the virtual output's GDI name from the stable
SudoVDA target id** (the name changes across the topology rebuild — the old failure was hunting the
stale `\\.\DISPLAYn` and dropping). `ACCESS_LOST` / `INVALID_CALL` / device-removed are all treated
as recoverable, and a mid-stream resolution change is followed (capturer + NVENC re-init at the new
size). Validated: logging in / locking through the stream stays connected (one real session
recovered 1012 desktop switches and completed cleanly). *Display isolation* (`isolate_displays`
detaches other monitors so Winlogon renders to the virtual output) covers the case where a physical
monitor is also attached.
### Running as SYSTEM (deployment) — the `PunktfunkHost` service
To capture the secure desktop the host must run as **SYSTEM in the interactive Session 1** (a Session
0 service can't duplicate Session 1). The end-user deployment is the built-in Windows **service**
(`src/service.rs`) — see [`windows-service.md`](windows-service.md). One elevated command:
```powershell
punktfunk-host service install # auto-start LocalSystem service + firewall rules + default host.env
punktfunk-host service start
```
The service runs in Session 0 but never captures: it duplicates its own LocalSystem token, retargets
it to the active console session, and `CreateProcessAsUserW`s the host there — supervising it across
exits and console-session switches (the Sunshine/Apollo model). Config lives in
`%ProgramData%\punktfunk\host.env`; logs in `%ProgramData%\punktfunk\logs\`.
> **Old bring-up chain (debug only, superseded by the service):** a scheduled task (Interactive,
> Highest) → `PsExec64 -s -i 1 -d wscript.exe launch.vbs` → `host-run.cmd` (hidden window), with
> `APPDATA=C:\Users\Public` as the shared-identity hack. The service replaces all of this; the host
> now resolves its config dir to `%ProgramData%\punktfunk` directly (`PUNKTFUNK_CONFIG_DIR` overrides).
### Real-GPU test box (RTX 4090, `ssh "Enrico Bühler"@192.168.1.174`)
Windows 11, RTX 4090 (driver 596.36) + AMD iGPU, SudoVDA + Apollo (sunshine) installed. SSH lands in
**Session 0 (non-interactive)** — DXGI duplication + SendInput need the **interactive Session 1**, so
launch the host there via an Interactive scheduled task (admin SSH session is the same user):
`Register-ScheduledTask -Principal (New-ScheduledTaskPrincipal -UserId (whoami) -LogonType
Interactive -RunLevel Highest)`, then `Start-ScheduledTask`. The host runs with desktop access; read
its redirected log over SSH. `nvEncodeAPI64.dll` ships with the driver, so a VM-built `--features
nvenc` exe runs here as-is (no SDK install). The 4090's Ada NVENC has no consumer session cap, so the
host encodes alongside Apollo. **Gotcha:** the SudoVDA monitor is rendered by — and DXGI-enumerated
under — the 4090, not the SudoVDA adapter LUID (the capturer searches all adapters; see the fix).
#### Native build on the 4090 (fast iteration loop)
Build on the box itself (edit locally → `sftp` to the repo → `cargo build` there → run via the task)
instead of build-on-VM-then-copy. Prereqs that bit us, in order:
1. **Full MSVC C++ build tools, incl. the CRT libs.** A VS install can land `cl.exe` + the Windows
SDK + sanitizer libs but *miss* the desktop CRT import libs (`VC\Tools\MSVC\<ver>\lib\x64\msvcrt.lib`,
`libcmt.lib`, …) → `LNK1104: msvcrt.lib`. Root cause here: the `Microsoft.VisualCpp.Redist.14`
package failed to install (1603), cascading to skip the NativeDesktop workload. Fix = (re)install
the C++ workload via the VS Installer **GUI** (the headless `setup.exe modify` over SSH fails — a
non-elevated SSH token gives 1603/87, and `--quiet` as SYSTEM hangs). A reboot may be needed first
(a pending reboot also yields 1603). Stop-gap: the desktop CRT libs are version-pinned, so they can
be copied from another box with the **identical** MSVC version (`14.51.36231` here).
2. **Build from an ASCII path.** A username with a non-ASCII char (`C:\Users\Enrico Bühler\…`) breaks
the MSVC PDB writer → `LNK1201: error writing to the program database`. Clone/copy the repo to
e.g. `C:\Users\Public\punktfunk-native` and build there (the VM worked only because it built in
`C:\Users\Public\punktfunk`).
3. `winget install NASM.NASM Kitware.CMake`; generate the NVENC import lib (`lib /def` → set
`PUNKTFUNK_NVENC_LIB_DIR`); set `CMAKE_POLICY_VERSION_MINIMUM=3.5` (libopus).
Build env (each `cargo` invocation): `$env:PATH += ";C:\Program Files\NASM;C:\Program Files\CMake\bin"`,
`$env:CMAKE_POLICY_VERSION_MINIMUM="3.5"`, `$env:PUNKTFUNK_NVENC_LIB_DIR="C:\Users\Public\nvenc"`, then
`cargo build --release -p punktfunk-host --features nvenc`. Validated: native build (1m37s) →
720p60 NVENC, 174/174 frames, p50 2.5 ms, ffmpeg-decodes clean.
All Windows backends are `clippy -D warnings` and `rustfmt` clean on `x86_64-pc-windows-msvc` (the
Windows-only modules are cfg-excluded from Linux CI, so run clippy on the VM after touching them — its
rustc 1.96 clippy is stricter than the Linux CI image on shared code, e.g. `needless_return`).
### Building & testing on a real-GPU Windows box (NVENC)
1. Install **SudoVDA** (virtual display) and **ViGEmBus** (gamepad) drivers; install the NVIDIA driver.
2. NVENC link lib: either install the NVIDIA Video Codec SDK, or generate an import lib from the
driver DLL — `lib /def:nvenc.def /machine:x64 /out:nvencodeapi.lib` where `nvenc.def` lists
`NvEncodeAPICreateInstance` and `NvEncodeAPIGetMaxSupportedVersion` — and set
`PUNKTFUNK_NVENC_LIB_DIR` to its directory.
3. `cargo build -p punktfunk-host --features nvenc,amf-qsv` for the all-vendor GPU host (NVENC for
NVIDIA; AMD AMF + Intel QSV via libavcodec — `amf-qsv` needs `FFMPEG_DIR` with the `*_amf`/`*_qsv`
encoders at build, e.g. the BtbN gpl-shared tree, and the FFmpeg DLLs on PATH at run; NASM + CMake
for aws-lc-rs; libclang for ffmpeg-sys-next). Default build (no feature) = openh264 software encoder.
4. Run in the **interactive session** (not a Session-0 service / not over SSH — SendInput + DXGI
Desktop Duplication need a desktop): `serve` or `punktfunk1-host --source virtual`.
`PUNKTFUNK_ENCODER=auto` (default) picks the backend from the GPU vendor — `nvenc`/`amf`/`qsv`/`sw`
force one. The DXGI capturer emits zero-copy D3D11 NV12/P010 to match any GPU backend; the SudoVDA
monitor activates once a real GPU drives WDDM, so capture + encode then work.
### Dev loop (this repo → the Windows VM)
`ssh "Enrico Bühler"@192.168.1.57` (PowerShell shell). Repo cloned at `C:\Users\Public\punktfunk`
(Gitea). Sync uncommitted files with **sftp** (`sftp -b - host`, `/C:/...` paths — scp and
base64-over-ssh are unreliable here). Commit on Linux → `git reset --hard origin/main` on the VM.
Build env: `PATH` += cargo bin + NASM + CMake + LLVM (vcvars not needed — rustc/cc self-locate MSVC).
Set `CMAKE_POLICY_VERSION_MINIMUM=3.5` — CMake 4 rejects libopus's old `cmake_minimum_required` when
`audiopus_sys` (vendored by the `opus` crate) builds libopus from source for the host→client audio path.
## Decisions (locked 2026-06-14)
| Decision | Choice | Rationale |
|---|---|---|
| **Build order** | **Host first** | User preference. (Note: the research recommended *client* first, since the client is unblocked by the no-GPU problem and becomes the host's test endpoint — see "No-GPU dev strategy". Revisit if host progress stalls on GPU-gated steps.) |
| **Virtual display** | **SudoVDA** | Arbitrary modes on the fly (no baked EDID / registry mode list, unlike parsec-vdd), MIT/CC0 (bundleable), already installed on the dev box, proven by Apollo. |
| **Client UI** | **Pure Rust: `windows-rs` + Windows Reactor (WinUI 3)** | No C++/C#. Links `punktfunk-core` directly as a crate (like the GTK Linux client — no C ABI, no GC/FFI-lifetime hazard). Built-in `SwapChainPanel` widget for the video surface; `Custom` escape hatch + raw `Microsoft.UI.Xaml` as fallback. |
| **Client decode** | **FFmpeg + D3D11VA** | Exactly what Moonlight ships; feeds AnnexB H.264/HEVC/AV1 directly, decodes AV1 via the GPU DXVA profile with **no** Store Video Extension. Cost: ffmpeg dep + libclang. |
| **Host SW encode (no-GPU dev)** | **openh264** | BSD, no system ffmpeg, low-latency single-ref/zero-lookahead with intra-refresh. Lets the full capture→encode→FEC→send pipeline run GPU-less. |
| **Host HW encode** | **nvidia-video-codec-sdk (D3D11)** | `NV_ENC_DEVICE_TYPE_DIRECTX` + `NvEncRegisterResource` on the captured `ID3D11Texture2D` = true zero-copy, no CUDA bridge. Young crate — vendor + wrap behind the `Encoder` trait. Defers to a real-GPU box. |
## Dev box (`ssh "Enrico Bühler"@192.168.1.57`)
Windows 11 Pro 25H2 (build 26200), QEMU Q35, 8 vCPU, 12 GB. **No working GPU** (an `RTX 5070 Ti` node
is present but `Status: Unknown`; `nvidia-smi` fails → NVENC cannot initialize). Installed: Rust 1.96
(MSVC), Visual Studio Community 2026 + VC tools + Windows SDK 10.0.26100/28000, Windows App Runtime
2.2 (Reactor needs ≥2.0.1 ✅), **SudoVDA** (`ROOT\DISPLAY\0000`, hwid `root\sudomaker\sudovda`, INF
`oem6.inf`, Status OK) and Parsec VDD, git, winget. **Toolchain gaps to fill** (see Step 0): NASM,
CMake, libclang.
## Reused as-is (~95% of the codebase — no changes)
| Reusable | Why |
|---|---|
| `punktfunk-core` (protocol, FEC, crypto, session, transport, QUIC control plane, C ABI) | Zero platform deps; already compiles on Windows MSVC |
| GameStream wire logic (mDNS, serverinfo, pairing, RTSP, ENet) *except* the capture/encode/audio backends | pure protocol |
| Management REST API (`mgmt.rs`) + OpenAPI, `native_pairing`, `discovery` | axum/tokio/quinn — portable |
| `punktfunk1.rs` / `spike.rs` / `pipeline.rs` orchestration | trait-generic: call `capturer.next_frame()`, `encoder.submit/poll()`, `vd.create(mode)` — no changes |
| The trait boundaries: `Capturer`, `Encoder`, `VirtualDisplay`, `InputInjector`, `AudioCapturer`, `VirtualMic` | platform-neutral; Linux deps already isolated under `[target.'cfg(target_os="linux")'.dependencies]` |
## Step 0 — make `punktfunk-host` compile on `x86_64-pc-windows-msvc` — ✅ DONE (2026-06-14)
**Result:** the full dependency tree builds clean on MSVC (aws-lc-rs with NASM+CMake, quinn,
rusty_enet, axum/hyper/utoipa), and `punktfunk-host` compiles **and runs** (the `openapi` subcommand
emits the spec). Only **3 cfg-gates** were needed — the host was already ~95% portable:
`main.rs` `mod dmabuf_fence`/`mod drm_sync``#[cfg(target_os = "linux")]`; `vdisplay.rs` the
`use std::os::fd::OwnedFd` import + `VirtualOutput.remote_fd` field → `#[cfg(target_os = "linux")]`.
Verified green on Linux too. Build env on the VM: rustc+`cc`/`cmake` self-locate MSVC (vcvars not
needed); `PATH` must include cargo bin + NASM + CMake + LLVM.
The host already compiles on macOS (Linux backends are `cfg`-gated; heavy Linux deps are
target-gated). Getting to Windows MSVC is the **unix-but-not-linux** delta, not a from-scratch port:
1. **Toolchain**: `winget install NASM.NASM Kitware.CMake LLVM.LLVM`, set `LIBCLANG_PATH`
(or tick VS "C++ Clang tools"). NASM+CMake are for **aws-lc-rs** (pulled by `rustls`/`rcgen` on
the `quic` path); libclang is for `ffmpeg-sys`/bindgen (client decode + any host bindgen crate).
2. **`std::os::fd` / `libc`**: `vdisplay.rs:18` has an unconditional `use std::os::fd::OwnedFd;` and
`VirtualOutput.remote_fd: Option<OwnedFd>``std::os::fd` is `cfg(unix)`, so it builds on macOS
but breaks on Windows. Gate the import + field (`#[cfg(unix)]`, with a Windows arm or omission).
Sweep for other `cfg(target_os="linux")`-missing unix-isms (`libc`, fds).
3. **Build natively on the VM** (`cargo build -p punktfunk-host`*not* cross-compile; xwin chokes on
aws-lc-rs/ffmpeg-sys/WDK). Triage the remaining errors. Suspect deps to verify link on MSVC:
`aws-lc-rs` (needs NASM+CMake), `rusty_enet`, the hyper/axum/utoipa stack (expected fine).
4. **CI**: add a `cargo build -p punktfunk-host --target x86_64-pc-windows-msvc` job so the Windows
path stops bit-rotting (the dev box can be a Gitea runner later).
This is the highest-value first move and is **fully doable GPU-less**.
## Windows backends (new `#[cfg(target_os = "windows")]` code behind existing traits)
| Subsystem | Linux today | Windows backend | VM-testable? |
|---|---|---|---|
| **VirtualDisplay** | KWin/gamescope/Mutter/Sway | **SudoVDA** IOCTLs (below) + `SetDisplayConfig` mode-set | ✅ likely (WARP) — *spike* |
| **Capture** | PipeWire/dmabuf | **DXGI Desktop Duplication** primary, **WGC** fallback → `ID3D11Texture2D`; add `FramePayload::D3d11` | ⚠️ DDA-on-WARP unreliable; WGC-on-WARP unverified — *spike* |
| **Zero-copy** | dmabuf→EGL/Vulkan→CUDA | register `ID3D11Texture2D` with NVENC (`NV_ENC_DEVICE_TYPE_DIRECTX`) — no CUDA bridge | ❌ needs real GPU |
| **Encode** | ffmpeg `*_nvenc`/`*_vaapi` | `nvidia-video-codec-sdk` (NVIDIA) + libavcodec `*_amf`/`*_qsv` (AMD/Intel, `encode/ffmpeg_win.rs`, `--features amf-qsv`) + `openh264` SW fallback; vendor-auto via `PUNKTFUNK_ENCODER` | NVENC ✅ live / AMF/QSV CI-only |
| **Input kbd/mouse** | libei / wlr | **SendInput** with `MOUSEEVENTF_VIRTUALDESK` absolute mapping onto the virtual desktop rect (skip the VK→evdev table — client sends Win VKs; use `KEYEVENTF_SCANCODE`+`EXTENDEDKEY`) | ✅ |
| **Gamepad** | uinput xpad + FF | **ViGEmBus** via `vigem-client` (`Xbox360Wired`); rumble via `request_notification()``XNotification{large,small}` | ✅ (install driver) |
| **Audio capture** | PipeWire sink monitor | **WASAPI loopback** via the `wasapi` crate (48 kHz stereo f32 → existing Opus) | ⚠️ needs an audio endpoint |
| **Virtual mic** | PipeWire `Audio/Source` | virtual audio driver (`Virtual-Audio-Driver`) or defer | ❌ second driver — defer |
`punktfunk1.rs`/`spike.rs`/`pipeline.rs` are unchanged. Note: the Windows capture needs its own
`capture_virtual_output` entry point (the SudoVDA identity is a DXGI adapter LUID + DisplayConfig
TargetId → GDI `\\.\DisplayN`, which doesn't fit the PipeWire `node_id: u32` field — carry it inside
the `keepalive` / a Windows-specific seam rather than overloading `node_id`).
## SudoVDA control protocol (the `VirtualDisplay` backend spec)
Pure Rust via the `windows` crate (no C lib; Apollo vendors a header-only client under
`third-party/sudovda/`). Reference port pattern: `parsec-vdd-rust` (SetupAPI/CM_* → `CreateFileW`
`DeviceIoControl`). **Verify the IOCTL hex with a `const fn ctl_code()`**
`CTL_CODE(dev,func,method,access) = (dev<<16)|(access<<14)|(func<<2)|method`, with
`FILE_DEVICE_UNKNOWN=0x22`, `METHOD_BUFFERED=0`, `FILE_ANY_ACCESS=0`.
- **Device interface GUID**: `{E5BCC234-1E0C-418A-A0D4-EF8B7501414D}` · **HWID**: `root\sudomaker\sudovda`
- **IOCTLs** (func → value): ADD `0x800``0x00222000`, REMOVE `0x801``0x00222004`,
SET_RENDER_ADAPTER `0x802``0x00222008`, GET_WATCHDOG `0x803``0x0022200C`,
DRIVER_PING `0x888``0x00222220`, GET_PROTOCOL_VERSION `0x8FF``0x002223FC`.
- **Add** (`#[repr(C)]` exact layout): in `{ u32 Width; u32 Height; u32 RefreshRate; GUID MonitorGuid;
CHAR DeviceName[14]; CHAR SerialNumber[14] }` → out `{ LUID AdapterLuid; u32 TargetId }`. **The mode
is set at create** (driver computes timing arithmetically — no EDID seeding). Pick a *stable
per-client* `MonitorGuid` (Windows persists that monitor's layout; remove is by GUID).
- **Resolve the capture target**: the monitor appears **asynchronously** — poll
`QueryDisplayConfig(QDC_ONLY_ACTIVE_PATHS)`, match `targetInfo.id == TargetId`,
`DisplayConfigGetDeviceInfo` → `viewGdiDeviceName` (`\\.\DisplayN`). Apollo polls 20 ms → ×2 → cap
320 ms. Then point DXGI Desktop Duplication at that output.
- **Keepalive (mandatory)**: `GET_WATCHDOG` → `{ u32 Timeout_s; u32 Countdown }` (default **3 s**,
driver-wide). Run one thread firing `DRIVER_PING` every `Timeout*1000/3` ms (~1 s). Miss it and the
driver tears down **all** virtual displays.
- **Teardown (RAII)**: `Drop` → `DeviceIoControl(REMOVE, { GUID MonitorGuid })` = the `VirtualOutput`
keepalive drop.
- **Mid-stream `Reconfigure`**: SudoVDA has no in-place mode IOCTL (Apollo only relaunches). Implement
punktfunk's `Reconfigure` as remove+re-add at the new mode (or add-second + migrate capture), and
**watch the Win11 24H2/25H2 IDD mode-apply regression** (post-create `ChangeDisplaySettingsEx` may
not move the *desktop* to the new mode without a Settings-UI poke — VirtualDrivers #471). The
~90 ms `Reconfigure` budget needs an isolated spike to confirm on 24H2/25H2.
- **Install / signing**: self-signed — ship `sudovda.cer`, import to Root + TrustedPublisher, create
the device node via `nefconc.exe` (`--create-device-node`/`--install-driver`). Installs **without**
test-signing (trusted-publisher). MIT/CC0 → bundleable (Apollo precedent). **Already installed on
the dev box.** Document it as a host prerequisite (like the Linux udev rule).
- **GPU caveat**: SudoVDA's `Driver.cpp` does `D3D11CreateDevice(UNKNOWN)` on a render adapter with
**no explicit WARP fallback**; on the GPU-less VM Windows binds the Basic Render Driver (WARP), so
display compositing *should* work but NVENC won't. Confirm `ADD` actually brings a monitor up on the
VM in the first spike.
## No-GPU dev strategy
**Buildable + validatable on the VM now:** Step 0 (MSVC compile); the SudoVDA backend
(add/mode-set/keepalive/remove via WARP — *spike to confirm*); the openh264 SW encode path fed a CPU
BGRA staging copy → real AnnexB → FEC → UDP (the full transport minus HW); SendInput injection +
interactive-session/desktop-reattach; ViGEm gamepad + rumble; WASAPI loopback (if an endpoint
exists); and the entire client (software decode loopback).
**Defers to a real NVIDIA-GPU Windows box:** NVENC-D3D11 zero-copy encode; whether the captured
`ID3D11Texture2D` registers with NVENC zero-copy vs needing a `CopyResource`; the DDA-vs-WGC latency
bake-off (DDA-on-WARP is `E_NOTIMPL`-class); split-encode + bitrate-ceiling probe; and **all**
glass-to-glass / throughput numbers (no perf claim transfers from Linux).
## Windows-specific structural issues (no Linux precedent)
- **Interactive session, not a Session-0 service.** SendInput can't reach the desktop from Session 0.
Run the host in the user's interactive session and replicate Apollo/Sunshine's
`OpenInputDesktop`/`SetThreadDesktop` re-attach to survive UAC/lock-screen desktop switches. (Driving
the UAC *secure* desktop needs a UIAccess manifest + signing — out of scope; document it.)
- **Clock epoch on the host side.** The skew handshake assumes both ends read the same realtime epoch
in ns. The Windows host must emit timestamps from `GetSystemTimePreciseAsFileTime`→Unix-epoch-ns or
cross-machine latency numbers + `ClockProbe`/`ClockEcho` break.
- **IDD has no audio endpoint.** There's nothing to loop back on a headless box unless a real/virtual
render device exists → WASAPI loopback needs an endpoint, and the virtual *mic* (client→host) has no
clean user-mode path. Audio is potentially a second driver-install problem; defer the mic.
- **Color/range.** All clients assume BT.709 limited-range. A new openh264/NVENC-D3D11 path doing
BGRA→I420 must match, or colors wash out — validate against the existing decoders.
## Phased plan (host-first)
0. **Compile on MSVC** (Step 0 above). GPU-less. ← *start here*
1. **SudoVDA `VirtualDisplay` backend** — ✅ *control path landed* (`vdisplay/sudovda.rs`:
add/keepalive/remove + GDI-name resolution + RAII teardown, behind the existing trait; `open()`
returns it on Windows). Compiles + live-tested on the VM. **Remaining:** monitor activation +
`\\.\DisplayN` resolution (needs a GPU), then `SetDisplayConfig` mid-stream `Reconfigure`.
2. **Capture + SW encode** — DXGI Desktop Duplication (or WGC) → `ID3D11Texture2D` → CPU staging →
openh264 → existing FEC/transport. First end-to-end Windows session, GPU-less, against the Linux
`punktfunk-probe` or the new Windows client.
3. **Input** — SendInput (kbd/mouse, VIRTUALDESK mapping) + interactive-session/desktop-reattach.
4. **Gamepad + audio** — ViGEm + rumble; WASAPI loopback.
5. **HW encode (real-GPU box)** — `nvidia-video-codec-sdk` D3D11 zero-copy; DDA-vs-WGC bake-off;
glass-to-glass numbers. Resolve to Xbox-360 pad on Windows (drop DualSense fidelity/virtual-mic to
follow-ups, as the host already does for non-Linux).
## The Windows client (separate track, pure Rust)
Structurally a sibling of `clients/linux` (GTK4) — same shape, different toolkit:
- **UI**: `windows-rs` + **Windows Reactor** (WinUI 3) for native chrome. Link `punktfunk-core`
directly (no C ABI). **De-risk early**: a Reactor window with a `SwapChainPanel` presenting a
test pattern through a flip-model waitable swapchain, before building on it. Fallback if Reactor's
3-week-old maturity bites: the `Custom` element + raw `windows-rs` `Microsoft.UI.Xaml`.
- **Decode**: FFmpeg `avcodec_send_packet`/`receive_frame` with the **D3D11VA** hwaccel → `NV12/P010`
`ID3D11Texture2D`. Feeds AnnexB directly (matches host output), decodes AV1 with no Store extension.
- **Present**: DXGI flip-model **waitable** swapchain (`FLIP_DISCARD` + `FRAME_LATENCY_WAITABLE_OBJECT`,
max latency 1) bound to the `SwapChainPanel` via `ISwapChainPanelNative::SetSwapChain`. **Not**
MediaPlayerElement.
- **Input capture**: RAWINPUT/`WM_INPUT` for relative/pointer-lock mouse; `Windows.Gaming.Input` for
gamepads + rumble. Forward via the linked `NativeClient` (`send_input`/`send_rich_input`).
- **Trust**: SPAKE2 PIN + TOFU pinning via core; persist the client identity in Windows Credential
Manager / DPAPI (the Keychain analog).
## Open risks / spikes (do these in isolation, early)
1. **`cargo build -p punktfunk-host` on the VM** — count + triage the real MSVC errors before
estimating Step 0. (GPU-less.)
2. **SudoVDA `ADD` on the VM** — ✅ *done 2026-06-15.* The control path is fully validated on the
GPU-less VM, both standalone and through the real `VirtualDisplay` trait (`vdisplay/sudovda.rs`):
device open by GUID, `GET_VERSION` (0.2.1), `GET_WATCHDOG` (3 s), `ADD 1920×1080@60` → returns
adapter LUID + `target_id`, watchdog ping holds it, RAII `Drop` → `REMOVE`. **Gap:** with no GPU the
target does NOT activate into a WDDM display path (`QueryDisplayConfig` active paths stay 0 → no
`\\.\DisplayN` to resolve/capture). So **activation + name-resolution + capture defer to a real
GPU** (passthrough on the Proxmox VM, or a GPU box) — consistent with capture/NVENC deferring anyway.
3. **IDD arbitrary-mode + `Reconfigure` on 24H2/25H2** — does 5120×1440@240 apply, and does a
remove+re-add (or re-modeset) hit the ~90 ms budget without a Settings-UI toggle? Make-or-break for
"native client resolution, no scaling".
4. **NVENC-D3D11 zero-copy** (real-GPU box) — does the captured texture register as-is, or need a
copy? Does `nvidia-video-codec-sdk`'s `NV_ENC_DEVICE_TYPE_DIRECTX` path work end-to-end? (Expect to
vendor/patch.)
5. **DDA vs WGC** against the SudoVDA monitor — measure latency/jitter on a real GPU; resolve the
primary-capture choice.
6. **Driver redistribution** — confirm bundling SudoVDA (`.cer` + nefcon) + ViGEmBus installers in the
punktfunk Windows package; document them as prerequisites.
## References
- SudoVDA: <https://github.com/SudoMaker/SudoVDA> · Apollo integration:
<https://github.com/ClassicOldSong/Apollo/tree/master/src/platform/windows> (`virtual_display.cpp`)
+ `third-party/sudovda/`
- parsec-vdd-rust (port pattern): <https://github.com/rohitsangwan01/parsec-vdd-rust>
- Win11 24H2 IDD mode-apply regression: VirtualDrivers/Virtual-Display-Driver #471
- Windows Reactor (WinUI 3 in Rust): windows-rs PR #4479
- Crates: `windows`, `windows-capture`, `vigem-client`, `wasapi`, `openh264`,
`nvidia-video-codec-sdk`, `ffmpeg-next`
</content>
</invoke>
+132
View File
@@ -0,0 +1,132 @@
# Windows secure-desktop capture — two-process design
Status: **all steps (16) implemented and live-validated on the RTX 4090 (2026-06-16).** The
two-process path works end to end (host as SYSTEM): the user-session WGC helper relays video, the mux
switches to the host's DDA on the secure desktop, a dead helper is rebuilt automatically, and the
SendInput injector follows desktop switches lazily. Only a *real* UAC/lock smoke test remains (can't
be triggered headless over SSH). The earlier user-mode WGC animation fix still ships; this is the
SYSTEM-mode design that adds secure-desktop (UAC/lock/login) coverage, since WGC and the secure desktop
need conflicting process tokens.
Implemented so far:
- **Step 1 — DesktopWatcher** (`capture/desktop_watch.rs`): polls the input-desktop name → atomic
`Default`/`Winlogon`. Committed `80e222d`.
- **Step 3 — WGC helper subcommand** (`wgc_helper.rs`, `punktfunk1-host wgc-helper`): WGC→NVENC→framed AUs on
stdout, stdin keyframe control. Committed `a0f6cdd`.
- **Step 4 — spawn + relay** (`capture/wgc_relay.rs`, `m3::virtual_stream_relay`): SYSTEM host spawns
the helper via `CreateProcessAsUserW` into `winsta0\default`, relays its stdout AUs to the QUIC send
thread, forwards keyframe requests, surfaces helper stderr in host tracing. Committed `9f50b39`.
- **Step 5 — source mux** (`m3::virtual_stream_relay`): the DesktopWatcher switches the AU source —
helper relay on `Default`, the host's own DDA capturer+encoder on `Winlogon`; every switch latches
"wait for IDR" + forces the now-active source to emit a keyframe.
**Live-validated on the RTX 4090 (2026-06-16, host as SYSTEM):**
- Step 4: the helper spawns via `CreateProcessAsUserW`, runs WGC with no hang (HDR FP16 BT.2020 PQ),
opens NVENC (D3D11 Main10), and relays AUs — `client-rs` over the LAN decoded 411 HEVC Main-10
frames. (Bug found+fixed: `CreateProcessAsUserW` gave the helper the *user's* env, dropping
`PUNKTFUNK_ENCODER=nvenc` → software-encoder fallback; fixed by `merged_env_block`.)
- Step 5: with `PUNKTFUNK_SECURE_TEST_PERIOD_MS=4000` driving a square-wave toggle, the source mux
switched `secure(DDA)``normal(WGC relay)` cleanly 5× in one session; the client decoded 308 frames
continuously across every switch (the wait-for-IDR latch held — no decode break). The real Winlogon
DDA capture itself is pre-proven by the single-process secure path (commit `f4b4a6c`); step 5's new
surface is the mux, which the toggle exercises directly.
- Step 6: the helper relaunch watchdog. Force-killing the helper PID mid-stream triggered exactly one
`WGC helper exited — rebuilt output + helper fails=1` and the stream recovered — client-rs decoded
645 frames continuously across the kill. A ~30s mux soak (2s toggle) ran 16 switches with 0 rebuilds
/ 0 early-ends / 465 frames decoded. (Recovery rebuilds the whole output, not a same-target respawn,
which storm-failed with "no DXGI output for target N yet" after an abrupt kill.)
- Step 2: SendInput now uses the retry-on-failure model (`inject/sendinput.rs`) — the thread stays
bound to its desktop and only reattaches (`OpenInputDesktop`/`SetThreadDesktop`) on a `SendInput`
short write (desktop switched), instead of two syscalls per event. Validated: `client-rs --input-test`
injected for ~6s with no `blocked desktop` errors (steady-state path); the reattach-on-switch path
is the same `OpenInputDesktop` call the old per-event code used, now lazy.
Remaining: a **final user-driven smoke test** — trigger a *real* UAC/lock on the box during a session
and confirm the dialog appears on the client AND that clicking/typing on it lands (the box's UAC
auto-elevates admins, so a real prompt can't be triggered headless over SSH; the mux switch itself is
proven by the timed toggle, and DDA-on-Winlogon capture + input by the single-process secure path).
> **Note:** the two-process path requires the host to run as SYSTEM (`run.cmd.sysbak` → `-s -i 1`).
> As SYSTEM, WASAPI loopback audio (session 0) does not capture the user session's audio — a known
> limitation of SYSTEM-mode capture, separate from this work.
## The constraint (verified live on the RTX 4090)
- **WGC** (the composed-desktop capture that fixes frozen HDR animations) **will not activate under
the SYSTEM account** — `CreateForMonitor``0x80070424`. Thread-level `ImpersonateLoggedOnUser` is
**insufficient** (tested: `impersonated=true`, still `0x80070424`). WGC needs the *process* to run
as the interactive user.
- **DDA + SendInput on the secure desktop (Winlogon: UAC/lock/login) require LOCAL_SYSTEM** (attach to
the Winlogon desktop). This is already shipped (task #17) when the host runs as SYSTEM.
- Therefore one process can't do both. Single-process (the simpler design) is **out**.
## Architecture: SYSTEM host + USER-session WGC helper, AU-relay (no shared GPU texture)
- **SYSTEM host** (the existing `punktfunk1-host`, launched as SYSTEM in interactive Session 1 via the
scheduled task → PsExec `-s -i 1`): owns the punktfunk/1 QUIC session, the single SudoVDA virtual
output (+ isolate/restore RAII — the *only* topology owner), the **DDA capture + NVENC encoder for
the secure desktop**, the **single SendInput injector** (serves *both* desktops), and the **AU
source mux** that feeds the QUIC data plane.
- **USER-session WGC helper** (a new `punktfunk1-host` subcommand, spawned by the SYSTEM host via
`WTSQueryUserToken(activeConsoleSessionId)``DuplicateTokenEx(TokenPrimary)`
`CreateProcessAsUserW(lpDesktop="winsta0\\default", CREATE_NO_WINDOW)`): runs the existing
**WGC → scRGB/PQ → NVENC** pipeline and ships **Annex-B AUs** (`{data, pts_ns, keyframe}`) to the
SYSTEM host over a **named pipe**. It captures the SAME SudoVDA output **by GDI name only** — it
must NOT create its own virtual output / touch display topology (a second topology owner re-triggers
the ACCESS_LOST born-lost storm).
- **Mux**: the SYSTEM host relays the helper's AUs onto QUIC while the input desktop is `Default`
(normal — WGC, HDR/animation-correct), and switches to its own DDA encoder while it's `Winlogon`
(secure — UAC/lock/login). The client sees one continuous stream; the encoder/FEC/AES-GCM/QUIC send
path is untouched (same `EncodedFrame` flow). NVENC re-inits only on a size/format change across the
swap (already handled); same-mode is a pointer re-register.
- **Input**: stays entirely in the SYSTEM host (only it can attach to Winlogon). One windowless
SendInput thread, Sunshine's **retry-on-failure-only** model (cache HDESK thread-local; SendInput
first; only on 0-injected re-`OpenInputDesktop`+`SetThreadDesktop` and retry once) — serves both
desktops with no per-event reattach. (Ctrl+Alt+Del/SAS needs `SendSAS`, out of scope; clicking UAC
Yes/No + typing the login password are plain SendInput on Winlogon.)
Rejected: a shared NT-handle GPU texture (MIC/SDDL pain SYSTEM→user, keyed-mutex ring at 240 Hz,
nvenc pointer-cache churn — all for a static lock dialog). AU bytes over a pipe are far simpler.
## Detection
`DesktopWatcher`: a dedicated thread polling the input-desktop NAME at 3060 Hz —
`OpenInputDesktop(0,FALSE,0)` + `GetUserObjectInformationW(UOI_NAME)` == `"Winlogon"` (secure) vs
`"Default"` (normal) → `Arc<AtomicU8>`. This is the authoritative signal; WTS session notifications
miss UAC entirely. (May also register `WTSRegisterSessionNotification` to short-circuit lock/unlock.)
## Implementation steps (each independently buildable/testable on the 4090)
1. **DesktopWatcher** (`capture/desktop_watch.rs`, ~40 lines): the poll + atomic. Test: lock / trigger
UAC over the existing stream, confirm the atomic flips `Default↔Winlogon` within a poll interval.
2. **SendInput retry-on-failure model** (`inject/sendinput.rs`): replace per-event reattach with the
cached-HDESK + retry-once model. Test: normal input unchanged; click UAC + type the lock password
land (works today via per-event reattach — this is a refactor).
3. **WGC helper subcommand** (`punktfunk1-host wgc-helper` or similar): the existing WGC pipeline → NVENC →
Annex-B AUs over a named-pipe server. Test standalone: as the user it writes a valid `.h265` to the
pipe (capturing the SudoVDA output by GDI name, no topology changes).
4. **Spawn + relay**: SYSTEM host spawns the helper (`CreateProcessAsUserW`), connects the pipe,
relays its AUs onto the live QUIC session. Test: normal-desktop stream sourced via the helper relay.
5. **Source mux**: relay helper AUs while `Default`, switch to the host's own DDA encoder while
`Winlogon` (reusing the DesktopWatcher). Test: normal (WGC, HDR) → trigger UAC → stream shows the
UAC dialog (DDA) → dismiss → back to WGC; QUIC session stays up throughout. **Full-coverage milestone.**
6. **Relaunch watchdog + soak**: `SERVICE_CONTROL_SESSIONCHANGE`-style relaunch of the helper on
console connect/disconnect; soak a few hundred lock/unlock+UAC switches (cf. task #17's 1012-switch
run) — no leak / black / disconnect. Cargo features for the fallback: `Win32_System_Threading`,
`Win32_System_Pipes`, `Win32_System_RemoteDesktop`.
## Risks / notes
- Validate on the real 4090 only (`ssh "Enrico Bühler"@192.168.1.174`, Session 1 via the Interactive
scheduled task) — the headless build VM can't reproduce Winlogon-on-virtual-display or WGC.
- The helper MUST capture the SudoVDA by GDI name and never create a second virtual output (avoids the
ACCESS_LOST born-lost storm — one isolate owner = the SYSTEM host).
- Confirm `reisolate` fires on a FRESH mid-session DDA open at the desktop boundary (task #17 only
validated DDA recovery within an already-DDA session).
- Brief one-frame repeat/flicker at the WGC↔DDA boundary is acceptable (the local lock/UAC transition
flickers too); never starve the encoder (repeat last frame across the swap gap).
- Pragmatic alternative if full coverage isn't worth the build: `PromptOnSecureDesktop=0` (UAC renders
on the normal desktop → WGC captures it) covers UAC (not lock/login) with one reversible registry
change.
+110
View File
@@ -0,0 +1,110 @@
# Windows service (deployment)
The `PunktfunkHost` Windows service is the end-user way to run the host on Windows. It replaces the
manual bring-up chain (a scheduled task → `PsExec64 -s -i 1``wscript launch.vbs``host-run.cmd`)
with one command, auto-start on boot, and supervision.
## Install (installer — recommended)
Download the signed installer from the package registry
(`punktfunk-host-windows`, <https://git.unom.io/unom/-/packages>) and run it (it elevates itself):
```
punktfunk-host-setup-<ver>.exe # wizard
punktfunk-host-setup-<ver>.exe /VERYSILENT # unattended
```
It lays the host into `C:\Program Files\punktfunk`, optionally installs the bundled **SudoVDA**
virtual-display driver, then runs `service install` + `service start` for you. Upgrades stop the
service first and re-point it; uninstall (Add/Remove Programs) runs `service uninstall`. Packaging
details: [`packaging/windows/README.md`](../packaging/windows/README.md). A self-signed CI build also
publishes a `.cer` — import it once (`Import-Certificate -FilePath punktfunk-host-windows.cer
-CertStoreLocation Cert:\LocalMachine\TrustedPublisher`) so Windows trusts the signed setup.
## Install (manual / CLI)
From an **elevated** (Administrator) prompt:
```powershell
punktfunk-host service install # register auto-start LocalSystem service + firewall rules + default host.env
punktfunk-host service start # start it now (also starts automatically on every boot)
```
`service install` is idempotent — run it again after upgrading the exe to re-point the service at the
new binary. Register whatever location you keep the exe in (e.g. `C:\Program Files\punktfunk\`); the
service records the current exe path.
Other subcommands:
```powershell
punktfunk-host service stop
punktfunk-host service status
punktfunk-host service uninstall # stop + delete the service + remove its firewall rules
```
## How it works
The host must run **as SYSTEM in the interactive session** (Session 1+): Desktop Duplication of the
secure desktop (UAC / lock / login) and `SendInput` need SYSTEM, and capture/injection need the
interactive session, which a plain Session-0 service is not in.
So the service (itself in Session 0) **never captures**. On start, and whenever the active console
session changes, it:
1. resolves the active console session (`WTSGetActiveConsoleSessionId`),
2. duplicates its own LocalSystem token and retargets it to that session (`SetTokenInformation`
`TokenSessionId`),
3. launches the host there with `CreateProcessAsUserW` (`lpDesktop = winsta0\default`),
4. supervises it: relaunches on exit/crash (with backoff) and on a console connect/disconnect.
A kill-on-close **job object** ensures a service crash never orphans the SYSTEM host. The host in turn
spawns the WGC helper into the *user* session (see [`windows-secure-desktop.md`](windows-secure-desktop.md))
— two nested launches. Lock/unlock are handled inside the host (the `DesktopWatcher` DDA↔WGC mux), so
the service deliberately does **not** relaunch on lock/unlock — only on a real session switch.
This is the same model Sunshine/Apollo use.
## Configuration
Config lives in **`%ProgramData%\punktfunk\host.env`** (KEY=VALUE lines, `#` comments). `service
install` writes a default if none exists. Template: [`scripts/windows/host.env.example`](../scripts/windows/host.env.example).
```ini
PUNKTFUNK_ENCODER=nvenc
PUNKTFUNK_VIDEO_SOURCE=virtual
PUNKTFUNK_SECURE_DDA=1
RUST_LOG=info
# PUNKTFUNK_HOST_CMD=serve --gamestream # the host subcommand the service launches (default: native + Moonlight)
```
The service loads these into its environment and carries `PUNKTFUNK_*` + `RUST_LOG` to the host child
(the same env-merge the WGC helper uses). Restart the service after editing:
```powershell
punktfunk-host service stop; punktfunk-host service start
```
The host's identity (cert/pairing/mgmt token/library) also lives under `%ProgramData%\punktfunk` — a
machine-wide dir the SYSTEM service and the interactive user share, surviving user logout.
`PUNKTFUNK_CONFIG_DIR` overrides the location (both platforms; handy for tests).
## Logs
- `%ProgramData%\punktfunk\logs\service.log` — the service's own supervision log (spawn/exit/session
switches).
- `%ProgramData%\punktfunk\logs\host.log` — the host child's stdout/stderr.
## Prerequisites
- The host built with `--features nvenc` for NVENC (the driver ships `nvEncodeAPI64.dll`; no SDK
needed at runtime). Software encode otherwise.
- The **SudoVDA** indirect display driver installed (for `PUNKTFUNK_VIDEO_SOURCE=virtual`).
- **ViGEmBus** for virtual gamepads (optional).
## Gotchas
- `service install`/`uninstall` need an **elevated** prompt (the SCM rejects non-admin).
- `service run` is the SCM entry point — don't run it by hand (it errors with a hint).
- A **graceful** stop currently `TerminateProcess`es the host, so its RAII teardown (SudoVDA monitor
REMOVE) doesn't run; a stale virtual monitor can linger until the next start. A cooperative-stop
signal is a follow-up.
+497
View File
@@ -0,0 +1,497 @@
# Windows virtual display — a Rust port of SudoVDA (investigation & plan)
Status: **P1 done — `pf-vdisplay` validated streaming on glass at 5120×1440@240** (2026-06-22). The
all-Rust IddCx driver replaces the vendored **SudoVDA** C++ driver, matching the "all-Rust UMDF, zero
external driver deps" direction we finished for gamepads (ViGEmBus gone; DualSense/DS4/XUSB shipped).
The investigation/plan below is kept for context; see **Validated on-box** for the result.
## TL;DR
A Rust port is **feasible, low-on-blockers, and strategically aligned** — and there's an unexpected
architectural prize beyond "same thing, in Rust."
- **Signing is not a blocker.** An IddCx driver is UMDF *user-mode*; it needs **no WHQL, no
attestation, no test-signing**. A self-signed cert in LocalMachine `Root` + `TrustedPublisher`
loads it — **exactly the model our gamepad drivers already ship** (and exactly what SudoVDA and the
other forks do). ([Do UMDF drivers require signing?](https://learn.microsoft.com/en-us/archive/blogs/peterwie/do-umdf-drivers-require-signing))
- **We would not be first in Rust.** [`MolotovCherry/virtual-display-rs`](https://github.com/MolotovCherry/virtual-display-rs)
is a complete, shipping **IddCx driver written in Rust** (MIT), with hand-rolled IddCx/WDF bindgen
bindings (`wdf-umdf-sys` + `wdf-umdf`) and a reference swap-chain processor. This turns "greenfield
FFI" into "adapt a proven reference."
- **The prize: we can stop using DXGI Desktop Duplication.** An IddCx driver already *receives* the
composited desktop frames in its swap-chain. [Looking Glass](https://deepwiki.com/gnif/LookingGlass/2.5-indirect-display-driver-(idd))
ships exactly this in production — driver consumes the swap-chain, hands frames to a separate
process, "operates entirely independently of DDA." Doing the same would **delete an entire class of
multi-GPU bugs** the current `capture/dxgi.rs` is built to survive (ACCESS_LOST storms,
MODE_CHANGE_IN_PROGRESS, the `win32u.dll` reparenting patch).
Recommendation: **yes, build it in Rust**, in phases — a drop-in DDA-compatible driver first (own the
stack at low risk), then the direct-frame-push path (the real cleanup). Keep vendoring SudoVDA as the
safe interim until the Rust driver is on-glass-validated on the RTX box.
## Validated on-box (2026-06-22)
Before committing, the toolchain + load path were proven on the RTX box (Win11 26200, WDK 26100):
- **A Rust IddCx driver builds with our toolchain.** Cloned [`virtual-display-rs`](https://github.com/MolotovCherry/virtual-display-rs)
and built its driver `.dll` against our WDK (UMDF 2.31 + IddCx 1.4 stubs, bindgen over `IddCx.h` via
our LLVM, nightly-2024-07-26). One fix needed: its `build.rs` picked the **max** SDK Lib version
(`10.0.28000.0`, a base SDK with no IddCx) for the `IddCxStub` search path; resolving it by the
version that actually contains `um\x64\iddcx\1.4` (`10.0.26100.0`, the WDK) fixed the link.
- **It installs self-signed and loads.** Signed `.dll`/`.cat` with our existing driver cert (the
gamepad `punktfunk-ds-test`), `pnputil /add-driver`, root devnode via `devgen`. The device came up
**Status OK / CM_PROB_NONE**, Class Display, hosted by `WUDFRd` — a Rust IddCx adapter initialized
cleanly. (SudoVDA, already live here, independently confirms IddCx + self-signed UMDF work on this
box.) Test artifacts removed afterward; SudoVDA untouched.
**Conclusion:** the central risk ("can we build + load a Rust IddCx driver here?") is retired. The
binding question (D2) resolves toward **reusing `virtual-display-rs`'s self-contained `wdf-umdf-sys` +
`wdf-umdf` bindgen crates** (now proven to build + load on our box) rather than extending
`windows-drivers-rs` — IddCx functions are direct `IddCxStub` exports the WDF function-table macro
can't reach anyway, so a unified bindgen is the cleaner base for `pf-vdisplay`. Reference clone kept at
`C:\Users\Public\virtual-display-rs`.
**Scaffold + driver logic landed + on-glass:** `packaging/windows/vdisplay-driver/` — vendored
`wdf-umdf-sys`/`wdf-umdf` (MIT, + the SDK-version build.rs fix) + the `pf-vdisplay` driver crate. The
full IddCx driver is ported (entry → `IDD_CX_CLIENT_CONFIG` with all 7 callbacks → device/monitor
context → our own EDID → a real swap-chain drain), with the IPC/serde/`tokio` stack replaced by an
in-tree `monitor` model and `OutputDebugString` logging. **Validated on the RTX box:** built, signed
(our `punktfunk-ds-test` cert), installed, loaded **Status OK**, and **arrived a real virtual monitor**
("VirtuDisplay+", `DISPLAY\CHY0000`) — i.e. an OURS, all-Rust IddCx virtual display creating a monitor.
**IOCTL control plane done + on-glass (P1 functionally complete):** the SudoVDA-compatible control
plane is implemented (`EVT_IDD_CX_DEVICE_IO_CONTROL` + the `{e5bcc234-…}` interface registered via
`WdfDeviceCreateDeviceInterface`; `control.rs` with byte-identical structs) — `ADD` a monitor at a
requested mode → `{LUID, target_id}` (target id + adapter LUID captured from `IDARG_OUT_MONITORARRIVAL`),
`REMOVE` by GUID, `PING`/`GET_WATCHDOG` watchdog, `GET_VERSION`, `SET_RENDER_ADAPTER`
(`IddCxAdapterSetRenderAdapter`); per-`ADD` mode injection (requested mode preferred + fallbacks). Added
the five missing FFI wrappers to the vendored `wdf-umdf`. **Validated on the RTX box** with a probe
that mimics `vdisplay/sudovda.rs` exactly: `GET_VERSION → 0.2.1`, `GET_WATCHDOG → timeout=3`,
`ADD 1920×1080@60 → target_id=257 + adapter LUID`, a real "VirtuDisplay+" monitor arrived at the
requested mode, `REMOVE` ok. **Constraint:** pf-vdisplay can't coexist with SudoVDA — they register the
same interface GUID, so two IddCx adapters claiming it → `FAILED_POST_START`; pf-vdisplay *replaces*
SudoVDA (validated by disabling SudoVDA first).
**Watchdog + real-host drive validated:** added the watchdog thread (1 Hz countdown reset by any IOCTL;
tears down all monitors at 0 so a gone host never leaves a phantom display; mirrors SudoVDA's
`RunWatchdog`). Pointed the **real host** at it — removed SudoVDA's devnode so pf-vdisplay is the sole
`{e5bcc234}` provider, then ran the host's `vdisplay::sudovda::tests::live_create_drop`
(`PUNKTFUNK_SUDOVDA_LIVE=1`): **test passed**, and the pf-vdisplay log shows the host's IOCTLs landing —
`ADD 1920x1080@60 → target_id=258, luid=…02619823`, then the watchdog correctly tore the monitor down
when the test process exited without a final REMOVE. So `vdisplay/sudovda.rs` drives pf-vdisplay
unchanged through the full control contract.
**Validated streaming end-to-end on glass (2026-06-22) — P1 complete.** pf-vdisplay is a working
SudoVDA replacement. Driven by the **real host** (`serve`, the LocalSystem service) with a stock client
at **5120×1440@240**: the monitor arrives, `resolve_gdi_name → \\.\DISPLAY10`, `set_active_mode` +
CCD-isolate succeed, the DXGI output resolves **under the RTX 4090**, WGC capture + NVENC run at
**steady 240 fps, ~2.4 ms encode**, 6512 AUs sent, clean teardown (`isolate restored rc=0x0`). Same
`vdisplay/sudovda.rs` path, unchanged — full parity with SudoVDA.
**The earlier "monitor arrives but never gets a swap-chain / no DXGI output" symptoms were a
measurement + state artifact, not a driver bug.** Two traps cost a lot of time:
1. **Session 0.** Every standalone probe (`vdtest`, the host's `live_create_drop` test) ran in
**Session 0** — the services session, whose desktop is a throwaway **1024×768** basic display. IddCx
activation happens in the **console Session 1**, where the 4090 drives the real desktop. So
`Screen.AllScreens`/CCD queries from Session 0 *can never* see the virtual monitor activate — they
report the wrong desktop. The only valid way to drive + observe it is the **host service** (SYSTEM,
which targets Session 1) plus the driver's own `OutputDebugString` (system-wide, session-agnostic).
2. **Accumulated device-state damage.** Repeated reinstalls + `Disable`/`Enable-PnpDevice` cycles +
a control handle the host **cached across all of it** left the device tree wedged (stale handle →
the host's PINGs fail → the 3 s watchdog tears the monitor down mid-session → capture opens a dying
display → "no DXGI output"). **A reboot cleared it and it worked on the first connect.** Lesson:
after device churn, restart the host service (fresh handle) — and when in doubt, reboot.
The swap-chain processor is a **faithful port of virtual-display-rs's** (it drains correctly via
`ReleaseAndAcquireBuffer` + `FinishedProcessingFrame` — the drain is *required*; a true no-op would
stall DWM and freeze the captured image). The EDID is our **own clean 128-byte block** (manufacturer
`PNK`, product `punktfunk`) — no SudoVDA bytes.
**Build gotcha (important for iterating):** updating an installed UMDF driver only takes if the INF
**DriverVer changes**`deploy-dev.ps1` stamps a date.time `-v` on every run; without a bump the old
binary keeps running (silently). **Devnode hygiene:** create the root devnode with
`nefconc --create-device-node` (a clean `ROOT\DISPLAY` node), NOT `devgen /add` — devgen makes
**persistent `SWD\DEVGEN` software devices** that survive reboot *and* registry deletion and resurrect
on every `pnputil /add-driver` (they have `hwid root\pf_vdisplay`, so the driver install re-materializes
them). The production installer must use a single `nefconc`/INF-created node and never `devgen`.
## P2 — direct frame push (kill DDA): design & decision record
Status: **in progress.** P1 ships frames the old way (the driver drains its swap-chain and DDA/WGC
re-captures the composited desktop). P2 makes the driver *publish* each swap-chain frame to the host
directly, so we can retire Desktop Duplication and its multi-GPU survival code. Built behind
`PUNKTFUNK_IDD_PUSH`, A/B'd against DDA, and only then made the default.
### The decisive finding: producer and consumer are both in Session 0
The whole transport design hinged on one unknown — same-session or cross-session? **Measured on the
RTX box (2026-06-22):** the pf-vdisplay host process is `WUDFHost.exe` with
`-DeviceGroupId:pfVDisplayGroup`, running in **Session 0**; the punktfunk host service is `LocalSystem`,
also **Session 0**. So the swap-chain processor thread (spawned by our own `thread::spawn` inside the
driver, i.e. in `WUDFHost`) and the encoder live in the **same session**. This is the easy case:
- A D3D11 **shared keyed-mutex texture** created in the driver can be opened by name in the host with
`ID3D11Device1::OpenSharedResourceByName` — both devices created on the **same render-adapter LUID**
(which the driver already reports out of the `ADD` IOCTL via `OsAdapterLuid`, surfaced as
`WinCaptureTarget::adapter_luid`).
- Named kernel objects resolve through Session 0's shared `\BaseNamedObjects`, so **no `Global\`
prefix / `SeCreateGlobalPrivilege` gymnastics** are needed (kept the names unprefixed; documented
that this relies on both processes being Session 0). The Looking-Glass cross-*VM* shared-memory
device is unnecessary — this is cross-*process*, same-session, on one GPU.
This collapses the "Session-0 cross-process transport is the long pole" risk from the original plan.
### Transport: a ring of shared keyed-mutex textures + a metadata header + an event
A single ping-pong keyed mutex would couple the driver's present rate to the host's consume rate — and
**the swap-chain thread must never block** (a stalled `IddCxSwapChainReleaseAndAcquire`/processing loop
freezes DWM compositing system-wide). So, the Looking-Glass shape — multiple frame buffers, newest
wins:
- **Ring** of `N` (default 3) shared textures, `RESOURCE_MISC_SHARED_NTHANDLE |
SHARED_KEYEDMUTEX`, fixed size for the session. A **generation** counter bumps on a mode change
(resize): the driver tears down + recreates the ring at the new size, the host notices the
generation change and re-opens.
- **Named metadata header** (`CreateFileMapping`): `{magic, version, generation, width, height,
dxgi_format, ring_len, latest}` where `latest` packs `{write_index, monotonic sequence}` published
*after* the copy completes. Plain (unprefixed) names — Session-0 shared namespace.
- **Frame-ready auto-reset event** so the consumer waits instead of spinning.
- **Producer (driver, per acquired frame):** pick `(latest_index + 1) % N`; **try**-acquire that
slot's keyed mutex with a 0 ms timeout (if the host still holds it — rare with 3 slots — reuse the
current slot or skip, **never block**); `CopyResource` the acquired `MetaData.pSurface` into the
slot; release the mutex; publish `{index, ++seq}`; `SetEvent`. Then `FinishedProcessingFrame` as
today.
- **Consumer (host `IddPushCapturer`):** `WaitForSingleObject(event, timeout)`; read `latest`; if `seq`
advanced, acquire that slot's mutex, `CopyResource` into an owned NVENC-input texture, release, yield
`FramePayload::D3d11{texture, device}` — straight into the existing zero-copy NVENC path. No DDA, no
CPU readback.
### What P2 removes vs. keeps
- **Removes:** `capture/dxgi.rs`'s `DXGI_ERROR_ACCESS_LOST`/`MODE_CHANGE_IN_PROGRESS` re-duplication
churn, the legacy-`DuplicateOutput` fallback, and **`install_gpu_pref_hook()` (the `win32u.dll`
patch)** — by **pinning the render adapter to the encoder GPU** (`IddCxAdapterSetRenderAdapter`, the
existing `SET_RENDER_ADAPTER` IOCTL, driven before `ADD`), so the OS never reparents the output and
the shared texture + NVENC share one device by construction.
- **Keeps:** display **topology** (making the virtual display the composited desktop) and the
**watchdog** (now ours). The **two-process WGC secure-desktop relay** stays until we confirm the IDD
push also delivers the secure (Winlogon) desktop; if it does, that retires too.
### On-glass attempt 2026-06-22 — code complete, blocked at driver load
The full transport (driver publisher + host `IddPushCapturer` + render-LUID robustness + in-process
routing) is written and compiles clean. The first on-glass A/B exposed several real things and one
hard blocker:
- **The service captures in a Session-1 WGC helper, not in-process.** `should_use_helper()` returns
true for a SYSTEM service, so it spawns a user-session helper that does capture **and input
injection**. IDD-push must capture **in-process in Session 0** (where the driver publishes) — wired
via `should_use_helper()` returning false for `PUNKTFUNK_IDD_PUSH`. **Caveat:** `SendInput` from
Session 0 can't reach the user's Session-1 desktop, so in-process IDD-push has **no working input**
yet. Production needs either a Session-1 input-only helper, or `Global\`-namespaced shared textures
so a Session-1 helper consumes IDD-push for both video + input.
- **`SET_RENDER_ADAPTER` is ignored by the driver** (the IDD lands on a different adapter than pinned:
observed IDD adapter `0xd60722` vs pinned 4090 `0x15de1`). The render-LUID-in-header path makes the
host bind correctly regardless, but the driver should be made to actually honor the pin (or the host
must copy across adapters) so NVENC gets a 4090 surface.
- **Cursor is included** in the IddCx composited frame (DDA strips it) — so the host-side cursor
compositor (P2.5) is likely unnecessary for this path.
- **`FAILED_POST_START` was a red herring (churn, not the binary).** Comparing the 2157 (works) and
the `frame_transport` DLL import tables: **identical** (same 8 DLLs; the size/hash delta is just the
Authenticode signature). A clean install **+ reboot** (no `restart-device`/`disable-enable`/kill in
between) loads the `frame_transport` driver to **`OK`**. The earlier `FAILED_POST_START` was the
device wedging from the hot-reload churn (the deploy gotchas above). **Lesson: deploy = install +
reboot, full stop.**
- **THE REAL BLOCKER — the driver can't CREATE the shared objects.** With the driver loaded clean and
the monitor active, the host's `IddPushCapturer` still times out: `pfvd-hdr-<target> never appeared`.
The driver's own `OutputDebugString` is invisible (UMDF redirects it to ETW, not DebugView — verified
with a working DBWIN self-test), so a **file-logging** driver build was tried — and it wrote **no
file at all**, even though `init()` runs in `DriverEntry`, the device is `OK`, WUDFHost runs as
`LocalService`, and `C:\Users\Public` is world-writable. **WUDFHost runs with a restricted token: it
can neither write the filesystem nor create named kernel objects** (`CreateFileMappingW`/`CreateEventW`/
`CreateSharedHandle`), so `FramePublisher::new` fails silently. This is exactly why the **gamepad UMDF
drivers invert it**: `inject/dualsense_windows.rs` — *"the host creates the section (privileged → a
permissive SDDL so the WUDFHost can open it); the driver maps it"* — `Global\pfds-shm-<idx>` + SDDL
`D:(A;;GA;;;WD)`. **Fix: invert frame-push to match.** The HOST creates the header + event + ring
textures (`Global\` names, `D:(A;;GA;;;WD)` SDDL); the DRIVER only OPENS them, writes its actual
render LUID + a status code back into the host-created header (so we get driver visibility through the
host log), and runs the copy loop. The host creates the textures on the render adapter the driver
reports.
- **Also unresolved: `SET_RENDER_ADAPTER` appears ignored** (the host's pin to the 4090 vs the ADD-reply
adapter differ every time). The inverted header carries the driver's *actual* render LUID so the host
can create textures + run NVENC on the right adapter — but if that's the iGPU, NVENC (NVIDIA) can't
encode it, so the driver must be made to honor the pin (or the host must cross-adapter copy). Needs its
own investigation.
**Driver deploy gotchas learned (this box):** hot-reloading a UMDF display driver is unreliable —
`pnputil /restart-device` does NOT restart WUDFHost (old image stays mapped), `Disable/Enable-PnpDevice`
errors on the root-enumerated IDD, and **killing WUDFHost invalidates the host's cached `{e5bcc234}`
control handle** (every ADD then fails `0x80070006`, and the device can wedge to `FAILED_POST_START`).
A **reboot** loads a freshly-installed build cleanly. **Recovery** from a broken build is clean and
reboot-free: `pnputil /delete-driver <oemNN>.inf /uninstall` removes the bad package and the device
rebinds the previous (validated) package in the DriverStore — restored 2157 → `OK` immediately.
### On-glass attempt 2 (2026-06-23) — inversion works; in-process Session-0 path is a dead end
Implemented the **inversion** (host creates the header + event + ring textures with the
`D:(A;;GA;;;WD)` SDDL, driver only opens them) + a per-attempt **generation** (kills the
`DXGI_ERROR_NAME_ALREADY_EXISTS` retry collisions) + a fixed-name **`Global\pfvd-dbg` debug channel**
(structured counters the driver writes, since UMDF/ETW + the restricted token block its other logs).
Results on the RTX box:
- ✅ The host **creates the shared ring every time** (`created shared ring … render_luid=…`) — the
privileged-create / restricted-open split is sound.
- ✅ No more name collisions (generation fix).
-**The driver writes NOTHING** — debug block all zeros, crucially `run_core_entries=0`. The
swap-chain processor **never runs**, i.e. the OS **never assigns a swap-chain** to the virtual
monitor in this path.
**Root cause: an IddCx monitor only gets a swap-chain when something PRESENTS to it, and the in-process
path has no presenter.** The host + the CCD topology-isolate run in **Session 0, which has no DWM /
compositor**. The WGC path works because its capture helper lives in **Session 1**, where DWM composes
the desktop onto the display (that composition is the swap-chain trigger). So in-process Session-0
IDD-push gets no frames to push, full stop — a **fundamental** barrier, not a fixable bug. The original
plan's "Session-0 transport is the long pole" was right, but the long pole turned out to be *triggering
presentation*, not the shared-memory mechanics (those work).
**Consequence:** the only viable IDD-push shape is **option 3 — a Session-1 helper drives presentation +
consumes the `Global\` ring** (the inversion built here is exactly what it needs). But it carries an
unretired risk: it's still unproven whether the swap-chain gets assigned even with a Session-1 consumer
that isn't WGC. Until that's answered, **DDA/WGC stays the shipping Windows capture path** — it works.
All the IDD-push code (driver open-side + host create-side + debug channel) is written, compiles, and is
gated behind `PUNKTFUNK_IDD_PUSH` (off), so it's dormant and harmless.
### CONCLUSION (2026-06-23): IDD-push is not viable for bare-metal capture — the swap-chain is never assigned
After the inversion + a fixed-name debug channel + a host-created-ring observer + an autonomous
loopback test harness (`punktfunk-probe` → the SYSTEM service, paired via the mgmt API), the question
"does the driver's swap-chain processor ever run?" was answered **definitively: no.** The driver's
`run_core` is **never entered**`run_core_entries=0` in *every* configuration tested:
- in-process (Session 0) and WGC-triggered (Session 1 helper) sessions,
- a user-created ring AND a host-created (LocalSystem) ring with a permissive `D:(A;;GA;;;WD)` SDDL,
- with and without a Low-IL (`S:(ML;;NW;;;LW)`) mandatory label,
- with WUDFHost confirmed **not** an AppContainer (`IsAppContainer=0`),
— even while WGC simultaneously captured the same virtual monitor's composition and streamed multi-MB
of HEVC. The gamepad UMDF drivers prove a UMDF driver *can* open + write a host-created `Global\`
section on this box, so the driver writing nothing is **not** an access problem — `run_core` simply
does not run.
**Root cause (researched + ecosystem-confirmed):** an IddCx virtual monitor only receives a swap-chain
(`EVT_IDD_CX_MONITOR_ASSIGN_SWAPCHAIN`) when the OS **presents/scans-out** to it, which requires a real
presentation consumer. **WGC/DDA capture of the composed desktop does NOT count** — it reads DWM's
composition, bypassing the driver's swap-chain. With no physical scanout and no consumer that routes
*through the driver*, the path stays inactive (`IDDCX_PATH_FLAGS=0`) and `ASSIGN_SWAPCHAIN` never fires.
Confirming evidence:
- **Every bare-metal virtual-display capture project uses WGC/DDA, not the driver swap-chain:** SudoVDA
(its swap-chain loop acquires-and-discards), Apollo/Sunshine (DDA + WGC backends), virtual-display-rs
(discards), parsec-vdd (no frame path). Only **Looking Glass** consumes the driver swap-chain — and
only because a **VM guest scans out** the display (the consumer). We have no equivalent on bare metal.
- Microsoft's own unanswered Q&A (learn.microsoft.com/answers 4096179) reports the identical symptom for
the IddSampleDriver: virtual display "always inactive," `ASSIGN_SWAPCHAIN` never runs.
**Verdict:** the "driver consumes its swap-chain and pushes frames" architecture (P2 / Looking-Glass
style) **cannot get frames** for punktfunk's bare-metal, whole-desktop, capture-only use case. The
shared-memory transport machinery (host-creates / driver-opens, the gamepad pattern) is all sound and
proven to *create*, but there is nothing for the driver to publish. **DDA/WGC remains the only viable
Windows capture path**, which is exactly what the entire ecosystem does. The IDD-push code stays
in-tree, compiles, and is gated `off` (`PUNKTFUNK_IDD_PUSH`) — dormant and harmless — documenting the
attempt so it isn't re-tried. "Better performance/lower overhead" must come from optimizing the WGC/DDA
path (e.g. trimming the Session-0↔Session-1 relay, zero-copy encode), not from IDD-push.
The only unexplored avenue is **driver-side** (a different adapter/monitor/path setup that might make the
OS treat the virtual display as a presentation target) — but it needs a reboot to test, the MS Q&A
suggests it's unsolved, and the unanimous ecosystem choice of WGC/DDA argues it's a dead end.
**Final exhaustion (2026-06-23, follow-up): both remaining avenues closed.**
- **Option 3 (present source) — TESTED, failed.** Added a present-trigger to the Session-1 WGC helper:
it successfully created a D3D11 swapchain on the virtual display and presented continuously (WGC even
captured the flashing window). The driver stayed `run_core_entries=0` / `frames_acquired=0`. So an
active *present source* on the display does NOT make the OS assign the driver's swap-chain either —
DWM composes the present onto the display (capturable) without routing it through the driver's
swap-chain.
- **Option 2 (driver flag) — closed by analysis.** The present-trigger succeeding proves the **path is
already active** (a swapchain presents to the display fine); the missing piece is **scanout routed
through the driver**, which the OS does only for a real consumer (physical display / VM guest / RDP).
The one IddCx flag for that — `IDDCX_ADAPTER_FLAGS_REMOTE_SESSION_DRIVER` — requires the **RDP
protocol stack** as the consumer, which bare-metal console capture has no equivalent of.
**Verdict is final:** IDD-push needs a presentation consumer (scanout / VM guest / RDP) that bare-metal
console desktop-capture fundamentally cannot provide. No host-side capture, no in-process path, no
present source, and no available driver flag overcomes it. WGC (normal desktop) + DDA (secure desktop)
is the only viable Windows capture path — as the entire ecosystem already does. The IDD-push +
present-trigger code stays in-tree, gated off, as the documented record of the attempt.
### Known gaps the build-out must close (tracked as P2.* tasks)
- **Cursor.** DDA/WGC composite the HW cursor host-side from frame-info; the IDD path delivers the
cursor separately (`IddCxMonitorSetupHardwareCursor` event → `QueryHardwareCursor`). The prototype
may ship cursor-less; the build-out wires the IDD cursor into the existing `CursorCompositor`.
- **HDR.** The default IddCx swap-chain surface is 8-bit `B8G8R8A8`; FP16/HDR needs the **IddCx 1.11
D3D12 acquire path** (`SetDevice2`/`ReleaseAndAcquireBuffer2``ID3D12Resource`). Build against
1.10, runtime-gate 1.11. SDR-only for the prototype.
## Why we'd do this
The user's goals, mapped to outcomes:
| Goal | Outcome |
| --- | --- |
| Drop external deps | No more vendored prebuilt SudoVDA `.dll`/`.cat` (third-party, C++, single upstream). |
| Increase Rust coverage | The display driver joins the gamepad drivers as in-tree Rust UMDF. |
| Own the stack / easier display management | We control the IOCTL protocol, the EDID, the mode list, the watchdog — and can fold the topology/mode logic that's currently scattered in `vdisplay/sudovda.rs` into the driver. |
| Cleaner code | Phase 2 retires `capture/dxgi.rs`'s DDA workarounds + the `win32u.dll` patch. |
## What we'd be replacing (current architecture)
- **Driver:** SudoVDA — UMDF2 IddCx, `Class=Display`, `UmdfExtensions=IddCx0102`,
`UpperFilters=IndirectKmd`, root-enumerated `Root\SudoMaker\SudoVDA`. Vendored prebuilt under
`packaging/windows/sudovda/`, installed by `install-sudovda.ps1` (cert → `nefconc` devnode →
`pnputil`). Source is public ([SudoMaker/SudoVDA](https://github.com/SudoMaker/SudoVDA), README-only
MIT/CC0 grant over the MS sample, ~1,900 LOC C++).
- **Host contract:** `crates/punktfunk-host/src/vdisplay/sudovda.rs` opens the control device by
interface GUID `{e5bcc234-…}` and drives a tiny `METHOD_BUFFERED` IOCTL protocol — byte-identical to
SudoVDA's `Common/Include/sudovda-ioctl.h`:
- `ADD (0x800)` `{w,h,refresh,GUID,name[14],serial[14]}``{LUID, target_id}`
- `REMOVE (0x801)` `{GUID}` · `SET_RENDER_ADAPTER (0x802)` `{LUID}` · `GET_WATCHDOG (0x803)` ·
`PING (0x888)` (mandatory keepalive) · `GET_VERSION (0x8FF)`
- **Capture:** `capture/dxgi.rs` finds the virtual monitor's GDI output **across all adapters** (it's
enumerated under the *rendering* GPU, not SudoVDA's LUID) and runs **DXGI Desktop Duplication**
(`DuplicateOutput1`, FP16 for HDR). This file is **dominated by virtual-display-over-DDA survival
code**: `DXGI_ERROR_ACCESS_LOST` re-duplication with retries, `MODE_CHANGE_IN_PROGRESS` backoff,
legacy-`DuplicateOutput` fallback, CCD display isolation to make the IDD the sole composited
desktop, and an **`install_gpu_pref_hook()` that patches `win32u.dll!NtGdiDdDDIGetCachedHybridQueryValue`**
to stop DXGI reparenting the output across GPUs. Most of that exists *because* we capture a virtual
display via DDA on a multi-GPU box.
## Feasibility findings
### Signing — green (the make-or-break)
UMDF user-mode ⇒ Code-Integrity signing rules don't apply to our binary (the only kernel piece is
Microsoft's inbox `IndirectKmd`). Self-signed cert in `Root` + `TrustedPublisher` is sufficient on a
normal Secure-Boot Win11 box — no `bcdedit /set testsigning`. SudoVDA and `virtual-display-rs` both
ship this way. This is the **same** model as our DualSense/DS4/XUSB drivers. (The only thing that
breaks install is a botched cert placement, not a signing *tier*.)
### Rust prior art — exists, MIT, reusable
`virtual-display-rs` proves an all-Rust IddCx driver runs in production and gives us:
`wdf-umdf-sys` (bindgen over WDF **and** `iddcx.h`, links `IddCxStub`), `wdf-umdf` (safe wrappers —
`iddcx.rs` ~300 LOC, with an `IddCxIsFunctionAvailable!` version-gate macro), and a reference driver
(`swap_chain_processor.rs` ~158 LOC, `direct_3d_device.rs`, `edid.rs`). **Caveat:** it uses its *own*
bindgen stack, **not** `microsoft/windows-drivers-rs` — see Decision D2.
### windows-drivers-rs IddCx support — absent, but a bounded extension
Our `wdk-sys` (m0) binds Base + WDF + feature-gated subsets (hid/gpio/spb/…). **Zero IddCx symbols.**
Adding it is the same shape as the existing subsets: an `ApiSubset::Iddcx` variant + `iddcx` feature →
`iddcx_headers()` returning `iddcx.h` for bindgen, and linking `IddCx.lib`. IddCx functions are **not**
WDF-table functions, so the `call_unsafe_wdf_function_binding!` macro doesn't apply — they're direct
`IddCx.lib` exports we'd `#[link(name="IddCx")] extern` (or bindgen) and wrap ourselves.
`windows` 0.58 (already in the tree) provides the Direct3D11/Dxgi APIs the swap-chain loop needs.
### The IddCx driver itself — well-understood, ~12k LOC
Required callbacks (baselined on the MS [IddSampleDriver](https://github.com/microsoft/Windows-driver-samples/blob/main/video/IndirectDisplay/IddSampleDriver/Driver.cpp), ~1,100 LOC, IddCx 1.4):
`EVT_IDD_CX_ADAPTER_INIT_FINISHED`, `ADAPTER_COMMIT_MODES`, `PARSE_MONITOR_DESCRIPTION`,
`MONITOR_GET_DEFAULT_DESCRIPTION_MODES`, `MONITOR_QUERY_TARGET_MODES`, `MONITOR_ASSIGN_SWAPCHAIN`
(the only callback with real D3D work), `MONITOR_UNASSIGN_SWAPCHAIN`, and `DEVICE_IO_CONTROL` (where
our ADD/REMOVE/PING IOCTLs live). Init flow: `WdfDeviceCreate → IddCxDeviceInitConfig →
IddCxDeviceInitialize → IddCxAdapterInitAsync → IddCxMonitorCreate → IddCxMonitorArrival`.
**Arbitrary resolutions don't need EDID timings:** ship one generic ~128/256-byte EDID base block to
make Windows treat the target as a real monitor, then advertise modes programmatically from the
mode-list callbacks — a static table **plus the runtime-requested client mode injected as preferred**
(exactly SudoVDA's `s_DefaultModes[]` + per-ADD preferred-mode approach). 5120×1440@240 just gets
added at ADD time.
**HDR/10-bit:** supported, but it's the one place IddCx is *harder* than today. The default swap-chain
surface is **8-bit `A8R8G8B8`**; FP16/HDR requires the IddCx **1.11 D3D12 acquire path**
(`SetDevice2`/`ReleaseAndAcquireBuffer2``ID3D12Resource`, with a stricter sync model). Our box is
Win11 26200 (IddCx ≥ 1.10), so this is reachable, but it's real work — and our current WGC/DDA path
gives FP16 HDR "for free." Build against 1.10 and runtime-gate the newer DDIs (SudoVDA's pattern).
## The architectural prize: skip DDA (Phase 2)
An IddCx driver gets each presented frame from `IddCxSwapChainReleaseAndAcquireBuffer` as an
`IDXGIResource` on a device **we** bind via `IddCxSwapChainSetDevice`. We can copy it into a shared
texture / shared section and hand it to the host's encoder process directly — **no Desktop
Duplication**. Why this is the real win, not just a detour:
- **It's the *intended* IddCx use case.** IddCx exists for remote/wireless/USB displays that ship
swap-chain frames over a wire; consuming frames in the driver is the designed path, and **Looking
Glass already does exactly this** (driver → shared memory → separate consumer, no DDA).
- **It kills the multi-GPU bug class.** We call `IddCxAdapterSetRenderAdapter` to pin the swap-chain to
the **same GPU as our NVENC encoder before adding the monitor**, and the OS honors it. No more DXGI
reparenting the output onto the wrong GPU, no ACCESS_LOST storms, and we can **retire
`install_gpu_pref_hook()` (the `win32u.dll` patch)** and most of `capture/dxgi.rs`. Swap-chain
re-creation becomes a documented, in-band event (`ABANDON_SWAPCHAIN`) instead of an undocumented
failure we fight with retries.
What it does **not** remove (be honest): display **topology** management — making the virtual display
the sole/primary composited desktop so the game (and Winlogon) render to it — is independent of how we
*get* frames and stays (though we can integrate it more cleanly). And the watchdog stays, now ours.
The cost: a **Session-0 → service cross-process frame transport** (the driver host is `WUDFHost` in
Session 0 / LocalService; our host is a LocalSystem service). A `Global\`-named, explicitly-ACL'd
shared section + keyed-mutex texture (Looking Glass's shape) is where the engineering actually goes —
prototype this first, it's the only genuinely new risk. Plus the HDR D3D12 path above.
## Decisions to make at kickoff
- **D1 — Own the driver?** Recommend **yes, in Rust.** (Alternatives: fork SudoVDA's C++ — fastest to a
known-good HDR driver but reintroduces a C++ toolchain and README-only license provenance; or keep
vendoring — zero cost, but none of the goals.)
- **D2 — Binding stack?** The main implementation fork.
- **(a)** Extend our `windows-drivers-rs` (m0) with an `iddcx` subset — **one toolchain across all
our drivers**, our build env, but we write the IddCx bindings ourselves (+~35 wk), using
`virtual-display-rs`'s `iddcx.rs` as the 1:1 guide. *Preferred for consistency.*
- **(b)** Vendor `virtual-display-rs`'s `wdf-umdf*` crates (MIT) — fastest to first light, but a
*second* WDK-binding stack in-tree.
- Suggested sequence: **prototype on (b) to prove IddCx-on-our-box in days**, then build production on
**(a)** for consistency.
- **D3 — Frame transport?** Phase it: **DDA-compatible first** (zero capture-side change), **direct
push second** (the cleanup). Don't couple the driver rewrite to the transport rewrite.
## Recommended plan
- **P0 — now:** keep vendoring SudoVDA. No change. (The gamepad-driver installer work just shipped;
this is independent.)
- **P1 — drop-in Rust IddCx driver (`pf-vdisplay`).** Replicate SudoVDA's IOCTL contract **exactly**
(same struct layouts; reuse or re-issue the control interface GUID) so `vdisplay/sudovda.rs` needs
**~zero change** (at most a GUID constant). Class=Display + IddCx INF, our own EDID + programmatic
mode list incl. the per-ADD client mode, the watchdog, a real swap-chain drain (the vdd port — the
drain is required so DWM keeps compositing; DDA/WGC still captures the desktop). Bundle + self-sign +
`pnputil`-install via the installer, identical to the gamepad-driver path we just built. **Outcome:** all-Rust, SudoVDA dependency dropped, DDA capture
unchanged. Effort ≈ **24 wk to first light**, **57 wk to parity** (HDR, multi-monitor, CI).
- **P2 — direct frame push (kill DDA).** Add a swap-chain processor that copies each frame into a
shared section/texture; new `capture` backend reads it directly; pin the render adapter to the
encoder GPU. Gate behind a flag, validate against DDA, then retire the DDA path + the `win32u.dll`
patch. HDR via the IddCx 1.11 D3D12 acquire path. **Outcome:** the real "owning the stack pays off"
cleanup. Effort: additional; the Session-0 transport is the long pole.
## Risks
1. **D3-in-a-driver swap-chain loop** — the one genuinely new piece; bugs here = black screens/TDR.
Mitigated by `virtual-display-rs`'s `swap_chain_processor.rs` + the MS sample as references.
2. **Session-0 cross-process transport** (P2) — the actual hard part; prototype it first.
3. **HDR = the harder D3D12 1.11 path** — our current WGC/DDA HDR is free; the IddCx HDR path is not.
4. **Two binding stacks** if we go D2(b) — a maintenance cost cutting against "clean/consistent."
5. **No WHQL ⇒ no Windows Update / Dev-Center distribution** — same constraint our gamepad drivers
already accept (bundle + self-sign + import cert).
## References
- IddCx model + signing: [IDD model overview](https://learn.microsoft.com/en-us/windows-hardware/drivers/display/indirect-display-driver-model-overview) ·
[IddCx versions](https://learn.microsoft.com/en-us/windows-hardware/drivers/display/iddcx-versions) ·
[1.10+ updates](https://learn.microsoft.com/en-us/windows-hardware/drivers/display/iddcx1.10-updates) ·
[UMDF signing](https://learn.microsoft.com/en-us/archive/blogs/peterwie/do-umdf-drivers-require-signing)
- Swap-chain / frames: [IDDCX_METADATA](https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/iddcx/ns-iddcx-iddcx_metadata) ·
[SetDevice](https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/iddcx/nf-iddcx-iddcxswapchainsetdevice) ·
[SetRenderAdapter](https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/iddcx/nf-iddcx-iddcxadaptersetrenderadapter) ·
[ASSIGN_SWAPCHAIN](https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/iddcx/nc-iddcx-evt_idd_cx_monitor_assign_swapchain)
- Prior art: [microsoft IddSampleDriver](https://github.com/microsoft/Windows-driver-samples/tree/main/video/IndirectDisplay) ·
[SudoMaker/SudoVDA](https://github.com/SudoMaker/SudoVDA) ([ioctl.h](https://github.com/SudoMaker/SudoVDA/blob/master/Common/Include/sudovda-ioctl.h)) ·
**[MolotovCherry/virtual-display-rs (Rust, MIT)](https://github.com/MolotovCherry/virtual-display-rs)** ·
[Looking Glass IDD (swap-chain → shm, no DDA)](https://deepwiki.com/gnif/LookingGlass/2.5-indirect-display-driver-(idd)) ·
[itsmikethetech/Virtual-Display-Driver](https://github.com/itsmikethetech/Virtual-Display-Driver)