feat(windows-client): D3D11VA zero-copy hw decode + HDR10 present + GUI polish
windows-msix / package (push) Successful in 1m2s
apple / swift (push) Successful in 54s
windows / build (push) Failing after 1m2s
android / android (push) Failing after 48s
ci / web (push) Failing after 6s
ci / docs-site (push) Failing after 1s
ci / bench (push) Failing after 0s
deb / build-publish (push) Failing after 0s
decky / build-publish (push) Failing after 0s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Failing after 0s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Failing after 1s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Failing after 0s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Failing after 0s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 0s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Failing after 1s
docker / deploy-docs (push) Has been skipped
ci / rust (push) Failing after 2m0s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 4m18s

The client was pure software HEVC decode + CPU swscale->RGBA + a full-frame
dynamic-texture upload every frame -- the reason performance was poor on a GPU
box (the GPU sat idle while the CPU churned). This adds a hardware path, HDR,
and a GUI pass.

Performance -- D3D11VA zero-copy:
- gpu.rs (new): one D3D11 device (hardware + VIDEO_SUPPORT, WARP fallback,
  multithread-protected) shared by decoder and presenter via a Send/Sync
  OnceLock. Sharing is mandatory -- a decoded texture is only bindable on the
  device that created it. windows-rs COM interfaces are !Send/!Sync, so the
  unsafe impl is sound only under the multithread protection + disjoint
  decode(video ctx)/present(immediate ctx) split.
- video.rs: D3d11vaDecoder (raw FFI mirroring the Linux VAAPI module). The
  COM-typed AVD3D11VA{Device,Frames}Context are declared here (stable FFmpeg
  ABI) to avoid ffmpeg-sys binding the d3d11 headers; get_format builds a frames
  ctx with BindFlags=SHADER_RESOURCE so the NV12/P010 array slices are
  sampleable. av_frame_clone guard keeps each surface out of the reuse pool
  until the presenter drops it. Software decode stays as the fallback
  (DecoderPref Auto/Hardware/Software; auto falls back on init/decode error).
- present.rs: shared device; per-plane SRVs over the array slice
  (NV12->R8/R8G8, P010->R16/R16G16) + three pixel shaders (RGBA passthrough,
  NV12/BT.709, P010/BT.2020-PQ). present() now takes the frame by value so the
  GPU surface survives re-presents.

HDR:
- Detected in-band (transfer == SMPTE2084), same signal as the other clients.
  Swapchain flips to R10G10B10A2 + ST.2084 + HDR10 metadata. New Settings toggle
  gates advertising VIDEO_CAP_10BIT|HDR; host still gates 10-bit behind its own
  PUNKTFUNK_10BIT + actual-HDR-content checks.

GUI (windows-reactor):
- Host cards with accent-monogram avatars + colored status pills, InfoBar for
  errors/pairing hints, ToggleSwitch settings (+ HDR, decoder, bitrate), button
  icons, a richer connecting screen, and a stream HUD with GPU/CPU-decode + HDR
  status chips.

Not yet on-glass validated: the Linux dev box can't compile the cfg(windows)
code (ffmpeg/windows crates unfetched; WARP has no hw decode) -- only
cargo fmt checks it here. API shapes verified against the windows-rs/reactor
source and the YUV->RGB coefficients checked by hand, but D3D11VA + shaders +
the GUI need a real build (Windows CI / build VM) and on-glass test on the RTX
box. The host-side HDR encode path is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-18 23:16:07 +00:00
parent af9bb54785
commit 0cc36fa130
8 changed files with 1121 additions and 222 deletions
+18 -8
View File
@@ -123,11 +123,21 @@ Low-latency desktop/game streaming stack, Linux-first, with a shared Rust protoc
framework backed by WinUI; PR #4499 added the `SwapChainPanel` widget + `set_swap_chain`). The
video is a **`SwapChainPanel`** bound to a **D3D11 composition swapchain** (WARP fallback for
the GPU-less dev box; runtime-compiled fullscreen-triangle shaders, Contain-fit letterbox),
driven by reactor's per-frame `on_rendering`. **FFmpeg software HEVC decode** (D3D11VA hw decode
is the follow-up), **WASAPI** render + mic capture, **SDL3** gamepads (rumble/lightbar/DualSense),
`mdns-sd` discovery, and the full trust surface — all **in-app**: host list (live mDNS + saved +
manual), settings (resolution/refresh/mic), SPAKE2 PIN pairing screen, TOFU, pinned-fp-mismatch
re-pair. **Stream input** is Win32 low-level hooks (`WH_KEYBOARD_LL`/`WH_MOUSE_LL`) — reactor
driven by reactor's per-frame `on_rendering`. **FFmpeg HEVC decode with a D3D11VA
zero-copy hardware path** (`gpu.rs` shares one D3D11 device — hardware+`VIDEO_SUPPORT`, WARP
fallback, multithread-protected — between the decoder and presenter; the decoder outputs
NV12/P010 `ID3D11Texture2D` array slices with `BIND_SHADER_RESOURCE` and the presenter samples
them via per-plane SRVs + YUV→RGB shaders — NV12/BT.709, P010/BT.2020-PQ; **software CPU decode
stays as the robust fallback**, auto-selected with a `DecoderPref` override). **HDR10**: the
client advertises 10-bit/HDR (Settings toggle), detects PQ in-band (`transfer == SMPTE2084`),
and flips the swapchain to `R10G10B10A2` + ST.2084 with HDR10 metadata. **WASAPI** render + mic
capture, **SDL3** gamepads (rumble/lightbar/DualSense), `mdns-sd` discovery, and the full trust
surface — all **in-app**: a polished WinUI shell (host cards w/ monogram + status pills,
`InfoBar` errors/hints, `ToggleSwitch` settings, status-chip stream HUD showing GPU/CPU decode +
HDR), host list (live mDNS + saved + manual), settings (resolution/refresh/decoder/bitrate/HDR/
mic), SPAKE2 PIN pairing screen, TOFU, pinned-fp-mismatch re-pair. **(D3D11VA + HDR present + the
GUI polish are written against the windows-rs/reactor APIs but not yet on-glass validated — the
dev VM is headless/WARP; needs the RTX box.)** **Stream input** is Win32 low-level hooks (`WH_KEYBOARD_LL`/`WH_MOUSE_LL`) — reactor
exposes no raw key/pointer events; native Windows VK + absolute mouse (client-rect Contain-fit) +
wheel, Ctrl+Alt+Shift+Q capture toggle. `--headless`/`--discover` keep CLI paths. Builds + clippy
+ fmt green on `x86_64-pc-windows-msvc` (on the dev VM). **windows-reactor is unpublished** (git
@@ -135,9 +145,9 @@ Low-latency desktop/game streaming stack, Linux-first, with a shared Rust protoc
with `set_swap_chain`); its `build.rs` downloads the Win App SDK NuGets + needs `CARGO_WORKSPACE_DIR`
set (in the VM build env; `/temp`+`/winmd` gitignored). Gotcha: `CARGO_HOME` must be an ASCII path
— the `ü` in the dev box's username breaks SDL3's MSVC precompiled-header build. Next: **on-glass
validation** (the dev VM is headless/Session-0 → the WinUI window needs a display: RDP or the RTX
box), D3D11VA hw decode + 10-bit/HDR present, RAWINPUT relative-mouse pointer-lock, and a per-host
speed test in the UI.
validation** of the D3D11VA decode + HDR present + GUI on the RTX box (the dev VM is
headless/Session-0/WARP → the WinUI window + hardware decode need a real display+GPU: RDP or the
RTX box), then RAWINPUT relative-mouse pointer-lock and a per-host speed test in the UI.
2. **Sub-frame pipelining**: overlap encode and transmit within a frame. Requires a direct
NVENC SDK wrapper (libavcodec only emits whole AUs) — the next big latency lever (~24 ms
at high res).