From 0cc36fa1304cb3948b674d7da34c9006496b1080 Mon Sep 17 00:00:00 2001 From: enricobuehler Date: Thu, 18 Jun 2026 23:16:07 +0000 Subject: [PATCH] feat(windows-client): D3D11VA zero-copy hw decode + HDR10 present + GUI polish The client was pure software HEVC decode + CPU swscale->RGBA + a full-frame dynamic-texture upload every frame -- the reason performance was poor on a GPU box (the GPU sat idle while the CPU churned). This adds a hardware path, HDR, and a GUI pass. Performance -- D3D11VA zero-copy: - gpu.rs (new): one D3D11 device (hardware + VIDEO_SUPPORT, WARP fallback, multithread-protected) shared by decoder and presenter via a Send/Sync OnceLock. Sharing is mandatory -- a decoded texture is only bindable on the device that created it. windows-rs COM interfaces are !Send/!Sync, so the unsafe impl is sound only under the multithread protection + disjoint decode(video ctx)/present(immediate ctx) split. - video.rs: D3d11vaDecoder (raw FFI mirroring the Linux VAAPI module). The COM-typed AVD3D11VA{Device,Frames}Context are declared here (stable FFmpeg ABI) to avoid ffmpeg-sys binding the d3d11 headers; get_format builds a frames ctx with BindFlags=SHADER_RESOURCE so the NV12/P010 array slices are sampleable. av_frame_clone guard keeps each surface out of the reuse pool until the presenter drops it. Software decode stays as the fallback (DecoderPref Auto/Hardware/Software; auto falls back on init/decode error). - present.rs: shared device; per-plane SRVs over the array slice (NV12->R8/R8G8, P010->R16/R16G16) + three pixel shaders (RGBA passthrough, NV12/BT.709, P010/BT.2020-PQ). present() now takes the frame by value so the GPU surface survives re-presents. HDR: - Detected in-band (transfer == SMPTE2084), same signal as the other clients. Swapchain flips to R10G10B10A2 + ST.2084 + HDR10 metadata. New Settings toggle gates advertising VIDEO_CAP_10BIT|HDR; host still gates 10-bit behind its own PUNKTFUNK_10BIT + actual-HDR-content checks. GUI (windows-reactor): - Host cards with accent-monogram avatars + colored status pills, InfoBar for errors/pairing hints, ToggleSwitch settings (+ HDR, decoder, bitrate), button icons, a richer connecting screen, and a stream HUD with GPU/CPU-decode + HDR status chips. Not yet on-glass validated: the Linux dev box can't compile the cfg(windows) code (ffmpeg/windows crates unfetched; WARP has no hw decode) -- only cargo fmt checks it here. API shapes verified against the windows-rs/reactor source and the YUV->RGB coefficients checked by hand, but D3D11VA + shaders + the GUI need a real build (Windows CI / build VM) and on-glass test on the RTX box. The host-side HDR encode path is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 26 +- clients/windows/src/app.rs | 387 +++++++++++++++++++++++------- clients/windows/src/gpu.rs | 121 ++++++++++ clients/windows/src/main.rs | 13 +- clients/windows/src/present.rs | 329 +++++++++++++++++++------- clients/windows/src/session.rs | 41 +++- clients/windows/src/trust.rs | 7 + clients/windows/src/video.rs | 419 +++++++++++++++++++++++++++++++-- 8 files changed, 1121 insertions(+), 222 deletions(-) create mode 100644 clients/windows/src/gpu.rs diff --git a/CLAUDE.md b/CLAUDE.md index efd9ef6..7023398 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -123,11 +123,21 @@ Low-latency desktop/game streaming stack, Linux-first, with a shared Rust protoc framework backed by WinUI; PR #4499 added the `SwapChainPanel` widget + `set_swap_chain`). The video is a **`SwapChainPanel`** bound to a **D3D11 composition swapchain** (WARP fallback for the GPU-less dev box; runtime-compiled fullscreen-triangle shaders, Contain-fit letterbox), - driven by reactor's per-frame `on_rendering`. **FFmpeg software HEVC decode** (D3D11VA hw decode - is the follow-up), **WASAPI** render + mic capture, **SDL3** gamepads (rumble/lightbar/DualSense), - `mdns-sd` discovery, and the full trust surface — all **in-app**: host list (live mDNS + saved + - manual), settings (resolution/refresh/mic), SPAKE2 PIN pairing screen, TOFU, pinned-fp-mismatch - re-pair. **Stream input** is Win32 low-level hooks (`WH_KEYBOARD_LL`/`WH_MOUSE_LL`) — reactor + driven by reactor's per-frame `on_rendering`. **FFmpeg HEVC decode with a D3D11VA + zero-copy hardware path** (`gpu.rs` shares one D3D11 device — hardware+`VIDEO_SUPPORT`, WARP + fallback, multithread-protected — between the decoder and presenter; the decoder outputs + NV12/P010 `ID3D11Texture2D` array slices with `BIND_SHADER_RESOURCE` and the presenter samples + them via per-plane SRVs + YUV→RGB shaders — NV12/BT.709, P010/BT.2020-PQ; **software CPU decode + stays as the robust fallback**, auto-selected with a `DecoderPref` override). **HDR10**: the + client advertises 10-bit/HDR (Settings toggle), detects PQ in-band (`transfer == SMPTE2084`), + and flips the swapchain to `R10G10B10A2` + ST.2084 with HDR10 metadata. **WASAPI** render + mic + capture, **SDL3** gamepads (rumble/lightbar/DualSense), `mdns-sd` discovery, and the full trust + surface — all **in-app**: a polished WinUI shell (host cards w/ monogram + status pills, + `InfoBar` errors/hints, `ToggleSwitch` settings, status-chip stream HUD showing GPU/CPU decode + + HDR), host list (live mDNS + saved + manual), settings (resolution/refresh/decoder/bitrate/HDR/ + mic), SPAKE2 PIN pairing screen, TOFU, pinned-fp-mismatch re-pair. **(D3D11VA + HDR present + the + GUI polish are written against the windows-rs/reactor APIs but not yet on-glass validated — the + dev VM is headless/WARP; needs the RTX box.)** **Stream input** is Win32 low-level hooks (`WH_KEYBOARD_LL`/`WH_MOUSE_LL`) — reactor exposes no raw key/pointer events; native Windows VK + absolute mouse (client-rect Contain-fit) + wheel, Ctrl+Alt+Shift+Q capture toggle. `--headless`/`--discover` keep CLI paths. Builds + clippy + fmt green on `x86_64-pc-windows-msvc` (on the dev VM). **windows-reactor is unpublished** (git @@ -135,9 +145,9 @@ Low-latency desktop/game streaming stack, Linux-first, with a shared Rust protoc with `set_swap_chain`); its `build.rs` downloads the Win App SDK NuGets + needs `CARGO_WORKSPACE_DIR` set (in the VM build env; `/temp`+`/winmd` gitignored). Gotcha: `CARGO_HOME` must be an ASCII path — the `ü` in the dev box's username breaks SDL3's MSVC precompiled-header build. Next: **on-glass - validation** (the dev VM is headless/Session-0 → the WinUI window needs a display: RDP or the RTX - box), D3D11VA hw decode + 10-bit/HDR present, RAWINPUT relative-mouse pointer-lock, and a per-host - speed test in the UI. + validation** of the D3D11VA decode + HDR present + GUI on the RTX box (the dev VM is + headless/Session-0/WARP → the WinUI window + hardware decode need a real display+GPU: RDP or the + RTX box), then RAWINPUT relative-mouse pointer-lock and a per-host speed test in the UI. 2. **Sub-frame pipelining**: overlap encode and transmit within a frame. Requires a direct NVENC SDK wrapper (libavcodec only emits whole AUs) — the next big latency lever (~2–4 ms at high res). diff --git a/clients/windows/src/app.rs b/clients/windows/src/app.rs index b3ab5b1..312270b 100644 --- a/clients/windows/src/app.rs +++ b/clients/windows/src/app.rs @@ -16,7 +16,7 @@ use crate::gamepad::GamepadService; use crate::present::Presenter; use crate::session::{self, SessionEvent, SessionParams, Stats}; use crate::trust::{self, KnownHost, KnownHosts, Settings}; -use crate::video::DecodedFrame; +use crate::video::{DecodedFrame, DecoderPref}; use punktfunk_core::client::NativeClient; use punktfunk_core::config::{CompositorPref, GamepadPref, Mode}; use std::cell::RefCell; @@ -31,6 +31,14 @@ const RESOLUTIONS: &[(u32, u32)] = &[ (3840, 2160), ]; const REFRESH: &[u32] = &[0, 30, 60, 90, 120, 144, 165, 240]; +/// Decode backend presets: `(stored value, display label)`. +const DECODERS: &[(&str, &str)] = &[ + ("auto", "Automatic (GPU, fall back to CPU)"), + ("hardware", "Hardware (GPU / D3D11VA)"), + ("software", "Software (CPU)"), +]; +/// Bitrate presets in Mb/s; `0` = host default. +const BITRATES_MBPS: &[u32] = &[0, 10, 20, 30, 50, 80, 150]; #[derive(Clone, PartialEq)] enum Screen { @@ -189,10 +197,61 @@ fn page(children: Vec) -> Element { scroll_view(col).into() } -/// A clickable host row: name + address/badge + chevron. +/// A rounded square "monogram" for a host, the first letter on an accent fill — a clean leading +/// visual that avoids depending on an icon font being installed. +fn avatar(name: &str) -> Border { + let initial = name + .chars() + .find(|c| c.is_alphanumeric()) + .map(|c| c.to_uppercase().to_string()) + .unwrap_or_else(|| "?".into()); + border( + text_block(initial) + .font_size(17.0) + .semibold() + .foreground(ThemeRef::AccentText) + .horizontal_alignment(HorizontalAlignment::Center) + .vertical_alignment(VerticalAlignment::Center), + ) + .background(ThemeRef::Accent) + .corner_radius(10.0) + .width(40.0) + .height(40.0) +} + +/// Pill chip colour intent. +#[derive(Clone, Copy)] +enum Pill { + Accent, + Good, + Neutral, +} + +/// A small rounded status chip (paired/PIN/HDR/etc.). +fn pill(text: &str, kind: Pill) -> Border { + let (bg, fg) = match kind { + Pill::Accent => (ThemeRef::Accent, ThemeRef::AccentText), + Pill::Good => (ThemeRef::SystemSuccessBackground, ThemeRef::SystemSuccess), + Pill::Neutral => (ThemeRef::SubtleFill, ThemeRef::SecondaryText), + }; + border(text_block(text).font_size(11.0).semibold().foreground(fg)) + .background(bg) + .corner_radius(10.0) + .padding(edges(9.0, 3.0, 9.0, 3.0)) +} + +/// A clickable host row: monogram + name/address + status pill + chevron. fn host_card(name: &str, sub: &str, badge: &str, on_tap: impl Fn() + 'static) -> Element { + let kind = match badge { + "Paired" => Pill::Good, + "Open" => Pill::Neutral, + _ => Pill::Accent, // Trusted / PIN + }; card( grid(( + avatar(name) + .grid_column(0) + .vertical_alignment(VerticalAlignment::Center), vstack(( text_block(name).font_size(15.0).semibold(), text_block(sub) @@ -200,21 +259,25 @@ fn host_card(name: &str, sub: &str, badge: &str, on_tap: impl Fn() + 'static) -> .foreground(ThemeRef::SecondaryText), )) .spacing(2.0) - .grid_column(0) - .vertical_alignment(VerticalAlignment::Center), - text_block(badge) - .font_size(12.0) - .foreground(ThemeRef::SecondaryText) - .grid_column(1) + .grid_column(1) + .vertical_alignment(VerticalAlignment::Center) + .margin(edges(12.0, 0.0, 0.0, 0.0)), + pill(badge, kind) + .grid_column(2) .vertical_alignment(VerticalAlignment::Center) - .margin(edges(0.0, 0.0, 12.0, 0.0)), + .margin(edges(0.0, 0.0, 10.0, 0.0)), text_block("\u{203A}") .font_size(18.0) .foreground(ThemeRef::SecondaryText) - .grid_column(2) + .grid_column(3) .vertical_alignment(VerticalAlignment::Center), )) - .columns([GridLength::Star(1.0), GridLength::Auto, GridLength::Auto]), + .columns([ + GridLength::Auto, + GridLength::Star(1.0), + GridLength::Auto, + GridLength::Auto, + ]), ) .on_tapped(on_tap) .into() @@ -281,22 +344,35 @@ fn root(cx: &mut RenderCx, ctx: &Arc) -> Element { }; match screen { Screen::Hosts => component(hosts_page, HostsProps { svc, hosts, status }), - Screen::Connecting => vstack(( - ProgressRing::indeterminate() - .width(48.0) - .height(48.0) - .horizontal_alignment(HorizontalAlignment::Center), - text_block("Connecting\u{2026}") - .font_size(16.0) - .horizontal_alignment(HorizontalAlignment::Center), - text_block(status.clone()) + Screen::Connecting => { + let target_name = ctx.shared.target.lock().unwrap().name.clone(); + let headline = if target_name.is_empty() { + "Connecting\u{2026}".to_string() + } else { + format!("Connecting to {target_name}\u{2026}") + }; + vstack(( + ProgressRing::indeterminate() + .width(48.0) + .height(48.0) + .horizontal_alignment(HorizontalAlignment::Center), + text_block(headline) + .font_size(18.0) + .semibold() + .horizontal_alignment(HorizontalAlignment::Center), + text_block(if status.is_empty() { + "Negotiating the session and creating the virtual display\u{2026}".to_string() + } else { + status.clone() + }) .foreground(ThemeRef::SecondaryText) .horizontal_alignment(HorizontalAlignment::Center), - )) - .spacing(16.0) - .horizontal_alignment(HorizontalAlignment::Center) - .vertical_alignment(VerticalAlignment::Center) - .into(), + )) + .spacing(16.0) + .horizontal_alignment(HorizontalAlignment::Center) + .vertical_alignment(VerticalAlignment::Center) + .into() + } // settings_page uses no hooks (it never touches `cx`), so calling it inline is sound. Screen::Settings => settings_page(ctx, &set_screen), Screen::Pair => component(pair_page, svc), @@ -327,6 +403,7 @@ fn hosts_page(props: &HostsProps, cx: &mut RenderCx) -> Element { .grid_column(0) .vertical_alignment(VerticalAlignment::Center), button("Settings") + .icon(SymbolGlyph::Setting) .on_click({ let ss = set_screen.clone(); move || ss.call(Screen::Settings) @@ -340,7 +417,13 @@ fn hosts_page(props: &HostsProps, cx: &mut RenderCx) -> Element { ); if !status.is_empty() { - body.push(card(text_block(status.to_string()).foreground(ThemeRef::SystemCritical)).into()); + body.push( + InfoBar::new("Couldn't connect") + .message(status.to_string()) + .error() + .is_closable(false) + .into(), + ); } // Saved (trusted/paired) hosts. @@ -439,6 +522,7 @@ fn hosts_page(props: &HostsProps, cx: &mut RenderCx) -> Element { .vertical_alignment(VerticalAlignment::Center), button("Connect") .accent() + .icon(SymbolGlyph::Forward) .on_click(connect_manual) .grid_column(1) .margin(edges(8.0, 0.0, 0.0, 0.0)), @@ -515,6 +599,8 @@ fn connect( gamepad: gamepad_pref, bitrate_kbps: s.bitrate_kbps, mic_enabled: s.mic_enabled, + hdr_enabled: s.hdr_enabled, + decoder: DecoderPref::from_name(&s.decoder), pin, identity: ctx.identity.clone(), }); @@ -594,64 +680,87 @@ fn pair_page(props: &Svc, cx: &mut RenderCx) -> Element { code.clone(), target.clone(), ); - button("Pair & Connect").accent().on_click(move || { - let pin = code2.trim().to_string(); - let (ctx3, ss, st, target3) = (ctx2.clone(), ss.clone(), st.clone(), target2.clone()); - std::thread::spawn(move || { - let name = - std::env::var("COMPUTERNAME").unwrap_or_else(|_| "windows-client".into()); - match NativeClient::pair( - &target3.addr, - target3.port, - (&ctx3.identity.0, &ctx3.identity.1), - &pin, - &name, - std::time::Duration::from_secs(90), - ) { - Ok(fp) => { - let mut k = KnownHosts::load(); - k.upsert(KnownHost { - name: target3.name.clone(), - addr: target3.addr.clone(), - port: target3.port, - fp_hex: trust::hex(&fp), - paired: true, - }); - let _ = k.save(); - connect(&ctx3, &target3, Some(fp), &ss, &st); + button("Pair & Connect") + .accent() + .icon(SymbolGlyph::Accept) + .on_click(move || { + let pin = code2.trim().to_string(); + let (ctx3, ss, st, target3) = + (ctx2.clone(), ss.clone(), st.clone(), target2.clone()); + std::thread::spawn(move || { + let name = + std::env::var("COMPUTERNAME").unwrap_or_else(|_| "windows-client".into()); + match NativeClient::pair( + &target3.addr, + target3.port, + (&ctx3.identity.0, &ctx3.identity.1), + &pin, + &name, + std::time::Duration::from_secs(90), + ) { + Ok(fp) => { + let mut k = KnownHosts::load(); + k.upsert(KnownHost { + name: target3.name.clone(), + addr: target3.addr.clone(), + port: target3.port, + fp_hex: trust::hex(&fp), + paired: true, + }); + let _ = k.save(); + connect(&ctx3, &target3, Some(fp), &ss, &st); + } + Err(e) => { + st.call(format!("Pairing failed: {e:?} (wrong PIN, or not armed?)")); + ss.call(Screen::Hosts); + } } - Err(e) => { - st.call(format!("Pairing failed: {e:?} (wrong PIN, or not armed?)")); - ss.call(Screen::Hosts); - } - } - }); - }) + }); + }) }; let cancel_btn = { let ss = set_screen.clone(); - button("Cancel").on_click(move || ss.call(Screen::Hosts)) + button("Cancel") + .icon(SymbolGlyph::Cancel) + .on_click(move || ss.call(Screen::Hosts)) }; let content = card(vstack(( - text_block(format!("Pair with {}", target.name)) - .font_size(20.0) - .semibold(), - text_block( - "Arm pairing on the host (its console or web console), then enter the 4-digit PIN it \ - shows.", - ) - .foreground(ThemeRef::SecondaryText) - .max_width(440.0), + grid(( + avatar(&target.name) + .grid_column(0) + .vertical_alignment(VerticalAlignment::Center), + vstack(( + text_block(format!("Pair with {}", target.name)) + .font_size(20.0) + .semibold(), + text_block(format!("{}:{}", target.addr, target.port)) + .font_size(12.0) + .foreground(ThemeRef::SecondaryText), + )) + .spacing(2.0) + .grid_column(1) + .vertical_alignment(VerticalAlignment::Center) + .margin(edges(12.0, 0.0, 0.0, 0.0)), + )) + .columns([GridLength::Auto, GridLength::Star(1.0)]), + InfoBar::new("Arm pairing on the host") + .message( + "On the host's console or web console, start pairing — it shows a 4-digit PIN. \ + Enter it below within 90 seconds.", + ) + .informational() + .is_closable(false), text_box(code) .placeholder("PIN") + .font_size(28.0) .on_changed(move |s| set_code.call(s)), hstack((pair_btn, cancel_btn)).spacing(8.0), )) - .spacing(14.0)) + .spacing(16.0)) .max_width(480.0) .horizontal_alignment(HorizontalAlignment::Center) - .margin(edges(0.0, 80.0, 0.0, 0.0)); + .margin(edges(0.0, 60.0, 0.0, 0.0)); page(vec![content.into()]) } @@ -708,10 +817,69 @@ fn settings_page(ctx: &Arc, set_screen: &AsyncSetState) -> Eleme s.save(); }) }; + let dec_i = DECODERS + .iter() + .position(|&(v, _)| v == s.decoder) + .unwrap_or(0) as i32; + let dec_names: Vec = DECODERS.iter().map(|&(_, l)| l.to_string()).collect(); + let decoder_combo = { + let ctx = ctx.clone(); + ComboBox::new(dec_names) + .header("Video decoder") + .selected_index(dec_i) + .on_selection_changed(move |i: i32| { + let (v, _) = DECODERS[(i.max(0) as usize).min(DECODERS.len() - 1)]; + let mut s = ctx.settings.lock().unwrap(); + s.decoder = v.to_string(); + s.save(); + }) + }; + + let br_i = BITRATES_MBPS + .iter() + .position(|&m| m * 1000 == s.bitrate_kbps) + .unwrap_or(0) as i32; + let br_names: Vec = BITRATES_MBPS + .iter() + .map(|&m| { + if m == 0 { + "Automatic".into() + } else { + format!("{m} Mb/s") + } + }) + .collect(); + let bitrate_combo = { + let ctx = ctx.clone(); + ComboBox::new(br_names) + .header("Bitrate") + .selected_index(br_i) + .on_selection_changed(move |i: i32| { + let m = BITRATES_MBPS[(i.max(0) as usize).min(BITRATES_MBPS.len() - 1)]; + let mut s = ctx.settings.lock().unwrap(); + s.bitrate_kbps = m * 1000; + s.save(); + }) + }; + + let hdr_toggle = { + let ctx = ctx.clone(); + ToggleSwitch::new(s.hdr_enabled) + .header("HDR (10-bit, BT.2020 PQ)") + .on_content("On") + .off_content("Off") + .on_changed(move |on: bool| { + let mut s = ctx.settings.lock().unwrap(); + s.hdr_enabled = on; + s.save(); + }) + }; let mic_toggle = { let ctx = ctx.clone(); - check_box(s.mic_enabled) - .label("Stream microphone to the host") + ToggleSwitch::new(s.mic_enabled) + .header("Stream microphone to the host") + .on_content("On") + .off_content("Off") .on_changed(move |on: bool| { let mut s = ctx.settings.lock().unwrap(); s.mic_enabled = on; @@ -727,6 +895,7 @@ fn settings_page(ctx: &Arc, set_screen: &AsyncSetState) -> Eleme .vertical_alignment(VerticalAlignment::Center), button("Back") .accent() + .icon(SymbolGlyph::Back) .on_click({ let ss = set_screen.clone(); move || ss.call(Screen::Hosts) @@ -739,7 +908,7 @@ fn settings_page(ctx: &Arc, set_screen: &AsyncSetState) -> Eleme let stream_card = card( vstack(( - text_block("Stream").font_size(15.0).semibold(), + text_block("Display").font_size(15.0).semibold(), text_block("The host creates a virtual display at exactly this mode.") .font_size(12.0) .foreground(ThemeRef::SecondaryText), @@ -749,13 +918,31 @@ fn settings_page(ctx: &Arc, set_screen: &AsyncSetState) -> Eleme .spacing(10.0), ); + let video_card = card( + vstack(( + text_block("Video").font_size(15.0).semibold(), + text_block( + "Hardware decode (D3D11VA) is zero-copy and far lighter than software — keep it on \ + Automatic unless debugging.", + ) + .font_size(12.0) + .foreground(ThemeRef::SecondaryText), + decoder_combo, + bitrate_combo, + hdr_toggle, + )) + .spacing(10.0), + ); + let audio_card = card(vstack((text_block("Audio").font_size(15.0).semibold(), mic_toggle)).spacing(10.0)); page(vec![ header.into(), - section("STREAM"), + section("DISPLAY"), stream_card.into(), + section("VIDEO"), + video_card.into(), section("AUDIO"), audio_card.into(), ]) @@ -764,12 +951,13 @@ fn settings_page(ctx: &Arc, set_screen: &AsyncSetState) -> Eleme // --- stream page -------------------------------------------------------------------------- fn present_newest(ctx: &mut PresentCtx) { + // Drain to the newest decoded frame (drop any backlog) and hand it to the presenter by value — + // the GPU zero-copy path retains the decoder surface across re-presents, so ownership matters. let mut newest = None; while let Ok(f) = ctx.frames.try_recv() { newest = Some(f); } - let cpu = newest.as_ref().map(|DecodedFrame::Cpu(c)| c); - ctx.presenter.present(cpu); + ctx.presenter.present(newest); } fn stream_page(props: &StreamProps, cx: &mut RenderCx) -> Element { @@ -839,34 +1027,53 @@ fn stream_page(props: &StreamProps, cx: &mut RenderCx) -> Element { .into() } -/// The streaming HUD overlay (top-right), mirroring the Apple client: mode + fps/throughput, the -/// capture→client latency + decode time, and the release-cursor hint. Layered over the +/// A small chip for the dark HUD: coloured text on a translucent dark fill. +fn hud_chip(text: &str, color: Color) -> Border { + border( + text_block(text) + .font_size(11.0) + .semibold() + .foreground(color), + ) + .background(Color::rgb(38, 38, 38)) + .corner_radius(8.0) + .padding(edges(8.0, 2.0, 8.0, 2.0)) +} + +/// The streaming HUD overlay (top-right), mirroring the Apple client: a chip row (mode · decode +/// path · HDR), the fps/throughput/latency line, and the release-cursor hint. Layered over the /// `SwapChainPanel` in the same grid cell. fn hud_overlay(stats: &Stats, mode: Option) -> Element { let res = mode .map(|m| format!("{}\u{00D7}{}@{}", m.width, m.height, m.refresh_hz)) .unwrap_or_else(|| "\u{2014}".into()); - let line1 = format!("{res} {:.0} fps {:.1} Mb/s", stats.fps, stats.mbps); - let line2 = format!( - "capture\u{2192}client {:.1} ms p50 \u{00B7} decode {:.1} ms", - stats.latency_ms, stats.decode_ms + let mut chips: Vec = vec![hud_chip(&res, Color::rgb(235, 235, 235)).into()]; + chips.push(if stats.hardware { + hud_chip("GPU decode", Color::rgb(120, 220, 150)).into() + } else { + hud_chip("CPU decode", Color::rgb(240, 190, 90)).into() + }); + if stats.hdr { + chips.push(hud_chip("HDR", Color::rgb(255, 205, 90)).into()); + } + let line = format!( + "{:.0} fps \u{00B7} {:.1} Mb/s \u{00B7} {:.1} ms p50 \u{00B7} decode {:.1} ms", + stats.fps, stats.mbps, stats.latency_ms, stats.decode_ms ); border( vstack(( - text_block(line1) - .font_size(12.0) - .foreground(Color::rgb(255, 255, 255)), - text_block(line2) + hstack(chips).spacing(6.0), + text_block(line) .font_size(11.0) - .foreground(Color::rgb(200, 200, 200)), + .foreground(Color::rgb(210, 210, 210)), text_block("Ctrl+Alt+Shift+Q releases the mouse") .font_size(11.0) - .foreground(Color::rgb(160, 160, 160)), + .foreground(Color::rgb(150, 150, 150)), )) - .spacing(2.0), + .spacing(6.0), ) .background(Color::rgb(0, 0, 0)) - .corner_radius(8.0) + .corner_radius(10.0) .padding(uniform(10.0)) .opacity(0.82) .horizontal_alignment(HorizontalAlignment::Right) diff --git a/clients/windows/src/gpu.rs b/clients/windows/src/gpu.rs new file mode 100644 index 0000000..3c76015 --- /dev/null +++ b/clients/windows/src/gpu.rs @@ -0,0 +1,121 @@ +//! The single Direct3D 11 device shared by the video decoder (D3D11VA hardware decode) and the +//! presenter (the `SwapChainPanel` composition swapchain + the present draw). +//! +//! Zero-copy hardware decode requires FFmpeg to decode HEVC into `ID3D11Texture2D`s created by the +//! **same** device the presenter binds as shader resources and draws with — a texture from one +//! device can't be sampled by another. So the device is created once, here, and both subsystems +//! pull it from a process-global `OnceLock` (initialised on whichever thread asks first: the +//! session pump when it builds the decoder, or the UI thread when it builds the presenter). +//! +//! **Thread-safety.** windows-rs COM interfaces are deliberately `!Send`/`!Sync` — thread-safety +//! is per-object, not universal. An `ID3D11Device` and its immediate context become free-threaded +//! once `ID3D11Multithread::SetMultithreadProtected(TRUE)` is set, which FFmpeg's D3D11VA backend +//! does inside `av_hwdevice_ctx_init` (it installs an `ID3D11Multithread`-based default lock when we +//! leave `AVD3D11VADeviceContext.lock` null). The decoder then uses FFmpeg's separate +//! `ID3D11VideoContext` for decode while the presenter uses the immediate context for draw; under +//! multithread protection D3D serialises the two internally, and decode/draw touch disjoint context +//! state. That makes the `unsafe impl Send + Sync` below sound for exactly this usage. + +use anyhow::{anyhow, Result}; +use std::sync::OnceLock; +use windows::core::Interface; +use windows::Win32::Graphics::Direct3D::{ + D3D_DRIVER_TYPE_HARDWARE, D3D_DRIVER_TYPE_WARP, D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_11_1, +}; +use windows::Win32::Graphics::Direct3D11::{ + D3D11CreateDevice, ID3D11Device, ID3D11DeviceContext, ID3D11Multithread, + D3D11_CREATE_DEVICE_BGRA_SUPPORT, D3D11_CREATE_DEVICE_VIDEO_SUPPORT, D3D11_SDK_VERSION, +}; + +pub struct SharedDevice { + pub device: ID3D11Device, + pub context: ID3D11DeviceContext, + /// True when this is a real GPU (hardware) adapter — a precondition for D3D11VA decode. WARP + /// (the GPU-less dev box) creates fine for present but cannot hardware-decode HEVC, so the + /// decoder skips straight to the software path there. + pub hardware: bool, +} + +// Sound for our usage — see the module docs: the device + immediate context are free-threaded under +// the multithread protection FFmpeg installs, and decode (video context) / present (immediate +// context) never share mutable context state. +unsafe impl Send for SharedDevice {} +unsafe impl Sync for SharedDevice {} + +static SHARED: OnceLock> = OnceLock::new(); + +/// The process-wide shared D3D11 device, created on first call. `None` only if D3D11 device +/// creation fails for both a hardware adapter and WARP (effectively never — WARP is always present). +pub fn shared() -> Option<&'static SharedDevice> { + SHARED.get_or_init(create).as_ref() +} + +fn create() -> Option { + match create_device() { + Ok(d) => Some(d), + Err(e) => { + tracing::error!(error = %e, "shared D3D11 device creation failed — no present/decode"); + None + } + } +} + +fn create_device() -> Result { + // Preference order: a hardware adapter with video support (enables D3D11VA); the same without + // the VIDEO flag (a driver that rejects it still presents + software-decodes); finally WARP for + // the GPU-less box. BGRA_SUPPORT is required for the composition swapchain in every case. + let attempts = [ + (D3D_DRIVER_TYPE_HARDWARE, true, true), + (D3D_DRIVER_TYPE_HARDWARE, false, true), + (D3D_DRIVER_TYPE_WARP, false, false), + ]; + for (driver, video, hardware) in attempts { + let flags = if video { + D3D11_CREATE_DEVICE_BGRA_SUPPORT | D3D11_CREATE_DEVICE_VIDEO_SUPPORT + } else { + D3D11_CREATE_DEVICE_BGRA_SUPPORT + }; + let mut device = None; + let mut context = None; + let r = unsafe { + D3D11CreateDevice( + None, + driver, + None, + flags, + Some(&[D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0]), + D3D11_SDK_VERSION, + Some(&mut device), + None, + Some(&mut context), + ) + }; + if r.is_ok() { + let (device, context) = (device.unwrap(), context.unwrap()); + // Make the device + immediate context free-threaded: the decoder (D3D11VA video context, + // pump thread) and the presenter (immediate context, UI thread) both touch this device. + // FFmpeg also sets this during hwdevice init, but doing it up front keeps the + // cross-thread `Send`/`Sync` sound from the moment the device exists. + if let Ok(mt) = context.cast::() { + unsafe { mt.SetMultithreadProtected(true) }; + } + tracing::info!( + driver = if hardware { + "hardware" + } else { + "WARP (software)" + }, + video, + "shared D3D11 device created" + ); + return Ok(SharedDevice { + device, + context, + hardware, + }); + } + } + Err(anyhow!( + "D3D11CreateDevice failed for both hardware and WARP" + )) +} diff --git a/clients/windows/src/main.rs b/clients/windows/src/main.rs index 66fd6d9..a1a2b0d 100644 --- a/clients/windows/src/main.rs +++ b/clients/windows/src/main.rs @@ -10,7 +10,8 @@ //! punktfunk-client (open the WinUI 3 window: host list, settings, pairing) //! punktfunk-client --discover (list punktfunk hosts on the LAN) //! punktfunk-client --headless --connect host[:port] [--pin HEX] [--pair PIN] [--mode WxHxHz] -//! [--bitrate MBPS] [--mic] (no window; count frames + print stats) +//! [--bitrate MBPS] [--mic] [--decoder auto|hardware|software] [--no-hdr] +//! (no window; count frames + print stats) // Link as a GUI (windows) subsystem binary so the default windowed launch (MSIX / double-click) // does NOT pop a console window. The CLI paths (--headless/--discover) reattach to the launching @@ -26,6 +27,8 @@ mod discovery; #[cfg(windows)] mod gamepad; #[cfg(windows)] +mod gpu; +#[cfg(windows)] mod input; #[cfg(windows)] mod present; @@ -162,7 +165,11 @@ fn run_headless_cli(args: &[String], identity: (String, String)) { } } - tracing::info!(%host, port, ?mode, tofu = pin.is_none(), "connecting (headless)"); + let decoder = arg("--decoder") + .map(|d| crate::video::DecoderPref::from_name(&d)) + .unwrap_or_default(); + + tracing::info!(%host, port, ?mode, tofu = pin.is_none(), ?decoder, "connecting (headless)"); let handle = session::start(session::SessionParams { host, port, @@ -171,6 +178,8 @@ fn run_headless_cli(args: &[String], identity: (String, String)) { gamepad: GamepadPref::Auto, bitrate_kbps, mic_enabled: flag("--mic"), + hdr_enabled: !flag("--no-hdr"), + decoder, pin, identity, }); diff --git a/clients/windows/src/present.rs b/clients/windows/src/present.rs index e72b708..1e5fb1d 100644 --- a/clients/windows/src/present.rs +++ b/clients/windows/src/present.rs @@ -1,32 +1,41 @@ -//! Direct3D11 presenter for a WinUI 3 `SwapChainPanel`: upload a decoded `CpuFrame` (RGBA) -//! into a dynamic texture and draw it Contain-fit into a **composition** flip-model swapchain, -//! which the reactor stream page binds to the panel via `SwapChainPanelHandle::set_swap_chain`. +//! Direct3D11 presenter for a WinUI 3 `SwapChainPanel`. It draws a decoded frame Contain-fit into a +//! **composition** flip-model swapchain, which the reactor stream page binds to the panel via +//! `SwapChainPanelHandle::set_swap_chain`. //! -//! The device prefers a hardware adapter and falls back to **WARP** (the GPU-less dev box runs -//! the whole present path in software). The draw is a single full-screen triangle sampling the -//! video texture; a letterbox is produced by clearing the back buffer black and setting the -//! viewport to the Contain-fit rect (no per-frame vertex buffer). +//! Two frame sources, one swapchain: //! -//! **HDR10**: when a frame is BT.2020 PQ (`CpuFrame::hdr`), the swapchain flips to -//! `R10G10B10A2` + `DXGI_COLOR_SPACE_RGB_FULL_G2084_NONE_P2020` (+ HDR10 metadata) via -//! `ResizeBuffers`/`SetColorSpace1`; the decoded samples are already PQ-encoded so the shader is a -//! plain passthrough and the compositor maps PQ→display. SDR stays 8-bit B8G8R8A8. +//! * **GPU (zero-copy)** — [`crate::video::GpuFrame`] is a decoder-owned NV12/P010 `ID3D11Texture2D` +//! array slice (D3D11VA). We create per-plane shader-resource views over the slice and convert +//! YUV→RGB in a pixel shader: NV12 via BT.709 (`ps_nv12`), P010 via BT.2020 with the PQ transfer +//! left intact (`ps_p010`). No CPU copy. The decoder uses the **same** shared device +//! ([`crate::gpu`]) so the texture is bindable here. +//! * **CPU upload** — [`crate::video::CpuFrame`] is packed RGBA (SDR) or X2BGR10 (HDR) from the +//! software decoder; we upload it into a dynamic texture and draw it with a passthrough shader +//! (`ps_rgba`). The fallback path. +//! +//! **HDR10**: when a frame is BT.2020 PQ the swapchain flips to `R10G10B10A2` + +//! `DXGI_COLOR_SPACE_RGB_FULL_G2084_NONE_P2020` (+ HDR10 metadata) via `ResizeBuffers`/ +//! `SetColorSpace1`; the shader output is already PQ-encoded so the compositor maps PQ→display. SDR +//! stays 8-bit B8G8R8A8. //! //! All `windows` types here come from the same windows-rs commit as `windows-reactor`, so the //! `IDXGISwapChain1` handed to `set_swap_chain` satisfies reactor's `windows_core::Interface`. -use crate::video::CpuFrame; +use crate::video::{DecodedFrame, GpuFrame}; use anyhow::{anyhow, Context, Result}; use windows::core::{Interface, PCSTR}; use windows::Win32::Graphics::Direct3D::Fxc::{D3DCompile, D3DCOMPILE_OPTIMIZATION_LEVEL3}; use windows::Win32::Graphics::Direct3D::{ - ID3DBlob, D3D_DRIVER_TYPE_HARDWARE, D3D_DRIVER_TYPE_WARP, D3D_FEATURE_LEVEL_11_0, - D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, + ID3DBlob, D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, D3D_SRV_DIMENSION_TEXTURE2DARRAY, }; use windows::Win32::Graphics::Direct3D11::*; use windows::Win32::Graphics::Dxgi::Common::*; use windows::Win32::Graphics::Dxgi::*; +// One vertex shader (fullscreen triangle) + three pixel shaders, selected per frame source. tex0 is +// RGBA (passthrough) or the luma plane; tex1 is the chroma plane. The YUV→RGB matrices fold the +// limited→full range scale into the coefficients; for P010 the R16 sample is rescaled (×65535/65472) +// to undo the 10-bits-in-the-high-bits packing, then converted with BT.2020 NCL, PQ preserved. const SHADER_HLSL: &str = r#" struct VSOut { float4 pos : SV_Position; float2 uv : TEXCOORD0; }; VSOut vs_main(uint vid : SV_VertexID) { @@ -36,44 +45,104 @@ VSOut vs_main(uint vid : SV_VertexID) { o.uv = uv; return o; } -Texture2D tex : register(t0); +Texture2D tex0 : register(t0); +Texture2D tex1 : register(t1); SamplerState smp : register(s0); -float4 ps_main(VSOut i) : SV_Target { return tex.Sample(smp, i.uv); } + +float4 ps_rgba(VSOut i) : SV_Target { return tex0.Sample(smp, i.uv); } + +float4 ps_nv12(VSOut i) : SV_Target { + float y = tex0.Sample(smp, i.uv).r; + float2 uv = tex1.Sample(smp, i.uv).rg; + float yy = (y - 0.0627451) * 1.164384; // (Y-16/255)*255/219 + float u = uv.x - 0.5; + float v = uv.y - 0.5; // BT.709 limited, chroma scale folded + float r = yy + 1.792741 * v; + float g = yy - 0.213249 * u - 0.532909 * v; + float b = yy + 2.112402 * u; + return float4(saturate(float3(r, g, b)), 1.0); +} + +float4 ps_p010(VSOut i) : SV_Target { + const float S = 65535.0 / 65472.0; // undo P010 high-bit packing → exact 10-bit / 1023 + float y = tex0.Sample(smp, i.uv).r * S; + float2 uv = tex1.Sample(smp, i.uv).rg * S; + float yy = (y - 0.0625611) * 1.167808; // (Y-64/1023)*1023/876 + float u = uv.x - 0.5; + float v = uv.y - 0.5; // BT.2020 NCL limited, chroma scale folded; PQ kept + float r = yy + 1.683611 * v; + float g = yy - 0.187877 * u - 0.652337 * v; + float b = yy + 2.148072 * u; + return float4(saturate(float3(r, g, b)), 1.0); +} "#; +/// A bound GPU frame: per-plane SRVs over the decoder's texture-array slice, plus the `GpuFrame` +/// itself kept alive so the decoder won't recycle the slice while we re-present it. +struct GpuView { + y: ID3D11ShaderResourceView, + c: ID3D11ShaderResourceView, + frame: GpuFrame, +} + +/// Current draw source. +#[derive(Clone, Copy, PartialEq)] +enum Mode { + Empty, + Rgba, + Nv12, + P010, +} + pub struct Presenter { device: ID3D11Device, context: ID3D11DeviceContext, vs: ID3D11VertexShader, - ps: ID3D11PixelShader, + ps_rgba: ID3D11PixelShader, + ps_nv12: ID3D11PixelShader, + ps_p010: ID3D11PixelShader, sampler: ID3D11SamplerState, swap: IDXGISwapChain1, rtv: Option, - /// Video texture + SRV + dimensions; recreated when the decoded size changes. - tex: Option<(ID3D11Texture2D, ID3D11ShaderResourceView, u32, u32)>, + /// CPU-upload texture + SRV + dimensions; recreated when the decoded size/format changes. + cpu_tex: Option<(ID3D11Texture2D, ID3D11ShaderResourceView, u32, u32)>, + /// Bound zero-copy GPU frame (held to keep its decoder surface alive). + gpu: Option, + mode: Mode, + /// Source frame dimensions, for the Contain-fit letterbox. + src_w: u32, + src_h: u32, /// Panel (swapchain) size in pixels, updated on resize. panel_w: u32, panel_h: u32, - /// Whether the swapchain is currently in 10-bit HDR10 (R10G10B10A2 + ST.2084) mode; flipped - /// to match each frame's `hdr` flag. + /// Whether the swapchain is currently in 10-bit HDR10 (R10G10B10A2 + ST.2084) mode. hdr: bool, } impl Presenter { - /// Create the D3D11 device + composition swapchain + shaders, sized to the panel. + /// Create the presenter on the process-wide shared D3D11 device (the one the decoder uses), plus + /// the composition swapchain + shaders, sized to the panel. pub fn new(width: u32, height: u32) -> Result { - let (device, context) = create_device()?; - let (vs, ps, sampler) = build_pipeline(&device)?; + let shared = crate::gpu::shared().ok_or_else(|| anyhow!("no shared D3D11 device"))?; + let device = shared.device.clone(); + let context = shared.context.clone(); + let (vs, ps_rgba, ps_nv12, ps_p010, sampler) = build_pipeline(&device)?; let swap = create_composition_swapchain(&device, width.max(1), height.max(1))?; Ok(Presenter { device, context, vs, - ps, + ps_rgba, + ps_nv12, + ps_p010, sampler, swap, rtv: None, - tex: None, + cpu_tex: None, + gpu: None, + mode: Mode::Empty, + src_w: 1, + src_h: 1, panel_w: width.max(1), panel_h: height.max(1), hdr: false, @@ -104,31 +173,122 @@ impl Presenter { self.panel_h = height; } - /// Present one decoded frame (Contain-fit) — or, when `frame` is `None`, just re-present the - /// last texture (or black). Called from the reactor `on_rendering` per-frame callback. - pub fn present(&mut self, frame: Option<&CpuFrame>) { - if let Some(f) = frame { - if f.hdr != self.hdr { - self.set_hdr(f.hdr); + /// Present one decoded frame (Contain-fit) — or, when `frame` is `None`, re-present the last one + /// (or black). Called from the reactor `on_rendering` per-frame callback on the UI thread. Takes + /// the frame by value so the GPU path can retain the decoder surface across re-presents. + pub fn present(&mut self, frame: Option) { + match frame { + Some(DecodedFrame::Cpu(c)) => { + if c.hdr != self.hdr { + self.set_hdr(c.hdr); + } + if let Err(e) = self.upload(&c) { + tracing::warn!(error = %e, "frame upload failed"); + } else { + self.mode = Mode::Rgba; + self.src_w = c.width; + self.src_h = c.height; + self.gpu = None; // drop any held GPU frame + } } - if let Err(e) = self.upload(f) { - tracing::warn!(error = %e, "frame upload failed"); + Some(DecodedFrame::Gpu(g)) => { + if g.hdr != self.hdr { + self.set_hdr(g.hdr); + } + match self.bind_gpu(g) { + Ok(()) => {} + Err(e) => tracing::warn!(error = %e, "GPU frame bind failed"), + } } + None => {} } + self.draw(); + } + + /// Build per-plane SRVs over the decoded texture-array slice and retain the frame. + fn bind_gpu(&mut self, g: GpuFrame) -> Result<()> { + let tex: ID3D11Texture2D = unsafe { + let raw = g.texture_ptr(); + ID3D11Texture2D::from_raw_borrowed(&raw) + .ok_or_else(|| anyhow!("null D3D11 texture"))? + .clone() + }; + // NV12: R8 luma + R8G8 chroma. P010: R16 luma + R16G16 chroma (10 bits in the high bits). + let (fy, fc) = if g.hdr { + (DXGI_FORMAT_R16_UNORM, DXGI_FORMAT_R16G16_UNORM) + } else { + (DXGI_FORMAT_R8_UNORM, DXGI_FORMAT_R8G8_UNORM) + }; + let y = self.array_srv(&tex, fy, g.index)?; + let c = self.array_srv(&tex, fc, g.index)?; + self.mode = if g.hdr { Mode::P010 } else { Mode::Nv12 }; + self.src_w = g.width; + self.src_h = g.height; + self.gpu = Some(GpuView { y, c, frame: g }); + Ok(()) + } + + /// A shader-resource view over a single slice of a texture array, reinterpreting the plane + /// format (the NV12/P010 sub-format trick D3D11 allows on video textures). + fn array_srv( + &self, + tex: &ID3D11Texture2D, + format: DXGI_FORMAT, + slice: u32, + ) -> Result { + let desc = D3D11_SHADER_RESOURCE_VIEW_DESC { + Format: format, + ViewDimension: D3D_SRV_DIMENSION_TEXTURE2DARRAY, + Anonymous: D3D11_SHADER_RESOURCE_VIEW_DESC_0 { + Texture2DArray: D3D11_TEX2D_ARRAY_SRV { + MostDetailedMip: 0, + MipLevels: 1, + FirstArraySlice: slice, + ArraySize: 1, + }, + }, + }; + unsafe { + let mut srv = None; + self.device + .CreateShaderResourceView(tex, Some(&desc), Some(&mut srv)) + .context("CreateShaderResourceView (array slice)")?; + srv.ok_or_else(|| anyhow!("null SRV")) + } + } + + fn draw(&mut self) { let Ok(rtv) = self.rtv() else { return; }; let (pw, ph) = (self.panel_w, self.panel_h); + // Resolve the current source's shader + the (up to two) SRVs to bind — cheap interface + // clones. Each arm yields `Option<(&pixel_shader, [Option; 2])>`. + let binding = match self.mode { + Mode::Rgba => self + .cpu_tex + .as_ref() + .map(|(_, srv, _, _)| (&self.ps_rgba, [Some(srv.clone()), None])), + Mode::Nv12 => self + .gpu + .as_ref() + .map(|g| (&self.ps_nv12, [Some(g.y.clone()), Some(g.c.clone())])), + Mode::P010 => self + .gpu + .as_ref() + .map(|g| (&self.ps_p010, [Some(g.y.clone()), Some(g.c.clone())])), + Mode::Empty => None, + }; unsafe { let c = &self.context; c.ClearRenderTargetView(&rtv, &[0.0, 0.0, 0.0, 1.0]); - if let Some((_, srv, vw, vh)) = &self.tex { + if let Some((ps, srvs)) = binding { // Contain-fit viewport: scale to the smaller axis, centre, letterbox the rest. let (ww, wh, vfw, vfh) = ( pw as f32, ph as f32, - (*vw).max(1) as f32, - (*vh).max(1) as f32, + self.src_w.max(1) as f32, + self.src_h.max(1) as f32, ); let scale = (ww / vfw).min(wh / vfh); let (dw, dh) = (vfw * scale, vfh * scale); @@ -146,8 +306,8 @@ impl Presenter { c.IASetInputLayout(None); c.IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST); c.VSSetShader(&self.vs, None); - c.PSSetShader(&self.ps, None); - c.PSSetShaderResources(0, Some(&[Some(srv.clone())])); + c.PSSetShader(ps, None); + c.PSSetShaderResources(0, Some(&srvs)); c.PSSetSamplers(0, Some(&[Some(self.sampler.clone())])); c.Draw(3, 0); } @@ -155,14 +315,13 @@ impl Presenter { } } - /// Switch the swapchain between 8-bit SDR (B8G8R8A8, sRGB/BT.709) and 10-bit HDR10 - /// (R10G10B10A2, ST.2084 PQ BT.2020). `ResizeBuffers` can change the back-buffer format in - /// place, so the panel binding (`set_swap_chain`) stays valid — no rebind needed. The decoded - /// samples are already PQ-encoded BT.2020 (see `video::convert`), so the colour space is all the - /// compositor needs to map them to the display. + /// Switch the swapchain between 8-bit SDR (B8G8R8A8, BT.709) and 10-bit HDR10 (R10G10B10A2, + /// ST.2084 PQ BT.2020). `ResizeBuffers` changes the back-buffer format in place, so the panel + /// binding (`set_swap_chain`) stays valid — no rebind. Both frame sources already produce + /// PQ-encoded BT.2020 for HDR, so the colour space is all the compositor needs. fn set_hdr(&mut self, on: bool) { self.rtv = None; // release back-buffer refs before ResizeBuffers - self.tex = None; // texture format changes (R10G10B10A2 vs R8G8B8A8) + self.cpu_tex = None; // CPU texture format changes (R10G10B10A2 vs R8G8B8A8) let format = if on { DXGI_FORMAT_R10G10B10A2_UNORM } else { @@ -208,9 +367,9 @@ impl Presenter { tracing::info!(hdr = on, "swapchain colour mode switched"); } - fn upload(&mut self, frame: &CpuFrame) -> Result<()> { + fn upload(&mut self, frame: &crate::video::CpuFrame) -> Result<()> { let (w, h) = (frame.width, frame.height); - let need_new = !matches!(&self.tex, Some((_, _, tw, th)) if *tw == w && *th == h); + let need_new = !matches!(&self.cpu_tex, Some((_, _, tw, th)) if *tw == w && *th == h); if need_new { let format = if self.hdr { DXGI_FORMAT_R10G10B10A2_UNORM @@ -246,9 +405,9 @@ impl Presenter { .context("CreateShaderResourceView")?; s.unwrap() }; - self.tex = Some((texture, srv, w, h)); + self.cpu_tex = Some((texture, srv, w, h)); } - let (texture, _, _, _) = self.tex.as_ref().unwrap(); + let (texture, _, _, _) = self.cpu_tex.as_ref().unwrap(); unsafe { let mut mapped = D3D11_MAPPED_SUBRESOURCE::default(); self.context @@ -286,38 +445,6 @@ impl Presenter { } } -fn create_device() -> Result<(ID3D11Device, ID3D11DeviceContext)> { - for driver in [D3D_DRIVER_TYPE_HARDWARE, D3D_DRIVER_TYPE_WARP] { - let mut device = None; - let mut context = None; - let r = unsafe { - D3D11CreateDevice( - None, - driver, - None, - D3D11_CREATE_DEVICE_BGRA_SUPPORT, - Some(&[D3D_FEATURE_LEVEL_11_0]), - D3D11_SDK_VERSION, - Some(&mut device), - None, - Some(&mut context), - ) - }; - if r.is_ok() { - let name = if driver == D3D_DRIVER_TYPE_HARDWARE { - "hardware" - } else { - "WARP (software)" - }; - tracing::info!(driver = name, "D3D11 device created"); - return Ok((device.unwrap(), context.unwrap())); - } - } - Err(anyhow!( - "D3D11CreateDevice failed for both hardware and WARP" - )) -} - /// A composition flip-model swapchain (no HWND) for binding to a XAML `SwapChainPanel`. fn create_composition_swapchain( device: &ID3D11Device, @@ -357,18 +484,34 @@ fn create_composition_swapchain( fn build_pipeline( device: &ID3D11Device, -) -> Result<(ID3D11VertexShader, ID3D11PixelShader, ID3D11SamplerState)> { +) -> Result<( + ID3D11VertexShader, + ID3D11PixelShader, + ID3D11PixelShader, + ID3D11PixelShader, + ID3D11SamplerState, +)> { let vs_blob = compile(SHADER_HLSL, "vs_main", "vs_5_0")?; - let ps_blob = compile(SHADER_HLSL, "ps_main", "ps_5_0")?; + let rgba_blob = compile(SHADER_HLSL, "ps_rgba", "ps_5_0")?; + let nv12_blob = compile(SHADER_HLSL, "ps_nv12", "ps_5_0")?; + let p010_blob = compile(SHADER_HLSL, "ps_p010", "ps_5_0")?; unsafe { let mut vs = None; device .CreateVertexShader(blob_bytes(&vs_blob), None, Some(&mut vs)) .context("CreateVertexShader")?; - let mut ps = None; + let mut ps_rgba = None; device - .CreatePixelShader(blob_bytes(&ps_blob), None, Some(&mut ps)) - .context("CreatePixelShader")?; + .CreatePixelShader(blob_bytes(&rgba_blob), None, Some(&mut ps_rgba)) + .context("CreatePixelShader (rgba)")?; + let mut ps_nv12 = None; + device + .CreatePixelShader(blob_bytes(&nv12_blob), None, Some(&mut ps_nv12)) + .context("CreatePixelShader (nv12)")?; + let mut ps_p010 = None; + device + .CreatePixelShader(blob_bytes(&p010_blob), None, Some(&mut ps_p010)) + .context("CreatePixelShader (p010)")?; let sdesc = D3D11_SAMPLER_DESC { Filter: D3D11_FILTER_MIN_MAG_MIP_LINEAR, AddressU: D3D11_TEXTURE_ADDRESS_CLAMP, @@ -381,7 +524,13 @@ fn build_pipeline( device .CreateSamplerState(&sdesc, Some(&mut sampler)) .context("CreateSamplerState")?; - Ok((vs.unwrap(), ps.unwrap(), sampler.unwrap())) + Ok(( + vs.unwrap(), + ps_rgba.unwrap(), + ps_nv12.unwrap(), + ps_p010.unwrap(), + sampler.unwrap(), + )) } } @@ -427,9 +576,9 @@ fn blob_bytes(blob: &ID3DBlob) -> &[u8] { } } -/// Generic HDR10 mastering metadata: BT.2020 primaries + D65 white (0.00002 units), a 1000-nit -/// mastering display, MaxCLL 1000 / MaxFALL 400. The protocol doesn't carry the stream's real -/// mastering metadata yet (host follow-up), so these are sane defaults the display tone-maps from. +/// Generic HDR10 mastering metadata: BT.2020 primaries + D65 white, a 1000-nit mastering display, +/// MaxCLL 1000 / MaxFALL 400. The protocol doesn't carry the stream's real mastering metadata yet +/// (host follow-up), so these are sane defaults the display tone-maps from. fn hdr10_metadata() -> DXGI_HDR_METADATA_HDR10 { DXGI_HDR_METADATA_HDR10 { RedPrimary: [35400, 14600], diff --git a/clients/windows/src/session.rs b/clients/windows/src/session.rs index 080fe20..b3007f4 100644 --- a/clients/windows/src/session.rs +++ b/clients/windows/src/session.rs @@ -8,7 +8,7 @@ //! (software-only here) and the audio backend (WASAPI). The pump body is identical. use crate::audio; -use crate::video::{DecodedFrame, Decoder}; +use crate::video::{DecodedFrame, Decoder, DecoderPref}; use punktfunk_core::client::NativeClient; use punktfunk_core::config::{CompositorPref, GamepadPref, Mode}; use punktfunk_core::PunktfunkError; @@ -25,6 +25,10 @@ pub struct SessionParams { pub bitrate_kbps: u32, /// Stream the default microphone to the host's virtual mic source. pub mic_enabled: bool, + /// Advertise 10-bit + HDR10 so the host may upgrade HDR content to a Main10/PQ stream. + pub hdr_enabled: bool, + /// Which video decode backend to use (auto/hardware/software). + pub decoder: DecoderPref, /// Pinned host fingerprint; `None` = trust on first use (caller persists the observed one). pub pin: Option<[u8; 32]>, pub identity: (String, String), @@ -37,6 +41,10 @@ pub struct Stats { pub decode_ms: f32, /// Median capture→decoded latency over the last window (host-clock corrected). pub latency_ms: f32, + /// True when decoding on the GPU (D3D11VA zero-copy) vs. CPU (software). + pub hardware: bool, + /// True when the stream is BT.2020 PQ HDR10 (last decoded frame). + pub hdr: bool, } pub enum SessionEvent { @@ -99,10 +107,15 @@ fn pump( params.compositor, params.gamepad, params.bitrate_kbps, - // Advertise 10-bit + HDR10: the presenter handles BT.2020 PQ (R10G10B10A2) frames, so the - // host may upgrade HDR content to a Main10/PQ stream (it still only does so for actual HDR - // content with its own 10-bit gate). 8-bit SDR is unaffected. - punktfunk_core::quic::VIDEO_CAP_10BIT | punktfunk_core::quic::VIDEO_CAP_HDR, + // Advertise 10-bit + HDR10 (when enabled): the presenter handles BT.2020 PQ frames (P010 on + // the GPU path, X2BGR10 on software), so the host may upgrade HDR content to a Main10/PQ + // stream — it still only does so for actual HDR content with its own 10-bit gate. 8-bit SDR + // is unaffected. A client that turns HDR off advertises `0` and always gets the 8-bit stream. + if params.hdr_enabled { + punktfunk_core::quic::VIDEO_CAP_10BIT | punktfunk_core::quic::VIDEO_CAP_HDR + } else { + 0 + }, None, // launch: the Windows client has no library picker yet params.pin, Some(params.identity), @@ -132,13 +145,15 @@ fn pump( fingerprint: connector.host_fingerprint, }); - let mut decoder = match Decoder::new() { + let mut decoder = match Decoder::new(params.decoder) { Ok(d) => d, Err(e) => { let _ = ev_tx.send_blocking(SessionEvent::Ended(Some(format!("video decoder: {e}")))); return; } }; + let mut hardware = decoder.is_hardware(); + let mut hdr = false; // Audio is best-effort: a session without it still streams. Gamepads are the // app-lifetime service's job (the UI attaches it on Connected). let player = audio::AudioPlayer::spawn() @@ -178,12 +193,16 @@ fn pump( match decoder.decode(&frame.data) { Ok(Some(decoded)) => { total_frames += 1; + hdr = decoded.hdr(); + // The backend can demote D3D11VA → software mid-session on a hardware error. + hardware = decoder.is_hardware(); if total_frames == 1 { - let DecodedFrame::Cpu(c) = &decoded; + let (w, h) = decoded.dims(); tracing::info!( - width = c.width, - height = c.height, - path = "software", + width = w, + height = h, + path = if hardware { "d3d11va" } else { "software" }, + hdr, "first frame decoded" ); } @@ -253,6 +272,8 @@ fn pump( 0.0 }, latency_ms: p50 as f32 / 1000.0, + hardware, + hdr, })); window_start = Instant::now(); frames_n = 0; diff --git a/clients/windows/src/trust.rs b/clients/windows/src/trust.rs index aec6aeb..fa32df0 100644 --- a/clients/windows/src/trust.rs +++ b/clients/windows/src/trust.rs @@ -130,6 +130,11 @@ pub struct Settings { pub inhibit_shortcuts: bool, /// Stream the default microphone to the host's virtual mic source. pub mic_enabled: bool, + /// Advertise 10-bit + HDR10 so the host upgrades HDR content to a Main10/PQ stream (the client + /// presents it on a 10-bit ST.2084 swapchain). No effect on SDR content. + pub hdr_enabled: bool, + /// Video decode backend: `auto` (D3D11VA, fall back to software), `hardware`, or `software`. + pub decoder: String, } impl Default for Settings { @@ -143,6 +148,8 @@ impl Default for Settings { compositor: "auto".into(), inhibit_shortcuts: true, mic_enabled: false, + hdr_enabled: true, + decoder: "auto".into(), } } } diff --git a/clients/windows/src/video.rs b/clients/windows/src/video.rs index 655e0d7..4473377 100644 --- a/clients/windows/src/video.rs +++ b/clients/windows/src/video.rs @@ -1,27 +1,76 @@ //! Video decode: reassembled HEVC access units → frames for the D3D11 presenter. //! -//! The dev box has no working GPU, so this ships the **software** backend first: libavcodec -//! on the CPU + swscale to RGBA, uploaded into a D3D11 texture by the presenter. It runs -//! `AV_CODEC_FLAG_LOW_DELAY` with slice threading only — the host encodes zero-reorder -//! streams (no B-frames, in-band parameter sets on every IDR), so decode is strictly -//! one-in/one-out and frame threading would only add latency. +//! Two backends, picked at session start (override via [`DecoderPref`] / the Settings UI): //! -//! `DecodedFrame` is an enum so the real-GPU **D3D11VA** path (decode → `NV12`/`P010` -//! `ID3D11Texture2D`, zero-copy into the swapchain) can be added as a second variant without -//! touching the session pump or the presenter's frame contract. +//! * **D3D11VA** (any GPU): libavcodec decodes on the GPU straight into `ID3D11Texture2D`s that +//! carry `D3D11_BIND_SHADER_RESOURCE`, so the presenter samples the decoded NV12/P010 surface +//! directly — **zero copy** (no swscale, no CPU readback, no per-frame upload). The textures are +//! created by the process-wide shared device ([`crate::gpu`]) the presenter also draws with, which +//! is what makes them bindable there. This is the big latency/throughput win over software decode. +//! * **Software**: libavcodec on the CPU + swscale to a packed 4-byte format the presenter uploads +//! (`RGBA` for SDR, `X2BGR10` for HDR). The fallback on a GPU-less box (WARP), when D3D11VA init +//! fails, or when a mid-session hardware error demotes us — the host's IDR/RFI recovery +//! resynchronizes on the next keyframe either way. +//! +//! Both run `AV_CODEC_FLAG_LOW_DELAY`; the host encodes zero-reorder streams (no B-frames, in-band +//! parameter sets on every IDR), so decode is strictly one-in/one-out. +//! +//! HDR is detected in-band from the decoded frame's transfer characteristic (`SMPTE2084` / PQ in the +//! HEVC VUI) — the same signal every other punktfunk client keys off — not from a protocol field. -use anyhow::{anyhow, Context as _, Result}; +use anyhow::{anyhow, bail, Context as _, Result}; use ffmpeg::format::Pixel; use ffmpeg::software::scaling; use ffmpeg::util::frame::Video as AvFrame; use ffmpeg_next as ffmpeg; +use std::ffi::c_void; +use std::ptr; +use windows::core::Interface; // ID3D11Device::clone().into_raw() for the FFmpeg hwdevice ctx + +/// Which decode backend to use; the Settings UI persists this as a string. +#[derive(Clone, Copy, PartialEq, Eq, Debug, Default)] +pub enum DecoderPref { + /// Try D3D11VA, fall back to software. + #[default] + Auto, + /// Force D3D11VA (error out if unavailable, for debugging). + Hardware, + /// Force software decode. + Software, +} + +impl DecoderPref { + pub fn from_name(s: &str) -> DecoderPref { + match s { + "hardware" => DecoderPref::Hardware, + "software" => DecoderPref::Software, + _ => DecoderPref::Auto, + } + } +} pub enum DecodedFrame { Cpu(CpuFrame), + Gpu(GpuFrame), } -/// Packed 4-byte-per-pixel frame for a D3D11 texture upload (which takes a row pitch). The bytes -/// are `R8G8B8A8` for SDR and `X2BGR10` (== DXGI `R10G10B10A2`, R in the low 10 bits) for HDR. +impl DecodedFrame { + pub fn dims(&self) -> (u32, u32) { + match self { + DecodedFrame::Cpu(c) => (c.width, c.height), + DecodedFrame::Gpu(g) => (g.width, g.height), + } + } + pub fn hdr(&self) -> bool { + match self { + DecodedFrame::Cpu(c) => c.hdr, + DecodedFrame::Gpu(g) => g.hdr, + } + } +} + +/// Packed 4-byte-per-pixel frame for a D3D11 dynamic-texture upload (which takes a row pitch). The +/// bytes are `R8G8B8A8` for SDR and `X2BGR10` (== DXGI `R10G10B10A2`, R in the low 10 bits) for HDR. pub struct CpuFrame { pub width: u32, pub height: u32, @@ -33,26 +82,101 @@ pub struct CpuFrame { pub hdr: bool, } +/// A decoded frame still on the GPU: a D3D11 texture **array** plus the slice index the decoder +/// wrote this frame into. The presenter creates per-plane shader-resource views over the slice and +/// converts YUV→RGB in a pixel shader. The underlying surface stays alive — and out of the decoder's +/// reuse pool — for exactly as long as `guard` (an `av_frame_clone` of the decoded frame) lives. +pub struct GpuFrame { + pub width: u32, + pub height: u32, + /// Texture-array slice this frame occupies (`AVFrame::data[1]`). + pub index: u32, + /// BT.2020 PQ HDR10 (P010, ST.2084) vs ordinary 8-bit BT.709 SDR (NV12). + pub hdr: bool, + /// 10-bit (P010, R16 planes) vs 8-bit (NV12, R8 planes) — kept for the first-frame log; the + /// present path keys colour/format off `hdr` (the host couples 10-bit ⟺ HDR). + pub ten_bit: bool, + guard: D3d11FrameGuard, +} + +impl GpuFrame { + /// The decoder's D3D11 texture array holding this frame's slice, borrowed from the live cloned + /// `AVFrame`. Construct the windows-rs interface on the thread that will use it (the presenter / + /// UI thread): COM interfaces are `!Send`, but the raw pointer is fine to carry across threads. + pub fn texture_ptr(&self) -> *mut c_void { + unsafe { (*self.guard.0).data[0] as *mut c_void } + } +} + +/// Owns a cloned decoded `AVFrame` (which refs the D3D11 surface in the decoder pool). Dropping it +/// releases the surface back for reuse. The clone is plain refcounted data; freeing it from the +/// presenter thread is fine. +pub struct D3d11FrameGuard(*mut ffmpeg::ffi::AVFrame); +unsafe impl Send for D3d11FrameGuard {} +impl Drop for D3d11FrameGuard { + fn drop(&mut self) { + unsafe { ffmpeg::ffi::av_frame_free(&mut self.0) }; + } +} + +enum Backend { + D3d11va(D3d11vaDecoder), + Software(SoftwareDecoder), +} + pub struct Decoder { - inner: SoftwareDecoder, + backend: Backend, } impl Decoder { - pub fn new() -> Result { + pub fn new(pref: DecoderPref) -> Result { ffmpeg::init().context("ffmpeg init")?; + if pref != DecoderPref::Software { + match D3d11vaDecoder::new() { + Ok(d) => { + tracing::info!("D3D11VA hardware decode active (zero-copy)"); + return Ok(Decoder { + backend: Backend::D3d11va(d), + }); + } + Err(e) => { + if pref == DecoderPref::Hardware { + return Err(e.context("decoder=hardware but D3D11VA failed")); + } + tracing::info!(reason = %e, "D3D11VA unavailable — software decode"); + } + } + } Ok(Decoder { - inner: SoftwareDecoder::new()?, + backend: Backend::Software(SoftwareDecoder::new()?), }) } - /// Feed one access unit; returns the decoded frame (the host's streams are - /// one-in/one-out). A decode error after packet loss is survivable — log upstream and - /// keep feeding; the host's IDR/RFI recovery resynchronizes on the next keyframe. + /// True for the zero-copy hardware backend (shown in the stream HUD). + pub fn is_hardware(&self) -> bool { + matches!(self.backend, Backend::D3d11va(_)) + } + + /// Feed one access unit; returns the decoded frame (the host's streams are one-in/one-out). A + /// software decode error after packet loss is survivable — keep feeding. A D3D11VA error demotes + /// to software for the rest of the session (the next IDR resynchronizes). pub fn decode(&mut self, au: &[u8]) -> Result> { - Ok(self.inner.decode(au)?.map(DecodedFrame::Cpu)) + match &mut self.backend { + Backend::D3d11va(d) => match d.decode(au) { + Ok(f) => Ok(f.map(DecodedFrame::Gpu)), + Err(e) => { + tracing::warn!(error = %e, "D3D11VA decode failed — falling back to software"); + self.backend = Backend::Software(SoftwareDecoder::new()?); + Ok(None) + } + }, + Backend::Software(s) => Ok(s.decode(au)?.map(DecodedFrame::Cpu)), + } } } +// --- software backend --------------------------------------------------------------- + struct SoftwareDecoder { decoder: ffmpeg::decoder::Video, /// Rebuilt whenever the decoded format/size **or output format** changes (mid-stream @@ -90,10 +214,9 @@ impl SoftwareDecoder { } /// Convert the decoded YUV frame to a packed 4-byte format the presenter uploads directly: - /// SDR → `RGBA` (BT.709), HDR (SMPTE ST.2084 / PQ transfer) → `X2BGR10` (10-bit, == DXGI - /// R10G10B10A2) using the BT.2020 matrix. For HDR the PQ-encoded values pass through unchanged - /// (swscale only applies the YUV→RGB matrix + range, never the transfer) — exactly what an - /// HDR10/ST.2084 swapchain wants. + /// SDR → `RGBA` (BT.709), HDR (SMPTE ST.2084 / PQ transfer) → `X2BGR10` (== DXGI R10G10B10A2) + /// using the BT.2020 matrix. For HDR the PQ-encoded values pass through unchanged (swscale only + /// applies the YUV→RGB matrix + range, never the transfer) — exactly what an HDR10 swapchain wants. fn convert(&mut self, frame: &AvFrame) -> Result { use ffmpeg::color::TransferCharacteristic; let (fmt, w, h) = (frame.format(), frame.width(), frame.height()); @@ -134,3 +257,255 @@ impl SoftwareDecoder { }) } } + +// --- D3D11VA backend ------------------------------------------------------------------ +// +// Raw FFI: ffmpeg-next has no hwaccel wrappers. The COM-typed hwcontext structs are declared here +// (stable FFmpeg public ABI) rather than relied on from ffmpeg-sys bindgen — the generic +// AVHWDeviceContext / AVHWFramesContext (whose payload is an opaque `void *hwctx`) come from +// ffmpeg-sys, and we cast `hwctx` to the structs below. All owned pointers are freed in Drop; +// decoded surfaces transfer out through D3d11FrameGuard. + +const AVERROR_EAGAIN: i32 = -11; // -EAGAIN +const D3D11_BIND_SHADER_RESOURCE: u32 = 0x8; // ; FFmpeg ORs D3D11_BIND_DECODER itself + +/// `hwcontext_d3d11va.h` — `AVHWDeviceContext::hwctx`. Leaving `lock` null makes FFmpeg install an +/// `ID3D11Multithread` default lock + set multithread protection on `device_context` during init, +/// which is what lets the presenter share this device's immediate context from the UI thread. +#[repr(C)] +struct AVD3D11VADeviceContext { + device: *mut c_void, // ID3D11Device* + device_context: *mut c_void, // ID3D11DeviceContext* + video_device: *mut c_void, // ID3D11VideoDevice* + video_context: *mut c_void, // ID3D11VideoContext* + lock: *mut c_void, // void (*)(void*) + unlock: *mut c_void, // void (*)(void*) + lock_ctx: *mut c_void, +} + +/// `hwcontext_d3d11va.h` — `AVHWFramesContext::hwctx`. `BindFlags` lets us add +/// `D3D11_BIND_SHADER_RESOURCE` so the decoded array texture is sampleable (zero copy). +#[repr(C)] +struct AVD3D11VAFramesContext { + texture: *mut c_void, // ID3D11Texture2D* (null → FFmpeg allocates the pool) + bind_flags: u32, // UINT BindFlags + misc_flags: u32, // UINT MiscFlags +} + +fn averr(what: &str, code: i32) -> anyhow::Error { + anyhow!("{what}: {}", ffmpeg::Error::from(code)) +} + +/// libavcodec's `get_format` callback: accept the D3D11 hw surface, building a frames context whose +/// textures carry `BIND_SHADER_RESOURCE` (so the presenter can sample them). Returning anything but +/// `AV_PIX_FMT_D3D11` aborts hardware decode → the session demotes to software. +unsafe extern "C" fn get_format_d3d11( + avctx: *mut ffmpeg::ffi::AVCodecContext, + mut list: *const ffmpeg::ffi::AVPixelFormat, +) -> ffmpeg::ffi::AVPixelFormat { + use ffmpeg::ffi::*; + unsafe { + let mut found = false; + while *list != AVPixelFormat::AV_PIX_FMT_NONE { + if *list == AVPixelFormat::AV_PIX_FMT_D3D11 { + found = true; + break; + } + list = list.add(1); + } + if !found { + return AVPixelFormat::AV_PIX_FMT_NONE; + } + let device_ref = (*avctx).hw_device_ctx; + if device_ref.is_null() { + return AVPixelFormat::AV_PIX_FMT_NONE; + } + let frames_ref = av_hwframe_ctx_alloc(device_ref); + if frames_ref.is_null() { + return AVPixelFormat::AV_PIX_FMT_NONE; + } + let frames = (*frames_ref).data as *mut AVHWFramesContext; + (*frames).format = AVPixelFormat::AV_PIX_FMT_D3D11; + let sw = if (*avctx).sw_pix_fmt != AVPixelFormat::AV_PIX_FMT_NONE { + (*avctx).sw_pix_fmt + } else { + AVPixelFormat::AV_PIX_FMT_NV12 + }; + (*frames).sw_format = sw; + (*frames).width = (*avctx).coded_width; + (*frames).height = (*avctx).coded_height; + // DPB + a few in-flight (decoded channel + the presenter's held frame); the host's + // zero-reorder stream needs only a small DPB, so 20 is comfortable headroom. + (*frames).initial_pool_size = 20; + let fhw = (*frames).hwctx as *mut AVD3D11VAFramesContext; + (*fhw).bind_flags = D3D11_BIND_SHADER_RESOURCE; + let r = av_hwframe_ctx_init(frames_ref); + if r < 0 { + let mut fr = frames_ref; + av_buffer_unref(&mut fr); + return AVPixelFormat::AV_PIX_FMT_NONE; + } + (*avctx).hw_frames_ctx = frames_ref; // decoder takes ownership + AVPixelFormat::AV_PIX_FMT_D3D11 + } +} + +struct D3d11vaDecoder { + ctx: *mut ffmpeg::ffi::AVCodecContext, + hw_device: *mut ffmpeg::ffi::AVBufferRef, + packet: *mut ffmpeg::ffi::AVPacket, + frame: *mut ffmpeg::ffi::AVFrame, +} + +// Single-owner pointers, only touched from the session pump thread. +unsafe impl Send for D3d11vaDecoder {} + +impl D3d11vaDecoder { + fn new() -> Result { + use ffmpeg::ffi; + let shared = crate::gpu::shared().ok_or_else(|| anyhow!("no shared D3D11 device"))?; + if !shared.hardware { + bail!("shared device is WARP (no hardware video decode)"); + } + unsafe { + // Build a D3D11VA hwdevice context around the *shared* device, so decoded textures live + // on the same device the presenter samples + draws with. + let hw_device = + ffi::av_hwdevice_ctx_alloc(ffi::AVHWDeviceType::AV_HWDEVICE_TYPE_D3D11VA); + if hw_device.is_null() { + bail!("av_hwdevice_ctx_alloc(D3D11VA) failed"); + } + let devctx = (*hw_device).data as *mut ffi::AVHWDeviceContext; + let d3dctx = (*devctx).hwctx as *mut AVD3D11VADeviceContext; + // Hand FFmpeg an owned ref to the device + immediate context (it Releases them when the + // hwdevice ctx is freed). `into_raw()` transfers a +1 ref without releasing. + (*d3dctx).device = shared.device.clone().into_raw(); + (*d3dctx).device_context = shared.context.clone().into_raw(); + // lock left null → FFmpeg installs the ID3D11Multithread default lock in init. + let r = ffi::av_hwdevice_ctx_init(hw_device); + if r < 0 { + let mut hw = hw_device; + ffi::av_buffer_unref(&mut hw); + bail!("av_hwdevice_ctx_init: {}", ffmpeg::Error::from(r)); + } + + let codec = ffi::avcodec_find_decoder(ffi::AVCodecID::AV_CODEC_ID_HEVC); + if codec.is_null() { + let mut hw = hw_device; + ffi::av_buffer_unref(&mut hw); + bail!("no HEVC decoder"); + } + let ctx = ffi::avcodec_alloc_context3(codec); + (*ctx).hw_device_ctx = ffi::av_buffer_ref(hw_device); + (*ctx).get_format = Some(get_format_d3d11); + (*ctx).flags |= ffi::AV_CODEC_FLAG_LOW_DELAY as i32; + (*ctx).thread_count = 1; // hwaccel: threads only add latency + let r = ffi::avcodec_open2(ctx, codec, ptr::null_mut()); + if r < 0 { + let mut ctx = ctx; + ffi::avcodec_free_context(&mut ctx); + let mut hw = hw_device; + ffi::av_buffer_unref(&mut hw); + bail!("avcodec_open2 (D3D11VA): {}", ffmpeg::Error::from(r)); + } + Ok(D3d11vaDecoder { + ctx, + hw_device, + packet: ffi::av_packet_alloc(), + frame: ffi::av_frame_alloc(), + }) + } + } + + fn decode(&mut self, au: &[u8]) -> Result> { + use ffmpeg::ffi; + unsafe { + let r = ffi::av_new_packet(self.packet, au.len() as i32); + if r < 0 { + return Err(averr("av_new_packet", r)); + } + ptr::copy_nonoverlapping(au.as_ptr(), (*self.packet).data, au.len()); + let r = ffi::avcodec_send_packet(self.ctx, self.packet); + ffi::av_packet_unref(self.packet); + if r < 0 { + return Err(averr("send_packet", r)); + } + let mut out = None; + loop { + let r = ffi::avcodec_receive_frame(self.ctx, self.frame); + if r == AVERROR_EAGAIN { + break; + } + if r < 0 { + return Err(averr("receive_frame", r)); + } + out = Some(self.lift()?); // newest wins; older guards drop here + ffi::av_frame_unref(self.frame); + } + Ok(out) + } + } + + /// Lift the decoded D3D11 surface into a `GpuFrame`. `data[0]` is the texture array, `data[1]` + /// the slice index. We `av_frame_clone` so the surface stays referenced (kept out of the reuse + /// pool) until the presenter drops the guard. + unsafe fn lift(&mut self) -> Result { + use ffmpeg::ffi; + unsafe { + if (*self.frame).format != ffi::AVPixelFormat::AV_PIX_FMT_D3D11 as i32 { + bail!("decoder returned a software frame (no D3D11 surface)"); + } + let hdr = + (*self.frame).color_trc == ffi::AVColorTransferCharacteristic::AVCOL_TRC_SMPTE2084; + let ten_bit = { + let hwfc = (*self.frame).hw_frames_ctx; + !hwfc.is_null() + && (*((*hwfc).data as *const ffi::AVHWFramesContext)).sw_format + == ffi::AVPixelFormat::AV_PIX_FMT_P010LE + }; + let cloned = ffi::av_frame_clone(self.frame); + if cloned.is_null() { + bail!("av_frame_clone failed"); + } + let frame = GpuFrame { + width: (*self.frame).width as u32, + height: (*self.frame).height as u32, + index: (*self.frame).data[1] as usize as u32, + hdr, + ten_bit, + guard: D3d11FrameGuard(cloned), + }; + log_layout_once(frame.width, frame.height, frame.index, hdr, ten_bit); + Ok(frame) + } + } +} + +impl Drop for D3d11vaDecoder { + fn drop(&mut self) { + use ffmpeg::ffi; + unsafe { + ffi::av_packet_free(&mut self.packet); + ffi::av_frame_free(&mut self.frame); + ffi::avcodec_free_context(&mut self.ctx); + ffi::av_buffer_unref(&mut self.hw_device); + } + } +} + +/// One-time dump of the first decoded surface's layout — so a new GPU/driver combination's real +/// format (slice index range, HDR/bit-depth) is visible in the logs without a debugger. +fn log_layout_once(width: u32, height: u32, index: u32, hdr: bool, ten_bit: bool) { + use std::sync::atomic::{AtomicBool, Ordering}; + static ONCE: AtomicBool = AtomicBool::new(true); + if ONCE.swap(false, Ordering::Relaxed) { + tracing::info!( + width, + height, + slice = index, + hdr, + ten_bit, + "D3D11VA first frame (zero-copy)" + ); + } +}