Files
punktfunk/crates/punktfunk-host/src/wgc_helper.rs
T
enricobuehler e2c9bfd3d9
apple / swift (push) Successful in 1m4s
windows-host / package (push) Successful in 6m28s
windows-msix / package (arm64, C:\Users\Public\ffmpeg-arm64, aarch64-pc-windows-msvc, C:\t-a64) (push) Successful in 1m14s
windows-msix / package (x64, C:\Users\Public\ffmpeg, x86_64-pc-windows-msvc, C:\t) (push) Successful in 1m10s
release / apple (push) Successful in 7m53s
android / android (push) Successful in 10m33s
ci / web (push) Successful in 44s
windows / build (aarch64-pc-windows-msvc) (push) Successful in 3m4s
ci / docs-site (push) Successful in 53s
ci / rust (push) Successful in 12m22s
windows / build (x86_64-pc-windows-msvc) (push) Successful in 1m11s
apple / screenshots (push) Successful in 5m24s
deb / build-publish (push) Successful in 3m16s
decky / build-publish (push) Successful in 21s
ci / bench (push) Successful in 4m42s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 27s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 2m34s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 2m42s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 2m13s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 47s
flatpak / build-publish (push) Successful in 4m24s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 8m5s
docker / deploy-docs (push) Successful in 25s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 7m44s
feat(windows): pf-vdisplay IDD-push — HDR + pipelined zero-copy capture
HDR (display-driven, matching the WGC path):
- CTA-861.3 HDR EDID (BT.2020 primaries + HDR Static Metadata block) so Windows
  offers "Use HDR" on the virtual display. The host FOLLOWS the display's live
  advanced-color state, recreating the shared ring at the matching format
  (FP16 in HDR / BGRA in SDR) on a toggle — no freeze.
- Always emit Main10/BT.2020-PQ Rgb10a2 while the display is HDR; the client
  auto-detects PQ from the HEVC VUI (clients under-report VIDEO_CAP_10BIT).
  Generic HDR10 mastering SEI on every IDR.
- Generation-tagged `latest` (gen<<40|seq<<8|slot) + driver `is_stale` re-attach
  kill the toggle-time garbage frame and any stale-ring read.

Perf:
- Pipeline the encode loop (Capturer::pipeline_depth; IDD-push = 2): submit N+1
  before polling N so the convert/copy on the 3D engine overlaps the NVENC encode
  of N on the ASIC. PUNKTFUNK_IDD_DEPTH overrides (1 = synchronous).
- Rotating host output ring (OUT_RING) so the in-flight encode and the next
  convert never touch the same texture.
- HDR converts directly from the keyed-mutex slot's SRV into the output ring
  (drops the redundant slot->fp16 scratch copy); SDR copies the BGRA slot in.
  The slot mutex is held only across the convert/copy, not the encode.
  RING_LEN 3->6 for publish headroom.
- Capture-health diagnostic: new_fps vs repeat_fps under PUNKTFUNK_PERF (a low
  new_fps at a high send rate means the source isn't compositing, not an encode
  stall).

Validated live on the RTX box: 5120x1440@240 HDR streams; driver composes
~180 new fps, encode 240 fps @ ~4.3 ms p50.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 00:39:28 +02:00

338 lines
15 KiB
Rust

//! USER-session WGC helper (Windows) — part of the two-process secure-desktop design
//! (docs/windows-secure-desktop.md).
//!
//! WGC won't activate under the SYSTEM account, but the host must run as SYSTEM for the secure
//! desktop. So the SYSTEM host spawns THIS helper in the interactive user session
//! (`CreateProcessAsUserW`) to do the WGC capture + NVENC encode that needs the user token, and the
//! helper ships the encoded Annex-B access units back over its **stdout** pipe (which the host
//! inherits + reads). The host relays them on the live QUIC session while the normal desktop is up,
//! and switches to its own DDA encoder on the secure desktop. The helper captures the SAME SudoVDA
//! output **by GDI name only** — it never creates a virtual output / touches display topology (a
//! second topology owner would re-trigger the ACCESS_LOST born-lost storm).
//!
//! Wire framing on stdout, per AU: `[u32 len LE][u64 pts_ns LE][u8 keyframe][len bytes data]`.
use crate::capture::{dxgi::WinCaptureTarget, wgc::WgcCapturer, Capturer};
use crate::encode::{self, Codec};
use anyhow::{Context, Result};
use std::io::{Read, Write};
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
pub struct HelperOptions {
pub target_id: u32,
pub gdi_name: String,
pub width: u32,
pub height: u32,
pub fps: u32,
pub bitrate_kbps: u32,
/// Negotiated encode bit depth (8, or 10 = HEVC Main10). HDR auto-upgrades to 10 from the
/// captured frame's `Rgb10a2` format regardless.
pub bit_depth: u8,
}
/// AU framing magic + version, so the host can resync / detect a helper crash on its stdout stream.
const AU_MAGIC: u32 = 0x5046_4155; // "PFAU"
/// Control byte the host writes on our stdin to force the next frame to be an IDR. Must match
/// `wgc_relay::CTL_KEYFRAME`.
const CTL_KEYFRAME: u8 = 0x01;
pub fn run(opts: HelperOptions) -> Result<()> {
tracing::info!(
target_id = opts.target_id,
gdi = %opts.gdi_name,
mode = format!("{}x{}@{}", opts.width, opts.height, opts.fps),
"WGC helper starting (user session)"
);
// This thread does WGC capture + video-processor convert + NVENC submit — the GPU-submitting hot
// path. Elevate its OS priority so a CPU-heavy game can't deschedule it and delay submission (which
// would leave our HIGH GPU priority with nothing queued to prioritise). Apollo's capture thread is
// likewise CRITICAL.
crate::punktfunk1::boost_thread_priority(true);
// Capture the EXISTING SudoVDA output by GDI name / target id — do NOT create one (the host owns
// the virtual output + its isolate/restore; a second topology owner breaks DDA recovery).
let target = WinCaptureTarget {
adapter_luid: 0,
gdi_name: opts.gdi_name.clone(),
target_id: opts.target_id,
};
let mut cap =
WgcCapturer::open(target, Some((opts.width, opts.height, opts.fps))).context("WGC open")?;
cap.set_active(true);
// O3 present-trigger experiment: spawn a thread that PRESENTS a D3D swapchain to the virtual
// display (a present SOURCE), testing whether that — unlike WGC's READ — makes the OS assign the
// driver's IddCx swap-chain (so the driver's run_core runs + can push). Gated; diagnostic.
if std::env::var_os("PUNKTFUNK_PRESENT_TRIGGER").is_some() {
let (w, h) = (opts.width, opts.height);
std::thread::Builder::new()
.name("pf-present-trigger".into())
.spawn(move || {
tracing::info!("present-trigger: starting D3D present loop on the virtual display");
if let Err(e) = unsafe { present_trigger(w, h) } {
tracing::warn!("present-trigger error: {e:#}");
}
})
.ok();
}
// First frame establishes the real dimensions + whether the desktop is HDR (the encoder derives
// Main10/HDR from the frame's PixelFormat::Rgb10a2). Then open NVENC on the capture device.
let first = cap.next_frame().context("first WGC frame")?;
let (w, h) = (first.width, first.height);
let mut enc = encode::open_video(
Codec::H265,
first.format,
w,
h,
opts.fps,
opts.bitrate_kbps as u64 * 1000,
false, // not cuda
opts.bit_depth, // 8, or 10 = Main10 (HDR auto-upgrades from the Rgb10a2 frame regardless)
)
.context("open NVENC")?;
// Control channel: the host writes a single byte on our stdin to force an IDR (client decode
// recovery), mirroring `enc.request_keyframe()` in the single-process path. A reader thread sets
// a flag the encode loop checks; stdin EOF (host gone) just stops the thread.
let kf = Arc::new(AtomicBool::new(false));
{
let kf = kf.clone();
std::thread::Builder::new()
.name("wgc-helper-ctl".into())
.spawn(move || {
let mut stdin = std::io::stdin();
let mut byte = [0u8; 1];
while let Ok(n) = stdin.read(&mut byte) {
if n == 0 {
break; // host closed our stdin
}
if byte[0] == CTL_KEYFRAME {
kf.store(true, Ordering::Relaxed);
}
}
})
.ok();
}
// Binary stdout — lock it once + write framed AUs. A short write / broken pipe means the host
// (parent) went away → exit cleanly so the host's relaunch watchdog can respawn us.
let stdout = std::io::stdout();
let mut out = stdout.lock();
// FIXED-CADENCE encode loop (mirrors the single-process `punktfunk1::virtual_stream` loop). The
// host runs as SYSTEM and relays our AUs; to deliver a STEADY `fps` to the client (the "fixed 240"
// goal) we must NOT gate on WGC's content-driven FrameArrived — `WgcCapturer::next_frame` blocks up
// to its ~8 ms static-repeat timeout when the desktop is quiet, capping a barely-changing desktop
// ~125 fps regardless of the GPU. Instead we pace to `1/fps` and take the FRESHEST frame with the
// non-blocking `try_latest`, repeating the last one when nothing newer arrived. Depth-1: NVENC's
// `poll` (lock_bitstream) blocks until the just-submitted frame is encoded, so exactly one frame is
// in flight per iteration. A deeper pipeline was measured to only stack latency under a
// GPU-saturating game (the encodes serialize on the contended GPU anyway) — the in-game lever is
// the GPU scheduling priority the SYSTEM host stamps on us, not pipeline depth.
let interval = std::time::Duration::from_secs_f64(1.0 / opts.fps.max(1) as f64);
let perf = std::env::var_os("PUNKTFUNK_PERF").is_some();
let mut frames = 0u64;
let mut repeats = 0u64; // frames where no newer capture had arrived (duplicate re-encode)
let mut cap_ns = 0u64; // time in try_latest (capture + video-processor convert)
let mut encode_ns = 0u64; // time blocked in lock_bitstream
let mut write_ns = 0u64; // time writing the AU to the stdout pipe (relay backpressure)
let mut window = std::time::Instant::now();
// `frame` is held across iterations and repeated when `try_latest` has nothing newer, so a static
// desktop still clocks `fps`. The capturer's held-set / output ring keep its texture alive across
// the repeat; reassigning `frame` on a fresh capture drops the prior one (already drained by poll).
let mut frame = first;
let mut next = std::time::Instant::now();
loop {
if kf.swap(false, Ordering::Relaxed) {
enc.request_keyframe();
}
// Freshest captured frame, or repeat the last (no new composition: static desktop / between a
// game's presents). Non-blocking, so the cadence is OURS, not WGC's event rate.
let t0 = std::time::Instant::now();
match cap.try_latest().context("WGC try_latest")? {
Some(f) => frame = f,
None => repeats += 1,
}
if perf {
cap_ns += t0.elapsed().as_nanos() as u64;
}
enc.submit(&frame).context("encoder submit")?;
// Drain the just-submitted frame. NVENC's poll blocks in lock_bitstream until it's encoded, so
// this returns exactly one AU (then None) — depth-1, no accumulation.
loop {
let p0 = std::time::Instant::now();
let polled = enc.poll().context("encoder poll")?;
if perf {
encode_ns += p0.elapsed().as_nanos() as u64;
}
let Some(au) = polled else { break };
let w0 = std::time::Instant::now();
let wrote = write_au(&mut out, &au);
if perf {
write_ns += w0.elapsed().as_nanos() as u64;
}
if wrote.is_err() {
tracing::info!("WGC helper: stdout closed (host gone) — exiting");
return Ok(());
}
}
// Pace to this frame's due time. If we're already past it (encode couldn't keep up under a
// GPU-saturating game), skip the sleep and re-baseline so we don't spiral into catch-up.
next += interval;
match next.checked_duration_since(std::time::Instant::now()) {
Some(d) => std::thread::sleep(d),
None => next = std::time::Instant::now(),
}
if perf {
frames += 1;
let since = window.elapsed();
if since.as_secs() >= 2 {
let secs = since.as_secs_f64();
let per = |ns: u64| format!("{:.2}", ns as f64 / frames as f64 / 1e6);
tracing::info!(
fps = format!("{:.1}", frames as f64 / secs),
repeats,
cap_ms = per(cap_ns),
encode_ms = per(encode_ns),
write_ms = per(write_ns),
"WGC helper perf (fixed-cadence depth-1; encode_ms=lock_bitstream; repeats=duplicated frames)"
);
frames = 0;
repeats = 0;
cap_ns = 0;
encode_ns = 0;
write_ns = 0;
window = std::time::Instant::now();
}
}
}
}
fn write_au(out: &mut impl Write, au: &encode::EncodedFrame) -> std::io::Result<()> {
out.write_all(&AU_MAGIC.to_le_bytes())?;
out.write_all(&(au.data.len() as u32).to_le_bytes())?;
out.write_all(&au.pts_ns.to_le_bytes())?;
out.write_all(&[au.keyframe as u8])?;
out.write_all(&au.data)?;
out.flush()
}
/// O3 present-trigger experiment (see the gated call in `run`). Creates a small swapchain-backed
/// window on the virtual display (the CCD-isolated primary) and presents continuously — an active
/// present SOURCE on the display — to test whether that makes the OS assign the driver's IddCx
/// swap-chain (which WGC's read does not). Runs forever on its own thread.
///
/// # Safety
/// Win32/D3D11 FFI; called once on a dedicated helper thread.
unsafe fn present_trigger(disp_w: u32, disp_h: u32) -> Result<()> {
use windows::core::{w, Interface};
use windows::Win32::Foundation::{HMODULE, HWND, LPARAM, LRESULT, WPARAM};
use windows::Win32::Graphics::Direct3D::D3D_DRIVER_TYPE_HARDWARE;
use windows::Win32::Graphics::Direct3D11::{
D3D11CreateDevice, ID3D11Device, ID3D11DeviceContext, ID3D11RenderTargetView,
ID3D11Texture2D, D3D11_CREATE_DEVICE_BGRA_SUPPORT, D3D11_SDK_VERSION,
};
use windows::Win32::Graphics::Dxgi::Common::{DXGI_FORMAT_B8G8R8A8_UNORM, DXGI_SAMPLE_DESC};
use windows::Win32::Graphics::Dxgi::{
IDXGIAdapter, IDXGIDevice, IDXGIFactory2, DXGI_PRESENT, DXGI_SWAP_CHAIN_DESC1,
DXGI_SWAP_EFFECT_FLIP_DISCARD, DXGI_USAGE_RENDER_TARGET_OUTPUT,
};
use windows::Win32::System::LibraryLoader::GetModuleHandleW;
use windows::Win32::UI::WindowsAndMessaging::{
CreateWindowExW, DefWindowProcW, DispatchMessageW, PeekMessageW, RegisterClassW,
ShowWindow, MSG, PM_REMOVE, SW_SHOWNOACTIVATE, WNDCLASSW, WS_EX_NOACTIVATE, WS_EX_TOPMOST,
WS_POPUP, WS_VISIBLE,
};
unsafe extern "system" fn wndproc(h: HWND, m: u32, wp: WPARAM, lp: LPARAM) -> LRESULT {
DefWindowProcW(h, m, wp, lp)
}
let hinst: HMODULE = GetModuleHandleW(None)?;
let cls = w!("pfPresentTrigger");
let wc = WNDCLASSW {
lpfnWndProc: Some(wndproc),
hInstance: hinst.into(),
lpszClassName: cls,
..Default::default()
};
RegisterClassW(&wc);
// Small window at the top-left of the (primary = virtual) display so it barely obscures the
// captured desktop; topmost + no-activate so it doesn't steal focus.
let win_w = disp_w.min(96) as i32;
let win_h = disp_h.min(96) as i32;
let hwnd: HWND = CreateWindowExW(
WS_EX_TOPMOST | WS_EX_NOACTIVATE,
cls,
w!("pf-present"),
WS_POPUP | WS_VISIBLE,
0,
0,
win_w,
win_h,
None,
None,
Some(hinst.into()),
None,
)?;
let _ = ShowWindow(hwnd, SW_SHOWNOACTIVATE);
let mut device: Option<ID3D11Device> = None;
let mut context: Option<ID3D11DeviceContext> = None;
D3D11CreateDevice(
None,
D3D_DRIVER_TYPE_HARDWARE,
HMODULE::default(),
D3D11_CREATE_DEVICE_BGRA_SUPPORT,
None,
D3D11_SDK_VERSION,
Some(&mut device),
None,
Some(&mut context),
)?;
let device = device.context("present-trigger d3d11 device")?;
let context = context.context("present-trigger d3d11 context")?;
let dxgi_dev: IDXGIDevice = device.cast()?;
let adapter: IDXGIAdapter = dxgi_dev.GetAdapter()?;
let factory: IDXGIFactory2 = adapter.GetParent()?;
let scd = DXGI_SWAP_CHAIN_DESC1 {
Width: win_w as u32,
Height: win_h as u32,
Format: DXGI_FORMAT_B8G8R8A8_UNORM,
SampleDesc: DXGI_SAMPLE_DESC {
Count: 1,
Quality: 0,
},
BufferUsage: DXGI_USAGE_RENDER_TARGET_OUTPUT,
BufferCount: 2,
SwapEffect: DXGI_SWAP_EFFECT_FLIP_DISCARD,
..Default::default()
};
let swapchain = factory.CreateSwapChainForHwnd(&device, hwnd, &scd, None, None)?;
tracing::info!("present-trigger: swapchain created on the virtual display; presenting");
let mut frame = 0u32;
loop {
let mut msg = MSG::default();
while PeekMessageW(&mut msg, None, 0, 0, PM_REMOVE).as_bool() {
let _ = DispatchMessageW(&msg);
}
let back: ID3D11Texture2D = swapchain.GetBuffer(0)?;
let mut rtv: Option<ID3D11RenderTargetView> = None;
device.CreateRenderTargetView(&back, None, Some(&mut rtv))?;
let rtv = rtv.context("present-trigger rtv")?;
let c = (frame % 120) as f32 / 120.0;
context.ClearRenderTargetView(&rtv, &[c, 0.1, 0.2, 1.0]);
let _ = swapchain.Present(1, DXGI_PRESENT(0));
frame = frame.wrapping_add(1);
}
}