diff --git a/design/README.md b/design/README.md index dc97608..e1733db 100644 --- a/design/README.md +++ b/design/README.md @@ -24,6 +24,7 @@ holds the full originals. | [`multi-user-profiles.md`](multi-user-profiles.md) | Multi-user / profiles end to end: map a client to a real host OS user account (own isolated desktop), web-console config, per-profile passcode | **Design, schema-of-record** — not yet implemented | | [`host-latency-plan.md`](host-latency-plan.md) | Latency under GPU contention — 4-tier plan | **Partly shipped** — superseded by ↓; diagnostics + open tiers kept | | [`gpu-contention-investigation.md`](gpu-contention-investigation.md) | GPU-contention root-cause + ranked levers (supersedes ↑) | **Active plan** — §5.A shipped; §5.B/C/E/F/G open | +| [`vrr-plan.md`](vrr-plan.md) | VRR over punktfunk/1 — skip the VRR virtual display (client panel follows stream cadence; virtual display Hz = sampling grid); frame-driven host loop + VFR rate control + present-on-arrival clients | **Design** — not yet implemented | | [`hdr-pipeline-plan.md`](hdr-pipeline-plan.md) | Glass-to-glass HDR | **Steps 0–3 shipped**; Step 4 (Linux) open | | [`windows-host-rewrite.md`](windows-host-rewrite.md) | **Windows host — the single architecture/status/reference doc** (validated invariants, ops, open work) | **Active reference** | | [`windows-build-and-packaging.md`](windows-build-and-packaging.md) | How the Windows host is built, signed, packaged (drivers-from-source, Inno, CI) | **Evergreen reference** | @@ -53,6 +54,7 @@ owning doc.) - Sub-frame pipelining — overlap encode+transmit within a frame; needs a direct NVENC SDK wrapper (~2–4 ms). → `implementation-plan`, `gamestream-host-plan` - GPU-contention levers: correct async NVENC pipeline, auto-gated REALTIME GPU priority, clock/P-state pinning, frame-source escape (swapchain-hook/NvFBC/compose-flip), iGPU encode offload, PERF uniq-vs-fps instrumentation. → `gpu-contention-investigation` (§5.B/C/E/F/G), `host-latency-plan` (Tiers 1A/1B/3B/3C/3D/4) - Apple stage-2 as default (after resolution/HDR checks) + smoothing/pacing policy + glass-to-glass numbers via `tools/latency-probe`. → `apple-stage2-presenter` +- VRR / frame-driven cadence: Stage A client present-on-arrival (Android/Linux → iOS → Windows), Stage B host frame-driven loop + VFR rate control + `FLAG_REPEAT`/`VIDEO_CAP_VRR`, Stage C gamescope `--adaptive-sync` headless experiment. → `vrr-plan` **HDR** - Linux 10-bit HDR (Step 4): 8-bit→Main10 shim, true 10-bit PipeWire capture (blocked upstream — gamescope #2126), Linux-client P010 + GTK color management. → `hdr-pipeline-plan` diff --git a/design/vrr-plan.md b/design/vrr-plan.md new file mode 100644 index 0000000..a2bd311 --- /dev/null +++ b/design/vrr-plan.md @@ -0,0 +1,158 @@ +# VRR over punktfunk/1 — design + +> **Status:** DESIGN — investigation complete (2026-07-03), nothing implemented. Key architectural +> decision recorded here: **no VRR virtual display is needed** — client-side VRR is driven purely by +> presentation cadence, so the host's virtual display Hz becomes a *sampling grid* decoupled from the +> client panel. punktfunk/1-native only; GameStream/Moonlight stays fixed-cadence (stock clients). + +Goal: end-to-end variable refresh — the game's real frame pacing reaches the client's VRR panel +instead of being resampled onto a fixed grid twice (host pacer, client vsync). Gains are both +latency (the fixed-cadence quantization at capture and present is now the dominant remaining +latency term — the Windows-client loopback p50 of ~18 ms is dominated by the 60 Hz virtual-display +cadence while the wire is sub-millisecond) and smoothness (an 85 fps game on a 120 Hz grid presents +as an irregular 8.3/16.7 ms alternation — judder baked in at the source that no client can undo). + +## The core decision: skip the VRR virtual display + +A VRR panel is not "driven at a framerate" by any API — it follows the presentation cadence. If the +client presents each frame on arrival, the panel refreshes at the stream's cadence, whatever it is. +So client VRR needs **frame-driven host emission + present-on-arrival clients**, and no VRR anywhere +on the host display stack. This sidesteps two otherwise-hard blockers entirely: + +- **IddCx (Windows host) has no VRR support at all** (through 1.10, which pf-vdisplay is built + against): no VRR DDI, no VRR in the virtual EDID, and GPU control panels don't even list indirect + displays as VRR-capable. Not fixable by us; the community IDD projects' "can we fake it" issue is + open and unanswered. +- **KWin/Mutter/wlroots virtual outputs are fixed-mode** (KWin hardcodes 60 Hz + out-of-band + `kscreen-doctor` custom modes, `vdisplay/linux/kwin.rs:101,138`; Mutter defaults 60 with the + `PUNKTFUNK_MUTTER_VIRTUAL_REFRESH` opt-in, `mutter.rs:244-258`; Sway takes one + `--custom WxH@Hz`, `wlroots.rs:93`). + +What a true-VRR virtual display *would* add is confined to the source end, exactly two residuals: +(1) **sampling quantization → pacing wobble** — the game's output is sampled on the virtual +display's fixed grid, and the game's true present times never reach the wire (our `pts_ns` is +stamped at capture, already grid-aligned); (2) **up to one virtual-vblank of host latency** (a +frame completed just after a composite waits for the next grid tick). Both scale with the grid: +at 240 Hz the grid is 4.2 ms — pacing error ~±2 ms (below the ~4–5 ms perceptibility threshold) +and ≤4.2 ms added latency. The high-Hz machinery already exists on every backend, and the Linux +compositors composite on damage, so a 240 Hz virtual mode costs GPU work proportional to the game's +actual fps, not 240 composites/s. + +**Negotiation semantics shift**: today the client requests its native WxH@Hz and the mode's Hz means +"the cadence you'll receive." Under client-VRR the virtual display Hz is the *sampling grid* (pick +it high), while the client's panel VRR range governs presentation only. + +## Where the pieces stand (investigation findings) + +### Wire — already ~90 % ready + +- Every packet carries a wall-clock **capture** timestamp: `PacketHeader.pts_ns` is the first field + of the 40-byte header (`punktfunk-core/src/packet.rs:52-68`), threaded to `Frame.pts_ns` and ABI + `PunktfunkFrame.pts_ns`. Epoch = ns since UNIX epoch, stamped host-side via `SystemTime::now()` + (`punktfunk1.rs:100-105`). Plus a monotonic per-AU `frame_index`. +- The clock-skew offset is **ABI-exposed**: `punktfunk_connection_clock_offset_ns` + (`abi.rs:2121-2137`; NTP-style min-RTT estimate, `quic.rs:417-426`). A client can convert host + capture time to its own clock — the raw material for a timestamp-scheduled presenter, and + something Moonlight fundamentally lacks (its "frame pacing" guesses; we have a measured offset). +- **FEC, keepalives, and reorder are rate-agnostic**: FEC is self-describing per packet and adapts + on loss; QUIC keepalive is 4 s/8 s; the reassembler window is frame-count-based + (`REORDER_WINDOW = 16`, `packet.rs:47`). Nothing in the data plane divides by fps. + +Missing (all small): a `FLAG_REPEAT` (or `FLAG_NEW`) bit in the already-end-to-end +`PacketHeader.user_flags` (free bits above `FLAG_PIC/EOF/SOF/PROBE`, `packet.rs:30-36` — no header +size change); `VIDEO_CAP_VRR = 0x08` in `video_caps` (`quic.rs:107-116`) mirrored to the ABI +constant with the lockstep assert (`abi.rs:856-864`); an append-only Hello/Welcome trailing field +for the client's panel refresh range (the same trailing-byte back-compat pattern used 7×). One real +caveat: **`Reconfigure`/`Reconfigured` are fixed-length, not tail-extensible** (decode requires +exact lengths, `quic.rs:1029,1057`) — a mid-stream VRR toggle/range change needs a new typed +control message, not a field append. + +### Host — fixed cadence is a consumer-loop choice, not a capture limitation + +Every capture producer is already push/event-driven: PipeWire delivers a buffer per composite on +all Linux backends (damage-driven on kwin/mutter/wlroots — a static desktop produces *nothing*; +gamescope pushes per output frame at its `-r` rate); the pf-vdisplay ring publishes one frame per +DWM present and signals a frame-ready event, returning `E_PENDING` when DWM composed nothing +(`swap_chain_processor.rs:306-333`). The fixed cadence is imposed entirely by the encode loops: the +`next += 1/effective_hz` pacer (`punktfunk1.rs:3336,3398-3401,3606`; GameStream analogue +`gamestream/stream.rs:805-808`) re-samples via `try_latest()` and **re-encodes the last frame as a +synthetic repeat** when nothing new arrived (`punktfunk1.rs:3169-3179`) — repeats go on the wire +indistinguishable from new frames (the `repeat` bool is host-internal stats only). + +Smallest cadence change: block on the existing `next_frame()` (Linux `recv_timeout`, IDD +`WaitForSingleObject` on the frame-ready event) and submit one encode per delivered frame, keeping +an **idle-timeout repeat** so a damage-idle desktop still emits keepalive frames. The wire PTS is +already wall-clock, so timestamps survive unchanged. + +The load-bearing fixed-fps assumption is **rate control**: both encoder paths run CBR with a +~1-frame VBV sized `bitrate/fps` and feed `frame_idx` as the encoder PTS +(`encode/linux/mod.rs:280-297` — `time_base(1/fps)`, VBV `bitrate/fps × PUNKTFUNK_VBV_FRAMES` +default 1; `encode/windows/nvenc.rs:663-672,787-788` — `frameRateNum = fps`, VBV `bitrate/fps`; +PTS = `frame_idx` at `nvenc.rs:1189-1206` / `mod.rs:167,535`). Variable intervals won't corrupt +ordering, but a game at 85 fps in a "240 Hz-grid" session drastically undershoots the bitrate +target and bursts fight the 1-frame VBV. VFR needs: feed the real capture PTS to the encoder +timeline, and either budget `frameRate` at the *expected* rate with a laxer VBV or move that path +to VBR/CQ. This is the one real technical knot. + +### Clients — all four are vsync-locked newest-wins today + +No client has any tearing/VRR/present-immediate path; `clock_offset_ns` is used only for the +latency HUD. Queue depth is 1–2 slots newest-wins everywhere; no de-jitter buffer anywhere. + +| Client | Today | Present-on-arrival path | +|---|---|---| +| Android | `releaseOutputBuffer(render=true)` immediately on the newest drained buffer (`native/src/decode.rs:274-334`); `setFrameRate` fixed hint (`decode.rs:100`) | **Closest** — present already arrival-driven; switch to the frame-rate change-strategy / seamless APIs so a VRR panel follows | +| Linux | `set_paintable` on frame arrival; GTK/compositor frame clock scans out (`ui_stream.rs:475-588`) | Arrival side done; needs compositor VRR (GNOME/KDE enable VRR for fullscreen apps — the fullscreen `GtkGraphicsOffload` dmabuf direct-scanout path is exactly the eligible case) | +| Apple | Main-runloop `CADisplayLink` at fixed display/stream cadence + 1-slot `ReadyRing` (`SessionPresenter.swift:69-76`, `Stage2Pipeline.swift:15-37`); macOS `displaySyncEnabled=false` is *not* tearing — WindowServer still composites at vsync (`MetalVideoPresenter.swift:193-200`) | iOS/iPadOS ProMotion: wide `CAFrameRateRange` + drive render from the decode callback instead of the link. macOS: WindowServer-limited (Moonlight reports VRR-follows-stream fullscreen only) | +| Windows | Render thread waits the swapchain latency waitable (DWM vblank cadence) then `Present(1)`; no `ALLOW_TEARING` anywhere (`render.rs:157-225`, `present.rs:161-173,540`) | **Hardest** — the composition `SwapChainPanel` swapchain can't tear/independent-flip. Plausible route: arrival-driven presents through DWM's windowed-VRR (windowed G-Sync/FreeSync — DWM composes on demand, panel follows); needs on-glass validation, else a fullscreen HWND swapchain mode | + +### Client pacing policy: scheduled present, not raw arrival + +Raw present-on-arrival replays network+encode jitter onto the panel. Better: present at +`pts_ns + clock_offset_ns + D` for a small constant `D` — the shared clock absorbs jitter and +reproduces the *host-side* cadence exactly (still grid-quantized at the source; see residual (1)). +`D` is a smoothness-vs-latency knob; on LAN it can be near zero. All the data for this is already +on the wire today. + +## Staging + +1. **Stage A — client-only, no protocol change.** Timestamp-scheduled / present-on-arrival on + VRR-capable displays. Order: Android + Linux (architecturally ready) → iOS ProMotion → Windows + (DWM windowed-VRR validation) → macOS fullscreen. Biggest single latency win: removes avg ½ / + worst 1 client refresh (~8/16.7 ms at 60 Hz, halved at 120). +2. **Stage B — host, native path only.** Frame-driven consumer loop + idle-repeat keepalive; + real-PTS encoder timeline + VFR-tolerant rate control; `FLAG_REPEAT` on the wire; + `VIDEO_CAP_VRR` + panel-range negotiation; grid-Hz mode semantics. Kills the capture-side + quantization down to the grid and stops burning encode on synthetic repeats. +3. **Stage C — optional gamescope experiment.** gamescope has `--adaptive-sync` and it works even + nested per upstream #1694; we don't pass it in the headless spawn + (`vdisplay/linux/gamescope.rs:975-980`), and whether the *headless* backend honors it is + unverified (untestable until the dev VM's GPU passthrough returns). If it works, it removes even + the sampling grid on the path that matters most for gaming, at near-zero implementation cost. + An optimization, not the architecture. KWin/Mutter/wlroots/IddCx true VRR: upstream-blocked, + do not pursue. + +## Open questions / risks + +- **VFR rate control per encoder**: exact NVENC/VAAPI/AMF-QSV recipe (real-timestamp `time_base` + vs max-rate + enlarged VBV vs VBR/CQ); interaction with the 1-frame-VBV latency property we rely + on. The main Stage-B risk item. +- **Does gamescope headless honor `--adaptive-sync`?** (Stage C gate; needs the GPU back.) +- **DWM windowed VRR with a composition swapchain**: does arrival-cadence presenting through the + XAML `SwapChainPanel` actually drive a G-Sync/FreeSync panel variably? On-glass validation gates + the Windows-client stage-A entry. +- **Panel VRR floor / LFC**: the idle-keepalive repeat cadence sets the stream's minimum rate; if + it sits below a panel's ~48 Hz floor the client compositor/driver's LFC handles doubling — + verify, and don't park the keepalive interval right at a floor boundary. +- **Android**: seamless (`CHANGE_FRAME_RATE_ONLY_IF_SEAMLESS`) vs non-seamless switch strategy, + and real-device VRR panel coverage. +- **Hello semantics**: how a VRR-capable client picks the grid Hz to request (host advertises its + max grid? client just asks 240 and the host clamps like today's mode ladder?). + +## External evidence (2026-07-03) + +- gamescope `--adaptive-sync` works in nested mode: [ValveSoftware/gamescope#1694](https://github.com/ValveSoftware/gamescope/issues/1694) +- IddCx has no VRR path; community "can we fake it" open/unanswered: [Virtual-Display-Driver#24](https://github.com/itsmikethetech/Virtual-Display-Driver/issues/24), [IddCx DDI index](https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/iddcx/) +- Client VRR panels do follow Moonlight's stream cadence in practice (and it's messy — our shared + clock is the differentiator): [moonlight-qt#1545](https://github.com/moonlight-stream/moonlight-qt/issues/1545), macOS fullscreen-only [moonlight-qt#1509](https://github.com/moonlight-stream/moonlight-qt/issues/1509) +- Mutter `RecordVirtual` derives refresh from PipeWire; VRR only on real monitors: [mutter!1154](https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1154)