design(host): add vrr design doc

2026-07-03 19:22:19 +00:00
parent 3f33ed30ae
commit 3039626b87
2 changed files with 160 additions and 0 deletions
@@ -24,6 +24,7 @@ holds the full originals.
 | [`multi-user-profiles.md`](multi-user-profiles.md) | Multi-user / profiles end to end: map a client to a real host OS user account (own isolated desktop), web-console config, per-profile passcode | **Design, schema-of-record** — not yet implemented |
 | [`host-latency-plan.md`](host-latency-plan.md) | Latency under GPU contention — 4-tier plan | **Partly shipped** — superseded by ↓; diagnostics + open tiers kept |
 | [`gpu-contention-investigation.md`](gpu-contention-investigation.md) | GPU-contention root-cause + ranked levers (supersedes ↑) | **Active plan** — §5.A shipped; §5.B/C/E/F/G open |
 | [`vrr-plan.md`](vrr-plan.md) | VRR over punktfunk/1 — skip the VRR virtual display (client panel follows stream cadence; virtual display Hz = sampling grid); frame-driven host loop + VFR rate control + present-on-arrival clients | **Design** — not yet implemented |
 | [`hdr-pipeline-plan.md`](hdr-pipeline-plan.md) | Glass-to-glass HDR | **Steps 0–3 shipped**; Step 4 (Linux) open |
 | [`windows-host-rewrite.md`](windows-host-rewrite.md) | **Windows host — the single architecture/status/reference doc** (validated invariants, ops, open work) | **Active reference** |
 | [`windows-build-and-packaging.md`](windows-build-and-packaging.md) | How the Windows host is built, signed, packaged (drivers-from-source, Inno, CI) | **Evergreen reference** |
@@ -53,6 +54,7 @@ owning doc.)
 - Sub-frame pipelining — overlap encode+transmit within a frame; needs a direct NVENC SDK wrapper (~2–4 ms). → `implementation-plan`, `gamestream-host-plan`
 - GPU-contention levers: correct async NVENC pipeline, auto-gated REALTIME GPU priority, clock/P-state pinning, frame-source escape (swapchain-hook/NvFBC/compose-flip), iGPU encode offload, PERF uniq-vs-fps instrumentation. → `gpu-contention-investigation` (§5.B/C/E/F/G), `host-latency-plan` (Tiers 1A/1B/3B/3C/3D/4)
 - Apple stage-2 as default (after resolution/HDR checks) + smoothing/pacing policy + glass-to-glass numbers via `tools/latency-probe`. → `apple-stage2-presenter`
 - VRR / frame-driven cadence: Stage A client present-on-arrival (Android/Linux → iOS → Windows), Stage B host frame-driven loop + VFR rate control + `FLAG_REPEAT`/`VIDEO_CAP_VRR`, Stage C gamescope `--adaptive-sync` headless experiment. → `vrr-plan`
 **HDR**
 - Linux 10-bit HDR (Step 4): 8-bit→Main10 shim, true 10-bit PipeWire capture (blocked upstream — gamescope #2126), Linux-client P010 + GTK color management. → `hdr-pipeline-plan`
@@ -0,0 +1,158 @@
 # VRR over punktfunk/1 — design
 > **Status:** DESIGN — investigation complete (2026-07-03), nothing implemented. Key architectural
 > decision recorded here: **no VRR virtual display is needed** — client-side VRR is driven purely by
 > presentation cadence, so the host's virtual display Hz becomes a *sampling grid* decoupled from the
 > client panel. punktfunk/1-native only; GameStream/Moonlight stays fixed-cadence (stock clients).
 Goal: end-to-end variable refresh — the game's real frame pacing reaches the client's VRR panel
 instead of being resampled onto a fixed grid twice (host pacer, client vsync). Gains are both
 latency (the fixed-cadence quantization at capture and present is now the dominant remaining
 latency term — the Windows-client loopback p50 of ~18 ms is dominated by the 60 Hz virtual-display
 cadence while the wire is sub-millisecond) and smoothness (an 85 fps game on a 120 Hz grid presents
 as an irregular 8.3/16.7 ms alternation — judder baked in at the source that no client can undo).
 ## The core decision: skip the VRR virtual display
 A VRR panel is not "driven at a framerate" by any API — it follows the presentation cadence. If the
 client presents each frame on arrival, the panel refreshes at the stream's cadence, whatever it is.
 So client VRR needs **frame-driven host emission + present-on-arrival clients**, and no VRR anywhere
 on the host display stack. This sidesteps two otherwise-hard blockers entirely:
 - **IddCx (Windows host) has no VRR support at all** (through 1.10, which pf-vdisplay is built
  against): no VRR DDI, no VRR in the virtual EDID, and GPU control panels don't even list indirect
  displays as VRR-capable. Not fixable by us; the community IDD projects' "can we fake it" issue is
  open and unanswered.
 - **KWin/Mutter/wlroots virtual outputs are fixed-mode** (KWin hardcodes 60 Hz + out-of-band
  `kscreen-doctor` custom modes, `vdisplay/linux/kwin.rs:101,138`; Mutter defaults 60 with the
  `PUNKTFUNK_MUTTER_VIRTUAL_REFRESH` opt-in, `mutter.rs:244-258`; Sway takes one
  `--custom WxH@Hz`, `wlroots.rs:93`).
 What a true-VRR virtual display *would* add is confined to the source end, exactly two residuals:
 (1) **sampling quantization → pacing wobble** — the game's output is sampled on the virtual
 display's fixed grid, and the game's true present times never reach the wire (our `pts_ns` is
 stamped at capture, already grid-aligned); (2) **up to one virtual-vblank of host latency** (a
 frame completed just after a composite waits for the next grid tick). Both scale with the grid:
 at 240 Hz the grid is 4.2 ms — pacing error ~±2 ms (below the ~4–5 ms perceptibility threshold)
 and ≤4.2 ms added latency. The high-Hz machinery already exists on every backend, and the Linux
 compositors composite on damage, so a 240 Hz virtual mode costs GPU work proportional to the game's
 actual fps, not 240 composites/s.
 **Negotiation semantics shift**: today the client requests its native WxH@Hz and the mode's Hz means
 "the cadence you'll receive." Under client-VRR the virtual display Hz is the *sampling grid* (pick
 it high), while the client's panel VRR range governs presentation only.
 ## Where the pieces stand (investigation findings)
 ### Wire — already ~90 % ready
 - Every packet carries a wall-clock **capture** timestamp: `PacketHeader.pts_ns` is the first field
  of the 40-byte header (`punktfunk-core/src/packet.rs:52-68`), threaded to `Frame.pts_ns` and ABI
  `PunktfunkFrame.pts_ns`. Epoch = ns since UNIX epoch, stamped host-side via `SystemTime::now()`
  (`punktfunk1.rs:100-105`). Plus a monotonic per-AU `frame_index`.
 - The clock-skew offset is **ABI-exposed**: `punktfunk_connection_clock_offset_ns`
  (`abi.rs:2121-2137`; NTP-style min-RTT estimate, `quic.rs:417-426`). A client can convert host
  capture time to its own clock — the raw material for a timestamp-scheduled presenter, and
  something Moonlight fundamentally lacks (its "frame pacing" guesses; we have a measured offset).
 - **FEC, keepalives, and reorder are rate-agnostic**: FEC is self-describing per packet and adapts
  on loss; QUIC keepalive is 4 s/8 s; the reassembler window is frame-count-based
  (`REORDER_WINDOW = 16`, `packet.rs:47`). Nothing in the data plane divides by fps.
 Missing (all small): a `FLAG_REPEAT` (or `FLAG_NEW`) bit in the already-end-to-end
 `PacketHeader.user_flags` (free bits above `FLAG_PIC/EOF/SOF/PROBE`, `packet.rs:30-36` — no header
 size change); `VIDEO_CAP_VRR = 0x08` in `video_caps` (`quic.rs:107-116`) mirrored to the ABI
 constant with the lockstep assert (`abi.rs:856-864`); an append-only Hello/Welcome trailing field
 for the client's panel refresh range (the same trailing-byte back-compat pattern used 7×). One real
 caveat: **`Reconfigure`/`Reconfigured` are fixed-length, not tail-extensible** (decode requires
 exact lengths, `quic.rs:1029,1057`) — a mid-stream VRR toggle/range change needs a new typed
 control message, not a field append.
 ### Host — fixed cadence is a consumer-loop choice, not a capture limitation
 Every capture producer is already push/event-driven: PipeWire delivers a buffer per composite on
 all Linux backends (damage-driven on kwin/mutter/wlroots — a static desktop produces *nothing*;
 gamescope pushes per output frame at its `-r` rate); the pf-vdisplay ring publishes one frame per
 DWM present and signals a frame-ready event, returning `E_PENDING` when DWM composed nothing
 (`swap_chain_processor.rs:306-333`). The fixed cadence is imposed entirely by the encode loops: the
 `next += 1/effective_hz` pacer (`punktfunk1.rs:3336,3398-3401,3606`; GameStream analogue
 `gamestream/stream.rs:805-808`) re-samples via `try_latest()` and **re-encodes the last frame as a
 synthetic repeat** when nothing new arrived (`punktfunk1.rs:3169-3179`) — repeats go on the wire
 indistinguishable from new frames (the `repeat` bool is host-internal stats only).
 Smallest cadence change: block on the existing `next_frame()` (Linux `recv_timeout`, IDD
 `WaitForSingleObject` on the frame-ready event) and submit one encode per delivered frame, keeping
 an **idle-timeout repeat** so a damage-idle desktop still emits keepalive frames. The wire PTS is
 already wall-clock, so timestamps survive unchanged.
 The load-bearing fixed-fps assumption is **rate control**: both encoder paths run CBR with a
 ~1-frame VBV sized `bitrate/fps` and feed `frame_idx` as the encoder PTS
 (`encode/linux/mod.rs:280-297` — `time_base(1/fps)`, VBV `bitrate/fps × PUNKTFUNK_VBV_FRAMES`
 default 1; `encode/windows/nvenc.rs:663-672,787-788` — `frameRateNum = fps`, VBV `bitrate/fps`;
 PTS = `frame_idx` at `nvenc.rs:1189-1206` / `mod.rs:167,535`). Variable intervals won't corrupt
 ordering, but a game at 85 fps in a "240 Hz-grid" session drastically undershoots the bitrate
 target and bursts fight the 1-frame VBV. VFR needs: feed the real capture PTS to the encoder
 timeline, and either budget `frameRate` at the *expected* rate with a laxer VBV or move that path
 to VBR/CQ. This is the one real technical knot.
 ### Clients — all four are vsync-locked newest-wins today
 No client has any tearing/VRR/present-immediate path; `clock_offset_ns` is used only for the
 latency HUD. Queue depth is 1–2 slots newest-wins everywhere; no de-jitter buffer anywhere.
 | Client | Today | Present-on-arrival path |
 |---|---|---|
 | Android | `releaseOutputBuffer(render=true)` immediately on the newest drained buffer (`native/src/decode.rs:274-334`); `setFrameRate` fixed hint (`decode.rs:100`) | **Closest** — present already arrival-driven; switch to the frame-rate change-strategy / seamless APIs so a VRR panel follows |
 | Linux | `set_paintable` on frame arrival; GTK/compositor frame clock scans out (`ui_stream.rs:475-588`) | Arrival side done; needs compositor VRR (GNOME/KDE enable VRR for fullscreen apps — the fullscreen `GtkGraphicsOffload` dmabuf direct-scanout path is exactly the eligible case) |
 | Apple | Main-runloop `CADisplayLink` at fixed display/stream cadence + 1-slot `ReadyRing` (`SessionPresenter.swift:69-76`, `Stage2Pipeline.swift:15-37`); macOS `displaySyncEnabled=false` is *not* tearing — WindowServer still composites at vsync (`MetalVideoPresenter.swift:193-200`) | iOS/iPadOS ProMotion: wide `CAFrameRateRange` + drive render from the decode callback instead of the link. macOS: WindowServer-limited (Moonlight reports VRR-follows-stream fullscreen only) |
 | Windows | Render thread waits the swapchain latency waitable (DWM vblank cadence) then `Present(1)`; no `ALLOW_TEARING` anywhere (`render.rs:157-225`, `present.rs:161-173,540`) | **Hardest** — the composition `SwapChainPanel` swapchain can't tear/independent-flip. Plausible route: arrival-driven presents through DWM's windowed-VRR (windowed G-Sync/FreeSync — DWM composes on demand, panel follows); needs on-glass validation, else a fullscreen HWND swapchain mode |
 ### Client pacing policy: scheduled present, not raw arrival
 Raw present-on-arrival replays network+encode jitter onto the panel. Better: present at
 `pts_ns + clock_offset_ns + D` for a small constant `D` — the shared clock absorbs jitter and
 reproduces the *host-side* cadence exactly (still grid-quantized at the source; see residual (1)).
 `D` is a smoothness-vs-latency knob; on LAN it can be near zero. All the data for this is already
 on the wire today.
 ## Staging
 1. **Stage A — client-only, no protocol change.** Timestamp-scheduled / present-on-arrival on
   VRR-capable displays. Order: Android + Linux (architecturally ready) → iOS ProMotion → Windows
   (DWM windowed-VRR validation) → macOS fullscreen. Biggest single latency win: removes avg ½ /
   worst 1 client refresh (~8/16.7 ms at 60 Hz, halved at 120).
 2. **Stage B — host, native path only.** Frame-driven consumer loop + idle-repeat keepalive;
   real-PTS encoder timeline + VFR-tolerant rate control; `FLAG_REPEAT` on the wire;
   `VIDEO_CAP_VRR` + panel-range negotiation; grid-Hz mode semantics. Kills the capture-side
   quantization down to the grid and stops burning encode on synthetic repeats.
 3. **Stage C — optional gamescope experiment.** gamescope has `--adaptive-sync` and it works even
   nested per upstream #1694; we don't pass it in the headless spawn
   (`vdisplay/linux/gamescope.rs:975-980`), and whether the *headless* backend honors it is
   unverified (untestable until the dev VM's GPU passthrough returns). If it works, it removes even
   the sampling grid on the path that matters most for gaming, at near-zero implementation cost.
   An optimization, not the architecture. KWin/Mutter/wlroots/IddCx true VRR: upstream-blocked,
   do not pursue.
 ## Open questions / risks
 - **VFR rate control per encoder**: exact NVENC/VAAPI/AMF-QSV recipe (real-timestamp `time_base`
  vs max-rate + enlarged VBV vs VBR/CQ); interaction with the 1-frame-VBV latency property we rely
  on. The main Stage-B risk item.
 - **Does gamescope headless honor `--adaptive-sync`?** (Stage C gate; needs the GPU back.)
 - **DWM windowed VRR with a composition swapchain**: does arrival-cadence presenting through the
  XAML `SwapChainPanel` actually drive a G-Sync/FreeSync panel variably? On-glass validation gates
  the Windows-client stage-A entry.
 - **Panel VRR floor / LFC**: the idle-keepalive repeat cadence sets the stream's minimum rate; if
  it sits below a panel's ~48 Hz floor the client compositor/driver's LFC handles doubling —
  verify, and don't park the keepalive interval right at a floor boundary.
 - **Android**: seamless (`CHANGE_FRAME_RATE_ONLY_IF_SEAMLESS`) vs non-seamless switch strategy,
  and real-device VRR panel coverage.
 - **Hello semantics**: how a VRR-capable client picks the grid Hz to request (host advertises its
  max grid? client just asks 240 and the host clamps like today's mode ladder?).
 ## External evidence (2026-07-03)
 - gamescope `--adaptive-sync` works in nested mode: [ValveSoftware/gamescope#1694](https://github.com/ValveSoftware/gamescope/issues/1694)
 - IddCx has no VRR path; community "can we fake it" open/unanswered: [Virtual-Display-Driver#24](https://github.com/itsmikethetech/Virtual-Display-Driver/issues/24), [IddCx DDI index](https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/iddcx/)
 - Client VRR panels do follow Moonlight's stream cadence in practice (and it's messy — our shared
  clock is the differentiator): [moonlight-qt#1545](https://github.com/moonlight-stream/moonlight-qt/issues/1545), macOS fullscreen-only [moonlight-qt#1509](https://github.com/moonlight-stream/moonlight-qt/issues/1509)
 - Mutter `RecordVirtual` derives refresh from PipeWire; VRR only on real monitors: [mutter!1154](https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1154)