design(host): add vrr design doc
This commit is contained in:
@@ -24,6 +24,7 @@ holds the full originals.
|
|||||||
| [`multi-user-profiles.md`](multi-user-profiles.md) | Multi-user / profiles end to end: map a client to a real host OS user account (own isolated desktop), web-console config, per-profile passcode | **Design, schema-of-record** — not yet implemented |
|
| [`multi-user-profiles.md`](multi-user-profiles.md) | Multi-user / profiles end to end: map a client to a real host OS user account (own isolated desktop), web-console config, per-profile passcode | **Design, schema-of-record** — not yet implemented |
|
||||||
| [`host-latency-plan.md`](host-latency-plan.md) | Latency under GPU contention — 4-tier plan | **Partly shipped** — superseded by ↓; diagnostics + open tiers kept |
|
| [`host-latency-plan.md`](host-latency-plan.md) | Latency under GPU contention — 4-tier plan | **Partly shipped** — superseded by ↓; diagnostics + open tiers kept |
|
||||||
| [`gpu-contention-investigation.md`](gpu-contention-investigation.md) | GPU-contention root-cause + ranked levers (supersedes ↑) | **Active plan** — §5.A shipped; §5.B/C/E/F/G open |
|
| [`gpu-contention-investigation.md`](gpu-contention-investigation.md) | GPU-contention root-cause + ranked levers (supersedes ↑) | **Active plan** — §5.A shipped; §5.B/C/E/F/G open |
|
||||||
|
| [`vrr-plan.md`](vrr-plan.md) | VRR over punktfunk/1 — skip the VRR virtual display (client panel follows stream cadence; virtual display Hz = sampling grid); frame-driven host loop + VFR rate control + present-on-arrival clients | **Design** — not yet implemented |
|
||||||
| [`hdr-pipeline-plan.md`](hdr-pipeline-plan.md) | Glass-to-glass HDR | **Steps 0–3 shipped**; Step 4 (Linux) open |
|
| [`hdr-pipeline-plan.md`](hdr-pipeline-plan.md) | Glass-to-glass HDR | **Steps 0–3 shipped**; Step 4 (Linux) open |
|
||||||
| [`windows-host-rewrite.md`](windows-host-rewrite.md) | **Windows host — the single architecture/status/reference doc** (validated invariants, ops, open work) | **Active reference** |
|
| [`windows-host-rewrite.md`](windows-host-rewrite.md) | **Windows host — the single architecture/status/reference doc** (validated invariants, ops, open work) | **Active reference** |
|
||||||
| [`windows-build-and-packaging.md`](windows-build-and-packaging.md) | How the Windows host is built, signed, packaged (drivers-from-source, Inno, CI) | **Evergreen reference** |
|
| [`windows-build-and-packaging.md`](windows-build-and-packaging.md) | How the Windows host is built, signed, packaged (drivers-from-source, Inno, CI) | **Evergreen reference** |
|
||||||
@@ -53,6 +54,7 @@ owning doc.)
|
|||||||
- Sub-frame pipelining — overlap encode+transmit within a frame; needs a direct NVENC SDK wrapper (~2–4 ms). → `implementation-plan`, `gamestream-host-plan`
|
- Sub-frame pipelining — overlap encode+transmit within a frame; needs a direct NVENC SDK wrapper (~2–4 ms). → `implementation-plan`, `gamestream-host-plan`
|
||||||
- GPU-contention levers: correct async NVENC pipeline, auto-gated REALTIME GPU priority, clock/P-state pinning, frame-source escape (swapchain-hook/NvFBC/compose-flip), iGPU encode offload, PERF uniq-vs-fps instrumentation. → `gpu-contention-investigation` (§5.B/C/E/F/G), `host-latency-plan` (Tiers 1A/1B/3B/3C/3D/4)
|
- GPU-contention levers: correct async NVENC pipeline, auto-gated REALTIME GPU priority, clock/P-state pinning, frame-source escape (swapchain-hook/NvFBC/compose-flip), iGPU encode offload, PERF uniq-vs-fps instrumentation. → `gpu-contention-investigation` (§5.B/C/E/F/G), `host-latency-plan` (Tiers 1A/1B/3B/3C/3D/4)
|
||||||
- Apple stage-2 as default (after resolution/HDR checks) + smoothing/pacing policy + glass-to-glass numbers via `tools/latency-probe`. → `apple-stage2-presenter`
|
- Apple stage-2 as default (after resolution/HDR checks) + smoothing/pacing policy + glass-to-glass numbers via `tools/latency-probe`. → `apple-stage2-presenter`
|
||||||
|
- VRR / frame-driven cadence: Stage A client present-on-arrival (Android/Linux → iOS → Windows), Stage B host frame-driven loop + VFR rate control + `FLAG_REPEAT`/`VIDEO_CAP_VRR`, Stage C gamescope `--adaptive-sync` headless experiment. → `vrr-plan`
|
||||||
|
|
||||||
**HDR**
|
**HDR**
|
||||||
- Linux 10-bit HDR (Step 4): 8-bit→Main10 shim, true 10-bit PipeWire capture (blocked upstream — gamescope #2126), Linux-client P010 + GTK color management. → `hdr-pipeline-plan`
|
- Linux 10-bit HDR (Step 4): 8-bit→Main10 shim, true 10-bit PipeWire capture (blocked upstream — gamescope #2126), Linux-client P010 + GTK color management. → `hdr-pipeline-plan`
|
||||||
|
|||||||
@@ -0,0 +1,158 @@
|
|||||||
|
# VRR over punktfunk/1 — design
|
||||||
|
|
||||||
|
> **Status:** DESIGN — investigation complete (2026-07-03), nothing implemented. Key architectural
|
||||||
|
> decision recorded here: **no VRR virtual display is needed** — client-side VRR is driven purely by
|
||||||
|
> presentation cadence, so the host's virtual display Hz becomes a *sampling grid* decoupled from the
|
||||||
|
> client panel. punktfunk/1-native only; GameStream/Moonlight stays fixed-cadence (stock clients).
|
||||||
|
|
||||||
|
Goal: end-to-end variable refresh — the game's real frame pacing reaches the client's VRR panel
|
||||||
|
instead of being resampled onto a fixed grid twice (host pacer, client vsync). Gains are both
|
||||||
|
latency (the fixed-cadence quantization at capture and present is now the dominant remaining
|
||||||
|
latency term — the Windows-client loopback p50 of ~18 ms is dominated by the 60 Hz virtual-display
|
||||||
|
cadence while the wire is sub-millisecond) and smoothness (an 85 fps game on a 120 Hz grid presents
|
||||||
|
as an irregular 8.3/16.7 ms alternation — judder baked in at the source that no client can undo).
|
||||||
|
|
||||||
|
## The core decision: skip the VRR virtual display
|
||||||
|
|
||||||
|
A VRR panel is not "driven at a framerate" by any API — it follows the presentation cadence. If the
|
||||||
|
client presents each frame on arrival, the panel refreshes at the stream's cadence, whatever it is.
|
||||||
|
So client VRR needs **frame-driven host emission + present-on-arrival clients**, and no VRR anywhere
|
||||||
|
on the host display stack. This sidesteps two otherwise-hard blockers entirely:
|
||||||
|
|
||||||
|
- **IddCx (Windows host) has no VRR support at all** (through 1.10, which pf-vdisplay is built
|
||||||
|
against): no VRR DDI, no VRR in the virtual EDID, and GPU control panels don't even list indirect
|
||||||
|
displays as VRR-capable. Not fixable by us; the community IDD projects' "can we fake it" issue is
|
||||||
|
open and unanswered.
|
||||||
|
- **KWin/Mutter/wlroots virtual outputs are fixed-mode** (KWin hardcodes 60 Hz + out-of-band
|
||||||
|
`kscreen-doctor` custom modes, `vdisplay/linux/kwin.rs:101,138`; Mutter defaults 60 with the
|
||||||
|
`PUNKTFUNK_MUTTER_VIRTUAL_REFRESH` opt-in, `mutter.rs:244-258`; Sway takes one
|
||||||
|
`--custom WxH@Hz`, `wlroots.rs:93`).
|
||||||
|
|
||||||
|
What a true-VRR virtual display *would* add is confined to the source end, exactly two residuals:
|
||||||
|
(1) **sampling quantization → pacing wobble** — the game's output is sampled on the virtual
|
||||||
|
display's fixed grid, and the game's true present times never reach the wire (our `pts_ns` is
|
||||||
|
stamped at capture, already grid-aligned); (2) **up to one virtual-vblank of host latency** (a
|
||||||
|
frame completed just after a composite waits for the next grid tick). Both scale with the grid:
|
||||||
|
at 240 Hz the grid is 4.2 ms — pacing error ~±2 ms (below the ~4–5 ms perceptibility threshold)
|
||||||
|
and ≤4.2 ms added latency. The high-Hz machinery already exists on every backend, and the Linux
|
||||||
|
compositors composite on damage, so a 240 Hz virtual mode costs GPU work proportional to the game's
|
||||||
|
actual fps, not 240 composites/s.
|
||||||
|
|
||||||
|
**Negotiation semantics shift**: today the client requests its native WxH@Hz and the mode's Hz means
|
||||||
|
"the cadence you'll receive." Under client-VRR the virtual display Hz is the *sampling grid* (pick
|
||||||
|
it high), while the client's panel VRR range governs presentation only.
|
||||||
|
|
||||||
|
## Where the pieces stand (investigation findings)
|
||||||
|
|
||||||
|
### Wire — already ~90 % ready
|
||||||
|
|
||||||
|
- Every packet carries a wall-clock **capture** timestamp: `PacketHeader.pts_ns` is the first field
|
||||||
|
of the 40-byte header (`punktfunk-core/src/packet.rs:52-68`), threaded to `Frame.pts_ns` and ABI
|
||||||
|
`PunktfunkFrame.pts_ns`. Epoch = ns since UNIX epoch, stamped host-side via `SystemTime::now()`
|
||||||
|
(`punktfunk1.rs:100-105`). Plus a monotonic per-AU `frame_index`.
|
||||||
|
- The clock-skew offset is **ABI-exposed**: `punktfunk_connection_clock_offset_ns`
|
||||||
|
(`abi.rs:2121-2137`; NTP-style min-RTT estimate, `quic.rs:417-426`). A client can convert host
|
||||||
|
capture time to its own clock — the raw material for a timestamp-scheduled presenter, and
|
||||||
|
something Moonlight fundamentally lacks (its "frame pacing" guesses; we have a measured offset).
|
||||||
|
- **FEC, keepalives, and reorder are rate-agnostic**: FEC is self-describing per packet and adapts
|
||||||
|
on loss; QUIC keepalive is 4 s/8 s; the reassembler window is frame-count-based
|
||||||
|
(`REORDER_WINDOW = 16`, `packet.rs:47`). Nothing in the data plane divides by fps.
|
||||||
|
|
||||||
|
Missing (all small): a `FLAG_REPEAT` (or `FLAG_NEW`) bit in the already-end-to-end
|
||||||
|
`PacketHeader.user_flags` (free bits above `FLAG_PIC/EOF/SOF/PROBE`, `packet.rs:30-36` — no header
|
||||||
|
size change); `VIDEO_CAP_VRR = 0x08` in `video_caps` (`quic.rs:107-116`) mirrored to the ABI
|
||||||
|
constant with the lockstep assert (`abi.rs:856-864`); an append-only Hello/Welcome trailing field
|
||||||
|
for the client's panel refresh range (the same trailing-byte back-compat pattern used 7×). One real
|
||||||
|
caveat: **`Reconfigure`/`Reconfigured` are fixed-length, not tail-extensible** (decode requires
|
||||||
|
exact lengths, `quic.rs:1029,1057`) — a mid-stream VRR toggle/range change needs a new typed
|
||||||
|
control message, not a field append.
|
||||||
|
|
||||||
|
### Host — fixed cadence is a consumer-loop choice, not a capture limitation
|
||||||
|
|
||||||
|
Every capture producer is already push/event-driven: PipeWire delivers a buffer per composite on
|
||||||
|
all Linux backends (damage-driven on kwin/mutter/wlroots — a static desktop produces *nothing*;
|
||||||
|
gamescope pushes per output frame at its `-r` rate); the pf-vdisplay ring publishes one frame per
|
||||||
|
DWM present and signals a frame-ready event, returning `E_PENDING` when DWM composed nothing
|
||||||
|
(`swap_chain_processor.rs:306-333`). The fixed cadence is imposed entirely by the encode loops: the
|
||||||
|
`next += 1/effective_hz` pacer (`punktfunk1.rs:3336,3398-3401,3606`; GameStream analogue
|
||||||
|
`gamestream/stream.rs:805-808`) re-samples via `try_latest()` and **re-encodes the last frame as a
|
||||||
|
synthetic repeat** when nothing new arrived (`punktfunk1.rs:3169-3179`) — repeats go on the wire
|
||||||
|
indistinguishable from new frames (the `repeat` bool is host-internal stats only).
|
||||||
|
|
||||||
|
Smallest cadence change: block on the existing `next_frame()` (Linux `recv_timeout`, IDD
|
||||||
|
`WaitForSingleObject` on the frame-ready event) and submit one encode per delivered frame, keeping
|
||||||
|
an **idle-timeout repeat** so a damage-idle desktop still emits keepalive frames. The wire PTS is
|
||||||
|
already wall-clock, so timestamps survive unchanged.
|
||||||
|
|
||||||
|
The load-bearing fixed-fps assumption is **rate control**: both encoder paths run CBR with a
|
||||||
|
~1-frame VBV sized `bitrate/fps` and feed `frame_idx` as the encoder PTS
|
||||||
|
(`encode/linux/mod.rs:280-297` — `time_base(1/fps)`, VBV `bitrate/fps × PUNKTFUNK_VBV_FRAMES`
|
||||||
|
default 1; `encode/windows/nvenc.rs:663-672,787-788` — `frameRateNum = fps`, VBV `bitrate/fps`;
|
||||||
|
PTS = `frame_idx` at `nvenc.rs:1189-1206` / `mod.rs:167,535`). Variable intervals won't corrupt
|
||||||
|
ordering, but a game at 85 fps in a "240 Hz-grid" session drastically undershoots the bitrate
|
||||||
|
target and bursts fight the 1-frame VBV. VFR needs: feed the real capture PTS to the encoder
|
||||||
|
timeline, and either budget `frameRate` at the *expected* rate with a laxer VBV or move that path
|
||||||
|
to VBR/CQ. This is the one real technical knot.
|
||||||
|
|
||||||
|
### Clients — all four are vsync-locked newest-wins today
|
||||||
|
|
||||||
|
No client has any tearing/VRR/present-immediate path; `clock_offset_ns` is used only for the
|
||||||
|
latency HUD. Queue depth is 1–2 slots newest-wins everywhere; no de-jitter buffer anywhere.
|
||||||
|
|
||||||
|
| Client | Today | Present-on-arrival path |
|
||||||
|
|---|---|---|
|
||||||
|
| Android | `releaseOutputBuffer(render=true)` immediately on the newest drained buffer (`native/src/decode.rs:274-334`); `setFrameRate` fixed hint (`decode.rs:100`) | **Closest** — present already arrival-driven; switch to the frame-rate change-strategy / seamless APIs so a VRR panel follows |
|
||||||
|
| Linux | `set_paintable` on frame arrival; GTK/compositor frame clock scans out (`ui_stream.rs:475-588`) | Arrival side done; needs compositor VRR (GNOME/KDE enable VRR for fullscreen apps — the fullscreen `GtkGraphicsOffload` dmabuf direct-scanout path is exactly the eligible case) |
|
||||||
|
| Apple | Main-runloop `CADisplayLink` at fixed display/stream cadence + 1-slot `ReadyRing` (`SessionPresenter.swift:69-76`, `Stage2Pipeline.swift:15-37`); macOS `displaySyncEnabled=false` is *not* tearing — WindowServer still composites at vsync (`MetalVideoPresenter.swift:193-200`) | iOS/iPadOS ProMotion: wide `CAFrameRateRange` + drive render from the decode callback instead of the link. macOS: WindowServer-limited (Moonlight reports VRR-follows-stream fullscreen only) |
|
||||||
|
| Windows | Render thread waits the swapchain latency waitable (DWM vblank cadence) then `Present(1)`; no `ALLOW_TEARING` anywhere (`render.rs:157-225`, `present.rs:161-173,540`) | **Hardest** — the composition `SwapChainPanel` swapchain can't tear/independent-flip. Plausible route: arrival-driven presents through DWM's windowed-VRR (windowed G-Sync/FreeSync — DWM composes on demand, panel follows); needs on-glass validation, else a fullscreen HWND swapchain mode |
|
||||||
|
|
||||||
|
### Client pacing policy: scheduled present, not raw arrival
|
||||||
|
|
||||||
|
Raw present-on-arrival replays network+encode jitter onto the panel. Better: present at
|
||||||
|
`pts_ns + clock_offset_ns + D` for a small constant `D` — the shared clock absorbs jitter and
|
||||||
|
reproduces the *host-side* cadence exactly (still grid-quantized at the source; see residual (1)).
|
||||||
|
`D` is a smoothness-vs-latency knob; on LAN it can be near zero. All the data for this is already
|
||||||
|
on the wire today.
|
||||||
|
|
||||||
|
## Staging
|
||||||
|
|
||||||
|
1. **Stage A — client-only, no protocol change.** Timestamp-scheduled / present-on-arrival on
|
||||||
|
VRR-capable displays. Order: Android + Linux (architecturally ready) → iOS ProMotion → Windows
|
||||||
|
(DWM windowed-VRR validation) → macOS fullscreen. Biggest single latency win: removes avg ½ /
|
||||||
|
worst 1 client refresh (~8/16.7 ms at 60 Hz, halved at 120).
|
||||||
|
2. **Stage B — host, native path only.** Frame-driven consumer loop + idle-repeat keepalive;
|
||||||
|
real-PTS encoder timeline + VFR-tolerant rate control; `FLAG_REPEAT` on the wire;
|
||||||
|
`VIDEO_CAP_VRR` + panel-range negotiation; grid-Hz mode semantics. Kills the capture-side
|
||||||
|
quantization down to the grid and stops burning encode on synthetic repeats.
|
||||||
|
3. **Stage C — optional gamescope experiment.** gamescope has `--adaptive-sync` and it works even
|
||||||
|
nested per upstream #1694; we don't pass it in the headless spawn
|
||||||
|
(`vdisplay/linux/gamescope.rs:975-980`), and whether the *headless* backend honors it is
|
||||||
|
unverified (untestable until the dev VM's GPU passthrough returns). If it works, it removes even
|
||||||
|
the sampling grid on the path that matters most for gaming, at near-zero implementation cost.
|
||||||
|
An optimization, not the architecture. KWin/Mutter/wlroots/IddCx true VRR: upstream-blocked,
|
||||||
|
do not pursue.
|
||||||
|
|
||||||
|
## Open questions / risks
|
||||||
|
|
||||||
|
- **VFR rate control per encoder**: exact NVENC/VAAPI/AMF-QSV recipe (real-timestamp `time_base`
|
||||||
|
vs max-rate + enlarged VBV vs VBR/CQ); interaction with the 1-frame-VBV latency property we rely
|
||||||
|
on. The main Stage-B risk item.
|
||||||
|
- **Does gamescope headless honor `--adaptive-sync`?** (Stage C gate; needs the GPU back.)
|
||||||
|
- **DWM windowed VRR with a composition swapchain**: does arrival-cadence presenting through the
|
||||||
|
XAML `SwapChainPanel` actually drive a G-Sync/FreeSync panel variably? On-glass validation gates
|
||||||
|
the Windows-client stage-A entry.
|
||||||
|
- **Panel VRR floor / LFC**: the idle-keepalive repeat cadence sets the stream's minimum rate; if
|
||||||
|
it sits below a panel's ~48 Hz floor the client compositor/driver's LFC handles doubling —
|
||||||
|
verify, and don't park the keepalive interval right at a floor boundary.
|
||||||
|
- **Android**: seamless (`CHANGE_FRAME_RATE_ONLY_IF_SEAMLESS`) vs non-seamless switch strategy,
|
||||||
|
and real-device VRR panel coverage.
|
||||||
|
- **Hello semantics**: how a VRR-capable client picks the grid Hz to request (host advertises its
|
||||||
|
max grid? client just asks 240 and the host clamps like today's mode ladder?).
|
||||||
|
|
||||||
|
## External evidence (2026-07-03)
|
||||||
|
|
||||||
|
- gamescope `--adaptive-sync` works in nested mode: [ValveSoftware/gamescope#1694](https://github.com/ValveSoftware/gamescope/issues/1694)
|
||||||
|
- IddCx has no VRR path; community "can we fake it" open/unanswered: [Virtual-Display-Driver#24](https://github.com/itsmikethetech/Virtual-Display-Driver/issues/24), [IddCx DDI index](https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/iddcx/)
|
||||||
|
- Client VRR panels do follow Moonlight's stream cadence in practice (and it's messy — our shared
|
||||||
|
clock is the differentiator): [moonlight-qt#1545](https://github.com/moonlight-stream/moonlight-qt/issues/1545), macOS fullscreen-only [moonlight-qt#1509](https://github.com/moonlight-stream/moonlight-qt/issues/1509)
|
||||||
|
- Mutter `RecordVirtual` derives refresh from PipeWire; VRR only on real monitors: [mutter!1154](https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1154)
|
||||||
Reference in New Issue
Block a user