69609945a3
Consumes the 0xCF host-timing plane (449a67c) on all four GUI clients: each
keeps a bounded pending ring of receipt samples keyed by pts, matches the
host's per-AU capture→sent reports against it, and the HUD equation becomes
= host 3.1 + network 6.7 + decode 2.1 + display 2.3
falling back to the combined `= host+network …` term whenever no timing
matched the window (old host / datagram loss) — same total, one split
fewer, never a misleading zero. Apple additionally gains the split as the
only equation line under the stage-1 fallback presenter (receipt is
presenter-independent), a `nextHostTiming` wrapper with its own plane lock,
and a unit-tested `HostNetworkSplitter`; Android extends the JNI stats
array 16→18 doubles (0–15 unchanged); Windows/Linux thread the split
through `Stats` into the HUD and the headless/debug logs.
Docs updated: design/stats-unification.md Phase 2 → implemented (wire
format, fallback semantics), and the docs-site matrix's Sunshine "Host
processing latency" row is now a direct match (ours includes the paced
send; avg vs p50).
Verified here: linux client clippy -D warnings green on the live tree,
windows stub check + hand-verified diff, android cargo-ndk arm64 check
green, apple loopback test extended (needs the rebuilt xcframework + swift
test on the mac). On-glass: pending on all platforms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
162 lines
9.5 KiB
Markdown
162 lines
9.5 KiB
Markdown
# Unified streaming stats — one vocabulary across every client
|
||
|
||
Status: spec agreed 2026-07-03; implementation tracked per client below.
|
||
User-facing companion: `docs-site/content/docs/stats.md` (overlay guide + Moonlight matrix).
|
||
|
||
## Why
|
||
|
||
Prospective users compare our numbers against Moonlight/Sunshine's overlay. Before this
|
||
spec the clients disagreed with each other (same concept, different name, different
|
||
measurement point, mean vs median, missing skew flags), and none of them made clear
|
||
which numbers are *sequential stages that add up* versus *overlapping absolutes*. The
|
||
Apple HUD showed three arrow-notation lines (`capture→client`, `capture→present`,
|
||
`decode→present`) that looked like a pipeline but overlapped, left the decode interval
|
||
invisible, and mixed two clock bases. Meanwhile Moonlight shows *only* disjoint
|
||
client-side segments and has **no end-to-end number at all** — so our headline
|
||
end-to-end figure, presented without context, reads "worse" against a Moonlight line
|
||
that measures a fraction of the chain.
|
||
|
||
## The model
|
||
|
||
Four measurement points per video frame. Every stat on every HUD is a difference of
|
||
two of these points — nothing else.
|
||
|
||
| point | meaning | who stamps it |
|
||
|---|---|---|
|
||
| **capture** | host capture clock: the per-AU `pts_ns`. Native path anchors at PipeWire frame delivery (host queue age is *inside* it); GameStream uses the RTP 90 kHz clock. | host |
|
||
| **received** | AU fully reassembled (post-FEC), handed to the client, before decode | client |
|
||
| **decoded** | decoder output frame available | client |
|
||
| **displayed** | best-effort presentation instant (see per-client endpoint table) | client |
|
||
|
||
Three **stages** tile the interval exactly (per frame, same timestamps, no gaps, no
|
||
overlap):
|
||
|
||
- `host+network` = capture → received. Contains the whole host pipeline
|
||
(queue/capture/convert/encode/pace) **plus** wire time and reassembly; it cannot be
|
||
split client-side today (see Phase 2).
|
||
- `decode` = received → decoded. Pure client-local; no clock skew involved.
|
||
- `display` = decoded → displayed. Pacing/queue wait + render + vsync. Client-local.
|
||
|
||
Headline: **`end-to-end`** = capture → displayed, **measured directly** (not the sum of
|
||
stage percentiles). Per frame the stages sum to it exactly; per window the p50s only
|
||
approximately sum (percentiles aren't additive) — the doc says so, the HUD shows the
|
||
equation with the directly-measured total on the left.
|
||
|
||
### Clock rules
|
||
|
||
- `end-to-end` and `host+network` use client `CLOCK_REALTIME` + `clock_offset_ns`
|
||
(the ClockProbe/ClockEcho skew handshake). Cross-machine valid when the offset was
|
||
measured.
|
||
- `decode` and `display` are single-clock client-local (offset irrelevant).
|
||
- When `clock_offset_ns == 0` (host didn't answer the handshake / same host), the HUD
|
||
appends **`(same-host clock)`** once, on the end-to-end line — every client, no
|
||
exceptions. Windows and Linux previously showed possibly-uncorrected numbers
|
||
silently; that was a bug in presentation.
|
||
|
||
### Aggregation rules
|
||
|
||
- **Window: 1 s tumbling** (drain and reset), HUD refresh 1×/s. (Moonlight uses a
|
||
1–2 s sliding window — close enough for comparison; documented in the matrix.)
|
||
- **Percentiles only, never means**: end-to-end shows `p50` and `p95`; stages show
|
||
`p50`. Windows/Linux previously showed *mean* decode time next to median latencies —
|
||
banned.
|
||
- Latency samples clamped to `(0, 10 s)` as before.
|
||
- `fps` = AUs received (reassembled) per second, actual-elapsed-time denominator.
|
||
- `Mb/s` = received payload bytes × 8 / elapsed (goodput, excludes FEC overhead —
|
||
same basis as Moonlight's bitrate tracker).
|
||
|
||
## Canonical HUD layout
|
||
|
||
Line order and label strings are normative; platform chips vary.
|
||
|
||
```
|
||
1920×1080@120 · 119 fps · 38.2 Mb/s · HEVC 10-bit HDR · GPU decode
|
||
end-to-end 14.2 ms p50 · 19.8 p95 · capture→on-glass
|
||
= host+network 9.8 + decode 2.1 + display 2.3
|
||
lost 3 (0.1%) · skipped 1 · FEC 12
|
||
```
|
||
|
||
- Line 1 — stream facts. `{W}×{H}@{Hz} · {fps} fps · {mbps} Mb/s` plus whatever
|
||
chips the platform already knows (codec, bit depth, HDR, GPU/CPU decode).
|
||
- Line 2 — the headline. `end-to-end {p50} ms p50 · {p95} p95 · capture→{endpoint}`
|
||
(+ ` (same-host clock)` when applicable). The endpoint suffix is honest per
|
||
platform — see the endpoint table. Never a bare "latency".
|
||
- Line 3 — the equation. `= host+network {a} + decode {b} + display {c}` (stage p50s,
|
||
ms, one decimal). A platform that can't measure a trailing stage drops the term and
|
||
the headline endpoint moves accordingly — the equation always tiles the headline
|
||
interval.
|
||
- Line 4 — counters, only rendered when any value is nonzero.
|
||
`lost` = unrecoverable network frame drops in the window (`{n} ({pct}%)`,
|
||
pct = lost/(received+lost)); `skipped` = client-side newest-wins/pacing drops;
|
||
`FEC` = shards recovered this window (proof FEC is earning its keep).
|
||
|
||
### Per-client endpoints (v1)
|
||
|
||
| client | displayed point | headline reads | equation terms |
|
||
|---|---|---|---|
|
||
| Apple stage-2 | display-link target present instant | `capture→on-glass` | host+network + decode + display |
|
||
| Apple stage-1 (fallback presenter) | n/a (opaque `AVSampleBufferDisplayLayer`) | `capture→received` | host+network only |
|
||
| Windows | post-`Present()` on the render thread | `capture→on-glass` | all three |
|
||
| Linux GTK | paintable-set (GTK adds ~1 compositor cycle after) | `capture→displayed` | all three |
|
||
| Android | MediaCodec output released to surface | `capture→decoded` (v1) | host+network + decode |
|
||
| probe (headless log) | n/a | `capture→received` (was "reassembled") | logs p50/p95/p99/max µs, whole-session — measurement tool, exempt from the 1 s window rule but uses the canonical point names |
|
||
|
||
Host web console (`stats_recorder`) keeps its own additive stage vocabulary
|
||
(`queue → capture → submit → encode → send`, p50/p99, 1–2 s windows) — operator
|
||
deep-dive tool, already additive/stacked, out of scope here except that its stage names
|
||
must stay consistent with the docs. The sum of the host stages is our analogue of
|
||
Sunshine's "host processing latency" (capture→send).
|
||
|
||
## Moonlight comparison (summary — full matrix in docs-site/content/docs/stats.md)
|
||
|
||
- Moonlight's latency lines are **disjoint client segments** (decode, queue delay,
|
||
render) plus Sunshine's host capture→send; nothing measures the wire, and there is
|
||
**no end-to-end line**. Our `end-to-end` must never be compared against any single
|
||
Moonlight line. Fair approximation:
|
||
`punktfunk end-to-end ≈ ML host processing latency + ~½ RTT + ML decoding time +
|
||
ML frame queue delay + ML rendering time`.
|
||
- Moonlight "Average network latency" = **ENet control-channel RTT** — not frame
|
||
latency; we intentionally have no equivalent line.
|
||
- Moonlight "Video stream … FPS" *includes inferred-lost frames* (host-rate estimate
|
||
from frame-number gaps); our `fps` counts received only — equal at ~0 loss.
|
||
- Moonlight decode/queue/render times are **means**; ours are p50s.
|
||
|
||
## Phase 2 (implemented): split `host+network` via the 0xCF host-timing plane
|
||
|
||
Not an AU-header change after all — the hardened data-plane format stays untouched.
|
||
The host reports its share on the established QUIC side-plane pattern:
|
||
|
||
- **Cap bit**: `Hello::video_caps` gains `VIDEO_CAP_HOST_TIMING` (0x08). NativeClient
|
||
ORs it in unconditionally; the probe sets it explicitly. Old hosts ignore it.
|
||
- **Datagram**: `HOST_TIMING_MAGIC` 0xCF, 13 bytes — `[tag][pts_ns u64 LE][host_us
|
||
u32 LE]` (`quic::HostTiming`). Emitted once per AU by the send thread right after
|
||
the AU's last packet left the socket, so `host_us` = capture→fully-sent (capture
|
||
read/convert, encode, FEC+seal, paced send) against the same anchor as the wire
|
||
pts. Speed-test filler (FLAG_PROBE) is skipped. The synthetic host emits it too
|
||
(loopback protocol tests cover the plane).
|
||
- **Client math**: correlate by `pts_ns` (a bounded pending ring of receipt samples),
|
||
`host = host_us`, `network = hostnet_sample − host_us` (saturating) — the two terms
|
||
tile the `host+network` stage per frame by construction. Equation line becomes
|
||
`= host {x} + network {y} + decode + display`; when no 0xCF arrived in the window
|
||
(old host / all datagrams lost) it falls back to the combined `host+network` term.
|
||
- **Surfaces**: `NativeClient::next_host_timing()` (Rust clients), C ABI
|
||
`PunktfunkHostTiming` + `punktfunk_connection_next_host_timing` (Apple), probe log
|
||
line `host/network latency split` (host_p50/p95_us · net_p50/p95_us).
|
||
- This is our direct analogue of Sunshine's "host processing latency" — ours
|
||
additionally includes the paced send (theirs stops just before the UDP send).
|
||
|
||
Still open for a later phase: surfacing the QUIC path RTT (quinn exposes it) as a
|
||
diagnostics line, clearly labelled control-plane RTT.
|
||
|
||
## Implementation status
|
||
|
||
- [x] Apple (`StreamHUDView`/`SessionModel`/`Stage2Pipeline` + `LatencyMeter` reuse) — 09a5957
|
||
- [x] Windows (`app/stream.rs` HUD rows, `session.rs`/`render.rs` meters → p50/p95) — 09a5957
|
||
- [x] Linux (`ui_stream.rs` OSD, `session.rs` window meters) — 09a5957
|
||
- [x] Android (`stats.rs`/`decode.rs` stage split, `StatsOverlay.kt`) — 09a5957
|
||
- [x] probe (rename `capture→reassembled` → `capture→received` in the log line) — 09a5957
|
||
- [x] docs-site stats page + matrix; link from `moonlight.md` — 09a5957
|
||
- [x] Phase 2 wire layer (0xCF + cap bit + NativeClient/ABI + host emission + probe split) — 449a67c
|
||
- [ ] Phase 2 client HUDs (host/network equation terms on Apple/Windows/Linux/Android)
|
||
- [ ] On-glass validation everywhere (Mac swift test + glass, Windows CI + glass, Android device, Linux glass)
|