feat(clients): unified stats vocabulary across every client + Moonlight comparison docs

One stat model everywhere (design/stats-unification.md): four measurement points (capture/received/decoded/displayed), three stages that tile the interval exactly, and a HUD that shows the addition explicitly — end-to-end 14.2 ms p50 · 19.8 p95 · capture→on-glass = host+network 9.8 + decode 2.1 + display 2.3 replacing each client's ad-hoc mix of overlapping absolutes (the Apple HUD's three arrow lines that looked sequential but weren't), mean-vs-median decode times (Windows/Linux), missing same-host-clock flags (Windows/Linux), and three different names for the same capture→received measurement (probe's "reassembled", Apple/Android's "client", Windows/Linux's post-decode "lat"). Per client: Apple threads receivedNs through the VT decode via the frame refcon bit pattern so the decode stage exists at all (stage-1 fallback honestly degrades to a capture→received headline); Windows carries FrameTimes through the existing frame channel to the render thread and adds e2e p50/p95 post-Present; Linux stamps received at AU pop and rides decoded_ns on DecodedFrame to the paintable-set site; Android pairs receipt stamps with MediaCodec output buffers via the codec's pts round-trip (JNI stats array 14→16 doubles, indexes 0-13 unchanged). fps now uniformly counts received AUs; lost/(received+lost) per window, hidden at zero. docs-site gains "Understanding the Stats Overlay": what each line means, why the equation only approximately sums (percentiles), and a line-by-line Moonlight/Sunshine matrix — including that Moonlight has no end-to-end number and its "network latency" is an ENet control RTT, so punktfunk's headline must not be compared against any single Moonlight line. Verified here: linux client + probe + core check/clippy/fmt green, android native cargo-ndk arm64 check green. Pending: Windows CI + on-glass, swift test on the mac, on-device Android. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 21:01:29 +00:00
parent c7630ff5dc
commit 09a5957c6d
38 changed files with 1122 additions and 380 deletions
@@ -0,0 +1,142 @@
+# Unified streaming stats — one vocabulary across every client
+
+Status: spec agreed 2026-07-03; implementation tracked per client below.
+User-facing companion: `docs-site/content/docs/stats.md` (overlay guide + Moonlight matrix).
+
+## Why
+
+Prospective users compare our numbers against Moonlight/Sunshine's overlay. Before this
+spec the clients disagreed with each other (same concept, different name, different
+measurement point, mean vs median, missing skew flags), and none of them made clear
+which numbers are *sequential stages that add up* versus *overlapping absolutes*. The
+Apple HUD showed three arrow-notation lines (`capture→client`, `capture→present`,
+`decode→present`) that looked like a pipeline but overlapped, left the decode interval
+invisible, and mixed two clock bases. Meanwhile Moonlight shows *only* disjoint
+client-side segments and has **no end-to-end number at all** — so our headline
+end-to-end figure, presented without context, reads "worse" against a Moonlight line
+that measures a fraction of the chain.
+
+## The model
+
+Four measurement points per video frame. Every stat on every HUD is a difference of
+two of these points — nothing else.
+
+| point | meaning | who stamps it |
+|---|---|---|
+| **capture** | host capture clock: the per-AU `pts_ns`. Native path anchors at PipeWire frame delivery (host queue age is *inside* it); GameStream uses the RTP 90 kHz clock. | host |
+| **received** | AU fully reassembled (post-FEC), handed to the client, before decode | client |
+| **decoded** | decoder output frame available | client |
+| **displayed** | best-effort presentation instant (see per-client endpoint table) | client |
+
+Three **stages** tile the interval exactly (per frame, same timestamps, no gaps, no
+overlap):
+
+- `host+network` = capture → received. Contains the whole host pipeline
+  (queue/capture/convert/encode/pace) **plus** wire time and reassembly; it cannot be
+  split client-side today (see Phase 2).
+- `decode` = received → decoded. Pure client-local; no clock skew involved.
+- `display` = decoded → displayed. Pacing/queue wait + render + vsync. Client-local.
+
+Headline: **`end-to-end`** = capture → displayed, **measured directly** (not the sum of
+stage percentiles). Per frame the stages sum to it exactly; per window the p50s only
+approximately sum (percentiles aren't additive) — the doc says so, the HUD shows the
+equation with the directly-measured total on the left.
+
+### Clock rules
+
+- `end-to-end` and `host+network` use client `CLOCK_REALTIME` + `clock_offset_ns`
+  (the ClockProbe/ClockEcho skew handshake). Cross-machine valid when the offset was
+  measured.
+- `decode` and `display` are single-clock client-local (offset irrelevant).
+- When `clock_offset_ns == 0` (host didn't answer the handshake / same host), the HUD
+  appends **`(same-host clock)`** once, on the end-to-end line — every client, no
+  exceptions. Windows and Linux previously showed possibly-uncorrected numbers
+  silently; that was a bug in presentation.
+
+### Aggregation rules
+
+- **Window: 1 s tumbling** (drain and reset), HUD refresh 1×/s. (Moonlight uses a
+  1–2 s sliding window — close enough for comparison; documented in the matrix.)
+- **Percentiles only, never means**: end-to-end shows `p50` and `p95`; stages show
+  `p50`. Windows/Linux previously showed *mean* decode time next to median latencies —
+  banned.
+- Latency samples clamped to `(0, 10 s)` as before.
+- `fps` = AUs received (reassembled) per second, actual-elapsed-time denominator.
+- `Mb/s` = received payload bytes × 8 / elapsed (goodput, excludes FEC overhead —
+  same basis as Moonlight's bitrate tracker).
+
+## Canonical HUD layout
+
+Line order and label strings are normative; platform chips vary.
+
+```
+1920×1080@120 · 119 fps · 38.2 Mb/s · HEVC 10-bit HDR · GPU decode
+end-to-end 14.2 ms p50 · 19.8 p95 · capture→on-glass
+= host+network 9.8 + decode 2.1 + display 2.3
+lost 3 (0.1%) · skipped 1 · FEC 12
+```
+
+- Line 1 — stream facts. `{W}×{H}@{Hz} · {fps} fps · {mbps} Mb/s` plus whatever
+  chips the platform already knows (codec, bit depth, HDR, GPU/CPU decode).
+- Line 2 — the headline. `end-to-end {p50} ms p50 · {p95} p95 · capture→{endpoint}`
+  (+ ` (same-host clock)` when applicable). The endpoint suffix is honest per
+  platform — see the endpoint table. Never a bare "latency".
+- Line 3 — the equation. `= host+network {a} + decode {b} + display {c}` (stage p50s,
+  ms, one decimal). A platform that can't measure a trailing stage drops the term and
+  the headline endpoint moves accordingly — the equation always tiles the headline
+  interval.
+- Line 4 — counters, only rendered when any value is nonzero.
+  `lost` = unrecoverable network frame drops in the window (`{n} ({pct}%)`,
+  pct = lost/(received+lost)); `skipped` = client-side newest-wins/pacing drops;
+  `FEC` = shards recovered this window (proof FEC is earning its keep).
+
+### Per-client endpoints (v1)
+
+| client | displayed point | headline reads | equation terms |
+|---|---|---|---|
+| Apple stage-2 | display-link target present instant | `capture→on-glass` | host+network + decode + display |
+| Apple stage-1 (fallback presenter) | n/a (opaque `AVSampleBufferDisplayLayer`) | `capture→received` | host+network only |
+| Windows | post-`Present()` on the render thread | `capture→on-glass` | all three |
+| Linux GTK | paintable-set (GTK adds ~1 compositor cycle after) | `capture→displayed` | all three |
+| Android | MediaCodec output released to surface | `capture→decoded` (v1) | host+network + decode |
+| probe (headless log) | n/a | `capture→received` (was "reassembled") | logs p50/p95/p99/max µs, whole-session — measurement tool, exempt from the 1 s window rule but uses the canonical point names |
+
+Host web console (`stats_recorder`) keeps its own additive stage vocabulary
+(`queue → capture → submit → encode → send`, p50/p99, 1–2 s windows) — operator
+deep-dive tool, already additive/stacked, out of scope here except that its stage names
+must stay consistent with the docs. The sum of the host stages is our analogue of
+Sunshine's "host processing latency" (capture→send).
+
+## Moonlight comparison (summary — full matrix in docs-site/content/docs/stats.md)
+
+- Moonlight's latency lines are **disjoint client segments** (decode, queue delay,
+  render) plus Sunshine's host capture→send; nothing measures the wire, and there is
+  **no end-to-end line**. Our `end-to-end` must never be compared against any single
+  Moonlight line. Fair approximation:
+  `punktfunk end-to-end ≈ ML host processing latency + ~½ RTT + ML decoding time +
+  ML frame queue delay + ML rendering time`.
+- Moonlight "Average network latency" = **ENet control-channel RTT** — not frame
+  latency; we intentionally have no equivalent line.
+- Moonlight "Video stream … FPS" *includes inferred-lost frames* (host-rate estimate
+  from frame-number gaps); our `fps` counts received only — equal at ~0 loss.
+- Moonlight decode/queue/render times are **means**; ours are p50s.
+
+## Phase 2 (specced, not in v1): split `host+network`
+
+Carry the host's capture→send duration per AU (host stamps it at send, e.g. a
+varint-µs field in the AU header or a 0.1 ms u16 à la Sunshine's frame header). Client
+then displays `host {x} + network {y}` instead of `host+network`, where
+`network = (received − capture) − host_reported` — and the Moonlight matrix gains a
+direct "Host processing latency" counterpart. Requires a core wire/ABI bump
+(`punktfunk_frame` gains `host_latency_us`), trailing-byte back-compat like the
+compositor/gamepad preference bytes. Also consider surfacing the QUIC path RTT
+(quinn exposes it) as a diagnostics line, clearly labelled control-plane RTT.
+
+## Implementation status
+
+- [ ] Apple (`StreamHUDView`/`SessionModel`/`Stage2Pipeline` + `LatencyMeter` reuse)
+- [ ] Windows (`app/stream.rs` HUD rows, `session.rs`/`render.rs` meters → p50/p95)
+- [ ] Linux (`ui_stream.rs` OSD, `session.rs` window meters)
+- [ ] Android (`stats.rs`/`decode.rs` stage split, `StatsOverlay.kt`)
+- [ ] probe (rename `capture→reassembled` → `capture→received` in the log line)
+- [ ] docs-site stats page + matrix; link from `moonlight.md`