Files
punktfunk/design/stats-unification.md
T
enricobuehler 69609945a3 feat(clients): host/network split in every stats HUD (stats phase 2, client side)
Consumes the 0xCF host-timing plane (449a67c) on all four GUI clients: each
keeps a bounded pending ring of receipt samples keyed by pts, matches the
host's per-AU capture→sent reports against it, and the HUD equation becomes

  = host 3.1 + network 6.7 + decode 2.1 + display 2.3

falling back to the combined `= host+network …` term whenever no timing
matched the window (old host / datagram loss) — same total, one split
fewer, never a misleading zero. Apple additionally gains the split as the
only equation line under the stage-1 fallback presenter (receipt is
presenter-independent), a `nextHostTiming` wrapper with its own plane lock,
and a unit-tested `HostNetworkSplitter`; Android extends the JNI stats
array 16→18 doubles (0–15 unchanged); Windows/Linux thread the split
through `Stats` into the HUD and the headless/debug logs.

Docs updated: design/stats-unification.md Phase 2 → implemented (wire
format, fallback semantics), and the docs-site matrix's Sunshine "Host
processing latency" row is now a direct match (ours includes the paced
send; avg vs p50).

Verified here: linux client clippy -D warnings green on the live tree,
windows stub check + hand-verified diff, android cargo-ndk arm64 check
green, apple loopback test extended (needs the rebuilt xcframework + swift
test on the mac). On-glass: pending on all platforms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 21:31:49 +00:00

162 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Unified streaming stats — one vocabulary across every client
Status: spec agreed 2026-07-03; implementation tracked per client below.
User-facing companion: `docs-site/content/docs/stats.md` (overlay guide + Moonlight matrix).
## Why
Prospective users compare our numbers against Moonlight/Sunshine's overlay. Before this
spec the clients disagreed with each other (same concept, different name, different
measurement point, mean vs median, missing skew flags), and none of them made clear
which numbers are *sequential stages that add up* versus *overlapping absolutes*. The
Apple HUD showed three arrow-notation lines (`capture→client`, `capture→present`,
`decode→present`) that looked like a pipeline but overlapped, left the decode interval
invisible, and mixed two clock bases. Meanwhile Moonlight shows *only* disjoint
client-side segments and has **no end-to-end number at all** — so our headline
end-to-end figure, presented without context, reads "worse" against a Moonlight line
that measures a fraction of the chain.
## The model
Four measurement points per video frame. Every stat on every HUD is a difference of
two of these points — nothing else.
| point | meaning | who stamps it |
|---|---|---|
| **capture** | host capture clock: the per-AU `pts_ns`. Native path anchors at PipeWire frame delivery (host queue age is *inside* it); GameStream uses the RTP 90 kHz clock. | host |
| **received** | AU fully reassembled (post-FEC), handed to the client, before decode | client |
| **decoded** | decoder output frame available | client |
| **displayed** | best-effort presentation instant (see per-client endpoint table) | client |
Three **stages** tile the interval exactly (per frame, same timestamps, no gaps, no
overlap):
- `host+network` = capture → received. Contains the whole host pipeline
(queue/capture/convert/encode/pace) **plus** wire time and reassembly; it cannot be
split client-side today (see Phase 2).
- `decode` = received → decoded. Pure client-local; no clock skew involved.
- `display` = decoded → displayed. Pacing/queue wait + render + vsync. Client-local.
Headline: **`end-to-end`** = capture → displayed, **measured directly** (not the sum of
stage percentiles). Per frame the stages sum to it exactly; per window the p50s only
approximately sum (percentiles aren't additive) — the doc says so, the HUD shows the
equation with the directly-measured total on the left.
### Clock rules
- `end-to-end` and `host+network` use client `CLOCK_REALTIME` + `clock_offset_ns`
(the ClockProbe/ClockEcho skew handshake). Cross-machine valid when the offset was
measured.
- `decode` and `display` are single-clock client-local (offset irrelevant).
- When `clock_offset_ns == 0` (host didn't answer the handshake / same host), the HUD
appends **`(same-host clock)`** once, on the end-to-end line — every client, no
exceptions. Windows and Linux previously showed possibly-uncorrected numbers
silently; that was a bug in presentation.
### Aggregation rules
- **Window: 1 s tumbling** (drain and reset), HUD refresh 1×/s. (Moonlight uses a
12 s sliding window — close enough for comparison; documented in the matrix.)
- **Percentiles only, never means**: end-to-end shows `p50` and `p95`; stages show
`p50`. Windows/Linux previously showed *mean* decode time next to median latencies —
banned.
- Latency samples clamped to `(0, 10 s)` as before.
- `fps` = AUs received (reassembled) per second, actual-elapsed-time denominator.
- `Mb/s` = received payload bytes × 8 / elapsed (goodput, excludes FEC overhead —
same basis as Moonlight's bitrate tracker).
## Canonical HUD layout
Line order and label strings are normative; platform chips vary.
```
1920×1080@120 · 119 fps · 38.2 Mb/s · HEVC 10-bit HDR · GPU decode
end-to-end 14.2 ms p50 · 19.8 p95 · capture→on-glass
= host+network 9.8 + decode 2.1 + display 2.3
lost 3 (0.1%) · skipped 1 · FEC 12
```
- Line 1 — stream facts. `{W}×{H}@{Hz} · {fps} fps · {mbps} Mb/s` plus whatever
chips the platform already knows (codec, bit depth, HDR, GPU/CPU decode).
- Line 2 — the headline. `end-to-end {p50} ms p50 · {p95} p95 · capture→{endpoint}`
(+ ` (same-host clock)` when applicable). The endpoint suffix is honest per
platform — see the endpoint table. Never a bare "latency".
- Line 3 — the equation. `= host+network {a} + decode {b} + display {c}` (stage p50s,
ms, one decimal). A platform that can't measure a trailing stage drops the term and
the headline endpoint moves accordingly — the equation always tiles the headline
interval.
- Line 4 — counters, only rendered when any value is nonzero.
`lost` = unrecoverable network frame drops in the window (`{n} ({pct}%)`,
pct = lost/(received+lost)); `skipped` = client-side newest-wins/pacing drops;
`FEC` = shards recovered this window (proof FEC is earning its keep).
### Per-client endpoints (v1)
| client | displayed point | headline reads | equation terms |
|---|---|---|---|
| Apple stage-2 | display-link target present instant | `capture→on-glass` | host+network + decode + display |
| Apple stage-1 (fallback presenter) | n/a (opaque `AVSampleBufferDisplayLayer`) | `capture→received` | host+network only |
| Windows | post-`Present()` on the render thread | `capture→on-glass` | all three |
| Linux GTK | paintable-set (GTK adds ~1 compositor cycle after) | `capture→displayed` | all three |
| Android | MediaCodec output released to surface | `capture→decoded` (v1) | host+network + decode |
| probe (headless log) | n/a | `capture→received` (was "reassembled") | logs p50/p95/p99/max µs, whole-session — measurement tool, exempt from the 1 s window rule but uses the canonical point names |
Host web console (`stats_recorder`) keeps its own additive stage vocabulary
(`queue → capture → submit → encode → send`, p50/p99, 12 s windows) — operator
deep-dive tool, already additive/stacked, out of scope here except that its stage names
must stay consistent with the docs. The sum of the host stages is our analogue of
Sunshine's "host processing latency" (capture→send).
## Moonlight comparison (summary — full matrix in docs-site/content/docs/stats.md)
- Moonlight's latency lines are **disjoint client segments** (decode, queue delay,
render) plus Sunshine's host capture→send; nothing measures the wire, and there is
**no end-to-end line**. Our `end-to-end` must never be compared against any single
Moonlight line. Fair approximation:
`punktfunk end-to-end ≈ ML host processing latency + ~½ RTT + ML decoding time +
ML frame queue delay + ML rendering time`.
- Moonlight "Average network latency" = **ENet control-channel RTT** — not frame
latency; we intentionally have no equivalent line.
- Moonlight "Video stream … FPS" *includes inferred-lost frames* (host-rate estimate
from frame-number gaps); our `fps` counts received only — equal at ~0 loss.
- Moonlight decode/queue/render times are **means**; ours are p50s.
## Phase 2 (implemented): split `host+network` via the 0xCF host-timing plane
Not an AU-header change after all — the hardened data-plane format stays untouched.
The host reports its share on the established QUIC side-plane pattern:
- **Cap bit**: `Hello::video_caps` gains `VIDEO_CAP_HOST_TIMING` (0x08). NativeClient
ORs it in unconditionally; the probe sets it explicitly. Old hosts ignore it.
- **Datagram**: `HOST_TIMING_MAGIC` 0xCF, 13 bytes — `[tag][pts_ns u64 LE][host_us
u32 LE]` (`quic::HostTiming`). Emitted once per AU by the send thread right after
the AU's last packet left the socket, so `host_us` = capture→fully-sent (capture
read/convert, encode, FEC+seal, paced send) against the same anchor as the wire
pts. Speed-test filler (FLAG_PROBE) is skipped. The synthetic host emits it too
(loopback protocol tests cover the plane).
- **Client math**: correlate by `pts_ns` (a bounded pending ring of receipt samples),
`host = host_us`, `network = hostnet_sample host_us` (saturating) — the two terms
tile the `host+network` stage per frame by construction. Equation line becomes
`= host {x} + network {y} + decode + display`; when no 0xCF arrived in the window
(old host / all datagrams lost) it falls back to the combined `host+network` term.
- **Surfaces**: `NativeClient::next_host_timing()` (Rust clients), C ABI
`PunktfunkHostTiming` + `punktfunk_connection_next_host_timing` (Apple), probe log
line `host/network latency split` (host_p50/p95_us · net_p50/p95_us).
- This is our direct analogue of Sunshine's "host processing latency" — ours
additionally includes the paced send (theirs stops just before the UDP send).
Still open for a later phase: surfacing the QUIC path RTT (quinn exposes it) as a
diagnostics line, clearly labelled control-plane RTT.
## Implementation status
- [x] Apple (`StreamHUDView`/`SessionModel`/`Stage2Pipeline` + `LatencyMeter` reuse) — 09a5957
- [x] Windows (`app/stream.rs` HUD rows, `session.rs`/`render.rs` meters → p50/p95) — 09a5957
- [x] Linux (`ui_stream.rs` OSD, `session.rs` window meters) — 09a5957
- [x] Android (`stats.rs`/`decode.rs` stage split, `StatsOverlay.kt`) — 09a5957
- [x] probe (rename `capture→reassembled` → `capture→received` in the log line) — 09a5957
- [x] docs-site stats page + matrix; link from `moonlight.md` — 09a5957
- [x] Phase 2 wire layer (0xCF + cap bit + NativeClient/ABI + host emission + probe split) — 449a67c
- [ ] Phase 2 client HUDs (host/network equation terms on Apple/Windows/Linux/Android)
- [ ] On-glass validation everywhere (Mac swift test + glass, Windows CI + glass, Android device, Linux glass)