# Stats capture & graphing — design Goal: let an operator **enable performance-stats capture from the web console**, play a session, **stop**, and **review the captured time-series as graphs** in the web console. Captures are **saved to disk** (browse/compare past sessions; survive host restart) and cover **both** streaming paths: native punktfunk/1 (`virtual_stream`) and GameStream/Moonlight (`gamestream/stream.rs`). This builds on the existing per-stage instrumentation (today gated by `PUNKTFUNK_PERF=1`, stdout-only, read once at startup). We make recording **runtime-toggleable**, route the same aggregates into a **shared ring → on-disk recording**, and expose it over the mgmt REST API + web console. --- ## 1. Host: shared `StatsRecorder` New module `crates/punktfunk-host/src/stats_recorder.rs`. One `Arc` is created once in the unified host entry (`gamestream::serve`, the `serve` subcommand) alongside `Arc`, and shared with **both** the mgmt API (`MgmtState`) and the streaming loops (threaded through `punktfunk1::serve` → `SessionContext` → `virtual_stream`/`send_loop`, and into the GameStream encode loop). Mirror the existing `NativePairing` Arc-sharing pattern exactly. ### Data model (serde + utoipa `ToSchema`; this is the wire + on-disk shape) ```rust /// One pipeline stage's latency in a window (microseconds). pub struct StageTiming { pub name: String, // "capture" | "submit" | "encode" | "packetize" | "send" pub p50_us: f32, pub p99_us: f32, } /// One aggregated sample (~ every 2 s native, ~ every 1 s GameStream). pub struct StatsSample { pub t_ms: u64, // ms since capture start (monotonic, from a stored Instant) pub session_id: u32, // disambiguates concurrent sessions (usually constant) pub stages: Vec, // ordered pipeline stages for this path pub fps: f32, // genuine NEW frames/s from the source pub repeat_fps: f32, // re-encoded holds/s (source-starvation indicator) pub mbps: f32, // tx goodput (Mb/s) pub bitrate_kbps: u32, // configured target bitrate pub frames_dropped: u32, // delta in this window pub packets_dropped: u32, // delta (receiver-side / reassembler), where known pub send_dropped: u32, // delta (host send-buffer overflow / EAGAIN) pub fec_recovered: u32, // delta (shards recovered) } pub struct CaptureMeta { pub id: String, // "2026-06-26T20-14-03Z_5120x1440" — also the filename stem pub started_unix_ms: u64, pub duration_ms: u64, pub kind: String, // "native" | "gamestream" pub width: u32, pub height: u32, pub fps: u32, pub codec: String, // "h264" | "hevc" | "av1" pub client: String, // short label / fingerprint prefix, or "" if unknown pub sample_count: u32, } pub struct Capture { pub meta: CaptureMeta, pub samples: Vec, } pub struct StatsStatus { pub armed: bool, // capture currently running pub sample_count: u32, // samples in the in-progress capture pub started_unix_ms: u64, // 0 if idle pub kind: String, // path of the in-progress capture, "" if idle } ``` Stage sets per path (ordered, roughly the per-frame critical path so stacking is meaningful): - **native**: `capture` (try_latest ring read + color convert), `submit` (NVENC enqueue), `encode` (lock_bitstream = NVENC schedule + ASIC — the dominant stage under GPU load), `send` (paced_submit: seal + FEC + pace + sendmmsg). - **gamestream**: `capture`, `encode`, `packetize` (poll+FEC+packetize), `send`. > Native naming: today's vectors are `st_cap`→`capture`, `st_submit`→`submit`, > `st_wait`→`encode`, `pace_us`→`send`. (`encode_us` total ≈ capture+submit+encode; we do not > emit it as a stage to avoid double-counting — it's implied by the stack.) ### Recorder API ```rust pub struct StatsRecorder { /* dir, armed: AtomicBool, live: Mutex>, next_sid: AtomicU32 */ } impl StatsRecorder { pub fn new(dir: PathBuf) -> Arc; // creates dir (0700) if missing pub fn is_armed(&self) -> bool; // cheap Relaxed atomic load — called on the hot path /// Arm a new capture. No-op if already armed (returns current status). pub fn start(&self) -> StatsStatus; /// A streaming loop announces itself when it first records while armed. /// Seeds CaptureMeta (kind/w/h/fps/codec/client) on the FIRST registration. Returns session_id. pub fn register_session(&self, kind: &'static str, w: u32, h: u32, fps: u32, codec: &str, client: &str) -> u32; /// Append one aggregated sample (called from the loops' existing ~2 s/~1 s boundary). /// Bounded: cap at MAX_SAMPLES (e.g. 5400 ≈ 3 h @ 2 s). On overflow, stop appending and /// set a `truncated` flag (DO NOT drop oldest — a saved recording must keep its start). pub fn push_sample(&self, session_id: u32, sample: StatsSample); /// Disarm + finalize: write /.json atomically, clear live, return saved meta. pub fn stop(&self) -> std::io::Result>; pub fn status(&self) -> StatsStatus; pub fn live_snapshot(&self) -> Option; // clone of the in-progress capture for live graphing pub fn list(&self) -> Vec; // scan dir, parse meta only, newest first pub fn load(&self, id: &str) -> std::io::Result; pub fn delete(&self, id: &str) -> std::io::Result<()>; } ``` Invariants / safety: - **No async on the per-frame path.** `is_armed()` is a `Relaxed` atomic load; sample construction happens only at the existing 2 s / 1 s aggregation boundary, never per frame. - **`id` is path-traversal-safe.** `load`/`delete` MUST reject any id not matching `^[A-Za-z0-9._-]+$` (no `/`, no `..`, no `:` — keep it a valid Windows filename), and only ever join `dir/.json`. Return NotFound on reject. (Endpoints are bearer-authed, but defend in depth.) - **Bounded memory.** `MAX_SAMPLES` cap; truncate (keep oldest), never unbounded. - **Atomic disk write.** Write to `.json.tmp` then rename, so a crash mid-write can't leave a half file. Pretty-print not required; compact JSON is fine. - Captures dir: `~/.config/punktfunk/captures/` (next to `cert.pem` etc.). Resolve via the same config-dir helper the rest of the host uses. ### Runtime gating change (the key behavioral change) Today the loops measure per-stage timing only `if perf` (a startup bool). Change the per-frame **measurement** predicate to `let measure = perf || recorder.is_armed();`, re-evaluated each frame (cheap atomic). Then at the aggregation boundary: - if `perf` → keep the existing `tracing::info!` log line (unchanged behavior); - if `recorder.is_armed()` → also build a `StatsSample` and `push_sample`. So `PUNKTFUNK_PERF=1` still works exactly as before, AND the web toggle now works at runtime with zero startup flags. ### Where each loop emits the sample - **native** (`punktfunk1.rs`): the cap/submit/encode(`st_wait`) splits live in the capture thread; `mbps`/`send_dropped`/`bytes` and `session.stats()` live in the send thread. Emit the complete sample from **one** place. Cleanest: carry the per-frame `cap_us/submit_us/wait_us` (and a `repeat: bool`) on `FrameMsg` to the send thread (it already carries `encode_us`), so `send_loop` builds the whole sample at its existing 2 s boundary where `session.stats()` is already read. Compute `frames_dropped/packets_dropped/send_dropped/fec_recovered` as deltas vs the previous window's `Session::stats()` snapshot (the loop already tracks `last_bytes` / `last_send_dropped` — extend that bookkeeping). `register_session` is called once with the negotiated mode/codec and the client label. - **gamestream** (`gamestream/stream.rs`): the encode loop already tracks per-stage max each 1 s. Add p50/p99 accumulation (small per-stage `Vec` like the native path) and, when `perf || recorder.is_armed()`, emit a `StatsSample` with stages `[capture, encode, packetize, send]` + fps (unique new frames) + mbps + whatever loss/byte counters that path exposes (use 0 where a counter doesn't exist; do NOT fabricate). Call `register_session("gamestream", ...)` with the GameStream-negotiated mode/codec/client. Threading: add `stats: Arc` to `SessionContext` and the GameStream stream setup; the standalone `punktfunk1-host` subcommand (no mgmt) passes a fresh recorder (harmless, just unused). --- ## 2. Host: mgmt REST API (`mgmt.rs`) Add `stats: Arc` to `MgmtState`. Register handlers in `api_router_parts()` via `routes!()` with `#[utoipa::path]`. All under `/api/v1`, **bearer-token only** (operator actions — do NOT add them to the mTLS `cert_may_access` read-only allowlist). All bodies/returns derive `ToSchema`; errors use the `ApiJson`/`ApiError` envelope. Tag every operation `stats`. | Method & path | fn (operationId) | body → returns | |---------------------------------------|-------------------------|-------------------------------| | POST `/api/v1/stats/capture/start` | `stats_capture_start` | — → `StatsStatus` | | POST `/api/v1/stats/capture/stop` | `stats_capture_stop` | — → `CaptureMeta` (200) / 204-ish if nothing was recording | | GET `/api/v1/stats/capture/status` | `stats_capture_status` | → `StatsStatus` | | GET `/api/v1/stats/capture/live` | `stats_capture_live` | → `Capture` (in-progress; 404/empty if idle) | | GET `/api/v1/stats/recordings` | `stats_recordings_list` | → `Vec` | | GET `/api/v1/stats/recordings/{id}` | `stats_recording_get` | → `Capture` | | DELETE `/api/v1/stats/recordings/{id}`| `stats_recording_delete`| → `StatsStatus`/204 | Register the new `ToSchema` types with the OpenApi derive's `components(schemas(...))` list. Then regenerate the checked-in spec: ``` cargo run -p punktfunk-host -- openapi > api/openapi.json ``` CI fails on drift — the regenerated `api/openapi.json` MUST be committed. --- ## 3. Web console (`web/`) New page **"Performance"** following the established route → section/index (fetch) → section/view (presentational) pattern, registered in the `NAV` array (`app-shell.tsx`) with a lucide icon (`Activity` or `LineChart`). - Route: `web/src/routes/stats.tsx` → `createFileRoute('/stats')` → `SectionStats`. - Section: `web/src/sections/Stats/index.tsx` (orval hooks) + `view.tsx` (presentational, i18n via Paraglide `m.*`). Use `Section`, `QueryState`, `Card`/`CardHeader`/`CardTitle`/ `CardContent`, `Button`, `Badge` from `web/src/components/ui`. - Charts: **add `recharts`** to `web/package.json` (no chart lib exists today). Render charts **client-only** (a mounted guard) so SSR doesn't choke on `ResponsiveContainer`'s 0-width measure. Theme via existing CSS variables / brand violet, dark-mode aware. Data hooks come from regenerated orval (`bun run api:gen` after the host's openapi.json is updated): `useStatsCaptureStatus`, `useStatsCaptureStart`, `useStatsCaptureStop`, `useStatsCaptureLive`, `useStatsRecordingsList`, `useStatsRecordingGet`, `useStatsRecordingDelete` (exact names per orval's tag/operationId convention — verify against generated output and adjust the view imports to match). UI layout: 1. **Capture control card** — Start/Stop button (mutations; invalidate status query on success), a "Recording…"/"Idle" `Badge`, elapsed time + live sample count (`useStatsCaptureStatus`, `refetchInterval: 2000`). On Start, the live chart appears. 2. **Live chart** (visible while armed; `useStatsCaptureLive`, `refetchInterval: 2000`) — the latency stage breakdown as a **stacked area** (capture/submit/encode/send in µs, the "where does the time go" view), with fps and mbps as secondary line charts. 3. **Recordings card** — table from `useStatsRecordingsList`: time, kind badge, resolution, codec, duration, sample count; row actions **View** (select → detail), **Download** (export the `Capture` JSON via the recording GET), **Delete** (mutation, confirm). 4. **Recording detail** — when a recording (or the live capture) is selected, render the full graph set from its `samples`: - Latency stage breakdown (stacked area, µs) — primary bottleneck view; p99 overlay toggle. - Throughput: fps (new vs repeat) + mbps. - Health: frames_dropped / packets_dropped / send_dropped / fec_recovered over time. i18n: add keys to `web/messages/en.json` + `de.json` (nav label, titles, button/labels) and regenerate Paraglide. Keep both locales in sync. --- ## 4. Verification / done-criteria - `cargo build -p punktfunk-host` (and `--workspace`), `cargo clippy --workspace --all-targets -D warnings`, `cargo fmt --all --check` — green. - `cargo run -p punktfunk-host -- openapi > api/openapi.json` — committed, no drift. - `PUNKTFUNK_PERF=1` stdout behavior unchanged (no regression to the existing perf log). - Web: orval regen clean, typecheck/build green, charts render client-side. - CLAUDE.md status note + this plan updated. - Adversarial review: hot-path stays sync + bounded; `id` path-traversal-safe; OpenAPI/orval no drift; SSR-safe charts; both paths actually emit samples.