docs(design): trim shipped plans, consolidate cluster, add index
Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).
- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
host-latency, gpu-contention (fixed stale status table), game-library,
linux-setup (fixed m0->spike + stale zero-copy claim),
session-aware-host-followups, windows-client-bootstrap,
windows-dualsense-{scoping,game-detection}, windows-virtual-display,
security-review (per-finding status table; #12 still open),
apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
merged, M4 done); windows-secure-desktop.md archived (now a fallback
behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+43
-238
@@ -1,246 +1,51 @@
|
||||
# Stats capture & graphing — design
|
||||
|
||||
Goal: let an operator **enable performance-stats capture from the web console**, play a
|
||||
session, **stop**, and **review the captured time-series as graphs** in the web console.
|
||||
Captures are **saved to disk** (browse/compare past sessions; survive host restart) and
|
||||
cover **both** streaming paths: native punktfunk/1 (`virtual_stream`) and GameStream/Moonlight
|
||||
(`gamestream/stream.rs`).
|
||||
> **Status:** SHIPPED (commit `5bf787e`) — host `crates/punktfunk-host/src/stats_recorder.rs`,
|
||||
> mgmt endpoints `/api/v1/stats/*` (`mgmt.rs`), web console Performance page
|
||||
> (`web/src/sections/Stats/`). Implemented; not yet on-glass validated. This doc is trimmed to
|
||||
> design rationale + open items; the shipped code is the source of truth (data models, recorder
|
||||
> API, endpoint list, and UI layout all live there).
|
||||
|
||||
This builds on the existing per-stage instrumentation (today gated by `PUNKTFUNK_PERF=1`,
|
||||
stdout-only, read once at startup). We make recording **runtime-toggleable**, route the same
|
||||
aggregates into a **shared ring → on-disk recording**, and expose it over the mgmt REST API +
|
||||
web console.
|
||||
Goal: let an operator **enable performance-stats capture from the web console**, play a session,
|
||||
**stop**, and **review the captured time-series as graphs**. Captures are **saved to disk**
|
||||
(browse/compare past sessions; survive host restart) and cover **both** streaming paths: native
|
||||
punktfunk/1 (`virtual_stream`) and GameStream/Moonlight (`gamestream/stream.rs`).
|
||||
|
||||
---
|
||||
## Why / design rationale
|
||||
|
||||
## 1. Host: shared `StatsRecorder`
|
||||
|
||||
New module `crates/punktfunk-host/src/stats_recorder.rs`. One `Arc<StatsRecorder>` is created
|
||||
once in the unified host entry (`gamestream::serve`, the `serve` subcommand) alongside
|
||||
`Arc<NativePairing>`, and shared with **both** the mgmt API (`MgmtState`) and the streaming
|
||||
loops (threaded through `punktfunk1::serve` → `SessionContext` → `virtual_stream`/`send_loop`,
|
||||
and into the GameStream encode loop). Mirror the existing `NativePairing` Arc-sharing pattern
|
||||
exactly.
|
||||
|
||||
### Data model (serde + utoipa `ToSchema`; this is the wire + on-disk shape)
|
||||
|
||||
```rust
|
||||
/// One pipeline stage's latency in a window (microseconds).
|
||||
pub struct StageTiming {
|
||||
pub name: String, // "capture" | "submit" | "encode" | "packetize" | "send"
|
||||
pub p50_us: f32,
|
||||
pub p99_us: f32,
|
||||
}
|
||||
|
||||
/// One aggregated sample (~ every 2 s native, ~ every 1 s GameStream).
|
||||
pub struct StatsSample {
|
||||
pub t_ms: u64, // ms since capture start (monotonic, from a stored Instant)
|
||||
pub session_id: u32, // disambiguates concurrent sessions (usually constant)
|
||||
pub stages: Vec<StageTiming>, // ordered pipeline stages for this path
|
||||
pub fps: f32, // genuine NEW frames/s from the source
|
||||
pub repeat_fps: f32, // re-encoded holds/s (source-starvation indicator)
|
||||
pub mbps: f32, // tx goodput (Mb/s)
|
||||
pub bitrate_kbps: u32, // configured target bitrate
|
||||
pub frames_dropped: u32, // delta in this window
|
||||
pub packets_dropped: u32, // delta (receiver-side / reassembler), where known
|
||||
pub send_dropped: u32, // delta (host send-buffer overflow / EAGAIN)
|
||||
pub fec_recovered: u32, // delta (shards recovered)
|
||||
}
|
||||
|
||||
pub struct CaptureMeta {
|
||||
pub id: String, // "2026-06-26T20-14-03Z_5120x1440" — also the filename stem
|
||||
pub started_unix_ms: u64,
|
||||
pub duration_ms: u64,
|
||||
pub kind: String, // "native" | "gamestream"
|
||||
pub width: u32,
|
||||
pub height: u32,
|
||||
pub fps: u32,
|
||||
pub codec: String, // "h264" | "hevc" | "av1"
|
||||
pub client: String, // short label / fingerprint prefix, or "" if unknown
|
||||
pub sample_count: u32,
|
||||
}
|
||||
|
||||
pub struct Capture {
|
||||
pub meta: CaptureMeta,
|
||||
pub samples: Vec<StatsSample>,
|
||||
}
|
||||
|
||||
pub struct StatsStatus {
|
||||
pub armed: bool, // capture currently running
|
||||
pub sample_count: u32, // samples in the in-progress capture
|
||||
pub started_unix_ms: u64, // 0 if idle
|
||||
pub kind: String, // path of the in-progress capture, "" if idle
|
||||
}
|
||||
```
|
||||
|
||||
Stage sets per path (ordered, roughly the per-frame critical path so stacking is meaningful):
|
||||
- **native**: `capture` (try_latest ring read + color convert), `submit` (NVENC enqueue),
|
||||
`encode` (lock_bitstream = NVENC schedule + ASIC — the dominant stage under GPU load),
|
||||
`send` (paced_submit: seal + FEC + pace + sendmmsg).
|
||||
- **gamestream**: `capture`, `encode`, `packetize` (poll+FEC+packetize), `send`.
|
||||
|
||||
> Native naming: today's vectors are `st_cap`→`capture`, `st_submit`→`submit`,
|
||||
> `st_wait`→`encode`, `pace_us`→`send`. (`encode_us` total ≈ capture+submit+encode; we do not
|
||||
> emit it as a stage to avoid double-counting — it's implied by the stack.)
|
||||
|
||||
### Recorder API
|
||||
|
||||
```rust
|
||||
pub struct StatsRecorder { /* dir, armed: AtomicBool, live: Mutex<Option<Live>>, next_sid: AtomicU32 */ }
|
||||
|
||||
impl StatsRecorder {
|
||||
pub fn new(dir: PathBuf) -> Arc<Self>; // creates dir (0700) if missing
|
||||
|
||||
pub fn is_armed(&self) -> bool; // cheap Relaxed atomic load — called on the hot path
|
||||
|
||||
/// Arm a new capture. No-op if already armed (returns current status).
|
||||
pub fn start(&self) -> StatsStatus;
|
||||
|
||||
/// A streaming loop announces itself when it first records while armed.
|
||||
/// Seeds CaptureMeta (kind/w/h/fps/codec/client) on the FIRST registration. Returns session_id.
|
||||
pub fn register_session(&self, kind: &'static str, w: u32, h: u32, fps: u32, codec: &str, client: &str) -> u32;
|
||||
|
||||
/// Append one aggregated sample (called from the loops' existing ~2 s/~1 s boundary).
|
||||
/// Bounded: cap at MAX_SAMPLES (e.g. 5400 ≈ 3 h @ 2 s). On overflow, stop appending and
|
||||
/// set a `truncated` flag (DO NOT drop oldest — a saved recording must keep its start).
|
||||
pub fn push_sample(&self, session_id: u32, sample: StatsSample);
|
||||
|
||||
/// Disarm + finalize: write <dir>/<id>.json atomically, clear live, return saved meta.
|
||||
pub fn stop(&self) -> std::io::Result<Option<CaptureMeta>>;
|
||||
|
||||
pub fn status(&self) -> StatsStatus;
|
||||
pub fn live_snapshot(&self) -> Option<Capture>; // clone of the in-progress capture for live graphing
|
||||
|
||||
pub fn list(&self) -> Vec<CaptureMeta>; // scan dir, parse meta only, newest first
|
||||
pub fn load(&self, id: &str) -> std::io::Result<Capture>;
|
||||
pub fn delete(&self, id: &str) -> std::io::Result<()>;
|
||||
}
|
||||
```
|
||||
|
||||
Invariants / safety:
|
||||
- **Reuse the existing per-stage instrumentation** that was startup-gated by `PUNKTFUNK_PERF=1`
|
||||
(stdout-only, read once at startup). The key behavioral change: make the per-frame
|
||||
**measurement** predicate `perf || recorder.is_armed()`, re-evaluated each frame via a cheap
|
||||
`Relaxed` atomic. `PUNKTFUNK_PERF=1` still emits its `tracing::info!` log line exactly as
|
||||
before; the web toggle additionally builds a `StatsSample` at the aggregation boundary — so
|
||||
the web toggle works at runtime with **zero startup flags**.
|
||||
- **No async on the per-frame path.** `is_armed()` is a `Relaxed` atomic load; sample
|
||||
construction happens only at the existing 2 s / 1 s aggregation boundary, never per frame.
|
||||
- **`id` is path-traversal-safe.** `load`/`delete` MUST reject any id not matching
|
||||
`^[A-Za-z0-9._-]+$` (no `/`, no `..`, no `:` — keep it a valid Windows filename), and only ever
|
||||
join `dir/<id>.json`. Return NotFound on reject. (Endpoints are bearer-authed, but defend in
|
||||
depth.)
|
||||
- **Bounded memory.** `MAX_SAMPLES` cap; truncate (keep oldest), never unbounded.
|
||||
- **Atomic disk write.** Write to `<id>.json.tmp` then rename, so a crash mid-write can't leave
|
||||
a half file. Pretty-print not required; compact JSON is fine.
|
||||
- Captures dir: `~/.config/punktfunk/captures/` (next to `cert.pem` etc.). Resolve via the same
|
||||
config-dir helper the rest of the host uses.
|
||||
construction happens only at the existing **~2 s native / ~1 s GameStream** aggregation
|
||||
boundary, never per frame. One shared `Arc<StatsRecorder>` is created once in the unified host
|
||||
entry and threaded into both streaming loops + `MgmtState`, mirroring the existing
|
||||
`Arc<NativePairing>` sharing pattern.
|
||||
- **Stage sets are the per-frame critical path so stacking is meaningful.** native:
|
||||
`capture` / `submit` (NVENC enqueue) / `encode` (`lock_bitstream` = NVENC schedule + ASIC, the
|
||||
dominant stage under GPU load) / `send` (paced_submit: seal + FEC + pace + sendmmsg).
|
||||
gamestream: `capture` / `encode` / `packetize` / `send`. Native source vectors map
|
||||
`st_cap`→`capture`, `st_submit`→`submit`, `st_wait`→`encode`, `pace_us`→`send`; `encode_us`
|
||||
total ≈ capture+submit+encode and is **not** emitted as its own stage to avoid double-counting.
|
||||
- **Gotchas / accepted-risk decisions:**
|
||||
- **`id` is path-traversal-safe.** `load`/`delete` reject any id not matching
|
||||
`^[A-Za-z0-9._-]+$` (no `/`, no `..`, no `:` — keep it a valid Windows filename) and only ever
|
||||
join `dir/<id>.json`. Endpoints are bearer-authed, but defend in depth.
|
||||
- **Bounded memory, keep the start.** `MAX_SAMPLES` cap (~5400 ≈ 3 h @ 2 s); on overflow stop
|
||||
appending and set a `truncated` flag — **do NOT drop oldest**, a saved recording must keep
|
||||
its start.
|
||||
- **Atomic disk write.** Write `<id>.json.tmp` then rename so a crash mid-write can't leave a
|
||||
half file. Captures dir `~/.config/punktfunk/captures/` (0700), next to `cert.pem`.
|
||||
- Counters that a path doesn't expose are recorded as `0` — **do NOT fabricate**.
|
||||
- mgmt endpoints are **bearer-token only** (operator actions) — deliberately NOT in the mTLS
|
||||
`cert_may_access` read-only allowlist.
|
||||
- Charts render **client-only** (mounted guard) so SSR doesn't choke on `ResponsiveContainer`'s
|
||||
0-width measure.
|
||||
|
||||
### Runtime gating change (the key behavioral change)
|
||||
## Open items
|
||||
|
||||
Today the loops measure per-stage timing only `if perf` (a startup bool). Change the per-frame
|
||||
**measurement** predicate to `let measure = perf || recorder.is_armed();`, re-evaluated each
|
||||
frame (cheap atomic). Then at the aggregation boundary:
|
||||
- if `perf` → keep the existing `tracing::info!` log line (unchanged behavior);
|
||||
- if `recorder.is_armed()` → also build a `StatsSample` and `push_sample`.
|
||||
|
||||
So `PUNKTFUNK_PERF=1` still works exactly as before, AND the web toggle now works at runtime
|
||||
with zero startup flags.
|
||||
|
||||
### Where each loop emits the sample
|
||||
|
||||
- **native** (`punktfunk1.rs`): the cap/submit/encode(`st_wait`) splits live in the capture
|
||||
thread; `mbps`/`send_dropped`/`bytes` and `session.stats()` live in the send thread. Emit the
|
||||
complete sample from **one** place. Cleanest: carry the per-frame `cap_us/submit_us/wait_us`
|
||||
(and a `repeat: bool`) on `FrameMsg` to the send thread (it already carries `encode_us`), so
|
||||
`send_loop` builds the whole sample at its existing 2 s boundary where `session.stats()` is
|
||||
already read. Compute `frames_dropped/packets_dropped/send_dropped/fec_recovered` as deltas vs
|
||||
the previous window's `Session::stats()` snapshot (the loop already tracks `last_bytes` /
|
||||
`last_send_dropped` — extend that bookkeeping). `register_session` is called once with the
|
||||
negotiated mode/codec and the client label.
|
||||
- **gamestream** (`gamestream/stream.rs`): the encode loop already tracks per-stage max each
|
||||
1 s. Add p50/p99 accumulation (small per-stage `Vec<u32>` like the native path) and, when
|
||||
`perf || recorder.is_armed()`, emit a `StatsSample` with stages
|
||||
`[capture, encode, packetize, send]` + fps (unique new frames) + mbps + whatever loss/byte
|
||||
counters that path exposes (use 0 where a counter doesn't exist; do NOT fabricate). Call
|
||||
`register_session("gamestream", ...)` with the GameStream-negotiated mode/codec/client.
|
||||
|
||||
Threading: add `stats: Arc<StatsRecorder>` to `SessionContext` and the GameStream stream
|
||||
setup; the standalone `punktfunk1-host` subcommand (no mgmt) passes a fresh recorder (harmless,
|
||||
just unused).
|
||||
|
||||
---
|
||||
|
||||
## 2. Host: mgmt REST API (`mgmt.rs`)
|
||||
|
||||
Add `stats: Arc<StatsRecorder>` to `MgmtState`. Register handlers in `api_router_parts()` via
|
||||
`routes!()` with `#[utoipa::path]`. All under `/api/v1`, **bearer-token only** (operator
|
||||
actions — do NOT add them to the mTLS `cert_may_access` read-only allowlist). All bodies/returns
|
||||
derive `ToSchema`; errors use the `ApiJson`/`ApiError` envelope. Tag every operation `stats`.
|
||||
|
||||
| Method & path | fn (operationId) | body → returns |
|
||||
|---------------------------------------|-------------------------|-------------------------------|
|
||||
| POST `/api/v1/stats/capture/start` | `stats_capture_start` | — → `StatsStatus` |
|
||||
| POST `/api/v1/stats/capture/stop` | `stats_capture_stop` | — → `CaptureMeta` (200) / 204-ish if nothing was recording |
|
||||
| GET `/api/v1/stats/capture/status` | `stats_capture_status` | → `StatsStatus` |
|
||||
| GET `/api/v1/stats/capture/live` | `stats_capture_live` | → `Capture` (in-progress; 404/empty if idle) |
|
||||
| GET `/api/v1/stats/recordings` | `stats_recordings_list` | → `Vec<CaptureMeta>` |
|
||||
| GET `/api/v1/stats/recordings/{id}` | `stats_recording_get` | → `Capture` |
|
||||
| DELETE `/api/v1/stats/recordings/{id}`| `stats_recording_delete`| → `StatsStatus`/204 |
|
||||
|
||||
Register the new `ToSchema` types with the OpenApi derive's `components(schemas(...))` list.
|
||||
Then regenerate the checked-in spec:
|
||||
|
||||
```
|
||||
cargo run -p punktfunk-host -- openapi > api/openapi.json
|
||||
```
|
||||
|
||||
CI fails on drift — the regenerated `api/openapi.json` MUST be committed.
|
||||
|
||||
---
|
||||
|
||||
## 3. Web console (`web/`)
|
||||
|
||||
New page **"Performance"** following the established route → section/index (fetch) →
|
||||
section/view (presentational) pattern, registered in the `NAV` array (`app-shell.tsx`) with a
|
||||
lucide icon (`Activity` or `LineChart`).
|
||||
|
||||
- Route: `web/src/routes/stats.tsx` → `createFileRoute('/stats')` → `SectionStats`.
|
||||
- Section: `web/src/sections/Stats/index.tsx` (orval hooks) + `view.tsx` (presentational,
|
||||
i18n via Paraglide `m.*`). Use `Section`, `QueryState`, `Card`/`CardHeader`/`CardTitle`/
|
||||
`CardContent`, `Button`, `Badge` from `web/src/components/ui`.
|
||||
- Charts: **add `recharts`** to `web/package.json` (no chart lib exists today). Render charts
|
||||
**client-only** (a mounted guard) so SSR doesn't choke on `ResponsiveContainer`'s 0-width
|
||||
measure. Theme via existing CSS variables / brand violet, dark-mode aware.
|
||||
|
||||
Data hooks come from regenerated orval (`bun run api:gen` after the host's openapi.json is
|
||||
updated): `useStatsCaptureStatus`, `useStatsCaptureStart`, `useStatsCaptureStop`,
|
||||
`useStatsCaptureLive`, `useStatsRecordingsList`, `useStatsRecordingGet`,
|
||||
`useStatsRecordingDelete` (exact names per orval's tag/operationId convention — verify against
|
||||
generated output and adjust the view imports to match).
|
||||
|
||||
UI layout:
|
||||
1. **Capture control card** — Start/Stop button (mutations; invalidate status query on
|
||||
success), a "Recording…"/"Idle" `Badge`, elapsed time + live sample count
|
||||
(`useStatsCaptureStatus`, `refetchInterval: 2000`). On Start, the live chart appears.
|
||||
2. **Live chart** (visible while armed; `useStatsCaptureLive`, `refetchInterval: 2000`) — the
|
||||
latency stage breakdown as a **stacked area** (capture/submit/encode/send in µs, the
|
||||
"where does the time go" view), with fps and mbps as secondary line charts.
|
||||
3. **Recordings card** — table from `useStatsRecordingsList`: time, kind badge, resolution,
|
||||
codec, duration, sample count; row actions **View** (select → detail), **Download** (export
|
||||
the `Capture` JSON via the recording GET), **Delete** (mutation, confirm).
|
||||
4. **Recording detail** — when a recording (or the live capture) is selected, render the full
|
||||
graph set from its `samples`:
|
||||
- Latency stage breakdown (stacked area, µs) — primary bottleneck view; p99 overlay toggle.
|
||||
- Throughput: fps (new vs repeat) + mbps.
|
||||
- Health: frames_dropped / packets_dropped / send_dropped / fec_recovered over time.
|
||||
|
||||
i18n: add keys to `web/messages/en.json` + `de.json` (nav label, titles, button/labels) and
|
||||
regenerate Paraglide. Keep both locales in sync.
|
||||
|
||||
---
|
||||
|
||||
## 4. Verification / done-criteria
|
||||
|
||||
- `cargo build -p punktfunk-host` (and `--workspace`), `cargo clippy --workspace --all-targets
|
||||
-D warnings`, `cargo fmt --all --check` — green.
|
||||
- `cargo run -p punktfunk-host -- openapi > api/openapi.json` — committed, no drift.
|
||||
- `PUNKTFUNK_PERF=1` stdout behavior unchanged (no regression to the existing perf log).
|
||||
- Web: orval regen clean, typecheck/build green, charts render client-side.
|
||||
- CLAUDE.md status note + this plan updated.
|
||||
- Adversarial review: hot-path stays sync + bounded; `id` path-traversal-safe; OpenAPI/orval no
|
||||
drift; SSR-safe charts; both paths actually emit samples.
|
||||
- **On-glass validation.** Implemented but not yet validated on real hardware end-to-end (arm
|
||||
from the console, play, stop, review graphs across both native + GameStream paths).
|
||||
|
||||
Reference in New Issue
Block a user