diff --git a/docs/roadmap.md b/docs/roadmap.md index 901be93..4d029a1 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -263,3 +263,43 @@ params (`max_data_per_block`, `shard_payload`), never by widening a bound by han `punktfunk-client-rs --speed-test KBPS:MS`, RELEASE build (debug is CPU-bound ~30 Mbps), watching `packets_send_dropped`. Open Qs: NVENC CBR rate-tracking at 0.5–1 Gbps (no explicit `rc_buffer_size`); LAN/QEMU-NIC jumbo/GSO support; any `web/` bitrate slider hardcoding 500 Mbps. + +## 12. Glass-to-glass latency *(investigated; quick wins landed, bigger bets scoped)* + +A 5-way investigation (2026-06-11) mapped where latency actually lives. The measured "p50 0.83 ms" +is only the same-host **capture-stamp→reassembled** slice (~30–40% of true glass-to-glass) and was +measured with tiny single-chunk frames, so it excludes the pacing tail. The latency that matters, in +priority order: **(1) the host pacing tail** — `paced_submit` used to spread *every* multi-chunk +frame over ~90% of the interval (up to ~7.5 ms@120 / ~15 ms@60); **(2) native-path serialization** — +`virtual_stream` runs capture+encode+seal+paced-send on one thread, so frame N+1 can't start until +frame N's paced tail leaves the wire; **(3) client present** — `AVSampleBufferDisplayLayer` adds +~0.5 refresh (~4 ms@120Hz, ~8 ms@60Hz), the dominant client term at 60 Hz. + +**Already optimal — do NOT touch** (confirmed): NVENC tuning (p1/ull/cbr/bf0/delay0/infinite-GOP + +forced-IDR — `receive_packet` is already same-frame); the device→device copy in `submit_cuda` (avoids +NVENC registration-cache thrash); FEC `max_data_per_block=4096` (every frame incl. a 4 MB IDR is one +block — no multi-block latency); the client reassembler (no jitter buffer, frame emitted on +last-packet arrival, `REORDER_WINDOW` is a dedup bound not a delay) — do **not** add a client jitter +buffer; `sendmmsg`/`recvmmsg` batching; the capture-timestamp anchor placement. + +- **Done & live (`99f60b5`):** **microburst-cap pacing** — a frame ≤ a cap (default 128 KB, + `PUNKTFUNK_PACE_BURST_KB`) bursts out immediately (no pacing tail); only a bigger frame's overflow + (IDR / sustained high bitrate — the bursts that actually froze) is spread. Recovers the tail on the + common case, keeps the freeze fix for the frames that need it; 128 KB is a safe default (well under + the ~150 Mbps@60 frame size where drops began). Plus **per-frame instrumentation** (PUNKTFUNK_PERF): + `encode_us` + `pace_us` p50/p99/max + immediate-vs-paced counts, so the cap is tunable against real + numbers. **Validate with the LAN soak before raising the cap** (`send_dropped` must stay 0). +- **Bigger bets (ordered, deferred — need real-NIC/GPU/Mac validation):** + 1. **Encode|send thread split** on the native path (port GameStream's `spawn_sender` + depth-2 + channel; `seal_frame` stays on the encode thread, `send_sealed` on a send thread) — removes the + serialization (~2–8 ms @60–120 fps), and is the substrate the slice wrapper needs. + 2. **Wall-clock skew handshake + glass-to-glass probe** (`tools/latency-probe`) — measures the two + biggest unmeasured terms (render→capture, decode→present); client present-stamp vs the AU's + `pts_ns` (already attached). + 3. **CUDA stream+event** to drop one of two redundant `cuCtxSynchronize` in `submit_cuda` (keep the + copy) — ~0.1–0.4 ms@720p, ~1 ms@5K; only if per-stage timing proves the sync is on the path. + 4. **Stage-2 Apple presenter** (`VTDecompressionSession` → `CAMetalLayer`, hand-paced) — ~0.5 refresh + off the present tail (biggest client win at 60 Hz); gate on the probe proving present is real. + 5. **NVENC slice-mode wrapper** (roadmap §2 sub-frame pipelining) — per-slice transmit overlaps + encode+send within a frame (~3–6 ms at 4K/5K/IDR); large + driver-ABI-fragile, on top of the + thread split, only after measurement justifies it.