docs(roadmap): §12 glass-to-glass latency — quick wins landed, bigger bets scoped
ci / rust (push) Has been cancelled

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-11 22:54:30 +00:00
parent 99f60b5b08
commit 761ccace25
+40
View File
@@ -263,3 +263,43 @@ params (`max_data_per_block`, `shard_payload`), never by widening a bound by han
`punktfunk-client-rs --speed-test KBPS:MS`, RELEASE build (debug is CPU-bound ~30 Mbps), watching
`packets_send_dropped`. Open Qs: NVENC CBR rate-tracking at 0.51 Gbps (no explicit
`rc_buffer_size`); LAN/QEMU-NIC jumbo/GSO support; any `web/` bitrate slider hardcoding 500 Mbps.
## 12. Glass-to-glass latency *(investigated; quick wins landed, bigger bets scoped)*
A 5-way investigation (2026-06-11) mapped where latency actually lives. The measured "p50 0.83 ms"
is only the same-host **capture-stamp→reassembled** slice (~3040% of true glass-to-glass) and was
measured with tiny single-chunk frames, so it excludes the pacing tail. The latency that matters, in
priority order: **(1) the host pacing tail** — `paced_submit` used to spread *every* multi-chunk
frame over ~90% of the interval (up to ~7.5 ms@120 / ~15 ms@60); **(2) native-path serialization** —
`virtual_stream` runs capture+encode+seal+paced-send on one thread, so frame N+1 can't start until
frame N's paced tail leaves the wire; **(3) client present** — `AVSampleBufferDisplayLayer` adds
~0.5 refresh (~4 ms@120Hz, ~8 ms@60Hz), the dominant client term at 60 Hz.
**Already optimal — do NOT touch** (confirmed): NVENC tuning (p1/ull/cbr/bf0/delay0/infinite-GOP +
forced-IDR — `receive_packet` is already same-frame); the device→device copy in `submit_cuda` (avoids
NVENC registration-cache thrash); FEC `max_data_per_block=4096` (every frame incl. a 4 MB IDR is one
block — no multi-block latency); the client reassembler (no jitter buffer, frame emitted on
last-packet arrival, `REORDER_WINDOW` is a dedup bound not a delay) — do **not** add a client jitter
buffer; `sendmmsg`/`recvmmsg` batching; the capture-timestamp anchor placement.
- **Done & live (`99f60b5`):** **microburst-cap pacing** — a frame ≤ a cap (default 128 KB,
`PUNKTFUNK_PACE_BURST_KB`) bursts out immediately (no pacing tail); only a bigger frame's overflow
(IDR / sustained high bitrate — the bursts that actually froze) is spread. Recovers the tail on the
common case, keeps the freeze fix for the frames that need it; 128 KB is a safe default (well under
the ~150 Mbps@60 frame size where drops began). Plus **per-frame instrumentation** (PUNKTFUNK_PERF):
`encode_us` + `pace_us` p50/p99/max + immediate-vs-paced counts, so the cap is tunable against real
numbers. **Validate with the LAN soak before raising the cap** (`send_dropped` must stay 0).
- **Bigger bets (ordered, deferred — need real-NIC/GPU/Mac validation):**
1. **Encode|send thread split** on the native path (port GameStream's `spawn_sender` + depth-2
channel; `seal_frame` stays on the encode thread, `send_sealed` on a send thread) — removes the
serialization (~28 ms @60120 fps), and is the substrate the slice wrapper needs.
2. **Wall-clock skew handshake + glass-to-glass probe** (`tools/latency-probe`) — measures the two
biggest unmeasured terms (render→capture, decode→present); client present-stamp vs the AU's
`pts_ns` (already attached).
3. **CUDA stream+event** to drop one of two redundant `cuCtxSynchronize` in `submit_cuda` (keep the
copy) — ~0.10.4 ms@720p, ~1 ms@5K; only if per-stage timing proves the sync is on the path.
4. **Stage-2 Apple presenter** (`VTDecompressionSession` → `CAMetalLayer`, hand-paced) — ~0.5 refresh
off the present tail (biggest client win at 60 Hz); gate on the probe proving present is real.
5. **NVENC slice-mode wrapper** (roadmap §2 sub-frame pipelining) — per-slice transmit overlaps
encode+send within a frame (~36 ms at 4K/5K/IDR); large + driver-ABI-fragile, on top of the
thread split, only after measurement justifies it.