docs(roadmap): §12 glass-to-glass latency — quick wins landed, bigger bets scoped

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 22:54:30 +00:00
parent 99f60b5b08
commit 761ccace25
1 changed files with 40 additions and 0 deletions
@@ -263,3 +263,43 @@ params (`max_data_per_block`, `shard_payload`), never by widening a bound by han
  `punktfunk-client-rs --speed-test KBPS:MS`, RELEASE build (debug is CPU-bound ~30 Mbps), watching
  `packets_send_dropped`. Open Qs: NVENC CBR rate-tracking at 0.5–1 Gbps (no explicit
  `rc_buffer_size`); LAN/QEMU-NIC jumbo/GSO support; any `web/` bitrate slider hardcoding 500 Mbps.
+
+## 12. Glass-to-glass latency *(investigated; quick wins landed, bigger bets scoped)*
+
+A 5-way investigation (2026-06-11) mapped where latency actually lives. The measured "p50 0.83 ms"
+is only the same-host **capture-stamp→reassembled** slice (~30–40% of true glass-to-glass) and was
+measured with tiny single-chunk frames, so it excludes the pacing tail. The latency that matters, in
+priority order: **(1) the host pacing tail** — `paced_submit` used to spread *every* multi-chunk
+frame over ~90% of the interval (up to ~7.5 ms@120 / ~15 ms@60); **(2) native-path serialization** —
+`virtual_stream` runs capture+encode+seal+paced-send on one thread, so frame N+1 can't start until
+frame N's paced tail leaves the wire; **(3) client present** — `AVSampleBufferDisplayLayer` adds
+~0.5 refresh (~4 ms@120Hz, ~8 ms@60Hz), the dominant client term at 60 Hz.
+
+**Already optimal — do NOT touch** (confirmed): NVENC tuning (p1/ull/cbr/bf0/delay0/infinite-GOP +
+forced-IDR — `receive_packet` is already same-frame); the device→device copy in `submit_cuda` (avoids
+NVENC registration-cache thrash); FEC `max_data_per_block=4096` (every frame incl. a 4 MB IDR is one
+block — no multi-block latency); the client reassembler (no jitter buffer, frame emitted on
+last-packet arrival, `REORDER_WINDOW` is a dedup bound not a delay) — do **not** add a client jitter
+buffer; `sendmmsg`/`recvmmsg` batching; the capture-timestamp anchor placement.
+
+- **Done & live (`99f60b5`):** **microburst-cap pacing** — a frame ≤ a cap (default 128 KB,
+  `PUNKTFUNK_PACE_BURST_KB`) bursts out immediately (no pacing tail); only a bigger frame's overflow
+  (IDR / sustained high bitrate — the bursts that actually froze) is spread. Recovers the tail on the
+  common case, keeps the freeze fix for the frames that need it; 128 KB is a safe default (well under
+  the ~150 Mbps@60 frame size where drops began). Plus **per-frame instrumentation** (PUNKTFUNK_PERF):
+  `encode_us` + `pace_us` p50/p99/max + immediate-vs-paced counts, so the cap is tunable against real
+  numbers. **Validate with the LAN soak before raising the cap** (`send_dropped` must stay 0).
+- **Bigger bets (ordered, deferred — need real-NIC/GPU/Mac validation):**
+  1. **Encode|send thread split** on the native path (port GameStream's `spawn_sender` + depth-2
+     channel; `seal_frame` stays on the encode thread, `send_sealed` on a send thread) — removes the
+     serialization (~2–8 ms @60–120 fps), and is the substrate the slice wrapper needs.
+  2. **Wall-clock skew handshake + glass-to-glass probe** (`tools/latency-probe`) — measures the two
+     biggest unmeasured terms (render→capture, decode→present); client present-stamp vs the AU's
+     `pts_ns` (already attached).
+  3. **CUDA stream+event** to drop one of two redundant `cuCtxSynchronize` in `submit_cuda` (keep the
+     copy) — ~0.1–0.4 ms@720p, ~1 ms@5K; only if per-stage timing proves the sync is on the path.
+  4. **Stage-2 Apple presenter** (`VTDecompressionSession` → `CAMetalLayer`, hand-paced) — ~0.5 refresh
+     off the present tail (biggest client win at 60 Hz); gate on the probe proving present is real.
+  5. **NVENC slice-mode wrapper** (roadmap §2 sub-frame pipelining) — per-slice transmit overlaps
+     encode+send within a frame (~3–6 ms at 4K/5K/IDR); large + driver-ABI-fragile, on top of the
+     thread split, only after measurement justifies it.