docs(roadmap): §12 glass-to-glass latency — quick wins landed, bigger bets scoped
ci / rust (push) Has been cancelled
ci / rust (push) Has been cancelled
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -263,3 +263,43 @@ params (`max_data_per_block`, `shard_payload`), never by widening a bound by han
|
||||
`punktfunk-client-rs --speed-test KBPS:MS`, RELEASE build (debug is CPU-bound ~30 Mbps), watching
|
||||
`packets_send_dropped`. Open Qs: NVENC CBR rate-tracking at 0.5–1 Gbps (no explicit
|
||||
`rc_buffer_size`); LAN/QEMU-NIC jumbo/GSO support; any `web/` bitrate slider hardcoding 500 Mbps.
|
||||
|
||||
## 12. Glass-to-glass latency *(investigated; quick wins landed, bigger bets scoped)*
|
||||
|
||||
A 5-way investigation (2026-06-11) mapped where latency actually lives. The measured "p50 0.83 ms"
|
||||
is only the same-host **capture-stamp→reassembled** slice (~30–40% of true glass-to-glass) and was
|
||||
measured with tiny single-chunk frames, so it excludes the pacing tail. The latency that matters, in
|
||||
priority order: **(1) the host pacing tail** — `paced_submit` used to spread *every* multi-chunk
|
||||
frame over ~90% of the interval (up to ~7.5 ms@120 / ~15 ms@60); **(2) native-path serialization** —
|
||||
`virtual_stream` runs capture+encode+seal+paced-send on one thread, so frame N+1 can't start until
|
||||
frame N's paced tail leaves the wire; **(3) client present** — `AVSampleBufferDisplayLayer` adds
|
||||
~0.5 refresh (~4 ms@120Hz, ~8 ms@60Hz), the dominant client term at 60 Hz.
|
||||
|
||||
**Already optimal — do NOT touch** (confirmed): NVENC tuning (p1/ull/cbr/bf0/delay0/infinite-GOP +
|
||||
forced-IDR — `receive_packet` is already same-frame); the device→device copy in `submit_cuda` (avoids
|
||||
NVENC registration-cache thrash); FEC `max_data_per_block=4096` (every frame incl. a 4 MB IDR is one
|
||||
block — no multi-block latency); the client reassembler (no jitter buffer, frame emitted on
|
||||
last-packet arrival, `REORDER_WINDOW` is a dedup bound not a delay) — do **not** add a client jitter
|
||||
buffer; `sendmmsg`/`recvmmsg` batching; the capture-timestamp anchor placement.
|
||||
|
||||
- **Done & live (`99f60b5`):** **microburst-cap pacing** — a frame ≤ a cap (default 128 KB,
|
||||
`PUNKTFUNK_PACE_BURST_KB`) bursts out immediately (no pacing tail); only a bigger frame's overflow
|
||||
(IDR / sustained high bitrate — the bursts that actually froze) is spread. Recovers the tail on the
|
||||
common case, keeps the freeze fix for the frames that need it; 128 KB is a safe default (well under
|
||||
the ~150 Mbps@60 frame size where drops began). Plus **per-frame instrumentation** (PUNKTFUNK_PERF):
|
||||
`encode_us` + `pace_us` p50/p99/max + immediate-vs-paced counts, so the cap is tunable against real
|
||||
numbers. **Validate with the LAN soak before raising the cap** (`send_dropped` must stay 0).
|
||||
- **Bigger bets (ordered, deferred — need real-NIC/GPU/Mac validation):**
|
||||
1. **Encode|send thread split** on the native path (port GameStream's `spawn_sender` + depth-2
|
||||
channel; `seal_frame` stays on the encode thread, `send_sealed` on a send thread) — removes the
|
||||
serialization (~2–8 ms @60–120 fps), and is the substrate the slice wrapper needs.
|
||||
2. **Wall-clock skew handshake + glass-to-glass probe** (`tools/latency-probe`) — measures the two
|
||||
biggest unmeasured terms (render→capture, decode→present); client present-stamp vs the AU's
|
||||
`pts_ns` (already attached).
|
||||
3. **CUDA stream+event** to drop one of two redundant `cuCtxSynchronize` in `submit_cuda` (keep the
|
||||
copy) — ~0.1–0.4 ms@720p, ~1 ms@5K; only if per-stage timing proves the sync is on the path.
|
||||
4. **Stage-2 Apple presenter** (`VTDecompressionSession` → `CAMetalLayer`, hand-paced) — ~0.5 refresh
|
||||
off the present tail (biggest client win at 60 Hz); gate on the probe proving present is real.
|
||||
5. **NVENC slice-mode wrapper** (roadmap §2 sub-frame pipelining) — per-slice transmit overlaps
|
||||
encode+send within a frame (~3–6 ms at 4K/5K/IDR); large + driver-ABI-fragile, on top of the
|
||||
thread split, only after measurement justifies it.
|
||||
|
||||
Reference in New Issue
Block a user