feat(latency): wall-clock skew handshake for cross-machine latency measurement
ci / rust (push) Has been cancelled

ClockProbe/ClockEcho on the QUIC control stream — 8 NTP-style rounds right after
Start; the min-RTT sample gives the host-client clock offset (clock_offset_ns
estimator in punktfunk-core). The client adds the offset to its receive instant
before differencing against the AU pts_ns, so the capture->reassembled latency
percentiles are valid across machines (skew_corrected=true), not just same-host.
Back-compat: an old host that doesn't answer the probe times out and the client
falls back to a shared-clock assumption (skew_corrected=false).

Host adds one ClockProbe dispatch arm in the control task; the client runs
clock_sync after Start, before the --remode/--speed-test tasks take the stream.

Validated cross-LAN (GNOME box -> dev box): offset ~ -1.57 ms (reproducible),
rtt ~140 us, p50 1.30 ms skew-corrected capture->reassembled — the offset is
exactly the systematic error the handshake removes. Unit tests for the message
codecs and the min-RTT offset estimator.

Roadmap §12: skew handshake done; remaining for true glass-to-glass is the Apple
client present-stamp (decode->present) plus the host render->capture term.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-12 11:20:20 +00:00
parent 50c9db785a
commit 05bc9ab22c
6 changed files with 225 additions and 13 deletions
+13 -6
View File
@@ -295,15 +295,22 @@ buffer; `sendmmsg`/`recvmmsg` batching; the capture-timestamp anchor placement.
`sync_channel(3)` with backpressure. Removes the serialization (~28 ms @60120 fps) and is the
substrate the slice wrapper needs. Real-NIC soak (host on the Ubuntu/GNOME box, client over the
LAN): `send_dropped=0` at 720p60 / 1080p120, and a 1 Gbps probe pushed 625 MB in 5 s clean.
- **Done & live (skew handshake landed 2026-06-12):** **wall-clock skew handshake** — `ClockProbe`/
`ClockEcho` on the control stream (8 NTP-style rounds right after `Start`; min-RTT sample →
hostclient offset; `clock_offset_ns`). The client adds the offset to its receive instant before
differencing against the AU `pts_ns`, so the `capture→reassembled` percentiles are now valid
**across machines** (reported `skew_corrected=true`), not just same-host. Back-compat: an old host
that doesn't answer times out → `skew_corrected=false` (shared-clock assumption, as before).
**Remaining for true glass-to-glass**: the **client present-stamp** (decode→present term) — only
the Apple client presents today, so it needs the connector to expose the offset + an Apple
present-time probe; and the **render→capture** term (compare the PipeWire buffer presentation
timestamp to our capture stamp). `tools/latency-probe` is still the cross-machine orchestrator.
- **Bigger bets (ordered, deferred — need real-NIC/GPU/Mac validation):**
1. **Wall-clock skew handshake + glass-to-glass probe** (`tools/latency-probe`) — measures the two
biggest unmeasured terms (render→capture, decode→present); client present-stamp vs the AU's
`pts_ns` (already attached).
2. **CUDA stream+event** to drop one of two redundant `cuCtxSynchronize` in `submit_cuda` (keep the
1. **CUDA stream+event** to drop one of two redundant `cuCtxSynchronize` in `submit_cuda` (keep the
copy) — ~0.10.4 ms@720p, ~1 ms@5K; only if per-stage timing proves the sync is on the path.
3. **Stage-2 Apple presenter** (`VTDecompressionSession` → `CAMetalLayer`, hand-paced) — ~0.5 refresh
2. **Stage-2 Apple presenter** (`VTDecompressionSession` → `CAMetalLayer`, hand-paced) — ~0.5 refresh
off the present tail (biggest client win at 60 Hz); gate on the probe proving present is real.
4. **NVENC slice-mode wrapper** (roadmap §2 sub-frame pipelining) — per-slice transmit overlaps
3. **NVENC slice-mode wrapper** (roadmap §2 sub-frame pipelining) — per-slice transmit overlaps
encode+send within a frame (~36 ms at 4K/5K/IDR); large + driver-ABI-fragile, on top of the
thread split, only after measurement justifies it.