feat(protocol): per-AU host-timing plane (0xCF) — split host+network latency (stats phase 2)

The unified-stats equation's host+network stage was one opaque number
because the wire carried nothing but pts_ns. Now the host reports its own
share per frame: when the client's Hello sets VIDEO_CAP_HOST_TIMING (0x08),
the send thread emits a 13-byte 0xCF datagram — [tag][pts_ns u64][host_us
u32] — right after the AU's last packet leaves the socket, so host_us =
capture→fully-sent (capture read/convert, encode, FEC+seal, paced send)
against the same anchor the wire pts carries. Clients correlate by pts_ns
and derive network = (received + clock_offset − pts) − host_us; the two
terms tile per frame by construction.

Back-compat is free in all four combinations: old clients ignore unknown
datagram tags, old hosts ignore unknown cap bits (client keeps the combined
stage). The hardened data-plane format is untouched — this rides the
established QUIC side-plane pattern (0xC8…0xCE). NativeClient ORs the bit
in unconditionally and exposes next_host_timing(); the C ABI gains
PunktfunkHostTiming + punktfunk_connection_next_host_timing (additive).
The synthetic host emits 0xCF too, so pure-loopback protocol tests cover
the plane.

The probe reports the split (host_p50/p95_us · net_p50/p95_us) and is our
direct analogue of Sunshine's "host processing latency" — ours additionally
includes the paced send.

Validated on loopback (synthetic host + probe, debug build): 240/240 AUs
matched, host_p50 6.5 ms + net_p50 6.4 ms ≈ capture→received p50 13.0 ms.
Core suite + new 0xCF roundtrip/truncation test green; host+core+probe
clippy clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-07-03 21:22:12 +00:00
parent 09a5957c6d
commit 449a67ce8d
6 changed files with 314 additions and 4 deletions
+59
View File
@@ -635,6 +635,22 @@ impl PunktfunkHdrMeta {
}
}
/// One access unit's host-side processing time ([`punktfunk_connection_next_host_timing`]):
/// capture → fully sent, i.e. the whole host pipeline (capture read/convert, encode, FEC+seal,
/// paced send). Correlate to the AU whose `PunktfunkFrame::pts_ns` equals `pts_ns`, then
/// `network = (received_instant + clock_offset pts_ns) host_us` — the unified stats HUD's
/// `host` / `network` split (design/stats-unification.md Phase 2). Best-effort: a lost datagram
/// means that frame simply contributes no sample.
#[cfg(feature = "quic")]
#[repr(C)]
#[derive(Clone, Copy)]
pub struct PunktfunkHostTiming {
/// The AU's capture stamp (host capture clock — matches `PunktfunkFrame::pts_ns` exactly).
pub pts_ns: u64,
/// Host capture→sent duration, µs.
pub host_us: u32,
}
/// `PunktfunkRichInput::kind` — a touchpad contact (`finger`/`active`/`x`/`y` valid).
pub const PUNKTFUNK_RICH_TOUCHPAD: u8 = 1;
/// `PunktfunkRichInput::kind` — a motion sample (`gyro`/`accel` valid).
@@ -1759,6 +1775,49 @@ pub unsafe extern "C" fn punktfunk_connection_next_hdr_meta(
})
}
/// Pull the next per-AU host timing (0xCF) into `*out`: the host's capture→sent duration for one
/// access unit, correlated to the AU by `pts_ns` (see [`PunktfunkHostTiming`]).
/// [`PunktfunkStatus::NoFrame`] on timeout, [`PunktfunkStatus::Closed`] once the session ended.
/// A stats consumer drains this non-blockingly (`timeout_ms = 0`) alongside its frame samples;
/// an older host never emits any — keep showing the combined `host+network` stage then. Same
/// threading rules as [`punktfunk_connection_next_rumble`] (one puller, may run alongside the
/// other planes).
///
/// # Safety
/// `c` is a valid connection handle; `out` is writable for one `PunktfunkHostTiming`.
#[cfg(feature = "quic")]
#[no_mangle]
pub unsafe extern "C" fn punktfunk_connection_next_host_timing(
c: *mut PunktfunkConnection,
out: *mut PunktfunkHostTiming,
timeout_ms: u32,
) -> PunktfunkStatus {
guard(|| {
let c = match unsafe { c.as_ref() } {
Some(c) => c,
None => return PunktfunkStatus::NullPointer,
};
if out.is_null() {
return PunktfunkStatus::NullPointer;
}
match c
.inner
.next_host_timing(std::time::Duration::from_millis(timeout_ms as u64))
{
Ok(t) => {
unsafe {
*out = PunktfunkHostTiming {
pts_ns: t.pts_ns,
host_us: t.host_us,
}
};
PunktfunkStatus::Ok
}
Err(e) => e.status(),
}
})
}
/// Read the session's resolved colour signalling + encode bit depth (from the host's Welcome).
/// Each out pointer is filled when non-NULL: `primaries`/`transfer`/`matrix` are CICP code points
/// (BT.709 = 1; BT.2020 = 9; PQ transfer = 16, HLG = 18; BT.2020-NCL matrix = 9), `full_range` is
+37 -2
View File
@@ -140,6 +140,11 @@ const HIDOUT_QUEUE: usize = 32;
/// and low-rate (one on start, re-sent on mastering changes / keyframes); a small ring is ample.
const HDR_META_QUEUE: usize = 8;
/// Host-timing plane depth (0xCF, one datagram per AU). Sized for a 240 fps stream whose stats
/// consumer drains once per second with headroom; overflow drops the newest sample (try_send) —
/// harmless, it's per-frame observability, not state.
const HOST_TIMING_QUEUE: usize = 512;
/// One Opus packet from the host's audio datagram stream (48 kHz stereo, 5 ms frames).
#[derive(Clone, Debug)]
pub struct AudioPacket {
@@ -161,6 +166,9 @@ pub struct NativeClient {
hidout: Mutex<Receiver<HidOutput>>,
/// Inbound static HDR metadata (ST.2086 mastering + content light level) — 0xCE datagrams.
hdr_meta: Mutex<Receiver<HdrMeta>>,
/// Inbound per-AU host capture→send timings — 0xCF datagrams (the client always advertises
/// [`quic::VIDEO_CAP_HOST_TIMING`]; an older host simply never sends any).
host_timing: Mutex<Receiver<crate::quic::HostTiming>>,
input_tx: tokio::sync::mpsc::UnboundedSender<InputEvent>,
/// Outbound mic frames `(seq, pts_ns, opus)` → encoded as 0xCB datagrams by the worker.
mic_tx: tokio::sync::mpsc::UnboundedSender<(u32, u64, Vec<u8>)>,
@@ -315,6 +323,8 @@ impl NativeClient {
let (rumble_tx, rumble_rx) = std::sync::mpsc::sync_channel::<(u16, u16, u16)>(RUMBLE_QUEUE);
let (hidout_tx, hidout_rx) = std::sync::mpsc::sync_channel::<HidOutput>(HIDOUT_QUEUE);
let (hdr_meta_tx, hdr_meta_rx) = std::sync::mpsc::sync_channel::<HdrMeta>(HDR_META_QUEUE);
let (host_timing_tx, host_timing_rx) =
std::sync::mpsc::sync_channel::<crate::quic::HostTiming>(HOST_TIMING_QUEUE);
let (input_tx, input_rx) = tokio::sync::mpsc::unbounded_channel::<InputEvent>();
let (mic_tx, mic_rx) = tokio::sync::mpsc::unbounded_channel::<(u32, u64, Vec<u8>)>();
let (rich_input_tx, rich_input_rx) = tokio::sync::mpsc::unbounded_channel::<RichInput>();
@@ -370,6 +380,7 @@ impl NativeClient {
rumble_tx,
hidout_tx,
hdr_meta_tx,
host_timing_tx,
input_rx,
mic_rx,
rich_input_rx,
@@ -412,6 +423,7 @@ impl NativeClient {
rumble: Mutex::new(rumble_rx),
hidout: Mutex::new(hidout_rx),
hdr_meta: Mutex::new(hdr_meta_rx),
host_timing: Mutex::new(host_timing_rx),
input_tx,
mic_tx,
rich_input_tx,
@@ -715,6 +727,20 @@ impl NativeClient {
}
}
/// Pull the next per-AU host timing (0xCF): the host's capture→sent duration for one access
/// unit, correlated to the AU by `pts_ns`. Feeds the unified stats HUD's `host` / `network`
/// split (`network = (received + clock_offset pts) host_us`); a stats consumer should
/// drain this non-blockingly alongside its frame samples. An older host never sends any —
/// the HUD then keeps the combined `host+network` stage. Same timeout/closed semantics as
/// [`NativeClient::next_hidout`].
pub fn next_host_timing(&self, timeout: Duration) -> Result<crate::quic::HostTiming> {
match self.host_timing.lock().unwrap().recv_timeout(timeout) {
Ok(t) => Ok(t),
Err(RecvTimeoutError::Timeout) => Err(PunktfunkError::NoFrame),
Err(RecvTimeoutError::Disconnected) => Err(PunktfunkError::Closed),
}
}
/// Queue one input event for delivery as a QUIC datagram.
pub fn send_input(&self, ev: &InputEvent) -> Result<()> {
self.input_tx.send(*ev).map_err(|_| PunktfunkError::Closed)
@@ -768,6 +794,7 @@ struct WorkerArgs {
rumble_tx: SyncSender<(u16, u16, u16)>,
hidout_tx: SyncSender<HidOutput>,
hdr_meta_tx: SyncSender<HdrMeta>,
host_timing_tx: SyncSender<crate::quic::HostTiming>,
input_rx: tokio::sync::mpsc::UnboundedReceiver<InputEvent>,
mic_rx: tokio::sync::mpsc::UnboundedReceiver<(u32, u64, Vec<u8>)>,
rich_input_rx: tokio::sync::mpsc::UnboundedReceiver<RichInput>,
@@ -803,6 +830,7 @@ async fn worker_main(args: WorkerArgs) {
rumble_tx,
hidout_tx,
hdr_meta_tx,
host_timing_tx,
mut input_rx,
mut mic_rx,
mut rich_input_rx,
@@ -860,8 +888,10 @@ async fn worker_main(args: WorkerArgs) {
launch: launch.clone(),
// The embedder's decode/present caps (e.g. the Windows client advertises
// VIDEO_CAP_10BIT | VIDEO_CAP_HDR). The host only upgrades to a 10-bit / HDR encode
// when the matching bit is set, so `0` stays an 8-bit BT.709 stream.
video_caps,
// when the matching bit is set, so `0` stays an 8-bit BT.709 stream. HOST_TIMING is
// OR'd in unconditionally: every NativeClient build demuxes the 0xCF plane, and the
// bit only asks the host for observability datagrams (never changes the encode).
video_caps: video_caps | crate::quic::VIDEO_CAP_HOST_TIMING,
// Requested surround channel count; the host echoes the resolved value in Welcome.
audio_channels,
// The codecs this client can decode + its soft preference (0 = auto). The host
@@ -1099,6 +1129,11 @@ async fn worker_main(args: WorkerArgs) {
let _ = hdr_meta_tx.try_send(m);
}
}
Some(&crate::quic::HOST_TIMING_MAGIC) => {
if let Some(t) = crate::quic::decode_host_timing_datagram(&d) {
let _ = host_timing_tx.try_send(t);
}
}
_ => {} // unknown tag — a newer host; ignore
}
}
+70
View File
@@ -114,6 +114,13 @@ pub const VIDEO_CAP_HDR: u8 = 0x02;
/// [`Welcome::chroma_format`] reflects the real resolved value. Independent of 10-bit/HDR (4:4:4 is a
/// chroma decision, bit depth is a depth decision; the two may combine where the hardware allows).
pub const VIDEO_CAP_444: u8 = 0x04;
/// [`Hello::video_caps`] bit: the client consumes per-AU host-timing datagrams
/// ([`HOST_TIMING_MAGIC`], 0xCF) — the host's capture→send duration per frame, letting the client
/// split its `host+network` latency stage into `host` and `network`
/// (design/stats-unification.md Phase 2). The host emits 0xCF ONLY when this bit is set (an older
/// host ignores it and simply never sends any); a client that doesn't set it keeps the combined
/// stage. Purely observability — never changes what the host encodes.
pub const VIDEO_CAP_HOST_TIMING: u8 = 0x08;
/// [`Hello::video_codecs`] bit: the client can decode H.264 / AVC. The GPU-less **software**
/// encode path (openh264) emits H.264, so a client that wants to stream from a software host MUST
@@ -1601,6 +1608,50 @@ pub fn decode_hdr_meta_datagram(b: &[u8]) -> Option<HdrMeta> {
})
}
/// Per-AU host-timing datagram tag, host → client (see [`HostTiming`]). Next tag after
/// [`HDR_META_MAGIC`]. Emitted once per access unit, right after its last packet left the host's
/// socket, and only when the client advertised [`VIDEO_CAP_HOST_TIMING`].
pub const HOST_TIMING_MAGIC: u8 = 0xCF;
/// One access unit's host-side processing time: capture → fully sent (the whole host pipeline —
/// capture read/convert, encode, FEC+seal, paced send). The client correlates it to the AU by
/// `pts_ns` (the AU's capture stamp, unique per frame) and derives
/// `network = (received + clock_offset pts_ns) host_us`, so the unified-stats equation's
/// `host+network` stage splits into two per-frame-tiling terms. Best-effort like every side-plane
/// datagram: a lost 0xCF just means that frame contributes no host/network sample.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub struct HostTiming {
/// The AU's capture stamp (host capture clock — matches the AU's `pts_ns` exactly).
pub pts_ns: u64,
/// Host capture→sent duration, µs (saturated at `u32::MAX` ≈ 71 min — far past the 10 s
/// client-side sanity clamp anyway).
pub host_us: u32,
}
/// Wire length of a [`HOST_TIMING_MAGIC`] datagram: tag + u64 pts + u32 µs = 13 bytes.
const HOST_TIMING_LEN: usize = 1 + 8 + 4;
/// Encode a [`HostTiming`] into a [`HOST_TIMING_MAGIC`] datagram.
pub fn encode_host_timing_datagram(t: &HostTiming) -> Vec<u8> {
let mut b = Vec::with_capacity(HOST_TIMING_LEN);
b.push(HOST_TIMING_MAGIC);
b.extend_from_slice(&t.pts_ns.to_le_bytes());
b.extend_from_slice(&t.host_us.to_le_bytes());
b
}
/// Parse a [`HOST_TIMING_MAGIC`] datagram → [`HostTiming`]. `None` on bad tag or a short buffer
/// (the fixed length bounds every read before it happens).
pub fn decode_host_timing_datagram(b: &[u8]) -> Option<HostTiming> {
if b.len() < HOST_TIMING_LEN || b[0] != HOST_TIMING_MAGIC {
return None;
}
Some(HostTiming {
pts_ns: u64::from_le_bytes(b[1..9].try_into().unwrap()),
host_us: u32::from_le_bytes(b[9..13].try_into().unwrap()),
})
}
/// Async framed-message IO over a quinn stream (`u16 LE length || payload`).
pub mod io {
/// Read one framed message (bounded at 64 KiB — control messages are tiny).
@@ -2189,6 +2240,25 @@ mod tests {
assert_eq!(decode_hdr_meta_datagram(&bad), None);
}
#[test]
fn host_timing_datagram_roundtrip_and_truncation() {
let t = HostTiming {
pts_ns: 1_751_500_000_123_456_789, // a realistic 2026 CLOCK_REALTIME capture stamp
host_us: 4_321,
};
let d = encode_host_timing_datagram(&t);
assert_eq!(d[0], HOST_TIMING_MAGIC);
assert_eq!(d.len(), 13);
assert_eq!(decode_host_timing_datagram(&d), Some(t));
// Truncated buffers and a wrong tag are rejected (never partially read).
for n in 0..d.len() {
assert_eq!(decode_host_timing_datagram(&d[..n]), None);
}
let mut bad = d.clone();
bad[0] = HDR_META_MAGIC;
assert_eq!(decode_host_timing_datagram(&bad), None);
}
#[test]
fn hello_start_roundtrip() {
let h = Hello {