feat(clients): unified stats vocabulary across every client + Moonlight comparison docs
One stat model everywhere (design/stats-unification.md): four measurement points (capture/received/decoded/displayed), three stages that tile the interval exactly, and a HUD that shows the addition explicitly — end-to-end 14.2 ms p50 · 19.8 p95 · capture→on-glass = host+network 9.8 + decode 2.1 + display 2.3 replacing each client's ad-hoc mix of overlapping absolutes (the Apple HUD's three arrow lines that looked sequential but weren't), mean-vs-median decode times (Windows/Linux), missing same-host-clock flags (Windows/Linux), and three different names for the same capture→received measurement (probe's "reassembled", Apple/Android's "client", Windows/Linux's post-decode "lat"). Per client: Apple threads receivedNs through the VT decode via the frame refcon bit pattern so the decode stage exists at all (stage-1 fallback honestly degrades to a capture→received headline); Windows carries FrameTimes through the existing frame channel to the render thread and adds e2e p50/p95 post-Present; Linux stamps received at AU pop and rides decoded_ns on DecodedFrame to the paintable-set site; Android pairs receipt stamps with MediaCodec output buffers via the codec's pts round-trip (JNI stats array 14→16 doubles, indexes 0-13 unchanged). fps now uniformly counts received AUs; lost/(received+lost) per window, hidden at zero. docs-site gains "Understanding the Stats Overlay": what each line means, why the equation only approximately sums (percentiles), and a line-by-line Moonlight/Sunshine matrix — including that Moonlight has no end-to-end number and its "network latency" is an ENet control RTT, so punktfunk's headline must not be compared against any single Moonlight line. Verified here: linux client + probe + core check/clippy/fmt green, android native cargo-ndk arm64 check green. Pending: Windows CI + on-glass, swift test on the mac, on-device Android. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,23 +1,25 @@
|
||||
// Per-frame latency sampler for the live HUD: records capture->client-receipt latency and drains
|
||||
// percentiles on demand. NSLock rather than an actor — the writer is the non-async pump/arrival
|
||||
// path (same pattern as the app's FrameMeter).
|
||||
// Per-frame latency-stage sampler for the live HUD: records one interval per frame (an end
|
||||
// instant minus a start instant, both CLOCK_REALTIME ns) and drains percentiles on demand.
|
||||
// NSLock rather than an actor — the writers are the non-async pump/decode/present paths (same
|
||||
// pattern as the app's FrameMeter).
|
||||
|
||||
import Foundation
|
||||
|
||||
/// Samples the **capture->client-receipt** latency of each access unit and reports percentiles.
|
||||
/// Samples one **latency stage** per frame and reports percentiles. One instance per stage of the
|
||||
/// unified stats model (design/stats-unification.md):
|
||||
///
|
||||
/// The latency is `now - pts_ns`, where `pts_ns` is the host's capture wall clock (the AU's pts) and
|
||||
/// `now` is the client's `CLOCK_REALTIME` instant the AU was received, shifted by the connect-time
|
||||
/// **clock-skew offset** (`PunktfunkConnection.clockOffsetNs`, host minus client) so the difference
|
||||
/// is valid across machines. `offsetNs == 0` means an old host that didn't answer the skew handshake
|
||||
/// (or genuinely synced clocks) — the number is then only meaningful same-host.
|
||||
/// - `host+network` = capture→received: `record(ptsNs:offsetNs:)` at AU receipt.
|
||||
/// - `decode` = received→decoded and `display` = decoded→displayed: client-local single-clock
|
||||
/// stages — `record(ptsNs:atNs:offsetNs:)` with the start instant as `ptsNs` and `offsetNs: 0`.
|
||||
/// - `end-to-end` = capture→displayed, measured directly (never summed from the stages):
|
||||
/// `record(ptsNs:atNs:offsetNs:)` at present.
|
||||
///
|
||||
/// SCOPE (stage-1 presenter): this covers host capture -> encode -> FEC -> network -> reassembly ->
|
||||
/// decrypt -> handed to the presenter. It does **not** include the on-device VideoToolbox decode or
|
||||
/// the `AVSampleBufferDisplayLayer` present — that layer decodes and presents compressed samples
|
||||
/// internally with no per-frame callback. True decode->present (the full glass-to-glass) needs the
|
||||
/// stage-2 presenter (`VTDecompressionSession` decode-completion + `CAMetalLayer`/display-link
|
||||
/// present); this meter is the substrate it will extend.
|
||||
/// For the host-anchored intervals (capture→…) the sample is `end + offset - pts_ns`, where
|
||||
/// `pts_ns` is the host's capture wall clock (the AU's pts) and the connect-time **clock-skew
|
||||
/// offset** (`PunktfunkConnection.clockOffsetNs`, host minus client) makes the difference valid
|
||||
/// across machines. `offsetNs == 0` means an old host that didn't answer the skew handshake (or
|
||||
/// genuinely synced clocks) — the number is then only meaningful same-host, and the HUD tags the
|
||||
/// end-to-end line `(same-host clock)`.
|
||||
public final class LatencyMeter: @unchecked Sendable {
|
||||
private let lock = NSLock()
|
||||
private var samplesUs: [Int64] = []
|
||||
@@ -34,12 +36,16 @@ public final class LatencyMeter: @unchecked Sendable {
|
||||
record(ptsNs: ptsNs, atNs: nowNs, offsetNs: offsetNs)
|
||||
}
|
||||
|
||||
/// Record one frame whose latency is `atNs + offsetNs - ptsNs` — an EXPLICIT client instant
|
||||
/// rather than now. The stage-2 presenter uses this to stamp capture→present at the display
|
||||
/// link's target present time (not the moment the present call ran). All in `CLOCK_REALTIME`.
|
||||
/// Record one frame whose sample is `atNs + offsetNs - ptsNs` — an EXPLICIT end instant
|
||||
/// rather than now. `ptsNs` is the stage's start point: the AU pts for the host-anchored
|
||||
/// intervals, or a client stamp (receivedNs / decodedNs, with `offsetNs: 0`) for the local
|
||||
/// decode/display stages. The stage-2 presenter stamps its present-side samples at the
|
||||
/// display link's target present time (not the moment the present call ran). All in
|
||||
/// `CLOCK_REALTIME`.
|
||||
public func record(ptsNs: UInt64, atNs: Int64, offsetNs: Int64) {
|
||||
let latNs = atNs &+ offsetNs &- Int64(bitPattern: ptsNs)
|
||||
// Drop absurd values (a clock step, a wildly wrong offset, or garbage pts).
|
||||
// Drop absurd values (a clock step, a wildly wrong offset, garbage pts, or a stage whose
|
||||
// start stamp is missing/after its end) — samples are clamped to (0, 10 s).
|
||||
guard latNs > 0, latNs < 10_000_000_000 else { return }
|
||||
lock.lock()
|
||||
samplesUs.append(latNs / 1000)
|
||||
|
||||
@@ -38,8 +38,9 @@ final class SessionPresenter {
|
||||
func start(
|
||||
connection: PunktfunkConnection,
|
||||
baseLayer: AVSampleBufferDisplayLayer,
|
||||
presentMeter: LatencyMeter?,
|
||||
presentTailMeter: LatencyMeter? = nil,
|
||||
endToEndMeter: LatencyMeter?,
|
||||
decodeMeter: LatencyMeter? = nil,
|
||||
displayMeter: LatencyMeter? = nil,
|
||||
makeDisplayLink: (AnyObject, Selector) -> CADisplayLink,
|
||||
onFrame: (@Sendable (AccessUnit) -> Void)?,
|
||||
onSessionEnd: (@Sendable () -> Void)?
|
||||
@@ -59,7 +60,8 @@ final class SessionPresenter {
|
||||
#endif
|
||||
if !forceStage1,
|
||||
let pipeline = Stage2Pipeline(
|
||||
presentMeter: presentMeter, presentTailMeter: presentTailMeter) {
|
||||
endToEndMeter: endToEndMeter, decodeMeter: decodeMeter,
|
||||
displayMeter: displayMeter) {
|
||||
let metal = pipeline.layer
|
||||
// The opaque metal layer composites OVER the AVSampleBufferDisplayLayer base, which
|
||||
// sits idle (un-enqueued) in stage-2. contentsScale + frame are set in layout().
|
||||
|
||||
@@ -1,7 +1,8 @@
|
||||
// Stage-2 presenter orchestrator: a pump thread pulls AUs → VideoDecoder; the decoder's async output
|
||||
// drops the newest decoded frame into a 1-slot ring; the hosting view's display link calls `renderTick`
|
||||
// once per vsync to draw + present the newest ready frame and stamp capture→present. Mirrors
|
||||
// StreamPump's lifecycle (one per start; cancel is permanent).
|
||||
// once per vsync to draw + present the newest ready frame and stamp the unified latency stages
|
||||
// (end-to-end capture→on-glass, plus the decode and display stage terms —
|
||||
// design/stats-unification.md). Mirrors StreamPump's lifecycle (one per start; cancel is permanent).
|
||||
//
|
||||
// Threading: the pump runs on its own thread; the decoder callback on a VT thread; `renderTick` +
|
||||
// `start`/`stop` on the MAIN thread (the view's CADisplayLink fires there). Only the ring (lock-guarded)
|
||||
@@ -40,8 +41,8 @@ public final class Stage2Pipeline {
|
||||
private let ring = ReadyRing()
|
||||
private let presenter: MetalVideoPresenter
|
||||
private let decoder: VideoDecoder
|
||||
private let presentMeter: LatencyMeter?
|
||||
private let presentTailMeter: LatencyMeter?
|
||||
private let endToEndMeter: LatencyMeter?
|
||||
private let displayMeter: LatencyMeter?
|
||||
private let recovery = KeyframeRecovery()
|
||||
private var token = StopFlag()
|
||||
private var offsetNs: Int64 = 0
|
||||
@@ -56,28 +57,41 @@ public final class Stage2Pipeline {
|
||||
/// The Metal layer the hosting view installs + sizes.
|
||||
public var layer: CAMetalLayer { presenter.layer }
|
||||
|
||||
/// `presentMeter` records capture→present (the glass-to-glass term); `presentTailMeter`
|
||||
/// records decode-completion→present (the ring wait + render — the tail stage-2 exists to
|
||||
/// shorten). Both optional: metering never gates the presenter choice. Returns nil if Metal
|
||||
/// can't be set up (headless / no GPU) — caller falls back to the stage-1 presenter.
|
||||
public init?(presentMeter: LatencyMeter?, presentTailMeter: LatencyMeter? = nil) {
|
||||
/// Unified-stats meters (design/stats-unification.md): `endToEndMeter` records the headline
|
||||
/// end-to-end (capture→on-glass, skew-corrected); `decodeMeter` the decode stage
|
||||
/// (received→decoded); `displayMeter` the display stage (decoded→on-glass, the ring wait +
|
||||
/// render + vsync — the tail stage-2 exists to shorten). All optional: metering never gates
|
||||
/// the presenter choice. Returns nil if Metal can't be set up (headless / no GPU) — caller
|
||||
/// falls back to the stage-1 presenter.
|
||||
public init?(
|
||||
endToEndMeter: LatencyMeter?,
|
||||
decodeMeter: LatencyMeter? = nil,
|
||||
displayMeter: LatencyMeter? = nil
|
||||
) {
|
||||
guard let presenter = MetalVideoPresenter.make() else { return nil }
|
||||
self.presenter = presenter
|
||||
self.presentMeter = presentMeter
|
||||
self.presentTailMeter = presentTailMeter
|
||||
self.endToEndMeter = endToEndMeter
|
||||
self.displayMeter = displayMeter
|
||||
let ring = ring
|
||||
let recovery = recovery
|
||||
self.decoder = VideoDecoder(
|
||||
onDecoded: { ring.submit($0) },
|
||||
onDecoded: { frame in
|
||||
// Decode stage = received→decoded, both client CLOCK_REALTIME (offset 0 — no
|
||||
// skew applies). Stamped at decode completion, so it covers every decoded frame,
|
||||
// including ones the newest-wins ring drops before present.
|
||||
decodeMeter?.record(
|
||||
ptsNs: UInt64(frame.receivedNs), atNs: frame.decodedNs, offsetNs: 0)
|
||||
ring.submit(frame)
|
||||
},
|
||||
// Async decode failure (a bad P-frame referencing a lost/corrupt IDR): the pump resets to
|
||||
// re-gate on the next IDR, and we ask the host to send one now (infinite GOP — it wouldn't
|
||||
// otherwise come soon). Throttled in KeyframeRecovery.
|
||||
onDecodeError: { _ in recovery.request() })
|
||||
}
|
||||
|
||||
/// Start pulling AUs into the decoder. MAIN THREAD. `onFrame` fires per AU at receipt (capture→client
|
||||
/// meter, exactly as stage-1); `onSessionEnd` on close. `clockOffsetNs` (host minus client) makes the
|
||||
/// present stamp cross-machine valid.
|
||||
/// Start pulling AUs into the decoder. MAIN THREAD. `onFrame` fires per AU at receipt (the
|
||||
/// host+network / capture→received meter, exactly as stage-1); `onSessionEnd` on close.
|
||||
/// `clockOffsetNs` (host minus client) makes the end-to-end stamp cross-machine valid.
|
||||
public func start(
|
||||
connection: PunktfunkConnection,
|
||||
onFrame: (@Sendable (AccessUnit) -> Void)?,
|
||||
@@ -174,14 +188,16 @@ public final class Stage2Pipeline {
|
||||
public func renderTick(targetPresentNs: Int64) {
|
||||
guard let frame = ring.take() else { return }
|
||||
let offsetNs = offsetNs
|
||||
let presentMeter = presentMeter
|
||||
let presentTailMeter = presentTailMeter
|
||||
let endToEndMeter = endToEndMeter
|
||||
let displayMeter = displayMeter
|
||||
let rendered = presenter.render(frame.pixelBuffer, isHDR: frame.isHDR) { presentedNs in
|
||||
let atNs = presentedNs ?? targetPresentNs
|
||||
presentMeter?.record(ptsNs: frame.ptsNs, atNs: atNs, offsetNs: offsetNs)
|
||||
// Present tail = decode-completion → on-glass. Both instants are client
|
||||
// CLOCK_REALTIME, so no skew offset applies.
|
||||
presentTailMeter?.record(ptsNs: UInt64(frame.decodedNs), atNs: atNs, offsetNs: 0)
|
||||
// End-to-end = capture→on-glass, measured directly (skew-corrected via the
|
||||
// connect-time clock offset) — the HUD headline.
|
||||
endToEndMeter?.record(ptsNs: frame.ptsNs, atNs: atNs, offsetNs: offsetNs)
|
||||
// Display stage = decoded → on-glass. Both instants are client CLOCK_REALTIME,
|
||||
// so no skew offset applies.
|
||||
displayMeter?.record(ptsNs: UInt64(frame.decodedNs), atNs: atNs, offsetNs: 0)
|
||||
}
|
||||
if !rendered { ring.putBack(frame) }
|
||||
}
|
||||
|
||||
@@ -61,7 +61,7 @@ public enum Stage444Probe {
|
||||
guard created == noErr, let session else { return false }
|
||||
defer { VTDecompressionSessionInvalidate(session) }
|
||||
|
||||
let au = AccessUnit(data: data, ptsNs: 0, frameIndex: 0, flags: 0)
|
||||
let au = AccessUnit(data: data, ptsNs: 0, frameIndex: 0, flags: 0, receivedNs: 0)
|
||||
guard let sample = AnnexB.sampleBuffer(au: au, format: format, codec: .hevc) else { return false }
|
||||
|
||||
var produced: OSType = 0
|
||||
|
||||
@@ -15,6 +15,10 @@ import VideoToolbox
|
||||
public struct ReadyFrame: @unchecked Sendable {
|
||||
/// Host capture clock (the AU's pts), in nanoseconds.
|
||||
public let ptsNs: UInt64
|
||||
/// Client `CLOCK_REALTIME` instant the AU was received (`AccessUnit.receivedNs`, threaded
|
||||
/// through the decode via the frame refcon), in nanoseconds. 0 when unknown (a caller that
|
||||
/// didn't stamp receipt) — the decode-stage meter then drops the sample via its sanity guard.
|
||||
public let receivedNs: Int64
|
||||
/// Client `CLOCK_REALTIME` instant decode completed, in nanoseconds.
|
||||
public let decodedNs: Int64
|
||||
/// The decoded image — 8-bit NV12 biplanar (SDR) or 10-bit P010 biplanar (HDR), Metal-compatible.
|
||||
@@ -25,13 +29,16 @@ public struct ReadyFrame: @unchecked Sendable {
|
||||
}
|
||||
|
||||
/// The C output callback can't capture context, so VideoToolbox hands it the refcon we set at
|
||||
/// session creation — a pointer back to the owning `VideoDecoder`.
|
||||
/// session creation — a pointer back to the owning `VideoDecoder`. The per-frame refcon carries
|
||||
/// the AU's `receivedNs` as a pointer bit pattern (a scalar smuggled through the C void*, never
|
||||
/// dereferenced) so the decode stage can be computed against decode-completion.
|
||||
private let decoderOutputCallback: VTDecompressionOutputCallback = {
|
||||
refcon, _, status, _, imageBuffer, pts, _ in
|
||||
refcon, frameRefcon, status, _, imageBuffer, pts, _ in
|
||||
guard let refcon else { return }
|
||||
let receivedNs = frameRefcon.map { Int64(Int(bitPattern: $0)) } ?? 0
|
||||
Unmanaged<VideoDecoder>.fromOpaque(refcon)
|
||||
.takeUnretainedValue()
|
||||
.handleDecoded(status: status, imageBuffer: imageBuffer, pts: pts)
|
||||
.handleDecoded(status: status, imageBuffer: imageBuffer, pts: pts, receivedNs: receivedNs)
|
||||
}
|
||||
|
||||
/// Owns a `VTDecompressionSession` rebuilt whenever the format description changes (every IDR /
|
||||
@@ -112,7 +119,9 @@ public final class VideoDecoder: @unchecked Sendable {
|
||||
session,
|
||||
sampleBuffer: sample,
|
||||
flags: [._EnableAsynchronousDecompression],
|
||||
frameRefcon: nil,
|
||||
// The AU's receipt instant rides through as a bit pattern (nil for 0 — the output
|
||||
// callback maps that back to 0); the callback needs it to stamp the decode stage.
|
||||
frameRefcon: UnsafeMutableRawPointer(bitPattern: Int(au.receivedNs)),
|
||||
infoFlagsOut: &infoOut)
|
||||
lock.unlock()
|
||||
if status != noErr {
|
||||
@@ -218,8 +227,11 @@ public final class VideoDecoder: @unchecked Sendable {
|
||||
return true
|
||||
}
|
||||
|
||||
/// VT thread. Stamp decode-completion and enqueue, or report the error.
|
||||
fileprivate func handleDecoded(status: OSStatus, imageBuffer: CVImageBuffer?, pts: CMTime) {
|
||||
/// VT thread. Stamp decode-completion and enqueue, or report the error. `receivedNs` is the
|
||||
/// AU's receipt instant threaded through the frame refcon (0 = unknown).
|
||||
fileprivate func handleDecoded(
|
||||
status: OSStatus, imageBuffer: CVImageBuffer?, pts: CMTime, receivedNs: Int64
|
||||
) {
|
||||
guard status == noErr, let imageBuffer else {
|
||||
onDecodeError(status)
|
||||
return
|
||||
@@ -242,6 +254,8 @@ public final class VideoDecoder: @unchecked Sendable {
|
||||
|| fmt == kCVPixelFormatType_444YpCbCr10BiPlanarVideoRange
|
||||
|| fmt == kCVPixelFormatType_444YpCbCr10BiPlanarFullRange
|
||||
onDecoded(
|
||||
ReadyFrame(ptsNs: ptsNs, decodedNs: decodedNs, pixelBuffer: imageBuffer, isHDR: isHDR))
|
||||
ReadyFrame(
|
||||
ptsNs: ptsNs, receivedNs: receivedNs, decodedNs: decodedNs,
|
||||
pixelBuffer: imageBuffer, isHDR: isHDR))
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user