feat(apple): add cursor capture on iPad
9.3 KiB
title, description
| title | description |
|---|---|
| Apple Stage-2 Presenter (handoff) | Design rationale + open items for the explicit VTDecompressionSession → CAMetalLayer presenter. Implementation shipped; this page is trimmed to the why + what's left. |
Status: SHIPPED as the default presenter (stage-1
AVSampleBufferDisplayLayeris the Metal-unavailable / DEBUG fallback). HDR corrected and 4:4:4 added on top of the proven main-thread present path (the hosting view'sCADisplayLinkdrivesrenderper vsync). Code:clients/apple/Sources/PunktfunkKit/{Stage2Pipeline,MetalVideoPresenter,VideoDecoder,Stage444Probe,LatencyMeter}.swift. This doc is trimmed to design rationale + open items — the shipped.swiftcode is the source of truth for the decode/present/measurement walkthrough.HDR (the "too bright" fix). The presenter renders to a separate CAMetalLayer drawable, so the mastering metadata that was attached to the source
CVPixelBufferwas never composited — and with no reference-white anchor the system rendered the PQ signal far too bright. The fix is to keep the PQ-passthrough shader (BT.2020 limited→full → PQ R′G′B′ as-is) and put the anchor on the layer:colorspace = itur_2100_PQ,wantsExtendedDynamicRangeContent = true(on all platforms — the old#if os(macOS)guard left iOS/tvOS EDR half-engaged), andedrMetadata = CAEDRMetadata.hdr10(displayInfo:contentInfo:opticalOutputScale: 203). 203 nits = BT.2408 HDR reference white anchors diffuse white at EDR 1.0; a larger value renders dimmer. The mastering/CLL blobs (host0xCEdatagram) now refineedrMetadata(drained by the pump,setHdrMetahops the layer write to main) rather than being attached to a never-composited source buffer. Needs on-glass validation on a real EDR panel.Mid-session SDR↔HDR. The control-plane colour (
connection.isHDR, from the Welcome) is fixed per session, but the host can re-init its encoder mid-session (a game entering HDR), so the HEVC VUI — and the decoder'sframe.isHDR— flips. The presenter follows the decoded frame, not the latched session flag:rendercalls the idempotentconfigure(hdr:)every frame, so on a flip it reconfigures the layer (per-mode pixel formatbgra8UnormSDR /rgba16FloatHDR, colorspace, EDR) and selects the matching shader — all synchronously on the main thread (the present path is main-thread, so no cross-thread hop is needed). The last0xCEgrade is cached so an SDR→HDR reconfigure re-applies the real mastering metadata instead of the bare anchor. The pump drains0xCEunconditionally (not gated on the Welcome flag) so a session that starts SDR still gets mastering metadata when it goes HDR. A ≤2-frame transition flash on the rare flip is accepted.Pacing. The hosting view owns a main-runloop
CADisplayLink(a weakDisplayLinkProxybreaks the retain cycle) that callsrenderTickonce per vsync.renderTickpops the newest ready frame from the 1-slot ring (older undisplayed frames dropped — lowest latency, no smoothing buffer) and, if there is one, draws it via manuallayer.nextDrawable()and presents at the next vsync; on an idle vsync (no new frame) it does nothing and the compositor holds the last presented drawable (no idle re-render — matters at 5K).drawableSizeis set beforenextDrawable(it doesn't track bounds, defaults to 0), so allocation always uses the decoded size.maximumDrawableCount = 3. macOSdisplaySyncEnabled = **false**: the display link is the single pacing source, so leaving the layer's own vsync wait on would also blockpresent/nextDrawableon the main thread and serialize it to the display — the cause of the fullscreen judder; disabling it lets present return promptly. Present is stamped at the display link'stargetTimestampprojected toCLOCK_REALTIME(the actual on-glass instant, <1 vsync after the draw — accurate for the HUD).(History: an off-main
CAMetalDisplayLinkvariant and an off-main blocking-render present thread were both tried and reverted — both measured slower on macOS and iPad than this main-thread display-link path, whose real judder fix was simplydisplaySyncEnabled = false, not moving present off-thread. Measured ~11 ms p50 on the main-thread path.)4:4:4. Chroma, bit-depth, and colorimetry are orthogonal: the decode pixel format is a 2×2 of
(chroma, HDR)→420v/x420/444v/x444(all biplanar, so the existing shaders sample a full-size chroma plane unchanged); the shader is keyed only on HDR. The client advertisesVIDEO_CAP_444only whenStage444Probeconfirms hardware 4:4:4 decode (a hardware-requiredVTDecompressionSessionover an embedded 256×256 4:4:4 keyframe — software 4:4:4 is too slow for real-time; validated on M3:444v/x444produced). A bounded pump backstop ends a 4:4:4 session that persistently fails to decode (gated to 4:4:4 sessions, so 4:2:0 loss-recovery is untouched).
Why stage 2 (design rationale)
The stage-1 presenter feeds compressed HEVC straight into AVSampleBufferDisplayLayer, which
hardware-decodes and presents internally with no per-frame callback — so we can't stamp decode or
present, and we can't hand-pace. Stage-2 takes explicit control: decode with
VTDecompressionSession, present decoded frames through a CAMetalLayer driven by a display link.
Two wins justify the extra machinery:
- ~0.5 refresh off the present tail — the present tail is the biggest client latency term at 60 Hz; display-link-driven present pops the newest-ready frame each vsync instead of letting the layer present on its own internal schedule.
- True decode→present / glass-to-glass measurement — explicit decode-completion and present
timestamps make
capture→presentmeasurable (modulo the still-unmeasured host render→capture term).
All of this is macOS/iOS/tvOS-only — build + validate on a Mac (swift build && swift test, then
live against a Linux host). The host + connector side is already done:
PunktfunkConnection.clockOffsetNs (the connect-time skew offset, host minus client) is what makes the
present timestamp cross-machine valid. skewCorrected stays false when clockOffsetNs == 0 (old host)
— then the numbers are same-host-only.
Architecture pattern (worth recording)
Async VTDecompressionSession callback → 1-slot newest-ready ring → display-link-driven present:
- VT decode is async; the output callback runs on a VT-managed thread — don't block it, just stamp
decode-completion (
CLOCK_REALTIMEns) + enqueue. Retain theCVPixelBufferuntil presented (the ring owns it). - Each vsync pops the newest ready frame and drops older undisplayed ones — low-latency default, no smoothing buffer.
- Three per-frame instants (all
CLOCK_REALTIMEns, all shifted byclockOffsetNsto the host clock): capture→decoded =decodedNs + offset − pts_ns; decode→present =presentedNs − decodedNs(the tail stage-2 shortens); capture→present =presentedNs + offset − pts_ns— the glass-to-glass number.
Open items
- On-glass HDR validation — eyeball
edrMetadata+opticalOutputScale: 203on a real EDR panel (XDR display) against stage-1 side-by-side: diffuse white should sit at SDR-white level with only highlights climbing. The reference white is a single named constant (hdrReferenceWhiteNits) for easy tuning. (Needs a Windows HDR host; the Linux host is 8-bit SDR only.) - On-glass 4:4:4 validation — confirm a
PUNKTFUNK_444host (RTX box) streams a 4:4:4 session the client decodes in hardware (HUD shows the resolved chroma); verify the resolution-ceiling backstop by forcing a too-large 4:4:4 mode. - Glass-to-glass numbers via
tools/latency-probe— close the still-unmeasured host render→capture term and confirm the main-thread display-link present p50 holds at ~11 ms (and isn't regressed by the per-frameconfigure/ HDR-anchor work). - Smoothing / pacing policy — present newest-ready for lowest latency today; an optional even-pacing
policy (
present(_:afterMinimumDuration:)) can come later if frames look uneven. - 4:4:4 runtime downgrade-reconnect — today a persistently-undecodable 4:4:4 session ends cleanly
(the live 4:4:4 decode requires hardware, so a resolution-ceiling miss fails the session create
synchronously and the pump backstop ends it — no black-screen loop); auto-reconnecting at 4:2:0
(dropping
VIDEO_CAP_444) is a future refinement. - HLG —
isHDR/isHDRFormatfold HLG (transfer 18) in with PQ, but the presenter is PQ-only (itur_2100_PQ+hdr10EDR), so an HLG stream would be mis-toned. Latent — no host emits HLG (the stack is BT.2020 PQ only). A real HLG path (itur_2100_HLG, no PQ reference-white anchor) is future work; until then HLG should be treated as out of scope. - Full-range — the shaders hardcode limited→full expansion and the decoder requests the
*VideoRangeformats regardless ofconnection.colorFullRange; VideoToolbox range-converts a full-range source to video range on decode, so it stays self-consistent (mild level compression on genuinely full-range content, which no host emits). Pre-existing; wirecolorFullRangeinto the range constants eventually.