Files
punktfunk/design/apple-stage2-presenter.md
T
enricobuehler 1c04e77293
apple / screenshots (push) Has been cancelled
apple / swift (push) Has been cancelled
ci / docs-site (push) Has been cancelled
ci / bench (push) Has been cancelled
ci / web (push) Has been cancelled
ci / rust (push) Has been cancelled
android-screenshots / screenshots (push) Successful in 2m16s
deb / build-publish (push) Successful in 3m26s
decky / build-publish (push) Successful in 13s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 6s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 6s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 5s
windows-host / package (push) Successful in 6m48s
release / apple (push) Successful in 7m45s
windows-msix / package (arm64, C:\Users\Public\ffmpeg-arm64, aarch64-pc-windows-msvc, C:\t-a64) (push) Successful in 1m22s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 2m37s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 4s
android / android (push) Successful in 9m35s
windows-msix / package (x64, C:\Users\Public\ffmpeg, x86_64-pc-windows-msvc, C:\t) (push) Successful in 1m32s
linux-client-screenshots / screenshots (push) Successful in 2m31s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 8m53s
web-screenshots / screenshots (push) Successful in 2m32s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 8m37s
flatpak / build-publish (push) Failing after 3m47s
docker / deploy-docs (push) Failing after 1m9s
feat(apple): Improve presenter
feat(apple): add cursor capture on iPad
2026-06-30 01:31:48 +02:00

9.3 KiB
Raw Permalink Blame History

title, description
title description
Apple Stage-2 Presenter (handoff) Design rationale + open items for the explicit VTDecompressionSession → CAMetalLayer presenter. Implementation shipped; this page is trimmed to the why + what's left.

Status: SHIPPED as the default presenter (stage-1 AVSampleBufferDisplayLayer is the Metal-unavailable / DEBUG fallback). HDR corrected and 4:4:4 added on top of the proven main-thread present path (the hosting view's CADisplayLink drives render per vsync). Code: clients/apple/Sources/PunktfunkKit/{Stage2Pipeline,MetalVideoPresenter,VideoDecoder,Stage444Probe,LatencyMeter}.swift. This doc is trimmed to design rationale + open items — the shipped .swift code is the source of truth for the decode/present/measurement walkthrough.

HDR (the "too bright" fix). The presenter renders to a separate CAMetalLayer drawable, so the mastering metadata that was attached to the source CVPixelBuffer was never composited — and with no reference-white anchor the system rendered the PQ signal far too bright. The fix is to keep the PQ-passthrough shader (BT.2020 limited→full → PQ RGB as-is) and put the anchor on the layer: colorspace = itur_2100_PQ, wantsExtendedDynamicRangeContent = true (on all platforms — the old #if os(macOS) guard left iOS/tvOS EDR half-engaged), and edrMetadata = CAEDRMetadata.hdr10(displayInfo:contentInfo:opticalOutputScale: 203). 203 nits = BT.2408 HDR reference white anchors diffuse white at EDR 1.0; a larger value renders dimmer. The mastering/CLL blobs (host 0xCE datagram) now refine edrMetadata (drained by the pump, setHdrMeta hops the layer write to main) rather than being attached to a never-composited source buffer. Needs on-glass validation on a real EDR panel.

Mid-session SDR↔HDR. The control-plane colour (connection.isHDR, from the Welcome) is fixed per session, but the host can re-init its encoder mid-session (a game entering HDR), so the HEVC VUI — and the decoder's frame.isHDR — flips. The presenter follows the decoded frame, not the latched session flag: render calls the idempotent configure(hdr:) every frame, so on a flip it reconfigures the layer (per-mode pixel format bgra8Unorm SDR / rgba16Float HDR, colorspace, EDR) and selects the matching shader — all synchronously on the main thread (the present path is main-thread, so no cross-thread hop is needed). The last 0xCE grade is cached so an SDR→HDR reconfigure re-applies the real mastering metadata instead of the bare anchor. The pump drains 0xCE unconditionally (not gated on the Welcome flag) so a session that starts SDR still gets mastering metadata when it goes HDR. A ≤2-frame transition flash on the rare flip is accepted.

Pacing. The hosting view owns a main-runloop CADisplayLink (a weak DisplayLinkProxy breaks the retain cycle) that calls renderTick once per vsync. renderTick pops the newest ready frame from the 1-slot ring (older undisplayed frames dropped — lowest latency, no smoothing buffer) and, if there is one, draws it via manual layer.nextDrawable() and presents at the next vsync; on an idle vsync (no new frame) it does nothing and the compositor holds the last presented drawable (no idle re-render — matters at 5K). drawableSize is set before nextDrawable (it doesn't track bounds, defaults to 0), so allocation always uses the decoded size. maximumDrawableCount = 3. macOS displaySyncEnabled = **false**: the display link is the single pacing source, so leaving the layer's own vsync wait on would also block present/nextDrawable on the main thread and serialize it to the display — the cause of the fullscreen judder; disabling it lets present return promptly. Present is stamped at the display link's targetTimestamp projected to CLOCK_REALTIME (the actual on-glass instant, <1 vsync after the draw — accurate for the HUD).

(History: an off-main CAMetalDisplayLink variant and an off-main blocking-render present thread were both tried and reverted — both measured slower on macOS and iPad than this main-thread display-link path, whose real judder fix was simply displaySyncEnabled = false, not moving present off-thread. Measured ~11 ms p50 on the main-thread path.)

4:4:4. Chroma, bit-depth, and colorimetry are orthogonal: the decode pixel format is a 2×2 of (chroma, HDR)420v/x420/444v/x444 (all biplanar, so the existing shaders sample a full-size chroma plane unchanged); the shader is keyed only on HDR. The client advertises VIDEO_CAP_444 only when Stage444Probe confirms hardware 4:4:4 decode (a hardware-required VTDecompressionSession over an embedded 256×256 4:4:4 keyframe — software 4:4:4 is too slow for real-time; validated on M3: 444v/x444 produced). A bounded pump backstop ends a 4:4:4 session that persistently fails to decode (gated to 4:4:4 sessions, so 4:2:0 loss-recovery is untouched).

Why stage 2 (design rationale)

The stage-1 presenter feeds compressed HEVC straight into AVSampleBufferDisplayLayer, which hardware-decodes and presents internally with no per-frame callback — so we can't stamp decode or present, and we can't hand-pace. Stage-2 takes explicit control: decode with VTDecompressionSession, present decoded frames through a CAMetalLayer driven by a display link.

Two wins justify the extra machinery:

  • ~0.5 refresh off the present tail — the present tail is the biggest client latency term at 60 Hz; display-link-driven present pops the newest-ready frame each vsync instead of letting the layer present on its own internal schedule.
  • True decode→present / glass-to-glass measurement — explicit decode-completion and present timestamps make capture→present measurable (modulo the still-unmeasured host render→capture term).

All of this is macOS/iOS/tvOS-only — build + validate on a Mac (swift build && swift test, then live against a Linux host). The host + connector side is already done: PunktfunkConnection.clockOffsetNs (the connect-time skew offset, host minus client) is what makes the present timestamp cross-machine valid. skewCorrected stays false when clockOffsetNs == 0 (old host) — then the numbers are same-host-only.

Architecture pattern (worth recording)

Async VTDecompressionSession callback → 1-slot newest-ready ring → display-link-driven present:

  • VT decode is async; the output callback runs on a VT-managed thread — don't block it, just stamp decode-completion (CLOCK_REALTIME ns) + enqueue. Retain the CVPixelBuffer until presented (the ring owns it).
  • Each vsync pops the newest ready frame and drops older undisplayed ones — low-latency default, no smoothing buffer.
  • Three per-frame instants (all CLOCK_REALTIME ns, all shifted by clockOffsetNs to the host clock): capture→decoded = decodedNs + offset pts_ns; decode→present = presentedNs decodedNs (the tail stage-2 shortens); capture→present = presentedNs + offset pts_ns — the glass-to-glass number.

Open items

  • On-glass HDR validation — eyeball edrMetadata + opticalOutputScale: 203 on a real EDR panel (XDR display) against stage-1 side-by-side: diffuse white should sit at SDR-white level with only highlights climbing. The reference white is a single named constant (hdrReferenceWhiteNits) for easy tuning. (Needs a Windows HDR host; the Linux host is 8-bit SDR only.)
  • On-glass 4:4:4 validation — confirm a PUNKTFUNK_444 host (RTX box) streams a 4:4:4 session the client decodes in hardware (HUD shows the resolved chroma); verify the resolution-ceiling backstop by forcing a too-large 4:4:4 mode.
  • Glass-to-glass numbers via tools/latency-probe — close the still-unmeasured host render→capture term and confirm the main-thread display-link present p50 holds at ~11 ms (and isn't regressed by the per-frame configure / HDR-anchor work).
  • Smoothing / pacing policy — present newest-ready for lowest latency today; an optional even-pacing policy (present(_:afterMinimumDuration:)) can come later if frames look uneven.
  • 4:4:4 runtime downgrade-reconnect — today a persistently-undecodable 4:4:4 session ends cleanly (the live 4:4:4 decode requires hardware, so a resolution-ceiling miss fails the session create synchronously and the pump backstop ends it — no black-screen loop); auto-reconnecting at 4:2:0 (dropping VIDEO_CAP_444) is a future refinement.
  • HLGisHDR/isHDRFormat fold HLG (transfer 18) in with PQ, but the presenter is PQ-only (itur_2100_PQ + hdr10 EDR), so an HLG stream would be mis-toned. Latent — no host emits HLG (the stack is BT.2020 PQ only). A real HLG path (itur_2100_HLG, no PQ reference-white anchor) is future work; until then HLG should be treated as out of scope.
  • Full-range — the shaders hardcode limited→full expansion and the decoder requests the *VideoRange formats regardless of connection.colorFullRange; VideoToolbox range-converts a full-range source to video range on decode, so it stays self-consistent (mild level compression on genuinely full-range content, which no host emits). Pre-existing; wire colorFullRange into the range constants eventually.