--- title: "Apple Stage-2 Presenter (handoff)" description: "Design rationale + open items for the explicit VTDecompressionSession → CAMetalLayer presenter. Implementation shipped; this page is trimmed to the why + what's left." --- > **Status:** SHIPPED as the **default** presenter (stage-1 `AVSampleBufferDisplayLayer` is the > Metal-unavailable / DEBUG fallback). HDR corrected and **4:4:4** added on top of the proven > main-thread present path (the hosting view's `CADisplayLink` drives `render` per vsync). Code: > `clients/apple/Sources/PunktfunkKit/{Stage2Pipeline,MetalVideoPresenter,VideoDecoder,Stage444Probe,LatencyMeter}.swift`. > This doc is trimmed to design rationale + open items — the shipped `.swift` code is the source of > truth for the decode/present/measurement walkthrough. > > **HDR (the "too bright" fix).** The presenter renders to a *separate* CAMetalLayer drawable, so the > mastering metadata that was attached to the source `CVPixelBuffer` was never composited — and with no > reference-white anchor the system rendered the PQ signal far too bright. The fix is to keep the > PQ-passthrough shader (BT.2020 limited→full → PQ R′G′B′ as-is) and put the anchor **on the layer**: > `colorspace = itur_2100_PQ`, `wantsExtendedDynamicRangeContent = true` (on **all** platforms — the old > `#if os(macOS)` guard left iOS/tvOS EDR half-engaged), and > `edrMetadata = CAEDRMetadata.hdr10(displayInfo:contentInfo:opticalOutputScale: 203)`. 203 nits = > BT.2408 HDR reference white anchors diffuse white at EDR 1.0; a larger value renders dimmer. The > mastering/CLL blobs (host `0xCE` datagram) now refine `edrMetadata` (drained by the pump, > `setHdrMeta` hops the layer write to main) rather than being attached to a never-composited source > buffer. **Needs on-glass validation on a real EDR panel.** > > **Mid-session SDR↔HDR.** The control-plane colour (`connection.isHDR`, from the Welcome) is fixed per > session, but the host can re-init its encoder mid-session (a game entering HDR), so the HEVC VUI — and > the decoder's `frame.isHDR` — flips. The presenter follows the **decoded frame**, not the latched > session flag: `render` calls the idempotent `configure(hdr:)` every frame, so on a flip it > reconfigures the layer (per-mode pixel format `bgra8Unorm` SDR / `rgba16Float` HDR, colorspace, EDR) > and selects the matching shader — all synchronously on the main thread (the present path is > main-thread, so no cross-thread hop is needed). The last `0xCE` grade is cached so an SDR→HDR > reconfigure re-applies the real mastering metadata instead of the bare anchor. The pump drains `0xCE` > **unconditionally** (not gated on the Welcome flag) so a session that starts SDR still gets mastering > metadata when it goes HDR. A ≤2-frame transition flash on the rare flip is accepted. > > **Pacing.** The hosting view owns a **main-runloop `CADisplayLink`** (a weak `DisplayLinkProxy` > breaks the retain cycle) that calls `renderTick` once per vsync. `renderTick` pops the **newest** > ready frame from the 1-slot ring (older undisplayed frames dropped — lowest latency, no smoothing > buffer) and, if there is one, draws it via **manual `layer.nextDrawable()`** and presents at the next > vsync; on an idle vsync (no new frame) it does nothing and the compositor holds the last presented > drawable (no idle re-render — matters at 5K). `drawableSize` is set **before** `nextDrawable` (it > doesn't track bounds, defaults to 0), so allocation always uses the decoded size. `maximumDrawableCount > = 3`. macOS `displaySyncEnabled = **false**`: the display link is the single pacing source, so leaving > the layer's own vsync wait on would *also* block `present`/`nextDrawable` on the main thread and > serialize it to the display — the cause of the fullscreen judder; disabling it lets present return > promptly. Present is stamped at the display link's `targetTimestamp` projected to `CLOCK_REALTIME` > (the actual on-glass instant, <1 vsync after the draw — accurate for the HUD). > > *(History: an off-main `CAMetalDisplayLink` variant and an off-main blocking-render present thread > were both tried and **reverted** — both measured slower on macOS *and* iPad than this main-thread > display-link path, whose real judder fix was simply `displaySyncEnabled = false`, not moving present > off-thread. Measured ~11 ms p50 on the main-thread path.)* > > **4:4:4.** Chroma, bit-depth, and colorimetry are orthogonal: the decode pixel format is a 2×2 of > `(chroma, HDR)` → `420v/x420/444v/x444` (all biplanar, so the existing shaders sample a full-size > chroma plane unchanged); the shader is keyed only on HDR. The client advertises `VIDEO_CAP_444` only > when `Stage444Probe` confirms **hardware** 4:4:4 decode (a hardware-required `VTDecompressionSession` > over an embedded 256×256 4:4:4 keyframe — software 4:4:4 is too slow for real-time; validated on M3: > `444v`/`x444` produced). A bounded pump backstop ends a 4:4:4 session that persistently fails to > decode (gated to 4:4:4 sessions, so 4:2:0 loss-recovery is untouched). ## Why stage 2 (design rationale) The **stage-1** presenter feeds compressed HEVC straight into `AVSampleBufferDisplayLayer`, which hardware-decodes **and presents internally with no per-frame callback** — so we can't stamp decode or present, and we can't hand-pace. **Stage-2** takes explicit control: decode with `VTDecompressionSession`, present decoded frames through a `CAMetalLayer` driven by a display link. Two wins justify the extra machinery: - **~0.5 refresh off the present tail** — the present tail is the biggest client latency term at 60 Hz; display-link-driven present pops the newest-ready frame each vsync instead of letting the layer present on its own internal schedule. - **True decode→present / glass-to-glass measurement** — explicit decode-completion and present timestamps make `capture→present` measurable (modulo the still-unmeasured host render→capture term). All of this is **macOS/iOS/tvOS-only** — build + validate on a Mac (`swift build && swift test`, then live against a Linux host). The host + connector side is already done: `PunktfunkConnection.clockOffsetNs` (the connect-time skew offset, host minus client) is what makes the present timestamp cross-machine valid. `skewCorrected` stays false when `clockOffsetNs == 0` (old host) — then the numbers are same-host-only. ## Architecture pattern (worth recording) Async `VTDecompressionSession` callback → **1-slot newest-ready ring** → display-link-driven present: - VT decode is **async**; the output callback runs on a VT-managed thread — don't block it, just stamp decode-completion (`CLOCK_REALTIME` ns) + enqueue. Retain the `CVPixelBuffer` until presented (the ring owns it). - Each vsync pops the **newest** ready frame and drops older undisplayed ones — low-latency default, no smoothing buffer. - Three per-frame instants (all `CLOCK_REALTIME` ns, all shifted by `clockOffsetNs` to the host clock): **capture→decoded** = `decodedNs + offset − pts_ns`; **decode→present** = `presentedNs − decodedNs` (the tail stage-2 shortens); **capture→present** = `presentedNs + offset − pts_ns` — the glass-to-glass number. ## Open items - **On-glass HDR validation** — eyeball `edrMetadata` + `opticalOutputScale: 203` on a real EDR panel (XDR display) against stage-1 side-by-side: diffuse white should sit at SDR-white level with only highlights climbing. The reference white is a single named constant (`hdrReferenceWhiteNits`) for easy tuning. (Needs a Windows HDR host; the Linux host is 8-bit SDR only.) - **On-glass 4:4:4 validation** — confirm a `PUNKTFUNK_444` host (RTX box) streams a 4:4:4 session the client decodes in hardware (HUD shows the resolved chroma); verify the resolution-ceiling backstop by forcing a too-large 4:4:4 mode. - **Glass-to-glass numbers via `tools/latency-probe`** — close the still-unmeasured host render→capture term and confirm the main-thread display-link present p50 holds at ~11 ms (and isn't regressed by the per-frame `configure` / HDR-anchor work). - **Smoothing / pacing policy** — present newest-ready for lowest latency today; an optional even-pacing policy (`present(_:afterMinimumDuration:)`) can come later if frames look uneven. - **4:4:4 runtime downgrade-reconnect** — today a persistently-undecodable 4:4:4 session ends cleanly (the live 4:4:4 decode requires hardware, so a resolution-ceiling miss fails the session create *synchronously* and the pump backstop ends it — no black-screen loop); auto-reconnecting at 4:2:0 (dropping `VIDEO_CAP_444`) is a future refinement. - **HLG** — `isHDR`/`isHDRFormat` fold HLG (transfer 18) in with PQ, but the presenter is PQ-only (`itur_2100_PQ` + `hdr10` EDR), so an HLG stream would be mis-toned. Latent — no host emits HLG (the stack is BT.2020 **PQ** only). A real HLG path (`itur_2100_HLG`, no PQ reference-white anchor) is future work; until then HLG should be treated as out of scope. - **Full-range** — the shaders hardcode limited→full expansion and the decoder requests the `*VideoRange` formats regardless of `connection.colorFullRange`; VideoToolbox range-converts a full-range source to video range on decode, so it stays self-consistent (mild level compression on genuinely full-range content, which no host emits). Pre-existing; wire `colorFullRange` into the range constants eventually.