Files
punktfunk/design/apple-stage2-presenter.md
T
enricobuehler 133e25849d feat(apple): gamepad UI v2 — controller settings + add host, aurora, macOS
Sources reorganized (client: Home/Session/Settings/Stores/Support/Trust; kit:
Audio/Connection/Gamepad/Input/Support/Video/Views) with the big files split
along the same seams.

The gamepad mode is couch-complete, and now on macOS too (the living-room
Mac case), not just iOS/iPadOS:

- GamepadSettingsView: a console-style, fully controller-navigable settings
  screen (X from the launcher) — up/down moves focus, left/right steps values
  (clamped, boundary thud), A cycles/toggles, B closes; the focused row shows a
  one-line description. Backed by GamepadMenuList, the vertical sibling of
  GamepadCarousel, and SettingsOptions — the option lists hoisted out of
  SettingsView statics and shared by the touch, tvOS and gamepad settings.
- GamepadAddHostView + GamepadKeyboard: register a host end to end with a pad
  — field rows open an on-screen controller keyboard (dpad grid, A types,
  X backspaces, B done); the launcher carousel ends in an Add Host tile, so
  the dead-end "add one with touch first" empty state is gone.
- Launcher polish: contextual hint bar with the pad's real button glyphs,
  controller name + battery chip, one shared console chrome.
- GamepadScreenBackground: an animated aurora (TimelineView-driven drifting
  blobs in the brand's violet family, breathing radii, slow hue shift,
  legibility scrim; freezes under Reduce Motion). Pure SwiftUI on purpose — a
  .metal library only bundles reliably in one of the two build systems (SPM vs
  the xcodeproj's synced folders) these sources compile under.
- macOS port: settings/add-host/library present as sized sheets (a macOS sheet
  takes its content's IDEAL size, and the GeometryReader-driven screens
  collapsed to nothing), NSScreen-based mode lists, scroll indicators .never
  (the "always show scroll bars" setting overrides .hidden), tray scrims so
  scrolled rows dim under the pinned title/hints, extra title clearance, and a
  PUNKTFUNK_FORCE_GAMEPAD_UI=1 dev hook — launcher/settings/add-host/keyboard/
  library render-verified live on a real Mac + LAN hosts.
- GamepadMenuInput: X button support, and (re)start now snapshots held buttons
  so a controller handoff press never fires twice (the B that closed the
  keyboard no longer also cancels the screen underneath).
- Cleanups: one "Connection failed" alert in ContentView instead of one per
  home screen; HostDiscovery.advertises/unsaved shared by both home screens.
- host: can_encode_444 stub for the non-Linux/Windows host build (the macOS
  synthetic-source loopback used by the Swift tests).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:24:44 +02:00

11 KiB
Raw Permalink Blame History

title, description
title description
Apple Stage-2 Presenter (handoff) Design rationale + open items for the explicit VTDecompressionSession → CAMetalLayer presenter. Implementation shipped; this page is trimmed to the why + what's left.

Status: SHIPPED as the default presenter (stage-1 AVSampleBufferDisplayLayer is the Metal-unavailable / DEBUG fallback). HDR corrected and 4:4:4 added on top of the proven main-thread present path (the hosting view's CADisplayLink drives render per vsync). Code: clients/apple/Sources/PunktfunkKit/Video/{Stage2Pipeline,MetalVideoPresenter,VideoDecoder,SessionPresenter,Stage444Probe,LatencyMeter}.swift. This doc is trimmed to design rationale + open items — the shipped .swift code is the source of truth for the decode/present/measurement walkthrough.

HDR (the "too bright" fix). The presenter renders to a separate CAMetalLayer drawable, so the mastering metadata that was attached to the source CVPixelBuffer was never composited — and with no reference-white anchor the system rendered the PQ signal far too bright. The fix is to keep the PQ-passthrough shader (BT.2020 limited→full → PQ RGB as-is) and put the anchor on the layer: colorspace = itur_2100_PQ, wantsExtendedDynamicRangeContent = true (on all platforms — the old #if os(macOS) guard left iOS/tvOS EDR half-engaged), and edrMetadata = CAEDRMetadata.hdr10(displayInfo:contentInfo:opticalOutputScale: 203). 203 nits = BT.2408 HDR reference white anchors diffuse white at EDR 1.0; a larger value renders dimmer. The mastering/CLL blobs (host 0xCE datagram) now refine edrMetadata (drained by the pump, setHdrMeta hops the layer write to main) rather than being attached to a never-composited source buffer. Needs on-glass validation on a real EDR panel.

Mid-session SDR↔HDR. The control-plane colour (connection.isHDR, from the Welcome) is fixed per session, but the host can re-init its encoder mid-session (a game entering HDR), so the HEVC VUI — and the decoder's frame.isHDR — flips. The presenter follows the decoded frame, not the latched session flag: render calls the idempotent configure(hdr:) every frame, so on a flip it reconfigures the layer (per-mode pixel format bgra8Unorm SDR / rgba16Float HDR, colorspace, EDR) and selects the matching shader — all synchronously on the main thread (the present path is main-thread, so no cross-thread hop is needed). The last 0xCE grade is cached so an SDR→HDR reconfigure re-applies the real mastering metadata instead of the bare anchor. The pump drains 0xCE unconditionally (not gated on the Welcome flag) so a session that starts SDR still gets mastering metadata when it goes HDR. A ≤2-frame transition flash on the rare flip is accepted.

Pacing. The hosting view owns a main-runloop CADisplayLink (a weak DisplayLinkProxy breaks the retain cycle; the shared per-session lifecycle lives in SessionPresenter) that calls renderTick once per vsync. renderTick pops the newest ready frame from the 1-slot ring (older undisplayed frames dropped — lowest latency, no smoothing buffer) and, if there is one, draws it via manual layer.nextDrawable() and presents at the next vsync; a frame that could not be rendered (no drawable yet) is put back into the still-empty ring so the next tick retries it (under the infinite GOP a static scene sends no replacement — losing the frame would freeze stale content). On an idle vsync it does nothing and the compositor holds the last presented drawable (no idle re-render — matters at 5K). drawableSize is set before nextDrawable (it doesn't track bounds, defaults to 0) to the layer's pixel size (bounds × contentsScale), so the shader — not the compositor's bilinear — performs the decoded→on-screen scale (bicubic Catmull-Rom luma + siting-corrected bilinear chroma); a native-mode session is exactly 1:1 (the kernel reduces to the identity texel). maximumDrawableCount = 3. On iOS/tvOS SessionPresenter sets the link's preferredFrameRateRange to the negotiated refresh (+ CADisableMinimumFrameDurationOnPhone in Info.plist) — without it ProMotion devices cap the link at 60 Hz and a 120 fps stream presents at half rate; macOS's NSView.displayLink already tracks its display and is left alone. macOS displaySyncEnabled = **false**: the display link is the single pacing source, so leaving the layer's own vsync wait on would also block present/nextDrawable on the main thread and serialize it to the display — the cause of the fullscreen judder; disabling it lets present return promptly. Present is stamped at the drawable's actual presentedTime (addPresentedHandler, converted to CLOCK_REALTIME), falling back to the display link's targetTimestamp projection when the system reports none (a dropped drawable) — so the HUD numbers reflect glass, and a missed vsync shows up instead of being assumed away. The same stamp feeds decode→present (presentTailMeter → the HUD's "decode→present" line), closing the third instant promised below.

(History: an off-main CAMetalDisplayLink variant and an off-main blocking-render present thread were both tried and reverted — both measured slower on macOS and iPad than this main-thread display-link path, whose real judder fix was simply displaySyncEnabled = false, not moving present off-thread. Measured ~11 ms p50 on the main-thread path.)

4:4:4. Chroma, bit-depth, and colorimetry are orthogonal: the decode pixel format is a 2×2 of (chroma, HDR)420v/x420/444v/x444 (all biplanar, so the existing shaders sample a full-size chroma plane unchanged); the shader is keyed only on HDR. The client advertises VIDEO_CAP_444 only when Stage444Probe confirms hardware 4:4:4 decode (a hardware-required VTDecompressionSession over an embedded 256×256 4:4:4 keyframe — software 4:4:4 is too slow for real-time; validated on M3: 444v/x444 produced). A bounded pump backstop ends a 4:4:4 session that persistently fails to decode (gated to 4:4:4 sessions, so 4:2:0 loss-recovery is untouched).

Why stage 2 (design rationale)

The stage-1 presenter feeds compressed HEVC straight into AVSampleBufferDisplayLayer, which hardware-decodes and presents internally with no per-frame callback — so we can't stamp decode or present, and we can't hand-pace. Stage-2 takes explicit control: decode with VTDecompressionSession, present decoded frames through a CAMetalLayer driven by a display link.

Two wins justify the extra machinery:

  • ~0.5 refresh off the present tail — the present tail is the biggest client latency term at 60 Hz; display-link-driven present pops the newest-ready frame each vsync instead of letting the layer present on its own internal schedule.
  • True decode→present / glass-to-glass measurement — explicit decode-completion and present timestamps make capture→present measurable (modulo the still-unmeasured host render→capture term).

All of this is macOS/iOS/tvOS-only — build + validate on a Mac (swift build && swift test, then live against a Linux host). The host + connector side is already done: PunktfunkConnection.clockOffsetNs (the connect-time skew offset, host minus client) is what makes the present timestamp cross-machine valid. skewCorrected stays false when clockOffsetNs == 0 (old host) — then the numbers are same-host-only.

Architecture pattern (worth recording)

Async VTDecompressionSession callback → 1-slot newest-ready ring → display-link-driven present:

  • VT decode is async; the output callback runs on a VT-managed thread — don't block it, just stamp decode-completion (CLOCK_REALTIME ns) + enqueue. Retain the CVPixelBuffer until presented (the ring owns it).
  • Each vsync pops the newest ready frame and drops older undisplayed ones — low-latency default, no smoothing buffer.
  • Three per-frame instants (all CLOCK_REALTIME ns, all shifted by clockOffsetNs to the host clock): capture→decoded = decodedNs + offset pts_ns; decode→present = presentedNs decodedNs (the tail stage-2 shortens); capture→present = presentedNs + offset pts_ns — the glass-to-glass number.

Open items

  • On-glass HDR validation — eyeball edrMetadata + opticalOutputScale: 203 on a real EDR panel (XDR display) against stage-1 side-by-side: diffuse white should sit at SDR-white level with only highlights climbing. The reference white is a single named constant (hdrReferenceWhiteNits) for easy tuning. (Needs a Windows HDR host; the Linux host is 8-bit SDR only.)
  • On-glass 4:4:4 validation — confirm a PUNKTFUNK_444 host (RTX box) streams a 4:4:4 session the client decodes in hardware (HUD shows the resolved chroma); verify the resolution-ceiling backstop by forcing a too-large 4:4:4 mode.
  • Glass-to-glass numbers via tools/latency-probe — close the still-unmeasured host render→capture term and confirm the main-thread display-link present p50 holds at ~11 ms (and isn't regressed by the per-frame configure / HDR-anchor work). The HUD's new decode→present line + the presentedTime-based stamp make the client-side share directly visible now.
  • On-glass validation of the 2026-07 presenter batch — the shader-side scale (drawable at layer pixel size; bicubic luma + chroma-siting offset — compare a resized/fullscreen-on-larger-panel window against stage-1 for sharpness, and check GPU headroom at 5K HDR), the iOS/tvOS preferredFrameRateRange (a 120 fps stream on a ProMotion iPhone/iPad should now present at ~120 — watch the HUD fps), kVTDecompressionPropertyKey_RealTime, and the zero-copy AnnexB → CMBlockBuffer packing (unit/round-trip tested; confirm live).
  • Smoothing / pacing policy — present newest-ready for lowest latency today; an optional even-pacing policy (present(_:afterMinimumDuration:)) can come later if frames look uneven.
  • 4:4:4 runtime downgrade-reconnect — today a persistently-undecodable 4:4:4 session ends cleanly (the live 4:4:4 decode requires hardware, so a resolution-ceiling miss fails the session create synchronously and the pump backstop ends it — no black-screen loop); auto-reconnecting at 4:2:0 (dropping VIDEO_CAP_444) is a future refinement.
  • HLGisHDR/isHDRFormat fold HLG (transfer 18) in with PQ, but the presenter is PQ-only (itur_2100_PQ + hdr10 EDR), so an HLG stream would be mis-toned. Latent — no host emits HLG (the stack is BT.2020 PQ only). A real HLG path (itur_2100_HLG, no PQ reference-white anchor) is future work; until then HLG should be treated as out of scope.
  • Full-range — the shaders hardcode limited→full expansion and the decoder requests the *VideoRange formats regardless of connection.colorFullRange; VideoToolbox range-converts a full-range source to video range on decode, so it stays self-consistent (mild level compression on genuinely full-range content, which no host emits). Pre-existing; wire colorFullRange into the range constants eventually.