punktfunk/design/apple-stage2-presenter.md at a9cca82fb8e3324907c57304ae0b83373165f27c

unom/punktfunk

Fork 0

Files

T

enricobuehler d01a8fd17a

windows-host / package (push) Failing after 4m16s

Details

ci / rust (push) Failing after 4m56s

Details

ci / web (push) Failing after 22s

Details

ci / docs-site (push) Successful in 1m7s

Details

android / android (push) Successful in 9m19s

Details

ci / bench (push) Successful in 4m47s

Details

decky / build-publish (push) Successful in 11s

Details

docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s

Details

docker / build-push (., web/Dockerfile, punktfunk-web) (push) Failing after 3s

Details

docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s

Details

docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s

Details

docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s

Details

docker / deploy-docs (push) Has been skipped

Details

deb / build-publish (push) Failing after 6m29s

Details

rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 7m4s

Details

rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 7m17s

Details

apple / swift (push) Successful in 1m13s

Details

apple / screenshots (push) Successful in 5m27s

Details

feat(host): HDR Vulkan layer so Vulkan games get HDR on the virtual display

NVIDIA/AMD Vulkan ICDs refuse to *advertise* an HDR color space for a surface on an
IddCx indirect/virtual display, so Vulkan games (Doom: The Dark Ages, id Tech, Indiana
Jones, …) report "device does not support HDR" — even though Windows HDR, DWM compose,
and the client PQ stream all work, and the ICD happily *accepts + presents* a forced HDR
swapchain there. The whole gap is enumeration; the community (Apollo/Sunshine/VDD) wrote
this off as kernel-side / unfixable.

Add VK_LAYER_PUNKTFUNK_hdr_inject (packaging/windows/pf-vkhdr-layer/): a standalone
cdylib Vulkan implicit layer that appends {A2B10G10R10, HDR10_ST2084} + {RGBA16F, scRGB}
to vkGetPhysicalDeviceSurfaceFormats[2]KHR (no need to hook vkCreateSwapchainKHR — the
ICD doesn't validate the color space there). Self-gated on the surface monitor's actual
advanced-color state (DisplayConfig GET_ADVANCED_COLOR_INFO), so it is a complete no-op
on SDR sessions and real monitors (dedup). Always-on (registry-discovered) so it works
regardless of how a game is launched — env-scoping silently fails for already-running
Steam. Escape hatches: DISABLE_PF_VKHDR, PF_VKHDR_EXCLUDE, and a built-in kernel-anti-
cheat denylist.

The installer builds/signs/stages it and registers it under
HKLM64\SOFTWARE\Khronos\Vulkan\ImplicitLayers (opt-out "Install the HDR Vulkan layer"
task); windows-host CI fmt+clippy-gates it (msvc-only FFI).

Live-validated on the RTX box: Doom: The Dark Ages enables HDR over the pf-vdisplay
virtual display.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-26 11:33:20 +00:00

8.1 KiB

Raw Blame History

title, description

title	description
Apple Stage-2 Presenter (handoff)	Implementation plan for the explicit VTDecompressionSession → CAMetalLayer presenter — hand-paced present + true decode→present (glass-to-glass) measurement. Written so a Mac agent can pick it up.

Status update: the stage-2 presenter described here has since been built and live-validated, shipping behind an opt-in flag (AVSampleBufferDisplayLayer remains the default known-good path). This page is preserved as the implementation/handoff record for that work.

The implementation plan for the stage-2 Apple presenter. The stage-1 presenter feeds compressed HEVC straight into AVSampleBufferDisplayLayer, which hardware-decodes and presents internally with no per-frame callback — so we can't stamp decode or present, and we can't hand-pace. Stage-2 takes explicit control: decode with VTDecompressionSession, present decoded frames through a CAMetalLayer driven by a display link. Two wins: ~0.5 refresh off the present tail (the biggest client latency term at 60 Hz) and true decode→present / glass-to-glass numbers.

All of this is macOS/iOS/tvOS-only — build + validate on a Mac (swift build && swift test, then live against a Linux host). The host + connector side is already done: PunktfunkConnection.clockOffsetNs (the connect-time skew offset, host minus client) is what makes the present timestamp cross-machine valid. See Status and roadmap §12.

Where it plugs into the existing code

Existing (stage-1)	Stage-2 change
`StreamPump` pulls AUs → `AnnexB.sampleBuffer` → `layer.enqueue` (compressed)	A `Stage2Pump` (or a mode flag on `StreamPump`) feeds AUs to `VTDecompressionSessionDecodeFrame` instead
`StreamView`/`StreamViewIOS` host an `AVSampleBufferDisplayLayer`	Host a `CAMetalLayer` (+ a display link); keep the input-capture + HUD overlay unchanged
`AnnexB.formatDescription(fromIDR:)` builds the format desc, refreshed on every IDR	Reused — it's the `VTDecompressionSession`'s format description; recreate the session when it changes
`LatencyMeter` records capture→client-receipt at `onFrame`	Extend to record decode-completion and present stages (below)

Keep stage-1 behind a UserDefaults flag (e.g. punktfunk.presenter = "stage1" | "stage2") so a regression can fall back — AVSampleBufferDisplayLayer is the known-good path.

Decode: VTDecompressionSession

Create the session from the IDR's CMVideoFormatDescription (AnnexB.formatDescription(fromIDR:)):

VTDecompressionSessionCreate(
  allocator: nil,
  formatDescription: fmt,
  decoderSpecification: nil,           // hardware by default; no need to force
  imageBufferAttributes: [
    kCVPixelBufferMetalCompatibilityKey: true,
    kCVPixelBufferPixelFormatTypeKey:
      kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange, // 8-bit SDR; 10-bit (…10BiPlanar) for HDR later
  ],
  outputCallback: <C-callback>,
  decompressionSessionOut: &session)

Per AU: build the same CMSampleBuffer as stage-1 (AnnexB.sampleBuffer(au:format:), PTS = au.ptsNs @ 1e9 timescale) and submit:

VTDecompressionSessionDecodeFrame(session, sampleBuffer,
  flags: ._EnableAsynchronousDecompression,
  frameRefcon: <pts or a boxed context>, infoFlagsOut: nil)

The output callback delivers (status, infoFlags, imageBuffer: CVImageBuffer?, presentationTimeStamp, …). presentationTimeStamp is au.ptsNs (the host capture clock). Stamp decode-completion here (CLOCK_REALTIME ns), retain the CVPixelBuffer, and push {pts, pixelBuffer, decodedNs} into a small NSLock-guarded ring (the "ready" queue) the display link drains.
IDR / mode change: when AnnexB.formatDescription yields a new desc, check VTDecompressionSessionCanAcceptFormatDescription; if not, finish-and-recreate the session (same trigger stage-1 uses to refresh format). On decoder error (kVTVideoDecoderBadDataErr, etc.) drop to the next IDR — there's no out-of-band extradata; recovery keyframes re-carry the parameter sets.

Present: CAMetalLayer + display link

CAMetalLayer (device = system default, pixelFormat = .bgra8Unorm, framebufferOnly = true, drawableSize = stream WxH). The view: macOS NSView/iOS UIView whose layerClass/backing layer is the CAMetalLayer (mirror StreamView/StreamViewIOS).
Display link drives present: macOS CVDisplayLink (or CADisplayLink on macOS 14+), iOS/tvOS CADisplayLink. Each callback carries the target present timestamp (CVTimeStamp / targetTimestamp).
Each vsync: pop the newest ready frame (drop older undisplayed ones — low-latency default; no smoothing buffer to start), render a fullscreen quad sampling the biplanar YUV (luma + chroma planes via CVMetalTextureCache) with a BT.709 YUV→RGB fragment shader, then commandBuffer.present(drawable) (or present(drawable, atTime:)). Stamp present time for the frame just shown (use the display link's target timestamp converted to CLOCK_REALTIME).
Colorspace: BT.709 8-bit for now (matches the host's SDR). HDR (BT.2020/PQ, 10-bit …10BiPlanar + EDR CAMetalLayer.wantsExtendedDynamicRangeContent) is a later tie-in with the HDR roadmap (§10).

Cheaper intermediate (2a) if the Metal path is too big in one step

Decode with VTDecompressionSession (gets the decode-completion timestamp = capture→decoded), then wrap the decoded CVPixelBuffer in a CMSampleBuffer and enqueue it into the existing AVSampleBufferDisplayLayer (it accepts uncompressed pixel buffers too). This yields the decode term without a Metal renderer — but not true present (the layer still presents internally). Ship 2a first if useful; 2b (CAMetalLayer + display link) is required for the on-glass present stamp.

Measurement (the whole point)

Extend LatencyMeter (or add per-stage meters) so each frame records three instants, all CLOCK_REALTIME ns, all shifted by connection.clockOffsetNs to the host clock:

capture→decoded = decodedNs + offset − pts_ns (VideoToolbox decode latency, cross-machine)
decode→present = presentedNs − decodedNs (the present tail stage-2 shortens)
capture→present = presentedNs + offset − pts_ns — the glass-to-glass number (modulo the host render→capture term, still unmeasured; see roadmap §12)

Surface capture→present p50/p95 in the HUD (extend the existing model.latency* line in ContentView). skewCorrected stays false when clockOffsetNs == 0 (old host) — then the numbers are same-host-only, as today.

Validation

swift test: add a decode-output test (decode a known IDR built like VideoToolboxRoundTripTests → assert a CVPixelBuffer of the right dimensions + the decode callback fires). Present is display-bound — validate it live via the HUD number.
Live: connect to a Linux host (punktfunk1-host --source virtual on the GNOME box; see Ubuntu — GNOME), confirm capture→present is a few ms over capture→client and that decode→present shrank vs. an AVSampleBufferDisplayLayer baseline.
Compare against the headless reference number: punktfunk-probe reports skew-corrected capture→reassembled (~1.3 ms p50 GNOME box → dev box); capture→present should be that + decode + present.

Gotchas

VT decode is async; the output callback runs on a VT-managed thread — don't block it, just stamp
- enqueue. Retain the CVPixelBuffer until presented (the ring owns it).
VTDecompressionSessionDecodeFrame wants the same CMSampleBuffer shape stage-1 builds (AVCC length-prefixed NALs, in-band parameter sets in the format desc, never as extradata).
CAMetalLayer.drawableSize must track mode changes (the host can Reconfigure mid-stream — watch PunktfunkConnection.mode/the new-IDR dimensions).
Don't add a jitter/smoothing buffer for the first cut — present newest-ready for lowest latency; a pacing policy can come later if frames look uneven.
Keep clients/apple/README.md's "Stage 2" item + Status updated when this lands.

8.1 KiB Raw Blame History Unescape Escape

Where it plugs into the existing code

Decode: VTDecompressionSession

Present: CAMetalLayer + display link

Cheaper intermediate (2a) if the Metal path is too big in one step

Measurement (the whole point)

Validation

Gotchas

8.1 KiB

Raw Blame History