feat(host): HDR Vulkan layer so Vulkan games get HDR on the virtual display
windows-host / package (push) Failing after 4m16s
ci / rust (push) Failing after 4m56s
ci / web (push) Failing after 22s
ci / docs-site (push) Successful in 1m7s
android / android (push) Successful in 9m19s
ci / bench (push) Successful in 4m47s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Failing after 3s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
docker / deploy-docs (push) Has been skipped
deb / build-publish (push) Failing after 6m29s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 7m4s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 7m17s
apple / swift (push) Successful in 1m13s
apple / screenshots (push) Successful in 5m27s
windows-host / package (push) Failing after 4m16s
ci / rust (push) Failing after 4m56s
ci / web (push) Failing after 22s
ci / docs-site (push) Successful in 1m7s
android / android (push) Successful in 9m19s
ci / bench (push) Successful in 4m47s
decky / build-publish (push) Successful in 11s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Failing after 3s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
docker / deploy-docs (push) Has been skipped
deb / build-publish (push) Failing after 6m29s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Failing after 7m4s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Failing after 7m17s
apple / swift (push) Successful in 1m13s
apple / screenshots (push) Successful in 5m27s
NVIDIA/AMD Vulkan ICDs refuse to *advertise* an HDR color space for a surface on an
IddCx indirect/virtual display, so Vulkan games (Doom: The Dark Ages, id Tech, Indiana
Jones, …) report "device does not support HDR" — even though Windows HDR, DWM compose,
and the client PQ stream all work, and the ICD happily *accepts + presents* a forced HDR
swapchain there. The whole gap is enumeration; the community (Apollo/Sunshine/VDD) wrote
this off as kernel-side / unfixable.
Add VK_LAYER_PUNKTFUNK_hdr_inject (packaging/windows/pf-vkhdr-layer/): a standalone
cdylib Vulkan implicit layer that appends {A2B10G10R10, HDR10_ST2084} + {RGBA16F, scRGB}
to vkGetPhysicalDeviceSurfaceFormats[2]KHR (no need to hook vkCreateSwapchainKHR — the
ICD doesn't validate the color space there). Self-gated on the surface monitor's actual
advanced-color state (DisplayConfig GET_ADVANCED_COLOR_INFO), so it is a complete no-op
on SDR sessions and real monitors (dedup). Always-on (registry-discovered) so it works
regardless of how a game is launched — env-scoping silently fails for already-running
Steam. Escape hatches: DISABLE_PF_VKHDR, PF_VKHDR_EXCLUDE, and a built-in kernel-anti-
cheat denylist.
The installer builds/signs/stages it and registers it under
HKLM64\SOFTWARE\Khronos\Vulkan\ImplicitLayers (opt-out "Install the HDR Vulkan layer"
task); windows-host CI fmt+clippy-gates it (msvc-only FFI).
Live-validated on the RTX box: Doom: The Dark Ages enables HDR over the pf-vdisplay
virtual display.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,126 @@
|
||||
---
|
||||
title: "Apple Stage-2 Presenter (handoff)"
|
||||
description: "Implementation plan for the explicit VTDecompressionSession → CAMetalLayer presenter — hand-paced present + true decode→present (glass-to-glass) measurement. Written so a Mac agent can pick it up."
|
||||
---
|
||||
|
||||
> **Status update:** the stage-2 presenter described here has since been **built and live-validated**,
|
||||
> shipping behind an opt-in flag (`AVSampleBufferDisplayLayer` remains the default known-good path).
|
||||
> This page is preserved as the implementation/handoff record for that work.
|
||||
|
||||
The implementation plan for the **stage-2 Apple presenter**. The **stage-1** presenter feeds
|
||||
compressed HEVC straight into `AVSampleBufferDisplayLayer`, which hardware-decodes **and presents
|
||||
internally with no per-frame callback** — so we can't stamp decode or present, and we can't hand-pace.
|
||||
Stage-2 takes explicit control: decode with `VTDecompressionSession`, present decoded frames through a
|
||||
`CAMetalLayer` driven by a display link. Two wins: **~0.5 refresh off the present tail** (the biggest
|
||||
client latency term at 60 Hz) and **true decode→present / glass-to-glass** numbers.
|
||||
|
||||
All of this is **macOS/iOS/tvOS-only** — build + validate on a Mac (`swift build && swift test`, then
|
||||
live against a Linux host). The host + connector side is already done: `PunktfunkConnection.clockOffsetNs`
|
||||
(the connect-time skew offset, host minus client) is what makes the present timestamp cross-machine
|
||||
valid. See [Status](/docs/status) and roadmap §12.
|
||||
|
||||
## Where it plugs into the existing code
|
||||
|
||||
| Existing (stage-1) | Stage-2 change |
|
||||
|---|---|
|
||||
| `StreamPump` pulls AUs → `AnnexB.sampleBuffer` → `layer.enqueue` (compressed) | A `Stage2Pump` (or a mode flag on `StreamPump`) feeds AUs to `VTDecompressionSessionDecodeFrame` instead |
|
||||
| `StreamView`/`StreamViewIOS` host an `AVSampleBufferDisplayLayer` | Host a `CAMetalLayer` (+ a display link); keep the input-capture + HUD overlay unchanged |
|
||||
| `AnnexB.formatDescription(fromIDR:)` builds the format desc, refreshed on every IDR | **Reused** — it's the `VTDecompressionSession`'s format description; recreate the session when it changes |
|
||||
| `LatencyMeter` records capture→client-receipt at `onFrame` | Extend to record **decode-completion** and **present** stages (below) |
|
||||
|
||||
Keep stage-1 behind a `UserDefaults` flag (e.g. `punktfunk.presenter = "stage1" | "stage2"`) so a
|
||||
regression can fall back — `AVSampleBufferDisplayLayer` is the known-good path.
|
||||
|
||||
## Decode: VTDecompressionSession
|
||||
|
||||
1. Create the session from the IDR's `CMVideoFormatDescription`
|
||||
(`AnnexB.formatDescription(fromIDR:)`):
|
||||
```
|
||||
VTDecompressionSessionCreate(
|
||||
allocator: nil,
|
||||
formatDescription: fmt,
|
||||
decoderSpecification: nil, // hardware by default; no need to force
|
||||
imageBufferAttributes: [
|
||||
kCVPixelBufferMetalCompatibilityKey: true,
|
||||
kCVPixelBufferPixelFormatTypeKey:
|
||||
kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange, // 8-bit SDR; 10-bit (…10BiPlanar) for HDR later
|
||||
],
|
||||
outputCallback: <C-callback>,
|
||||
decompressionSessionOut: &session)
|
||||
```
|
||||
2. Per AU: build the same `CMSampleBuffer` as stage-1 (`AnnexB.sampleBuffer(au:format:)`, PTS =
|
||||
`au.ptsNs` @ 1e9 timescale) and submit:
|
||||
```
|
||||
VTDecompressionSessionDecodeFrame(session, sampleBuffer,
|
||||
flags: ._EnableAsynchronousDecompression,
|
||||
frameRefcon: <pts or a boxed context>, infoFlagsOut: nil)
|
||||
```
|
||||
3. The **output callback** delivers `(status, infoFlags, imageBuffer: CVImageBuffer?, presentationTimeStamp, …)`.
|
||||
`presentationTimeStamp` is `au.ptsNs` (the host capture clock). **Stamp decode-completion here**
|
||||
(`CLOCK_REALTIME` ns), retain the `CVPixelBuffer`, and push `{pts, pixelBuffer, decodedNs}` into a
|
||||
small NSLock-guarded ring (the "ready" queue) the display link drains.
|
||||
4. **IDR / mode change**: when `AnnexB.formatDescription` yields a new desc, check
|
||||
`VTDecompressionSessionCanAcceptFormatDescription`; if not, finish-and-recreate the session (same
|
||||
trigger stage-1 uses to refresh `format`). On decoder error (`kVTVideoDecoderBadDataErr`, etc.) drop
|
||||
to the next IDR — there's no out-of-band extradata; recovery keyframes re-carry the parameter sets.
|
||||
|
||||
## Present: CAMetalLayer + display link
|
||||
|
||||
- `CAMetalLayer` (device = system default, `pixelFormat = .bgra8Unorm`, `framebufferOnly = true`,
|
||||
`drawableSize` = stream WxH). The view: macOS `NSView`/iOS `UIView` whose `layerClass`/backing layer
|
||||
is the `CAMetalLayer` (mirror `StreamView`/`StreamViewIOS`).
|
||||
- **Display link** drives present: macOS `CVDisplayLink` (or `CADisplayLink` on macOS 14+),
|
||||
iOS/tvOS `CADisplayLink`. Each callback carries the **target present timestamp** (`CVTimeStamp` /
|
||||
`targetTimestamp`).
|
||||
- Each vsync: pop the **newest** ready frame (drop older undisplayed ones — low-latency default; no
|
||||
smoothing buffer to start), render a fullscreen quad sampling the **biplanar YUV** (luma +
|
||||
chroma planes via `CVMetalTextureCache`) with a BT.709 YUV→RGB fragment shader, then
|
||||
`commandBuffer.present(drawable)` (or `present(drawable, atTime:)`). **Stamp present time** for the
|
||||
frame just shown (use the display link's target timestamp converted to `CLOCK_REALTIME`).
|
||||
- Colorspace: BT.709 8-bit for now (matches the host's SDR). HDR (BT.2020/PQ, 10-bit `…10BiPlanar` +
|
||||
EDR `CAMetalLayer.wantsExtendedDynamicRangeContent`) is a later tie-in with the HDR roadmap (§10).
|
||||
|
||||
### Cheaper intermediate (2a) if the Metal path is too big in one step
|
||||
Decode with `VTDecompressionSession` (gets the **decode-completion timestamp** = capture→decoded),
|
||||
then wrap the decoded `CVPixelBuffer` in a `CMSampleBuffer` and `enqueue` it into the existing
|
||||
`AVSampleBufferDisplayLayer` (it accepts uncompressed pixel buffers too). This yields the decode term
|
||||
**without** a Metal renderer — but **not** true present (the layer still presents internally). Ship 2a
|
||||
first if useful; 2b (CAMetalLayer + display link) is required for the on-glass present stamp.
|
||||
|
||||
## Measurement (the whole point)
|
||||
|
||||
Extend `LatencyMeter` (or add per-stage meters) so each frame records three instants, all
|
||||
`CLOCK_REALTIME` ns, all shifted by `connection.clockOffsetNs` to the host clock:
|
||||
|
||||
- **capture→decoded** = `decodedNs + offset − pts_ns` (VideoToolbox decode latency, cross-machine)
|
||||
- **decode→present** = `presentedNs − decodedNs` (the present tail stage-2 shortens)
|
||||
- **capture→present** = `presentedNs + offset − pts_ns` — **the glass-to-glass number** (modulo the
|
||||
host render→capture term, still unmeasured; see roadmap §12)
|
||||
|
||||
Surface `capture→present` p50/p95 in the HUD (extend the existing `model.latency*` line in
|
||||
`ContentView`). `skewCorrected` stays false when `clockOffsetNs == 0` (old host) — then the numbers are
|
||||
same-host-only, as today.
|
||||
|
||||
## Validation
|
||||
|
||||
- `swift test`: add a decode-output test (decode a known IDR built like
|
||||
`VideoToolboxRoundTripTests` → assert a `CVPixelBuffer` of the right dimensions + the
|
||||
decode callback fires). Present is display-bound — validate it **live** via the HUD number.
|
||||
- Live: connect to a Linux host (`punktfunk1-host --source virtual` on the GNOME box; see
|
||||
[Ubuntu — GNOME](/docs/ubuntu-gnome)), confirm `capture→present` is a few ms over `capture→client`
|
||||
and that `decode→present` shrank vs. an `AVSampleBufferDisplayLayer` baseline.
|
||||
- Compare against the headless reference number: `punktfunk-probe` reports skew-corrected
|
||||
capture→reassembled (~1.3 ms p50 GNOME box → dev box); capture→present should be that **+ decode +
|
||||
present**.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- VT decode is **async**; the output callback runs on a VT-managed thread — don't block it, just stamp
|
||||
+ enqueue. Retain the `CVPixelBuffer` until presented (the ring owns it).
|
||||
- `VTDecompressionSessionDecodeFrame` wants the **same** `CMSampleBuffer` shape stage-1 builds (AVCC
|
||||
length-prefixed NALs, in-band parameter sets in the format desc, never as extradata).
|
||||
- `CAMetalLayer.drawableSize` must track mode changes (the host can `Reconfigure` mid-stream — watch
|
||||
`PunktfunkConnection.mode`/the new-IDR dimensions).
|
||||
- Don't add a jitter/smoothing buffer for the first cut — present newest-ready for lowest latency; a
|
||||
pacing policy can come later if frames look uneven.
|
||||
- Keep `clients/apple/README.md`'s "Stage 2" item + [Status](/docs/status) updated when this lands.
|
||||
Reference in New Issue
Block a user