Files
punktfunk/design/implementation-plan.md
T
enricobuehler 7b99b41ede docs(design): trim shipped plans, consolidate cluster, add index
Much of design/ described work that has since shipped. Trim each doc to
its durable rationale + still-open items (the code is the source of truth
for shipped detail; git history holds the full originals).

- Shipped plans -> status stubs: stats-capture, gamestream-host-plan,
  apple-stage2-presenter, windows-service.
- Trimmed completed-out / open-kept: implementation-plan, hdr-pipeline,
  host-latency, gpu-contention (fixed stale status table), game-library,
  linux-setup (fixed m0->spike + stale zero-copy claim),
  session-aware-host-followups, windows-client-bootstrap,
  windows-dualsense-{scoping,game-detection}, windows-virtual-display,
  security-review (per-finding status table; #12 still open),
  apollo-comparison (shipped backlog collapsed to one-liners).
- Windows-host cluster consolidated: windows-host.md -> redirect into
  windows-host-rewrite.md (whose stale scorecard is corrected -- goal1 is
  merged, M4 done); windows-secure-desktop.md archived (now a fallback
  behind IDD-push primary).
- Kept evergreen: ci.md, gamescope-multiuser.md, windows-build-and-packaging.md.
- New design/README.md: per-doc status table + consolidated open-items
  roll-up so nothing is tracked in only one buried doc.
- Repoint 5 code comments to the archived secure-desktop doc path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:39:06 +00:00

244 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Implementation Plan"
description: "The full design: protocol core, milestones, and architecture."
---
*A ground-up low-latency desktop streaming stack, built Linux-first, with a shared Rust protocol core and native clients per platform.*
> **Status:** SHIPPED — M0M5 complete, M6 largely shipped. This is the project's canonical design doc; it is trimmed to the load-bearing design (thesis, scope, architecture, protocol strategy, C ABI, virtual-display orchestration, latency budget, risk register) plus still-open items. For current shipped-feature status see CLAUDE.md "Where the work stands"; for build/test/run, repo layout, and next actions see CLAUDE.md. Git history holds the full original milestone acceptance criteria.
> The name `punktfunk` fits the lowercase house style (`unom`, `played`, `remplir`) and reads as "glass-to-glass light," which is the whole point.
---
## 0. The thesis (why this is worth building)
Two concrete gaps justify a new project rather than another fork:
1. **The 1 Gbps wall is a FEC design limit, not a bandwidth limit.** Moonlight/Sunshine protect each frame with ReedSolomon over GF(2⁸), which caps a block at 255 shards. At 5120×1440@240 that ceiling is hit around 1 Gbps. Switching the erasure code to **Leopard-RS over GF(2¹⁶)** (via the `reed-solomon-simd` crate) raises the per-block shard limit to 65,536 and runs in O(n log n) with SIMD. The wall disappears as a *consequence* of a better core, not as a hack.
2. **Linux software virtual displays are a real, unfilled gap.** The compositor-side capability now exists (Mutter headless virtual monitors since GNOME 40; wlroots headless outputs; KWin virtual outputs in Plasma 6), but no streaming host *drives* those APIs to create a client-sized output on demand, capture it via PipeWire, and route input back via libei. Apollo's virtual display is Windows-only. This is the immediate, shippable win.
**Strategic ordering:** ship the Linux virtual-display host speaking the *existing* Moonlight protocol first (every Moonlight/Artemis client works on day one, no client to write). Only then introduce the new GF(2¹⁶) transport as a negotiated protocol extension with our own clients. Value early, hard parts deferred until de-risked.
---
## 1. Scope & non-goals
**In scope (eventually):**
- Linux streaming host with on-demand software virtual displays (KWin first, then wlroots, then Mutter).
- A shared Rust protocol/transport/FEC core exposed over a stable C ABI.
- A modern transport that removes the 1 Gbps ceiling.
- Native clients: Rust (Linux), Swift (macOS/iOS), Kotlin (Android) — all linking the same core.
**Explicit non-goals (at least at first):**
- Windows *host* support (Sunshine/Apollo already do this well; no gap to fill). *(Note: this non-goal was later reversed — a Windows host shipped; see CLAUDE.md.)*
- Internet/NAT-traversal relay infrastructure (LAN/VPN first; lean on an existing mesh VPN such as Headscale/NetBird/Tailscale).
- Reinventing encoders/decoders (bind to FFmpeg + vendor SDKs; never rewrite codecs).
- A bespoke compositor (drive existing ones; only consider a dedicated headless compositor as a *deployment mode*, see §6).
---
## 2. Architecture overview
```mermaid
flowchart TD
subgraph Host["Linux Host (Rust)"]
VD["Virtual display orchestrator<br/>(KWin / wlroots / Mutter)"]
CAP["Capture<br/>(PipeWire / dmabuf)"]
ENC["Encoder<br/>(VAAPI / NVENC via FFmpeg)"]
VD --> CAP --> ENC
ENC --> COREH
IN_H["Input injector<br/>(libei / uinput)"]
COREH["punktfunk-core (C ABI)<br/>protocol · FEC · pacing · crypto"]
COREH --> IN_H
end
COREH <-->|"UDP+FEC video / QUIC control+audio"| COREC
subgraph Client["Client (Rust / Swift / Kotlin)"]
COREC["punktfunk-core (same crate, C ABI)"]
DEC["Decoder<br/>(VideoToolbox / NVDEC / VAAPI)"]
PRES["Present + frame pacing"]
INP["Input capture"]
COREC --> DEC --> PRES
INP --> COREC
end
```
**The load-bearing decision:** `punktfunk-core` is one crate, compiled once, linked by every host and client through a C ABI. Protocol logic, FEC, packet pacing, jitter buffering, pairing, and crypto live there and exist exactly once. Platform code (capture, encode, decode, present, input, UI) lives outside the core and is written in whatever language suits the platform.
---
## 3. Protocol strategy (three phases)
| Phase | Protocol | Clients that work | Bitrate ceiling | Purpose |
|------|----------|-------------------|-----------------|---------|
| **P1** | GameStream-compatible (existing Moonlight wire format) | All existing Moonlight/Artemis clients | ~1 Gbps (legacy GF(2⁸) FEC) | Ship the Linux virtual-display win with zero client work |
| **P2** | `punktfunk/1` negotiated extension: GF(2¹⁶) FEC, multi-block framing, optional QUIC control | punktfunk clients only; falls back to P1 for others | Multi-Gbps | Break the wall; introduce native clients |
| **P3** | `punktfunk/1` as primary; GameStream kept as compat shim | punktfunk everywhere, Moonlight as fallback | Multi-Gbps | Full control of features (mic passthrough, per-client identity, HDR signalling) |
Negotiation: extend the `serverinfo`/RTSP `SETUP` handshake with a capability flag. Old clients never see the flag and get P1 behavior. This is how Apollo/Artemis diverge cleanly, and it keeps you compatible while you build.
---
## 4. Tech stack (settled)
**Language split:** Rust for the core and all non-Apple platform code; Swift only for the macOS/iOS client UI + VideoToolbox/Metal; Kotlin for Android UI + MediaCodec. The C ABI is the seam.
**Threading:** native OS threads for the video hot path. `tokio` is allowed *only* for the control plane (pairing, web config, QUIC control stream). The per-frame pipeline must never touch an async runtime.
### Core crate dependencies
| Concern | Crate | Notes |
|--------|-------|-------|
| FEC | `reed-solomon-simd` (v3+) | Leopard/GF(2¹⁶), SIMD, O(n log n) — the wall-breaker |
| QUIC (control/audio) | `quinn` | Datagram ext for audio; reliable streams for control |
| TLS / crypto | `rustls` + `ring` (or `aws-lc-rs`) | Pairing, session keys (AES-GCM to match GameStream in P1) |
| Serialization | `zerocopy` / `bytes` | Wire structs `#[repr(C)]`, zero-copy parse |
| C header gen | `cbindgen` | Generates `punktfunk_core.h` from the ABI module |
| Error/log | `tracing` | Structured; feature-gate off the hot path |
### Linux host dependencies
| Concern | Crate / API | Notes |
|--------|-------------|-------|
| Capture | `pipewire` (pipewire-rs) | ScreenCast portal stream → dmabuf |
| Portal / DBus | `ashpd` + `zbus` | xdg-desktop-portal: ScreenCast, RemoteDesktop |
| Encode | `ffmpeg-next` or `rsmpeg` | VAAPI / NVENC, dmabuf import (zero-copy) |
| Input inject | `reis` (libei) + `input-linux` (uinput fallback) | Wayland-native first, uinput as universal fallback |
| Virtual output | per-compositor (see §6) | KWin DBus / Sway `create_output` / Mutter DBus |
| Web config | `axum` + `tokio` + small Vite/React UI | You own this stack already |
### Apple client (P2+)
Swift + VideoToolbox (decode) + Metal (present) + SwiftUI. Imports `punktfunk_core.h` directly via a module map — no glue layer.
### Ruled out
- **Swift for the host/core:** no Linux Wayland/PipeWire/DRM/VAAPI ecosystem; ARC in hot loops. (Excellent *Apple-client* language, wrong for systems/Linux.)
- **Go:** GC disqualifies the hot path.
- **C++:** throws away the safety/concurrency wins that justified greenfield over forking.
- **Zig:** best-in-class C interop, but pre-1.0 with no Wayland/QUIC ecosystem — too much risk for a multi-month build. Revisit later if desired.
---
## 5. The C ABI boundary
Design it on day one; retrofitting an ABI is painful.
**Principles**
- Opaque handles only across the boundary: `PunktfunkSession*`, never Rust types.
- All cross-boundary structs are `#[repr(C)]`; primitives + pointer/len pairs for buffers.
- Async events via registered C callbacks (`fn ptr` + `void* userdata`).
- Explicit, documented ownership: who frees what, when. Provide `punktfunk_*_free` for every allocation that crosses out.
- Versioned ABI: `uint32_t punktfunk_abi_version(void)` + a `PunktfunkConfig` struct whose first field is its own size for forward-compat.
**Minimal surface (sketch)**
```c
// lifecycle
PunktfunkSession* punktfunk_session_new(const PunktfunkConfig* cfg);
void punktfunk_session_free(PunktfunkSession*);
// host: feed an encoded access unit (the core does FEC + packetize + pace + send)
int punktfunk_host_submit_frame(PunktfunkSession*, const uint8_t* data, size_t len,
uint64_t pts_ns, PunktfunkFrameFlags flags);
// client: pull a reassembled, FEC-recovered access unit ready to decode
int punktfunk_client_poll_frame(PunktfunkSession*, PunktfunkFrame* out /*borrowed until next poll*/);
// input (both directions): client captures, host receives via callback
int punktfunk_send_input(PunktfunkSession*, const PunktfunkInputEvent*);
void punktfunk_set_input_callback(PunktfunkSession*, PunktfunkInputCb, void* user);
// stats for the frame-pacing/quality logic and the web UI
void punktfunk_get_stats(PunktfunkSession*, PunktfunkStats* out);
```
Keep it this small. Everything platform-specific (how you got the encoded bytes, how you decode them) stays on the platform side.
---
## 6. Virtual display orchestration
This is the differentiator and the most fragmented part. Two deployment models — support both eventually, pick one for the MVP.
**Model A — Attach to the running session.** Create a client-sized virtual output *inside the user's live desktop*, stream it, tear it down on disconnect. This is "add a monitor to my actual PC." Best UX, hardest because it depends on per-compositor runtime APIs.
**Model B — Dedicated headless session.** Spawn a separate headless compositor purely for the stream (e.g. `gnome-shell --headless --virtual-monitor WxH`, or a headless wlroots compositor). Cleaner isolation, sidesteps runtime-output APIs, ideal for "remote second PC." Worse for "mirror/extend my real desktop."
**Per-compositor (Model A) runtime virtual-output creation:**
- **KWin / Plasma 6 (recommended MVP target — a common KDE daily-driver setup, and where the gap is loudest):** KWin can create virtual outputs; KRdp already does this internally for remote sessions. Drive it via the KWin DBus interface; capture via `xdg-desktop-portal-kde` ScreenCast (PipeWire); inject input via the RemoteDesktop portal or `reis`.
- **wlroots (Sway/Hyprland — fastest to *prototype* the pipeline):** enable the headless backend (`WLR_BACKENDS=…,headless`), then `swaymsg create_output` / `hyprctl output create headless`. Capture via `wlr-screencopy` or the portal. Simplest API; good for validating capture→encode→send before fighting KWin/Mutter.
- **Mutter / GNOME:** virtual monitors via the headless backend; runtime creation via Mutter DBus (`org.gnome.Mutter.*` — partly experimental). Capture via `xdg-desktop-portal-gnome` ScreenCast.
**Recommendation:** do a 12 day wlroots spike to prove the *pipeline*, then build the real MVP on KWin because that's your deployment target. Abstract virtual-output creation behind a trait so compositors are pluggable:
```rust
trait VirtualDisplay {
fn create(&self, mode: Mode) -> Result<OutputHandle>;
fn destroy(&self, h: OutputHandle) -> Result<()>;
}
```
---
## 7. The hot path: pipeline & latency budget
Per-frame pipeline, each stage on its own thread, connected by bounded SPSC channels (drop-oldest on overflow, never block the encoder):
```
capture(dmabuf) → encode(NVENC/VAAPI) → core[FEC+packetize+pace+send]
│ network
client: recv → core[reorder+FEC recover+jitter] → decode → present
```
**Glass-to-glass budget (LAN, 240 Hz = 4.17 ms/frame):**
| Stage | Target | Notes |
|------|--------|-------|
| Capture latency | ≤ 1 frame | dmabuf, no copy to CPU |
| Encode | 14 ms | NVENC low-latency preset; tune lookahead off |
| FEC + packetize | < 1 ms | SIMD RS; pre-allocated shard buffers |
| Network (LAN) | < 1 ms | `sendmmsg` / UDP GSO to cut syscalls |
| Jitter buffer | 01 frame | adaptive; minimum that hides observed jitter |
| FEC recover + reassemble | < 1 ms | only when loss occurs |
| Decode | 14 ms | hardware decoder |
| Present | ≤ 1 frame | align to client vsync |
**Target: 1535 ms glass-to-glass on LAN.** The art is *frame pacing* — matching capture/encode cadence to the client's actual refresh and keeping the jitter buffer as small as the link allows. This, not the codec, is what separates good from bad streaming. Budget real time for it.
**Throughput math to keep honest:** 5120×1440@240 ≈ 1.77 Gpx/s. At 0.5 bpp that's ~885 Mbps; 0.6 bpp ≈ 1.06 Gbps; 0.8 bpp (4:4:4 headroom) ≈ 1.4 Gbps. The GF(2¹⁶) FEC + multi-block framing must sustain these without the per-frame shard count being the limiter — which it no longer is once you leave GF(2⁸).
---
## 8. Milestones — status
M0M5 complete; M6 (feature surface) largely shipped. The original per-milestone acceptance criteria (M0 pipeline spike → M1 core+C ABI → M2 P1 host to stock Moonlight → M3 measurement harness → M4 P2 GF(2¹⁶) wall-breaker → M5 Apple client → M6 mic/HDR/per-client identity) are in git history. Live status — what is validated, what is partial — lives in CLAUDE.md "Where the work stands." The bet held: M2 (virtual-display streaming to stock Moonlight on Linux) shipped first as a complete, gap-filling release; the wall-breaking transport, native clients, and mic-done-right were unlocked from that position, resting on a FEC core that makes the 1 Gbps ceiling a thing of the past rather than a thing to hack around.
### Open items (still in flight)
- **Sub-frame pipelining**: overlap encode and transmit within a frame. Requires a direct NVENC SDK wrapper (libavcodec only emits whole AUs) — the next big latency lever (~24 ms at high res).
- **Apple stage-2 presenter as the default** (`VTDecompressionSession` + `CAMetalLayer`, live-validated behind the opt-in `punktfunk.presenter` flag at ~11 ms p50) after a few resolution/HDR checks, plus **iOS/iPadOS/tvOS variants**.
- **Windows client on-glass validation**: D3D11VA zero-copy decode + HDR present + the WinUI GUI polish are written against the windows-rs/reactor APIs but not yet validated on a real display+GPU (the dev VM is headless/Session-0/WARP); needs the RTX box. Then RAWINPUT relative-mouse pointer-lock and a per-host speed test in the UI.
- **Android real-device validation**: gamepad rumble/HID feedback and HDR10 (Main10/BT.2020 PQ) live-verify; presenter/latency polish.
- **gamescope multi-user isolation**: per-session input/audio so concurrent sessions are independent desktops (§8b-2 peer-push approval from a paired device's own app is the related open protocol-growth item).
- **GameStream AV1 + surround audio live confirmation**: both are implemented and unit/live-capture tested but still need a live Moonlight confirmation (select AV1 in a stock client; a real 5.1/7.1 listen including FEC under loss).
---
## 9. Risk register
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| KWin runtime virtual-output API is undocumented/unstable | High | High | Spike on wlroots first to de-risk the pipeline; study KRdp's source for the KWin path; keep `VirtualDisplay` pluggable so a stuck compositor doesn't block the project |
| Wayland input injection gaps (libei still evolving) | Med | Med | uinput fallback always available; `reis` for the Wayland-native path |
| dmabuf → encoder zero-copy import quirks per GPU/driver | High | Med | Validate on your actual NVIDIA + AMD hardware early (M0); have a CPU-copy fallback path |
| Encoder/decoder can't sustain 1.77 Gpx/s @ 240 | Med | High | Measure in M0/M4 on real silicon; this is a hardware ceiling no rewrite fixes — discover it before P2, not after |
| Frame pacing eats more time than expected | High | Med | M3 measurement harness first; treat pacing as a first-class subsystem, not a polish step |
| Scope creep into a full Moonlight replacement | High | High | P1 (stock-client compat) is the firewall: it forces you to ship value before writing a client |
| Solo bandwidth vs. other projects | High | Med | M2 is a complete, useful artifact on its own; the plan is safe to pause after any milestone |