bench(ci): report-only regression harness — Tier-1/2 in CI + Tier-3 GPU runner

- scripts/bench/compare.py: diff criterion medians (target/criterion/**/estimates.json) vs a committed baseline, print a markdown table to the job summary, flag >threshold regressions, always exit 0 (shared CI hardware is too noisy to gate on). --update rewrites the baseline. - ci.yml `bench` job: runs Tier-1 (criterion) + Tier-2 (loss-harness FEC recovery) GPU-free in the rust-ci container, then compare.py — report-only visibility per push/PR. - scripts/bench/gpu-stream.sh + bench-gpu.yml: Tier-3 real pipeline (virtual output → zero-copy → NVENC → punktfunk/1 → reassemble) on a self-hosted GPU runner; captures encode_us/tx_mbps/ send_dropped + client capture→reassembled latency, compares to gpu-baseline.json (20% threshold). Needs the dev box registered as a `[self-hosted, gpu]` act_runner (one-time, see the workflow header) — the dedicated hardware makes its absolute baseline meaningful, unlike shared CI. - baseline.json: dev-box Tier-1 numbers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 19:24:52 +00:00
parent 2976daf2e3
commit ec617f9c6b
5 changed files with 259 additions and 0 deletions
@@ -0,0 +1,32 @@
+# Tier-3 real-world GPU benchmark — the actual capture → zero-copy → NVENC → punktfunk/1 → reassemble
+# pipeline, measuring encode time / throughput / end-to-end latency. The GPU-less CI containers
+# (ci.yml `bench` job) can only run the Tier-1/2 GPU-free benchmarks; this runs on a SELF-HOSTED GPU
+# runner — a dev box with an NVIDIA GPU + a KWin session.
+#
+# Runner setup (one-time, on the GPU box): register a Gitea act_runner with the labels below, e.g.
+#   act_runner register --instance https://git.unom.io --token <REPO_RUNNER_TOKEN> \
+#     --labels gpu:host --name <box>-gpu
+# It runs jobs directly on the host (no container) so it can reach the GPU, PipeWire and the
+# compositor. A persistent KWin session helps (else the script brings up a headless one).
+#
+# Report-only: the script flags regressions vs scripts/bench/gpu-baseline.json but never fails the
+# job. Refresh the baseline on the runner with `scripts/bench/gpu-stream.sh <mode> <secs> --update`.
+name: bench-gpu
+
+on:
+  workflow_dispatch:
+    inputs:
+      mode:
+        description: "stream mode WxHxHz"
+        default: "1920x1080x120"
+  schedule:
+    - cron: "0 6 * * *" # nightly
+
+jobs:
+  gpu-stream:
+    runs-on: [self-hosted, gpu]
+    timeout-minutes: 20
+    steps:
+      - uses: actions/checkout@v4
+      - name: Tier-3 GPU stream benchmark
+        run: bash scripts/bench/gpu-stream.sh "${{ inputs.mode || '1920x1080x120' }}" 12
@@ -115,3 +115,25 @@ jobs:
        run: bun run build
      - name: Typecheck
        run: bun run lint
+
+  bench:
+    # Tier-1 (criterion microbenchmarks) + Tier-2 (FEC loss recovery) — GPU-free, so they run here.
+    # Report-only: prints the numbers + a diff vs the committed baseline to the job summary and never
+    # fails the build (shared CI hardware is too noisy to gate on). The tight regression gate + the
+    # real encode/stream path live on the self-hosted GPU runner (Tier 3, bench-gpu.yml).
+    runs-on: ubuntu-24.04
+    container:
+      image: git.unom.io/unom/punktfunk-rust-ci:latest
+    timeout-minutes: 30
+    steps:
+      - uses: actions/checkout@v4
+      - name: Prep
+        run: |
+          git config --global --add safe.directory "$PWD"
+          command -v python3 >/dev/null || { apt-get update && apt-get install -y --no-install-recommends python3; }
+      - name: Tier-1 microbenchmarks (criterion)
+        run: cargo bench -p punktfunk-core --bench pipeline -- --warm-up-time 1 --measurement-time 3
+      - name: Tier-2 FEC loss recovery (loss-harness)
+        run: cargo run -q -p loss-harness
+      - name: Compare vs baseline (report-only)
+        run: python3 scripts/bench/compare.py --threshold 0.5