perf(host): latency hardening for the game-vs-encode GPU contention collapse

Verified, prioritized analysis in docs/host-latency-plan.md (multi-agent
investigation + adversarial verification). Lands the two low-risk tiers:

Tier 2B — Linux scheduling hygiene:
- boost_thread_priority now nices the capture/encode (-10) and send (-5)
  threads on Linux (setpriority, best-effort; no-op without CAP_SYS_NICE),
  and the wrong "gamescope caps the game" doc-comment is corrected.
- CUDA context created with CU_CTX_SCHED_BLOCKING_SYNC (frees a core on the
  shared box instead of busy-spinning on completion).
- Copies moved off the default stream onto a per-thread highest-priority
  CUDA stream (cuStreamCreateWithPriority, graceful NULL-stream fallback)
  with a per-stream sync that no longer blocks on the other worker thread's
  in-flight copies. Stream priority is measure-then-keep (NVIDIA Linux may
  ignore it); never regresses.

Tier 3A — Windows session tuning (new session_tuning.rs, raw C-ABI FFI,
no-op off Windows): once-per-process 1ms timer + DwmEnableMMCSS + HIGH
priority class; per-thread MMCSS "Games" + keep-display-awake. Wired into
both the native (boost_thread_priority) and GameStream (stream.rs) paths.
We had zero session tuning before (Apollo streaming_will_start parity).

Tier 2A (Linux NV12 convert) is specified but intentionally not landed:
it is colour-correctness-critical and needs A/B validation on a GPU box
with a display (green-screen risk). Builds + clippy + fmt green on Linux.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-18 23:05:57 +00:00
parent 16d3b7767e
commit 112a054c35
6 changed files with 472 additions and 23 deletions
@@ -60,6 +60,8 @@ fn run(
force_idr: &AtomicBool,
video_cap: &std::sync::Mutex<Option<Box<dyn Capturer>>>,
) -> Result<()> {
// GameStream capture/encode thread: apply Windows session tuning (no-op off Windows).
crate::session_tuning::on_hot_thread();
// Reject an out-of-range client mode before allocating capture/encode buffers.
encode::validate_dimensions(cfg.codec, cfg.width, cfg.height)
.context("client-requested video mode")?;
@@ -219,6 +221,8 @@ fn spawn_sender(
std::thread::Builder::new()
.name("punktfunk-send".into())
.spawn(move || {
// GameStream send thread: Windows session tuning + MMCSS (no-op off Windows).
crate::session_tuning::on_hot_thread();
// Chunk pacing: 16 packets per burst, bursts spread across the send budget.
const PACE_CHUNK: usize = 16;
let budget = frame_interval.mul_f32(0.75);