perf(host): latency hardening for the game-vs-encode GPU contention collapse
Verified, prioritized analysis in docs/host-latency-plan.md (multi-agent investigation + adversarial verification). Lands the two low-risk tiers: Tier 2B — Linux scheduling hygiene: - boost_thread_priority now nices the capture/encode (-10) and send (-5) threads on Linux (setpriority, best-effort; no-op without CAP_SYS_NICE), and the wrong "gamescope caps the game" doc-comment is corrected. - CUDA context created with CU_CTX_SCHED_BLOCKING_SYNC (frees a core on the shared box instead of busy-spinning on completion). - Copies moved off the default stream onto a per-thread highest-priority CUDA stream (cuStreamCreateWithPriority, graceful NULL-stream fallback) with a per-stream sync that no longer blocks on the other worker thread's in-flight copies. Stream priority is measure-then-keep (NVIDIA Linux may ignore it); never regresses. Tier 3A — Windows session tuning (new session_tuning.rs, raw C-ABI FFI, no-op off Windows): once-per-process 1ms timer + DwmEnableMMCSS + HIGH priority class; per-thread MMCSS "Games" + keep-display-awake. Wired into both the native (boost_thread_priority) and GameStream (stream.rs) paths. We had zero session tuning before (Apollo streaming_will_start parity). Tier 2A (Linux NV12 convert) is specified but intentionally not landed: it is colour-correctness-critical and needs A/B validation on a GPU box with a display (green-screen risk). Builds + clippy + fmt green on Linux. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -60,6 +60,8 @@ fn run(
|
|||||||
force_idr: &AtomicBool,
|
force_idr: &AtomicBool,
|
||||||
video_cap: &std::sync::Mutex<Option<Box<dyn Capturer>>>,
|
video_cap: &std::sync::Mutex<Option<Box<dyn Capturer>>>,
|
||||||
) -> Result<()> {
|
) -> Result<()> {
|
||||||
|
// GameStream capture/encode thread: apply Windows session tuning (no-op off Windows).
|
||||||
|
crate::session_tuning::on_hot_thread();
|
||||||
// Reject an out-of-range client mode before allocating capture/encode buffers.
|
// Reject an out-of-range client mode before allocating capture/encode buffers.
|
||||||
encode::validate_dimensions(cfg.codec, cfg.width, cfg.height)
|
encode::validate_dimensions(cfg.codec, cfg.width, cfg.height)
|
||||||
.context("client-requested video mode")?;
|
.context("client-requested video mode")?;
|
||||||
@@ -219,6 +221,8 @@ fn spawn_sender(
|
|||||||
std::thread::Builder::new()
|
std::thread::Builder::new()
|
||||||
.name("punktfunk-send".into())
|
.name("punktfunk-send".into())
|
||||||
.spawn(move || {
|
.spawn(move || {
|
||||||
|
// GameStream send thread: Windows session tuning + MMCSS (no-op off Windows).
|
||||||
|
crate::session_tuning::on_hot_thread();
|
||||||
// Chunk pacing: 16 packets per burst, bursts spread across the send budget.
|
// Chunk pacing: 16 packets per burst, bursts spread across the send budget.
|
||||||
const PACE_CHUNK: usize = 16;
|
const PACE_CHUNK: usize = 16;
|
||||||
let budget = frame_interval.mul_f32(0.75);
|
let budget = frame_interval.mul_f32(0.75);
|
||||||
|
|||||||
@@ -33,6 +33,7 @@ mod punktfunk1;
|
|||||||
mod pwinit;
|
mod pwinit;
|
||||||
#[cfg(target_os = "windows")]
|
#[cfg(target_os = "windows")]
|
||||||
mod service;
|
mod service;
|
||||||
|
mod session_tuning;
|
||||||
mod spike;
|
mod spike;
|
||||||
mod vdisplay;
|
mod vdisplay;
|
||||||
#[cfg(target_os = "windows")]
|
#[cfg(target_os = "windows")]
|
||||||
|
|||||||
@@ -1831,10 +1831,15 @@ struct FrameMsg {
|
|||||||
/// capture/encode/send threads. This matters even though our GPU work is already HIGH priority: the
|
/// capture/encode/send threads. This matters even though our GPU work is already HIGH priority: the
|
||||||
/// GPU scheduler can only favour commands we've actually SUBMITTED, so if a normal-priority thread is
|
/// GPU scheduler can only favour commands we've actually SUBMITTED, so if a normal-priority thread is
|
||||||
/// descheduled by the game it submits the convert/encode late and the GPU priority never bites. Apollo
|
/// descheduled by the game it submits the convert/encode late and the GPU priority never bites. Apollo
|
||||||
/// does the same (capture thread CRITICAL, encoder ABOVE_NORMAL). Windows-only — the Linux host caps
|
/// does the same (capture thread CRITICAL, encoder ABOVE_NORMAL). The Linux host needs this too: an
|
||||||
/// the game via gamescope, so its threads aren't starved. `critical` → highest non-realtime class
|
/// uncapped GPU-saturating title (e.g. CS2 direct on a virtual output, not capped by gamescope) is
|
||||||
|
/// also a CPU hog and can deschedule our submit threads. `critical` → highest non-realtime class
|
||||||
/// (the capture+encode loop); otherwise above-normal (the send/relay thread).
|
/// (the capture+encode loop); otherwise above-normal (the send/relay thread).
|
||||||
pub(crate) fn boost_thread_priority(critical: bool) {
|
pub(crate) fn boost_thread_priority(critical: bool) {
|
||||||
|
// Windows host-process/thread session tuning (timer 1ms, DWM MMCSS, HIGH class once; MMCSS +
|
||||||
|
// keep-display-awake per thread). No-op off Windows. Both stream threads call us, so this covers
|
||||||
|
// capture/encode (critical) and send (non-critical).
|
||||||
|
crate::session_tuning::on_hot_thread();
|
||||||
#[cfg(target_os = "windows")]
|
#[cfg(target_os = "windows")]
|
||||||
unsafe {
|
unsafe {
|
||||||
use windows::Win32::System::Threading::{
|
use windows::Win32::System::Threading::{
|
||||||
@@ -1853,7 +1858,27 @@ pub(crate) fn boost_thread_priority(critical: bool) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#[cfg(not(target_os = "windows"))]
|
#[cfg(target_os = "linux")]
|
||||||
|
{
|
||||||
|
// Best-effort nice of the CALLING thread. On Linux `setpriority(PRIO_PROCESS, 0, …)` acts on
|
||||||
|
// the calling thread (the kernel resolves who==0 to the current task/tid), and both call
|
||||||
|
// sites run inside their worker thread — so this nices exactly the capture/encode (critical)
|
||||||
|
// and send (non-critical) threads, nothing else. Silently no-ops without CAP_SYS_NICE / a
|
||||||
|
// raised RLIMIT_NICE, which is fine. We deliberately do NOT use SCHED_RR/FIFO by default: a
|
||||||
|
// realtime CPU class can preempt the compositor AND the game's own render thread, adding the
|
||||||
|
// very frame-time we refuse to add (opt-in only — see PUNKTFUNK_SCHED_RR).
|
||||||
|
let nice = if critical { -10 } else { -5 };
|
||||||
|
let rc = unsafe { libc::setpriority(libc::PRIO_PROCESS, 0, nice) };
|
||||||
|
if rc == 0 {
|
||||||
|
tracing::debug!(critical, nice, "thread nice raised");
|
||||||
|
} else {
|
||||||
|
tracing::debug!(
|
||||||
|
critical,
|
||||||
|
"setpriority(nice) no-op (needs CAP_SYS_NICE / RLIMIT_NICE)"
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
#[cfg(not(any(target_os = "windows", target_os = "linux")))]
|
||||||
{
|
{
|
||||||
let _ = critical;
|
let _ = critical;
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,90 @@
|
|||||||
|
//! Windows host-process session tuning — parity with Apollo/Sunshine `streaming_will_start`.
|
||||||
|
//!
|
||||||
|
//! The default Windows process runs at NORMAL priority and ~15.6 ms timer granularity, and lets the
|
||||||
|
//! GPU/display idle. Under a GPU-saturating game that starves our capture/encode/send threads (the
|
||||||
|
//! "240→40 fps collapse"), and the coarse timer floors any precise frame pacing. This raises the
|
||||||
|
//! process out of the default scheduling class, gives DWM and our hot threads MMCSS priority, drops
|
||||||
|
//! the timer to 1 ms, and keeps the (virtual) display awake for the session.
|
||||||
|
//!
|
||||||
|
//! Raw C-ABI FFI (winmm/kernel32/dwmapi/avrt) rather than the `windows` crate so it builds without
|
||||||
|
//! pulling new windows-rs features. No-op on non-Windows. Per-thread effects (MMCSS, execution
|
||||||
|
//! state) auto-revert at thread exit (= session end); the process-wide bits revert at process exit.
|
||||||
|
//! See `docs/host-latency-plan.md` Tier 3A.
|
||||||
|
|
||||||
|
#[cfg(target_os = "windows")]
|
||||||
|
mod imp {
|
||||||
|
#![allow(non_snake_case)]
|
||||||
|
use std::ffi::c_void;
|
||||||
|
use std::sync::OnceLock;
|
||||||
|
|
||||||
|
type Handle = *mut c_void;
|
||||||
|
type Bool = i32;
|
||||||
|
|
||||||
|
#[link(name = "winmm")]
|
||||||
|
extern "system" {
|
||||||
|
fn timeBeginPeriod(uPeriod: u32) -> u32;
|
||||||
|
}
|
||||||
|
#[link(name = "kernel32")]
|
||||||
|
extern "system" {
|
||||||
|
fn GetCurrentProcess() -> Handle;
|
||||||
|
fn SetPriorityClass(hProcess: Handle, dwPriorityClass: u32) -> Bool;
|
||||||
|
fn SetThreadExecutionState(esFlags: u32) -> u32;
|
||||||
|
}
|
||||||
|
#[link(name = "dwmapi")]
|
||||||
|
extern "system" {
|
||||||
|
fn DwmEnableMMCSS(fEnableMMCSS: Bool) -> i32; // HRESULT
|
||||||
|
}
|
||||||
|
#[link(name = "avrt")]
|
||||||
|
extern "system" {
|
||||||
|
fn AvSetMmThreadCharacteristicsW(TaskName: *const u16, TaskIndex: *mut u32) -> Handle;
|
||||||
|
}
|
||||||
|
|
||||||
|
const HIGH_PRIORITY_CLASS: u32 = 0x0000_0080;
|
||||||
|
const ES_CONTINUOUS: u32 = 0x8000_0000;
|
||||||
|
const ES_SYSTEM_REQUIRED: u32 = 0x0000_0001;
|
||||||
|
const ES_DISPLAY_REQUIRED: u32 = 0x0000_0002;
|
||||||
|
|
||||||
|
static PROCESS_TUNED: OnceLock<()> = OnceLock::new();
|
||||||
|
|
||||||
|
/// Process-wide tuning, applied exactly once. Reverts at process exit. Best-effort: each call is
|
||||||
|
/// independent and a failure is ignored (e.g. a non-elevated host may not get HIGH class).
|
||||||
|
fn tune_process_once() {
|
||||||
|
PROCESS_TUNED.get_or_init(|| unsafe {
|
||||||
|
// 1 ms timer granularity (default ~15.6 ms) — the floor for precise frame pacing and the
|
||||||
|
// encode|send split's sub-ms sleeps.
|
||||||
|
timeBeginPeriod(1);
|
||||||
|
// Run DWM's compositor work at MMCSS priority — helps the compose-rate ceiling hold up
|
||||||
|
// under a saturating game (capture is bounded by how often DWM composes).
|
||||||
|
DwmEnableMMCSS(1);
|
||||||
|
// Lift the whole host above NORMAL so a CPU-saturating game can't deschedule our
|
||||||
|
// control/capture/encode/send threads on the CPU (Apollo does the same).
|
||||||
|
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
|
||||||
|
tracing::info!("windows session tuning applied (timer 1ms, DWM MMCSS, HIGH priority)");
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Call at the start of each capture/encode/send (hot stream) thread. Applies the process-wide
|
||||||
|
/// tuning once, registers the calling thread with MMCSS ("Games"), and asserts the display/system
|
||||||
|
/// must stay awake for as long as this thread lives. The MMCSS handle is intentionally leaked and
|
||||||
|
/// the execution-state assertion is bound to this thread — both are reverted by the OS when the
|
||||||
|
/// thread exits, so a session that ends tears them down without explicit bookkeeping.
|
||||||
|
pub fn on_hot_thread() {
|
||||||
|
tune_process_once();
|
||||||
|
unsafe {
|
||||||
|
SetThreadExecutionState(ES_CONTINUOUS | ES_DISPLAY_REQUIRED | ES_SYSTEM_REQUIRED);
|
||||||
|
let task: Vec<u16> = "Games\0".encode_utf16().collect();
|
||||||
|
let mut idx: u32 = 0;
|
||||||
|
// Leak the handle: these are session/process-lifetime worker threads; the OS reverts the
|
||||||
|
// MMCSS characteristics at thread exit.
|
||||||
|
let _ = AvSetMmThreadCharacteristicsW(task.as_ptr(), &mut idx);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[cfg(target_os = "windows")]
|
||||||
|
pub use imp::on_hot_thread;
|
||||||
|
|
||||||
|
/// No-op on non-Windows (Linux uses `setpriority` nice + CUDA stream priority instead — see
|
||||||
|
/// `punktfunk1::boost_thread_priority` and `zerocopy::cuda`).
|
||||||
|
#[cfg(not(target_os = "windows"))]
|
||||||
|
pub fn on_hot_thread() {}
|
||||||
@@ -27,6 +27,15 @@ pub type CUexternalMemory = *mut c_void; // opaque CUextMemory_st*
|
|||||||
pub const CU_MEMORYTYPE_DEVICE: c_uint = 2;
|
pub const CU_MEMORYTYPE_DEVICE: c_uint = 2;
|
||||||
pub const CU_MEMORYTYPE_ARRAY: c_uint = 3;
|
pub const CU_MEMORYTYPE_ARRAY: c_uint = 3;
|
||||||
|
|
||||||
|
/// `CUctx_flags` (cuda.h): block the CPU on an OS primitive while waiting for the GPU instead of
|
||||||
|
/// busy-spinning. On this shared box (compositor + send thread on the same cores) spinning a core
|
||||||
|
/// to detect copy completion steals CPU from the very threads we want scheduled; BLOCKING_SYNC
|
||||||
|
/// frees it. Default (`CU_CTX_SCHED_AUTO=0`) heuristically picks SPIN vs YIELD by core count.
|
||||||
|
const CU_CTX_SCHED_BLOCKING_SYNC: c_uint = 0x04;
|
||||||
|
|
||||||
|
/// `cuStreamCreateWithPriority` flag: don't implicitly synchronize with the legacy NULL stream.
|
||||||
|
const CU_STREAM_NON_BLOCKING: c_uint = 0x01;
|
||||||
|
|
||||||
/// `CUDA_MEMCPY2D` (cuda.h, `_v2` ABI). Field order is load-bearing.
|
/// `CUDA_MEMCPY2D` (cuda.h, `_v2` ABI). Field order is load-bearing.
|
||||||
#[repr(C)]
|
#[repr(C)]
|
||||||
#[derive(Default)]
|
#[derive(Default)]
|
||||||
@@ -91,8 +100,15 @@ extern "C" {
|
|||||||
element_size: c_uint,
|
element_size: c_uint,
|
||||||
) -> CUresult;
|
) -> CUresult;
|
||||||
fn cuMemFree_v2(dptr: CUdeviceptr) -> CUresult;
|
fn cuMemFree_v2(dptr: CUdeviceptr) -> CUresult;
|
||||||
fn cuMemcpy2D_v2(copy: *const CUDA_MEMCPY2D) -> CUresult;
|
fn cuMemcpy2DAsync_v2(copy: *const CUDA_MEMCPY2D, stream: CUstream) -> CUresult;
|
||||||
fn cuCtxSynchronize() -> CUresult;
|
fn cuStreamSynchronize(stream: CUstream) -> CUresult;
|
||||||
|
// Greatest/least stream priority the driver exposes (greatest = numerically lowest).
|
||||||
|
fn cuCtxGetStreamPriorityRange(least: *mut c_int, greatest: *mut c_int) -> CUresult;
|
||||||
|
fn cuStreamCreateWithPriority(
|
||||||
|
stream: *mut CUstream,
|
||||||
|
flags: c_uint,
|
||||||
|
priority: c_int,
|
||||||
|
) -> CUresult;
|
||||||
|
|
||||||
// GL interop (cudaGL.h) — these symbols have NO `_v2` suffix. `cuGraphicsEGLRegisterImage`
|
// GL interop (cudaGL.h) — these symbols have NO `_v2` suffix. `cuGraphicsEGLRegisterImage`
|
||||||
// is Tegra-only on the desktop driver, so we go EGLImage → GL texture → register the texture.
|
// is Tegra-only on the desktop driver, so we go EGLImage → GL texture → register the texture.
|
||||||
@@ -162,7 +178,10 @@ pub fn context() -> Result<CUcontext> {
|
|||||||
let mut dev: CUdevice = 0;
|
let mut dev: CUdevice = 0;
|
||||||
ck(cuDeviceGet(&mut dev, 0), "cuDeviceGet")?;
|
ck(cuDeviceGet(&mut dev, 0), "cuDeviceGet")?;
|
||||||
let mut ctx: CUcontext = std::ptr::null_mut();
|
let mut ctx: CUcontext = std::ptr::null_mut();
|
||||||
ck(cuCtxCreate_v2(&mut ctx, 0, dev), "cuCtxCreate_v2")?;
|
ck(
|
||||||
|
cuCtxCreate_v2(&mut ctx, CU_CTX_SCHED_BLOCKING_SYNC, dev),
|
||||||
|
"cuCtxCreate_v2",
|
||||||
|
)?;
|
||||||
ctx
|
ctx
|
||||||
};
|
};
|
||||||
// Racy first-init is fine: the winner's context is used; a loser leaks one context (rare,
|
// Racy first-init is fine: the winner's context is used; a loser leaks one context (rare,
|
||||||
@@ -176,6 +195,57 @@ pub fn make_current() -> Result<()> {
|
|||||||
unsafe { ck(cuCtxSetCurrent(ctx), "cuCtxSetCurrent") }
|
unsafe { ck(cuCtxSetCurrent(ctx), "cuCtxSetCurrent") }
|
||||||
}
|
}
|
||||||
|
|
||||||
|
thread_local! {
|
||||||
|
/// Per-thread copy stream. `None` until first use; `Some(null)` means "creation failed, use the
|
||||||
|
/// default (NULL) stream". Per-thread (not shared) so each worker's `cuStreamSynchronize` waits
|
||||||
|
/// only on ITS OWN copies — the old per-frame `cuCtxSynchronize` was context-wide and also
|
||||||
|
/// blocked on the other worker thread's in-flight NULL-stream copies.
|
||||||
|
static COPY_STREAM: std::cell::Cell<Option<CUstream>> = const { std::cell::Cell::new(None) };
|
||||||
|
}
|
||||||
|
|
||||||
|
/// The calling thread's highest-priority copy stream (lazily created; context must be current).
|
||||||
|
/// Carries the greatest stream priority the driver exposes — a scheduler hint that nudges our
|
||||||
|
/// copies ahead of the game's queued compute. NOTE: stream priority is an intra-process hint and
|
||||||
|
/// NVIDIA's Linux driver may ignore it / not preempt a saturating game's graphics context; this is
|
||||||
|
/// "measure-then-keep", and it never regresses (falls back to the NULL stream). The greatest
|
||||||
|
/// priority is the numerically-lowest value (`greatest` from `cuCtxGetStreamPriorityRange`).
|
||||||
|
fn copy_stream() -> CUstream {
|
||||||
|
COPY_STREAM.with(|cell| {
|
||||||
|
if let Some(s) = cell.get() {
|
||||||
|
return s;
|
||||||
|
}
|
||||||
|
let stream = unsafe {
|
||||||
|
let (mut least, mut greatest) = (0i32, 0i32);
|
||||||
|
if cuCtxGetStreamPriorityRange(&mut least, &mut greatest) != 0 {
|
||||||
|
std::ptr::null_mut()
|
||||||
|
} else {
|
||||||
|
let mut s: CUstream = std::ptr::null_mut();
|
||||||
|
if cuStreamCreateWithPriority(&mut s, CU_STREAM_NON_BLOCKING, greatest) != 0 {
|
||||||
|
std::ptr::null_mut()
|
||||||
|
} else {
|
||||||
|
tracing::debug!(
|
||||||
|
priority = greatest,
|
||||||
|
"CUDA high-priority copy stream created"
|
||||||
|
);
|
||||||
|
s
|
||||||
|
}
|
||||||
|
}
|
||||||
|
};
|
||||||
|
cell.set(Some(stream));
|
||||||
|
stream
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Issue `copy` on this thread's priority stream and block until it completes. Replaces the
|
||||||
|
/// per-frame `cuMemcpy2D_v2` + context-wide `cuCtxSynchronize` pair: same completion guarantee
|
||||||
|
/// (the source dmabuf is safe to recycle once this returns), but the wait is scoped to our own
|
||||||
|
/// stream and the copy carries the high priority hint.
|
||||||
|
unsafe fn copy_blocking(copy: &CUDA_MEMCPY2D, what: &str) -> Result<()> {
|
||||||
|
let stream = copy_stream();
|
||||||
|
ck(cuMemcpy2DAsync_v2(copy, stream), what)?;
|
||||||
|
ck(cuStreamSynchronize(stream), "cuStreamSynchronize")
|
||||||
|
}
|
||||||
|
|
||||||
/// Allocate one pitched device buffer for `width`x`height` 4-byte pixels; returns `(ptr, pitch)`.
|
/// Allocate one pitched device buffer for `width`x`height` 4-byte pixels; returns `(ptr, pitch)`.
|
||||||
fn alloc_pitched(width: u32, height: u32) -> Result<(CUdeviceptr, usize)> {
|
fn alloc_pitched(width: u32, height: u32) -> Result<(CUdeviceptr, usize)> {
|
||||||
let mut ptr: CUdeviceptr = 0;
|
let mut ptr: CUdeviceptr = 0;
|
||||||
@@ -342,7 +412,8 @@ impl RegisteredTexture {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/// Map the texture for this frame, copy its (already-linear RGBA8) array into `dst`, then
|
/// Map the texture for this frame, copy its (already-linear RGBA8) array into `dst`, then
|
||||||
/// unmap. The `cuCtxSynchronize` ensures `dst` is ready before the source dmabuf is recycled.
|
/// unmap. The copy is synchronized (on our priority stream) before unmap so `dst` is ready
|
||||||
|
/// before the source dmabuf is recycled. Always unmaps, even if the copy errors.
|
||||||
pub fn copy_mapped_to(&mut self, dst: &DeviceBuffer) -> Result<()> {
|
pub fn copy_mapped_to(&mut self, dst: &DeviceBuffer) -> Result<()> {
|
||||||
unsafe {
|
unsafe {
|
||||||
ck(
|
ck(
|
||||||
@@ -364,13 +435,10 @@ impl RegisteredTexture {
|
|||||||
Height: dst.height as usize,
|
Height: dst.height as usize,
|
||||||
..Default::default()
|
..Default::default()
|
||||||
};
|
};
|
||||||
let r = cuMemcpy2D_v2(©);
|
let res = copy_blocking(©, "cuMemcpy2DAsync_v2");
|
||||||
let s = cuCtxSynchronize();
|
|
||||||
let _ = cuGraphicsUnmapResources(1, &mut self.resource, std::ptr::null_mut());
|
let _ = cuGraphicsUnmapResources(1, &mut self.resource, std::ptr::null_mut());
|
||||||
ck(r, "cuMemcpy2D_v2")?;
|
res
|
||||||
ck(s, "cuCtxSynchronize")?;
|
|
||||||
}
|
}
|
||||||
Ok(())
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -393,11 +461,7 @@ pub fn copy_device_to_device(
|
|||||||
Height: src.height as usize,
|
Height: src.height as usize,
|
||||||
..Default::default()
|
..Default::default()
|
||||||
};
|
};
|
||||||
unsafe {
|
unsafe { copy_blocking(©, "cuMemcpy2DAsync_v2(dev->dev)") }
|
||||||
ck(cuMemcpy2D_v2(©), "cuMemcpy2D_v2(dev->dev)")?;
|
|
||||||
ck(cuCtxSynchronize(), "cuCtxSynchronize")?;
|
|
||||||
}
|
|
||||||
Ok(())
|
|
||||||
}
|
}
|
||||||
|
|
||||||
impl Drop for RegisteredTexture {
|
impl Drop for RegisteredTexture {
|
||||||
@@ -500,10 +564,7 @@ pub fn copy_pitched_to_buffer(
|
|||||||
Height: dst.height as usize,
|
Height: dst.height as usize,
|
||||||
..Default::default()
|
..Default::default()
|
||||||
};
|
};
|
||||||
unsafe {
|
// copy_blocking syncs our priority stream before returning, so the copy is complete before the
|
||||||
ck(cuMemcpy2D_v2(©), "cuMemcpy2D_v2(ext->dev)")?;
|
// dmabuf is requeued to the producer.
|
||||||
// The copy must finish before the dmabuf is requeued to the producer.
|
unsafe { copy_blocking(©, "cuMemcpy2DAsync_v2(ext->dev)") }
|
||||||
ck(cuCtxSynchronize(), "cuCtxSynchronize")?;
|
|
||||||
}
|
|
||||||
Ok(())
|
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,268 @@
|
|||||||
|
# Host latency & the GPU-contention collapse — analysis + prioritized plan
|
||||||
|
|
||||||
|
Scope: Windows + Linux GameStream/punktfunk1 hosts. Priority: **latency**, and specifically the
|
||||||
|
"saturating game starves the stream" headache:
|
||||||
|
|
||||||
|
> CS2 runs 400+ fps. Client requests 240. In an easy scene the client gets ~200; in a demanding
|
||||||
|
> (GPU-100%) scene it collapses to 40-50. Capping the game is **not** an acceptable fix.
|
||||||
|
|
||||||
|
This doc is the synthesis of a multi-agent investigation (deep read of our pipeline + the
|
||||||
|
[Apollo comparison](apollo-comparison.md) + external NVIDIA/streaming research) followed by an
|
||||||
|
**adversarial verification pass** — every candidate fix was attacked, against our actual code, to
|
||||||
|
separate real levers from placebo. The "Dropped / why" section exists so we don't re-propose the
|
||||||
|
placebos.
|
||||||
|
|
||||||
|
## Implementation status (2026-06-18)
|
||||||
|
|
||||||
|
- ✅ **Tier 2B — Linux scheduling hygiene**: landed. `boost_thread_priority` now nices the
|
||||||
|
capture/encode + send threads on Linux (`setpriority`, best-effort) and its wrong gamescope
|
||||||
|
doc-comment is fixed; CUDA context uses `CU_CTX_SCHED_BLOCKING_SYNC`; copies run on a per-thread
|
||||||
|
highest-priority CUDA stream (`cuStreamCreateWithPriority`, graceful NULL-stream fallback) with a
|
||||||
|
per-stream sync that no longer blocks on the other worker thread's work. Builds + clippy + fmt
|
||||||
|
green. The stream-priority hint is **measure-then-keep** (NVIDIA Linux may ignore it).
|
||||||
|
- ✅ **Tier 3A — Windows session tuning**: landed (`session_tuning.rs`, raw C-ABI FFI, no-op off
|
||||||
|
Windows). Each capture/encode/send thread now applies process-wide tuning once (1 ms timer,
|
||||||
|
`DwmEnableMMCSS`, `HIGH_PRIORITY_CLASS`) and per-thread MMCSS "Games" + keep-display-awake. Wired
|
||||||
|
into both the native (`boost_thread_priority`) and GameStream (`stream.rs`) paths. Linux no-op
|
||||||
|
path builds green; the Windows path is validated by the Windows CI runner / on-box.
|
||||||
|
- ⏳ **Tier 2A — Linux NV12 convert**: specified to the code level (below) but **not landed** — it is
|
||||||
|
a ~300-line, colour-correctness-critical change that cannot be A/B-validated on the headless dev
|
||||||
|
VM (no display; the project has already been burned by the exact green-screen failure mode this
|
||||||
|
risks — Steam-Deck `SEPARATE_LAYERS` bug). Execute + A/B it on a GPU box **with a display**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Three corrections to the mental model (read first)
|
||||||
|
|
||||||
|
**(A) "Feed NVENC RGB so the ASIC does the colour-convert" is backwards.**
|
||||||
|
NVENC's encode core is YUV-native. RGB input makes the driver insert an **RGB→YUV CSC on the
|
||||||
|
SM/3D-compute cores** — the *exact* engine a game saturates. Windows already does the right thing:
|
||||||
|
`convert_to_yuv` runs the CSC on the dedicated **VIDEO engine** via `VideoProcessorBlt`
|
||||||
|
(`capture/dxgi.rs:1023,1063`), logged as "0% 3D". **Linux still feeds NVENC RGB**
|
||||||
|
(`encode/linux.rs:98 nvenc_input` → `RGBZ`/`BGRZ`; the `zerocopy/egl.rs:98` shader is a `.bgra`
|
||||||
|
*swizzle*, not a CSC), so it pays NVENC's internal CSC on the SM every frame. That is the single
|
||||||
|
biggest, clearly-fixable contention source on Linux, and Windows already eliminated it.
|
||||||
|
|
||||||
|
**(B) "More GPU priority so our frames get through" is already maxed on Windows, and hits a
|
||||||
|
hardware ceiling.** We ship `D3DKMTSetProcessSchedulingPriorityClass=HIGH(4)` +
|
||||||
|
`SetGPUThreadPriority(0x4000001E)` + `SetMaximumFrameLatency(1)` (`capture/dxgi.rs:160-263`). The
|
||||||
|
residual ~20 ms `lock_bitstream` wall (documented at `dxgi.rs:155`) is GPU **context-scheduling
|
||||||
|
latency**, bounded by **preemption granularity**: NVIDIA preempts *compute* at instruction level
|
||||||
|
(~0.1 ms) but *graphics* only at coarse draw/tile/DMA-buffer boundaries (milliseconds out under a
|
||||||
|
draw flood). No priority class preempts an in-flight game draw. So the winning strategy is **not
|
||||||
|
more priority** — it is (1) do **less work on the contended graphics/3D engine**, and (2) **overlap
|
||||||
|
the unavoidable per-frame scheduling wait across frames** to recover throughput.
|
||||||
|
|
||||||
|
**(C) A chunk of the collapse is upstream of our encoder — no encode/priority fix can beat it.**
|
||||||
|
DXGI Desktop Duplication *and* WGC both capture **from the DWM compositor**, so captured fps is
|
||||||
|
hard-ceilinged at the **compose rate**, never the game's 400 fps. Under saturation the *compositor
|
||||||
|
itself* is scheduled late → composes fewer unique frames → we starve even though NVENC is idle. And
|
||||||
|
borderless/fullscreen games on **Independent/Direct Flip** present straight to scanout, *bypassing
|
||||||
|
DWM*, so capture sees ~half the frames (this is the "200 not 240"). The host already paces at
|
||||||
|
`target_fps` and **re-encodes held frames**, so *transmitted* fps stays ~240 while *unique* fps
|
||||||
|
collapses. **This must be measured before blaming encode.**
|
||||||
|
|
||||||
|
> Net: Windows is already near best-in-class (priority + video-engine CSC + encode|send split all
|
||||||
|
> shipped); its remaining wins are narrow and partly a hardware/compositor ceiling. **Linux is the
|
||||||
|
> least-hardened host and holds most of the headroom.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tier 0 — Diagnose first (cheap, decisive, do before writing code)
|
||||||
|
|
||||||
|
Everything below is gated on knowing *which* bucket the collapse is in. We already have the tooling.
|
||||||
|
|
||||||
|
1. **Run the workload with `PUNKTFUNK_PERF=1` and read `uniq` vs `fps`.** The `uniq` counter
|
||||||
|
(genuinely-new captured frames vs re-encoded holds) already exists
|
||||||
|
(`gamestream/stream.rs:332-336,403`; `wgc_helper.rs:122-183`). Under CS2 at GPU-100%:
|
||||||
|
- **`fps`≈240 but `uniq`→40-50** ⇒ the *source/compositor* only produced 40-50 unique frames.
|
||||||
|
No encode/priority/cadence fix on our side exceeds that — it is the game's effective
|
||||||
|
present-to-compose rate at 100% GPU. The lever there is **reducing our own per-frame GPU
|
||||||
|
steal** (Tier 2) so the game keeps more headroom, plus the cadence work (Tier 1A).
|
||||||
|
- **both `fps` and `uniq`→40-50** ⇒ our capture→convert→encode round-trip is being starved (the
|
||||||
|
`lock_bitstream` scheduling stall). The Tier 1/2 contention levers apply directly.
|
||||||
|
2. **Confirm the game's flip mode on Windows.** If the game is on Independent/Direct Flip (MPO),
|
||||||
|
capture is bypassing DWM and seeing half the frames. We already have `capture/composed_flip.rs`
|
||||||
|
— verify ForceComposedFlip is actually engaged on the game path, and watch `cap_us`.
|
||||||
|
3. Capture `cap_us` / `enc_us` / `pace_us` p50/p99 alongside, to localise the stall.
|
||||||
|
|
||||||
|
Run this on the real-GPU boxes (RTX 4090 Windows host; a Linux NVIDIA box with a real game). This
|
||||||
|
headless dev VM cannot reproduce the contention.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tier 1 — The two under-weighted, cross-platform levers (confirmed by research, not yet done)
|
||||||
|
|
||||||
|
### 1A. Capture-source / compose-rate cadence (where "200 not 240" actually lives)
|
||||||
|
The capture ceiling is the compositor's compose rate, and under load the compositor gets starved.
|
||||||
|
Levers, in order:
|
||||||
|
- **Force Composed Flip on Windows** for the game path (defeat MPO/flip-metering frame loss).
|
||||||
|
Machinery exists (`composed_flip.rs`); confirm it engages and measure the unique-frame delta.
|
||||||
|
- **Opt-in "double-refresh" virtual output**: create the per-session virtual output at ~2× the
|
||||||
|
client's rate to break the game-present-vs-compose beat (community-validated; cheap for us since
|
||||||
|
we already mint arbitrary-mode virtual outputs). Gate **off** by default and **never** on the
|
||||||
|
gamescope/SudoVDA game-attach path (no DWM beat there; it just adds compose work to the saturated
|
||||||
|
engine). `PUNKTFUNK_OUTPUT_HZ_MULTIPLIER`.
|
||||||
|
- **Reflex / render-queue=0 style headroom** (non-capping): documented as the substitute for an fps
|
||||||
|
cap — removes render-queue backpressure so the compositor/capture get scheduled. Investigate what
|
||||||
|
we can influence from the host side.
|
||||||
|
|
||||||
|
Risk: the double-refresh trick can be a net regression under saturation (doubles compose + our
|
||||||
|
capture work on the saturated engine) — measure (Tier 0) before shipping it on by default.
|
||||||
|
|
||||||
|
### 1B. Pin GPU power / clock state for the session (kills the per-frame downclock tax)
|
||||||
|
NVIDIA's adaptive P-state downclocks between our small bursty frames and pays a ramp every frame —
|
||||||
|
a hidden latency tax, *most visible in easy scenes* (the ~200-should-be-240 case). Sunshine ships
|
||||||
|
this as `nvenc_latency_over_power` and calls it decisive. **Neither host does it.**
|
||||||
|
- **Windows**: NvAPI **per-application DRS profile** `PREFERRED_PSTATE = PREFER_MAX` scoped to our
|
||||||
|
exe (not a global override). Load `nvapi64.dll` dynamically; treat `NvAPI_Initialize` failure as
|
||||||
|
"no NVIDIA, skip" (covers AMD/Intel + the WARP dev VM). **Crash-safe undo is mandatory**: write
|
||||||
|
an undo record to `%ProgramData%\punktfunk\` *before* applying and revert a stale profile on next
|
||||||
|
startup — a crash must not leave the user's control panel modified.
|
||||||
|
- **Linux**: prefer the **root-free** path — disable the CUDA "Force P2 State" downclock that
|
||||||
|
context creation triggers (env/per-context), and `nvidia-smi -pm 1` (persistence) where
|
||||||
|
permitted. `nvmlDeviceSetGpuLockedClocks` needs root/CAP_SYS_ADMIN (our host runs as a normal
|
||||||
|
user → silent no-op) and is brittle across SKUs; if used, query `nvmlDeviceGetMaxClockInfo`, lock
|
||||||
|
to *that*, and restore on teardown **and** via a SIGTERM/panic handler.
|
||||||
|
- Gate behind `PUNKTFUNK_PIN_CLOCKS`; **default OFF on battery / Steam Deck** (thermal/power caps
|
||||||
|
make pinning actively harmful there).
|
||||||
|
|
||||||
|
Impact: reliable, modest p99 / easy-scene win on both OSes. Does **not** fix the saturated-scene
|
||||||
|
collapse (at 100% util the clock is already maxed). Low cost.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tier 2 — Linux work-deletion + scheduling hygiene (the biggest in-our-control headroom)
|
||||||
|
|
||||||
|
### 2A. Produce **NV12/P010** on Linux and feed it to NVENC native (delete the SM-side CSC)
|
||||||
|
The strictly-correct version (verified): **extend the existing GL de-tile blit
|
||||||
|
(`zerocopy/egl.rs`) to emit NV12** instead of swizzled BGRx — multi-render-target (GL_R8 luma
|
||||||
|
full-res + GL_RG8 chroma half-res, or two passes) applying an **explicit BT.709 limited-range
|
||||||
|
matrix matching the Windows `VideoConverter`** (`dxgi.rs:957`) so hosts look identical — then
|
||||||
|
register `NV_ENC_BUFFER_FORMAT_NV12` with the encoder (teach `encode/linux.rs:98 nvenc_input` an
|
||||||
|
NV12 case; `CudaHw sw_format` → `AV_PIX_FMT_NV12`).
|
||||||
|
- Net: today = GL swizzle (3D) **+** NVENC-internal CSC (SM); after = GL CSC (3D, ~same cost as the
|
||||||
|
swizzle it replaces) **+ zero NVENC CSC**. Removes one whole CSC pass and removes it from the SM.
|
||||||
|
- **Do *not* implement this as a standalone CUDA convert kernel on the tiled path** — CUDA can't
|
||||||
|
sample a tiled NVIDIA surface (`cuGraphicsEGLRegisterImage` is Tegra-only, `egl.rs:6-12`), so it
|
||||||
|
would still need the GL detile, *and* a CUDA kernel runs on the same saturated SM. The CUDA-kernel
|
||||||
|
route is only clean on the **LINEAR/Vulkan-bridge (gamescope)** path, where it doubles as the NV12
|
||||||
|
producer; do it there if/when that path needs it.
|
||||||
|
- Pitfalls: pervasive 4-byte-pixel assumptions break with NV12 — `cuda.rs` hardcodes
|
||||||
|
`WidthInBytes = width*4` (`:363,392,499`), `BufferPool`/`alloc_pitched` assume 4 B/px, GL dst is
|
||||||
|
`GL_RGBA8`; all need a plane-aware NV12 variant (luma W·H + chroma W·H/2, two-plane copy) or you
|
||||||
|
get the Steam-Deck green-screen class of bug. The HDR/10-bit path needs P010, not NV12.
|
||||||
|
- Impact: real, **modest, compounding** — a few ms of per-frame GPU time and a shorter time-slice
|
||||||
|
need, which stacks with cadence + power-pin. **Not** a standalone cure for the 240→40 collapse
|
||||||
|
(external "47→100 fps" numbers are other people's non-zero-copy pipelines; don't promise them).
|
||||||
|
Medium cost. Gate behind a `PUNKTFUNK_*` env and A/B `cap→encoded` p50 + the CS2 fps floor.
|
||||||
|
|
||||||
|
### 2B. Linux scheduling hygiene (cheap; the priority bits are "measure-then-keep")
|
||||||
|
Consolidates the genuine parts of several candidates. Mostly unambiguous cleanups + opt-in
|
||||||
|
priority:
|
||||||
|
- **Arm the Linux `boost_thread_priority` no-op** (`punktfunk1.rs:1856` cfg branch): best-effort
|
||||||
|
`libc::setpriority(PRIO_PROCESS, 0, -10/-5)` on the calling thread (tid 0 = self), log-and-continue
|
||||||
|
on EPERM. **Do not** default to SCHED_RR/FIFO (can starve the compositor and the game's render
|
||||||
|
thread — the user refuses to add game frame-time); offer it only behind `PUNKTFUNK_SCHED_RR=1`.
|
||||||
|
**Fix the wrong doc-comment** at `punktfunk1.rs:1834-1835` ("the Linux host caps the game via
|
||||||
|
gamescope, so its threads aren't starved") — false for the uncapped/NVIDIA-direct path.
|
||||||
|
- **Set CUDA context scheduling deliberately**: `cuCtxCreate` flag `CU_CTX_SCHED_BLOCKING_SYNC` on
|
||||||
|
this shared VM (frees a core vs the default AUTO/SPIN) — a CPU-efficiency fix, not throughput.
|
||||||
|
- **High-priority CUDA stream + EGL context priority** (the missing analogue of the Windows
|
||||||
|
hardening): `cuStreamCreateWithPriority(highest from cuCtxGetStreamPriorityRange)` for our copies;
|
||||||
|
request `EGL_IMG_context_priority HIGH` (try `EGL_NV_context_priority_realtime`) at
|
||||||
|
`egl.rs:332`. **Caveat, load-bearing**: these are intra-process *hints* and NVIDIA's Linux driver
|
||||||
|
has been reported to **ignore** context priority (driver 545: high- vs low-priority EGL contexts
|
||||||
|
measured identical) and to **deny** realtime Vulkan queues. Implement with graceful fallback,
|
||||||
|
gate behind env, and **measure on driver 595** — do not architect around it or credit it before
|
||||||
|
measurement.
|
||||||
|
|
||||||
|
> Explicitly **not** doing on Linux: Vulkan `VK_EXT_global_priority` as "the" lever (it only touches
|
||||||
|
> the minority gamescope/LINEAR copy, not the convert; likely a silent no-op on consumer NVIDIA).
|
||||||
|
> Replacing `cuCtxSynchronize` with a per-stream event chain for *contention* reasons (it's
|
||||||
|
> per-context, never waited on the game's separate context — a non-fix; keep the full sync where it
|
||||||
|
> guards dmabuf recycle, `egl.rs:491`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tier 3 — Windows parity polish (Windows is already strong)
|
||||||
|
|
||||||
|
- **3A. Host-process session tuning (we have *zero* today — verified):** `NtSetTimerResolution(0.5ms)`
|
||||||
|
/ `timeBeginPeriod(1)` (default 15.6 ms granularity blocks precise pacing), `DwmEnableMMCSS(true)`,
|
||||||
|
`SetPriorityClass(HIGH_PRIORITY_CLASS)`, MMCSS-register the capture/encode threads ("Games"/"Pro
|
||||||
|
Audio"), `SetThreadExecutionState(ES_CONTINUOUS|ES_DISPLAY_REQUIRED)`. All revert on stop.
|
||||||
|
Foundational for any precise frame pacing and the encode|send split. Low cost, low risk.
|
||||||
|
(`gamestream/stream.rs` start/stop; Apollo's `streaming_will_start`/`_stopped`.)
|
||||||
|
- **3B. Auto-gated REALTIME D3DKMT class** instead of fixed HIGH (the realtime opt-in already exists
|
||||||
|
at `dxgi.rs:199-207`): probe HAGS (`D3DKMTQueryAdapterInfo` `HwSchEnabled`) **and** VRAM headroom
|
||||||
|
(`IDXGIAdapter3::QueryVideoMemoryInfo`, continuously), allow REALTIME(5) only when safe (HAGS off,
|
||||||
|
or HAGS on + VRAM comfortably below budget), downgrade to HIGH the moment VRAM pressure rises —
|
||||||
|
Sunshine's actual gate avoids the HAGS+near-full-VRAM NVENC freeze/crash. Marginal (one scheduling
|
||||||
|
rung, same preemption ceiling), so rank it as cheap parity, not a fix.
|
||||||
|
- **3C. Cheap experiment — `VideoProcessorBlt` directly from the DDA surface** (skip the same-format
|
||||||
|
`gpu_copy` at `dxgi.rs:2375`), then `ReleaseFrame`, *iff* it doesn't re-serialize `AcquireNextFrame`
|
||||||
|
(the existing decouple-copy was measured 40-200 fps vs ~60 fps, but that note predates confirming
|
||||||
|
the Blt is on the video engine). One-line source-texture change; benchmark only. Do **not** build a
|
||||||
|
D3D11↔D3D12 copy-queue offload — the convert is already off-3D, the remaining copy is intra-VRAM
|
||||||
|
(~5% 3D, no PCIe), not worth the interop rebuild.
|
||||||
|
- **3D. Async NVENC + off-thread retrieve — measure-gated, uncertain.** Today retrieve
|
||||||
|
(`lock_bitstream`) runs **inline on the submit thread** (`nvenc.rs:524-558`), which is *why*
|
||||||
|
`depth>1` was measured to regress (`wgc_helper.rs:111-114`). The NVENC guide mandates submit/retrieve
|
||||||
|
on separate threads with completion events + a deep surface pool; doing that *could* let per-frame
|
||||||
|
scheduling waits **overlap across frames** and recover *throughput* — at a per-frame *latency* cost
|
||||||
|
(depth × frame time). This is the one place the research and our own prior measurement disagree, so
|
||||||
|
it is **strictly measure-first**, and it forecloses slice output (`reportSliceOffsets` needs
|
||||||
|
`enableEncodeAsync=0`). Treat as a structural experiment, not a committed win.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tier 4 — Deferred 2nd-order latency (not contention fixes; do after Tiers 0-2)
|
||||||
|
|
||||||
|
- **GL2 — Intra-refresh for RFI/recovery** (`enableIntraRefresh` + recovery-point SEI) instead of a
|
||||||
|
forced full-IDR: spreads a moving intra band across N frames, killing the 20-40× keyframe size
|
||||||
|
spike and the VBV-overshoot drops it causes. Preconditions (infinite GOP, P-only) already met.
|
||||||
|
Medium; needs all 4 clients to trust the recovery-point SEI and stop demanding IDRs. Real p99 win,
|
||||||
|
orthogonal to the collapse.
|
||||||
|
- **GL1 + GL6 — Sub-frame slice output + per-slice paced send** (the roadmap's "~2-4 ms lever"):
|
||||||
|
`enableSubFrameWrite` + `sliceMode` + transmit each slice as it completes. **Big**: needs the
|
||||||
|
direct NVENC SDK on Linux (libavcodec emits whole AUs) **and** a per-slice wire/FEC redesign in
|
||||||
|
`punktfunk-core` (today `PacketHeader`/`Packetizer`/reassembler are whole-AU; per-slice FEC blocks
|
||||||
|
wreck Leopard efficiency) **and** client slice-granular submit. Gate on
|
||||||
|
`NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK` (often absent on consumer GeForce). The paced-send half is
|
||||||
|
**already shipped** (`stream.rs spawn_sender`, `punktfunk1.rs paced_submit`) — don't re-implement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dropped / why (so we don't re-propose placebo)
|
||||||
|
|
||||||
|
| Candidate | Verdict | Why |
|
||||||
|
|---|---|---|
|
||||||
|
| Feed NVENC ARGB to "offload CSC to ASIC" | ✗ backwards | RGB input forces CSC onto the SM; YUV-native is correct (see §0A). |
|
||||||
|
| Replace `cuCtxSynchronize` with per-stream event chain *for contention* | ✗ | `cuCtxSynchronize` is per-context, never waited on the game's separate process; single null stream = no overlap to win. Keep the full sync where it guards dmabuf recycle. |
|
||||||
|
| Vulkan `VK_EXT_global_priority` as the Linux priority lever | ✗ | Touches only the minority gamescope/LINEAR `vkCmdCopyBuffer`, not the convert; consumer NVIDIA denies realtime / ignores it. Retarget to CUDA/EGL priority. |
|
||||||
|
| Async NVENC as a *throughput/collapse* fix | ✗ (→ measure-gated 3D) | Async is CPU-thread-only (NVIDIA guide); Apollo's own PR #3629 measured no gain; our `depth>1` regressed; Linux-impossible. Kept only as the structural pipelining experiment (§3D). |
|
||||||
|
| D3D12 copy-queue offload of the DDA copy | ✗ | Convert already off-3D; remaining copy is intra-VRAM ~5%, no PCIe — not worth a D3D11↔D3D12 interop rebuild. |
|
||||||
|
| Empty-frame (`LastPresentTime==0`) skip | ✗ for this | Static desktop already coalesced via WAIT_TIMEOUT; under a 400 fps game there are no empty frames to skip. |
|
||||||
|
| GL5 — set ULL RC knobs explicitly | ✗ (audit only) | ULL preset already sets `zeroReorderDelay=1`, lookahead/multipass/AQ off; ffmpeg defaults match + we set `bf=0`. Only `lowDelayKeyFrameScale=1` is non-redundant → fold into GL2 (Windows SDK path only). |
|
||||||
|
| GL3 — true ref-frame invalidation | ✗ for this | No lost-range protocol signal (both control planes collapse to a bool/unit); libavcodec exposes no `nvEncInvalidateRefFrames`; deeper DPB adds per-frame cost. Revisit only as loss-recovery robustness. |
|
||||||
|
| GL4 — move input injection off the ENet thread | ✗ for this | CPU-side, orthogonal to GPU contention; the blocking case is a once-per-UAC desktop switch. Demote to control-plane robustness. |
|
||||||
|
| SCHED_RR/FIFO by default (Linux) | ✗ default | Can preempt the compositor + the game's render thread → adds game frame-time the user refuses. Opt-in only. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended order of attack
|
||||||
|
|
||||||
|
1. **Tier 0 diagnose** on the real boxes — settles whether the collapse is source-ceiling or
|
||||||
|
pipeline-starvation, and whether flip-bypass is halving capture.
|
||||||
|
2. **Tier 2A (Linux NV12)** + **Tier 2B (Linux scheduling hygiene)** — the largest in-our-control
|
||||||
|
headroom; Linux is the least-hardened host.
|
||||||
|
3. **Tier 1B (clock/power pin)** both OSes — cheap, fixes the easy-scene 200-vs-240, crash-safe undo.
|
||||||
|
4. **Tier 1A (cadence/flip)** — gated on Tier 0 (this is where a big chunk of the collapse may live).
|
||||||
|
5. **Tier 3 (Windows polish)** — session tuning is the clear win; the rest is parity.
|
||||||
|
6. **Tier 4** — only after the contention work; intra-refresh first, slice pipelining last.
|
||||||
|
|
||||||
|
Honest expectation: with the work-deletion + cadence + power-pin levers stacked, the easy-scene gap
|
||||||
|
closes and the saturated floor rises, but a residual ceiling remains — at 100% GPU the game
|
||||||
|
physically cannot also render the game *and* compose 240 unique frames, and WDDM/NVIDIA preemption
|
||||||
|
granularity caps how far priority can claw back. Report that ceiling honestly rather than chasing it
|
||||||
|
with encoder micro-optimisations.
|
||||||
Reference in New Issue
Block a user