perf(host/windows): zero-copy NVENC — encode the capturer's texture in place (halve 3D-engine load)
ci / rust (push) Failing after 43s
apple / swift (push) Successful in 53s
ci / web (push) Successful in 35s
android / android (push) Successful in 1m45s
ci / docs-site (push) Successful in 29s
ci / bench (push) Successful in 1m35s
decky / build-publish (push) Successful in 32s
deb / build-publish (push) Successful in 2m21s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 17s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 2m59s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 3m52s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 21s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 2m37s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 5m37s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 5m4s
docker / deploy-docs (push) Successful in 18s
ci / rust (push) Failing after 43s
apple / swift (push) Successful in 53s
ci / web (push) Successful in 35s
android / android (push) Successful in 1m45s
ci / docs-site (push) Successful in 29s
ci / bench (push) Successful in 1m35s
decky / build-publish (push) Successful in 32s
deb / build-publish (push) Successful in 2m21s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 17s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 2m59s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 3m52s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 21s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 2m37s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 5m37s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 5m4s
docker / deploy-docs (push) Successful in 18s
The Windows host pegged the GPU 3D engine at ~97% during high-fps desktop streaming — measured (per- process GPU-engine counters) as OUR process, not DWM. Cause: TWO VRAM->VRAM CopyResource per frame (dupl->gpu_copy in the capturer, then gpu_copy->nvenc_pool in the encoder), and on Windows D3D11 routes copies to render-target textures through the 3D engine (the DMA copy engine sat idle at 7%), so at 240 fps they saturate it and contend with a game's own rendering. Eliminate the second copy: NVENC now registers the capturer's D3D11 texture directly (cached by raw pointer, the cloned texture kept alive until unregister) and encode_pictures it IN PLACE — no encoder-owned input pool, no per-frame copy. Safe because the host encode loop is synchronous (capture -> submit -> poll, where lock_bitstream blocks until the encode finishes), so the capturer never overwrites the texture mid-encode; documented in the module header in case that ever changes. 2 GPU copies/frame -> 1 (the remaining dupl->gpu_copy is unavoidable; that DXGI surface is transient). Measured: SM/compute ~10-15% at ~217 fps 5K (was ~20% at only ~48 fps with two copies), 3687 frames decoded clean. Windows-only; Linux/macOS unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,26 +2,25 @@
|
|||||||
//!
|
//!
|
||||||
//! Drives the raw NVENC API via `nvidia_video_codec_sdk::{sys, ENCODE_API}` (the safe `Encoder`
|
//! Drives the raw NVENC API via `nvidia_video_codec_sdk::{sys, ENCODE_API}` (the safe `Encoder`
|
||||||
//! wrapper is CUDA-only). Opens an encode session bound to the **same** `ID3D11Device` as the DXGI
|
//! wrapper is CUDA-only). Opens an encode session bound to the **same** `ID3D11Device` as the DXGI
|
||||||
//! capturer (the device is carried on `FramePayload::D3d11`), registers a small pool of encoder-owned
|
//! capturer (the device is carried on `FramePayload::D3d11`), and **encodes the capturer's texture in
|
||||||
//! BGRA textures once, and per frame `CopyResource`s the captured texture into a pooled one and
|
//! place** — it registers each input texture with NVENC once (cached by pointer) and `encode_picture`s
|
||||||
//! `encode_picture`s it. Mirrors the Linux NVENC config: CBR + ultra-low-latency, infinite GOP,
|
//! it directly, with NO per-frame `CopyResource`. (That's safe because the host encode loop is
|
||||||
//! P-frames only, forced-IDR for RFI, in-band SPS/PPS each keyframe.
|
//! synchronous — capture → submit → poll, where `poll`/`lock_bitstream` blocks until the encode
|
||||||
|
//! finishes — so the capturer never overwrites the texture mid-encode; if that loop ever becomes
|
||||||
|
//! pipelined, the capturer must hand a ring of textures.) Mirrors the Linux NVENC config: CBR +
|
||||||
|
//! ultra-low-latency, infinite GOP, P-frames only, forced-IDR for RFI, in-band SPS/PPS each keyframe.
|
||||||
//!
|
//!
|
||||||
//! Needs a real NVIDIA GPU at runtime (session creation fails otherwise) — compiles GPU-less, but
|
//! Needs a real NVIDIA GPU at runtime (session creation fails otherwise) — compiles GPU-less, but
|
||||||
//! `open`/`submit` only succeed on a GPU box. The software encoder (`super::sw`) is the fallback.
|
//! `open`/`submit` only succeed on a GPU box. The software encoder (`super::sw`) is the fallback.
|
||||||
|
|
||||||
use super::{Codec, EncodedFrame, Encoder};
|
use super::{Codec, EncodedFrame, Encoder};
|
||||||
use crate::capture::{CapturedFrame, FramePayload, PixelFormat};
|
use crate::capture::{CapturedFrame, FramePayload, PixelFormat};
|
||||||
use anyhow::{anyhow, bail, Context, Result};
|
use anyhow::{anyhow, bail, Result};
|
||||||
use std::collections::VecDeque;
|
use std::collections::{HashMap, VecDeque};
|
||||||
use std::ffi::c_void;
|
use std::ffi::c_void;
|
||||||
use std::ptr;
|
use std::ptr;
|
||||||
use windows::core::Interface;
|
use windows::core::Interface;
|
||||||
use windows::Win32::Graphics::Direct3D11::{
|
use windows::Win32::Graphics::Direct3D11::{ID3D11Device, ID3D11Texture2D};
|
||||||
ID3D11Device, ID3D11DeviceContext, ID3D11Texture2D, D3D11_BIND_RENDER_TARGET,
|
|
||||||
D3D11_TEXTURE2D_DESC, D3D11_USAGE_DEFAULT,
|
|
||||||
};
|
|
||||||
use windows::Win32::Graphics::Dxgi::Common::{DXGI_FORMAT_B8G8R8A8_UNORM, DXGI_SAMPLE_DESC};
|
|
||||||
|
|
||||||
use nvidia_video_codec_sdk::sys::nvEncodeAPI as nv;
|
use nvidia_video_codec_sdk::sys::nvEncodeAPI as nv;
|
||||||
use nvidia_video_codec_sdk::ENCODE_API as API;
|
use nvidia_video_codec_sdk::ENCODE_API as API;
|
||||||
@@ -36,14 +35,7 @@ fn codec_guid(codec: Codec) -> nv::GUID {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
struct PooledTex {
|
|
||||||
tex: ID3D11Texture2D,
|
|
||||||
reg: nv::NV_ENC_REGISTERED_PTR,
|
|
||||||
map: nv::NV_ENC_INPUT_PTR,
|
|
||||||
}
|
|
||||||
|
|
||||||
pub struct NvencD3d11Encoder {
|
pub struct NvencD3d11Encoder {
|
||||||
ctx: Option<ID3D11DeviceContext>,
|
|
||||||
encoder: *mut c_void,
|
encoder: *mut c_void,
|
||||||
codec_guid: nv::GUID,
|
codec_guid: nv::GUID,
|
||||||
width: u32,
|
width: u32,
|
||||||
@@ -51,10 +43,14 @@ pub struct NvencD3d11Encoder {
|
|||||||
fps: u32,
|
fps: u32,
|
||||||
bitrate_bps: u64,
|
bitrate_bps: u64,
|
||||||
buffer_fmt: nv::NV_ENC_BUFFER_FORMAT,
|
buffer_fmt: nv::NV_ENC_BUFFER_FORMAT,
|
||||||
pool: Vec<PooledTex>,
|
/// Registrations of the capturer's input textures, cached by texture raw pointer — NVENC encodes
|
||||||
|
/// them in place (no per-frame copy). The cloned `ID3D11Texture2D` keeps each alive until we
|
||||||
|
/// unregister it (the capturer may drop its copy on a device recreate before our teardown runs).
|
||||||
|
regs: HashMap<isize, (nv::NV_ENC_REGISTERED_PTR, ID3D11Texture2D)>,
|
||||||
next: usize,
|
next: usize,
|
||||||
bitstreams: Vec<nv::NV_ENC_OUTPUT_PTR>,
|
bitstreams: Vec<nv::NV_ENC_OUTPUT_PTR>,
|
||||||
pending: VecDeque<(nv::NV_ENC_OUTPUT_PTR, usize, u64)>,
|
/// (bitstream, mapped input resource to unmap after retrieval, pts_ns) per in-flight encode.
|
||||||
|
pending: VecDeque<(nv::NV_ENC_OUTPUT_PTR, nv::NV_ENC_INPUT_PTR, u64)>,
|
||||||
frame_idx: i64,
|
frame_idx: i64,
|
||||||
force_kf: bool,
|
force_kf: bool,
|
||||||
inited: bool,
|
inited: bool,
|
||||||
@@ -77,7 +73,6 @@ impl NvencD3d11Encoder {
|
|||||||
bitrate_bps: u64,
|
bitrate_bps: u64,
|
||||||
) -> Result<Self> {
|
) -> Result<Self> {
|
||||||
Ok(Self {
|
Ok(Self {
|
||||||
ctx: None,
|
|
||||||
encoder: ptr::null_mut(),
|
encoder: ptr::null_mut(),
|
||||||
codec_guid: codec_guid(codec),
|
codec_guid: codec_guid(codec),
|
||||||
width,
|
width,
|
||||||
@@ -85,7 +80,7 @@ impl NvencD3d11Encoder {
|
|||||||
fps,
|
fps,
|
||||||
bitrate_bps,
|
bitrate_bps,
|
||||||
buffer_fmt: nv::NV_ENC_BUFFER_FORMAT::NV_ENC_BUFFER_FORMAT_ARGB,
|
buffer_fmt: nv::NV_ENC_BUFFER_FORMAT::NV_ENC_BUFFER_FORMAT_ARGB,
|
||||||
pool: Vec::new(),
|
regs: HashMap::new(),
|
||||||
next: 0,
|
next: 0,
|
||||||
bitstreams: Vec::new(),
|
bitstreams: Vec::new(),
|
||||||
pending: VecDeque::new(),
|
pending: VecDeque::new(),
|
||||||
@@ -102,21 +97,23 @@ impl NvencD3d11Encoder {
|
|||||||
if self.encoder.is_null() {
|
if self.encoder.is_null() {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
for p in &self.pool {
|
// Unmap any in-flight inputs, then unregister every cached texture and destroy the bitstreams.
|
||||||
if !p.map.is_null() {
|
for (_, map, _) in &self.pending {
|
||||||
let _ = (API.unmap_input_resource)(self.encoder, p.map);
|
if !map.is_null() {
|
||||||
|
let _ = (API.unmap_input_resource)(self.encoder, *map);
|
||||||
}
|
}
|
||||||
let _ = (API.unregister_resource)(self.encoder, p.reg);
|
}
|
||||||
|
for (reg, _tex) in self.regs.values() {
|
||||||
|
let _ = (API.unregister_resource)(self.encoder, *reg);
|
||||||
}
|
}
|
||||||
for &bs in &self.bitstreams {
|
for &bs in &self.bitstreams {
|
||||||
let _ = (API.destroy_bitstream_buffer)(self.encoder, bs);
|
let _ = (API.destroy_bitstream_buffer)(self.encoder, bs);
|
||||||
}
|
}
|
||||||
let _ = (API.destroy_encoder)(self.encoder);
|
let _ = (API.destroy_encoder)(self.encoder);
|
||||||
self.pool.clear();
|
self.regs.clear(); // drops the texture clones, releasing our refs
|
||||||
self.bitstreams.clear();
|
self.bitstreams.clear();
|
||||||
self.pending.clear();
|
self.pending.clear();
|
||||||
self.encoder = ptr::null_mut();
|
self.encoder = ptr::null_mut();
|
||||||
self.ctx = None;
|
|
||||||
self.inited = false;
|
self.inited = false;
|
||||||
self.next = 0;
|
self.next = 0;
|
||||||
}
|
}
|
||||||
@@ -124,12 +121,6 @@ impl NvencD3d11Encoder {
|
|||||||
/// Lazily create the session on the first frame's D3D11 device (so capture + encode share it).
|
/// Lazily create the session on the first frame's D3D11 device (so capture + encode share it).
|
||||||
fn init_session(&mut self, device: &ID3D11Device) -> Result<()> {
|
fn init_session(&mut self, device: &ID3D11Device) -> Result<()> {
|
||||||
unsafe {
|
unsafe {
|
||||||
self.ctx = Some(
|
|
||||||
device
|
|
||||||
.GetImmediateContext()
|
|
||||||
.context("D3D11 immediate context")?,
|
|
||||||
);
|
|
||||||
|
|
||||||
// Probe-and-step-down on the bitrate. NVENC rejects `initialize_encoder` with InvalidParam
|
// Probe-and-step-down on the bitrate. NVENC rejects `initialize_encoder` with InvalidParam
|
||||||
// when `averageBitRate` exceeds what the GPU's max codec level can express (e.g. a 1.6 Gbps
|
// when `averageBitRate` exceeds what the GPU's max codec level can express (e.g. a 1.6 Gbps
|
||||||
// request on HEVC). Mirror the Linux host's strategy: try the requested rate, and on
|
// request on HEVC). Mirror the Linux host's strategy: try the requested rate, and on
|
||||||
@@ -275,48 +266,9 @@ impl NvencD3d11Encoder {
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
// 5. encoder-owned BGRA texture pool, registered once, + one bitstream per slot.
|
// 5. one output bitstream per in-flight slot. There is NO encoder-owned input pool: the
|
||||||
let desc = D3D11_TEXTURE2D_DESC {
|
// capturer's textures are registered on demand in `submit` and encoded in place.
|
||||||
Width: self.width,
|
|
||||||
Height: self.height,
|
|
||||||
MipLevels: 1,
|
|
||||||
ArraySize: 1,
|
|
||||||
Format: DXGI_FORMAT_B8G8R8A8_UNORM,
|
|
||||||
SampleDesc: DXGI_SAMPLE_DESC {
|
|
||||||
Count: 1,
|
|
||||||
Quality: 0,
|
|
||||||
},
|
|
||||||
Usage: D3D11_USAGE_DEFAULT,
|
|
||||||
BindFlags: D3D11_BIND_RENDER_TARGET.0 as u32,
|
|
||||||
CPUAccessFlags: 0,
|
|
||||||
MiscFlags: 0,
|
|
||||||
};
|
|
||||||
for _ in 0..POOL {
|
for _ in 0..POOL {
|
||||||
let mut tex: Option<ID3D11Texture2D> = None;
|
|
||||||
device
|
|
||||||
.CreateTexture2D(&desc, None, Some(&mut tex))
|
|
||||||
.context("CreateTexture2D(nvenc pool)")?;
|
|
||||||
let tex = tex.context("null pool texture")?;
|
|
||||||
let mut rr = nv::NV_ENC_REGISTER_RESOURCE {
|
|
||||||
version: nv::NV_ENC_REGISTER_RESOURCE_VER,
|
|
||||||
resourceType:
|
|
||||||
nv::NV_ENC_INPUT_RESOURCE_TYPE::NV_ENC_INPUT_RESOURCE_TYPE_DIRECTX,
|
|
||||||
width: self.width,
|
|
||||||
height: self.height,
|
|
||||||
pitch: 0,
|
|
||||||
resourceToRegister: tex.as_raw(),
|
|
||||||
bufferFormat: self.buffer_fmt,
|
|
||||||
bufferUsage: nv::NV_ENC_BUFFER_USAGE::NV_ENC_INPUT_IMAGE,
|
|
||||||
..Default::default()
|
|
||||||
};
|
|
||||||
(API.register_resource)(enc, &mut rr)
|
|
||||||
.result_without_string()
|
|
||||||
.map_err(|e| anyhow!("register_resource: {e:?}"))?;
|
|
||||||
self.pool.push(PooledTex {
|
|
||||||
tex,
|
|
||||||
reg: rr.registeredResource,
|
|
||||||
map: ptr::null_mut(),
|
|
||||||
});
|
|
||||||
let mut cb = nv::NV_ENC_CREATE_BITSTREAM_BUFFER {
|
let mut cb = nv::NV_ENC_CREATE_BITSTREAM_BUFFER {
|
||||||
version: nv::NV_ENC_CREATE_BITSTREAM_BUFFER_VER,
|
version: nv::NV_ENC_CREATE_BITSTREAM_BUFFER_VER,
|
||||||
..Default::default()
|
..Default::default()
|
||||||
@@ -373,18 +325,38 @@ impl Encoder for NvencD3d11Encoder {
|
|||||||
let slot = self.next % POOL;
|
let slot = self.next % POOL;
|
||||||
self.next += 1;
|
self.next += 1;
|
||||||
unsafe {
|
unsafe {
|
||||||
let ctx = self.ctx.as_ref().context("no D3D11 context")?;
|
// Register the capturer's texture with NVENC once (cached by raw pointer), then encode it
|
||||||
ctx.CopyResource(&self.pool[slot].tex, &frame.texture);
|
// IN PLACE — no `CopyResource` into an encoder-owned pool. This is the zero-copy win: the
|
||||||
|
// capturer already produced a stable GPU texture; we just register + map + encode it.
|
||||||
|
let key = frame.texture.as_raw() as isize;
|
||||||
|
if !self.regs.contains_key(&key) {
|
||||||
|
let mut rr = nv::NV_ENC_REGISTER_RESOURCE {
|
||||||
|
version: nv::NV_ENC_REGISTER_RESOURCE_VER,
|
||||||
|
resourceType: nv::NV_ENC_INPUT_RESOURCE_TYPE::NV_ENC_INPUT_RESOURCE_TYPE_DIRECTX,
|
||||||
|
width: self.width,
|
||||||
|
height: self.height,
|
||||||
|
pitch: 0,
|
||||||
|
resourceToRegister: frame.texture.as_raw(),
|
||||||
|
bufferFormat: self.buffer_fmt,
|
||||||
|
bufferUsage: nv::NV_ENC_BUFFER_USAGE::NV_ENC_INPUT_IMAGE,
|
||||||
|
..Default::default()
|
||||||
|
};
|
||||||
|
(API.register_resource)(self.encoder, &mut rr)
|
||||||
|
.result_without_string()
|
||||||
|
.map_err(|e| anyhow!("register_resource: {e:?}"))?;
|
||||||
|
self.regs
|
||||||
|
.insert(key, (rr.registeredResource, frame.texture.clone()));
|
||||||
|
}
|
||||||
|
let reg = self.regs[&key].0;
|
||||||
|
|
||||||
let mut mp = nv::NV_ENC_MAP_INPUT_RESOURCE {
|
let mut mp = nv::NV_ENC_MAP_INPUT_RESOURCE {
|
||||||
version: nv::NV_ENC_MAP_INPUT_RESOURCE_VER,
|
version: nv::NV_ENC_MAP_INPUT_RESOURCE_VER,
|
||||||
registeredResource: self.pool[slot].reg,
|
registeredResource: reg,
|
||||||
..Default::default()
|
..Default::default()
|
||||||
};
|
};
|
||||||
(API.map_input_resource)(self.encoder, &mut mp)
|
(API.map_input_resource)(self.encoder, &mut mp)
|
||||||
.result_without_string()
|
.result_without_string()
|
||||||
.map_err(|e| anyhow!("map_input_resource: {e:?}"))?;
|
.map_err(|e| anyhow!("map_input_resource: {e:?}"))?;
|
||||||
self.pool[slot].map = mp.mappedResource;
|
|
||||||
|
|
||||||
let pts = self.frame_idx as u64;
|
let pts = self.frame_idx as u64;
|
||||||
self.frame_idx += 1;
|
self.frame_idx += 1;
|
||||||
@@ -411,7 +383,7 @@ impl Encoder for NvencD3d11Encoder {
|
|||||||
.result_without_string()
|
.result_without_string()
|
||||||
.map_err(|e| anyhow!("encode_picture: {e:?}"))?;
|
.map_err(|e| anyhow!("encode_picture: {e:?}"))?;
|
||||||
self.pending
|
self.pending
|
||||||
.push_back((self.bitstreams[slot], slot, captured.pts_ns));
|
.push_back((self.bitstreams[slot], mp.mappedResource, captured.pts_ns));
|
||||||
}
|
}
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
@@ -421,7 +393,7 @@ impl Encoder for NvencD3d11Encoder {
|
|||||||
}
|
}
|
||||||
|
|
||||||
fn poll(&mut self) -> Result<Option<EncodedFrame>> {
|
fn poll(&mut self) -> Result<Option<EncodedFrame>> {
|
||||||
let Some((bs, slot, pts_ns)) = self.pending.pop_front() else {
|
let Some((bs, map, pts_ns)) = self.pending.pop_front() else {
|
||||||
return Ok(None);
|
return Ok(None);
|
||||||
};
|
};
|
||||||
unsafe {
|
unsafe {
|
||||||
@@ -445,9 +417,8 @@ impl Encoder for NvencD3d11Encoder {
|
|||||||
(API.unlock_bitstream)(self.encoder, bs)
|
(API.unlock_bitstream)(self.encoder, bs)
|
||||||
.result_without_string()
|
.result_without_string()
|
||||||
.map_err(|e| anyhow!("unlock_bitstream: {e:?}"))?;
|
.map_err(|e| anyhow!("unlock_bitstream: {e:?}"))?;
|
||||||
if !self.pool[slot].map.is_null() {
|
if !map.is_null() {
|
||||||
let _ = (API.unmap_input_resource)(self.encoder, self.pool[slot].map);
|
let _ = (API.unmap_input_resource)(self.encoder, map);
|
||||||
self.pool[slot].map = ptr::null_mut();
|
|
||||||
}
|
}
|
||||||
Ok(Some(EncodedFrame {
|
Ok(Some(EncodedFrame {
|
||||||
data,
|
data,
|
||||||
|
|||||||
Reference in New Issue
Block a user