feat(net/mac): default-on recvmsg_x batched Mac recv + GSO host + longer probe
ci / web (push) Successful in 27s
ci / docs-site (push) Successful in 31s
ci / rust (push) Successful in 2m6s
ci / bench (push) Successful in 1m35s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 6s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
apple / swift (push) Successful in 1m17s
docker / deploy-docs (push) Successful in 17s
deb / build-publish (push) Successful in 2m18s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 4m50s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 4m27s
ci / web (push) Successful in 27s
ci / docs-site (push) Successful in 31s
ci / rust (push) Successful in 2m6s
ci / bench (push) Successful in 1m35s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 6s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
apple / swift (push) Successful in 1m17s
docker / deploy-docs (push) Successful in 17s
deb / build-publish (push) Successful in 2m18s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 4m50s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 4m27s
The Mac/iOS client's wall around ~380 Mbps on a 2.5 G path is the receive drain, not the transport: a loopback speed-test pushes 380/600/1000 Mbps at 0.0% loss, but Darwin has no recvmmsg(2), so the macOS client was doing one recv() syscall per packet — ~40-90k syscalls/s on one core. When the recv loop can't drain fast enough the kernel socket buffer backs up and drops, which the client sees as a sustained stream stalling/freezing in the 300-400 Mbps range (and an immediate "session ended" when a 500 Mbps+ first keyframe bursts in). - core/transport: flip recvmsg_x (the batched Darwin recv, ~30x fewer syscalls) from opt-in to default ON, opt-out via PUNKTFUNK_RECVMSG_X=0. Keeps the auto-fallback to the scalar loop on any unexpected syscall error. The Apple CI swift-test loopback now exercises this path by default. - packaging/kde host.env: enable PUNKTFUNK_GSO=1 — UDP segmentation offload on the host send path (one sendmsg per ~64 packets), the dominant lever above ~1 Gbps. Already wired (send_sealed -> send_gso) with sendmmsg auto-fallback. - apple SpeedTestSheet: lengthen the bandwidth probe 2 s -> 5 s so the measured number stops swinging wildly (50 vs 900 Mbps on the same link) — long enough for steady-state send + recv drain to settle. Matches host MAX_PROBE_MS. - host capture: PUNKTFUNK_SYNTH_NOISE synthetic high-entropy source for reproducible throughput testing of the encode->FEC->send->recv path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -23,10 +23,12 @@ private final class ProbeToken: @unchecked Sendable {
|
|||||||
/// What the host is asked to burst: the host's full probe ceiling (it clamps to ≤ 3 Gbps),
|
/// What the host is asked to burst: the host's full probe ceiling (it clamps to ≤ 3 Gbps),
|
||||||
/// so the measurement surfaces the link's real ceiling instead of an artificial cap —
|
/// so the measurement surfaces the link's real ceiling instead of an artificial cap —
|
||||||
/// bursting ABOVE what the link can carry is how the probe finds where delivery falls off.
|
/// bursting ABOVE what the link can carry is how the probe finds where delivery falls off.
|
||||||
/// Two seconds rides out scheduler jitter. File-scope so the detached probe task reads them
|
/// Five seconds (was 2 s) averages out the scheduler/recv jitter that made a short probe swing
|
||||||
/// without crossing into the view's main actor.
|
/// wildly (50 vs 900 Mbps on the same link) — long enough for the host's steady-state send and
|
||||||
|
/// the client's recv drain to settle. File-scope so the detached probe task reads them without
|
||||||
|
/// crossing into the view's main actor.
|
||||||
private let probeTargetKbps: UInt32 = 3_000_000
|
private let probeTargetKbps: UInt32 = 3_000_000
|
||||||
private let probeDurationMs: UInt32 = 2_000
|
private let probeDurationMs: UInt32 = 5_000
|
||||||
|
|
||||||
struct SpeedTestSheet: View {
|
struct SpeedTestSheet: View {
|
||||||
@Environment(\.dismiss) private var dismiss
|
@Environment(\.dismiss) private var dismiss
|
||||||
|
|||||||
@@ -108,10 +108,14 @@ fn send_one_gso(fd: libc::c_int, buf: &[u8], gso_size: u16) -> std::io::Result<(
|
|||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Apple (macOS/iOS) batched-receive enable state. Darwin has no `recvmmsg(2)`, so our macOS client
|
/// Apple (macOS/iOS) batched-receive enable state. Darwin has no `recvmmsg(2)`, so without this our
|
||||||
/// does one `recv` per packet (non-allocating, but a syscall each); `recvmsg_x(2)` is the batched
|
/// macOS client does one `recv` syscall per packet — at a few hundred Mbps that's ~40-90k syscalls/s
|
||||||
/// equivalent. Opt-in via `PUNKTFUNK_RECVMSG_X` (it's FFI we can't exercise off-Apple — the scalar
|
/// on one core, and when the recv loop can't drain fast enough the kernel socket buffer backs up and
|
||||||
/// recv-loop is the tested default), with auto-fallback if the syscall ever errors unexpectedly.
|
/// drops, which the client sees as a sustained stream stalling/freezing around 300-400 Mbps.
|
||||||
|
/// `recvmsg_x(2)` is the batched equivalent (the recv counterpart of Linux `recvmmsg`), cutting the
|
||||||
|
/// syscall rate ~30x. **Default ON** (the multi-Gbps Mac path); the `swift test` loopback on the
|
||||||
|
/// Apple CI runner exercises it, and it auto-falls-back to the scalar loop if the syscall ever errors
|
||||||
|
/// unexpectedly. Set `PUNKTFUNK_RECVMSG_X=0` to force the scalar fallback.
|
||||||
#[cfg(target_vendor = "apple")]
|
#[cfg(target_vendor = "apple")]
|
||||||
mod recvx {
|
mod recvx {
|
||||||
use std::sync::atomic::{AtomicU8, Ordering};
|
use std::sync::atomic::{AtomicU8, Ordering};
|
||||||
@@ -122,7 +126,10 @@ mod recvx {
|
|||||||
1 => true,
|
1 => true,
|
||||||
2 => false,
|
2 => false,
|
||||||
_ => {
|
_ => {
|
||||||
let on = std::env::var_os("PUNKTFUNK_RECVMSG_X").is_some();
|
// On unless explicitly disabled with PUNKTFUNK_RECVMSG_X=0.
|
||||||
|
let on = std::env::var("PUNKTFUNK_RECVMSG_X")
|
||||||
|
.map(|v| v != "0")
|
||||||
|
.unwrap_or(true);
|
||||||
STATE.store(if on { 1 } else { 2 }, Ordering::Relaxed);
|
STATE.store(if on { 1 } else { 2 }, Ordering::Relaxed);
|
||||||
on
|
on
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -165,6 +165,12 @@ pub struct FastSyntheticCapturer {
|
|||||||
height: u32,
|
height: u32,
|
||||||
frame_idx: u64,
|
frame_idx: u64,
|
||||||
buf: Vec<u8>,
|
buf: Vec<u8>,
|
||||||
|
/// PUNKTFUNK_SYNTH_NOISE: every frame is fresh high-entropy noise NVENC can't compress or
|
||||||
|
/// predict, so the encoder hits its (CBR) bitrate target — a throughput test of the real
|
||||||
|
/// encode→FEC→send→recv path. The default flat/band content compresses to ~nothing, so it
|
||||||
|
/// can't generate real Mbps (the encoder is content-driven). xorshift over u64 chunks.
|
||||||
|
noise: bool,
|
||||||
|
rng: u64,
|
||||||
}
|
}
|
||||||
|
|
||||||
impl FastSyntheticCapturer {
|
impl FastSyntheticCapturer {
|
||||||
@@ -175,20 +181,38 @@ impl FastSyntheticCapturer {
|
|||||||
height,
|
height,
|
||||||
frame_idx: 0,
|
frame_idx: 0,
|
||||||
buf: vec![0u8; width as usize * height as usize * 4],
|
buf: vec![0u8; width as usize * height as usize * 4],
|
||||||
|
noise: std::env::var_os("PUNKTFUNK_SYNTH_NOISE").is_some(),
|
||||||
|
rng: 0x9e3779b97f4a7c15,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
impl Capturer for FastSyntheticCapturer {
|
impl Capturer for FastSyntheticCapturer {
|
||||||
fn next_frame(&mut self) -> Result<CapturedFrame> {
|
fn next_frame(&mut self) -> Result<CapturedFrame> {
|
||||||
let (w, h) = (self.width as usize, self.height as usize);
|
if self.noise {
|
||||||
let row = w * 4;
|
// Fresh, every-frame-decorrelated noise: reseed from the frame index so consecutive
|
||||||
let shade = (self.frame_idx % 256) as u8;
|
// frames share no structure (forces large P-frames too, not just the keyframe).
|
||||||
self.buf.fill(shade);
|
let mut s = self
|
||||||
let band_h = (h / 20).max(1);
|
.rng
|
||||||
let band_y = (self.frame_idx as usize * 6) % h;
|
.wrapping_add(self.frame_idx.wrapping_mul(0x2545F491_4F6CDD1D))
|
||||||
for y in band_y..(band_y + band_h).min(h) {
|
| 1;
|
||||||
self.buf[y * row..(y + 1) * row].fill(0xff);
|
for c in self.buf.chunks_exact_mut(8) {
|
||||||
|
s ^= s << 13;
|
||||||
|
s ^= s >> 7;
|
||||||
|
s ^= s << 17;
|
||||||
|
c.copy_from_slice(&s.to_le_bytes());
|
||||||
|
}
|
||||||
|
self.rng = s;
|
||||||
|
} else {
|
||||||
|
let (w, h) = (self.width as usize, self.height as usize);
|
||||||
|
let row = w * 4;
|
||||||
|
let shade = (self.frame_idx % 256) as u8;
|
||||||
|
self.buf.fill(shade);
|
||||||
|
let band_h = (h / 20).max(1);
|
||||||
|
let band_y = (self.frame_idx as usize * 6) % h;
|
||||||
|
for y in band_y..(band_y + band_h).min(h) {
|
||||||
|
self.buf[y * row..(y + 1) * row].fill(0xff);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
self.frame_idx += 1;
|
self.frame_idx += 1;
|
||||||
Ok(CapturedFrame {
|
Ok(CapturedFrame {
|
||||||
|
|||||||
@@ -10,6 +10,12 @@ PUNKTFUNK_COMPOSITOR=kwin
|
|||||||
PUNKTFUNK_VIDEO_SOURCE=virtual
|
PUNKTFUNK_VIDEO_SOURCE=virtual
|
||||||
PUNKTFUNK_ZEROCOPY=1
|
PUNKTFUNK_ZEROCOPY=1
|
||||||
PUNKTFUNK_INPUT_BACKEND=libei
|
PUNKTFUNK_INPUT_BACKEND=libei
|
||||||
|
# UDP Generic Segmentation Offload on the send path: coalesce a frame's equal-size packets into
|
||||||
|
# kernel super-buffers (one sendmsg per ~64 packets instead of one per packet) — the dominant
|
||||||
|
# lever above ~1 Gbps, where per-packet send syscalls/pps become the host bottleneck. Safe: it
|
||||||
|
# auto-falls back to sendmmsg on any kernel/path that rejects UDP_SEGMENT. Set PUNKTFUNK_GSO=0 to
|
||||||
|
# force it off if a NIC/middlebox mishandles GSO segments.
|
||||||
|
PUNKTFUNK_GSO=1
|
||||||
# Make the per-session streamed output the SOLE desktop, so plasmashell + windows render on it
|
# Make the per-session streamed output the SOLE desktop, so plasmashell + windows render on it
|
||||||
# rather than on the headless session's `kwin --virtual` bootstrap output (without this the client
|
# rather than on the headless session's `kwin --virtual` bootstrap output (without this the client
|
||||||
# sees only the wallpaper of an empty extended output). KWin re-homes the desktop; the bootstrap is
|
# sees only the wallpaper of an empty extended output). KWin re-homes the desktop; the bootstrap is
|
||||||
|
|||||||
Reference in New Issue
Block a user