perf(core): UDP GSO send path (the multi-Gbps lever)
apple / swift (push) Successful in 1m16s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 3s
ci / rust (push) Successful in 1m31s
deb / build-publish (push) Successful in 2m36s
ci / web (push) Failing after 36s
ci / docs-site (push) Failing after 32s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 2m42s
rpm / build-publish (push) Successful in 4m38s
docker / deploy-docs (push) Successful in 17s

sendmmsg already batches syscalls but still builds one sk_buff per datagram —
the kernel-side wall above ~1 Gbps. UDP Generic Segmentation Offload hands the
kernel one big buffer it splits into gso_size datagrams, building ~1 GSO skb per
≤64 segments. Research (LWN/Cloudflare/Tailscale) measures ~2.4x throughput at
equal CPU and 17-44x fewer syscalls, and that sendmmsg batching alone is
insufficient — you need true segmentation offload.

Adds Transport::send_gso (default = send_batch) + a UdpTransport Linux override:
coalesces a frame's equal-size wire packets (shards are zero-padded to a constant
size, so a whole frame is one gso_size) into ≤64-segment sendmsg(UDP_SEGMENT)
calls. seal/send routes through it. Opt-in via PUNKTFUNK_GSO (new unsafe hot-path
code) with automatic fallback to sendmmsg on any GSO error (unsupported kernel/
path), latched per process. Loopback unit test validates the cmsg segmentation;
full session over loopback streams clean (0% loss). Linux-only; loopback/non-Linux
keep sendmmsg/scalar.

Next levers: in-place AES-GCM seal (kill per-packet allocs), UDP GRO on recv,
drop the sleep-pacing in favor of the kernel qdisc, jumbo MTU.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-12 23:29:51 +00:00
parent 4b1bbfdf0e
commit 448986f41c
3 changed files with 174 additions and 1 deletions
@@ -33,6 +33,18 @@ pub trait Transport: Send + Sync {
Ok(sent)
}
/// Send a frame's equal-size packets using UDP Generic Segmentation Offload where available:
/// one `sendmsg` hands the kernel a big buffer it splits into `gso_size` UDP datagrams, building
/// ~1 GSO skb per ≤64 segments instead of one skb per packet. This is the multi-Gbps lever —
/// research shows ~2.4× throughput at equal CPU and ~40× fewer syscalls, and that `sendmmsg`
/// batching alone is insufficient (it still builds one skb per datagram). The
/// [`UdpTransport`](super::UdpTransport) Linux override implements it (opt-in via `PUNKTFUNK_GSO`,
/// auto-fallback on any GSO error); the default just delegates to [`send_batch`](Self::send_batch),
/// correct for loopback and non-Linux. Same lossy, FEC-protected short-count contract as `send_batch`.
fn send_gso(&self, packets: &[&[u8]]) -> std::io::Result<usize> {
self.send_batch(packets)
}
fn recv(&self) -> std::io::Result<Option<Vec<u8>>>;
/// Receive up to `out.len()` datagrams in as few syscalls as possible, writing each into its