fix(net): don't tear the stream down on a connected-UDP ICMP blip (ECONNREFUSED)
ci / web (push) Successful in 25s
ci / docs-site (push) Successful in 30s
apple / swift (push) Successful in 1m15s
ci / rust (push) Successful in 2m7s
ci / bench (push) Successful in 1m36s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 4s
deb / build-publish (push) Successful in 2m17s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 4m50s
docker / deploy-docs (push) Successful in 17s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 4m22s
ci / web (push) Successful in 25s
ci / docs-site (push) Successful in 30s
apple / swift (push) Successful in 1m15s
ci / rust (push) Successful in 2m7s
ci / bench (push) Successful in 1m36s
docker / build-push (--build-arg FEDORA_VERSION=44, ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora44-rpm) (push) Successful in 5s
docker / build-push (., web/Dockerfile, punktfunk-web) (push) Successful in 5s
docker / build-push (ci, ci/fedora-rpm.Dockerfile, punktfunk-fedora-rpm) (push) Successful in 4s
docker / build-push (ci, ci/rust-ci.Dockerfile, punktfunk-rust-ci) (push) Successful in 4s
docker / build-push (docs-site, docs-site/Dockerfile, punktfunk-docs) (push) Successful in 4s
deb / build-publish (push) Successful in 2m17s
rpm / build-publish (bazzite, punktfunk-fedora-rpm) (push) Successful in 4m50s
docker / deploy-docs (push) Successful in 17s
rpm / build-publish (fedora-44, punktfunk-fedora44-rpm) (push) Successful in 4m22s
Root cause of the Mac "session ended" at higher bitrates. The video data plane is
a *connected* UDP socket; with data-plane hole-punching the path can blip and the
kernel surfaces an asynchronous ICMP port-unreachable/reset as ECONNREFUSED /
ECONNRESET on a later send or recv. Both the host send loop and the client
poll_frame treated that as fatal and tore the session down:
ERROR punktfunk_host::m3: send failed — stopping stream
error=send_sealed: Io(ConnectionRefused, code 111) <-- observed live
That also cascades: a transient ICMP makes the client's poll_frame bail and close
its data socket, which makes the host's next send get a *real* ECONNREFUSED, which
tears the host side down too — exactly the "broke at 500 Mbps+" report.
Fix: classify ECONNREFUSED/ECONNRESET alongside WouldBlock as transient (a lossy
drop / "no data this poll"), never a teardown, at every data-path send/recv site
(send, send_batch, send_gso, recv, recv_batch x2, recv_batch_x). FEC + the next
frame/RFI recover; if the peer is genuinely gone the QUIC control plane's
conn.closed() ends the session cleanly (no infinite "stream into the void").
This is the standard connected-UDP rule that ICMP errors are advisory — doubly
true with hole-punching. Adds is_transient_io() + a unit test.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -16,6 +16,21 @@ use std::net::UdpSocket;
|
||||
/// silently truncating it.
|
||||
const RECV_BUF: usize = MAX_DATAGRAM_BYTES + 1;
|
||||
|
||||
/// True for transient socket conditions that must be a lossy drop / "no data this poll" — NOT a
|
||||
/// stream teardown. Two cases:
|
||||
/// - `WouldBlock`: the kernel send/recv buffer is momentarily full (a frame burst saturated the tx
|
||||
/// queue — the dominant condition at 1 Gbps+). Drop the packet; FEC + the next frame recover.
|
||||
/// - `ConnectionRefused` / `ConnectionReset`: a *connected* UDP socket received an asynchronous ICMP
|
||||
/// port-unreachable / reset for an *earlier* datagram. With data-plane hole-punching the path
|
||||
/// blips — the peer's data socket briefly gone, a NAT rebind, or a stale ICMP from punch setup —
|
||||
/// so erroring out here kills a stream that the very next packet would resume. If the peer is
|
||||
/// genuinely gone, the QUIC control plane times out and ends the session cleanly instead. (This is
|
||||
/// the classic connected-UDP "ICMP errors are advisory" rule, doubly true with hole-punching.)
|
||||
fn is_transient_io(e: &std::io::Error) -> bool {
|
||||
use std::io::ErrorKind::{ConnectionRefused, ConnectionReset, WouldBlock};
|
||||
matches!(e.kind(), WouldBlock | ConnectionRefused | ConnectionReset)
|
||||
}
|
||||
|
||||
/// Build one `mmsghdr` per `iovec` (each a single-buffer message) for `sendmmsg`/`recvmmsg`. Shared
|
||||
/// by `send_batch` + `recv_batch` so the raw-pointer scaffolding lives in exactly one place.
|
||||
///
|
||||
@@ -347,7 +362,7 @@ impl UdpTransport {
|
||||
};
|
||||
if n < 0 {
|
||||
let err = std::io::Error::last_os_error();
|
||||
if err.kind() == std::io::ErrorKind::WouldBlock {
|
||||
if is_transient_io(&err) {
|
||||
return Ok(0);
|
||||
}
|
||||
return Err(err);
|
||||
@@ -371,7 +386,7 @@ impl Transport for UdpTransport {
|
||||
// latency, and erroring tears the session down. `Ok(false)` surfaces the drop so the
|
||||
// session counts it (`packets_send_dropped`) instead of it being invisible. Mirrors
|
||||
// the `recv` WouldBlock handling above.
|
||||
Err(e) if e.kind() == std::io::ErrorKind::WouldBlock => Ok(false),
|
||||
Err(e) if is_transient_io(&e) => Ok(false),
|
||||
Err(e) => Err(e),
|
||||
}
|
||||
}
|
||||
@@ -401,9 +416,10 @@ impl Transport for UdpTransport {
|
||||
let n = unsafe { libc::sendmmsg(fd, hdrs.as_mut_ptr(), hdrs.len() as libc::c_uint, 0) };
|
||||
if n < 0 {
|
||||
let err = std::io::Error::last_os_error();
|
||||
// Nothing fit in the send buffer — drop this + the remaining chunks (counted by
|
||||
// the caller). A real error (not WouldBlock) still tears the session down.
|
||||
if err.kind() == std::io::ErrorKind::WouldBlock {
|
||||
// Nothing fit in the send buffer (or a stale ICMP from a connected-socket blip) —
|
||||
// drop this + the remaining chunks (counted by the caller). Only a genuine error
|
||||
// tears the session down; transient conditions are lossy drops (see is_transient_io).
|
||||
if is_transient_io(&err) {
|
||||
break;
|
||||
}
|
||||
return Err(err);
|
||||
@@ -449,8 +465,9 @@ impl Transport for UdpTransport {
|
||||
}
|
||||
match send_one_gso(fd, &scratch, seg as u16) {
|
||||
Ok(()) => sent += chunk.len(),
|
||||
// Send buffer momentarily full — drop the rest (counted by the caller), never block.
|
||||
Err(e) if e.kind() == std::io::ErrorKind::WouldBlock => break,
|
||||
// Send buffer momentarily full, or a stale ICMP from a connected-socket blip — drop
|
||||
// the rest (counted by the caller), never block, never tear down (see is_transient_io).
|
||||
Err(e) if is_transient_io(&e) => break,
|
||||
// GSO unsupported on this kernel/path — latch off and finish via sendmmsg.
|
||||
Err(e) if gso_unsupported(&e) => {
|
||||
gso::disable();
|
||||
@@ -472,7 +489,7 @@ impl Transport for UdpTransport {
|
||||
buf.truncate(n);
|
||||
Ok(Some(buf))
|
||||
}
|
||||
Err(e) if e.kind() == std::io::ErrorKind::WouldBlock => Ok(None),
|
||||
Err(e) if is_transient_io(&e) => Ok(None),
|
||||
Err(e) => Err(e),
|
||||
}
|
||||
}
|
||||
@@ -511,7 +528,7 @@ impl Transport for UdpTransport {
|
||||
};
|
||||
if n < 0 {
|
||||
let err = std::io::Error::last_os_error();
|
||||
if err.kind() == std::io::ErrorKind::WouldBlock {
|
||||
if is_transient_io(&err) {
|
||||
return Ok(0);
|
||||
}
|
||||
return Err(err);
|
||||
@@ -557,8 +574,8 @@ impl Transport for UdpTransport {
|
||||
};
|
||||
if r < 0 {
|
||||
let err = std::io::Error::last_os_error();
|
||||
if err.kind() == std::io::ErrorKind::WouldBlock {
|
||||
break; // socket drained
|
||||
if is_transient_io(&err) {
|
||||
break; // socket drained, or a stale connected-socket ICMP — no data this poll
|
||||
}
|
||||
if got > 0 {
|
||||
break; // report what we have; surface the error on the next empty poll
|
||||
@@ -577,6 +594,26 @@ mod tests {
|
||||
use super::*;
|
||||
use crate::transport::Transport;
|
||||
|
||||
/// A connected UDP socket's stale ICMP (ECONNREFUSED/RESET) and a full buffer (EAGAIN) must all
|
||||
/// be classified transient — a lossy drop, never a stream teardown. A real error must not be.
|
||||
#[test]
|
||||
fn transient_io_covers_connected_udp_blips() {
|
||||
use std::io::{Error, ErrorKind};
|
||||
for k in [
|
||||
ErrorKind::WouldBlock,
|
||||
ErrorKind::ConnectionRefused,
|
||||
ErrorKind::ConnectionReset,
|
||||
] {
|
||||
assert!(
|
||||
is_transient_io(&Error::from(k)),
|
||||
"{k:?} should be transient"
|
||||
);
|
||||
}
|
||||
for k in [ErrorKind::PermissionDenied, ErrorKind::AddrInUse] {
|
||||
assert!(!is_transient_io(&Error::from(k)), "{k:?} must stay fatal");
|
||||
}
|
||||
}
|
||||
|
||||
/// `send_one_gso` must split one buffer into N separate UDP datagrams of `gso_size` bytes each
|
||||
/// (the kernel UDP GSO segmentation) — the multi-Gbps send lever. Loopback supports GSO; if the
|
||||
/// CI kernel doesn't, skip rather than fail.
|
||||
|
||||
Reference in New Issue
Block a user