WebRTC is the Problem
WebRTC is the Problem
OpenAI’s WebRTC Problem OpenAI posted a technical blog a few days ago. This blog post triggered me more than it should have. I urge to slap my meaty fingers on the keyboard. You should NOT copy OpenAI. I don’t think you should use WebRTC for voice AI. WebRTC is the problem.
OpenAI 的 WebRTC 问题 OpenAI 前几天发布了一篇技术博客。这篇博文让我感到异常烦躁,甚至让我忍不住想把手指狠狠地敲在键盘上。你不应该效仿 OpenAI。我认为你不应该在语音 AI 中使用 WebRTC。WebRTC 本身就是问题所在。
Me Like 6 years ago I wrote a WebRTC SFU at Twitch. Originally we used Pion (Go) just like OpenAI, but forked after benchmarking revealed that it was too slow. I ended up rewriting every protocol, because of course I did! Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust. Because of course I did! You’re probably noticing a trend.
关于我 大约 6 年前,我在 Twitch 编写了一个 WebRTC SFU(选择性转发单元)。最初我们像 OpenAI 一样使用了 Pion (Go),但在基准测试显示其速度太慢后,我们进行了分叉(fork)。最终我重写了每一个协议,毕竟我总是这么干!就在一年前,我在 Discord 工作时,又用 Rust 重写了 WebRTC SFU。毕竟我总是这么干!你可能已经发现这个趋势了。
Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s. And some de-facto standards that are technically drafts (ex. TWCC, REMB). Not a fun fact when you have to implement them all. You should consider me a Certified WebRTC Expert. Which is why I never, never want to use WebRTC again.
有趣的事实:WebRTC 由大约 45 个可以追溯到 21 世纪初的 RFC 组成。还有一些实际上只是草案的事实标准(例如 TWCC、REMB)。当你必须实现所有这些标准时,这可一点都不有趣。你应该把我视为一名认证的 WebRTC 专家。正因如此,我再也不想使用 WebRTC 了。
Product Fit I’m going to cheat a little bit and start with the hot takes before they get cold. Don’t worry, we’ll get right back to talking about the OpenAI blog post and load balancing, I promise. WebRTC is a poor fit for Voice AI. But that seems counter-intuitive? WebRTC is for conferencing, and that involves speaking? And robots can speak, right?
产品契合度 我要耍个小聪明,在观点变冷之前先抛出我的“暴论”。别担心,我保证我们马上会回到 OpenAI 的博文和负载均衡的话题上。WebRTC 并不适合语音 AI。但这听起来是否有些反直觉?WebRTC 是为会议设计的,而会议涉及说话,机器人也能说话,对吧?
WebRTC is too aggressive Let’s say I pull up my OpenAI app on my phone. I say hi to Scarlett Johansson Sky and then I utter: should I walk or drive to the car wash? WebRTC is designed to degrade and drop my prompt during poor network conditions. wtf my dude. WebRTC aggressively drops audio packets to keep latency low. If you’ve ever heard distorted audio on a conference call, that’s WebRTC baybee.
WebRTC 太激进了 假设我打开手机上的 OpenAI 应用。我向“Sky”(斯嘉丽·约翰逊的声音)打招呼,然后问:“我是该走路还是开车去洗车店?”WebRTC 的设计初衷是在网络状况不佳时降低质量并丢弃我的提示词。搞什么鬼?WebRTC 为了保持低延迟,会激进地丢弃音频包。如果你在电话会议中听到过失真的音频,那就是 WebRTC 的杰作,宝贝。
The idea is that conference calls depend on rapid back-and-forth, so pausing to wait for audio is unacceptable. …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate. After all, I’m paying good money to boil the ocean, and a garbage prompt means a garbage response. It’s not like LLMs are particularly responsive anyway.
其理念是电话会议依赖于快速的来回交流,因此暂停等待音频是不可接受的。……但作为用户,我宁愿多等 200 毫秒,以确保我那昂贵且缓慢的提示词是准确的。毕竟,我花了大价钱去“煮沸海洋”(指进行大规模计算),垃圾的提示词意味着垃圾的回复。反正 LLM 本身响应速度也不算特别快。
But I’m not allowed to wait. It’s impossible to even retransmit a WebRTC audio packet within a browser; we tried at Discord. The implementation is hard-coded for real-time latency or else. UPDATE: Some WebRTC folks are claiming this is a skill issue. It might be possible to enable audio NACKs, but we couldn’t figure out the correct SDP munging. Either way, the WebRTC jitter buffer is aggressively small. And yes, Voice AI agents will eventually get the latency down to the conversational range. But reducing latency has trade-offs. I’m not even sure that purposely degrading audio prompts will ever be worth it.
但我没得选。在浏览器中甚至无法重传 WebRTC 音频包;我们在 Discord 尝试过。其实现被硬编码为必须满足实时延迟,否则就没戏。更新:一些 WebRTC 从业者声称这是技术问题。也许可以启用音频 NACK,但我们无法搞定正确的 SDP 修改。无论如何,WebRTC 的抖动缓冲区(jitter buffer)小得离谱。是的,语音 AI 代理最终会将延迟降低到对话范围内。但降低延迟是有代价的。我甚至不确定故意降低音频提示质量是否真的值得。
TTS is faster than real-time You speak into the microphone, it gets sent to one of OpenAI’s billion servers, and then a GPU pretends to talk to you via text-to-speech. Neato. Let’s say it takes 2s of GPUs to generate 8s of audio. In an ideal world, we would stream the audio as it’s being generated (over 2s) and the client would start playing it back (over 8s). That way, if there’s a network blip, some audio is buffered locally. The user might not even notice the network blip.
TTS 比实时更快 你对着麦克风说话,声音被发送到 OpenAI 的十亿台服务器之一,然后 GPU 通过文本转语音(TTS)假装在和你交谈。真酷。假设生成 8 秒的音频需要 2 秒的 GPU 时间。在理想世界中,我们会在音频生成时(2 秒内)进行流式传输,而客户端会开始播放(8 秒内)。这样,如果出现网络波动,部分音频会在本地缓存。用户甚至可能根本察觉不到网络波动。
But nope, WebRTC has no buffering and renders based on arrival time. Like seriously, timestamps are just suggestions. It’s even more annoying when video enters the picture. To compensate for this, OpenAI has to make sure packets arrive exactly when they should be rendered. They need to add a sleep in front of every audio packet before sending it. But if there’s network congestion, oops we lost that audio packet and it’ll never be retransmitted. OpenAI is literally introducing artificial latency, and then aggressively dropping packets to “keep latency low”. It’s the equivalent of screen sharing a YouTube video instead of buffering it. The quality will be degraded.
但不行,WebRTC 没有缓冲,而是根据到达时间进行渲染。说真的,时间戳在这里只是个建议。当视频加入进来时,这就更烦人了。为了弥补这一点,OpenAI 必须确保数据包在应该渲染时准确到达。他们需要在发送每个音频包之前增加一个延迟(sleep)。但如果出现网络拥塞,糟糕,我们丢失了那个音频包,而且它永远不会被重传。OpenAI 实际上是在引入人为延迟,然后为了“保持低延迟”而激进地丢弃数据包。这相当于屏幕共享 YouTube 视频而不是进行缓冲。质量必然会下降。
Ports Ports Ports Okay but let’s talk about the technical meat of the OpenAI article. We’re no longer on a boat, but let’s talk about ports. When you host a TCP server, you open a port (ex. 443 for HTTPS) and listen for incoming connections. The TCP client will randomly select an ephemeral port to use, and the connection is identified by the source/destination IP/ports. For example, a connection might be identified as 123.45.67.89:54321 -> 192.168.1.2:443. But there’s a minor problem… client addresses can change. When your phone switches from WiFi to cellular, oops your IP changes. NATs can also arbitrarily change your source IP/port because of course they can. Whenever this happens, bye bye connection, it’s time to dial a new one. And that means an expensive TCP + TLS handshake which takes at least 2-3 RTTs. The users definitely notice the network hiccup when you’re live streaming.
端口,端口,还是端口 好了,我们来谈谈 OpenAI 文章的技术核心。我们不再谈论船了,我们来谈谈端口。当你托管一个 TCP 服务器时,你会打开一个端口(例如 HTTPS 的 443)并监听传入连接。TCP 客户端会随机选择一个临时端口使用,连接由源/目标 IP 和端口标识。例如,一个连接可能被标识为 123.45.67.89:54321 -> 192.168.1.2:443。但有一个小问题……客户端地址可能会变。当你的手机从 WiFi 切换到蜂窝网络时,糟糕,你的 IP 变了。NAT 也可能随意更改你的源 IP/端口,因为它们当然会这么做。每当这种情况发生,连接就断了,必须重新拨号。这意味着昂贵的 TCP + TLS 握手,至少需要 2-3 个 RTT(往返时间)。当你在直播时,用户绝对能感觉到这种网络卡顿。
WebRTC tried to solve this issue but made things worse. Seriously. A WebRTC implementation is supposed to allocate an ephemeral port for each connection. That way, a WebRTC session can identified by the destination IP/port only; the source is irrelevant. If the source IP/port changes, oh hey that’s still Bob because the destination port is the same. But as OpenAI corroborates, this causes issues at scale because… Servers only have a limited number of ports available. Firewalls love to block ephemeral ports. Kubernetes lul. You could probably abuse IPv6 to work around this, but IDK I never tried. Twitch didn’t even support IPv6…
WebRTC 试图解决这个问题,但反而让事情变得更糟。真的。WebRTC 的实现本应为每个连接分配一个临时端口。这样,WebRTC 会话只需通过目标 IP/端口即可识别;源地址无关紧要。如果源 IP/端口变了,嘿,那还是 Bob,因为目标端口没变。但正如 OpenAI 所证实的,这在大规模场景下会导致问题,因为……服务器可用的端口数量有限。防火墙喜欢拦截临时端口。Kubernetes 呵呵。你或许可以滥用 IPv6 来绕过这个问题,但我不知道,我没试过。Twitch 甚至不支持 IPv6……
Hacks by Necessity So most services end up ignoring the WebRTC specifications. Because of course they do. We mux multiple connections onto a singl…
被迫的黑客手段 所以大多数服务最终都选择忽略 WebRTC 规范。毕竟他们当然会这么做。我们将多个连接多路复用到一个……