Realtime deepfake software is a SaaS product now

实时深度伪造软件现已成为 SaaS 产品

I’ve been half-following the deepfake-in-the-wild beat for a while. Most of it has been static image stuff—fake profile photos, AI generated headshots on LinkedIn, that kind of thing. I run suspicious images through AI or Not when something looks off, flag it, move on. But the 404 Media investigation into “HELLO BOSS” software shifted my sense of where the floor actually is. This isn’t someone uploading a faked image. This is a live video call where the person on screen is not the person on screen.

我一直断断续续地关注着野外深度伪造（deepfake）的相关动态。其中大部分是静态图像，比如虚假的个人资料照片、LinkedIn 上 AI 生成的头像等。当发现可疑图片时，我会用 AI or Not 进行检测，标记并处理。但 404 Media 对“HELLO BOSS”软件的调查彻底改变了我对当前技术底线的认知。这不再是上传一张伪造图片那么简单，而是一场实时视频通话，屏幕上的人根本不是本人。

What the pipeline actually looks like

技术流程究竟是怎样的

The software they describe isn’t magic; it’s a real-time face swap layer that sits between the camera input and whatever video call software the scammer is using. The rough architecture: [Scammer’s webcam] ↓ [Face detection + landmark extraction] ↓ [Target face model (pre-trained on victim’s photos)] ↓ [Rendered output frame] ↓ [Virtual camera driver — OBS, v4l2loopback, etc.] ↓ [Zoom / WhatsApp / Teams / any WebRTC app]

他们所描述的软件并非魔法，而是一个实时换脸层，位于摄像头输入端与诈骗者使用的视频通话软件之间。其大致架构如下： [诈骗者摄像头] ↓ [人脸检测 + 特征点提取] ↓ [目标人脸模型（预先使用受害者照片训练）] ↓ [渲染输出帧] ↓ [虚拟摄像头驱动 — OBS, v4l2loopback 等] ↓ [Zoom / WhatsApp / Teams / 任何 WebRTC 应用]

The virtual camera driver is the key piece most people miss. Tools like OBS Virtual Camera on Windows/Mac or v4l2loopback on Linux let you present any video source as a system webcam. The calling app has no idea it’s not getting real hardware input. The face-swap model itself runs inference on every frame—typically 24–30 fps—which used to require a beefy GPU. Consumer grade hardware can handle it now, and cloud GPU instances are cheap enough that you can rent the compute if you don’t own it.

虚拟摄像头驱动是大多数人忽略的关键环节。Windows/Mac 上的 OBS Virtual Camera 或 Linux 上的 v4l2loopback 等工具，可以将任何视频源伪装成系统摄像头。通话应用根本无法察觉它接收的并非来自真实硬件的输入。换脸模型本身在每一帧上运行推理（通常为 24–30 fps），这在过去需要强大的 GPU，但现在消费级硬件已能胜任，且云端 GPU 实例足够便宜，即使没有硬件也可以租用算力。

“Hello boss” isn’t a technical term; it’s a script

“Hello boss”不是技术术语，而是一种剧本

The name comes from one of the primary use cases: impersonating a company executive on a video call to authorize a wire transfer. Subordinate gets a call from what looks like their CEO on screen, hears a voice that’s been cloned or at least pitch-shifted, and gets told to move money somewhere. The “hello boss” phrasing is the greeting on the other end—the scammer picking up a call from someone who thinks they’re reaching their boss, not the scammer impersonating the boss outbound. Either way, the social engineering depends entirely on the live video being convincing enough to short-circuit skepticism.

这个名字源于其主要用途之一：在视频通话中冒充公司高管以授权电汇。下属接到一个看起来像是 CEO 的视频电话，听到经过克隆或至少是变调处理的声音，并被要求转账。“Hello boss”是对方接听时的问候语——即诈骗者接听了受害者打来的电话，受害者以为自己在联系老板，而实际上是诈骗者在冒充老板。无论哪种方式，这种社会工程学攻击完全依赖于实时视频的逼真度，从而绕过受害者的怀疑。

This same stack powers romance scams and “pig butchering” investment fraud, where the point is sustained trust over weeks, not a one time wire transfer. A static fake photo stops working the moment someone asks for a video call. A real time face swap keeps the fiction going.

同样的架构也被用于浪漫诈骗和“杀猪盘”投资诈骗，其目的不是一次性电汇，而是维持数周的信任。静态假照片在对方要求视频通话时就会失效，而实时换脸则能让骗局持续下去。

Why this breaks video KYC assumptions

为什么这会打破视频 KYC（了解你的客户）的假设

A lot of identity verification flows have converged on “liveness check + face match” as the standard:

User holds up ID document → OCR extracts name, DOB, document number
User records short selfie video → liveness detection (blink, turn head)
Face on selfie matches face on document → pass

许多身份验证流程已将“活体检测 + 人脸比对”作为标准：

用户出示身份证件 → OCR 提取姓名、出生日期、证件号码
用户录制简短自拍视频 → 活体检测（眨眼、转头）
自拍人脸与证件人脸匹配 → 通过

This pipeline assumes the face in the selfie video is the person’s real face. Real time face swap defeats step 3 entirely if the attacker pre-trains their model on photos of the person whose identity they’re stealing. It also defeats liveness checks—the swap handles arbitrary head movements and expressions in real time, so asking someone to blink or smile doesn’t help. Some vendors are adding texture analysis, illumination consistency checks, and temporal coherence scoring to catch the artifacts that face swap models still produce at frame boundaries and occlusion edges. But that arms race is already underway and the defenders are not obviously winning.

这一流程假设自拍视频中的脸是本人的真实面孔。如果攻击者预先使用被盗身份者的照片训练模型，实时换脸将彻底击败第 3 步。它也能绕过活体检测——换脸技术可以实时处理任意头部动作和表情，因此要求对方眨眼或微笑已无济于事。一些供应商正在增加纹理分析、光照一致性检查和时间连贯性评分，以捕捉换脸模型在帧边界和遮挡边缘产生的伪影。但这场军备竞赛已经开始，防御者显然并未占据上风。

What I’d actually do differently if I were building this today

如果今天由我来构建系统，我会做哪些改变

Don’t trust video alone. Pair any video verification step with a second channel (SMS OTP, authenticator app, document scan from a different session) so compromising the video doesn’t compromise the whole flow.
Log the raw video stream for async review. Real-time detectors aren’t reliable enough to be gatekeepers. Use them as signals, not hard blocks, and let a human or a more thorough model review borderline cases after the fact.
Add device fingerprinting. Face swap pipelines route through a virtual camera driver. The camera device name exposed by browser WebRTC APIs (MediaDeviceInfo.label) will often be “OBS Virtual Camera” or similar. That’s not a perfect signal, but it’s a cheap one worth logging.
Test your liveness checks against an actual face swap. There are open source models you can run locally. If your liveness check passes, you need to know now, not when a fraud team calls you. Assume the video is synthetic and design accordingly. Treat video verification as a corroborating signal, not a root-of-trust.
不要仅信任视频。 将任何视频验证步骤与第二渠道（短信验证码、身份验证器应用、来自不同会话的文档扫描）配对，这样即使视频被攻破，也不会导致整个流程失控。
记录原始视频流以供异步审查。 实时检测器不足以作为唯一的把关者。应将其作为信号而非硬性拦截，并让人工或更严谨的模型在事后审查边缘案例。
增加设备指纹识别。 换脸流程通过虚拟摄像头驱动运行。浏览器 WebRTC API 暴露的摄像头设备名称（MediaDeviceInfo.label）通常会显示为“OBS Virtual Camera”或类似名称。这不是完美的信号，但值得记录，且成本低廉。
使用真实的换脸技术测试你的活体检测。 市面上有可以在本地运行的开源模型。如果你的活体检测能被通过，你必须现在就知道，而不是等到反欺诈团队找上门时才发现。假设视频是合成的并据此设计。将视频验证视为辅助信号，而非信任根源。

Checking clips yourself

自行检查视频片段

When I see video circulating that seems suspicious—a celebrity endorsing something, an executive making a statement—AI or Not is what I pull up first. It handles video files, not just images, so you can actually run the clip rather than screenshotting a frame and hoping the compression didn’t wash out the artifacts. It’s not a forensic lab, but it’s fast and the confidence scores are useful for triage. The problem the 404 Media story describes is harder to catch after the fact because there usually isn’t a recording—it happened on a live call. But for anything that did get recorded, that kind of tooling matters.

当我看到可疑视频流传时——比如名人代言某物、高管发表声明——我首先会打开 AI or Not。它不仅能处理图像，还能处理视频文件，因此你可以直接运行片段，而不是截取一帧并祈祷压缩没有抹掉伪影。它虽不是法医实验室，但速度快，且置信度评分对初步筛选很有用。404 Media 报道的问题在事后更难捕捉，因为通常没有录像——它发生在实时通话中。但对于任何被录制下来的内容，这类工具至关重要。

The SaaS part is the part that scales

SaaS 模式才是实现规模化的关键

What makes this story different from “deepfakes exist, film at 11” is the distribution model. The software described in the 404 Media piece is sold through Telegram channels at subscription prices. That means:

Low barrier to entry. You don’t need to train a model or write code.
Sellers have support channels, update cadences, and refund policies.
Supply scales with demand, not with technical skill.

这篇报道与“深度伪造存在，新闻播报”的区别在于其分发模式。404 Media 报道中描述的软件通过 Telegram 频道以订阅价格出售。这意味着：

准入门槛低。你不需要训练模型或编写代码。
卖家提供支持渠道、更新频率和退款政策。
供应随需求增长，而非随技术能力增长。

This is the same trajectory malware took. Ransomware-as-a-service normalized the idea that you could be a criminal without being a programmer. Deepfake-as-a-service is doing the same thing for identity fraud. The underlying models will keep improving. The virtual camera trick is already table stakes. At some point, real-time voice cloning (already fairly mature) and real-time video swap running together on consumer hardware will…

这与恶意软件的发展轨迹如出一辙。勒索软件即服务（RaaS）让“无需成为程序员也能成为罪犯”这一概念常态化。深度伪造即服务（DaaS）正在身份欺诈领域重演这一过程。底层模型将持续改进。虚拟摄像头技巧已是基本门槛。在某个时间点，实时语音克隆（已相当成熟）与实时视频换脸在消费级硬件上同时运行，将会……