Experimenting with the proposed Cross-Origin Storage API in Transformers.js

Experimenting with the proposed Cross-Origin Storage API in Transformers.js

在 Transformers.js 中试验提议的跨源存储 API (Cross-Origin Storage API)

(This is a guest post by Developer Relations Engineer Thomas Steiner from the Chrome team at Google.) (本文由谷歌 Chrome 团队的开发者关系工程师 Thomas Steiner 撰写。)

Transformers.js provides Web developers with a simple way to use the power of transformers in their Web apps through task-specific pipelines. To run inference in the browser, developers create an instance of pipeline() and specify a task they want to use the pipeline for. As a concrete example, the following snippet shows how to set up an automatic speech recognition (ASR) pipeline. Transformers.js 为 Web 开发者提供了一种简单的方法,通过特定任务的流水线(pipelines)在 Web 应用中使用 Transformer 模型的能力。为了在浏览器中运行推理,开发者需要创建一个 pipeline() 实例,并指定想要使用的任务。作为一个具体的例子,以下代码片段展示了如何设置自动语音识别(ASR)流水线。

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.2.0';

const asr = await pipeline(
  'automatic-speech-recognition',
  'Xenova/whisper-tiny.en',
  { device: 'webgpu' },
);

const result = await asr('jfk.wav');
console.log(result);

The cache challenge

缓存挑战

You will notice in the source code that I specified Xenova/whisper-tiny.en as the model, which is a very decent choice for common English automatic speech recognition tasks. In fact, it’s even the default model according to the Transformers.js default model resolution, as per the linked excerpt. 你会在源代码中注意到,我指定了 Xenova/whisper-tiny.en 作为模型,对于常见的英语自动语音识别任务来说,这是一个非常不错的选择。事实上,根据 Transformers.js 的默认模型解析规则(如链接摘录所示),它甚至是默认模型。

Model resources

模型资源

When you run this example in the browser, Transformers.js automatically takes care of downloading and caching the relevant model resources and Wasm files. The following screenshot shows the Chrome DevTools Cache storage section after visiting the app. When you reload the page, the resources are served from the Cache API, and the model returns results almost instantly. 当你在浏览器中运行此示例时,Transformers.js 会自动处理相关模型资源和 Wasm 文件的下载与缓存。以下截图展示了访问该应用后 Chrome 开发者工具中的“缓存存储”(Cache storage)部分。当你重新加载页面时,资源将从 Cache API 提供,模型几乎可以瞬间返回结果。

However, Xenova/whisper-tiny.en being a popular model (and, as mentioned before, even being the ASR default model in Transformers.js), you can well imagine that more than just one app that you visit would use it. To simulate this situation, here’s the same example app from before, but served from a different origin. When you visit this different origin app, rather than being usable almost instantly, the browser instead has to download and cache all the model resources again, even if they’re byte-by-byte the same as before. Even in this toy example, this adds up to 177 MB of duplicate download and storage, as you can examine in the Storage section of the Chrome DevTools Application panel. You can imagine that this quickly adds up. 然而,由于 Xenova/whisper-tiny.en 是一个热门模型(如前所述,它甚至是 Transformers.js 中的默认 ASR 模型),你可以想象到,你访问的不仅仅是一个应用会使用它。为了模拟这种情况,这里有之前同样的示例应用,但它是从不同的源(origin)提供的。当你访问这个不同源的应用时,浏览器并不能几乎瞬间使用它,而是必须再次下载并缓存所有模型资源,即使它们与之前的内容完全一致。即使在这个简单的示例中,这也导致了 177 MB 的重复下载和存储,你可以在 Chrome 开发者工具“应用”(Application)面板的“存储”(Storage)部分查看。你可以想象,这种情况很快就会累积起来。

Wasm runtime resources

Wasm 运行时资源

But it gets worse. Let’s add a second pipeline to the toy example: sentiment analysis. Sentiment analysis by default uses the Xenova/distilbert-base-uncased-finetuned-sst-2-english model. By not specifying the model, Transformers.js’ default model resolution automatically picks it for you. 情况还不止于此。让我们在示例中添加第二个流水线:情感分析。情感分析默认使用 Xenova/distilbert-base-uncased-finetuned-sst-2-english 模型。通过不指定模型,Transformers.js 的默认模型解析会自动为你选择它。

const classifier = await pipeline('sentiment-analysis');
const sentiment = await classifier(result.text);
pre.append('\n\n' + JSON.stringify(sentiment, null, 2));

Two entirely different AI models, but they depend on the same 4,733 kB ort-wasm-simd-threaded.asyncify.wasm WebAssembly (Wasm) runtime file from the underlying ONNX Runtime library that Transformers.js is built on top of. Open the extended demo on a different origin, and you will notice in the Network tab how also the Wasm runtime gets downloaded and cached again. So even if you run apps that don’t share the same AI models, your browser still makes redundant requests for shared Wasm resources you already have, and on top of that also caches them again, which consumes space on your hard disk. 这是两个完全不同的 AI 模型,但它们都依赖于同一个 4,733 kB 的 ort-wasm-simd-threaded.asyncify.wasm WebAssembly (Wasm) 运行时文件,该文件来自 Transformers.js 所构建的底层 ONNX Runtime 库。在不同的源上打开扩展演示,你会注意到在“网络”(Network)选项卡中,Wasm 运行时也被再次下载和缓存了。因此,即使你运行的应用不共享相同的 AI 模型,你的浏览器仍然会为你已经拥有的共享 Wasm 资源发出冗余请求,并且在此基础上再次缓存它们,这会消耗你的硬盘空间。

Cache isolation

缓存隔离

AI model resources serving AI 模型资源服务

By default, AI model resources come from the Hugging Face Hub, and ultimately the Hugging Face CDN. The browser makes a request for a resource like https://huggingface.co/Xenova/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json which then gets redirected to the final CDN URL. 默认情况下,AI 模型资源来自 Hugging Face Hub,最终来自 Hugging Face CDN。浏览器会请求类似 https://huggingface.co/Xenova/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json 的资源,然后被重定向到最终的 CDN URL。

Wasm runtime resources serving Wasm 运行时资源服务

The Wasm runtime resources are served from the jsDelivr CDN by default. For example, ort-wasm-simd-threaded.asyncify.wasm comes from https://cdn.jsdelivr.net/npm/onnxruntime-web@1.26.0-dev.20260416-b7804b056c/dist/ort-wasm-simd-threaded.asyncify.wasm at the time of this writing. Wasm 运行时资源默认由 jsDelivr CDN 提供。例如,在撰写本文时,ort-wasm-simd-threaded.asyncify.wasm 来自 https://cdn.jsdelivr.net/npm/onnxruntime-web@1.26.0-dev.20260416-b7804b056c/dist/ort-wasm-simd-threaded.asyncify.wasm

Now you may say that if different apps, even though running on different origins, in the end serve their resources from the same CDN URLs, caching shouldn’t be a problem, as long as the final URLs are the same. Unfortunately, this is not how caching works in browsers for a long time. The article “Gaining security and privacy by partitioning the cache” goes into all the details, but essentially, caches are isolated by origin to prevent timing attacks: the time a website takes to respond to HTTP requests can reveal that the browser has accessed the same resource in the past, which makes the browser vulnerable to security and privacy leaks. 你可能会说,如果不同的应用即使在不同的源上运行,最终都从相同的 CDN URL 提供资源,那么只要最终 URL 相同,缓存就不应该成为问题。遗憾的是,浏览器长期以来的缓存机制并非如此。文章《通过分区缓存获得安全性和隐私性》(Gaining security and privacy by partitioning the cache) 详细介绍了这一点,但本质上,缓存是按源进行隔离的,以防止计时攻击:网站响应 HTTP 请求所需的时间可能会泄露浏览器过去是否访问过相同的资源,这会使浏览器容易受到安全和隐私泄露的影响。

Chrome’s implementation

Chrome 的实现

The concrete implementation may vary by browser, but in Chrome, cached resources are keyed using a Network Isolation Key in addition to the resource URL. The Network Isolation Key is composed of the top-level site and the current-frame site. Take the previous toy examples hosted on the origins https://googlechrome.github.io and https://rawcdn.rawgit.net. If they both use the Wasm runtime from https://cdn.jsdelivr.net/npm/onnxruntime-web@1.26.0-dev.20260416-b7804b056c/dist/ort-wasm-simd-threaded.asyncify.wasm, their cache keys will look like in the following table. 具体的实现可能因浏览器而异,但在 Chrome 中,缓存资源除了使用资源 URL 外,还使用“网络隔离密钥”(Network Isolation Key)进行索引。网络隔离密钥由顶级站点(top-level site)和当前框架站点(current-frame site)组成。以之前托管在 https://googlechrome.github.iohttps://rawcdn.rawgit.net 源上的示例为例。如果它们都使用来自 https://cdn.jsdelivr.net/npm/onnxruntime-web@1.26.0-dev.20260416-b7804b056c/dist/ort-wasm-simd-threaded.asyncify.wasm 的 Wasm 运行时,它们的缓存键将如下表所示。