How to build scalable web apps with OpenAI's Privacy Filter
How to build scalable web apps with OpenAI’s Privacy Filter
如何利用 OpenAI 的 Privacy Filter 构建可扩展的 Web 应用
OpenAI released Privacy Filter on the Hub this week: an open-source personally-identifiable information (PII) detector that labels text across eight categories in a single forward pass over a 128k context. Model card. OpenAI 本周在 Hugging Face Hub 上发布了 Privacy Filter:这是一个开源的个人身份信息 (PII) 检测器,能够在 128k 上下文窗口内通过单次前向传播,对文本中的八类信息进行标注。模型卡片如下。
We spent a few hours building with it and landed on three apps that each reveals a different slice of what it can do. 我们花了几个小时进行开发,最终构建了三个应用程序,每个程序都展示了该模型不同层面的能力。
Document Privacy Explorer: drop in a PDF or DOCX, read the document back with every PII span highlighted in place. Document Privacy Explorer(文档隐私浏览器):上传 PDF 或 DOCX 文件,阅读文档时,所有 PII 范围都会在原位高亮显示。
Image Anonymizer: upload an image, get it back with redacted black bars over names, emails, and account numbers. The image is also editable on a canvas so you can make your own annotations before downloading. Image Anonymizer(图像匿名化工具):上传图片,系统会返回覆盖了黑色遮挡条的图片,隐藏姓名、电子邮件和账号。该图片还可以在画布上进行编辑,以便你在下载前添加自己的注释。
SmartRedact Paste: paste sensitive text, share a public URL that serves the redacted version, keep a private reveal link for yourself. SmartRedact Paste(智能遮盖粘贴):粘贴敏感文本,分享一个显示遮盖版本的公共 URL,并为你自己保留一个查看原始信息的私有链接。
All three are built on gradio.Server, which lets you pair custom HTML/JS frontends with Gradio’s queueing, ZeroGPU allocation, and gradio_client SDK. In all these apps, gradio.Server plays the same backend role, and that consistency is exactly what makes it really powerful. 这三个应用均基于 gradio.Server 构建,它允许你将自定义的 HTML/JS 前端与 Gradio 的队列机制、ZeroGPU 分配以及 gradio_client SDK 相结合。在所有这些应用中,gradio.Server 都扮演着相同的后端角色,这种一致性正是其强大之处。
The model Privacy Filter is a 1.5B-parameter model with 50M active parameters, permissively licensed under Apache 2.0. PII categories are private_person, private_address, private_email, private_phone, private_url, private_date, account_number, secret. Context is 128,000 tokens. Achieves state-of-the-art performance on the PII-Masking-300k benchmark. Full numbers and methodology are in the official release blog. Privacy Filter 模型拥有 15 亿参数,其中 5000 万为活跃参数,采用宽松的 Apache 2.0 许可证。PII 类别包括:个人姓名、私人地址、私人邮箱、私人电话、私人网址、私人日期、账号和机密信息。上下文窗口为 128,000 个 token。它在 PII-Masking-300k 基准测试中达到了业界领先水平。完整数据和方法论请参阅官方发布博客。
1. Document Privacy Explorer
1. 文档隐私浏览器
Try it at ysharma/OPF-Document-PII-Explorer. 请访问 ysharma/OPF-Document-PII-Explorer 体验。
User problem. You want to read a PII-heavy document (a contract, a resume, an exported chat log) with every detected span highlighted by category, a filter in the sidebar, and a summary dashboard up top. The reading experience should feel like a normal document, not a form. 用户痛点:你希望阅读一份包含大量 PII 的文档(如合同、简历、导出的聊天记录),并要求所有检测到的范围按类别高亮显示,侧边栏提供筛选器,顶部提供摘要仪表板。阅读体验应像普通文档一样自然,而不是像填写表单。
What Privacy Filter does here. The whole file goes through in a single 128k-context forward pass, so there’s no chunking, no stitching, and span offsets line up directly with the rendered text. BIOES decoding keeps span boundaries clean through long ambiguous runs. Privacy Filter 的作用:整个文件通过单次 128k 上下文的前向传播进行处理,因此无需分块或拼接,范围偏移量与渲染后的文本直接对应。BIOES 解码确保了在长且模糊的文本段中,范围边界依然清晰。
What gr.Server does here. You could wire this up in Blocks with gr.HighlightedText and a sidebar, and it would work. The reading experience we wanted (serif body, category filters that toggle CSS classes client-side instead of re-running the model, a summary dashboard that doesn’t force a page re-render) was easier to hand-author than to compose. gr.Server lets us serve the reader view as a single HTML file and expose the model behind one queued endpoint. gr.Server 的作用:你可以使用 Blocks 配合 gr.HighlightedText 和侧边栏来实现这一功能。但我们想要的阅读体验(衬线字体正文、在客户端切换 CSS 类而非重新运行模型的类别筛选器、无需强制页面重绘的摘要仪表板)通过手动编写代码比使用组件组合更容易实现。gr.Server 允许我们将阅读视图作为单个 HTML 文件提供,并将模型暴露在一个排队端点之后。
(Code snippet omitted for brevity) (代码片段略)
Note the decorator: @server.api(name=“analyze_document”), not a plain @server.post. That’s the piece that plugs the handler into Gradio’s queue, so concurrent uploads are serialized, @spaces.GPU composes correctly on ZeroGPU, and the same endpoint is reachable from both the browser and gradio_client with no duplicated code. 注意装饰器:是 @server.api(name=“analyze_document”),而不是普通的 @server.post。这部分代码将处理程序接入了 Gradio 的队列,从而使并发上传能够序列化,确保 @spaces.GPU 在 ZeroGPU 上正确组合,并且同一个端点可以同时被浏览器和 gradio_client 访问,无需重复代码。
2. Image Anonymizer
2. 图像匿名化工具
Try it at ysharma/OPF-Image-Anonymizer. 请访问 ysharma/OPF-Image-Anonymizer 体验。
User problem. You want to share an image or any screenshot (a Slack thread, a receipt, a Stripe dashboard) with black bars over the PII. You want to toggle bars on and off, drag them to reposition, or draw one by hand for anything the model missed, then export the result. 用户痛点:你希望分享一张图片或截图(如 Slack 对话、收据、Stripe 仪表板),并用黑条遮盖 PII。你希望能够开关遮盖条、拖动调整位置,或者手动绘制模型遗漏的区域,最后导出结果。
What Privacy Filter does here. Tesseract runs OCR and returns per-word bounding boxes. The backend reconstructs the full text with a char-offset to box map, then runs Privacy Filter once over the whole text. Detected character spans are looked up against the word map and joined into pixel rectangles per line. Privacy Filter 的作用:Tesseract 运行 OCR 并返回每个单词的边界框。后端通过字符偏移量到边界框的映射重建完整文本,然后对整个文本运行一次 Privacy Filter。检测到的字符范围会与单词映射进行比对,并合并为每行的像素矩形。
What gr.Server does here. gr.ImageEditor supports layered annotation and is a reasonable starting point for image redaction. The workflow we wanted (per-bar category metadata, toggle all bars in a category at once, client-side PNG export at natural resolution with no server round-trip) was cleaner to build on a custom
(Code snippet omitted for brevity) (代码片段略)