Your PDFs Never Leave Your Pocket: Building a 100% Offline RAG App with Gemma 4 + LiteRT-LM

Title: 你的 PDF 永不离身：使用 Gemma 4 + LiteRT-LM 构建 100% 离线 RAG 应用

“We’d love to use AI on our internal documents… but legal said no.” If you’ve ever worked with a German Mittelstand company — or honestly, any healthcare provider, law firm, or financial services team anywhere in the EU — you’ve heard a version of this sentence. And legal isn’t being paranoid. They’re being correct. The moment an employee pastes a contract, a payslip, or a patient record into ChatGPT, that document becomes someone else’s processing activity. Under GDPR Article 28, the cloud AI provider becomes a data processor. You stay the controller. If those servers sit in the US, you’ve also tripped Chapter V transfer rules and the ghost of Schrems II. Fines top out at €20 million or 4% of global turnover, and the regulators are warming up.

“我们很想在内部文档上使用 AI……但法务部门拒绝了。” 如果你曾与德国的中型企业（Mittelstand）合作过，或者在欧盟任何地方的医疗机构、律师事务所或金融服务团队工作过，你一定听过类似的话。法务部门并非杞人忧天，他们是正确的。一旦员工将合同、工资单或病历粘贴到 ChatGPT 中，该文档就变成了他人的处理活动。根据《通用数据保护条例》（GDPR）第 28 条，云端 AI 提供商成为了数据处理者，而你依然是控制者。如果这些服务器位于美国，你还触犯了第五章的传输规则以及 Schrems II 裁决的阴影。罚款最高可达 2000 万欧元或全球营业额的 4%，且监管机构正在加大执法力度。

So here’s the dilemma every European SME is sitting in right now: the productivity gains from “chat with your documents” are real and obvious, but the compliance surface is a minefield. Most teams resolve it the same way — they don’t use AI on their sensitive stuff at all. The data just sits there, unsearchable. I wanted to fix that. Not by writing another DPA template, but by removing the cloud from the equation entirely. This is PocketSage — a fully offline, on-device RAG assistant for Android. You import a PDF, ask it questions, and get streaming answers from Gemma 4 E2B running natively on your phone. No network calls. No API keys. No “your data may be used to improve our services.” The model weights live in your app sandbox; the embeddings live in a Room database; airplane mode works perfectly. Let me walk you through how it’s built.

因此，这就是目前每家欧洲中小企业面临的困境：通过“与文档对话”获得的生产力提升是真实且明显的，但合规性却是一个雷区。大多数团队的解决方法是一样的——完全不在敏感数据上使用 AI。数据就那样静静地躺着，无法被检索。我想解决这个问题。不是通过编写另一份 DPA（数据处理协议）模板，而是将云端完全从等式中剔除。这就是 PocketSage——一个完全离线、运行在设备端的 Android RAG 助手。你可以导入 PDF，提出问题，并从手机本地运行的 Gemma 4 E2B 模型中获得流式回答。没有网络请求，没有 API 密钥，也没有“你的数据可能被用于改进我们的服务”之类的条款。模型权重存储在你的应用沙盒中，嵌入向量存储在 Room 数据库中，飞行模式下也能完美运行。让我带你了解它是如何构建的。

The Privacy Argument, Stated Plainly 🇪🇺

隐私论点，直白陈述 🇪🇺

I want to spend one more paragraph here because this is the whole point of the project, not a footnote. When you build a cloud RAG pipeline for a German enterprise, here’s what your compliance checklist actually looks like: ✅ Sign a Data Processing Agreement with your LLM provider (Article 28) ✅ Conduct a Data Protection Impact Assessment — DPIA — for high-risk processing (Article 35) ✅ Document legal basis under Article 6 for every category of data ✅ Update your Record of Processing Activities (Article 30) ✅ Set up Standard Contractual Clauses for any non-EU sub-processors ✅ Implement PII redaction before vectorization (because the prompt-and-document data hits a third-party server) ✅ Build a “Right to be Forgotten” mechanism that can purge specific vectors from your store

我想再花一段篇幅讨论这一点，因为这是整个项目的核心，而非脚注。当你为一家德国企业构建云端 RAG 流水线时，你的合规清单实际上是这样的： ✅ 与你的大模型提供商签署数据处理协议（第 28 条） ✅ 针对高风险处理进行数据保护影响评估（DPIA，第 35 条） ✅ 为每一类数据记录第 6 条下的法律依据 ✅ 更新你的处理活动记录（第 30 条） ✅ 为任何非欧盟的次级处理者设置标准合同条款 ✅ 在向量化之前实施个人身份信息（PII）脱敏（因为提示词和文档数据会发送到第三方服务器） ✅ 构建一个“被遗忘权”机制，能够从你的存储中清除特定的向量

That’s a six-month project before you write a line of feature code. Now here’s PocketSage’s compliance checklist: ✅ The data doesn’t leave the device. That’s it. There is no processor because there is no processing happening anywhere except on hardware the user already owns. Article 28 doesn’t apply. Chapter V transfers don’t apply. There’s no DPA to sign because there’s no third party. This is privacy by design in the most literal sense the regulation could possibly mean — the architecture itself makes the violation impossible. For a German SME evaluating “chat with your contracts” tools, this is the difference between a six-month legal review and a one-week pilot.

在写下一行功能代码之前，这通常就是一个为期六个月的项目。现在看看 PocketSage 的合规清单： ✅ 数据不会离开设备。仅此而已。没有处理者，因为除了用户自己拥有的硬件外，没有任何地方在进行处理。第 28 条不适用，第五章的传输规则也不适用。没有第三方，因此无需签署 DPA。这是法规所能定义的“隐私设计”的最字面意义——架构本身就使得违规成为不可能。对于正在评估“与合同对话”工具的德国中小企业来说，这就是六个月法律审查与一周试点项目之间的区别。

The Tech Stack 🛠️

技术栈 🛠️

PocketSage is a textbook Modern Android Development (MAD Skills) app applied to a non-trivial ML problem. Three layers, clean separation, zero android.* imports in the domain layer.

PocketSage 是一个将现代 Android 开发（MAD Skills）应用于非平凡机器学习问题的教科书式案例。它分为三层，结构清晰，领域层中没有任何 android.* 的导入。

Layer	Choice	Why
UI	Jetpack Compose + Material 3	Single Activity, dynamic color, recruiter-recognisable
Architecture	MVVM, Hilt DI, Navigation Compose	Standard, testable, no surprises
Concurrency	Kotlin Coroutines + Flow	Streaming tokens map cleanly onto callbackFlow
Persistence	Room (SQLite)	384-dim embeddings stored as BLOB, cosine in Kotlin
Embeddings	LiteRT (all-MiniLM-L6-v2)	22 MB, well-benchmarked, runs anywhere
PDF parsing	pdfbox-android	Mature port, handles most consumer PDFs
LLM Inference	LiteRT-LM + gemma-4-E2B-it-litert-lm	Google’s official on-device GenAI orchestration layer

层级	选择	原因
UI	Jetpack Compose + Material 3	单 Activity，动态配色，招聘人员认可
架构	MVVM, Hilt DI, Navigation Compose	标准化，可测试，无意外
并发	Kotlin Coroutines + Flow	流式 Token 可完美映射到 callbackFlow
持久化	Room (SQLite)	384 维嵌入向量以 BLOB 存储，余弦相似度计算在 Kotlin 中完成
嵌入模型	LiteRT (all-MiniLM-L6-v2)	22 MB，基准测试表现良好，随处运行
PDF 解析	pdfbox-android	成熟的移植版本，处理大多数消费级 PDF
LLM 推理	LiteRT-LM + gemma-4-E2B-it-litert-lm	Google 官方的设备端生成式 AI 编排层

The whole RAG pipeline is roughly 500 lines of Kotlin once you strip the boilerplate. Honestly the hard part wasn’t the code — it was choosing the right model file format. (More on that nightmare in a moment.)

除去样板代码，整个 RAG 流水线大约只有 500 行 Kotlin 代码。老实说，最难的部分不是代码，而是选择正确的模型文件格式。（稍后会详细说明那场噩梦。）

How RAG Works Here, in Three Paragraphs 📚

这里的 RAG 是如何工作的（三段式说明） 📚

When you import a PDF, PocketSage extracts the text with PDFBox, splits it into ~800-character overlapping chunks, embeds each chunk with MiniLM (a tiny BERT-family model), and stores the resulting 384-dimensional vectors as raw bytes in a Room table. One-time per document, runs in the background, progress bar in the UI. Standard stuff.

当你导入 PDF 时，PocketSage 使用 PDFBox 提取文本，将其拆分为约 800 字符的重叠块，使用 MiniLM（一种微型的 BERT 系列模型）对每个块进行嵌入，并将生成的 384 维向量作为原始字节存储在 Room 表中。每个文档仅需一次处理，在后台运行，UI 上有进度条。这是标准操作。

When you ask a question, the same embedding model converts your question into a vector. The app computes cosine similarity between the question vector and every stored chunk, takes the top four matches, and stitches them into a prompt template that explicitly tells the LLM: answer only from the supplied context.

当你提问时，同一个嵌入模型会将你的问题转换为向量。应用会计算问题向量与每个存储块之间的余弦相似度，取前四个匹配项，并将它们拼接到一个提示词模板中，明确告诉大模型：仅根据提供的上下文回答。

The prompt is fed to Gemma 4 E2B running in LiteRT-LM’s Engine runtime, which streams tokens back through a callback. Each token is appended to a StateFlow that the chat screen renders in real time, with the retrieved chunks shown beneath each answer so you can verify the model isn’t hallucinating. End-to-end, on a Pixel 8, you get first token in roughly 2-3 seconds and a full answer in 10-15. Not GPT-4, but very usable.

提示词被输入到运行在 LiteRT-LM 引擎运行时中的 Gemma 4 E2B，它通过回调流式传输 Token。每个 Token 被追加到 StateFlow 中，聊天界面实时渲染，并在每个答案下方显示检索到的块，以便你可以验证模型是否在产生幻觉。在 Pixel 8 上，端到端体验是：大约 2-3 秒获得第一个 Token，10-15 秒获得完整答案。虽然比不上 GPT-4，但非常实用。

Under the Hood: The LiteRtLmRunner 🔧

幕后：LiteRtLmRunner 🔧

This is the piece I’m proudest of, and it’s also the piece that took the longest to get right. LiteRT-LM is Google’s new orchestration layer that sits on top of LiteRT (formerly TensorFlow Lite). It handles KV-cache management, prompt templating, and the streaming token API — all the GenAI-specific plumbing that you used to have to write yourself.

这是我最引以为豪的部分，也是耗时最长才搞定的部分。LiteRT-LM 是 Google 在 LiteRT（前身为 TensorFlow Lite）之上构建的全新编排层。它处理 KV 缓存管理、提示词模板化以及流式 Token API——这些都是以前必须自己编写的生成式 AI 专用底层逻辑。