I Built a Personal Knowledge Graph. Apple Had Already Built One on My Laptop.
I Built a Personal Knowledge Graph. Apple Had Already Built One on My Laptop.
我构建了一个个人知识图谱,结果发现苹果已经在我的笔记本电脑里内置了一个。
I’ve been building a system called Nexus for the past few months. It’s a personal data platform: 54 data sources, 250+ tables, 358,000 knowledge graph facts, all running in Postgres 16 on a home server. It ingests everything from iMessage to Gmail to Apple Photos to Spotify to HealthKit. The goal is biographical intelligence: a queryable model of my own life. A few weeks ago, while wiring up some of the Apple-specific data sources, I cracked open the local databases that Apple maintains on macOS. What I found was… familiar.
过去几个月里,我一直在构建一个名为 Nexus 的系统。这是一个个人数据平台:包含 54 个数据源、250 多张表、358,000 条知识图谱事实,全部运行在家庭服务器的 Postgres 16 数据库上。它摄取了从 iMessage、Gmail、Apple Photos 到 Spotify 和 HealthKit 的所有数据。其目标是实现“传记智能”:一个可查询的、关于我个人生活的模型。几周前,在接入一些苹果专有的数据源时,我打开了苹果在 macOS 上维护的本地数据库。我发现的内容……非常眼熟。
What Apple stores locally
苹果在本地存储了什么
Apple maintains several overlapping SQLite databases on your Mac. The digital forensics community has mapped these fairly well, even though Apple barely documents them publicly. Sarah Edwards at mac4n6 was among the first to document knowledgeC.db in depth, and her later research on PersonalizationPortrait specifically called out that the database was “loaded with data.” But that work focused on forensic investigation (what can law enforcement extract from a seized device), not architectural analysis. I’m coming at it from the other direction: I built a system that does the same thing, and I want to understand what Apple’s design choices can teach me.
苹果在你的 Mac 上维护着多个重叠的 SQLite 数据库。尽管苹果几乎没有公开记录这些数据库,但数字取证社区已经对其进行了相当深入的映射。mac4n6 的 Sarah Edwards 是最早深入记录 knowledgeC.db 的人之一,她后来关于 PersonalizationPortrait 的研究特别指出该数据库“加载了大量数据”。但这些工作侧重于取证调查(执法部门能从被扣押的设备中提取什么),而非架构分析。我则是从另一个方向切入:我构建了一个功能相同的系统,我想了解苹果的设计选择能给我带来什么启示。
The databases I ended up ingesting into Nexus:
- knowledgeC.db (CoreDuet framework) — this is the big one. Forensic researchers call it “pattern of life” data. App usage, device usage, location context, behavioral timelines, media activity, browsing behavior, communication metadata, charging patterns, movement patterns. On newer macOS/iOS versions, much of this moved into the Biome framework, but the architectural concept stayed the same.
我最终摄入到 Nexus 中的数据库包括:
- knowledgeC.db (CoreDuet 框架) —— 这是最核心的一个。取证研究人员称其为“生活模式”数据。它包含应用使用情况、设备使用情况、位置上下文、行为时间线、媒体活动、浏览行为、通信元数据、充电模式和移动模式。在较新的 macOS/iOS 版本中,大部分数据已迁移到 Biome 框架,但架构理念保持不变。
After harvesting, my own aurora_raw_apple_knowledge table has 13,677 records and aurora_raw_biome has 249,776. Here’s what the stream distribution looks like in knowledgeC:
- /app/usage: 8,739
- /app/intents: 3,078
- /app/webUsage: 758
- /display/isBacklit: 338
- /notification/usage: 304
- /bluetooth/isConnected: 300
在采集之后,我自己的 aurora_raw_apple_knowledge 表有 13,677 条记录,aurora_raw_biome 有 249,776 条。以下是 knowledgeC 中的流分布情况:
- /app/usage: 8,739
- /app/intents: 3,078
- /app/webUsage: 758
- /display/isBacklit: 338
- /notification/usage: 304
- /bluetooth/isConnected: 300
And in Biome:
- GenerativeModels.GenerativeFunctions.Instrumentation: 241,138
- Autonaming.Messages.MessageIds: 2,221
- Messages.Read: 2,098
- App.Intent: 1,139
- Siri.Remembers.MessageHistory: 1,051
而在 Biome 中:
- GenerativeModels.GenerativeFunctions.Instrumentation: 241,138
- Autonaming.Messages.MessageIds: 2,221
- Messages.Read: 2,098
- App.Intent: 1,139
- Siri.Remembers.MessageHistory: 1,051
That first Biome stream alone: 241,138 records of Apple Intelligence function invocations. Every time the generative model runs on your device, that’s a record.
仅第一个 Biome 流就包含了 241,138 条 Apple Intelligence 函数调用记录。每当生成式模型在你的设备上运行一次,就会产生一条记录。
PersonalizationPortrait — this is where it gets architecturally interesting. My aurora_raw_apple_personalization table has 38,775 records. The schema: entity_name (text), entity_type (text), interest_score (real), decay_score (real), topic_category (text), is_significant_contact (boolean).
PersonalizationPortrait —— 这是架构上最有趣的部分。我的 aurora_raw_apple_personalization 表有 38,775 条记录。其架构为:entity_name (文本), entity_type (文本), interest_score (实数), decay_score (实数), topic_category (文本), is_significant_contact (布尔值)。
Apple is running interest scoring with temporal decay on entities it tracks about you. The entity types include significant_contact (3,970 records), topic (2,805 records), loc (2,000 location entities), and a set of opaque ne_* (named entity) categories that appear to map to different entity classes. The topic_category values are Wikidata QIDs. As far as I can tell from the published DFIR literature, nobody has called this out before: Apple is using the Wikidata knowledge graph as its topic ontology.
苹果正在对其追踪的关于你的实体进行带有时间衰减的兴趣评分。实体类型包括 significant_contact(3,970 条记录)、topic(2,805 条记录)、loc(2,000 个位置实体),以及一组不透明的 ne_*(命名实体)类别,它们似乎映射到不同的实体类。topic_category 的值是 Wikidata QID。据我所知,在已发表的数字取证与事件响应(DFIR)文献中,此前没人指出这一点:苹果正在使用 Wikidata 知识图谱作为其主题本体。
Q223563 (Google Calendar) is the most frequent topic in my data with 587 records and an average interest score of 0.999. Apple knows I’m very interested in calendaring. The location entities have a mean decay_score of -1.0, which suggests Apple uses negative decay to actively deprecate location relevance over time. Contacts don’t decay. Locations do.
Q223563(Google 日历)是我数据中最频繁的主题,共有 587 条记录,平均兴趣得分为 0.999。苹果知道我对日历非常感兴趣。位置实体的平均 decay_score 为 -1.0,这表明苹果使用负衰减来主动降低位置信息随时间的相关性。联系人不会衰减,但位置会。
Apple Intelligence triples — this one stopped me cold. aurora_raw_apple_intelligence has 40,339 records with this schema: subject (text), predicate (text), object (text), confidence (real). That’s a knowledge graph. Subject-predicate-object triples with confidence scores. Apple is running entity resolution on your device.
Apple Intelligence 三元组 —— 这一点让我震惊。aurora_raw_apple_intelligence 有 40,339 条记录,架构为:subject (文本), predicate (文本), object (文本), confidence (实数)。这就是一个知识图谱。带有置信度评分的主语-谓语-宾语三元组。苹果正在你的设备上运行实体解析。
nm_hasVisualIdentifier links face embeddings to person entities. nm_entityAliasRelationship with nm_confirmationConfidence is alias resolution: “this name and this face are the same person, with this confidence score.” nm_personType classifies entities into person categories. This is the same architectural pattern I built in Nexus, where aurora_social_identities (7,203 person entities) links to knowledge_facts (358,053 facts) through entity resolution with confidence scoring and alias deduplication. Apple and I arrived at the same design, independently, for the same reasons.
nm_hasVisualIdentifier 将人脸嵌入与人物实体关联起来。带有 nm_confirmationConfidence 的 nm_entityAliasRelationship 是别名解析:“这个名字和这张脸是同一个人,置信度为多少。” nm_personType 将实体分类为人物类别。这与我在 Nexus 中构建的架构模式相同,即 aurora_social_identities(7,203 个个人实体)通过带有置信度评分和别名去重的实体解析链接到 knowledge_facts(358,053 条事实)。苹果和我独立地得出了相同的设计,原因也如出一辙。
The architectural comparison
架构对比
Here’s what surprised me about the overlap:
以下是让我感到惊讶的重叠之处:
| Capability | Apple (on-device) | Nexus (my system) |
|---|---|---|
| Entity resolution | Face-to-identity linking with confidence | Multi-signal identity merge with confidence |
| Relationship modeling | is_significant_contact boolean + interaction frequency | Weighted edges with temporal validity windows |
| Topic classification | Wikidata QIDs with interest + decay scores | Knowledge graph facts with typed predicates |
| Behavioral timeline | knowledgeC + Biome streams | Unified timeline across 54 sources |
| Location context | 2,000 location entities with decay scoring | Google Timeline + device GPS + travel records |
| 功能 | 苹果(设备端) | Nexus(我的系统) |
|---|---|---|
| 实体解析 | 基于置信度的人脸到身份链接 | 基于置信度的多信号身份合并 |
| 关系建模 | is_significant_contact 布尔值 + 交互频率 | 带有时间有效窗口的加权边 |
| 主题分类 | 带有兴趣+衰减评分的 Wikidata QID | 带有类型化谓语的知识图谱事实 |
| 行为时间线 | knowledgeC + Biome 流 | 跨 54 个源的统一时间线 |
| 位置上下文 | 带有衰减评分的 2,000 个位置实体 | Google 时间轴 + 设备 GPS + 旅行记录 |
Where Apple is richer: the ML layer. The visual identifier linking, the interest/decay scoring algorithms, the generative model instrumentation. That’s 241K records of model telemetry I can see but can’t interpret because Apple’s weighting logic is opaque.
苹果更丰富的地方在于:机器学习层。视觉标识符链接、兴趣/衰减评分算法、生成式模型仪表盘。我能看到 24.1 万条模型遥测记录,却无法解读,因为苹果的加权逻辑是不透明的。
Where Nexus is richer: temporal depth (24 years vs. however long your Mac has been running), cross-platform fusion (Apple only sees Apple ecosystem data), and full auditability. I can SELECT * FROM knowledge_facts WHERE subject_id = $person and get every fact I’ve ever recorded. Apple’s system is a black box even to the device owner.
Nexus 更丰富的地方在于:时间深度(24 年 vs. 你的 Mac 运行的时间)、跨平台融合(苹果只能看到苹果生态系统的数据)以及完全的可审计性。我可以执行 SELECT * FROM knowledge_facts WHERE subject_id = $person 并获取我记录过的每一条事实。而苹果的系统即使对设备所有者来说也是一个黑盒。
The part nobody talks about
没人谈论的部分
Apple’s privacy story is “we keep this on-device.” That’s meaningful. It’s genuinely better than sending everything to a cloud. I respect the architectural de…
苹果的隐私叙事是“我们将数据保留在设备上”。这很有意义。它确实比把所有东西都发送到云端要好。我尊重这种架构……