Data Architectures Powering Agentic AI

Data Architectures Powering Agentic AI

驱动智能体 AI 的数据架构

From semantic layers and knowledge graphs to vector search, modern data platforms, and real-time pipelines — here’s the infrastructure beneath the intelligence. The headline of 2025–2026 is not the model. It’s the agent. Large language models proved that machines can reason. Agentic AI proves they can act — plan multi-step tasks, call tools, observe results, and adapt without a human in the loop. 从语义层、知识图谱到向量搜索、现代数据平台以及实时流水线——这就是智能背后的基础设施。2025-2026 年的头条新闻不再是模型,而是智能体(Agent)。大语言模型证明了机器可以进行推理,而智能体 AI 则证明了它们可以采取行动——规划多步骤任务、调用工具、观察结果,并在无需人工干预的情况下进行自我调整。

But here’s the architectural truth nobody tweets about: a brilliant agent grounded in bad data is just a confident liar. The data infrastructure beneath an agentic system determines whether it produces trustworthy decisions or expensive hallucinations. Traditional data architectures — built for dashboards and batch queries — are fundamentally ill-equipped for the fluid, latency-sensitive, multi-source demands of autonomous agents. This article breaks down every layer of a production-grade agentic data stack, with reference architectures you can actually build. 但有一个没人会在推特上提及的架构真相:一个基于糟糕数据的天才智能体,不过是一个自信的骗子。智能体系统底层的数据基础设施决定了它产生的是可信的决策,还是昂贵的幻觉。传统的、为仪表盘和批处理查询而构建的数据架构,从根本上无法满足自主智能体对流动性、低延迟和多源数据的需求。本文将剖析生产级智能体数据栈的每一层,并提供你可以实际构建的参考架构。

What Makes Agentic AI Different

智能体 AI 有何不同

A standard LLM application fires one request and gets one response. An agentic system fires chains of requests, each depending on the last — querying databases, reading APIs, executing code, writing to systems of record, and looping back for context. This changes data infrastructure requirements fundamentally: 标准的 LLM 应用通常是发送一个请求并获得一个响应。而智能体系统则会触发一系列请求链,每个请求都依赖于上一个——查询数据库、读取 API、执行代码、写入记录系统,并循环回溯以获取上下文。这从根本上改变了对数据基础设施的要求:

  • Latency shifts from “acceptable in seconds” to “must respond in milliseconds” — agents make dozens of data calls per task.
  • 延迟从“秒级可接受”转变为“必须毫秒级响应”——智能体在每个任务中会进行数十次数据调用。
  • Freshness is non-negotiable — a stale risk score or outdated inventory count produces a wrong action, not just a wrong answer.
  • 时效性不可妥协——陈旧的风险评分或过时的库存数量会导致错误的行动,而不仅仅是错误的答案。
  • Governance becomes critical — agents act autonomously, so every data access must be scoped, audited, and revocable.
  • 治理变得至关重要——智能体是自主行动的,因此每一次数据访问都必须经过范围界定、审计和可撤销处理。
  • Context continuity requires storing and retrieving evolving state across multiple turns and tool calls.
  • 上下文连续性要求在多次对话和工具调用中存储并检索不断演变的各种状态。

The data stack must stop being passive storage and become an active, governed reasoning substrate. 数据栈必须停止作为被动存储的角色,转而成为一个主动的、受治理的推理基底。


Layer 1 — The Semantic Layer (What Data Means)

第一层:语义层(数据的含义)

Raw databases are unreadable by agents. A column named amt_usd_cr_adj means nothing to an LLM — and if the agent guesses wrong, every downstream action is corrupted. The semantic layer solves this by translating raw data into machine-readable business context: what each field means, how metrics are calculated, which datasets relate to which entities. It maps complex data into familiar business terms — product, customer, revenue, risk — offering a unified view across an organization’s entire data estate. 原始数据库对智能体来说是不可读的。一个名为 amt_usd_cr_adj 的列对 LLM 毫无意义——如果智能体猜错了,后续的每一个动作都会被破坏。语义层通过将原始数据转换为机器可读的业务上下文来解决这个问题:每个字段的含义、指标如何计算、哪些数据集与哪些实体相关。它将复杂数据映射为熟悉的业务术语——产品、客户、收入、风险——从而在整个组织的数据资产中提供统一的视图。

Key components of a semantic layer for agents: 智能体语义层的关键组件:

  • Virtual datasets: Clean, business-aligned views that hide raw table complexity from the agent.
  • 虚拟数据集: 清洁、与业务对齐的视图,向智能体隐藏原始表的复杂性。
  • Column-level documentation: Human-readable descriptions LLMs use to understand field semantics.
  • 列级文档: LLM 用于理解字段语义的人类可读描述。
  • Pre-defined metrics: Aggregations (revenue, DAU, churn rate) agents invoke by name rather than recalculating each time.
  • 预定义指标: 智能体通过名称调用(如收入、DAU、流失率)的聚合数据,无需每次重新计算。
  • Business rules: Hierarchy definitions, relationships, and domain logic made machine-readable.
  • 业务规则: 使层级定义、关系和领域逻辑变得机器可读。

Without this layer, agents reverse-engineer table semantics from raw column names and data distributions — a brittle approach that produces hallucinations at scale. 没有这一层,智能体只能从原始列名和数据分布中反向推导表语义——这是一种脆弱的方法,会在大规模应用中产生幻觉。

# Example: Semantic Layer Metadata (dbt / Dremio style) # 示例:语义层元数据(dbt / Dremio 风格)

table: transactions
columns:
  - name: amt_usd_cr_adj
    description: "Credit-adjusted transaction amount in USD after refunds"
    semantic_type: currency
    metric: true
  - name: user_id
    description: "Unique identifier for the user who initiated the transaction"
    semantic_type: entity_key
    joins_to: users.id

Layer 2 — Knowledge Graphs (How Data Connects)

第二层:知识图谱(数据的连接方式)

If the semantic layer tells an agent what data means, the knowledge graph tells it how everything relates. Knowledge graphs model entities — users, products, transactions, events — as nodes and their relationships as edges, enabling agents to traverse multi-hop reasoning paths that flat tables cannot express. 如果说语义层告诉智能体数据意味着什么,那么知识图谱则告诉它事物之间是如何关联的。知识图谱将实体(用户、产品、交易、事件)建模为节点,将它们的关系建模为边,使智能体能够遍历扁平表无法表达的多跳推理路径。

The key differentiator from a relational database is inference: knowledge graphs built on W3C’s Resource Description Framework (RDF) stack can derive new facts from existing ones using formal reasoning via OWL ontologies and SHACL validation constraints. This makes them ideal as a grounding layer for LLMs — providing structured, verifiable facts that anchor generative responses to reality. 它与关系数据库的关键区别在于推理:基于 W3C 资源描述框架 (RDF) 构建的知识图谱,可以通过 OWL 本体和 SHACL 验证约束,利用形式化推理从现有事实中推导出新事实。这使其成为 LLM 的理想基础层——提供结构化、可验证的事实,将生成式响应锚定在现实中。

GraphRAG combines the best of both approaches: vector-based retrieval finds semantically relevant chunks, while the knowledge graph provides structured, relationship-aware context for precise reasoning. GraphRAG 结合了两种方法的优点:基于向量的检索找到语义相关的片段,而知识图谱则为精确推理提供结构化、具备关系感知的上下文。

Research on a hybrid RAG-KG framework (RAG-KG-IL) demonstrated that integrating knowledge graphs with RAG significantly reduces hallucination rates and improves answer completeness and reasoning accuracy compared to RAG-only baselines. In clinical question answering specifically, an ontology-grounded knowledge graph framework achieved 98% accuracy and reduced hallucination rates from ~63% (ChatGPT-4) to just 1.7%. 关于混合 RAG-KG 框架 (RAG-KG-IL) 的研究表明,与仅使用 RAG 的基准相比,将知识图谱与 RAG 集成可显著降低幻觉率,并提高答案的完整性和推理准确性。特别是在临床问答中,一个基于本体的知识图谱框架实现了 98% 的准确率,并将幻觉率从约 63% (ChatGPT-4) 降低到了仅 1.7%。

Graph Traversal Example: 图遍历示例: User:John → PLACED → Order:4821 Order:4821 → CONTAINS → Product:SKU-991 Product:SKU-991 → MANUFACTURED_BY → Vendor:Acme Vendor:Acme → IS_FLAGGED → Risk:HIGH Agent query: “Should I approve John’s refund?” Graph traversal reveals vendor risk → agent triggers manual review. 用户:John → 下单 → 订单:4821 订单:4821 → 包含 → 产品:SKU-991 产品:SKU-991 → 制造方 → 供应商:Acme 供应商:Acme → 被标记 → 风险:高 智能体查询:“我应该批准 John 的退款吗?” 图遍历揭示了供应商风险 → 智能体触发人工审核。

Graph-based approaches also deliver massive efficiency gains: experiments in financial document retrieval showed an 80% decrease in token usage and a 734-fold reduction in token consumption for contradiction detection compared to conventional RAG methods. 基于图的方法还带来了巨大的效率提升:在金融文档检索实验中,与传统 RAG 方法相比,Token 使用量减少了 80%,在矛盾检测方面的 Token 消耗量降低了 734 倍。


Layer 3 — Vector Search (How Data Is Retrieved)

第三层:向量搜索(数据的检索方式)

Not all knowledge fits neatly into a relational schema or a knowledge graph. Unstructured content — documents, emails, support tickets, product descriptions, conversation history — is best represented as embeddings: high-dimensional vectors encoding semantic meaning. Vector search finds the most semantically similar content to a query, enabling agents to retrieve relevant context even when exact keywords don’t match. 并非所有知识都能整齐地放入关系模式或知识图谱中。非结构化内容(文档、电子邮件、支持工单、产品描述、对话历史)最好表示为嵌入(Embeddings):即编码语义的高维向量。向量搜索可以找到与查询语义最相似的内容,使智能体即使在精确关键词不匹配的情况下也能检索到相关的上下文。

A production vector search pipeline has three phases: 生产级向量搜索流水线包含三个阶段:

  1. Ingestion and Preprocessing: Chunk large documents into sentence or paragraph-level units. 摄取与预处理: 将大型文档分块为句子或段落级别的单元。
  2. Attach metadata: (timestamps, source, entity IDs). 附加元数据: (时间戳、来源、实体 ID)。