Two Patterns for Reducing LLM Costs in Data-Heavy RAG Apps

Two Patterns for Reducing LLM Costs in Data-Heavy RAG Apps

两种降低数据密集型 RAG 应用 LLM 成本的模式

How we cut token usage significantly in an F1 telemetry analyzer by rethinking what goes into the context window — and when. When building RAG applications on top of structured data (databases, APIs, telemetry), the naive approach is to dump everything into the context and let the LLM figure it out. It works, but it’s expensive and slow. 我们通过重新思考上下文窗口的内容及其加载时机,显著降低了一款 F1 遥测分析工具的 Token 使用量。在基于结构化数据(数据库、API、遥测数据)构建 RAG 应用时,一种简单粗暴的方法是将所有内容一股脑塞进上下文,让 LLM 自己去处理。这种方法虽然可行,但既昂贵又缓慢。

After building F1 Analyst Pro — a chat interface for Formula 1 race analysis backed by FastF1 + Supabase + Claude — we refined two patterns that significantly reduced both token usage and API costs without sacrificing response quality. This post covers both patterns with real code from the project. 在开发了 F1 Analyst Pro(一个由 FastF1 + Supabase + Claude 驱动的 F1 赛事分析聊天界面)之后,我们总结了两种模式,在不牺牲响应质量的前提下,显著降低了 Token 使用量和 API 成本。本文将通过项目中的实际代码来介绍这两种模式。

The Problem: Context Bloat

问题:上下文膨胀

A typical race weekend in F1 Analyst Pro has data across multiple sessions: FP1, FP2, FP3, Qualifying, and Race. Each session has lap-by-lap data for 22 drivers — compound, lap time, position, stint, sector times, tyre life. On top of that: qualifying results, race results, stint summaries, pit stop analysis, key moments, race incidents, and journalist notes. 在 F1 Analyst Pro 中,一个典型的比赛周末包含多个环节的数据:FP1、FP2、FP3、排位赛和正赛。每个环节都有 22 位车手的逐圈数据——包括轮胎配方、圈速、名次、停站、分段计时和轮胎寿命。此外还有:排位赛成绩、正赛成绩、停站总结、进站分析、关键时刻、比赛事故以及记者笔记。

If you send all of that to the LLM on every query, two things happen: 如果你在每次查询时都将所有这些数据发送给 LLM,会发生两件事:

  1. Cost explodes. A full context dump for a race weekend easily reaches 8,000-12,000 tokens per query. At $3/MTok for Claude Sonnet input, that’s $0.024-$0.036 per query — fine for one user, painful for many.

  2. 成本激增。 一个比赛周末的完整上下文转储每次查询很容易达到 8,000 到 12,000 个 Token。以 Claude Sonnet 输入端每百万 Token 3 美元的价格计算,每次查询成本为 0.024 到 0.036 美元——对于单个用户尚可,但对于大量用户来说则难以承受。

  3. Quality degrades. LLMs perform worse when the context contains irrelevant information. A question about qualifying doesn’t need race stint summaries — and including them adds noise that can distract the model from what matters.

  4. 质量下降。 当上下文中包含无关信息时,LLM 的表现会变差。关于排位赛的问题不需要正赛的停站总结——包含这些信息会产生干扰,使模型偏离重点。

The solution is two complementary patterns: conditional context and pre-generation. 解决方案是两种互补的模式:条件上下文 (Conditional Context)预生成 (Pre-generation)


Pattern 1: Conditional Context via Intent Detection

模式 1:通过意图识别实现条件上下文

Instead of always building the full context, detect what the user is actually asking about and fetch only the relevant data. 与其总是构建完整的上下文,不如识别用户真正询问的内容,并仅获取相关数据。

Intent Detection

意图识别

import unicodedata

def _detect_intents(prompt: str) -> dict:
    # Normalize accents so "clasificación" matches "clasificacion"
    p = unicodedata.normalize("NFD", prompt.lower())
    p = "".join(c for c in p if unicodedata.category(c) != "Mn")
    
    wants_qualy = any(w in p for w in [ "clasificacion", "qualifying", "q1", "q2", "q3", "pole", "sector" ])
    wants_race = any(w in p for w in [ "carrera", "race", "resultado", "vuelta rapida", "stint", "degradacion" ])
    # ... (other intents)
    
    # load_all is the fallback: fires only when no specific intent is detected
    load_all = not any([ wants_qualy, wants_race, wants_telemetry, wants_undercut, wants_practice, wants_race_sim ])
    
    return { "wants_qualy": wants_qualy, "wants_race": wants_race, ... }

Conditional Context Building

条件上下文构建

Each intent triggers specific SQL queries. Critically, expensive analyses (undercut/overcut, key moments, race incidents) are only fetched when they’re actually relevant: 每个意图都会触发特定的 SQL 查询。关键在于,昂贵的分析(如超车/被超车、关键时刻、比赛事故)仅在真正相关时才会被获取:

def build_context(self, prompt: str, gp_name: str, year: int) -> str:
    intents = _detect_intents(prompt)
    context = ""
    
    # --- QUALIFYING ---
    if intents["wants_qualy"] or intents["load_all"]:
        q_results = self.db.get_qualifying_results(q_id)
        context += "--- QUALIFYING (Q) ---\n" + q_results.to_string() + "\n\n"
        
    # --- RACE ---
    if intents["wants_race"] or intents["wants_undercut"] or intents["load_all"]:
        # Only fetch detailed analysis if relevant
        if intents["wants_undercut"] or intents["wants_race"] or intents["load_all"]:
            pit_df = self.db.get_pit_stop_analysis(r_id)
            context += "--- UNDERCUT/OVERCUT ANALYSIS ---\n" + pit_df.to_string() + "\n\n"
    # ...

The Impact

影响

A query like “¿Cuál fue la pole?” used to pull in race stints, pit stop analysis, key moments, and FP laps — all irrelevant. With intent detection, it only fetches qualifying results: ~400 tokens vs ~6,000 tokens for the same question. That’s a 15x reduction in input tokens for qualifying-specific queries. 像“谁拿了杆位?”这样的查询过去会拉取正赛停站、进站分析、关键时刻和练习赛圈速——这些全都是无关信息。通过意图识别,它现在只获取排位赛结果:对于同一个问题,Token 消耗从约 6,000 降至约 400。这意味着针对排位赛的查询,输入 Token 减少了 15 倍。


Pattern 2: Pre-Generation for Charts

模式 2:图表预生成

The standard approach when a user asks for a chart is to let the LLM generate the plotting code, return it to the client, execute it, and display the result. This sounds reasonable but has three problems in practice: It costs tokens. A matplotlib or Plotly snippet… 当用户要求生成图表时,标准做法是让 LLM 生成绘图代码,返回给客户端执行并显示结果。这听起来很合理,但在实践中有三个问题:它消耗 Token。一段 matplotlib 或 Plotly 代码片段……