jamwithai / production-agentic-rag-course

jamwithai / production-agentic-rag-course

The Mother of AI Project Phase 1 RAG Systems: arXiv Paper Curator

AI 之母项目第一阶段 RAG 系统:arXiv 论文策展人

A Learner-Focused Journey into Production RAG Systems. Learn to build modern AI systems from the ground up through hands-on implementation. Master the most in-demand AI engineering skills: RAG (Retrieval-Augmented Generation). 这是一段以学习者为中心的生产级 RAG 系统探索之旅。通过动手实践,从零开始构建现代 AI 系统,掌握当前最热门的 AI 工程技能:RAG(检索增强生成)。

📖 About This Course

📖 关于本课程

This is a learner-focused project where you’ll build a complete research assistant system that automatically fetches academic papers, understands their content, and answers your research questions using advanced RAG techniques. The arXiv Paper Curator will teach you to build a production-grade RAG system using industry best practices. Unlike tutorials that jump straight to vector search, we follow the professional path: master keyword search foundations first, then enhance with vectors for hybrid retrieval. 这是一个以学习者为导向的项目,你将构建一个完整的科研助手系统,它能自动抓取学术论文、理解内容,并利用先进的 RAG 技术回答你的研究问题。arXiv 论文策展人项目将教你如何使用行业最佳实践构建生产级的 RAG 系统。与那些直接跳到向量搜索的教程不同,我们遵循专业路径:先掌握关键词搜索基础,再通过向量技术增强,实现混合检索。

🎯 The Professional Difference:

🎯 专业之处:

We build RAG systems the way successful companies do - solid search foundations enhanced with AI, not AI-first approaches that ignore search fundamentals. By the end of this course, you’ll have your own AI research assistant and the deep technical skills to build production RAG systems for any domain. 我们以成功企业的方式构建 RAG 系统——以扎实的搜索基础为核心,辅以 AI 增强,而不是忽视搜索基本功的“AI 至上”方法。课程结束时,你将拥有自己的 AI 科研助手,并具备为任何领域构建生产级 RAG 系统的深厚技术能力。

🎓 What You’ll Build

🎓 你将构建的内容

  • Week 1: Complete infrastructure with Docker, FastAPI, PostgreSQL, OpenSearch, and Airflow

  • Week 2: Automated data pipeline fetching and parsing academic papers from arXiv

  • Week 3: Production BM25 keyword search with filtering and relevance scoring

  • Week 4: Intelligent chunking + hybrid search combining keywords with semantic understanding

  • Week 5: Complete RAG pipeline with local LLM, streaming responses, and Gradio interface

  • Week 6: Production monitoring with Langfuse tracing and Redis caching for optimized performance

  • Week 7: Agentic RAG with LangGraph and Telegram Bot for mobile access

  • 第 1 周: 包含 Docker、FastAPI、PostgreSQL、OpenSearch 和 Airflow 的完整基础设施

  • 第 2 周: 从 arXiv 自动抓取并解析学术论文的数据流水线

  • 第 3 周: 具备过滤和相关性评分功能的生产级 BM25 关键词搜索

  • 第 4 周: 智能分块 + 结合关键词与语义理解的混合搜索

  • 第 5 周: 包含本地 LLM、流式响应和 Gradio 界面的完整 RAG 流水线

  • 第 6 周: 带有 Langfuse 追踪和 Redis 缓存的生产监控,以优化性能

  • 第 7 周: 基于 LangGraph 的 Agentic RAG(智能体 RAG)及用于移动端访问的 Telegram 机器人


🏗️ System Architecture Evolution

🏗️ 系统架构演进

Week 7: Agentic RAG & Telegram Bot Integration 第 7 周:Agentic RAG 与 Telegram 机器人集成

Complete Week 7 architecture showing Telegram bot integration with the agentic RAG system. 展示了 Telegram 机器人与 Agentic RAG 系统集成的完整第 7 周架构。

LangGraph Agentic RAG Workflow LangGraph 智能体 RAG 工作流

Detailed LangGraph workflow showing decision nodes, document grading, and adaptive retrieval. 详细的 LangGraph 工作流,展示了决策节点、文档评分和自适应检索过程。

Key Innovations in Week 7: 第 7 周的关键创新:

  • Intelligent Decision-Making: Agents evaluate and adapt retrieval strategies

  • Document Grading: Automatic relevance assessment with semantic evaluation

  • Query Rewriting: Adaptive query refinement when results are insufficient

  • Guardrails: Out-of-domain detection prevents hallucination

  • Mobile Access: Telegram bot for conversational AI on any device

  • Transparency: Full reasoning step tracking for debugging and trust

  • 智能决策: 智能体评估并调整检索策略

  • 文档评分: 通过语义评估进行自动相关性判定

  • 查询重写: 当结果不足时进行自适应查询优化

  • 护栏机制: 领域外检测以防止幻觉

  • 移动访问: 用于在任何设备上进行对话式 AI 的 Telegram 机器人

  • 透明度: 完整的推理步骤追踪,便于调试和建立信任


🚀 Quick Start

🚀 快速开始

📋 Prerequisites

📋 前置要求

  • Docker Desktop (with Docker Compose)

  • Python 3.12+

  • UV Package Manager (Install Guide)

  • 8GB+ RAM and 20GB+ free disk space

  • Docker Desktop (需包含 Docker Compose)

  • Python 3.12+

  • UV 包管理器(安装指南

  • 8GB 以上内存及 20GB 以上可用磁盘空间

⚡ Get Started

⚡ 开始使用

# 1. Clone and setup
git clone <repository-url>
cd arxiv-paper-curator

# 2. Configure environment (IMPORTANT!)
cp .env.example .env
# The .env file contains all necessary configuration for OpenSearch,
# arXiv API, and service connections. Defaults work out of the box.
# You need to add Jina embeddings free api key and langfuse keys (check the blogs)

# 3. Install dependencies
uv sync

# 4. Start all services
docker compose up --build -d

# 5. Verify everything works
curl http://localhost:8000/api/v1/health
# 1. 克隆并设置
git clone <repository-url>
cd arxiv-paper-curator

# 2. 配置环境(重要!)
cp .env.example .env
# .env 文件包含了 OpenSearch、arXiv API 和服务连接所需的所有配置。
# 默认配置即可直接运行。
# 你需要添加 Jina embeddings 的免费 API Key 和 Langfuse Key(请查看博客)。

# 3. 安装依赖
uv sync

# 4. 启动所有服务
docker compose up --build -d

# 5. 验证是否正常工作
curl http://localhost:8000/api/v1/health

📚 Weekly Learning Path

📚 每周学习路径

WeekTopicBlog PostCode Release
0The Mother of AI project - 6 phases--
1Infrastructure FoundationThe Infrastructure That Powers RAG Systemsweek1.0
2Data Ingestion PipelineBuilding Data Ingestion Pipelines for RAGweek2.0
3OpenSearch ingestion & BM25 retrievalThe Search Foundation Every RAG System Needsweek3.0
4Chunking & Hybrid SearchThe Chunking Strategy That Makes Hybrid Search Workweek4.0
5Complete RAG systemThe Complete RAG Systemweek5.0
6Production monitoring & cachingProduction-ready RAG: Monitoring & Cachingweek6.0
7Agentic RAG & Telegram BotAgentic RAG with LangGraph and Telegramweek7.0
周次主题博客文章代码发布
0AI 之母项目 - 6 个阶段--
1基础设施基础驱动 RAG 系统的基础设施week1.0
2数据摄取流水线构建 RAG 的数据摄取流水线week2.0
3OpenSearch 摄取与 BM25 检索每个 RAG 系统都需要的搜索基础week3.0
4分块与混合搜索让混合搜索生效的分块策略week4.0
5完整的 RAG 系统完整的 RAG 系统week5.0
6生产监控与缓存生产就绪的 RAG:监控与缓存week6.0
7智能体 RAG 与 Telegram 机器人基于 LangGraph 和 Telegram 的智能体 RAGweek7.0

📥 Clone a specific week’s release:

📥 克隆特定周的版本:

# Clone a specific week's code
git clone --branch <WEEK_TAG> https://github.com/jamwithai/arxiv-paper-curator
cd arxiv-paper-curator
uv sync
docker compose down -v
docker compose up --build -d
# Replace <WEEK_TAG> with: week1.0, week2.0, etc.
# 克隆特定周的代码
git clone --branch <WEEK_TAG> https://github.com/jamwithai/arxiv-paper-curator
cd arxiv-paper-curator
uv sync
docker compose down -v
docker compose up --build -d
# 将 <WEEK_TAG> 替换为:week1.0, week2.0 等。

📊 Access Your Services

📊 访问你的服务

ServiceURLPurpose
API Documentationhttp://localhost:8000/docsInteractive API testing
Gradio RAG Interfacehttp://localhost:7861User-friendly chat interface
Langfuse Dashboardhttp://localhost:3000RAG pipeline monitoring & tracing
Airflow Dashboardhttp://localhost:8080Workflow management
OpenSearch Dashboardshttp://localhost:5601Hybrid search engine UI
服务URL用途
API 文档http://localhost:8000/docs交互式 API 测试
Gradio RAG 界面http://localhost:7861用户友好的聊天界面
Langfuse 面板http://localhost:3000RAG 流水线监控与追踪
Airflow 面板http://localhost:8080工作流管理
OpenSearch 面板http://localhost:5601混合搜索引擎 UI

NOTE: Check airflow/simple_auth_manager_passwords.json.generated for Airflow username and password. 注意:请查看 airflow/simple_auth_manager_passwords.json.generated 获取 Airflow 的用户名和密码。