Streaming Ollama Responses in Next.js: The SSE Pattern That Actually Works
Streaming Ollama Responses in Next.js: The SSE Pattern That Actually Works
在 Next.js 中流式传输 Ollama 响应:真正有效的 SSE 模式
Most Next.js + Ollama tutorials show a single await fetch and call it a day. The user types a question, waits eight seconds, and a wall of text appears. That’s a bad UX. Real LLM apps stream tokens as they’re generated. The user sees a response materialise word by word, just like ChatGPT. This post shows how to build that on Next.js 15 App Router with Ollama as the backend, using Server-Sent Events. Production-ready in under a hundred lines.
大多数 Next.js + Ollama 的教程只是简单地展示一个 await fetch 就草草了事。用户输入问题,等待八秒,然后一大段文字突然出现。这是一种糟糕的用户体验。真正的 LLM 应用会在 Token 生成时进行流式传输。用户可以看到响应像 ChatGPT 一样逐字呈现。本文将展示如何使用 Server-Sent Events (SSE) 在 Next.js 15 App Router 中构建此功能,并以 Ollama 作为后端。不到一百行代码即可实现生产级效果。
Why SSE and not WebSocket
为什么选择 SSE 而不是 WebSocket?
| The tradeoffs | SSE | WebSocket |
|---|---|---|
| One-way (server → client) | ✓ | also bi-directional |
| Auto-reconnect built in | ✓ | implement yourself |
| Plain HTTP, no upgrade | ✓ | requires upgrade handshake |
| Works through proxies | ✓ | sometimes blocked |
| Streaming overhead | minimal | small frame overhead |
| 权衡因素 | SSE | WebSocket |
|---|---|---|
| 单向 (服务器 → 客户端) | ✓ | 也支持双向 |
| 内置自动重连 | ✓ | 需自行实现 |
| 普通 HTTP,无需升级 | ✓ | 需要升级握手 |
| 可穿透代理 | ✓ | 有时会被拦截 |
| 流式传输开销 | 极小 | 较小的帧开销 |
For LLM streaming, you only need server → client. SSE wins on simplicity. WebSocket is overkill until you need bidirectional streaming (voice, real-time collaboration, tool-call dialogues).
对于 LLM 流式传输,你只需要“服务器 → 客户端”的通信。SSE 在简洁性上胜出。除非你需要双向流(如语音、实时协作、工具调用对话),否则 WebSocket 有点大材小用。
The architecture
架构设计
Browser → /api/chat (Next.js Route Handler) → Ollama (localhost:11434)
↑ emits SSE chunks back to the browser as Ollama produces tokens
浏览器 → /api/chat (Next.js 路由处理器) → Ollama (localhost:11434)
↑ 当 Ollama 生成 Token 时,将 SSE 数据块回传给浏览器
Three pieces:
- Server route — pipes Ollama’s stream into the response.
- Client hook — reads the stream and updates state.
- UI — renders the materialising text.
三个部分:
- 服务器路由 — 将 Ollama 的流管道传输到响应中。
- 客户端 Hook — 读取流并更新状态。
- UI — 渲染正在生成的文本。
Server: the route handler (app/api/chat/route.ts)
服务器端:路由处理器 (app/api/chat/route.ts)
export async function POST(request: Request) {
const { message } = await request.json();
const ollama = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "qwen2.5:7b",
messages: [{ role: "user", content: message }],
stream: true,
}),
});
if (!ollama.ok || !ollama.body) {
return new Response("upstream error", { status: 502 });
}
const stream = new ReadableStream({
async start(controller) {
const reader = ollama.body!.getReader();
const decoder = new TextDecoder();
const encoder = new TextEncoder();
let buffer = "";
try {
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() ?? "";
for (const line of lines) {
if (!line.trim()) continue;
try {
const obj = JSON.parse(line);
if (obj.message?.content) {
const sseChunk = `data: ${JSON.stringify({ delta: obj.message.content, })}\n\n`;
controller.enqueue(encoder.encode(sseChunk));
}
if (obj.done) {
controller.enqueue(encoder.encode(`data: ${JSON.stringify({ done: true })}\n\n`));
}
} catch { /* ignore non-JSON lines */ }
}
}
} finally {
controller.close();
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache, no-transform",
Connection: "keep-alive",
"X-Accel-Buffering": "no",
},
});
}
Two details that matter:
stream: truein the Ollama call. Without it, Ollama returns one big response after the whole generation finishes.X-Accel-Buffering: noheader. If you deploy behind nginx or a CDN that buffers responses, this disables it for SSE specifically. Without it, you’ll see chunks arrive in a burst at the end.
两个关键细节:
- Ollama 调用中的
stream: true。如果没有它,Ollama 会在整个生成完成后返回一个巨大的响应。 X-Accel-Buffering: no响应头。如果你部署在 nginx 或 CDN 之后,它们可能会缓冲响应,此设置专门为 SSE 禁用该行为。否则,你会看到数据块在最后一次性涌入。
Client: the hook
客户端:Hook
import { useState } from "react";
export function useChatStream() {
const [response, setResponse] = useState("");
const [loading, setLoading] = useState(false);
async function send(message: string) {
setResponse("");
setLoading(true);
const r = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ message }),
});
if (!r.body) { setLoading(false); return; }
const reader = r.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n\n");
buffer = lines.pop() ?? "";
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const json = JSON.parse(line.slice(6));
if (json.delta) { setResponse((prev) => prev + json.delta); }
if (json.done) { setLoading(false); }
}
}
}
return { response, loading, send };
}
That’s it for the streaming logic. Calling send("hello") updates response token by token.
流式传输逻辑到此结束。调用 send("hello") 会逐个 Token 更新 response。
UI: the chat box
UI:聊天框
"use client";
import { useState } from "react";
import { useChatStream } from "./useChatStream";
export default function Chat() {
const [input, setInput] = useState("");
const { response, loading, send } = useChatStream();
return (
<div className="max-w-2xl mx-auto p-4 space-y-4">
<div className="min-h-[200px] p-4 border rounded whitespace-pre-wrap">
{response || (loading ? "thinking..." : "ask me anything")}
</div>
<form onSubmit={(e) => { e.preventDefault(); send(input); setInput(""); }} className="flex gap-2">
<input value={input} onChange={(e) => setInput(e.target.value)} className="flex-1 px-3 py-2 border rounded" placeholder="Ask Ollama..." />
<button type="submit" disabled={loading || !input} className="px-4 py-2 bg-blue-600 text-white rounded disabled:opacity-50">
Send
</button>
</form>
</div>
);
}
Run pnpm dev, hit the page, and watch tokens appear in real time.
运行 pnpm dev,打开页面,观察 Token 实时出现。
Production-grade additions
生产级改进建议
The skeleton above works locally. To ship it:
- Authentication. Add an auth check in the route handler before opening the upstream stream. Otherwise anyone with your URL can burn your local CPU.
- Conversation history. The handler above takes a single message. Real chat sends the full history each time. Pass
messages: ChatMessage[]and forward to Ollama. - Cancellation. When the user navigates away, abort the upstream fetch. Pass an
AbortController.signaland callcontroller.abort()on disconnect. - Backpressure. If your client is slow, the controller’s queue grows. Use
controller.desiredSizeto detect this and pause reading from Ollama. - Vercel deployment. Edge Runtime works for this pattern but has a 30-second function timeout. For longer generations, use Node Runtime or self-host. Local models running on your dev machine are obviously not callable from Vercel — for production, you’d swap.
上述骨架代码适用于本地开发。若要上线:
- 身份验证:在打开上游流之前,在路由处理器中添加身份验证检查。否则,任何拥有你 URL 的人都可以耗尽你的本地 CPU。
- 对话历史:上述处理器只处理单条消息。真正的聊天每次都会发送完整历史记录。请传递
messages: ChatMessage[]并转发给 Ollama。 - 取消请求:当用户离开页面时,中止上游 fetch。传递
AbortController.signal并在断开连接时调用controller.abort()。 - 背压处理:如果客户端处理缓慢,控制器的队列会增长。使用
controller.desiredSize检测此情况并暂停从 Ollama 读取数据。 - Vercel 部署:Edge Runtime 支持此模式,但有 30 秒的函数超时限制。对于更长的生成任务,请使用 Node Runtime 或自托管。在开发机上运行的本地模型显然无法从 Vercel 调用——生产环境中,你需要替换为远程模型服务。