Streaming Ollama Responses in Next.js: The SSE Pattern That Actually Works

在 Next.js 中流式传输 Ollama 响应：真正有效的 SSE 模式

Most Next.js + Ollama tutorials show a single await fetch and call it a day. The user types a question, waits eight seconds, and a wall of text appears. That’s a bad UX. Real LLM apps stream tokens as they’re generated. The user sees a response materialise word by word, just like ChatGPT. This post shows how to build that on Next.js 15 App Router with Ollama as the backend, using Server-Sent Events. Production-ready in under a hundred lines.

大多数 Next.js + Ollama 的教程只是简单地展示一个 await fetch 就草草了事。用户输入问题，等待八秒，然后一大段文字突然出现。这是一种糟糕的用户体验。真正的 LLM 应用会在 Token 生成时进行流式传输。用户可以看到响应像 ChatGPT 一样逐字呈现。本文将展示如何使用 Server-Sent Events (SSE) 在 Next.js 15 App Router 中构建此功能，并以 Ollama 作为后端。不到一百行代码即可实现生产级效果。

Why SSE and not WebSocket

为什么选择 SSE 而不是 WebSocket？

The tradeoffs	SSE	WebSocket
One-way (server → client)	✓	also bi-directional
Auto-reconnect built in	✓	implement yourself
Plain HTTP, no upgrade	✓	requires upgrade handshake
Works through proxies	✓	sometimes blocked
Streaming overhead	minimal	small frame overhead

权衡因素	SSE	WebSocket
单向 (服务器 → 客户端)	✓	也支持双向
内置自动重连	✓	需自行实现
普通 HTTP，无需升级	✓	需要升级握手
可穿透代理	✓	有时会被拦截
流式传输开销	极小	较小的帧开销

For LLM streaming, you only need server → client. SSE wins on simplicity. WebSocket is overkill until you need bidirectional streaming (voice, real-time collaboration, tool-call dialogues).

对于 LLM 流式传输，你只需要“服务器 → 客户端”的通信。SSE 在简洁性上胜出。除非你需要双向流（如语音、实时协作、工具调用对话），否则 WebSocket 有点大材小用。

The architecture

架构设计

Browser → /api/chat (Next.js Route Handler) → Ollama (localhost:11434) ↑ emits SSE chunks back to the browser as Ollama produces tokens

浏览器 → /api/chat (Next.js 路由处理器) → Ollama (localhost:11434) ↑ 当 Ollama 生成 Token 时，将 SSE 数据块回传给浏览器

Three pieces:

Server route — pipes Ollama’s stream into the response.
Client hook — reads the stream and updates state.
UI — renders the materialising text.

三个部分：

服务器路由 — 将 Ollama 的流管道传输到响应中。
客户端 Hook — 读取流并更新状态。
UI — 渲染正在生成的文本。

Server: the route handler (`app/api/chat/route.ts`)

服务器端：路由处理器 (app/api/chat/route.ts)

export async function POST(request: Request) {
  const { message } = await request.json();
  const ollama = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "qwen2.5:7b",
      messages: [{ role: "user", content: message }],
      stream: true,
    }),
  });

  if (!ollama.ok || !ollama.body) {
    return new Response("upstream error", { status: 502 });
  }

  const stream = new ReadableStream({
    async start(controller) {
      const reader = ollama.body!.getReader();
      const decoder = new TextDecoder();
      const encoder = new TextEncoder();
      let buffer = "";
      try {
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split("\n");
          buffer = lines.pop() ?? "";
          for (const line of lines) {
            if (!line.trim()) continue;
            try {
              const obj = JSON.parse(line);
              if (obj.message?.content) {
                const sseChunk = `data: ${JSON.stringify({ delta: obj.message.content, })}\n\n`;
                controller.enqueue(encoder.encode(sseChunk));
              }
              if (obj.done) {
                controller.enqueue(encoder.encode(`data: ${JSON.stringify({ done: true })}\n\n`));
              }
            } catch { /* ignore non-JSON lines */ }
          }
        }
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache, no-transform",
      Connection: "keep-alive",
      "X-Accel-Buffering": "no",
    },
  });
}

Two details that matter:

stream: true in the Ollama call. Without it, Ollama returns one big response after the whole generation finishes.
X-Accel-Buffering: no header. If you deploy behind nginx or a CDN that buffers responses, this disables it for SSE specifically. Without it, you’ll see chunks arrive in a burst at the end.

两个关键细节：

Ollama 调用中的 stream: true。如果没有它，Ollama 会在整个生成完成后返回一个巨大的响应。
X-Accel-Buffering: no 响应头。如果你部署在 nginx 或 CDN 之后，它们可能会缓冲响应，此设置专门为 SSE 禁用该行为。否则，你会看到数据块在最后一次性涌入。

Client: the hook

客户端：Hook

import { useState } from "react";

export function useChatStream() {
  const [response, setResponse] = useState("");
  const [loading, setLoading] = useState(false);

  async function send(message: string) {
    setResponse("");
    setLoading(true);
    const r = await fetch("/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ message }),
    });

    if (!r.body) { setLoading(false); return; }

    const reader = r.body.getReader();
    const decoder = new TextDecoder();
    let buffer = "";
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      buffer += decoder.decode(value, { stream: true });
      const lines = buffer.split("\n\n");
      buffer = lines.pop() ?? "";
      for (const line of lines) {
        if (!line.startsWith("data: ")) continue;
        const json = JSON.parse(line.slice(6));
        if (json.delta) { setResponse((prev) => prev + json.delta); }
        if (json.done) { setLoading(false); }
      }
    }
  }
  return { response, loading, send };
}

That’s it for the streaming logic. Calling send("hello") updates response token by token.

流式传输逻辑到此结束。调用 send("hello") 会逐个 Token 更新 response。

UI: the chat box

UI：聊天框

"use client";
import { useState } from "react";
import { useChatStream } from "./useChatStream";

export default function Chat() {
  const [input, setInput] = useState("");
  const { response, loading, send } = useChatStream();

  return (
    <div className="max-w-2xl mx-auto p-4 space-y-4">
      <div className="min-h-[200px] p-4 border rounded whitespace-pre-wrap">
        {response || (loading ? "thinking..." : "ask me anything")}
      </div>
      <form onSubmit={(e) => { e.preventDefault(); send(input); setInput(""); }} className="flex gap-2">
        <input value={input} onChange={(e) => setInput(e.target.value)} className="flex-1 px-3 py-2 border rounded" placeholder="Ask Ollama..." />
        <button type="submit" disabled={loading || !input} className="px-4 py-2 bg-blue-600 text-white rounded disabled:opacity-50">
          Send
        </button>
      </form>
    </div>
  );
}

Run pnpm dev, hit the page, and watch tokens appear in real time.

运行 pnpm dev，打开页面，观察 Token 实时出现。

Production-grade additions

生产级改进建议

The skeleton above works locally. To ship it:

Authentication. Add an auth check in the route handler before opening the upstream stream. Otherwise anyone with your URL can burn your local CPU.
Conversation history. The handler above takes a single message. Real chat sends the full history each time. Pass messages: ChatMessage[] and forward to Ollama.
Cancellation. When the user navigates away, abort the upstream fetch. Pass an AbortController.signal and call controller.abort() on disconnect.
Backpressure. If your client is slow, the controller’s queue grows. Use controller.desiredSize to detect this and pause reading from Ollama.
Vercel deployment. Edge Runtime works for this pattern but has a 30-second function timeout. For longer generations, use Node Runtime or self-host. Local models running on your dev machine are obviously not callable from Vercel — for production, you’d swap.

上述骨架代码适用于本地开发。若要上线：

身份验证：在打开上游流之前，在路由处理器中添加身份验证检查。否则，任何拥有你 URL 的人都可以耗尽你的本地 CPU。
对话历史：上述处理器只处理单条消息。真正的聊天每次都会发送完整历史记录。请传递 messages: ChatMessage[] 并转发给 Ollama。
取消请求：当用户离开页面时，中止上游 fetch。传递 AbortController.signal 并在断开连接时调用 controller.abort()。
背压处理：如果客户端处理缓慢，控制器的队列会增长。使用 controller.desiredSize 检测此情况并暂停从 Ollama 读取数据。
Vercel 部署：Edge Runtime 支持此模式，但有 30 秒的函数超时限制。对于更长的生成任务，请使用 Node Runtime 或自托管。在开发机上运行的本地模型显然无法从 Vercel 调用——生产环境中，你需要替换为远程模型服务。