Streaming Ollama Responses in Next.js: The SSE Pattern That Actually Works

Streaming Ollama Responses in Next.js: The SSE Pattern That Actually Works

在 Next.js 中流式传输 Ollama 响应:真正有效的 SSE 模式

Most Next.js + Ollama tutorials show a single await fetch and call it a day. The user types a question, waits eight seconds, and a wall of text appears. That’s a bad UX. Real LLM apps stream tokens as they’re generated. The user sees a response materialise word by word, just like ChatGPT. This post shows how to build that on Next.js 15 App Router with Ollama as the backend, using Server-Sent Events. Production-ready in under a hundred lines.

大多数 Next.js + Ollama 的教程只是简单地展示一个 await fetch 就草草了事。用户输入问题,等待八秒,然后一大段文字突然出现。这是一种糟糕的用户体验。真正的 LLM 应用会在 Token 生成时进行流式传输。用户可以看到响应像 ChatGPT 一样逐字呈现。本文将展示如何使用 Server-Sent Events (SSE) 在 Next.js 15 App Router 中构建此功能,并以 Ollama 作为后端。不到一百行代码即可实现生产级效果。

Why SSE and not WebSocket

为什么选择 SSE 而不是 WebSocket?

The tradeoffsSSEWebSocket
One-way (server → client)also bi-directional
Auto-reconnect built inimplement yourself
Plain HTTP, no upgraderequires upgrade handshake
Works through proxiessometimes blocked
Streaming overheadminimalsmall frame overhead
权衡因素SSEWebSocket
单向 (服务器 → 客户端)也支持双向
内置自动重连需自行实现
普通 HTTP,无需升级需要升级握手
可穿透代理有时会被拦截
流式传输开销极小较小的帧开销

For LLM streaming, you only need server → client. SSE wins on simplicity. WebSocket is overkill until you need bidirectional streaming (voice, real-time collaboration, tool-call dialogues).

对于 LLM 流式传输,你只需要“服务器 → 客户端”的通信。SSE 在简洁性上胜出。除非你需要双向流(如语音、实时协作、工具调用对话),否则 WebSocket 有点大材小用。

The architecture

架构设计

Browser → /api/chat (Next.js Route Handler) → Ollama (localhost:11434) ↑ emits SSE chunks back to the browser as Ollama produces tokens

浏览器 → /api/chat (Next.js 路由处理器) → Ollama (localhost:11434) ↑ 当 Ollama 生成 Token 时,将 SSE 数据块回传给浏览器

Three pieces:

  1. Server route — pipes Ollama’s stream into the response.
  2. Client hook — reads the stream and updates state.
  3. UI — renders the materialising text.

三个部分:

  1. 服务器路由 — 将 Ollama 的流管道传输到响应中。
  2. 客户端 Hook — 读取流并更新状态。
  3. UI — 渲染正在生成的文本。

Server: the route handler (app/api/chat/route.ts)

服务器端:路由处理器 (app/api/chat/route.ts)

export async function POST(request: Request) {
  const { message } = await request.json();
  const ollama = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "qwen2.5:7b",
      messages: [{ role: "user", content: message }],
      stream: true,
    }),
  });

  if (!ollama.ok || !ollama.body) {
    return new Response("upstream error", { status: 502 });
  }

  const stream = new ReadableStream({
    async start(controller) {
      const reader = ollama.body!.getReader();
      const decoder = new TextDecoder();
      const encoder = new TextEncoder();
      let buffer = "";
      try {
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split("\n");
          buffer = lines.pop() ?? "";
          for (const line of lines) {
            if (!line.trim()) continue;
            try {
              const obj = JSON.parse(line);
              if (obj.message?.content) {
                const sseChunk = `data: ${JSON.stringify({ delta: obj.message.content, })}\n\n`;
                controller.enqueue(encoder.encode(sseChunk));
              }
              if (obj.done) {
                controller.enqueue(encoder.encode(`data: ${JSON.stringify({ done: true })}\n\n`));
              }
            } catch { /* ignore non-JSON lines */ }
          }
        }
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache, no-transform",
      Connection: "keep-alive",
      "X-Accel-Buffering": "no",
    },
  });
}

Two details that matter:

  • stream: true in the Ollama call. Without it, Ollama returns one big response after the whole generation finishes.
  • X-Accel-Buffering: no header. If you deploy behind nginx or a CDN that buffers responses, this disables it for SSE specifically. Without it, you’ll see chunks arrive in a burst at the end.

两个关键细节:

  • Ollama 调用中的 stream: true。如果没有它,Ollama 会在整个生成完成后返回一个巨大的响应。
  • X-Accel-Buffering: no 响应头。如果你部署在 nginx 或 CDN 之后,它们可能会缓冲响应,此设置专门为 SSE 禁用该行为。否则,你会看到数据块在最后一次性涌入。

Client: the hook

客户端:Hook

import { useState } from "react";

export function useChatStream() {
  const [response, setResponse] = useState("");
  const [loading, setLoading] = useState(false);

  async function send(message: string) {
    setResponse("");
    setLoading(true);
    const r = await fetch("/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ message }),
    });

    if (!r.body) { setLoading(false); return; }

    const reader = r.body.getReader();
    const decoder = new TextDecoder();
    let buffer = "";
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      buffer += decoder.decode(value, { stream: true });
      const lines = buffer.split("\n\n");
      buffer = lines.pop() ?? "";
      for (const line of lines) {
        if (!line.startsWith("data: ")) continue;
        const json = JSON.parse(line.slice(6));
        if (json.delta) { setResponse((prev) => prev + json.delta); }
        if (json.done) { setLoading(false); }
      }
    }
  }
  return { response, loading, send };
}

That’s it for the streaming logic. Calling send("hello") updates response token by token.

流式传输逻辑到此结束。调用 send("hello") 会逐个 Token 更新 response

UI: the chat box

UI:聊天框

"use client";
import { useState } from "react";
import { useChatStream } from "./useChatStream";

export default function Chat() {
  const [input, setInput] = useState("");
  const { response, loading, send } = useChatStream();

  return (
    <div className="max-w-2xl mx-auto p-4 space-y-4">
      <div className="min-h-[200px] p-4 border rounded whitespace-pre-wrap">
        {response || (loading ? "thinking..." : "ask me anything")}
      </div>
      <form onSubmit={(e) => { e.preventDefault(); send(input); setInput(""); }} className="flex gap-2">
        <input value={input} onChange={(e) => setInput(e.target.value)} className="flex-1 px-3 py-2 border rounded" placeholder="Ask Ollama..." />
        <button type="submit" disabled={loading || !input} className="px-4 py-2 bg-blue-600 text-white rounded disabled:opacity-50">
          Send
        </button>
      </form>
    </div>
  );
}

Run pnpm dev, hit the page, and watch tokens appear in real time.

运行 pnpm dev,打开页面,观察 Token 实时出现。

Production-grade additions

生产级改进建议

The skeleton above works locally. To ship it:

  • Authentication. Add an auth check in the route handler before opening the upstream stream. Otherwise anyone with your URL can burn your local CPU.
  • Conversation history. The handler above takes a single message. Real chat sends the full history each time. Pass messages: ChatMessage[] and forward to Ollama.
  • Cancellation. When the user navigates away, abort the upstream fetch. Pass an AbortController.signal and call controller.abort() on disconnect.
  • Backpressure. If your client is slow, the controller’s queue grows. Use controller.desiredSize to detect this and pause reading from Ollama.
  • Vercel deployment. Edge Runtime works for this pattern but has a 30-second function timeout. For longer generations, use Node Runtime or self-host. Local models running on your dev machine are obviously not callable from Vercel — for production, you’d swap.

上述骨架代码适用于本地开发。若要上线:

  • 身份验证:在打开上游流之前,在路由处理器中添加身份验证检查。否则,任何拥有你 URL 的人都可以耗尽你的本地 CPU。
  • 对话历史:上述处理器只处理单条消息。真正的聊天每次都会发送完整历史记录。请传递 messages: ChatMessage[] 并转发给 Ollama。
  • 取消请求:当用户离开页面时,中止上游 fetch。传递 AbortController.signal 并在断开连接时调用 controller.abort()
  • 背压处理:如果客户端处理缓慢,控制器的队列会增长。使用 controller.desiredSize 检测此情况并暂停从 Ollama 读取数据。
  • Vercel 部署:Edge Runtime 支持此模式,但有 30 秒的函数超时限制。对于更长的生成任务,请使用 Node Runtime 或自托管。在开发机上运行的本地模型显然无法从 Vercel 调用——生产环境中,你需要替换为远程模型服务。