Time-Series LLMs, Explained with t0-alpha

时间序列大模型：以 t0-alpha 为例解析

A practical walkthrough of time-series foundation models: how they forecast with patches and quantiles, where t0-alpha fits on GIFT-Eval, and what its reproducible result says about the field. 这是一篇关于时间序列基础模型的实践指南：解析它们如何通过分块（patches）和分位数（quantiles）进行预测，t0-alpha 在 GIFT-Eval 评测中的表现，以及其可复现的结果对该领域意味着什么。

I wanted a concrete way to understand the new time-series foundation models, so I picked a recent one I could run. t0-alpha is a 102M-parameter probabilistic forecaster from The Forecasting Company, released in June 2026. The Forecasting Company published the weights openly under Apache-2.0, which is what makes this reproduction possible: the model is small enough to run on accessible hardware, and it ships with GIFT-Eval results that can be checked outside the original lab. 为了直观地理解这些新型时间序列基础模型，我挑选了一个近期发布且可运行的模型。t0-alpha 是由 The Forecasting Company 于 2026 年 6 月发布的一款拥有 1.02 亿参数的概率预测模型。该公司以 Apache-2.0 协议开源了模型权重，这使得复现成为可能：该模型足够小，可以在普通硬件上运行，并且附带了可以在实验室外部进行验证的 GIFT-Eval 评测结果。

The model shows the basic recipe behind many current time-series LLMs. It cuts a numerical sequence into patches, processes those patches with a causal transformer, and emits quantiles rather than a single future line. That is close enough to language modelling to make the analogy useful, but different enough that the details matter. 该模型展示了当前许多时间序列大模型背后的基本配方：它将数值序列切割成“块”（patches），利用因果 Transformer 处理这些块，并输出分位数而非单一的未来预测线。这与语言模型非常相似，使得这种类比很有意义，但两者在细节上仍有重要区别。

I also re-ran the benchmark. On GIFT-Eval, t0-alpha reproduced its reported headline numbers exactly: CRPS 0.4941 and MASE 0.7240. 我还重新运行了基准测试。在 GIFT-Eval 上，t0-alpha 完全复现了其报告的核心指标：CRPS 为 0.4941，MASE 为 0.7240。

Figure 1 — Accuracy versus size on GIFT-Eval. CRPS on GIFT-Eval plotted against parameter count. t0-alpha sits in the clean competitive cluster at 102M parameters, although TiRex is both smaller and slightly more accurate. Hollow markers indicate models GIFT-Eval flags for test-data leakage. Lower CRPS is better; the vertical axis is inverted so better models appear higher. 图 1 — GIFT-Eval 上的准确度与规模对比。图中展示了 GIFT-Eval 的 CRPS 指标与参数量的关系。t0-alpha 位于 1.02 亿参数的竞争集群中，尽管 TiRex 模型更小且准确度略高。空心标记表示 GIFT-Eval 标记出存在测试数据泄露的模型。CRPS 越低越好；纵轴已反转，因此表现更好的模型位置更高。

This post uses t0-alpha to explain how time-series foundation models work, how they are evaluated, where they beat classical baselines, where they still fail, and why the next useful gains may come from calibration, routing, leakage control, stronger baselines, and domain-specific estimators rather than another small transformer variation. 本文通过 t0-alpha 来解释时间序列基础模型的工作原理、评估方式、它们在何处超越了经典基准、在何处仍存在不足，以及为什么未来的突破可能来自于校准、路由、泄露控制、更强的基准模型和特定领域的估计器，而不是仅仅增加 Transformer 的变体。

How the model turns a time series into something a transformer can read

模型如何将时间序列转化为 Transformer 可读的数据

A language model starts with tokens. A time-series foundation model has to make tokens out of numbers. t0-alpha does this by cutting the input into fixed windows of 32 time steps. Each window becomes a patch. The model embeds those patches, passes them through a decoder-style transformer, and predicts future quantiles. 语言模型始于 Token，而时间序列基础模型必须将数字转化为 Token。t0-alpha 的做法是将输入切割成 32 个时间步长的固定窗口，每个窗口成为一个“块”（patch）。模型对这些块进行嵌入（embedding），通过解码器风格的 Transformer 处理，并预测未来的分位数。

The causal part matters. When t0-alpha forecasts the next window, it can only attend to the past. It does not see the answer window during generation. The quantile part matters too. The model is not just drawing one expected future line. It emits a set of quantiles, which represent a forecast distribution. In my run I used nine quantile levels, from 0.1 to 0.9. That is why CRPS is a useful metric here. It rewards a model for being accurate and for putting the right amount of uncertainty around the forecast. 因果性至关重要。当 t0-alpha 预测下一个窗口时，它只能关注过去的数据，在生成过程中无法看到答案窗口。分位数同样重要：模型不仅仅是画出一条预期的未来线，而是输出一组代表预测分布的分位数。在我的运行中，我使用了从 0.1 到 0.9 的九个分位水平。这就是为什么 CRPS 在这里是一个有用的指标——它奖励那些既准确又能为预测提供恰当不确定性范围的模型。

Figure 2 — t0-alpha architecture. t0-alpha is a decoder-style patch transformer for probabilistic time-series forecasting. Raw series are split into 32-step patches, embedded, processed through causal-time-attention and group-attention layers, and decoded into future quantiles rather than a single point forecast. 图 2 — t0-alpha 架构。t0-alpha 是一种用于概率时间序列预测的解码器风格分块 Transformer。原始序列被拆分为 32 步的块，经过嵌入，通过因果时间注意力（causal-time-attention）和组注意力（group-attention）层处理，最终解码为未来分位数，而非单一的点预测。

Two kinds of time-series LLM

两类时间序列大模型

The phrase “time-series LLM” gets used for two different things. The first kind is trained natively on time-series data. These models turn numerical sequences into patches or tokens, train a transformer on many forecasting datasets, and produce forecasts directly. t0-alpha, TimesFM, Toto, Chronos, TiRex and Moirai are broadly in this group, although their architectures differ. “时间序列大模型”这个词被用于指代两种不同的事物。第一类是在时间序列数据上原生训练的。这些模型将数值序列转化为块或 Token，在多个预测数据集上训练 Transformer，并直接生成预测结果。t0-alpha、TimesFM、Toto、Chronos、TiRex 和 Moirai 大体上属于这一类，尽管它们的架构各不相同。

The second kind starts with a pretrained text LLM and adapts it to forecasting. These systems reprogram, prompt, or wrap a language model so that it can process numerical sequences. Time-LLM is a representative example of this direction. This article is about the first kind. 第二类是从预训练的文本大模型开始，并将其适配到预测任务中。这些系统通过重编程、提示词（prompting）或封装语言模型，使其能够处理数值序列。Time-LLM 是这一方向的代表。本文讨论的是第一类。

GIFT-Eval benchmark setup

GIFT-Eval 基准测试设置

I use GIFT-Eval as the main benchmark in this article. It has 97 task configurations from 55 datasets across seven domains. It includes short and long horizons, frequencies from secondly to yearly, univariate and multivariate series, and probabilistic scoring. It also includes several datasets that appear repeatedly in forecasting work, including M4, ETT, and a large slice of the Monash archive. That breadth makes it a useful benchmark for comparing general-purpose forecasting models. 我在本文中使用 GIFT-Eval 作为主要基准。它包含来自七个领域的 55 个数据集的 97 种任务配置，涵盖了短期和长期预测、从秒级到年级的频率、单变量和多变量序列以及概率评分。它还包括了预测领域中反复出现的几个数据集，如 M4、ETT 和 Monash 档案库的大部分内容。这种广度使其成为比较通用预测模型的有效基准。

The two headline metrics are MASE and CRPS. Both scores are normalised against Seasonal Naive, the baseline that repeats the previous season. A score of 1.000 means the model matches that baseline. Scores below 1.000 are better, and scores above 1.000 are worse. 两个核心指标是 MASE 和 CRPS。这两个分数都相对于“季节性朴素法”（Seasonal Naive，即重复上一季节数据的基准）进行了归一化。分数为 1.000 意味着模型与该基准持平。低于 1.000 的分数表现更好，高于 1.000 则表现更差。