Featuring Every Eval Ever Results on Hugging Face Model Pages

Featuring Every Eval Ever Results on Hugging Face Model Pages

在 Hugging Face 模型页面展示 Every Eval Ever (EEE) 评估结果

Every Eval Ever (EEE) and Hugging Face Community Evals are now intercompatible. We enable cross-posting and interpreting evaluation results, while linking to open models, leaderboards, and a unified standardized metadata store. Every Eval Ever (EEE) 与 Hugging Face Community Evals 现已实现互通。我们支持评估结果的交叉发布与解读,同时将其与开源模型、排行榜以及统一的标准化元数据存储库进行关联。

EEE launched in February 2026 as a project of the EvalEval Coalition, the first cross-institutional effort to improve how AI evaluation results get reported by both first and third party evaluators. Hugging Face launched Community Evals in February 2026 to decentralize how benchmark scores get reported on the Hub. Combined, they patch gaps in how users, researchers, and policymakers trust, understand, and choose evaluations and models. EEE 于 2026 年 2 月作为 EvalEval 联盟的一个项目启动,这是首个旨在改善第一方和第三方评估者报告 AI 评估结果方式的跨机构合作项目。Hugging Face 于 2026 年 2 月推出了 Community Evals,旨在去中心化地报告 Hub 上的基准测试分数。两者结合,填补了用户、研究人员和政策制定者在信任、理解及选择评估与模型方面的空白。

Evaluation results are how we measure model capabilities, compare models against each other, and reason about safety and governance, and yet they are scattered and hard to compare. They live in papers, leaderboards, blog posts, and harness logs, among others, each in its own format. The same model on the same benchmark often returns different scores depending on who ran it and how; LLaMA 65B, for one, has been reported at both 63.7 and 48.8 on MMLU. These gaps can arise from evaluation settings that we found are commonly unreported. 评估结果是我们衡量模型能力、比较模型优劣以及评估安全性和治理水平的依据,但目前这些数据分散且难以比较。它们散落在论文、排行榜、博客文章和测试日志等各处,且格式各异。同一个模型在同一个基准测试中,往往会因运行者和运行方式的不同而得出不同的分数;例如,LLaMA 65B 在 MMLU 上的得分曾被报告为 63.7 和 48.8。这些差异往往源于我们发现通常未被披露的评估设置。

EEE is our fix for the reporting side. It’s one JSON schema for an evaluation result that records: who ran it, which model, how it was accessed, generation settings, what the metric actually means, and [recommended] a companion JSONL file for per-sample outputs. The schema was built with feedback from researchers and policy researchers, and it takes in results from any source, so harness logs, leaderboard scrapes, and paper numbers all end up in the same shape. The GitHub repository has the converters, examples, and a contributor guide. EEE 是我们针对报告端提出的解决方案。它为评估结果提供了一种统一的 JSON 模式,记录了:谁运行了测试、使用了哪个模型、如何访问、生成设置、指标的实际含义,以及(推荐)用于记录单样本输出的配套 JSONL 文件。该模式是在研究人员和政策研究人员的反馈下构建的,能够接收来自任何来源的结果,因此测试日志、排行榜抓取数据和论文数据最终都能以相同的格式呈现。GitHub 仓库中提供了转换器、示例和贡献者指南。

Since launching, the datastore on Hugging Face has grown to around 229,000 evaluation results across more than 22,000 models and 2,200 benchmarks, pulled from 31 different reporting formats. Reproducing just those runs from scratch would cost somewhere in the hundreds of thousands of dollars, which is a reasonable argument for not letting the data scatter once someone has paid to generate it. Learn more about the schema and how to contribute here. 自发布以来,Hugging Face 上的数据存储库已增长至约 229,000 条评估结果,涵盖超过 22,000 个模型和 2,200 个基准测试,这些数据来自 31 种不同的报告格式。仅从头开始复现这些运行就需要花费数十万美元,这充分说明了在有人付费生成数据后,不应让其散落各处的必要性。点击此处了解更多关于该模式及如何贡献的信息。

Now, it comes with better integration and attribution. Contributors can now send EEE results to Hugging Face Community Evals. We built a converter that takes your EEE records and writes the small YAML files Hugging Face expects, so you don’t have to keep the same result in two formats by hand. This is new functionality for everyone who reports or reads evaluations, not only existing EEE contributors. 现在,它具备了更好的集成性和归因功能。贡献者现在可以将 EEE 结果发送至 Hugging Face Community Evals。我们构建了一个转换器,可以将您的 EEE 记录转换为 Hugging Face 所需的小型 YAML 文件,这样您就不必手动维护两种格式的同一结果。这对所有报告或阅读评估的人来说都是一项新功能,而不仅仅是针对现有的 EEE 贡献者。

First-party evaluators reporting on their own models and third-party evaluators reporting on someone else’s can both submit to Community Evals and to EEE, and anyone browsing the Hub gets results that trace back to a full record. When you submit your data through your organization’s official Hugging Face account, your results show up with a verified checkmark on EvalEval, a signal to readers that the numbers come straight from the source. The rest of this post walks through what Community Evals are and what the converter does. 报告自身模型的第一方评估者和报告他人模型的第三方评估者都可以向 Community Evals 和 EEE 提交数据,任何浏览 Hub 的用户都能获得可追溯至完整记录的结果。当您通过组织的官方 Hugging Face 账户提交数据时,您的结果会在 EvalEval 上显示一个验证勾选标记,向读者表明这些数据直接来自源头。本文的其余部分将介绍 Community Evals 的工作原理以及转换器的功能。

How Hugging Face Community Evals works together with EvalEval

Hugging Face Community Evals 如何与 EvalEval 协同工作

Hugging Face Community Evals has two sides. A benchmark lives in a dataset repo that registers itself by adding an eval.yaml. Once registered, that dataset page collects and displays a leaderboard of every score reported against it across the Hub. The list of official benchmarks grows over time. Hugging Face Community Evals 包含两个方面。基准测试位于数据集仓库中,通过添加 eval.yaml 进行注册。一旦注册,该数据集页面就会收集并显示 Hub 上针对该基准测试报告的所有分数的排行榜。官方基准测试列表会随时间不断增加。

A model’s scores live in .eval_results/*.yaml inside the model repo. They show up on the model card and feed into the matching benchmark leaderboard. Both the model author’s own results and results submitted by anyone else through a pull request get aggregated, and each score carries a badge saying whether it was author-submitted, community-submitted, or independently verified. Anyone can add a score to any model by opening a PR with the right YAML file, and the model author can close PRs or hide results on their own repo. 模型的得分存储在模型仓库内的 .eval_results/*.yaml 文件中。它们会显示在模型卡片上,并汇入相应的基准测试排行榜。模型作者自己的结果以及其他人通过 Pull Request 提交的结果都会被汇总,每个分数都会带有一个徽章,标明它是作者提交、社区提交还是经独立验证的。任何人都可以通过提交包含正确 YAML 文件的 PR 为任何模型添加分数,模型作者可以关闭 PR 或在自己的仓库中隐藏结果。

This is where EEE and Community Evals fit together. When you send a result to both, two things happen: First, your score appears on the Hugging Face model page and gets pulled into the benchmark’s leaderboard. And second, it carries a source badge that links straight back to the full EEE record, where the generation config, the harness version, the reproducibility notes, and any instance-level data live. 这就是 EEE 与 Community Evals 结合的地方。当您将结果同时发送到两者时,会发生两件事:首先,您的分数会出现在 Hugging Face 模型页面上,并被拉取到基准测试的排行榜中;其次,它会带有一个源徽章,直接链接回完整的 EEE 记录,其中包含生成配置、测试工具版本、可复现性说明以及任何实例级数据。

The two destinations do different jobs toward the same goal. Hugging Face puts your result where people look at models, with a link back to the source. EEE keeps the full structured record that makes the result interpretable, and powers Eval Cards on top of it. Send your data to both and the same evaluation ends up visible and legible at once, which is the point of reporting one at all. 这两个目的地为实现同一目标发挥着不同的作用。Hugging Face 将您的结果放置在人们查看模型的地方,并提供回溯源头的链接。EEE 则保留了使结果可解读的完整结构化记录,并在此基础上支持“评估卡片”(Eval Cards)。将数据发送到两者,同一评估结果就能同时实现可见与可读,这正是进行评估报告的意义所在。