Engineering CellFateBench: A Reproducible Python Benchmark for Single-Cell Genomics Reasoning

Engineering CellFateBench：用于单细胞基因组学推理的可复现 Python 基准测试

CellFateBench is a scientific software and benchmark-engineering project for evaluating reasoning over single-cell genomics workflows. The project was designed around a practical question: How can single-cell analysis outputs be turned into reproducible benchmark tasks with public prompts, hidden answer keys, oracle outputs, scoring, calibration, Docker validation, and CI? CellFateBench 是一个用于评估单细胞基因组学工作流推理能力的科学软件和基准工程项目。该项目围绕一个实际问题而设计：如何将单细胞分析的输出转化为可复现的基准测试任务，并包含公开提示词、隐藏答案键、预言机（Oracle）输出、评分、校准、Docker 验证以及持续集成（CI）？

What CellFateBench does: Single-cell genomics workflows often produce outputs such as: clusters; embeddings; marker tables; pseudotime summaries; spatial patterns; topology summaries; RNA velocity layers. Those outputs still require interpretation. A solver may need to decide which state is likely upstream, whether a branch is terminal, whether a spatial pattern is meaningful, whether a ring-like pattern supports a cyclic claim, or whether RNA velocity evidence is strong enough to support a directionality statement. CellFateBench focuses on that reasoning layer. It converts single-cell analysis contexts into structured benchmark assets: public benchmark tasks; hidden answer keys; oracle outputs; deterministic validators; scoring outputs; calibration logs; difficulty rebalancing outputs; reproducible pipelines; Docker validation; GitHub Actions CI. The project is not a notebook-only analysis. It is structured as a reproducible scientific software repository. CellFateBench 的功能：单细胞基因组学工作流通常会产生诸如聚类、嵌入、标记基因表、拟时序摘要、空间模式、拓扑摘要和 RNA 速率层等输出。这些输出仍需要解读。求解器（Solver）可能需要判断哪个状态可能是上游状态、分支是否为末端、空间模式是否有意义、环状模式是否支持循环声明，或者 RNA 速率证据是否足以支持方向性结论。CellFateBench 专注于这一推理层。它将单细胞分析上下文转换为结构化的基准资产：公开基准任务、隐藏答案键、预言机输出、确定性验证器、评分输出、校准日志、难度重平衡输出、可复现流水线、Docker 验证以及 GitHub Actions CI。该项目不仅仅是笔记本分析，而是被构建为一个可复现的科学软件仓库。

Repository architecture: The repository is organised around a clear separation of source code, workflow scripts, tests, benchmark assets, documentation, and generated outputs. 仓库架构：该仓库围绕源代码、工作流脚本、测试、基准资产、文档和生成输出的清晰分离进行组织。

cellfatebench-single-cell-analysis/
├── benchmark_tasks/
│   ├── public/
│   ├── hidden/
│   ├── oracle_outputs/
│   └── calibration_logs/
├── configs/
├── data/
│   ├── raw/
│   ├── processed/
│   ├── reference/
│   └── synthetic/
├── docs/
├── results/
│   ├── figures/
│   ├── reports/
│   └── tables/
├── sample_solver_answers/
├── scripts/
├── src/cellfatebench/
├── tests/
├── Dockerfile
├── Makefile
├── environment.yml
└── README.md

The key design decision is that benchmark assets are explicit and inspectable. Public tasks are not mixed with hidden answers. Oracle outputs are separate. Scoring code is separate from task generation. Pipelines are exposed through Makefile commands. That structure makes the project easier to review, test, and extend. 关键的设计决策是基准资产必须是明确且可检查的。公开任务不会与隐藏答案混在一起。预言机输出是独立的。评分代码与任务生成代码分离。流水线通过 Makefile 命令暴露。这种结构使得项目更易于审查、测试和扩展。

Two benchmark layers: v1 and v2. CellFateBench currently has two layers. 两个基准测试层：v1 和 v2。CellFateBench 目前包含两个层级。

Layer	Purpose	Status
v1	controlled benchmark	Synthetic single-cell data with known hidden truth for trajectory, spatial, and topology reasoning
v2	public RNA velocity extension	Public scVelo pancreas dataset layer with RNA velocity reasoning tasks, solver evaluation, empirical calibration, and difficulty rebalancing

层级	目的	状态
v1	受控基准测试	包含已知隐藏真值的合成单细胞数据，用于轨迹、空间和拓扑推理
v2	公开 RNA 速率扩展	基于公开 scVelo 胰腺数据集，包含 RNA 速率推理任务、求解器评估、经验校准和难度重平衡

This design allows the project to balance two needs: controlled hidden truth for deterministic scoring; public dataset context for biological realism. 这种设计使项目能够平衡两种需求：用于确定性评分的受控隐藏真值，以及用于生物学真实性的公开数据集上下文。

v1: controlled synthetic benchmark. The v1 layer uses controlled synthetic single-cell data. This is important because benchmark scoring requires known answers. In many real datasets, the true biological state, lineage structure, or spatial domain assignment may be uncertain. Synthetic data allows the benchmark to define hidden truth and use that hidden truth for deterministic evaluation. v1：受控合成基准测试。v1 层使用受控的合成单细胞数据。这一点很重要，因为基准评分需要已知答案。在许多真实数据集中，真实的生物学状态、谱系结构或空间域分配可能是不确定的。合成数据允许基准测试定义隐藏真值，并利用该真值进行确定性评估。

The v1 dataset includes: 900 synthetic cells; 60 genes; designed root or progenitor state; transition state; terminal states; branch labels; pseudotime values; spatial coordinates; spatial domains; topology design. v1 数据集包括：900 个合成细胞；60 个基因；设计的根或祖细胞状态；过渡状态；末端状态；分支标签；拟时序值；空间坐标；空间域；拓扑设计。

Generated synthetic outputs include: 生成的合成输出包括：

data/synthetic/synthetic_cell_metadata.csv
data/synthetic/synthetic_expression_matrix.csv
data/synthetic/synthetic_gene_metadata.csv
data/synthetic/synthetic_hidden_truth.json

The synthetic hidden-truth file is central to the v1 benchmark. It allows tasks to be scored against known answers. 合成隐藏真值文件是 v1 基准测试的核心。它允许任务根据已知答案进行评分。

v1 task families: The v1 benchmark contains three task families. v1 任务族：v1 基准测试包含三个任务族。

Trajectory and pseudotime reasoning: These tasks test reasoning about: root-state inference; terminal-state inference; transition-state placement; early-to-late pseudotime ordering; masked terminal-state recovery.
轨迹与拟时序推理：这些任务测试关于以下方面的推理：根状态推断；末端状态推断；过渡状态定位；早到晚拟时序排序；掩码末端状态恢复。
Spatial pattern reasoning: These tasks test reasoning about: spatially variable genes; domain-specific marker enrichment; masked spatial-domain recovery; unsupported spatial claims.
空间模式推理：这些任务测试关于以下方面的推理：空间变异基因；特定域标记富集；掩码空间域恢复；不支持的空间声明。
Topological persistence reasoning: These tasks test reasoning about: bifurcating structure; branch count; ring-like spatial signals; false-positive loop claims; the difference between spatial topology and cell-fate topology. The topology layer uses GUDHI-based summaries to support topology-aware benchmark tasks.
拓扑持久性推理：这些任务测试关于以下方面的推理：分叉结构；分支计数；环状空间信号；假阳性循环声明；空间拓扑与细胞命运拓扑之间的差异。拓扑层使用基于 GUDHI 的摘要来支持拓扑感知基准任务。

Public tasks, hidden answers, and oracle outputs: A key benchmark-design pattern in CellFateBench is the separation between public prompts and hidden answers. Public tasks are solver-facing. Hidden answers contain expected outputs and scoring-relevant evidence. Oracle outputs show reference-style answers with rationale, confidence, and supporting evidence. This structure helps prevent answer leakage and makes the benchmark easier to review. 公开任务、隐藏答案和预言机输出：CellFateBench 的一个关键基准设计模式是公开提示词与隐藏答案的分离。公开任务面向求解器。隐藏答案包含预期输出和评分相关的证据。预言机输出展示了带有原理、置信度和支持证据的参考式答案。这种结构有助于防止答案泄露，并使基准测试更易于审查。

A simplified benchmark structure looks like this: 简化的基准结构如下所示：

public task | | visible to solver
v
solver answer | | compared privately
v
hidden answer key | | scored by validators
v
score report

Oracle outputs provide a human-readable reference, but they are not used as public prompts. 预言机输出提供人类可读的参考，但它们不会被用作公开提示词。

v2: public RNA velocity extension: The v2 layer adds a public RNA velocity benchmark extension… v2：公开 RNA 速率扩展：v2 层增加了一个公开的 RNA 速率基准扩展……