StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL：用于增强 RTL 综合的逐步过程奖励引导大模型微调

Abstract: Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. 摘要： 由于数字硬件设计中的长程推理、多步依赖关系以及 Verilog 和 VHDL 中严格的正确性约束，自动生成 RTL 代码仍然是一项具有挑战性的任务。

We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. 我们提出了 StepPRM-RTL，这是一个结合了逐步轨迹建模、过程奖励建模（PRM）和检索增强微调（RAFT）的新型框架，旨在提升基于大模型的 RTL 代码生成的功能正确性和推理保真度。

StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. StepPRM-RTL 从规范解决方案中构建逐步推理轨迹，其中每一步都包含推理逻辑和增量代码修改。

A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. 过程奖励模型（PRM）会对中间步骤进行评估，提供密集的反馈，从而在 RAFT 微调过程中引导强化学习式的更新。

Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. 蒙特卡洛树搜索（MCTS）用于探索替代的推理路径，通过高质量轨迹丰富训练数据集。

This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. 这种逐步奖励与结果感知奖励的结合，使模型不仅能学会如何构建正确的 RTL，还能理解为什么要这样构建，从而在长程推理能力上超越了标准的监督学习或基于结果的训练。

Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10% in functional correctness and reasoning fidelity metrics. 在基准 Verilog 和 VHDL 数据集上的实验评估表明，StepPRM-RTL 在功能正确性和推理保真度指标上比现有的最佳方法提升了超过 10%。

Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. 消融研究证实，PRM 引导的奖励与逐步轨迹探索的结合是其性能提升的关键。

StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation. StepPRM-RTL 可推广至多种 RTL 语言，并为高保真、可解释的代码生成提供了一个可扩展的框架，为大模型辅助的硬件设计自动化树立了新标准。