StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
StepPRM-RTL:用于增强 RTL 综合的逐步过程奖励引导大模型微调
Abstract: Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. 摘要: 由于数字硬件设计中的长程推理、多步依赖关系以及 Verilog 和 VHDL 中严格的正确性约束,自动生成 RTL 代码仍然是一项具有挑战性的任务。
We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. 我们提出了 StepPRM-RTL,这是一个结合了逐步轨迹建模、过程奖励建模(PRM)和检索增强微调(RAFT)的新型框架,旨在提升基于大模型的 RTL 代码生成的功能正确性和推理保真度。
StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. StepPRM-RTL 从规范解决方案中构建逐步推理轨迹,其中每一步都包含推理逻辑和增量代码修改。
A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. 过程奖励模型(PRM)会对中间步骤进行评估,提供密集的反馈,从而在 RAFT 微调过程中引导强化学习式的更新。
Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. 蒙特卡洛树搜索(MCTS)用于探索替代的推理路径,通过高质量轨迹丰富训练数据集。
This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. 这种逐步奖励与结果感知奖励的结合,使模型不仅能学会如何构建正确的 RTL,还能理解为什么要这样构建,从而在长程推理能力上超越了标准的监督学习或基于结果的训练。
Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10% in functional correctness and reasoning fidelity metrics. 在基准 Verilog 和 VHDL 数据集上的实验评估表明,StepPRM-RTL 在功能正确性和推理保真度指标上比现有的最佳方法提升了超过 10%。
Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. 消融研究证实,PRM 引导的奖励与逐步轨迹探索的结合是其性能提升的关键。
StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation. StepPRM-RTL 可推广至多种 RTL 语言,并为高保真、可解释的代码生成提供了一个可扩展的框架,为大模型辅助的硬件设计自动化树立了新标准。