Legal Domain Adaptation of Modern BERT Models

Legal Domain Adaptation of Modern BERT Models

现代 BERT 模型的法律领域适配

We investigate domain adaptation of modern BERT models in the legal domain. We further pre-train ModernBERT on all US court opinions using the masked language modeling objective. 我们研究了现代 BERT 模型在法律领域的领域适配问题。我们使用掩码语言建模(Masked Language Modeling)目标,在所有美国法院判决书上对 ModernBERT 进行了进一步的预训练。

Although ModernBERT has been trained on roughly 500x more data than original BERT, we still find that this model benefits from further pre-training and domain adaptation in the legal domain: we report significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions. 尽管 ModernBERT 的训练数据量约为原始 BERT 的 500 倍,但我们发现该模型在法律领域进行进一步预训练和领域适配仍能获益:与原始的 ModernBERT 相比,我们在所有与美国法院判决书相关的数据集上都观察到了显著的性能提升。

We find gains similar to those reported in early work on domain adaptation of BERT-like models. However, from scratch pre-training does not match the performance of further pre-training an existing ModernBERT checkpoint in our experiments. 我们发现这些增益与早期关于 BERT 类模型领域适配的研究报告相似。然而,在我们的实验中,从零开始预训练的效果并不如在现有的 ModernBERT 检查点(checkpoint)基础上进行进一步预训练的效果好。

The resulting models are capable of processing sequences up to 8,192 tokens, and can be used to compute meaningful embeddings of legal passages, or could quickly rerank hundreds of legal passages for a given search query. We release all model checkpoints publicly. 最终得到的模型能够处理长达 8,192 个 token 的序列,可用于计算法律文本的有意义嵌入(embeddings),或针对给定的搜索查询快速对数百条法律文本进行重排序。我们已公开所有模型检查点。