NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

NorBERTo:基于 3310 亿词元语料库训练的葡萄牙语 ModernBERT 模型

Abstract: High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms.

摘要: 高质量的语料库对于推动葡萄牙语自然语言处理(NLP)的发展至关重要。在 BERTimbau 和 Albertina PT-BR 等先前的仅编码器(encoder-only)模型基础上,我们推出了 NorBERTo。这是一个基于 ModernBERT 架构的现代编码器,具备长上下文支持和高效的注意力机制。

NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE.

NorBERTo 是在 Aurora-PT 上训练的,这是一个新整理的巴西葡萄牙语语料库,包含 3310 亿个 GPT-2 词元(tokens),收集自各种网络资源和现有的多语言数据集。我们使用 ASSIN 2 和 PLUE 等标准化数据集,在语义相似度、文本蕴含和分类任务上,将 NorBERTo 与强基准模型进行了系统性的对比测试。

On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest entailment F1 (~0.904) among all encoders considered, although Albertina-900M and BERTimbau-large still hold an advantage.

在 PLUE 测试集上,NorBERTo-large 在我们评估的所有编码器模型中表现最佳,特别是在 MRPC 任务上达到了 0.9191 的 F1 分数,在 RTE 任务上达到了 0.7689 的准确率。在 ASSIN 2 测试集上,NorBERTo-large 在所有考虑的编码器中获得了最高的蕴含 F1 分数(约 0.904),尽管 Albertina-900M 和 BERTimbau-large 在某些方面仍具优势。

To the best of our knowledge, Aurora-PT is currently the largest openly available monolingual Portuguese corpus, surpassing previous resources. NorBERTo provides a modern, mid-sized encoder designed for realistic deployment scenarios: it is straight-forward to fine-tune, efficient to serve, and well suited as a backbone for retrieval-augmented generation and other downstream Portuguese NLP systems.

据我们所知,Aurora-PT 是目前最大的公开可用单语葡萄牙语语料库,超越了以往的资源。NorBERTo 提供了一个专为实际部署场景设计的现代中型编码器:它易于微调、推理高效,非常适合作为检索增强生成(RAG)及其他下游葡萄牙语 NLP 系统的骨干模型。