FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

FMI_SU_Yotkova_Kastreva 在 SemEval-2026 任务 13 中的表现:通过文体特征轻量化检测大模型生成的代码


Abstract: SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods.

摘要: SemEval-2026 任务 13 旨在研究跨多种编程语言和应用场景的机器生成代码检测,要求参赛系统能够泛化到未见过的语言和领域。本文介绍了我们参与子任务 A(二分类)的过程,并探索了预训练代码编码器和基于特征的轻量化方法。


We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples.

我们设计了对代码片段长度不敏感的基于比率的特征。为了支持描述性相关信号的提取,我们使用了解析引擎和编程语言分类器。此外,我们还训练了一个独立的代码与文本行分类器,以识别嵌入在样本中的原始自然语言片段。


We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and achieves near-instant inference time, offering a lightweight alternative to large pretrained models.

我们将浅层决策树与从数据分析中得出的启发式规则相结合,以生成最终预测。我们的方法计算效率高,训练仅需 CPU 资源,且能实现近乎即时的推理,为大型预训练模型提供了一种轻量化的替代方案。