FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

FMI_SU_Yotkova_Kastreva 在 SemEval-2026 任务 13 中的表现：通过文体特征轻量化检测大模型生成的代码

Abstract: SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods.

摘要： SemEval-2026 任务 13 旨在研究跨多种编程语言和应用场景的机器生成代码检测，要求参赛系统能够泛化到未见过的语言和领域。本文介绍了我们参与子任务 A（二分类）的过程，并探索了预训练代码编码器和基于特征的轻量化方法。

We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples.

我们设计了对代码片段长度不敏感的基于比率的特征。为了支持描述性相关信号的提取，我们使用了解析引擎和编程语言分类器。此外，我们还训练了一个独立的代码与文本行分类器，以识别嵌入在样本中的原始自然语言片段。

We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and achieves near-instant inference time, offering a lightweight alternative to large pretrained models.

我们将浅层决策树与从数据分析中得出的启发式规则相结合，以生成最终预测。我们的方法计算效率高，训练仅需 CPU 资源，且能实现近乎即时的推理，为大型预训练模型提供了一种轻量化的替代方案。