Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

流畅不代表真实：探究大型语言模型在学术写作中的幻觉现象

Abstract: Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing.

摘要： 大型语言模型（LLMs）展现出了非凡的能力，但它们仍然容易产生幻觉，尤其是在我们使用它们生成学术内容时。我们针对学术写作中可能出现的幻觉问题，对四种主流大模型（ChatGPT、Grok、Gemini 和 Copilot）进行了深入研究。

We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models.

我们设计了涵盖参考文献生成、事实解释、摘要生成和写作润色这四大类的 80 个提示词（Prompts）。我们采用 0-5 分的评分标准对模型进行评估，考察维度包括事实准确性、参考文献有效性、逻辑连贯性、风格一致性以及学术语调。此外，我们引入了一种全新的加权指标——幻觉指数（Hallucination Index, HI），用于量化模型生成内容中的幻觉程度。

Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively.

一些广泛使用的评估指标往往无法检测出机器翻译文本中改变情感倾向的错误。研究发现，Grok 和 Copilot 在参考文献生成任务中表现较好，但在处理摘要或风格化提示词时往往表现不佳，其幻觉指数（HI）分别为 0.67 和 0.70。

Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.

相比之下，Gemini 和 ChatGPT 在语调控制方面表现更强，但在事实性写作任务中有所欠缺，且幻觉风险较高，其 HI 分数分别为 0.53 和 0.57。我们的研究表明，幻觉行为不仅取决于模型架构，还与任务类型及我们提供的提示词条件密切相关。我们认为，这项工作为未来的研究者开辟了新的研究维度。