Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

通过多轮提示验证提升量化模型在定性分析中的性能

Abstract: Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis.

摘要： 量化大语言模型（LLMs）因其运行速度快且计算资源需求低，在定性分析中的应用日益广泛。本研究探讨了不同低位量化水平（8-bit、4-bit、3-bit 和 2-bit）及量化类型如何影响 LLaMA-3.1 (8B) 在定性分析中的表现。

The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy.

本研究使用了来自 82 份访谈记录的专家和非专家回复。低位模型在处理非专家语言中含糊不清的术语时，往往会产生更高水平的幻觉和不稳定的结果。为了提升性能，我们提出了一种感知量化的多轮提示验证方法。该方法通过受控步骤引导模型，从而减少幻觉。它在验证后剔除不可靠内容，并将结果传递给下一份记录，从而提高了准确性。

To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis.

为了验证性能，人工编码员使用 NVivo 和 BF16 LLaMA 对访谈记录进行了分析。BF16 LLaMA-3.1 虽然输出了高精度结果，但仍存在语义漂移和幻觉问题。这些错误经过了人工修正。修正后的 BF16 输出与 NVivo 人工编码相结合，构建了用于主题提取和频率分析的黄金标准基准（GSGT）。

The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.

结果显示，8-bit 模型最接近 GSGT。4-bit 模型在应用所提方法后，虽然准确率有所下降，但变得更加稳定。3-bit 和 2-bit 模型由于高压缩率导致性能下降，但在所提提示设计和验证方法的辅助下有所改善。研究还发现，相同位数的模型会因量化类型的不同而表现出差异。总体而言，该方法有助于低资源 LLMs 在降低成本的同时，变得更加稳定、准确，并更适合定性研究。