He asked AI to count carbs 27000 times. It couldn't give the same answer twice
He asked AI to count carbs 27,000 times. It couldn’t give the same answer twice
他让 AI 计算了 27,000 次碳水化合物含量,结果它没能给出两次相同的答案
Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times. You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency. 让 ChatGPT 估算你午餐中的碳水化合物含量。现在再问它一次。再问一次。重复五百次。你本以为每次得到的答案都应该是一样的。毕竟是同一张照片、同一个模型、同一个问题。但你不会得到相同的答案。甚至相差甚远——这些差异大到足以引发低血糖急症。
That’s the central finding of a study I’ve just published as a preprint, and it has direct implications for anyone using AI-powered carb counting in a diabetes app. 这就是我刚刚作为预印本发表的一项研究的核心发现,它对于任何在糖尿病应用中使用 AI 碳水化合物计算功能的人来说,都有直接的影响。
The study I submitted 13 food photographs — real meals, photographed on a phone, the way you’d actually use them — to four leading AI models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview. Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings. 26,904 queries in total. All at the lowest randomness setting these models offer. The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example. 在这项研究中,我将 13 张食物照片(真实的餐食,用手机拍摄,模拟你实际使用时的场景)提交给了四个领先的 AI 模型:OpenAI GPT-5.4、Anthropic Claude Sonnet 4.6、Google Gemini 2.5 Pro 和 Google Gemini 3.1 Pro Preview。每张照片都向每个模型发送了超过 500 次。每次提示词相同,照片相同,设置也相同。总计进行了 26,904 次查询。所有查询均在这些模型提供的最低随机性设置下进行。提示词改编自 iAPS 开源自动胰岛素输送系统所使用的提示词——这是一个真实的生产环境提示词,而非玩具示例。
The models disagree with themselves
模型无法与自己达成一致
Every model returned different carbohydrate estimates for the same photo across repeated queries. But the degree of disagreement varies enormously. How much does each model disagree with itself? Each dot is one of the 13 test images. The violin shape shows the spread. Claude’s variation clusters below 5% for most images; the Gemini models regularly exceed 10-20%. 在重复查询中,每个模型对同一张照片给出的碳水化合物估算值都不尽相同。但这种不一致的程度差异巨大。每个模型与自身的不一致程度如何?每个点代表 13 张测试图像中的一张。小提琴图显示了分布范围。Claude 的变异程度在大多数图像中聚集在 5% 以下;而 Gemini 模型经常超过 10-20%。
| Model | Median variation (CV) | Median insulin swing | Worst-case insulin swing |
|---|---|---|---|
| 模型 | 中位数变异系数 (CV) | 中位数胰岛素波动 | 最差情况胰岛素波动 |
| Claude Sonnet 4.6 | 2.4% | 0.9 U | 13.6 U |
| GPT-5.4 | 8.4% | 2.3 U | 16.6 U |
| Gemini 3.1 Pro | 10.3% | 2.9 U | 16.2 U |
| Gemini 2.5 Pro | 11.0% | 4.7 U | 42.9 U |
The worst case? The paella photo. Here’s what happened when I sent it to each model 500+ times: One photo of paella, 2000+ answers. Every dot is one query. Same photo. Same prompt. Same model. Gemini 2.5 Pro’s estimates span from 55g to 484g — a 429g range, equivalent to 42.9 units of insulin at a 1:10 ICR. Claude’s estimates cluster tightly by comparison. 42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality. 最糟糕的情况是什么?那张西班牙海鲜饭的照片。当我把它发送给每个模型 500 多次时,发生了这样的情况:一张海鲜饭照片,2000 多个答案。每个点都是一次查询。相同的照片、相同的提示词、相同的模型。Gemini 2.5 Pro 的估算值范围从 55 克到 484 克——跨度达 429 克,按照 1:10 的胰岛素碳水比(ICR)计算,相当于 42.9 单位的胰岛素。相比之下,Claude 的估算值聚集得非常紧密。单张照片导致 42.9 单位的胰岛素误差,这绝不是四舍五入的误差,而是潜在的致命风险。
This variation is invisible to you
这种变异对你来说是不可见的
When you take a photo in a diabetes app, you get one number back. You have absolutely no way to know whether you received a typical estimate or a tail-end outlier from a distribution you can’t see. For Claude, that single number is probably close to the model’s consensus. For Gemini 2.5 Pro, you could be anywhere on the map. 当你在糖尿病应用中拍照时,你只会得到一个数字。你完全无法知道你收到的是一个典型的估算值,还是你看不见的分布中的极端离群值。对于 Claude 来说,那个单一数字可能接近模型的共识;但对于 Gemini 2.5 Pro,结果可能完全不可预测。
The cheese sandwich that defeats AI
击败 AI 的奶酪三明治
Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g. Simple, unambiguous, packet-label accuracy. 这是一个本应很简单的问题。两片厚白面包(包装上标明:每片 20 克碳水)加上切达奶酪(碳水含量可忽略不计)。参考值:40 克。简单、明确,且有包装标签的准确性支持。
Three of four models — Claude, Gemini 2.5 Pro and Gemini 3.1 Pro — independently converge on approximately 28g for a 40g meal. 510 queries from Claude, CV of 0.3%, and every single one is 12g below the actual value. The bread is right there in the photo. The carb value is on the packet. This is the “precisely wrong” problem: high consistency doesn’t guarantee accuracy. A diabetes app user getting 28g every time would consistently underdose by ~1.2 units. GPT-5.4 goes the other way: mean estimate 74g, nearly double the reference, and highly variable on top of it. 四个模型中的三个——Claude、Gemini 2.5 Pro 和 Gemini 3.1 Pro——独立地将这顿 40 克的餐食估算为约 28 克。Claude 进行了 510 次查询,变异系数为 0.3%,且每一次都比实际值低 12 克。面包就在照片里,碳水数值就在包装上。这就是“精确地错误”问题:高一致性并不保证准确性。如果糖尿病应用用户每次都得到 28 克的结果,他们会持续少注射约 1.2 单位的胰岛素。GPT-5.4 则走向了另一个极端:平均估算值为 74 克,几乎是参考值的两倍,而且波动极大。
The models don’t always know what they’re looking at
模型并不总是知道它们在看什么
I found food identification errors in 8 of the 13 test images: 我在 13 张测试图像中发现了 8 张存在食物识别错误:
- Bakewell tart: Claude called it a “Linzer torte” in 100% of 510 queries. GPT-5.4 called it a “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly named it (99.8%).
- 贝克韦尔塔: Claude 在 510 次查询中 100% 将其称为“林茨塔”。GPT-5.4 称其为“果酱塔”或“蛋糕棒”。只有 Gemini 3.1 Pro 正确识别了它(99.8%)。
- Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries.
- 加泰罗尼亚焦糖布丁: 四个模型中有三个 100% 将其称为“法式焦糖布丁”。只有 Gemini 3.1 Pro 在 3.4% 的查询中识别出了“加泰罗尼亚焦糖布丁”。
- Cheese sandwich: Gemini 3.1 Pro added non-existent “deli meat” in 17.4% of queries — hallucinating an ingredient that isn’t there. This could directly inflate carbohydrate estimates.
- 奶酪三明治: Gemini 3.1 Pro 在 17.4% 的查询中添加了不存在的“熟食肉类”——幻觉出了一种并不存在的配料。这可能会直接导致碳水化合物估算值虚高。
Some of these misidentifications have modest nutritional impact. Others could change the carbohydrate estimate substantially. 其中一些识别错误对营养估算的影响较小,但另一些则可能大幅改变碳水化合物的估算结果。
Where does your insulin dose actually land?
你的胰岛素剂量到底落在哪里?
On the five images where I had the strongest reference values (packet labels and weighed portions), here’s how often each model’s individual queries would have pushed insulin doses into clinically dangerous territory: 在拥有最强参考值(包装标签和称重份量)的五张图像上,以下是每个模型的单次查询将胰岛素剂量推向临床危险区域的频率:
- Claude: 100% of queries in the safe or moderate zone. No single query would have caused more than a 2-unit insulin error.
- Claude: 100% 的查询处于安全或中度风险区域。没有任何一次查询会导致超过 2 单位的胰岛素误差。
- GPT-5.4: 37% of queries would cause a clinically significant insulin error (>2U). That’s more than one in three queries landing in the danger zone.
- GPT-5.4: 37% 的查询会导致临床显著的胰岛素误差(>2 单位)。这意味着超过三分之一的查询落入了危险区域。
- Gemini 3.1 Pro Preview: 12% of queries would cause a clinically significant insulin error (>2U). Better than ChatGPT-5.4.
- Gemini 3.1 Pro Preview: 12% 的查询会导致临床显著的胰岛素误差(>2 单位)。表现优于 ChatGPT-5.4。
- Gemini 2.5 Pro: 12% of queries would cause a >5U error — the threshold associated with severe hypoglycaemia requiring third-party assistance.
- Gemini 2.5 Pro: 12% 的查询会导致 >5 单位的误差——这是与需要第三方协助的严重低血糖相关的阈值。
Two types of risk
两种类型的风险
The study identifies two distinct failure modes: 该研究确定了两种截然不同的失效模式:
- Systematic bias (chronic risk). All four models overestimate carbs on average, meaning the dominant direction of error is toward too much insulin and hypoglycaemia. GPT-5.4 averages +1.2 units overdose per meal on strong-reference foods. Three meals a day, that’s 3.6 units of extra insulin per day.
- 系统性偏差(慢性风险)。 所有四个模型平均都会高估碳水化合物,这意味着误差的主要方向是胰岛素过量和低血糖。GPT-5.4 在强参考食物上平均每餐过量 1.2 单位。每天三餐,就是每天多注射 3.6 单位胰岛素。
- Stochastic variability (acute risk). The within-image variation means a single unlucky query could produce a catastrophic outlier. Gemini 2.5 Pro’s worst single query on strong-reference data would have caused an 11.3 unit insulin overdose for a 34g meal. That’s a potential severe hypo.
- 随机变异性(急性风险)。 图像内的变异意味着单次“运气不好”的查询就可能产生灾难性的离群值。Gemini 2.5 Pro 在强参考数据上最差的一次查询,会导致 34 克碳水餐食过量注射 11.3 单位胰岛素。这可能导致严重的低血糖。