我的人工智能并不是误读了一张收据,而是从头开始伪造了一张。

1作者: Raywob4 天前原帖
我对一张杂货店收据进行了视觉模型的测试。它返回了商店名称、商品列表和总金额,但这些信息在纸上都没有出现。 这并不是OCR错误。模型并没有将“7”误认为“1”。它是从头生成了一张看似合理的收据——不同的商店、不同的商品、不同的价格。如果我没有拿着原件,可能就不会发现这个问题。 同样的图像,使用不同的模型(参数数量相同,硬件相同),五秒钟后:每个商品都正确,商店名称正确,总金额精确到分。 这两个模型分别是:minicpm-v 8B(伪造)和qwen3-vl 8B(准确)。两者都是开源的,均为约6GB的显存,均在RTX 5080上通过Ollama本地运行。 我学到的几点: 1. 视觉模型的幻觉与文本模型的幻觉在性质上是不同的。文本模型会给出一个错误的答案来回应一个真实的问题,而视觉模型则会对它没有处理过的图像给出一个自信的答案。后者更难以检测。 2. 模型选择比提示工程对视觉模型更为重要。相同的提示,相同的图像——一个模型伪造了数据,另一个模型则准确读取。没有任何提示优化能够修复一个会虚构数据的模型。 3. 置信度评分是必须的。我增加了一个核对检查:提取的商品总和是否大致等于所述的总金额?这可以捕捉到在单个项目层面看似合理的伪造。 4. 解决方案并不是更多的资金或更大的模型。相同的大小(8B),相同的硬件,相同的成本($0)。只是不同的架构,能够实际读取像素,而不是生成关于它们的合理文本。 完整的写作包括管道架构和代码模式: https://dev.to/rayne_robinson_e479bf0f26/my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n
查看原文
I pointed a vision model at a grocery receipt. It returned a store name, item list, and total. None of it was on the paper.<p>This wasn&#x27;t OCR error. The model didn&#x27;t confuse a &quot;7&quot; for a &quot;1.&quot; It generated a plausible-looking receipt from scratch — different store, different items, different prices. If I hadn&#x27;t been holding the original, I might not have caught it.<p>Same image, different model (same parameter count, same hardware), five seconds later: every item correct, store name right, total accurate to the penny.<p>The models: minicpm-v 8B (fabricated) vs qwen3-vl 8B (accurate). Both open source, both ~6GB VRAM, both running locally via Ollama on an RTX 5080.<p>What I learned:<p>1. Vision model hallucination is qualitatively different from text hallucination. A text model gives you a wrong answer to a real question. A vision model gives you a confident answer to an image it didn&#x27;t process. The second is harder to detect.<p>2. Model selection matters more than prompt engineering for vision. Same prompt, same image — one model fabricated, one read accurately. No prompt optimization fixes a model that invents data.<p>3. Confidence scoring is mandatory. I added a reconciliation check: do the extracted items sum to roughly the stated total? This catches fabrication that looks plausible at the individual line-item level.<p>4. The fix wasn&#x27;t more money or a bigger model. Same size (8B), same hardware, same cost ($0). Just a different architecture that actually reads pixels instead of generating plausible text about them.<p>Full writeup with the pipeline architecture and code patterns: https:&#x2F;&#x2F;dev.to&#x2F;rayne_robinson_e479bf0f26&#x2F;my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n