HackerNews中文版

我对一张杂货店收据进行了视觉模型的测试。它返回了商店名称、商品列表和总金额，但这些信息在纸上都没有出现。这并不是OCR错误。模型并没有将“7”误认为“1”。它是从头生成了一张看似合理的收据——不同的商店、不同的商品、不同的价格。如果我没有拿着原件，可能就不会发现这个问题。同样的图像，使用不同的模型（参数数量相同，硬件相同），五秒钟后：每个商品都正确，商店名称正确，总金额精确到分。这两个模型分别是：minicpm-v 8B（伪造）和qwen3-vl 8B（准确）。两者都是开源的，均为约6GB的显存，均在RTX 5080上通过Ollama本地运行。我学到的几点： 1. 视觉模型的幻觉与文本模型的幻觉在性质上是不同的。文本模型会给出一个错误的答案来回应一个真实的问题，而视觉模型则会对它没有处理过的图像给出一个自信的答案。后者更难以检测。 2. 模型选择比提示工程对视觉模型更为重要。相同的提示，相同的图像——一个模型伪造了数据，另一个模型则准确读取。没有任何提示优化能够修复一个会虚构数据的模型。 3. 置信度评分是必须的。我增加了一个核对检查：提取的商品总和是否大致等于所述的总金额？这可以捕捉到在单个项目层面看似合理的伪造。 4. 解决方案并不是更多的资金或更大的模型。相同的大小（8B），相同的硬件，相同的成本（$0）。只是不同的架构，能够实际读取像素，而不是生成关于它们的合理文本。完整的写作包括管道架构和代码模式： https://dev.to/rayne_robinson_e479bf0f26/my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n

查看原文

I pointed a vision model at a grocery receipt. It returned a store name, item list, and total. None of it was on the paper.This wasn't OCR error. The model didn't confuse a "7" for a "1." It generated a plausible-looking receipt from scratch — different store, different items, different prices. If I hadn't been holding the original, I might not have caught it.Same image, different model (same parameter count, same hardware), five seconds later: every item correct, store name right, total accurate to the penny.The models: minicpm-v 8B (fabricated) vs qwen3-vl 8B (accurate). Both open source, both ~6GB VRAM, both running locally via Ollama on an RTX 5080.What I learned:1. Vision model hallucination is qualitatively different from text hallucination. A text model gives you a wrong answer to a real question. A vision model gives you a confident answer to an image it didn't process. The second is harder to detect.2. Model selection matters more than prompt engineering for vision. Same prompt, same image — one model fabricated, one read accurately. No prompt optimization fixes a model that invents data.3. Confidence scoring is mandatory. I added a reconciliation check: do the extracted items sum to roughly the stated total? This catches fabrication that looks plausible at the individual line-item level.4. The fix wasn't more money or a bigger model. Same size (8B), same hardware, same cost ($0). Just a different architecture that actually reads pixels instead of generating plausible text about them.Full writeup with the pipeline architecture and code patterns: https://dev.to/rayne_robinson_e479bf0f26/my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n

我的人工智能并不是误读了一张收据，而是从头开始伪造了一张。