发布 HN:Pulse(YC S24)—— 生产级非结构化文档提取
嗨,HN,我们是Sid和Ritvik,Pulse的联合创始人。Pulse是一个文档提取系统,旨在生成适合大型语言模型(LLM)的文本。我们创建Pulse是因为意识到,尽管现代视觉语言模型在生成合理文本方面表现出色,但这也使得它们在大规模光学字符识别(OCR)和数据摄取中存在风险。
当我们开始进行文档提取时,我们假设与今天许多团队一样:基础模型正在快速改进,多模态系统似乎能够很好地读取文档,对于小型或干净的输入,这一假设往往成立。然而,当我们开始处理真实文档的大量数据时,这些局限性显现出来。长PDF、密集表格、混合布局、低质量扫描以及财务或运营数据暴露出一些微妙、难以检测且修正成本高昂的错误。输出结果往往看起来合理,但在表格和数字字段中却包含一些小但重要的错误。
自那时以来,我们的许多工作都是应用研究。我们对复杂文档进行受控评估,微调视觉模型,并构建真实情况至关重要的标注数据集。我们的团队曾多次熬夜手动标注页面,围绕表格绘制边框,逐点标记图表,或讨论某个数字是否不可读或仅仅是扫描效果差。这一过程在很大程度上塑造了我们的直觉,远比单纯的基准测试更为深刻。
很快我们意识到,核心挑战并不在于提取本身,而在于信心。视觉语言模型将文档图像嵌入到高维表示中,这些表示优化了语义理解,而非精确转录。这一过程本质上是有损的。当不确定性出现时,模型倾向于使用学习到的先验来解决,而不是揭示模糊性。这种行为在消费场景中可能是有益的,但在生产流程中却会造成难以扩展的验证问题。
Pulse的诞生旨在通过系统设计来弥补这一差距,而不仅仅依赖提示。系统将文档理解视为一个分离的过程,布局分析与语言建模相分离。文档被规范化为结构化表示,以保留层次和表格,然后再进行模式映射。提取受到预先定义的模式的约束,提取的值与源位置关联,以便可以检查不确定性,而不是简单地猜测。实际上,这导致了一种混合方法,结合了传统计算机视觉技术、布局模型和视觉语言模型,因为没有单一的方法能够可靠地处理这些情况。
我们故意分享了一些反映激励我们进行此项工作的输入类型的文档。这些文档代表了我们看到的通用OCR或基于VLM的管道所面临的困难案例。
以下是一个财务10K报告:
<a href="https://platform.runpulse.com/dashboard/examples/example1">https://platform.runpulse.com/dashboard/examples/example1</a>
以下是一份报纸:
<a href="https://platform.runpulse.com/dashboard/examples/example2">https://platform.runpulse.com/dashboard/examples/example2</a>
以下是一份租金清单:
<a href="https://platform.runpulse.com/dashboard/examples/example3">https://platform.runpulse.com/dashboard/examples/example3</a>
Pulse并不完美,特别是在高度退化的扫描或不常见的手写体上,仍然有改进的空间。我们的目标并不是完全消除错误,而是让错误可见、可审计,并更容易进行推理。
Pulse通过基于使用的API和平台访问提供。您可以在这里试用,并在这里访问API文档。
演示链接在这里:
<a href="https://video.runpulse.com/video/pulse-platform-walkthrough-69f9">https://video.runpulse.com/video/pulse-platform-walkthrough-...</a>
我们希望听听其他人在文档提取中如何评估正确性,您在实践中遇到的失败模式,以及您依赖哪些信号来判断输出是否可信。我们会在这里回答问题,并乐意处理其他文档,如果有人想分享示例。
查看原文
Hi HN, we’re Sid and Ritvik, co-founders of Pulse. Pulse is a document extraction system to create LLM-ready text. We built Pulse as we realized that although modern vision language models are very good at producing plausible text, that makes them risky for OCR and data ingestion at scale.<p>When we started working on document extraction, we assumed the same thing many teams do today: foundation models were improving quickly, multi modal systems appeared to read documents well, and for small or clean inputs that assumption often held. The limitations showed up once we began processing real documents in volume. Long PDFs, dense tables, mixed layouts, low-fidelity scans, and financial or operational data exposed errors that were subtle, hard to detect, and expensive to correct. Outputs often looked reasonable while containing small but meaningful mistakes, especially in tables and numeric fields.<p>A lot of our work since then has been applied research. We run controlled evaluations on complex documents, fine tune vision models, and build labeled datasets where ground truth actually matters. There have been many nights where our team stayed up hand annotating pages, drawing bounding boxes around tables, labeling charts point by point, or debating whether a number was unreadable or simply poorly scanned. That process shaped our intuition far more than benchmarks alone.<p>One thing became clear quickly. The core challenge was not extraction itself, but confidence. Vision language models embed document images into high-dimensional representations optimized for semantic understanding rather than precise transcription. That process is inherently lossy. When uncertainty appears, models tend to resolve it using learned priors instead of surfacing ambiguity. This behavior can be helpful in consumer settings. In production pipelines, it creates verification problems that do not scale well.<p>Pulse grew out of trying to address this gap through system design rather than prompting alone. Instead of treating document understanding as a single generative step, the system separates layout analysis from language modeling. Documents are normalized into structured representations that preserve hierarchy and tables before schema mapping occurs. Extraction is constrained by schemas defined ahead of time, and extracted values are tied back to source locations so uncertainty can be inspected rather than guessed away. In practice, this results in a hybrid approach that combines traditional computer vision techniques, layout models, and vision language models, because no single approach handled these cases reliably on its own.<p>We are intentionally sharing a few documents that reflect the types of inputs that motivated this work. These are representative of cases where we saw generic OCR or VLM-based pipelines struggle.<p>Here is a financial 10K:
<a href="https://platform.runpulse.com/dashboard/examples/example1">https://platform.runpulse.com/dashboard/examples/example1</a><p>Here is a newspaper:
<a href="https://platform.runpulse.com/dashboard/examples/example2">https://platform.runpulse.com/dashboard/examples/example2</a><p>Here is a rent roll: <a href="https://platform.runpulse.com/dashboard/examples/example3">https://platform.runpulse.com/dashboard/examples/example3</a><p>Pulse is not perfect, particularly on highly degraded scans or uncommon handwriting, and there is still room for improvement. The goal is not to eliminate errors entirely, but to make them visible, auditable, and easier to reason about.<p>Pulse is available via usage-based access to the API and platform You can try it here and access the API docs here.<p>Demo link here: <a href="https://video.runpulse.com/video/pulse-platform-walkthrough-69f9">https://video.runpulse.com/video/pulse-platform-walkthrough-...</a><p>We’re interested in hearing how others here evaluate correctness for document extraction, which failure modes you have seen in practice, and what signals you rely on to decide whether an output can be trusted. We will be around to answer questions and are happy to run additional documents if people want to share examples.