无向量的RAG – 基于推理的RAG使用PageIndex
传统的基于向量的检索增强生成(RAG)常常在检索准确性上遇到困难,因为它优化的是相似性,而非相关性。然而,在检索中我们真正需要的是相关性,这需要推理。当处理需要领域专业知识和多步推理的专业文档时,基于向量的RAG和相似性搜索往往表现不佳。
因此,我们开始探索一种更以推理为驱动的RAG方法。基于推理的RAG使大型语言模型(LLMs)能够思考并推理出最相关的文档部分。受到AlphaGo的启发,我们提出使用树搜索来进行结构化文档检索。
我们开源了一个关键组件:PageIndex。PageIndex是一个层次索引系统,它从长文档(如财务报告、法规文件或教科书)构建搜索树结构,使其适合于基于推理的RAG。
一些亮点包括:
- 层次结构:将冗长的PDF文档组织成适合LLM的树状结构——就像一个智能的目录。
- 精确引用:每个节点包含摘要和确切的物理页码。
- 自然分段:节点与文档部分对齐,保留上下文——没有任意的分块。
我们已经在财务文档分析中使用PageIndex结合基于推理的RAG,并且与基于向量的系统相比,检索准确性有了显著提高。
非常欢迎任何反馈——特别是关于基于推理的RAG的想法,或者PageIndex可以应用的地方的建议!
查看原文
Traditional vector-based RAG often struggles with retrieval accuracy because it optimizes for similarity, not relevance. But what we truly need in retrieval is relevance, which requires reasoning. When working with professional documents that require domain expertise and multi-step reasoning, vector-based RAG and similarity search often fall short.<p>So we started exploring a more reasoning-driven approach to RAG. Reasoning-based RAG enables LLMs to think and reason their way to the most relevant document sections. Inspired by AlphaGo, we propose to use tree search to perform structured document retrieval.<p>We open-sourced one of the key components: PageIndex. PageIndex is a hierarchical indexing system that builds search tree structures from long documents (like financial reports, regulatory documents, or textbooks), making them ready for reasoning-based RAG.<p>Some highlights:<p>- Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.<p>- Precise Referencing: Each node includes a summary and exact physical page numbers.<p>- Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.<p>We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy compared to vector-based systems.<p>Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!