RAG 无向量 – 页面索引:基于推理的文档索引

6作者: vectify_AI7 个月前原帖
我们对基于向量的RAG系统感到沮丧,因为它们依赖于语义相似性,往往在处理长篇、特定领域的文档时表现不佳。在这些情况下,特定领域的术语往往具有语义相似性,这使得检索用户所需的确切内容变得困难。同时,有效地融入专家知识或用户偏好也很具挑战性。因此,我们开始探索一种更以推理为驱动的RAG方法。受到AlphaGo中的树搜索算法的启发,我们提出了一种基于推理的RAG系统,利用树搜索来指导检索。 我们开源了一个关键组件:PageIndex,这是一个层次化索引系统,将大型文档(如财务报告、监管文件或教科书)转化为优化用于基于推理的RAG的语义树。 一些亮点包括: - 层次结构:将冗长的PDF文档组织成适合大语言模型(LLM)的树状结构——就像一个智能目录。 - 精确引用:每个节点包含摘要和确切的物理页码。 - 自然分段:节点与文档章节对齐,保留上下文——没有任意的分块。 我们已经在财务文档分析中使用PageIndex与基于推理的RAG,并且与基于向量的系统相比,检索准确性有了显著提高。 非常期待任何反馈——尤其是对基于推理的RAG的看法,或者PageIndex可能应用的想法!
查看原文
We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents. In these contexts, domain-specific terminology tends to be semantically similar, making it hard to retrieve the exact content users need. It’s also difficult to incorporate expert knowledge or user preferences effectively. So we started exploring a more reasoning-driven approach to RAG. Inspired by the tree search algorithm in AlphaGo, we came up with a reasoning-based RAG system that uses tree search to guide retrieval.<p>We open-sourced one of the key components: PageIndex, a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.<p>Some highlights:<p>- Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.<p>- Precise Referencing: Each node includes a summary and exact physical page numbers.<p>- Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.<p>We&#x27;ve used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy compared to vector-based systems.<p>Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!