展示HN:我构建了一个代理,可以在隐藏个人身份信息(PII)的同时保持RAG的正常运行。

3作者: rohansx大约 2 个月前原帖
嘿,HN, 当你将真实文档或客户数据发送给大型语言模型(LLMs)时,你会面临一个痛苦的权衡: - 发送原始文本 → 隐私灾难 - 使用[REDACTED]进行编辑 → 嵌入破坏,RAG检索失败,多轮对话变得无用,模型通常拒绝回答关于被编辑实体的问题。 实际的解决方案是保持一致的伪匿名化:同一个真实实体始终映射到同一个标记(例如,“塔塔汽车” → ORG_7)。这保留了向量搜索和推理的语义意义,然后你可以重新填充响应,以便提供者永远看不到实际的名称、数字或地址。 我厌倦了与Presidio和自定义粘合剂(截断的RAG块、印度语言的变格、拼写错误/兄弟的模糊合并、LLM混淆、百分比破坏数学)进行斗争。因此,我构建了Cloakpipe,一个小型的单二进制Rust代理。 它的功能包括: - 多层检测(正则表达式 + 财务规则 + 可选的GLiNER2 ONNX命名实体识别 + 自定义TOML) - 在AES-256-GCM加密库中进行一致的可逆映射(内存被清零) - 智能重新填充,能够处理截断块,如[[ADDRESS:A00 - 内置的模糊解析,用于拼写错误和相似名称 - 数值推理模式,使得百分比在计算中仍然有效 完全开源(MIT),零Python依赖,延迟小于5毫秒。 代码库: [https://github.com/rohansx/cloakpipe](https://github.com/rohansx/cloakpipe) 演示和快速入门: [https://app.cloakpipe.co/demo](https://app.cloakpipe.co/demo) 希望能收到任何审计过其RAG数据流或在编辑与语义问题上苦苦挣扎的人的反馈——特别是在法律、金融科技或非英语工作流程中。 你们采用了什么方法?
查看原文
Hey HN,<p>When you send real documents or customer data to LLMs, you face a painful tradeoff:<p>- Send raw text → privacy disaster - Redact with [REDACTED] → embeddings break, RAG retrieval fails, multi-turn chats become useless, and the model often refuses to answer questions about the redacted entities.<p>The practical solution is consistent pseudonymization: the same real entity always maps to the same token (e.g. “Tata Motors” → ORG_7 everywhere). This preserves semantic meaning for vector search and reasoning, then you rehydrate the response so the provider never sees actual names, numbers or addresses.<p>I got fed up fighting this with Presidio + custom glue (truncated RAG chunks, declension in Indian languages, fuzzy merging for typos&#x2F;siblings, LLM confusion, percentages breaking math). So I built Cloakpipe as a tiny single-binary Rust proxy.<p>It does: • Multi-layer detection (regex + financial rules + optional GLiNER2 ONNX NER + custom TOML) • Consistent reversible mapping in an AES-256-GCM encrypted vault (memory zeroized) • Smart rehydration that survives truncated chunks like [[ADDRESS:A00 • Built-in fuzzy resolution for typos and similar names • Numeric reasoning mode so percentages still work for calculations<p>Fully open source (MIT), zero Python dependencies, &lt;5 ms overhead.<p>Repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;rohansx&#x2F;cloakpipe" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;rohansx&#x2F;cloakpipe</a> Demo &amp; quick start: <a href="https:&#x2F;&#x2F;app.cloakpipe.co&#x2F;demo" rel="nofollow">https:&#x2F;&#x2F;app.cloakpipe.co&#x2F;demo</a><p>Would love feedback from anyone who has audited their RAG data flow or is struggling with the redaction-vs-semantics problem — especially in legal, fintech, or non-English workflows.<p>What approaches have you landed on?