展示HN:我构建了一个代理,可以在隐藏个人身份信息(PII)的同时保持RAG的正常运行。
嘿,HN,
当你将真实文档或客户数据发送给大型语言模型(LLMs)时,你会面临一个痛苦的权衡:
- 发送原始文本 → 隐私灾难
- 使用[REDACTED]进行编辑 → 嵌入破坏,RAG检索失败,多轮对话变得无用,模型通常拒绝回答关于被编辑实体的问题。
实际的解决方案是保持一致的伪匿名化:同一个真实实体始终映射到同一个标记(例如,“塔塔汽车” → ORG_7)。这保留了向量搜索和推理的语义意义,然后你可以重新填充响应,以便提供者永远看不到实际的名称、数字或地址。
我厌倦了与Presidio和自定义粘合剂(截断的RAG块、印度语言的变格、拼写错误/兄弟的模糊合并、LLM混淆、百分比破坏数学)进行斗争。因此,我构建了Cloakpipe,一个小型的单二进制Rust代理。
它的功能包括:
- 多层检测(正则表达式 + 财务规则 + 可选的GLiNER2 ONNX命名实体识别 + 自定义TOML)
- 在AES-256-GCM加密库中进行一致的可逆映射(内存被清零)
- 智能重新填充,能够处理截断块,如[[ADDRESS:A00
- 内置的模糊解析,用于拼写错误和相似名称
- 数值推理模式,使得百分比在计算中仍然有效
完全开源(MIT),零Python依赖,延迟小于5毫秒。
代码库: [https://github.com/rohansx/cloakpipe](https://github.com/rohansx/cloakpipe)
演示和快速入门: [https://app.cloakpipe.co/demo](https://app.cloakpipe.co/demo)
希望能收到任何审计过其RAG数据流或在编辑与语义问题上苦苦挣扎的人的反馈——特别是在法律、金融科技或非英语工作流程中。
你们采用了什么方法?
查看原文
Hey HN,<p>When you send real documents or customer data to LLMs, you face a painful tradeoff:<p>- Send raw text → privacy disaster
- Redact with [REDACTED] → embeddings break, RAG retrieval fails, multi-turn chats become useless, and the model often refuses to answer questions about the redacted entities.<p>The practical solution is consistent pseudonymization: the same real entity always maps to the same token (e.g. “Tata Motors” → ORG_7 everywhere). This preserves semantic meaning for vector search and reasoning, then you rehydrate the response so the provider never sees actual names, numbers or addresses.<p>I got fed up fighting this with Presidio + custom glue (truncated RAG chunks, declension in Indian languages, fuzzy merging for typos/siblings, LLM confusion, percentages breaking math). So I built Cloakpipe as a tiny single-binary Rust proxy.<p>It does:
• Multi-layer detection (regex + financial rules + optional GLiNER2 ONNX NER + custom TOML)
• Consistent reversible mapping in an AES-256-GCM encrypted vault (memory zeroized)
• Smart rehydration that survives truncated chunks like [[ADDRESS:A00
• Built-in fuzzy resolution for typos and similar names
• Numeric reasoning mode so percentages still work for calculations<p>Fully open source (MIT), zero Python dependencies, <5 ms overhead.<p>Repo: <a href="https://github.com/rohansx/cloakpipe" rel="nofollow">https://github.com/rohansx/cloakpipe</a>
Demo & quick start: <a href="https://app.cloakpipe.co/demo" rel="nofollow">https://app.cloakpipe.co/demo</a><p>Would love feedback from anyone who has audited their RAG data flow or is struggling with the redaction-vs-semantics problem — especially in legal, fintech, or non-English workflows.<p>What approaches have you landed on?