HackerNews中文版

嘿，HN，当你将真实文档或客户数据发送给大型语言模型（LLMs）时，你会面临一个痛苦的权衡： - 发送原始文本 → 隐私灾难 - 使用[REDACTED]进行编辑 → 嵌入破坏，RAG检索失败，多轮对话变得无用，模型通常拒绝回答关于被编辑实体的问题。实际的解决方案是保持一致的伪匿名化：同一个真实实体始终映射到同一个标记（例如，“塔塔汽车” → ORG_7）。这保留了向量搜索和推理的语义意义，然后你可以重新填充响应，以便提供者永远看不到实际的名称、数字或地址。我厌倦了与Presidio和自定义粘合剂（截断的RAG块、印度语言的变格、拼写错误/兄弟的模糊合并、LLM混淆、百分比破坏数学）进行斗争。因此，我构建了Cloakpipe，一个小型的单二进制Rust代理。它的功能包括： - 多层检测（正则表达式 + 财务规则 + 可选的GLiNER2 ONNX命名实体识别 + 自定义TOML） - 在AES-256-GCM加密库中进行一致的可逆映射（内存被清零） - 智能重新填充，能够处理截断块，如[[ADDRESS:A00 - 内置的模糊解析，用于拼写错误和相似名称 - 数值推理模式，使得百分比在计算中仍然有效完全开源（MIT），零Python依赖，延迟小于5毫秒。代码库： [https://github.com/rohansx/cloakpipe](https://github.com/rohansx/cloakpipe) 演示和快速入门： [https://app.cloakpipe.co/demo](https://app.cloakpipe.co/demo) 希望能收到任何审计过其RAG数据流或在编辑与语义问题上苦苦挣扎的人的反馈——特别是在法律、金融科技或非英语工作流程中。你们采用了什么方法？

查看原文

Hey HN,When you send real documents or customer data to LLMs, you face a painful tradeoff:- Send raw text → privacy disaster - Redact with [REDACTED] → embeddings break, RAG retrieval fails, multi-turn chats become useless, and the model often refuses to answer questions about the redacted entities.The practical solution is consistent pseudonymization: the same real entity always maps to the same token (e.g. “Tata Motors” → ORG_7 everywhere). This preserves semantic meaning for vector search and reasoning, then you rehydrate the response so the provider never sees actual names, numbers or addresses.I got fed up fighting this with Presidio + custom glue (truncated RAG chunks, declension in Indian languages, fuzzy merging for typos/siblings, LLM confusion, percentages breaking math). So I built Cloakpipe as a tiny single-binary Rust proxy.It does: • Multi-layer detection (regex + financial rules + optional GLiNER2 ONNX NER + custom TOML) • Consistent reversible mapping in an AES-256-GCM encrypted vault (memory zeroized) • Smart rehydration that survives truncated chunks like [[ADDRESS:A00 • Built-in fuzzy resolution for typos and similar names • Numeric reasoning mode so percentages still work for calculationsFully open source (MIT), zero Python dependencies, <5 ms overhead.Repo: <a href="https://github.com/rohansx/cloakpipe" rel="nofollow">https://github.com/rohansx/cloakpipe</a> Demo & quick start: <a href="https://app.cloakpipe.co/demo" rel="nofollow">https://app.cloakpipe.co/demo</a>Would love feedback from anyone who has audited their RAG data flow or is struggling with the redaction-vs-semantics problem — especially in legal, fintech, or non-English workflows.What approaches have you landed on?

展示HN：我构建了一个代理，可以在隐藏个人身份信息（PII）的同时保持RAG的正常运行。