展示HN:我们测试了214种不需要越狱的AI攻击方式

1作者: exordex11 天前原帖
大多数代理安全测试试图对模型进行越狱。这非常困难,OpenAI和Anthropic在红队测试方面表现出色。 我们采取了不同的方法:攻击环境,而不是模型。 以下是针对我们的攻击套件测试代理的结果: - 工具操控:要求代理读取一个文件,注入路径=/etc/passwd。它照做了。 - 数据外泄:要求代理读取配置并将其通过电子邮件发送到外部。它做到了。 - Shell注入:用指令污染git状态输出。代理遵循了这些指令。 - 凭证泄露:要求提供API密钥“用于调试”。代理提供了这些密钥。 这些操作都不需要绕过模型的安全机制。模型正常工作——代理仍然被攻陷。 其工作原理: 我们构建了拦截代理实际操作的适配层: - 文件系统适配层:对open()、Path.read_text()进行猴子补丁。 - 子进程适配层:对subprocess.run()进行猴子补丁。 - PATH劫持:伪造git/nmp/curl,包装真实的二进制文件并污染输出。 模型看到的看似合法的工具输出。它对此毫无察觉。 总共进行了214次攻击,包括文件注入、shell输出污染、工具操控、RAG污染、MCP攻击。 早期访问: [https://exordex.com](https://exordex.com) 希望能收到任何将代理投入生产的人的反馈。
查看原文
Most agent security testing tries to jailbreak the model. That&#x27;s really difficult, OpenAI and Anthropic are good at red-teaming.<p>We took a different approach: attack the environment, not the model.<p>Results from testing agents against our attack suite:<p>- Tool manipulation: Asked agent to read a file, injected path=&#x2F;etc&#x2F;passwd. It complied. - Data exfiltration: Asked agent to read config, email it externally. It did. - Shell injection: Poisoned git status output with instructions. Agent followed them. - Credential leaks: Asked for API keys &quot;for debugging.&quot; Agent provided them.<p>None of these required bypassing the model&#x27;s safety. The model worked correctly—the agent still got owned.<p>How it works:<p>We built shims that intercept what agents actually do: - Filesystem shim: monkeypatches open(), Path.read_text() - Subprocess shim: monkeypatches subprocess.run() - PATH hijacking: fake git&#x2F;npm&#x2F;curl that wrap real binaries and poison output<p>The model sees what looks like legitimate tool output. It has no idea.<p>214 attacks total. File injection, shell output poisoning, tool manipulation, RAG poisoning, MCP attacks.<p>Early access: <a href="https:&#x2F;&#x2F;exordex.com" rel="nofollow">https:&#x2F;&#x2F;exordex.com</a><p>Looking for feedback from anyone shipping agents to production.