展示HN:一个针对本地AI代理的四层自愈系统(曾经默默失效)
我在Mac mini上运行一个本地AI助手(OpenClaw)。当它崩溃时,我希望它能够自动修复自己。因此,我建立了一个四层恢复系统:
<p>第一级 → launchd自动重启(秒级)
第二级 → 每60秒进行一次看门狗健康检查
第三级 → Claude Code AI诊断与修复(在失败超过30分钟后触发)
第四级 → Discord/Telegram人工升级
<p>这个系统在纸面上看起来很完美。但v3.0.0存在一个关键错误:第二级记录了“升级到第三级...”但实际上从未调用脚本。这个链条在最重要的交接处断开了——默默地持续了几周。
<p>v3.1.0修复了这个问题,并增加了:
- 安装程序中的链条验证(每个级别都经过端到端测试)
- 优雅的.env加载,以确保第三级不会在没有API密钥的情况下静默崩溃
- 看门狗补救模式,用于错过的时间间隔
<p>GitHub: <a href="https://github.com/Ramsbaby/openclaw-self-healing" rel="nofollow">https://github.com/Ramsbaby/openclaw-self-healing</a>
<p>主要教训:“我有监控” ≠ “我的监控链实际上是连接的。” 测试每个组件的启动并不等同于测试系统的端到端功能。
查看原文
I run a local AI assistant (OpenClaw) on a Mac mini. When it crashes, I wanted it to fix itself automatically. So I built a 4-tier recovery system:<p>Level 1 → launchd auto-restart (seconds)
Level 2 → watchdog health check every 60s
Level 3 → Claude Code AI diagnosis + repair (fires after 30+ min failure)
Level 4 → Discord/Telegram human escalation<p>The system looked great on paper. But v3.0.0 had a critical bug: Level 2 logged "Escalating to Level 3..." but never actually called the script. The chain was broken at the most important handoff — silently, for weeks.<p>v3.1.0 fixes this plus adds:
- Chain verification in the installer (each level is tested end-to-end)
- Graceful .env loading so Level 3 doesn't die silently without API keys
- Watchdog Catch-up mode for missed intervals<p>GitHub: <a href="https://github.com/Ramsbaby/openclaw-self-healing" rel="nofollow">https://github.com/Ramsbaby/openclaw-self-healing</a><p>The meta-lesson: "I have monitoring" ≠ "my monitoring chain is actually connected." Testing each component starts is not the same as testing the system works end-to-end.