问HN:你们是如何防止系统背景信息随着时间而衰退的?

3作者: kennethops16 天前原帖
曾经是SRE的我在这里寻求建议。 我知道有很多工具专注于在系统出现故障后进行根本原因分析。这很好,但这并不是让我感到疲惫的原因。真正让我感到痛苦的是在试图理解一个系统是如何组合在一起的、各个部分之间的依赖关系以及最近发生了什么变化时,频繁的上下文切换。 随着系统的增长,这种情况似乎变得越来越困难。添加日志后,你就创造了数以百万计的新事件需要考虑。再加上一个数据库,突然间你就要面对子网限制或代价高昂的数据库选择,而这些问题往往在事后才被发现。每个人都知道自己负责的部分,但整体情况却没有人掌握,因此系统的逐渐退化就悄然发生。 现在,随着AI代理快速推送大量代码和配置更改,这种情况感觉更糟。事情发展得更快,但共享理解却更快地落后。 老实说,我对人们在实践中如何有效应对这一问题感到困惑。对于处理真实生产系统的人来说,实际上有什么帮助?是图表、文档、部落知识、工具,还是其他什么?在哪些方面会出现问题?
查看原文
Former SRE here, looking for advice.<p>I know there are a lot of tools focused on root cause analysis after things break. Cool, but that’s not what’s wearing me down. What actually hurts is the constant context switching while trying to understand how a system fits together, what depends on what, and what changed recently.<p>As systems grow, this feels like it gets exponentially harder. Add logs and now you’ve created a million new events to reason about. Add another database and suddenly you’re dealing with subnet constraints or a DB choice that’s expensive as hell, and no one noticed until later. Everyone knows their slice, but the full picture lives nowhere, so bit rot just keeps creeping in.<p>This feels even worse now that AI agents are pushing large amounts of code and config changes quickly. Things move faster, but shared understanding falls behind even faster.<p>I’m honestly stuck on how people handle this well in practice. For folks dealing with real production systems, what’s actually helped? Diagrams, docs, tribal knowledge, tooling, something else? Where does it break down?