请问HN:有没有人因为重试风暴或部分API故障而失眠?

2作者: rjpruitt16大约 7 小时前原帖
我正在研究基础设施,以解决重试风暴和故障问题。在深入之前,我想了解一下人们今天实际在做什么。比较不同的解决方案,也许能帮助某些人发现潜在的解决办法。 问题: - 重试风暴 - API 失败,整个系统的实例独立重试,造成“雷鸣般的群体效应”,使情况更糟。 - 部分故障 - API 虽然“在线”,但性能下降(响应慢,间歇性500错误)。健康检查通过,但请求却受到影响。 我想了解的是: - 你们目前的解决方案是什么?(熔断器、队列、自定义协调、服务网格,还是其他?) - 效果如何?存在哪些不足之处? - 你们的规模有多大?(公司规模、实例数量、请求数/秒) 我很想听听哪些方法有效,哪些无效,以及你们希望存在的解决方案。
查看原文
I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:<p>Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.<p>Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.<p>What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests&#x2F;sec)<p>I’d love to hear what’s working, what isn’t, and what you wish existed.