HackerNews中文版

我正在研究基础设施，以解决重试风暴和故障问题。在深入之前，我想了解一下人们今天实际在做什么。比较不同的解决方案，也许能帮助某些人发现潜在的解决办法。问题： - 重试风暴 - API 失败，整个系统的实例独立重试，造成“雷鸣般的群体效应”，使情况更糟。 - 部分故障 - API 虽然“在线”，但性能下降（响应慢，间歇性500错误）。健康检查通过，但请求却受到影响。我想了解的是： - 你们目前的解决方案是什么？（熔断器、队列、自定义协调、服务网格，还是其他？） - 效果如何？存在哪些不足之处？ - 你们的规模有多大？（公司规模、实例数量、请求数/秒）我很想听听哪些方法有效，哪些无效，以及你们希望存在的解决方案。

查看原文

I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:<p>Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.<p>Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.<p>What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)<p>I’d love to hear what’s working, what isn’t, and what you wish existed.

请问HN：有没有人因为重试风暴或部分API故障而失眠？