请问HN:在大型互联后端系统中调试失败的问题
我正在试图了解团队是如何调试由多个服务和外部集成(例如 Stripe、Twilio、内部微服务、队列、Webhook 等)构成的系统中的生产问题的。
实际上,当出现故障时,工作流程通常是这样的:
- 警报触发(Datadog/Sentry/CloudWatch 等)
- 或者客户投诉
- 工程师随后开始检查多个系统中的日志、追踪和仪表板
- 最终手动重建跨服务发生的事情
我想了解的是:
- 你们今天是如何追踪单个失败的请求或交易跨多个服务的?
- 在实际操作中,你们最依赖哪些工具(而不是理论上的工具)?
- 通常在哪个环节出现问题——日志、追踪、监控,还是缺乏上下文?
- 从“出现问题”到“我们确切知道为什么会出错”通常需要多长时间?
- 这个过程中哪部分仍然主要依赖手动拼凑信息?
我想了解在实际操作中,尤其是在有大量外部集成和异步流程的系统中,真正的痛点是什么。
查看原文
I’m trying to understand how teams actually debug production issues in systems made up of multiple services and external integrations (e.g. Stripe, Twilio, internal microservices, queues, webhooks, etc.).<p>In practice, when something breaks, it seems like the workflow is usually:<p>an alert fires (Datadog/Sentry/CloudWatch/etc.)<p>or a customer complains<p>engineers then start checking logs, traces, dashboards across multiple systems<p>and eventually manually reconstruct what happened across services<p>What I’m curious about:<p>How do you actually trace a single failed request or transaction across multiple services today?<p>What tools do you rely on most in practice (not in theory)?<p>Where does it usually break down — logs, tracing, instrumentation, or just missing context?<p>How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?<p>What part of this is still mostly manual stitching together of information?<p>Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.