请问HN:在大型互联后端系统中调试失败的问题

1作者: Ifedayo_s20 天前原帖
我正在试图了解团队是如何调试由多个服务和外部集成(例如 Stripe、Twilio、内部微服务、队列、Webhook 等)构成的系统中的生产问题的。 实际上,当出现故障时,工作流程通常是这样的: - 警报触发(Datadog/Sentry/CloudWatch 等) - 或者客户投诉 - 工程师随后开始检查多个系统中的日志、追踪和仪表板 - 最终手动重建跨服务发生的事情 我想了解的是: - 你们今天是如何追踪单个失败的请求或交易跨多个服务的? - 在实际操作中,你们最依赖哪些工具(而不是理论上的工具)? - 通常在哪个环节出现问题——日志、追踪、监控,还是缺乏上下文? - 从“出现问题”到“我们确切知道为什么会出错”通常需要多长时间? - 这个过程中哪部分仍然主要依赖手动拼凑信息? 我想了解在实际操作中,尤其是在有大量外部集成和异步流程的系统中,真正的痛点是什么。
查看原文
I’m trying to understand how teams actually debug production issues in systems made up of multiple services and external integrations (e.g. Stripe, Twilio, internal microservices, queues, webhooks, etc.).<p>In practice, when something breaks, it seems like the workflow is usually:<p>an alert fires (Datadog&#x2F;Sentry&#x2F;CloudWatch&#x2F;etc.)<p>or a customer complains<p>engineers then start checking logs, traces, dashboards across multiple systems<p>and eventually manually reconstruct what happened across services<p>What I’m curious about:<p>How do you actually trace a single failed request or transaction across multiple services today?<p>What tools do you rely on most in practice (not in theory)?<p>Where does it usually break down — logs, tracing, instrumentation, or just missing context?<p>How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?<p>What part of this is still mostly manual stitching together of information?<p>Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.