HackerNews中文版

我正在试图了解团队是如何调试由多个服务和外部集成（例如 Stripe、Twilio、内部微服务、队列、Webhook 等）构成的系统中的生产问题的。实际上，当出现故障时，工作流程通常是这样的： - 警报触发（Datadog/Sentry/CloudWatch 等） - 或者客户投诉 - 工程师随后开始检查多个系统中的日志、追踪和仪表板 - 最终手动重建跨服务发生的事情我想了解的是： - 你们今天是如何追踪单个失败的请求或交易跨多个服务的？ - 在实际操作中，你们最依赖哪些工具（而不是理论上的工具）？ - 通常在哪个环节出现问题——日志、追踪、监控，还是缺乏上下文？ - 从“出现问题”到“我们确切知道为什么会出错”通常需要多长时间？ - 这个过程中哪部分仍然主要依赖手动拼凑信息？我想了解在实际操作中，尤其是在有大量外部集成和异步流程的系统中，真正的痛点是什么。

查看原文

I’m trying to understand how teams actually debug production issues in systems made up of multiple services and external integrations (e.g. Stripe, Twilio, internal microservices, queues, webhooks, etc.).In practice, when something breaks, it seems like the workflow is usually:an alert fires (Datadog/Sentry/CloudWatch/etc.)or a customer complainsengineers then start checking logs, traces, dashboards across multiple systemsand eventually manually reconstruct what happened across servicesWhat I’m curious about:How do you actually trace a single failed request or transaction across multiple services today?What tools do you rely on most in practice (not in theory)?Where does it usually break down — logs, tracing, instrumentation, or just missing context?How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?What part of this is still mostly manual stitching together of information?Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.

请问HN：在大型互联后端系统中调试失败的问题