请问HN:在MMAP使内存计量失真时,如何调度有状态节点?

7作者: leo_e1 天前原帖
我们遇到了经典的分布式系统瓶颈,我在寻找一些经验教训或“最不糟糕”的实践。 背景:我们维护一个分布式有状态引擎(类似搜索/分析)。架构是标准的:控制平面(协调器)将数据段分配给工作节点。工作负载涉及对大数据集的重度使用 mmap 和延迟加载。 事件:我们发生了级联故障,协调器陷入循环,导致对特定节点的 DDOS 攻击。 信号:协调器发现节点 A 的行数(逻辑计数)显著低于集群平均水平,标记节点 A 为“未充分利用”。 行动:协调器尝试重新平衡/将新数据段加载到节点 A 上。 现实:节点 A 实际上使用了 197GB 的内存(接近 OOM)。其上的数据非常宽(宽行,大块数据),因此逻辑行数较低,但物理占用却非常大。 循环:节点 A 拒绝加载(或超时)。协调器忽略了背压,再次看到低行数,立即重试。 核心问题:我们试图为负载均衡器编写一个“上帝方程”。我们从行数开始,但失败了。我们查看了磁盘使用情况,但由于延迟加载,这与内存并没有直接关联。 现在我们在关注 mmap。由于操作系统管理页面缓存,应用层的 RSS 噪声很大,并不能严格反映“所需”内存与“可回收”缓存的关系。 问题:试图将每个资源变量(CPU、IOPS、RSS、磁盘、逻辑计数)枚举到一个单一的评分函数中,感觉像是一个 NP 难题的陷阱。 在内存使用不透明/动态的系统中,您如何处理资源分配? 愚蠢的协调器,聪明的节点:我们是否应该让协调器根据磁盘空间盲目操作,100% 依赖节点根据本地压力返回硬性 429 请求过多? 成本估算:我们是否尝试为每个数据段构建一个合成的“成本模型”(例如,预测的内存占用),并基于信用进行调度,而忽略实际的操作系统指标? 控制平面解耦:将存储平衡(磁盘)与查询平衡(内存)分开? 感觉我们在重新发明轮子。感谢提供相关论文或类似架构的事后分析。
查看原文
We’re hitting a classic distributed systems wall and I’m looking for war stories or &quot;least worst&quot; practices.<p>The Context: We maintain a distributed stateful engine (think search&#x2F;analytics). The architecture is standard: a Control Plane (Coordinator) assigns data segments to Worker Nodes. The workload involves heavy use of mmap and lazy loading for large datasets.<p>The Incident: We had a cascading failure where the Coordinator got stuck in a loop, DDOS-ing a specific node.<p>The Signal: Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as &quot;underutilized.&quot;<p>The Action: Coordinator attempts to rebalance&#x2F;load new segments onto Node A.<p>The Reality: Node A is actually sitting at 197GB RAM usage (near OOM). The data on it happens to be extremely wide (fat rows, huge blobs), so its logical row count is low, but physical footprint is massive.<p>The Loop: Node A rejects the load (or times out). The Coordinator ignores the backpressure, sees the low row count again, and retries immediately.<p>The Core Problem: We are trying to write a &quot;God Equation&quot; for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn&#x27;t correlate with RAM because of lazy loading.<p>Now we are staring at mmap. Because the OS manages the page cache, the application-level RSS is noisy and doesn&#x27;t strictly reflect &quot;required&quot; memory vs &quot;reclaimable&quot; cache.<p>The Question: Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.<p>How do you handle placement in systems where memory usage is opaque&#x2F;dynamic?<p>Dumb Coordinator, Smart Nodes: Should we just let the Coordinator blind-fire based on disk space, and rely 100% on the Node to return hard 429 Too Many Requests based on local pressure?<p>Cost Estimation: Do we try to build a synthetic &quot;cost model&quot; per segment (e.g., predicted memory footprint) and schedule based on credits, ignoring actual OS metrics?<p>Control Plane Decoupling: Separate storage balancing (disk) from query balancing (mem)?<p>Feels like we are reinventing the wheel. References to papers or similar architecture post-mortems appreciated.