HackerNews中文版

我在一台 H100 PCIe 80GB 节点上进行了 A/B 基准测试。连续的内存拷贝（memcpy）在基线和优化运行中均保持约 1.86 TB/s，显示没有额外开销。对于跨步（strided）和未对齐访问，基线速度约为 230 GB/s，而优化版本达到了约 1.86 TB/s，约提升了 8 倍。大负载（8–24 GB）同样保持在约 1.86 TB/s。典型的 CUDA 核心，如内存拷贝、跨步访问、KV 缓存和层归一化（LayerNorm），从约 220–330 GB/s 提升至约 1.8–1.86 TB/s，速度提高了 7–8 倍，且抖动非常低。使用简单的 LLM 解码成本模型（BPT = 1.13 MB/token），吞吐量从约 161.9k token/s 提升至约 225.1k token/s（≈1.39 倍）。这表明，像 KV 缓存和跨步加载这样的内存绑定操作可以更接近于带宽上限，从而直接影响解码吞吐量。我对这样的内存绑定优化在 LLM 训练与推理中的影响，以及接下来测试哪些好的公共长上下文（8k–32k）基准感兴趣，欢迎反馈。

查看原文

I ran A/B benchmarks on an H100 PCIe 80GB node. Contiguous memcpy sustained ~1.86 TB/s in both baseline and optimized runs, showing no overhead. For strided and misaligned access the baseline was ~230 GB/s, while the optimized version reached ~1.86 TB/s, about an 8× improvement. Large 8–24 GB payloads sustained ~1.86 TB/s as well. Canonical CUDA kernels such as memcpy, strided access, KV-cache, and LayerNorm improved from ~220–330 GB/s to ~1.8–1.86 TB/s, around 7–8× faster with very low jitter.<p>Using a simple LLM decode cost model (BPT = 1.13 MB/token), throughput improved from ~161.9k tok/s to ~225.1k tok/s (≈1.39×). This suggests memory-bound operations like KV-cache and strided loads can be lifted closer to roofline bandwidth, with direct impact on decode throughput.<p>I’m interested in feedback on how such memory-bound optimizations might affect LLM training versus inference, and what good public long-context (8k–32k) benchmarks would be to test next ?

H100 PCIe – 1.86 TB/s 的内存拷贝上限和 8 倍提升