H100 PCIe – 1.86 TB/s 的内存拷贝上限和 8 倍提升

1作者: GPUrouter3 个月前原帖
我在一台 H100 PCIe 80GB 节点上进行了 A/B 基准测试。连续的内存拷贝(memcpy)在基线和优化运行中均保持约 1.86 TB/s,显示没有额外开销。对于跨步(strided)和未对齐访问,基线速度约为 230 GB/s,而优化版本达到了约 1.86 TB/s,约提升了 8 倍。大负载(8–24 GB)同样保持在约 1.86 TB/s。典型的 CUDA 核心,如内存拷贝、跨步访问、KV 缓存和层归一化(LayerNorm),从约 220–330 GB/s 提升至约 1.8–1.86 TB/s,速度提高了 7–8 倍,且抖动非常低。 使用简单的 LLM 解码成本模型(BPT = 1.13 MB/token),吞吐量从约 161.9k token/s 提升至约 225.1k token/s(≈1.39 倍)。这表明,像 KV 缓存和跨步加载这样的内存绑定操作可以更接近于带宽上限,从而直接影响解码吞吐量。 我对这样的内存绑定优化在 LLM 训练与推理中的影响,以及接下来测试哪些好的公共长上下文(8k–32k)基准感兴趣,欢迎反馈。
查看原文
I ran A&#x2F;B benchmarks on an H100 PCIe 80GB node. Contiguous memcpy sustained ~1.86 TB&#x2F;s in both baseline and optimized runs, showing no overhead. For strided and misaligned access the baseline was ~230 GB&#x2F;s, while the optimized version reached ~1.86 TB&#x2F;s, about an 8× improvement. Large 8–24 GB payloads sustained ~1.86 TB&#x2F;s as well. Canonical CUDA kernels such as memcpy, strided access, KV-cache, and LayerNorm improved from ~220–330 GB&#x2F;s to ~1.8–1.86 TB&#x2F;s, around 7–8× faster with very low jitter.<p>Using a simple LLM decode cost model (BPT = 1.13 MB&#x2F;token), throughput improved from ~161.9k tok&#x2F;s to ~225.1k tok&#x2F;s (≈1.39×). This suggests memory-bound operations like KV-cache and strided loads can be lifted closer to roofline bandwidth, with direct impact on decode throughput.<p>I’m interested in feedback on how such memory-bound optimizations might affect LLM training versus inference, and what good public long-context (8k–32k) benchmarks would be to test next ?