HackerNews中文版

在三星Galaxy Watch 4 Classic（armeabi-v7a，Mali G68）上运行llama.cpp时，我注意到Vulkan后端拒绝了每一个量化的MUL_MAT操作，尽管报告显示“33/33层已转移到GPU”。根本原因：在llama-model-loader.cpp中的create_tensor()函数内，张量步幅计算中缺少块大小的除法。错误的步幅导致ggml_nbytes()溢出，在32位系统上超出了max_buffer_size，因为size_t是32位的。在64位设备上，溢出被静默掩盖——虽然值错误，但仍在GPU内存限制之内，因此没有人注意到。这个bug可能已经存在多年。修复和相关信息： https://github.com/Perinban/llama.cpp/tree/axon-dev

查看原文

While running llama.cpp on a Samsung Galaxy Watch 4 Classic (armeabi-v7a, Mali G68), I noticed the Vulkan backend was rejecting every quantized MUL_MAT operation despite reporting "33/33 layers offloaded to GPU".Root cause: a missing block size division in tensor stride calculation inside create_tensor() in llama-model-loader.cpp. The wrong stride cascades into ggml_nbytes() overflow, exceeding max_buffer_size on 32-bit where size_t is 32-bit.On 64-bit devices the overflow is silently masked — wrong value but still within GPU memory limits so nobody noticed. Bug has likely been there for years.Fix and context: https://github.com/Perinban/llama.cpp/tree/axon-dev

修复了一个在所有32位ARM设备上静默禁用Vulkan GPU的llama.cpp错误。