修复了一个在所有32位ARM设备上静默禁用Vulkan GPU的llama.cpp错误。

2作者: perinban8 天前原帖
在三星Galaxy Watch 4 Classic(armeabi-v7a,Mali G68)上运行llama.cpp时,我注意到Vulkan后端拒绝了每一个量化的MUL_MAT操作,尽管报告显示“33/33层已转移到GPU”。<p>根本原因:在llama-model-loader.cpp中的create_tensor()函数内,张量步幅计算中缺少块大小的除法。错误的步幅导致ggml_nbytes()溢出,在32位系统上超出了max_buffer_size,因为size_t是32位的。<p>在64位设备上,溢出被静默掩盖——虽然值错误,但仍在GPU内存限制之内,因此没有人注意到。这个bug可能已经存在多年。<p>修复和相关信息: https://github.com/Perinban/llama.cpp/tree/axon-dev
查看原文
While running llama.cpp on a Samsung Galaxy Watch 4 Classic (armeabi-v7a, Mali G68), I noticed the Vulkan backend was rejecting every quantized MUL_MAT operation despite reporting &quot;33&#x2F;33 layers offloaded to GPU&quot;.<p>Root cause: a missing block size division in tensor stride calculation inside create_tensor() in llama-model-loader.cpp. The wrong stride cascades into ggml_nbytes() overflow, exceeding max_buffer_size on 32-bit where size_t is 32-bit.<p>On 64-bit devices the overflow is silently masked — wrong value but still within GPU memory limits so nobody noticed. Bug has likely been there for years.<p>Fix and context: https:&#x2F;&#x2F;github.com&#x2F;Perinban&#x2F;llama.cpp&#x2F;tree&#x2F;axon-dev