在双Pascal GPU上运行35B LLMs,使用QLoRA技术。

1作者: rickesh_tn大约 2 个月前原帖
嗨,HN, 我构建了一个系统,可以在较旧的Pascal GPU(P100 + GTX 1080 Ti)上运行35B参数的语言模型,利用多GPU内存溢出技术。 问题:大多数大型语言模型推理工具(如Ollama、LM Studio)仅限于单个GPU的显存(在16GB GPU上最大支持约13B模型)。如果你有多个旧款GPU,第二个GPU就会闲置。 解决方案:使用多GPU + CPU内存溢出和QLoRA 4位量化。该系统自动将层分配到GPU0 → GPU1 → CPU内存,使得在通常最大支持13B的硬件上能够运行35B模型。 基准测试(P100 16GB + GTX 1080 Ti 11GB): - Qwen-14B:13.7个token/秒(9.4GB显存) - OPT-30B:5.4个token/秒(15.2GB显存) - CodeLlama-34B:0.8个token/秒(16.7GB显存) 快速开始: ``` docker pull rickeshtn/large-model-international_release:latest docker run -it --rm --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=268435456 -v $(pwd):/workspace -e HF_HOME=/workspace/model_cache rickeshtn/large-model-international_release:latest python /app/interactive_chat.py --model-name Qwen/Qwen2.5-14B-Instruct ``` 技术细节: - QLoRA 4位NF4量化(内存减少75%) - HuggingFace Transformers + Accelerate + bitsandbytes - 自动设备映射与CPU卸载 - 具有对话持久性的互动聊天 GitHub: [https://github.com/rickeshtn/locallm-pascal](https://github.com/rickeshtn/locallm-pascal) Docker Hub: [https://hub.docker.com/r/rickeshtn/large-model-international_release](https://hub.docker.com/r/rickeshtn/large-model-international_release) 已有34位用户在运行该系统。如有技术问题,欢迎提问!
查看原文
Hi HN,<p><pre><code> I built a system to run 35B parameter language models on older Pascal GPUs (P100 + GTX 1080 Ti) using multi-GPU memory spillover. Problem: Most LLM inference tools (Ollama, LM Studio) are limited to single GPU VRAM (~13B models max on a 16GB GPU). If you have multiple older GPUs, the second one sits idle. Solution: Multi-GPU + CPU memory spillover with QLoRA 4-bit quantization. The system automatically distributes layers across GPU0 → GPU1 → CPU RAM, enabling 35B models on hardware that normally maxes at 13B. Benchmarks (P100 16GB + GTX 1080 Ti 11GB): - Qwen-14B: 13.7 tokens&#x2F;sec (9.4GB VRAM) - OPT-30B: 5.4 tokens&#x2F;sec (15.2GB VRAM) - CodeLlama-34B: 0.8 tokens&#x2F;sec (16.7GB VRAM) Quick start: docker pull rickeshtn&#x2F;large-model-international_release:latest docker run -it --rm --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=268435456 -v $(pwd):&#x2F;workspace -e HF_HOME=&#x2F;workspace&#x2F;model_cache rickeshtn&#x2F;large-model-international_release:latest python &#x2F;app&#x2F;interactive_chat.py --model-name Qwen&#x2F;Qwen2.5-14B-Instruct Technical details: - QLoRA 4-bit NF4 quantization (75% memory reduction) - HuggingFace Transformers + Accelerate + bitsandbytes - Automatic device mapping with CPU offload - Interactive chat with conversation persistence GitHub: https:&#x2F;&#x2F;github.com&#x2F;rickeshtn&#x2F;locallm-pascal Docker Hub: https:&#x2F;&#x2F;hub.docker.com&#x2F;r&#x2F;rickeshtn&#x2F;large-model-international_release 34 users already running it. Happy to answer technical questions!</code></pre>