展示HN:PyTorch K-Means GPU友好型,单文件,层次化和重采样
我在纯 PyTorch 中构建了一个小型的自包含 K-Means 实现:<a href="https://gitlab.com/hassonofer/pt_kmeans" rel="nofollow">https://gitlab.com/hassonofer/pt_kmeans</a>
我在进行数据集采样和近似最近邻搜索时,尝试了几种现有的大规模 K-Means 库。可是我找不到一个既快速又简单,并且能够在我的工作站上顺畅运行而不受内存限制的解决方案。也许我错过了某个现有的解决方案,但最终我写了一个符合我需求的实现。
核心思路:将数据保留在 CPU 上(因为那里有更多的内存),并在迭代步骤中智能地将必要的数据块移动到 GPU 进行计算。结果总是返回到 CPU 以便进行简单的后处理。
(注意:在 GPU 上计算 K-Means++ 初始化时,完整数据集仍需适配于 GPU。)
它提供了一些实用功能:
```
- 分块计算:通过仅将必要的数据块移动到 GPU,进行内存高效的大数据集处理,防止内存溢出错误
- 聚类拆分:通过将单个聚类拆分为多个子聚类来细化现有聚类
- 零依赖:单文件实现,仅需 PyTorch。可以直接复制粘贴到任何项目中
- 高级聚类:支持可选重采样的层次 K-Means(遵循最新研究),聚类拆分工具
- 设备灵活性:显式设备控制 - 数据可以存放在任何地方,计算在您指定的地方进行(任何 PyTorch 支持的加速器)
```
未来计划:
```
- 添加对内存映射文件的支持,以处理更大的数据集
- 探索 PyTorch 分布式以实现多节点 K-Means
```
该实现处理 L2 和余弦距离,并包含 K-Means++ 初始化。
可通过 PyPI 安装(`pip install pt_kmeans`),完整实现可在这里找到:<a href="https://gitlab.com/hassonofer/pt_kmeans" rel="nofollow">https://gitlab.com/hassonofer/pt_kmeans</a>
希望能收到对该方法的反馈,以及任何我可能遗漏的用例!
查看原文
I built a small, self-contained K-Means implementation in pure PyTorch: <a href="https://gitlab.com/hassonofer/pt_kmeans" rel="nofollow">https://gitlab.com/hassonofer/pt_kmeans</a><p>I was working on dataset sampling and approximate nearest neighbor search, and tried several existing libraries for large-scale K-Means. I couldn't find something that was fast, simple, and would run comfortably on my own workstation without hitting memory limits. Maybe I missed an existing solution, but I ended up writing one that fit my needs.<p>The core insight: Keep your data on CPU (where you have more RAM) and intelligently move only the necessary chunks to GPU for computation during the iterative steps. Results always come back to CPU for easy post-processing.
(Note: For K-Means++ initialization when computing on GPU, the full dataset still needs to fit on the GPU.)<p>It offers a few practical features:<p><pre><code> - Chunked Computations: Memory-efficient processing of large datasets by only moving necessary data chunks to the GPU, preventing Out-Of-Memory errors
- Cluster splitting: Refine existing clusters by splitting a single cluster into multiple sub-clusters
- Zero Dependencies: Single file, only requires PyTorch. Copy-paste into any project
- Advanced Clustering: Hierarchical K-Means with optional resampling (following recent research), cluster splitting utilities.
- Device Flexibility: Explicit device control - data can live anywhere, computation happens where you specify (any accelerator PyTorch supports)
</code></pre>
Future plans:<p><pre><code> - Add support for memory-mapped files to handle even bigger datasets
- Explore PyTorch distributed for multi-node K-Means
</code></pre>
The implementation handles both L2 and cosine distances, includes K-Means++ initialization.<p>Available on PyPI (`pip install pt_kmeans`) and the full implementation is at: <a href="https://gitlab.com/hassonofer/pt_kmeans" rel="nofollow">https://gitlab.com/hassonofer/pt_kmeans</a><p>Would love feedback on the approach and any use cases I might have missed!