Smart-KNN:一种以生产为导向的特征加权KNN,针对CPU进行了优化。
嗨,HN,
我一直在开发 SmartKNN,这是一个专门为生产部署而设计的最近邻系统,而不是用于学术实验。
我们的目标不是对经典 KNN 进行小幅调整,而是将其重构为一个可部署的、关注延迟的系统,同时保持可解释性。
**它的不同之处**
传统的 KNN 简单且易于解释,但在实际应用中面临以下挑战:
- 随着数据集的增长,推理延迟增加
- 对所有特征的平等处理
- 固定的距离度量
- 在负载下性能不可预测
SmartKNN 通过以下方式解决这些问题:
1. **学习特征加权**
特征重要性自动学习并纳入距离计算。这减少了噪声,提高了邻居质量,而无需手动调优。
2. **自适应距离行为**
距离计算根据学习到的特征相关性进行调整,而不是依赖于简单的欧几里得固定度量。
3. **后端选择**
SmartKNN 支持暴力搜索和近似最近邻策略。
- 小数据集 → 暴力搜索
- 大数据集 → 近似候选检索
近似搜索仅用于检索候选项。最终预测始终使用学习到的距离函数。
4. **以 CPU 为中心的设计**
系统优化为可预测的 CPU 推理性能,而不是依赖 GPU 的重型工作流。重点是稳定的延迟特性,适合生产工作负载。
5. **统一 API**
通过与 scikit-learn 兼容的接口支持分类和回归。
**性能**
在具有强局部结构的结构化/表格数据集上,SmartKNN 的准确性与基于树的模型相当。
它并不旨在普遍替代树模型或神经网络。在邻域结构有意义且需要可解释性的地方表现最佳。
**局限性**
- 需要数据集保持在内存中
- 高维稠密数据仍可能对最近邻方法构成挑战
- 尚无在线/增量更新
- 对于大数据集,后端准备增加了设置时间
**项目状态**
- 公共发布:0.2.2
- 稳定 API
- 开源
- CPU 优化核心
代码库:
https://github.com/thatipamula-jashwanth/smart-knn
我非常欢迎反馈,特别是来自那些在生产中部署过最近邻系统的人的意见。
谢谢。
- Jashwanth
查看原文
Hi HN,<p>I’ve been working on SmartKNN, a nearest-neighbor system designed specifically for production deployment rather than academic experimentation.<p>The goal was not to slightly tweak classical KNN, but to restructure it into a deployable, latency-aware system while preserving interpretability.<p>What it does differently<p>Traditional KNN is simple and interpretable, but in practice it struggles with:<p>Inference latency as datasets grow<p>Equal treatment of all features<p>Fixed distance metrics<p>Unpredictable performance under load<p>SmartKNN addresses these issues through:<p>1. Learned Feature Weighting<p>Feature importance is learned automatically and incorporated into the distance computation. This reduces noise and improves neighbor quality without manual tuning.<p>2. Adaptive Distance Behavior<p>Distance computation adapts to learned feature relevance instead of relying on a fixed metric like plain Euclidean.<p>3. Backend Selection<p>SmartKNN supports both brute-force and approximate nearest-neighbor strategies.<p>Small datasets → brute-force<p>Larger datasets → approximate candidate retrieval<p>Approximate search is used only to retrieve candidates. Final prediction always uses the learned distance function.<p>4. CPU-Focused Design<p>The system is optimized for predictable CPU inference performance rather than GPU-heavy workflows. The focus is stable latency characteristics suitable for production workloads.<p>5. Unified API<p>Supports both classification and regression through a scikit-learn compatible interface.<p>Performance<p>On structured/tabular datasets with strong local structure, SmartKNN achieves competitive accuracy against tree-based models.<p>It does not aim to replace tree models or neural networks universally. It performs best where neighborhood structure is meaningful and interpretability is desired.<p>Limitations<p>- Requires dataset to remain in memory
- High-dimensional dense data can still challenge nearest-neighbor methods
- No online/incremental updates yet
- Backend preparation adds setup time for large datasets<p>Project Status<p>- Public release: 0.2.2
- Stable API
- Open source
- CPU-optimized core
Repository:
https://github.com/thatipamula-jashwanth/smart-knn
I’d appreciate feedback, especially from people who have deployed nearest-neighbor systems in production.<p>Thanks.<p>- Jashwanth