Smart-KNN:一种以生产为导向的特征加权KNN,针对CPU进行了优化。

1作者: Jashwanth013 个月前原帖
嗨,HN, 我一直在开发 SmartKNN,这是一个专门为生产部署而设计的最近邻系统,而不是用于学术实验。 我们的目标不是对经典 KNN 进行小幅调整,而是将其重构为一个可部署的、关注延迟的系统,同时保持可解释性。 **它的不同之处** 传统的 KNN 简单且易于解释,但在实际应用中面临以下挑战: - 随着数据集的增长,推理延迟增加 - 对所有特征的平等处理 - 固定的距离度量 - 在负载下性能不可预测 SmartKNN 通过以下方式解决这些问题: 1. **学习特征加权** 特征重要性自动学习并纳入距离计算。这减少了噪声,提高了邻居质量,而无需手动调优。 2. **自适应距离行为** 距离计算根据学习到的特征相关性进行调整,而不是依赖于简单的欧几里得固定度量。 3. **后端选择** SmartKNN 支持暴力搜索和近似最近邻策略。 - 小数据集 → 暴力搜索 - 大数据集 → 近似候选检索 近似搜索仅用于检索候选项。最终预测始终使用学习到的距离函数。 4. **以 CPU 为中心的设计** 系统优化为可预测的 CPU 推理性能,而不是依赖 GPU 的重型工作流。重点是稳定的延迟特性,适合生产工作负载。 5. **统一 API** 通过与 scikit-learn 兼容的接口支持分类和回归。 **性能** 在具有强局部结构的结构化/表格数据集上,SmartKNN 的准确性与基于树的模型相当。 它并不旨在普遍替代树模型或神经网络。在邻域结构有意义且需要可解释性的地方表现最佳。 **局限性** - 需要数据集保持在内存中 - 高维稠密数据仍可能对最近邻方法构成挑战 - 尚无在线/增量更新 - 对于大数据集,后端准备增加了设置时间 **项目状态** - 公共发布:0.2.2 - 稳定 API - 开源 - CPU 优化核心 代码库: https://github.com/thatipamula-jashwanth/smart-knn 我非常欢迎反馈,特别是来自那些在生产中部署过最近邻系统的人的意见。 谢谢。 - Jashwanth
查看原文
Hi HN,<p>I’ve been working on SmartKNN, a nearest-neighbor system designed specifically for production deployment rather than academic experimentation.<p>The goal was not to slightly tweak classical KNN, but to restructure it into a deployable, latency-aware system while preserving interpretability.<p>What it does differently<p>Traditional KNN is simple and interpretable, but in practice it struggles with:<p>Inference latency as datasets grow<p>Equal treatment of all features<p>Fixed distance metrics<p>Unpredictable performance under load<p>SmartKNN addresses these issues through:<p>1. Learned Feature Weighting<p>Feature importance is learned automatically and incorporated into the distance computation. This reduces noise and improves neighbor quality without manual tuning.<p>2. Adaptive Distance Behavior<p>Distance computation adapts to learned feature relevance instead of relying on a fixed metric like plain Euclidean.<p>3. Backend Selection<p>SmartKNN supports both brute-force and approximate nearest-neighbor strategies.<p>Small datasets → brute-force<p>Larger datasets → approximate candidate retrieval<p>Approximate search is used only to retrieve candidates. Final prediction always uses the learned distance function.<p>4. CPU-Focused Design<p>The system is optimized for predictable CPU inference performance rather than GPU-heavy workflows. The focus is stable latency characteristics suitable for production workloads.<p>5. Unified API<p>Supports both classification and regression through a scikit-learn compatible interface.<p>Performance<p>On structured&#x2F;tabular datasets with strong local structure, SmartKNN achieves competitive accuracy against tree-based models.<p>It does not aim to replace tree models or neural networks universally. It performs best where neighborhood structure is meaningful and interpretability is desired.<p>Limitations<p>- Requires dataset to remain in memory - High-dimensional dense data can still challenge nearest-neighbor methods - No online&#x2F;incremental updates yet - Backend preparation adds setup time for large datasets<p>Project Status<p>- Public release: 0.2.2 - Stable API - Open source - CPU-optimized core Repository: https:&#x2F;&#x2F;github.com&#x2F;thatipamula-jashwanth&#x2F;smart-knn I’d appreciate feedback, especially from people who have deployed nearest-neighbor systems in production.<p>Thanks.<p>- Jashwanth