HackerNews中文版

嗨，HN，我一直在开发 SmartKNN，这是一个专门为生产部署而设计的最近邻系统，而不是用于学术实验。我们的目标不是对经典 KNN 进行小幅调整，而是将其重构为一个可部署的、关注延迟的系统，同时保持可解释性。 **它的不同之处** 传统的 KNN 简单且易于解释，但在实际应用中面临以下挑战： - 随着数据集的增长，推理延迟增加 - 对所有特征的平等处理 - 固定的距离度量 - 在负载下性能不可预测 SmartKNN 通过以下方式解决这些问题： 1. **学习特征加权** 特征重要性自动学习并纳入距离计算。这减少了噪声，提高了邻居质量，而无需手动调优。 2. **自适应距离行为** 距离计算根据学习到的特征相关性进行调整，而不是依赖于简单的欧几里得固定度量。 3. **后端选择** SmartKNN 支持暴力搜索和近似最近邻策略。 - 小数据集 → 暴力搜索 - 大数据集 → 近似候选检索近似搜索仅用于检索候选项。最终预测始终使用学习到的距离函数。 4. **以 CPU 为中心的设计** 系统优化为可预测的 CPU 推理性能，而不是依赖 GPU 的重型工作流。重点是稳定的延迟特性，适合生产工作负载。 5. **统一 API** 通过与 scikit-learn 兼容的接口支持分类和回归。 **性能** 在具有强局部结构的结构化/表格数据集上，SmartKNN 的准确性与基于树的模型相当。它并不旨在普遍替代树模型或神经网络。在邻域结构有意义且需要可解释性的地方表现最佳。 **局限性** - 需要数据集保持在内存中 - 高维稠密数据仍可能对最近邻方法构成挑战 - 尚无在线/增量更新 - 对于大数据集，后端准备增加了设置时间 **项目状态** - 公共发布：0.2.2 - 稳定 API - 开源 - CPU 优化核心代码库： https://github.com/thatipamula-jashwanth/smart-knn 我非常欢迎反馈，特别是来自那些在生产中部署过最近邻系统的人的意见。谢谢。 - Jashwanth

查看原文

Hi HN,I’ve been working on SmartKNN, a nearest-neighbor system designed specifically for production deployment rather than academic experimentation.The goal was not to slightly tweak classical KNN, but to restructure it into a deployable, latency-aware system while preserving interpretability.What it does differentlyTraditional KNN is simple and interpretable, but in practice it struggles with:Inference latency as datasets growEqual treatment of all featuresFixed distance metricsUnpredictable performance under loadSmartKNN addresses these issues through:1. Learned Feature WeightingFeature importance is learned automatically and incorporated into the distance computation. This reduces noise and improves neighbor quality without manual tuning.2. Adaptive Distance BehaviorDistance computation adapts to learned feature relevance instead of relying on a fixed metric like plain Euclidean.3. Backend SelectionSmartKNN supports both brute-force and approximate nearest-neighbor strategies.Small datasets → brute-forceLarger datasets → approximate candidate retrievalApproximate search is used only to retrieve candidates. Final prediction always uses the learned distance function.4. CPU-Focused DesignThe system is optimized for predictable CPU inference performance rather than GPU-heavy workflows. The focus is stable latency characteristics suitable for production workloads.5. Unified APISupports both classification and regression through a scikit-learn compatible interface.PerformanceOn structured/tabular datasets with strong local structure, SmartKNN achieves competitive accuracy against tree-based models.It does not aim to replace tree models or neural networks universally. It performs best where neighborhood structure is meaningful and interpretability is desired.Limitations- Requires dataset to remain in memory - High-dimensional dense data can still challenge nearest-neighbor methods - No online/incremental updates yet - Backend preparation adds setup time for large datasetsProject Status- Public release: 0.2.2 - Stable API - Open source - CPU-optimized core Repository: https://github.com/thatipamula-jashwanth/smart-knn I’d appreciate feedback, especially from people who have deployed nearest-neighbor systems in production.Thanks.- Jashwanth

Smart-KNN：一种以生产为导向的特征加权KNN，针对CPU进行了优化。