展示HN:利用嵌入和AVX2检测协调的金融叙事
我构建了一个名为Horaculo的开源系统,用于分析金融新闻来源之间的协调性和分歧。其目标是量化叙事一致性、熵变化和历史来源的可靠性。
**流程**
1. 获取50-100篇文章(使用NewsAPI)
2. 提取主张(NLP预处理)
3. 生成句子嵌入(使用HuggingFace)
4. 在C++中计算余弦相似度(使用AVX2和INT8量化)
5. 聚类叙事
6. 计算熵和协调指标
7. 根据历史来源的可信度加权结果
8. 输出结构化的JSON信号
**示例输出(查询:“油”)**
```json
{
"verdict": {
"winner_source": "路透社",
"intensity": 0.85,
"entropy": 1.92
},
"psychology": {
"mood": "恐惧",
"is_trap": true,
"coordination_score": 0.72
}
}
```
**测量内容**
- 强度 → 叙事分歧
- 熵 → 信息混乱
- 协调分数 → 跨来源一致性
- 可信度加权 → 每个来源的历史共识准确性
**性能**
- 每个查询耗时1.4秒(约10个来源)
- 每分钟约100个查询
- 内存占用约150MB
- 仅Python版本约需12秒
- C++优化:
- INT8嵌入量化(减少4倍大小)
- AVX2 SIMD向量化余弦相似度
- PyBind11集成层
**存储**
- SQLite(本地内存)
- 可选Postgres
每个来源建立一个滚动的可信度档案:
```json
{
"source": "路透社",
"total_scans": 342,
"consensus_hits": 289,
"credibility": 0.85
}
```
**开源(MIT许可证)**
GitHub: [https://github.com/ANTONIO34346/HORACULO](https://github.com/ANTONIO34346/HORACULO)
我特别希望获得以下方面的反馈:
- 熵建模方法
- 协调检测方法论
- FAISS是否比当前的SIMD引擎更合适
- 针对10万+嵌入的可扩展性策略
查看原文
I built an open-source system called Horaculo that analyzes coordination and divergence across financial news sources.
The goal is to quantify narrative alignment, entropy shifts, and historical source reliability.
Pipeline
Fetch 50–100 articles (NewsAPI)
Extract claims (NLP preprocessing)
Generate sentence embeddings (HuggingFace)
Compute cosine similarity in C++ (AVX2 + INT8 quantization)
Cluster narratives
Compute entropy + coordination metrics
Weight results using historical source credibility
Output structured JSON signals
Example Output (query: “oil”)
Json
Copiar código
{
"verdict": {
"winner_source": "Reuters",
"intensity": 0.85,
"entropy": 1.92
},
"psychology": {
"mood": "Fear",
"is_trap": true,
"coordination_score": 0.72
}
}
What it measures
Intensity → narrative divergence
Entropy → informational disorder
Coordination score → cross-source alignment
Credibility weighting → historical consensus accuracy per source
Performance
1.4s per query (~10 sources)
~100 queries/min
~150MB memory footprint
Python-only version was ~12s
C++ optimizations:
INT8 embedding quantization (4x size reduction)
AVX2 SIMD vectorized cosine similarity
PyBind11 integration layer
Storage
SQLite (local memory)
Optional Postgres
Each source builds a rolling credibility profile:
Json
Copiar código
{
"source": "Reuters",
"total_scans": 342,
"consensus_hits": 289,
"credibility": 0.85
}
Open Source (MIT)
GitHub: [<a href="https://github.com/ANTONIO34346/HORACULO" rel="nofollow">https://github.com/ANTONIO34346/HORACULO</a>]
I'm particularly interested in feedback on:
The entropy modeling approach
Coordination detection methodology
Whether FAISS would be a better fit than the current SIMD engine
Scalability strategies for 100k+ embeddings