HackerNews中文版

我注意到一个模式：如今每个大型语言模型（LLM）框架都允许人工智能管理状态和进行数学运算。然后我们就会疑惑，为什么管道会产生幻觉数字，并在凌晨三点崩溃。我采取了不同的方法，构建了Aura-State，这是一个开源的Python框架，将LLM工作流编译成形式验证的状态机。与其希望人工智能能够搞定这一切，我引入了来自硬件验证和统计学习的真实算法： - CTL模型检查：同样的技术用于验证飞行控制系统，现在应用于LLM工作流图。在执行之前证明安全属性。 - Z3定理证明器：每个LLM提取都在商业约束下进行形式证明。如果总数不等于价格乘以数量，Z3会通过反例捕捉到这一点。 - 适应性预测：对每个提取字段提供无分布的95%置信区间。不仅仅是“LLM说是45万美元”，而是“95%置信区间：[$448k, $452k]”。 - MCTS路由：蒙特卡洛树搜索（AlphaGo背后的算法）以数学方式评估模糊状态转移。 - 沙盒数学：英语数学规则编译为Python抽象语法树（AST）。零幻觉计算。我对10个房地产销售记录进行了实时基准测试，使用的是GPT-4o-mini： → 100%预算提取准确率（平均误差为$0） → 20/20 Z3证明义务通过 → 3/3时间安全属性被证明 → 65个自动化测试通过 “它通常有效”和“它可以证明有效”之间的差距比人们想象的要小。希望能收到任何正在构建生产级LLM系统的人的反馈；你希望哪些内容得到形式验证？ https://github.com/munshi007/Aura-State

查看原文

I noticed a pattern: every LLM framework today lets the AI manage state and do math. Then we wonder why pipelines hallucinate numbers and break at 3 AM.I took a different approach and built Aura-State, an open-source Python framework that compiles LLM workflows into formally verified state machines.Instead of hoping the AI figures it out, I brought in real algorithms from hardware verification and statistical learning:CTL Model Checking: the same technique used to verify flight control systems, now applied to LLM workflow graphs. Proves safety properties before execution.Z3 Theorem Prover: every LLM extraction gets formally proven against business constraints. If the total ≠ price × quantity, Z3 catches it with a counterexample.Conformal Prediction: distribution-free 95% confidence intervals on every extracted field. Not just "the LLM said $450k" but "95% CI: [$448k, $452k]."MCTS Routing: Monte Carlo Tree Search (the algorithm behind AlphaGo) scores ambiguous state transitions mathematically.Sandboxed Math: English math rules compile to Python AST. Zero hallucination calculations.I ran a live benchmark against 10 real-estate sales transcripts using GPT-4o-mini: → 100% budget extraction accuracy ($0 mean error) → 20/20 Z3 proof obligations passed → 3/3 temporal safety properties proven → 65 automated tests passingThe gap between "it usually works" and "it provably works" is smaller than people think.Would love feedback from anyone building production LLM systems; what would you want formally verified?https://github.com/munshi007/Aura-State

Aura-State：正式验证的LLM状态机编译器