Aura-State:正式验证的LLM状态机编译器

2作者: rohanmunshi082 个月前原帖
我注意到一个模式:如今每个大型语言模型(LLM)框架都允许人工智能管理状态和进行数学运算。然后我们就会疑惑,为什么管道会产生幻觉数字,并在凌晨三点崩溃。 我采取了不同的方法,构建了Aura-State,这是一个开源的Python框架,将LLM工作流编译成形式验证的状态机。 与其希望人工智能能够搞定这一切,我引入了来自硬件验证和统计学习的真实算法: - CTL模型检查:同样的技术用于验证飞行控制系统,现在应用于LLM工作流图。在执行之前证明安全属性。 - Z3定理证明器:每个LLM提取都在商业约束下进行形式证明。如果总数不等于价格乘以数量,Z3会通过反例捕捉到这一点。 - 适应性预测:对每个提取字段提供无分布的95%置信区间。不仅仅是“LLM说是45万美元”,而是“95%置信区间:[$448k, $452k]”。 - MCTS路由:蒙特卡洛树搜索(AlphaGo背后的算法)以数学方式评估模糊状态转移。 - 沙盒数学:英语数学规则编译为Python抽象语法树(AST)。零幻觉计算。 我对10个房地产销售记录进行了实时基准测试,使用的是GPT-4o-mini: → 100%预算提取准确率(平均误差为$0) → 20/20 Z3证明义务通过 → 3/3时间安全属性被证明 → 65个自动化测试通过 “它通常有效”和“它可以证明有效”之间的差距比人们想象的要小。 希望能收到任何正在构建生产级LLM系统的人的反馈;你希望哪些内容得到形式验证? https://github.com/munshi007/Aura-State
查看原文
I noticed a pattern: every LLM framework today lets the AI manage state and do math. Then we wonder why pipelines hallucinate numbers and break at 3 AM.<p>I took a different approach and built Aura-State, an open-source Python framework that compiles LLM workflows into formally verified state machines.<p>Instead of hoping the AI figures it out, I brought in real algorithms from hardware verification and statistical learning:<p>CTL Model Checking: the same technique used to verify flight control systems, now applied to LLM workflow graphs. Proves safety properties before execution.<p>Z3 Theorem Prover: every LLM extraction gets formally proven against business constraints. If the total ≠ price × quantity, Z3 catches it with a counterexample.<p>Conformal Prediction: distribution-free 95% confidence intervals on every extracted field. Not just &quot;the LLM said $450k&quot; but &quot;95% CI: [$448k, $452k].&quot;<p>MCTS Routing: Monte Carlo Tree Search (the algorithm behind AlphaGo) scores ambiguous state transitions mathematically.<p>Sandboxed Math: English math rules compile to Python AST. Zero hallucination calculations.<p>I ran a live benchmark against 10 real-estate sales transcripts using GPT-4o-mini: → 100% budget extraction accuracy ($0 mean error) → 20&#x2F;20 Z3 proof obligations passed → 3&#x2F;3 temporal safety properties proven → 65 automated tests passing<p>The gap between &quot;it usually works&quot; and &quot;it provably works&quot; is smaller than people think.<p>Would love feedback from anyone building production LLM systems; what would you want formally verified?<p>https:&#x2F;&#x2F;github.com&#x2F;munshi007&#x2F;Aura-State