没有视觉的亚马逊购物自动化:验证门+本地模型(3B)
一种常见的方法来自动化亚马逊购物或类似复杂网站的操作是使用大型云模型(通常具备视觉能力)。我想测试一个矛盾:一个约30亿参数的本地大语言模型(LLM)能否仅通过结构化页面数据(DOM)和确定性断言来完成流程。
这篇文章总结了四次相同任务的运行(搜索 → 第一个产品 → 添加到购物车 → 在亚马逊上结账)。关键比较是演示0(云基线)与演示3(本地自主);演示1和2是中间控制。
更多技术细节(架构、代码片段、额外日志片段):
https://www.sentienceapi.com/blog/verification-layer-amazon-case-study
演示0与演示3:
演示0(云,GLM-4.6 + 结构化快照)
- 成功:1/1 次运行
- 令牌数:19,956(相比于约35k估算减少了约43%)
- 时间:约60,000毫秒
- 成本:云API(变化不定)
- 视觉:不需要
演示3(本地,DeepSeek R1 规划器 + Qwen ~30亿执行器)
- 成功:7/7 步骤(重新运行)
- 令牌数:11,114
- 时间:405,740毫秒
- 成本:$0.00 增量(本地推理)
- 视觉:不需要
延迟说明:本地堆栈的端到端速度较慢,主要是因为推理在本地硬件(配备M4的Mac Studio)上运行;云基线受益于托管推理,但每个令牌有API成本。
架构
之所以成功,是因为我们改变了控制平面并添加了验证循环。
1) 限制模型看到的内容(DOM修剪)。
我们不提供整个DOM或截图。我们收集原始元素,然后运行WASM处理以生成紧凑的“语义快照”(角色/文本/几何),并修剪其余部分(通常约95%的节点)。
2) 将推理与执行分开(规划者与执行者)。
- 规划者(推理):DeepSeek R1(本地)生成步骤意图和后续必须为真的条件。
- 执行者(行动):Qwen ~30亿(本地)选择具体的DOM操作,如CLICK(id) / TYPE(text)。
3) 用Jest风格的验证控制每一步。
在每个动作后,我们断言状态变化(URL变化、元素存在/不存在、模态框/抽屉出现)。如果所需的断言失败,该步骤将失败,并带有伪影和有限的重试。
最小形状:
```python
ok = await runtime.check(
exists("role=textbox"),
label="search_box_visible",
required=True,
).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)
```
“看起来聪明的代理”和真正有效的代理之间的区别
日志中的两个示例:
- 确定性覆盖以强制执行“第一个结果”意图:“执行者决策… [覆盖] first_product_link -> CLICK(1022)”
- 抽屉处理验证并强制正确分支:“结果:通过 | add_to_cart_verified_after_drawer”
重要的是,这些不是事后分析,而是内联门控:系统要么证明它取得了进展,要么停止并恢复。
结论
如果你想让浏览器代理更可靠,最高效的措施不是使用更大的模型,而是限制状态空间,并通过逐步断言明确成功/失败。
代理的可靠性来自于验证(对结构化快照的断言),而不仅仅是模型规模的扩大。
查看原文
A common approach to automating Amazon shopping or similar complex websites is to reach for large cloud models (often vision-capable). I wanted to test a contradiction: can a ~3B parameter local LLM model complete the flow using only structural page data (DOM) plus deterministic assertions?<p>This post summarizes four runs of the same task (search → first product → add to cart → checkout on Amazon). The key comparison is Demo 0 (cloud baseline) vs Demo 3 (local autonomy); Demos 1–2 are intermediate controls.<p>More technical detail (architecture, code excerpts, additional log snippets):<p>https://www.sentienceapi.com/blog/verification-layer-amazon-case-study<p>Demo 0 vs Demo 3:<p>Demo 0 (cloud, GLM‑4.6 + structured snapshots)
success: 1/1 run
tokens: 19,956 (~43% reduction vs ~35k estimate)
time: ~60,000ms
cost: cloud API (varies)
vision: not required<p>Demo 3 (local, DeepSeek R1 planner + Qwen ~3B executor)
success: 7/7 steps (re-run)
tokens: 11,114
time: 405,740ms
cost: $0.00 incremental (local inference)
vision: not required<p>Latency note: the local stack is slower end-to-end here largely because inference runs on local hardware (Mac Studio with M4); the cloud baseline benefits from hosted inference, but has per-token API cost.<p>Architecture<p>This worked because we changed the control plane and added a verification loop.<p>1) Constrain what the model sees (DOM pruning).
We don’t feed the entire DOM or screenshots. We collect raw elements, then run a WASM pass to produce a compact “semantic snapshot” (roles/text/geometry) and prune the rest (often on the order of ~95% of nodes).<p>2) Split reasoning from acting (planner vs executor).<p>Planner (reasoning): DeepSeek R1 (local) generates step intent + what must be true afterward.
Executor (action): Qwen ~3B (local) selects concrete DOM actions like CLICK(id) / TYPE(text).
3) Gate every step with Jest‑style verification.
After each action, we assert state changes (URL changed, element exists/doesn’t exist, modal/drawer appeared). If a required assertion fails, the step fails with artifacts and bounded retries.<p>Minimal shape:<p>ok = await runtime.check(
exists("role=textbox"),
label="search_box_visible",
required=True,
).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)<p>What changed between “agents that look smart” and agents that work
Two examples from the logs:<p>Deterministic override to enforce “first result” intent: “Executor decision … [override] first_product_link -> CLICK(1022)”<p>Drawer handling that verifies and forces the correct branch: “result: PASS | add_to_cart_verified_after_drawer”<p>The important point is that these are not post‑hoc analytics. They are inline gates: the system either proves it made progress or it stops and recovers.<p>Takeaway
If you’re trying to make browser agents reliable, the highest‑leverage move isn’t a bigger model. It’s constraining the state space and making success/failure explicit with per-step assertions.<p>Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.