HackerNews中文版

我们是 Webhound（<a href="https://webhound.ai">https://webhound.ai</a>）背后的团队，这是一个基于自然语言提示从网络构建数据集的 AI 代理。您只需描述您想要查找的内容。代理会确定如何构建数据和查找位置，然后搜索、提取结果，并将所有内容输出为可导出的 CSV 文件。我们为 HN 社区设置了一个特别的无注册版本，您可以在 <a href="https://hn.webhound.ai">https://hn.webhound.ai</a> 访问 - 只需点击“以访客身份继续”即可在不注册的情况下试用。以下是一个演示：<a href="https://youtu.be/fGaRfPdK1Sk" rel="nofollow">https://youtu.be/fGaRfPdK1Sk</a> 我们开始构建这个工具是因为厌倦了手动进行这种研究。打开 50 个标签页，将所有内容复制到电子表格中，发现数据不一致，然后重新开始。这感觉像是大型语言模型（LLM）应该能够处理的事情。以下是过去一个月人们使用它的一些示例：竞争对手分析：“创建一个内部工具平台（Retool、Appsmith、Superblocks、UI Bakery、BudiBase 等）的比较表，包含它们的免费计划限制、定价层级、入职体验、集成情况，以及它们在登录页面上的定位。”（<a href="https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff6927c44f80">https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff69...</a>）潜在客户生成：“查找最近推出的销售护肤产品的 Shopify 商店。我想要商店网址、创始人姓名、电子邮件、Instagram 账号和产品类别。”（<a href="https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e341c67c8">https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e...</a>）定价跟踪：“跟踪过去 6 个月笔记应用的免费和付费计划如何变化，使用官方网站和更新日志。列出每个应用的变化时间线及其来源。”（<a href="https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8deab09e85d7">https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8dea...</a>）投资者映射：“查找在过去一年中主导或参与过基于浏览器的开发工具初创公司种子轮或天使轮的风险投资公司。包括 VC 名称、相关合伙人、联系信息和投资组合链接以供参考。”（<a href="https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fda3444340">https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fd...</a>）研究收集：“获取最近关于 NLP 中弱监督的 arXiv 论文列表。每篇论文包括摘要、引用次数、出版日期，以及如果有的话，GitHub 仓库。”（<a href="https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b7c423ce2">https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b...</a>）假设检验：“检查用户对 Figma 在大文件上性能的投诉在过去 3 个月是否有所增加。搜索 Hacker News、Reddit 和 Figma 社区网站等论坛，并显示最相关的帖子及其时间戳和参与度指标。”（<a href="https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b66e845cd">https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b...</a>） Webhound 的第一个版本是一个运行在 Claude 4 Sonnet 上的单一代理。它能工作，但会话的费用常常超过 1100 美元，并且经常陷入无限循环。我们知道这不是可持续的，因此开始围绕更小的模型进行构建。这意味着需要添加更多结构。我们引入了一个多代理系统，以保持其可靠性和准确性。系统包括一个主代理、一组并行运行子任务的搜索代理、一个保持进度的评审代理，以及在保存数据之前进行双重检查的验证代理。我们还为它提供了一个记事本，用于长期记忆，这有助于避免重复并跟踪它已经看到的内容。在切换到 Gemini 2.5 Flash 并引入代理系统后，我们能够将成本降低超过 30 倍，同时提高速度和输出质量。该系统分为两个阶段运行。首先是规划阶段，在此阶段它决定架构、如何搜索、使用哪些来源以及如何判断何时完成。然后是提取阶段，在此阶段它执行计划并收集数据。它使用我们构建的基于文本的浏览器，该浏览器将页面呈现为 Markdown 格式并直接提取内容。我们尝试过完整的浏览器使用，但速度较慢且可靠性较差。对于这种任务，纯文本仍然效果更好。我们还构建了定期刷新功能，以保持数据集的最新状态，并提供 API，以便您可以将数据直接集成到您的工作流程中。目前，所有内容在运行期间都保留在代理的上下文中。根据属性数量，它在 1000 到 5000 行之间开始出现问题。我们正在努力开发更好的架构，以便在此基础上进行扩展。我们非常希望收到反馈，特别是来自那些尝试解决此问题或构建类似工具的人的反馈。欢迎在讨论中提问。谢谢！ Moe

查看原文

We're the team behind Webhound (<a href="https://webhound.ai">https://webhound.ai</a>), an AI agent that builds datasets from the web based on natural language prompts. You describe what you're trying to find. The agent figures out how to structure the data and where to look, then searches, extracts the results, and outputs everything in a CSV you can export.We've set up a special no-signup version for the HN community at <a href="https://hn.webhound.ai">https://hn.webhound.ai</a> - just click "Continue as Guest" to try it without signing up.Here's a demo: <a href="https://youtu.be/fGaRfPdK1Sk" rel="nofollow">https://youtu.be/fGaRfPdK1Sk</a>We started building it after getting tired of doing this kind of research manually. Open 50 tabs, copy everything into a spreadsheet, realize it's inconsistent, start over. It felt like something an LLM should be able to handle.Some examples of how people have used it in the past month:Competitor analysis: "Create a comparison table of internal tooling platforms (Retool, Appsmith, Superblocks, UI Bakery, BudiBase, etc) with their free plan limits, pricing tiers, onboarding experience, integrations, and how they position themselves on their landing pages." (<a href="https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff6927c44f80">https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff69...</a>)Lead generation: "Find Shopify stores launched recently that sell skincare products. I want the store URLs, founder names, emails, Instagram handles, and product categories." (<a href="https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e341c67c8">https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e...</a>)Pricing tracking: "Track how the free and paid plans of note-taking apps have changed over the past 6 months using official sites and changelogs. List each app with a timeline of changes and the source for each." (<a href="https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8deab09e85d7">https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8dea...</a>)Investor mapping: "Find VCs who led or participated in pre-seed or seed rounds for browser-based devtools startups in the past year. Include the VC name, relevant partners, contact info, and portfolio links for context." (<a href="https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fda3444340">https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fd...</a>)Research collection: "Get a list of recent arXiv papers on weak supervision in NLP. For each, include the abstract, citation count, publication date, and a GitHub repo if available." (<a href="https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b7c423ce2">https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b...</a>)Hypothesis testing: "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site and show the most relevant posts with timestamps and engagement metrics." (<a href="https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b66e845cd">https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b...</a>)The first version of Webhound was a single agent running on Claude 4 Sonnet. It worked, but sessions routinely cost over $1100 and it would often get lost in infinite loops. We knew that wasn't sustainable, so we started building around smaller models.That meant adding more structure. We introduced a multi-agent system to keep it reliable and accurate. There's a main agent, a set of search agents that run subtasks in parallel, a critic agent that keeps things on track, and a validator that double-checks extracted data before saving it. We also gave it a notepad for long-term memory, which helps avoid duplicates and keeps track of what it's already seen.After switching to Gemini 2.5 Flash and layering in the agent system, we were able to cut costs by more than 30x while also improving speed and output quality.The system runs in two phases. First is planning, where it decides the schema, how to search, what sources to use, and how to know when it's done. Then comes extraction, where it executes the plan and gathers the data.It uses a text-based browser we built that renders pages as markdown and extracts content directly. We tried full browser use but it was slower and less reliable. Plain text still works better for this kind of task.We also built scheduled refreshes to keep datasets up to date and an API so you can integrate the data directly into your workflows.Right now, everything stays in the agent's context during a run. It starts to break down around 1000-5000 rows depending on the number of attributes. We're working on a better architecture for scaling past that.We'd love feedback, especially from anyone who's tried solving this problem or built similar tools. Happy to answer anything in the thread.Thanks! Moe

发布 HN: Webhound (YC S23) – 从网络构建数据集的研究助手