请问HN:有没有搜索引擎全面使用Common Crawl?

1作者: n1xis10t大约 16 小时前原帖
Common Crawl大约包含3000亿个页面,如果将所有内容以提取文本格式下载,压缩后仅占约816 TB。如果有人利用这些数据创建一个搜索引擎,我认为它会比Bing更全面,可能与Google相似。我所知道的基于Common Crawl的搜索引擎仅使用了它们可用数据的一小部分。你知道有没有使用全部数据的搜索引擎吗?
查看原文
The Common Crawl has about 300 billion pages in it, and if you downloaded all of it in extracted text format it would only take up about 816 TB compressed. If someone were to make a search engine with this I think it would be more comprehensive than Bing, and possibly pretty similar to Google. The only CC based search engines that I know of use a tiny fraction of what they have available. Do you know of any that use the whole thing?