抓取241个英国地方政府规划门户网站——迄今为止已有260万项决策

20作者: mebkorea大约 8 小时前原帖
我一直在抓取241个英国地方政府的规划门户网站——到目前为止已处理260万个决策。 英国的规划数据在技术上是公开的。但实际上,这些数据被锁定在400多个不同的地方政府门户网站后面,有些仍在使用看起来像是2004年就开始运行的定制ASP.NET,有些则在AWS WAF后面,所有网站的架构都有细微的不同。我花了四个月的时间进行抓取。目前我已经覆盖了241个地方政府,收集了英格兰、苏格兰和威尔士的260万个决策。 抓取问题 大多数英国地方政府使用几种门户系统中的一种,Idox是最常见的。从理论上讲,这使得事情变得简单。但在实践中,每个地方政府的配置都不同,有些通过TLS指纹识别阻止非浏览器请求,有些设置了速率限制,10分钟内就会被封禁,还有一些则在运行上述的定制ASP.NET。 我最终编写了几个抓取工具:一个基于标准请求的抓取器,一个基于Playwright的抓取器,专门用于那些阻止任何看起来不像真实浏览器请求的地方政府,还有一个用于TLS指纹识别的curl_cffi抓取器。有些地方政府我仍然无法抓取。利物浦的门户网站在AWS WAF后面,并且有一个JavaScript挑战。我有一个有效的基于Playwright的抓取器,可以解决这个挑战并重用cookies,但WAF在大约10个请求后会对IP进行速率限制,然后封禁我一天。因此,我从一次旧的抓取中获得了6万条利物浦的决策数据,但没有简单的方法来增加更多。 我发现的内容 审批率是大多数人关注的内容。全国范围内的审批率约为88%,但在地方政府内部的不同选区之间差异很大,而不仅仅是地方政府之间的差异。 更有趣的发现来自于决策时间的数据。在119个英格兰和威尔士的地方政府中,2025年有36.5%的家庭扩建申请未能在法定的8周目标内完成,较2019年的27.9%有所上升。吉尔福德是最糟糕的,66%的决策超出了目标,平均耗时13.3周。 现在的情况 一个免费的邮政编码查询工具和收费的PDF报告(19英镑/79英镑)。到目前为止没有付费客户,这没关系。我一直专注于数据质量和覆盖范围。 如果你想了解更多,可以访问网站planninglens.co.uk。关于抓取方面的任何问题都可以问我——那里有很多有趣的问题。
查看原文
I&#x27;ve been scraping 241 UK council planning portals – 2.6M decisions so far<p>UK planning data is technically public. In practice it&#x27;s locked behind 400+ different council portals, some still running bespoke ASP.NET that looks like it dates from 2004, some behind AWS WAF, all with subtly different schemas. I&#x27;ve spent four months scraping them. I&#x27;m now at 241 councils and 2.6 million decisions across England, Scotland and Wales.<p>The scraping problem<p>Most UK councils run one of a handful of portal systems, Idox being the most common. In theory this makes things easy. In practice every council has configured theirs differently, some block non-browser requests via TLS fingerprinting, some have rate limits that will get you banned inside 10 minutes, and a handful are running the aforementioned bespoke ASP.NET.<p>I ended up writing several scrapers: a standard requests-based one, a Playwright-based one for councils that block anything that doesn&#x27;t look like a real browser, and a curl_cffi one for TLS fingerprinting. Some councils I still can&#x27;t get. Liverpool&#x27;s portal sits behind AWS WAF with a JavaScript challenge. I have a working Playwright-based scraper that solves the challenge once and reuses cookies, but the WAF rate-limits the IP after about 10 requests and then blocks me for a day. So I have 60k Liverpool decisions from an old scrape and no easy way to add more.<p>What I found<p>The approval rate stuff is what most people come for. Nationally it&#x27;s around 88%, but it varies wildly by ward within a council, not just between councils.<p>The more interesting finding came from the time-to-decision data. Across 119 English and Welsh councils, 36.5% of home extension applications missed the statutory 8-week target in 2025, up from 27.9% in 2019. Guildford is the worst at scale: 66% of decisions over target, averaging 13.3 weeks.<p>What it is now<p>A postcode checker (free) and paid PDF reports (£19&#x2F;£79). Zero paying customers so far, which is fine. I&#x27;ve been heads down on data quality and coverage.<p>Site is planninglens.co.uk if you want to poke around. AMA on the scraping side – that&#x27;s where the interesting problems are.