抓取241个英国地方政府规划门户网站——迄今为止已有260万项决策
我一直在抓取241个英国地方政府的规划门户网站——到目前为止已处理260万个决策。
英国的规划数据在技术上是公开的。但实际上,这些数据被锁定在400多个不同的地方政府门户网站后面,有些仍在使用看起来像是2004年就开始运行的定制ASP.NET,有些则在AWS WAF后面,所有网站的架构都有细微的不同。我花了四个月的时间进行抓取。目前我已经覆盖了241个地方政府,收集了英格兰、苏格兰和威尔士的260万个决策。
抓取问题
大多数英国地方政府使用几种门户系统中的一种,Idox是最常见的。从理论上讲,这使得事情变得简单。但在实践中,每个地方政府的配置都不同,有些通过TLS指纹识别阻止非浏览器请求,有些设置了速率限制,10分钟内就会被封禁,还有一些则在运行上述的定制ASP.NET。
我最终编写了几个抓取工具:一个基于标准请求的抓取器,一个基于Playwright的抓取器,专门用于那些阻止任何看起来不像真实浏览器请求的地方政府,还有一个用于TLS指纹识别的curl_cffi抓取器。有些地方政府我仍然无法抓取。利物浦的门户网站在AWS WAF后面,并且有一个JavaScript挑战。我有一个有效的基于Playwright的抓取器,可以解决这个挑战并重用cookies,但WAF在大约10个请求后会对IP进行速率限制,然后封禁我一天。因此,我从一次旧的抓取中获得了6万条利物浦的决策数据,但没有简单的方法来增加更多。
我发现的内容
审批率是大多数人关注的内容。全国范围内的审批率约为88%,但在地方政府内部的不同选区之间差异很大,而不仅仅是地方政府之间的差异。
更有趣的发现来自于决策时间的数据。在119个英格兰和威尔士的地方政府中,2025年有36.5%的家庭扩建申请未能在法定的8周目标内完成,较2019年的27.9%有所上升。吉尔福德是最糟糕的,66%的决策超出了目标,平均耗时13.3周。
现在的情况
一个免费的邮政编码查询工具和收费的PDF报告(19英镑/79英镑)。到目前为止没有付费客户,这没关系。我一直专注于数据质量和覆盖范围。
如果你想了解更多,可以访问网站planninglens.co.uk。关于抓取方面的任何问题都可以问我——那里有很多有趣的问题。
查看原文
I've been scraping 241 UK council planning portals – 2.6M decisions so far<p>UK planning data is technically public. In practice it's locked behind 400+ different council portals, some still running bespoke ASP.NET that looks like it dates from 2004, some behind AWS WAF, all with subtly different schemas. I've spent four months scraping them. I'm now at 241 councils and 2.6 million decisions across England, Scotland and Wales.<p>The scraping problem<p>Most UK councils run one of a handful of portal systems, Idox being the most common. In theory this makes things easy. In practice every council has configured theirs differently, some block non-browser requests via TLS fingerprinting, some have rate limits that will get you banned inside 10 minutes, and a handful are running the aforementioned bespoke ASP.NET.<p>I ended up writing several scrapers: a standard requests-based one, a Playwright-based one for councils that block anything that doesn't look like a real browser, and a curl_cffi one for TLS fingerprinting. Some councils I still can't get. Liverpool's portal sits behind AWS WAF with a JavaScript challenge. I have a working Playwright-based scraper that solves the challenge once and reuses cookies, but the WAF rate-limits the IP after about 10 requests and then blocks me for a day. So I have 60k Liverpool decisions from an old scrape and no easy way to add more.<p>What I found<p>The approval rate stuff is what most people come for. Nationally it's around 88%, but it varies wildly by ward within a council, not just between councils.<p>The more interesting finding came from the time-to-decision data. Across 119 English and Welsh councils, 36.5% of home extension applications missed the statutory 8-week target in 2025, up from 27.9% in 2019. Guildford is the worst at scale: 66% of decisions over target, averaging 13.3 weeks.<p>What it is now<p>A postcode checker (free) and paid PDF reports (£19/£79). Zero paying customers so far, which is fine. I've been heads down on data quality and coverage.<p>Site is planninglens.co.uk if you want to poke around. AMA on the scraping side – that's where the interesting problems are.