展示HN:Kreuzberg v3.0 – 现代Python文档提取

2作者: nhirschfeld26 天前原帖
我很高兴地宣布Kreuzberg v3.0于昨天发布。 Kreuzberg是一个获得MIT许可的Python库,能够从各种文档(PDF、图像、办公文件等)中提取文本,而无需依赖外部API。 与该领域的其他库和商业产品不同,Kreuzberg的设计目标是(1)轻量级,(2)以CPU为导向,(3)易于使用,以及(4)将异步支持作为一等公民。 v3.0版本彻底重构了架构,以增强可扩展性。Kreuzberg现在支持: - 多种OCR后端(Tesseract、PaddleOCR、EasyOCR),OCR本身是完全可选的。 - 支持自定义提取器和覆盖内置提取器。 - 后处理和验证钩子。 - 广泛的PDF元数据提取。 - 可选的语义分块支持。 此外,还有一个全新的文档网站,地址为<a href="https://goldziher.github.io/kreuzberg" rel="nofollow">https://goldziher.github.io/kreuzberg</a>。 我还发布了项目的路线图,您可以在这里查看:<a href="https://github.com/Goldziher/kreuzberg/discussions/24" rel="nofollow">https://github.com/Goldziher/kreuzberg/discussions/24</a>。 您可以在<a href="https://github.com/Goldziher/kreuzberg" rel="nofollow">https://github.com/Goldziher/kreuzberg</a>查看代码库,如果您觉得这个项目有价值,请给我加星,这将激励我继续努力!
查看原文
I&#x27;m excited to announce Kreuzberg v3.0, which was released yesterday.<p>Kreuzberg is an MIT licensed Python library that extracts text from a wide range of documents (PDFs, images, office files etc.) without depending on external APIs dependencies.<p>Its different from other libraries and commercial offerings in this space by being designed to be (1) lightweight, (2) CPU orientated, (3) simple to user and (4) have async support as a first class citizen.<p>The v3.0 release completely reworks the architecture for extensibility. Kreuzberg now now supports:<p>- Multiple OCR backends (Tesseract, PaddleOCR, EasyOCR), with OCR itself being completely optional. - Support custom extractors and overriding of builtin extractors. - Post-processing and validation hooks. - Extensive PDF metadata extraction. - Optional support for semantic chunking.<p>There is also a brand new documentation site at <a href="https:&#x2F;&#x2F;goldziher.github.io&#x2F;kreuzberg" rel="nofollow">https:&#x2F;&#x2F;goldziher.github.io&#x2F;kreuzberg</a>.<p>I also published a roadmap for the project, which you can see here: <a href="https:&#x2F;&#x2F;github.com&#x2F;Goldziher&#x2F;kreuzberg&#x2F;discussions&#x2F;24" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Goldziher&#x2F;kreuzberg&#x2F;discussions&#x2F;24</a><p>You can see the repo at <a href="https:&#x2F;&#x2F;github.com&#x2F;Goldziher&#x2F;kreuzberg" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Goldziher&#x2F;kreuzberg</a> - please star it if you find it valuable, since this motivates me!