[HN Gopher] Show HN: Kreuzberg - Modern async Python library for...
___________________________________________________________________
Show HN: Kreuzberg - Modern async Python library for document text
extraction
I'm excited to showcase Kreuzberg! Kreuzberg is a modern Python
library built from the ground up with async/await, type hints, and
optimized I/O handling. It provides a unified interface for
extracting text from documents (PDFs, images, office files) without
external API dependencies. Key technical features: - Built with
modern Python best practices (async/await, type hints, functional-
first) - Optimized async I/O with anyio for multi-loop
compatibility - Smart worker process pool for CPU-bound tasks (OCR,
doc conversion) - Efficient batch processing with concurrent
extractions - Clean error handling with context-rich exceptions I
built this after struggling with existing solutions that were
either synchronous-only, required complex deployments, or had poor
async support. The goal was to create something that works well in
modern async Python applications, can be easily dockerized or used
in serverless contexts, and relies only on permissive OSS. Key
advantages over alternatives: - True async support with optimized
I/O - Minimal dependencies (much smaller than alternatives) -
Perfect for serverless and async web apps - Local processing
without API calls - Built for modern Python codebases with rigorous
typing and testing I Would love feedback! The library is MIT
licensed and open to contributions. Here is the repo:
https://github.com/Goldziher/kreuzberg Staring is caring
Author : nhirschfeld
Score : 139 points
Date : 2025-02-15 10:07 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| pseudony wrote:
| Interesting, thanks for sharing :)
|
| Can you speak to how this differs in PDF extraction from, say,
| pymupdf, pdfplumber, unsloth and so on ?
|
| I know the async part is probably a thing, but when building a
| RAG I would be brutally focused on the quality of text
| extraction. Have you noticed an ability to do better than others
| ?
| nhirschfeld wrote:
| So, for PDF we need to distinguish between two types of text
| extraction-
|
| 1. Text extraction from a searchable PDF.
|
| 2. OCR.
|
| For 1. Kreuzberg uses pypdfium2, which is a python binding for
| pdfium - the chromium PDF engine. In this regard Kreuzberg has
| top notch performance. Much faster than miner.six, PDFplumber
| etc.
|
| Note PyMuPDF has top notch performance but also an AGPL
| license, and is almost unusable because of this without paying.
|
| For 2. Kreuzberg uses Tesseract, which is very solid.
| Performance is good, and Kreuzberg utilizes async worker
| processes to optimize concurrency.
|
| OCR though is a complex world. If what you need is to extract
| text from standard text documents (broadly speaking), Tesseract
| and hence Kreuzberg are a good choice.
|
| If what you need is things like layout extraction, hand writing
| recognition, complete bonding box metadata etc. than you need
| to use an alternative - commercial one probably.
| dleeftink wrote:
| An oldy but goody for layout extraction is Cermine by
| Dominika Tkaczyk and colleagues[0]. Java required.
|
| [0]: http://cermine.ceon.pl/about.html
| nhirschfeld wrote:
| didnt know this!
| mdaniel wrote:
| Also AGPLv3 https://github.com/CeON/CERMINE/blob/cermine-
| parent-1.13/LIC...
| ilaksh wrote:
| PaddleOCR layout works, and so do some open source large
| language vision models
| tomcam wrote:
| What is a RAG?
| nhirschfeld wrote:
| Retrieval Augmented Generation. Its a class of techniques for
| generating content using LLMs. I'd recommend Googling this.
| richrichardsson wrote:
| What led to the name choice?
| nhirschfeld wrote:
| That's my neighborhood in Berlin, which I love
| richrichardsson wrote:
| Ah, cool. I have a friend who lives there, so knew the name
| from that.
| jacomoRodriguez wrote:
| amazing that half of the comments revolve around the name and
| the Neighbourhood. But I also clicked the topic because of
| the name, hello neighbour :)
|
| jokes aside, really cool library. I'm currently working in a
| bigger project where we build a data lake with a wide variety
| of input sources and formats - this could be quite
| interesting for us.
| nhirschfeld wrote:
| Amazing, would be interested in reading your experience
| eamag wrote:
| Love the name!
|
| OCR was discussed here lately several times
| (https://news.ycombinator.com/item?id=42952605 and
| https://news.ycombinator.com/item?id=42871143), and some cool
| projects like https://github.com/Future-House/paper-
| qa?tab=readme-ov-file#... are using PyMuPDF. My experience with
| Tesseract is pretty sad, it's usually not good enough and modern
| LLMs are better.
| nhirschfeld wrote:
| Thanks, I'll check these links.
|
| In my tests I found tesseract quite good for regular text
| documents. For other kinds of texts it's not great.
|
| As for using models - there are some good small language models
| as well, and of course LLMs.
|
| I sorta feel though that if one needs complex OCR, or a vision
| model for layout, one should opt for either a commercial
| solution that abstracts the deployment and GPU management, or
| bake ones own system.
|
| For most use cases involving text documents though, my
| subjective opinion is that tesseract is sufficient.
| leif_lundberg wrote:
| Very cool, we've been using https://github.com/DS4SD/docling in
| our project, but will give this a try :)
| odiroot wrote:
| Do you have to watch your pockets when using this library?
| nhirschfeld wrote:
| lol ;).
|
| But seriously, in 13 years living here, only one guy tried to
| pick pocket me.
| tymm wrote:
| I live in 36 since 15 years or so. Wasn't as lucky as you :)
| nhirschfeld wrote:
| Sorry to hear...
| madisonmay wrote:
| pypdfium2 is a great choice and a solid piece of software!
|
| You might want to look into https://github.com/VikParuchuri/surya
| as an alternative to tesseract. Yes, it's associated with a
| commercial company, but as you long as you aren't a company with
| 5M in ARR or $5M in funding it's free to use.
| nhirschfeld wrote:
| interesting!
| pzo wrote:
| this still seems GPL. another OCR worth considering is easyOCR
| [0] (apache license). AFAIK there is not layout detection but
| they do provide bounding boxes and support many languages also
| detecting text on many different world objects from images
| (signpost, etc)
|
| [0] https://github.com/JaidedAI/EasyOCR
| nhirschfeld wrote:
| Yup, easy OCR is good.
|
| My reasons for using Tesseract - easy OCR is larger, and it
| has a significant cold start.
|
| It benchmarks better for many OCR tasks though, so I'm
| thinking of adding it as an alternative backend.
| alex_suzuki wrote:
| Any experience with Paddle OCR?
| https://github.com/PaddlePaddle/PaddleOCR
|
| Personally I've used Tesseract before but the results were
| underwhelming, so I'm curious how Paddle OCR performs in
| comparison.
| nhirschfeld wrote:
| I haven't, testing it out is on my todo list for sure
| cdrini wrote:
| Where did you find benchmarks for OCR tools? There have
| been so many OCR engines coming lately, I would love to see
| benchmarks!
| nhirschfeld wrote:
| I google this for a while...
| rednafi wrote:
| Gotta write something named Wedding, Schoneberg, or Pankow. Kewt
| names.
| a012 wrote:
| Don't forget Neukolln
| martin_balsam wrote:
| Garbage collect module (cfr. Neukollner for the past 12
| years)
| rednafi wrote:
| But multicultural. So I don't mind.
| socksy wrote:
| Not sure I would trust a garbage collector called Neukolln
| nhirschfeld wrote:
| I'm actually considering another library with optional API
| called `Kreuzkoln` - probably without the Umlaut!
| jacomoRodriguez wrote:
| Mitte?
| herval wrote:
| Too gentrified for Python
| jenadine wrote:
| Neuhohenschonhausen?
| rednafi wrote:
| Imagine having to import this or some nightmare like
| Hausvogteiplatz or Schlesisches Tor. Not German, and I
| wanna cry everytime I have to pronounce these :v
| ant6n wrote:
| Python Zoo, Python Tiergarten...
| rednafi wrote:
| Python dependencies are tear garden for sure.
| flessner wrote:
| Moabit - maybe a name for a new crypto currency?
| diarrhea wrote:
| I'm curious about the async aspect of this. I was under the
| impression PDF processing like OCR is purely CPU bound. OS file
| I/O interfaces are sync, so async does not help. With GIL, so
| single threaded Python, I can't see how async improves
| performance for the PDF use case. Only parallelism helps, and
| concurrency doesn't. When would it yield back to the event loop
| when it's busy number crunching?
| nurettin wrote:
| It just litters perfectly reasonable python code with
| async/await. Maybe they are preparing for something we don't
| know, like a parallel async executor which can be set up to use
| native threads without changing code and somehow protects you
| if it detects shared state.
| diarrhea wrote:
| > It just litters perfectly reasonable python code with
| async/await
|
| Yeah. As an API consumer I would not expect a PDF API do IO,
| hence be async. Have the library be sans-io, the interfaces
| sync and callers from async code handle IO on their end,
| offloading to IO threads.
|
| Async is also referred to as "best practice", but it's just a
| tool, for specific use cases. And I say that as an "async
| fan"!
|
| That said, perhaps it's easier nowadays to just do async by
| default, as you say. The real world is async anyway, so why
| not program closer to that reality.
| nhirschfeld wrote:
| thats why Kreuzberg also exposes a sync API for you to
| consume.
| ismailmaj wrote:
| It is probably not worth the complexity currently but
| considering they are using small local CPU models for OCR like
| tesseract, if they add the support of reading files on the web
| then I wouldn't be so sure of the CPU bound aspect.
| nhirschfeld wrote:
| Thanks for asking!
|
| It's both. The OCR part is ofc CPU bound, but the entire text
| extraction involves reading files, or writing and then reading
| files.
|
| Without async, these simply block.
|
| As for efficiency - if you're working in an async application
| context you have to "asyncify" these operations or suffer the
| consequences.
| v3ss0n wrote:
| We are building something similar and waiting my partners/clients
| approval for opensourcing it. Looks like we should join forces.
| ideashower wrote:
| Is there something like this for handwritten documents? I know
| newer models have been really good at handwriting transcription.
| m00dy wrote:
| good naming, it feels so warm that I feel like home :)
| taosx wrote:
| I know this is contrary to popular opinion but I wish people
| would slowly move away from python. I've wasted so much time in
| understanding, integrating or just making python projects work
| that at this point I'm just avoiding anything python. The best
| python projects that I can confidently say are high quality are
| the ones where a lot of the code is c,c++ or rust and python is
| just a high level wrapper.
| d0mine wrote:
| "python is a high level wrapper"
|
| is a python usage as intended. Being executable pseudo-code,
| glue language is its selling point. When has it ever been any
| different.
|
| I'm not sure C++/Rust projects are easier to understand though.
| RNCTX wrote:
| Awesome.
|
| I modified a library card software (Blacklight) into a searchable
| PDF industrial manual system awhile back on a one-off basis. It
| couldn't go any further than a contract project that delivered
| the source code because it's hard to do anything programmatically
| (at the time) to a PDF without Ghostscript.
|
| I've often thought of rewriting it with Python (and Postgres, to
| get rid of Solr or Elastic as the search backend), maybe now's
| the time...
|
| I trust you long enough for a second look because I ctrl-f'd the
| readme and found "pdfium" so I know I don't have to retread old
| ground in your github issues about how there's really only a
| couple of ways to parse a PDF with a semblance of reliability,
| lol...
|
| (for anyone else reading this getting started with documents..
| Adobe and Chrome are really the only PDF rendering libraries that
| work. PDF.js aka Firefox has always been broken, and Apple's is
| problematic as well, in both cases rearing their heads in terms
| of incorrect word / letter spacing).
| coderstartup wrote:
| That's Great.
| maleldil wrote:
| The API is pretty nice and easy to get started, but I couldn't
| get good results with parsing scientific paper PDFs,
| unfortunately (including OCR). Are there plans to use other
| backends? Docling works alright, and LLMs like Gemini Flash are
| interesting too.
___________________________________________________________________
(page generated 2025-02-15 23:00 UTC)