[HN Gopher] Show HN: Open-source Rule-based PDF parser for RAG
___________________________________________________________________
Show HN: Open-source Rule-based PDF parser for RAG
The PDF parser is a rule based parser which uses text co-ordinates
(boundary box), graphics and font data. The PDF parser works off
text layer and also offers a OCR option to automatically use OCR if
there are scanned pages in your PDFs. The OCR feature is based off
a modified version of tika which uses tesseract underneath. The
PDF Parser offers the following features: * Sections and
subsections along with their levels. * Paragraphs - combines lines.
* Links between sections and paragraphs. * Tables along with the
section the tables are found in. * Lists and nested lists. * Join
content spread across pages. * Removal of repeating headers and
footers. * Watermark removal. * OCR with boundary boxes
Author : jnathsf
Score : 253 points
Date : 2024-01-24 05:31 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| guidedlight wrote:
| How does this differ from Azure Document Intelligence, or are
| they effectively the same thing?
| StrauXX wrote:
| Last I used it, Azure Document Intelligence wasn't all that
| smart about choosing split points. This seems to implement
| better heuristics.
| infecto wrote:
| What is a split point? I use Textract a lot and from my
| testing, always beats out any of the open source tooling to
| extract information. That could also be highly dependent on
| the document format.
| batch12 wrote:
| I think it is a reference to the place a larger document is
| split into chunks for calculating embeddings and storage.
| asukla wrote:
| I wrote about split points and the need for including section
| hierarchy in this post:
| https://ambikasukla.substack.com/p/efficient-rag-with-
| docume...
|
| All this is automated in the llmsherpa parser
| https://github.com/nlmatics/llmsherpa which you can use as an
| API over this library.
| cdolan wrote:
| I am also curious about this. ADI is reliable but does have
| edge case issues on malformed PDF
|
| I fear tesseract OCR is a potential limitation though. I've
| seen it make so many mistakes
| asukla wrote:
| No, we are not doing the same thing. Most cloud parsers use a
| vision model and they are lot slower, expensive and you need to
| write code on the top of these to extract good chunks.
|
| You can use llmsherpa library -
| https://github.com/nlmatics/llmsherpa with this server to get
| nice layout friendly chunks for your LLM/RAG project.
| ramoz wrote:
| There's no ocr or ai involved here (other than the standard
| fallback).
|
| What this library, and something like fitz/pymupdf, allow you
| to do is extract the text straight from the pdf, using rules
| about how to parse & structure it. (Most modern pdfs you can
| extract text without ocr).
|
| - much cheaper obviously but doesn't scale (across dynamic
| layouts) well so you likely are using this when you can
| configure around a standard structure. I have found rule-based
| text extraction to work fairly dynamically though for things
| like scientific pdfs.
| jvdvegt wrote:
| Do you ave any examples? There doesn't seem to be a single PDF
| file in the repo.
| asukla wrote:
| You can see examples in llmsherpa project -
| https://github.com/nlmatics/llmsherpa. This project nlm-
| ingestor provides you the backend to work with llmsherpa. The
| llmsherpa library is very convenient to use for extracting nice
| chunks for your LLM/RAG project.
| epaga wrote:
| This looks like it could be very helpful. The company I work for
| has a PDF comparison tool called "PDFC" which can read PDFs and
| runs comparisons of semantic differences.
| https://www.inetsoftware.de/products/pdf-content-comparer
|
| Parsing PDFs can be quite the headache because the format is so
| complex. We support most of these features already but there are
| always so many edge cases that additional angles can be very
| helpful.
| firtoz wrote:
| Thank you for sharing. Are there some example input output pairs
| somewhere?
| asukla wrote:
| You can use the library in conjunction with llmsherpa
| LayoutPDFReader.
|
| Some examples are here with notebook:
| https://github.com/nlmatics/llmsherpa Here's another notebook
| from the repo with examples: https://github.com/nlmatics/nlm-
| ingestor/blob/main/notebooks...
| dmezzetti wrote:
| Nice project! I've long used Tika for document parsing given it's
| maturity and wide number of formats supported. The XHTML output
| helps with chunking documents for RAG.
|
| Here's a couple examples:
|
| - https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
|
| - https://neuml.hashnode.dev/extract-text-from-documents
|
| Disclaimer: I'm the primary author of txtai
| (https://github.com/neuml/txtai).
| mpeg wrote:
| Off-topic, but do you know how Tika compares to other pdf
| parsing libraries? I was very unimpressed by pdfminer.six (what
| unstructured uses) as the layout detection seems pretty basic,
| it fails to parse multi column text, whereas MuPDF does it
| perfectly
|
| Currently I'm using a mix of MuPDF + AWS Textract (for tables,
| mostly) but I'd love to understand what other people are doing
| dmezzetti wrote:
| I don't have scientific metrics but I've found the quality
| much better than most. It does a pretty good job to pulling
| data from text and tables.
| jahewson wrote:
| Tika uses PDFBox under the hood, using its built-in text
| extractor (which is "ok"). If you're looking for table
| extraction specifically, check out Tabula
| (https://tabula.technology) which is also built on top of
| PDFBox and has some contributions from the same maintainers.
| PDFBox actually exposes a lower-level API for text extraction
| (I wrote it!) than the one Tabula uses, allowing you to roll
| your own extractor - but that's where dragons live, trust me
| :)
| lmeyerov wrote:
| Tesseract OCR fallback sounds great!
|
| There are now a lot of file loaders for RAG (langchain, LLMindex,
| unstructured, ...), any reasons, like a leading benchmark score,
| to prefer this one?
| mpeg wrote:
| I couldn't try this tool as it doesn't build on apple silicon
| (and there's no ARM docker image)
|
| However, I have a PDF parsing use-case that I tried those RAG
| tools for, but the output they give me is pretty low quality -
| it kinda works for RAG as the LLM can work around the issues
| but if you want to get higher quality responses with proper
| references and such I think the best way is to write your own
| rule-based parser which is what I ended up doing (based on
| MuPDF though, not Tika).
|
| Maybe that's what the authors of this tool were thinking too.
| asukla wrote:
| To run the docker image on apple silicon, you can use the
| following command to pull - it will be slower but works:
| docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-
| ingestor:latest
| mpeg wrote:
| Thanks, I always forget I can do that! I've given it a go
| and it's really impressive - the default chunker is very
| smart and manages to keep most of the chunk context
| together
|
| The table parser in particular is really good. Is the trick
| that you draw some guide lines and rectangles around
| tables? I'm trying to understand the
| GraphicsStreamProcessor class as I'm not familiar with
| Tika, how does it know where to draw in the first place?
| ramoz wrote:
| For me, PyMuPDF/fitz has been the best way to retain natural
| reading order and set dynamic enough rules to extract text in
| complex layouts.
|
| None of the mentioned tools did this out of the box, none
| seemed easy to configured, all definitely hyped and marketed
| way beyond fitz though.
| mpeg wrote:
| Same here, fitz is great, it does well enough out of the
| box that I can apply some simple heuristics for things like
| joining/splitting paragraphs where it makes a mistake and
| extract drawings and such and get pretty close to 100%
| accuracy on the output.
|
| The only thing it doesn't do is tables detection (neither
| does pdfminer.six), but there are plenty of other ways to
| handle them.
| rmsaksida wrote:
| Last time I tried Langchain (admittedly, that was ~6 months
| ago) the implementations for content extraction from PDFs and
| HTML files were very basic. Enough to get a prototype RAG
| solution going, but not enough to build anything reliable. This
| looks like a much more battle-tested implementation.
| dmezzetti wrote:
| One additional library to add, if you're working with scientific
| papers: https://github.com/kermitt2/grobid. I use this with
| paperetl (https://github.com/neuml/paperetl).
| ilaksh wrote:
| How does this compare to PaddleOCR?
|
| Looks like Apache 2 license which is nice.
| asukla wrote:
| Thanks for the post. Please use this server with the llmsherpa
| LayoutPDFReader to get optimal chunks for your LLM/RAG project:
| https://github.com/nlmatics/llmsherpa. See examples and notebook
| in the repo.
| xfalcox wrote:
| We've been looking for something exactly like this, thanks for
| sharing!
| huqedato wrote:
| I tried to parse a few hundreds pdfs with it. The results are
| pretty decent. If this was developed in Julia, it would be ten
| times faster (at least).
| mistrial9 wrote:
| great effort and very interesting. However, I go to Github and I
| see "This organization has no public members" .. I do not know
| who you are at all, or what else might be part of this without
| disclosure.
|
| Overall, I believe there has to be some middle ground for
| identification and trust building over time, between "hidden
| group with no names on $CORP secure site" and other traditional
| means of introduction and trust building.
|
| thanks for posting this interesting and relevant work
___________________________________________________________________
(page generated 2024-01-24 23:01 UTC)