[HN Gopher] Show HN: Open-source Rule-based PDF parser for RAG
       ___________________________________________________________________
        
       Show HN: Open-source Rule-based PDF parser for RAG
        
       The PDF parser is a rule based parser which uses text co-ordinates
       (boundary box), graphics and font data. The PDF parser works off
       text layer and also offers a OCR option to automatically use OCR if
       there are scanned pages in your PDFs. The OCR feature is based off
       a modified version of tika which uses tesseract underneath.  The
       PDF Parser offers the following features:  * Sections and
       subsections along with their levels. * Paragraphs - combines lines.
       * Links between sections and paragraphs. * Tables along with the
       section the tables are found in. * Lists and nested lists. * Join
       content spread across pages. * Removal of repeating headers and
       footers. * Watermark removal. * OCR with boundary boxes
        
       Author : jnathsf
       Score  : 253 points
       Date   : 2024-01-24 05:31 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | guidedlight wrote:
       | How does this differ from Azure Document Intelligence, or are
       | they effectively the same thing?
        
         | StrauXX wrote:
         | Last I used it, Azure Document Intelligence wasn't all that
         | smart about choosing split points. This seems to implement
         | better heuristics.
        
           | infecto wrote:
           | What is a split point? I use Textract a lot and from my
           | testing, always beats out any of the open source tooling to
           | extract information. That could also be highly dependent on
           | the document format.
        
             | batch12 wrote:
             | I think it is a reference to the place a larger document is
             | split into chunks for calculating embeddings and storage.
        
           | asukla wrote:
           | I wrote about split points and the need for including section
           | hierarchy in this post:
           | https://ambikasukla.substack.com/p/efficient-rag-with-
           | docume...
           | 
           | All this is automated in the llmsherpa parser
           | https://github.com/nlmatics/llmsherpa which you can use as an
           | API over this library.
        
         | cdolan wrote:
         | I am also curious about this. ADI is reliable but does have
         | edge case issues on malformed PDF
         | 
         | I fear tesseract OCR is a potential limitation though. I've
         | seen it make so many mistakes
        
         | asukla wrote:
         | No, we are not doing the same thing. Most cloud parsers use a
         | vision model and they are lot slower, expensive and you need to
         | write code on the top of these to extract good chunks.
         | 
         | You can use llmsherpa library -
         | https://github.com/nlmatics/llmsherpa with this server to get
         | nice layout friendly chunks for your LLM/RAG project.
        
         | ramoz wrote:
         | There's no ocr or ai involved here (other than the standard
         | fallback).
         | 
         | What this library, and something like fitz/pymupdf, allow you
         | to do is extract the text straight from the pdf, using rules
         | about how to parse & structure it. (Most modern pdfs you can
         | extract text without ocr).
         | 
         | - much cheaper obviously but doesn't scale (across dynamic
         | layouts) well so you likely are using this when you can
         | configure around a standard structure. I have found rule-based
         | text extraction to work fairly dynamically though for things
         | like scientific pdfs.
        
       | jvdvegt wrote:
       | Do you ave any examples? There doesn't seem to be a single PDF
       | file in the repo.
        
         | asukla wrote:
         | You can see examples in llmsherpa project -
         | https://github.com/nlmatics/llmsherpa. This project nlm-
         | ingestor provides you the backend to work with llmsherpa. The
         | llmsherpa library is very convenient to use for extracting nice
         | chunks for your LLM/RAG project.
        
       | epaga wrote:
       | This looks like it could be very helpful. The company I work for
       | has a PDF comparison tool called "PDFC" which can read PDFs and
       | runs comparisons of semantic differences.
       | https://www.inetsoftware.de/products/pdf-content-comparer
       | 
       | Parsing PDFs can be quite the headache because the format is so
       | complex. We support most of these features already but there are
       | always so many edge cases that additional angles can be very
       | helpful.
        
       | firtoz wrote:
       | Thank you for sharing. Are there some example input output pairs
       | somewhere?
        
         | asukla wrote:
         | You can use the library in conjunction with llmsherpa
         | LayoutPDFReader.
         | 
         | Some examples are here with notebook:
         | https://github.com/nlmatics/llmsherpa Here's another notebook
         | from the repo with examples: https://github.com/nlmatics/nlm-
         | ingestor/blob/main/notebooks...
        
       | dmezzetti wrote:
       | Nice project! I've long used Tika for document parsing given it's
       | maturity and wide number of formats supported. The XHTML output
       | helps with chunking documents for RAG.
       | 
       | Here's a couple examples:
       | 
       | - https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
       | 
       | - https://neuml.hashnode.dev/extract-text-from-documents
       | 
       | Disclaimer: I'm the primary author of txtai
       | (https://github.com/neuml/txtai).
        
         | mpeg wrote:
         | Off-topic, but do you know how Tika compares to other pdf
         | parsing libraries? I was very unimpressed by pdfminer.six (what
         | unstructured uses) as the layout detection seems pretty basic,
         | it fails to parse multi column text, whereas MuPDF does it
         | perfectly
         | 
         | Currently I'm using a mix of MuPDF + AWS Textract (for tables,
         | mostly) but I'd love to understand what other people are doing
        
           | dmezzetti wrote:
           | I don't have scientific metrics but I've found the quality
           | much better than most. It does a pretty good job to pulling
           | data from text and tables.
        
           | jahewson wrote:
           | Tika uses PDFBox under the hood, using its built-in text
           | extractor (which is "ok"). If you're looking for table
           | extraction specifically, check out Tabula
           | (https://tabula.technology) which is also built on top of
           | PDFBox and has some contributions from the same maintainers.
           | PDFBox actually exposes a lower-level API for text extraction
           | (I wrote it!) than the one Tabula uses, allowing you to roll
           | your own extractor - but that's where dragons live, trust me
           | :)
        
       | lmeyerov wrote:
       | Tesseract OCR fallback sounds great!
       | 
       | There are now a lot of file loaders for RAG (langchain, LLMindex,
       | unstructured, ...), any reasons, like a leading benchmark score,
       | to prefer this one?
        
         | mpeg wrote:
         | I couldn't try this tool as it doesn't build on apple silicon
         | (and there's no ARM docker image)
         | 
         | However, I have a PDF parsing use-case that I tried those RAG
         | tools for, but the output they give me is pretty low quality -
         | it kinda works for RAG as the LLM can work around the issues
         | but if you want to get higher quality responses with proper
         | references and such I think the best way is to write your own
         | rule-based parser which is what I ended up doing (based on
         | MuPDF though, not Tika).
         | 
         | Maybe that's what the authors of this tool were thinking too.
        
           | asukla wrote:
           | To run the docker image on apple silicon, you can use the
           | following command to pull - it will be slower but works:
           | docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-
           | ingestor:latest
        
             | mpeg wrote:
             | Thanks, I always forget I can do that! I've given it a go
             | and it's really impressive - the default chunker is very
             | smart and manages to keep most of the chunk context
             | together
             | 
             | The table parser in particular is really good. Is the trick
             | that you draw some guide lines and rectangles around
             | tables? I'm trying to understand the
             | GraphicsStreamProcessor class as I'm not familiar with
             | Tika, how does it know where to draw in the first place?
        
           | ramoz wrote:
           | For me, PyMuPDF/fitz has been the best way to retain natural
           | reading order and set dynamic enough rules to extract text in
           | complex layouts.
           | 
           | None of the mentioned tools did this out of the box, none
           | seemed easy to configured, all definitely hyped and marketed
           | way beyond fitz though.
        
             | mpeg wrote:
             | Same here, fitz is great, it does well enough out of the
             | box that I can apply some simple heuristics for things like
             | joining/splitting paragraphs where it makes a mistake and
             | extract drawings and such and get pretty close to 100%
             | accuracy on the output.
             | 
             | The only thing it doesn't do is tables detection (neither
             | does pdfminer.six), but there are plenty of other ways to
             | handle them.
        
         | rmsaksida wrote:
         | Last time I tried Langchain (admittedly, that was ~6 months
         | ago) the implementations for content extraction from PDFs and
         | HTML files were very basic. Enough to get a prototype RAG
         | solution going, but not enough to build anything reliable. This
         | looks like a much more battle-tested implementation.
        
       | dmezzetti wrote:
       | One additional library to add, if you're working with scientific
       | papers: https://github.com/kermitt2/grobid. I use this with
       | paperetl (https://github.com/neuml/paperetl).
        
       | ilaksh wrote:
       | How does this compare to PaddleOCR?
       | 
       | Looks like Apache 2 license which is nice.
        
       | asukla wrote:
       | Thanks for the post. Please use this server with the llmsherpa
       | LayoutPDFReader to get optimal chunks for your LLM/RAG project:
       | https://github.com/nlmatics/llmsherpa. See examples and notebook
       | in the repo.
        
       | xfalcox wrote:
       | We've been looking for something exactly like this, thanks for
       | sharing!
        
       | huqedato wrote:
       | I tried to parse a few hundreds pdfs with it. The results are
       | pretty decent. If this was developed in Julia, it would be ten
       | times faster (at least).
        
       | mistrial9 wrote:
       | great effort and very interesting. However, I go to Github and I
       | see "This organization has no public members" .. I do not know
       | who you are at all, or what else might be part of this without
       | disclosure.
       | 
       | Overall, I believe there has to be some middle ground for
       | identification and trust building over time, between "hidden
       | group with no names on $CORP secure site" and other traditional
       | means of introduction and trust building.
       | 
       | thanks for posting this interesting and relevant work
        
       ___________________________________________________________________
       (page generated 2024-01-24 23:01 UTC)