[HN Gopher] Show HN: Beyond text splitting - improved file parsi...
       ___________________________________________________________________
        
       Show HN: Beyond text splitting - improved file parsing for LLMs
        
       Author : serjester
       Score  : 181 points
       Date   : 2024-04-08 05:41 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | skeptrune wrote:
       | I need to spend more time figuring out how the layout detection
       | is working under the hood, but if it's not a NC model then this
       | could be really good.
        
       | Oras wrote:
       | How accurate is table detection/parsing in PDFs? I found this
       | part the most challenging, and none of the open-source PDF
       | parsers worked well.
        
         | saliagato wrote:
         | worked 100% of the time for me
        
           | filkin wrote:
           | which software?
        
         | passion__desire wrote:
         | Have you checked Surya ?
        
           | Oras wrote:
           | I did and I had issues when tables had mixed text and
           | numbers.
           | 
           | Example:
           | 
           | PS243,234 would be PS234,
           | 
           | Or PS243 234
           | 
           | Or PS243,234 (correct).
           | 
           | Some cells weren't even detected.
        
         | verdverm wrote:
         | I've been using camelot, which builds on the lower python pdf
         | libraries, to extract tables from pdfs. Haven't tried anything
         | exotic, but it seems to work. The tables I parse tend to be
         | full page or the most dominant element
         | 
         | https://camelot-py.readthedocs.io/en/master/
         | 
         | I like Camelot because it gives me back pandas dataframes. I
         | don't want markdown, I can make that from a dataframe if needed
        
         | serjester wrote:
         | Author here. Optionally we implement unitable which represents
         | the current state of the art in table detection. Camelot /
         | Tabelot use much simpler, traditional extraction techniques.
         | 
         | Unitable itself has shockingly good accuracy, although we're
         | still working on better table detection which sometimes
         | negatively affects results.
        
       | zby wrote:
       | What I want is a dynamic chunking - I want to search a document
       | for a word - and then I want to get the largest chunk that fits
       | into my limits and contain the found word. Has anyone worked on
       | such thing?
        
         | dleeftink wrote:
         | Do you need to find the longest common substring? Because there
         | are several methods to accomplish that.
         | 
         | [0]: https://en.m.wikipedia.org/wiki/Longest_common_substring
        
         | snorkel wrote:
         | OpenSearch perhaps? The search query results returns a list of
         | hits (matches) with a text_entry field that has the matching
         | excerpt from the source doc
        
         | Y_Y wrote:
         | grep -C $n word document
         | 
         | will get you $n lines of context on either side of the matching
         | lines.
        
           | zby wrote:
           | Yeah - the idea is simple - but there are so many variations
           | as to what makes a good chunk. If it is a program - then
           | lines are good, but maybe you'd like to set the boundaries at
           | block endings or something. And for regular text - then maybe
           | sentences would be better than lines? Or paragraphs. And
           | maybe it should not go beyond a boundary for a text section
           | or chapter. And then there might also be tables. With tables
           | - the good solution would be to fit some rows - but maybe the
           | headers should also be copied together with the rows in the
           | middle? But if a previous chunk with the headers was already
           | loaded - then maybe not duplicate the headers?
        
             | c_moscardi wrote:
             | Figures, too! Yeah you could write some logic essentially
             | on top of a library like this, and tune based on optimizing
             | for some notion of recall (grab more surrounding context)
             | and precision (direct context around the word, e.g. only
             | the paragraph or 5 surrounding table rows) for your
             | specific application needs.
             | 
             | Using the models underlying a library like this, there's
             | maybe room for fine-tuning as well if you have a set of
             | documents with specific semantic boundaries that current
             | approaches don't capture. (And you spend an hour drawing
             | bounding boxes to make that happen).
        
       | marban wrote:
       | The recent Real Python pod has some anecdotal insights from a
       | real-world project with respect to dealing with decades-old
       | unstructured PDFs. https://realpython.com/podcasts/rpp/199/
        
       | d-z-m wrote:
       | Very cool!
       | 
       | I see this in the README under the "How is this different from
       | other layout parsers" section.
       | 
       | > Commercial Solutions: Requires sharing your data with a vendor.
       | 
       | But I also see that to use the Semantic Processing example, you
       | have to have an OpenAI API key. Are there any plans to support
       | locally hosted embedding models for this kind of processing?
        
         | willj wrote:
         | Relatedly, the OCR component relies on PyMuPDF, which has a
         | license that requires releasing source code, which isn't
         | possible for most commercial applications. Is there any plan to
         | move away from PyMuPDF, or is there a way to use an
         | alternative?
        
           | kkielhofner wrote:
           | FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a
           | PDF, which in some cases is either non-existent or done with
           | poor quality OCR (like some random implementation from
           | whatever it was scanned with).
           | 
           | This implementation bolts on Tesseract which IME is typically
           | not the best available.
        
             | serjester wrote:
             | Author here. I'm very open to alternatives to PyMuPDF /
             | tesseract because I agree OCR results are sub optimal and
             | it has a restrictive license. I tried basic ones and found
             | the results to be poor.
        
               | mcbetz wrote:
               | This article compares multiple solutions and recommends
               | docTR (Apache License 2.0):
               | https://source.opennews.org/articles/our-search-best-ocr-
               | too...
        
         | serjester wrote:
         | Coming soon!
        
       | cpursley wrote:
       | Neat and timely. My biggest challenge is tables contained in
       | PDFs.
       | 
       | Are there any similar projects that are lower level (for those of
       | us not using Python)? Something in Rust that I could call out to,
       | for example?
        
       | edshiro wrote:
       | This looks great! And incredibly timely too!
       | 
       | I finished watching this video today where the host and guests
       | were discussing challenges in a RAG pipeline, and certainly
       | chunking documents the right way is still very challenging.
       | Video:
       | https://www.youtube.com/watch?v=Y9qn4XGH1TI&ab_channel=Prole... .
       | 
       | I was already scratching my head on how I was going to tackle
       | this challenge... It seems your library is addressing this
       | problem.
       | 
       | Thanks for the good work.
        
       | deoxykev wrote:
       | How does this compare to LayoutLMv3? Was it trained on forms at
       | all?
        
       | constantinum wrote:
       | Folks who want to extract data from really complex documents -
       | not just restricted to complex tables - but also check boxes,
       | tables spanning multiple pages do try LLMWhisperer.
       | https://llmwhisperer.unstract.com/
        
       | _pdp_ wrote:
       | My $0.02: correct chunking can improve accuracy, but it does not
       | change the fact that it is still a single-shot operation. I have
       | commented on this before, so I am repeating myself, but what RAGs
       | are trying to do is the equivalent of looking up some information
       | (let's say via a search engine), and you happen to have the
       | correct answer in the first 5 results - not the links but the
       | actual excerpt from the crawled pages. You don't need many evals
       | to naturally figure out that this will only sometimes work. So,
       | chunking improves the performance as long as the search phrase
       | can discover the correct information, but it does not consider
       | that the search itself could be wrong or require more evaluation.
       | Add to the mix that vectorisation of the records does not work
       | well for non-tokens, made-up words, foreign languages, etc, and
       | you start getting the idea of the complexity involved. This is
       | why more context is better but up to a limit.
       | 
       | IMHO, in most use cases, chunking optimisation strategies will
       | not substantially improve performance. What I think might improve
       | performance is running N search strategies with multiple
       | variations of the search phrase and picking up the best answer.
       | But this is currently expensive and slow.
       | 
       | Having developed a RAG platform over one and a half years ago, I
       | find many of these challenges strikingly familiar.
        
         | serjester wrote:
         | There's far more to a RAG pipeline than chunking documents,
         | chunking is just one way to interface with a file. In our case
         | we use query decomposition, document summaries and chunking to
         | achieve strong results.
         | 
         | Your right that chunking is just one piece of this. But without
         | quality chunks you're either going to miss context come query
         | time (bad chunks) or use 100X the tokens (full file context).
        
           | register wrote:
           | Can you describe a little bit more in detail what is your
           | stragegy on query decomposition?
        
         | mistermann wrote:
         | > What I think might improve performance is running N search
         | strategies with multiple variations of the search phrase and
         | picking up the best answer. But this is currently expensive and
         | slow.
         | 
         | Eerily similar to Thinking Fast and Slow, and may help explain
         | (when combined with biological and social evolutionary theory)
         | why people have such a strong aversion to System 2 thinking.
        
           | _pdp_ wrote:
           | Ha, never thought of that. Thank you :)
        
             | mistermann wrote:
             | It'd be funny if humanity was permanently stalled at the
             | stage AI is currently at. Well, funny if one could watch it
             | remotely like on The Truman Show instead of being trapped
             | within it.
        
       | vikp wrote:
       | This looks great! You might be interested in surya -
       | https://github.com/VikParuchuri/surya (I'm the author). It does
       | OCR (much more accurate than tesseract), layout analysis, and
       | text detection.
       | 
       | The OCR is slow on CPU (working on it), but faster than tesseract
       | (CPU-only) on GPU.
       | 
       | You could probably replace pymupdf, tesseract, and some layout
       | heuristics with this.
       | 
       | Happy to discuss more, feel free to email me (in profile).
        
         | nicklo wrote:
         | OP: please don't poison your MIT license w/ surya's GPL license
        
       | Kkoala wrote:
       | Is this only for PDFs? Or does it support other formats too? E.g.
       | markdown, text, docx etc.
        
       | jsf01 wrote:
       | Is this limited to PDFs or could the same chunking and parsing be
       | applied to plain text, html, and other input file types?
        
       | mind-blight wrote:
       | One thing I've noticed with pdfminer is that it can have horrible
       | load times for some PDFs. I've seen 20 page PDFs take upwards of
       | 45 seconds due to layout analysis. It's anaysis engine is also
       | decent, but it takes newlines into account in weird ways
       | sometimes - especially if you're asking for vertical text
       | analysis
        
       ___________________________________________________________________
       (page generated 2024-04-08 23:01 UTC)