[HN Gopher] Show HN: Beyond text splitting - improved file parsi...
___________________________________________________________________
Show HN: Beyond text splitting - improved file parsing for LLMs
Author : serjester
Score : 181 points
Date : 2024-04-08 05:41 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| skeptrune wrote:
| I need to spend more time figuring out how the layout detection
| is working under the hood, but if it's not a NC model then this
| could be really good.
| Oras wrote:
| How accurate is table detection/parsing in PDFs? I found this
| part the most challenging, and none of the open-source PDF
| parsers worked well.
| saliagato wrote:
| worked 100% of the time for me
| filkin wrote:
| which software?
| passion__desire wrote:
| Have you checked Surya ?
| Oras wrote:
| I did and I had issues when tables had mixed text and
| numbers.
|
| Example:
|
| PS243,234 would be PS234,
|
| Or PS243 234
|
| Or PS243,234 (correct).
|
| Some cells weren't even detected.
| verdverm wrote:
| I've been using camelot, which builds on the lower python pdf
| libraries, to extract tables from pdfs. Haven't tried anything
| exotic, but it seems to work. The tables I parse tend to be
| full page or the most dominant element
|
| https://camelot-py.readthedocs.io/en/master/
|
| I like Camelot because it gives me back pandas dataframes. I
| don't want markdown, I can make that from a dataframe if needed
| serjester wrote:
| Author here. Optionally we implement unitable which represents
| the current state of the art in table detection. Camelot /
| Tabelot use much simpler, traditional extraction techniques.
|
| Unitable itself has shockingly good accuracy, although we're
| still working on better table detection which sometimes
| negatively affects results.
| zby wrote:
| What I want is a dynamic chunking - I want to search a document
| for a word - and then I want to get the largest chunk that fits
| into my limits and contain the found word. Has anyone worked on
| such thing?
| dleeftink wrote:
| Do you need to find the longest common substring? Because there
| are several methods to accomplish that.
|
| [0]: https://en.m.wikipedia.org/wiki/Longest_common_substring
| snorkel wrote:
| OpenSearch perhaps? The search query results returns a list of
| hits (matches) with a text_entry field that has the matching
| excerpt from the source doc
| Y_Y wrote:
| grep -C $n word document
|
| will get you $n lines of context on either side of the matching
| lines.
| zby wrote:
| Yeah - the idea is simple - but there are so many variations
| as to what makes a good chunk. If it is a program - then
| lines are good, but maybe you'd like to set the boundaries at
| block endings or something. And for regular text - then maybe
| sentences would be better than lines? Or paragraphs. And
| maybe it should not go beyond a boundary for a text section
| or chapter. And then there might also be tables. With tables
| - the good solution would be to fit some rows - but maybe the
| headers should also be copied together with the rows in the
| middle? But if a previous chunk with the headers was already
| loaded - then maybe not duplicate the headers?
| c_moscardi wrote:
| Figures, too! Yeah you could write some logic essentially
| on top of a library like this, and tune based on optimizing
| for some notion of recall (grab more surrounding context)
| and precision (direct context around the word, e.g. only
| the paragraph or 5 surrounding table rows) for your
| specific application needs.
|
| Using the models underlying a library like this, there's
| maybe room for fine-tuning as well if you have a set of
| documents with specific semantic boundaries that current
| approaches don't capture. (And you spend an hour drawing
| bounding boxes to make that happen).
| marban wrote:
| The recent Real Python pod has some anecdotal insights from a
| real-world project with respect to dealing with decades-old
| unstructured PDFs. https://realpython.com/podcasts/rpp/199/
| d-z-m wrote:
| Very cool!
|
| I see this in the README under the "How is this different from
| other layout parsers" section.
|
| > Commercial Solutions: Requires sharing your data with a vendor.
|
| But I also see that to use the Semantic Processing example, you
| have to have an OpenAI API key. Are there any plans to support
| locally hosted embedding models for this kind of processing?
| willj wrote:
| Relatedly, the OCR component relies on PyMuPDF, which has a
| license that requires releasing source code, which isn't
| possible for most commercial applications. Is there any plan to
| move away from PyMuPDF, or is there a way to use an
| alternative?
| kkielhofner wrote:
| FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a
| PDF, which in some cases is either non-existent or done with
| poor quality OCR (like some random implementation from
| whatever it was scanned with).
|
| This implementation bolts on Tesseract which IME is typically
| not the best available.
| serjester wrote:
| Author here. I'm very open to alternatives to PyMuPDF /
| tesseract because I agree OCR results are sub optimal and
| it has a restrictive license. I tried basic ones and found
| the results to be poor.
| mcbetz wrote:
| This article compares multiple solutions and recommends
| docTR (Apache License 2.0):
| https://source.opennews.org/articles/our-search-best-ocr-
| too...
| serjester wrote:
| Coming soon!
| cpursley wrote:
| Neat and timely. My biggest challenge is tables contained in
| PDFs.
|
| Are there any similar projects that are lower level (for those of
| us not using Python)? Something in Rust that I could call out to,
| for example?
| edshiro wrote:
| This looks great! And incredibly timely too!
|
| I finished watching this video today where the host and guests
| were discussing challenges in a RAG pipeline, and certainly
| chunking documents the right way is still very challenging.
| Video:
| https://www.youtube.com/watch?v=Y9qn4XGH1TI&ab_channel=Prole... .
|
| I was already scratching my head on how I was going to tackle
| this challenge... It seems your library is addressing this
| problem.
|
| Thanks for the good work.
| deoxykev wrote:
| How does this compare to LayoutLMv3? Was it trained on forms at
| all?
| constantinum wrote:
| Folks who want to extract data from really complex documents -
| not just restricted to complex tables - but also check boxes,
| tables spanning multiple pages do try LLMWhisperer.
| https://llmwhisperer.unstract.com/
| _pdp_ wrote:
| My $0.02: correct chunking can improve accuracy, but it does not
| change the fact that it is still a single-shot operation. I have
| commented on this before, so I am repeating myself, but what RAGs
| are trying to do is the equivalent of looking up some information
| (let's say via a search engine), and you happen to have the
| correct answer in the first 5 results - not the links but the
| actual excerpt from the crawled pages. You don't need many evals
| to naturally figure out that this will only sometimes work. So,
| chunking improves the performance as long as the search phrase
| can discover the correct information, but it does not consider
| that the search itself could be wrong or require more evaluation.
| Add to the mix that vectorisation of the records does not work
| well for non-tokens, made-up words, foreign languages, etc, and
| you start getting the idea of the complexity involved. This is
| why more context is better but up to a limit.
|
| IMHO, in most use cases, chunking optimisation strategies will
| not substantially improve performance. What I think might improve
| performance is running N search strategies with multiple
| variations of the search phrase and picking up the best answer.
| But this is currently expensive and slow.
|
| Having developed a RAG platform over one and a half years ago, I
| find many of these challenges strikingly familiar.
| serjester wrote:
| There's far more to a RAG pipeline than chunking documents,
| chunking is just one way to interface with a file. In our case
| we use query decomposition, document summaries and chunking to
| achieve strong results.
|
| Your right that chunking is just one piece of this. But without
| quality chunks you're either going to miss context come query
| time (bad chunks) or use 100X the tokens (full file context).
| register wrote:
| Can you describe a little bit more in detail what is your
| stragegy on query decomposition?
| mistermann wrote:
| > What I think might improve performance is running N search
| strategies with multiple variations of the search phrase and
| picking up the best answer. But this is currently expensive and
| slow.
|
| Eerily similar to Thinking Fast and Slow, and may help explain
| (when combined with biological and social evolutionary theory)
| why people have such a strong aversion to System 2 thinking.
| _pdp_ wrote:
| Ha, never thought of that. Thank you :)
| mistermann wrote:
| It'd be funny if humanity was permanently stalled at the
| stage AI is currently at. Well, funny if one could watch it
| remotely like on The Truman Show instead of being trapped
| within it.
| vikp wrote:
| This looks great! You might be interested in surya -
| https://github.com/VikParuchuri/surya (I'm the author). It does
| OCR (much more accurate than tesseract), layout analysis, and
| text detection.
|
| The OCR is slow on CPU (working on it), but faster than tesseract
| (CPU-only) on GPU.
|
| You could probably replace pymupdf, tesseract, and some layout
| heuristics with this.
|
| Happy to discuss more, feel free to email me (in profile).
| nicklo wrote:
| OP: please don't poison your MIT license w/ surya's GPL license
| Kkoala wrote:
| Is this only for PDFs? Or does it support other formats too? E.g.
| markdown, text, docx etc.
| jsf01 wrote:
| Is this limited to PDFs or could the same chunking and parsing be
| applied to plain text, html, and other input file types?
| mind-blight wrote:
| One thing I've noticed with pdfminer is that it can have horrible
| load times for some PDFs. I've seen 20 page PDFs take upwards of
| 45 seconds due to layout analysis. It's anaysis engine is also
| decent, but it takes newlines into account in weird ways
| sometimes - especially if you're asking for vertical text
| analysis
___________________________________________________________________
(page generated 2024-04-08 23:01 UTC)