[HN Gopher] Rd-TableBench - Accurately evaluating table extraction
___________________________________________________________________
Rd-TableBench - Accurately evaluating table extraction
Hey HN! A ton of document parsing solutions have been coming out
lately, each claiming SOTA with little evidence. A lot of these
turned out to be LLM or LVM wrappers that hallucinate frequently on
complex tables. We just released RD-TableBench, an open benchmark
to help teams evaluate extraction performance for complex tables.
The benchmark includes a variety of challenging scenarios including
scanned tables, handwriting, language detection, merged cells, and
more. We employed an independent team of PhD-level human labelers
who manually annotated 1000 complex table images from a diverse set
of publicly available documents. Alongside this, we also release a
new bioinformatics inspired algorithm for grading table similarity.
Would love to hear any feedback! -Raunak
Author : raunakchowdhuri
Score : 25 points
Date : 2024-11-05 18:46 UTC (4 hours ago)
(HTM) web link (reducto.ai)
(TXT) w3m dump (reducto.ai)
| nparsan wrote:
| This is great, but are there datasets for this already? I know
| pubtables is like 1M labeled data points. Also how important are
| table schemas as a % of overall unstructured documents?
| raunakchowdhuri wrote:
| Love the Pubtables work! It's a really useful dataset. Their
| data comes from existing annotations from scientific papers, so
| in our experience it doesn't include a lot of the hardest cases
| that a lot of methods fail at today. The annotations are
| computer generated instead of manually labeled, so you don't
| have things like scanned and rotated images or a lot of
| diversity in languages.
|
| I'd encourage you to take a look at some of our data points to
| compare for yourself! Link:
| huggingface.co/spaces/reducto/rd_table_bench
|
| In terms of the overall importance of table extraction, we've
| found it to be a key bottleneck for folks looking to do
| document parsing. It's up there amongst the hardest problems in
| the space alongside complex form region parsing. I don't have
| the exact statistics handy, but I'd estimate that ~25% of the
| pages we parse have some hairy tables in them!
| michaefe wrote:
| Not surprising to see Reducto at the top, it's by far the best
| option we've tried
| adit_a wrote:
| Part of the goal with releasing the dataset is to highlight how
| hard PDF parsing can be. Reducto models are SOTA, but they aren't
| perfect.
|
| We constantly see alternatives show one ideal table to claim
| they're accurate. Being able to parse some tables is not hard.
|
| What happens when it has merged cells, dense text, rotations, or
| no gridlines? Will your table outputs be the same when a user
| uploads a document twice?
|
| Our team is relentlessly focused on solving for the true range of
| scenarios so our customers don't have to. Excited to share more
| about our next gen models soon.
| gregw2 wrote:
| I have realworld bank statements that I have been unable to find
| any PDF/AI extractor that can do a good job on.
|
| (To summarize, the core challenge appears to be recognizing
| nested columnar layout formats combined with odd line wrapping
| within those columns.)
|
| Is there anyone I can submit an example few pages to for
| consideration in some benchmark?
| adit_a wrote:
| happy to add examples to future iterations of this dataset if
| you want to send examples!
___________________________________________________________________
(page generated 2024-11-05 23:01 UTC)