[HN Gopher] Rd-TableBench - Accurately evaluating table extraction
       ___________________________________________________________________
        
       Rd-TableBench - Accurately evaluating table extraction
        
       Hey HN!  A ton of document parsing solutions have been coming out
       lately, each claiming SOTA with little evidence. A lot of these
       turned out to be LLM or LVM wrappers that hallucinate frequently on
       complex tables.  We just released RD-TableBench, an open benchmark
       to help teams evaluate extraction performance for complex tables.
       The benchmark includes a variety of challenging scenarios including
       scanned tables, handwriting, language detection, merged cells, and
       more.  We employed an independent team of PhD-level human labelers
       who manually annotated 1000 complex table images from a diverse set
       of publicly available documents.  Alongside this, we also release a
       new bioinformatics inspired algorithm for grading table similarity.
       Would love to hear any feedback!  -Raunak
        
       Author : raunakchowdhuri
       Score  : 25 points
       Date   : 2024-11-05 18:46 UTC (4 hours ago)
        
 (HTM) web link (reducto.ai)
 (TXT) w3m dump (reducto.ai)
        
       | nparsan wrote:
       | This is great, but are there datasets for this already? I know
       | pubtables is like 1M labeled data points. Also how important are
       | table schemas as a % of overall unstructured documents?
        
         | raunakchowdhuri wrote:
         | Love the Pubtables work! It's a really useful dataset. Their
         | data comes from existing annotations from scientific papers, so
         | in our experience it doesn't include a lot of the hardest cases
         | that a lot of methods fail at today. The annotations are
         | computer generated instead of manually labeled, so you don't
         | have things like scanned and rotated images or a lot of
         | diversity in languages.
         | 
         | I'd encourage you to take a look at some of our data points to
         | compare for yourself! Link:
         | huggingface.co/spaces/reducto/rd_table_bench
         | 
         | In terms of the overall importance of table extraction, we've
         | found it to be a key bottleneck for folks looking to do
         | document parsing. It's up there amongst the hardest problems in
         | the space alongside complex form region parsing. I don't have
         | the exact statistics handy, but I'd estimate that ~25% of the
         | pages we parse have some hairy tables in them!
        
       | michaefe wrote:
       | Not surprising to see Reducto at the top, it's by far the best
       | option we've tried
        
       | adit_a wrote:
       | Part of the goal with releasing the dataset is to highlight how
       | hard PDF parsing can be. Reducto models are SOTA, but they aren't
       | perfect.
       | 
       | We constantly see alternatives show one ideal table to claim
       | they're accurate. Being able to parse some tables is not hard.
       | 
       | What happens when it has merged cells, dense text, rotations, or
       | no gridlines? Will your table outputs be the same when a user
       | uploads a document twice?
       | 
       | Our team is relentlessly focused on solving for the true range of
       | scenarios so our customers don't have to. Excited to share more
       | about our next gen models soon.
        
       | gregw2 wrote:
       | I have realworld bank statements that I have been unable to find
       | any PDF/AI extractor that can do a good job on.
       | 
       | (To summarize, the core challenge appears to be recognizing
       | nested columnar layout formats combined with odd line wrapping
       | within those columns.)
       | 
       | Is there anyone I can submit an example few pages to for
       | consideration in some benchmark?
        
         | adit_a wrote:
         | happy to add examples to future iterations of this dataset if
         | you want to send examples!
        
       ___________________________________________________________________
       (page generated 2024-11-05 23:01 UTC)