hngopher.com

       [HN Gopher] Show HN: Extract Table from Image
       ___________________________________________________________________
        
       Show HN: Extract Table from Image
        
       Author : v3gas
       Score  : 115 points
       Date   : 2021-09-28 07:15 UTC (15 hours ago)
        
 (HTM) web link (extract-table.com)
 (TXT) w3m dump (extract-table.com)
        
       | nanis wrote:
       | With this image[1] from this question on SO[2], the output[3] is
       | missing the last row. FWIW, I've had the occasional miraculous-
       | looking results from AWS Textract, but you do need to keep an eye
       | on what's happening.
       | 
       | Update: I just checked a bit carefully, and this example[4] is
       | also missing the last row.
       | 
       | Also, Danish o seems problematic on your web page whereas the CSV
       | has the right UTF-8 encoded bytes.
       | 
       | [1]: https://i.stack.imgur.com/y7Zrt.png
       | 
       | [2]: https://stackoverflow.com/q/69363708/100754
       | 
       | [3]: https://results.extract-
       | table.com/8d4818867ad604792819e98808...
       | 
       | [4]: https://results.extract-
       | table.com/254d95722a2c2b1df72fc26b59...
        
       | whirlwin wrote:
       | Nice. Fun fact: The third example table is an ordered list of
       | Norway's richest people (according to net worth, I think)
        
       | BrandiATMuhkuh wrote:
       | This is really awesome. I have tried to solve that many times. I
       | got close, with open CV and azure ML. I have even tried AWS
       | Textract (~2 years ago). But this is the best implementation I
       | have seen so far. Congratulations.
       | 
       | I'm not sure what application you are thinking off. But the
       | reason I'm following this problem is UX. Years ago, I worked on a
       | project where anyone can add product prices into a DB. They do
       | that by typing their receipt (line items) into the DB. The major
       | issue was, the UX was horrible.
       | 
       | With an API like yours, this is super simply. One photo. That's
       | all.
       | 
       | Maybe I'll revisit it as a side project.
        
       | w-m wrote:
       | I'm answering questions about Pandas (the Python data analysis
       | framework) on StackOverflow from time to time. It's an exercise
       | in patience, because many people will post screenshots of their
       | data instead of a reproducible code example. You'll have to point
       | about every other newcomer to the documentation on how write a
       | proper question that one can actually answer.
       | 
       | I'd imagine other areas around StackOverflow (SQL, R?) are
       | fighting similar issues. I've just tried it with a question (sure
       | enough the second newest Pandas tagged question had a table as an
       | image), and your tool produced a nice .csv.
       | 
       | It would be a godsend to have a button on StackOverflow that
       | would replace a user-uploaded image of a table with some Pandas
       | code that constructs the same DataFrame. Currently I would have
       | to download the image, upload it to extract-table.com, download
       | the .csv, load it into Python, run some code to create the code-
       | based DataFrame.
       | 
       | I'd consider sending people on StackOverflow to your tool if you
       | cut down some of the steps: (1) allowing to paste in an URL of an
       | image, and (2) producing Pandas code output that can be directly
       | copy/pasted from the site (not having to download a csv).
       | 
       | For illustration: here's what the Pandas code would look like for
       | the first example of extract-table.com:                 df =
       | pd.DataFrame( {'Name': {0: 'David', 1: 'Jessica', 2: 'Warren'},
       | 'Gender': {0: 'Male', 1: 'Female', 2: 'Male'}, 'Age': {0: 23, 1:
       | 47, 2: 12}} )
        
         | MattGaiser wrote:
         | Could do it with a Chrome extension. Add a button to the right
         | click context menu and get the tabular data in the popup.
        
         | pietrovismara wrote:
         | Off topic funny story: My highest voted answer on SO is a very
         | basic one about Pandas, from 7 years ago. It's funny that I've
         | only used Pands for a few weeks, years ago (I would need to
         | relearn it from scratch now), but 90% of my SO score comes from
         | that answer and I still get more points almost daily. In fact
         | I'm in the top 6% of SO mostly thanks to that answer.
        
           | belval wrote:
           | I'm in the same boat, 95% of my SO points come from an answer
           | that was basically a copy pasted script to fix an obscure
           | VMWare error with Ubuntu. Turns out a lot of people had the
           | same issue that day.
        
             | w-m wrote:
             | Since all votes have the same weight I guess it makes sense
             | that the answers to most basic questions or highly common
             | problems will get the most points. Maybe SO should have a
             | button to donate points to an answer that really saved your
             | bacon, a super-upvote if you will. (I know you can attach
             | bounties to questions, but that's not really feasibly when
             | you come across something that has already been answered).
             | 
             | But yeah, crowd behavior is fun. I have the feeling I can
             | time when some computer vision courses (or the semester)
             | starts, as suddenly there's many upvotes on my basic answer
             | explaining BGR/RGB color space confusion with OpenCV, the
             | computer vision library :)
        
               | naberhausj wrote:
               | Funny that this is brought up. As an undergraduate in a
               | Data Scientist class we did analysis on the SO dataset
               | (we processed the whole thing using RStudio running on a
               | big EC2 instance). I found that about ~1,000 users that
               | have made less than fifty posts have moderator
               | privileges. In that report, I suggested that they should
               | give users quality points (Upvotes / # Page Views) rather
               | than straight reputation points.
        
         | unwind wrote:
         | People post images of C code too. Best are the ones that post a
         | link to the image on some external image host. Gaaah.
        
       | z3t4 wrote:
       | Should make it into a browser plugin, so annoying when web sites
       | have tables in images.
        
       | mzs wrote:
       | https://github.com/vegarsti/extract-table
        
       | greaterweb wrote:
       | Nice work putting together this tool. Have you seen either Spark
       | OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They
       | both do a pretty good job a data extraction from tables as well.
       | 
       | [1] https://www.johnsnowlabs.com/spark-ocr/
       | 
       | [2] https://www.adobe.io/apis/documentcloud/dcsdk/pdf-
       | extract.ht...
        
       | BillSaysThis wrote:
       | Really nice but... wondering how long this will last as a free
       | tool given AWS fees.
        
       | pveierland wrote:
       | Neat tool! There appears to be two minor issues in the last
       | example. There is an encoding issue of "o" characters ("RA,kke"),
       | and a column split appears to be missing betweeen the closely
       | spaced numbers ("33 300 22 700" vs "33 300,22 700"). Possible
       | possibly non-trivial improvement: harmonize formatting within the
       | same column to avoid mixed occurences of "7800" / "7 800".
        
       | howmayiannoyyou wrote:
       | Nice job. Actually though, what the world really needs in ML that
       | divines the trend and perhaps indices/values from images of
       | charts.
        
         | plaidfuji wrote:
         | This has been my pet side project for many years. What use case
         | would you apply it to?
        
       | MattGaiser wrote:
       | Pair this with a snipping tool and all sorts of people in banking
       | would use it for a few hours a day, especially if it could paste
       | to Excel or at least fill the clipboard in a way pastable to
       | Excel.
       | 
       | I used to work for a bank on their innovation team and pitched
       | basically this, but as an intern I had neither the skill nor time
       | to do it. But it was certainly something a bunch of people
       | internally wanted.
        
       | eihli wrote:
       | Nice. I worked on something similar but far less robust:
       | https://github.com/eihli/image-table-ocr. It fails to find the
       | tables on the example images at extract-table.com, but the code
       | is heavily commented at https://eihli.github.io/image-table-
       | ocr/pdf_table_extraction... so there's high visibility into
       | what's going on and what needs to change to get it to work with
       | images of different sizes/fonts.
        
       | jnsie wrote:
       | Really cool. I'm interested to hear your plans for this. Are you
       | planning to offer as a service/open source/etc.?
        
       ___________________________________________________________________
       (page generated 2021-09-28 23:00 UTC)