[HN Gopher] Show HN: Extract Table from Image
___________________________________________________________________
Show HN: Extract Table from Image
Author : v3gas
Score : 115 points
Date : 2021-09-28 07:15 UTC (15 hours ago)
(HTM) web link (extract-table.com)
(TXT) w3m dump (extract-table.com)
| nanis wrote:
| With this image[1] from this question on SO[2], the output[3] is
| missing the last row. FWIW, I've had the occasional miraculous-
| looking results from AWS Textract, but you do need to keep an eye
| on what's happening.
|
| Update: I just checked a bit carefully, and this example[4] is
| also missing the last row.
|
| Also, Danish o seems problematic on your web page whereas the CSV
| has the right UTF-8 encoded bytes.
|
| [1]: https://i.stack.imgur.com/y7Zrt.png
|
| [2]: https://stackoverflow.com/q/69363708/100754
|
| [3]: https://results.extract-
| table.com/8d4818867ad604792819e98808...
|
| [4]: https://results.extract-
| table.com/254d95722a2c2b1df72fc26b59...
| whirlwin wrote:
| Nice. Fun fact: The third example table is an ordered list of
| Norway's richest people (according to net worth, I think)
| BrandiATMuhkuh wrote:
| This is really awesome. I have tried to solve that many times. I
| got close, with open CV and azure ML. I have even tried AWS
| Textract (~2 years ago). But this is the best implementation I
| have seen so far. Congratulations.
|
| I'm not sure what application you are thinking off. But the
| reason I'm following this problem is UX. Years ago, I worked on a
| project where anyone can add product prices into a DB. They do
| that by typing their receipt (line items) into the DB. The major
| issue was, the UX was horrible.
|
| With an API like yours, this is super simply. One photo. That's
| all.
|
| Maybe I'll revisit it as a side project.
| w-m wrote:
| I'm answering questions about Pandas (the Python data analysis
| framework) on StackOverflow from time to time. It's an exercise
| in patience, because many people will post screenshots of their
| data instead of a reproducible code example. You'll have to point
| about every other newcomer to the documentation on how write a
| proper question that one can actually answer.
|
| I'd imagine other areas around StackOverflow (SQL, R?) are
| fighting similar issues. I've just tried it with a question (sure
| enough the second newest Pandas tagged question had a table as an
| image), and your tool produced a nice .csv.
|
| It would be a godsend to have a button on StackOverflow that
| would replace a user-uploaded image of a table with some Pandas
| code that constructs the same DataFrame. Currently I would have
| to download the image, upload it to extract-table.com, download
| the .csv, load it into Python, run some code to create the code-
| based DataFrame.
|
| I'd consider sending people on StackOverflow to your tool if you
| cut down some of the steps: (1) allowing to paste in an URL of an
| image, and (2) producing Pandas code output that can be directly
| copy/pasted from the site (not having to download a csv).
|
| For illustration: here's what the Pandas code would look like for
| the first example of extract-table.com: df =
| pd.DataFrame( {'Name': {0: 'David', 1: 'Jessica', 2: 'Warren'},
| 'Gender': {0: 'Male', 1: 'Female', 2: 'Male'}, 'Age': {0: 23, 1:
| 47, 2: 12}} )
| MattGaiser wrote:
| Could do it with a Chrome extension. Add a button to the right
| click context menu and get the tabular data in the popup.
| pietrovismara wrote:
| Off topic funny story: My highest voted answer on SO is a very
| basic one about Pandas, from 7 years ago. It's funny that I've
| only used Pands for a few weeks, years ago (I would need to
| relearn it from scratch now), but 90% of my SO score comes from
| that answer and I still get more points almost daily. In fact
| I'm in the top 6% of SO mostly thanks to that answer.
| belval wrote:
| I'm in the same boat, 95% of my SO points come from an answer
| that was basically a copy pasted script to fix an obscure
| VMWare error with Ubuntu. Turns out a lot of people had the
| same issue that day.
| w-m wrote:
| Since all votes have the same weight I guess it makes sense
| that the answers to most basic questions or highly common
| problems will get the most points. Maybe SO should have a
| button to donate points to an answer that really saved your
| bacon, a super-upvote if you will. (I know you can attach
| bounties to questions, but that's not really feasibly when
| you come across something that has already been answered).
|
| But yeah, crowd behavior is fun. I have the feeling I can
| time when some computer vision courses (or the semester)
| starts, as suddenly there's many upvotes on my basic answer
| explaining BGR/RGB color space confusion with OpenCV, the
| computer vision library :)
| naberhausj wrote:
| Funny that this is brought up. As an undergraduate in a
| Data Scientist class we did analysis on the SO dataset
| (we processed the whole thing using RStudio running on a
| big EC2 instance). I found that about ~1,000 users that
| have made less than fifty posts have moderator
| privileges. In that report, I suggested that they should
| give users quality points (Upvotes / # Page Views) rather
| than straight reputation points.
| unwind wrote:
| People post images of C code too. Best are the ones that post a
| link to the image on some external image host. Gaaah.
| z3t4 wrote:
| Should make it into a browser plugin, so annoying when web sites
| have tables in images.
| mzs wrote:
| https://github.com/vegarsti/extract-table
| greaterweb wrote:
| Nice work putting together this tool. Have you seen either Spark
| OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They
| both do a pretty good job a data extraction from tables as well.
|
| [1] https://www.johnsnowlabs.com/spark-ocr/
|
| [2] https://www.adobe.io/apis/documentcloud/dcsdk/pdf-
| extract.ht...
| BillSaysThis wrote:
| Really nice but... wondering how long this will last as a free
| tool given AWS fees.
| pveierland wrote:
| Neat tool! There appears to be two minor issues in the last
| example. There is an encoding issue of "o" characters ("RA,kke"),
| and a column split appears to be missing betweeen the closely
| spaced numbers ("33 300 22 700" vs "33 300,22 700"). Possible
| possibly non-trivial improvement: harmonize formatting within the
| same column to avoid mixed occurences of "7800" / "7 800".
| howmayiannoyyou wrote:
| Nice job. Actually though, what the world really needs in ML that
| divines the trend and perhaps indices/values from images of
| charts.
| plaidfuji wrote:
| This has been my pet side project for many years. What use case
| would you apply it to?
| MattGaiser wrote:
| Pair this with a snipping tool and all sorts of people in banking
| would use it for a few hours a day, especially if it could paste
| to Excel or at least fill the clipboard in a way pastable to
| Excel.
|
| I used to work for a bank on their innovation team and pitched
| basically this, but as an intern I had neither the skill nor time
| to do it. But it was certainly something a bunch of people
| internally wanted.
| eihli wrote:
| Nice. I worked on something similar but far less robust:
| https://github.com/eihli/image-table-ocr. It fails to find the
| tables on the example images at extract-table.com, but the code
| is heavily commented at https://eihli.github.io/image-table-
| ocr/pdf_table_extraction... so there's high visibility into
| what's going on and what needs to change to get it to work with
| images of different sizes/fonts.
| jnsie wrote:
| Really cool. I'm interested to hear your plans for this. Are you
| planning to offer as a service/open source/etc.?
___________________________________________________________________
(page generated 2021-09-28 23:00 UTC)