[HN Gopher] Show HN: I made a tool to convert images of tables t...
___________________________________________________________________
Show HN: I made a tool to convert images of tables to CSV
Author : aperrin
Score : 104 points
Date : 2021-03-09 19:58 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| luplex wrote:
| This is similar to WebPlotDigitizer, which helps you extract data
| from graphs:
|
| https://automeris.io/WebPlotDigitizer/index.html
| aperrin wrote:
| Hi ! Thank you for sharing this, it's a great tool I bumped
| into when searching for an image to CSV converter. But it seems
| to work with graphs only if I'm not mistaken.
| luplex wrote:
| Yes, your tool is a welcome addition!
| ohazi wrote:
| I had been meaning to find or write a tool like this for ages --
| often times the only place where you can find pinout information
| for a chip is from a table buried on page 7xx of a massive pdf
| datasheet. Trying to create a symbol for, e.g. a 200+ ball BGA is
| _awful_.
| aperrin wrote:
| Hi ! I couldn't find a tool like that when I needed it, so I made
| that as a Python beginner's project. Hope you'll find it useful.
| :-)
| roussanoff wrote:
| A similar tool:
|
| https://github.com/eihli/image-table-ocr
| vmchale wrote:
| That's pretty neat.
| cosmotic wrote:
| How fast is it? Does it work with rotated images? How about
| multiple tables per image?
| cosmotic wrote:
| What about hand writing?
| aperrin wrote:
| The program runs with Python and Tesseract. It is quite fast
| (less than one second for a table of 100 numbers) though I
| never tested it with larger tables. It detects numbers from an
| image of a table, which is supposed not to be rotated and also
| cropped : only the table is visible on the image. So, in order
| to process multiple tables per image, one needs to create an
| image for each table. This program is rather simple I must say.
| ;-)
|
| As for the handwriting, I think Tesseract can handle the
| recognition if the writing is good, but the table needs to
| fullfil the expected hypothesis. Also the pre-processing can't
| get rid of a lot of noise so it can be a problem too !
| technicolorwhat wrote:
| Is there also a solution for automatic border detection. Last
| year tried reading bank statements, which were scanned slips.
| Unfortunately they didn't have any borders which made it super
| difficult to extract content. Would be cool if someone could make
| something for this :) I thought it would be easy but I broke my
| mind on it for several days until I gave up.
| [deleted]
| boogies wrote:
| https://github.com/eihli/image-table-ocr seems to automatically
| find tables within larger images, IDK if it works without
| borders though.
| eihli wrote:
| The logic for detecting a table is to get rid of everything
| but vertical lines over a certain length, save that in one
| image, then get rid of everything but horizontal lines of a
| certain length, save that image. Then overlay the two and
| take the bounding rectangle. So you don't need the table to
| have a border as long as you have vertical and horizontal
| lines and they extend far enough to encompass all the data
| you need.
| adflux wrote:
| Azure FormRecognizer API
| spudwaffle wrote:
| It would be cool if you could put a license for this!
| aperrin wrote:
| Done it, thank you for the tip ! ;-)
| leeoniya wrote:
| also https://github.com/tabulapdf/tabula-java
| arathore wrote:
| Great project! I've had success using camelot-py
| (https://camelot-py.readthedocs.io) to extract tabular data from
| PDFs (for images, I use imagemagick to convert those to PDF). If
| your table has borders the default method (lattice) works quite
| well. For non-bordered table there is the option to use 'stream'
| option but usually requires bit more preprocessing to get usable
| results.
___________________________________________________________________
(page generated 2021-03-09 23:00 UTC)