[HN Gopher] Demystifying Text Data with the Unstructured Python ...
___________________________________________________________________
Demystifying Text Data with the Unstructured Python Library
Author : saeedesmaili
Score : 100 points
Date : 2023-07-06 14:47 UTC (8 hours ago)
(HTM) web link (saeedesmaili.com)
(TXT) w3m dump (saeedesmaili.com)
| yuppiepuppie wrote:
| It would help this article's quality if the author had included
| an output example from each code snippet. As a reader I'm left to
| imagine what the output looks like.
| saeedesmaili wrote:
| Author here. That's a good point. I'll add output examples.
| agadius wrote:
| If you accept running Java, the Apache Tika is extremely good at
| parsing content (https://tika.apache.org/)
| convivialdingo wrote:
| I second this suggestion. I tested numerous Python tools to
| extract text - nothing matches Tika for general extraction of
| just about any data format.
|
| However - if you can expect a certain format beforehand - then
| Python is better since you can extract higher-quality data
| (tables, lists) with the appropriate tool.
| saeedesmaili wrote:
| Do you have any suggestions for Python libraries (other than
| what's mentioned in the post)?
| convivialdingo wrote:
| I've had good luck with python-docx for reading word
| documents (typically specifications). Tables are supported
| - but it's not obvious where the table comes from in the
| document and I had to come up with a hack way to read image
| captions.
|
| PDF has been hit or miss, but pypdf has improved in the
| last couple of years. Depending on the document you'll
| sometimes get random spaces or nospacesatall.
| saeedesmaili wrote:
| I tried python-docx with a bunch of docx files
| (downloaded from Google Docs). It returns empty strings
| for hyperlinks and I couldn't manage to fix this. So if
| there is a sentence like "This is an important link to
| another doc or url." and the "link" is a hyperlink,
| python-docx returns "This is an important to another doc
| or url."
| icegreentea2 wrote:
| Heh, I got a bit into hacking on python-docx last year
| (the original author seems to be focusing on other things
| than python-docx now) - I have a fork/branch where I
| tried to more properly implement external hyperlink
| functionality (https://github.com/icegreentea/python-
| docx/pull/7)
|
| I realize now staring at this, that I might have broken
| API a little. You can't do "text = paragraph.text"
| anymore, but you can do "text = ''.join([run.text for run
| in paragraph.runs])" instead.
|
| If you're curious at all why it breaks, it's because in
| the OOXML spec paragraphs are made up of a ordered list
| of runs or hyperlinks (and hyperlinks can then contain
| additional runs). The master branch just implements
| paragraphs as ordered list of runs (and ignores all
| hyperlinks).
| convivialdingo wrote:
| Hey, that's fantastic. I'll definitely check that out.
| saeedesmaili wrote:
| This sounds amazing! Thanks for sharing it, I will try it
| to see if I can replace it with the main python-docx. For
| my use case it suffices to have full text of each
| paragraph (even if it includes a hyperlink) and heading
| but also be able to have each of them separated when
| needed.
| icegreentea2 wrote:
| Actually, I just realized that I had provided a 'one-off'
| hack to a similarish situation here:
| https://github.com/python-openxml/python-
| docx/issues/1123#is...
|
| Replace the `qn("w:ins")` in the example with
| `qn("w:hyperlink")` and that should hopefully work?
| oersted wrote:
| I have been using it extensively during the last few weeks. I've
| very thankful for such a clean and practical API, and I think it
| will become the central solution for ingesting heterogeneous text
| in the Python ecosystem.
|
| However, I'm afraid it is not there yet. Other libraries like
| PDFMiner give higher quality outputs and specialized libraries
| like Camelot are still needed to extract tables as reasonably
| well formatted text. It also needs a lot of extra tooling for web
| scraping. Sure it can read plain HTML from a URL, but it cannot
| run JavaScript, or control things like User Agent. It could be
| argued that such features are not within the scope, but it is
| rather bothersome for a library that presents a magic `partition`
| function for most standard text sources.
|
| I'm sure it will get there soon though. It shouldn't be hard to
| integrate with state-of-the-art parsers and tooling, and the
| simple API undoubtedly brings a lot of peace of mind.
| froggychairs wrote:
| API looks very clean :) I've also been avoiding LangChain since
| it just seems too big for my tastes. Will give this a shot
| CShorten wrote:
| Awesome!
| dogline wrote:
| I hadn't seen the unstructured python library before. Seems handy
| to parse personal text, like the author is doing.
___________________________________________________________________
(page generated 2023-07-06 23:01 UTC)