hngopher.com

       [HN Gopher] Demystifying Text Data with the Unstructured Python ...
       ___________________________________________________________________
        
       Demystifying Text Data with the Unstructured Python Library
        
       Author : saeedesmaili
       Score  : 100 points
       Date   : 2023-07-06 14:47 UTC (8 hours ago)
        
 (HTM) web link (saeedesmaili.com)
 (TXT) w3m dump (saeedesmaili.com)
        
       | yuppiepuppie wrote:
       | It would help this article's quality if the author had included
       | an output example from each code snippet. As a reader I'm left to
       | imagine what the output looks like.
        
         | saeedesmaili wrote:
         | Author here. That's a good point. I'll add output examples.
        
       | agadius wrote:
       | If you accept running Java, the Apache Tika is extremely good at
       | parsing content (https://tika.apache.org/)
        
         | convivialdingo wrote:
         | I second this suggestion. I tested numerous Python tools to
         | extract text - nothing matches Tika for general extraction of
         | just about any data format.
         | 
         | However - if you can expect a certain format beforehand - then
         | Python is better since you can extract higher-quality data
         | (tables, lists) with the appropriate tool.
        
           | saeedesmaili wrote:
           | Do you have any suggestions for Python libraries (other than
           | what's mentioned in the post)?
        
             | convivialdingo wrote:
             | I've had good luck with python-docx for reading word
             | documents (typically specifications). Tables are supported
             | - but it's not obvious where the table comes from in the
             | document and I had to come up with a hack way to read image
             | captions.
             | 
             | PDF has been hit or miss, but pypdf has improved in the
             | last couple of years. Depending on the document you'll
             | sometimes get random spaces or nospacesatall.
        
               | saeedesmaili wrote:
               | I tried python-docx with a bunch of docx files
               | (downloaded from Google Docs). It returns empty strings
               | for hyperlinks and I couldn't manage to fix this. So if
               | there is a sentence like "This is an important link to
               | another doc or url." and the "link" is a hyperlink,
               | python-docx returns "This is an important to another doc
               | or url."
        
               | icegreentea2 wrote:
               | Heh, I got a bit into hacking on python-docx last year
               | (the original author seems to be focusing on other things
               | than python-docx now) - I have a fork/branch where I
               | tried to more properly implement external hyperlink
               | functionality (https://github.com/icegreentea/python-
               | docx/pull/7)
               | 
               | I realize now staring at this, that I might have broken
               | API a little. You can't do "text = paragraph.text"
               | anymore, but you can do "text = ''.join([run.text for run
               | in paragraph.runs])" instead.
               | 
               | If you're curious at all why it breaks, it's because in
               | the OOXML spec paragraphs are made up of a ordered list
               | of runs or hyperlinks (and hyperlinks can then contain
               | additional runs). The master branch just implements
               | paragraphs as ordered list of runs (and ignores all
               | hyperlinks).
        
               | convivialdingo wrote:
               | Hey, that's fantastic. I'll definitely check that out.
        
               | saeedesmaili wrote:
               | This sounds amazing! Thanks for sharing it, I will try it
               | to see if I can replace it with the main python-docx. For
               | my use case it suffices to have full text of each
               | paragraph (even if it includes a hyperlink) and heading
               | but also be able to have each of them separated when
               | needed.
        
               | icegreentea2 wrote:
               | Actually, I just realized that I had provided a 'one-off'
               | hack to a similarish situation here:
               | https://github.com/python-openxml/python-
               | docx/issues/1123#is...
               | 
               | Replace the `qn("w:ins")` in the example with
               | `qn("w:hyperlink")` and that should hopefully work?
        
       | oersted wrote:
       | I have been using it extensively during the last few weeks. I've
       | very thankful for such a clean and practical API, and I think it
       | will become the central solution for ingesting heterogeneous text
       | in the Python ecosystem.
       | 
       | However, I'm afraid it is not there yet. Other libraries like
       | PDFMiner give higher quality outputs and specialized libraries
       | like Camelot are still needed to extract tables as reasonably
       | well formatted text. It also needs a lot of extra tooling for web
       | scraping. Sure it can read plain HTML from a URL, but it cannot
       | run JavaScript, or control things like User Agent. It could be
       | argued that such features are not within the scope, but it is
       | rather bothersome for a library that presents a magic `partition`
       | function for most standard text sources.
       | 
       | I'm sure it will get there soon though. It shouldn't be hard to
       | integrate with state-of-the-art parsers and tooling, and the
       | simple API undoubtedly brings a lot of peace of mind.
        
       | froggychairs wrote:
       | API looks very clean :) I've also been avoiding LangChain since
       | it just seems too big for my tastes. Will give this a shot
        
       | CShorten wrote:
       | Awesome!
        
       | dogline wrote:
       | I hadn't seen the unstructured python library before. Seems handy
       | to parse personal text, like the author is doing.
        
       ___________________________________________________________________
       (page generated 2023-07-06 23:01 UTC)