[HN Gopher] Transform DOCX into LLM-ready data
       ___________________________________________________________________
        
       Transform DOCX into LLM-ready data
        
       Author : sergiishcherbak
       Score  : 7 points
       Date   : 2025-05-04 22:42 UTC (2 days ago)
        
 (HTM) web link (contextgem.dev)
 (TXT) w3m dump (contextgem.dev)
        
       | sergiishcherbak wrote:
       | As part of work on my open-source project ContextGem, I've built
       | a native, zero-dependency DOCX converter that transforms Word
       | documents into LLM-ready data.
       | 
       | This custom-built converter directly processes Word XML, provides
       | comprehensive content extraction + covers what other open-source
       | tools often miss or lack support for:
       | 
       | - Rich paragraph and sentence metadata for enhanced context
       | 
       | - Misaligned tables
       | 
       | - Comments, footnotes, and textboxes
       | 
       | - Embedded images
       | 
       | The converted document can then be easily used in ContextGem's
       | LLM extraction workflows.
       | 
       | Perfect for developers building contract intelligence
       | applications where precision matters. The converter preserves
       | document structure and relationships, empowering LLMs to better
       | understand and analyze document content.
       | 
       | Try it / share with your dev team today and see the difference in
       | your document processing pipeline!
       | 
       | GitHub: https://github.com/shcherbak-ai/contextgem
       | 
       | All DocxConverter features:
       | https://contextgem.dev/converters/docx.html
        
         | WalterGR wrote:
         | _zero-dependency DOCX converter_
         | 
         | I've read that there are a lot of OpenXML elements that are
         | pretty opaque. They appear to basically be XML-esque
         | representations of binary, in-memory structs used internally by
         | Office. (Maybe this has changed over time.)
         | 
         | How much OpenXML does this actually handle?
         | 
         |  _Extracts information that other open-source tools often do
         | not capture: misaligned tables_
         | 
         | Could you expand on what you mean by misaligned tables? Are
         | these tables that appear as separate 'table nodes' in the XML,
         | or ones that appear as a single node but have wonky formatting?
        
         | obeavs wrote:
         | Hey! This is really awesome. Do you intend to support analysis
         | on redlining/tracked changes? That's where it would become very
         | useful for my use cases.
        
           | eightysixfour wrote:
           | Yes, this is the one that always gets me in the MS ecosystem.
           | Would make a few of my workflows so much better.
        
         | TiredOfLife wrote:
         | How it compares to https://github.com/microsoft/markitdown?
        
       ___________________________________________________________________
       (page generated 2025-05-06 23:01 UTC)