[HN Gopher] Transform DOCX into LLM-ready data
___________________________________________________________________
Transform DOCX into LLM-ready data
Author : sergiishcherbak
Score : 7 points
Date : 2025-05-04 22:42 UTC (2 days ago)
(HTM) web link (contextgem.dev)
(TXT) w3m dump (contextgem.dev)
| sergiishcherbak wrote:
| As part of work on my open-source project ContextGem, I've built
| a native, zero-dependency DOCX converter that transforms Word
| documents into LLM-ready data.
|
| This custom-built converter directly processes Word XML, provides
| comprehensive content extraction + covers what other open-source
| tools often miss or lack support for:
|
| - Rich paragraph and sentence metadata for enhanced context
|
| - Misaligned tables
|
| - Comments, footnotes, and textboxes
|
| - Embedded images
|
| The converted document can then be easily used in ContextGem's
| LLM extraction workflows.
|
| Perfect for developers building contract intelligence
| applications where precision matters. The converter preserves
| document structure and relationships, empowering LLMs to better
| understand and analyze document content.
|
| Try it / share with your dev team today and see the difference in
| your document processing pipeline!
|
| GitHub: https://github.com/shcherbak-ai/contextgem
|
| All DocxConverter features:
| https://contextgem.dev/converters/docx.html
| WalterGR wrote:
| _zero-dependency DOCX converter_
|
| I've read that there are a lot of OpenXML elements that are
| pretty opaque. They appear to basically be XML-esque
| representations of binary, in-memory structs used internally by
| Office. (Maybe this has changed over time.)
|
| How much OpenXML does this actually handle?
|
| _Extracts information that other open-source tools often do
| not capture: misaligned tables_
|
| Could you expand on what you mean by misaligned tables? Are
| these tables that appear as separate 'table nodes' in the XML,
| or ones that appear as a single node but have wonky formatting?
| obeavs wrote:
| Hey! This is really awesome. Do you intend to support analysis
| on redlining/tracked changes? That's where it would become very
| useful for my use cases.
| eightysixfour wrote:
| Yes, this is the one that always gets me in the MS ecosystem.
| Would make a few of my workflows so much better.
| TiredOfLife wrote:
| How it compares to https://github.com/microsoft/markitdown?
___________________________________________________________________
(page generated 2025-05-06 23:01 UTC)