[HN Gopher] MarkItDown: Python tool for converting files and off...
___________________________________________________________________
MarkItDown: Python tool for converting files and office documents
to Markdown
Author : Handy-Man
Score : 176 points
Date : 2024-12-13 18:02 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ezxs wrote:
| it would be cool if Word just had that implemented inside the
| product like Google Docs does.
| benatkin wrote:
| Nary a mention of LLMs in the readme. That was an unexpected but
| pleasant surprise, when the idea of converting something to
| markdown for LLMs is floated as if it's new and the greatest
| thing since sliced bread.
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
|
| It's interesting to read the code. It's mostly glue code, and
| most of it is in single 1101 line file. But it does indeed say
| what the README says it does. Here is the _special handling for
| Wikipedia_ :
| https://github.com/microsoft/markitdown/blob/main/src/markit...
|
| Edit: good to see the one from yesterday flagged. I tried to
| assume good intent, but also wondered if it was a place to draw a
| line in the sand. https://news.ycombinator.com/item?id=42405758
|
| Edit 2: ah, it came down to simple violation of the Show HN
| rules. I didn't notice, but yeah, that's definitely the case.
| zamadatix wrote:
| > Nary a mention of LLMs in the readme. That was an unexpected
| but pleasant surprise
|
| No surprise it has still managed to come up in the comments in
| spite of that!
| benatkin wrote:
| Yep, and that's fine. It's just that there is a lot of false
| assumptions and magical thinking going around about LLMs and
| Markdown and I was glad to not find any in the README.
| irskep wrote:
| I worked on an in-house version of this feature for my employer
| (turning files into LLM-friendly text). After reading the source
| code, I can say this is a pretty reasonable implementation of
| this type of thing. But I would avoid using it for images, since
| the LLM providers let you just pass images directly, and I would
| also avoid using it for spreadsheets, since LLMs are very bad at
| interpreting Markdown tables.
|
| There are a lot of random startups and open source projects who
| try to make this space sound fancy, but I really hope the end
| state is a simple project like this, easy to understand and easy
| to deploy.
|
| I do wish it had a knob to turn for "how much processing do you
| want me to do." For PDF specifically, you either have to get a
| crappy version of the plain text using heuristics in a way that
| is very sensitive to how the PDF is exported, or you have to go
| full OCR, and it's annoying when a project locks you into one or
| the other. I'm also not sure I'd want to use the speech-to-text
| features here since they might have very different performance
| characteristics than the text-to-text stuff.
| cosmie wrote:
| From your experience, what would be the best way to handle
| spreadsheets?
| simonw wrote:
| I don't think tabular data of any sort is a particularly good
| fit for LLMs at the moment. What are you trying to do with
| it?
|
| If you want to answer questions like "how many students does
| Everglade High School have?" and you have a spreadsheet of
| schools where one of the columns is "number of students" I
| guess you could feed that into an LLM, but it doesn't feel
| like a great tool for the job.
|
| I'd instead use systems like ChatGPT Code Interpreter where
| the LLM gets to load up that data programatically and answer
| questions by running code against it. Text-to-SQL systems
| could work well for that too.
| btown wrote:
| This is an active area of research:
| https://github.com/SpursGoZmy/Awesome-Tabular-LLMs is a
| good starting point!
| cosmie wrote:
| For me personally, a lot of times it's for table
| augmentation purposes. Appending additional columns to a
| dataset, such as a cleaned/standardized version of another
| field, extracting a value from another field, or appending
| categorization attributes (sometimes pre-seeded and
| sometimes just giving it general direction).
|
| Or sometimes I'll manually curate a field like that, and
| then ask it to generate an Excel function that can be used
| to produce as similar a result as possible for automated
| categorization in the future.
|
| So in most cases I both want to provide it with tabular
| data, and also want tabular data back out. In general I've
| gotten decent results for these sorts of use cases, but
| when it falls down it's almost always addressable by
| tinkering with the formatting related instructions -
| sometimes by tweaking the input and sometimes by tweaking
| the instructions for the desired output.
| danielmarkbruce wrote:
| Many LLMs are ok with json and html tables. Not perfect, but
| not terrible.
| simonw wrote:
| I've seen enough examples of an LLM misinterpreting a
| column or row - resulting in returning the incorrect answer
| to a question because it was off by one in one of the
| directions - that I'm nervous about trusting them for this.
|
| JSON objects are different - there the key/value
| relationship is closer in the set of tokens which usually
| makes it more reliable.
| danielmarkbruce wrote:
| yeah... so, you want to two step it. Parse the table into
| something structured, then answer the question. For a lot
| of LLM "problems", it's about the same as teaching a kid
| a multi-step problem in math - if you try to do it in one
| step, you are going to have a hard time .
| irskep wrote:
| The only reason I'm not immediately answering is because I
| need to check whether it's a trade secret. We do our own
| thing that I haven't seen anywhere else and works super well.
| Sorry for being mysterious, I'll try to get an OK to share.
|
| Edit: yeah I can't talk about it, sorry
| nprateem wrote:
| Give it the data as separate columns. For each cell give it
| the row index and the data.
|
| That way it's just working with lists but can easily key that
| eg all this data is in row 3, etc. Tell it to correlate data
| by the first value in the pair like that.
| layer8 wrote:
| Markdown isn't suitable for most spreadsheets in the first
| place, IMO.
| themanmaran wrote:
| The reason there's a lot of startups in the OCR space (us being
| one of them) is the classic 80/20 rule. Any solution that's 80%
| accurate just doesn't work for most applications.
|
| Converting a clean .docx into markdown is 10 lines of python.
| But what about the same document with a screenshot of an excel
| file? Or complex table layouts? The .NORM files that people
| actually use. Definitely agree with having a toggle between
| rules-based/ocr. But if you're looking at company wide docs,
| you won't always know which to pick.
|
| Example with one of our test files:
|
| Input: https://omni-demo-data.s3.us-
| east-1.amazonaws.com/zerox/Omni...
|
| MarkItDown: https://omni-demo-data.s3.us-
| east-1.amazonaws.com/zerox/mark...
|
| Ours: https://omni-demo-data.s3.us-
| east-1.amazonaws.com/zerox/omni...
|
| The response from MarkItDown seems pretty barebones. I expected
| it to convert the clean pdf table element into a markdown
| table, but it just pulls the plaintext, which drops the
| header/column relationship.
| irskep wrote:
| > Any solution that's 80% accurate just doesn't work for most
| applications.
|
| And yet people use LLMs, for which "80% accuracy" is still
| mostly an aspiration. :-)
|
| I think it's reasonably likely most people companies end up
| using open source libraries, at least partly because it lets
| them avoid adding another GDPR sub-processor.
| Unstructured.io, one of your competitors, goes as far as
| having an AWS Marketplace setup so customers can use their
| own infrastruture but still pay them.
|
| LLMs might get better at consuming badly-formatted data, so
| the data only needs to meet that minimum bar, vs the
| admittedly very nice output you showed.
| themanmaran wrote:
| > LLMs might get better at consuming badly-formatted data
|
| Oh agreed. There's definitely a meeting in the middle
| between better ingestion and smarter models. LLMs are
| already a great fuzzing layer for that type of
| interpretation. And even with a perfect WYSIWYG text
| extraction, you're still limited by how coherent the
| original document was in the first place.
| fritzo wrote:
| Converters like this are much more useful if they are bi-
| directional, even if the two directions aren't exactly inverses.
| theanonymousone wrote:
| Why is the repository 95% "HTML" code?
| sphars wrote:
| There's some very large HTML files in the test directory,
| including an offline version of the Microsoft Wikipedia page
| valbaca wrote:
| tests
| markhneedham wrote:
| Quite curious how this compares to docling -
| https://github.com/DS4SD/docling
|
| docling uses an LLM IIRC, so that's already a difference in
| approach
| phren0logy wrote:
| In my use, docling has not involved an LLM. There are a few
| choices for OCR, but I don't think a vision model is one of
| them.
|
| It's certainly touted as a solution to digest documents into
| plain text for LLM use, but (unless I just haven't run into
| that part of it) it does not employe an LLM for its functions.
| simonw wrote:
| If you have uv installed you can run this against a file without
| first installing anything like this: uvx
| markitdown path-to-file.pdf
|
| (This will cache the necessary packages the first time you run
| it, then reuse those cached packages on future invocations.)
|
| I've tried it against HTML and PDFs so far and it seems pretty
| decent.
| wrboyce wrote:
| Is uvx just part of uv? I keep a few python packages around via
| pipx (itself via homebrew) but am a big fan of uv for python
| projects... Do I just need to install uv globally (via brew?)
| to do this? Is there a mechanism to also have the installed
| utils available in my PATH (so I can invoke them without a uvx
| prefix)?
| karl42 wrote:
| You can install to your path with 'uv tool install'.
|
| uvx is just an alias for 'uv tool run'.
| wrboyce wrote:
| Thank you! I should explore the uv docs properly.
| buibuibui wrote:
| Wow that is magic! I just installed uv because of your comment.
| figomore wrote:
| Pandoc (https://pandoc.org) can be used to convert a .docx file
| to markdown and other file formats like djot and typst. I don't
| think pandoc can convert powerpoint and excel files.
| disgruntledphd2 wrote:
| Yeah that was the interesting part to me, at least. Plus, it's
| Microsoft so hopefully it will work for their files.
| LordDragonfang wrote:
| ...I did not catch that it was from Microsoft. I was
| wondering why a random markdown converter was so notable.
| zamadatix wrote:
| The hard part about document conversion is not finding a tool
| which can convert the formats but the tool which does it best.
| I wonder how MarkItDown ranks for the tasks for the various
| types.
| jez wrote:
| The README of MarkItDown mentions "indexing and text
| analysis" as the two motivating features, whereas Pandoc is
| more interested in document preparation via conversion that
| maintains rich text formatting.
|
| Since my personal use leans towards the latter, I'm hesitant
| to believe that this tool will work better for me but others
| may have other priorities.
| LittleTimothy wrote:
| This is... interesting. From my understanding - and people can
| correct me if I'm wrong, but didn't Microsoft spend an extremely
| large amount of effort essentially trying to screw people who
| made things like this in the 2000s? Interoperability and the Open
| Office movement were prety hard fought. It's kind of crazy to see
| MSFT do this today. Did I just misunderstand and the underlying
| formats (docx etc) were actually pretty friendly, or have the
| formats evolved a lot since then? Or is it more a case of "It
| doesn't matter if it looks terrible because we're feeding it to
| the AI beast anyway"
|
| A cynic might say it became suddenly easy when MSFT had a reason
| to allow you to genereate markdown to feed into it's AI?
| dmonitor wrote:
| I don't think that's a cynical take considering the description
|
| > (e.g., for indexing, text analysis, etc.)
| btown wrote:
| For PDFs it's entirely a wrapper around
| https://pdfminersix.readthedocs.io/en/latest/tutorial/highle... -
| https://github.com/microsoft/markitdown/blob/main/src/markit...
|
| So if that's your use case, PDFMiner might be better to integrate
| with directly!
| persedes wrote:
| or just use pymupdf
| kepano wrote:
| Never thought I'd see the day. Yet... not surprising because
| plain text is the ideal format for analysis, LLM training, etc.
|
| The question businesses will start to ask is why are we putting
| our data into .docx files in the first place?
| mdaniel wrote:
| I can't tell if you're trolling or what but the idea of most
| business users (a) knowing markdown (b) reverting to html for
| the damn near _infinite_ layout and /or styling things that
| markdown doesn't support (c) ignoring _mail merge_ (d) wanting
| change tracking ... makes your comment laughable
| throwaway81523 wrote:
| Why not Pandoc?
| ulrischa wrote:
| I wonder how a powerpoint can be converted to markdown
| poidos wrote:
| Very timely, thanks!
|
| Was just yesterday working on chaining together `xlsx` and
| `tablemark` to accomplish this. I found `uvx markitdown my-
| excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I
| needed to get my spreadsheet into a reasonably-legible markdown
| table when rendered by GitLab.
| constantinum wrote:
| I will try it with some complex layout PDFs or documents with
| tables. These documents have real business use cases for
| automation -- insurance, banking, etc.
|
| Anyone here who wants to convert PDF documents or scanned images
| as it is preserving the layout, do try LLMWhisperer -
| https://unstract.com/llmwhisperer/
| starkparker wrote:
| I index a lot of tabletop RPG books in PDF format, which often
| have complex visual layouts and many tables that parsers
| typically have difficulty with. If this is just a wrapper around
| PDFMiner, as noted in another comment, I don't see any value
| added by this tool.
|
| This handles them... fine. It either doesn't recognize or never
| attempts to handle tables, which makes it fundamentally a non-
| starter for my typical usage, but to its credit it seems to have
| at least some sense of table cells; it organizes columns in a
| manner that isn't fully readable but isn't as broken as some
| other solutions, either.
|
| It otherwise handles text that's in variable-width columns or
| wrapped in complex ways around art work rather well. It inserts
| extraneous spaces on fully justified text, which is frustrating
| but not unusual, and sometimes adds extraneous line breaks on
| mid-sentence column breaks.
|
| The biggest miss, though, is how it completely misses headings!
| This seems fundamental for any use case, including grooming
| sources for LLM training. It doesn't identify a single heading in
| any PDF I've thrown at it so far.
| hks0 wrote:
| This is amazing and really useful, love the idea; but let me tell
| you a story, it's a bit of a tangent but relevant enough:
|
| In an online language class we were sending the assignments to
| our teacher via slack, the teacher would then mark our mistakes
| and send it back.
|
| I, as a true hater of all the heavy weight text formats for
| everyday communications, autonomously fired up the terminal,
| wrote my assignment in my_name.md and happily sent it without
| giving it any thought. This is what I hear the next session:
|
| "... and everybody did a great job! Although someone just sent me
| their assignment in a stupid format. I don't know what it was! I
| could neither highlight it or make the text bold or anything.
| Don't do that to me again please".
|
| Before that I never dreamed of meeting someone who preferred a
| word document _after_ opening a .md file, and I also learned if I
| had chosen product design as a career, everyone would've suffered
| immensely (or maybe not, I would've just ended up jobless).
| EasyMark wrote:
| If you are talking about an online language class as in "I'm
| learning Yiddish" then I don't understand why it would confuse
| that that someone who isn't a coder or writer (and they're a
| big if) who doesn't know what the heck markdown is and hence
| wouldn't want to deal with it since they're used to MS Word or
| other word processor app. that's probably like 95% of the
| population at least.
| yawnxyz wrote:
| anyone get the Bing search DocumentConverter working? It keeps
| getting me null results
| sneak wrote:
| I wish we had a markdown equivalent for spreadsheets. Markdown
| tables ain't it.
___________________________________________________________________
(page generated 2024-12-13 23:00 UTC)