[HN Gopher] MarkItDown: Python tool for converting files and off...
       ___________________________________________________________________
        
       MarkItDown: Python tool for converting files and office documents
       to Markdown
        
       Author : Handy-Man
       Score  : 176 points
       Date   : 2024-12-13 18:02 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ezxs wrote:
       | it would be cool if Word just had that implemented inside the
       | product like Google Docs does.
        
       | benatkin wrote:
       | Nary a mention of LLMs in the readme. That was an unexpected but
       | pleasant surprise, when the idea of converting something to
       | markdown for LLMs is floated as if it's new and the greatest
       | thing since sliced bread.
       | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
       | 
       | It's interesting to read the code. It's mostly glue code, and
       | most of it is in single 1101 line file. But it does indeed say
       | what the README says it does. Here is the _special handling for
       | Wikipedia_ :
       | https://github.com/microsoft/markitdown/blob/main/src/markit...
       | 
       | Edit: good to see the one from yesterday flagged. I tried to
       | assume good intent, but also wondered if it was a place to draw a
       | line in the sand. https://news.ycombinator.com/item?id=42405758
       | 
       | Edit 2: ah, it came down to simple violation of the Show HN
       | rules. I didn't notice, but yeah, that's definitely the case.
        
         | zamadatix wrote:
         | > Nary a mention of LLMs in the readme. That was an unexpected
         | but pleasant surprise
         | 
         | No surprise it has still managed to come up in the comments in
         | spite of that!
        
           | benatkin wrote:
           | Yep, and that's fine. It's just that there is a lot of false
           | assumptions and magical thinking going around about LLMs and
           | Markdown and I was glad to not find any in the README.
        
       | irskep wrote:
       | I worked on an in-house version of this feature for my employer
       | (turning files into LLM-friendly text). After reading the source
       | code, I can say this is a pretty reasonable implementation of
       | this type of thing. But I would avoid using it for images, since
       | the LLM providers let you just pass images directly, and I would
       | also avoid using it for spreadsheets, since LLMs are very bad at
       | interpreting Markdown tables.
       | 
       | There are a lot of random startups and open source projects who
       | try to make this space sound fancy, but I really hope the end
       | state is a simple project like this, easy to understand and easy
       | to deploy.
       | 
       | I do wish it had a knob to turn for "how much processing do you
       | want me to do." For PDF specifically, you either have to get a
       | crappy version of the plain text using heuristics in a way that
       | is very sensitive to how the PDF is exported, or you have to go
       | full OCR, and it's annoying when a project locks you into one or
       | the other. I'm also not sure I'd want to use the speech-to-text
       | features here since they might have very different performance
       | characteristics than the text-to-text stuff.
        
         | cosmie wrote:
         | From your experience, what would be the best way to handle
         | spreadsheets?
        
           | simonw wrote:
           | I don't think tabular data of any sort is a particularly good
           | fit for LLMs at the moment. What are you trying to do with
           | it?
           | 
           | If you want to answer questions like "how many students does
           | Everglade High School have?" and you have a spreadsheet of
           | schools where one of the columns is "number of students" I
           | guess you could feed that into an LLM, but it doesn't feel
           | like a great tool for the job.
           | 
           | I'd instead use systems like ChatGPT Code Interpreter where
           | the LLM gets to load up that data programatically and answer
           | questions by running code against it. Text-to-SQL systems
           | could work well for that too.
        
             | btown wrote:
             | This is an active area of research:
             | https://github.com/SpursGoZmy/Awesome-Tabular-LLMs is a
             | good starting point!
        
             | cosmie wrote:
             | For me personally, a lot of times it's for table
             | augmentation purposes. Appending additional columns to a
             | dataset, such as a cleaned/standardized version of another
             | field, extracting a value from another field, or appending
             | categorization attributes (sometimes pre-seeded and
             | sometimes just giving it general direction).
             | 
             | Or sometimes I'll manually curate a field like that, and
             | then ask it to generate an Excel function that can be used
             | to produce as similar a result as possible for automated
             | categorization in the future.
             | 
             | So in most cases I both want to provide it with tabular
             | data, and also want tabular data back out. In general I've
             | gotten decent results for these sorts of use cases, but
             | when it falls down it's almost always addressable by
             | tinkering with the formatting related instructions -
             | sometimes by tweaking the input and sometimes by tweaking
             | the instructions for the desired output.
        
           | danielmarkbruce wrote:
           | Many LLMs are ok with json and html tables. Not perfect, but
           | not terrible.
        
             | simonw wrote:
             | I've seen enough examples of an LLM misinterpreting a
             | column or row - resulting in returning the incorrect answer
             | to a question because it was off by one in one of the
             | directions - that I'm nervous about trusting them for this.
             | 
             | JSON objects are different - there the key/value
             | relationship is closer in the set of tokens which usually
             | makes it more reliable.
        
               | danielmarkbruce wrote:
               | yeah... so, you want to two step it. Parse the table into
               | something structured, then answer the question. For a lot
               | of LLM "problems", it's about the same as teaching a kid
               | a multi-step problem in math - if you try to do it in one
               | step, you are going to have a hard time .
        
           | irskep wrote:
           | The only reason I'm not immediately answering is because I
           | need to check whether it's a trade secret. We do our own
           | thing that I haven't seen anywhere else and works super well.
           | Sorry for being mysterious, I'll try to get an OK to share.
           | 
           | Edit: yeah I can't talk about it, sorry
        
           | nprateem wrote:
           | Give it the data as separate columns. For each cell give it
           | the row index and the data.
           | 
           | That way it's just working with lists but can easily key that
           | eg all this data is in row 3, etc. Tell it to correlate data
           | by the first value in the pair like that.
        
           | layer8 wrote:
           | Markdown isn't suitable for most spreadsheets in the first
           | place, IMO.
        
         | themanmaran wrote:
         | The reason there's a lot of startups in the OCR space (us being
         | one of them) is the classic 80/20 rule. Any solution that's 80%
         | accurate just doesn't work for most applications.
         | 
         | Converting a clean .docx into markdown is 10 lines of python.
         | But what about the same document with a screenshot of an excel
         | file? Or complex table layouts? The .NORM files that people
         | actually use. Definitely agree with having a toggle between
         | rules-based/ocr. But if you're looking at company wide docs,
         | you won't always know which to pick.
         | 
         | Example with one of our test files:
         | 
         | Input: https://omni-demo-data.s3.us-
         | east-1.amazonaws.com/zerox/Omni...
         | 
         | MarkItDown: https://omni-demo-data.s3.us-
         | east-1.amazonaws.com/zerox/mark...
         | 
         | Ours: https://omni-demo-data.s3.us-
         | east-1.amazonaws.com/zerox/omni...
         | 
         | The response from MarkItDown seems pretty barebones. I expected
         | it to convert the clean pdf table element into a markdown
         | table, but it just pulls the plaintext, which drops the
         | header/column relationship.
        
           | irskep wrote:
           | > Any solution that's 80% accurate just doesn't work for most
           | applications.
           | 
           | And yet people use LLMs, for which "80% accuracy" is still
           | mostly an aspiration. :-)
           | 
           | I think it's reasonably likely most people companies end up
           | using open source libraries, at least partly because it lets
           | them avoid adding another GDPR sub-processor.
           | Unstructured.io, one of your competitors, goes as far as
           | having an AWS Marketplace setup so customers can use their
           | own infrastruture but still pay them.
           | 
           | LLMs might get better at consuming badly-formatted data, so
           | the data only needs to meet that minimum bar, vs the
           | admittedly very nice output you showed.
        
             | themanmaran wrote:
             | > LLMs might get better at consuming badly-formatted data
             | 
             | Oh agreed. There's definitely a meeting in the middle
             | between better ingestion and smarter models. LLMs are
             | already a great fuzzing layer for that type of
             | interpretation. And even with a perfect WYSIWYG text
             | extraction, you're still limited by how coherent the
             | original document was in the first place.
        
       | fritzo wrote:
       | Converters like this are much more useful if they are bi-
       | directional, even if the two directions aren't exactly inverses.
        
       | theanonymousone wrote:
       | Why is the repository 95% "HTML" code?
        
         | sphars wrote:
         | There's some very large HTML files in the test directory,
         | including an offline version of the Microsoft Wikipedia page
        
         | valbaca wrote:
         | tests
        
       | markhneedham wrote:
       | Quite curious how this compares to docling -
       | https://github.com/DS4SD/docling
       | 
       | docling uses an LLM IIRC, so that's already a difference in
       | approach
        
         | phren0logy wrote:
         | In my use, docling has not involved an LLM. There are a few
         | choices for OCR, but I don't think a vision model is one of
         | them.
         | 
         | It's certainly touted as a solution to digest documents into
         | plain text for LLM use, but (unless I just haven't run into
         | that part of it) it does not employe an LLM for its functions.
        
       | simonw wrote:
       | If you have uv installed you can run this against a file without
       | first installing anything like this:                   uvx
       | markitdown path-to-file.pdf
       | 
       | (This will cache the necessary packages the first time you run
       | it, then reuse those cached packages on future invocations.)
       | 
       | I've tried it against HTML and PDFs so far and it seems pretty
       | decent.
        
         | wrboyce wrote:
         | Is uvx just part of uv? I keep a few python packages around via
         | pipx (itself via homebrew) but am a big fan of uv for python
         | projects... Do I just need to install uv globally (via brew?)
         | to do this? Is there a mechanism to also have the installed
         | utils available in my PATH (so I can invoke them without a uvx
         | prefix)?
        
           | karl42 wrote:
           | You can install to your path with 'uv tool install'.
           | 
           | uvx is just an alias for 'uv tool run'.
        
             | wrboyce wrote:
             | Thank you! I should explore the uv docs properly.
        
         | buibuibui wrote:
         | Wow that is magic! I just installed uv because of your comment.
        
       | figomore wrote:
       | Pandoc (https://pandoc.org) can be used to convert a .docx file
       | to markdown and other file formats like djot and typst. I don't
       | think pandoc can convert powerpoint and excel files.
        
         | disgruntledphd2 wrote:
         | Yeah that was the interesting part to me, at least. Plus, it's
         | Microsoft so hopefully it will work for their files.
        
           | LordDragonfang wrote:
           | ...I did not catch that it was from Microsoft. I was
           | wondering why a random markdown converter was so notable.
        
         | zamadatix wrote:
         | The hard part about document conversion is not finding a tool
         | which can convert the formats but the tool which does it best.
         | I wonder how MarkItDown ranks for the tasks for the various
         | types.
        
           | jez wrote:
           | The README of MarkItDown mentions "indexing and text
           | analysis" as the two motivating features, whereas Pandoc is
           | more interested in document preparation via conversion that
           | maintains rich text formatting.
           | 
           | Since my personal use leans towards the latter, I'm hesitant
           | to believe that this tool will work better for me but others
           | may have other priorities.
        
       | LittleTimothy wrote:
       | This is... interesting. From my understanding - and people can
       | correct me if I'm wrong, but didn't Microsoft spend an extremely
       | large amount of effort essentially trying to screw people who
       | made things like this in the 2000s? Interoperability and the Open
       | Office movement were prety hard fought. It's kind of crazy to see
       | MSFT do this today. Did I just misunderstand and the underlying
       | formats (docx etc) were actually pretty friendly, or have the
       | formats evolved a lot since then? Or is it more a case of "It
       | doesn't matter if it looks terrible because we're feeding it to
       | the AI beast anyway"
       | 
       | A cynic might say it became suddenly easy when MSFT had a reason
       | to allow you to genereate markdown to feed into it's AI?
        
         | dmonitor wrote:
         | I don't think that's a cynical take considering the description
         | 
         | > (e.g., for indexing, text analysis, etc.)
        
       | btown wrote:
       | For PDFs it's entirely a wrapper around
       | https://pdfminersix.readthedocs.io/en/latest/tutorial/highle... -
       | https://github.com/microsoft/markitdown/blob/main/src/markit...
       | 
       | So if that's your use case, PDFMiner might be better to integrate
       | with directly!
        
         | persedes wrote:
         | or just use pymupdf
        
       | kepano wrote:
       | Never thought I'd see the day. Yet... not surprising because
       | plain text is the ideal format for analysis, LLM training, etc.
       | 
       | The question businesses will start to ask is why are we putting
       | our data into .docx files in the first place?
        
         | mdaniel wrote:
         | I can't tell if you're trolling or what but the idea of most
         | business users (a) knowing markdown (b) reverting to html for
         | the damn near _infinite_ layout and /or styling things that
         | markdown doesn't support (c) ignoring _mail merge_ (d) wanting
         | change tracking ... makes your comment laughable
        
       | throwaway81523 wrote:
       | Why not Pandoc?
        
       | ulrischa wrote:
       | I wonder how a powerpoint can be converted to markdown
        
       | poidos wrote:
       | Very timely, thanks!
       | 
       | Was just yesterday working on chaining together `xlsx` and
       | `tablemark` to accomplish this. I found `uvx markitdown my-
       | excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I
       | needed to get my spreadsheet into a reasonably-legible markdown
       | table when rendered by GitLab.
        
       | constantinum wrote:
       | I will try it with some complex layout PDFs or documents with
       | tables. These documents have real business use cases for
       | automation -- insurance, banking, etc.
       | 
       | Anyone here who wants to convert PDF documents or scanned images
       | as it is preserving the layout, do try LLMWhisperer -
       | https://unstract.com/llmwhisperer/
        
       | starkparker wrote:
       | I index a lot of tabletop RPG books in PDF format, which often
       | have complex visual layouts and many tables that parsers
       | typically have difficulty with. If this is just a wrapper around
       | PDFMiner, as noted in another comment, I don't see any value
       | added by this tool.
       | 
       | This handles them... fine. It either doesn't recognize or never
       | attempts to handle tables, which makes it fundamentally a non-
       | starter for my typical usage, but to its credit it seems to have
       | at least some sense of table cells; it organizes columns in a
       | manner that isn't fully readable but isn't as broken as some
       | other solutions, either.
       | 
       | It otherwise handles text that's in variable-width columns or
       | wrapped in complex ways around art work rather well. It inserts
       | extraneous spaces on fully justified text, which is frustrating
       | but not unusual, and sometimes adds extraneous line breaks on
       | mid-sentence column breaks.
       | 
       | The biggest miss, though, is how it completely misses headings!
       | This seems fundamental for any use case, including grooming
       | sources for LLM training. It doesn't identify a single heading in
       | any PDF I've thrown at it so far.
        
       | hks0 wrote:
       | This is amazing and really useful, love the idea; but let me tell
       | you a story, it's a bit of a tangent but relevant enough:
       | 
       | In an online language class we were sending the assignments to
       | our teacher via slack, the teacher would then mark our mistakes
       | and send it back.
       | 
       | I, as a true hater of all the heavy weight text formats for
       | everyday communications, autonomously fired up the terminal,
       | wrote my assignment in my_name.md and happily sent it without
       | giving it any thought. This is what I hear the next session:
       | 
       | "... and everybody did a great job! Although someone just sent me
       | their assignment in a stupid format. I don't know what it was! I
       | could neither highlight it or make the text bold or anything.
       | Don't do that to me again please".
       | 
       | Before that I never dreamed of meeting someone who preferred a
       | word document _after_ opening a .md file, and I also learned if I
       | had chosen product design as a career, everyone would've suffered
       | immensely (or maybe not, I would've just ended up jobless).
        
         | EasyMark wrote:
         | If you are talking about an online language class as in "I'm
         | learning Yiddish" then I don't understand why it would confuse
         | that that someone who isn't a coder or writer (and they're a
         | big if) who doesn't know what the heck markdown is and hence
         | wouldn't want to deal with it since they're used to MS Word or
         | other word processor app. that's probably like 95% of the
         | population at least.
        
       | yawnxyz wrote:
       | anyone get the Bing search DocumentConverter working? It keeps
       | getting me null results
        
       | sneak wrote:
       | I wish we had a markdown equivalent for spreadsheets. Markdown
       | tables ain't it.
        
       ___________________________________________________________________
       (page generated 2024-12-13 23:00 UTC)