hngopher.com

       [HN Gopher] MarkItDown: Python tool for converting files and off...
       ___________________________________________________________________
        
       MarkItDown: Python tool for converting files and office documents
       to Markdown
        
       Author : Handy-Man
       Score  : 291 points
       Date   : 2024-12-13 18:02 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ezxs wrote:
       | it would be cool if Word just had that implemented inside the
       | product like Google Docs does.
        
       | benatkin wrote:
       | Nary a mention of LLMs in the readme. That was an unexpected but
       | pleasant surprise, when the idea of converting something to
       | markdown for LLMs is floated as if it's new and the greatest
       | thing since sliced bread.
       | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
       | 
       | It's interesting to read the code. It's mostly glue code, and
       | most of it is in single 1101 line file. But it does indeed say
       | what the README says it does. Here is the _special handling for
       | Wikipedia_ :
       | https://github.com/microsoft/markitdown/blob/main/src/markit...
       | 
       | Edit: good to see the one from yesterday flagged. I tried to
       | assume good intent, but also wondered if it was a place to draw a
       | line in the sand. https://news.ycombinator.com/item?id=42405758
       | 
       | Edit 2: ah, it came down to simple violation of the Show HN
       | rules. I didn't notice, but yeah, that's definitely the case.
        
         | zamadatix wrote:
         | > Nary a mention of LLMs in the readme. That was an unexpected
         | but pleasant surprise
         | 
         | No surprise it has still managed to come up in the comments in
         | spite of that!
        
           | benatkin wrote:
           | Yep, and that's fine. It's just that there is a lot of false
           | assumptions and magical thinking going around about LLMs and
           | Markdown and I was glad to not find any in the README.
        
       | irskep wrote:
       | I worked on an in-house version of this feature for my employer
       | (turning files into LLM-friendly text). After reading the source
       | code, I can say this is a pretty reasonable implementation of
       | this type of thing. But I would avoid using it for images, since
       | the LLM providers let you just pass images directly, and I would
       | also avoid using it for spreadsheets, since LLMs are very bad at
       | interpreting Markdown tables.
       | 
       | There are a lot of random startups and open source projects who
       | try to make this space sound fancy, but I really hope the end
       | state is a simple project like this, easy to understand and easy
       | to deploy.
       | 
       | I do wish it had a knob to turn for "how much processing do you
       | want me to do." For PDF specifically, you either have to get a
       | crappy version of the plain text using heuristics in a way that
       | is very sensitive to how the PDF is exported, or you have to go
       | full OCR, and it's annoying when a project locks you into one or
       | the other. I'm also not sure I'd want to use the speech-to-text
       | features here since they might have very different performance
       | characteristics than the text-to-text stuff.
        
         | cosmie wrote:
         | From your experience, what would be the best way to handle
         | spreadsheets?
        
           | simonw wrote:
           | I don't think tabular data of any sort is a particularly good
           | fit for LLMs at the moment. What are you trying to do with
           | it?
           | 
           | If you want to answer questions like "how many students does
           | Everglade High School have?" and you have a spreadsheet of
           | schools where one of the columns is "number of students" I
           | guess you could feed that into an LLM, but it doesn't feel
           | like a great tool for the job.
           | 
           | I'd instead use systems like ChatGPT Code Interpreter where
           | the LLM gets to load up that data programatically and answer
           | questions by running code against it. Text-to-SQL systems
           | could work well for that too.
        
             | btown wrote:
             | This is an active area of research:
             | https://github.com/SpursGoZmy/Awesome-Tabular-LLMs is a
             | good starting point!
        
             | cosmie wrote:
             | For me personally, a lot of times it's for table
             | augmentation purposes. Appending additional columns to a
             | dataset, such as a cleaned/standardized version of another
             | field, extracting a value from another field, or appending
             | categorization attributes (sometimes pre-seeded and
             | sometimes just giving it general direction).
             | 
             | Or sometimes I'll manually curate a field like that, and
             | then ask it to generate an Excel function that can be used
             | to produce as similar a result as possible for automated
             | categorization in the future.
             | 
             | So in most cases I both want to provide it with tabular
             | data, and also want tabular data back out. In general I've
             | gotten decent results for these sorts of use cases, but
             | when it falls down it's almost always addressable by
             | tinkering with the formatting related instructions -
             | sometimes by tweaking the input and sometimes by tweaking
             | the instructions for the desired output.
        
           | danielmarkbruce wrote:
           | Many LLMs are ok with json and html tables. Not perfect, but
           | not terrible.
        
             | simonw wrote:
             | I've seen enough examples of an LLM misinterpreting a
             | column or row - resulting in returning the incorrect answer
             | to a question because it was off by one in one of the
             | directions - that I'm nervous about trusting them for this.
             | 
             | JSON objects are different - there the key/value
             | relationship is closer in the set of tokens which usually
             | makes it more reliable.
        
               | danielmarkbruce wrote:
               | yeah... so, you want to two step it. Parse the table into
               | something structured, then answer the question. For a lot
               | of LLM "problems", it's about the same as teaching a kid
               | a multi-step problem in math - if you try to do it in one
               | step, you are going to have a hard time .
        
           | irskep wrote:
           | The only reason I'm not immediately answering is because I
           | need to check whether it's a trade secret. We do our own
           | thing that I haven't seen anywhere else and works super well.
           | Sorry for being mysterious, I'll try to get an OK to share.
           | 
           | Edit: yeah I can't talk about it, sorry
        
           | nprateem wrote:
           | Give it the data as separate columns. For each cell give it
           | the row index and the data.
           | 
           | That way it's just working with lists but can easily key that
           | eg all this data is in row 3, etc. Tell it to correlate data
           | by the first value in the pair like that.
        
           | layer8 wrote:
           | Markdown isn't suitable for most spreadsheets in the first
           | place, IMO.
        
           | __mharrison__ wrote:
           | LLMs are decent at Pandas.
           | 
           | I say "decent" because most of the available training data
           | for Pandas does things in a naive way.
           | 
           | OTOH, they are horrible at Polars. (I figure this is mostly a
           | lack of training data.)
        
             | disgruntledphd2 wrote:
             | > I say "decent" because most of the available training
             | data for Pandas does things in a naive way.
             | 
             | They're around the level of the median user, which is
             | pretty bad as pandas is a big and complicated API with many
             | different approaches available (as is base R, in case
             | people think I'm just hating on pandas).
        
         | themanmaran wrote:
         | The reason there's a lot of startups in the OCR space (us being
         | one of them) is the classic 80/20 rule. Any solution that's 80%
         | accurate just doesn't work for most applications.
         | 
         | Converting a clean .docx into markdown is 10 lines of python.
         | But what about the same document with a screenshot of an excel
         | file? Or complex table layouts? The .NORM files that people
         | actually use. Definitely agree with having a toggle between
         | rules-based/ocr. But if you're looking at company wide docs,
         | you won't always know which to pick.
         | 
         | Example with one of our test files:
         | 
         | Input: https://omni-demo-data.s3.us-
         | east-1.amazonaws.com/zerox/Omni...
         | 
         | MarkItDown: https://omni-demo-data.s3.us-
         | east-1.amazonaws.com/zerox/mark...
         | 
         | Ours: https://omni-demo-data.s3.us-
         | east-1.amazonaws.com/zerox/omni...
         | 
         | The response from MarkItDown seems pretty barebones. I expected
         | it to convert the clean pdf table element into a markdown
         | table, but it just pulls the plaintext, which drops the
         | header/column relationship.
        
           | irskep wrote:
           | > Any solution that's 80% accurate just doesn't work for most
           | applications.
           | 
           | And yet people use LLMs, for which "80% accuracy" is still
           | mostly an aspiration. :-)
           | 
           | I think it's reasonably likely most people companies end up
           | using open source libraries, at least partly because it lets
           | them avoid adding another GDPR sub-processor.
           | Unstructured.io, one of your competitors, goes as far as
           | having an AWS Marketplace setup so customers can use their
           | own infrastruture but still pay them.
           | 
           | LLMs might get better at consuming badly-formatted data, so
           | the data only needs to meet that minimum bar, vs the
           | admittedly very nice output you showed.
        
             | themanmaran wrote:
             | > LLMs might get better at consuming badly-formatted data
             | 
             | Oh agreed. There's definitely a meeting in the middle
             | between better ingestion and smarter models. LLMs are
             | already a great fuzzing layer for that type of
             | interpretation. And even with a perfect WYSIWYG text
             | extraction, you're still limited by how coherent the
             | original document was in the first place.
        
         | dragonwriter wrote:
         | LLM providers also let you send PDFs directly, too.
         | 
         | OTOH, sometimes _you_ are the LLM provider, and you may not be
         | using a multimodal LLM. (Or, even though feeding an LLM is a
         | common use. You may be using the markdown for another purpose.)
        
         | Ambix wrote:
         | > LLMs are very bad at interpreting Markdown tables
         | 
         | Which table format is better for LLMs? Do you have some
         | insights there?
        
       | fritzo wrote:
       | Converters like this are much more useful if they are bi-
       | directional, even if the two directions aren't exactly inverses.
        
       | theanonymousone wrote:
       | Why is the repository 95% "HTML" code?
        
         | sphars wrote:
         | There's some very large HTML files in the test directory,
         | including an offline version of the Microsoft Wikipedia page
        
         | valbaca wrote:
         | tests
        
       | markhneedham wrote:
       | Quite curious how this compares to docling -
       | https://github.com/DS4SD/docling
       | 
       | docling uses an LLM IIRC, so that's already a difference in
       | approach
        
         | phren0logy wrote:
         | In my use, docling has not involved an LLM. There are a few
         | choices for OCR, but I don't think a vision model is one of
         | them.
         | 
         | It's certainly touted as a solution to digest documents into
         | plain text for LLM use, but (unless I just haven't run into
         | that part of it) it does not employe an LLM for its functions.
        
         | ekianjo wrote:
         | docling does not use LLMs...
        
       | simonw wrote:
       | If you have uv installed you can run this against a file without
       | first installing anything like this:                   uvx
       | markitdown path-to-file.pdf
       | 
       | (This will cache the necessary packages the first time you run
       | it, then reuse those cached packages on future invocations.)
       | 
       | I've tried it against HTML and PDFs so far and it seems pretty
       | decent.
        
         | wrboyce wrote:
         | Is uvx just part of uv? I keep a few python packages around via
         | pipx (itself via homebrew) but am a big fan of uv for python
         | projects... Do I just need to install uv globally (via brew?)
         | to do this? Is there a mechanism to also have the installed
         | utils available in my PATH (so I can invoke them without a uvx
         | prefix)?
        
           | karl42 wrote:
           | You can install to your path with 'uv tool install'.
           | 
           | uvx is just an alias for 'uv tool run'.
        
             | wrboyce wrote:
             | Thank you! I should explore the uv docs properly.
        
         | buibuibui wrote:
         | Wow that is magic! I just installed uv because of your comment.
        
       | figomore wrote:
       | Pandoc (https://pandoc.org) can be used to convert a .docx file
       | to markdown and other file formats like djot and typst. I don't
       | think pandoc can convert powerpoint and excel files.
        
         | disgruntledphd2 wrote:
         | Yeah that was the interesting part to me, at least. Plus, it's
         | Microsoft so hopefully it will work for their files.
        
           | LordDragonfang wrote:
           | ...I did not catch that it was from Microsoft. I was
           | wondering why a random markdown converter was so notable.
        
           | _rs wrote:
           | That was the first thing I checked, and it looks like they're
           | using some existing python package to parse docx files. I
           | wonder if they contributed to it or vetted it strongly
        
             | disgruntledphd2 wrote:
             | Wow, I dunno if that's good or bad, certainly it's not what
             | I expected.
        
               | wis wrote:
               | Looking at the code, it looks like they used existing
               | Python packages to read and parse MS Office formats, not
               | what I expected, seeing that the repo is in Microsoft's
               | org on GitHub I expected them to have used Microsoft's
               | "official" libraries for parsing these formats, through
               | Component Object Model (COM).
               | 
               | They used Mammoth for docx (Word) [1][2] Python-pptx for
               | ppt (PowerPoint) [3][4] and Pandas for XSLX (Excel) [5]
               | 
               | [1] https://github.com/microsoft/markitdown/blob/70ab149f
               | f1657c3... [2] https://pypi.org/project/mammoth/ [3] http
               | s://github.com/microsoft/markitdown/blob/70ab149ff1657c3.
               | .. [4] https://pypi.org/project/python-pptx/ [5] https://
               | github.com/microsoft/markitdown/blob/70ab149ff1657c3...
        
               | jamwil wrote:
               | COM requires you to interact with the files through the
               | associated MS Office applications, whereas these libs
               | parse the ooxml file format directly.
        
         | zamadatix wrote:
         | The hard part about document conversion is not finding a tool
         | which can convert the formats but the tool which does it best.
         | I wonder how MarkItDown ranks for the tasks for the various
         | types.
        
           | jez wrote:
           | The README of MarkItDown mentions "indexing and text
           | analysis" as the two motivating features, whereas Pandoc is
           | more interested in document preparation via conversion that
           | maintains rich text formatting.
           | 
           | Since my personal use leans towards the latter, I'm hesitant
           | to believe that this tool will work better for me but others
           | may have other priorities.
        
             | gbraad wrote:
             | MarkItDown feels like running strings; the output is great
             | for text extraction and processing, not for reading by
             | humans
        
       | LittleTimothy wrote:
       | This is... interesting. From my understanding - and people can
       | correct me if I'm wrong, but didn't Microsoft spend an extremely
       | large amount of effort essentially trying to screw people who
       | made things like this in the 2000s? Interoperability and the Open
       | Office movement were prety hard fought. It's kind of crazy to see
       | MSFT do this today. Did I just misunderstand and the underlying
       | formats (docx etc) were actually pretty friendly, or have the
       | formats evolved a lot since then? Or is it more a case of "It
       | doesn't matter if it looks terrible because we're feeding it to
       | the AI beast anyway"
       | 
       | A cynic might say it became suddenly easy when MSFT had a reason
       | to allow you to genereate markdown to feed into it's AI?
        
         | dmonitor wrote:
         | I don't think that's a cynical take considering the description
         | 
         | > (e.g., for indexing, text analysis, etc.)
        
         | badlibrarian wrote:
         | Microsoft filed a covenant not to sue and made all the formats
         | open ~20 years ago. A lot of people bitched at the time but
         | there's a long list of software that supports the format now.
         | It is complicated because the apps themselves are complicated
         | and decades old, and imperfect because the format or app you're
         | converting to likely doesn't support all of the features and
         | certainly none of the quirks.
         | 
         | https://en.wikipedia.org/wiki/Office_Open_XML
         | 
         | It took browsers 15 years just to render HTML whitespace nearly
         | consistently, so keep that in mind as you read that history.
        
       | btown wrote:
       | For PDFs it's entirely a wrapper around
       | https://pdfminersix.readthedocs.io/en/latest/tutorial/highle... -
       | https://github.com/microsoft/markitdown/blob/main/src/markit...
       | 
       | So if that's your use case, PDFMiner might be better to integrate
       | with directly!
        
         | persedes wrote:
         | or just use pymupdf
        
           | E_Bfx wrote:
           | pymupdf has a commercial licence that couldb be a problem if
           | use in a compagny.
        
       | kepano wrote:
       | Never thought I'd see the day. Yet... not surprising because
       | plain text is the ideal format for analysis, LLM training, etc.
       | 
       | The question businesses will start to ask is why are we putting
       | our data into .docx files in the first place?
        
         | mdaniel wrote:
         | I can't tell if you're trolling or what but the idea of most
         | business users (a) knowing markdown (b) reverting to html for
         | the damn near _infinite_ layout and /or styling things that
         | markdown doesn't support (c) ignoring _mail merge_ (d) wanting
         | change tracking ... makes your comment laughable
        
       | throwaway81523 wrote:
       | Why not Pandoc?
        
         | johannesrexx wrote:
         | Pandoc does not have a PDF reader.
        
       | ulrischa wrote:
       | I wonder how a powerpoint can be converted to markdown
        
       | poidos wrote:
       | Very timely, thanks!
       | 
       | Was just yesterday working on chaining together `xlsx` and
       | `tablemark` to accomplish this. I found `uvx markitdown my-
       | excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I
       | needed to get my spreadsheet into a reasonably-legible markdown
       | table when rendered by GitLab.
        
       | constantinum wrote:
       | I will try it with some complex layout PDFs or documents with
       | tables. These documents have real business use cases for
       | automation -- insurance, banking, etc.
       | 
       | Anyone here who wants to convert PDF documents or scanned images
       | as it is preserving the layout, do try LLMWhisperer -
       | https://unstract.com/llmwhisperer/
        
       | starkparker wrote:
       | I index a lot of tabletop RPG books in PDF format, which often
       | have complex visual layouts and many tables that parsers
       | typically have difficulty with. If this is just a wrapper around
       | PDFMiner, as noted in another comment, I don't see any value
       | added by this tool.
       | 
       | This handles them... fine. It either doesn't recognize or never
       | attempts to handle tables, which makes it fundamentally a non-
       | starter for my typical usage, but to its credit it seems to have
       | at least some sense of table cells; it organizes columns in a
       | manner that isn't fully readable but isn't as broken as some
       | other solutions, either.
       | 
       | It otherwise handles text that's in variable-width columns or
       | wrapped in complex ways around art work rather well. It inserts
       | extraneous spaces on fully justified text, which is frustrating
       | but not unusual, and sometimes adds extraneous line breaks on
       | mid-sentence column breaks.
       | 
       | The biggest miss, though, is how it completely misses headings!
       | This seems fundamental for any use case, including grooming
       | sources for LLM training. It doesn't identify a single heading in
       | any PDF I've thrown at it so far.
        
       | hks0 wrote:
       | This is amazing and really useful, love the idea; but let me tell
       | you a story, it's a bit of a tangent but relevant enough:
       | 
       | In an online language class we were sending the assignments to
       | our teacher via slack, the teacher would then mark our mistakes
       | and send it back.
       | 
       | I, as a true hater of all the heavy weight text formats for
       | everyday communications, autonomously fired up the terminal,
       | wrote my assignment in my_name.md and happily sent it without
       | giving it any thought. This is what I hear the next session:
       | 
       | "... and everybody did a great job! Although someone just sent me
       | their assignment in a stupid format. I don't know what it was! I
       | could neither highlight it or make the text bold or anything.
       | Don't do that to me again please".
       | 
       | Before that I never dreamed of meeting someone who preferred a
       | word document _after_ opening a .md file, and I also learned if I
       | had chosen product design as a career, everyone would've suffered
       | immensely (or maybe not, I would've just ended up jobless).
        
         | EasyMark wrote:
         | If you are talking about an online language class as in "I'm
         | learning Yiddish" then I don't understand why it would confuse
         | that that someone who isn't a coder or writer (and they're a
         | big if) who doesn't know what the heck markdown is and hence
         | wouldn't want to deal with it since they're used to MS Word or
         | other word processor app. that's probably like 95% of the
         | population at least.
        
           | hks0 wrote:
           | It doesn't confuse anyone, quite the opposite. The irony for
           | me was my own isolation with the non-tech folks.
        
         | powersnail wrote:
         | > Before that I never dreamed of meeting someone who preferred
         | a word document _after_ opening a .md file
         | 
         | That's like 90% of the people I know outside of
         | computer/engineering circle. Most of people probably have never
         | opened a plaintext file in their life. They would have no idea
         | what to do with a `.md` file.
         | 
         | In fact, some older engineers would not know what markdown is
         | either, since it's only been around for two decades or so, but
         | they can probably work with it anyway (the strength of plain
         | text).
        
           | hks0 wrote:
           | Exactly! Hence the "please don't try product design role"
           | advice for me. I seem to live in an all-engineers bubble.
        
             | zelphirkalt wrote:
             | Engineers are people too. Engineers use products as well.
             | Maybe you would have gone into a saner direction than most
             | products go.
        
       | yawnxyz wrote:
       | anyone get the Bing search DocumentConverter working? It keeps
       | getting me null results
        
       | sneak wrote:
       | I wish we had a markdown equivalent for spreadsheets. Markdown
       | tables ain't it.
        
         | acrophiliac wrote:
         | Your comment has me very curious what exactly you are looking
         | for in a "markdown equivalent" for spreadsheets. Do you want
         | Excel to be able to export the spreadsheet in a Markdown-like
         | format (including formulas, etc)? Or do you want to build the
         | spreadsheet in a text editor using Markdown++ syntax and then
         | use some GUI application to render it? Or do you simply want an
         | ASCII version of Excel that works in a terminal?
        
           | sneak wrote:
           | the second one.
           | 
           | I want a human editable plain text spreadsheet format. CSV
           | and TSV ain't it, not the least of which is because they
           | don't have formulae.
        
         | zelphirkalt wrote:
         | Org-mode. Emacs Org mode has tables with formulas, being able
         | to make use of many programming languages. By default Calc (I
         | believe GNU Calc) and Elisp. However using sbe you can make it
         | use code blocks written in any language that you have support
         | for using org-babel. For example I have time tracking
         | spreadsheets using source blocks of GNU Guile code for time
         | calculation.
         | 
         | Of course you can put that under version control easily, since
         | it is just a text document.
        
       | einpoklum wrote:
       | This is BS, it doesn't support Office documents, it supports only
       | Microsoft's broken office documents which don't obey their own
       | custom specs. Why doesn't this work on ODF files?
        
       | lbrunson wrote:
       | Are there any good libraries for the opposite, going from
       | markdown to pdf or docx? Pandoc gets most of the way there but
       | struggles with certain things like tables.
        
       | roamerz wrote:
       | Since it's Microsoft maybe it will do a half decent job on
       | Outlook HTML and .docx. I have evaluated most of them out there,
       | paid included and haven't found one that I thought was good
       | enough to run in production. Definitely will be giving this a
       | try.
        
       | be_erik wrote:
       | Oh thank god. I can finally retire my docx to pandoc to markdown
       | tool chain. I can't believe M$ was the big one to go first. Good
       | on ya.
        
       | toastal wrote:
       | So we convert from rich formats with metadata & advanced features
       | to a format without the former & severely lacking at the latter.
        
       | konfekt wrote:
       | Though it promises to convert everything to Markdown, it seems to
       | be a worse version of what the already existing tools such as
       | PDFtotext, docx2txt, pptx2md, ... collected [here] do without
       | even pretending to export to Markdown. Looking at its [source],
       | it indeed seems to be a wrapper to python variants of those.
       | Making the pool smaller can hardly improve the output.
       | 
       | [here] https://github.com/Konfekt/vim-office [source] htps://gith
       | ub.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.
       | py
        
       | SuperHeavy256 wrote:
       | I don't think it works if you try installing it using pip. Can
       | anyone confirm? I ended up downloading it manually, making a
       | venv, and then running it.
        
       | ekianjo wrote:
       | any idea how it compares to Docling?
        
       | zelphirkalt wrote:
       | If the source document is anything half decent, this would serve
       | to lose information, as markdown is far from flexible and
       | powerful enough to represent all kinds of formatting and layout
       | present in source documents. If all you need is the text
       | information, then that might be just what you want, lossily
       | compressing documents.
        
       ___________________________________________________________________
       (page generated 2024-12-14 23:02 UTC)