hngopher.com

       [HN Gopher] PDF to Text, a challenging problem
       ___________________________________________________________________
        
       PDF to Text, a challenging problem
        
       Author : ingve
       Score  : 204 points
       Date   : 2025-05-13 15:01 UTC (7 hours ago)
        
 (HTM) web link (www.marginalia.nu)
 (TXT) w3m dump (www.marginalia.nu)
        
       | rad_gruchalski wrote:
       | So many of these problems have been solved by mozilla pdf.js
       | together with its viewer implementation:
       | https://mozilla.github.io/pdf.js/.
        
         | zzleeper wrote:
         | Any sense on how PDF.js compares against other tools such as
         | pdfminer?
        
           | rad_gruchalski wrote:
           | I don't know. I use pdf.js for everything PDF.
        
           | favorited wrote:
           | I did some very broad testing of several PDF text extraction
           | tools recently, and PDF.js was one of the slowest.
           | 
           | My use-case was specifically testing their performance as
           | command-line tools, so that will skew the results to an
           | extent. For example, PDFBox was very slow because you're
           | paying the JVM startup cost with each invocation.
           | 
           | Poppler's pdftotext utility and pdfminer.six were generally
           | the fastest. Both produced serviceable plain-text versions of
           | the PDFs, with minor differences in where they placed
           | paragraph breaks.
           | 
           | I also wrote a small program which extracted text using
           | Chrome's PDFium, which also performed well, but building that
           | project can be a nightmare unless you're Google. IBM's
           | Docling project, which uses ML models, produced by far the
           | best formatting, preserving much of the document's original
           | structure - but it was, of course, enormously slower and more
           | energy-hungry.
           | 
           | Disclaimer: I was testing specific PDF files that are
           | representative of the kind of documents my software produces.
        
         | egnehots wrote:
         | I don't think so, pdf.js is able to _render_ a pdf content.
         | 
         | Which is different from extracting "text". Text in PDF can be
         | encoded in many ways, in an actual image, in shapes (think,
         | segments, quadratic bezier curves...), or in an XML format
         | (really easy to process).
         | 
         | PDF viewers are able to render text, like a printer would work,
         | processing command to show pixels on the screen at the end.
         | 
         | But often, paragraph, text layout, columns, tables are lost in
         | the process. Even though, you see them, so close yet so far.
         | That is why AI is quite strong at this task.
        
           | lionkor wrote:
           | Correct me if im wrong, but pdf.js actually has a lot of
           | methods to manipulate PDFs, no?
        
             | rad_gruchalski wrote:
             | Yes, pdf.js can do that: https://github.com/mozilla/pdf.js/
             | blob/master/web/viewer.htm....
             | 
             | The purpose of my original comment was to simply say:
             | there's an existing implementation so if you're building a
             | pdf file viewer/editor, and you need inspiration, have a
             | look. One of the reasons why mozilla is doing this is to be
             | a reference implementation. I'm not sure why people are
             | upset with this. Though, I could have explained it better.
        
           | rad_gruchalski wrote:
           | You are wrong. Pdf.js can extract text and has all facilities
           | required to render and extract formatting. The latest version
           | can also edit PDF files. It's basically the same engine as
           | the Firefox PDF viewer. Which also has a document outline,
           | search, linking, print preview, scaling, scripting sandbox...
           | it does not simply ,,render" a file.
           | 
           | Regarding tables, this here
           | https://www.npmjs.com/package/pdf-table-extractor does a very
           | good job at table interpretation and works on top of pdf.js.
           | 
           | I also didn't say what works better or worse, neither do I go
           | into PDF being good or bad.
           | 
           | I simply said that a ton of problems have been covered by
        
         | iAMkenough wrote:
         | A good PDF reader makes the problems easier to deal with, but
         | does not solve the underlying issue.
         | 
         | The PDF itself is still flawed, even if pdf.js interprets it
         | perfectly, which is still a problem for non-pdf.js viewers and
         | tasks where "viewing" isn't the primary goal.
        
           | rad_gruchalski wrote:
           | Yeah. What I'm saying: pdf.js seems to have some of these
           | solved. All I'm suggesting is have a look at it. I get it
           | that for some PDF is a broken format.
        
       | bartread wrote:
       | Yeah, getting text - even structured text - out of PDFs is no
       | picnic. Scraping a table out of an HTML document is often
       | straightforward even on sites that use the "everything's a <div>"
       | (anti-)pattern, and especially on sites that use more
       | semantically useful elements, like <table>.
       | 
       | Not so PDFs.
       | 
       | I'm far from an expert on the format, so maybe there is some
       | semantic support in there, but I've seen plenty of PDFs where
       | tables are simply an loose assemblage of graphical and text
       | elements that, only when rendered, are easily discernible as a
       | table because they're positioned in such a way that they render
       | as a table.
       | 
       | I've actually had decent luck extracting tabular data from PDFS
       | by converting the PDFs to HTML using the Poppler PDF utils, then
       | finding the expected table header, and then using the
       | x-coordinate of the HTML elements for each value within the table
       | to work out columns, and extract values for each rows.
       | 
       | It's kind of groaty but it seems reliable for what I need.
       | Certainly much moreso than going via formatted plaintext, which
       | has issues with inconsistent spacing, and the insertion of
       | newlines into the middle of rows.
        
         | j45 wrote:
         | PDFs inherently are a markup / xml format, the standard is
         | available to learn from.
         | 
         | It's possible to create the same PDF in many, many, many ways.
         | 
         | Some might lean towards exporting a layout containing text and
         | graphics from a graphics suite.
         | 
         | Others might lean towards exporting text and graphics from a
         | word processor, which is words first.
         | 
         | The lens of how the creating app deals with information is
         | often something that has input on how the PDF is output.
         | 
         | If you're looking for an off the shelf utility that is
         | surprisingly decent at pulling structured data from PDFs, tools
         | like cisdem have already solved enough of it for local users.
         | Lots of tools like this out there, many do promise structured
         | data support but it needs to match what you're up to.
        
           | layer8 wrote:
           | > PDFs inherently are a markup / xml format
           | 
           | This is false. PDFs are an object graph containing
           | imperative-style drawing instructions (among many other
           | things). There's a way to add structural information on top
           | (akin to an HTML document structure), but that's completely
           | optional and only serves as auxiliary metadata, it's not at
           | the core of the PDF format.
        
             | davidthewatson wrote:
             | Thanks for your comment.
             | 
             | Indeed. Therein lies the rub.
             | 
             | Why?
             | 
             | Because no matter the fact that I've spent several years of
             | my latent career crawling and parsing and outputting PDF
             | data, I see now that pointing my LLLM stack at a directory
             | of *.pdf just makes the invisible encoding of the object
             | graph visible. It's a skeptical science.
             | 
             | The key transclusion may be to move from imperative to
             | declarative tools or conditional to probabilistic tools, as
             | many areas have in the last couple decades.
             | 
             | I've been following John Sterling's ocaml work for a while
             | on related topics and the ideas floating around have been a
             | good influence on me in forests and their forester which I
             | found resonant given my own experience:
             | 
             | https://www.jonmsterling.com/index/index.xml
             | 
             | https://github.com/jonsterling/forest
             | 
             | I was gonna email john and ask whether it's still being
             | worked on as I hope so, but I brought it up this morning as
             | a way out of the noise that imperative programming PDF has
             | been for a decade or more where turtles all the way down to
             | the low-level root cause libraries mean that the high level
             | imperative languages often display the exact same bugs
             | despite significant differences as to what's being intended
             | in the small on top of the stack vs the large on the bottom
             | of the stack. It would help if "fitness for a particular
             | purpose" decisions were thoughtful as to publishing and
             | distribution but as the CFO likes to say, "Dave, that ship
             | has already sailed." Sigh.
             | 
             | -\\_(tsu)_/-
        
             | j45 wrote:
             | I appreciate the clarification. Should have been more
             | precise with my terminology.
             | 
             | That being said, I think I'm talking about the forest of
             | PDFs.
             | 
             | When I said PDFs have a "markup-like structure," I was
             | talking from my experience manually writing PDFs from
             | scratch using Adobe's spec.
             | 
             | PDFs definitely have a structured, hierarchical format with
             | nested elements that looks a lot like markup languages
             | conceptually.
             | 
             | The objects have a structure comparable to DOM-like
             | structures - there's clear parent-child relationships just
             | like in markup languages. Working with tags like "<<" and
             | ">>" feels similar to markup tags when hand coding them.
             | 
             | This is an article that highlights what I have seen (much
             | cleaner PDF code): "The Structure of a PDF File"
             | (https://medium.com/@jberkenbilt/the-structure-of-a-pdf-
             | file-...) which says:
             | 
             | "There are several types of objects. If you are familiar
             | with JSON, YAML, or the object model in any reasonably
             | modern programming language, this will seem very familiar
             | to you... A PDF object may have one of the following types:
             | String, Number, Boolean, Null, Name, Array, Dictionary..."
             | 
             | This structure with dictionaries in "<<" and ">>" and
             | arrays in brackets really gave me markup vibes when coding
             | to the spec (https://opensource.adobe.com/dc-acrobat-sdk-
             | docs/pdfstandard...).
             | 
             | While PDFs are an object graph with drawing instructions
             | like you said, the structure itself looks a lot like markup
             | formats.
             | 
             | Might be just a difference in choosing to focus on the
             | forest vs the trees.
             | 
             | That hierarchical structure is why different PDF creation
             | methods can make such varied document structures, which is
             | exactly why text extraction is so tricky.
             | 
             | Learning to hand code PDFs in many ways, lets you learn to
             | read and unravel them a little differently, maybe even a
             | bit easier.
        
             | bartread wrote:
             | > but that's completely optional and only serves as
             | auxiliary metadata, it's not at the core of the PDF format.
             | 
             | This is what I kind of suspected but, as I said in my
             | original comment, I'm not an expert and for the PDFs I'm
             | reading I didn't need to delve further because that
             | metadata simply isn't in there (although, boy do I wish it
             | was) so I needed to use a different approach. As soon as I
             | realised what I had was purely presentation I knew it was
             | going to be a bit grim.
        
         | yxhuvud wrote:
         | My favorite is (official, governmental) documents that has one
         | set of text that is rendered, and a totally different set of
         | text that you get if you extract the text the normal way..
        
         | hermitcrab wrote:
         | I am hoping at some point to be able to extract tabular data
         | from PDFs for my data wrangling software. If anyone knows of a
         | library that can extract tables from PDFs, can be inegrated
         | into a C++ app and is free or less than a few hundred $, please
         | let me know!
        
       | j45 wrote:
       | Part of a problem being challenging is recognizing if it's new,
       | or just new to us.
       | 
       | We get to learn a lot when something is new to us.. at the same
       | time the untouchable parts of PDF to Text are largely being
       | solved with the help of LLMs.
       | 
       | I built a tool to extract information from PDFs a long time ago,
       | and the break through was having no ego or attachment to any one
       | way of doing it.
       | 
       | Different solutions and approaches offered different depth or
       | quality of solutions and organizing them to work together in
       | addition to anything I built myself provided what was needed -
       | one place where more things work.. than not.
        
       | xnx wrote:
       | Weird that there's no mention of LLMs in this article even though
       | the article is very recent. LLMs haven't solved every
       | OCR/document data extraction problem, but they've dramatically
       | improved the situation.
        
         | j45 wrote:
         | LLMs are definitely helping approach some problems that
         | couldn't be to date.
        
         | simonw wrote:
         | I've had great results against PDFs from recent vision models.
         | Gemini, OpenAI and Claude can all accept PDFs directly now and
         | treat them as image input.
         | 
         | For longer PDFs I've found that breaking them up into images
         | per page and treating each page separately works well - feeing
         | a thousand page PDF to even a long context model like Gemini
         | 2.5 Pro or Flash still isn't reliable enough that I trust it.
         | 
         | As always though, the big challenge of using vision LLMs for
         | OCR (or audio transcription) tasks is the risk of accidental
         | instruction following - even more so if there's a risk of
         | deliberately malicious instructions in the documents you are
         | processing.
        
         | marginalia_nu wrote:
         | Author here: LLMs are definitely the new gold standard for
         | smaller bodies of shorter documents.
         | 
         | The article is in the context of an internet search engine, the
         | corpus to be converted is of order 1 TB. Running that amount of
         | data through an LLM would be extremely expensive, given the
         | relatively marginal improvement in outcome.
        
           | mediaman wrote:
           | Corpus size doesn't mean much in the context of a PDF, given
           | how variable that can be per page.
           | 
           | I've found Google's Flash to cut my OCR costs by about 95+%
           | compared to traditional commercial offerings that support
           | structured data extraction, and I still get tables, headers,
           | etc from each page. Still not perfect, but per page costs
           | were less than one tenth of a cent per page, and 100 gb
           | collections of PDFs ran to a few hundreds of dollars.
        
           | noosphr wrote:
           | A PDF corpus with a size of 1tb can mean anything from 10,000
           | really poorly scanned documents to 1,000,000,000 nicely
           | generated latex pdfs. What matters is the number of
           | documents, and the number of pages per document.
           | 
           | For the first I can run a segmentation model + traditional
           | OCR in a day or two for the cost of warming my office in
           | winter. For the second you'd need a few hundred dollars and a
           | cloud server.
           | 
           | Feel free to reach out. I'd be happy to have a chat and do
           | some pro-bono work for someone building a open source tool
           | chain and index for the rest of us.
        
         | constantinum wrote:
         | True indeed, but there are a few problems -- hallucinations and
         | trusting the output(validation). More here
         | https://unstract.com/blog/why-llms-struggle-with-unstructure...
        
       | svat wrote:
       | One thing I wish someone would write is something like the
       | browser's developer tools ("inspect elements") for PDF -- it
       | would be great to be able to "view source" a PDF's content
       | streams (the BT ... ET operators that enclose text, each Tj
       | operator for setting down text in the currently chosen font,
       | etc), to see how every "pixel" of the PDF is being
       | specified/generated. I know this goes against the current trend /
       | state-of-the-art of using vision models to basically "see" the
       | PDF like a human and "read" the text, but it would be really nice
       | to be able to actually understand what a PDF file contains.
       | 
       | There are a few tools that allow inspecting a PDF's contents
       | (https://news.ycombinator.com/item?id=41379101) but they stop at
       | the level of the PDF's objects, so entire content streams are
       | single objects. For example, to use one of the PDFs mentioned in
       | this post, the file https://bfi.uchicago.edu/wp-
       | content/uploads/2022/06/BFI_WP_2... has, corresponding to page
       | number 6 (PDF page 8), a content stream that starts like (some
       | newlines added by me):                   0 g 0 G         0 g 0 G
       | BT         /F19 10.9091 Tf 88.936 709.041 Td         [(Subsequen)
       | 28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in
       | )-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(comp
       | an)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ         -16.936
       | -21.922 Td         [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(pr
       | iv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-4
       | 38(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(laun
       | c)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ         0 -21.923 Td
       | 
       | and it would be really cool to be able to see the above "source"
       | and the rendered PDF side-by-side, hover over one to see the
       | corresponding region of the other, etc, the way we can do for a
       | HTML page.
        
         | whenc wrote:
         | Try with cpdf (disclaimer, wrote it):                 cpdf
         | -output-json -output-json-parse-content-streams in.pdf -o
         | out.json
         | 
         | Then you can play around with the JSON, and turn it back to PDF
         | with                 cpdf -j out.json -o out.pdf
         | 
         | No live back-and-forth though.
        
           | svat wrote:
           | The live back-and-forth is the main point of what I'm asking
           | for -- I tried your cpdf (thanks for the mention; will add it
           | to my list) and it too doesn't help; all it does is,
           | somewhere 9000-odd lines into the JSON file, turn the part of
           | the content stream corresponding to what I mentioned in the
           | earlier comment into:                       [               [
           | { "F": 0.0 }, "g" ],               [ { "F": 0.0 }, "G" ],
           | [ { "F": 0.0 }, "g" ],               [ { "F": 0.0 }, "G" ],
           | [ "BT" ],               [ "/F19", { "F": 10.9091 }, "Tf" ],
           | [ { "F": 88.93600000000001 }, { "F": 709.0410000000001 },
           | "Td" ],               [                 [
           | "Subsequen",                   { "F": 28.0 },
           | "t",                   { "F": -374.0 },
           | "to",                   { "F": -373.0 },
           | "the",                   { "F": -373.0 },
           | "p",                   { "F": -28.0 },
           | "erio",                   { "F": -28.0 },
           | "d",                   { "F": -373.0 },
           | "analyzed",                   { "F": -373.0 },
           | "in",                   { "F": -374.0 },
           | "our",                   { "F": -373.0 },
           | "study",                   { "F": 83.0 },
           | ",",                   { "F": -383.0 },
           | "Bridge's",                   { "F": -373.0 },
           | "paren",                   { "F": 27.0 },
           | "t",                   { "F": -373.0 },
           | "compan",                   { "F": 28.0 },
           | "y",                   { "F": -373.0 },
           | "Ne",                   { "F": -1.0 },
           | "wGlob",                   { "F": -27.0 },
           | "e",                   { "F": -374.0 },
           | "reduced"                 ],                 "TJ"
           | ],               [ { "F": -16.936 }, { "F": -21.922 }, "Td"
           | ],
           | 
           | This is just a more verbose restatement of what's in the PDF
           | file; the real questions I'm asking are:
           | 
           | - How can a user get to this part, from viewing the PDF file?
           | (Note that the PDF page objects are not necessarily a flat
           | list; they are often nested at different levels of "kids".)
           | 
           | - How can a user understand these instructions, and "see" how
           | they correspond to what is visually displayed on the PDF
           | file?
        
           | IIAOPSW wrote:
           | This might actually be something very valuable to me.
           | 
           | I have a bunch of documents right now that are annual
           | statutory and financial disclosures of a large institute, and
           | they are just barely differently organized from each year to
           | the next to make it too tedious to cross compare them
           | manually. I've been looking around for a tool that could
           | break out the content and let me reorder it so that the same
           | section is on the same page for every report.
           | 
           | This might be it.
        
         | dleeftink wrote:
         | Have a look at this notebook[0], not exactly what you're
         | looking for but does provide a 'live' inspector of the various
         | drawing operations contained in a PDF.
         | 
         | [0]: https://observablehq.com/@player1537/pdf-utilities
        
           | svat wrote:
           | Thanks, but I was not able to figure out how to get any use
           | out of the notebook above. In what sense is it a 'live'
           | inspector? All it seems to do is to just decompose the PDF
           | into separate "ops" and "args" arrays (neither of which is
           | meaningful without the other), but it does not seem "live" in
           | any sense -- how can one find the ops (and args)
           | corresponding to a region of the PDF page, or vice-versa?
        
             | dleeftink wrote:
             | You can load up your own PDF and select a page up front
             | after which it will display the opcodes for this page.
             | Operations are not structurally grouped, but decomposed in
             | three aligned arrays which can be grouped to your liking
             | based on opcode or used as coordinates for intersection
             | queries (e.g. combining the ops and args arrays).
             | 
             | The 'liveness' here is that you can derive multiple
             | downstream cells (e.g. filters, groupings, drawing
             | instructions) from the initial parsed PDF, which will
             | update as you swap out the PDF file.
        
         | kccqzy wrote:
         | When you use PDF.js from Mozilla to render a PDF file in DOM, I
         | think you might actually get something pretty close. For
         | example I suppose each Tj becomes a <span> and each TJ becomes
         | a collection of <span>s. (I'm fairly certain it doesn't use
         | <canvas>.) And I suppose it must be very faithful to the
         | original document to make it work.
        
           | chaps wrote:
           | Indeed! I've used it to parse documents I've received through
           | FOIA -- sometimes it's just easier to write beautifulsoup
           | code compared to having to deal with PDF's oddities.
        
       | wrs wrote:
       | Since these are statistical classification problems, it seems
       | like it would be worth trying some old-school machine learning
       | (not an LLM, just an NN) to see how it compares with these manual
       | heuristics.
        
         | marginalia_nu wrote:
         | I imagine that would work pretty well given an adequate and
         | representative body of annotated sample data. Though that is
         | also not easy to come by.
        
           | ted_dunning wrote:
           | Actually, it is easy to come up with reasonably decent
           | heuristics that can auto-tag a corpus. From that you can look
           | for anomalies and adjust your tagging system.
           | 
           | The problem of getting a representative body is
           | (surprisingly) much harder than the annotation. I know. I
           | spent quite some time years ago doing this.
        
       | andrethegiant wrote:
       | Cloudflare's ai.toMarkdown() function available in Workers AI can
       | handle PDFs pretty easily. Judging from speed alone, it seems
       | they're parsing the actual content rather than shoving into
       | OCR/LLM.
       | 
       | Shameless plug: I use this under the hood when you prefix any PDF
       | URL with https://pure.md/ to convert to raw text.
        
         | burkaman wrote:
         | If you're looking for test cases, this is the first thing I
         | tried and the result is very bad:
         | https://pure.md/https://docs.house.gov/meetings/IF/IF00/2025...
        
           | andrethegiant wrote:
           | Apart from lacking newlines, how is the result bad? It
           | extracts the text for easy piping into an LLM.
        
             | burkaman wrote:
             | - Most of the titles have incorrectly split words, for
             | example "P ART 2--R EPEAL OF EPA R ULE R ELATING TO M ULTI
             | -P OLLUTANT E MISSION S TANDARDS". I know LLMs are
             | resilient against typos and mistakes like this, but it
             | still seems not ideal.
             | 
             | - The header is parsed in a way that I suspect would
             | mislead an LLM: "BRETT GUTHRIE, KENTUCKY FRANK PALLONE,
             | JR., NEW JERSEY CHAIRMAN RANKING MEMBER ONE HUNDRED
             | NINETEENTH CONGRESS". Guthrie is the chairman and Pallone
             | is the ranking member, but that isn't implied in the text.
             | In this particular case an LLM might already know that from
             | other sources, but in more obscure contexts it will just
             | have to rely on the parsed text.
             | 
             | - It isn't converted into Markdown at all, the structure is
             | completely lost. If you only care about text then I guess
             | that's fine, and in this case an LLM might do an ok job at
             | identifying some of the headers, but in the context of this
             | discussion I think ai.toMarkdown() did a bad job of
             | converting to Markdown and a just ok job of converting to
             | text.
             | 
             | I would have considered this a fairly easy test case, so it
             | would make me hesitant to trust that function for general
             | use if I were trying to solve the challenges described in
             | the submitted article (Identifying headings, Joining
             | consecutive headings, Identifying Paragraphs).
             | 
             | I see that you are trying to minimize tokens for LLM input,
             | so I realize your goals are probably not the same as what
             | I'm talking about.
             | 
             | Edit: Another test case, it seems to crash on any Arxiv
             | PDF. Example:
             | https://pure.md/https://arxiv.org/pdf/2411.12104.
        
               | andrethegiant wrote:
               | > it seems to crash on any Arxiv PDF
               | 
               | Fixed, thanks for reporting :-)
        
           | marginalia_nu wrote:
           | That PDF actually has some weird corner cases.
           | 
           | First it's all the same font size everywhere, it's also got
           | bolded "headings" with spaces that are not bolded. Had to fix
           | my own handling to get it to process well.
           | 
           | This is the search engine's view of the document as of those
           | fixes: https://www.marginalia.nu/junk/congress.html
           | 
           | Still far from perfect...
        
             | mdaniel wrote:
             | > That PDF actually has some weird corner cases.
             | 
             | Heh, in my experience with PDFs that's a tautology
        
         | _boffin_ wrote:
         | You're aware that PDFs are containers that can hold various
         | formats, which can be interlaced in different ways, such as on
         | top, throughout, or in unexpected and unspecified ways that
         | aren't "parsable," right?
         | 
         | I would wager that they're using OCR/LLM in their pipeline.
        
           | andrethegiant wrote:
           | Could be. But their pricing for the conversion is free, which
           | leads me to believe LLMs are not involved.
        
         | cpursley wrote:
         | How's their function do on complex data tables, charts and that
         | sort of stuff?
        
         | bambax wrote:
         | It doesn't seem to handle multi-columns PDFs well?
        
       | bob1029 wrote:
       | When accommodating the general case, solving PDF-to-text is
       | approximately equivalent to solving JPEG-to-text.
       | 
       | The only PDF parsing scenario I would consider putting my name on
       | is scraping AcroForm field values from standardized documents.
        
         | kapitalx wrote:
         | This is approximately the approach we're taking also at
         | https://doctly.ai, add to that a "multiple experts" approach
         | for analyzing the image (for our 'ultra' version), and we get
         | really good results. And we're making it better constantly.
        
         | layer8 wrote:
         | If you assume standardized documents, you can impose the use of
         | Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/
        
       | dwheeler wrote:
       | The better solution is to embed, in the PDF, the editable source
       | document. This is easily done by LibreOffice. Embedding it takes
       | very little space in general (because it compresses well), and
       | then you have MUCH better information on what the text is and its
       | meaning. It works just fine with existing PDF readers.
        
         | layer8 wrote:
         | That's true, but it also opens up the vulnerability of the
         | source document being arbitrarily different from the rendered
         | PDF content.
        
         | kerkeslager wrote:
         | That's true, but it's dependent on the creator of the PDF
         | having aligned incentives with the consumer of the PDF.
         | 
         | In the e-Discovery field, it's commonplace for those providing
         | evidence to dump it into a PDF purely so that it's harder for
         | the opposing side's lawyers to consume. If both sides have lots
         | of money this isn't a barrier, but for example public defenders
         | don't have funds to hire someone (me!) to process the PDFs into
         | a readable format, so realistically they end up taking much
         | longer to process the data, which takes a psychological toll on
         | the defendant. And that's if they process the data at all.
         | 
         | The solution is to make it illegal to do this: wiretap data,
         | for example, should be provided in a standardized machine-
         | readable format. There's no ethical reason for simple technical
         | friction to be affecting the outcomes of criminal proceedings.
        
           | giovannibonetti wrote:
           | I wonder if AI will solve that
        
             | GaggiX wrote:
             | There are specialized models, but even generic ones like
             | Gemini 2.0 Flash are really good and cheap, you can use
             | them and embed the OCR inside the PDF to index to the
             | original content.
        
               | kerkeslager wrote:
               | This fundamentally misunderstands the problem. Effective
               | OCR predates the popularity of ChatGPT and e-Discovery
               | folks were already using it--AI in the modern sense adds
               | nothing to this. Indexing the resulting text was also
               | already possible--again AI adds nothing. The problem is
               | that the resultant text lacks structure: being able to
               | sort/filter wiretap data by date/location, for example,
               | isn't inherently possible because you've obtained text or
               | indexed it. AI accuracy simply isn't high enough to solve
               | this problem without specialized training--off the shelf
               | models simply won't work accurately enough even if you
               | can get around the legal problems of feeding potentially-
               | sensitive information into a model. AI models trained on
               | a large enough domain-specific dataset might work, but
               | the existing off-the-shelf models certainly are not
               | accurate enough. And there are a lot of subdomains--
               | wiretap data, cell phone GPS data, credit card data,
               | email metadata, etc., which would each require model
               | training.
               | 
               | Fundamentally, the solution to this problem is to not
               | create it in the first place. There's no reason for there
               | to be a structured data -> PDF -> AI -> structured data
               | pipeline when we can just force people providing evidence
               | to provide the structured data.
        
         | carabiner wrote:
         | I bet 90% of the problem space is legacy PDFs. My company has
         | thousands of these. Some are crappy scans. Some have Adobe's
         | OCR embedded, but most have none at all.
        
         | lelandfe wrote:
         | The better solution to a search engine extracting text from
         | existing PDFs is to provide advice on how to author PDFs?
         | 
         | What's the timeline for this solution to pay off
        
           | chaps wrote:
           | Microsoft is one of the bigger contributors to this. Like --
           | why does excel have a feature to export to PDF, but not a
           | feature to do the opposite? That export functionality really
           | feels like it was given to a summer intern who finished it in
           | two weeks and never had to deal with it ever again.
        
             | mattigames wrote:
             | Because then we would have 2 formats: "pdfs generated by
             | Excel" and "real pdfs" with the same extension and that
             | would be it's own can of worms for Microsoft's and for
             | everyone else.
        
         | yxhuvud wrote:
         | Sure, and if you have access to the source document the pdf was
         | generated from, then that is a good thing to do.
         | 
         | But generally speaking, you don't have that control.
        
       | Obscurity4340 wrote:
       | Is this what GoodPDF does?
        
       | reify wrote:
       | https://github.com/jalan/pdftotext
       | 
       | pdftotext -layout input.pdf output.txt
       | 
       | pip install pdftotext
        
       | EmilStenstrom wrote:
       | I think using Gemma3 in vision mode could be a good use-case for
       | converting PDF to text. It's downloadable and runnable on a local
       | computer, with decent memory requirements depending on which size
       | you pick. Did anyone try it?
        
         | CaptainFever wrote:
         | Kind of unrelated, but Gemma 3's weights are unfree, so perhaps
         | LLaVA (https://ollama.com/library/llava) would be a good
         | alternative.
        
         | ljlolel wrote:
         | Mistral OCR has the best in class document understanding.
         | https://mistral.ai/news/mistral-ocr
        
       | ted_dunning wrote:
       | One of my favorite documents for highlighting the challenges
       | described here is the PDF for this article:
       | 
       | https://academic.oup.com/auk/article/126/4/717/5148354
       | 
       | The first page is classic with two columns of text, centered
       | headings, a text inclusion that sits between the columns and
       | changes the line lengths and indentations for the columns. Then
       | we get the fun of page headers that change between odd and even
       | pages and section header conventions that vary drastically.
       | 
       | Oh... to make things even better, paragraphs doing get extra
       | spacing and don't always have an indented first line.
       | 
       | Some of everything.
        
         | JKCalhoun wrote:
         | The API in CoreGraphics (MacOS) for PDF, at a basic level,
         | simply presented the text, per page, in the order in which it
         | was encoded in the dictionaries. And 95% of the time this was
         | pretty good -- and when working with PDFKit and Preview on the
         | Mac, we got by with it for years.
         | 
         | If you stepped back you could imagine the app that originally
         | had captured/produced the PDF -- perhaps a word processor -- it
         | was likely rendering the text into the PDF context in some
         | reasonable order from it's own text buffer(s). So even for two
         | columns, you rather expect, and often found, that the text
         | _flowed_ correctly from the left column to the right. The text
         | was therefore already in the correct order within the PDF
         | document.
         | 
         | Now, footers, headers on the page -- that would be anyone's
         | guess as to what order the PDF-producing app dumped those into
         | the PDF context.
        
       | devrandoom wrote:
       | I currently use ocrmypdf for my private library. Then Recoll to
       | index and search. Is there a better solution I'm missing?
        
       | constantinum wrote:
       | PDF parsing is hell indeed, with all sorts of edge cases that
       | breaks business workflows, more on that here
       | https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
        
       | gibsonf1 wrote:
       | We[1] Create "Units of Thought" from PDF's and then work with
       | those for further discovery where a "Unit of Thought" is any
       | paragraph, title, note heading - something that stands on its own
       | semantically. We then create a hierarchy of objects from that pdf
       | in the database for search and conceptual search - all at scale.
       | 
       | [1] https://graphmetrix.com/trinpod-server https://trinapp.com
        
         | IIAOPSW wrote:
         | I'm tempted to try it. My use case right now is a set of
         | documents which are annual financial and statutory disclosures
         | of a large institution. Every year they are formatted /
         | organized slightly differently which makes it enormously
         | tedious to manually find and compare the same basic section
         | from one year to another, but they are consistent enough to
         | recognize analogous sections from different years due to often
         | reusing verbatim quotes or highly specific key words each time.
         | 
         | What I really want to do is take all these docs and just
         | reorder all the content such that I can look at page n (or
         | section whatever) scrolling down and compare it between
         | different years by scrolling horizontally. Ideally with changes
         | from one year to the next highlighted.
         | 
         | Can your product do this?
        
       | kbyatnal wrote:
       | "PDF to Text" is a bit simplified IMO. There's actually a few
       | class of problems within this category:
       | 
       | 1. reliable OCR from documents (to index for search, feed into a
       | vector DB, etc)
       | 
       | 2. structured data extraction (pull out targeted values)
       | 
       | 3. end-to-end document pipelines (e.g. automate mortgage
       | applications)
       | 
       | Marginalia needs to solve problem #1 (OCR), which is luckily
       | getting commoditized by the day thanks to models like Gemini
       | Flash. I've now seen multiple companies replace their OCR
       | pipelines with Flash for a fraction of the cost of previous
       | solutions, it's really quite remarkable.
       | 
       | Problems #2 and #3 are much more tricky. There's still a large
       | gap for businesses in going from raw OCR outputs --> document
       | pipelines deployed in prod for mission-critical use cases. LLMs
       | and VLMs aren't magic, and anyone who goes in expecting 100%
       | automation is in for a surprise.
       | 
       | You still need to build and label datasets, orchestrate pipelines
       | (classify -> split -> extract), detect uncertainty and correct
       | with human-in-the-loop, fine-tune, and a lot more. You can
       | certainly get close to full automation over time, but it's going
       | to take time and effort. The future is definitely moving in this
       | direction though.
       | 
       | Disclaimer: I started a LLM doc processing company to help
       | companies solve problems in this space (https://extend.ai)
        
         | varunneal wrote:
         | I've been hacking away at trying to process PDFs into Markdown,
         | having encountered similar obstacles to OP regarding header
         | detection (and many other issues). OCR is fantastic these days
         | but maintaining a global structure to the document is much
         | trickier. Consistent HTML seems still out of reach for large
         | documents. I'm having half-decent results with Markdown using
         | multiple passes of an LLM to extract document structure and
         | feeding it in contextually for page-by-pass extraction.
        
         | miki123211 wrote:
         | There's also #4, reliable OCR and semantics extraction that
         | works across many diverse classes of documents, which is
         | relevant for accessibility.
         | 
         | This is hard because:
         | 
         | 1. Unlike a business workflow which often only deals with a few
         | specific kinds of documents, you never know what the user is
         | going to get. You're making an abstract PDF reader, not an app
         | that can process court documents in bankruptcy cases in
         | Delaware.
         | 
         | 2. You don't just need the text (like in traditional OCR), you
         | need to recognize tables, page headers and footers, footnotes,
         | headings, mathematics etc.
         | 
         | 3. Because this is for human consumption, you want to minimize
         | errors as much as possible, which means _not_ using OCR when
         | not needed, and relying on the underlying text embedded within
         | the PDF while still extracting semantics. This means you
         | essentially need two different paths, when the PDF only
         | consists of images and when there are content streams you can
         | get some information from.
         | 
         | 3.1. But the content streams may contain different text from
         | what's actually on the page, e.g. white-on-white text to hide
         | information the user isn't supposed to see, or diacritics
         | emulation with commands that manually draw acute accents
         | instead of using proper unicode diacritics (LaTeX works that
         | way).
         | 
         | 4. You're likely running as a local app on the user's (possibly
         | very underpowered) device, and likely don't have an associated
         | server and subscription, so you can't use any cloud AI models.
         | 
         | 5. You need to support forms. Since the user is using
         | accessibility software, presumably they can't print and use a
         | pen, so you need to handle the ones meant for printing too, not
         | just the nice, spec-compatible ones.
         | 
         | This is very much an open problem and is not even remotely
         | close to being solved. People have been taking stabs at it for
         | years, but all current solutions suck in some way, and there's
         | no single one that solves all 5 points correctly.
        
         | noosphr wrote:
         | >replace their OCR pipelines with Flash for a fraction of the
         | cost of previous solutions, it's really quite remarkable.
         | 
         | As someone who had to build custom tools because VLMs are so
         | unreliable: anyone that uses VLMs for unprocessed images is in
         | for more pain than all the providers which let LLMs without
         | guard rails interact directly with consumers.
         | 
         | They are very good at image labeling. They are ok at very
         | simple documents, e.g. single column text, centered single
         | level of headings, one image or table per page, etc. (which is
         | what all the MVP demos show). They need another trillion
         | parameters to become bad at complex documents with tables and
         | images.
         | 
         | Right now they hallucinate so badly that you simply _can't_ use
         | them for something as simple as a table with a heading at the
         | top, data in the middle and a summary at the bottom.
        
       | anonu wrote:
       | They should called it NDF - Non-Portable Document Format.
        
       | dobraczekolada wrote:
       | Reminds me of github.com/docwire/docwire
        
       | 90s_dev wrote:
       | Have any of you ever thought to yourself, this is new and
       | interesting, and then vaguely remembered that you spent months or
       | years becoming an expert at it earlier in life but entirely
       | forgot it? And in fact large chunks of the very interesting
       | things you've done just completely flew out of your mind long
       | ago, to the point where you feel absolutely new at life, like
       | you've accomplished relatively nothing, until something like this
       | jars you out of that forgetfulness?
       | 
       | I definitely vaguely remember doing some incredibly cool things
       | with PDFs and OCR about 6 or 7 years ago. Some project comes to
       | mind... google tells me it was "tesseract" and that sounds
       | familiar.
        
         | downboots wrote:
         | No different than a fire ant whose leaf got knocked over by the
         | wind and it moved on to the next.
        
           | 90s_dev wrote:
           | Well I sure do _feel_ different than a fire ant.
        
             | downboots wrote:
             | anttention is all we have
        
               | 90s_dev wrote:
               | Not true, I also have a nice cigar waiting for the rain
               | to go away.
        
               | 90s_dev wrote:
               | Hmm, it's gone now. Well I used to have one anyway.
        
         | korkybuchek wrote:
         | Not that I'm privy to your mind, but it probably _was_
         | tesseract (and this is my exact experience too...although for
         | me it was about 12 years ago).
        
         | bazzargh wrote:
         | Back in... 2006ish? I got annoyed with being unable to copy
         | text from multicolumn scientific papers on my iRex (an early
         | ereader that was somewhat hackable) so dug a bit into why that
         | was. Under the hood, the pdf reader used poppler, so I modified
         | poppler to infer reading order in multicolumn documents using
         | algorithms that tessaract's author (Thomas Breuel) had
         | published for OCR.
         | 
         | It was a bit of a heuristic hack; it was 20 years ago but as I
         | recall poppler's ancient API didn't really represent text runs
         | in a way you'd want for an accessibility API. A version of the
         | multicolumn select made it in but it was a pain to try to
         | persuade poppler's maintainer that subsequent suggestions to
         | improve performance were ok - because they used slightly
         | different heuristics so had different text selections in some
         | circumstances. There was no 'right' answer, so wanting the
         | results to match didn't make sense.
         | 
         | And that's how kpdf got multicolumn select, of a sort.
         | 
         | Using tessaract directly for this has probably made more sense
         | for some years now.
        
           | steeeeeve wrote:
           | I too went down that rabbithole. Haha. Anything around that
           | time to get an edge in a fantasy football league. I found a
           | bunch of historical NFL stats pdfs and it took forever to
           | make usable data out of them.
        
         | pimlottc wrote:
         | This is life. So many times I've finished a project and thought
         | to myself: "Now I am an expert at doing this. Yet I probably
         | won't ever do this again." Because the next thing will
         | completely in a different subject area and I'll start again
         | from the basics.
        
         | didericis wrote:
         | I built an auto HQ solver with tesseract when HQ was blowing up
         | over thanksgiving (HQ was the gameshow by the vine people with
         | live hosts). I would take a screenshot of the app during a
         | question, share it/send it to a little local api, do a google
         | query for the question, see how many times each answer on the
         | first page appeared in the results, then rank the answers by
         | probability.
         | 
         | Didn't work well/was a very naive way to search for answers
         | (which is prob good/idk what kind of trouble I'd have gotten in
         | if it let me or anyone else who used it win all the time), but
         | it was fun to build.
        
         | anon373839 wrote:
         | Tesseract was the best open-source OCR for a long time. But I'd
         | argue that docTR is better now, as it's more accurate out of
         | the box and GPU accelerated. It implements a variety of
         | different text detection and recognition model architectures
         | that you can combine in a modular pipeline. And you can train
         | or fine-tune in PyTorch or TensorFlow to get even better
         | performance on your domain.
        
       | PeterStuer wrote:
       | I guess I'm lucky the PDF's I need to process are mostly rather
       | dull unadventurous layouts. So far I've had great success using
       | docling.
        
       | keybored wrote:
       | For people who want people to read their documents[1] they should
       | have their PDF point to a more digital-friendly format, an alt
       | document.
       | 
       |  _Looks like you've found my PDF. You might want this version
       | instead:_
       | 
       | PDFs are often subpar. Just see the first example: standard Latex
       | serif section title. I mean, PDFs often aren't even well-typeset
       | for what they are (dead-tree simulations).
       | 
       | [1] No sarcasm or truism. Some may just want to submit a paper to
       | whatever publisher and go through their whole laundry list of
       | what a paper ought to be. Wide dissemanation is not the point.
        
       | 1vuio0pswjnm7 wrote:
       | Below is a PDF. It is a .txt file. I can save it with a .pdf
       | extension and open it in a PDF viewer. I can make changes in a
       | text editor. For example, by editing this text file, I can change
       | the text displayed on the screen when the PDF is opened, the
       | font, font size, line spacing, the maximum characters per line,
       | number of lines per page, the paper width and height, as well as
       | portrait versus landscape mode.                  %PDF-1.4
       | 1 0 obj        <<        /CreationDate (D:2025)        /Producer
       | >>        endobj        2 0 obj        <<        /Type /Catalog
       | /Pages 3 0 R        >>        endobj        4 0 obj        <<
       | /Type /Font        /Subtype /Type1        /Name /F1
       | /BaseFont /Times-Roman        >>        endobj        5 0 obj
       | <<          /Font << /F1 4 0 R >>          /ProcSet [ /PDF /Text
       | ]        >>        endobj        6 0 obj        <<        /Type
       | /Page        /Parent 3 0 R        /Resources 5 0 R
       | /Contents 7 0 R        >>        endobj        7 0 obj        <<
       | /Length 8 0 R        >>        stream        BT        /F1 50 Tf
       | 1 0 0 1 50 752 Tm        54 TL        (PDF is)'         ((a) a
       | text format)'        ((b) a graphics format)'        ((c) (a) and
       | (b).)'        ()'        ET        endstream        endobj
       | 8 0 obj        53        endobj        3 0 obj        <<
       | /Type /Pages        /Count 1        /MediaBox [ 0 0 612 792 ]
       | /Kids [ 6 0 R ]        >>        endobj        xref        0 9
       | 0000000000 65535 f
       | 
       | 0000000009 00000 n 0000000113 00000 n 0000000514 00000 n
       | 0000000162 00000 n 0000000240 00000 n 0000000311 00000 n
       | 0000000391 00000 n 0000000496 00000 n trailer << /Size 9 /Root 2
       | 0 R /Info 1 0 R >> startxref 599 %%EOF
        
         | swsieber wrote:
         | It can also have embedded binary streams. It was not made for
         | text. It was made for layout and graphics. You give nice
         | examples, but each of those lines could have been broken up
         | into one call per character, or per word, even out of order.
        
         | 1vuio0pswjnm7 wrote:
         | "PDF" is an acronym for for "Portable Document Format"
         | 
         | "2.3.2 Portability
         | 
         | A PDF file is a 7-bit ASCII file, which means PDF files use
         | only the printable subset of the ASCII character set to
         | describe documents even those with images and special
         | characters. As a result, PDF files are extremely portable
         | across diverse hardware and operating system environments."
         | 
         | https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
        
       | ljlolel wrote:
       | Mistral OCR has best in class doing document understanding
       | 
       | https://mistral.ai/news/mistral-ocr
        
       | nicodjimenez wrote:
       | Check out mathpix.com. We handle complex tables, complex math,
       | diagrams, rotated tables, and much more, extremely accurately.
       | 
       | Disclaimer: I'm the founder.
        
       | bickfordb wrote:
       | Maybe it's time for new document formats and browsers that neatly
       | separate content, presentation and UI layers? PDF and HTML are
       | 20+ years old and it's often difficult to extract information
       | from either let alone author a browser.
        
         | rrr_oh_man wrote:
         | Yes, but I'm sure they're out there somewhere
         | 
         | (https://xkcd.com/927/)
        
           | TRiG_Ireland wrote:
           | Open XML Paper Specification is an XML-based format intended
           | to compete with PDF. Unlike PDF, it is purely static: no
           | scripting.
           | 
           | Also unlike PDF, I've never seen it actually used in the
           | wild.
        
       | smcleod wrote:
       | Definitely recommend docling for this. https://docling-
       | project.github.io/docling/
        
       | elpalek wrote:
       | Recently tested a (non-english) pdf ocr with Gemini 2.5 Pro.
       | First, directly ask it to extract text from pdf. Result: random
       | text blob, not useable.
       | 
       | Second, I converted pdf into pages of jpg. Gemini performed
       | exceptional. Near perfect text extraction with intact format in
       | markdown.
       | 
       | Maybe there's internal difference when processing pdf vs jpg
       | inside the model.
        
         | jagged-chisel wrote:
         | Model isn't rendering the PDF probably, just looking in the
         | file for text.
        
       | noosphr wrote:
       | I've worked on this in my day job: extracting _all_ relevant
       | information from a financial services PDF for a bert based search
       | engine.
       | 
       | The only way to solve that is with a segmentation model followed
       | by a regular OCR model and whatever other specialized models you
       | need to extract other types of data. VLM aren't ready for prime
       | time and won't be for a decade on more.
       | 
       | What worked was using doclaynet trained YOLO models to get the
       | areas of the document that were text, images, tables or formulas:
       | https://github.com/DS4SD/DocLayNet if you don't care about
       | anything but text you can feed the results into tesseract
       | directly (but for the love of god read the manual).
       | Congratulations, you're done.
       | 
       | Here's some pre-trained models that work OK out of the box:
       | https://github.com/ppaanngggg/yolo-doclaynet I found that we
       | needed to increase the resolution from ~700px to ~2100px
       | horizontal for financial data segmentation.
       | 
       | VLMs on the other hand still choke on long text and hallucinate
       | unpredictably. Worse they can't understand nested data. If you
       | give _any_ current model nothing harder than three nested
       | rectangles with text under each they will not extract the text
       | correctly. Given that nested rectangles describes every table no
       | VLM can currently extract data from anything but the most
       | straightforward of tables. But it will happily lie to you that it
       | did - after all a mining company should own a dozen bulldozers
       | right? And if they each cost $35.000 it must be an amazing deal
       | they got, right?
        
       | Sharlin wrote:
       | Some of the unsung heroes of the modern age are the programmers
       | who, through what must have involved a lot of weeping and
       | gnashing of teeth, have managed to implement the find, select,
       | and copy operations in PDF readers.
        
       | patrick41638265 wrote:
       | Good old https://linux.die.net/man/1/pdftotext and a little
       | Python on top of its output will get you a long way if your
       | documents are not too crazy. I use it to parse all my bank
       | statements into an sqlite database for analysis.
        
       | coolcase wrote:
       | Tried extracting data from a newspaper. It is really hard. What
       | is a headline and which headline belongs to which paragraphs?
       | Harder than you think! And chucking it as is into OpenAI was no
       | good at all. Manually dealing with coordinates from OCR was
       | better but not perfect.
        
       | rekoros wrote:
       | I've been using Azure's "Document Intelligence" thingy (prebuilt
       | "read" model) to extract text from PDFs with pretty good results
       | [1]. Their terminology is so bad, it's easy to dismiss the whole
       | thing for another Microsoft pile, but it actually, like, for
       | real, works.
       | 
       | [1] https://learn.microsoft.com/en-us/azure/ai-
       | services/document...
        
       ___________________________________________________________________
       (page generated 2025-05-13 23:00 UTC)