[HN Gopher] PDF to Text, a challenging problem
___________________________________________________________________
PDF to Text, a challenging problem
Author : ingve
Score : 204 points
Date : 2025-05-13 15:01 UTC (7 hours ago)
(HTM) web link (www.marginalia.nu)
(TXT) w3m dump (www.marginalia.nu)
| rad_gruchalski wrote:
| So many of these problems have been solved by mozilla pdf.js
| together with its viewer implementation:
| https://mozilla.github.io/pdf.js/.
| zzleeper wrote:
| Any sense on how PDF.js compares against other tools such as
| pdfminer?
| rad_gruchalski wrote:
| I don't know. I use pdf.js for everything PDF.
| favorited wrote:
| I did some very broad testing of several PDF text extraction
| tools recently, and PDF.js was one of the slowest.
|
| My use-case was specifically testing their performance as
| command-line tools, so that will skew the results to an
| extent. For example, PDFBox was very slow because you're
| paying the JVM startup cost with each invocation.
|
| Poppler's pdftotext utility and pdfminer.six were generally
| the fastest. Both produced serviceable plain-text versions of
| the PDFs, with minor differences in where they placed
| paragraph breaks.
|
| I also wrote a small program which extracted text using
| Chrome's PDFium, which also performed well, but building that
| project can be a nightmare unless you're Google. IBM's
| Docling project, which uses ML models, produced by far the
| best formatting, preserving much of the document's original
| structure - but it was, of course, enormously slower and more
| energy-hungry.
|
| Disclaimer: I was testing specific PDF files that are
| representative of the kind of documents my software produces.
| egnehots wrote:
| I don't think so, pdf.js is able to _render_ a pdf content.
|
| Which is different from extracting "text". Text in PDF can be
| encoded in many ways, in an actual image, in shapes (think,
| segments, quadratic bezier curves...), or in an XML format
| (really easy to process).
|
| PDF viewers are able to render text, like a printer would work,
| processing command to show pixels on the screen at the end.
|
| But often, paragraph, text layout, columns, tables are lost in
| the process. Even though, you see them, so close yet so far.
| That is why AI is quite strong at this task.
| lionkor wrote:
| Correct me if im wrong, but pdf.js actually has a lot of
| methods to manipulate PDFs, no?
| rad_gruchalski wrote:
| Yes, pdf.js can do that: https://github.com/mozilla/pdf.js/
| blob/master/web/viewer.htm....
|
| The purpose of my original comment was to simply say:
| there's an existing implementation so if you're building a
| pdf file viewer/editor, and you need inspiration, have a
| look. One of the reasons why mozilla is doing this is to be
| a reference implementation. I'm not sure why people are
| upset with this. Though, I could have explained it better.
| rad_gruchalski wrote:
| You are wrong. Pdf.js can extract text and has all facilities
| required to render and extract formatting. The latest version
| can also edit PDF files. It's basically the same engine as
| the Firefox PDF viewer. Which also has a document outline,
| search, linking, print preview, scaling, scripting sandbox...
| it does not simply ,,render" a file.
|
| Regarding tables, this here
| https://www.npmjs.com/package/pdf-table-extractor does a very
| good job at table interpretation and works on top of pdf.js.
|
| I also didn't say what works better or worse, neither do I go
| into PDF being good or bad.
|
| I simply said that a ton of problems have been covered by
| iAMkenough wrote:
| A good PDF reader makes the problems easier to deal with, but
| does not solve the underlying issue.
|
| The PDF itself is still flawed, even if pdf.js interprets it
| perfectly, which is still a problem for non-pdf.js viewers and
| tasks where "viewing" isn't the primary goal.
| rad_gruchalski wrote:
| Yeah. What I'm saying: pdf.js seems to have some of these
| solved. All I'm suggesting is have a look at it. I get it
| that for some PDF is a broken format.
| bartread wrote:
| Yeah, getting text - even structured text - out of PDFs is no
| picnic. Scraping a table out of an HTML document is often
| straightforward even on sites that use the "everything's a <div>"
| (anti-)pattern, and especially on sites that use more
| semantically useful elements, like <table>.
|
| Not so PDFs.
|
| I'm far from an expert on the format, so maybe there is some
| semantic support in there, but I've seen plenty of PDFs where
| tables are simply an loose assemblage of graphical and text
| elements that, only when rendered, are easily discernible as a
| table because they're positioned in such a way that they render
| as a table.
|
| I've actually had decent luck extracting tabular data from PDFS
| by converting the PDFs to HTML using the Poppler PDF utils, then
| finding the expected table header, and then using the
| x-coordinate of the HTML elements for each value within the table
| to work out columns, and extract values for each rows.
|
| It's kind of groaty but it seems reliable for what I need.
| Certainly much moreso than going via formatted plaintext, which
| has issues with inconsistent spacing, and the insertion of
| newlines into the middle of rows.
| j45 wrote:
| PDFs inherently are a markup / xml format, the standard is
| available to learn from.
|
| It's possible to create the same PDF in many, many, many ways.
|
| Some might lean towards exporting a layout containing text and
| graphics from a graphics suite.
|
| Others might lean towards exporting text and graphics from a
| word processor, which is words first.
|
| The lens of how the creating app deals with information is
| often something that has input on how the PDF is output.
|
| If you're looking for an off the shelf utility that is
| surprisingly decent at pulling structured data from PDFs, tools
| like cisdem have already solved enough of it for local users.
| Lots of tools like this out there, many do promise structured
| data support but it needs to match what you're up to.
| layer8 wrote:
| > PDFs inherently are a markup / xml format
|
| This is false. PDFs are an object graph containing
| imperative-style drawing instructions (among many other
| things). There's a way to add structural information on top
| (akin to an HTML document structure), but that's completely
| optional and only serves as auxiliary metadata, it's not at
| the core of the PDF format.
| davidthewatson wrote:
| Thanks for your comment.
|
| Indeed. Therein lies the rub.
|
| Why?
|
| Because no matter the fact that I've spent several years of
| my latent career crawling and parsing and outputting PDF
| data, I see now that pointing my LLLM stack at a directory
| of *.pdf just makes the invisible encoding of the object
| graph visible. It's a skeptical science.
|
| The key transclusion may be to move from imperative to
| declarative tools or conditional to probabilistic tools, as
| many areas have in the last couple decades.
|
| I've been following John Sterling's ocaml work for a while
| on related topics and the ideas floating around have been a
| good influence on me in forests and their forester which I
| found resonant given my own experience:
|
| https://www.jonmsterling.com/index/index.xml
|
| https://github.com/jonsterling/forest
|
| I was gonna email john and ask whether it's still being
| worked on as I hope so, but I brought it up this morning as
| a way out of the noise that imperative programming PDF has
| been for a decade or more where turtles all the way down to
| the low-level root cause libraries mean that the high level
| imperative languages often display the exact same bugs
| despite significant differences as to what's being intended
| in the small on top of the stack vs the large on the bottom
| of the stack. It would help if "fitness for a particular
| purpose" decisions were thoughtful as to publishing and
| distribution but as the CFO likes to say, "Dave, that ship
| has already sailed." Sigh.
|
| -\\_(tsu)_/-
| j45 wrote:
| I appreciate the clarification. Should have been more
| precise with my terminology.
|
| That being said, I think I'm talking about the forest of
| PDFs.
|
| When I said PDFs have a "markup-like structure," I was
| talking from my experience manually writing PDFs from
| scratch using Adobe's spec.
|
| PDFs definitely have a structured, hierarchical format with
| nested elements that looks a lot like markup languages
| conceptually.
|
| The objects have a structure comparable to DOM-like
| structures - there's clear parent-child relationships just
| like in markup languages. Working with tags like "<<" and
| ">>" feels similar to markup tags when hand coding them.
|
| This is an article that highlights what I have seen (much
| cleaner PDF code): "The Structure of a PDF File"
| (https://medium.com/@jberkenbilt/the-structure-of-a-pdf-
| file-...) which says:
|
| "There are several types of objects. If you are familiar
| with JSON, YAML, or the object model in any reasonably
| modern programming language, this will seem very familiar
| to you... A PDF object may have one of the following types:
| String, Number, Boolean, Null, Name, Array, Dictionary..."
|
| This structure with dictionaries in "<<" and ">>" and
| arrays in brackets really gave me markup vibes when coding
| to the spec (https://opensource.adobe.com/dc-acrobat-sdk-
| docs/pdfstandard...).
|
| While PDFs are an object graph with drawing instructions
| like you said, the structure itself looks a lot like markup
| formats.
|
| Might be just a difference in choosing to focus on the
| forest vs the trees.
|
| That hierarchical structure is why different PDF creation
| methods can make such varied document structures, which is
| exactly why text extraction is so tricky.
|
| Learning to hand code PDFs in many ways, lets you learn to
| read and unravel them a little differently, maybe even a
| bit easier.
| bartread wrote:
| > but that's completely optional and only serves as
| auxiliary metadata, it's not at the core of the PDF format.
|
| This is what I kind of suspected but, as I said in my
| original comment, I'm not an expert and for the PDFs I'm
| reading I didn't need to delve further because that
| metadata simply isn't in there (although, boy do I wish it
| was) so I needed to use a different approach. As soon as I
| realised what I had was purely presentation I knew it was
| going to be a bit grim.
| yxhuvud wrote:
| My favorite is (official, governmental) documents that has one
| set of text that is rendered, and a totally different set of
| text that you get if you extract the text the normal way..
| hermitcrab wrote:
| I am hoping at some point to be able to extract tabular data
| from PDFs for my data wrangling software. If anyone knows of a
| library that can extract tables from PDFs, can be inegrated
| into a C++ app and is free or less than a few hundred $, please
| let me know!
| j45 wrote:
| Part of a problem being challenging is recognizing if it's new,
| or just new to us.
|
| We get to learn a lot when something is new to us.. at the same
| time the untouchable parts of PDF to Text are largely being
| solved with the help of LLMs.
|
| I built a tool to extract information from PDFs a long time ago,
| and the break through was having no ego or attachment to any one
| way of doing it.
|
| Different solutions and approaches offered different depth or
| quality of solutions and organizing them to work together in
| addition to anything I built myself provided what was needed -
| one place where more things work.. than not.
| xnx wrote:
| Weird that there's no mention of LLMs in this article even though
| the article is very recent. LLMs haven't solved every
| OCR/document data extraction problem, but they've dramatically
| improved the situation.
| j45 wrote:
| LLMs are definitely helping approach some problems that
| couldn't be to date.
| simonw wrote:
| I've had great results against PDFs from recent vision models.
| Gemini, OpenAI and Claude can all accept PDFs directly now and
| treat them as image input.
|
| For longer PDFs I've found that breaking them up into images
| per page and treating each page separately works well - feeing
| a thousand page PDF to even a long context model like Gemini
| 2.5 Pro or Flash still isn't reliable enough that I trust it.
|
| As always though, the big challenge of using vision LLMs for
| OCR (or audio transcription) tasks is the risk of accidental
| instruction following - even more so if there's a risk of
| deliberately malicious instructions in the documents you are
| processing.
| marginalia_nu wrote:
| Author here: LLMs are definitely the new gold standard for
| smaller bodies of shorter documents.
|
| The article is in the context of an internet search engine, the
| corpus to be converted is of order 1 TB. Running that amount of
| data through an LLM would be extremely expensive, given the
| relatively marginal improvement in outcome.
| mediaman wrote:
| Corpus size doesn't mean much in the context of a PDF, given
| how variable that can be per page.
|
| I've found Google's Flash to cut my OCR costs by about 95+%
| compared to traditional commercial offerings that support
| structured data extraction, and I still get tables, headers,
| etc from each page. Still not perfect, but per page costs
| were less than one tenth of a cent per page, and 100 gb
| collections of PDFs ran to a few hundreds of dollars.
| noosphr wrote:
| A PDF corpus with a size of 1tb can mean anything from 10,000
| really poorly scanned documents to 1,000,000,000 nicely
| generated latex pdfs. What matters is the number of
| documents, and the number of pages per document.
|
| For the first I can run a segmentation model + traditional
| OCR in a day or two for the cost of warming my office in
| winter. For the second you'd need a few hundred dollars and a
| cloud server.
|
| Feel free to reach out. I'd be happy to have a chat and do
| some pro-bono work for someone building a open source tool
| chain and index for the rest of us.
| constantinum wrote:
| True indeed, but there are a few problems -- hallucinations and
| trusting the output(validation). More here
| https://unstract.com/blog/why-llms-struggle-with-unstructure...
| svat wrote:
| One thing I wish someone would write is something like the
| browser's developer tools ("inspect elements") for PDF -- it
| would be great to be able to "view source" a PDF's content
| streams (the BT ... ET operators that enclose text, each Tj
| operator for setting down text in the currently chosen font,
| etc), to see how every "pixel" of the PDF is being
| specified/generated. I know this goes against the current trend /
| state-of-the-art of using vision models to basically "see" the
| PDF like a human and "read" the text, but it would be really nice
| to be able to actually understand what a PDF file contains.
|
| There are a few tools that allow inspecting a PDF's contents
| (https://news.ycombinator.com/item?id=41379101) but they stop at
| the level of the PDF's objects, so entire content streams are
| single objects. For example, to use one of the PDFs mentioned in
| this post, the file https://bfi.uchicago.edu/wp-
| content/uploads/2022/06/BFI_WP_2... has, corresponding to page
| number 6 (PDF page 8), a content stream that starts like (some
| newlines added by me): 0 g 0 G 0 g 0 G
| BT /F19 10.9091 Tf 88.936 709.041 Td [(Subsequen)
| 28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in
| )-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(comp
| an)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ -16.936
| -21.922 Td [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(pr
| iv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-4
| 38(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(laun
| c)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ 0 -21.923 Td
|
| and it would be really cool to be able to see the above "source"
| and the rendered PDF side-by-side, hover over one to see the
| corresponding region of the other, etc, the way we can do for a
| HTML page.
| whenc wrote:
| Try with cpdf (disclaimer, wrote it): cpdf
| -output-json -output-json-parse-content-streams in.pdf -o
| out.json
|
| Then you can play around with the JSON, and turn it back to PDF
| with cpdf -j out.json -o out.pdf
|
| No live back-and-forth though.
| svat wrote:
| The live back-and-forth is the main point of what I'm asking
| for -- I tried your cpdf (thanks for the mention; will add it
| to my list) and it too doesn't help; all it does is,
| somewhere 9000-odd lines into the JSON file, turn the part of
| the content stream corresponding to what I mentioned in the
| earlier comment into: [ [
| { "F": 0.0 }, "g" ], [ { "F": 0.0 }, "G" ],
| [ { "F": 0.0 }, "g" ], [ { "F": 0.0 }, "G" ],
| [ "BT" ], [ "/F19", { "F": 10.9091 }, "Tf" ],
| [ { "F": 88.93600000000001 }, { "F": 709.0410000000001 },
| "Td" ], [ [
| "Subsequen", { "F": 28.0 },
| "t", { "F": -374.0 },
| "to", { "F": -373.0 },
| "the", { "F": -373.0 },
| "p", { "F": -28.0 },
| "erio", { "F": -28.0 },
| "d", { "F": -373.0 },
| "analyzed", { "F": -373.0 },
| "in", { "F": -374.0 },
| "our", { "F": -373.0 },
| "study", { "F": 83.0 },
| ",", { "F": -383.0 },
| "Bridge's", { "F": -373.0 },
| "paren", { "F": 27.0 },
| "t", { "F": -373.0 },
| "compan", { "F": 28.0 },
| "y", { "F": -373.0 },
| "Ne", { "F": -1.0 },
| "wGlob", { "F": -27.0 },
| "e", { "F": -374.0 },
| "reduced" ], "TJ"
| ], [ { "F": -16.936 }, { "F": -21.922 }, "Td"
| ],
|
| This is just a more verbose restatement of what's in the PDF
| file; the real questions I'm asking are:
|
| - How can a user get to this part, from viewing the PDF file?
| (Note that the PDF page objects are not necessarily a flat
| list; they are often nested at different levels of "kids".)
|
| - How can a user understand these instructions, and "see" how
| they correspond to what is visually displayed on the PDF
| file?
| IIAOPSW wrote:
| This might actually be something very valuable to me.
|
| I have a bunch of documents right now that are annual
| statutory and financial disclosures of a large institute, and
| they are just barely differently organized from each year to
| the next to make it too tedious to cross compare them
| manually. I've been looking around for a tool that could
| break out the content and let me reorder it so that the same
| section is on the same page for every report.
|
| This might be it.
| dleeftink wrote:
| Have a look at this notebook[0], not exactly what you're
| looking for but does provide a 'live' inspector of the various
| drawing operations contained in a PDF.
|
| [0]: https://observablehq.com/@player1537/pdf-utilities
| svat wrote:
| Thanks, but I was not able to figure out how to get any use
| out of the notebook above. In what sense is it a 'live'
| inspector? All it seems to do is to just decompose the PDF
| into separate "ops" and "args" arrays (neither of which is
| meaningful without the other), but it does not seem "live" in
| any sense -- how can one find the ops (and args)
| corresponding to a region of the PDF page, or vice-versa?
| dleeftink wrote:
| You can load up your own PDF and select a page up front
| after which it will display the opcodes for this page.
| Operations are not structurally grouped, but decomposed in
| three aligned arrays which can be grouped to your liking
| based on opcode or used as coordinates for intersection
| queries (e.g. combining the ops and args arrays).
|
| The 'liveness' here is that you can derive multiple
| downstream cells (e.g. filters, groupings, drawing
| instructions) from the initial parsed PDF, which will
| update as you swap out the PDF file.
| kccqzy wrote:
| When you use PDF.js from Mozilla to render a PDF file in DOM, I
| think you might actually get something pretty close. For
| example I suppose each Tj becomes a <span> and each TJ becomes
| a collection of <span>s. (I'm fairly certain it doesn't use
| <canvas>.) And I suppose it must be very faithful to the
| original document to make it work.
| chaps wrote:
| Indeed! I've used it to parse documents I've received through
| FOIA -- sometimes it's just easier to write beautifulsoup
| code compared to having to deal with PDF's oddities.
| wrs wrote:
| Since these are statistical classification problems, it seems
| like it would be worth trying some old-school machine learning
| (not an LLM, just an NN) to see how it compares with these manual
| heuristics.
| marginalia_nu wrote:
| I imagine that would work pretty well given an adequate and
| representative body of annotated sample data. Though that is
| also not easy to come by.
| ted_dunning wrote:
| Actually, it is easy to come up with reasonably decent
| heuristics that can auto-tag a corpus. From that you can look
| for anomalies and adjust your tagging system.
|
| The problem of getting a representative body is
| (surprisingly) much harder than the annotation. I know. I
| spent quite some time years ago doing this.
| andrethegiant wrote:
| Cloudflare's ai.toMarkdown() function available in Workers AI can
| handle PDFs pretty easily. Judging from speed alone, it seems
| they're parsing the actual content rather than shoving into
| OCR/LLM.
|
| Shameless plug: I use this under the hood when you prefix any PDF
| URL with https://pure.md/ to convert to raw text.
| burkaman wrote:
| If you're looking for test cases, this is the first thing I
| tried and the result is very bad:
| https://pure.md/https://docs.house.gov/meetings/IF/IF00/2025...
| andrethegiant wrote:
| Apart from lacking newlines, how is the result bad? It
| extracts the text for easy piping into an LLM.
| burkaman wrote:
| - Most of the titles have incorrectly split words, for
| example "P ART 2--R EPEAL OF EPA R ULE R ELATING TO M ULTI
| -P OLLUTANT E MISSION S TANDARDS". I know LLMs are
| resilient against typos and mistakes like this, but it
| still seems not ideal.
|
| - The header is parsed in a way that I suspect would
| mislead an LLM: "BRETT GUTHRIE, KENTUCKY FRANK PALLONE,
| JR., NEW JERSEY CHAIRMAN RANKING MEMBER ONE HUNDRED
| NINETEENTH CONGRESS". Guthrie is the chairman and Pallone
| is the ranking member, but that isn't implied in the text.
| In this particular case an LLM might already know that from
| other sources, but in more obscure contexts it will just
| have to rely on the parsed text.
|
| - It isn't converted into Markdown at all, the structure is
| completely lost. If you only care about text then I guess
| that's fine, and in this case an LLM might do an ok job at
| identifying some of the headers, but in the context of this
| discussion I think ai.toMarkdown() did a bad job of
| converting to Markdown and a just ok job of converting to
| text.
|
| I would have considered this a fairly easy test case, so it
| would make me hesitant to trust that function for general
| use if I were trying to solve the challenges described in
| the submitted article (Identifying headings, Joining
| consecutive headings, Identifying Paragraphs).
|
| I see that you are trying to minimize tokens for LLM input,
| so I realize your goals are probably not the same as what
| I'm talking about.
|
| Edit: Another test case, it seems to crash on any Arxiv
| PDF. Example:
| https://pure.md/https://arxiv.org/pdf/2411.12104.
| andrethegiant wrote:
| > it seems to crash on any Arxiv PDF
|
| Fixed, thanks for reporting :-)
| marginalia_nu wrote:
| That PDF actually has some weird corner cases.
|
| First it's all the same font size everywhere, it's also got
| bolded "headings" with spaces that are not bolded. Had to fix
| my own handling to get it to process well.
|
| This is the search engine's view of the document as of those
| fixes: https://www.marginalia.nu/junk/congress.html
|
| Still far from perfect...
| mdaniel wrote:
| > That PDF actually has some weird corner cases.
|
| Heh, in my experience with PDFs that's a tautology
| _boffin_ wrote:
| You're aware that PDFs are containers that can hold various
| formats, which can be interlaced in different ways, such as on
| top, throughout, or in unexpected and unspecified ways that
| aren't "parsable," right?
|
| I would wager that they're using OCR/LLM in their pipeline.
| andrethegiant wrote:
| Could be. But their pricing for the conversion is free, which
| leads me to believe LLMs are not involved.
| cpursley wrote:
| How's their function do on complex data tables, charts and that
| sort of stuff?
| bambax wrote:
| It doesn't seem to handle multi-columns PDFs well?
| bob1029 wrote:
| When accommodating the general case, solving PDF-to-text is
| approximately equivalent to solving JPEG-to-text.
|
| The only PDF parsing scenario I would consider putting my name on
| is scraping AcroForm field values from standardized documents.
| kapitalx wrote:
| This is approximately the approach we're taking also at
| https://doctly.ai, add to that a "multiple experts" approach
| for analyzing the image (for our 'ultra' version), and we get
| really good results. And we're making it better constantly.
| layer8 wrote:
| If you assume standardized documents, you can impose the use of
| Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/
| dwheeler wrote:
| The better solution is to embed, in the PDF, the editable source
| document. This is easily done by LibreOffice. Embedding it takes
| very little space in general (because it compresses well), and
| then you have MUCH better information on what the text is and its
| meaning. It works just fine with existing PDF readers.
| layer8 wrote:
| That's true, but it also opens up the vulnerability of the
| source document being arbitrarily different from the rendered
| PDF content.
| kerkeslager wrote:
| That's true, but it's dependent on the creator of the PDF
| having aligned incentives with the consumer of the PDF.
|
| In the e-Discovery field, it's commonplace for those providing
| evidence to dump it into a PDF purely so that it's harder for
| the opposing side's lawyers to consume. If both sides have lots
| of money this isn't a barrier, but for example public defenders
| don't have funds to hire someone (me!) to process the PDFs into
| a readable format, so realistically they end up taking much
| longer to process the data, which takes a psychological toll on
| the defendant. And that's if they process the data at all.
|
| The solution is to make it illegal to do this: wiretap data,
| for example, should be provided in a standardized machine-
| readable format. There's no ethical reason for simple technical
| friction to be affecting the outcomes of criminal proceedings.
| giovannibonetti wrote:
| I wonder if AI will solve that
| GaggiX wrote:
| There are specialized models, but even generic ones like
| Gemini 2.0 Flash are really good and cheap, you can use
| them and embed the OCR inside the PDF to index to the
| original content.
| kerkeslager wrote:
| This fundamentally misunderstands the problem. Effective
| OCR predates the popularity of ChatGPT and e-Discovery
| folks were already using it--AI in the modern sense adds
| nothing to this. Indexing the resulting text was also
| already possible--again AI adds nothing. The problem is
| that the resultant text lacks structure: being able to
| sort/filter wiretap data by date/location, for example,
| isn't inherently possible because you've obtained text or
| indexed it. AI accuracy simply isn't high enough to solve
| this problem without specialized training--off the shelf
| models simply won't work accurately enough even if you
| can get around the legal problems of feeding potentially-
| sensitive information into a model. AI models trained on
| a large enough domain-specific dataset might work, but
| the existing off-the-shelf models certainly are not
| accurate enough. And there are a lot of subdomains--
| wiretap data, cell phone GPS data, credit card data,
| email metadata, etc., which would each require model
| training.
|
| Fundamentally, the solution to this problem is to not
| create it in the first place. There's no reason for there
| to be a structured data -> PDF -> AI -> structured data
| pipeline when we can just force people providing evidence
| to provide the structured data.
| carabiner wrote:
| I bet 90% of the problem space is legacy PDFs. My company has
| thousands of these. Some are crappy scans. Some have Adobe's
| OCR embedded, but most have none at all.
| lelandfe wrote:
| The better solution to a search engine extracting text from
| existing PDFs is to provide advice on how to author PDFs?
|
| What's the timeline for this solution to pay off
| chaps wrote:
| Microsoft is one of the bigger contributors to this. Like --
| why does excel have a feature to export to PDF, but not a
| feature to do the opposite? That export functionality really
| feels like it was given to a summer intern who finished it in
| two weeks and never had to deal with it ever again.
| mattigames wrote:
| Because then we would have 2 formats: "pdfs generated by
| Excel" and "real pdfs" with the same extension and that
| would be it's own can of worms for Microsoft's and for
| everyone else.
| yxhuvud wrote:
| Sure, and if you have access to the source document the pdf was
| generated from, then that is a good thing to do.
|
| But generally speaking, you don't have that control.
| Obscurity4340 wrote:
| Is this what GoodPDF does?
| reify wrote:
| https://github.com/jalan/pdftotext
|
| pdftotext -layout input.pdf output.txt
|
| pip install pdftotext
| EmilStenstrom wrote:
| I think using Gemma3 in vision mode could be a good use-case for
| converting PDF to text. It's downloadable and runnable on a local
| computer, with decent memory requirements depending on which size
| you pick. Did anyone try it?
| CaptainFever wrote:
| Kind of unrelated, but Gemma 3's weights are unfree, so perhaps
| LLaVA (https://ollama.com/library/llava) would be a good
| alternative.
| ljlolel wrote:
| Mistral OCR has the best in class document understanding.
| https://mistral.ai/news/mistral-ocr
| ted_dunning wrote:
| One of my favorite documents for highlighting the challenges
| described here is the PDF for this article:
|
| https://academic.oup.com/auk/article/126/4/717/5148354
|
| The first page is classic with two columns of text, centered
| headings, a text inclusion that sits between the columns and
| changes the line lengths and indentations for the columns. Then
| we get the fun of page headers that change between odd and even
| pages and section header conventions that vary drastically.
|
| Oh... to make things even better, paragraphs doing get extra
| spacing and don't always have an indented first line.
|
| Some of everything.
| JKCalhoun wrote:
| The API in CoreGraphics (MacOS) for PDF, at a basic level,
| simply presented the text, per page, in the order in which it
| was encoded in the dictionaries. And 95% of the time this was
| pretty good -- and when working with PDFKit and Preview on the
| Mac, we got by with it for years.
|
| If you stepped back you could imagine the app that originally
| had captured/produced the PDF -- perhaps a word processor -- it
| was likely rendering the text into the PDF context in some
| reasonable order from it's own text buffer(s). So even for two
| columns, you rather expect, and often found, that the text
| _flowed_ correctly from the left column to the right. The text
| was therefore already in the correct order within the PDF
| document.
|
| Now, footers, headers on the page -- that would be anyone's
| guess as to what order the PDF-producing app dumped those into
| the PDF context.
| devrandoom wrote:
| I currently use ocrmypdf for my private library. Then Recoll to
| index and search. Is there a better solution I'm missing?
| constantinum wrote:
| PDF parsing is hell indeed, with all sorts of edge cases that
| breaks business workflows, more on that here
| https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
| gibsonf1 wrote:
| We[1] Create "Units of Thought" from PDF's and then work with
| those for further discovery where a "Unit of Thought" is any
| paragraph, title, note heading - something that stands on its own
| semantically. We then create a hierarchy of objects from that pdf
| in the database for search and conceptual search - all at scale.
|
| [1] https://graphmetrix.com/trinpod-server https://trinapp.com
| IIAOPSW wrote:
| I'm tempted to try it. My use case right now is a set of
| documents which are annual financial and statutory disclosures
| of a large institution. Every year they are formatted /
| organized slightly differently which makes it enormously
| tedious to manually find and compare the same basic section
| from one year to another, but they are consistent enough to
| recognize analogous sections from different years due to often
| reusing verbatim quotes or highly specific key words each time.
|
| What I really want to do is take all these docs and just
| reorder all the content such that I can look at page n (or
| section whatever) scrolling down and compare it between
| different years by scrolling horizontally. Ideally with changes
| from one year to the next highlighted.
|
| Can your product do this?
| kbyatnal wrote:
| "PDF to Text" is a bit simplified IMO. There's actually a few
| class of problems within this category:
|
| 1. reliable OCR from documents (to index for search, feed into a
| vector DB, etc)
|
| 2. structured data extraction (pull out targeted values)
|
| 3. end-to-end document pipelines (e.g. automate mortgage
| applications)
|
| Marginalia needs to solve problem #1 (OCR), which is luckily
| getting commoditized by the day thanks to models like Gemini
| Flash. I've now seen multiple companies replace their OCR
| pipelines with Flash for a fraction of the cost of previous
| solutions, it's really quite remarkable.
|
| Problems #2 and #3 are much more tricky. There's still a large
| gap for businesses in going from raw OCR outputs --> document
| pipelines deployed in prod for mission-critical use cases. LLMs
| and VLMs aren't magic, and anyone who goes in expecting 100%
| automation is in for a surprise.
|
| You still need to build and label datasets, orchestrate pipelines
| (classify -> split -> extract), detect uncertainty and correct
| with human-in-the-loop, fine-tune, and a lot more. You can
| certainly get close to full automation over time, but it's going
| to take time and effort. The future is definitely moving in this
| direction though.
|
| Disclaimer: I started a LLM doc processing company to help
| companies solve problems in this space (https://extend.ai)
| varunneal wrote:
| I've been hacking away at trying to process PDFs into Markdown,
| having encountered similar obstacles to OP regarding header
| detection (and many other issues). OCR is fantastic these days
| but maintaining a global structure to the document is much
| trickier. Consistent HTML seems still out of reach for large
| documents. I'm having half-decent results with Markdown using
| multiple passes of an LLM to extract document structure and
| feeding it in contextually for page-by-pass extraction.
| miki123211 wrote:
| There's also #4, reliable OCR and semantics extraction that
| works across many diverse classes of documents, which is
| relevant for accessibility.
|
| This is hard because:
|
| 1. Unlike a business workflow which often only deals with a few
| specific kinds of documents, you never know what the user is
| going to get. You're making an abstract PDF reader, not an app
| that can process court documents in bankruptcy cases in
| Delaware.
|
| 2. You don't just need the text (like in traditional OCR), you
| need to recognize tables, page headers and footers, footnotes,
| headings, mathematics etc.
|
| 3. Because this is for human consumption, you want to minimize
| errors as much as possible, which means _not_ using OCR when
| not needed, and relying on the underlying text embedded within
| the PDF while still extracting semantics. This means you
| essentially need two different paths, when the PDF only
| consists of images and when there are content streams you can
| get some information from.
|
| 3.1. But the content streams may contain different text from
| what's actually on the page, e.g. white-on-white text to hide
| information the user isn't supposed to see, or diacritics
| emulation with commands that manually draw acute accents
| instead of using proper unicode diacritics (LaTeX works that
| way).
|
| 4. You're likely running as a local app on the user's (possibly
| very underpowered) device, and likely don't have an associated
| server and subscription, so you can't use any cloud AI models.
|
| 5. You need to support forms. Since the user is using
| accessibility software, presumably they can't print and use a
| pen, so you need to handle the ones meant for printing too, not
| just the nice, spec-compatible ones.
|
| This is very much an open problem and is not even remotely
| close to being solved. People have been taking stabs at it for
| years, but all current solutions suck in some way, and there's
| no single one that solves all 5 points correctly.
| noosphr wrote:
| >replace their OCR pipelines with Flash for a fraction of the
| cost of previous solutions, it's really quite remarkable.
|
| As someone who had to build custom tools because VLMs are so
| unreliable: anyone that uses VLMs for unprocessed images is in
| for more pain than all the providers which let LLMs without
| guard rails interact directly with consumers.
|
| They are very good at image labeling. They are ok at very
| simple documents, e.g. single column text, centered single
| level of headings, one image or table per page, etc. (which is
| what all the MVP demos show). They need another trillion
| parameters to become bad at complex documents with tables and
| images.
|
| Right now they hallucinate so badly that you simply _can't_ use
| them for something as simple as a table with a heading at the
| top, data in the middle and a summary at the bottom.
| anonu wrote:
| They should called it NDF - Non-Portable Document Format.
| dobraczekolada wrote:
| Reminds me of github.com/docwire/docwire
| 90s_dev wrote:
| Have any of you ever thought to yourself, this is new and
| interesting, and then vaguely remembered that you spent months or
| years becoming an expert at it earlier in life but entirely
| forgot it? And in fact large chunks of the very interesting
| things you've done just completely flew out of your mind long
| ago, to the point where you feel absolutely new at life, like
| you've accomplished relatively nothing, until something like this
| jars you out of that forgetfulness?
|
| I definitely vaguely remember doing some incredibly cool things
| with PDFs and OCR about 6 or 7 years ago. Some project comes to
| mind... google tells me it was "tesseract" and that sounds
| familiar.
| downboots wrote:
| No different than a fire ant whose leaf got knocked over by the
| wind and it moved on to the next.
| 90s_dev wrote:
| Well I sure do _feel_ different than a fire ant.
| downboots wrote:
| anttention is all we have
| 90s_dev wrote:
| Not true, I also have a nice cigar waiting for the rain
| to go away.
| 90s_dev wrote:
| Hmm, it's gone now. Well I used to have one anyway.
| korkybuchek wrote:
| Not that I'm privy to your mind, but it probably _was_
| tesseract (and this is my exact experience too...although for
| me it was about 12 years ago).
| bazzargh wrote:
| Back in... 2006ish? I got annoyed with being unable to copy
| text from multicolumn scientific papers on my iRex (an early
| ereader that was somewhat hackable) so dug a bit into why that
| was. Under the hood, the pdf reader used poppler, so I modified
| poppler to infer reading order in multicolumn documents using
| algorithms that tessaract's author (Thomas Breuel) had
| published for OCR.
|
| It was a bit of a heuristic hack; it was 20 years ago but as I
| recall poppler's ancient API didn't really represent text runs
| in a way you'd want for an accessibility API. A version of the
| multicolumn select made it in but it was a pain to try to
| persuade poppler's maintainer that subsequent suggestions to
| improve performance were ok - because they used slightly
| different heuristics so had different text selections in some
| circumstances. There was no 'right' answer, so wanting the
| results to match didn't make sense.
|
| And that's how kpdf got multicolumn select, of a sort.
|
| Using tessaract directly for this has probably made more sense
| for some years now.
| steeeeeve wrote:
| I too went down that rabbithole. Haha. Anything around that
| time to get an edge in a fantasy football league. I found a
| bunch of historical NFL stats pdfs and it took forever to
| make usable data out of them.
| pimlottc wrote:
| This is life. So many times I've finished a project and thought
| to myself: "Now I am an expert at doing this. Yet I probably
| won't ever do this again." Because the next thing will
| completely in a different subject area and I'll start again
| from the basics.
| didericis wrote:
| I built an auto HQ solver with tesseract when HQ was blowing up
| over thanksgiving (HQ was the gameshow by the vine people with
| live hosts). I would take a screenshot of the app during a
| question, share it/send it to a little local api, do a google
| query for the question, see how many times each answer on the
| first page appeared in the results, then rank the answers by
| probability.
|
| Didn't work well/was a very naive way to search for answers
| (which is prob good/idk what kind of trouble I'd have gotten in
| if it let me or anyone else who used it win all the time), but
| it was fun to build.
| anon373839 wrote:
| Tesseract was the best open-source OCR for a long time. But I'd
| argue that docTR is better now, as it's more accurate out of
| the box and GPU accelerated. It implements a variety of
| different text detection and recognition model architectures
| that you can combine in a modular pipeline. And you can train
| or fine-tune in PyTorch or TensorFlow to get even better
| performance on your domain.
| PeterStuer wrote:
| I guess I'm lucky the PDF's I need to process are mostly rather
| dull unadventurous layouts. So far I've had great success using
| docling.
| keybored wrote:
| For people who want people to read their documents[1] they should
| have their PDF point to a more digital-friendly format, an alt
| document.
|
| _Looks like you've found my PDF. You might want this version
| instead:_
|
| PDFs are often subpar. Just see the first example: standard Latex
| serif section title. I mean, PDFs often aren't even well-typeset
| for what they are (dead-tree simulations).
|
| [1] No sarcasm or truism. Some may just want to submit a paper to
| whatever publisher and go through their whole laundry list of
| what a paper ought to be. Wide dissemanation is not the point.
| 1vuio0pswjnm7 wrote:
| Below is a PDF. It is a .txt file. I can save it with a .pdf
| extension and open it in a PDF viewer. I can make changes in a
| text editor. For example, by editing this text file, I can change
| the text displayed on the screen when the PDF is opened, the
| font, font size, line spacing, the maximum characters per line,
| number of lines per page, the paper width and height, as well as
| portrait versus landscape mode. %PDF-1.4
| 1 0 obj << /CreationDate (D:2025) /Producer
| >> endobj 2 0 obj << /Type /Catalog
| /Pages 3 0 R >> endobj 4 0 obj <<
| /Type /Font /Subtype /Type1 /Name /F1
| /BaseFont /Times-Roman >> endobj 5 0 obj
| << /Font << /F1 4 0 R >> /ProcSet [ /PDF /Text
| ] >> endobj 6 0 obj << /Type
| /Page /Parent 3 0 R /Resources 5 0 R
| /Contents 7 0 R >> endobj 7 0 obj <<
| /Length 8 0 R >> stream BT /F1 50 Tf
| 1 0 0 1 50 752 Tm 54 TL (PDF is)' ((a) a
| text format)' ((b) a graphics format)' ((c) (a) and
| (b).)' ()' ET endstream endobj
| 8 0 obj 53 endobj 3 0 obj <<
| /Type /Pages /Count 1 /MediaBox [ 0 0 612 792 ]
| /Kids [ 6 0 R ] >> endobj xref 0 9
| 0000000000 65535 f
|
| 0000000009 00000 n 0000000113 00000 n 0000000514 00000 n
| 0000000162 00000 n 0000000240 00000 n 0000000311 00000 n
| 0000000391 00000 n 0000000496 00000 n trailer << /Size 9 /Root 2
| 0 R /Info 1 0 R >> startxref 599 %%EOF
| swsieber wrote:
| It can also have embedded binary streams. It was not made for
| text. It was made for layout and graphics. You give nice
| examples, but each of those lines could have been broken up
| into one call per character, or per word, even out of order.
| 1vuio0pswjnm7 wrote:
| "PDF" is an acronym for for "Portable Document Format"
|
| "2.3.2 Portability
|
| A PDF file is a 7-bit ASCII file, which means PDF files use
| only the printable subset of the ASCII character set to
| describe documents even those with images and special
| characters. As a result, PDF files are extremely portable
| across diverse hardware and operating system environments."
|
| https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
| ljlolel wrote:
| Mistral OCR has best in class doing document understanding
|
| https://mistral.ai/news/mistral-ocr
| nicodjimenez wrote:
| Check out mathpix.com. We handle complex tables, complex math,
| diagrams, rotated tables, and much more, extremely accurately.
|
| Disclaimer: I'm the founder.
| bickfordb wrote:
| Maybe it's time for new document formats and browsers that neatly
| separate content, presentation and UI layers? PDF and HTML are
| 20+ years old and it's often difficult to extract information
| from either let alone author a browser.
| rrr_oh_man wrote:
| Yes, but I'm sure they're out there somewhere
|
| (https://xkcd.com/927/)
| TRiG_Ireland wrote:
| Open XML Paper Specification is an XML-based format intended
| to compete with PDF. Unlike PDF, it is purely static: no
| scripting.
|
| Also unlike PDF, I've never seen it actually used in the
| wild.
| smcleod wrote:
| Definitely recommend docling for this. https://docling-
| project.github.io/docling/
| elpalek wrote:
| Recently tested a (non-english) pdf ocr with Gemini 2.5 Pro.
| First, directly ask it to extract text from pdf. Result: random
| text blob, not useable.
|
| Second, I converted pdf into pages of jpg. Gemini performed
| exceptional. Near perfect text extraction with intact format in
| markdown.
|
| Maybe there's internal difference when processing pdf vs jpg
| inside the model.
| jagged-chisel wrote:
| Model isn't rendering the PDF probably, just looking in the
| file for text.
| noosphr wrote:
| I've worked on this in my day job: extracting _all_ relevant
| information from a financial services PDF for a bert based search
| engine.
|
| The only way to solve that is with a segmentation model followed
| by a regular OCR model and whatever other specialized models you
| need to extract other types of data. VLM aren't ready for prime
| time and won't be for a decade on more.
|
| What worked was using doclaynet trained YOLO models to get the
| areas of the document that were text, images, tables or formulas:
| https://github.com/DS4SD/DocLayNet if you don't care about
| anything but text you can feed the results into tesseract
| directly (but for the love of god read the manual).
| Congratulations, you're done.
|
| Here's some pre-trained models that work OK out of the box:
| https://github.com/ppaanngggg/yolo-doclaynet I found that we
| needed to increase the resolution from ~700px to ~2100px
| horizontal for financial data segmentation.
|
| VLMs on the other hand still choke on long text and hallucinate
| unpredictably. Worse they can't understand nested data. If you
| give _any_ current model nothing harder than three nested
| rectangles with text under each they will not extract the text
| correctly. Given that nested rectangles describes every table no
| VLM can currently extract data from anything but the most
| straightforward of tables. But it will happily lie to you that it
| did - after all a mining company should own a dozen bulldozers
| right? And if they each cost $35.000 it must be an amazing deal
| they got, right?
| Sharlin wrote:
| Some of the unsung heroes of the modern age are the programmers
| who, through what must have involved a lot of weeping and
| gnashing of teeth, have managed to implement the find, select,
| and copy operations in PDF readers.
| patrick41638265 wrote:
| Good old https://linux.die.net/man/1/pdftotext and a little
| Python on top of its output will get you a long way if your
| documents are not too crazy. I use it to parse all my bank
| statements into an sqlite database for analysis.
| coolcase wrote:
| Tried extracting data from a newspaper. It is really hard. What
| is a headline and which headline belongs to which paragraphs?
| Harder than you think! And chucking it as is into OpenAI was no
| good at all. Manually dealing with coordinates from OCR was
| better but not perfect.
| rekoros wrote:
| I've been using Azure's "Document Intelligence" thingy (prebuilt
| "read" model) to extract text from PDFs with pretty good results
| [1]. Their terminology is so bad, it's easy to dismiss the whole
| thing for another Microsoft pile, but it actually, like, for
| real, works.
|
| [1] https://learn.microsoft.com/en-us/azure/ai-
| services/document...
___________________________________________________________________
(page generated 2025-05-13 23:00 UTC)