[HN Gopher] So you want to modify the text of a PDF by hand (2020)
___________________________________________________________________
So you want to modify the text of a PDF by hand (2020)
Author : mutant_glofish
Score : 113 points
Date : 2023-09-03 06:24 UTC (1 days ago)
(HTM) web link (gist.github.com)
(TXT) w3m dump (gist.github.com)
| rogeliodh wrote:
| LibreOffice can open and edit PDFs. Last time I tried it was
| really good. Not sure what limitations are there.
| lucb1e wrote:
| For me it always seems to change the font from whatever was
| built into the PDF (rendered just fine in any PDF reader) to a
| random system font which completely breaks the spacing, making
| different parts of the document overflow into each other
| ks2048 wrote:
| This seems to be missing an important point: at the end of PDF is
| a table ("cross-reference" table) that stores the BYTE-OFFSET to
| different objects in the file.
|
| If you modify things within the file, typically these offsets
| will change and the file will be corrupt. It looks like in this
| article, maybe they were only interested in changing one number
| to another, so none of the positions change.
|
| But, generally, adding/removing/modifying things in the middle of
| the file require recomputing the xref table and thus become much
| easier to use a library rather than direct text editing.
| userbinator wrote:
| That's the weirdest part of the PDF spec IMHO. It's a mix of
| both binary and text, with text-specified byte offsets. It
| would be very interesting to read about why the format became
| like that, if its authors would ever talk about it. My guess is
| that it was meant to be completely textual at first (but then
| requiring the xref table to have fixed-length entries is odd),
| and then they decided binary would be more efficient.
| Someone wrote:
| > My guess is that it was meant to be completely textual at
| first
|
| It indeed started life as "not Turing complete postscript
| with an index" (those makes it easy to render just the third
| page of a PDF file, something that's impossible in postscript
| without rendering the first and second pages first). Like
| postscript, that was a pure text format.
|
| One nice feature is that you can append a few pieces and a
| new index to an existing PDF file and get a new valid PDF
| file (which would still contain its old index as a piece of
| "junk DNA")
|
| I think compression was added because users complained about
| file sizes. Ascii85 (https://en.m.wikipedia.org/wiki/Ascii85)
| grows binary data by 25%.
|
| > but then requiring the xref table to have fixed-length
| entries is odd
|
| My guess is that made it easier to hack together a tool to
| convert PDF to postscript.
| detourdog wrote:
| I actually was at a Acrobat/PDF launch event in midtown NYC.
| It was an embedded file type that could be generated at the
| type of publishing and all dependencies could either be
| embedded or not.
|
| This made a coherent point in a digital workflow that could
| be saved and reprinted with ease. This was a big deal before
| the portable document format came to be.
|
| I once made a workflow that took pdf files from Word,
| filemaker, excel, and mini-cad. This all got combined into a
| single 9,000 page pdf. The final pdf had a coherent
| thumbnails, page numbers and headers and footer.
|
| Only took a couple of hours to get the final documnet after
| pushing the go buttton.
| pmarreck wrote:
| The roots of PDF are PostScript, which is like Forth, and is
| text-based, so that's why
| bena wrote:
| Ah. So it's a lot like editing compiled binaries.
|
| You can modify binaries all you want as long as you preserve
| the length of everything.
|
| Some piece of software we had authenticated against a server,
| but everything was done on the client. The client executed SQL
| against the server directly, etc. Basically, the server checked
| to see if this client would put you over the number of licenses
| you purchased and that's it.
|
| I had run it against a disassembler, found the part where it
| performed the check, and was able to change it to a straight
| JMP and then pad the rest of the space with NOPs.
| gpvos wrote:
| That's why they decode it with qpdf and re-encode it again
| afterwards, so qpdf takes care of that. qpdf reconstructs the
| original PDF structure, and I think it even tries to keep the
| object numbers the same, but the offsets are recalculated
| completely.
| aidos wrote:
| In my experience it's easiest just to break the xref table and
| run something like "mutool clean" to fix it again. It can be
| completely derived from the content so it's safe to do.
| Const-me wrote:
| > I didn't see an obvious open-source tool that lets you dig into
| PDF internals
|
| That's a matter of the toolset. I program C#, and I have good
| experience with that open source library:
| https://www.nuget.org/packages/iTextSharp-LGPL/ It's a decade old
| by now, but PDF ain't exactly a new format. That library is not
| terribly bad for many practical use cases. Particularly good when
| you only need to create the documents as opposed to editing them,
| because for that use case you'd want to use an old version of the
| format anyway, for optimal compatibility.
| jl6 wrote:
| This seems to be missing an important step in the use of qpdf's
| --qdf mode: after you've finished editing, you need to run the
| file through the fix-pdf utility to recalculate all the object
| offsets and rebuild the cross-reference table that lives at the
| end of the file (unless you only change bytes in-place rather
| than adding or removing bytes).
|
| My top 3 fun PDF facts:
|
| 1) Although PDF documents are typically 8-bit binary files, you
| can make one that is valid UTF-8 "plain text", even including
| images, through the use of the ASCII85 filter.[0]
|
| 2) PDF allows an incredible variety of freaky features (3D
| objects, JavaScript, movies in an embedded flash object,
| invisible annotations...). PDF/A is a much saner, safer subset.
|
| 3) The PDF spec allows you to write widgets (e.g. form controls)
| using "rich text", which is a subset of XHTML and CSS - but this
| feature is very sparsely supported outside the official Adobe
| Reader.
|
| [0] For example: https://lab6.com/2
| gpvos wrote:
| After you've finished editing, just run it through qpdf without
| parameters, as explained in the beginning of the article, and
| it will recompress the data and recreate the xref table. No
| need for yet another tool.
| jl6 wrote:
| I guess you could, but this is the source of the errors
| (actually warnings) that the article mentions. Probably best
| to fix the file with the provided tool (fix-qdf is
| distributed with qpdf) rather than get in the habit of
| ignoring warnings.
| LispSporks22 wrote:
| As I recall, words aren't even necessarily made up of contiguous
| characters. Especially true in OCRed documents in PDF.
| aleden wrote:
| I'm surprised no one has mentioned qpdf.
|
| https://qpdf.readthedocs.io/en/stable/overview.html
|
| It turns a PDF (typically everything in it is compressed binary
| blobs) into a mixed binary/ASCII file (which itself is a PDF)
| that can be edited with vim.
| rhaway84773 wrote:
| It's mentioned in the gist
|
| > To view the compressed data, you can use a command line tool
| called qpdf.
| chrnola wrote:
| The linked article literally mentions qpdf within the first few
| paragraphs.
| gpvos wrote:
| I'm not sure what you were reading, but the fine article is
| centred around using qpdf.
| seszett wrote:
| Although this is an interesting dive into the PDF format, just
| opening the PDF in Libreoffice or Inkscape usually works fine to
| modify its text.
| gcanyon wrote:
| I'm interested in extracting the contents of a pdf form -- many
| individual text boxes. You're saying libre office would likely
| be able to parse that pdf into a usable format?
| anon____ wrote:
| With LibreOffice Draw you can edit the PDF (modify the text,
| move or change images, etc), then save as pdf, but it can't
| parse and save it as .odt, .doc, .html or similar.
| ShadowBanThis01 wrote:
| LibreOffice has some really perplexing functionality gaps.
|
| The one that baffles me is that it doesn't understand its
| own graphics format, so you have to export drawings to TIFF
| or something (if I remember correctly).
| pikrzyszto wrote:
| Poppler ( https://poppler.freedesktop.org/ ) handles this for
| you with pdftotext utility. It also ships with bunch of other
| utilities to work with PDFs
| desgeeko wrote:
| If you want to continue this journey and learn more about PDF,
| you can read the anatomy of a file I documented recently:
| https://pdfsyntax.dev/introduction_pdf_syntax.html
| enriquto wrote:
| You can do this: pdf2ps a.pdf # convert to
| postscript "a.ps" vim a.ps # edit postscript by
| hand ps2pdf a.ps # convert back to pdf
|
| Some complex pdf (with embedded javascript, animations, etc) fail
| to work correctly after this back and forth. Yet for "plain"
| documents this works alright. You can easily remove watermarks,
| change some words and numbers, etc. Spacing is harder to modify.
| Of course you need to know some postscript.
| jordann wrote:
| If you don't mind using java, you can use the open source Apache
| PDFBox library
|
| https://pdfbox.apache.org/
|
| It's relatively performant and it's a mature and supported
| codebase that can accomplish most pdf tasks.
| aidos wrote:
| This topic comes up periodically as most people think PDFs are
| some impenetrable binary format, but they're really not.
|
| They are a graph of objects of different types. The types
| themselves are well described in the official spec (I'm a sadist,
| I read it for fun).
|
| My advice is always to convert the pdf to a version without
| compressed data like the author here has. My tool of choice is
| mutool (mutool clean -d in.pdf out.pdf). Then just have a
| rummage. You'll be surprised by how much you can follow.
|
| In the article the author missed a step where you look at the
| page object to see the resources. That's where the mapping from
| the font name use in the content stream to the underlying object
| is made.
|
| There's also another important bit missing - most fonts are
| subset into the pdf. Ie, only the glyphs that are needed are
| maintained in the font. I think that's often where the re-
| encoding happens. ToUnicode is maintained to allow you to copy
| text (or search in a PDF). It's a nice to have for users (in my
| experience it's normally there and correct though).
| esafak wrote:
| It is a shame Adobe designed a format so hard to work with that
| people are amazed when someone accomplishes what should be a
| basic task with it.
|
| Their design philosophy of creating a read-only format was
| flawed to begin with. What's the first feature people are going
| to ask for??
| pwg wrote:
| > It is a shame Adobe designed a format so hard to work with
|
| PDF was not designed to be editable, nor for anyone to "work
| with" it in that way.
|
| It was designed (at least the original purpose circa 1989) to
| represent printed pages electronically in a format that would
| view and print identically everywhere. In fact, the initial
| advertising for the "value" of the PDF format was exactly
| this, no matter where a recipient viewed your PDF output, it
| would look, and print, identically to everywhere else.
|
| It was originally meant to be "electronic paper".
| dylan604 wrote:
| Wasn't the PDF format based on the Illustrator format?
|
| The weird thing to me is people using a distribution format
| as an original source. It's right up there with video
| cameras shooting an acquisition source as an MP4 and all of
| the negative baggage that comes with that.
| userbinator wrote:
| I believe Illustrator format is very similar to
| PostScript.
| mistrial9 wrote:
| .. waves to Leonard Rosenthol
| lucascacho wrote:
| Every time I read about the hardships of interacting with the PDF
| format, I gain more respect for Photopea, which has full PDF
| editing support.
| blincoln wrote:
| The PDF specification is wild. My current favourite trivia is
| that it supports all of Photoshop's layer blend modes for
| rendering overlapping elements.[1] My second-favourite is that it
| supports appended content that modifies earlier content, so one
| should always look for forensic evidence in all distinct versions
| represented in a given file.[2]
|
| It's also a fun example of the futility of DRM. The spec includes
| password-based encryption, and allows for different "owner" and
| "user" passwords. There's a bitfield with options for things like
| "prevent printing", "prevent copying text", and so forth,[3] but
| because reading the document necessarily involves decrypting it,
| one can use the "user" password to open an encrypted PDF in a
| non-compliant tool,[4] then save the unencrypted version to get
| an editable equivalent.
|
| [1] "More than just transparency" section of
| https://blog.adobe.com/en/publish/2022/01/31/20-years-of-tra...
|
| [2] https://blog.didierstevens.com/2008/05/07/solving-a-
| little-p...
|
| [3] Page 61 of https://opensource.adobe.com/dc-acrobat-sdk-
| docs/pdfstandard...
|
| [4] For example, a script that uses the pypdf library.
| aidos wrote:
| To be fair, if you wanted to stop copying of text it would be
| easiest just to drop the ToUnicode mapping against the fonts
| and then it's a manual process for people to recreate them.
| miki123211 wrote:
| That also breaks search (and more importantly screen reader
| accessibility), and if you're professionally required to
| specifically produce PDFs with these security features
| enabled, you're pretty likely to be working in a context
| where that would be illegal.
| userbinator wrote:
| In the context of a format that was originally proprietary and
| not widely available to everyone, and conceived in an era where
| encryption was strongly controlled by export law, that sort of
| security-by-obscurity was very common. Incidentally, a popular
| cracking tutorial back then was to de-DRM the official reader
| by patching the function that checks those permissions.
| aardvark179 wrote:
| Aren't the blend modes supported just the Porter-Duff
| compositing modes? You might think that's overkill, but it's a
| really good mapping of what other rendering pipelines offer and
| it can really help reduce the work to produce a PDF.
| pavlov wrote:
| The original Porter-Duff compositing operators don't cover
| Photoshop-style blending. Here's a link with pictures:
|
| http://ssp.impulsetrain.com/porterduff.html
|
| The Porter-Duff operators are appealingly rigorous and easy
| to implement because they're simply the possible combinations
| of a simple formula. But many of these operators are not very
| useful either.
|
| The Photoshop blending modes are practically the opposite:
| they are not derived from anything mathematically appealing,
| it's really just a collection of algorithms that Photoshop's
| designers originally found useful. They reflect the
| limitations of their early 1990s desktop computer
| implementations (for example, no attempt is made to account
| for gamma correction when combining the layers, which makes
| many of these operations behave very differently from actual
| light that they mean to emulate).
| crtified wrote:
| This brings back horrible memories of working with large complex
| maps back in the 2000s. Having various CAD and GIS applications
| generate messy, inefficient spaghetti-coded PDF outputs - then
| bouncing those PDFs around the Adobe apps of the time, to add
| effects and other prettifications not available in the mapping
| apps.
|
| It would reach the point where things would start to break, and
| .... "good times were had, by all".
| schlowmo wrote:
| PDF is such a weird format. Not so long ago I had to write some
| Java code for manipulating PDFs: find a string, remove it and
| place an image at the former string position. I should have known
| better as I thought "Well, how hard can that be?"
|
| What followed was a deep dive down the rabbit hole, a lot of
| fiddling with the same tools the author of this gist is using
| trying to make sense of it all.
|
| The final solution worked better than I thought while at the same
| time felt incredibly wrong.
|
| I'm very thankful for all the (probably painful) work that went
| into those open source PDF tools.
| miki123211 wrote:
| What people often miss about PDF is that it's closer to an image
| format in some ways than to a Word document. Word documents, PDFs
| and images are in document editing what DAW projects, midis and
| mp3 files are in music and what Java source code, JVM bytecode
| and pure x86 machine code are in software.
|
| The primary purpose of a PDF file is to tell you what to display
| (or print), with perfect clarity, in much fewer bytes than an
| actual image would take. It exploits the fact that the document
| creator knows about patterns in the document structure that, if
| expressed properly, make the document much more compressible than
| anything that an actual image compression algorithm could
| accomplish. For example, if you have access to the actual font,
| it's better to say "put these characters at these coordinates
| with that much spacing between them" than to include every
| occurrence of every character as a part of the image, hoping that
| the compression algorithm notices and compresses away the
| repetitions. Things like what character is part of what word, or
| even what unicode codepoint is mapped to which font glyph are
| basically unimportant if all you're after is efficiently
| transferring the image of a document.
|
| If you have an editable document, you care a lot more about the
| semantics of the content, not just about its presentation. It
| matters to you whether a particular break in the text is supposed
| to be multiple spaces, the next column in a table or just a weird
| page layout caused by an image being present. If you have some
| text at the bottom of each page, you care whether that text was
| put there by the document author multiple times, or whether it
| was entered once and set as a footer. If you add a new paragraph
| and have to change page layout, it matters to you that the last
| paragraph on this page is a footnote and should not be moved to
| the next one. If a section heading moves to another page, you
| care about the fact that the table of contents should update
| automatically and isn't just some text that the author has
| manually entered. If you're a printer or screen, you care about
| none of these things, you just print or display whatever you're
| told to print or display. For a PDF, footnotes, section headings,
| footers or tables of contents don't have to be special, they can
| just be text with some meaningless formatting applied to it. This
| is why making PDF work for any purpose which isn't displaying or
| printing is never going to be 100% accurate. Of course, there are
| efforts to remedy this, and a PDF-creating program is free to
| include any metadata it sees fit, but it's by no means required
| to do so.
|
| This isn't necessarily the mental model that the PDF authors had
| in mind, but it's an useful way to look at PDF and understand why
| it is the way it is.
| eschaton wrote:
| Anybody trying to do this is missing the point of PDF: It's a
| _page-description format_ and therefore only represents the
| _marks on a page_ , not _document structure_.
|
| One should not attempt to edit a PDF, one should edit the
| document from which the PDF is generated.
| lucb1e wrote:
| I'll stop trying to edit PDFs when people stop sending me PDFs
| that I want to edit.
|
| Somehow it became "unprofessional" to just send meant-to-be-
| editable documents around for everyone to enjoy, so this is
| where we end up...
| louthy wrote:
| > It's a page-description format and therefore only represents
| the marks on a page, not document structure
|
| Maybe they should have called it 'Page Description Format'
| then? Instead of 'Portable _Document_ Format'
| layer8 wrote:
| PDF does support incorporating information about the logical
| document structure, aka Tagged PDF. It's optional, but
| recommended for accessibility (e.g. PDF/UA). See chapters
| 14.7-14.8 in [1].
|
| [1] https://opensource.adobe.com/dc-acrobat-sdk-
| docs/pdfstandard...
| o1y32 wrote:
| "should not" is meaningless here, because in the real world
| there are tons of situations where people _want_ you to edit
| PDF, one way or another
| yair99dd wrote:
| Inkscape+1.2 multipage support is Great for editing graphics and
| text on PDFs
___________________________________________________________________
(page generated 2023-09-04 23:00 UTC)