[HN Gopher] Diff-pdf: tool to visually compare two PDFs
       ___________________________________________________________________
        
       Diff-pdf: tool to visually compare two PDFs
        
       Author : Olshansky
       Score  : 433 points
       Date   : 2024-07-02 07:26 UTC (15 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jaustin wrote:
       | We've been using this in the Micro:bit Educational Foundation
       | (microbit.org) to fill a gap in hardware design tooling, and get
       | visual diffs of our schematics and gerbers during PCB design
       | iterations. It's kinda wild that's what we ended up doing, but if
       | you want to be sure your radio layout didn't change at all when
       | you're making a minor revision to a different part of the board,
       | visual diffs are perfect.
       | 
       | That said, next project we want to try something more integrated
       | with EDA tools. If anyone else has followed this path, we'd love
       | to know.
        
       | rawbert wrote:
       | We use this tool in our team regularly for comparison of PDFs we
       | obtain from third party services that might have changed after
       | code-changes on our side. Big thanks to the author <3
        
       | canistel wrote:
       | Interestingly, Github thinks the project is 46% shell, due to the
       | fairly huge wxwin.m4.
        
         | infecto wrote:
         | I noticed this a while back with a private project of mine. The
         | Github languages breakdown seems broken. Mine is a Python
         | project with a handful of Jupyter notebooks but many many
         | python files. The LOC must be 80% python files but Github sees
         | the project as 50% Jupyter.
        
           | badlibrarian wrote:
           | You can tweak/exclude with .gitattributes
           | 
           | https://github.com/github-
           | linguist/linguist/blob/master/docs...
        
       | Levitating wrote:
       | No screenshots?
        
         | colddevil wrote:
         | Here is one: https://vslavik.github.io/diff-pdf/
        
           | Tryk wrote:
           | Can anyone explain how to interpret that screenshot? It just
           | looks like a very blurry text to me.
        
             | pimlottc wrote:
             | It's showing you both PDFs overlaid on each other. The main
             | window looks blurry because the main text has shifted
             | vertically slightly. The regions that have changes are
             | highlighted in the thumbnails on the left.
             | 
             | I agree it's not the best initial example to demonstrate
             | the tool, but it does show how it can be used to detect
             | even minor spacing changes.
        
             | ska wrote:
             | If it was sharp, they would be identical. The "blurriness"
             | is doubling, where the lines are not quite aligned. Red
             | text there show you content that is in one and not the
             | other.
        
       | asah wrote:
       | Crazy, I'd have thought that modern multi-modal LLMs can do this,
       | but when I tried Gemini, ChatGPT-4o and Claude they all pooped
       | out:
       | 
       | - Gemini at first only diff'd the text, and then when pushed it
       | identified the items in the images and then hallucinated the
       | differences between the versions. It could not produce an image
       | output.
       | 
       | - Claude only diff'd the text and refused to believe that there
       | images in the PDFs.
       | 
       | - ChatGPT attempted to write and execute python code for this,
       | which errored out.
        
         | infecto wrote:
         | This is definitely not a strength for multi-modal LLM. Multi-
         | modal capabilities are still too flaky especially when looking
         | at a page of a PDF which can have multiple areas of focus.
        
         | ertgbnm wrote:
         | This may be the type of thing that LLMs are currently the worst
         | at. I'm not surprised at all.
        
         | ale42 wrote:
         | Visually comparing two PDFs is something a PC can do
         | deterministically without any resource (and energy) intensive
         | LLMs. People will soon use LLMs for things they are not
         | especially good or efficient at, like computing the sum of
         | numbers in an Excel table... (or are they doing it already?).
        
           | B1FF_PSUVM wrote:
           | As a bonus, they'll get a result that looks likely.
        
         | pmarreck wrote:
         | I would fully expect an LLM to not get natively good at this
         | but to know how to reach out to another tool in order to get
         | good at this
        
       | thibaut_barrere wrote:
       | I have been using this in a CI pipeline to maintain a business-
       | critical PDF generation (healthcare) app (started circa 2010 I
       | think), here is the RSpec helpers I'm using:
       | 
       | https://gist.github.com/thbar/d1ce2afef68bf6089aeae8d9ddc05d...
       | 
       | The code contains git-stored reference PDFs, and the test suite
       | re-generate them and assert that nothing has changed.
       | 
       | Helped a lot to audit visual changes, or PDF library upgrades!
        
         | pmarreck wrote:
         | could you not just compare the source (or perhaps even the
         | hash) of the PDF and assert on that?
        
           | alexdoesh wrote:
           | Hashes can change regularly due to metadata. Source checks
           | may also require some filtration or preprocessing before
           | comparison. Visual comparison is the best option here,
           | especially if you have a complex document with multiple
           | third-party components that may change both the hash and
           | source but keep the visual appearance the same.
        
             | thibaut_barrere wrote:
             | In this case, we indeed have multiple components (although
             | not third-party), and being able to refactor those without
             | risk is quite nice.
        
           | jabroni_salad wrote:
           | If the document is a legally required disclosure (like a
           | bank's fee schedule for example) then you need to grade that
           | document directly rather than its source code. PDFs are
           | horrible and there is a lot that can go wrong with making
           | them between writing and publishing.
        
           | ydant wrote:
           | I use some custom tools for PDF comparison (visual, textual,
           | and perceptual hash) for my personal records/accounting
           | purposes.
           | 
           | A number of the financial and medical institutions I deal
           | with re-generate PDFs every time you request them, but the
           | content is 99-100% identical. Sometimes just a date changes.
           | So I use a perceptual hash and content comparison to automate
           | detecting truly new documents vs. ones that are only slightly
           | changed.
        
           | thibaut_barrere wrote:
           | the source sometimes changed for small internal reasons in
           | the library generating the PDF (prawn). So just comparing the
           | source would not give a clear cut answer. A visual comparison
           | has helped quite nicely over time.
        
           | knallfrosch wrote:
           | What should I do when the assertion fails - inspect the PDF
           | with my sad little caveman eyeball?
        
         | tylerflick wrote:
         | Are you using singed digests in the PDFs?
        
       | poidos wrote:
       | Reminds me of the tool Bob Nystrom wrote to help himself out when
       | working on the physical edition of Crafting Interpreters:
       | https://journal.stuffwithstuff.com/2020/04/05/crafting-craft...
       | 
       | Whole article is worth reading, but if you want the relevant bits
       | search for " I wrote a Dart script that would take a PDF of the
       | book".
        
       | jgalt212 wrote:
       | Thanks. I'll give this a shot to see if any counterparties try to
       | sneak in any last second changes to the executable version of the
       | doc.
        
       | akasakahakada wrote:
       | Use this to compare university textbook edition 8 and 9 before
       | buying.
        
         | ant6n wrote:
         | Uh how can you compare without buying? Or put another way, why
         | buy if you can compare?
        
           | cocodill wrote:
           | time machine research
        
           | akasakahakada wrote:
           | libgen exist bro
        
             | N0b8ez wrote:
             | But then why would you need to buy it?
        
               | Foobar8568 wrote:
               | Because for textbooks, paper is often superior.
        
       | smartmic wrote:
       | I like this tool better: https://www.qtrac.eu/diffpdf.html
       | 
       | It shows the differences in the GUI side-by-side instead of
       | overlayed.
        
         | invalidlogin wrote:
         | I use BeyondCompare 5 for this.
        
         | Tryk wrote:
         | From the github:
         | 
         | Another option is to compare the two files visually in a simple
         | GUI, using the --view argument:
         | 
         | $ diff-pdf --view a.pdf b.pdf
         | 
         | This opens a window that lets you view the files' pages and
         | zoom in on details. It is also possible to shift the two pages
         | relatively to each other using Ctrl-arrows (Cmd-arrows on
         | MacOS). This is useful for identifying translation-only
         | differences.
        
           | yencabulator wrote:
           | Shifting the offset is _very_ far from the experience of a
           | side-by-side diff, and more useful for nudging the images to
           | align them.
        
         | justinnk wrote:
         | There is also an open-source/free version of this [1], which I
         | use regularly. You can install it, e.g., in Fedora, with the
         | ,diffpdf' package. It is no longer maintained but works very
         | well, has a nice GUI with a side-by-side view, drag&drop
         | support, and both text and visual modes.
         | 
         | [1] https://www.qtrac.eu/diffpdf-foss.html
        
       | atum47 wrote:
       | back when I was writing my final paper I faced a similar issue,
       | needed to de-duplicate a bunch of PDF's, so I came up with a
       | simple solution
       | 
       | https://github.com/victorqribeiro/dtf
        
       | sva_ wrote:
       | Coincidentally I downloaded and tried using this just a while
       | ago. I was trying to see if it can identify an Elsevier
       | fingerprint between two pdfs. It can't, it only compares visible
       | things.
       | 
       | I used vbindiff instead.
        
       | deckar01 wrote:
       | I wrote a pixel-based visual diffing algorithm long ago that was
       | intended for a CI tool that finds all of the UI changes in a PR.
       | I broke the layout of a page I didn't even know existed as an
       | intern at Inkling and have had this idea in my head ever since.
       | 
       | https://github.com/deckar01/narcis
        
       | redman25 wrote:
       | I created a similar in-browser version a while back with
       | mozilla's pdf-js. The diff rendering is all run client side.
       | 
       | https://www.parepdf.com
       | 
       | The diff-pdf project was my inspiration but I wanted to create a
       | version that was distributable to non-programmers.
        
       | ck_one wrote:
       | Can anyone recommend a method to deduplicate pdfs? The hash is
       | often different but the content and meta data is 99.99% the same.
        
         | strangus wrote:
         | cp?
        
         | pixelmonkey wrote:
         | You might want strip metadata before doing a comparison, using
         | exiftool. Even though exiftool was originally written for EXIF
         | metadata on JPGs, these days, it supports a lot of metadata
         | standards, including PDF. This command will do it assuming you
         | set filename=`basename your.pdf .pdf`:
         | exiftool -all= -o ${filename}.stripped.pdf ${filename}.pdf
         | 
         | That won't help you with small differences in the contents, but
         | might help with small differences in metadata. Running `md5sum`
         | on the stripped PDF should give more reliable dedupe results.
         | 
         | I was recently working on a similar problem for JPG, RAW, and
         | MP4 files (photo/video backup) so it is fresh in my mind.
        
         | bob1029 wrote:
         | I would consider rasterizing the PDFs and then hashing the
         | resulting bitmaps.
        
       | mikeyinternews wrote:
       | You can do this with Beyond Compare (it's not free, but not very
       | expensive either) https://www.scootersoftware.com/
        
         | Rinzler89 wrote:
         | Beyond Compare is one of those priceless tools I pay for myself
         | instead of waiting for my employer to pay for it.
         | Price/functionality wise it's worth its weight in gold, it's
         | cross platform, and its licensing is very liberal. There's just
         | no FOSS compare tools out there that can match BC.
        
           | hipnoizz wrote:
           | What are BC features that you find to be so great?
           | 
           | I'm genuinely curious - I heard of lot of BC being 'the tool'
           | for diffing. I'm used to Meld, but my current employee has a
           | pretty strict policy which tools could be used so at some
           | point I've managed a licence for some older version of BC.
           | But for some reason I've found its UI/the way it works a bit
           | less optimal that I was accustomed for. Since I'm using that
           | primarily for text diffs these day I usually use a diff tool
           | from IntelliJ Idea (I have Idea open all the time).
        
             | netol wrote:
             | In comparison, Meld is not stable, nor fast, especially for
             | big diffs. The UI is also more limited. Araxis Merge and
             | WinMerge are good alternatives
        
       | tomwheeler wrote:
       | In a previous job, I had to validate the output of an unreliable
       | production publishing system, so I tested dozens of PDF
       | comparison tools available at the time. The best I found was
       | called Delta Walker. It was proprietary commercial Mac-only
       | software, but reasonably inexpensive, accurate, and could handle
       | long PDFs with lots of graphics well.
       | 
       | I remember evaluating this diff-pdf tool and finding that it fell
       | short in some way, although it's been so long that I don't recall
       | the specifics. Most of them failed to identify changes or
       | reported false positives. I also remember being disappointed
       | since this one was open source and could easily be scripted.
        
         | pivo wrote:
         | It looks like Delta Walker's added Windows and Linux support:
         | https://www.deltawalker.com/download
        
         | mksreddy wrote:
         | Wouldn't exporting pages to images and using pixel diff
         | accurately identify differences in PDF's?
        
           | netsharc wrote:
           | I guess it depends on the use case. Imagine adding an extra
           | sentence in the second PDF, and this causes the paragraph to
           | have 6 instead of 5 lines, and the next paragraph begins a
           | line further down, and the last paragraph of that page ends
           | up in the next page, etc...
        
       | strangus wrote:
       | https://10052.ai has a tool that will visually compare
       | documents(pdfs, doc, image,etc) and cluster them together. It
       | works amazingly well.
        
       | ydant wrote:
       | Related - this might be helpful to someone.
       | 
       | ImageMagick can do a visual PDF compare:                   magick
       | compare -density "$DENSITY" -background white "$1[0]" "$2[0]"
       | "$TMP"
       | 
       | (density = 100, $1 and $2 are the filenames to compare, $TMP the
       | output file)
       | 
       | You need to do some work to support multiple pages, so I use this
       | script:
       | 
       | https://gist.github.com/mbafford/7e6f3bef20fc220f68e467589bb...
       | 
       | This also uses `imgcat` to show the difference directly in the
       | terminal.
       | 
       | You can also use ImageMagick get a perceptual hash difference
       | using something like:                   convert -metric phash
       | "$1" null: "$2" -compose Difference -layers composite -format
       | '%[fx:mean]\n' info:
       | 
       | I use the fact you can configure git to use custom diff tools and
       | take advantage of this with the following in my .gitconfig:
       | [diff "pdf"]             command = ~/bin/git-diff-pdf
       | 
       | And in my .gitattributes I enable the above with:
       | *.pdf binary diff=pdf
       | 
       | ~/bin/git-diff-pdf does a diff of the output of `pdftotext
       | -layout` (from poppler) and also runs pdf-compare-phash.
       | 
       | To use this custom diff with `git show`, you need to add an extra
       | argument (`git show --ext-diff`), but it uses it automatically if
       | running `git diff`.
        
         | bigfatfrock wrote:
         | Next level, especially with the git attribute calls, well
         | played.
         | 
         | I'm still blown away how powerful imagemagick is after using it
         | for a decade or two, what an inspiring piece of open source
         | software.
        
           | Bluestein wrote:
           | imagemagick really is magical.-
        
       | fwn wrote:
       | I really like the overlay view and that it is not cloud based.
       | Will try to test it at work.
       | 
       | I rely heavily on PDF comparison via PDF-XChange Editor, which is
       | accurate for text, but often has trouble highlighting visual
       | changes correctly.
        
       | TacticalCoder wrote:
       | This reminds me of a book author who posted here IIRC. He had a
       | little tool allowing him to quickly compare two revisions of his
       | book. For example too make sure typos fixed didn't t break havoc.
       | I remember his tool would show in red what had changed on pages
       | thumbnails.
        
       | riedel wrote:
       | I always used DiffPDF only to read on their website: > in the
       | view of the EU's Cyber Resilience Act and an abundance of
       | caution, we have withdrawn all our free software
       | 
       | [1]
       | 
       | Good to see post-cyberresilience alternatives :)
       | 
       | PDF diffs are really great for versioning/comparing PCB-Designs.
       | (The only real use case I had 15 yrs back)
       | 
       | [1] http://www.qtrac.eu/diffpdf-foss.html
        
         | yencabulator wrote:
         | What a convenient excuse for them to try to get people to
         | switch to their proprietary fork.
         | 
         | I genuinely need a side-by-side PDF comparison tool, and the
         | diff-pdf tool linked from the main link doesn't do that. Any
         | thoughts?
        
       | simonw wrote:
       | This inspired me to have Claude 3.5 Sonnet knock out a quick web
       | page prototype for me, using PDF.js to load and render the PDFs
       | to canvas elements and then display visual diffs between their
       | pages.
       | 
       | Two prompts:                   Build a tool where I can drag and
       | drop on two PDF files and         it uses PDF.js to turn each of
       | their pages into canvas         elements and then displays those
       | pages side by side with a         third image that highlights any
       | differences between them, if         any differences exist
       | rewrite that code to not use React at all
       | 
       | Here's the result: https://tools.simonwillison.net/compare-pdfs
       | 
       | It actually works quite well! Screenshot here:
       | https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5...
        
         | radicality wrote:
         | What's the best way to effectively make use of Claude 3.5 ? I
         | signed up a few days ago for the api access. Besides
         | console.anthropic.com , do you recommend any other tools that I
         | can run locally to give it api key and use claude effectively ?
        
           | simonw wrote:
           | The web interface gives you access to the Artifacts feature
           | where it can build SPAs and render them in the browser.
           | 
           | For terminal access I like using my own
           | https://llm.datasette.io/ tool with the
           | https://github.com/simonw/llm-claude-3 plugin
           | 
           | For Python library access I recommend checking out Claudette:
           | https://www.answer.ai/posts/2024-06-21-claudette.html
        
         | kenjackson wrote:
         | How much additional work did you have to do to get it into this
         | form? Thats impressive work using those prompts.
        
           | simonw wrote:
           | Almost no extra work at all. You can see the full transcript
           | here: https://gist.github.com/simonw/9d7cbe02d448812f48070e7d
           | e13a5... - it really was just those two prompts, then I
           | copied the result out into a document to test it.
           | 
           | I modified the HTML a tiny bit before publishing it - I set
           | the font to Helvetica and added the note at the bottom of the
           | page showing the prompt I used.
           | 
           | The whole project took less than 5 minutes - then another 10
           | to write it up.
        
       | v101 wrote:
       | Typical software dude: A wall of text, not one screeshot of a
       | generated diff PDF.
        
       | crocal wrote:
       | I will just chime in to mention Draftable
       | (https://www.draftable.com/compare). It really works well. It's
       | not so easy to have a visually comfortable diff of two PDFs.
        
       ___________________________________________________________________
       (page generated 2024-07-02 23:00 UTC)