[HN Gopher] Diff-pdf: tool to visually compare two PDFs
___________________________________________________________________
Diff-pdf: tool to visually compare two PDFs
Author : Olshansky
Score : 433 points
Date : 2024-07-02 07:26 UTC (15 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jaustin wrote:
| We've been using this in the Micro:bit Educational Foundation
| (microbit.org) to fill a gap in hardware design tooling, and get
| visual diffs of our schematics and gerbers during PCB design
| iterations. It's kinda wild that's what we ended up doing, but if
| you want to be sure your radio layout didn't change at all when
| you're making a minor revision to a different part of the board,
| visual diffs are perfect.
|
| That said, next project we want to try something more integrated
| with EDA tools. If anyone else has followed this path, we'd love
| to know.
| rawbert wrote:
| We use this tool in our team regularly for comparison of PDFs we
| obtain from third party services that might have changed after
| code-changes on our side. Big thanks to the author <3
| canistel wrote:
| Interestingly, Github thinks the project is 46% shell, due to the
| fairly huge wxwin.m4.
| infecto wrote:
| I noticed this a while back with a private project of mine. The
| Github languages breakdown seems broken. Mine is a Python
| project with a handful of Jupyter notebooks but many many
| python files. The LOC must be 80% python files but Github sees
| the project as 50% Jupyter.
| badlibrarian wrote:
| You can tweak/exclude with .gitattributes
|
| https://github.com/github-
| linguist/linguist/blob/master/docs...
| Levitating wrote:
| No screenshots?
| colddevil wrote:
| Here is one: https://vslavik.github.io/diff-pdf/
| Tryk wrote:
| Can anyone explain how to interpret that screenshot? It just
| looks like a very blurry text to me.
| pimlottc wrote:
| It's showing you both PDFs overlaid on each other. The main
| window looks blurry because the main text has shifted
| vertically slightly. The regions that have changes are
| highlighted in the thumbnails on the left.
|
| I agree it's not the best initial example to demonstrate
| the tool, but it does show how it can be used to detect
| even minor spacing changes.
| ska wrote:
| If it was sharp, they would be identical. The "blurriness"
| is doubling, where the lines are not quite aligned. Red
| text there show you content that is in one and not the
| other.
| asah wrote:
| Crazy, I'd have thought that modern multi-modal LLMs can do this,
| but when I tried Gemini, ChatGPT-4o and Claude they all pooped
| out:
|
| - Gemini at first only diff'd the text, and then when pushed it
| identified the items in the images and then hallucinated the
| differences between the versions. It could not produce an image
| output.
|
| - Claude only diff'd the text and refused to believe that there
| images in the PDFs.
|
| - ChatGPT attempted to write and execute python code for this,
| which errored out.
| infecto wrote:
| This is definitely not a strength for multi-modal LLM. Multi-
| modal capabilities are still too flaky especially when looking
| at a page of a PDF which can have multiple areas of focus.
| ertgbnm wrote:
| This may be the type of thing that LLMs are currently the worst
| at. I'm not surprised at all.
| ale42 wrote:
| Visually comparing two PDFs is something a PC can do
| deterministically without any resource (and energy) intensive
| LLMs. People will soon use LLMs for things they are not
| especially good or efficient at, like computing the sum of
| numbers in an Excel table... (or are they doing it already?).
| B1FF_PSUVM wrote:
| As a bonus, they'll get a result that looks likely.
| pmarreck wrote:
| I would fully expect an LLM to not get natively good at this
| but to know how to reach out to another tool in order to get
| good at this
| thibaut_barrere wrote:
| I have been using this in a CI pipeline to maintain a business-
| critical PDF generation (healthcare) app (started circa 2010 I
| think), here is the RSpec helpers I'm using:
|
| https://gist.github.com/thbar/d1ce2afef68bf6089aeae8d9ddc05d...
|
| The code contains git-stored reference PDFs, and the test suite
| re-generate them and assert that nothing has changed.
|
| Helped a lot to audit visual changes, or PDF library upgrades!
| pmarreck wrote:
| could you not just compare the source (or perhaps even the
| hash) of the PDF and assert on that?
| alexdoesh wrote:
| Hashes can change regularly due to metadata. Source checks
| may also require some filtration or preprocessing before
| comparison. Visual comparison is the best option here,
| especially if you have a complex document with multiple
| third-party components that may change both the hash and
| source but keep the visual appearance the same.
| thibaut_barrere wrote:
| In this case, we indeed have multiple components (although
| not third-party), and being able to refactor those without
| risk is quite nice.
| jabroni_salad wrote:
| If the document is a legally required disclosure (like a
| bank's fee schedule for example) then you need to grade that
| document directly rather than its source code. PDFs are
| horrible and there is a lot that can go wrong with making
| them between writing and publishing.
| ydant wrote:
| I use some custom tools for PDF comparison (visual, textual,
| and perceptual hash) for my personal records/accounting
| purposes.
|
| A number of the financial and medical institutions I deal
| with re-generate PDFs every time you request them, but the
| content is 99-100% identical. Sometimes just a date changes.
| So I use a perceptual hash and content comparison to automate
| detecting truly new documents vs. ones that are only slightly
| changed.
| thibaut_barrere wrote:
| the source sometimes changed for small internal reasons in
| the library generating the PDF (prawn). So just comparing the
| source would not give a clear cut answer. A visual comparison
| has helped quite nicely over time.
| knallfrosch wrote:
| What should I do when the assertion fails - inspect the PDF
| with my sad little caveman eyeball?
| tylerflick wrote:
| Are you using singed digests in the PDFs?
| poidos wrote:
| Reminds me of the tool Bob Nystrom wrote to help himself out when
| working on the physical edition of Crafting Interpreters:
| https://journal.stuffwithstuff.com/2020/04/05/crafting-craft...
|
| Whole article is worth reading, but if you want the relevant bits
| search for " I wrote a Dart script that would take a PDF of the
| book".
| jgalt212 wrote:
| Thanks. I'll give this a shot to see if any counterparties try to
| sneak in any last second changes to the executable version of the
| doc.
| akasakahakada wrote:
| Use this to compare university textbook edition 8 and 9 before
| buying.
| ant6n wrote:
| Uh how can you compare without buying? Or put another way, why
| buy if you can compare?
| cocodill wrote:
| time machine research
| akasakahakada wrote:
| libgen exist bro
| N0b8ez wrote:
| But then why would you need to buy it?
| Foobar8568 wrote:
| Because for textbooks, paper is often superior.
| smartmic wrote:
| I like this tool better: https://www.qtrac.eu/diffpdf.html
|
| It shows the differences in the GUI side-by-side instead of
| overlayed.
| invalidlogin wrote:
| I use BeyondCompare 5 for this.
| Tryk wrote:
| From the github:
|
| Another option is to compare the two files visually in a simple
| GUI, using the --view argument:
|
| $ diff-pdf --view a.pdf b.pdf
|
| This opens a window that lets you view the files' pages and
| zoom in on details. It is also possible to shift the two pages
| relatively to each other using Ctrl-arrows (Cmd-arrows on
| MacOS). This is useful for identifying translation-only
| differences.
| yencabulator wrote:
| Shifting the offset is _very_ far from the experience of a
| side-by-side diff, and more useful for nudging the images to
| align them.
| justinnk wrote:
| There is also an open-source/free version of this [1], which I
| use regularly. You can install it, e.g., in Fedora, with the
| ,diffpdf' package. It is no longer maintained but works very
| well, has a nice GUI with a side-by-side view, drag&drop
| support, and both text and visual modes.
|
| [1] https://www.qtrac.eu/diffpdf-foss.html
| atum47 wrote:
| back when I was writing my final paper I faced a similar issue,
| needed to de-duplicate a bunch of PDF's, so I came up with a
| simple solution
|
| https://github.com/victorqribeiro/dtf
| sva_ wrote:
| Coincidentally I downloaded and tried using this just a while
| ago. I was trying to see if it can identify an Elsevier
| fingerprint between two pdfs. It can't, it only compares visible
| things.
|
| I used vbindiff instead.
| deckar01 wrote:
| I wrote a pixel-based visual diffing algorithm long ago that was
| intended for a CI tool that finds all of the UI changes in a PR.
| I broke the layout of a page I didn't even know existed as an
| intern at Inkling and have had this idea in my head ever since.
|
| https://github.com/deckar01/narcis
| redman25 wrote:
| I created a similar in-browser version a while back with
| mozilla's pdf-js. The diff rendering is all run client side.
|
| https://www.parepdf.com
|
| The diff-pdf project was my inspiration but I wanted to create a
| version that was distributable to non-programmers.
| ck_one wrote:
| Can anyone recommend a method to deduplicate pdfs? The hash is
| often different but the content and meta data is 99.99% the same.
| strangus wrote:
| cp?
| pixelmonkey wrote:
| You might want strip metadata before doing a comparison, using
| exiftool. Even though exiftool was originally written for EXIF
| metadata on JPGs, these days, it supports a lot of metadata
| standards, including PDF. This command will do it assuming you
| set filename=`basename your.pdf .pdf`:
| exiftool -all= -o ${filename}.stripped.pdf ${filename}.pdf
|
| That won't help you with small differences in the contents, but
| might help with small differences in metadata. Running `md5sum`
| on the stripped PDF should give more reliable dedupe results.
|
| I was recently working on a similar problem for JPG, RAW, and
| MP4 files (photo/video backup) so it is fresh in my mind.
| bob1029 wrote:
| I would consider rasterizing the PDFs and then hashing the
| resulting bitmaps.
| mikeyinternews wrote:
| You can do this with Beyond Compare (it's not free, but not very
| expensive either) https://www.scootersoftware.com/
| Rinzler89 wrote:
| Beyond Compare is one of those priceless tools I pay for myself
| instead of waiting for my employer to pay for it.
| Price/functionality wise it's worth its weight in gold, it's
| cross platform, and its licensing is very liberal. There's just
| no FOSS compare tools out there that can match BC.
| hipnoizz wrote:
| What are BC features that you find to be so great?
|
| I'm genuinely curious - I heard of lot of BC being 'the tool'
| for diffing. I'm used to Meld, but my current employee has a
| pretty strict policy which tools could be used so at some
| point I've managed a licence for some older version of BC.
| But for some reason I've found its UI/the way it works a bit
| less optimal that I was accustomed for. Since I'm using that
| primarily for text diffs these day I usually use a diff tool
| from IntelliJ Idea (I have Idea open all the time).
| netol wrote:
| In comparison, Meld is not stable, nor fast, especially for
| big diffs. The UI is also more limited. Araxis Merge and
| WinMerge are good alternatives
| tomwheeler wrote:
| In a previous job, I had to validate the output of an unreliable
| production publishing system, so I tested dozens of PDF
| comparison tools available at the time. The best I found was
| called Delta Walker. It was proprietary commercial Mac-only
| software, but reasonably inexpensive, accurate, and could handle
| long PDFs with lots of graphics well.
|
| I remember evaluating this diff-pdf tool and finding that it fell
| short in some way, although it's been so long that I don't recall
| the specifics. Most of them failed to identify changes or
| reported false positives. I also remember being disappointed
| since this one was open source and could easily be scripted.
| pivo wrote:
| It looks like Delta Walker's added Windows and Linux support:
| https://www.deltawalker.com/download
| mksreddy wrote:
| Wouldn't exporting pages to images and using pixel diff
| accurately identify differences in PDF's?
| netsharc wrote:
| I guess it depends on the use case. Imagine adding an extra
| sentence in the second PDF, and this causes the paragraph to
| have 6 instead of 5 lines, and the next paragraph begins a
| line further down, and the last paragraph of that page ends
| up in the next page, etc...
| strangus wrote:
| https://10052.ai has a tool that will visually compare
| documents(pdfs, doc, image,etc) and cluster them together. It
| works amazingly well.
| ydant wrote:
| Related - this might be helpful to someone.
|
| ImageMagick can do a visual PDF compare: magick
| compare -density "$DENSITY" -background white "$1[0]" "$2[0]"
| "$TMP"
|
| (density = 100, $1 and $2 are the filenames to compare, $TMP the
| output file)
|
| You need to do some work to support multiple pages, so I use this
| script:
|
| https://gist.github.com/mbafford/7e6f3bef20fc220f68e467589bb...
|
| This also uses `imgcat` to show the difference directly in the
| terminal.
|
| You can also use ImageMagick get a perceptual hash difference
| using something like: convert -metric phash
| "$1" null: "$2" -compose Difference -layers composite -format
| '%[fx:mean]\n' info:
|
| I use the fact you can configure git to use custom diff tools and
| take advantage of this with the following in my .gitconfig:
| [diff "pdf"] command = ~/bin/git-diff-pdf
|
| And in my .gitattributes I enable the above with:
| *.pdf binary diff=pdf
|
| ~/bin/git-diff-pdf does a diff of the output of `pdftotext
| -layout` (from poppler) and also runs pdf-compare-phash.
|
| To use this custom diff with `git show`, you need to add an extra
| argument (`git show --ext-diff`), but it uses it automatically if
| running `git diff`.
| bigfatfrock wrote:
| Next level, especially with the git attribute calls, well
| played.
|
| I'm still blown away how powerful imagemagick is after using it
| for a decade or two, what an inspiring piece of open source
| software.
| Bluestein wrote:
| imagemagick really is magical.-
| fwn wrote:
| I really like the overlay view and that it is not cloud based.
| Will try to test it at work.
|
| I rely heavily on PDF comparison via PDF-XChange Editor, which is
| accurate for text, but often has trouble highlighting visual
| changes correctly.
| TacticalCoder wrote:
| This reminds me of a book author who posted here IIRC. He had a
| little tool allowing him to quickly compare two revisions of his
| book. For example too make sure typos fixed didn't t break havoc.
| I remember his tool would show in red what had changed on pages
| thumbnails.
| riedel wrote:
| I always used DiffPDF only to read on their website: > in the
| view of the EU's Cyber Resilience Act and an abundance of
| caution, we have withdrawn all our free software
|
| [1]
|
| Good to see post-cyberresilience alternatives :)
|
| PDF diffs are really great for versioning/comparing PCB-Designs.
| (The only real use case I had 15 yrs back)
|
| [1] http://www.qtrac.eu/diffpdf-foss.html
| yencabulator wrote:
| What a convenient excuse for them to try to get people to
| switch to their proprietary fork.
|
| I genuinely need a side-by-side PDF comparison tool, and the
| diff-pdf tool linked from the main link doesn't do that. Any
| thoughts?
| simonw wrote:
| This inspired me to have Claude 3.5 Sonnet knock out a quick web
| page prototype for me, using PDF.js to load and render the PDFs
| to canvas elements and then display visual diffs between their
| pages.
|
| Two prompts: Build a tool where I can drag and
| drop on two PDF files and it uses PDF.js to turn each of
| their pages into canvas elements and then displays those
| pages side by side with a third image that highlights any
| differences between them, if any differences exist
| rewrite that code to not use React at all
|
| Here's the result: https://tools.simonwillison.net/compare-pdfs
|
| It actually works quite well! Screenshot here:
| https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5...
| radicality wrote:
| What's the best way to effectively make use of Claude 3.5 ? I
| signed up a few days ago for the api access. Besides
| console.anthropic.com , do you recommend any other tools that I
| can run locally to give it api key and use claude effectively ?
| simonw wrote:
| The web interface gives you access to the Artifacts feature
| where it can build SPAs and render them in the browser.
|
| For terminal access I like using my own
| https://llm.datasette.io/ tool with the
| https://github.com/simonw/llm-claude-3 plugin
|
| For Python library access I recommend checking out Claudette:
| https://www.answer.ai/posts/2024-06-21-claudette.html
| kenjackson wrote:
| How much additional work did you have to do to get it into this
| form? Thats impressive work using those prompts.
| simonw wrote:
| Almost no extra work at all. You can see the full transcript
| here: https://gist.github.com/simonw/9d7cbe02d448812f48070e7d
| e13a5... - it really was just those two prompts, then I
| copied the result out into a document to test it.
|
| I modified the HTML a tiny bit before publishing it - I set
| the font to Helvetica and added the note at the bottom of the
| page showing the prompt I used.
|
| The whole project took less than 5 minutes - then another 10
| to write it up.
| v101 wrote:
| Typical software dude: A wall of text, not one screeshot of a
| generated diff PDF.
| crocal wrote:
| I will just chime in to mention Draftable
| (https://www.draftable.com/compare). It really works well. It's
| not so easy to have a visually comfortable diff of two PDFs.
___________________________________________________________________
(page generated 2024-07-02 23:00 UTC)