https://eclecticlight.co/2021/08/14/how-to-compare-two-pdf-documents/ Skip to content [eclecticlight] The Eclectic Light Company Macs, painting, and more Main navigation Menu * Downloads * M1 Macs * Mac Problems * Mac articles * Art * Macs * Painting hoakley August 14, 2021 General, Language, Macs, Technology How to compare two PDF documents There are some fundamental tasks we need to do with most if not all documents. One of them is to compare two versions of what are essentially the same document. These might be legal agreements, or revisions of a report, which are quite likely now to come in PDF format. This article explores how you can compare the contents of two PDF files, or perhaps why you can't. Comparing PDFs isn't a feature you're likely to find in apps which otherwise have rich support for the document format. It's more likely that they'll offer some form of redaction but not the ability to make any comparison between two documents. Try Adobe Acrobat Reader, and the tool will be offered, but the only way to obtain it is to upgrade to the full Adobe Acrobat DC, on a monthly subscription. That's an offer that most will wisely refuse. Compare text A free solution is to export each of the documents in the form of text, and use a powerful text editor like BBEdit to compare those text documents. If you have Apple's free Xcode SDK installed, you could use its FileMerge app, which is hidden away inside the app bundle and accessed through the Open Developer Tool command in the Xcode menu, but I prefer BBEdit's Find Differences... command in its Search menu. pdfdiff1 You'll then discover how variable the text exported from PDF files can be. One experiment worth trying is to make a copy of a text-rich PDF document and open and save it a few times using different apps, but without changing any of its content. This can move chunks of text around, even though when you view the PDF it clearly hasn't changed at all. So, although you should be able to find all the content, you're likely to have plenty of false positives, where there are differences between exported text, but not in what you see in the documents themselves. Paid-for Acrobat As far as I can see, the only 'serious' feature which can compare PDF files is that in the paid-for version of Adobe Acrobat DC. Reaching for my copy, I put it through its paces and discovered that it too is of only limited use for this task. Apart from its standard Martian interface which is thankfully peculiar to Acrobat, small differences between PDFs often trigger hundreds of differences that are reported by Acrobat. If you've got all day to work through each page, it might be just the job, but if you want a clean and simple list of differences, you're likely to be out of luck. To test this, I took a text document with numbered lines, as is common with many legal documents, and printed it to PDF. I then made a handful of small changes to it, turned that into PDF, and compared the two results. pdfdiff2 Because Acrobat has no sense of any underlying structure, where the minor changes in the text had caused renumbering of lines, Acrobat flagged every single line as being different. It also picked up all changes in page layout which didn't involve any change in content: the removal of a single line on the first page of a document thus effectively made the rest of the document a long and tedious series of changes too. One strength, though, is that Acrobat is reliable at reporting when documents haven't changed, even though text exported from them has changed in its structure. Beyond that, I didn't find Acrobat much help, as it just overwhelmed with irrelevant differences. Room for improvement? Given the popularity of PDF documents, you'd imagine there's strong demand for something better for comparisons. However, any solution is doomed to fail unless it can overcome a fundamental design limitation of the PDF format: it doesn't store content in any form of semantic structure, but merely what's needed to make each page look right. You can alter that by manually flowing each block of text together, a procedure necessary for some types of PDF which need to be compatible with text readers, for instance, but hardly anyone bothers to do that, and it's exceptional to discover documents which have been so structured. Within a PDF file are as many as tens of thousands of objects, each of which contains the code to generate part of a page. If you were to set one word in a paragraph and style it using a different font and weight, the PDF engine may decide to split it out as another object to be placed on that page. But there's no semantic link between those objects, and individual PDF writers can even place each word on a page independently, as a separate object. Working out how those words assemble into the text would then be a very difficult task even for "an AI". Not only that, but being such an old file format, it allows editors to tack objects on at the end of the file, to save having to write the whole file again. Sometimes a PDF engine will 'flatten' all those appended changes, which can completely restructure the objects. The sorry truth is that the PDF format was never designed to provide access to its contents, except to display them correctly on the screen or in a page image for printing. Despite that, the whole world is busy storing millions of its most important documents every day as PDFs. Does that seem ever so slightly crazy? I'm grateful to Paul for opening this Pandora's box. Share this: * Twitter * Facebook * Reddit * Pinterest * Email * Print * Like this: Like Loading... Related Posted in General, Language, Macs, Technology and tagged Acrobat, Adobe, BBEdit, FileMerge, PDF, text, Xcode. Bookmark the permalink. 11Comments Add yours 1. 1 [08e49604d7a9] blackxacto on August 14, 2021 at 11:09 am Reply Is there a difference in display text and text displayed? Trying to understand what PDF considers components of a doc. LikeLiked by 1 person + 2 [6986a746f627] hoakley on August 14, 2021 at 1:49 pm Reply As I wrote, a PDF document consists of objects. Objects can be as small as a single character - as is often seen with drop caps - or as large as the whole text on one page. That depends entirely on the software that's generating the PDF. For example, if you take a single-page document with a single text object and divide that up to move a paragraph within that text, you could end up with the text on that page taking three or more objects. All the PDF renderer does is render each of the objects on each page. It doesn't know how they relate to one another, or whether their content is even connected. I hope that's clearer. Howard. LikeLike o 3 [08e49604d7a9] blackxacto on August 14, 2021 at 3:33 pm Reply "doesn't know how they relate to one another, or whether their content is even connected." Thank you. LikeLiked by 1 person 2. 4 [bab73d81865c] DaveG on August 14, 2021 at 2:57 pm Reply Good article and certainly points out a weakness of our present document archival approach. I wonder if pdf -> image -> ocr provides any value in relating two documents by doing a new pass an objectification. Clearly, it is likely to introduce some but maybe not a lot of OCR errors. LikeLiked by 1 person + 5 [6986a746f627] hoakley on August 14, 2021 at 8:09 pm Reply Thank you. No, OCR destructures documents even more, I'm afraid, and makes it impossible to distinguish things like headers and footers, another common feature of legal and similar documents. Howard. LikeLike 3. 6 [899a6cf7b3da] John on August 14, 2021 at 4:07 pm Reply In the legal world from my experience at least (20+ years as a lawyer working in and with other big international firms), everyone uses specialised document comparison software. Unfortunately, there are less than a handful of products on the market for this purpose (such as Litera Compare - https:// www.litera.com/products/store/litera-compare/ ), they're priced accordingly and all of them are Windows only (it's the main reason why I have VMware Fusion and Windows installed on my Mac at home). Such software can compare PDFs but we would only use this as a last resort because the results have all of the issues mentioned above. When sending revised documents to clients, other lawyers and others, people will without exception either (i) send a marked-up compare (either in Word or PDF format) generated - always - from the old and new Word versions of the document amended or (ii) a Word document with track changes. I think Word (on Windows at least) has the ability to compare two documents function but it doesn't have all of the features of the proper compare programs (e.g. the compare programs will identify text which has been moved from one to another in a document and changes in a table but I don't the built-in Word function can handle that). LikeLiked by 1 person + 7 [6986a746f627] hoakley on August 14, 2021 at 8:10 pm Reply Thank you. That's invaluable experience. Howard. LikeLike 4. 8 [2ddb2a12ae04] btown on August 14, 2021 at 5:48 pm Reply https://draftable.com/compare is by far the best solution I've found for this, and it's a shame it's not more widely known about. It's not open-source, and their offline app is Windows only, but its ability to handle multi-page relayouts is far and above Acrobat's diff functionality, and there's a free online version that's reasonably secure so long as you don't share the secret URL around. LikeLiked by 2 people + 9 [6986a746f627] hoakley on August 14, 2021 at 8:22 pm Reply Thank you. I'm afraid that I refuse to run up a VM to compare PDFs using software which is almost as expensive as Acrobat, and subscription-only. Howard. LikeLike 5. 10 [0637408f6116] jeffsyrop on August 14, 2021 at 7:37 pm Reply This is a really good article and I thank you for it. I've been struggling with this problem for years, both as a tech writer and an on-line political writer (on Quora). I love your point, "Despite that, the whole world is busy storing millions of its most important documents every day as PDFs. Does that seem ever so slightly crazy?" OMG! Is that ever true!! And now let me share something that is not THE ANSWER, but has been INCREDIBLY useful to me: 1. Display in 2 separate windows, one on top of the other, 2 versions of a page -- the original and an updated page on which a few changes have been made -- and display both windows and both pages IN the windows identically, starting at the same line of text. 2. Alt-~(tilde) back and forth between pages, you can instantly see where changes are. Even if changes have pushed text down the page so it no longer lines up with the text on the other page, you can simply scroll up or down in each document so that the next group of text in question is perfectly aligned. Alt-Tabbing quickly, if, in the next paragraph there was a doubled word, e.g., "the the", by switching back and forth, you'll see the word "the" jumping around and know exactly where the difference lies. LikeLiked by 2 people + 11 [6986a746f627] hoakley on August 14, 2021 at 8:24 pm Reply Thank you. That makes excellent sense, tragic though it is to admit that we've got to compare by eye. Howard. LikeLike Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Google photo You are commenting using your Google account. ( Log Out / Change ) Twitter picture You are commenting using your Twitter account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] This site uses Akismet to reduce spam. Learn how your comment data is processed. Quick Links * Downloads * Mac Troubleshooting Summary * M1 Macs * Mac problem-solving * Painting topics * Painting * Long Reads Search Search for: [ ] [Search] Monthly archives * August 2021 (35) * July 2021 (75) * June 2021 (71) * May 2021 (80) * April 2021 (79) * March 2021 (77) * February 2021 (75) * January 2021 (75) * December 2020 (77) * November 2020 (84) * October 2020 (81) * September 2020 (79) * August 2020 (103) * July 2020 (81) * June 2020 (78) * May 2020 (78) * April 2020 (81) * March 2020 (86) * February 2020 (77) * January 2020 (86) * December 2019 (82) * November 2019 (74) * October 2019 (89) * September 2019 (80) * August 2019 (91) * July 2019 (95) * June 2019 (88) * May 2019 (91) * April 2019 (79) * March 2019 (78) * February 2019 (71) * January 2019 (69) * December 2018 (79) * November 2018 (71) * October 2018 (78) * September 2018 (76) * August 2018 (78) * July 2018 (76) * June 2018 (77) * May 2018 (71) * April 2018 (67) * March 2018 (73) * February 2018 (67) * January 2018 (83) * December 2017 (94) * November 2017 (73) * October 2017 (86) * September 2017 (92) * August 2017 (69) * July 2017 (81) * June 2017 (76) * May 2017 (90) * April 2017 (76) * March 2017 (79) * February 2017 (65) * January 2017 (76) * December 2016 (75) * November 2016 (68) * October 2016 (76) * September 2016 (78) * August 2016 (70) * July 2016 (74) * June 2016 (66) * May 2016 (71) * April 2016 (67) * March 2016 (71) * February 2016 (68) * January 2016 (90) * December 2015 (96) * November 2015 (103) * October 2015 (119) * September 2015 (115) * August 2015 (117) * July 2015 (117) * June 2015 (105) * May 2015 (111) * April 2015 (119) * March 2015 (69) * February 2015 (54) * January 2015 (39) Tags Adobe APFS Apple AppleScript Apple silicon App Store backup Big Sur Blake Bonnard bug bugs Catalina Consolation Console diagnosis Disk Utility Dore El Capitan extended attributes Finder firmware Gatekeeper Gerome HFS+ High Sierra history history of painting iCloud Impressionism iOS landscape LockRattler log logs M1 Mac Mac history macOS macOS 10.12 macOS 10.13 macOS 10.14 macOS 10.15 macOS 11 malware Metamorphoses Mojave Monet Moreau MRT myth narrative OS X Ovid painting Pissarro Poussin privacy realism riddle Rubens Sargent scripting security Sierra Swift symbolism Time Machine Turner update upgrade vulnerability xattr Xcode XProtect Statistics * 9,432,096 hits Blog at WordPress.com. Footer navigation * About & Contact * Macs * Painting * Language * Tech * Life * General * Downloads * Mac problem-solving * Extended attributes (xattrs) * Painting topics * Hieronymus Bosch * English language * LockRattler: 10.12 Sierra * LockRattler: 10.13 High Sierra * LockRattler: 10.11 El Capitan * Updates: El Capitan * Updates: Sierra, High Sierra, Mojave, Catalina, Big Sur * LockRattler: 10.14 Mojave * SilentKnight, silnite, LockRattler, SystHist & Scrub * DelightEd & Podofyllin * xattred, Metamer, Sandstrip & xattr tools * 32-bitCheck & ArchiChect * T2M2, Ulbow, Consolation and log utilities * Cirrus & Bailiff * Taccy, Signet, Precize, Alifix, UTIutility, Sparsity, alisma * Revisionist & DeepTools * Text Utilities: Nalaprop, Dystextia and others * PDF * Keychains & Permissions * LockRattler: 10.15 Catalina * Updates * Spundle, Cormorant, Stibium, Dintch, Fintch and cintch * Long Reads * LockRattler: 11.0 Big Sur * Mac Troubleshooting Summary * M1 Macs * Mints: a multifunction utility Secondary navigation * Search Post navigation Impressionist painting in Britain: 4 Philip Wilson Steer Saturday Mac riddles 112 Search for: [ ] [Search] Begin typing your search above and press return to search. Press Esc to cancel. Loading Comments... Write a Comment... [ ] Email (Required) [ ] Name (Required) [ ] Website [ ] [Post Comment] Send to Email Address [ ] Your Name [ ] Your Email Address [ ] [ ] loading [Send Email] Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this: [b]