[HN Gopher] Show HN: PDFs that are readable by human eyes only
       ___________________________________________________________________
        
       Show HN: PDFs that are readable by human eyes only
        
       Hi, OP here. A friend was involved in a custody battle and was
       afraid his ex was going to leak all of his discovery documents on
       the internet and he asked if there was something I could do to make
       it harder for bots/crawlers to find sensitive information.
       Originally I was going to turn all of his docs to image based PDFs,
       but those get large fast and are easy to OCR.  So I found a post
       musing about altering fonts/glyphs so that it _looks_ like english,
       but the actual character being seen by the pdf reader is a non-
       english character. As such, when you try to OCR these files, it
       doesn 't see any images and can't convert it.  I figured it had
       some potential uses and maybe you fine folks can identify other use
       cases. I'll be monitoring this post most of the day.
        
       Author : viggity
       Score  : 86 points
       Date   : 2022-07-06 15:38 UTC (7 hours ago)
        
 (HTM) web link (humaneyesonly.com)
 (TXT) w3m dump (humaneyesonly.com)
        
       | O__________O wrote:
       | You asked about use case ideas, while I personally strongly
       | dislike them, there are number of sites, including online testing
       | apps, that try to remove copy-and-paste. Not sure how valuable it
       | would be, but it's for sure a use case.
        
       | anyfactor wrote:
       | I am on a older phone.
       | 
       | What your eyes see is identical to what the computer sees. They
       | are both giberrish. Also the email is giberrish.
       | 
       | What am I missing here?
        
         | solardev wrote:
         | Your PDF reader probably doesn't support the particular font
         | this is using. Try it on your computer.
        
           | anyfactor wrote:
           | > Have a good use case? Email me: yu-
           | yehou@[?]lmn'eu[?]ou'uorehaS.NgeY
           | 
           | This is what I am getting on the browser.
        
             | forgotpwd16 wrote:
             | You maybe don't allow pages to choose their own font? From
             | what I understand it works by making a font that maps Latin
             | characters to random Unicode points. So, unless for a
             | specific text its respective font is used, text will appear
             | gibberish.
             | 
             | edit: Saw you said you use an old phone. Found those fonts
             | are in woff format. Older Android phones don't have support
             | for it.
        
             | solardev wrote:
             | Mm, might have to use a different browser/phone? Not sure
             | what else there is to do here... it's a proof of concept
             | that doesn't work on all devices.
        
               | viggity wrote:
               | it probably depends on browser font support. PDF works
               | with embedded fonts and so it is less likely to be an
               | issue when viewing the PDFs there.
        
               | salawat wrote:
               | Ironically, that undermines the entire point of PDF
        
         | viggity wrote:
         | I'm wondering if your browser isn't showing woff fonts for some
         | reason. The "your eyes see" textbox on a modern browser shows:
         | 
         | Name: Satoshi Nakamoto
         | 
         | DOB: 1982-06-05
         | 
         | SSN: 958-20-3141
         | 
         | Cell Phone: 514-867-5309
        
       | hexo wrote:
       | Impressive, the example web text area is very lovely too
        
       | dawnerd wrote:
       | This is terrible for accessibility, please don't do this.
        
         | vehemenz wrote:
         | Isn't that the point though?
         | 
         | If you make a sensitive document unindexable (assuming this
         | works), then effectively no one can find it.
         | 
         | The intent here is not to restrict the document to sighted
         | users but to hide the document from _everyone_ , which includes
         | sighted and blind users searching for keywords. The fact that
         | blind users can't read the text at all without screen grabbing
         | is just a bonus.
        
           | dawnerd wrote:
           | Treating blind people that way will 100% lead you towards a
           | lawsuit. So have fun with that I guess.
        
       | jffry wrote:
       | I think people will get tripped up by you saying it "can't" be
       | OCR'd or that it is difficult to do so, and will end up looking
       | past a pretty elegant solution in the process.
       | 
       | This seems like a nicely clever way to trip up non-targeted
       | scrapers which might attempt to OCR any images they encounter,
       | but which will ignore what looks like random gibberish
       | codepoints. It doesn't eliminate the ability to index this data
       | but I can see how it might greatly reduce it.
       | 
       | Obviously you could still convert these PDFs to an image and OCR
       | them, but that's not the thing being defended against here.
        
       | josephcsible wrote:
       | This is basically just really weak DRM and is just as evil.
        
         | ck2 wrote:
         | Is it even DRM or more closely just child-like ROT13?
        
         | maxbond wrote:
         | I agree that OCR is an important tool for end users, especially
         | those with accessibility needs, and that we shouldn't use
         | something like this lightly, but the context is completely
         | different here. If I want to send you a PDF from my gmail, and
         | want to make it difficult for Google to leverage that data -
         | that's completely different than if I were a giant media
         | company, gatekeeping to ensure a huge portion of culture flows
         | through me, which I then claim as my own and charge exorbitant
         | rents for, enforced by DRM.
         | 
         | The problem with DRM is not that someone is trying to control
         | what happens to a string of bits, it's that it props up an
         | institution which is harmful.
        
           | withinboredom wrote:
           | The problem is that people won't "use it lightly" and not
           | everyone speaks the same language. Being able to copy/paste
           | text into a translation tool (probably Google Translate which
           | is kinda ironic in the case you mentioned) to understand what
           | the document is about is super important when in another
           | country or communicating with someone in another country.
        
             | maxbond wrote:
             | Being able to copy/paste is an important option that
             | empowers users.
             | 
             | Being able to selectively defeat copy paste is an
             | additional option that additionally empowers users.
             | 
             | I don't anticipate this tool being used very widely. If it
             | became the default, I would have a problem with it for the
             | reasons you highlight, among others.
        
           | kolanos wrote:
           | > I agree that OCR is an important tool for end users,
           | especially those with accessibility needs, and that we
           | shouldn't use something like this lightly, but the context is
           | completely different here. If I want to send you a PDF from
           | my gmail, and want to make it difficult for Google to
           | leverage that data - that's completely different than if I
           | were a giant media company, gatekeeping to ensure a huge
           | portion of culture flows through me, which I then claim as my
           | own and charge exorbitant rents for, enforced by DRM.
           | 
           | There are easier ways to do this, such as encrypting the PDF.
           | It is trivially easy to password protect a PDF as well [0],
           | it is even a part of the PDF spec. It isn't ironclad, but it
           | will defeat Gmail's indexer.
           | 
           | [0]: https://digify.com/blog/protect-pdf-with-password/
        
       | Komodai wrote:
       | "As such, when you try to OCR these files, it doesn't see any
       | images and can't convert it."
       | 
       | Bullshit
       | 
       | 1. Screenshot
       | 
       | 2. OCR
       | 
       | 3. Profit
        
       | bscphil wrote:
       | This is broken in multiple ways, some obvious, some not.
       | 
       | 1. Obviously most people jumped directly to OCR, and that works
       | of course. So counter to the OP, you can trivially render the
       | first page to a high resolution PNG and then OCR that with what
       | will probably be 100% accurate results. Sample image:
       | https://i.imgur.com/hyJOSjY.jpg
       | 
       | 2. This is just messing with glyphs in fonts, so one trivial way
       | of undoing the changes _losslessly_ (not even requiring OCR) is
       | to create a mapping between each font glyph and the original
       | character. For example, I was able to extract the font used for
       | the text  "CONTENTS" near the beginning of the sample document.
       | It is named "SecureFont-1845559949-FranklinGothic-Demi", as
       | extracted by mutool. In the PDF "CONTENTS" is made up of eight
       | Unicode characters, which render as "CONTENTS" in that font.
       | 
       | 3. Even if the first two methods somehow failed, the same
       | character in a given font is repeatedly used to render the same
       | character in English. That makes the approach similar to a
       | substitution cipher [1] which is trivially broken with frequency
       | analysis. You could literally just copy / paste the fake "text"
       | out of the PDF and with an analysis tool derive the original
       | text. This isn't really significant since the PDF can be read by
       | sight anyway, but it's worth pointing out.
       | 
       | [1] https://en.wikipedia.org/wiki/Substitution_cipher
        
         | rmbyrro wrote:
         | I believe this can be taken as an axiom: there's nothing a
         | human eye can see that a machine cannot.
        
         | ChristopherDrum wrote:
         | To point 1, it was trivial for me to open the PDF in Preview on
         | macOS, save it out as a multi-page PNG, open that PNG in
         | Preview and macOS itself just let me copy out the text without
         | error (in my limited test).
        
         | viggity wrote:
         | so one thing that I do to counter point 3 is by having multiple
         | non-english characters map to the same english character and I
         | pick a random one each time. Depending on the input, there can
         | be ~10 or so characters mapping to any single english letter.
         | If you're advanced enough to know about a substitution cipher,
         | you'll figure out how to convert to an image based PDF and then
         | use OCR on that. The reason I have the multiple mappings is so
         | that if a layperson was trying to find all instances of "Billy"
         | they could copy those characters and then search for
         | "Chz'n'mZbe", but the other instances of "Billy" might have the
         | codepoints "Chtiivu".
         | 
         | Again, its resistant to built in PDF Reader OCR, not
         | bulletproof. I'm trying to thwart a crawler or a script kiddie,
         | or a 50 year old divorce attorney. Not the denizens of HN.
        
           | bscphil wrote:
           | That's a fair point. I was trying to point out an interesting
           | relationship between your approach and various forms of
           | cryptography, but obviously this would not be the first line
           | solution to the problem. I think the resistance to crawlers
           | will come down to whether they implement OCR or not. I
           | suspect some of the more sophisticated ones do. (A naive one
           | might also be fooled by the fact that real glyphs are used
           | and not attempt to OCR the text.)
           | 
           | BTW, you probably shouldn't say it's resistant to PDF reader
           | OCR because most PDF readers don't _have_ OCR, AFAIK. They
           | just pull the text from the document, that 's not OCR.
           | Software that _has_ OCR like Adobe Acrobat will not be fooled
           | by your obfuscation if you render it to bitmap or textless
           | vector first. If OCR doesn 't work on the document as is,
           | it's only because the presence of text glyphs fools it into
           | thinking there's nothing there to perform _optical_ character
           | recognition on.
        
       | mmastrac wrote:
       | Funny enough, this is because the PDF spec literally allows you
       | to map glyphs like that. Some properly-produced PDFs are broken
       | like this, but it's been less common in recent years.
       | 
       | You're supposed to provide mapping tables for text extraction but
       | they are optional.
       | 
       | This fails pretty bad for security because you can detect the
       | glyphs themselves in the font tables and provide a mapping
       | yourself
        
         | layer8 wrote:
         | It's because PDF was designed before Unicode became viable, and
         | was designed to be flexible regarding character sets, hence you
         | can basically define your own encoding.
        
       | [deleted]
        
       | maybeiambatman wrote:
       | Fascinating. How does this work?
        
         | viggity wrote:
         | It takes the embedded font out of your PDF, and then maps non-
         | latin characters (japanese, cyrillic, etc) to render as if they
         | looked like a latin character. So in the example on the site.
         | "ae" will render as a "D" using my special font. And "j" will
         | draw the "B" glyph. Then I do a replacement on the underlying
         | text so all "B" are replaced with "j". It is more complicated
         | than that, but that's the gist.
        
           | bambax wrote:
           | So basically it's a type of Caesar cipher where letters are
           | mapped to something else one-to-one. Very easy to decrypt /
           | reverse. If this tool ever became popular there would be
           | hundreds of scripts to defeat it.
           | 
           | And as it is, it does not prevent "OCR", only copy-paste.
        
             | smegsicle wrote:
             | looks like it's actually one-to-many across unicode, if so
             | then you could think of it as approaching one-time-pad
             | encryption, with the key being the font
             | 
             | if the generator crafted a new font every time, never used
             | the same codepoint twice, and kept the font separate from
             | the document (pre-shared by being installed on the intended
             | receiver's machine) then it'd be uncrackable!
        
       | hk1337 wrote:
       | I thought this was going to be some style guide on how to make
       | the PDF document easy on the eyes to read.
        
       | thrown_22 wrote:
       | Cybersecurity Incident & Vulnerability Response Playbooks
       | 
       | Operational Procedures for Planning and Conducting Cybersecurity
       | Incident and Vulnerability Response Activities in FCEB
       | Information Systems
       | 
       | Publication: November 2021
       | 
       | Cybersecurity and Infrastructure Security Agency
       | 
       | DISCLAIMER: This document is marked TLP:WHITE. Disclosure is not
       | limited. Sources may use TLP:WHITE when information carries
       | minimal or no foreseeable risk of misuse, in accordance with
       | applicable rules and procedures for public release. Subject to
       | standard copyrght rules, TLP:WHITE information may be distributed
       | without restriction. For more information on the Traffic Light
       | Protocol, see
       | 
       | ---
       | 
       | Converting the first page of the sample PDF file to a tiff file
       | using ghost script and running tesseract OCR without any special
       | filters.
       | 
       | >Resistant to Optical Character Recognition (OCR), most laypeople
       | will need to print+rescan to OCR
       | 
       | This is not OCR resistant, I used the same two liner I used to
       | get my textbooks scanned at university 20 years ago.
        
         | viggity wrote:
         | You specifically are technologically proficient. Not everybody
         | knows how to export to a tiff and then OCR. 98% don't. When I
         | say "OCR Resistant", I mean that I haven't found PDF software
         | with built in OCR that has managed to extract the english text
         | back out.
        
           | kube-system wrote:
           | PDF software doesn't attempt to OCR text because it is
           | already text. This fools PDF software not to even attempt to
           | OCR rather than defeating OCR.
           | 
           | What you're resisting here is the ability for other
           | applications scrape the already text-format text.
        
           | thrown_22 wrote:
           | That's like saying a lock is pick resistant because you
           | haven't been able to open it with a dead fish.
           | 
           | Words mean things, if what you did can't stand up to 20 year
           | old technology then it's basically useless. Remove the claim
           | that it resists OCR and just called it copy/paste proof and
           | unsearchable.
        
             | smegsicle wrote:
             | if a lock convinces most popular lock-picking devices to
             | use the ineffective dead fish technique then it's something
             | atleast
        
       | blitz_skull wrote:
       | I literally used an OCR tool to grab the text directly out of the
       | first box. I think this is meant to be guarding against
       | copy/pasting--not OCR.
        
         | jfk13 wrote:
         | Yeah, that was not an accurate choice of terminology... as it
         | says in the "more info" box,
         | 
         | > Resistant to Optical Character Recognition (OCR), most
         | laypeople will need to print+rescan to OCR
         | 
         | If print+rescan (or equivalently, screen-grab+OCR) works, which
         | it will, then it's hardly OCR-resistant!
         | 
         | The only thing this "blocks" is text extraction from the PDF
         | with things like copy/paste or pdftotext/html/whatever
         | conversion tools, which will "see" the codepoints used rather
         | than the glyph images.
        
         | tyingq wrote:
         | And tools like pdftotext...it effectively breaks that.
        
         | viggity wrote:
         | So this is interesting. I guess I didn't realize that there are
         | (common?) tools to OCR screenshots. And do that end, there
         | probably isn't a whole lot I can do to stop it. But when you're
         | looking at a huge tax return, or sworn testimony, or just a
         | dump of 3000 emails, you're not gonna screenshot each one.
         | You're going to want to automate the OCR, which most PDF
         | readers (at least the commercial ones) will let you do. It is
         | against that type of OCR that my app is resistant to. They look
         | for image data within the PDF and OCR that. They bypass my text
         | because to the pdf reader, it already is in a text format.
         | 
         | I'm 1000% sure there are gurus who could whip up a script to
         | overcome this. But its kind of one of those things where you
         | don't have to outrun the bear, you have to outrun your friend
         | running next to you. It makes your sensitive documents just
         | that much less likely to be scanned/found.
        
           | Semaphor wrote:
           | I think the main issue is that you have no idea what OCR is:
           | https://en.wikipedia.org/wiki/Optical_character_recognition
        
           | vehemenz wrote:
           | macOS does it by default now. I've found it to be very useful
           | at times.
           | 
           | Any image you open in Safari, Preview, etc. (official Apple
           | programs) will be OCR'd automatically, allowing you to
           | extract the text with copy+paste. I think it works with PDFs,
           | but I haven't tested it.
           | 
           | https://support.apple.com/guide/preview/interact-with-
           | text-i...
        
           | gundamdoubleO wrote:
           | I'm fairly sure you can just open up any image (not sure if
           | there's a limit on size or complexity) on a Mac and use the
           | select all shortcut to grab all the text to use for whatever
           | you'd like.
        
           | bil7 wrote:
           | > I didn't realize that there are (common?) tools to OCR
           | screenshots.
           | 
           | This seems like quite the oversight to me...
        
             | viggity wrote:
             | But if you have a huge tax document, you're likely not
             | going to screenshot page by page. Yes, there are ways to
             | automate this. But if you're 50 year old divorce attorney,
             | you're going to click on the "OCR" button in your PDF
             | reader and it will not work.
        
               | mbreese wrote:
               | You don't have to screen shot every page... convert the
               | PDF to a PNG/TIFF image for every page, and OCR those.
               | This is _very_ easy to automate. If this is working with
               | Unicode code points, you're not blocking OCR, you're
               | obfuscating text. Anything that renders the PDF to a
               | raster format will produce an OCR-able document.
               | 
               | If you're a divorce attorney who used this to convert
               | documents in response to a discovery request, and the
               | opposing side had a valid reason for needing the
               | unobfuscated text, then you're probably going to end up
               | having a nice conversation with the judge about
               | acceptable formats.
               | 
               | Sending compressed TIFFs would probably be just as good.
               | A bit larger file sizes, but it would be just as
               | effective as stopping automated scraping of text. Also,
               | less likely to piss off a judge. Any opposing firm that
               | would be sophisticated enough to automate scrapping the
               | text from a normal PDF would be able to OCR these files
               | just as easily.
               | 
               | Or maybe you have a second site that sells the decoder,
               | so you get to sell to both sides. Not a bad business
               | model, if you can work it.
        
               | bambax wrote:
               | I don't know why you think divorce attorneys are stupid.
               | Some are probably very well versed in tech; those who
               | aren't know others who are. They won't simply sit there
               | and think "oh, for some reason I can't copy-paste from
               | that PDF, better give up the case then".
               | 
               | ... And most attorneys simply print documents. Once the
               | PDF is on paper, OCR-ing it back into text is just one
               | scanner away.
        
               | Semaphor wrote:
               | I'm not sure what reader you are talking about, but that
               | button is most certainly not doing any kind of OCR if
               | your technique stops it.
        
               | bil7 wrote:
               | I understand that. And I understand your personal use
               | case was valid. But I think your "Human Eyes Only" name
               | and domain is a little deceptive.
        
           | boloust wrote:
           | > I didn't realize that there are (common?) tools to OCR
           | screenshots.
           | 
           | Retrieving text from images is literally the definition of
           | OCR.
        
           | solardev wrote:
           | It's actually much easier than you think.
           | 
           | You don't need any scripts, just Acrobat itself (or any
           | comparable PDF viewer) can do this. Export the PDF to images,
           | make a new PDF out of the images, scan the text, done.
           | 
           | Example (took your example and did just that with it, now
           | everything can be copied & pasted as normal text):
           | https://filebin.net/qse2e0oaqkl1hjof/ocred.pdf
           | 
           | In general, if it LOOKS like text, SOMETHING can OCR it.
           | That's the whole point of OCR. If you want to try to block
           | OCR, you need something like CAPTCHAs, and that's getting
           | less and less effective every day. In fact many are already
           | more easily solved by computers than humans.
        
             | bambax wrote:
             | Of course. The OP doesn't understand what "OCR" actually
             | means.
        
           | 3jckd wrote:
           | Nothing that you said is wrong but it doesn't make the
           | situation better.
           | 
           | 1) As many people pointed out, this doesn't prevent OCR, it
           | just prevents copying strings (e.g. with crawlers). 2)
           | Majority of OCR doesn't deal with PDFs produced from a text
           | source but either from a) jpg-scans of documents b) pdfs
           | produced from those jpg-scans. 3) The first thing I tried,
           | was OCR with my iPhone and it obviously worked. As someone
           | else said, there're solutions that let you batch process many
           | documents.
           | 
           | Don't get me wrong, your stuff works for what you designed it
           | to. However, it provides <false sense of security> by
           | <falsely> claiming that it prevents OCR; which in turn, can
           | lead to more harm[1].
           | 
           | [1] - e.g., it may convince people to share stuff that they
           | wouldn't otherwise.
        
           | thrown_22 wrote:
           | >I'm 1000% sure there are gurus who could whip up a script to
           | overcome this. But its kind of one of those things where you
           | don't have to outrun the bear, you have to outrun your friend
           | running next to you. It makes your sensitive documents just
           | that much less likely to be scanned/found.
           | 
           | Security through obscurity is stupid:                   gs
           | -sDEVICE=tiffscaled24 -dNOPAUSE -dBATCH -dSAFER \
           | -sOutputFile=filename.tiff \            filename.pdf
           | tesseract filename.tiff filename.txt
           | 
           | All you need is ghostscript and tesseract. Both are an apt-
           | get away.
        
           | SigmundA wrote:
           | >It is against that type of OCR that my app is resistant to.
           | 
           | There is no form of OCR this is resistant to, simply the
           | change the description to be accurate and remove references
           | to being OCR resistant as this is false.
        
       | ctoth wrote:
       | I am a blind user using an extension to my screen reader which
       | (under the covers) uses the Windows 10 built-in OCR. Your sample
       | document gives me:
       | 
       | INTRODUCTION The Cybersecurity and Infrastructure Security Agency
       | (CISA) is committed to leading the response to cybersecurity
       | incidents and vulnerabilities to safeguard the nation's critical
       | assets. Section 6 of Executive Order 14028 directed DHS, via
       | CISA, to "develop a standard set of operational procedures
       | (playbook) to be used in planning and conducting cybersecurity
       | vulnerability and incident response activity respecting Federal
       | Civilian Executive Branch (FCEB) Information Systems." I Overview
       | This document presents two playbooks: one for incident response
       | and one for vulnerability response. These playbooks provide FCEB
       | agencies with a standard set of procedures to identify,
       | coordinate, remediate, recover, and track successful mitigations
       | from incidents and vulnerabilities affecting FCEB systems, data,
       | and networks. In addition, future iterations of these playbooks
       | may be useful for organizations outside of the FCEB to
       | standardize incident response practices. Working together across
       | all federal government organizations has proven to be an
       | effective model for addressing vulnerabilities and incidents.
       | Building on lessons learned from previous incidents and
       | incorporating industry best practices, CISA intends for these
       | playbooks to evolve the federal government's practices for
       | cybersecurity response through standardizing shared practices
       | that bring together the best people and processes to drive
       | coordinated actions.
       | 
       | Pretty sure this doesn't actually work.
        
         | viggity wrote:
         | very interesting. the windows 10 screen reader consume the
         | raster data on PDFs to OCR and not the code point data embedded
         | within the PDF. People here have been on my ass about saying
         | "OCR resistant" and I get where they are coming from. I've
         | primarily been testing the various "OCR" functionalities built
         | within the various PDF readers out there. The "OCR" that 98% of
         | laypeople are going to rely on. I always new that exporting to
         | an image based PDF wouldn't be defeated. If a human can read
         | it, a machine can read it. Just most PDF readers aren't set up
         | to do it. Out of curiosity, when you use your screen reader on
         | my website, does the <textarea> read and/or start with "Name:
         | Satoshi Nakamoto"?
        
           | ctoth wrote:
           | I can see the content in the textareas are a bunch of Unicode
           | glyphs that aren't mapped to speakable characters and when I
           | perform a "read all" action mostly render as questionmarks.
        
       | ksaj wrote:
       | I'm not very convinced by the PDF idea, but web fonts done this
       | way would be great for the parts of web pages you don't want
       | scraped or collected by search engines, if it is on pages where
       | you do want at least some of the content available to search
       | engines.
        
         | viggity wrote:
         | the antiscraping thing is a good idea. Hell, you could poison
         | it very lightly using homoglyphs (greek capital Epsilon instead
         | of just an E) just to see where else on the internet your data
         | ends up, too.
        
       | solardev wrote:
       | Hmm... interesting in theory, but take a screenshot and it's
       | trivially bypassed. Try it yourself here:
       | http://www.structurise.com/screenshot-ocr/
        
         | viggity wrote:
         | I posted some more info here:
         | https://news.ycombinator.com/item?id=32003066
        
         | rst wrote:
         | Which is not what I'd expect from anything that claims to be
         | "OCR resistant". It's not at all clear what they mean by that.
        
           | solardev wrote:
           | I think the OP, while well-intentioned, did not really
           | understand how OCR works. Follow-up convo in a separate
           | thread here: https://news.ycombinator.com/item?id=32003066
           | 
           | What this blocks is not OCR but casual copy & pasting (and
           | search engine indexing)
        
         | RajT88 wrote:
         | I think it works for the use case - where documents can be
         | provided for discovery, but if posted online won't have the
         | content indexed by search engines.
         | 
         | The various legal teams involved are unlikely to ever be the
         | wiser. Or will they?
         | 
         | Won't this print out a pile of gibberish? Hard copies are
         | rather important in the courts. Somebody is going to complain
         | about what was provided in that case.
        
       | londons_explore wrote:
       | > his ex was going to leak all of his discovery documents on the
       | internet
       | 
       | If that happens, I suspect he'd have a very strong case to win
       | custody...
        
         | viggity wrote:
         | he got what he wanted out of the case, so good for him. The
         | problem is that he just didn't know if it was going to happen.
         | And she could definitely get sanctioned for it. But his info
         | would still be out there.
        
           | mbreese wrote:
           | To me, this falls under the category of -- you can't have a
           | technical solution to a societal problem. Yes, technology may
           | have made your friend feel better. But the actual thing that
           | protected him was the law, not the obfuscated PDFs.
           | 
           | But, if the judge didn't care and it made your friend feel
           | better, who am I to judge? But this isn't a great protection
           | scheme... it just adds a few extra technical hurdles that are
           | easy to get around.
        
       | forgotpwd16 wrote:
       | Not sure OCR is the correct term here. OCR specifically means
       | extracting text from an image. This approach doesn't protect
       | against that. Some maybe better options will be "machine
       | obfuscated" or "scrape resistant".
        
       | donkarma wrote:
       | took me a while to realise that my font settings on firefox break
       | this
        
       | waynesonfire wrote:
       | brilliant solution.
        
       | oxff wrote:
       | Just give me readable papers instead. Such a pain in the dick
       | format yet its the only thing there is.
        
       | dodo6502 wrote:
       | As an author of a PDF library this is hilarious, because the
       | number of bugs I have received over the years where this is
       | unintentionally happening is quite high.
        
       | solardev wrote:
       | > As such, when you try to OCR these files, it doesn't see any
       | images and can't convert it.
       | 
       | That isn't true. Acrobat might skip parts of the PDF that it
       | thinks are already text/glyphs, but it's trivial to get around
       | that by either using other OCR software or just printing the PDF
       | to a raster image first. Example:
       | https://filebin.net/qse2e0oaqkl1hjof/ocred.pdf
       | 
       | Still, though, for the purposes of obscuring these from
       | bots/crawlers... a lot better than nothing!
        
       | s1mon wrote:
       | MacOS does OCR on this just fine. Screenshot, open in Preview,
       | and select, copy, paste:
       | 
       | Name: Satoshi Nakamoto DOB: 1982-06-05 SSN: 958-20-3141 Cell
       | Phone: 514-867-5309
        
       | layer8 wrote:
       | You could use the ZXX typeface to defy OCR:
       | https://walkerart.org/magazine/sang-mun-defiant-typeface-nsa...
       | 
       | It's probably still not ML-resistant.
        
         | rafram wrote:
         | You could just train an OCR engine on that typeface. IIRC
         | training Tesseract for a new font is quite trivial.
        
       | peanut_worm wrote:
       | This seems like it would just be annoying and would not even work
       | for most purposes. Kind of neat though.
        
       | nemothekid wrote:
       | I thought this was going to be some adversarial neural network.
       | 
       | Unfortunately OP, I don't think your solution even works for your
       | intended use case; Google already does OCR (actual OCR, not just
       | parsing text) in Images. I use it in Gmail quite often.
       | Regardless the implementation is quite neat and will surely
       | thwart less advanced indexers.
        
         | social_quotient wrote:
         | Does it ocr on live text pdfs or just pdfs that have text in
         | images/flattened?
        
         | ben_w wrote:
         | Likewise iOS, text in screenshots is selectable and in this
         | case it is recognised correctly.
        
         | nmstoker wrote:
         | Adding to this, it's trivial to get the human readable text on
         | a Google phone:
         | 
         | switch between apps and pick the app with the text but don't
         | jump back into the app yet, select text and you can immediately
         | copy the text out.
         | 
         | It's yielding the visible text, so most be OCR'ing the image
         | (works offline too).
         | 
         | When i copy direct from the example on the web page, in the
         | non-OCR method, that does give the messed up text, but not when
         | done the way above.
         | 
         | " Phone: 514-867-5309" was copied out easily (can't be bothered
         | to go back get the Cell bit i was just inaccurate copying, I'm
         | sure it works!)
        
         | viggity wrote:
         | thank you to everyone in this thread who realize it was never
         | meant to be perfect and appreciating it for what it does do!
        
           | SilasX wrote:
           | The point isn't that it has flaws, but that its description
           | is wrong. "Non-human eyes" -- normally understood to be OCR
           | -- read it just fine. I think most of us were expecting
           | something that disrupts "computer eyes" (e.g. because of
           | deceiving overly narrow "tricks" that neural networks use to
           | identify characters) but left it readable for the typical
           | human (like an easy Captcha).
           | 
           | A more accurate (and helpful!) description of the problem
           | you're solving is that this disrupts _text parsers_. That is,
           | any program that just reads this in as text won 't see the
           | "real" letters (unless it's been pre-programmed with a
           | specific reverser, etc.) and thus will frustrate, say, text
           | search.
           | 
           | Which, on that note, I notice elsewhere you mention this
           | being a solution applied to document submission in legal
           | proceedings. In that case, the assumption might be that one
           | side wishes to run text searches and assume its compatible
           | with that. In that case, this could be viewed as non-
           | compliance with a judge's orders, so FYI.
        
       | peetah wrote:
       | funny, I did this for the web, something like 8 or 7 years ago,
       | under the name "cprotext", but was unable to find a way to sell
       | this as SaaS :)
       | 
       | There should be a wordpress plugin floating around somewhere
       | called wp-cprotext and maybe one or two demo websites that I
       | can't even remember the url.
       | 
       | People came with the same critics as we can read here: evil DRM,
       | accessibility nightmare and easily bypassed by OCR. All in all, I
       | came to be quite convinced by these critics, especially the first
       | and second, and shut it down completely.
       | 
       | I would genuinely be interested to see how you'll succeed where I
       | failed ! good luck !
        
       ___________________________________________________________________
       (page generated 2022-07-06 23:01 UTC)