[HN Gopher] Show HN: PDFs that are readable by human eyes only
___________________________________________________________________
Show HN: PDFs that are readable by human eyes only
Hi, OP here. A friend was involved in a custody battle and was
afraid his ex was going to leak all of his discovery documents on
the internet and he asked if there was something I could do to make
it harder for bots/crawlers to find sensitive information.
Originally I was going to turn all of his docs to image based PDFs,
but those get large fast and are easy to OCR. So I found a post
musing about altering fonts/glyphs so that it _looks_ like english,
but the actual character being seen by the pdf reader is a non-
english character. As such, when you try to OCR these files, it
doesn 't see any images and can't convert it. I figured it had
some potential uses and maybe you fine folks can identify other use
cases. I'll be monitoring this post most of the day.
Author : viggity
Score : 86 points
Date : 2022-07-06 15:38 UTC (7 hours ago)
(HTM) web link (humaneyesonly.com)
(TXT) w3m dump (humaneyesonly.com)
| O__________O wrote:
| You asked about use case ideas, while I personally strongly
| dislike them, there are number of sites, including online testing
| apps, that try to remove copy-and-paste. Not sure how valuable it
| would be, but it's for sure a use case.
| anyfactor wrote:
| I am on a older phone.
|
| What your eyes see is identical to what the computer sees. They
| are both giberrish. Also the email is giberrish.
|
| What am I missing here?
| solardev wrote:
| Your PDF reader probably doesn't support the particular font
| this is using. Try it on your computer.
| anyfactor wrote:
| > Have a good use case? Email me: yu-
| yehou@[?]lmn'eu[?]ou'uorehaS.NgeY
|
| This is what I am getting on the browser.
| forgotpwd16 wrote:
| You maybe don't allow pages to choose their own font? From
| what I understand it works by making a font that maps Latin
| characters to random Unicode points. So, unless for a
| specific text its respective font is used, text will appear
| gibberish.
|
| edit: Saw you said you use an old phone. Found those fonts
| are in woff format. Older Android phones don't have support
| for it.
| solardev wrote:
| Mm, might have to use a different browser/phone? Not sure
| what else there is to do here... it's a proof of concept
| that doesn't work on all devices.
| viggity wrote:
| it probably depends on browser font support. PDF works
| with embedded fonts and so it is less likely to be an
| issue when viewing the PDFs there.
| salawat wrote:
| Ironically, that undermines the entire point of PDF
| viggity wrote:
| I'm wondering if your browser isn't showing woff fonts for some
| reason. The "your eyes see" textbox on a modern browser shows:
|
| Name: Satoshi Nakamoto
|
| DOB: 1982-06-05
|
| SSN: 958-20-3141
|
| Cell Phone: 514-867-5309
| hexo wrote:
| Impressive, the example web text area is very lovely too
| dawnerd wrote:
| This is terrible for accessibility, please don't do this.
| vehemenz wrote:
| Isn't that the point though?
|
| If you make a sensitive document unindexable (assuming this
| works), then effectively no one can find it.
|
| The intent here is not to restrict the document to sighted
| users but to hide the document from _everyone_ , which includes
| sighted and blind users searching for keywords. The fact that
| blind users can't read the text at all without screen grabbing
| is just a bonus.
| dawnerd wrote:
| Treating blind people that way will 100% lead you towards a
| lawsuit. So have fun with that I guess.
| jffry wrote:
| I think people will get tripped up by you saying it "can't" be
| OCR'd or that it is difficult to do so, and will end up looking
| past a pretty elegant solution in the process.
|
| This seems like a nicely clever way to trip up non-targeted
| scrapers which might attempt to OCR any images they encounter,
| but which will ignore what looks like random gibberish
| codepoints. It doesn't eliminate the ability to index this data
| but I can see how it might greatly reduce it.
|
| Obviously you could still convert these PDFs to an image and OCR
| them, but that's not the thing being defended against here.
| josephcsible wrote:
| This is basically just really weak DRM and is just as evil.
| ck2 wrote:
| Is it even DRM or more closely just child-like ROT13?
| maxbond wrote:
| I agree that OCR is an important tool for end users, especially
| those with accessibility needs, and that we shouldn't use
| something like this lightly, but the context is completely
| different here. If I want to send you a PDF from my gmail, and
| want to make it difficult for Google to leverage that data -
| that's completely different than if I were a giant media
| company, gatekeeping to ensure a huge portion of culture flows
| through me, which I then claim as my own and charge exorbitant
| rents for, enforced by DRM.
|
| The problem with DRM is not that someone is trying to control
| what happens to a string of bits, it's that it props up an
| institution which is harmful.
| withinboredom wrote:
| The problem is that people won't "use it lightly" and not
| everyone speaks the same language. Being able to copy/paste
| text into a translation tool (probably Google Translate which
| is kinda ironic in the case you mentioned) to understand what
| the document is about is super important when in another
| country or communicating with someone in another country.
| maxbond wrote:
| Being able to copy/paste is an important option that
| empowers users.
|
| Being able to selectively defeat copy paste is an
| additional option that additionally empowers users.
|
| I don't anticipate this tool being used very widely. If it
| became the default, I would have a problem with it for the
| reasons you highlight, among others.
| kolanos wrote:
| > I agree that OCR is an important tool for end users,
| especially those with accessibility needs, and that we
| shouldn't use something like this lightly, but the context is
| completely different here. If I want to send you a PDF from
| my gmail, and want to make it difficult for Google to
| leverage that data - that's completely different than if I
| were a giant media company, gatekeeping to ensure a huge
| portion of culture flows through me, which I then claim as my
| own and charge exorbitant rents for, enforced by DRM.
|
| There are easier ways to do this, such as encrypting the PDF.
| It is trivially easy to password protect a PDF as well [0],
| it is even a part of the PDF spec. It isn't ironclad, but it
| will defeat Gmail's indexer.
|
| [0]: https://digify.com/blog/protect-pdf-with-password/
| Komodai wrote:
| "As such, when you try to OCR these files, it doesn't see any
| images and can't convert it."
|
| Bullshit
|
| 1. Screenshot
|
| 2. OCR
|
| 3. Profit
| bscphil wrote:
| This is broken in multiple ways, some obvious, some not.
|
| 1. Obviously most people jumped directly to OCR, and that works
| of course. So counter to the OP, you can trivially render the
| first page to a high resolution PNG and then OCR that with what
| will probably be 100% accurate results. Sample image:
| https://i.imgur.com/hyJOSjY.jpg
|
| 2. This is just messing with glyphs in fonts, so one trivial way
| of undoing the changes _losslessly_ (not even requiring OCR) is
| to create a mapping between each font glyph and the original
| character. For example, I was able to extract the font used for
| the text "CONTENTS" near the beginning of the sample document.
| It is named "SecureFont-1845559949-FranklinGothic-Demi", as
| extracted by mutool. In the PDF "CONTENTS" is made up of eight
| Unicode characters, which render as "CONTENTS" in that font.
|
| 3. Even if the first two methods somehow failed, the same
| character in a given font is repeatedly used to render the same
| character in English. That makes the approach similar to a
| substitution cipher [1] which is trivially broken with frequency
| analysis. You could literally just copy / paste the fake "text"
| out of the PDF and with an analysis tool derive the original
| text. This isn't really significant since the PDF can be read by
| sight anyway, but it's worth pointing out.
|
| [1] https://en.wikipedia.org/wiki/Substitution_cipher
| rmbyrro wrote:
| I believe this can be taken as an axiom: there's nothing a
| human eye can see that a machine cannot.
| ChristopherDrum wrote:
| To point 1, it was trivial for me to open the PDF in Preview on
| macOS, save it out as a multi-page PNG, open that PNG in
| Preview and macOS itself just let me copy out the text without
| error (in my limited test).
| viggity wrote:
| so one thing that I do to counter point 3 is by having multiple
| non-english characters map to the same english character and I
| pick a random one each time. Depending on the input, there can
| be ~10 or so characters mapping to any single english letter.
| If you're advanced enough to know about a substitution cipher,
| you'll figure out how to convert to an image based PDF and then
| use OCR on that. The reason I have the multiple mappings is so
| that if a layperson was trying to find all instances of "Billy"
| they could copy those characters and then search for
| "Chz'n'mZbe", but the other instances of "Billy" might have the
| codepoints "Chtiivu".
|
| Again, its resistant to built in PDF Reader OCR, not
| bulletproof. I'm trying to thwart a crawler or a script kiddie,
| or a 50 year old divorce attorney. Not the denizens of HN.
| bscphil wrote:
| That's a fair point. I was trying to point out an interesting
| relationship between your approach and various forms of
| cryptography, but obviously this would not be the first line
| solution to the problem. I think the resistance to crawlers
| will come down to whether they implement OCR or not. I
| suspect some of the more sophisticated ones do. (A naive one
| might also be fooled by the fact that real glyphs are used
| and not attempt to OCR the text.)
|
| BTW, you probably shouldn't say it's resistant to PDF reader
| OCR because most PDF readers don't _have_ OCR, AFAIK. They
| just pull the text from the document, that 's not OCR.
| Software that _has_ OCR like Adobe Acrobat will not be fooled
| by your obfuscation if you render it to bitmap or textless
| vector first. If OCR doesn 't work on the document as is,
| it's only because the presence of text glyphs fools it into
| thinking there's nothing there to perform _optical_ character
| recognition on.
| mmastrac wrote:
| Funny enough, this is because the PDF spec literally allows you
| to map glyphs like that. Some properly-produced PDFs are broken
| like this, but it's been less common in recent years.
|
| You're supposed to provide mapping tables for text extraction but
| they are optional.
|
| This fails pretty bad for security because you can detect the
| glyphs themselves in the font tables and provide a mapping
| yourself
| layer8 wrote:
| It's because PDF was designed before Unicode became viable, and
| was designed to be flexible regarding character sets, hence you
| can basically define your own encoding.
| [deleted]
| maybeiambatman wrote:
| Fascinating. How does this work?
| viggity wrote:
| It takes the embedded font out of your PDF, and then maps non-
| latin characters (japanese, cyrillic, etc) to render as if they
| looked like a latin character. So in the example on the site.
| "ae" will render as a "D" using my special font. And "j" will
| draw the "B" glyph. Then I do a replacement on the underlying
| text so all "B" are replaced with "j". It is more complicated
| than that, but that's the gist.
| bambax wrote:
| So basically it's a type of Caesar cipher where letters are
| mapped to something else one-to-one. Very easy to decrypt /
| reverse. If this tool ever became popular there would be
| hundreds of scripts to defeat it.
|
| And as it is, it does not prevent "OCR", only copy-paste.
| smegsicle wrote:
| looks like it's actually one-to-many across unicode, if so
| then you could think of it as approaching one-time-pad
| encryption, with the key being the font
|
| if the generator crafted a new font every time, never used
| the same codepoint twice, and kept the font separate from
| the document (pre-shared by being installed on the intended
| receiver's machine) then it'd be uncrackable!
| hk1337 wrote:
| I thought this was going to be some style guide on how to make
| the PDF document easy on the eyes to read.
| thrown_22 wrote:
| Cybersecurity Incident & Vulnerability Response Playbooks
|
| Operational Procedures for Planning and Conducting Cybersecurity
| Incident and Vulnerability Response Activities in FCEB
| Information Systems
|
| Publication: November 2021
|
| Cybersecurity and Infrastructure Security Agency
|
| DISCLAIMER: This document is marked TLP:WHITE. Disclosure is not
| limited. Sources may use TLP:WHITE when information carries
| minimal or no foreseeable risk of misuse, in accordance with
| applicable rules and procedures for public release. Subject to
| standard copyrght rules, TLP:WHITE information may be distributed
| without restriction. For more information on the Traffic Light
| Protocol, see
|
| ---
|
| Converting the first page of the sample PDF file to a tiff file
| using ghost script and running tesseract OCR without any special
| filters.
|
| >Resistant to Optical Character Recognition (OCR), most laypeople
| will need to print+rescan to OCR
|
| This is not OCR resistant, I used the same two liner I used to
| get my textbooks scanned at university 20 years ago.
| viggity wrote:
| You specifically are technologically proficient. Not everybody
| knows how to export to a tiff and then OCR. 98% don't. When I
| say "OCR Resistant", I mean that I haven't found PDF software
| with built in OCR that has managed to extract the english text
| back out.
| kube-system wrote:
| PDF software doesn't attempt to OCR text because it is
| already text. This fools PDF software not to even attempt to
| OCR rather than defeating OCR.
|
| What you're resisting here is the ability for other
| applications scrape the already text-format text.
| thrown_22 wrote:
| That's like saying a lock is pick resistant because you
| haven't been able to open it with a dead fish.
|
| Words mean things, if what you did can't stand up to 20 year
| old technology then it's basically useless. Remove the claim
| that it resists OCR and just called it copy/paste proof and
| unsearchable.
| smegsicle wrote:
| if a lock convinces most popular lock-picking devices to
| use the ineffective dead fish technique then it's something
| atleast
| blitz_skull wrote:
| I literally used an OCR tool to grab the text directly out of the
| first box. I think this is meant to be guarding against
| copy/pasting--not OCR.
| jfk13 wrote:
| Yeah, that was not an accurate choice of terminology... as it
| says in the "more info" box,
|
| > Resistant to Optical Character Recognition (OCR), most
| laypeople will need to print+rescan to OCR
|
| If print+rescan (or equivalently, screen-grab+OCR) works, which
| it will, then it's hardly OCR-resistant!
|
| The only thing this "blocks" is text extraction from the PDF
| with things like copy/paste or pdftotext/html/whatever
| conversion tools, which will "see" the codepoints used rather
| than the glyph images.
| tyingq wrote:
| And tools like pdftotext...it effectively breaks that.
| viggity wrote:
| So this is interesting. I guess I didn't realize that there are
| (common?) tools to OCR screenshots. And do that end, there
| probably isn't a whole lot I can do to stop it. But when you're
| looking at a huge tax return, or sworn testimony, or just a
| dump of 3000 emails, you're not gonna screenshot each one.
| You're going to want to automate the OCR, which most PDF
| readers (at least the commercial ones) will let you do. It is
| against that type of OCR that my app is resistant to. They look
| for image data within the PDF and OCR that. They bypass my text
| because to the pdf reader, it already is in a text format.
|
| I'm 1000% sure there are gurus who could whip up a script to
| overcome this. But its kind of one of those things where you
| don't have to outrun the bear, you have to outrun your friend
| running next to you. It makes your sensitive documents just
| that much less likely to be scanned/found.
| Semaphor wrote:
| I think the main issue is that you have no idea what OCR is:
| https://en.wikipedia.org/wiki/Optical_character_recognition
| vehemenz wrote:
| macOS does it by default now. I've found it to be very useful
| at times.
|
| Any image you open in Safari, Preview, etc. (official Apple
| programs) will be OCR'd automatically, allowing you to
| extract the text with copy+paste. I think it works with PDFs,
| but I haven't tested it.
|
| https://support.apple.com/guide/preview/interact-with-
| text-i...
| gundamdoubleO wrote:
| I'm fairly sure you can just open up any image (not sure if
| there's a limit on size or complexity) on a Mac and use the
| select all shortcut to grab all the text to use for whatever
| you'd like.
| bil7 wrote:
| > I didn't realize that there are (common?) tools to OCR
| screenshots.
|
| This seems like quite the oversight to me...
| viggity wrote:
| But if you have a huge tax document, you're likely not
| going to screenshot page by page. Yes, there are ways to
| automate this. But if you're 50 year old divorce attorney,
| you're going to click on the "OCR" button in your PDF
| reader and it will not work.
| mbreese wrote:
| You don't have to screen shot every page... convert the
| PDF to a PNG/TIFF image for every page, and OCR those.
| This is _very_ easy to automate. If this is working with
| Unicode code points, you're not blocking OCR, you're
| obfuscating text. Anything that renders the PDF to a
| raster format will produce an OCR-able document.
|
| If you're a divorce attorney who used this to convert
| documents in response to a discovery request, and the
| opposing side had a valid reason for needing the
| unobfuscated text, then you're probably going to end up
| having a nice conversation with the judge about
| acceptable formats.
|
| Sending compressed TIFFs would probably be just as good.
| A bit larger file sizes, but it would be just as
| effective as stopping automated scraping of text. Also,
| less likely to piss off a judge. Any opposing firm that
| would be sophisticated enough to automate scrapping the
| text from a normal PDF would be able to OCR these files
| just as easily.
|
| Or maybe you have a second site that sells the decoder,
| so you get to sell to both sides. Not a bad business
| model, if you can work it.
| bambax wrote:
| I don't know why you think divorce attorneys are stupid.
| Some are probably very well versed in tech; those who
| aren't know others who are. They won't simply sit there
| and think "oh, for some reason I can't copy-paste from
| that PDF, better give up the case then".
|
| ... And most attorneys simply print documents. Once the
| PDF is on paper, OCR-ing it back into text is just one
| scanner away.
| Semaphor wrote:
| I'm not sure what reader you are talking about, but that
| button is most certainly not doing any kind of OCR if
| your technique stops it.
| bil7 wrote:
| I understand that. And I understand your personal use
| case was valid. But I think your "Human Eyes Only" name
| and domain is a little deceptive.
| boloust wrote:
| > I didn't realize that there are (common?) tools to OCR
| screenshots.
|
| Retrieving text from images is literally the definition of
| OCR.
| solardev wrote:
| It's actually much easier than you think.
|
| You don't need any scripts, just Acrobat itself (or any
| comparable PDF viewer) can do this. Export the PDF to images,
| make a new PDF out of the images, scan the text, done.
|
| Example (took your example and did just that with it, now
| everything can be copied & pasted as normal text):
| https://filebin.net/qse2e0oaqkl1hjof/ocred.pdf
|
| In general, if it LOOKS like text, SOMETHING can OCR it.
| That's the whole point of OCR. If you want to try to block
| OCR, you need something like CAPTCHAs, and that's getting
| less and less effective every day. In fact many are already
| more easily solved by computers than humans.
| bambax wrote:
| Of course. The OP doesn't understand what "OCR" actually
| means.
| 3jckd wrote:
| Nothing that you said is wrong but it doesn't make the
| situation better.
|
| 1) As many people pointed out, this doesn't prevent OCR, it
| just prevents copying strings (e.g. with crawlers). 2)
| Majority of OCR doesn't deal with PDFs produced from a text
| source but either from a) jpg-scans of documents b) pdfs
| produced from those jpg-scans. 3) The first thing I tried,
| was OCR with my iPhone and it obviously worked. As someone
| else said, there're solutions that let you batch process many
| documents.
|
| Don't get me wrong, your stuff works for what you designed it
| to. However, it provides <false sense of security> by
| <falsely> claiming that it prevents OCR; which in turn, can
| lead to more harm[1].
|
| [1] - e.g., it may convince people to share stuff that they
| wouldn't otherwise.
| thrown_22 wrote:
| >I'm 1000% sure there are gurus who could whip up a script to
| overcome this. But its kind of one of those things where you
| don't have to outrun the bear, you have to outrun your friend
| running next to you. It makes your sensitive documents just
| that much less likely to be scanned/found.
|
| Security through obscurity is stupid: gs
| -sDEVICE=tiffscaled24 -dNOPAUSE -dBATCH -dSAFER \
| -sOutputFile=filename.tiff \ filename.pdf
| tesseract filename.tiff filename.txt
|
| All you need is ghostscript and tesseract. Both are an apt-
| get away.
| SigmundA wrote:
| >It is against that type of OCR that my app is resistant to.
|
| There is no form of OCR this is resistant to, simply the
| change the description to be accurate and remove references
| to being OCR resistant as this is false.
| ctoth wrote:
| I am a blind user using an extension to my screen reader which
| (under the covers) uses the Windows 10 built-in OCR. Your sample
| document gives me:
|
| INTRODUCTION The Cybersecurity and Infrastructure Security Agency
| (CISA) is committed to leading the response to cybersecurity
| incidents and vulnerabilities to safeguard the nation's critical
| assets. Section 6 of Executive Order 14028 directed DHS, via
| CISA, to "develop a standard set of operational procedures
| (playbook) to be used in planning and conducting cybersecurity
| vulnerability and incident response activity respecting Federal
| Civilian Executive Branch (FCEB) Information Systems." I Overview
| This document presents two playbooks: one for incident response
| and one for vulnerability response. These playbooks provide FCEB
| agencies with a standard set of procedures to identify,
| coordinate, remediate, recover, and track successful mitigations
| from incidents and vulnerabilities affecting FCEB systems, data,
| and networks. In addition, future iterations of these playbooks
| may be useful for organizations outside of the FCEB to
| standardize incident response practices. Working together across
| all federal government organizations has proven to be an
| effective model for addressing vulnerabilities and incidents.
| Building on lessons learned from previous incidents and
| incorporating industry best practices, CISA intends for these
| playbooks to evolve the federal government's practices for
| cybersecurity response through standardizing shared practices
| that bring together the best people and processes to drive
| coordinated actions.
|
| Pretty sure this doesn't actually work.
| viggity wrote:
| very interesting. the windows 10 screen reader consume the
| raster data on PDFs to OCR and not the code point data embedded
| within the PDF. People here have been on my ass about saying
| "OCR resistant" and I get where they are coming from. I've
| primarily been testing the various "OCR" functionalities built
| within the various PDF readers out there. The "OCR" that 98% of
| laypeople are going to rely on. I always new that exporting to
| an image based PDF wouldn't be defeated. If a human can read
| it, a machine can read it. Just most PDF readers aren't set up
| to do it. Out of curiosity, when you use your screen reader on
| my website, does the <textarea> read and/or start with "Name:
| Satoshi Nakamoto"?
| ctoth wrote:
| I can see the content in the textareas are a bunch of Unicode
| glyphs that aren't mapped to speakable characters and when I
| perform a "read all" action mostly render as questionmarks.
| ksaj wrote:
| I'm not very convinced by the PDF idea, but web fonts done this
| way would be great for the parts of web pages you don't want
| scraped or collected by search engines, if it is on pages where
| you do want at least some of the content available to search
| engines.
| viggity wrote:
| the antiscraping thing is a good idea. Hell, you could poison
| it very lightly using homoglyphs (greek capital Epsilon instead
| of just an E) just to see where else on the internet your data
| ends up, too.
| solardev wrote:
| Hmm... interesting in theory, but take a screenshot and it's
| trivially bypassed. Try it yourself here:
| http://www.structurise.com/screenshot-ocr/
| viggity wrote:
| I posted some more info here:
| https://news.ycombinator.com/item?id=32003066
| rst wrote:
| Which is not what I'd expect from anything that claims to be
| "OCR resistant". It's not at all clear what they mean by that.
| solardev wrote:
| I think the OP, while well-intentioned, did not really
| understand how OCR works. Follow-up convo in a separate
| thread here: https://news.ycombinator.com/item?id=32003066
|
| What this blocks is not OCR but casual copy & pasting (and
| search engine indexing)
| RajT88 wrote:
| I think it works for the use case - where documents can be
| provided for discovery, but if posted online won't have the
| content indexed by search engines.
|
| The various legal teams involved are unlikely to ever be the
| wiser. Or will they?
|
| Won't this print out a pile of gibberish? Hard copies are
| rather important in the courts. Somebody is going to complain
| about what was provided in that case.
| londons_explore wrote:
| > his ex was going to leak all of his discovery documents on the
| internet
|
| If that happens, I suspect he'd have a very strong case to win
| custody...
| viggity wrote:
| he got what he wanted out of the case, so good for him. The
| problem is that he just didn't know if it was going to happen.
| And she could definitely get sanctioned for it. But his info
| would still be out there.
| mbreese wrote:
| To me, this falls under the category of -- you can't have a
| technical solution to a societal problem. Yes, technology may
| have made your friend feel better. But the actual thing that
| protected him was the law, not the obfuscated PDFs.
|
| But, if the judge didn't care and it made your friend feel
| better, who am I to judge? But this isn't a great protection
| scheme... it just adds a few extra technical hurdles that are
| easy to get around.
| forgotpwd16 wrote:
| Not sure OCR is the correct term here. OCR specifically means
| extracting text from an image. This approach doesn't protect
| against that. Some maybe better options will be "machine
| obfuscated" or "scrape resistant".
| donkarma wrote:
| took me a while to realise that my font settings on firefox break
| this
| waynesonfire wrote:
| brilliant solution.
| oxff wrote:
| Just give me readable papers instead. Such a pain in the dick
| format yet its the only thing there is.
| dodo6502 wrote:
| As an author of a PDF library this is hilarious, because the
| number of bugs I have received over the years where this is
| unintentionally happening is quite high.
| solardev wrote:
| > As such, when you try to OCR these files, it doesn't see any
| images and can't convert it.
|
| That isn't true. Acrobat might skip parts of the PDF that it
| thinks are already text/glyphs, but it's trivial to get around
| that by either using other OCR software or just printing the PDF
| to a raster image first. Example:
| https://filebin.net/qse2e0oaqkl1hjof/ocred.pdf
|
| Still, though, for the purposes of obscuring these from
| bots/crawlers... a lot better than nothing!
| s1mon wrote:
| MacOS does OCR on this just fine. Screenshot, open in Preview,
| and select, copy, paste:
|
| Name: Satoshi Nakamoto DOB: 1982-06-05 SSN: 958-20-3141 Cell
| Phone: 514-867-5309
| layer8 wrote:
| You could use the ZXX typeface to defy OCR:
| https://walkerart.org/magazine/sang-mun-defiant-typeface-nsa...
|
| It's probably still not ML-resistant.
| rafram wrote:
| You could just train an OCR engine on that typeface. IIRC
| training Tesseract for a new font is quite trivial.
| peanut_worm wrote:
| This seems like it would just be annoying and would not even work
| for most purposes. Kind of neat though.
| nemothekid wrote:
| I thought this was going to be some adversarial neural network.
|
| Unfortunately OP, I don't think your solution even works for your
| intended use case; Google already does OCR (actual OCR, not just
| parsing text) in Images. I use it in Gmail quite often.
| Regardless the implementation is quite neat and will surely
| thwart less advanced indexers.
| social_quotient wrote:
| Does it ocr on live text pdfs or just pdfs that have text in
| images/flattened?
| ben_w wrote:
| Likewise iOS, text in screenshots is selectable and in this
| case it is recognised correctly.
| nmstoker wrote:
| Adding to this, it's trivial to get the human readable text on
| a Google phone:
|
| switch between apps and pick the app with the text but don't
| jump back into the app yet, select text and you can immediately
| copy the text out.
|
| It's yielding the visible text, so most be OCR'ing the image
| (works offline too).
|
| When i copy direct from the example on the web page, in the
| non-OCR method, that does give the messed up text, but not when
| done the way above.
|
| " Phone: 514-867-5309" was copied out easily (can't be bothered
| to go back get the Cell bit i was just inaccurate copying, I'm
| sure it works!)
| viggity wrote:
| thank you to everyone in this thread who realize it was never
| meant to be perfect and appreciating it for what it does do!
| SilasX wrote:
| The point isn't that it has flaws, but that its description
| is wrong. "Non-human eyes" -- normally understood to be OCR
| -- read it just fine. I think most of us were expecting
| something that disrupts "computer eyes" (e.g. because of
| deceiving overly narrow "tricks" that neural networks use to
| identify characters) but left it readable for the typical
| human (like an easy Captcha).
|
| A more accurate (and helpful!) description of the problem
| you're solving is that this disrupts _text parsers_. That is,
| any program that just reads this in as text won 't see the
| "real" letters (unless it's been pre-programmed with a
| specific reverser, etc.) and thus will frustrate, say, text
| search.
|
| Which, on that note, I notice elsewhere you mention this
| being a solution applied to document submission in legal
| proceedings. In that case, the assumption might be that one
| side wishes to run text searches and assume its compatible
| with that. In that case, this could be viewed as non-
| compliance with a judge's orders, so FYI.
| peetah wrote:
| funny, I did this for the web, something like 8 or 7 years ago,
| under the name "cprotext", but was unable to find a way to sell
| this as SaaS :)
|
| There should be a wordpress plugin floating around somewhere
| called wp-cprotext and maybe one or two demo websites that I
| can't even remember the url.
|
| People came with the same critics as we can read here: evil DRM,
| accessibility nightmare and easily bypassed by OCR. All in all, I
| came to be quite convinced by these critics, especially the first
| and second, and shut it down completely.
|
| I would genuinely be interested to see how you'll succeed where I
| failed ! good luck !
___________________________________________________________________
(page generated 2022-07-06 23:01 UTC)