[HN Gopher] Unredacter: Never use pixelation as a redaction tech...
       ___________________________________________________________________
        
       Unredacter: Never use pixelation as a redaction technique
        
       Author : linker3000
       Score  : 159 points
       Date   : 2022-12-17 19:58 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | danbruc wrote:
       | This also seems like something a [convolutional] neural net
       | should be able to learn pretty well and generating training data
       | could be fully automated. Throw everything at it - different
       | fonts, different sizes, bold, italic, different colors, different
       | anti-aliasing methods, sub pixel offsets, pixelation sizes and
       | offsets, JPEG noise, rotations, ... - and maybe also give it some
       | language model to better reason about plausible letter
       | combinations. I would love to see, how good this could be.
        
         | buildbot wrote:
         | Already been done! https://github.com/HypoX64/DeepMosaics
        
       | dang wrote:
       | Related:
       | 
       |  _Don 't use text pixelation to redact sensitive information_ -
       | https://news.ycombinator.com/item?id=30350626 - Feb 2022 (163
       | comments)
        
       | epgui wrote:
       | I think the animation makes it pretty clear how this works... But
       | I do wonder how the baseline coordinates are established?
       | 
       | In more realistic applications you'd have to deal with things
       | like text that is not exactly aligned with pixel boundaries, and
       | different anti-aliasing methods.
        
         | kleiba wrote:
         | And different fonts.
        
           | kamray23 wrote:
           | With such heavy pixellation being overcome, you'd have to
           | have a _seriously_ different font to avoid the general
           | patterns arising from basic letter shapes. I think simply
           | guessing  "times" vs "helvetica" brings you close enough to
           | guess the letters. Optimize score by shuffling the baselines
           | around a bit and go through all the letters to find the right
           | ones and you should be good. Work in blocks of 3-5 letters
           | while shuffling them around and stretching the spacing, you
           | might even have a fairly good guess.
        
             | [deleted]
        
         | _aavaa_ wrote:
         | I think the answer is the in most applications you'd have a
         | much larger screenshot from which you can glean this
         | information.
         | 
         | E.g. a page to do fold recognition on, or even to figure out
         | what software (e.g. Gmail) and to look up what font the
         | software uses.
        
       | sly010 wrote:
       | Is it just me or this should have been called "enhance"?
        
         | pvorb wrote:
         | I don't get it, so it might be just you.
        
           | nodja wrote:
           | https://www.youtube.com/watch?v=hHwjceFcF2Q
        
             | pvorb wrote:
             | Ok, telling from the downvotes, everyone knows this scene
             | from Blade Runner. So it's just me then, I guess.
        
               | roywiggins wrote:
               | In that case, I have a tremendous supercut for you:
               | https://www.youtube.com/watch?v=LhF_56SxrGk
        
               | Anarch157a wrote:
               | Nothing beats this one:
               | https://www.youtube.com/watch?v=6i3NWKbBaaU
        
               | danuker wrote:
               | Amazing!
        
               | danuker wrote:
               | It's a common trope in a lot of movies/series, as the
               | other comments illustrated.
        
             | Zizizizz wrote:
             | https://youtu.be/gF_qQYrCcns
        
               | MonkeyMalarky wrote:
               | https://youtu.be/oeEvZ8WHSvY
        
       | thomasqbrady wrote:
       | Looking closely, this only works if you know the font name, size,
       | and weight used, or at least can guess it, manually, before
       | feeding the pixelated version into the tool? Still quite fun, but
       | not as scary as the headline made it sound...
        
         | if_by_whisky wrote:
         | Guessing is actually easy. For the kinds of files that end up
         | as redacted pdfs (legal, government, etc), there's probably 5-8
         | font options that make up 98% of documents. Sizes and weights
         | are immediately recognizable to the slightly trained eye. I'm
         | pretty sure I could guess all 3 attributes at a glance.
        
           | happyopossum wrote:
           | Or just look at the unredacted text around it and use that.
           | Nobody is changing fonts on text before pixelation.
        
         | mmoskal wrote:
         | Often only parts of text are pixelated.
        
       | JJMcJ wrote:
       | Speaking of pixelation.
       | 
       | If you're nearsighted taking your glasses off can blur the
       | pixelation of a face so that you can get a pretty good idea of
       | the person's appearance, especially if you are just trying to
       | guess if it's really a specific person you are already familiar
       | with.
        
         | sircastor wrote:
         | I worry this often leads to a sort of observation bias, where
         | people assign their expectation to the situation. If a video is
         | blurry, it's not hard to convince an audience that the face in
         | the video is a particular person. Human brains are really great
         | at filling in details, whether they're accurate or not.
        
           | ductsurprise wrote:
           | No doubt...
           | 
           | Now, pair that with a Prosopagnosia disorder... Right back
           | where you started?
        
           | matsemann wrote:
           | We have this advent calendar at work where one person on the
           | team posts a pixelated scene from a christmas movie and we
           | guess which one it is. Very hard, but when they later post
           | the non-pixelated version it looks so obvious. Then if I look
           | at the pixelated version again I can now "clearly" see what
           | it is.
        
       | TedDoesntTalk wrote:
       | Can the same be done to blurred text?
        
         | ruuda wrote:
         | Yes, the process is called deconvolution, and it works even
         | better than trying to bruteforce the input image of pixelated
         | text.
        
       | d--b wrote:
       | I use pixelation, but replace the text with dirty words before
       | pixelating. Just in case someone uses one of these tools.
        
         | akiselev wrote:
         | I have a script to replace them with bits of H.P. Lovecraft so
         | that anyone who unredacts the document can peer into the depths
         | of madness.
        
       | gareth_untether wrote:
       | There's a lot of pixelated photos and documents on the net just
       | waiting to be viewed with a fresh light.
        
         | CyborgCabbage wrote:
         | Photos won't work with this, it works here because text is
         | discrete and provides a limited search space.
        
           | teeray wrote:
           | "Attacks always get better, they never get worse" --Bruce
           | Schneier
        
             | bawolff wrote:
             | Yes, but that doesn't mean this particular attack is
             | relavant to the problem space.
        
           | zmgsabst wrote:
           | What makes it possible to search over font characters but
           | not, eg, convolution tiles?
           | 
           | Photo identification is done with a limited search space of
           | "tiles" that the image is decomposed into, for convolutional
           | NNs.
        
         | weego wrote:
         | The obvious problem is the relative scale of the pixelation to
         | the underlying redacted material. The demo works because the
         | scale encodes enough data of each character and its
         | relationship to its neighbours. My guess would be even a 25%
         | larger pixelation block would scrub too much for it to be at
         | all reliable.
        
       | bscphil wrote:
       | You could probably do the same for Guassian blur too, right? At
       | least if you could get a reasonable guess for the parameters of
       | the blur. Any case of obfuscation where plaintext and ciphertext
       | characters have a 1:1 relationship should be trivial to undo.
       | 
       | If you need redaction, black it out completely.
        
       | radarsat1 wrote:
       | You can just randomize the string and pixelize that, it should
       | look roughly the same aesthetically without leaking any
       | information.
        
         | jahnu wrote:
         | Or write something funny/rude/misleading
        
           | teaearlgraycold wrote:
           | hunter2
        
             | Murfalo wrote:
             | What is funny or rude about *******? Is this supposed to be
             | some 7-letter curse word or something? I don't get it.
        
               | icepat wrote:
               | HN automatically hides your password when you input it
               | into the comments, like mine is ******, to me it shows up
               | as ******, but to you you see ******.
        
               | eitland wrote:
               | It is a joke to see who will post their passwords.
        
               | cafeinux wrote:
               | So if I type Meatmybeat*123 you just see asterisks ? That
               | nice.
        
               | smilebot wrote:
               | Yep, all I see is *******
        
               | sircastor wrote:
               | Does it do this predictively or intuitively somehow? How
               | does it know your password? Is it keeping it as clear
               | text on the client side?
        
               | brokensegue wrote:
               | it hashes all the substrings you type to check for
               | collisions.e.g. *****
        
               | thedorkknight wrote:
               | It probably checks each token's hash for a match. Hold
               | on, trying it now: *****
        
             | [deleted]
        
             | loloquwowndueo wrote:
             | http://bash.org/?244321 Lol
        
               | godsfshrmn wrote:
               | Oh man. Top 100 link at bottom is great. I just spent a
               | good 15 minutes laughing
        
         | bagels wrote:
         | By randomize, I assume you don't mean shuffle the characters,
         | but replace it with an entirely different string?
        
       | poglet wrote:
       | I wonder if this could be used to reveal text that is too small
       | or out of focus.
       | 
       | Also the example seems to go through one letter at a time, once
       | the pixels become larger it might be required to cycle through
       | entire words.
        
         | lucgommans wrote:
         | > once the pixels become larger it might be required to cycle
         | through entire words.
         | 
         | That's a clever idea for speeding up this process actually.
         | Instead of guessing a-z for every position, only guess valid
         | word completions. If it doesn't end up giving a good score or
         | the human judges it to be nonsensical, it can fall back to a-z
         | guessing for that word.
         | 
         | A bit similar to my hangman solver
         | (https://lucgommans.nl/p/hangman-solver/), which looks for the
         | only words still possible with the given letters already known,
         | but simpler because you only need prefix matching.
         | 
         | I know you meant it for very strong blurs, and there it is not
         | actually an advantage because you need to go through thousands
         | of words before guessing one (instead of 26x5[?]130 guesses,
         | assuming an average word length of ~5), but yeah there you'd
         | have no other choice.
        
       | morpheuskafka wrote:
       | Is there any approach to guessing a single redacted word/phrase
       | where the length of the word is revealed? (Such as in a PDF where
       | the rectangle is automatically drawn around the characters.)
       | 
       | I've wanted to build something that would if nothing else run
       | through a list of guesses (assuming the font is the same as
       | surrounding text) and see if any of them could match size-wise,
       | but not sure of an easy way to deal with the PDF part of it.
        
         | lucb1e wrote:
         | > not sure of an easy way to deal with the PDF part of it.
         | 
         | If you aren't doing this so often that you need automation, you
         | could sidestep that issue by just taking screenshots at 400%
         | zoom or so and accurately measure how many pixels each letter
         | (a-z) takes, as well as how many pixels a space is (the
         | censored part might be into the spacing on each side of the
         | word), and measure how many pixels the gap is between the words
         | surrounding the censored part, then                   for word
         | in wordlist do           # Start by accounting for the spaces
         | wordsize = charwidths[' '] * 2           for char in word do
         | wordsize += charwidths[char]           done           if
         | wordsize == gap_size then             print("Possible word: " +
         | word)           endif         done
         | 
         | Probably want to do the gap_size +/- 1 or so, but that's how
         | I'd approach this for a given document. A starting point for a
         | wordlist on many linux systems could be
         | `/usr/share/dict/words`.
        
       | crazygringo wrote:
       | First of all, very cool little tool. I always suspected you could
       | do this but it's really neat to watch it go.
       | 
       | But now just in response to the title... _does_ anybody ever use
       | pixelation as a redaction technique?
       | 
       | I feel like I've only ever seen pixelation for censoring nudity
       | or a brand name or face or something. Actual redaction of text
       | truly meant to be kept secret, I've only ever seen as black bars.
       | 
       | Are there notable cases where people have genuinely tried to
       | redact something secret using pixelation?
        
         | czx4f4bd wrote:
         | Bear in mind that "redaction" just means any effort to protect
         | sensitive information by obscuring text. I don't think I've
         | seen pixelation used to redact text in any
         | corporate/governmental context, but I've definitely seen people
         | use it on sites like Reddit or Twitter to try to hide their
         | name or other info.
         | 
         | I've also seen people try to redact sensitive information by
         | poorly scribbling it out using the pencil tool on their phone,
         | leaving enough parts of letters visible to guess what was
         | originally there, or try to "black out" text using a brush that
         | wasn't fully opaque, allowing it to be revealed by adjusting
         | the contrast. Basically, a lot of people don't know how to
         | safely redact stuff.
         | 
         | I don't think pixelation is commonly used because most people
         | probably don't know how to use it, but my Samsung phone's
         | built-in image editor also has a pixelation feature, so I
         | honestly expect to see it pop up more often in the future.
        
           | zmgsabst wrote:
           | This is somewhere the simplicity of old MS paint and BMPs was
           | great:
           | 
           | A colored square is easy -- and destructively removes what
           | was there, by setting that part of the image to a chosen
           | color.
        
       ___________________________________________________________________
       (page generated 2022-12-17 23:00 UTC)