[HN Gopher] Unredacter: Never use pixelation as a redaction tech...
___________________________________________________________________
Unredacter: Never use pixelation as a redaction technique
Author : linker3000
Score : 159 points
Date : 2022-12-17 19:58 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| danbruc wrote:
| This also seems like something a [convolutional] neural net
| should be able to learn pretty well and generating training data
| could be fully automated. Throw everything at it - different
| fonts, different sizes, bold, italic, different colors, different
| anti-aliasing methods, sub pixel offsets, pixelation sizes and
| offsets, JPEG noise, rotations, ... - and maybe also give it some
| language model to better reason about plausible letter
| combinations. I would love to see, how good this could be.
| buildbot wrote:
| Already been done! https://github.com/HypoX64/DeepMosaics
| dang wrote:
| Related:
|
| _Don 't use text pixelation to redact sensitive information_ -
| https://news.ycombinator.com/item?id=30350626 - Feb 2022 (163
| comments)
| epgui wrote:
| I think the animation makes it pretty clear how this works... But
| I do wonder how the baseline coordinates are established?
|
| In more realistic applications you'd have to deal with things
| like text that is not exactly aligned with pixel boundaries, and
| different anti-aliasing methods.
| kleiba wrote:
| And different fonts.
| kamray23 wrote:
| With such heavy pixellation being overcome, you'd have to
| have a _seriously_ different font to avoid the general
| patterns arising from basic letter shapes. I think simply
| guessing "times" vs "helvetica" brings you close enough to
| guess the letters. Optimize score by shuffling the baselines
| around a bit and go through all the letters to find the right
| ones and you should be good. Work in blocks of 3-5 letters
| while shuffling them around and stretching the spacing, you
| might even have a fairly good guess.
| [deleted]
| _aavaa_ wrote:
| I think the answer is the in most applications you'd have a
| much larger screenshot from which you can glean this
| information.
|
| E.g. a page to do fold recognition on, or even to figure out
| what software (e.g. Gmail) and to look up what font the
| software uses.
| sly010 wrote:
| Is it just me or this should have been called "enhance"?
| pvorb wrote:
| I don't get it, so it might be just you.
| nodja wrote:
| https://www.youtube.com/watch?v=hHwjceFcF2Q
| pvorb wrote:
| Ok, telling from the downvotes, everyone knows this scene
| from Blade Runner. So it's just me then, I guess.
| roywiggins wrote:
| In that case, I have a tremendous supercut for you:
| https://www.youtube.com/watch?v=LhF_56SxrGk
| Anarch157a wrote:
| Nothing beats this one:
| https://www.youtube.com/watch?v=6i3NWKbBaaU
| danuker wrote:
| Amazing!
| danuker wrote:
| It's a common trope in a lot of movies/series, as the
| other comments illustrated.
| Zizizizz wrote:
| https://youtu.be/gF_qQYrCcns
| MonkeyMalarky wrote:
| https://youtu.be/oeEvZ8WHSvY
| thomasqbrady wrote:
| Looking closely, this only works if you know the font name, size,
| and weight used, or at least can guess it, manually, before
| feeding the pixelated version into the tool? Still quite fun, but
| not as scary as the headline made it sound...
| if_by_whisky wrote:
| Guessing is actually easy. For the kinds of files that end up
| as redacted pdfs (legal, government, etc), there's probably 5-8
| font options that make up 98% of documents. Sizes and weights
| are immediately recognizable to the slightly trained eye. I'm
| pretty sure I could guess all 3 attributes at a glance.
| happyopossum wrote:
| Or just look at the unredacted text around it and use that.
| Nobody is changing fonts on text before pixelation.
| mmoskal wrote:
| Often only parts of text are pixelated.
| JJMcJ wrote:
| Speaking of pixelation.
|
| If you're nearsighted taking your glasses off can blur the
| pixelation of a face so that you can get a pretty good idea of
| the person's appearance, especially if you are just trying to
| guess if it's really a specific person you are already familiar
| with.
| sircastor wrote:
| I worry this often leads to a sort of observation bias, where
| people assign their expectation to the situation. If a video is
| blurry, it's not hard to convince an audience that the face in
| the video is a particular person. Human brains are really great
| at filling in details, whether they're accurate or not.
| ductsurprise wrote:
| No doubt...
|
| Now, pair that with a Prosopagnosia disorder... Right back
| where you started?
| matsemann wrote:
| We have this advent calendar at work where one person on the
| team posts a pixelated scene from a christmas movie and we
| guess which one it is. Very hard, but when they later post
| the non-pixelated version it looks so obvious. Then if I look
| at the pixelated version again I can now "clearly" see what
| it is.
| TedDoesntTalk wrote:
| Can the same be done to blurred text?
| ruuda wrote:
| Yes, the process is called deconvolution, and it works even
| better than trying to bruteforce the input image of pixelated
| text.
| d--b wrote:
| I use pixelation, but replace the text with dirty words before
| pixelating. Just in case someone uses one of these tools.
| akiselev wrote:
| I have a script to replace them with bits of H.P. Lovecraft so
| that anyone who unredacts the document can peer into the depths
| of madness.
| gareth_untether wrote:
| There's a lot of pixelated photos and documents on the net just
| waiting to be viewed with a fresh light.
| CyborgCabbage wrote:
| Photos won't work with this, it works here because text is
| discrete and provides a limited search space.
| teeray wrote:
| "Attacks always get better, they never get worse" --Bruce
| Schneier
| bawolff wrote:
| Yes, but that doesn't mean this particular attack is
| relavant to the problem space.
| zmgsabst wrote:
| What makes it possible to search over font characters but
| not, eg, convolution tiles?
|
| Photo identification is done with a limited search space of
| "tiles" that the image is decomposed into, for convolutional
| NNs.
| weego wrote:
| The obvious problem is the relative scale of the pixelation to
| the underlying redacted material. The demo works because the
| scale encodes enough data of each character and its
| relationship to its neighbours. My guess would be even a 25%
| larger pixelation block would scrub too much for it to be at
| all reliable.
| bscphil wrote:
| You could probably do the same for Guassian blur too, right? At
| least if you could get a reasonable guess for the parameters of
| the blur. Any case of obfuscation where plaintext and ciphertext
| characters have a 1:1 relationship should be trivial to undo.
|
| If you need redaction, black it out completely.
| radarsat1 wrote:
| You can just randomize the string and pixelize that, it should
| look roughly the same aesthetically without leaking any
| information.
| jahnu wrote:
| Or write something funny/rude/misleading
| teaearlgraycold wrote:
| hunter2
| Murfalo wrote:
| What is funny or rude about *******? Is this supposed to be
| some 7-letter curse word or something? I don't get it.
| icepat wrote:
| HN automatically hides your password when you input it
| into the comments, like mine is ******, to me it shows up
| as ******, but to you you see ******.
| eitland wrote:
| It is a joke to see who will post their passwords.
| cafeinux wrote:
| So if I type Meatmybeat*123 you just see asterisks ? That
| nice.
| smilebot wrote:
| Yep, all I see is *******
| sircastor wrote:
| Does it do this predictively or intuitively somehow? How
| does it know your password? Is it keeping it as clear
| text on the client side?
| brokensegue wrote:
| it hashes all the substrings you type to check for
| collisions.e.g. *****
| thedorkknight wrote:
| It probably checks each token's hash for a match. Hold
| on, trying it now: *****
| [deleted]
| loloquwowndueo wrote:
| http://bash.org/?244321 Lol
| godsfshrmn wrote:
| Oh man. Top 100 link at bottom is great. I just spent a
| good 15 minutes laughing
| bagels wrote:
| By randomize, I assume you don't mean shuffle the characters,
| but replace it with an entirely different string?
| poglet wrote:
| I wonder if this could be used to reveal text that is too small
| or out of focus.
|
| Also the example seems to go through one letter at a time, once
| the pixels become larger it might be required to cycle through
| entire words.
| lucgommans wrote:
| > once the pixels become larger it might be required to cycle
| through entire words.
|
| That's a clever idea for speeding up this process actually.
| Instead of guessing a-z for every position, only guess valid
| word completions. If it doesn't end up giving a good score or
| the human judges it to be nonsensical, it can fall back to a-z
| guessing for that word.
|
| A bit similar to my hangman solver
| (https://lucgommans.nl/p/hangman-solver/), which looks for the
| only words still possible with the given letters already known,
| but simpler because you only need prefix matching.
|
| I know you meant it for very strong blurs, and there it is not
| actually an advantage because you need to go through thousands
| of words before guessing one (instead of 26x5[?]130 guesses,
| assuming an average word length of ~5), but yeah there you'd
| have no other choice.
| morpheuskafka wrote:
| Is there any approach to guessing a single redacted word/phrase
| where the length of the word is revealed? (Such as in a PDF where
| the rectangle is automatically drawn around the characters.)
|
| I've wanted to build something that would if nothing else run
| through a list of guesses (assuming the font is the same as
| surrounding text) and see if any of them could match size-wise,
| but not sure of an easy way to deal with the PDF part of it.
| lucb1e wrote:
| > not sure of an easy way to deal with the PDF part of it.
|
| If you aren't doing this so often that you need automation, you
| could sidestep that issue by just taking screenshots at 400%
| zoom or so and accurately measure how many pixels each letter
| (a-z) takes, as well as how many pixels a space is (the
| censored part might be into the spacing on each side of the
| word), and measure how many pixels the gap is between the words
| surrounding the censored part, then for word
| in wordlist do # Start by accounting for the spaces
| wordsize = charwidths[' '] * 2 for char in word do
| wordsize += charwidths[char] done if
| wordsize == gap_size then print("Possible word: " +
| word) endif done
|
| Probably want to do the gap_size +/- 1 or so, but that's how
| I'd approach this for a given document. A starting point for a
| wordlist on many linux systems could be
| `/usr/share/dict/words`.
| crazygringo wrote:
| First of all, very cool little tool. I always suspected you could
| do this but it's really neat to watch it go.
|
| But now just in response to the title... _does_ anybody ever use
| pixelation as a redaction technique?
|
| I feel like I've only ever seen pixelation for censoring nudity
| or a brand name or face or something. Actual redaction of text
| truly meant to be kept secret, I've only ever seen as black bars.
|
| Are there notable cases where people have genuinely tried to
| redact something secret using pixelation?
| czx4f4bd wrote:
| Bear in mind that "redaction" just means any effort to protect
| sensitive information by obscuring text. I don't think I've
| seen pixelation used to redact text in any
| corporate/governmental context, but I've definitely seen people
| use it on sites like Reddit or Twitter to try to hide their
| name or other info.
|
| I've also seen people try to redact sensitive information by
| poorly scribbling it out using the pencil tool on their phone,
| leaving enough parts of letters visible to guess what was
| originally there, or try to "black out" text using a brush that
| wasn't fully opaque, allowing it to be revealed by adjusting
| the contrast. Basically, a lot of people don't know how to
| safely redact stuff.
|
| I don't think pixelation is commonly used because most people
| probably don't know how to use it, but my Samsung phone's
| built-in image editor also has a pixelation feature, so I
| honestly expect to see it pop up more often in the future.
| zmgsabst wrote:
| This is somewhere the simplicity of old MS paint and BMPs was
| great:
|
| A colored square is easy -- and destructively removes what
| was there, by setting that part of the image to a chosen
| color.
___________________________________________________________________
(page generated 2022-12-17 23:00 UTC)