[HN Gopher] The Problem with Perceptual Hashes
___________________________________________________________________
The Problem with Perceptual Hashes
Author : rivo
Score : 171 points
Date : 2021-08-06 19:29 UTC (3 hours ago)
(HTM) web link (rentafounder.com)
(TXT) w3m dump (rentafounder.com)
| stickfigure wrote:
| I've also implemented perceptual hashing algorithms for use in
| the real world. Article is correct, there really is no way to
| eliminate false positives while still catching minor changes
| (say, resizing, cropping, or watermarking).
|
| I'm sure I'm not the only person with naked pictures of my wife.
| Do you really want a false positive to result in your intimate
| moments getting shared around some outsourced boiler room for
| laughs?
| jjtheblunt wrote:
| Why would other people have a naked picture of your wife?
| planb wrote:
| I fully agree with you. But while scrolling to next comment, a
| question came to my mind: Would it really bother me if some
| person that does not known my name, has never met me in real
| life and never will is looking at my pictures without me ever
| knowing about it? To be honest, I'm not sure if I'd care.
| Because for all I know, that might be happening right now...
| zxcvbn4038 wrote:
| Rookie mistake.
|
| Three rules to live by:
|
| 1) Always pay your taxes
|
| 2) Don't talk to the police
|
| 3) Don't take photographs with your clothes off
| jimmygrapes wrote:
| I might amend #2 a bit to read "Be friends with the police"
| as that has historically been more beneficial to those who
| are.
| vineyardmike wrote:
| > Do you really want a false positive to result in your
| intimate moments getting shared around some outsourced boiler
| room for laughs?
|
| these people also have no incentive to find you innocent for
| innocent photos. If they err on the side of false-negative,
| they might find themselves at the wrong end of a criminal
| search ("why didn't you catch this"), but if they false-
| positive they at worse ruin a random person's life.
| jdavis703 wrote:
| Even still this has to go to the FBI or other law enforcement
| agency, then it's passed on to a prosecutor and finally a
| jury will evaluate. I have a tough time believing that false
| positives would slip through that many layers.
|
| That isn't to say CASM scanning or any other type of drag net
| is OK. But I'm not concerned about a perceptual hash ruining
| someone's life, just like I'm not concerned about a botched
| millimeter wave scan ruining someone's life for weapons
| possession.
| gambiting wrote:
| >>I have a tough time believing that false positives would
| slip through that many layers.
|
| I don't, not in the slightest. Back in the days when Geek
| Squad had to report any suspicious images found during
| routine computer repairs, a guy got reported to the police
| for having child porn, arrested, fired from his job, named
| in the local newspaper as a pedophile, all before the
| prosecutor was actually persuaded by the defense attorney
| to look at these "disgusting pictures".....which turned out
| to be his own grand children in a pool. Of course he was
| immediately released but not before the damage to his life
| was done.
|
| >>But I'm not concerned about a perceptual hash ruining
| someone's life
|
| I'm incredibly concerned about this, I don't see how you
| can not be.
| zimpenfish wrote:
| > Do you really want a false positive to result in your
| intimate moments getting shared around some outsourced boiler
| room for laughs?
|
| You'd have to have several positive matches against the
| specific hashes of CSAM from NCMEC before they'd be flagged up
| for human review, right? Which presumably lowers the threshold
| of accidental false positives quite a bit?
| mjlee wrote:
| > I'm sure I'm not the only person with naked pictures of my
| wife.
|
| I'm not completely convinced that says what you want it to.
| enedil wrote:
| Didn't she possibly have previous partners?
| iratewizard wrote:
| I don't even have nude photos of my wife. The only person
| who might would be the NSA contractor assigned to watch
| her.
| websites2023 wrote:
| Presumably she wasn't his wife then. But also people have
| various arrangements so I'm not here to shame.
| nine_k wrote:
| Buy a subcompact camera. Never upload such photos to any cloud.
| Use your local NAS / external disk / your Linux laptop's
| encrypted hard drive.
|
| Unless you prefer to live dangerously, of course.
| ohazi wrote:
| Consumer NAS boxes like the ones from Synology or QNAP have
| "we update your box at our whim" cloud software running on
| them and are effectively subject to the same risks, even if
| you try to turn off all of the cloud options. I probably
| wouldn't include a NAS on this list unless you built it
| yourself.
|
| It looks like you've updated your comment to clarify _Linux_
| laptop 's encrypted hard drive, and I agree with your line of
| thinking. Modern Windows and Mac OS are effectively cloud
| operating systems where more or less anything can be pushed
| at you at any time.
| derefr wrote:
| With Synology's DSM, at least, there's no "firmware" per
| se; it's just a regular Linux install that you have sudo(1)
| privileges on, so you can just SSH in and modify the OS as
| you please (e.g. removing/disabling the update service.)
| cm2187 wrote:
| At least you can deny the NAS access to the WAN by blocking
| it on the router or not configuring the right gateway.
| marcinzm wrote:
| Given all the zero day exploits on iOS I wonder if it's now going
| to be viable to hack someone's phone and upload child porn to
| their account. Apple with happily flag the photos and then,
| likely, get those people arrested. Now they have to, in practice,
| prove they were hacked which might be impossible. Will either
| ruin their reputation or put them in jail for a long time. Given
| past witch hunts it could be decades before people get
| exonerated.
| new_realist wrote:
| This is already possible using other services (Google Drive,
| gmail, Instagram, etc.) that already scan for CP.
| toxik wrote:
| This is really a difficult problem to solve I think. However, I
| think most people who are prosecuted for CP distribution are
| hoarding it by the terabyte. It's hard to claim that you were
| unaware of that. A couple of gigabytes though? Plausible. And
| that's what this CSAM scanner thing is going to find on phones.
| emodendroket wrote:
| A couple gigabytes is a lot of photos... and they'd all be
| showing up in your camera roll. Maybe possible but stretching
| the bounds of plausibility.
| danachow wrote:
| A couple gigabytes is enough to ruin someone's day but not
| a lot to surreptitiously transfer, it's literally seconds.
| Just backdate them and they may very well go unnoticed.
| [deleted]
| yellow_lead wrote:
| Regarding false positives re:Apple, the Ars Technica article
| claims
|
| > Apple offers technical details, claims 1-in-1 trillion chance
| of false positives.
|
| There are two ways to read this, but I'm assuming it means, for
| each scan, there is a 1-in-1 trillion chance of a false positive.
|
| Apple has over 1 billion devices. Assuming ten scans per device
| per day, you would reach one trillion scans in ~100 days. Okay,
| but not all the devices will be on the latest iOS, not all are
| active, etc, etc. But this is all under the assumption those
| numbers are accurate. I imagine reality will be much worse. And I
| don't think the police will be very understanding. Maybe you will
| get off, but you'll be in a huge debt from your legal defense. Or
| maybe, you'll be in jail, because the police threw the book at
| you.
| klodolph wrote:
| > Even at a Hamming Distance threshold of 0, that is, when both
| hashes are identical, I don't see how Apple can avoid tons of
| collisions...
|
| You'd want to look at the particular perceptual hash
| implementation. There is no reason to expect, without knowing the
| hash function, that you would end up with tons of collisions at
| distance 0.
| SavantIdiot wrote:
| This article covers three methods, all of which just look for
| alterations of a source image to find a fast match (in fact,
| that's the paper referenced). It is still a "squint to see if it
| is similar" test. I was under the impression there were more
| sophisticated methods that looked for _types_ of images, not just
| altered known images. Am I misunderstanding?
| jbmsf wrote:
| I am fairly ignorant if this space. Do any of the standard
| methods use multiple hash functions vs just one?
| jdavis703 wrote:
| Yes, I worked on such a product. Users had several hashing
| algorithms they could chose from, and the ability to create
| custom ones if they wanted.
| heavyset_go wrote:
| I've built products that utilize different phash algorithms at
| once, and it's entirely possible, and quite common, to get
| false positives across hashing algorithms.
| lordnacho wrote:
| Why wouldn't the algo check that one image has a face while the
| other doesn't? That would remove this particular false positive,
| though I'm not sure what it might cause of new ones.
| PUSH_AX wrote:
| Because where do you draw the line with classifying arbitrary
| features in the images? The concept is it should work with an
| image of anything.
| rustybolt wrote:
| > an Apple employee will then look at your (flagged) pictures.
|
| This means that there will be people paid to look at child
| pornography and probably a lot of private nude pictures as well.
| emodendroket wrote:
| And what do you think the content moderation teams employed by
| Facebook, YouTube, et al. do all day?
| mattigames wrote:
| Yeah, we obviously needed one more company doing it as well,
| and I'm sure having more positions in the job market which
| pretty much could be described as "Get paid to watch
| pedophilia all day long" will not backfire in any way.
| emodendroket wrote:
| You could say there are harmful effects of these jobs but
| probably not in the sense you're thinking.
| https://www.wired.com/2014/10/content-moderation/
| josephcsible wrote:
| They look at content that people actively and explicitly
| chose to share with wider audiences.
| [deleted]
| Spivak wrote:
| Yep! I guess this announcement is when everyone is collectively
| finding out how this has, apparently quietly, worked for years.
|
| It's a "killing floor" type job where you're limited in how
| long you're allowed to do it in a lifetime.
| varjag wrote:
| There are people who are paid to do that already, just
| generally not in corporate employment.
| pkulak wrote:
| Apple, with all those Apple == Privacy billboards plastered
| everywhere, is going to have a full-time staff of people with
| the job of looking through it's customers' private photos.
| mattigames wrote:
| I'm sure thats the dream position for most pedophiles, watching
| child porn fully legally and being paid for it, plus on the
| record being someone who helps destroy it; and given that CP
| will exist for as long as human beings do there will be no
| shortage no matter how much they help capturing other
| pedophiles.
| ivalm wrote:
| I am not exactly buying the premise here, if you train a CNN on
| useful semantic categories then the representations they generate
| will be semantically meaningful (so the error shown in blog
| wouldn't occur).
|
| I dislike the general idea of iCloud having back doors but I
| don't think the criticism in this blog is entirely valid.
|
| Edit: it was pointed out apple doesn't have semantically
| meaningful classifier so the blog post's criticism is valid.
| jeffbee wrote:
| I agree the article is a straw-man argument and is not
| addressing the system that Apple actually describes.
| SpicyLemonZest wrote:
| Apple's description of the training process
| (https://www.apple.com/child-
| safety/pdf/CSAM_Detection_Techni...) sounds like they're just
| training it to recognize some representative perturbations, not
| useful semantic categories.
| ivalm wrote:
| Ok, good point, thanks.
| ajklsdhfniuwehf wrote:
| whatsapp and other apps place pictures from groups chats in
| folders deep in your IOS gallery.
|
| Swatting will be a problem all over again.... wait, did it ever
| stop being a problem?
| karmakaze wrote:
| It really all comes down to if Apple has and is willing to
| maintain the effort of human evaluations prior to taking action
| on the potentially false positives:
|
| > According to Apple, a low number of positives (false or not)
| will not trigger an account to be flagged. But again, at these
| numbers, I believe you will still get too many situations where
| an account has multiple photos triggered as a false positive.
| (Apple says that probability is "1 in 1 trillion" but it is
| unclear how they arrived at such an estimate.) These cases will
| be manually reviewed.
|
| At scale, even human classification which ought to be clear will
| fail, accidentally clicking 'not ok' when they saw something they
| thought was 'ok'. It will be interesting to see what happens
| then.
| jdavis703 wrote:
| Then law enforcement, a prosecutor and a jury would get
| involved. Hopefully law enforcement would be the first and
| final stage if it was merely the case that a person pressed
| "ok" by accident.
| at_a_remove wrote:
| I do not know as much about perceptual hashing as I would like,
| but have considered it for a little project of my own.
|
| Still, I know it has been floating around in the wild. I recently
| came across it on Discord when I attempted to push an ancient
| image, from the 4chan of old, to a friend, which mysteriously
| wouldn't send. Saved it as a PNG, no dice. This got me
| interested. I stripped the EXIF data off of the original JPEG. I
| resized it slightly. I trimmed some edges. I adjusted colors. I
| did a one degree rotation. Only after a reasonably complete
| combination of those factors would the image make it through. How
| interesting!
|
| I just don't know how well this little venture of Apple's will
| scale, and I wonder if it won't even up being easy enough to
| bypass in a variety of ways. I think the tradeoff will do very
| little, as stated, but is probably a glorious apportunity for
| black-suited goons of state agencies across the globe.
|
| We're going to find out in a big big way soon.
|
| * The image is of the back half of a Sphynx cat atop a CRT. From
| the angle of the dangle, the presumably cold, man-made feline is
| draping his unexpectedly large testicles across the similarly
| man-made device to warm them, suggesting that people create
| problems and also their solutions, or that, in the Gibsonian
| sense, the street finds its own uses for things. I assume that
| the image was blacklisted, although I will allow for the somewhat
| baffling concept of a highly-specialized scrotal matching neural-
| net that overreached a bit or a byte on species, genus, family,
| and order.
| judge2020 wrote:
| AFAIK Discord's NSFW filter is not a perceptual hash nor uses
| the NCMEC database (although that might indeed be in the
| pipeline elsewhere) but instead uses a ML classifier (I'm
| certain it doesn't use perceptual hashes as Discord doesn't
| have a catalogue of NSFW image hashes to compare against). I've
| guessed it's either open_nsfw[0] or Google's Cloud Vision since
| the rest of Discord's infrastructure uses Google Cloud VMs.
| There's a web demo available of this api[1], Discord probably
| pulls the safe search classifications for determining NSFW.
|
| 0: https://github.com/yahoo/open_nsfw
|
| 1: https://cloud.google.com/vision#section-2
| a_t48 wrote:
| Adding your friend as a "friend" on discord should disable the
| filter.
| ttul wrote:
| Apple would not be so naive as to roll out a solution to child
| abuse images that has a high false positive rate. They do test
| things prior to release...
| bjt wrote:
| I'm guessing you don't remember all the errors in the initial
| launch of Apple Maps.
| smlss_sftwr wrote:
| ah yes, from the same company that shipped this:
| https://medium.com/hackernoon/new-macos-high-sierra-vulnerab...
|
| and this:
| https://www.theverge.com/2017/11/6/16611756/ios-11-bug-lette...
| celeritascelery wrote:
| Test it... how exactly? This is detecting illegal material that
| they can't use to test against.
| bryanrasmussen wrote:
| Not knowing anything about it but I suppose various
| governmental agencies maintain corpora of nasty stuff and
| that you can say to them - hey we want to roll out anti-nasty
| stuff functionality in our service therefore we need access
| to corpora to test at which point there is probably a pretty
| involved process that requires governmental access also to
| make sure things work and are not misused otherwise -
|
| how does anyone ever actually fight the nasty stuff? This
| problem structure of how do I catch examples of A if examples
| of A are illegal must apply in many places and ways.
| vineyardmike wrote:
| Test it against innocent data sets, then in prod swap it
| for the opaque gov db of nasty stuff and hope the gov was
| honest about what is in it :)
|
| They don't need to train a model to detect the actual data
| set. They need to train a model to follow a pre-defined
| algo
| zimpenfish wrote:
| > This is detecting illegal material that they can't use to
| test against.
|
| But they can because they're matching the hashes to the ones
| provided by NCMEC, not directly against CSAM itself (which
| presumably stays under some kind of lock and key at NCMEC.)
|
| Same as you can test whether you get false positives against
| a bunch of MD5 hashes that Fred provides without knowing the
| contents of his documents.
| ben_w wrote:
| While I don't have any inside knowledge at all, I would
| expect a company as big as Apple to be able to ask law
| enforcement to run Apple's algorithm on data sets Apple
| themselves don't have access to and report the result.
|
| No idea if they did (or will), but I do expect it's possible.
| zimpenfish wrote:
| > ask law enforcement to run Apple's algorithm on data sets
| Apple themselves don't have access to
|
| Sounds like that's what they did since they say they're
| matching against hashes provided by NCMEC generated from
| their 200k CSAM corpus.
|
| [edit: Ah, in the PDF someone else linked, "First, Apple
| receives the NeuralHashes corresponding to known CSAM from
| the above child-safety organizations."]
| IfOnlyYouKnew wrote:
| They want to avoid false powitives, so you would test for
| that by running it over innocuous photos, anyway.
| [deleted]
| IfOnlyYouKnew wrote:
| Apple's documents said they require multiple hits before anything
| happens, as the article notes. They can (and have) adjusted that
| number to any desired balance of false positive to negatives.
|
| How can they say it's 1 in a trillion? You test the algorithm on
| a bunch of random negatives, see how many positives you get, and
| do one division and one multiplication. This isn't rocket
| science.
|
| So, while there are many arguments against this program, this
| isn't it. It's also somewhat strange to believe the idea of
| collisions in hashes of far smaller size than the images they are
| run on somehow escaped Apple and/or really anyone mildly
| competent.
| bt1a wrote:
| That would not be a good way to arrive at an accurate estimate.
| Would you not need dozens of trillions of photos to begin with
| in order to get an accurate estimate when the occurrence rate
| is so small?
| KarlKemp wrote:
| What? No...
|
| Or, more accurately: if you need "dozens of trillions" that
| implies a false positive rate so low, it's practically of no
| concern.
|
| You'd want to look up the poisson distribution for this. But,
| to get at this intuitively: say you have a bunch of eggs,
| some of which may be spoiled. How many would you have to
| crack open, to get a meaningful idea of how many are still
| fine, and how many are not?
|
| The absolute number depends on the fraction that are off. But
| independent of that, you'd usually start trusting your sample
| when you've seen 5 to 10 spoiled ones.
|
| So Apple runs the hash algorithm on random photos. They find
| 20 false positives in the first ten million. Given that error
| rate, how many positives would it require for the average
| photo collection of 10,000 to be certain at at 1:a trillion
| level that it's not just coincidence?
|
| Throw it into, for example,
| https://keisan.casio.com/exec/system/1180573179 with lambda =
| 0.2 (you're expecting one false positive for every 50,000 at
| the error rate we assumed, or 0.2 for 10,000), and n = 10
| (we've found 10 positives in this photo library) to see the
| chances of that, 2.35x10^-14, or 2.35 / 100 trillion.
| mrtksn wrote:
| The technical challenges aside, I'm very disturbed that my device
| will be reporting me to the authorities.
|
| That's very different from authorities taking a sneak peek into
| my stuff.
|
| That's like the theological concept of always being watched.
|
| It starts with child pornography but the technology is
| indifferent towards it, it can be anything.
|
| It's always about the children because we all want to save the
| children. Soon then will start asking you start saving your
| country. Depending on your location they will start checking
| against sins against religion, race, family values, political
| activities.
|
| I bet you, after the next election in the US your device will be
| reporting you for spreading far right or deep state lies,
| depending on who wins.
| baggy_trough wrote:
| Totally agree. This is very sinister indeed. Horrible idea,
| Apple.
| zionic wrote:
| So what are we going to _do_ about it?
|
| I have a large user base on iOS. Considering a blackout
| protest.
| drzoltar wrote:
| The other issue with these hashes is non-robustness to
| adversarial attacks. Simply rotating the image by a few degrees,
| or slightly translating/shearing it will move the hash well
| outside the threshold. The only way to combat this would be to
| use a face bounding box algorithm to somehow manually realign the
| image.
| foobarrio wrote:
| In my admittedly limited experience in image hashing, typically
| you extract some basic feature and transform the image before
| hashing (eg darkest corner in the upper left or look for
| verticals/horizontals and align). You also take multiple hashes
| of the images to handle various crops, black and white vs
| color. This increases robustness a bit but overall yea you can
| always transform the image in such a way to come up with a
| different enough hash. One thing that would be hard to catch is
| if you do something like a swirl and then the consumers of that
| content will use a plugin or something to "deswirl" the image.
|
| There's also something like the Scale Invariant Feature
| Transform that would protect against all affine transformations
| (scale, rotate, translate, skew).
|
| I believe one thing that's done is whenever any CP is found,
| the hashes of all images in the "collection" is added to the DB
| whether or not they actually contain abuse. So if there are any
| common transforms of existing images then those also now have
| their hashes added to the db. The idea being that a high
| percent of hits from even the benign hashes means the presence
| of the same "collection".
| ris wrote:
| I agree with the article in general except part of the final
| conclusion
|
| > The simple fact that image data is reduced to a small number of
| bits leads to collisions and therefore false positives
|
| Our experience with regular hashes suggests this is not the
| underlying problem. SHA256 hashes have 256 bits and still there
| are _no known_ collisions, even with people deliberately trying
| to find them. SHA-1 only has only 160 bits to play with and it 's
| still hard enough to find collisions. MD5 is easier to find
| collisions but at 128 bits, still people don't come across them
| by chance.
|
| I think the actual issue is that perceptual hashes tend to be
| used with this "nearest neighbour" comparison scheme which is
| clearly needed to compensate for the inexactness of the whole
| problem.
| marcinzm wrote:
| > an Apple employee will then look at your (flagged) pictures.
|
| Always fun when unknown strangers get to look at your potentially
| sensitive photos with probably no notice given to you.
| judge2020 wrote:
| They already do this for photodna-matched iCloud Photos (and
| Google Photos, Flickr, Imgur, etc), perceptual hashes do not
| change that.
| version_five wrote:
| I'm not familiar with iPhone picture storage. Are the
| pictures automatically sync'ed with cloud storage? I would
| assume (even if I don't like it) that cloud providers may be
| scanning my data. But I would not expect anyone to be able to
| see or scan what is stored on my phone.
|
| Incidentally, I work in computer vision and handle
| proprietary images. I would be violating client agreements if
| I let anyone else have access to them. This is a concern I've
| had in the past e.g. with Office365 (the gold standard in
| disregarding privacy) that defaults to sending pictures in
| word documents to Microsoft servers for captioning, etc. I
| use a Mac now for work, but if somehow this snooping applies
| to computers as well I can't keep doing so while respecting
| the privacy of my clients.
|
| I echo the comment on another post, Apple is an entertainment
| company, I don't know why we all started using their products
| for business applications.
| Asdrubalini wrote:
| You can disable automatic backups, this way your photos
| won't ever be uploaded to iCloud.
| abawany wrote:
| By default it is enabled. One has to go through Settings to
| turn off the default iCloud upload, afaik.
| starkd wrote:
| The method Apple is using looks more like a cryptographic hash.
| That's entirely different (and more secure) than a perceptual
| hash.
|
| From https://www.apple.com/child-safety/
|
| "Before an image is stored in iCloud Photos, an on-device
| matching process is performed for that image against the known
| CSAM hashes. This matching process is powered by a cryptographic
| technology called private set intersection, which determines if
| there is a match without revealing the result. The device creates
| a cryptographic safety voucher that encodes the match result
| along with additional encrypted data about the image. This
| voucher is uploaded to iCloud Photos along with the image."
|
| Elsewhere, it does explain the use of neuralhashes which I take
| to be the perceptual hash part of it.
|
| I did some work on a similar attempt awhile back. I also have a
| way to store hashes and find similar images. Here's my blog post.
| I'm currently working on a full site.
|
| http://starkdg.github.io/posts/concise-image-descriptor
___________________________________________________________________
(page generated 2021-08-06 23:00 UTC)