[HN Gopher] Megaface
       ___________________________________________________________________
        
       Megaface
        
       Author : gennarro
       Score  : 187 points
       Date   : 2023-01-02 01:09 UTC (21 hours ago)
        
 (HTM) web link (exposing.ai)
 (TXT) w3m dump (exposing.ai)
        
       | hrkucuk wrote:
       | Are all these faces white?
        
         | Zamicol wrote:
         | No.
        
       | jonplackett wrote:
       | Anyone know have any stats on the ethnicities and genera of the
       | people in this dataset?
       | 
       | Is this still widely used to test face recognition?
        
         | wongarsu wrote:
         | There is the DiveFace dataset/metadata [1], which is a subset
         | of the Megaface dataset with six equally sized groups: three
         | ethnicities times two genders.
         | 
         | 1: https://github.com/BiDAlab/DiveFace
        
       | sschueller wrote:
       | This datasets usage and creation violates Swiss law [1]. Any
       | person in Switzerland has the right to their face in any picture
       | taken now and any time in the future even if taken by someone
       | else. Without the explicit consent of a person, their face may
       | not be used or published in anyway or form. There are only a few
       | exceptions like for public figures and celebrities but even then
       | they also have a right to privacy.
       | 
       | SRF once did a segment about face recognition and public photos
       | from social media. Under strict supervision and journalism
       | protection they created a data set and showed what was possible.
       | The dataset and code was then destroyed. [2]
       | 
       | Similar laws exist in EU states as well.
       | 
       | [1]
       | https://www.edoeb.admin.ch/edoeb/de/home/datenschutz/Interne...
       | 
       | [2] https://www.srf.ch/news/schweiz/automatische-
       | gesichtserkennu...
        
         | sacrosancty wrote:
         | [dead]
        
       | gardenhedge wrote:
       | It strikes me as sad that people's photos have been taken a used
       | to train a technology for a corporate profit. People just wanted
       | to share their wedding photos.
        
         | sacrosancty wrote:
         | Replace "corporate profit" with "social good", which is what it
         | generally comes from, and then is it still sad?
         | 
         | You seem to imply there's something wrong with corporate
         | profit. We as society want and encourage corporate profit
         | because we want the social good that corporations provide and
         | the profit incentivizes them to do it. Profit is a rough
         | measure of how much good they do for people.
         | 
         | Profit is like salary for investors. Salary is fine for doctors
         | and teachers, isn't it? It's also fine for investors which do
         | the useful and difficult job of deciding which companies are
         | doing the most good, then encouraging them to do more of it by
         | investing money.
        
         | fleddr wrote:
         | I agree, but why would one share wedding photos using an open
         | license like Creative Commons?
        
           | jefftk wrote:
           | In this case, because Flickr used to wave hosting fees for
           | people who chose a Creative Commons license.
        
           | tomrod wrote:
           | Because regulations are generally unknown to folks who don't
           | spend their time solving tech problems.
           | 
           | People simply assumed they could share it easily with friends
           | and family.
        
             | fleddr wrote:
             | This doesn't match my experience, and I run a photo
             | community myself.
             | 
             | You're absolutely right that people are generally fairly
             | clueless about licenses, especially in the amateur domain.
             | And the main implication of that is that they don't bother
             | with it at all and leave it at whatever the default is,
             | which typically is "copyrighted, all rights reserved".
             | 
             | Those explicitly tinkering with licenses, which is a
             | purposeful action, tend to actually know (somewhat) what
             | they are doing.
             | 
             | Further, if you leave a photo's license to its default,
             | copyrighted, absolutely nothing stops you from sharing it
             | with friends and family. What would happen? You share it
             | with them and then sue yourself?
             | 
             | Similarly, somebody you don't even know could use your
             | copyrighted image and post it on social media. Again,
             | nothing happens, as this is widespread behavior and called
             | "fair use", which it legally absolutely isn't. But nobody
             | cares, as nobody will sue over it unless there is a case of
             | vast commercial usage.
        
         | ironmagma wrote:
         | It's no worry, someday that data will be so ubiquitous and
         | well-studied that it won't even be profitable, it will just be
         | trivial to construct or deconstruct any face.
        
           | [deleted]
        
       | samwillis wrote:
       | One of the difficulties with these training datasets is the
       | currently understood rules around web scraping. The current legal
       | precedent [0] is that web scraping is perfectly legal, despite
       | what is in the websites terms of service, "licence" or
       | robots.txt. If a human can navigate to it freely, you can scrape
       | it using automated means.
       | 
       | What you can't do with scraped data is republish it verbatim.
       | Doing a data analysis on scraped data is permitted by law, and
       | you can publish your analysis of that data.
       | 
       | The question is, is an AI model trained on scraped data a derived
       | analysis that is therefore legal? Or is it republishing of the
       | original data? We need a test case to find out.
       | 
       | In the case of this dataset, I don't think the CC license applies
       | to people using it. It "may" apply to redistribution of it for
       | free. If the dataset was sold, that would be a violation. I
       | suspect (after tested in court) a model trained on this dataset
       | would be allowed despite the CC license on the photos.
       | 
       | Personally, in this case I think the ethics committee of the
       | University should have put up barriers to the project. The morals
       | of this are questionable at best.
       | 
       | 0: https://techcrunch.com/2022/04/18/web-scraping-legal-court/
        
         | pbhjpbhj wrote:
         | >Or is it republishing of the original data?
         | 
         | If it's publishing _data_ then you're fine under regular
         | copyright as it only protects artistic works and not things
         | like data. You might fall shy of other IP legislation but not
         | copyright.
         | 
         |  _YMMV, this is not legal advice and represents my personal
         | opinion unrelated to my employment._
        
           | fragmede wrote:
           | The CFAA would be the thing to look out for.
        
           | cmeacham98 wrote:
           | The "data" here is photographs, which all jurisdictions I'm
           | aware of treat as coprightable.
        
             | Mtinie wrote:
             | Which makes this case even more interesting to me. Some
             | percentage of those photos' copywrites' are owned by
             | corporations rather than the pictured individuals.
             | 
             | If it was simply a large group of selfies, I don't expect
             | much legal challenge from the allegedly aggrieved. But when
             | companies with legal counsel get involved...
        
         | KRAKRISMOTT wrote:
         | You can do the scraping in a jurisdiction where it is legal.
        
           | traceroute66 wrote:
           | > You can do the scraping in a jurisdiction where it is
           | legal.
           | 
           | No such thing with GDPR.
           | 
           | Why do you think so many US websites take the lazy-ass
           | approach and block EU visitors to their websites ?
           | 
           | Simple, its because either you comply with GDPR or you don't
           | process the information of citizens of GDPR covered
           | countries. End of story.
        
             | pixl97 wrote:
             | If I'm in China and I scrape/collect data I don't think the
             | GDPR is going to do anything to me. This really only
             | affects businesses that some the EU has some means of
             | reaching.
        
             | [deleted]
        
             | laingc wrote:
             | Well, no, only if you're under the jurisdiction of the EU
             | courts. They can rule against you as much as they like, but
             | it's not enforceable outside of the EU or a jurisdiction
             | that chooses to enforce EU judgements.
        
           | pbhjpbhj wrote:
           | Importing (in the geographical sense) the data would still be
           | infringing, you've just scraped it in a convoluted way --
           | legal systems in my limited experience take account of such
           | things.
        
         | JumpCrisscross wrote:
         | Beyond copyright, how would these requirements work with
         | Illinois' biometrics law?
        
         | the_duke wrote:
         | It's not as easy as that.
         | 
         | Pictures are clearly personally identifiable data, so storing
         | them violates the GDPR if you don't have permission to do so.
         | 
         | Some "data analysis company" got fined a hefty sum for doing so
         | with EU citizens.
         | 
         | I forgot the name, but they were recently in the news for
         | helping Ukraine identify Russian soldiers by picture.
         | 
         | Of course they were also aggregating other data including
         | names, so just pictures might be a more complicated case, but
         | as a company with EU exposure I wouldn't do it. It's pretty
         | clearly against the law.
        
           | samwillis wrote:
           | You are quite right, forgot that one.
           | 
           | Point is though, we need a test case to go through the courts
           | to clarify all of this. There are companies betting billions
           | on the outcome that they are ok to do what they are doing.
        
           | fleddr wrote:
           | "Pictures are clearly personally identifiable data, so
           | storing them violates the GDPR if you don't have permission
           | to do so."
           | 
           | Wouldn't a Creative Commons license express this permission?
        
             | kixiQu wrote:
             | IANAL, but I believe no; the CC license handles the rights
             | that a photographer can hand out, but doesn't come with any
             | kind of model release guarantees.
        
               | fleddr wrote:
               | Model release is a good point but in many situations
               | where people are photographed it does not apply. When you
               | make photos of yourself or your family or even of people
               | in public spaces, you do not require a model release. And
               | I imagine this to be the main input of this training set.
               | 
               | When you hire a model, photograph the person and then use
               | these photos for promotion or commercial activities, you
               | do require a model release. But in that case it would be
               | absurdly weird to publish such commercial material as CC
               | NC on Flickr, makes no sense.
        
         | satvikpendem wrote:
         | There was an updated ruling in November 2022 showing that HiQ
         | was ruled against and that they reached a settlement with
         | LinkedIn, so I'm not sure that web scraping is entirely legal.
         | 
         | https://www.natlawreview.com/article/hiq-and-linkedin-reach-...
        
           | EMIRELADERO wrote:
           | That was because Linkedin added a no-scraping clause to their
           | ToS and also put up a login wall for _viewing profiles_ in
           | the first place.
           | 
           | If you scraped from a web page without actually signing up
           | for an account you wouldn't be accepting the terms and would
           | thus be legally in the clear.
        
         | charcircuit wrote:
         | A ML model would be considered transformational.
        
       | Jerry2 wrote:
       | It's unfortunate they removed it. Is there a public
       | mirror/torrent of it by any chance?
        
       | colesantiago wrote:
       | "All photos included a Creative Commons licenses, but most were
       | not licensed for commercial use."
       | 
       | I wonder what the implications are for Stable Diffusion, DALLE
       | and Midjourney since that art images on the internet are
       | copyrighted by default.
       | 
       | Even with a fair use argument, there are examples in cases where
       | AI was generating art that included the signatures of artists.
       | 
       | https://nwn.blogs.com/nwn/2022/12/lensa-ai-art-images-withou...
        
         | gwern wrote:
         | This is apples and oranges. SD et al are defended on the
         | grounds of being transformative use
         | (https://en.wikipedia.org/wiki/Transformative_use): they do not
         | distribute (ie _copy_ ) the original training images, and they
         | are not a derivative work due to transformativeness, so the
         | license of the original images is completely irrelevant.
         | (Details like 'signatures' are also irrelevant: if I write a
         | style parody of William Shakespeare and add a '--Willy
         | Shakespeare' at the end to round it off, have I revealed that I
         | have secretly copied his work? Of course not. It's just
         | plausible that there would be a name there, so I came up with a
         | name.)
         | 
         | The criticism here is that distributing (copying) the original
         | image violates the non-commercial clause of the original images
         | because someone, somewhere, might somehow have made money in
         | some way because the dataset exists; but as they somewhat
         | lamely acknowledge later, what counts as 'commercial' has never
         | been clearly defined, and it probably _can 't_ be defined
         | (because for most people 'commercial' seems to be defined by
         | 'ewww'), and this is why CC-NC licenses are heavily discouraged
         | by WMF and other FLOSS groups and weren't part of FLOSS from
         | the beginning even though Stallman was in large part attacking
         | commercial exploitation.
        
           | schemescape wrote:
           | Does anyone know if attempts have been made to trick these ML
           | models into reproducing original copyrighted inputs verbatim
           | (edit: or close enough)?
           | 
           | Edit: Asking about verbatim copies wasn't really a great
           | question. I should have asked about producing things that are
           | "close enough to cause legal trouble" (whether that be due to
           | copyright, trademark, or something else).
        
             | sdenton4 wrote:
             | That's not really how memorization in neutral networks
             | works. For classifiers, memorization is more like learning
             | a hash function and a lookup table; no need to store the
             | full image at all. Even for very large models, the weights
             | are a tiny fraction of the size of the original data.
             | 
             | It's probably helpful to think of embeddings for generative
             | models in a similar way; it's a very specific embedding
             | function, like a locality sensitive hash, which doesn't
             | require actually storing the data.
        
               | schemescape wrote:
               | Thanks. Yes, I shouldn't have asked about "verbatim"
               | copies -- I should have asked about something more like
               | "close enough to cause legal trouble". Obviously copying
               | verbatim is a violation of copyright, but there must be
               | some threshold of "close enough" that is still
               | problematic. E.g. compressed MP3s of copyrighted songs
               | aren't a verbatim reproduction, but as far as I'm aware
               | they're still covered by copyright.
               | 
               | Trademarks are even broader.
        
             | polygamous_bat wrote:
             | One, the diffusion model's possible output space contains
             | every RGB image ever. But two, it cannot ever possibly
             | contain the original inputs verbatim, because (the size of
             | the model)/(the size of the training set) comes out to be
             | something like 0.2 KB per image. Unless it's an incredible
             | compression algorithm, diffusion necessarily have learned
             | something from the input rather than copy-pasting things,
             | as claimed upthread.
        
               | schemescape wrote:
               | I edited my post a while ago, but I shouldn't have ask
               | about "verbatim" copies. See reply to sibling for a more
               | interesting question.
        
             | gwern wrote:
             | There's been a lot of work on memorization, yes, and you
             | can also do nearest-neighbor lookups in the original data
             | to gutcheck 'memorization'. As usual, the answer is "it's
             | complicated" but for most practical purposes, the answer is
             | 'no': you will get the Mona Lisa if you ask for it,
             | absolutely, but the odds of a randomly generated image
             | being a doppelganger is near-zero. (If you've seen stuff on
             | social media to the contrary, then you may have been misled
             | by various people peddling img2img or 'variation'
             | functions, or prompting for it, or other ways of
             | lying/ignorance.)
             | 
             | But you certainly can get things like watermarks without
             | any real memorization. Watermarks have been a nuisance in
             | GANS, for example - the StyleGAN CATS model was filled with
             | watermarks or attempted meme text captions, even though the
             | cats were so nightmarish that they obviously weren't
             | 'plagiarized' so nobody made a big deal about it back then
             | and they understood the GAN had simply learned that
             | watermarks were a thing in many real images and it would
             | try to imitate them where plausible in a sample.
        
             | Scaevolus wrote:
             | That's a known failure mode called "overfitting" or
             | "memorization", where a specific input text is very
             | accurately reproduced.
             | 
             | I'm not aware of it occurring for any copyrighted inputs,
             | but it occurs for many famous artworks-- it's nearly
             | impossible to convince Stable Diffusion to style "Mona
             | Lisa" at all.
        
           | theptip wrote:
           | > distributing the original image violates the non-commercial
           | clause of the original images because someone, somewhere,
           | might somehow have made money
           | 
           | I agree with the rest of your post, but this point seems a
           | bit uncharitable.
           | 
           | I think the claims would be:
           | 
           | 1. It's a breach of copyright for Megaface to share the
           | images in any case without attribution & replicating the CC-
           | NC license. It would (presumably) be OK assuming Megaface
           | were to correctly apply the CC-NC licenses to the dataset.
           | 
           | 2. It's a breach of copyright for anyone consuming Megaface
           | (e.g. Google) to use those images for commercial purposes.
           | 
           | And your argument for SD applies to 2. that regardless of
           | license, it's OK to create a transformative work. But it
           | still doesn't get Megaface off the hook for 1. - distributing
           | those images without the license.
        
           | pbhjpbhj wrote:
           | >if I write a style parody of William Shakespeare and add a '
           | --Willy Shakespeare' at the end to round it off, have I
           | revealed that I have secretly copied his work? //
           | 
           | I doubt you're suggesting SD, Dall-E, etc., are producing
           | parodies so bringing in parody considerations muddies the
           | water a lot. Also, Shakespeare's works are out of copyright.
           | 
           | If you sell a painting signed with a [facsimile] signature of
           | Dali then it's pretty hard to say you didn't copy the
           | signature, as a minimum. Thats likely to be a trademark
           | violation too. Now, suppose you include aspects in the image
           | specifically associated with the artist, and a signature, ...
           | there's no way to genuinely deny that is a derivative.
        
             | theptip wrote:
             | > a painting signed with a [facsimile] signature of Dali
             | 
             | That's not what's happening here though.
             | 
             | If you look at the original tweet (https://twitter.com/Laur
             | ynIpsum/status/1599953586699767808) it seems that the
             | complaint is about the "mangled remains of an artist's
             | signature". I don't see any examples where it's actually
             | copying the signature of a specific artist.
             | 
             | (Please do share an example of that if there is one.)
        
               | return_to_monke wrote:
               | I do respect artist's concerns. I have a hard time
               | getting this one though. The ai learned that humans
               | usually put squiggly lines in the corners, and it does,
               | too. What is wrong with this?
        
         | polygamous_bat wrote:
         | > Even with a fair use argument, there are examples in cases
         | where AI was generating art that included the signatures of
         | artists.
         | 
         | I went through the post, and I am not sure whether I agree with
         | the analysis of the examples. Diffusion models are conceptual
         | parrots, and it is possible that "25% images contain a scribble
         | in the bottom right corner, so the model will make a scribble
         | in the corner" is what is being construed as a signature in
         | this post.
         | 
         | I think a large part of outrage from the artists about
         | diffusion model "stealing" art comes from a place of disbelief
         | that machines can be this good without "stealing", and it's
         | perfectly natural. In fact, it's unnatural to me how good
         | machines have gotten in image generation, and it is a field
         | I've been following for five years now. However, because I
         | understand the model and can implement it myself, I can
         | convince myself it doesn't need to steal, just needs to be able
         | to model correlations at some ungodly level.
        
         | ilikehurdles wrote:
         | > Stability AI is happy to follow copyright laws for their
         | music model, because they know that music labels will hold them
         | accountable. So this seems like a good time to point out to
         | larger companies like @WaltDisneyCo that their copyrighted
         | material is being stolen and used too
         | 
         | I mean this is a pretty good point. If they're so sure this is
         | legal, then train on copyrighted audio+video media as they
         | already do with copyrighted visual media.
        
           | zarzavat wrote:
           | Avoiding doing something because you don't want to get sued
           | and subjected to a lengthy court battle is completely
           | rational and it doesn't mean that doing that thing is
           | illegal.
           | 
           | For example, for decades many TV shows came up with their own
           | lyrics for Happy Birthday song, even though it was well known
           | that the song wasn't copyrighted, because nobody wanted to
           | get sued and fight _that_ battle. Easier to just change a few
           | words in the script.
        
       | rsync wrote:
       | What's going to happen when (not if) it becomes cheap and simple
       | to mock up your own head and you "present" that in multiple
       | locations, simultaneously ?
       | 
       | It's interesting to think about how these systems (and their
       | human operators) will react when their system recognizes, with
       | certainty, that X is in two places (or 15) at once ...
       | 
       | ... or if X is recorded somewhere (Zurich) and then two hours
       | later at an impossible distance (San Francisco) ...
       | 
       | In a way, it's the _opposite_ of the  "Sigil" plot device in
       | Gibsons _Zero History_ wherein the wearer was invisible to
       | security camera networks.[1] Instead, the operator of this
       | network of clones aspires to be on _as many cameras as possible_.
       | 
       | [1] https://en.wikipedia.org/wiki/Zero_History
        
       | cshimmin wrote:
       | Hmmm...                   June 11, 2020: MegaFace dataset is now
       | decommissioned. University of Washington has ceased distributing
       | the MegaFace dataset citing the challenge has concluded and that
       | maintenance of their platform would be too burdensome.
        
       | Imnimo wrote:
       | If I understand correctly, this dataset isn't even being used to
       | train commercial facial recognition models, it's just being used
       | to benchmark them? The implication seems to be that it should be
       | illegal to even apply an algorithm (of any sort) to an image that
       | you don't have a commercial license for?
        
       | georgeglue1 wrote:
       | Are there any licenses that are generally permissive, but
       | prohibit certain programmatic, law enforcement, government, etc.
       | usecases?
       | 
       | It'd be interesting legal territory if someone has tried this
       | already.
        
         | 542458 wrote:
         | IANAL.
         | 
         | I don't think you can prevent scraping or use in ML corpuses in
         | this way. Copyright prevents the creation of non-transformative
         | copies of a work other than some protected use cases (parody,
         | education, etc). All OSS licenses do is provide a right to copy
         | a work provided certain conditions (attribution, copy left) are
         | met. But the general legal consensus as far as I know is that
         | most ML models meet the threshold for being a new
         | transformative work, so copyright doesn't apply. Accordingly,
         | you can't use copyright to prevent something from being part of
         | a ML corpus.
         | 
         | That said, I if your question is broader than the article... if
         | you're just talking about _non-transformative uses_ (I.e., just
         | using open source software) I don't see any reason why you
         | couldn't create a license that doesn't allow software to be
         | deployed into certain environments. Some examples:
         | 
         | https://www.cs.ucdavis.edu/~rogaway/ocb/license2.pdf
         | 
         | https://www.linux.com/news/open-source-project-adds-no-milit...
         | 
         | No idea how these would do in court though.
        
           | dragonwriter wrote:
           | > Copyright prevents the creation of non-transformative
           | copies of a work
           | 
           | It also prevents transformative derivatives.
           | 
           | Both nontransformative copies and transformative derivative
           | works may meet (in the US) the exception for _fair use_ ,
           | which is the usual argument for nonlicensed use in ML
           | training.
        
           | AlexandrB wrote:
           | > But the general legal consensus as far as I know is that
           | most ML models meet the threshold for being a new
           | transformative work, so copyright doesn't apply.
           | 
           | Has this been tested in court yet?
        
             | jefftk wrote:
             | It hasn't yet. I think this is the central claim of the
             | GitHub co-pilot suit.
             | 
             | There's a prediction market on whether the suit will be
             | successful, which is currently at 43%:
             | https://manifold.markets/JeffKaufman/will-the-github-
             | copilot...
        
       | hk__2 wrote:
       | (2021)
       | 
       | > Last updated: Jan 23, 2021
        
       | [deleted]
        
       | MuffinFlavored wrote:
       | Slighty unrelated, how many "distinguishable/unique" faces/facial
       | styles are there in terms of like "broad categories?"
       | 
       | Obviously hard to define but as somebody who moved around a lot
       | growing up, I would catch my brain thinking I'd recognize
       | somebody (quite often) only to remember I was in a totally
       | different state than where the person I thought I was recognizing
       | lived.
        
       ___________________________________________________________________
       (page generated 2023-01-02 23:00 UTC)