[HN Gopher] Megaface
___________________________________________________________________
Megaface
Author : gennarro
Score : 187 points
Date : 2023-01-02 01:09 UTC (21 hours ago)
(HTM) web link (exposing.ai)
(TXT) w3m dump (exposing.ai)
| hrkucuk wrote:
| Are all these faces white?
| Zamicol wrote:
| No.
| jonplackett wrote:
| Anyone know have any stats on the ethnicities and genera of the
| people in this dataset?
|
| Is this still widely used to test face recognition?
| wongarsu wrote:
| There is the DiveFace dataset/metadata [1], which is a subset
| of the Megaface dataset with six equally sized groups: three
| ethnicities times two genders.
|
| 1: https://github.com/BiDAlab/DiveFace
| sschueller wrote:
| This datasets usage and creation violates Swiss law [1]. Any
| person in Switzerland has the right to their face in any picture
| taken now and any time in the future even if taken by someone
| else. Without the explicit consent of a person, their face may
| not be used or published in anyway or form. There are only a few
| exceptions like for public figures and celebrities but even then
| they also have a right to privacy.
|
| SRF once did a segment about face recognition and public photos
| from social media. Under strict supervision and journalism
| protection they created a data set and showed what was possible.
| The dataset and code was then destroyed. [2]
|
| Similar laws exist in EU states as well.
|
| [1]
| https://www.edoeb.admin.ch/edoeb/de/home/datenschutz/Interne...
|
| [2] https://www.srf.ch/news/schweiz/automatische-
| gesichtserkennu...
| sacrosancty wrote:
| [dead]
| gardenhedge wrote:
| It strikes me as sad that people's photos have been taken a used
| to train a technology for a corporate profit. People just wanted
| to share their wedding photos.
| sacrosancty wrote:
| Replace "corporate profit" with "social good", which is what it
| generally comes from, and then is it still sad?
|
| You seem to imply there's something wrong with corporate
| profit. We as society want and encourage corporate profit
| because we want the social good that corporations provide and
| the profit incentivizes them to do it. Profit is a rough
| measure of how much good they do for people.
|
| Profit is like salary for investors. Salary is fine for doctors
| and teachers, isn't it? It's also fine for investors which do
| the useful and difficult job of deciding which companies are
| doing the most good, then encouraging them to do more of it by
| investing money.
| fleddr wrote:
| I agree, but why would one share wedding photos using an open
| license like Creative Commons?
| jefftk wrote:
| In this case, because Flickr used to wave hosting fees for
| people who chose a Creative Commons license.
| tomrod wrote:
| Because regulations are generally unknown to folks who don't
| spend their time solving tech problems.
|
| People simply assumed they could share it easily with friends
| and family.
| fleddr wrote:
| This doesn't match my experience, and I run a photo
| community myself.
|
| You're absolutely right that people are generally fairly
| clueless about licenses, especially in the amateur domain.
| And the main implication of that is that they don't bother
| with it at all and leave it at whatever the default is,
| which typically is "copyrighted, all rights reserved".
|
| Those explicitly tinkering with licenses, which is a
| purposeful action, tend to actually know (somewhat) what
| they are doing.
|
| Further, if you leave a photo's license to its default,
| copyrighted, absolutely nothing stops you from sharing it
| with friends and family. What would happen? You share it
| with them and then sue yourself?
|
| Similarly, somebody you don't even know could use your
| copyrighted image and post it on social media. Again,
| nothing happens, as this is widespread behavior and called
| "fair use", which it legally absolutely isn't. But nobody
| cares, as nobody will sue over it unless there is a case of
| vast commercial usage.
| ironmagma wrote:
| It's no worry, someday that data will be so ubiquitous and
| well-studied that it won't even be profitable, it will just be
| trivial to construct or deconstruct any face.
| [deleted]
| samwillis wrote:
| One of the difficulties with these training datasets is the
| currently understood rules around web scraping. The current legal
| precedent [0] is that web scraping is perfectly legal, despite
| what is in the websites terms of service, "licence" or
| robots.txt. If a human can navigate to it freely, you can scrape
| it using automated means.
|
| What you can't do with scraped data is republish it verbatim.
| Doing a data analysis on scraped data is permitted by law, and
| you can publish your analysis of that data.
|
| The question is, is an AI model trained on scraped data a derived
| analysis that is therefore legal? Or is it republishing of the
| original data? We need a test case to find out.
|
| In the case of this dataset, I don't think the CC license applies
| to people using it. It "may" apply to redistribution of it for
| free. If the dataset was sold, that would be a violation. I
| suspect (after tested in court) a model trained on this dataset
| would be allowed despite the CC license on the photos.
|
| Personally, in this case I think the ethics committee of the
| University should have put up barriers to the project. The morals
| of this are questionable at best.
|
| 0: https://techcrunch.com/2022/04/18/web-scraping-legal-court/
| pbhjpbhj wrote:
| >Or is it republishing of the original data?
|
| If it's publishing _data_ then you're fine under regular
| copyright as it only protects artistic works and not things
| like data. You might fall shy of other IP legislation but not
| copyright.
|
| _YMMV, this is not legal advice and represents my personal
| opinion unrelated to my employment._
| fragmede wrote:
| The CFAA would be the thing to look out for.
| cmeacham98 wrote:
| The "data" here is photographs, which all jurisdictions I'm
| aware of treat as coprightable.
| Mtinie wrote:
| Which makes this case even more interesting to me. Some
| percentage of those photos' copywrites' are owned by
| corporations rather than the pictured individuals.
|
| If it was simply a large group of selfies, I don't expect
| much legal challenge from the allegedly aggrieved. But when
| companies with legal counsel get involved...
| KRAKRISMOTT wrote:
| You can do the scraping in a jurisdiction where it is legal.
| traceroute66 wrote:
| > You can do the scraping in a jurisdiction where it is
| legal.
|
| No such thing with GDPR.
|
| Why do you think so many US websites take the lazy-ass
| approach and block EU visitors to their websites ?
|
| Simple, its because either you comply with GDPR or you don't
| process the information of citizens of GDPR covered
| countries. End of story.
| pixl97 wrote:
| If I'm in China and I scrape/collect data I don't think the
| GDPR is going to do anything to me. This really only
| affects businesses that some the EU has some means of
| reaching.
| [deleted]
| laingc wrote:
| Well, no, only if you're under the jurisdiction of the EU
| courts. They can rule against you as much as they like, but
| it's not enforceable outside of the EU or a jurisdiction
| that chooses to enforce EU judgements.
| pbhjpbhj wrote:
| Importing (in the geographical sense) the data would still be
| infringing, you've just scraped it in a convoluted way --
| legal systems in my limited experience take account of such
| things.
| JumpCrisscross wrote:
| Beyond copyright, how would these requirements work with
| Illinois' biometrics law?
| the_duke wrote:
| It's not as easy as that.
|
| Pictures are clearly personally identifiable data, so storing
| them violates the GDPR if you don't have permission to do so.
|
| Some "data analysis company" got fined a hefty sum for doing so
| with EU citizens.
|
| I forgot the name, but they were recently in the news for
| helping Ukraine identify Russian soldiers by picture.
|
| Of course they were also aggregating other data including
| names, so just pictures might be a more complicated case, but
| as a company with EU exposure I wouldn't do it. It's pretty
| clearly against the law.
| samwillis wrote:
| You are quite right, forgot that one.
|
| Point is though, we need a test case to go through the courts
| to clarify all of this. There are companies betting billions
| on the outcome that they are ok to do what they are doing.
| fleddr wrote:
| "Pictures are clearly personally identifiable data, so
| storing them violates the GDPR if you don't have permission
| to do so."
|
| Wouldn't a Creative Commons license express this permission?
| kixiQu wrote:
| IANAL, but I believe no; the CC license handles the rights
| that a photographer can hand out, but doesn't come with any
| kind of model release guarantees.
| fleddr wrote:
| Model release is a good point but in many situations
| where people are photographed it does not apply. When you
| make photos of yourself or your family or even of people
| in public spaces, you do not require a model release. And
| I imagine this to be the main input of this training set.
|
| When you hire a model, photograph the person and then use
| these photos for promotion or commercial activities, you
| do require a model release. But in that case it would be
| absurdly weird to publish such commercial material as CC
| NC on Flickr, makes no sense.
| satvikpendem wrote:
| There was an updated ruling in November 2022 showing that HiQ
| was ruled against and that they reached a settlement with
| LinkedIn, so I'm not sure that web scraping is entirely legal.
|
| https://www.natlawreview.com/article/hiq-and-linkedin-reach-...
| EMIRELADERO wrote:
| That was because Linkedin added a no-scraping clause to their
| ToS and also put up a login wall for _viewing profiles_ in
| the first place.
|
| If you scraped from a web page without actually signing up
| for an account you wouldn't be accepting the terms and would
| thus be legally in the clear.
| charcircuit wrote:
| A ML model would be considered transformational.
| Jerry2 wrote:
| It's unfortunate they removed it. Is there a public
| mirror/torrent of it by any chance?
| colesantiago wrote:
| "All photos included a Creative Commons licenses, but most were
| not licensed for commercial use."
|
| I wonder what the implications are for Stable Diffusion, DALLE
| and Midjourney since that art images on the internet are
| copyrighted by default.
|
| Even with a fair use argument, there are examples in cases where
| AI was generating art that included the signatures of artists.
|
| https://nwn.blogs.com/nwn/2022/12/lensa-ai-art-images-withou...
| gwern wrote:
| This is apples and oranges. SD et al are defended on the
| grounds of being transformative use
| (https://en.wikipedia.org/wiki/Transformative_use): they do not
| distribute (ie _copy_ ) the original training images, and they
| are not a derivative work due to transformativeness, so the
| license of the original images is completely irrelevant.
| (Details like 'signatures' are also irrelevant: if I write a
| style parody of William Shakespeare and add a '--Willy
| Shakespeare' at the end to round it off, have I revealed that I
| have secretly copied his work? Of course not. It's just
| plausible that there would be a name there, so I came up with a
| name.)
|
| The criticism here is that distributing (copying) the original
| image violates the non-commercial clause of the original images
| because someone, somewhere, might somehow have made money in
| some way because the dataset exists; but as they somewhat
| lamely acknowledge later, what counts as 'commercial' has never
| been clearly defined, and it probably _can 't_ be defined
| (because for most people 'commercial' seems to be defined by
| 'ewww'), and this is why CC-NC licenses are heavily discouraged
| by WMF and other FLOSS groups and weren't part of FLOSS from
| the beginning even though Stallman was in large part attacking
| commercial exploitation.
| schemescape wrote:
| Does anyone know if attempts have been made to trick these ML
| models into reproducing original copyrighted inputs verbatim
| (edit: or close enough)?
|
| Edit: Asking about verbatim copies wasn't really a great
| question. I should have asked about producing things that are
| "close enough to cause legal trouble" (whether that be due to
| copyright, trademark, or something else).
| sdenton4 wrote:
| That's not really how memorization in neutral networks
| works. For classifiers, memorization is more like learning
| a hash function and a lookup table; no need to store the
| full image at all. Even for very large models, the weights
| are a tiny fraction of the size of the original data.
|
| It's probably helpful to think of embeddings for generative
| models in a similar way; it's a very specific embedding
| function, like a locality sensitive hash, which doesn't
| require actually storing the data.
| schemescape wrote:
| Thanks. Yes, I shouldn't have asked about "verbatim"
| copies -- I should have asked about something more like
| "close enough to cause legal trouble". Obviously copying
| verbatim is a violation of copyright, but there must be
| some threshold of "close enough" that is still
| problematic. E.g. compressed MP3s of copyrighted songs
| aren't a verbatim reproduction, but as far as I'm aware
| they're still covered by copyright.
|
| Trademarks are even broader.
| polygamous_bat wrote:
| One, the diffusion model's possible output space contains
| every RGB image ever. But two, it cannot ever possibly
| contain the original inputs verbatim, because (the size of
| the model)/(the size of the training set) comes out to be
| something like 0.2 KB per image. Unless it's an incredible
| compression algorithm, diffusion necessarily have learned
| something from the input rather than copy-pasting things,
| as claimed upthread.
| schemescape wrote:
| I edited my post a while ago, but I shouldn't have ask
| about "verbatim" copies. See reply to sibling for a more
| interesting question.
| gwern wrote:
| There's been a lot of work on memorization, yes, and you
| can also do nearest-neighbor lookups in the original data
| to gutcheck 'memorization'. As usual, the answer is "it's
| complicated" but for most practical purposes, the answer is
| 'no': you will get the Mona Lisa if you ask for it,
| absolutely, but the odds of a randomly generated image
| being a doppelganger is near-zero. (If you've seen stuff on
| social media to the contrary, then you may have been misled
| by various people peddling img2img or 'variation'
| functions, or prompting for it, or other ways of
| lying/ignorance.)
|
| But you certainly can get things like watermarks without
| any real memorization. Watermarks have been a nuisance in
| GANS, for example - the StyleGAN CATS model was filled with
| watermarks or attempted meme text captions, even though the
| cats were so nightmarish that they obviously weren't
| 'plagiarized' so nobody made a big deal about it back then
| and they understood the GAN had simply learned that
| watermarks were a thing in many real images and it would
| try to imitate them where plausible in a sample.
| Scaevolus wrote:
| That's a known failure mode called "overfitting" or
| "memorization", where a specific input text is very
| accurately reproduced.
|
| I'm not aware of it occurring for any copyrighted inputs,
| but it occurs for many famous artworks-- it's nearly
| impossible to convince Stable Diffusion to style "Mona
| Lisa" at all.
| theptip wrote:
| > distributing the original image violates the non-commercial
| clause of the original images because someone, somewhere,
| might somehow have made money
|
| I agree with the rest of your post, but this point seems a
| bit uncharitable.
|
| I think the claims would be:
|
| 1. It's a breach of copyright for Megaface to share the
| images in any case without attribution & replicating the CC-
| NC license. It would (presumably) be OK assuming Megaface
| were to correctly apply the CC-NC licenses to the dataset.
|
| 2. It's a breach of copyright for anyone consuming Megaface
| (e.g. Google) to use those images for commercial purposes.
|
| And your argument for SD applies to 2. that regardless of
| license, it's OK to create a transformative work. But it
| still doesn't get Megaface off the hook for 1. - distributing
| those images without the license.
| pbhjpbhj wrote:
| >if I write a style parody of William Shakespeare and add a '
| --Willy Shakespeare' at the end to round it off, have I
| revealed that I have secretly copied his work? //
|
| I doubt you're suggesting SD, Dall-E, etc., are producing
| parodies so bringing in parody considerations muddies the
| water a lot. Also, Shakespeare's works are out of copyright.
|
| If you sell a painting signed with a [facsimile] signature of
| Dali then it's pretty hard to say you didn't copy the
| signature, as a minimum. Thats likely to be a trademark
| violation too. Now, suppose you include aspects in the image
| specifically associated with the artist, and a signature, ...
| there's no way to genuinely deny that is a derivative.
| theptip wrote:
| > a painting signed with a [facsimile] signature of Dali
|
| That's not what's happening here though.
|
| If you look at the original tweet (https://twitter.com/Laur
| ynIpsum/status/1599953586699767808) it seems that the
| complaint is about the "mangled remains of an artist's
| signature". I don't see any examples where it's actually
| copying the signature of a specific artist.
|
| (Please do share an example of that if there is one.)
| return_to_monke wrote:
| I do respect artist's concerns. I have a hard time
| getting this one though. The ai learned that humans
| usually put squiggly lines in the corners, and it does,
| too. What is wrong with this?
| polygamous_bat wrote:
| > Even with a fair use argument, there are examples in cases
| where AI was generating art that included the signatures of
| artists.
|
| I went through the post, and I am not sure whether I agree with
| the analysis of the examples. Diffusion models are conceptual
| parrots, and it is possible that "25% images contain a scribble
| in the bottom right corner, so the model will make a scribble
| in the corner" is what is being construed as a signature in
| this post.
|
| I think a large part of outrage from the artists about
| diffusion model "stealing" art comes from a place of disbelief
| that machines can be this good without "stealing", and it's
| perfectly natural. In fact, it's unnatural to me how good
| machines have gotten in image generation, and it is a field
| I've been following for five years now. However, because I
| understand the model and can implement it myself, I can
| convince myself it doesn't need to steal, just needs to be able
| to model correlations at some ungodly level.
| ilikehurdles wrote:
| > Stability AI is happy to follow copyright laws for their
| music model, because they know that music labels will hold them
| accountable. So this seems like a good time to point out to
| larger companies like @WaltDisneyCo that their copyrighted
| material is being stolen and used too
|
| I mean this is a pretty good point. If they're so sure this is
| legal, then train on copyrighted audio+video media as they
| already do with copyrighted visual media.
| zarzavat wrote:
| Avoiding doing something because you don't want to get sued
| and subjected to a lengthy court battle is completely
| rational and it doesn't mean that doing that thing is
| illegal.
|
| For example, for decades many TV shows came up with their own
| lyrics for Happy Birthday song, even though it was well known
| that the song wasn't copyrighted, because nobody wanted to
| get sued and fight _that_ battle. Easier to just change a few
| words in the script.
| rsync wrote:
| What's going to happen when (not if) it becomes cheap and simple
| to mock up your own head and you "present" that in multiple
| locations, simultaneously ?
|
| It's interesting to think about how these systems (and their
| human operators) will react when their system recognizes, with
| certainty, that X is in two places (or 15) at once ...
|
| ... or if X is recorded somewhere (Zurich) and then two hours
| later at an impossible distance (San Francisco) ...
|
| In a way, it's the _opposite_ of the "Sigil" plot device in
| Gibsons _Zero History_ wherein the wearer was invisible to
| security camera networks.[1] Instead, the operator of this
| network of clones aspires to be on _as many cameras as possible_.
|
| [1] https://en.wikipedia.org/wiki/Zero_History
| cshimmin wrote:
| Hmmm... June 11, 2020: MegaFace dataset is now
| decommissioned. University of Washington has ceased distributing
| the MegaFace dataset citing the challenge has concluded and that
| maintenance of their platform would be too burdensome.
| Imnimo wrote:
| If I understand correctly, this dataset isn't even being used to
| train commercial facial recognition models, it's just being used
| to benchmark them? The implication seems to be that it should be
| illegal to even apply an algorithm (of any sort) to an image that
| you don't have a commercial license for?
| georgeglue1 wrote:
| Are there any licenses that are generally permissive, but
| prohibit certain programmatic, law enforcement, government, etc.
| usecases?
|
| It'd be interesting legal territory if someone has tried this
| already.
| 542458 wrote:
| IANAL.
|
| I don't think you can prevent scraping or use in ML corpuses in
| this way. Copyright prevents the creation of non-transformative
| copies of a work other than some protected use cases (parody,
| education, etc). All OSS licenses do is provide a right to copy
| a work provided certain conditions (attribution, copy left) are
| met. But the general legal consensus as far as I know is that
| most ML models meet the threshold for being a new
| transformative work, so copyright doesn't apply. Accordingly,
| you can't use copyright to prevent something from being part of
| a ML corpus.
|
| That said, I if your question is broader than the article... if
| you're just talking about _non-transformative uses_ (I.e., just
| using open source software) I don't see any reason why you
| couldn't create a license that doesn't allow software to be
| deployed into certain environments. Some examples:
|
| https://www.cs.ucdavis.edu/~rogaway/ocb/license2.pdf
|
| https://www.linux.com/news/open-source-project-adds-no-milit...
|
| No idea how these would do in court though.
| dragonwriter wrote:
| > Copyright prevents the creation of non-transformative
| copies of a work
|
| It also prevents transformative derivatives.
|
| Both nontransformative copies and transformative derivative
| works may meet (in the US) the exception for _fair use_ ,
| which is the usual argument for nonlicensed use in ML
| training.
| AlexandrB wrote:
| > But the general legal consensus as far as I know is that
| most ML models meet the threshold for being a new
| transformative work, so copyright doesn't apply.
|
| Has this been tested in court yet?
| jefftk wrote:
| It hasn't yet. I think this is the central claim of the
| GitHub co-pilot suit.
|
| There's a prediction market on whether the suit will be
| successful, which is currently at 43%:
| https://manifold.markets/JeffKaufman/will-the-github-
| copilot...
| hk__2 wrote:
| (2021)
|
| > Last updated: Jan 23, 2021
| [deleted]
| MuffinFlavored wrote:
| Slighty unrelated, how many "distinguishable/unique" faces/facial
| styles are there in terms of like "broad categories?"
|
| Obviously hard to define but as somebody who moved around a lot
| growing up, I would catch my brain thinking I'd recognize
| somebody (quite often) only to remember I was in a totally
| different state than where the person I thought I was recognizing
| lived.
___________________________________________________________________
(page generated 2023-01-02 23:00 UTC)