[HN Gopher] Extracting AI models from mobile apps
___________________________________________________________________
Extracting AI models from mobile apps
Author : smoser
Score : 222 points
Date : 2025-01-05 13:19 UTC (9 hours ago)
(HTM) web link (altayakkus.substack.com)
(TXT) w3m dump (altayakkus.substack.com)
| do_not_redeem wrote:
| Can anyone explain that resize_to_320.tflite file? Surely they
| aren't using an AI model to resize images? Right?
| smitop wrote:
| tflite files can contain a ResizeOp that resizes the image:
| https://ai.google.dev/edge/api/tflite/java/org/tensorflow/li...
|
| The file is only 7.7kb, so it couldn't contain many weights
| anyways.
| raydiak wrote:
| Exactly. Put another way, tensorflow is not an AI. You can
| build an AI in tensorflow. You can also resize images in
| tensorflow (using the traditional algorithms, not AI). I am
| not an expert, but as I understand, it is common for vision
| models to require a fixed resolution input, and it is common
| for that resolution to be quite low due to resource
| constraints.
| koe123 wrote:
| Probably not what your alluding to but AI upscaling of images
| is definitely a thing
| JTyQZSnP3cQGa8B wrote:
| > Keep in mind that AI models [...] are considered intellectual
| property
|
| Is it ironic or missing a /s? I can't really tell here.
| Freak_NL wrote:
| Standard disclaimer. Like inserting a bunch of 'hypothetically'
| in a comment telling one where to find some piece of abandoned
| media where using an unsanctioned channel would entail
| infringing upon someone's intellectual property.
| SunlitCat wrote:
| To be honest, that was my first thought on reading that
| headline as well. Given that especially those large companies
| (but who knows how smaller ones got their training data) got a
| huge amount of backlash for their unprecedented collection of
| data all over the web and not just there but everywhere else,
| it's kinda ironic to talk about intellectual property.
|
| If you use one of those AI model as a basis for your AI model
| the real danger could be that the owners of the originating
| data are going after you at some point as well.
| ToucanLoucan wrote:
| Standard corporate hypocrisy. "Rules for thee, not for me."
|
| If you actually expected anything to be open about OpenAI's
| products, please get in touch, I have an incredible business
| opportunity for you in the form of a bridge in New York.
| npteljes wrote:
| I think it's both. It's
|
| 1. the current, unproven-in-court legal understanding, 2.
| standard disclaimer to cover OP's ass 3. tongue-in-cheek
| reference to the prevalent argument that training AI on data,
| and then offering it via AI is being a parasite on that
| original data
| wat10000 wrote:
| " Keep in mind that AI models, like most things, are considered
| intellectual property. Before using or modifying any extracted
| models, you need the explicit permission of their owner."
|
| That's not true, is it? It would be a copyright violation to
| distribute an extracted model, but you can do what you want with
| it yourself.
| wslh wrote:
| It's also worth noting that there is still no legal clarity on
| these issues, even if a license claims to provide specific
| permissions.
|
| Additionally, the debate around the sources companies use to
| train their models remains unresolved, raising ethical and
| legal questions about data ownership and consent.
| jdietrich wrote:
| Circumventing a copy-prevention system without a valid
| exemption is a crime, even if you don't make unlawful copies.
| Copyright covers the right to make copies, not the right to
| distribute; "doing what you want with it yourself" may or may
| not be covered by fair use. Whether or not model weights are
| copyrightable remains an open question.
|
| https://www.law.cornell.edu/uscode/text/17/1201
| mcny wrote:
| >Circumventing a copy-prevention system without a valid
| exemption is a crime, even if you don't make unlawful copies.
| Copyright covers the right to make copies, not the right to
| distribute; "doing what you want with it yourself" may or may
| not be covered by fair use. Whether or not model weights are
| copyrightable remains an open question.
|
| If that is the law, it is a defect that we need to fix. Laws
| do not come down from heaven in the form of commandments. We,
| humans, write laws. If there is a defect in the laws, we
| should fix it.
|
| If this is the law, time shifting and format shifting is
| unlawful as well which to me is unacceptable.
|
| Disclaimer: As usual, I anal.
| dialup_sounds wrote:
| Time shifting is protected by 40 years of judicial
| precedent establishing it as fair use.
| nadermx wrote:
| This is being tested in the courts currently,
| https://torrentfreak.com/appeals-court-hears-riaa-and-
| yout-i...
| kmeisthax wrote:
| DMCA 1201 is written so broadly that _any_ feature of a
| product or service can be construed to prevent copying,
| and thus gain 1201 protection.
|
| I don't think YouTube intended regular uploads to have
| DRM, if only because they support Creative Commons
| metadata on uploads, and Creative Commons specifically
| forbids the use of technical protection measures on CC-
| licensed content[0]. On a less moralistic note, applying
| encryption to all YouTube videos would be prohibitively
| expensive because DRM vendors charge $$$ for the tech.
|
| But the RIAA wants DRM because, well, they don't want
| people taking what they have rightfully stolen. So
| YouTube engineered a weak form of URL obfuscation that
| would only stop very basic scrapers[1]. DMCA 1201 doesn't
| care about encryption or obfuscation, though. What it
| does care about is if something was _intended_ to stop
| copying, and if so, if the defendant 's product was
| designed to defeat that thing.
|
| There's an interesting wrinkle in DMCA 1201 in that
| merely being able to defeat DRM does not make something
| illegal. Defeating DRM has to be the tool's _only
| function_ [2], or you have to advertise the tool as being
| able to defeat DRM[3], in order to actually violate DMCA
| 1201. DRM vendors _usually_ resort to encryption, because
| it makes the circumvention tools specialized enough that
| they have no other purpose and thus fall afoul of DMCA
| 1201. But there 's nothing stopping you from using really
| basic schemes (ROT-13 your DVDs!) and still getting to
| sue for 1201.
|
| Going back to the AI ripping question, this blog post is
| probably not in and of itself a circumvention tool[4],
| but anyone implementing it is very much making
| circumvention tools, which are illegal to distribute.
| Circumvention itself is also illegal, but only when
| there's an underlying copyright infringement. i.e. you
| can't just encrypt something that's public domain or
| uncopyrightable and sue anyone who decrypts it.
|
| So the next question is: is AI copyrightable? And can you
| sue for 1201 circumvention for something that is
| fundamentally composed of someone else's copyrighted work
| that you don't own and haven't licensed?
|
| [0] Additionally, there is a very large repository of CC-
| BY music from Kevin MacLeod that is used all over YouTube
| that would have to be removed or relicensed if the RIAA
| were to prevail on this case.
|
| I have no idea if Kevin actually intends to enforce the
| no-DRM clause in this way, though. Kevin actually has a
| fairly loose interpretation of CC-BY. For example,
| _nobody_ attributes his music correctly, either the way
| the license requires, or with Kevin 's (legally
| insufficient) recommended attribution strings. He does
| sell commercial (non-attribution) licenses but I've yet
| to hear of any enforcement actions from him.
|
| [1] To be clear, without DRM encryption, _any_ video can
| be ripped by hooking standard HTML5 video APIs using an
| extension.
|
| [2] Things with "limited commercial purposes" beyond
| breaking DRM may also be construed as circumvention tools
| under DMCA 1201.
|
| [3] My favorite example: someone tried selling a VGA-to-
| composite adapter as a way to copy movies off Netflix.
| That is illegal under DMCA 1201.
|
| [4] To be clear, this is NOT settled law, this is "get
| sued and find out if the Supreme Court likes you that
| day" law.
| dialup_sounds wrote:
| Not really. The fair use status of time shifting isn't in
| question there by either party.
| rpdillon wrote:
| Your comment confused me, but I'm very interested in what
| you're getting at.
|
| > Circumventing a copy-prevention system without a valid
| exemption is a crime, even if you don't make unlawful copies.
|
| Yep, this is the DMCA section 1201. Late '90s law in the US.
|
| > Copyright covers the right to make copies, not the right to
| distribute
|
| This is where I got confused. Copyright covers four rights:
| copying, distribution, creation of derivative works, and
| public performance. So I'm not sure what you were getting at
| with the copy/distribute dichotomy.
|
| But here's a question I'm curious about: Can DMCA apply to a
| copy-protection mechanism that's being applied to non-
| copyrightable work? Based on my reading of
| https://www.copyright.gov/dmca/:
|
| > First, it prohibits circumventing technological protection
| measures (or TPMs) used by copyright owners to control access
| to their works.
|
| That's not the letter of the law, but an overview, but it
| does seem to suggest you can't bring a DMCA 1201 claim
| against someone circumventing copy-protection for
| uncopyrightable works.
|
| > Whether or not model weights are copyrightable remains an
| open question.
|
| And this is where the interaction with the wording of 1201
| gets interesting, in my (non-professional) opinion!
| lcnPylGDnU4H9OF wrote:
| Here is the relevant text in the law:
|
| > No person shall circumvent a technological measure that
| effectively controls access to a work protected under this
| title.
|
| The inclusion of "work protected under this title" makes it
| clear in the law, though I doubt a judge would rule
| otherwise without that line. (Otherwise, I'd wonder if I
| could claim damages that Google et al. are violating the
| technological measures I've put in place to protect the
| specificity of my interests, because it wouldn't matter
| that such is not protected by copyright law.)
|
| Also not an attorney, for what it's worth.
| roywiggins wrote:
| It seems clear from this definition especially:
|
| > (A) to "circumvent a technological measure" means to
| descramble a scrambled work, to decrypt an encrypted
| work, or otherwise to avoid, bypass, remove, deactivate,
| or impair a technological measure, _without the authority
| of the copyright owner_
|
| In this case there _is_ no copyright owner.
| lcnPylGDnU4H9OF wrote:
| Right, that's what I was getting at with my
| parenthetical. Obviously the work has to have an owned
| copyright in order to be protected by copyright law.
| roywiggins wrote:
| sorry, yes, reread your comment and dirty-edited mine
| rusk wrote:
| This is interesting. I wonder could you use it as a basis
| for "legally" circumventing a technology by applying it
| to non-copyrighted works.
| lcnPylGDnU4H9OF wrote:
| If you mean that you might be able to decrypt a
| copyrighted work because you used that same encryption
| method on a non-copyrighted work, then definitely not.
| The work under protection will be considered. (Otherwise,
| I am unsure what you meant.)
| rusk wrote:
| From what I recall, it was the actual protection method
| that was protected by DMCA - when DVD protection was
| cracked it was forbidden to distribute a particular
| section of code so they just printed it on a Tee-shirt to
| troll the powers that be.
| jazzyjackson wrote:
| Better yet just print the colors that represent the
| number, see
| https://en.m.wikipedia.org/wiki/Illegal_number
|
| But then again, knowing the number is a far cry from
| using that number to circumvent DRM
| lcnPylGDnU4H9OF wrote:
| Presuming you are referring to this: https://en.wikipedia
| .org/wiki/AACS_encryption_key_controvers...
|
| > Outside the Internet and the mass media, the key has
| appeared in or on T-shirts, poetry, songs and music
| videos, illustrations and other graphic artworks, tattoos
| and body art, and comic strips.
|
| Using the encryption key to decrypt the data on a DVD is
| illegal "circumvention" per DMCA 1201, if it's done
| without authorization from the copyright owner of the
| data on the DVD. If it were really illegal to simply
| publish the key on a website, then printing it on
| clothing that they sold instead would not be a viable
| loophole.
|
| I'm glad it is still referred to as a controversy that
| they were issuing cease and desist letters for publishing
| information when the actual crime they had in mind, which
| was not alleged in the letters, is using the information
| to decrypt a DVD.
| nadermx wrote:
| Actually, in terms of copyright control "The Federal Circuit
| went on to clarify the nature of the DMCA's anti-
| circumvention provisions. The DMCA established causes of
| action for liability and did not establish a property right.
| Therefore, circumvention is not infringement in itself."[1]
|
| https://en.m.wikipedia.org/wiki/Chamberlain_Group,_Inc._v._S.
| ..
| bitwize wrote:
| Circumvention is not infringement, but the DMCA makes it a
| separate crime punishable by up to 5 years in prison.
| BadHumans wrote:
| Is using millions of copyrighted works to train your AI a
| valid exemption? Asking for a few billionaire friends.
| larodi wrote:
| just imagine, like just for a second how it becomes illegal
| to train anything that does not then afterwards produce, if
| publicly used or distributed, a copyright token which is both
| in the training set - to mark it - and in the produce - to
| recognize it.
|
| so this is where it all goes in several years, if i were the
| gov.
| Lerc wrote:
| I'm not even sure if event the first part is true. Has it been
| determined if AI models are intellectual property? Machine
| generated content may not be copyrightable. It isn't just the
| output of generative AI that falls under this, the models
| themselves are.
|
| Can you copyright a set of coefficients for a formula? In the
| sense of a JPEG it would be considered that the image being
| reproduced is the thing that has the copyright. Being the first
| to run the calculations that produces a compressed version of
| that data should not grant you any special rights to that
| compressed form.
|
| An AI model is just a form of that writ large. When the models
| generalize and create new content, it seems hard to see how
| that either the output or the model that generated it could be
| considered someone's property.
|
| People possess models, I'm not sure if they own them.
|
| There are however billions of dollars at play here and enough
| money can buy you whichever legal opinion you want.
| echelon wrote:
| > AI models are intellectual property
|
| If companies train on data they don't own and expect to own
| their model weights, that's hypocritical.
|
| Model weights shouldn't be copyrightable if the training data
| was pilfered.
|
| But this hasn't been tested because models are locked away in
| data centers as trade secrets. There's no opportunity to
| observe or copy them outside of using their outputs as
| synthetic data.
|
| On that subject, training on model outputs should be fair
| use, and an area we should use legislation to defend access
| to (similar to web scraping provisions).
| ANewFormation wrote:
| Now that you mention it, I'm quite surprised that none of
| the typical fanatical IP lawsuiters had sued arguing
| (reasonably I think) that the output of the LLMs is
| strongly suggestive that they have been trained on
| copyrighted materials. Get the lawsuit to discovery, and
| those data centers become fair game.
|
| Perhaps 'strongly suggestive' isn't enough.
| Onawa wrote:
| Wasn't that the goal of both the New York Times lawsuit
| and other class action lawsuits from authors?
|
| https://harvardlawreview.org/blog/2024/04/nyt-v-openai-
| the-t...
|
| https://www.publishersweekly.com/pw/by-topic/industry-
| news/p...
| cabalamat wrote:
| > strongly suggestive that they have been trained on
| copyrighted materials
|
| Given that everything -- including this comment -- is
| copyrighted unless it is (1) old or (2) deliberately put
| into the public domain, this is almost certainly true.
| rusk wrote:
| Isn't this comment in the public domain? I presume that's
| what I'm doing when I'm posting on a forum. If somebody
| copied and pasted something I wrote on here could I in
| theory use copyright law to restrict distribution? I
| think the law would say I published it on a public forum
| and thus it is in the public domain.
| taormina wrote:
| Why would it be in the public domain? Anything you
| create, under US copyright law, is the opposite of being
| in the public domain, it's yours. According to the
| legalese of YC, you are granting YC and YC alone a
| license to use the UGC you submitted to their website,
| but if anything, the YC agreement DEMANDS that you own
| the copyright to the comment you are posting.
|
| > User Content Transmitted Through the Site: With respect
| to the content or other materials you upload through the
| Site or share with other users or recipients
| (collectively, "User Content"), you represent and warrant
| that you own all right, title and interest in and to such
| User Content, including, without limitation, all
| copyrights and rights of publicity contained therein. By
| uploading any User Content you hereby grant and will
| grant Y Combinator and its affiliated companies a
| nonexclusive, worldwide, royalty free, fully paid up,
| transferable, sublicensable, perpetual, irrevocable
| license to copy, display, upload, perform, distribute,
| store, modify and otherwise use your User Content for any
| Y Combinator-related purpose in any form, medium or
| technology now known or later developed. However, please
| review the Privacy Policy located here for more
| information on how we treat information included in
| applications submitted to us.
|
| > You acknowledge and agree that any questions, comments,
| suggestions, ideas, feedback or other information about
| the Site ("Submissions") provided by you to Y Combinator
| are non-confidential and Y Combinator will be entitled to
| the unrestricted use and dissemination of these
| Submissions for any purpose, without acknowledgment or
| compensation to you.
| ANewFormation wrote:
| Another example of this is people putting code, intended
| to be shared, up on e.g. Github without a licence.
|
| Many people seem to think that no licence = public
| domain, but it's still under strong copyright protection.
| This is the point of things like the Unlicense license.
| graemep wrote:
| > If somebody copied and pasted something I wrote on here
| could I in theory use copyright law to restrict
| distribution?
|
| Yes you could, unless you agreed to forum terms that said
| otherwise, fair use aside. Its the same in most
| jurisdictions
| dragonwriter wrote:
| > Now that you mention it, I'm quite surprised that none
| of the typical fanatical IP lawsuiters had sued arguing
| (reasonably I think) that the output of the LLMs is
| strongly suggestive that they have been trained on
| copyrighted materials. Get the lawsuit to discovery, and
| those data centers become fair game.
|
| No, there have been lawsuits, and the data centers have
| not been fair game because whether or not the models were
| trained on copyright-protected works is not generally in
| dispute. Discovery only applies to evidence relevant to
| facts in dispute.
| dragonwriter wrote:
| > If companies train on data they don't own and expect to
| own their model weights, that's hypocritical.
|
| Its not hypocritical to follow a line of legal analysis
| whoch holds that copying material in the course of training
| AI on it is outside the scope of copyright protection (as,
| e.g., fair use in the US), but that the model weights
| resulting from the training are protected by copyright.
|
| It maybe wrong, and it may be convenient for the interests
| of the firms involved, but it is not self-inconsistent in
| the way required for it to be hypocrisy.
| Mordisquitos wrote:
| If the resulting AI models are protected by copyright
| that invalidates the claim that AI models being trained
| on copyrighted materials is fair-use analogous to human
| beings becoming educated by exposure to copyrighted
| materials.
|
| Educated human beings are not protected by copyright,
| hence neither should trained AI models. Conversely, if a
| copyrightable work is produced based on work which itself
| is copyrighted, the resulting work needs the consent of
| the original authors of the prior work.
|
| AI models can't have their (c)ake and eat it.
| drdeca wrote:
| If I take 1000 books and count the distributions of the
| lengths of the words, and the covariance between the
| lengths of one word and the next word for each book, and
| how much this covariance matrix tends to vary across the
| different books, and other things like this, and publish
| these summaries, it seems fairly clear to me that this
| should count as fair use.
|
| (Such a model/statistical-summary, along with a
| dictionary, could be used to generate nonsensical texts
| which have similar patterns in terms of just word
| lengths.)
|
| Should the resulting work be protected by copyright? I'm
| not entirely sure...
|
| I guess one thing is, the specific numbers I obtain by
| doing this are not a consequence of any creative decision
| making on my part, which I think in some jurisdictions (I
| don't remember which) plays a role in whether a work is
| copyrightable (I will use "copyrightable" as an
| abbreviation for "protected by copyright". I don't mean
| to imply a requirement that someone specifically
| registers for copyright.). (Iirc this makes it so phone
| books are copyrightable in some jurisdictions but not
| others?)
|
| The particular choice of statistical analysis does seem
| like it may involve creative decision making, but that
| would just be about like, what analysis I describe, and
| how the numbers I publish are to be interpreted, not what
| the numbers are? (Analogous to the source code of an ML
| model, not the parameters.)
|
| Here is another question: suppose there is a method of
| producing a data artifact which would be genuinely (and
| economically) useful, and which does not rely on taking
| in any copyrighted input, but requires a large
| (expensive) amount of compute to produce, and which also
| uses a lot of randomness so that the result would be
| different each time it was done (but suppose also that
| there isn't much point doing it multiple times at the
| same scale, as having two of this kind of data artifact
| wouldn't be much more valuable than having one).
|
| Should such data artifacts be protected by copyright or
| something like it?
|
| Well, if copyright requires creative human decision
| making, then they wouldn't be.
|
| It seems like it would make sense to want it to be
| economically incentivized to create such data artifacts
| of higher sizes (to a point of course. Only as much as is
| justified by the value that is produced by them being
| available.) .
|
| If such data artifacts can always be distributed without
| restriction, then ones that are publicly available would
| be public goods, and I guess only ones that are trade
| secrets would be private goods? It seems to me like
| having some mechanism to incentivize their creation and
| being-eventually-freely-distributed would be beneficial?
|
| But maybe copyright isn't the best way to do that? Idk.
| _DeadFred_ wrote:
| 'Should the resulting work be protected by copyright? I'm
| not entirely sure...'
|
| This has already been settled hasn't it? Don't companies
| have to introduce 'flaws' in order for data sets to be
| 'protected'? Just compiled lists of facts can't be
| protected. Which is why things like election result
| companies having to rely on NDAs and not copyright
| protections to protect their services on election night.
| Mordisquitos wrote:
| > _suppose there is a method of producing a data artifact
| which would be genuinely (and economically) useful, and
| which does not rely on taking in any copyrighted input,
| [...] It seems like it would make sense to want it to be
| economically incentivized to create such data artifacts
| of higher sizes [...] But maybe copyright isn't the best
| way to do that? Idk._
|
| Exactly. It would be patents, not copyright.
| tjr wrote:
| I think it is fair to say that existing copyright law was
| not written to handle all of this. It was written for
| people who created works, and for other people who were
| using those works.
|
| To substitute either party with a computer system and
| assume that the existing law still makes sense may be
| assuming too much.
| OkayPhysicist wrote:
| The model weights are the result of an automated process,
| by definition, and thus not protected by copyright.
|
| In my unusually well-informed on copyright but not a
| lawyer opinion, without any new legislation on the
| subject, I suspect that the most likely scenario for
| intellectual property rights surrounding AI is that using
| other people's works for training probably falls under
| fair use, since it's extremely transformative (an AI that
| makes text and a textual work are very different things)
| and it's extremely difficult to argue that the AI, as it
| exists today, directly impacts the value of the original
| work.
|
| The list of what traing data to use is probably protected
| by copyright if hand-picked, otherwise just whatever web-
| crawler they wrote to gather it.
|
| The AI models, as in, the inference and training
| applications are protected by copyright, like any other
| application.
|
| The architecture of a particular AI model can be
| protected by patents.
|
| The weights, as the result of an automated process, are
| probably not protected by copyright.
| rsynnott wrote:
| There are certainly publicly available weights with
| restrictive licenses (eg some of the StableDiffusion
| stuff). I'd agree that it'd seem fairly perverse to say
| "our process for making this by slurping in a ton of
| copyright content was not copyright theft, but your use of
| it outside our restrictive license is", but then I'm not a
| lawyer.
| mikewarot wrote:
| >models are locked away in data centers as trade secrets
|
| The architecture and the weights in a model are the secret
| process used to make a commercially valuable output. It
| makes the most sense to treat them as a trade secret, in a
| court of law.
| slowmovintarget wrote:
| Datasets want to be free.
| bayindirh wrote:
| Just ask the owner of the data for their consent before
| adding to a dataset which wants to be free.
| cle wrote:
| The main disagreement is who the "owner" is in the first
| place.
| baryphonic wrote:
| Going a step further, weights, i.e. coefficients, aren't
| produced by a person at all - they're produced by machine
| algorithms. Because a human did not create the weights, the
| weights have no author. Thus they are ineligible for
| copyright in the first place and are in the public domain.
| Whether the model architecture is copyrightable is more of an
| open question, but I think a solid argument could be that the
| model architecture is simply a mathematical expression -
| albeit a complex one -, though Python or other source code is
| almost certainly copyrighted. But I imagine clean-room
| methods could avoid problems there, and with much less effort
| than most software.
|
| IANAL, but I have serious doubts about the applicability of
| current copyright law to existing AI models. I imagine the
| courts will decide the same.
| jonpo wrote:
| You can say the same about compiled executable code though.
| baryphonic wrote:
| Each compiled executable has a one-to-one relation with
| its source code, which has an author (except for LLM code
| and/or infinite monkeys). Thus compiled executables are
| derivative works.
|
| There is an argument also that LLMs are derivative works
| of the training data, which I'm somewhat sympathetic to,
| though clearly there's a difference and lots of ambiguity
| about which contributions to which weights correspond to
| any particular source work.
|
| Again IANAL, and this is my opinion based on reading the
| law & precedents. Consult a real copyright attorney for
| real advice.
| nullc wrote:
| The weights are a product of a mechanical process, 5 years
| ago it would be generally uncontroversial that they would be
| not subject to copyright in the US... but 'industry' has done
| a tremendous job of spreading confusion.
| cle wrote:
| I think you have to distinguish between a model, its
| implementation, and its weights/parameters. AFAIU:
|
| - Models are processes/concepts, thus not copyrightable, but
| _are_ subject to trade secret law, contract and license
| restrictions, patents, etc.
|
| - Concrete implementations may be copyrighted like any code.
|
| - Parameters are "facts", thus not copyrightable, but are
| similarly subject to trade secret and contract law.
|
| IANAL, not legal advice, yadda yadda yadda.
| rileymat2 wrote:
| I doubt the models are copyrighted, arn't works created by
| machine not eligible? Or you get into cases autogenerating and
| claiming ownership of all possible musical note combinations.
| hnlmorg wrote:
| It's hard to say because as far as I know this stuff hasn't
| been definitively tested on any courts that I know of. Europe
| not America.
|
| AI models are generally regarded as a company's asset (like a
| customer database would also be), and rightly so given the cost
| required to generate one. But that's a different matter
| entirely to copyright.
| dragonwriter wrote:
| No, copyright violation occurs at the first unauthorized
| copying or creation of a derivative work or exercise of any of
| the other exclusive rights of the copyright holder (that does
| not fall into an exception like that for fair use.) That
| distribution is required for a copyright violation is a
| persistent myth. Distribution is a means by which a violation
| becomes more likely to be detected and also more likely to
| involve significant liability for damages.
|
| (OTOH, whether models, as the output of a mechanical process,
| are subject to copyright is a matter of some debate. The firms
| training models tend to treat the models as if they were
| protected by copyright but also tend to treat the source works
| as if copying for the purpose of training AI were within a
| copyright exception; why each of those positions is in their
| interest is obvious, but neither is well-established.)
| larodi wrote:
| its insane to state it tbh
| 23B1 wrote:
| > hoarding data
|
| Laundering IP. FTFY.
| jonpo wrote:
| Well done you seem to have liberated an open model trained on
| open data for blind and visually impaired people.
|
| Paper: https://arxiv.org/pdf/2204.03738
|
| Code: https://github.com/microsoft/banknote-net Training data:
| https://raw.githubusercontent.com/microsoft/banknote-net/ref...
|
| model: https://github.com/microsoft/banknote-
| net/blob/main/models/b...
|
| Kinda easier to download it straight from github.
|
| Its licenced under MIT and CDLA-Permissive-2.0 licenses.
|
| But lets not let that get in the way of hating on AI shall we?
| cess11 wrote:
| [flagged]
| jonpo wrote:
| Yes nothing wrong with cool software or showing people how to
| use it for useful things.
|
| Sorry I'm just kind of sick of the whole 'kool aid', 'rage
| against AI' thing a lot of people seem to have going on and
| the way is presented in the post. I have family members with
| vision impairment helped by this particular app so its a bit
| personal.
|
| Nothing against opening stuff up and understanding how it
| works etc. I'd just rather see people build/train useful new
| models and stuff with the open datasets / models already
| available.
|
| I guess AI kind of does pay my bills in a round about way.
| a2128 wrote:
| Sadly companies will hoard datasets and model research in
| the name of competitive advantage. Obviously with this
| specific model Microsoft chose to make it open, but this is
| not always the case, and it's not uncommon to read papers
| or technical reports saying they trained on an "internal
| dataset"
| jonpo wrote:
| Companies do have a lot of data, and some of that data
| might be useful for training AI. but >99% isn't. When
| companies do release a cool model or paper that doesn't
| have open data, (as you point out for competitive or
| other reasons privacy etc) people can then help
| build/collect similar open datasets. Unfortunately
| companies generally don't owe you their data, and if they
| are in the business of making models they probably won't
| share the model either, the situation is similar to
| source code for proprietary LoB applications. but
| fortunately the best AI researchers mostly do like to
| share their knowledge and because companies want to
| attract the best AI researchers they seem to generally
| allow researchers to publish if its not too commercially
| sensitive. It could be worse while the competitive
| situation has reduced some visibility of the cutting edge
| science, lots of datasets and papers are still published.
| cess11 wrote:
| In my view there was almost nothing like that in this
| article, besides the first sentence it went right into the
| technical stuff, which I liked. Compared to a lot of
| articles linked here it felt almost free from the battles
| between "AI" fashions.
|
| It seems dang thinks I mistreated you somehow, if you agree
| I'm sorry, it wasn't my intention.
| dang wrote:
| Can you please edit swipes out of your HN comments? Your post
| would be fine with just the first sentence.
|
| This is in the site guidelines:
| https://news.ycombinator.com/newsguidelines.html.
| cess11 wrote:
| What do you mean, "swipe"? The other person agreed they'd
| misjudged the article and apologised several hours before
| you wrote this.
| llama_drama wrote:
| If this is exactly the same model then what's the point of
| encrypting it?
| TechDebtDevin wrote:
| [flagged]
| timewizard wrote:
| [flagged]
| TechDebtDevin wrote:
| [flagged]
| rob_c wrote:
| I am groot
| dang wrote:
| Please don't respond to a bad comment by breaking the site
| guidelines yourself. That only makes things worse.
|
| https://news.ycombinator.com/newsguidelines.html
| dang wrote:
| Please don't cross into personal attack or otherwise break
| the site guidelines when posting here. Your post would be
| fine with just the first sentence.
|
| https://news.ycombinator.com/newsguidelines.html
| rob_c wrote:
| Really... Some people do need to be taken down a peg here
| at times though.
| DoctorOetker wrote:
| Don't you think its intentional, so as not to demonstrate the
| technique on potentially copyrighted data?
| dang wrote:
| > But lets not let that get in the way of hating on AI shall
| we?
|
| Can you please edit this kind of thing out of your HN comments?
| (This is in the site guidelines:
| https://news.ycombinator.com/newsguidelines.html.)
|
| It leads to a downward spiral, as one can see in the
| progression to https://news.ycombinator.com/item?id=42604422
| and https://news.ycombinator.com/item?id=42604728. That's what
| we're trying to avoid here.
|
| Your post is informative and would be just fine without the
| last sentence (well, plus the snarky first two words).
| Lerc wrote:
| Can you clarify this a bit. I presume you are talking about
| the tone more than the implied statement.
|
| If the last sentence were explicit rather than implied, for
| instance
|
| _This article seems to be serving the growing prejudice
| against AI_
|
| Is that better? It is still likely to be controversial and
| the accuracy debatable, but it is at least sincere and could
| be the start of a reasonable conversation, provided the
| responders behave accordingly.
|
| I would like people to talk about controversial things here
| if they do so in a considerate manner.
|
| I'd also like to personally acknowledge how much work you do
| to defuse situations on HN. You represent an excellent
| example of how to behave. Even when the people you are
| talking to assume bad faith you hold your composure.
| jonpo wrote:
| I don't seem to be able to edit it, apologies I will try not
| to let this type of thing get to me in future.
|
| I would also like to point out that this is a fine tuned
| classifier vision model based on mobilenetv2 and not an LLM.
| rob_c wrote:
| ... Because if he did this with a model that's not open that's
| sure going to keep everyone happy and not result in
| lawsuit(s)...
|
| The same method/strategy applies to closed tools and models
| too, although you should probably be careful if you've handed
| over a credit card for a decryption key to a service and try
| this ;)
| nthingtohide wrote:
| One thing I noticed in Gboard is it uses homeomorphic encryption
| to do federated learning of common words used amongst public to
| do encrypted suggestions.
|
| E.g. there are two common spelling of bizarre which are popular
| on Gboard : bizzare and bizarre.
|
| Can something similar help in model encryption?
| antman wrote:
| Had to look it up, this seems to be the paper
| https://research.google/pubs/federated-learning-for-mobile-k...
| umeshunni wrote:
| Homomorphic, not homeomorphic
| vlovich123 wrote:
| In theory yes, in practice right now no. Homomorphic encryption
| is too computationally expensive.
| boothby wrote:
| If I understand the position of major players in this field,
| downloading models in bulk and training a ML model on that corpus
| shouldn't violate anybody's IP.
| zitterbewegung wrote:
| IANAL But, this is not true it would be a piece of the
| software. If there is a copyright on the app itself it would
| extend to the model. Even models have licenses for example
| LLAMA is release under this license [1]
|
| [1] https://github.com/meta-llama/llama/blob/main/LICENSE
| blitzar wrote:
| If I understand the position of major players in this field,
| copyright itself is optional (for them at least).
| rusk wrote:
| They claim "safe harbour" - if nobody complains it's fair
| game
| zitterbewegung wrote:
| True, I think there has to be a case that sets precedent
| for this issue.
| Drakim wrote:
| Is there a material difference between the copyright laws for
| software and the copyright laws for images and text?
| boothby wrote:
| LLMs are trained on works -- software, graphics and text --
| covered by my copyright. What's the difference?
| dragonwriter wrote:
| The fact that models creators _assert_ that they are
| protectrd by copyright and offer licenses does not mean:
|
| (1) That they are actually protected by copyright in the
| first place, or
|
| (2) That the particular act described does not fall into an
| exception to copyright like fair use, exactly as many model
| creators assert that the exact same act done with the
| materials models are trained on does, rendering the
| restrictions of the license offered moot for that purpose.
| _DeadFred_ wrote:
| Yeah no.
|
| An example for legal reference might be convolution reverb.
| Basically it's a way to record what a fancy reverb machines
| does (using copyrighted complex math algorithms) and cheaply
| recreate the reverb on my computer. It seems like companies
| can do this as long as they distribute protected reverbs
| separately from the commercial application. So Liquidsonics
| (https://www.liquidsonics.com/software/) sells reverb
| software but puts for free download the 'protected'
| convolution reverbs specifically the Bricasti ones in dispute
| (https://www.liquidsonics.com/fusion-ir/reverberate-3/)
|
| Also, while a SQL server can be copyright protected, a SQL
| database is not given copyright protection/ownership to the
| SQL server software creators by extension of that.
| avg_dev wrote:
| pretty cool; that frida tool seems really nice.
| https://frida.re/docs/home/
|
| (and a bunch of people seem to be interested in the "IP" note,
| but I took as, just trying to not get run into legal trouble for
| advertising "here's how you can 'steal' models!")
| frogsRnice wrote:
| frida is an amazing tool - it has empowered me to do things
| that would have otherwise took weeks or even months. This video
| is a little old, but the creator is also cracked
| https://www.youtube.com/watch?v=CLpW1tZCblo
|
| It's supposed to be "free-IDA" and the work put in by the
| developers and maintainers is truly phenomenal.
|
| EDIT: This isn't really an attack imo. If you are going to take
| "secrets" and shove it into a mobile app, they can't really be
| considered secret. I suppose it's a tradeoff - if you want to
| do this kind of thing client-side - the secret sauce isn't so
| secret.
| Polizeiposaune wrote:
| You wouldn't train a LLM on a corpus containing copyrighted works
| without ensuring you had the necessary rights to the works, would
| you?
| sharkest wrote:
| Would.
| deadbabe wrote:
| Fair use.
| griomnib wrote:
| Easy to claim, harder to justify once you start charging
| money for your subsequent creation.
|
| Unless all LLM are a ruthless parody of human intelligence,
| which they may be, the legal issues will continue.
| dijksterhuis wrote:
| *only available in the USA, terms and conditions apply.
|
| most other places use fair dealing which is more restrictive
| https://en.m.wikipedia.org/wiki/Fair_dealing
| bayindirh wrote:
| The moment you earn money from it, that's not fair use
| anymore. When I last checked, unlimited access to said models
| were not free, plus it's not "research" anymore.
|
| - Addenda -
|
| For the interested parties, the law states the following [0].
|
| Notwithstanding the provisions of sections 17 U.S.C. SS 106
| and 17 U.S.C. SS 106A, the fair use of a copyrighted work,
| including such use by reproduction in copies or phonorecords
| or by any other means specified by that section, for purposes
| such as criticism, comment, news reporting, teaching
| (including multiple copies for classroom use), scholarship,
| or research, is not an infringement of copyright. In
| determining whether the use made of a work in any particular
| case is a fair use the factors to be considered shall
| include: 1. the purpose and character of
| the use, including whether such use is of a commercial nature
| or is for nonprofit educational purposes; 2. the
| nature of the copyrighted work; 3. the amount and
| substantiality of the portion used in relation to the
| copyrighted work as a whole; and 4. the effect of the
| use upon the potential market for or value of the copyrighted
| work.
|
| The fact that a work is unpublished shall not itself bar a
| finding of fair use if such finding is made upon
| consideration of all the above factors
|
| So, if you say that these factors can be flexed depending on
| the defendant, and can be just waved away to protect the
| wealthy, then it becomes _something else_ , but given these
| factors, and how damaging this "fair use" is, I can certainly
| say that training AI models with copyrighted corpus is not
| fair use in any way.
|
| Of course at the end of the day, IANAL & IANAJ. However, my
| moral compass directly bars use of copyrighted corpus in
| publicly accessible, for profit models which undermine many
| people of their livelihoods.
|
| From my perspective, people can whitewash AI training as they
| see fit to sleep sound at night, but this doesn't change
| anything from my PoV.
|
| [0]:
| https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors
| xvector wrote:
| You can absolutely monetize works altered under fair use.
| bayindirh wrote:
| Any examples sans current AI models? I have not seen any,
| or failed to find any, to precise.
| xvector wrote:
| Basically any YouTube video that shows another YouTube
| video, song, movie, etc. as part of something else (eg a
| voiceover.)
| o11c wrote:
| "Making money" does not immediately invalidate fair use,
| but it does wave a big red flag in the courts' faces.
| bayindirh wrote:
| So you say that, every law is a suggestion depending
| who's being tried?
| o11c wrote:
| Er, what? I'm speaking directly from the law, 17 U.S.C.
| SS 107. It's deliberately written in terms of "factors to
| consider", rather than absolutes.
|
| > In determining whether the use made of a work in any
| particular case is a fair use the factors to be
| considered shall include:
|
| > * the purpose and character of the use, including
| whether such use is of a commercial nature or is for
| nonprofit educational purposes;
|
| > * the nature of the copyrighted work;
|
| > * the amount and substantiality of the portion used in
| relation to the copyrighted work as a whole; and
|
| > * the effect of the use upon the potential market for
| or value of the copyrighted work.
| FloorEgg wrote:
| I really don't think it's that simple. I can read books and
| then earn money from applying what I learned in them. I can
| also study art and then make original art in the same or
| similar styles. If a person was doing this there would be
| no one claiming copyright infringement. The only difference
| is it's a machine doing it and not a person.
|
| The nature of copyright and plagiarism boils down to
| paraphrasing, and so long as LLMs sufficiently paraphrase
| the content it's an open question whether it's copyright
| infringement and requires new law/precedent.
|
| So the fact they are earning money is a red herring unless
| they are reproducing the exact same content without
| paraphrasing (with exception to commentary). E.g. they can
| quote part of a work while commenting on it.
|
| Where they have gotten into trouble with e.g. NYT afaik is
| when the LLM reproduced a whole article word for word. I
| think they have all tried hard to prevent the LLM from ever
| doing that to avoid that legal risk.
| bayindirh wrote:
| > I can read books and then earn money from applying what
| I learned in them.
|
| How many books can you read, understand and memorize in T
| time, and how many books an AI can ingest in the T time?
|
| If we're down to paraphrasing, watch this video [1], and
| think again.
|
| Many models, given that you ask the correct questions,
| reproduce their training set with great accuracy, and
| this is only prevented with monkey patching, IIUC.
|
| So, it's still a big mess, even if we don't add
| copyrighted corpus to the mix. Oh, BTW, datasets like
| "The Stack" are not clean as they claim. I have seen at
| least two non-permissively licensed code repositories
| inside that dataset.
|
| [1]: https://youtu.be/LrkAORPiaEA
| yieldcrv wrote:
| and therefore everyone has the necessary rights to read works,
| the necessary rights to critique of the works including for
| commercial purposes, and the necessary rights to derivative
| works including for commercial purposes
| tomjen3 wrote:
| You wouldn't read a book and teach others its lessons without a
| derived license, would you?
| dijksterhuis wrote:
| copyright refers to the act of copying the material at hand
| (including distribution, reproduction, performance) etc.
|
| as an example: saying "i really like james holden's
| inheritors album for the rough and dissonant sounds" isn't
| covered by copyright.
|
| if i reproduced it verbatim using my mouth, or created a
| derived work which is noticeably similar to the original,
| that's a different question though.
|
| in your example, a derivative work example could be akin to
| only quoting from the book for the audience and modifying a
| word of each quote.
|
| "derived" works are always a grey area, especially around
| generative machine learning right now.
| ben_w wrote:
| When I was at school, we were sometimes all sat down in front
| of a TV to watch some movie on VHS tape (it was the 90s).
|
| At the start of the tape, there was a copyright notice
| forbidding the VHS tape from being played at, amongst other
| places, schools.
|
| Copyright rules are a strange thing.
| philwelch wrote:
| You're applying a double standard to LLM's and human creators.
| Any human writer or artist or filmmaker or musician will be
| influenced by other people's works, even while those works are
| still under copyright.
| cmiles74 wrote:
| I don't see how this is a double standard. Comparing a person
| interacting with their culture is not comparable in any way.
| IMHO, it's kind of a wacky argument to make.
| crazygringo wrote:
| Can you elaborate on how it's not comparable? It seems
| obvious to me that it is -- they both learn and then create
| -- so what's the difference?
|
| If I can hire an employee who draws on knowledge they
| learned from copyrighted textbooks, why can't I hire an AI
| which draws on knowledge it learned from copyrighted
| textbooks? What makes that argument "wacky" in your eyes?
| tekno45 wrote:
| you're asking why you have to treat people differently
| than you treat tools and machines.
| crazygringo wrote:
| Well obviously not in general. But when it comes to
| copyright law specifically, yes absolutely. That is the
| question I'm asking.
| salawat wrote:
| You're not going to get an answer you find agreeable,
| because you're hoping for an answer that allows you to
| continue to treat the tool as chattel, without conferring
| to it the excess baggage of being an individuated
| entity/laborer.
|
| You're either going to get: it's a technological,
| infinitely scalable process, and the training data should
| be considered what it is, which is intellectual property
| that should be being licensed before being used.
|
| ...or... It actually is the same as human learning, and
| it's time we started loading these things up with other
| baggage to be attached to persons if we're going to
| accept it's possible for a machine to learn like a human.
|
| There isn't a reasonable middle ground due to the
| magnitude of social disruption a chattel quasi-human
| technological human replacement would cause.
| philwelch wrote:
| No, you're missing the point of copyright. The point of
| copyright is to protect an exclusive right to copy, not
| the right to produce original works influenced by
| previous works. If an LLM produces original works that
| are influenced by the training data, that is not a
| violation of copyright. If it reproduces the training
| data verbatim, it is.
| dijksterhuis wrote:
| i weirdly agree with you, but also want to point out that
| "influenced by the training data" is doing some very
| heavy lifting there.
|
| exactly how the new work is created is important when it
| comes to derivative works.
|
| does it use a copy of the original work to create it, or
| a vague idea/memory of the original work's composition?
|
| when i make music it's usually vague memories. i'd argue
| that LLMs have an encoded representation of the original
| work in their weights (along with all the other stuff).
|
| but that's the legal grey area bit. is the "mush" of
| model weights an encoded representation of works, or
| vague memories?
| philwelch wrote:
| I don't really think it matters because you can just
| compare the output to the input and apply the same
| standard, treating the process between the two as a black
| box.
| aidenn0 wrote:
| Aren't animals a current example of a middle ground? They
| are incapable of authoring copyrightable works under
| current US law.
| cmiles74 wrote:
| It has never been argued that copyright law should apply
| to information the people learn, whether that be from
| reading books or newspapers, watching television or
| appreciating art like paintings or photographs.
|
| Unlike a person, an large language model is product built
| by a company and sold by a company. While I am not a
| lawyer, I believe much of the copyright arguments around
| LLM training revolve around the idea that copyrighted
| content should be licensed by the company training the
| LLM. In much the same way that people are not allowed to
| scrape the content of the New York Time website and then
| pass it off as their own content, so should OpenAI be
| barred from scraping the New York Times website to train
| ChatGPT and then sell the service without providing some
| dollars back to the New York Times.
| _DeadFred_ wrote:
| One is a collection of highly dithered data generated by
| machines paid for by a business in order to financially
| gain from the copyrighted works in order to replace any
| future need for copyrighted text books.
|
| The other is a person learning from a copyrighted
| textbook in the legally protected manner, and whom and
| use the textbook was written for.
| groby_b wrote:
| Unless you are making an argument for personhood, one is
| a machine, the other is a human. Different standards
| apply, end of discussion.
| homarp wrote:
| most probably your employee actually 'paid' for their
| textbook.
| cmiles74 wrote:
| I don't think this question really makes any sense... In
| my opinion, it's kind of mish-mashing several things
| together.
|
| "Can you elaborate on how it's not comparable?"
|
| The process of individual people interacting with their
| culture is a vastly different process than that used to
| train large language models. In what ways to you think
| these processes have anything in common?
|
| "It seems obvious to me that it is -- they both learn and
| then create -- so what's the difference?"
|
| This doesn't seem obvious to me (obviously)! Maybe you
| can argue that an LLM "learns" during training, but that
| ceases once training is complete. For sure, there are
| work-arounds that meet certain goals (RAG, fine-tuning);
| maybe your already vague definition of "learning" could
| be stretched to include these? Still, comparing this to
| how people learn is pretty far-fetched. AFAICT, there's
| no literature supporting the view that there's any
| commonality here; if you have some I would be very
| interested to read it. :-)
|
| Do they both create? I suspect not; an LLM is parroting
| back data from it's training set. We've seen many studies
| showing that tested LLMs perform poorly on novel problem
| sets. This article was posted just this week:
|
| https://news.ycombinator.com/item?id=42565606
|
| The court is still out on the copyright issue, for the
| perspective of US law we'll have to wait on this one.
| Still, it's clear that an LLM can't "create" in any
| meaningful way.
|
| And so on and so forth. How is hiring an employee at all
| similar to subscribing to an OpenAI ChatGPT plan? Wacky
| indeed!
| rob_c wrote:
| That's a little simplistic. You're almost trying to say
| blank and white sands gray can't be compared which is a bit
| weird.
|
| Strangely like the situation itself.
|
| The question is just looked to how can we guarantee a model
| is influenced rather than memorising an input?
|
| And then is a human who is influenced simply relying on a
| faulty or less than perfect memory?
| dijksterhuis wrote:
| as a human being, and one that does music stuff, i don't
| download terabytes of other peoples works from the internet
| directly into my brain. i don't have verbatim reproductions
| of people's work sitting around on a hard disk in my
| stomach/lungs/head/feet.
|
| LLMs are not humans. They're essentially a probabilistic
| compression algorithm (encode data into model weights/decode
| with prompt to retrieve data).
| philwelch wrote:
| Do you ever listen to music? Is your music ever influenced
| by the music that you listen to? How do you imagine that
| works, in an information-theoretical sense, that
| fundamentally differs from an LLM?
|
| Depending on how much music you've listened to, you very
| well may have "downloaded terabytes" of it into your brain.
| Your argument is specious.
| dijksterhuis wrote:
| i do listen to music.
|
| i listen to it on apple music.
|
| i pay money to apple for this.
|
| some of that money that i pay to apple goes to the rights
| holders of that music for the copying and performance of
| their work through my speakers.
|
| that's a pretty big difference to how most LLMs are
| trained right there! i actually pay original creators
| some money.
|
| -
|
| i am a human being. you cannot reduce me down to some
| easy information theory.
|
| an LLM is a tool. an algorithm. with the same random seed
| etc etc it will get the same results. it is not human.
|
| you put me in the same room as yesterday i'll behave
| completely differently.
|
| -
|
| i have listened to way more than terabytes of music in my
| life. doesn't mean i have the ability to regurgitate any
| of it verbatim though. i'm crap at that stuff.
|
| LLMs seem to be really good at it though.
| cmiles74 wrote:
| Information on how large language models are trained is
| not hard to come by, there are numerous articles that
| cover this material. Even a brief skimming of this
| material will make it clear that the training of large
| language models is materially different in almost every
| way from how human beings "learn" and build knowledge.
| There are still many open questions around the process of
| how humans collect, store, retrieve and synthesize
| information.
|
| There is little mystery to how large language models
| function and it's clear that their output is parroting
| back portions of their training data, the quality of
| output degrades greatly when novel input is provided. Is
| your argument that people fundamentally function in the
| same way? That would be a bold and novel assertion!
| _DeadFred_ wrote:
| Human creators don't store that 'influence' in a digital
| machine accessible format generated directly from the
| copyrighted content though.
|
| Although with the 'good new everyone, we built the torment
| nexus' trajectory of AI my guess is at this point AI
| companies would just incorporate actual human brains instead
| of digital storage if that was the requirement.
| galangalalgol wrote:
| Does that imply that if we invent brain upload technology,
| that my weights have every conflicting license and patent
| for everything I can quote or create? I don't like that
| precedent. I have complete rights over my noggin's
| contents. If I do quote a NYT article in it's entirely,
| _that_ vould be infringement, but not copying my brain
| itself.
| philwelch wrote:
| Your argument boils down to "we don't know how brains
| work", and it is a non-sequitur. It isn't a violation of
| copyright law to create original works under the creative
| influence of works still under copyright.
| Workaccount2 wrote:
| LLMs are not massive archives of data. They are a tiny fraction
| of a fraction of a percent of the size of their training set.
|
| And before you knee-jerk "it's a compression algo!", I invite
| you to archive all your data with an LLMs "compression algo".
| int_19h wrote:
| It doesn't matter. It's still a derived work.
| baxtr wrote:
| Well what isn't in this world?
|
| Would Einstein would have been possible without Newton?
| thedailymail wrote:
| Newton was public domain by Einstein's time.
| jampekka wrote:
| Indeed. Copyright was introduced in 1710, Principia was
| published in 1687.
| yieldcrv wrote:
| and even with our current copyright laws providing for
| long dated protection, it would have still been in public
| domain
| jampekka wrote:
| It's hard to say what the current laws actually imply.
| Steamboat Willie was originally meant to be in the public
| domain in 1955. Got there in 2024.
| BobbyTables2 wrote:
| Copying a single sentence verbatim from a 1000 page book is
| still plagiarism.
|
| And is technically copyright infringement outside fair use
| exceptions.
| concerndc1tizen wrote:
| And similarly, translating those sentences into data points
| is still a derivative work, like transcribing music and
| then making a new recording is still derivative.
| jpollock wrote:
| derivative works still tend to be copyright violations.
| concerndc1tizen wrote:
| Yes, that's what I'm saying. An LLM washing machine
| doesn't get rid of the copyright.
| timewizard wrote:
| > LLMs are not massive archives of data.
|
| Neither am I, yet, I am still capable of reproducing
| copyrighted works to a level that most would describe as
| illegal.
|
| > And before you knee-jerk "it's a compression algo!"
|
| It's literally a fundamental part of the technology so I
| can't see how you call it a "knee jerk." It's lossy
| compression, the same way a JPEG might be, and simply
| recompressing your picture to a lower resolution does not at
| all obviate your copyright.
|
| > I invite you to archive all your data with an LLMs
| "compression algo".
|
| As long as we agree it is _my data_ and not yours.
| Isamu wrote:
| > It's lossy compression, the same way a JPEG might be
|
| Compression yes, but this is co-mingling as well. The
| entire corpus is compressed together, which identifies
| common patterns, and in the model they are essentially now
| overlapping.
|
| The original document is represented statistically in the
| final model, but you've lost the ability to extract it
| closely. Instead you gain the ability to generate something
| statistically similar to a large number of original
| documents that are related or are structurally similar.
|
| I'm just commenting, not disputing any argument about fair
| use.
| ipsum2 wrote:
| This is cool, but only the first part in extracting a ML model
| for usage. The second part is reverse engineering the tokenizer
| and input transformations that are needed to before passing the
| data to the model, and outputting a human readable format.
| refulgentis wrote:
| This is a good comment, but only in the sense it documents a
| model file doesn't run the model by itself.
|
| An analogous situation is seeing a blog that purports to "show
| you code", and the code returns an object, and commenting "This
| is cool, but doesn't show you how to turn a function return
| value into a human readable format" More noise, than signal.
|
| The techniques in the article are trivially understood to also
| apply to discovering the input tokenization format, and Netron
| shows you the types of inputs and outputs.
|
| Thanks for the article OP, really fascinating.
| ipsum2 wrote:
| Just having the shape of the input and output are not
| sufficient, the image (in this example) needs to be
| normalized. It's presumably not difficult to find the exact
| numbers, but it is a source of errors when reverse
| engineering a ML model.
| rob_c wrote:
| If you can't fix this with a little help from chatgpt or Google
| you shouldn't be building the models frankly let alone mucking
| with other people's...
| mentalgear wrote:
| Would be interesting if someone could detail the approach to
| decode the pre-post processing steps before it enters the
| model, and how to find the correct input encoding.
| 1vuio0pswjnm7 wrote:
| "Keep in mind that AI models, like most things, are considered
| intellectual property. Before using or modifying any extracted
| models, you need the explicit permission of their owner."
|
| If weights and biases contained in "AI models" are prorietary,
| then for one model owner to detect infingement by another model
| owner, it may be necessary to download and extract.
| kittikitti wrote:
| This was a great article and I really appreciate it!
| Fragoel2 wrote:
| There's an interesting research paper from a few years ago that
| extracted models from Android apps on a large scale:
| https://impillar.github.io/files/ccs2022advdroid.pdf
| janalsncm wrote:
| I'm a huge fan of ML on device. It's a big improvement in privacy
| for the user. That said, there's always a chance for the user to
| extract your model, so on-device models will need to be fairly
| generic.
| Ekaros wrote:
| Can you launder AI model by feeding it to some other model or
| training process? After all that is how it was originally
| created. So it cannot be any less legal...
| benreesman wrote:
| There are a family of techniques, often called something like
| "distillation". There are also various synthetic training data
| strategies, it's a very active area of research.
|
| As for the copyright treatment? As far as I know it's a bit up
| in the air at the moment. I suspect that the major frontier
| vendors would mostly contend that training data is fair use but
| weights are copyrighted. But that's because they're bad people.
| qup wrote:
| The weights are my training data. I scraped them from the
| internet
| dfefdfdd wrote:
| Ugh Gdug Tfhggdusysg7sfsusfx6sgs6sgsyyxgushzjs Fuck You
| amolgupta wrote:
| For app developers considering tflite, a safer way would be to
| host the models on firebase and delete them when their job is
| done. It comes with other features like versioning for model
| updates, A/B tests, lower apk size etc.
| https://firebase.google.com/docs/ml/manage-hosted-models
___________________________________________________________________
(page generated 2025-01-05 23:00 UTC)