hngopher.com

       [HN Gopher] Extracting AI models from mobile apps
       ___________________________________________________________________
        
       Extracting AI models from mobile apps
        
       Author : smoser
       Score  : 222 points
       Date   : 2025-01-05 13:19 UTC (9 hours ago)
        
 (HTM) web link (altayakkus.substack.com)
 (TXT) w3m dump (altayakkus.substack.com)
        
       | do_not_redeem wrote:
       | Can anyone explain that resize_to_320.tflite file? Surely they
       | aren't using an AI model to resize images? Right?
        
         | smitop wrote:
         | tflite files can contain a ResizeOp that resizes the image:
         | https://ai.google.dev/edge/api/tflite/java/org/tensorflow/li...
         | 
         | The file is only 7.7kb, so it couldn't contain many weights
         | anyways.
        
           | raydiak wrote:
           | Exactly. Put another way, tensorflow is not an AI. You can
           | build an AI in tensorflow. You can also resize images in
           | tensorflow (using the traditional algorithms, not AI). I am
           | not an expert, but as I understand, it is common for vision
           | models to require a fixed resolution input, and it is common
           | for that resolution to be quite low due to resource
           | constraints.
        
         | koe123 wrote:
         | Probably not what your alluding to but AI upscaling of images
         | is definitely a thing
        
       | JTyQZSnP3cQGa8B wrote:
       | > Keep in mind that AI models [...] are considered intellectual
       | property
       | 
       | Is it ironic or missing a /s? I can't really tell here.
        
         | Freak_NL wrote:
         | Standard disclaimer. Like inserting a bunch of 'hypothetically'
         | in a comment telling one where to find some piece of abandoned
         | media where using an unsanctioned channel would entail
         | infringing upon someone's intellectual property.
        
         | SunlitCat wrote:
         | To be honest, that was my first thought on reading that
         | headline as well. Given that especially those large companies
         | (but who knows how smaller ones got their training data) got a
         | huge amount of backlash for their unprecedented collection of
         | data all over the web and not just there but everywhere else,
         | it's kinda ironic to talk about intellectual property.
         | 
         | If you use one of those AI model as a basis for your AI model
         | the real danger could be that the owners of the originating
         | data are going after you at some point as well.
        
           | ToucanLoucan wrote:
           | Standard corporate hypocrisy. "Rules for thee, not for me."
           | 
           | If you actually expected anything to be open about OpenAI's
           | products, please get in touch, I have an incredible business
           | opportunity for you in the form of a bridge in New York.
        
         | npteljes wrote:
         | I think it's both. It's
         | 
         | 1. the current, unproven-in-court legal understanding, 2.
         | standard disclaimer to cover OP's ass 3. tongue-in-cheek
         | reference to the prevalent argument that training AI on data,
         | and then offering it via AI is being a parasite on that
         | original data
        
       | wat10000 wrote:
       | " Keep in mind that AI models, like most things, are considered
       | intellectual property. Before using or modifying any extracted
       | models, you need the explicit permission of their owner."
       | 
       | That's not true, is it? It would be a copyright violation to
       | distribute an extracted model, but you can do what you want with
       | it yourself.
        
         | wslh wrote:
         | It's also worth noting that there is still no legal clarity on
         | these issues, even if a license claims to provide specific
         | permissions.
         | 
         | Additionally, the debate around the sources companies use to
         | train their models remains unresolved, raising ethical and
         | legal questions about data ownership and consent.
        
         | jdietrich wrote:
         | Circumventing a copy-prevention system without a valid
         | exemption is a crime, even if you don't make unlawful copies.
         | Copyright covers the right to make copies, not the right to
         | distribute; "doing what you want with it yourself" may or may
         | not be covered by fair use. Whether or not model weights are
         | copyrightable remains an open question.
         | 
         | https://www.law.cornell.edu/uscode/text/17/1201
        
           | mcny wrote:
           | >Circumventing a copy-prevention system without a valid
           | exemption is a crime, even if you don't make unlawful copies.
           | Copyright covers the right to make copies, not the right to
           | distribute; "doing what you want with it yourself" may or may
           | not be covered by fair use. Whether or not model weights are
           | copyrightable remains an open question.
           | 
           | If that is the law, it is a defect that we need to fix. Laws
           | do not come down from heaven in the form of commandments. We,
           | humans, write laws. If there is a defect in the laws, we
           | should fix it.
           | 
           | If this is the law, time shifting and format shifting is
           | unlawful as well which to me is unacceptable.
           | 
           | Disclaimer: As usual, I anal.
        
             | dialup_sounds wrote:
             | Time shifting is protected by 40 years of judicial
             | precedent establishing it as fair use.
        
               | nadermx wrote:
               | This is being tested in the courts currently,
               | https://torrentfreak.com/appeals-court-hears-riaa-and-
               | yout-i...
        
               | kmeisthax wrote:
               | DMCA 1201 is written so broadly that _any_ feature of a
               | product or service can be construed to prevent copying,
               | and thus gain 1201 protection.
               | 
               | I don't think YouTube intended regular uploads to have
               | DRM, if only because they support Creative Commons
               | metadata on uploads, and Creative Commons specifically
               | forbids the use of technical protection measures on CC-
               | licensed content[0]. On a less moralistic note, applying
               | encryption to all YouTube videos would be prohibitively
               | expensive because DRM vendors charge $$$ for the tech.
               | 
               | But the RIAA wants DRM because, well, they don't want
               | people taking what they have rightfully stolen. So
               | YouTube engineered a weak form of URL obfuscation that
               | would only stop very basic scrapers[1]. DMCA 1201 doesn't
               | care about encryption or obfuscation, though. What it
               | does care about is if something was _intended_ to stop
               | copying, and if so, if the defendant 's product was
               | designed to defeat that thing.
               | 
               | There's an interesting wrinkle in DMCA 1201 in that
               | merely being able to defeat DRM does not make something
               | illegal. Defeating DRM has to be the tool's _only
               | function_ [2], or you have to advertise the tool as being
               | able to defeat DRM[3], in order to actually violate DMCA
               | 1201. DRM vendors _usually_ resort to encryption, because
               | it makes the circumvention tools specialized enough that
               | they have no other purpose and thus fall afoul of DMCA
               | 1201. But there 's nothing stopping you from using really
               | basic schemes (ROT-13 your DVDs!) and still getting to
               | sue for 1201.
               | 
               | Going back to the AI ripping question, this blog post is
               | probably not in and of itself a circumvention tool[4],
               | but anyone implementing it is very much making
               | circumvention tools, which are illegal to distribute.
               | Circumvention itself is also illegal, but only when
               | there's an underlying copyright infringement. i.e. you
               | can't just encrypt something that's public domain or
               | uncopyrightable and sue anyone who decrypts it.
               | 
               | So the next question is: is AI copyrightable? And can you
               | sue for 1201 circumvention for something that is
               | fundamentally composed of someone else's copyrighted work
               | that you don't own and haven't licensed?
               | 
               | [0] Additionally, there is a very large repository of CC-
               | BY music from Kevin MacLeod that is used all over YouTube
               | that would have to be removed or relicensed if the RIAA
               | were to prevail on this case.
               | 
               | I have no idea if Kevin actually intends to enforce the
               | no-DRM clause in this way, though. Kevin actually has a
               | fairly loose interpretation of CC-BY. For example,
               | _nobody_ attributes his music correctly, either the way
               | the license requires, or with Kevin 's (legally
               | insufficient) recommended attribution strings. He does
               | sell commercial (non-attribution) licenses but I've yet
               | to hear of any enforcement actions from him.
               | 
               | [1] To be clear, without DRM encryption, _any_ video can
               | be ripped by hooking standard HTML5 video APIs using an
               | extension.
               | 
               | [2] Things with "limited commercial purposes" beyond
               | breaking DRM may also be construed as circumvention tools
               | under DMCA 1201.
               | 
               | [3] My favorite example: someone tried selling a VGA-to-
               | composite adapter as a way to copy movies off Netflix.
               | That is illegal under DMCA 1201.
               | 
               | [4] To be clear, this is NOT settled law, this is "get
               | sued and find out if the Supreme Court likes you that
               | day" law.
        
               | dialup_sounds wrote:
               | Not really. The fair use status of time shifting isn't in
               | question there by either party.
        
           | rpdillon wrote:
           | Your comment confused me, but I'm very interested in what
           | you're getting at.
           | 
           | > Circumventing a copy-prevention system without a valid
           | exemption is a crime, even if you don't make unlawful copies.
           | 
           | Yep, this is the DMCA section 1201. Late '90s law in the US.
           | 
           | > Copyright covers the right to make copies, not the right to
           | distribute
           | 
           | This is where I got confused. Copyright covers four rights:
           | copying, distribution, creation of derivative works, and
           | public performance. So I'm not sure what you were getting at
           | with the copy/distribute dichotomy.
           | 
           | But here's a question I'm curious about: Can DMCA apply to a
           | copy-protection mechanism that's being applied to non-
           | copyrightable work? Based on my reading of
           | https://www.copyright.gov/dmca/:
           | 
           | > First, it prohibits circumventing technological protection
           | measures (or TPMs) used by copyright owners to control access
           | to their works.
           | 
           | That's not the letter of the law, but an overview, but it
           | does seem to suggest you can't bring a DMCA 1201 claim
           | against someone circumventing copy-protection for
           | uncopyrightable works.
           | 
           | > Whether or not model weights are copyrightable remains an
           | open question.
           | 
           | And this is where the interaction with the wording of 1201
           | gets interesting, in my (non-professional) opinion!
        
             | lcnPylGDnU4H9OF wrote:
             | Here is the relevant text in the law:
             | 
             | > No person shall circumvent a technological measure that
             | effectively controls access to a work protected under this
             | title.
             | 
             | The inclusion of "work protected under this title" makes it
             | clear in the law, though I doubt a judge would rule
             | otherwise without that line. (Otherwise, I'd wonder if I
             | could claim damages that Google et al. are violating the
             | technological measures I've put in place to protect the
             | specificity of my interests, because it wouldn't matter
             | that such is not protected by copyright law.)
             | 
             | Also not an attorney, for what it's worth.
        
               | roywiggins wrote:
               | It seems clear from this definition especially:
               | 
               | > (A) to "circumvent a technological measure" means to
               | descramble a scrambled work, to decrypt an encrypted
               | work, or otherwise to avoid, bypass, remove, deactivate,
               | or impair a technological measure, _without the authority
               | of the copyright owner_
               | 
               | In this case there _is_ no copyright owner.
        
               | lcnPylGDnU4H9OF wrote:
               | Right, that's what I was getting at with my
               | parenthetical. Obviously the work has to have an owned
               | copyright in order to be protected by copyright law.
        
               | roywiggins wrote:
               | sorry, yes, reread your comment and dirty-edited mine
        
               | rusk wrote:
               | This is interesting. I wonder could you use it as a basis
               | for "legally" circumventing a technology by applying it
               | to non-copyrighted works.
        
               | lcnPylGDnU4H9OF wrote:
               | If you mean that you might be able to decrypt a
               | copyrighted work because you used that same encryption
               | method on a non-copyrighted work, then definitely not.
               | The work under protection will be considered. (Otherwise,
               | I am unsure what you meant.)
        
               | rusk wrote:
               | From what I recall, it was the actual protection method
               | that was protected by DMCA - when DVD protection was
               | cracked it was forbidden to distribute a particular
               | section of code so they just printed it on a Tee-shirt to
               | troll the powers that be.
        
               | jazzyjackson wrote:
               | Better yet just print the colors that represent the
               | number, see
               | https://en.m.wikipedia.org/wiki/Illegal_number
               | 
               | But then again, knowing the number is a far cry from
               | using that number to circumvent DRM
        
               | lcnPylGDnU4H9OF wrote:
               | Presuming you are referring to this: https://en.wikipedia
               | .org/wiki/AACS_encryption_key_controvers...
               | 
               | > Outside the Internet and the mass media, the key has
               | appeared in or on T-shirts, poetry, songs and music
               | videos, illustrations and other graphic artworks, tattoos
               | and body art, and comic strips.
               | 
               | Using the encryption key to decrypt the data on a DVD is
               | illegal "circumvention" per DMCA 1201, if it's done
               | without authorization from the copyright owner of the
               | data on the DVD. If it were really illegal to simply
               | publish the key on a website, then printing it on
               | clothing that they sold instead would not be a viable
               | loophole.
               | 
               | I'm glad it is still referred to as a controversy that
               | they were issuing cease and desist letters for publishing
               | information when the actual crime they had in mind, which
               | was not alleged in the letters, is using the information
               | to decrypt a DVD.
        
           | nadermx wrote:
           | Actually, in terms of copyright control "The Federal Circuit
           | went on to clarify the nature of the DMCA's anti-
           | circumvention provisions. The DMCA established causes of
           | action for liability and did not establish a property right.
           | Therefore, circumvention is not infringement in itself."[1]
           | 
           | https://en.m.wikipedia.org/wiki/Chamberlain_Group,_Inc._v._S.
           | ..
        
             | bitwize wrote:
             | Circumvention is not infringement, but the DMCA makes it a
             | separate crime punishable by up to 5 years in prison.
        
           | BadHumans wrote:
           | Is using millions of copyrighted works to train your AI a
           | valid exemption? Asking for a few billionaire friends.
        
           | larodi wrote:
           | just imagine, like just for a second how it becomes illegal
           | to train anything that does not then afterwards produce, if
           | publicly used or distributed, a copyright token which is both
           | in the training set - to mark it - and in the produce - to
           | recognize it.
           | 
           | so this is where it all goes in several years, if i were the
           | gov.
        
         | Lerc wrote:
         | I'm not even sure if event the first part is true. Has it been
         | determined if AI models are intellectual property? Machine
         | generated content may not be copyrightable. It isn't just the
         | output of generative AI that falls under this, the models
         | themselves are.
         | 
         | Can you copyright a set of coefficients for a formula? In the
         | sense of a JPEG it would be considered that the image being
         | reproduced is the thing that has the copyright. Being the first
         | to run the calculations that produces a compressed version of
         | that data should not grant you any special rights to that
         | compressed form.
         | 
         | An AI model is just a form of that writ large. When the models
         | generalize and create new content, it seems hard to see how
         | that either the output or the model that generated it could be
         | considered someone's property.
         | 
         | People possess models, I'm not sure if they own them.
         | 
         | There are however billions of dollars at play here and enough
         | money can buy you whichever legal opinion you want.
        
           | echelon wrote:
           | > AI models are intellectual property
           | 
           | If companies train on data they don't own and expect to own
           | their model weights, that's hypocritical.
           | 
           | Model weights shouldn't be copyrightable if the training data
           | was pilfered.
           | 
           | But this hasn't been tested because models are locked away in
           | data centers as trade secrets. There's no opportunity to
           | observe or copy them outside of using their outputs as
           | synthetic data.
           | 
           | On that subject, training on model outputs should be fair
           | use, and an area we should use legislation to defend access
           | to (similar to web scraping provisions).
        
             | ANewFormation wrote:
             | Now that you mention it, I'm quite surprised that none of
             | the typical fanatical IP lawsuiters had sued arguing
             | (reasonably I think) that the output of the LLMs is
             | strongly suggestive that they have been trained on
             | copyrighted materials. Get the lawsuit to discovery, and
             | those data centers become fair game.
             | 
             | Perhaps 'strongly suggestive' isn't enough.
        
               | Onawa wrote:
               | Wasn't that the goal of both the New York Times lawsuit
               | and other class action lawsuits from authors?
               | 
               | https://harvardlawreview.org/blog/2024/04/nyt-v-openai-
               | the-t...
               | 
               | https://www.publishersweekly.com/pw/by-topic/industry-
               | news/p...
        
               | cabalamat wrote:
               | > strongly suggestive that they have been trained on
               | copyrighted materials
               | 
               | Given that everything -- including this comment -- is
               | copyrighted unless it is (1) old or (2) deliberately put
               | into the public domain, this is almost certainly true.
        
               | rusk wrote:
               | Isn't this comment in the public domain? I presume that's
               | what I'm doing when I'm posting on a forum. If somebody
               | copied and pasted something I wrote on here could I in
               | theory use copyright law to restrict distribution? I
               | think the law would say I published it on a public forum
               | and thus it is in the public domain.
        
               | taormina wrote:
               | Why would it be in the public domain? Anything you
               | create, under US copyright law, is the opposite of being
               | in the public domain, it's yours. According to the
               | legalese of YC, you are granting YC and YC alone a
               | license to use the UGC you submitted to their website,
               | but if anything, the YC agreement DEMANDS that you own
               | the copyright to the comment you are posting.
               | 
               | > User Content Transmitted Through the Site: With respect
               | to the content or other materials you upload through the
               | Site or share with other users or recipients
               | (collectively, "User Content"), you represent and warrant
               | that you own all right, title and interest in and to such
               | User Content, including, without limitation, all
               | copyrights and rights of publicity contained therein. By
               | uploading any User Content you hereby grant and will
               | grant Y Combinator and its affiliated companies a
               | nonexclusive, worldwide, royalty free, fully paid up,
               | transferable, sublicensable, perpetual, irrevocable
               | license to copy, display, upload, perform, distribute,
               | store, modify and otherwise use your User Content for any
               | Y Combinator-related purpose in any form, medium or
               | technology now known or later developed. However, please
               | review the Privacy Policy located here for more
               | information on how we treat information included in
               | applications submitted to us.
               | 
               | > You acknowledge and agree that any questions, comments,
               | suggestions, ideas, feedback or other information about
               | the Site ("Submissions") provided by you to Y Combinator
               | are non-confidential and Y Combinator will be entitled to
               | the unrestricted use and dissemination of these
               | Submissions for any purpose, without acknowledgment or
               | compensation to you.
        
               | ANewFormation wrote:
               | Another example of this is people putting code, intended
               | to be shared, up on e.g. Github without a licence.
               | 
               | Many people seem to think that no licence = public
               | domain, but it's still under strong copyright protection.
               | This is the point of things like the Unlicense license.
        
               | graemep wrote:
               | > If somebody copied and pasted something I wrote on here
               | could I in theory use copyright law to restrict
               | distribution?
               | 
               | Yes you could, unless you agreed to forum terms that said
               | otherwise, fair use aside. Its the same in most
               | jurisdictions
        
               | dragonwriter wrote:
               | > Now that you mention it, I'm quite surprised that none
               | of the typical fanatical IP lawsuiters had sued arguing
               | (reasonably I think) that the output of the LLMs is
               | strongly suggestive that they have been trained on
               | copyrighted materials. Get the lawsuit to discovery, and
               | those data centers become fair game.
               | 
               | No, there have been lawsuits, and the data centers have
               | not been fair game because whether or not the models were
               | trained on copyright-protected works is not generally in
               | dispute. Discovery only applies to evidence relevant to
               | facts in dispute.
        
             | dragonwriter wrote:
             | > If companies train on data they don't own and expect to
             | own their model weights, that's hypocritical.
             | 
             | Its not hypocritical to follow a line of legal analysis
             | whoch holds that copying material in the course of training
             | AI on it is outside the scope of copyright protection (as,
             | e.g., fair use in the US), but that the model weights
             | resulting from the training are protected by copyright.
             | 
             | It maybe wrong, and it may be convenient for the interests
             | of the firms involved, but it is not self-inconsistent in
             | the way required for it to be hypocrisy.
        
               | Mordisquitos wrote:
               | If the resulting AI models are protected by copyright
               | that invalidates the claim that AI models being trained
               | on copyrighted materials is fair-use analogous to human
               | beings becoming educated by exposure to copyrighted
               | materials.
               | 
               | Educated human beings are not protected by copyright,
               | hence neither should trained AI models. Conversely, if a
               | copyrightable work is produced based on work which itself
               | is copyrighted, the resulting work needs the consent of
               | the original authors of the prior work.
               | 
               | AI models can't have their (c)ake and eat it.
        
               | drdeca wrote:
               | If I take 1000 books and count the distributions of the
               | lengths of the words, and the covariance between the
               | lengths of one word and the next word for each book, and
               | how much this covariance matrix tends to vary across the
               | different books, and other things like this, and publish
               | these summaries, it seems fairly clear to me that this
               | should count as fair use.
               | 
               | (Such a model/statistical-summary, along with a
               | dictionary, could be used to generate nonsensical texts
               | which have similar patterns in terms of just word
               | lengths.)
               | 
               | Should the resulting work be protected by copyright? I'm
               | not entirely sure...
               | 
               | I guess one thing is, the specific numbers I obtain by
               | doing this are not a consequence of any creative decision
               | making on my part, which I think in some jurisdictions (I
               | don't remember which) plays a role in whether a work is
               | copyrightable (I will use "copyrightable" as an
               | abbreviation for "protected by copyright". I don't mean
               | to imply a requirement that someone specifically
               | registers for copyright.). (Iirc this makes it so phone
               | books are copyrightable in some jurisdictions but not
               | others?)
               | 
               | The particular choice of statistical analysis does seem
               | like it may involve creative decision making, but that
               | would just be about like, what analysis I describe, and
               | how the numbers I publish are to be interpreted, not what
               | the numbers are? (Analogous to the source code of an ML
               | model, not the parameters.)
               | 
               | Here is another question: suppose there is a method of
               | producing a data artifact which would be genuinely (and
               | economically) useful, and which does not rely on taking
               | in any copyrighted input, but requires a large
               | (expensive) amount of compute to produce, and which also
               | uses a lot of randomness so that the result would be
               | different each time it was done (but suppose also that
               | there isn't much point doing it multiple times at the
               | same scale, as having two of this kind of data artifact
               | wouldn't be much more valuable than having one).
               | 
               | Should such data artifacts be protected by copyright or
               | something like it?
               | 
               | Well, if copyright requires creative human decision
               | making, then they wouldn't be.
               | 
               | It seems like it would make sense to want it to be
               | economically incentivized to create such data artifacts
               | of higher sizes (to a point of course. Only as much as is
               | justified by the value that is produced by them being
               | available.) .
               | 
               | If such data artifacts can always be distributed without
               | restriction, then ones that are publicly available would
               | be public goods, and I guess only ones that are trade
               | secrets would be private goods? It seems to me like
               | having some mechanism to incentivize their creation and
               | being-eventually-freely-distributed would be beneficial?
               | 
               | But maybe copyright isn't the best way to do that? Idk.
        
               | _DeadFred_ wrote:
               | 'Should the resulting work be protected by copyright? I'm
               | not entirely sure...'
               | 
               | This has already been settled hasn't it? Don't companies
               | have to introduce 'flaws' in order for data sets to be
               | 'protected'? Just compiled lists of facts can't be
               | protected. Which is why things like election result
               | companies having to rely on NDAs and not copyright
               | protections to protect their services on election night.
        
               | Mordisquitos wrote:
               | > _suppose there is a method of producing a data artifact
               | which would be genuinely (and economically) useful, and
               | which does not rely on taking in any copyrighted input,
               | [...] It seems like it would make sense to want it to be
               | economically incentivized to create such data artifacts
               | of higher sizes [...] But maybe copyright isn't the best
               | way to do that? Idk._
               | 
               | Exactly. It would be patents, not copyright.
        
               | tjr wrote:
               | I think it is fair to say that existing copyright law was
               | not written to handle all of this. It was written for
               | people who created works, and for other people who were
               | using those works.
               | 
               | To substitute either party with a computer system and
               | assume that the existing law still makes sense may be
               | assuming too much.
        
               | OkayPhysicist wrote:
               | The model weights are the result of an automated process,
               | by definition, and thus not protected by copyright.
               | 
               | In my unusually well-informed on copyright but not a
               | lawyer opinion, without any new legislation on the
               | subject, I suspect that the most likely scenario for
               | intellectual property rights surrounding AI is that using
               | other people's works for training probably falls under
               | fair use, since it's extremely transformative (an AI that
               | makes text and a textual work are very different things)
               | and it's extremely difficult to argue that the AI, as it
               | exists today, directly impacts the value of the original
               | work.
               | 
               | The list of what traing data to use is probably protected
               | by copyright if hand-picked, otherwise just whatever web-
               | crawler they wrote to gather it.
               | 
               | The AI models, as in, the inference and training
               | applications are protected by copyright, like any other
               | application.
               | 
               | The architecture of a particular AI model can be
               | protected by patents.
               | 
               | The weights, as the result of an automated process, are
               | probably not protected by copyright.
        
             | rsynnott wrote:
             | There are certainly publicly available weights with
             | restrictive licenses (eg some of the StableDiffusion
             | stuff). I'd agree that it'd seem fairly perverse to say
             | "our process for making this by slurping in a ton of
             | copyright content was not copyright theft, but your use of
             | it outside our restrictive license is", but then I'm not a
             | lawyer.
        
             | mikewarot wrote:
             | >models are locked away in data centers as trade secrets
             | 
             | The architecture and the weights in a model are the secret
             | process used to make a commercially valuable output. It
             | makes the most sense to treat them as a trade secret, in a
             | court of law.
        
           | slowmovintarget wrote:
           | Datasets want to be free.
        
             | bayindirh wrote:
             | Just ask the owner of the data for their consent before
             | adding to a dataset which wants to be free.
        
               | cle wrote:
               | The main disagreement is who the "owner" is in the first
               | place.
        
           | baryphonic wrote:
           | Going a step further, weights, i.e. coefficients, aren't
           | produced by a person at all - they're produced by machine
           | algorithms. Because a human did not create the weights, the
           | weights have no author. Thus they are ineligible for
           | copyright in the first place and are in the public domain.
           | Whether the model architecture is copyrightable is more of an
           | open question, but I think a solid argument could be that the
           | model architecture is simply a mathematical expression -
           | albeit a complex one -, though Python or other source code is
           | almost certainly copyrighted. But I imagine clean-room
           | methods could avoid problems there, and with much less effort
           | than most software.
           | 
           | IANAL, but I have serious doubts about the applicability of
           | current copyright law to existing AI models. I imagine the
           | courts will decide the same.
        
             | jonpo wrote:
             | You can say the same about compiled executable code though.
        
               | baryphonic wrote:
               | Each compiled executable has a one-to-one relation with
               | its source code, which has an author (except for LLM code
               | and/or infinite monkeys). Thus compiled executables are
               | derivative works.
               | 
               | There is an argument also that LLMs are derivative works
               | of the training data, which I'm somewhat sympathetic to,
               | though clearly there's a difference and lots of ambiguity
               | about which contributions to which weights correspond to
               | any particular source work.
               | 
               | Again IANAL, and this is my opinion based on reading the
               | law & precedents. Consult a real copyright attorney for
               | real advice.
        
           | nullc wrote:
           | The weights are a product of a mechanical process, 5 years
           | ago it would be generally uncontroversial that they would be
           | not subject to copyright in the US... but 'industry' has done
           | a tremendous job of spreading confusion.
        
           | cle wrote:
           | I think you have to distinguish between a model, its
           | implementation, and its weights/parameters. AFAIU:
           | 
           | - Models are processes/concepts, thus not copyrightable, but
           | _are_ subject to trade secret law, contract and license
           | restrictions, patents, etc.
           | 
           | - Concrete implementations may be copyrighted like any code.
           | 
           | - Parameters are "facts", thus not copyrightable, but are
           | similarly subject to trade secret and contract law.
           | 
           | IANAL, not legal advice, yadda yadda yadda.
        
         | rileymat2 wrote:
         | I doubt the models are copyrighted, arn't works created by
         | machine not eligible? Or you get into cases autogenerating and
         | claiming ownership of all possible musical note combinations.
        
         | hnlmorg wrote:
         | It's hard to say because as far as I know this stuff hasn't
         | been definitively tested on any courts that I know of. Europe
         | not America.
         | 
         | AI models are generally regarded as a company's asset (like a
         | customer database would also be), and rightly so given the cost
         | required to generate one. But that's a different matter
         | entirely to copyright.
        
         | dragonwriter wrote:
         | No, copyright violation occurs at the first unauthorized
         | copying or creation of a derivative work or exercise of any of
         | the other exclusive rights of the copyright holder (that does
         | not fall into an exception like that for fair use.) That
         | distribution is required for a copyright violation is a
         | persistent myth. Distribution is a means by which a violation
         | becomes more likely to be detected and also more likely to
         | involve significant liability for damages.
         | 
         | (OTOH, whether models, as the output of a mechanical process,
         | are subject to copyright is a matter of some debate. The firms
         | training models tend to treat the models as if they were
         | protected by copyright but also tend to treat the source works
         | as if copying for the purpose of training AI were within a
         | copyright exception; why each of those positions is in their
         | interest is obvious, but neither is well-established.)
        
         | larodi wrote:
         | its insane to state it tbh
        
       | 23B1 wrote:
       | > hoarding data
       | 
       | Laundering IP. FTFY.
        
       | jonpo wrote:
       | Well done you seem to have liberated an open model trained on
       | open data for blind and visually impaired people.
       | 
       | Paper: https://arxiv.org/pdf/2204.03738
       | 
       | Code: https://github.com/microsoft/banknote-net Training data:
       | https://raw.githubusercontent.com/microsoft/banknote-net/ref...
       | 
       | model: https://github.com/microsoft/banknote-
       | net/blob/main/models/b...
       | 
       | Kinda easier to download it straight from github.
       | 
       | Its licenced under MIT and CDLA-Permissive-2.0 licenses.
       | 
       | But lets not let that get in the way of hating on AI shall we?
        
         | cess11 wrote:
         | [flagged]
        
           | jonpo wrote:
           | Yes nothing wrong with cool software or showing people how to
           | use it for useful things.
           | 
           | Sorry I'm just kind of sick of the whole 'kool aid', 'rage
           | against AI' thing a lot of people seem to have going on and
           | the way is presented in the post. I have family members with
           | vision impairment helped by this particular app so its a bit
           | personal.
           | 
           | Nothing against opening stuff up and understanding how it
           | works etc. I'd just rather see people build/train useful new
           | models and stuff with the open datasets / models already
           | available.
           | 
           | I guess AI kind of does pay my bills in a round about way.
        
             | a2128 wrote:
             | Sadly companies will hoard datasets and model research in
             | the name of competitive advantage. Obviously with this
             | specific model Microsoft chose to make it open, but this is
             | not always the case, and it's not uncommon to read papers
             | or technical reports saying they trained on an "internal
             | dataset"
        
               | jonpo wrote:
               | Companies do have a lot of data, and some of that data
               | might be useful for training AI. but >99% isn't. When
               | companies do release a cool model or paper that doesn't
               | have open data, (as you point out for competitive or
               | other reasons privacy etc) people can then help
               | build/collect similar open datasets. Unfortunately
               | companies generally don't owe you their data, and if they
               | are in the business of making models they probably won't
               | share the model either, the situation is similar to
               | source code for proprietary LoB applications. but
               | fortunately the best AI researchers mostly do like to
               | share their knowledge and because companies want to
               | attract the best AI researchers they seem to generally
               | allow researchers to publish if its not too commercially
               | sensitive. It could be worse while the competitive
               | situation has reduced some visibility of the cutting edge
               | science, lots of datasets and papers are still published.
        
             | cess11 wrote:
             | In my view there was almost nothing like that in this
             | article, besides the first sentence it went right into the
             | technical stuff, which I liked. Compared to a lot of
             | articles linked here it felt almost free from the battles
             | between "AI" fashions.
             | 
             | It seems dang thinks I mistreated you somehow, if you agree
             | I'm sorry, it wasn't my intention.
        
           | dang wrote:
           | Can you please edit swipes out of your HN comments? Your post
           | would be fine with just the first sentence.
           | 
           | This is in the site guidelines:
           | https://news.ycombinator.com/newsguidelines.html.
        
             | cess11 wrote:
             | What do you mean, "swipe"? The other person agreed they'd
             | misjudged the article and apologised several hours before
             | you wrote this.
        
         | llama_drama wrote:
         | If this is exactly the same model then what's the point of
         | encrypting it?
        
         | TechDebtDevin wrote:
         | [flagged]
        
           | timewizard wrote:
           | [flagged]
        
             | TechDebtDevin wrote:
             | [flagged]
        
               | rob_c wrote:
               | I am groot
        
             | dang wrote:
             | Please don't respond to a bad comment by breaking the site
             | guidelines yourself. That only makes things worse.
             | 
             | https://news.ycombinator.com/newsguidelines.html
        
           | dang wrote:
           | Please don't cross into personal attack or otherwise break
           | the site guidelines when posting here. Your post would be
           | fine with just the first sentence.
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
             | rob_c wrote:
             | Really... Some people do need to be taken down a peg here
             | at times though.
        
         | DoctorOetker wrote:
         | Don't you think its intentional, so as not to demonstrate the
         | technique on potentially copyrighted data?
        
         | dang wrote:
         | > But lets not let that get in the way of hating on AI shall
         | we?
         | 
         | Can you please edit this kind of thing out of your HN comments?
         | (This is in the site guidelines:
         | https://news.ycombinator.com/newsguidelines.html.)
         | 
         | It leads to a downward spiral, as one can see in the
         | progression to https://news.ycombinator.com/item?id=42604422
         | and https://news.ycombinator.com/item?id=42604728. That's what
         | we're trying to avoid here.
         | 
         | Your post is informative and would be just fine without the
         | last sentence (well, plus the snarky first two words).
        
           | Lerc wrote:
           | Can you clarify this a bit. I presume you are talking about
           | the tone more than the implied statement.
           | 
           | If the last sentence were explicit rather than implied, for
           | instance
           | 
           |  _This article seems to be serving the growing prejudice
           | against AI_
           | 
           | Is that better? It is still likely to be controversial and
           | the accuracy debatable, but it is at least sincere and could
           | be the start of a reasonable conversation, provided the
           | responders behave accordingly.
           | 
           | I would like people to talk about controversial things here
           | if they do so in a considerate manner.
           | 
           | I'd also like to personally acknowledge how much work you do
           | to defuse situations on HN. You represent an excellent
           | example of how to behave. Even when the people you are
           | talking to assume bad faith you hold your composure.
        
           | jonpo wrote:
           | I don't seem to be able to edit it, apologies I will try not
           | to let this type of thing get to me in future.
           | 
           | I would also like to point out that this is a fine tuned
           | classifier vision model based on mobilenetv2 and not an LLM.
        
         | rob_c wrote:
         | ... Because if he did this with a model that's not open that's
         | sure going to keep everyone happy and not result in
         | lawsuit(s)...
         | 
         | The same method/strategy applies to closed tools and models
         | too, although you should probably be careful if you've handed
         | over a credit card for a decryption key to a service and try
         | this ;)
        
       | nthingtohide wrote:
       | One thing I noticed in Gboard is it uses homeomorphic encryption
       | to do federated learning of common words used amongst public to
       | do encrypted suggestions.
       | 
       | E.g. there are two common spelling of bizarre which are popular
       | on Gboard : bizzare and bizarre.
       | 
       | Can something similar help in model encryption?
        
         | antman wrote:
         | Had to look it up, this seems to be the paper
         | https://research.google/pubs/federated-learning-for-mobile-k...
        
         | umeshunni wrote:
         | Homomorphic, not homeomorphic
        
         | vlovich123 wrote:
         | In theory yes, in practice right now no. Homomorphic encryption
         | is too computationally expensive.
        
       | boothby wrote:
       | If I understand the position of major players in this field,
       | downloading models in bulk and training a ML model on that corpus
       | shouldn't violate anybody's IP.
        
         | zitterbewegung wrote:
         | IANAL But, this is not true it would be a piece of the
         | software. If there is a copyright on the app itself it would
         | extend to the model. Even models have licenses for example
         | LLAMA is release under this license [1]
         | 
         | [1] https://github.com/meta-llama/llama/blob/main/LICENSE
        
           | blitzar wrote:
           | If I understand the position of major players in this field,
           | copyright itself is optional (for them at least).
        
             | rusk wrote:
             | They claim "safe harbour" - if nobody complains it's fair
             | game
        
             | zitterbewegung wrote:
             | True, I think there has to be a case that sets precedent
             | for this issue.
        
           | Drakim wrote:
           | Is there a material difference between the copyright laws for
           | software and the copyright laws for images and text?
        
           | boothby wrote:
           | LLMs are trained on works -- software, graphics and text --
           | covered by my copyright. What's the difference?
        
           | dragonwriter wrote:
           | The fact that models creators _assert_ that they are
           | protectrd by copyright and offer licenses does not mean:
           | 
           | (1) That they are actually protected by copyright in the
           | first place, or
           | 
           | (2) That the particular act described does not fall into an
           | exception to copyright like fair use, exactly as many model
           | creators assert that the exact same act done with the
           | materials models are trained on does, rendering the
           | restrictions of the license offered moot for that purpose.
        
           | _DeadFred_ wrote:
           | Yeah no.
           | 
           | An example for legal reference might be convolution reverb.
           | Basically it's a way to record what a fancy reverb machines
           | does (using copyrighted complex math algorithms) and cheaply
           | recreate the reverb on my computer. It seems like companies
           | can do this as long as they distribute protected reverbs
           | separately from the commercial application. So Liquidsonics
           | (https://www.liquidsonics.com/software/) sells reverb
           | software but puts for free download the 'protected'
           | convolution reverbs specifically the Bricasti ones in dispute
           | (https://www.liquidsonics.com/fusion-ir/reverberate-3/)
           | 
           | Also, while a SQL server can be copyright protected, a SQL
           | database is not given copyright protection/ownership to the
           | SQL server software creators by extension of that.
        
       | avg_dev wrote:
       | pretty cool; that frida tool seems really nice.
       | https://frida.re/docs/home/
       | 
       | (and a bunch of people seem to be interested in the "IP" note,
       | but I took as, just trying to not get run into legal trouble for
       | advertising "here's how you can 'steal' models!")
        
         | frogsRnice wrote:
         | frida is an amazing tool - it has empowered me to do things
         | that would have otherwise took weeks or even months. This video
         | is a little old, but the creator is also cracked
         | https://www.youtube.com/watch?v=CLpW1tZCblo
         | 
         | It's supposed to be "free-IDA" and the work put in by the
         | developers and maintainers is truly phenomenal.
         | 
         | EDIT: This isn't really an attack imo. If you are going to take
         | "secrets" and shove it into a mobile app, they can't really be
         | considered secret. I suppose it's a tradeoff - if you want to
         | do this kind of thing client-side - the secret sauce isn't so
         | secret.
        
       | Polizeiposaune wrote:
       | You wouldn't train a LLM on a corpus containing copyrighted works
       | without ensuring you had the necessary rights to the works, would
       | you?
        
         | sharkest wrote:
         | Would.
        
         | deadbabe wrote:
         | Fair use.
        
           | griomnib wrote:
           | Easy to claim, harder to justify once you start charging
           | money for your subsequent creation.
           | 
           | Unless all LLM are a ruthless parody of human intelligence,
           | which they may be, the legal issues will continue.
        
           | dijksterhuis wrote:
           | *only available in the USA, terms and conditions apply.
           | 
           | most other places use fair dealing which is more restrictive
           | https://en.m.wikipedia.org/wiki/Fair_dealing
        
           | bayindirh wrote:
           | The moment you earn money from it, that's not fair use
           | anymore. When I last checked, unlimited access to said models
           | were not free, plus it's not "research" anymore.
           | 
           | - Addenda -
           | 
           | For the interested parties, the law states the following [0].
           | 
           | Notwithstanding the provisions of sections 17 U.S.C. SS 106
           | and 17 U.S.C. SS 106A, the fair use of a copyrighted work,
           | including such use by reproduction in copies or phonorecords
           | or by any other means specified by that section, for purposes
           | such as criticism, comment, news reporting, teaching
           | (including multiple copies for classroom use), scholarship,
           | or research, is not an infringement of copyright. In
           | determining whether the use made of a work in any particular
           | case is a fair use the factors to be considered shall
           | include:                   1. the purpose and character of
           | the use, including whether such use is of a commercial nature
           | or is for nonprofit educational purposes;         2. the
           | nature of the copyrighted work;         3. the amount and
           | substantiality of the portion used in relation to the
           | copyrighted work as a whole; and         4. the effect of the
           | use upon the potential market for or value of the copyrighted
           | work.
           | 
           | The fact that a work is unpublished shall not itself bar a
           | finding of fair use if such finding is made upon
           | consideration of all the above factors
           | 
           | So, if you say that these factors can be flexed depending on
           | the defendant, and can be just waved away to protect the
           | wealthy, then it becomes _something else_ , but given these
           | factors, and how damaging this "fair use" is, I can certainly
           | say that training AI models with copyrighted corpus is not
           | fair use in any way.
           | 
           | Of course at the end of the day, IANAL & IANAJ. However, my
           | moral compass directly bars use of copyrighted corpus in
           | publicly accessible, for profit models which undermine many
           | people of their livelihoods.
           | 
           | From my perspective, people can whitewash AI training as they
           | see fit to sleep sound at night, but this doesn't change
           | anything from my PoV.
           | 
           | [0]:
           | https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors
        
             | xvector wrote:
             | You can absolutely monetize works altered under fair use.
        
               | bayindirh wrote:
               | Any examples sans current AI models? I have not seen any,
               | or failed to find any, to precise.
        
               | xvector wrote:
               | Basically any YouTube video that shows another YouTube
               | video, song, movie, etc. as part of something else (eg a
               | voiceover.)
        
             | o11c wrote:
             | "Making money" does not immediately invalidate fair use,
             | but it does wave a big red flag in the courts' faces.
        
               | bayindirh wrote:
               | So you say that, every law is a suggestion depending
               | who's being tried?
        
               | o11c wrote:
               | Er, what? I'm speaking directly from the law, 17 U.S.C.
               | SS 107. It's deliberately written in terms of "factors to
               | consider", rather than absolutes.
               | 
               | > In determining whether the use made of a work in any
               | particular case is a fair use the factors to be
               | considered shall include:
               | 
               | > * the purpose and character of the use, including
               | whether such use is of a commercial nature or is for
               | nonprofit educational purposes;
               | 
               | > * the nature of the copyrighted work;
               | 
               | > * the amount and substantiality of the portion used in
               | relation to the copyrighted work as a whole; and
               | 
               | > * the effect of the use upon the potential market for
               | or value of the copyrighted work.
        
             | FloorEgg wrote:
             | I really don't think it's that simple. I can read books and
             | then earn money from applying what I learned in them. I can
             | also study art and then make original art in the same or
             | similar styles. If a person was doing this there would be
             | no one claiming copyright infringement. The only difference
             | is it's a machine doing it and not a person.
             | 
             | The nature of copyright and plagiarism boils down to
             | paraphrasing, and so long as LLMs sufficiently paraphrase
             | the content it's an open question whether it's copyright
             | infringement and requires new law/precedent.
             | 
             | So the fact they are earning money is a red herring unless
             | they are reproducing the exact same content without
             | paraphrasing (with exception to commentary). E.g. they can
             | quote part of a work while commenting on it.
             | 
             | Where they have gotten into trouble with e.g. NYT afaik is
             | when the LLM reproduced a whole article word for word. I
             | think they have all tried hard to prevent the LLM from ever
             | doing that to avoid that legal risk.
        
               | bayindirh wrote:
               | > I can read books and then earn money from applying what
               | I learned in them.
               | 
               | How many books can you read, understand and memorize in T
               | time, and how many books an AI can ingest in the T time?
               | 
               | If we're down to paraphrasing, watch this video [1], and
               | think again.
               | 
               | Many models, given that you ask the correct questions,
               | reproduce their training set with great accuracy, and
               | this is only prevented with monkey patching, IIUC.
               | 
               | So, it's still a big mess, even if we don't add
               | copyrighted corpus to the mix. Oh, BTW, datasets like
               | "The Stack" are not clean as they claim. I have seen at
               | least two non-permissively licensed code repositories
               | inside that dataset.
               | 
               | [1]: https://youtu.be/LrkAORPiaEA
        
         | yieldcrv wrote:
         | and therefore everyone has the necessary rights to read works,
         | the necessary rights to critique of the works including for
         | commercial purposes, and the necessary rights to derivative
         | works including for commercial purposes
        
         | tomjen3 wrote:
         | You wouldn't read a book and teach others its lessons without a
         | derived license, would you?
        
           | dijksterhuis wrote:
           | copyright refers to the act of copying the material at hand
           | (including distribution, reproduction, performance) etc.
           | 
           | as an example: saying "i really like james holden's
           | inheritors album for the rough and dissonant sounds" isn't
           | covered by copyright.
           | 
           | if i reproduced it verbatim using my mouth, or created a
           | derived work which is noticeably similar to the original,
           | that's a different question though.
           | 
           | in your example, a derivative work example could be akin to
           | only quoting from the book for the audience and modifying a
           | word of each quote.
           | 
           | "derived" works are always a grey area, especially around
           | generative machine learning right now.
        
           | ben_w wrote:
           | When I was at school, we were sometimes all sat down in front
           | of a TV to watch some movie on VHS tape (it was the 90s).
           | 
           | At the start of the tape, there was a copyright notice
           | forbidding the VHS tape from being played at, amongst other
           | places, schools.
           | 
           | Copyright rules are a strange thing.
        
         | philwelch wrote:
         | You're applying a double standard to LLM's and human creators.
         | Any human writer or artist or filmmaker or musician will be
         | influenced by other people's works, even while those works are
         | still under copyright.
        
           | cmiles74 wrote:
           | I don't see how this is a double standard. Comparing a person
           | interacting with their culture is not comparable in any way.
           | IMHO, it's kind of a wacky argument to make.
        
             | crazygringo wrote:
             | Can you elaborate on how it's not comparable? It seems
             | obvious to me that it is -- they both learn and then create
             | -- so what's the difference?
             | 
             | If I can hire an employee who draws on knowledge they
             | learned from copyrighted textbooks, why can't I hire an AI
             | which draws on knowledge it learned from copyrighted
             | textbooks? What makes that argument "wacky" in your eyes?
        
               | tekno45 wrote:
               | you're asking why you have to treat people differently
               | than you treat tools and machines.
        
               | crazygringo wrote:
               | Well obviously not in general. But when it comes to
               | copyright law specifically, yes absolutely. That is the
               | question I'm asking.
        
               | salawat wrote:
               | You're not going to get an answer you find agreeable,
               | because you're hoping for an answer that allows you to
               | continue to treat the tool as chattel, without conferring
               | to it the excess baggage of being an individuated
               | entity/laborer.
               | 
               | You're either going to get: it's a technological,
               | infinitely scalable process, and the training data should
               | be considered what it is, which is intellectual property
               | that should be being licensed before being used.
               | 
               | ...or... It actually is the same as human learning, and
               | it's time we started loading these things up with other
               | baggage to be attached to persons if we're going to
               | accept it's possible for a machine to learn like a human.
               | 
               | There isn't a reasonable middle ground due to the
               | magnitude of social disruption a chattel quasi-human
               | technological human replacement would cause.
        
               | philwelch wrote:
               | No, you're missing the point of copyright. The point of
               | copyright is to protect an exclusive right to copy, not
               | the right to produce original works influenced by
               | previous works. If an LLM produces original works that
               | are influenced by the training data, that is not a
               | violation of copyright. If it reproduces the training
               | data verbatim, it is.
        
               | dijksterhuis wrote:
               | i weirdly agree with you, but also want to point out that
               | "influenced by the training data" is doing some very
               | heavy lifting there.
               | 
               | exactly how the new work is created is important when it
               | comes to derivative works.
               | 
               | does it use a copy of the original work to create it, or
               | a vague idea/memory of the original work's composition?
               | 
               | when i make music it's usually vague memories. i'd argue
               | that LLMs have an encoded representation of the original
               | work in their weights (along with all the other stuff).
               | 
               | but that's the legal grey area bit. is the "mush" of
               | model weights an encoded representation of works, or
               | vague memories?
        
               | philwelch wrote:
               | I don't really think it matters because you can just
               | compare the output to the input and apply the same
               | standard, treating the process between the two as a black
               | box.
        
               | aidenn0 wrote:
               | Aren't animals a current example of a middle ground? They
               | are incapable of authoring copyrightable works under
               | current US law.
        
               | cmiles74 wrote:
               | It has never been argued that copyright law should apply
               | to information the people learn, whether that be from
               | reading books or newspapers, watching television or
               | appreciating art like paintings or photographs.
               | 
               | Unlike a person, an large language model is product built
               | by a company and sold by a company. While I am not a
               | lawyer, I believe much of the copyright arguments around
               | LLM training revolve around the idea that copyrighted
               | content should be licensed by the company training the
               | LLM. In much the same way that people are not allowed to
               | scrape the content of the New York Time website and then
               | pass it off as their own content, so should OpenAI be
               | barred from scraping the New York Times website to train
               | ChatGPT and then sell the service without providing some
               | dollars back to the New York Times.
        
               | _DeadFred_ wrote:
               | One is a collection of highly dithered data generated by
               | machines paid for by a business in order to financially
               | gain from the copyrighted works in order to replace any
               | future need for copyrighted text books.
               | 
               | The other is a person learning from a copyrighted
               | textbook in the legally protected manner, and whom and
               | use the textbook was written for.
        
               | groby_b wrote:
               | Unless you are making an argument for personhood, one is
               | a machine, the other is a human. Different standards
               | apply, end of discussion.
        
               | homarp wrote:
               | most probably your employee actually 'paid' for their
               | textbook.
        
               | cmiles74 wrote:
               | I don't think this question really makes any sense... In
               | my opinion, it's kind of mish-mashing several things
               | together.
               | 
               | "Can you elaborate on how it's not comparable?"
               | 
               | The process of individual people interacting with their
               | culture is a vastly different process than that used to
               | train large language models. In what ways to you think
               | these processes have anything in common?
               | 
               | "It seems obvious to me that it is -- they both learn and
               | then create -- so what's the difference?"
               | 
               | This doesn't seem obvious to me (obviously)! Maybe you
               | can argue that an LLM "learns" during training, but that
               | ceases once training is complete. For sure, there are
               | work-arounds that meet certain goals (RAG, fine-tuning);
               | maybe your already vague definition of "learning" could
               | be stretched to include these? Still, comparing this to
               | how people learn is pretty far-fetched. AFAICT, there's
               | no literature supporting the view that there's any
               | commonality here; if you have some I would be very
               | interested to read it. :-)
               | 
               | Do they both create? I suspect not; an LLM is parroting
               | back data from it's training set. We've seen many studies
               | showing that tested LLMs perform poorly on novel problem
               | sets. This article was posted just this week:
               | 
               | https://news.ycombinator.com/item?id=42565606
               | 
               | The court is still out on the copyright issue, for the
               | perspective of US law we'll have to wait on this one.
               | Still, it's clear that an LLM can't "create" in any
               | meaningful way.
               | 
               | And so on and so forth. How is hiring an employee at all
               | similar to subscribing to an OpenAI ChatGPT plan? Wacky
               | indeed!
        
             | rob_c wrote:
             | That's a little simplistic. You're almost trying to say
             | blank and white sands gray can't be compared which is a bit
             | weird.
             | 
             | Strangely like the situation itself.
             | 
             | The question is just looked to how can we guarantee a model
             | is influenced rather than memorising an input?
             | 
             | And then is a human who is influenced simply relying on a
             | faulty or less than perfect memory?
        
           | dijksterhuis wrote:
           | as a human being, and one that does music stuff, i don't
           | download terabytes of other peoples works from the internet
           | directly into my brain. i don't have verbatim reproductions
           | of people's work sitting around on a hard disk in my
           | stomach/lungs/head/feet.
           | 
           | LLMs are not humans. They're essentially a probabilistic
           | compression algorithm (encode data into model weights/decode
           | with prompt to retrieve data).
        
             | philwelch wrote:
             | Do you ever listen to music? Is your music ever influenced
             | by the music that you listen to? How do you imagine that
             | works, in an information-theoretical sense, that
             | fundamentally differs from an LLM?
             | 
             | Depending on how much music you've listened to, you very
             | well may have "downloaded terabytes" of it into your brain.
             | Your argument is specious.
        
               | dijksterhuis wrote:
               | i do listen to music.
               | 
               | i listen to it on apple music.
               | 
               | i pay money to apple for this.
               | 
               | some of that money that i pay to apple goes to the rights
               | holders of that music for the copying and performance of
               | their work through my speakers.
               | 
               | that's a pretty big difference to how most LLMs are
               | trained right there! i actually pay original creators
               | some money.
               | 
               | -
               | 
               | i am a human being. you cannot reduce me down to some
               | easy information theory.
               | 
               | an LLM is a tool. an algorithm. with the same random seed
               | etc etc it will get the same results. it is not human.
               | 
               | you put me in the same room as yesterday i'll behave
               | completely differently.
               | 
               | -
               | 
               | i have listened to way more than terabytes of music in my
               | life. doesn't mean i have the ability to regurgitate any
               | of it verbatim though. i'm crap at that stuff.
               | 
               | LLMs seem to be really good at it though.
        
               | cmiles74 wrote:
               | Information on how large language models are trained is
               | not hard to come by, there are numerous articles that
               | cover this material. Even a brief skimming of this
               | material will make it clear that the training of large
               | language models is materially different in almost every
               | way from how human beings "learn" and build knowledge.
               | There are still many open questions around the process of
               | how humans collect, store, retrieve and synthesize
               | information.
               | 
               | There is little mystery to how large language models
               | function and it's clear that their output is parroting
               | back portions of their training data, the quality of
               | output degrades greatly when novel input is provided. Is
               | your argument that people fundamentally function in the
               | same way? That would be a bold and novel assertion!
        
           | _DeadFred_ wrote:
           | Human creators don't store that 'influence' in a digital
           | machine accessible format generated directly from the
           | copyrighted content though.
           | 
           | Although with the 'good new everyone, we built the torment
           | nexus' trajectory of AI my guess is at this point AI
           | companies would just incorporate actual human brains instead
           | of digital storage if that was the requirement.
        
             | galangalalgol wrote:
             | Does that imply that if we invent brain upload technology,
             | that my weights have every conflicting license and patent
             | for everything I can quote or create? I don't like that
             | precedent. I have complete rights over my noggin's
             | contents. If I do quote a NYT article in it's entirely,
             | _that_ vould be infringement, but not copying my brain
             | itself.
        
             | philwelch wrote:
             | Your argument boils down to "we don't know how brains
             | work", and it is a non-sequitur. It isn't a violation of
             | copyright law to create original works under the creative
             | influence of works still under copyright.
        
         | Workaccount2 wrote:
         | LLMs are not massive archives of data. They are a tiny fraction
         | of a fraction of a percent of the size of their training set.
         | 
         | And before you knee-jerk "it's a compression algo!", I invite
         | you to archive all your data with an LLMs "compression algo".
        
           | int_19h wrote:
           | It doesn't matter. It's still a derived work.
        
             | baxtr wrote:
             | Well what isn't in this world?
             | 
             | Would Einstein would have been possible without Newton?
        
               | thedailymail wrote:
               | Newton was public domain by Einstein's time.
        
               | jampekka wrote:
               | Indeed. Copyright was introduced in 1710, Principia was
               | published in 1687.
        
               | yieldcrv wrote:
               | and even with our current copyright laws providing for
               | long dated protection, it would have still been in public
               | domain
        
               | jampekka wrote:
               | It's hard to say what the current laws actually imply.
               | Steamboat Willie was originally meant to be in the public
               | domain in 1955. Got there in 2024.
        
           | BobbyTables2 wrote:
           | Copying a single sentence verbatim from a 1000 page book is
           | still plagiarism.
           | 
           | And is technically copyright infringement outside fair use
           | exceptions.
        
             | concerndc1tizen wrote:
             | And similarly, translating those sentences into data points
             | is still a derivative work, like transcribing music and
             | then making a new recording is still derivative.
        
               | jpollock wrote:
               | derivative works still tend to be copyright violations.
        
               | concerndc1tizen wrote:
               | Yes, that's what I'm saying. An LLM washing machine
               | doesn't get rid of the copyright.
        
           | timewizard wrote:
           | > LLMs are not massive archives of data.
           | 
           | Neither am I, yet, I am still capable of reproducing
           | copyrighted works to a level that most would describe as
           | illegal.
           | 
           | > And before you knee-jerk "it's a compression algo!"
           | 
           | It's literally a fundamental part of the technology so I
           | can't see how you call it a "knee jerk." It's lossy
           | compression, the same way a JPEG might be, and simply
           | recompressing your picture to a lower resolution does not at
           | all obviate your copyright.
           | 
           | > I invite you to archive all your data with an LLMs
           | "compression algo".
           | 
           | As long as we agree it is _my data_ and not yours.
        
             | Isamu wrote:
             | > It's lossy compression, the same way a JPEG might be
             | 
             | Compression yes, but this is co-mingling as well. The
             | entire corpus is compressed together, which identifies
             | common patterns, and in the model they are essentially now
             | overlapping.
             | 
             | The original document is represented statistically in the
             | final model, but you've lost the ability to extract it
             | closely. Instead you gain the ability to generate something
             | statistically similar to a large number of original
             | documents that are related or are structurally similar.
             | 
             | I'm just commenting, not disputing any argument about fair
             | use.
        
       | ipsum2 wrote:
       | This is cool, but only the first part in extracting a ML model
       | for usage. The second part is reverse engineering the tokenizer
       | and input transformations that are needed to before passing the
       | data to the model, and outputting a human readable format.
        
         | refulgentis wrote:
         | This is a good comment, but only in the sense it documents a
         | model file doesn't run the model by itself.
         | 
         | An analogous situation is seeing a blog that purports to "show
         | you code", and the code returns an object, and commenting "This
         | is cool, but doesn't show you how to turn a function return
         | value into a human readable format" More noise, than signal.
         | 
         | The techniques in the article are trivially understood to also
         | apply to discovering the input tokenization format, and Netron
         | shows you the types of inputs and outputs.
         | 
         | Thanks for the article OP, really fascinating.
        
           | ipsum2 wrote:
           | Just having the shape of the input and output are not
           | sufficient, the image (in this example) needs to be
           | normalized. It's presumably not difficult to find the exact
           | numbers, but it is a source of errors when reverse
           | engineering a ML model.
        
         | rob_c wrote:
         | If you can't fix this with a little help from chatgpt or Google
         | you shouldn't be building the models frankly let alone mucking
         | with other people's...
        
         | mentalgear wrote:
         | Would be interesting if someone could detail the approach to
         | decode the pre-post processing steps before it enters the
         | model, and how to find the correct input encoding.
        
       | 1vuio0pswjnm7 wrote:
       | "Keep in mind that AI models, like most things, are considered
       | intellectual property. Before using or modifying any extracted
       | models, you need the explicit permission of their owner."
       | 
       | If weights and biases contained in "AI models" are prorietary,
       | then for one model owner to detect infingement by another model
       | owner, it may be necessary to download and extract.
        
       | kittikitti wrote:
       | This was a great article and I really appreciate it!
        
       | Fragoel2 wrote:
       | There's an interesting research paper from a few years ago that
       | extracted models from Android apps on a large scale:
       | https://impillar.github.io/files/ccs2022advdroid.pdf
        
       | janalsncm wrote:
       | I'm a huge fan of ML on device. It's a big improvement in privacy
       | for the user. That said, there's always a chance for the user to
       | extract your model, so on-device models will need to be fairly
       | generic.
        
       | Ekaros wrote:
       | Can you launder AI model by feeding it to some other model or
       | training process? After all that is how it was originally
       | created. So it cannot be any less legal...
        
         | benreesman wrote:
         | There are a family of techniques, often called something like
         | "distillation". There are also various synthetic training data
         | strategies, it's a very active area of research.
         | 
         | As for the copyright treatment? As far as I know it's a bit up
         | in the air at the moment. I suspect that the major frontier
         | vendors would mostly contend that training data is fair use but
         | weights are copyrighted. But that's because they're bad people.
        
           | qup wrote:
           | The weights are my training data. I scraped them from the
           | internet
        
       | dfefdfdd wrote:
       | Ugh Gdug Tfhggdusysg7sfsusfx6sgs6sgsyyxgushzjs Fuck You
        
       | amolgupta wrote:
       | For app developers considering tflite, a safer way would be to
       | host the models on firebase and delete them when their job is
       | done. It comes with other features like versioning for model
       | updates, A/B tests, lower apk size etc.
       | https://firebase.google.com/docs/ml/manage-hosted-models
        
       ___________________________________________________________________
       (page generated 2025-01-05 23:00 UTC)