[HN Gopher] AI and Open Source in 2023
       ___________________________________________________________________
        
       AI and Open Source in 2023
        
       Author : belter
       Score  : 42 points
       Date   : 2023-11-04 18:50 UTC (4 hours ago)
        
 (HTM) web link (magazine.sebastianraschka.com)
 (TXT) w3m dump (magazine.sebastianraschka.com)
        
       | gumballindie wrote:
       | As soon as DRM for text and images is implemented companies such
       | openai will be in for a ride. Unfortunately though open source
       | models will be sacrificed in the process, but we need means to
       | protect against the rampant ip theft ai companies do.
        
         | artninja1988 wrote:
         | No such thing as IP theft
        
           | gumballindie wrote:
           | Let me guess - you think ip and copyright are "rent seeking"?
           | What a weird age we live in. Where people defend corporations
           | from stealing our work. Quite a shift from the reverse.
        
           | minimaxir wrote:
           | It's entirely possible to steal IP, but the "AI art is theft"
           | part of it is still legally up in the air.
        
             | gumballindie wrote:
             | There are all sorts of things that are legal and immoral or
             | disagreeable so even if ai art theft is legalised it's
             | still theft if the author doesnt want it to be used that
             | way. It seems like "ai" is quite reliant on ingesting and
             | storing massive property data to emulate "intelligence" -
             | and thats equal to people downloading and storing movies
             | and music. A thing we are not permitted to by the same
             | corporations that you wish to help.
        
             | jrm4 wrote:
             | I think what OP is referring to is the entirely reasonable
             | legal argument that IP infringement is not actually "theft"
             | 
             | The idea being: "Theft" isn't about "you get something you
             | don't own," it means "you deprive someone else of THEIR
             | property."
        
         | minimaxir wrote:
         | Which means that companies will just license the data used to
         | train models because they have the money to do so, or use their
         | own data instead. That's how Adobe's Firefly works right now,
         | and OpenAI just signed a licensing agreement with Shutterstock:
         | https://venturebeat.com/ai/shutterstock-signs-6-year-trainin...
         | 
         | Even if it became impossible to train AI on internet-accessible
         | data, there's no change to the proliferation of generative AI
         | other than keeping it entrenched and centralized in the power
         | of a few players, and it has no impact on potentially taking
         | jobs from artists, other than making it _harder_ for artists to
         | compete due to the lack of open-source alternatives.
        
           | gumballindie wrote:
           | No problem then, people willing to make their content
           | available to ai can do so by using such websites, people that
           | value their work can use something else.
        
             | ben_w wrote:
             | That has the same vibe as responding to the invention of
             | the Jacquard loom by saying: "No problem then, people
             | willing to make their designs available to automation can
             | do so by using such punched cards, people that value their
             | work can use something else."
             | 
             | Home weaving does still exist. Not a very big employer any
             | more, though.
        
               | LastTrain wrote:
               | All analogies are fraught but this one takes the cake. A
               | more apt one is not wanting the Jaquard loom people to
               | steal my designs.
        
         | jrm4 wrote:
         | You're probably getting downvoted because "DRM" was nearly a
         | complete technical failure already, and there's no reason
         | believe it would be different for AI?
        
           | gumballindie wrote:
           | Normally i wouldnt advocate for drm but there needs to be a
           | way to protect our content from this madness. I understand
           | the backlash though and I am not worried about downvotes.
        
             | Krasnol wrote:
             | Your content was never protected in the sense you want it
             | to be protected.
             | 
             | Since the moment you put it up online for people to see and
             | hear, they were able to move on and create something else
             | based upon this. Most of the time unconsciously. This is
             | how humanity works. This is the reason we're still on this
             | planet. AI accelerates the process like any other tool
             | we've come up with since we climbed down the trees.
             | 
             | You can complain and scream as much as you want, but it
             | won't change. Even if you manage to regulate the whole
             | western part of the internet. The rest of the world is
             | bigger and won't sleep.
        
           | ls612 wrote:
           | Unfortunatley I think you are wrong about this. DRM schemes
           | are evolving to be nearly unbreakable in the future with the
           | widespread adoption of security processors in everything.
           | 
           | As long as there is a massive fundamental asymmetry between
           | assembling a chip with a small amount of ROM and
           | disassembling & reading that ROM while still making the chip
           | usable, DRM schemes using PKI methods will become widespread
           | and nigh unbreakable.
        
             | ben_w wrote:
             | Point [camera/microphone/eyeball] at [video/audio/text],
             | [press record/press record/start writing down what you
             | see].
        
         | candiddevmike wrote:
         | IMO, I think the entire "train on as much data as possible" is
         | nearing its end. There are diminishing returns and it seems
         | like a dead end strategy.
        
         | babyshake wrote:
         | Watermarking images, particularly very high resolution images,
         | I can understand, but I fail to see how with text, you would
         | watermark it in a way that provides sufficient evidence it has
         | been used for training data, unless the model is just quoting
         | it at length.
        
       | andy99 wrote:
       | Most importantly, 2023 was the year when "open source" got
       | watered down to mean "you can look at the source code / weights"
       | if you agree with a bunch of stuff. Most of the models
       | referenced, like stable diffusion (RAIL license) and Llama &
       | derivatives (proprietary facebook license with conditions on uses
       | and some commercial terms) are not open source in the sense that
       | it was understood a year ago. People protested a bit when the
       | terminology started being abused, and now that's mostly died down
       | and people now call these restrictive licenses open source. This
       | (plus ongoing regulatory capture) is going to be the wedge that
       | destroys software freedom and brings us back to a regime where a
       | few companies dictate how computers can be used.
        
         | Der_Einzige wrote:
         | In practice this matters less than you think. You can't easily
         | prove that any outputs were generated by a particular model in
         | general, so any user can simply ignore your licenses and do as
         | they please.
         | 
         | I know it rustles purist feathers, but I don't understand why
         | we live in this pretend world that assumes that folks
         | particularly care about respecting licenses. Consider how
         | little success that the GNU folks have had with using the
         | courts for any enforcement of their licenses, and that's by
         | stallmans own admission.
         | 
         | AI is itself a subversive technology, whose current versions
         | rely on subversive training techniques. Why should we expect
         | everyone to suddenly want to follow the rules when they read a
         | poorly written restrictive open source license?
        
           | andy99 wrote:
           | For personal or noncommercial use I agree the restrictions
           | are meaningless. As they are for "bad actors" that would
           | potentially abuse the tools in contravention of the license.
           | But the license terms are a risk for commercial users
           | especially when dealing with a big company like Meta. These
           | risks werent previously there in say pytorch that is MIT
           | licensed. The ironic thing with these licenses is that they
           | are least enforceable on those who would be most likely to
           | abuse them: https://katedowninglaw.com/2023/07/13/ai-
           | licensing-cant-bala...
           | 
           | Re success of free licenses, linux (other than a few arguable
           | abuses) has remained free and unencumbered thanks to GPL
           | licensing.
        
           | nologic01 wrote:
           | Somehow the "AI defense" (namely that it is not possible to
           | "prove" anything was used illegally) will open Pandora's box
           | in terms of providing viable channels for whitewashing
           | explicit theft activity. Steal anything proprietary, run it
           | through an AI filter that mixes it with other stuff and claim
           | it as your own.
        
         | ebalit wrote:
         | Mistral 7B [1] and many models stemming from it are released
         | under permissive Apache license.
         | 
         | Some might argue that a "pure" open-source would require the
         | dataset and the training "recipe" as it would be needed to
         | reproduce the training, but it would be so expensive that most
         | people wouldn't be able to do much with it.
         | 
         | IMO, a release with open weights without the "source" is much
         | better than the opposite, a release with open source and no
         | trained weights.
         | 
         | And it's not like there was no progress on the open dataset
         | front: - Together just released RedPajama V2, with enough
         | tokens to train a very sizeable base model. - Tsinghua released
         | UltraFeedback which allowed more people to align models using
         | RLHF methods (like the Zephyr models from Hugging Face) - and
         | many many others
         | 
         | [1] https://mistral.ai/news/announcing-mistral-7b/ [2]
         | https://github.com/togethercomputer/RedPajama-Data
        
         | seydor wrote:
         | mistral appears to be quite open, and even better than llama
         | imho
        
         | emadm wrote:
         | Check out our fully open recent 3b model which outperforms most
         | 7b models and runs on an iPhone/cpu, fully open including data
         | and details
         | 
         | Tuned versions outperform 13b vicuna, wizard etc
         | 
         | https://stability.wandb.io/stability-llm/stable-lm/reports/S...
        
       | nologic01 wrote:
       | Is there a truly open source effort in the LLM space? Like a
       | collaborative, crowd-sourced effort (possibly with academic
       | institutions playing a major role) that relies on creative
       | commons licensed or otherwise open data and produces a public
       | good as final outcome?
       | 
       | There is this ridiculous idea of AI moats and other machinations
       | for the next big VC thing (god bless them, people have spend
       | their energy on worse pursuits) but in a fundamental sense there
       | is a public good type infrastructure crying out to be developed
       | for each major linguistic domain.
       | 
       | Maybe such an effort would not be cutting edge enough to power
       | the next corporate chatbot that will eliminate 99% of all jobs,
       | but it would be a significant step up in our ability to process
       | text.
        
         | vinni2 wrote:
         | I think OpenAssistant is the closest to what you are describing
         | but their models are not yet that great. https://open-
         | assistant.io/
        
           | nulld3v wrote:
           | Open Assistant just shut down:
           | https://www.youtube.com/watch?v=gqtmUHhaplo
           | 
           | Cited reasons: Lack of resources, lack of maintainer time and
           | there being many new good alternatives.
        
         | dartos wrote:
         | RWKV is fully open source and even part of the Linux foundation
         | 
         | Idk why nobody ever talks about it
        
         | TheCaptain4815 wrote:
         | ElutherAi fits that I believe. In the olden days (1.5 years
         | ago) they probably had the best open source model with their
         | NeoX model, but it's been ellipsed by Llamma and other "open
         | source" models I believe. They still have an active discord
         | with a great community pushing forward.
        
         | emadm wrote:
         | We back rwkv, eleuther ai and others at stability ai
         | 
         | We also have our carper.ai lab for the rl buts
         | 
         | We are rolling out open language models and datasets soon for a
         | number of languages too, see our recent Japanese language
         | models for example
         | 
         | Got some big plans soon, have funded it all ourself but sure
         | other would like to help
        
       | seydor wrote:
       | $NVDA went to the moon, AI stocks skyrocketed including any beer
       | with "AI" in its name. The rest of the story is typical by now:
       | vc money flows, companies hide their trade secrets (prompts),
       | public research is derailed. It's all very premature, LLMs was
       | not the end of the road.
        
         | brrrrrm wrote:
         | Why do you say "prompts" is the canonical trade secret?
        
       | jimmySixDOF wrote:
       | On a retrospective take of the state of AI 1 year this month into
       | LLMs post ChatGPT, I would like to single out Simon Wilson as the
       | MVP for Open AI tooling contribution. His datasett projects are a
       | great work in in progress and the prodigious blog posts and TIL
       | snips are state of the art. Great onboarding to the whole
       | ecosystem. I find myself using something he has produced in some
       | way everyday.
       | 
       | https://simonwillison.net/
        
       | raincole wrote:
       | I think open models are more like closed source freemium
       | applications. You got the weights, which are "compiled" from the
       | source material. You're free to use it, but you can't, for
       | example, remove one source material from it.
        
       ___________________________________________________________________
       (page generated 2023-11-04 23:00 UTC)