[HN Gopher] RedPajama: Reproduction of LLaMA with friendly license
___________________________________________________________________
RedPajama: Reproduction of LLaMA with friendly license
Author : tim_sw
Score : 534 points
Date : 2023-04-17 14:05 UTC (8 hours ago)
(HTM) web link (www.together.xyz)
(TXT) w3m dump (www.together.xyz)
| bmc7505 wrote:
| @dang The title should be changed from MILA to Mila/IQIA.
| hsuduebc2 wrote:
| I'm somehow scared and somehow amazed by speed of this progress.
| martythemaniak wrote:
| This is cool, now we just need to locate 1,000,000 A100-80GB
| equivalent GPU-hours. If we had a SETI@Home type project setup
| for this, it would be straightforward - only $50K worth of
| electricity for the 65B model.
|
| Given the immense momentum behind LLaMA, I'm pretty disappointed
| that Meta won't just open-source it, but I guess reproducing it
| is better long-term.
| miohtama wrote:
| They missed the chance to call it OpenPajama
| wongarsu wrote:
| Calling next month's headline: "OpenPajama: RedPajama weights
| fine-tuned on liberotica and fanfiction.net"
| omneity wrote:
| An actually open source LLM would be a game changer. We might
| need a new license that englobes model usage and training,
| something GPL-like whereby distributing a retrained model
| requires contributing data back or making it public, but not if
| you use it privately.
|
| This will definitely accelerate progress in LLM research,
| productization and safety. Alpaca, vicuna, gpt4all and others are
| sporadic repesentations of this that could become a continuous
| improvement process were the LLM and its license truely open
| source.
|
| An interesting possible side effect of a GPL-like license is that
| AIs become unlikely to be trained on private data, the usual moat
| that big tech wouldn't want/just can't make public if it were to
| use those GPL-like licensed models.
| jupp0r wrote:
| As with original GPL, this would be almost useless in a
| commercial context.
| e12e wrote:
| There are commercial devices that ship with a Linux kernel?
| Bjartr wrote:
| Basically every Android device for starters.
| ijustlovemath wrote:
| I think they mean in terms of enforcement when there's a
| violation
| jupp0r wrote:
| But do they train the Linux kernel with their customers
| data?
| sp332 wrote:
| Using a Linux kernel doesn't mean you have to make your
| whole project GPL, unless your project is specifically
| kernel code.
| wongarsu wrote:
| Neither would the proposed model license. Just like the
| kernel's GPL stops at the userspace boundary, the
| proposed license would only cover the model definition
| and weights.
| ipsum2 wrote:
| Huh? There's plenty of open source LLMs. Pythia, GPT-NeoX,
| GPT-J, GPT-2, BLOOM-176, are ones I can think of off the top of
| my head. Pythia is the best performing one IIRC.
| buzzscale wrote:
| Dolly 2.0 is fully open, Apache License and the tuning dataset
| is employee generated:
|
| https://www.databricks.com/blog/2023/04/12/dolly-first-open-...
| almost_usual wrote:
| Name is obviously inspired from the Anna Dewdney children's
| books.
| michael_j_ward wrote:
| My kids love that book, and my oldest had me read it to his
| preschool class earlier this year.
|
| Here is a much more creative reading by Ludacris [0]
|
| [0] https://www.youtube.com/watch?v=PFtHeo7oMSU
| quickthrower2 wrote:
| As I understand it they have the input data, but next up they are
| creating the model. I could make a joke about drawing an owl ...
| but that would be a bit mean. I am really glad people are working
| on this.
|
| I wonder... who is paying? Will there be restrictions like ethics
| clauses and suchlike. Not necessarily a bad thing if they do.
| Will there be restrictions on commercial use.
| HopenHeyHi wrote:
| Having reproduced the pre-training data, the next step is to
| train a strong base model. As part of the INCITE program, with
| support from Oak Ridge Leadership Computing Facility (OLCF), we
| are training a full suite of models, with the first becoming
| available in the coming weeks. With a strong base
| model in hand, we are excited to instruction tune the models.
| Alpaca illustrated the power of instruction tuning - with merely
| 50K high-quality, diverse instructions, it was able to unlock
| dramatically improved capabilities. Via OpenChatKit, we received
| hundreds of thousands of high-quality natural user instructions,
| which will be used to release instruction-tuned versions of the
| RedPajama models.
|
| Excellent. Sam Altman can blow it out his ass. :)
| rafaelero wrote:
| That's awesome! Are people thinking about training it for more
| than just 1 epoch? I believe Gallactica showed that training for
| even 4 epochs is ok. Also, how amazing would be if the next gen
| of open-source LLM's increased context window, like adding 8k
| more tokens? That's probably expensive, but totally doable.
| sp332 wrote:
| It's including Common Crawl data 4 or 5 times, does that count?
| [deleted]
| Jayakumark wrote:
| This is huge, was just checking today on what would take someone
| to get a model similar to Llama, since Meta did not share
| Training code or Dataset.. Looks like they have figured out how
| to make the dataset ,Main Problem here is pre-processing them.
| Second step is to make the code to train model and final one do
| it cheaply.
| brucethemoose2 wrote:
| Maybe they should use whatever Cerebras used. The whole point
| of their own LLM release was as a maximum compute/$
| demonstration on their platform.
|
| Surely there is a better alternative than a bunch of A100s on
| AWS...
| mgaunard wrote:
| Pyjama singular actually works, but I'm not sure Pajamas can be
| singular.
| FranchuFranchu wrote:
| I think that at this point, LLM etymology is way more
| interesting than LLMs themselves.
| [deleted]
| local_crmdgeon wrote:
| So how do I use this? As someone new to the domain.
| tinco wrote:
| You download the 2.76TB of data. Then you run it through
| Llama's training script for a couple months on 40 NVidia
| A100's, and you should have yourself a pretty fine large
| language model you could use to host your own ChatGPT service.
| It'll be significantly worse than ChatGPT for reason's that
| aren't yet fully clear because OpenAI switched its mission from
| protecting the earth from nefarious AI developments, to being
| itself being the origin of possibly nefarious AI developments.
| DigitalDopamine wrote:
| Renting 40 nvidia a100s is around 70k dolar per month (on
| vultr i see). So this would only cost 420k for 6 months.
| Seems doable.
|
| Is 40 a100s enough though? I am interested in what this would
| cost.
| mlboss wrote:
| It would be great if this can be done on 3090s. Used 3090
| usually costs $500-1000 to buy.
| skybrian wrote:
| You don't, since they're not done yet. Someone will come up
| with a way to use it when they're done.
| thrtythreeforty wrote:
| I'm very glad people are starting to push back against claims of
| various LLMs being open source. I was beginning to be worried
| that the term would be forcefully redefined in the ML space to
| mean "weights available." With the kickoff of projects like this
| and Databricks' Dolly, I'm heartened to see the community saying
| "no, we are willing to spend the compute to make actually open
| models."
|
| (While it's true that the actual model code of Llama is properly
| open source, it's also useless for inference by itself. Claiming
| these models are open source seems like having your cake and
| eating it too - you get accolades for "open sourcing" but still
| get to control what happens with it.)
| jrm4 wrote:
| Lawyer here, still trying to wrap my head around all of it --
| but it seems as if what may be different here is the extent to
| which all of this is _practically_ "open-source" or even
| "literally free, as in freedom and cost etc" (i.e. generally
| and widely available REGARDLESS of what the law says)
|
| And then coming second appears to be "companies and whoever who
| seek to make money, and intend to make some sort of legal
| restriction part of the biz model."
|
| I have no answers or even predictions here except "this is
| gonna be interesting."
| nickcw wrote:
| To make an analogy with Linux, the weights are (up until now) a
| very large closed source firmware blob.
| ninjin wrote:
| I can only agree. The number of times we have seen corporations
| abuse "open source" and "open science" in the context of large
| language models have been baffling: OPT/LLaMA disallowing
| commercial usage, BLOOM having an ethical non-open license, GLM
| having a clause not to "undermine [the People's Republic of
| China's] national security and national unity", etc. Every
| single one of these models have been happy to ride on the
| coattails of the hard work of the open movements by calling
| themselves open, while only paying lip service to the ideals
| and definitions underpinning them.
|
| While RedPajama has yet to commit to a license (from what I can
| see, it is late at night...), they are making _all_ the right
| noises and I am hopeful that my prediction that we are about to
| see the floodgates of _truly_ open models blow open and that
| OpenAI's "moat" will be proving to be a lot shallower than what
| they and many others have made us believe over the last six
| months will come true.
| vipulved wrote:
| Hi, this is Vipul, I am a co-founder of Together. We plan to
| release the model weights under Apache 2.0. The amount of
| creativity that Stable Diffusion unleashed for instance is
| only really possible with permissive licenses!
| Taek wrote:
| Are you working at all with Stability, Eleuther, or LAION?
| There have been some rumors that they are doing something
| similar to this and I'm wondering if this is a duplicated
| effort.
|
| Either way, huge fan, it would be awesome to have a LLaMA
| set of weights that are fully open.
| yieldcrv wrote:
| > not to undermine the national security and national unity
|
| this is a required statement to conform with China's
| constitution, or the superseding authoritative social
| contract there.
|
| think of it like if the Patriot Act was an article of the
| constitution instead of a random law subservient to the
| constitution, it would negate other parts of the constitution
| that we hold near and dear.
|
| this is a useful similarity as both constitutions have
| assurances of free speech
|
| just one has a fatal heavily leveraged clause that undermines
| all other parts of that constitution and dictates all facets
| of life
| ninjin wrote:
| This is interesting, thank you. But then how can _any_
| entity in the PRC contribute to open source? Alibaba,
| Baidu, etc. have released plenty of machine learning code
| under _proper_ open licenses in the past (not to mention
| that we have hardware vendors in the PRC contributing to
| say Linux). The story I heard about GLM was that they were
| a high enough public profile project that it caught the
| attention of PRC bureaucrats that pushed for the clause to
| be included.
|
| Regardless of the cause though, the clause flies afoul of
| any definition of open out there.
| yieldcrv wrote:
| simplest answer is that Alibaba and Baidu have more party
| members as stakeholders
|
| but its not likely that any uncontrollable LLM can start
| spitting out accuracy or things unhelpful to Beijing's
| ethos there and be allowed to operate
|
| the model or the service filtering the model has to be
| controlled
| nacs wrote:
| > this is a required statement to conform with China's
| constitution
|
| But doesn't this mean the model training data also excludes
| anything critical of China?
|
| For example, does their training data include things like
| this: https://en.wikipedia.org/wiki/1989_Tiananmen_Square_p
| rotests... ?
| [deleted]
| danShumway wrote:
| My only caveat here is that I'm actually really curious to see
| a ruling about whether model weights can be copyrighted.
|
| I don't think the "Open Source" label people are using is
| accurate, and I _heavily_ agree that a common thing that
| companies seem to be trying to do in this space is release what
| are essentially closed models while calling them open, and it
| 's a really dangerous direction for AI to go. So nothing in
| your comment is wrong.
|
| But it also feels a little bit like ceding ground to just
| assume that Llama can't be used commercially just because
| Facebook says it can't. I never signed a EULA with them, that
| claim depends entirely on whether or not model weights are
| under copyright (or under some similar form of IP protection,
| some people have brought up trade secrets).
|
| And I don't have a super-strong opinion necessarily, but I'm
| not sure that's a safe assumption for people to make, and I
| kind of think it might be good to throw an asterisk next to
| "can't be used for commercial projects" whenever we talk about
| Llama's restrictions.
|
| But again, I agree with you, it's not the same as saying Llama
| is Open Source. Even if it does get ruled as having weaker
| protections, I don't think the term would really apply.
| jupp0r wrote:
| I haven't done so, but don't you sign an agreement when you
| ask Facebook for a link to download the weights for LLAMA
| which is currently the only officially supported way of
| getting those weights
| (https://github.com/facebookresearch/llama/tree/main#llama) ?
| danShumway wrote:
| I haven't used Llama for anything other than playing around
| to test its capabilities, so I feel fairly comfortable
| admitting publicly that when I did that testing, I did not
| download it from Facebook using an official portal, and I
| didn't sign any agreement about it.
|
| On that subject, to the best of my knowledge, I also
| haven't signed any kind of agreement with OpenAI. I've done
| all of my GPT testing through 3rd-party services or portals
| that don't require signing EULAs to use.
| Ajedi32 wrote:
| Why would you bother using an "officially supported" way of
| downloading the weights if they aren't copyrightable
| anyway?
| worldsayshi wrote:
| > GitHub: GitHub data, filtered by licenses and quality
|
| Does anyone know which licenses are filtered into the dataset?
| mananaysiempre wrote:
| The description on the linked HuggingFace page[1] says MIT, BSD
| and Apache.
|
| [1]
| https://huggingface.co/datasets/togethercomputer/RedPajama-D...
| asddubs wrote:
| it's better than laundering gpl code, but it still breaks the
| licensing terms of those licenses as well, namely attribution
| worldsayshi wrote:
| I guess that could potentially be fixed if citation
| ejection can somehow be implemented into it, which seem to
| be at least feasible?
| MangezBien wrote:
| Definitely thought this was about the kid's book.
| franzypants wrote:
| It might be a little late, but I hope datasets start
| incorporating patent texts as well:
|
| 1. It's a large corpus of technical knowledge; 2. The language is
| written by experts in a field and reviewed many times, and 3.
| They have technical drawings with labels and references in the
| text
|
| The only downside I suppose is that sometimes patents are written
| with "just enough knowledge" to get it granted but not too much
| to give away the secret sauce. That's not really that different
| from many scholarly papers though.
|
| To give a size of scale, the granted patent texts of 2020
| (without images) is about 160 GB of data, and we have digitized
| grants going back to at least 1970.
| seunosewa wrote:
| You wouldn't want chatbots to answer you with the kind of
| language used in patent texts.
| MayeulC wrote:
| Now, I don't know if I would rely on it, but I've certainly
| thought about asking a LLM to write my patent text for me,
| provided with a technical description.
| sp332 wrote:
| LLMs are actually pretty good at translating info in one form
| into another form.
| return_to_monke wrote:
| with both this and https://Open-Assistant.io, I believe we have
| entered the Stable Diffusion era of large language models
| bugglebeetle wrote:
| Only if they actually start performing at the level of OpenAI's
| models. I'm not a fan of StableDiffusion, but at least their
| models work at general parity with private offerings. All the
| LLama derivatives and OpenAssistant stuff performs far below
| GPT-3.5 for everything I've tested.
| jokethrowaway wrote:
| I don't think there is a ready made alternative to
| Midjourney.
|
| Midjourney is way more versatile than SD. If you start
| getting some fine tuned models on civitai, trained to do well
| some specific tasks, you can get comparable quality but I
| haven't seen a single model which is able to replace
| Midjourney.
|
| Llama is no different, it has ok performance on generic
| queries but still far away from GPT3.5: if you start fine-
| tuning you can get good perf on specific tasks.
| htaunay wrote:
| Midjourney to me feels like bowling with bumpers
|
| Sure, its very easy to get good results fast, but the
| tuning that avoids "uglier" images is the same that removes
| a lot of versatility compared to SD
|
| Also controlnet is a killer feature
| og_kalu wrote:
| You're 100 percent right. People will say control bla bla
| bla and that's certainly true. You can get a lot more
| control with Stable Diffusion but like 99% of digital
| comics created with ai art use midjourney. One of the most
| control and versatility inclined use cases of generated art
| and midjourney is still easily winning. There's a reason
| for that.
| bugglebeetle wrote:
| SD with ControlNet and some other open source plugins is
| far more flexible than MidJourney. It just has all the
| typical hurdles of OSS vs. commercial offerings. Default
| image quality in Midjourney is better in terms of its
| pedestrian aesthetic biases, but it's not very interesting
| as an actual artistic tool. And I say this as someone who
| doesn't like either service and used to be a commercial
| illustrator before moving into Data Science.
| asynchronous wrote:
| Midjourney also doesn't have controlnet functionality like
| Stable Diffusion now does, which gives specific posing of a
| scene a huge edge on SD.
|
| They're very similar offerings if you're willing to put in
| the work on SD.
| GaggiX wrote:
| >I'm not a fan of StableDiffusion
|
| For some technical reason?
| bugglebeetle wrote:
| No, technically it's all very impressive. My displeasure
| with them was there doing a Napster-style maneuver to force
| artists into accepting AI art generation
| CuriouslyC wrote:
| The training was legal, and artists don't have a say
| under the current law, so your analogy doesn't hold.
| bugglebeetle wrote:
| Neither of these claims have been truly tested in court
| and vary at the national level, so no, not really.
| grumbel wrote:
| LAION is a German company and what StableDiffusion is
| doing seems to be covered under UrhG SS 44b. If artists
| don't want their work inspected by bots they have the
| option to put a robots.txt on their site.
|
| https://www.gesetze-im-internet.de/urhg/__44b.html
|
| https://www.gesetze-im-
| internet.de/englisch_urhg/englisch_ur...
| bobwaycott wrote:
| While this may very well be covered, I think the general
| problem in meatspace is that there was no advance notice
| given to exercise the option to place the proper
| robots.txt directives to opt out of having one's artwork
| collected for model training _before it happened_ , while
| still preserving the ability to have one's artwork
| findable by search engines and the like. I'm sure there
| are more than a handful of people who have no idea that a
| robots.txt file can be used to prevent AI data collection
| --and some may even be surprised to learn the file that's
| been used for search engine crawlers is also going to
| double for AI crawlers.
|
| To push a bit further, there's something that just feels
| particularly _off_ about assuming everyone's content is
| up for grabs unless _the producers_ do the work to _opt
| out_. I think there's an especially palpable bit of irony
| looking at it from the EU's perspective--where _cookies_
| must be _opt-in_ , but grabbing all your copyrighted
| material so companies can do whatever they like with it
| places the burden on the owner to _opt-out_. It just
| feels backward. Perhaps one should have to expressly
| _opt-in_ to allowing their work to be accessible as
| training data. At least then there will be a clear signal
| that the producer of the work can't later complain, as
| they willingly granted permission.
| Karunamon wrote:
| I wonder if these authors would have complained so loudly
| if they had known that other artists were looking at
| their output to learn how to create their own work?
| Absolutely none of them sprung from the womb, tablet in
| hand, to create their work ex nihilo, based on nothing
| other than their own entirely original thoughts.
| bugglebeetle wrote:
| None of this voids the terms of international copyright
| agreements and someone on Hacker News should know better
| than to claim that a robots.txt on a personal site would
| cover all instances of an image being scraped. I'm not
| saying that artists will necessarily come out on the
| winning end of this battle, but it's also specious to
| claim that company says what they did is legal, therefore
| it is.
| pluijzer wrote:
| Do you mean the use of uncredited use of artists artwork
| without paying royalties for the training set or AI art
| generation in general?
| bugglebeetle wrote:
| What I mean is releasing a free service out into the
| world that allows anyone to effectively pirate an
| artist's work. Their intention was obviously to be
| rewarded by established players for doing this bit of
| dirty work, forcing artists to accept terms they wouldn't
| have otherwise.
| moffkalast wrote:
| > not a fan of StableDiffusion, but at least their models
| work at general parity with private offerings
|
| I think you're being a bit generous there. Either I'm using
| it seriously wrong or SD can only generate vague blobs while
| Midjourney can make some proper stuff. It's a larger
| difference than GPT 3.5 vs GPT 4.
| dragonwriter wrote:
| > Either I'm using it seriously wrong or SD can only
| generate vague blobs
|
| You are definitely using it wrong, if the alternative is
| "SD can only generate vague blobs". Even the base SD models
| are _much_ better than that (though, the strength of the SD
| ecosystem is the availability of custom checkpoints,
| hypernetworks, LORAs, embdeddings, ControlNet, etc., not
| just the base models.)
| CuriouslyC wrote:
| Llama itself performs comparably to GPT3.5 (at least 30/60g
| models), but the RLHF of chatgtp is much better than what the
| community has produced thus far, and it's tuned to work well
| without tinkering. There will be open source models with that
| level of fine tuning in the near future, at which point
| ChatGPT4 will mainly be superior for stuff like code that
| needs the best possible cohesion and accuracy.
| og_kalu wrote:
| SD isn't comparable to Midjourney. 99% of comics created with
| ai art use midjourney. One of the most glaring need cases for
| control and still nothing. There's a reason for that.
| GaggiX wrote:
| I have seen really convincing comics made with SD, much
| more convincing than any comics made with MJ, and the
| reason is really obvious. Models and LoRAs on CivitAI and
| Huggingface are really good, and the fact that MJ can
| generate slightly better images does not justify the total
| lack of control.
| og_kalu wrote:
| Never said you couldn't make impressive stuff with SD but
| feel free to share those comics.
|
| Models on CivitAI are okay. Cool if you're looking for a
| certain style and/or want to create something that looks
| like the training images but style isn't everything.
|
| Midjourney generates much better than "slightly better
| images" and the very fact you say this just tells me
| you've not even used the thing in any real capacity.
| GaggiX wrote:
| I am very familiar with MJ and know very well how SD can
| be used to generate images.
|
| I am the author of submissions such as:
| https://news.ycombinator.com/item?id=35181433, and I am
| one of the people responsible for the enthusiasm behind
| the performance of MJ v5.
|
| But no, MJ is not much better if you know how to use SD,
| although if what you did with SD was just put a prompt in
| a huggingface space, I can understand why you say that.
|
| >I never said you can't do impressive things with SD, but
| feel free to share these comics.
|
| I am arguing that they are better than any comics made
| with MJ, not that they are simply impressive, that's
| really the entire point. I know some on Pixiv, you can
| look them up if you want; I am not linking them for
| obvious reasons (to say they are NSFW is putting it
| mildly).
| og_kalu wrote:
| >But no, MJ is not much better if you know how to use SD,
| although if what you did with SD was just put a prompt in
| a huggingface space, I can understand why you say that.
|
| I'm the person behind these -
| https://huggingface.co/ogkalu I think it's safe to say i
| know something about SD's capabilities.
|
| >I am arguing that they are better than any comics made
| with MJ, not that they are simply impressive, that's
| really the entire point.
|
| Sure that's why i'm asking you to link these comics that
| are supposedly better than anything Midjourney has ever
| produced. With a claim like that, i'm sure you understand
| wanting to see results.
|
| >You can go look them up on Pixiv if you want, they host
| some; I am not linking them for obvious reasons (to say
| they are NSFW is putting it mildly).
|
| So you can't link anything that isn't NSFW on pixiv? Lol,
| that just solidifies my point. Frankly if the best you
| can come up with is pseudo porn(or maybe not pseudo lol)
| on pixiv (i don't imagine any readers of that will care
| about the things i'm looking for) then that's not a very
| good look.
| GaggiX wrote:
| You seem surprise that porn brings innovation, but you
| shouldn't if there has to be someone obsessed with
| creating the best possible illustration, it is indeed a
| Pixiv user or more generally a user who wants to create
| porn of their favorite character; moreover, I know these
| comics not because I have a weird obsession with going to
| read comics that were created by an AI, I know them
| because they are good enough to have gone on trend as
| NSFW comics, whereas the comics made by MJ are known not
| because they are good comics but because they are made by
| MJ (so it's cool I guess), so I don't see how it can
| solidify your point of view ahah, if you can't control
| the generation every panel will look different, a collage
| of images, that's why the comics made by MJ seem to be
| known just because they are made by MJ and not because
| they are in the interest of others communities like NSFW
| comics on Pixiv. Also for this reason, I have not saved
| links to these posts, I found them randomly while
| browsing Pixiv, another reason why you should look for
| them yourself.
| EveYoung wrote:
| In my experience, the threshold to be useful is much lower
| than GPT-3.5. These smaller models can "easily" be finetuned
| to achieve a comparable performance on a specific task. For
| example, I've achieved promising results for data
| summarisiation and image captioning (BLIP2-based) using
| Alpaca.
|
| Also, server/hardware costs are still a limiting factor for
| running and finetuning the larger 33/65B Llama models.
| Especially, if they can only be used for personal toy
| projects.
| bugglebeetle wrote:
| I don't use LLMs for anything image related, so I can't
| speak to their value there, but almost all simpler NLP
| tasks are IMO better handled using other techniques that
| predate them. I've yet to see an example where fine-tuning
| is cheaper/more efficient/better performing than older
| solutions to these problems.
| EveYoung wrote:
| If older techniques work for you, there is of course no
| reason to switch to LLMs besides general curiousity or to
| explore what's possible already. That said, in my case I
| was enable to generate much more engaging text summaries
| of tabular data using a Llama derivative.
| idle_zealot wrote:
| Didn't Open Assistant just announce that they weren't releasing
| their model weights due to safety concerns? Seems like another
| "Open" AI initiative.
| circuit10 wrote:
| Unless something changed, I thought it was that they
| literally cannot legally release the weights that are based
| on LLaMA (except maybe with an xor thing) so they're going to
| train it based on something else
| mindcrime wrote:
| Is any of the Open Assistant stuff based on LLaMA? I
| thought they release (at least some version) before LLaMA
| even dropped?
| circuit10 wrote:
| Yes, there's also something based on Pythia but it's a
| smaller model
| selfhoster11 wrote:
| IIRC, the video said they will train it on a properly open-
| source model as well.
| akiselev wrote:
| That was a joke in the release video. The Pythia model is
| already released at [1] and the deltas for the LLaMa model
| should be up here [2] in the next few days.
|
| [1] https://huggingface.co/OpenAssistant/oasst-
| sft-4-pythia-12b-...
|
| [2] https://huggingface.co/OpenAssistant/oasst-llama-based-
| model...
| RandomBK wrote:
| Unfortunately [2] is just a placeholder for now, but it
| does look like the intent is to publish the weights.
| Taek wrote:
| It's also relatively cheap to make your own llama-30
| weights, the real value of OpenAssistant is in the
| training data, and all of that data has been made
| available.
|
| The OpenAssistant effort gets an A+ for open source
| contributions.
| fortyseven wrote:
| There was a dumb joke along those lines in an announcement
| video, meant as a jab at OpenAI. It's easy to miss the "just
| kidding". (I did, initially.)
| detrites wrote:
| The announce video by Yannic contained a (lengthy) gag to
| that effect, has it been taken out of context or did now
| something actually happen?
|
| https://youtube.com/watch?v=ddG2fM9i4Kk&t=132
|
| It's easy to miss but after the negative build-up he says:
| "and... I'm kidding!"
| ricardobeat wrote:
| Dangerous gag, he said "I'm joking" so quickly it's very
| easy to miss. I imagine the commenter is not alone in
| having that wrong impression.
| idle_zealot wrote:
| Oh, ha, yeah this is exactly the gag I fell for. I just
| noped out of the video and wrote off the project as this
| was the first I ever heard of them, and their website just
| has a signup and no downloads I could see.
|
| Too bad my original comment is too old to edit.
| [deleted]
| [deleted]
| Tepix wrote:
| Great initiative. Next, we need a lot of compute! Perhaps
| Tenstorrent wants to make a good impression?
| rnosov wrote:
| > we are training a full suite of models, with the first
| becoming available in the coming weeks.
|
| Sounds like they already have the compute and began training.
| DogTweezers wrote:
| [flagged]
| sytelus wrote:
| Great to see this but dataset is the trickiest part. There is no
| way to confirm if this is good dataset unless model is actually
| trained on it. To reproduce LLaMA, you need $2M of compute.
| Robotbeat wrote:
| Do you have a calculation that shows where that $2M number
| comes from, EXACTLY?
| eiz wrote:
| https://arxiv.org/pdf/2302.13971.pdf table 15. 1770394
| A100-80GB hours to train the entire model suite at the going
| rate for cloud 8xA100-80GBs (~$12/hr if you could actually
| get capacity) is ~$2.6M, under extremely optimistic
| assumptions. YMMV on bulk pricing ;) "the more you buy the
| more you save"
| Robotbeat wrote:
| Hmmm... the values in the 7B model seem feasible. An order
| of magnitude lower GPU hours, plus presumably the lower
| parameter count means it probably could fit on a 24GB
| Radeon RX 7900 XTX, which has higher single precision flops
| than the A100 and costs $1000 instead of $15,000.
|
| An order of magnitude lower GPU-hour time, plus if you
| train it for 210 days instead of 21 days, means you could
| do a 7B model with 20 consumer GPUs which are $1000 apiece.
| $20k, not counting mainboard, etc. Really not bad. Might
| even be doable as a volunteer project.
| sp332 wrote:
| Page 4 https://arxiv.org/abs/2302.13971
|
| _When training a 65B-parameter model, our code processes
| around 380 tokens /sec/GPU on 2048 A100 GPU with 80GB of RAM.
| This means that training over our dataset containing 1.4T
| tokens takes approximately 21 days._
|
| At $4/GPU-hour per A100 80GB GPU, that's $4 * 2,048 * 21 * 24
| = $4,128,768.
| Robotbeat wrote:
| Hmmm... so a 7 billion parameter model could probably be
| trained on consumer GPUs for one or two orders of magnitude
| lower cost, particularly if you didn't go well beyond
| Chinchilla-optimal training time.
| t00 wrote:
| Obligatory Dall-E version
| https://labs.openai.com/s/Httd7N2ZF5kynUnzp0vCVOjN
| [deleted]
| [deleted]
| [deleted]
| dwheeler wrote:
| Has anyone investigated to see if OpenCyc can be converted to
| natural language (presumably English) and then injested into
| this? Cyc made an attempt years ago to "encode common sense" and
| a subset called OpenCyc was released. That might be a great way
| to kickstart information representation of the real world. The
| latest version of Cyc is proprietary but I think there OpenCyc is
| an open subset (though I'm having trouble confirming that, so the
| licensing may not be good).
|
| Some links: https://github.com/bovlb/opencyc
| https://github.com/asanchez75/opencyc
| sp332 wrote:
| I've been wondering this for a while now. Cyc has tons of
| knowledge in a white-box, formal system. If it just had a
| front-end that could convert from natural language to Cyc
| knowledge queries and back, we wouldn't have to worry so much
| about hallucinations, catastrophic forgetting, or trying to fit
| the entire database in VRAM.
| speed_spread wrote:
| My understanding is that LLM and Cyc are fundamentally
| different forms of AI. Even if you could turn OpenCyc into text
| rules, once ingested it would just dissolve into the ocean of
| training text data and would not significantly gain more
| apparent "common sense" than it already had. Maybe a more
| interesting combination could be to have both Cyc and LLM
| working side by side and comparing notes before agreeing on a
| result.
| piannucci wrote:
| If the name is a reference to Ogden Nash's poem then I am very
| tickled:
| https://www.madisonpubliclibrary.org/engagement/poetry/poem-...
| ricketycricket wrote:
| I'd guess it's the book Llama Llama Red Pajama:
| https://openlibrary.org/books/OL24377652M/Llama_Llama_Red_Pa...
| FloatArtifact wrote:
| Code generation, I wonder the difference in output given order of
| operations with training and fine tuning. What if the model was
| trained on the documentation and the code base for Python as an
| example. Then fine tuning came from training on actual python
| code on GitHub.
|
| At the model understands the python documentation and the
| implementation standard library/interpreter. Then is there a
| reduction of data needed for contacts in other code?
| nailer wrote:
| Someone on HN made a point that weights can't even have
| copyright- they lack two of the requirements for being
| copyrightable:
|
| https://news.ycombinator.com/item?id=35508651
| bobwernstein1 wrote:
| when will the first code writing specific model arrive?
| smrtinsert wrote:
| So is the next step is for someone to come in a fine tune on top
| of it in order to make it a Vicuna? Or can current vicuna deltas
| be applied?
| rafaelero wrote:
| Yeah, it's pretty trivial to change the base model from LLaMa
| to this next one. You just have to finetune it with the same
| data used previously to train Vicuna.
| wesleychen wrote:
| There's no model yet, only a dataset.
| omneity wrote:
| My understanding is that LLaMa's architecture is open, so the
| most difficult part is:
|
| 1. Getting data of equal or better quality
|
| 2. Securing the funding/hardware required for training
|
| 3. Learning/figuring out the training challenges needed to
| tune the process (the PhD part)
|
| It seems #1 is the relatively lowest hanging fruit and a
| prerequisite for the other two, and that's what the project
| is (rightfully) tackling at this stage. #2 could be solved by
| many ways, and doesn't require much innovation if the project
| and the team are solid. Which takes me to #3, which on the
| other hand seems to be the make or break part of the project.
|
| I'm not one to doubt the technical prowesses of the
| RedPajama's team and their contributors, I rather see it
| economically. How can an AI open-source project compete with
| big tech in attracting the brilliant minds of our generation?
| It's enough to look at levels.xyz to see the battle is not
| ... level.
|
| There's a serious economical challenge in here to have any
| sort of sustainable open source initiative in AI.
| Havoc wrote:
| Love this - I'll happily accept a bit of a quality trade-off for
| a pure open model. Its a bit like I'm willing to accept trade-
| offs to ensure my IoT gear is local only even if that means loss
| of cloud convenience
| simonw wrote:
| The training data - all 1.2 trillion tokens - can be downloaded
| by grabbing each of the 2,084 URLs listed here:
| https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
|
| I ran a HEAD request against them all to sum up the total file
| size, and it's 2.67TB total.
|
| Here's a Datasette Lite URL that lets you explore the size
| metadata about those files:
| https://lite.datasette.io/?json=https://gist.github.com/simo...
|
| And a SQL query that shows the breakdown across the different
| sources:
|
| https://lite.datasette.io/?json=https://gist.github.com/simo...
|
| Sizes here are in GB: common_crawl
| 1341.6166818914935 c4 806.7667234372348 github
| 212.1786002581939 wikipedia 111.89125544670969
| book 100.43162744678557 arxiv 87.35323827341199
| stackexchange 74.54870238155127
|
| Common Crawl is in there a few times - they have the following
| folders: common_crawl/2020-05 198 files
| common_crawl/2021-04 176 files common_crawl/2023-06 175
| files common_crawl/2022-05 157 files
| common_crawl/2019-30 153 files
|
| And then C4 as well, which is "a colossal, cleaned version of
| Common Crawl's web crawl corpus. It was based on Common Crawl
| dataset": https://paperswithcode.com/dataset/c4
| afro88 wrote:
| Interesting they're allowed to use stackexchange. I don't know
| much about the legalities of scraping. Was this an agreement
| between them, or is it simply ok to scrape and use the data in
| a model?
| progbits wrote:
| https://stackoverflow.com/help/licensing
|
| Doesn't this imply the produced model has to be CC-BY-SA too?
| wongarsu wrote:
| By that line of reasoning, GitHub copilot would have to be
| GPL. Until somebody fights about this in court we don't
| really know. But even in the worst case the CC-BY-SA is one
| of the easier licenses to fulfill, not much worse than the
| MIT-licensed code contained in the dataset.
| gattilorenz wrote:
| Welcome to this can of worms.
|
| CC-BY-SA content needs attribution too, but I don't see
| the(se) model(s) in the current state being able to do so.
|
| I imagine we're gonna see the IBM PC bios/Unix/ReactOS
| "tainted code" arguments again in court, this time is not
| the human who is more-or-less knowingly responsible for
| sneaking in copyrighted code.
| doctoboggan wrote:
| I am a little concerned that they have only about 60% of the
| code tokens (GitHub and stackexchange). Given that so far the
| only concrete use case I have for LLMs is coding assistance I
| wouldn't want this open source model to be and less quality in
| that area.
|
| In your opinion do you think this will hamper the model at all?
| Or is it still more than enough to get good coding assistance?
| csris wrote:
| Nice catch! We sampled the github dataset to match the total
| # tokens seen by LLaMA during training: ~64B tokens (they
| only pass through 0.64 of their total Github dataset
| according to the paper). We have a lot of Github data and
| will make them available soon. Note, we also have not built
| this for compute optimal training. We are following LLaMA's
| lead and are training on more data for longer to optimize for
| quality, not compute.
| doctoboggan wrote:
| Very good to hear that you are optimizing for inference
| rather than training. I've tried llama and its various
| instruction tuned siblings and have yet to get equivalent
| performance to gpt-3.5 on coding tasks. Seeing how the base
| model performed relative to gpt-3 on the various benchmarks
| gives me hope that the difference is just in RLHF or other
| fine tuning steps. I really hope the community is able to
| get there, Especially if the resulting model is able to be
| quantized with minimal loss.
| jstx1 wrote:
| Smaller % of training data doesn't necessarily mean lower
| quality.
| sp332 wrote:
| As mentioned in the post, the smaller models are trained well
| past "compute-optimal" amounts of data and I would expect are
| well into diminishing returns. On the other hand, large
| models are good one-shot and few-shot learners, and might be
| able to pick up enough context from your prompt alone to be
| useable, even if it wasn't specifically trained on your use
| case.
| [deleted]
| Minus0 wrote:
| In this context compute optimal isn't quite the same as
| diminishing returns. If you look at the loss graphs in the
| Llama paper, you can see that even the curves for the
| smaller models were still going down at the time they
| stopped training and weren't anywhere near plateauing yet.
| LLMs are notoriously data hungry and will take a long time
| to reach convergence.
|
| Compute optimal here means the point at which it makes
| sense to move from a smaller to a larger model assuming
| that: (a) you have a fixed compute budget of FLOPs, and (b)
| you want to train the best model possible. The problem is
| that this applies only to training and assumes nothing
| about the cost of inference. If you actually need to deploy
| these trained models and support them long-term for
| hundreds, thousands, even millions of people to use, would
| you rather deploy a 13B model or a 30B model at the same
| level of quality, even if the 13B model would be more
| costly to train?
|
| There is going to be a point at which these models plateau
| and further improvement will not be possible without moving
| to a larger model, but Llama doesn't get there quite yet.
| bkm wrote:
| Relevant:
| https://twitter.com/abacaj/status/1647999551964323844
| totoglazer wrote:
| This tweet is misunderstanding the papers.
| t3estabc wrote:
| [dead]
| simonw wrote:
| No idea!
|
| I wonder how hard it would be to fine-tune something built on
| RedPajama on further code examples to improve performance
| there.
| fnands wrote:
| Nice. Thanks for the summary.
|
| So ~4x the size of the Pile, any idea how it stacks up in terms
| of quality to other big datasets?
| rcpt wrote:
| I'm kind of surprised how small that dataset is
| simonw wrote:
| Wrote this up as a blog post:
| https://simonwillison.net/2023/Apr/17/redpajama-data/
| csris wrote:
| Hi! I'm the VP of Engineering at Together. Thanks for writing
| up these instructions! FYI, you can also download all the
| files with one wget command: wget -i
| https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
|
| This is also mentioned on the dataset card for redpajama-
| data-1T on Huggingface [1].
|
| [1]: https://huggingface.co/datasets/togethercomputer/RedPaja
| ma-D...
| simonw wrote:
| I made sure to include that in my blog post - along with a
| note that you need 2.67TB of disk space first!
| macinjosh wrote:
| This guy has kids, so we all know he.. nevermind. I love the name
| being a parent myself.
___________________________________________________________________
(page generated 2023-04-17 23:00 UTC)