[HN Gopher] Meta AI releases Code Llama 70B
___________________________________________________________________
Meta AI releases Code Llama 70B
Author : albert_e
Score : 422 points
Date : 2024-01-29 17:11 UTC (5 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| turnsout wrote:
| Given how good some of the smaller code models are (such as
| Deepseek Coder at 6.7B), I'll be curious to see what this 70B
| model is capable of!
| jasonjmcghee wrote:
| My personal experience is that Deepseek far exceeds code llama
| of the same size, but it was released quite a while ago.
| turnsout wrote:
| Agreed--I hope Meta studied Deepseek's approach. The idea of
| a Deepseek Coder at 70B would be exciting.
| whimsicalism wrote:
| The "approach" is likely just training on more tokens.
| imjonse wrote:
| It's the quality of data and the method of training, not
| just the amount of tokens (per their paper release a few
| days ago)
| hackerlight wrote:
| There's a deepseek coder around 30-35b and it has almost
| identical performance to the 7b on benchmarks.
| ignoramous wrote:
| _AlphaCodium_ is the newest kid on the block that 's SoTA
| _pass@5_ on coding tasks (authors claim at least 2x better than
| GPT4): https://github.com/Codium-ai/AlphaCodium
|
| As for small models, Microsoft has been making noise with the
| unreleased _WaveCoder-Ultra-6.7b_
| (https://arxiv.org/abs/2312.14187).
| eurekin wrote:
| Are weights available?
| moyix wrote:
| AlphaCodium is more of a prompt engineering / flow
| engineering strategy, so it can be used with existing
| models.
| passion__desire wrote:
| AlphaCodium author says he should have used DSPy
|
| https://twitter.com/talrid23/status/1751663363216580857
| hackerlight wrote:
| Is this better than GPT4's Grimoire?
| CrypticShift wrote:
| Phind [1] uses the larger 34B Model. Still, I'm also curious
| what they are gonna do with this one.
|
| [1] https://news.ycombinator.com/item?id=38088538
| hackerlight wrote:
| Seems worse according to this benchmark:
|
| https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...
| colesantiago wrote:
| Llama is getting better and better, I heard this and Llama 3 will
| start to be good as GPT-4.
|
| Who would have thought that Meta, that has been chucking billions
| on the metaverse is on the forefront of Open Source AI.
|
| Not to mention their stock is up and they are worth $1TN, again.
|
| Not sure how I feel about this given the fact of all the scandals
| that have plagued them and the massive 1BN fine from the EU,
| Cambridge Analytica, and last of all caused a genocide in
| Myanmar.
|
| Goes to show that nobody cares about all of these scandals and
| just moves on onto the future, allowing Facebook to still collect
| all this data for their models.
|
| If any other startup or mid sized company had at least _two_ of
| these large scandals, they would be dead in the water.
| austinpena wrote:
| I'm really curious what their goal is
| Ruq wrote:
| Zuck just wants another robot to talk to.
| throwup238 wrote:
| Poor guy just wants a friend that won't sell him out to the
| FTC or some movie producers.
| hendersoon wrote:
| Their goals are clear, dominance and stockholder value. What
| I'm curious about is how they plan to _monetize it_.
| refulgentis wrote:
| Use it in products, ex. the chatbots
| regimeld wrote:
| Commoditize your compliments.
| refulgentis wrote:
| Same as MS, in the game, in the conversation, and ensuring
| next-gen search margins approximate 0.
| Arainach wrote:
| Disclaimer: I do not work at Meta, but I work at a large tech
| company which competes with them. I don't work in AI,
| although if my VP asks don't tell them I said that or they
| might lay me off.
|
| Multiple of their major competitors/other large tech
| companies are trying to monetize LLMs. OpenAI maneuvering an
| early lead into a dominant position would be another
| potential major competitor. If releasing these models slows
| or hurts them that is in and itself a benefit.
| michaelt wrote:
| Why?
|
| What benefit is there to grabbing market share from your
| competitors... in a business you don't even want to be in?
|
| By that logic you could justify any bizarre business
| decision. Should Google launch a social network, to hurt
| their competitor Facebook? Should Facebook, Amazon and
| Microsoft each launch a phone?
| Filligree wrote:
| > Should Google launch a social network, to hurt their
| competitor Facebook?
|
| I mean, Google _did_ launch a social network, to hurt
| their competitor Facebook. It was a whole thing. It was
| even a really nice system, eventually.
| wanderingstan wrote:
| And it turned out that Facebook had quite a moat with
| network effects. OpenAI doesn't have such a moat, which
| may be what Meta is wanting to expose.
| esafak wrote:
| Google botched the launch, and they never nurture
| products after launch anyway. Google+ could have been
| more successful.
| zaat wrote:
| I enjoyed using Google plus more than any other social
| network, and managed to create new connections and/or
| have standard, authentic, real conversations with people
| I didn't know, most of them ordinary people with shared
| interests that I would probably wouldn't meet otherwise,
| some of them are people I can't believe I could connect
| with directly in any other way - newspapers and news
| sites editors, major SDK developers. And even with Kevin
| Kelly.
| Arainach wrote:
| >Should Facebook, Amazon and Microsoft each launch a
| phone?
|
| * https://www.lifewire.com/whatever-happened-to-the-
| facebook-p...
|
| * https://en.wikipedia.org/wiki/Fire_Phone
|
| * https://en.wikipedia.org/wiki/Windows_Phone
| Arainach wrote:
| Who says they don't want to be in the market? Facebook
| has one product. Their income is entirely determined by
| ads on social media. That's a perilous position subject
| to being disrupted. Meta desperately wants to diversify
| its product offerings - that's why they've been throwing
| so much at VR.
| tsunamifury wrote:
| Their goal is to counter the competition. You rarely should
| pick the exact same strategy as your competitor and count on
| out gunning them, rather you should counter them. OpenAI is
| ironically closed, well meta will be open then. If you can't
| beat them, you should try to degrade down the competitors
| value case.
|
| Its a smart move IMO
| dsabanin wrote:
| I think Meta's goal is to subvert Google, MS and OpenAI,
| after realizing it's not positioned well to compete with them
| commercially.
| api wrote:
| Could also be that these smaller models are a loss leader
| or advertisement for a future product or service... like a
| big brother to Llama3 that's commercial.
| idkyall wrote:
| I believe there were rumors they are developing a
| commercial model: e.g. https://www.ft.com/content/01fd640
| e-0c6b-4542-b82b-20afb203f...
| akozak wrote:
| I would guess mindshare in a crowded field, ie discussion
| threads just like this one that help with recruiting and tech
| reputation after a bummer ~8 years. (It's best not to
| overestimate the complexity/# of layers in a bigco's
| strategic thinking.)
| jedberg wrote:
| Commoditizing your complement. If all your competitors need a
| key technology to get ahead, you make a cheap/free version of
| it so that they can't use it as a competitive advantage.
| nindalf wrote:
| The complement being the metaverse. You can't handcraft the
| metaverse because it would be infeasible. If LLMs are a
| commodity that everyone has access to, then it can be done
| on the cheap.
|
| Put another way - if OpenAI were the only game in town how
| much would they be charging for their product? They're
| competing on price because competitors exist. Now imagine
| the price if a hypothetical high quality open source model
| existed that can customers can use for "free".
|
| That's the future Meta wants. They weren't getting rich
| selling shovels like cloud providers are, they want
| everyone digging. And everyone digs when the shovels are
| free.
| aantix wrote:
| To undermine the momentum of OpenAI.
|
| If Meta were at the forefront, these models would not be
| openly available.
|
| They are scrambling.
| ppsreejith wrote:
| If you go by what Zuck says, he calls this out in previous
| earnings reports and interviews[1]. It mainly boils down to 2
| things:
|
| 1. Similar to other initiatives (mainly opencompute but also
| PyTorch, React etc), community improvements help them improve
| their own infra and helps attract talent.
|
| 2. Helping people create better content ultimately improves
| quality of content on their platforms (Both FoA & RL)
|
| Sources:
|
| [1]Interview with verge:
| https://www.theverge.com/23889057/mark-zuckerberg-meta-ai-
| el... . Search for "regulatory capture right now with AI"
|
| > Zuck: ... And we believe that it's generally positive to
| open-source a lot of our infrastructure for a few reasons.
| One is that we don't have a cloud business, right? So it's
| not like we're selling access to the infrastructure, so
| giving it away is fine. And then, when we do give it away, we
| generally benefit from innovation from the ecosystem, and
| when other people adopt the stuff, it increases volume and
| drives down prices.
|
| > Interviewer: Like PyTorch, for example?
|
| > Zuck: When I was talking about driving down prices, I was
| thinking about stuff like Open Compute, where we open-sourced
| our server designs, and now the factories that are making
| those kinds of servers can generate way more of them because
| other companies like Amazon and others are ordering the same
| designs, that drives down the price for everyone, which is
| good.
| PheonixPharts wrote:
| I imagine their goal is to simultaneously show that Meta is
| still SotA when it comes to AI and at the same time feed a
| community of people who will work for free to essentially
| undermine OpenAI's competitive advantage and make life worse
| for Google since at the very least LLMs tend to be a better
| search engine for most topics.
|
| There's far more risk if Meta were to try to directly compete
| with OpenAI and Microsoft on this. They'd have to manage the
| infra, work to acquire customers, etc, etc on top of building
| these massive models. If it's not a space they really want to
| be in, it's a space they can easily disrupt.
|
| Meta's late game realization was that Google owned the web
| via search and Apple took over a lot of the mobile space with
| their walled garden. I suspect Meta's view now is that it's
| much easier to just prevent something like this from
| happening with AI early on.
| blackoil wrote:
| Devil's advocate: they have to build it anyway for Meta verse
| and in general. Management has no interest in going into
| cloud business. They had Parse long time back but that is
| done. So why not to release it. They are getting
| goodwill/mindshare, may set up industry standard and get
| community benefit. It isn't very different from React, Torch
| etc.
| ttul wrote:
| If you want to employ the top ML researchers, you have to
| give them what they want, which is often the ability to share
| their discoveries with the world. Making Llama-N open may not
| be Zuckerberg's preference. It's possible the researchers
| demanded it.
| wrsh07 wrote:
| Rule 5: commodify your complement
|
| Content generation is complementary to most of meta's apps
| and projects
| elorant wrote:
| Prevent OpenAI from dominating the market, and at the same
| time have the research community enhance your models and
| identify key use cases.
| pennomi wrote:
| Meta basically got a ton of free R&D that directly applies
| to their model architecture. Their next generation AIs will
| always benefit from the techniques/processes developed by
| the clever researchers and hobbyists out there.
| megaman821 wrote:
| They were going to make most this anyway for Instagram
| filters, chat stickers, internal coding tools, VR world
| generation, content moderation, etc. Might as well do a
| little bit extra work to open source it since it doesn't
| really compete with anything Meta is selling.
| make3 wrote:
| with pytorch (& so many open publications), Meta has had a
| unimaginably strong contribution to ai for a while
| fragmede wrote:
| _Model available_ , not open source. These models aren't open
| source because we don't have access to the data sets, nor the
| full code to train them, so we can't recreate the models even
| if we had the GPU time available to recreate them.
| colesantiago wrote:
| Everyone using AI in production is using Pytorch by Meta.
|
| Which is open source.
|
| I do not know anybody important in the AI space apart from
| Google using TensorFlow.
| Philpax wrote:
| That may be true, but it's largely irrelevant. The ML
| framework in use has no bearing on whether or not you have
| the data required to reproduce the model being trained with
| that framework.
| colesantiago wrote:
| Do you and the GP have 350K GPUs and quality data to
| reproduce 1:1 whatever Facebook releases in their repos?
|
| Even if you want to reproduce the model and they give you
| the data, you would need to do this at Facebook scale, so
| you and the GP are just making moot points all around.
|
| https://about.fb.com/news/2023/05/metas-infrastructure-
| for-a...
|
| https://www.theregister.com/2024/01/20/metas_ai_plans/
|
| The fact that these models are coming from Meta in the
| open rather than Google which releases only papers with
| _no_ model tell 's me that Meta's models is open enough
| for everyone to use.
|
| Besides, everyone using the Pytorch framework benefits
| Meta in the same way they were originally founded as a
| company:
|
| Network effects
|
| It's relevant.
| Philpax wrote:
| There are organisations that are capable of reproduction
| (e.g. EleutherAI), but yes, you're right, not having the
| data is largely irrelevant for most users.
|
| The thing that bothers me more is that it's not actually
| an open-source licence; there are restrictions on what
| you can do with it, and whatever you do with the model is
| subject to those restrictions. It's still very useful and
| I'm not opposed to them releasing it under that licence
| (they need to recoup the costs somehow), but "open-
| source" (or even "open") it is not.
| WhackyIdeas wrote:
| It does seem like the nicest thing Facebook have ever done by
| giving so much to the open source LLM scene, I know that it
| might have been started by a leaker, but they have given so
| much voluntarily. I mean don't get me wrong, I don't like the
| company but I do really like some of the choices they have made
| recently.
|
| But I do wonder in the back of my mind why. And I should be
| suspicious of their angle and I will keep thinking about it. Is
| it paranoid to think that maybe their angle is putting almost
| some kind of metadata by style of code being unique to
| different machines that they can trace generated code to
| different people? Is that their angle or am I biased in
| remembering who they have been for the past decade?
| cm2012 wrote:
| Cambridge Analytica is not a real scandal (did not affect any
| elections), and FB did not cause a genocide in Myanmar (they
| were a popular message board during a genocide, which is not
| the same thing)
| ramshanker wrote:
| Are these trained on internal Code bases or just the public
| repositories?
| albertzeyer wrote:
| Without knowing any details, I'm almost sure that they did not
| train it on internal code, as it might be possible to reveal
| that code otherwise (given the right prompt, or just looking at
| the weights).
| make3 wrote:
| people are able to extract training data from llms with
| different methods, so there's no way this was trained on
| internal code
| eigenvalue wrote:
| Would be a really bad idea to train on internal code I would
| think. Besides, there is no shortage of open source code (even
| open source created by Meta) out there.
| changoplatanero wrote:
| Correct that it's a bad idea to train on internal code.
| However surprisingly there is a shortage of open source code.
| These models are trained on substantially all the available
| open source code that these companies can get their hands on.
| andy99 wrote:
| The github [0] hasn't been fully updated, but it links to a
| paper [1] that describes how the smaller code llama models were
| trained. It would be a good guess that this model is similar.
|
| [0] https://github.com/facebookresearch/codellama [1]
| https://ai.meta.com/research/publications/code-llama-open-fo...
| Ninjinka wrote:
| Benchmarks?
| Havoc wrote:
| Not sure who this is aimed at? The avg programmer probably
| doesn't have the gear on hand to run this at the required pace
|
| Cool nonetheless
| Spivak wrote:
| People who want to host the models presumably, AWS bedrock will
| def include it.
| blackoil wrote:
| How feasible would it be too fine tune using internal code and
| have an enterprise copilot.
| keriati1 wrote:
| We actually run already in-house ollama server prototype for
| coding assistance with deepseek coder and it is pretty good.
| Now if we would get a model for this, that is on chatgpt 4
| level, I would be super happy.
| eurekin wrote:
| Did you finetune a model?
| keriati1 wrote:
| No, we went with RAG pipeline approach as we assume
| things change too fast.
| eurekin wrote:
| Thanks! Any details how you chunk and find the relevant
| code?
|
| Or how you deal with context length? I.e. do you send
| anything other than the current file? How is the prompt
| constructed?
| lozenge wrote:
| Considering a number of Saas offer this service, I'd say it's
| feasible.
| jejeyyy77 wrote:
| already been done
| connorgutman wrote:
| This is targeted towards GPU rental services like RunPod as
| well as API providers such as together AI. Together.ai is
| charging $0.90/1M tokens at 70B parameters.
| https://www.together.ai/pricing
| ttul wrote:
| Yeah, but if your company wants to rent an H100, you can deploy
| this for your developers for much less than the cost of a
| developer...
| moyix wrote:
| You can run it on a Macbook M1/M2 with 64GB of RAM.
| 2OEH8eoCRo0 wrote:
| How? It's larger than 64GB.
| coder543 wrote:
| Quantization is highly effective at reducing memory and
| storage requirements, and it barely has any impact on
| quality (unless you take it to the extreme). Approximately
| no one should ever be running the full fat fp16 models
| during inference of any of these LLMs. That would be
| _incredibly_ inefficient.
|
| I run 33B parameter models on my RTX 3090 (24GB VRAM) no
| problem. 70B should _easily_ fit into 64GB of RAM.
| 2OEH8eoCRo0 wrote:
| I'm aware but is it still LLaMA 70B at that point?
| coder543 wrote:
| Yes. Quantization does not reduce the number of
| parameters. It does not re-train the model.
| andy99 wrote:
| It's a legit question, the model will be worse in some
| way... I've seen it discussed that all things being equal
| more parameters is better (meaning it's better to take a
| big model and quantized it to fit in memory than use a
| smaller unquantized model that fits), but a quantized
| model wouldn't be expected to run identically to or as
| well as the full model.
| coder543 wrote:
| You don't stop being andy99 just because you're a little
| tired, do you? Being tired makes everyone a little less
| capable at most things. Sometimes, a lot less capable.
|
| In traditional software, the same program compiled for
| 32-bit and 64-bit architectures won't be able to handle
| all of the same inputs, because the 32-bit version is
| limited by the available address space. It's still the
| same program.
|
| If we're not willing to declare that you are a completely
| separate person when you're tired, or that 32-bit and
| 64-bit versions are completely different programs, then I
| don't think it's worth getting overly philosophical about
| quantization. A quantized model is still the same model.
|
| The quality loss from using 4+ bit quantization is
| minimal, in my experience.
|
| Yes, it has a _small_ impact on accuracy, but with
| massive efficiency gains. I don't really think anyone
| should be running the full models outside of research in
| the first place. If anything, the quantized models should
| be considered the "real" models, and the full fp16 /fp32
| model should just be considered a research artifact
| distinct from the model. But this philosophical rabbit
| hole doesn't seem to lead anywhere interesting to me.
|
| Various papers have shown that 4-bit quantization is a
| great balance. One example:
| https://arxiv.org/pdf/2212.09720.pdf
| cjbprime wrote:
| I don't like the metaphor: when I'm tired, I will be
| alert again later. Quantization is lossy compression: the
| human equivalent would be more like a traumatic brain
| injury affecting recall, especially of fine details.
|
| The question of whether I am still me after a traumatic
| brain injury is philosophically unclear, and likely
| depends on specifics about the extent of the deficits.
| coder543 wrote:
| The impact on accuracy is somewhere in the single-digit
| percentages at 4-bit quantization, from what I've been
| able to gather. Very small impact. To draw the analogy
| out further, if the model was able to get an A on a test
| before quantization, it would likely still get a B at
| worst afterwards, given a drop in the score of less than
| 10%. Depending on the task, the measured impact could
| even be negligible.
|
| It's _far_ more similar to the model being perpetually
| tired than it is to a TBI.
|
| You may nitpick the analogy, but analogies are _never_
| exact. You also ignore the other piece that I pointed
| out, which is how we treat other software that comes in
| multiple slightly different forms.
| berniedurfee wrote:
| Reminds me of the never ending MP3 vs FLAC argument.
|
| The difference can be measured empirically, but is it
| noticeable in real world usage.
| jameshart wrote:
| But we're talking about a coding LLM here. A single digit
| percentage reduction in accuracy means, what, one or two
| times in a hundred, it writes == instead of !=?
| coder543 wrote:
| I think that's too simplified. The best LLMs will still
| frequently make mistakes. Meta is advertising a HumanEval
| score of 67.8%. In a third of cases, the code generated
| still doesn't satisfactorily solve the problem in that
| automated benchmark. The additional errors that
| quantization would introduce would only be a very small
| percentage of the overall errors, making the quantized
| and unquantized models practically indistinguishable to a
| human observer. Beyond that, lower accuracy can manifest
| in many ways, and "do the opposite" seems unlikely to be
| the most common way. There might be a dozen correct ways
| to solve a problem. The quantized model might choose a
| different path that still turns out to work, it's just
| not exactly the same path.
|
| As someone else pointed out, FLAC is objectively more
| accurate than mp3, but how many people can really tell?
| Is it worth 3x the data to store/stream music in FLAC?
|
| The quantized model would run at probably 4x the speed of
| the unquantized model, assuming you had enough memory to
| choose between them. Is speed worth nothing? If I have to
| wait all day for the LLM to respond, I can probably do
| the work faster myself without its help. Is being able to
| fit the model onto the hardware you have worth nothing?
|
| In essence, quantization here is a 95% "accurate"
| implementation of a 67% accurate model, which yields a
| 300% increase in speed while using just 25% of the RAM.
| All numbers are approximate, even the HumanEval benchmark
| should be taken with a large grain of salt.
|
| If you have a very opulent computational experience, you
| can enjoy the luxury of the full 67.8% accurate model,
| but that just feels both wasteful and like a bad user
| experience.
| manmal wrote:
| Sure, quantization reduces information stored for each
| parameter, not the parameter count.
| sbrother wrote:
| Can I ask how many tok/s you're getting on that setup?
| I'm trying to decide whether to invest in a high-end
| NVIDIA setup or a Mac Studio with llama.cpp for the
| purposes of running LLMs like this one locally.
| coder543 wrote:
| On a 33B model at q4_0 quantization, I'm seeing about 36
| tokens/s on the RTX 3090 with all layers offloaded to the
| GPU.
|
| Mixtral runs at about 43 tokens/s at q3_K_S with all
| layers offloaded. I normally avoid going below 4-bit
| quantization, but Mixtral doesn't seem phased. I'm not
| sure if the MoE just makes it more resilient to
| quantization, or what the deal is. If I run it at q4_0,
| then it runs at about 24 tokens/s, with 26 out of 33
| layers offloaded, which is still perfectly usable, but I
| don't usually see the need with Mixtral.
|
| Ollama dynamically adjusts the layers offloaded based on
| the model and context size, so if I need to run with a
| larger context window, that reduces the number of layers
| that will fit on the GPU and that impacts performance,
| but things generally work well.
| sbrother wrote:
| Thanks! That is really fast for personal use.
| sorenjan wrote:
| What's the power consumption and fan noise like when
| doing that? I assume you're running the model doing
| inference in the background for the whole coding session,
| i.e. hours at a time?
| coder543 wrote:
| I don't use local LLMs for CoPilot-like functionality,
| but I have toyed with the concept.
|
| There are a few things to keep in mind: no programmer
| that I know is sitting there typing code for hours at a
| time without stopping. There's a lot more to being a
| developer than just typing, whether it is debugging,
| thinking, JIRA, Slack, or whatever else. These CoPilot-
| like tools will only activate after you type something,
| then stop for a defined timeout period. While you're
| typing, they do nothing. After they generate, they do
| nothing.
|
| I would honestly be surprised if the GPU active time was
| more than 10% averaged over an hour. When actively
| working on a large LLM, the RTX 3090 is drawing close to
| 400W in my desktop. At a 10% duty cycle (active time),
| that would be 40W on average, which would be 320Wh over
| the course of a full 8-hour day of crazy productivity. My
| electric rate is about 15C//kWh, so that would be about
| 5C/ per day. It is _absolutely not_ running at a 100%
| duty cycle, and it's absurd to even do the math for that,
| but we can multiply by 10 and say that if you're somehow
| a mythical "10x developer" then it would be 50C/ /day in
| electricity here. I think 5C//day to 10C//day is closer
| to reality. Either way, the cost is marginal at the scale
| of a software developer's salary.
| sorenjan wrote:
| That sounds perfectly reasonable. I'm more worried about
| noise and heat than the cost though, but I guess that's
| not too bad either then. What's the latency like? When
| I've used generative image models the programs unload the
| model after they're done, so it takes a while to generate
| the next image. Is the model sitting in VRAM when it's
| idle?
| coder543 wrote:
| Fan noise isn't very much, and you can always limit the
| max clockspeeds on a GPU (and/or undervolt it) to be
| quieter and more efficient at a cost of a small amount of
| performance. The RTX 3090 still _seems_ to be faster than
| the M3 Max for LLMs that fit on the 3090, so giving up a
| little performance for near-silent operation wouldn't be
| a big loss.
|
| Ollama caches the last used model in memory for a few
| minutes, then unloads it if it hasn't been used in that
| time to free up VRAM. I think they're working on making
| this period configurable.
|
| Latency is very good in my experience, but I haven't used
| the local code completion stuff much, just a few quick
| experiments on personal projects, so my experience with
| that aspect is limited. If I ever have a job that
| encourages me to use my own LLM server, I would certainly
| consider using it more for that.
| nullstyle wrote:
| Here's an example of megadolphin running on my m2 ultra
| setup: https://gist.github.com/nullstyle/a9b68991128fd4be
| 84ffe8435f...
| int_19h wrote:
| I run LLaMA 70B and 120B (frankenmerges) locally on a
| 2022 Mac Studio with M1 Ultra and 12Gb RAM. It gives ~7
| tok/s for 120B and ~9.5 tok/s for 70B.
|
| Note that M1/M2 Ultra is quite a bit faster than M3 Max,
| mostly due to 800 Gb/s vs 400 Gb/s memory
| rgbrgb wrote:
| Quantization can take it under 30GB (with quality
| degradation).
|
| For example, take a look at the GGUF file sizes here:
| https://huggingface.co/TheBloke/Llama-2-70B-GGUF
| reddit_clone wrote:
| I am not too familiar with LLMs and GPUs (Not a gamer
| either). But want to learn.
|
| Could you please expand on what else would be capable of
| running such models locally?
|
| How about a linux laptop/desktop with specific hardware
| configuration?
| MeImCounting wrote:
| It pretty much comes down to 2 factors which is memory
| bandwidth and compute. You need a high enough memory
| bandwidth to be able to "feed" the compute and you need
| beefy enough compute to be able to keep up with the data
| that is being fed in by the memory. In theory a single
| Nvidia 4090 would be able to run a 70b model with
| quantization at "useable" speeds. The reason mac hardware
| is so capable in AI is because of the unified architecture
| meaning the memory is shared across the GPU and CPU. There
| are other factors but it essentially comes down to tokens
| per second advantages. You could run one of these models on
| an old GPU with low memory bandwidth just fine but your
| tokens per second would be far too slow for what most
| people consider "useable" and the quantization necessary
| might star noticeably effecting the quality.
| int_19h wrote:
| A single RTX 4090 can run at most 34b models with 4-bit
| quantization. You'd need 2-bit for 70b, and at that point
| quality plummets.
|
| Compute is actually not that big of a deal once
| generation is ongoing, compared to memory bandwidth. But
| the initial prompt processing can easily be an order of
| magnitude slower on CPU, so for large prompts (which
| would be the case for code completion), acceleration is
| necessary.
| svara wrote:
| It's aimed at OpenAI's moat. Making sure they don't accumulate
| too much of one. No one actually has to use this, it just needs
| to be clear that LLM as a service won't be super high margin
| because competition can simply start building on Meta's open
| source releases.
| FrustratedMonky wrote:
| So. Strange as it seems, is Meta being more 'Open', than
| OpenAI that was created to be the 'open' option to fight off
| Meta and Google?
| holoduke wrote:
| Meta is becoming the good guy. Its actually a smart move.
| Some extra reputation points wont hurt Meta.
| chasd00 wrote:
| If Meta can turn the money making sauce in GenAI from
| model+data to just data then it's in a very good position.
| Meta has tons of data.
| adventured wrote:
| The moat is all but guaranteed to be the scale of the GPUs
| required to operate these for a lot of users as they get ever
| larger, specifically the extreme cost that is going along
| with that.
|
| Anybody have $10 billion sitting around to deploy that
| gigantic open source set-up for millions of users? There's
| your moat and only a relatively few companies will be able to
| do it.
|
| One of Google's moats is, has been, and will always be the
| scale required to just get into the search game and the tens
| of billions of dollars you need to compete in search
| effectively (and that's before you get to competing with
| their brand). Microsoft has spent over a hundred billion
| dollars trying to compete with Google, and there's little
| evidence anybody else has done better anywhere (Western
| Europe hasn't done anything in search, there's Baidu out of
| China, and Yandex out of Russia).
|
| VRAM isn't moving nearly as fast as the models are
| progressing in size. And it's never going to. The cost will
| get ever greater to operate these at scale.
|
| Unless someone sees a huge paradigm change for cheaper,
| consumer accessible GPUs in the near future (Intel? AMD?
| China?). As it is, Nvidia owns the market and they're part of
| the moat cost problem.
| staticman2 wrote:
| >The moat is all but guaranteed to be the scale of the GPUs
| required to operate these
|
| You don't have to run them locally.
| brucethemoose2 wrote:
| > VRAM isn't moving nearly as fast as the models are
| progressing in size. And it's never going to. The cost will
| get ever greater to operate these at scale.
|
| It is in at least 2025. AMD (and Intel, maybe) will have
| M-Pro-Esque APUs that can run a 70B model at very
| reasonable speeds.
|
| I am pretty sure Intel is going to rock the VRAM boat on
| desktops as well. They literally have no market to lose,
| unlike AMD which infuriatingly still artificially segments
| their high VRAM cards.
| dragonwriter wrote:
| > VRAM isn't moving nearly as fast as the models are
| progressing in size.
|
| Models of any given quality are declining in size (both
| number of parameters, and also VRAM required for inference
| per parameter because quantization methods are improving.)
| alfalfasprout wrote:
| and this is why the LLM arms race for ultra high parameter
| count models will stagnate. It's all well and good that
| we're developing interesting new models. But once you
| factor cost into the equation it does severely limit what
| applications justify the cost.
|
| Raw FLOPs may increase each generation but VRAM becomes a
| limiting factor. And fast VRAM is expensive.
|
| I do expect to see incremental innovation in reducing the
| size of foundational models.
| KaiserPro wrote:
| > The moat is all but guaranteed to be the scale of the
| GPUs required to operate these for a lot of users
|
| for end users, yes. For small companies that want to
| finetune, evaluate and create derivatives, it reduces the
| cost by millions.
| kungfupawnda wrote:
| I got it to build and run the example app on my M3 max with 36
| gb ram. Memory pressure was around 32 gb
| dimask wrote:
| Did you quantise it? At what level and what was your
| impression compared to other recent smaller models at that
| quantisation, if so?
| kungfupawnda wrote:
| No I just ran it out of the box but I had to modify the
| source code to run for Mac.
|
| Instructions here:
| https://github.com/facebookresearch/llama/pull/947/
| dimask wrote:
| There are companies like phind that offer copilot-like services
| using finetuned versions of CodeLlama-34B, which imo are
| actually good. But I do not know if such a larger model is
| gonna be used in such a context.
| b33j0r wrote:
| There is a bait and switch going on, and sam altman or mark
| zuckerberg are the first to tell you.
|
| "No one can compete with us, but it's cute to try! Make
| applications though" --almost direct quote from Sam Altman.
|
| I have 64gb and an RTX 3090 and a macbook M3, and I already can't
| run a lot of the newest models even in their quantized form.
|
| The business model requires this to be a subscription service. At
| least as of today...
| kuczmama wrote:
| Realistically, what hardware would be required to run this? I
| assumed a RTX 3090 would be enough?
| summarity wrote:
| A Studio Mac with an M1 Ultra (about 2800 USD used) is
| actually a really cost effective way to run in. Its total
| system power consumption is really low, even spitting out
| tokens at full tilt (<250W).
| kkzz99 wrote:
| RTX 3090 has 24GB of memory, a quantized llama70b takes
| around 60GB of memory. You can offload a few layers on the
| gpu, but most of them will run on the CPU with terrible
| speeds.
| nullc wrote:
| You're not required to put the whole model in a single GPU.
|
| You can buy a 24GB gpu for $150-ish (P40).
| kuczmama wrote:
| Wow that's a really good idea. I could potentially buy 4
| Nvidia P40's for the same price as a 3090 and run
| inference on pretty much any model I want.
| eurekin wrote:
| Just make sure you're comfortable with manually compiling
| the bitsandbytes and generally combine a software stack
| of almost out of date libraries
| kuczmama wrote:
| That's a good point. Are you referring to the out of date
| cuda libraries?
| eurekin wrote:
| I don't remember exactly (either cuda directly or the
| cudnn version used by the flashattention)... Anyway,
| /r/localLlama has few instances of such builds. Might be
| really worthwhile looking that up before buying
| nullc wrote:
| P40 still works with 12.2 at the moment. I used to use
| K80s (which I think I paid like $50 for!) which turned
| into a huge mess to deal with older libraries, especially
| since essentially all ML stuff is on a crazy upgrade
| cadence with everything constantly breaking even without
| having to deal with orphaned old software.
|
| You can get gpu server chassis that have 10 pci-slots
| too! for around $2k on ebay. But note that there is a
| hardware limitation on the PCI-E cards such that each
| card can only directly communicate with 8 others at a
| time. Beware, they're LOUD even by the standards of sever
| hardware.
|
| Oh also the nvidia tesla power connectors have cpu-
| connector like polarity instead of pci-e, so at least in
| my chassis I needed to adapt them.
|
| Also keep in mind that if you aren't using a special gpu
| chassis, the tesla cards don't have fans, so you have to
| provide cooling.
| trentnelson wrote:
| Can that be split across multiple GPUs? i.e. what if I have
| 4xV100-DGXS-32GBs?
| michaelt wrote:
| You can run a similarly sized model - Llama 2 70B - at the
| 'Q4_K_M' quantisation level, with 44 GB of memory [1]. So you
| can just about fit it on 2x RTX 3090 (which you can buy,
| used, for around $1100 each)
|
| Of course, you can buy quite a lot of hosted model API access
| or cloud GPU time for that money.
|
| [1] https://huggingface.co/TheBloke/Llama-2-70B-GGUF
| ttul wrote:
| A 70B model is quite accessible; just rent a data center GPU
| hourly. There are easy deployment services that are getting
| better all the time. Smaller models can be derived from the big
| ones to run on a MacBook running Apple Silicon. While the
| compute won't be a match for Nvidia hardware, a MacBook can
| pack 128GB of RAM and run enormous models - albeit slowly.
| moyix wrote:
| My Macbook has a mere 64GB and that's plenty to run 70B
| models at 4-bit :) LM Studio is very nice for this.
| b33j0r wrote:
| Ok, well now that we've downvoted me below the visibility
| threshold, I was being sincere. And Altman did say that. I am
| not a hater.
|
| So. Maybe we could help other people figure out why VRAM is
| maxing out. I think it has to do with various new platforms
| leaking memory.
|
| In my case, I suspect ollama and diffusers are not actually
| evicting VRAM. nvidia-smi shows it in one case, but I haven't
| figured it out yet.
|
| Hey, my point remains. The models are going to get too
| expensive for me, personally, to run locally. I suspect we'll
| default into subscriptions to APIs because the upgrade slope
| is too steep.
| YetAnotherNick wrote:
| > At least as of today...
|
| This is the exact opposite of bait and switch. The current
| model couldn't be un-opensourced and over time it will just
| become easier to run it.
|
| Also unless there is reason to believe that prompt engineering
| of different model families is very different(which honestly I
| don't believe), there is no effect of baiting. I believe it
| will always be the case that best 2-3 models would be closed
| weights.
| whimsicalism wrote:
| the connection to the classic bait-and-switch seems tenuous at
| best
| bk146 wrote:
| Can someone explain Meta's strategy with the open source models
| here? Genuine question, I don't fully undestand.
|
| (Please don't say "commoditize your complement" without
| explaining what exactly they're commoditizing...)
| pyinstallwoes wrote:
| To be crowned the harbinger of AGI.
| apples_oranges wrote:
| OT: You ,,don't" or you ,,don't fully" understand? ;)
|
| (I try to train myself to say it right ..)
| eurekin wrote:
| Total speculation: Yann LeCun is there and he is really
| passionate about the technology and openness
| observationist wrote:
| The faux-open models mean the models can't be used in
| competing products. The open code base means enthusiasts and
| amateurs and other people hack on Meta projects and
| contribute improvements.
|
| They get free R&D and suppress competition, while looking
| like they have principles. Yann is clueless about open source
| principles, or the models would have been Apache or some
| other comparably open license. It's all ruthless corporate
| strategy, regardless of the mouth noises coming out of
| various meta employees.
| importantbrian wrote:
| Meta's choice of license doesn't indicate that Yann is
| clueless about open-source principles. I don't know about
| Meta specifically, but in most companies choosing the
| license for open source projects involves working with a
| lot of different stakeholders. He very easily could have
| pushed for Apache or MIT and some other interest group
| within Meta vetoed it.
| sangnoir wrote:
| > The faux-open models mean the models can't be used in
| competing products.
|
| Just because certain entities can't profitably use a
| product or obtain a license doesn't make it not-open. AGPL
| is open, for an extreme example.
|
| This argument is also subjective, and not new - "Which is
| more open BSD-style licenses or GPL?" has ben a guaranteed
| flameware starter for decades.
| skottenborg wrote:
| I doubt personal passions would merit the company funding
| required for such big models.
| eurekin wrote:
| Given how megacorps spend millions on a whim (Disney with
| all recent flops) or, when just a single person wants it
| (Ms Flight Simulator?) - I wouldn't be surprised to be
| honest...
|
| But sure, sounds more reasonable
| og_kalu wrote:
| Disney didn't spend millions on a whim. It's just the
| reality of box office that even millions in investment
| are no guarantee for returns.
| simonw wrote:
| AI seems like the Next Big Thing. Meta have put themselves at
| the center of the most exciting growth area in technology by
| releasing models they have trained.
|
| They've gained an incredible amount of influence and mindshare.
| bryan_w wrote:
| Part of it is that they already had this developed for years
| (see alt text on uploaded images for example), and they want to
| ensure that new regulations don't hamper any of their future
| plans.
|
| It costs them nothing to open it up, so why not. Kinda like all
| the rest of their GitHub repos.
| gen220 wrote:
| They're commoditizing the ability to generate viral content,
| which is the carrot that keeps peoples' eyeballs on the hedonic
| treadmill. More eyeball-time = more ad placements = more money.
|
| On the advertiser side, they're commoditizing the ability for
| companies to write more persuasively-targeted ads. Higher
| click-through rates = more money.
|
| [edit]: For models that generate code instead of content (TFA),
| it's obviously a different story. I don't have a good grip on
| that story, beyond "they're using their otherwise-idle GPU
| farms to buy goodwill and innovate on training methods".
| esafak wrote:
| That stuff ultimately drives people away. Who thinks "I need
| my daily fix of genAI memes, let me head to Facebook!"?
| Philpax wrote:
| Aside from the "positive" explanations offered in the sibling
| comments, there's also a "negative" one: other AI companies
| that try to enter the fray will not be able to compete with
| Meta's open offerings. After all, why would you pay a company
| to undertake R&D on building their own models when you can just
| finetune a Llama?
| pchristensen wrote:
| Meta doesn't have an AI "product" competing with OpenAI,
| Google's Bard, etc. But they use AI extensively internally.
| This is roughly a byproduct of their internal AI work that
| they're already doing, and fostering open source AI development
| puts incredible pressure on the AI products and their owners.
|
| If Meta can help prevent there from being an AI monopoly
| company, but rather an ecosystem of comparable products, then
| they avoid having another threatening tech giant competitor, as
| well as preventing their own AI work and products from being
| devalued.
|
| Think of it like Google releasing a web browser.
| IshKebab wrote:
| Google releasing a (very popular) web browser gives them
| direct control of web standards. What does this give
| Facebook?
| eganist wrote:
| OP already mentioned that it adds additional hurdles for
| possible future tech giants to have to cross on their
| quest.
|
| It's akin to a Great Filter, if such an analogy helps. If
| Meta's open models make a company's closed models
| uneconomical for others to consume, then the business case
| for those models is compromised and the odds of them
| growing to a size where they can compete with Meta in other
| ways is mitigated a bit.
| patapong wrote:
| I think we should not underestimate the strategic talent
| acquisition value as well. Many top-tier AI engineers may
| appreciate the openness and choose to join meta, which
| could be very valuable in the long run.
| jwkane wrote:
| Excellent point -- goodwill in a hyper-high demand dev
| community is invaluable.
| fngjdflmdflg wrote:
| Web standards are probably the last thing Google cares
| about with Chrome. Much more important is being the default
| search engine and making sure data collection isn't
| interrupted by a potential privacy minded browser.
| andy99 wrote:
| I think a big part of it is just because they have a big AI
| lab. I don't know the genesis of that, but it has for years
| been a big contributor, see pytorch, models like SEER, as well
| as being one of the dominant publishers at big conferences.
|
| Maybe now their leadership wants to push for practicality so
| they don't end up like Google (also a research powerhouse but
| failing to convert to popular advances) so they are publicly
| pushing strong LLMs.
| Lerc wrote:
| If they hadn't opened the models the llama series would just be
| a few sub-GPT4 models. Opening the models has created a wealth
| of development that has built upon those models.
|
| Alone, it was unlikely they would become a major player in a
| field that might be massively important. With a large community
| building upon their base they have a chance to influence the
| direction of development and possibly prevent a proprietary
| monopoly in the hands of another company.
| datadrivenangel wrote:
| AI puts pressure on search, cutting into google's ad revenue.
| Meta's properties are less immune to pressure from AI.
| theGnuMe wrote:
| Bill Gurly has a good perspective on it.
|
| Essentially, you mitigate IP claims and reduce vendor
| dependency.
|
| https://eightify.app/summary/technology-and-software/the-imp...
| crowcroft wrote:
| Meta's end goal is to have better AI than everyone else, in the
| medium term that means they want to have the best foundational
| models. How does this help.
|
| 1. They become an attractive place for AI researchers to work,
| and can bring in better staff. 2. They make it less appealing
| for startups to enter the space and build large foundation
| models (Meta would prefer 1,000 startups pop up and play around
| with other people's models, than 1000 startups popping up and
| trying to build better foundational models). 3. They put cost
| pressure on AI as a service providers. When LLAMA exists it's
| harder for companies to make a profit just selling access to
| models. Along with 2 this further limits the possibility of
| startups entering the foundational model space, because the
| path to monetization/breakeven is more difficult.
|
| Essentially this puts Meta, Google, and OpenAI/Microsoft
| (Anthropic/Amazon as a number four maybe) as the only real
| players in the cutting edge foundational model space. Worst
| case scenario they maintain their place in the current tech
| hegemony as newcomers are blocked from competing.
| siquick wrote:
| > Essentially this puts Meta, Google, and OpenAI/Microsoft
| (Anthropic/Amazon as a number four maybe) as the only real
| players in the cutting edge foundational model space.
|
| Mistral is right up there.
| yodsanklai wrote:
| Mistral has ~20 employees. I'm sure they have good
| researchers, but don't they lack the computing and
| engineering resources the big actors have?
| crowcroft wrote:
| I'm curious to see how they go, I might have a limited
| understanding. From what I can tell they do a good job in
| terms of value and efficiency with 'lighter' models, but I
| don't put them in the same category as the others in the
| sense that they aren't producing the massive absolute best
| in class LLMs.
|
| Hopefully they can prove me wrong though!
| Calvin02 wrote:
| Controversial take:
|
| Meta sees this as the way to improve their AI offerings faster
| than others and, eventually, better than others.
|
| Instead of a small group of engineers working on this inside
| Meta, the Open Source community helps improve it.
|
| They have a history of this with React, PyTorch, hhvm, etc. All
| these have gotten better as OS projects faster than Meta alone
| would have been able to do.
| emporas wrote:
| Yan Le Cunn has talked about Meta's strategy with open source.
| The general idea, is that the smartest people in the world do
| not work for you. No company can replicate innovation from open
| source internally.
| yodsanklai wrote:
| > The general idea, is that the smartest people in the world
| do not work for you
|
| Most likely, they work for your competitors. They may not be
| working to improve your system for free.
|
| > No company can replicate innovation from open source
| internally.
|
| Lot of innovation does come from companies.
| flir wrote:
| Really enjoying how many different answers you got.
|
| (My theory: _if_ there 's an AI pot of gold, what megacorp can
| risk one of the others getting to it first?)
| Too wrote:
| Meta still sit on all the juicy user data that they want to use
| AI on but they don't know how. They are crowdsourcing
| development of applications and tooling.
|
| Meta releases model. Joe builds a cool app with it, earns some
| internet points and if lucky a few hundred bucks. Meta copies
| app, multiply Joes success story with 1 billion users and earn
| a few million bucks.
|
| Joe is happy, Meta is happy. Everybody is happy.
| chasd00 wrote:
| My opinion is Meta is taking the model out of the secret sauce
| formula. That leaves hardware and data for training as the
| barrier to entry. If you don't need to develop your own model
| then all you need is data and hardware which lowers the barrier
| to entry. The lower the barrier the more GenAI startups and the
| more potential data customers for Meta since they certainly
| have large, curated, datasets for sale.
| edweis wrote:
| How come a company as big as Meta still uses bit.ly ?
| nemothekid wrote:
| What else would they use?
| 3pt14159 wrote:
| Something like meta.com/our_own_tech_handles_this
| nemothekid wrote:
| Not sure it's preferable to hire people at fb salaries to
| maintain a link shortener than just to use a reputable free
| one?
| junon wrote:
| Every big company has one of these anyway, and usually
| more involved (internal DNS, VPN, etc). A link shortener
| is like an interview question.
| transcriptase wrote:
| fb.com seems like a reasonable choice.
| Cthulhu_ wrote:
| Their own?
| huac wrote:
| their own shortener, e.g. fb.me, presumably
| geor9e wrote:
| Ironically it doesn't help to use link shorteners on twitter
| anyway - all URLs posted to twitter count as 23 characters. The
| hypertext is the truncated original URL string, and the URL is
| actually a t.co link.
| esafak wrote:
| Because this is a marketing channel. They handle tracking of
| FB/IG messages by other means, intended for engineers.
| kmeisthax wrote:
| Not only that, the announcement is on Twitter, a company that
| at least _used_ to be their biggest competitor. Old habits die
| hard, huh?
| simonw wrote:
| Here's the model on Hugging Face:
| https://huggingface.co/codellama/CodeLlama-70b-hf
| israrkhan wrote:
| I hope someone will soon post a quantized version that I can
| run on my macbook pro.
| theLiminator wrote:
| Curious what's the current SOTA local copilot model? Are there
| any extensions in vscode that give you a similar experience? I'd
| love something more powerful than copilot for local use (I have a
| 4090, so I should be able to run a decent number of models).
| Eisenstein wrote:
| When this 70b model gets quantized you should be able to run it
| fine on your 4090. Check out 'TheBloke' on huggingface and the
| llamacpp to run the gguf files.
| coder543 wrote:
| I think your take is a bit optimistic. I like quantization as
| much as the next person, but even the 2-bit model won't fit
| entirely on a 4090:
| https://huggingface.co/TheBloke/Llama-2-70B-GGUF
|
| I would be uncomfortable recommending less than 4-bit
| quantization on a non-MoE model, which is ~40GB on a 70B
| model.
| nox101 wrote:
| fortunately it will run on my UMA mac. it's made me curious
| what the trade offs are. Would I be better off with a 4090
| or a Mac with 128+gig of uma memory
| coder543 wrote:
| Even the M3 Max seems to be slower than my 3090 for LLMs
| that fit onto the 3090, but it's hard to find
| comprehensive numbers. The primary advantage is that you
| can spec out more memory with the M3 Max to fit larger
| models, but with the exception of CodeLlama-70B today, it
| really seems like the trend is for models to be getting
| smaller and better, not bigger. Mixtral runs circles
| around Llama2-70B and arguably ChatGPT-3.5. Mistral-7B
| often seems fairly close to Llama2-70B.
|
| Microsoft accidentally leaked that ChatGPT-3.5-Turbo is
| apparently only 20B parameters.
|
| 24GB of VRAM is enough to run ~33B parameter models, and
| enough to run Mixtral (which is a MoE, which makes direct
| comparisons to "traditional" LLMs a little more
| confusing.)
|
| I don't think there's a clear answer of what hardware
| someone should get. It depends. Should you give up
| performance on the models most people run locally in
| hopes of running very large models, or give up the
| ability to run very large models in favor of prioritizing
| performance on the models that are popular and proven
| today?
| int_19h wrote:
| M3 Max is actually less than ideal because it peaks at
| 400 Gb/s for memory. What you really want is M1 or M2
| Ultra, which offers up to 800 Gb/s (for comparison, RTX
| 3090 runs at 936 GB/s). A Mac Studio suitable for running
| 70B models with speeds fast enough for realtime chat can
| be had for ~$3K
|
| The downside of Apple's hardware at the moment is that
| the training ecosystem is very much focused on CUDA;
| llama.cpp has an open issue about Metal-accelerated
| training:
| https://github.com/ggerganov/llama.cpp/issues/3799 - but
| no work on it so far. This is likely because training at
| any significant sizes requires enough juice that it's
| pretty much always better to do it in the cloud
| currently, where, again, CUDA is the well-established
| ecosystem, and it's cheaper and easier for datacenter
| operators to scale. But, in principle, much faster
| training on Apple hardware should be possible, and
| eventually someone will get it done.
| zten wrote:
| Well, the workstation-class equivalent of a 4090 -- RTX
| 6000 Ada -- has enough RAM to work with a quantized
| model, but it'll blow away anyone's budget at anywhere
| between $7,000 and $10,000.
| Eisenstein wrote:
| The great thing about gguf is that it will cross to system
| RAM if there isn't enough VRAM. It will be slower, but
| waiting a couple minutes for a prompt response isn't the
| worst thing if you are the type that would get use out of a
| local 70b parameter model. Then again, one could have
| grabbed 2x 3090s for the price of a 4090 and ended up with
| 48gb of VRAM in exchange for a very tolerable performance
| hit.
| coder543 wrote:
| > The great thing about gguf is that it will cross to
| system RAM if there isn't enough VRAM.
|
| No... that's not such a great thing. Helpful in a pinch,
| but if you're not running at least 70% of your layers on
| the GPU, then you barely get any benefit from the GPU in
| my experience. The vast gulf in performance between the
| CPU and GPU means that the GPU is just spinning its
| wheels waiting on the CPU. Running half of a model on the
| GPU is not useful.
|
| > Then again, one could have grabbed 2x 3090s for the
| price of a 4090 and ended up with 48gb of VRAM in
| exchange for a very tolerable performance hit.
|
| I agree with this, if someone has a desktop that can fit
| two GPUs.
| sp332 wrote:
| The main benefit of a GPU in that case is much faster
| prompt reading. Could be useful for Code Llama cases
| where you want the model to read a lot of code and then
| write a line or part of a line.
| dimask wrote:
| > The great thing about gguf is that it will cross to
| system RAM if there isn't enough VRAM.
|
| Then you can just run it entirely on CPU. There is no
| point to buy an expensive GPU to run LLMs to be
| bottlenecked by your CPU in the first place. Which is why
| I do not get so excited with these huge models, as they
| gain less traction as not as many people can run them
| locally, and finetuning is probably more costly too.
| int_19h wrote:
| GGUF is just a file format. The ability to offload some
| layers to CPU is not specific to it nor to llama.cpp in
| general - indeed, it was available before llama.cpp was
| even a thing.
| sfsylvester wrote:
| This is a completely fair, but open question. Not to be a
| typical HN user, but when you say SOTA local, the question is
| really what benchmarks do you really care about in order to
| evaluate. Size, operability, complexity, explainability etc.
|
| Working out what copilot models perform best has been a deep
| exercise for myself and has really made me evaluate my own
| coding style on what I find important and things I look out for
| when investigating models and evaluating interview candidates.
|
| I think three benchmarks & leaderboards most go to are:
|
| https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...
| - which is the most understood, broad language capability
| leaderboad that relies on well understood evaluations and
| benchmarks.
|
| https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...
| - Also comprehensive, but primarily assesses Python and
| JavaScript.
|
| https://evalplus.github.io/leaderboard.html - which I think is
| a better take on comparing models you intend to run locally as
| you can evaluate performance, operability and size in one
| visualisation.
|
| Best of luck and I would love to know which models & benchmarks
| you choose and why.
| theLiminator wrote:
| I'm honestly more interested in anecdotes and I'm just
| seeking anything that can be a drop-in copilot replacement
| (that's subjectively better). Perhaps one major thing I'd
| look for is improved understanding of the code in my own
| workspace.
|
| I honestly don't know what benchmarks to look at or even what
| questions to be asking.
| hackerlight wrote:
| > https://huggingface.co/spaces/mike-ravkine/can-ai-code-
| resul... - Also comprehensive, but primarily assesses Python
| and JavaScript.
|
| I wonder why they didn't use DeepSeek under the "senior"
| interview test. I am curious to see how it stacks up there.
| ahmednazir wrote:
| Can you explain why big tech company make a race to release an
| open source model? If model is free and open source then how will
| they earn and how will they compete with others?
| jampekka wrote:
| Commoditize your complement?
| stainablesteel wrote:
| they want to incentivize dependency
| martingoodson wrote:
| Baptiste Roziere gave a great talk about Code Llama at our meetup
| recently: https://m.youtube.com/watch?v=_mhMi-7ONWQ
|
| I highly recommend watching it.
| doctoboggan wrote:
| Is there a quantized version available for ollama or is it too
| early for that?
| coder543 wrote:
| Already there, it looks like:
| https://ollama.ai/library/codellama
|
| (Look at "tags" to see the different quantizations)
| robin-whg wrote:
| This is a prime example of the positive aspects of capitalism.
| Meta has its own interests, of course, but as a side effect, this
| greatly benefits consumers.
| LVB wrote:
| I'm not very plugged into how to use these models, but I do love
| and pay for both ChatGPT and GitHub Copilot. How does one take a
| model like this (or a smaller version) and leverage it in VS
| Code? There's a dizzying array of GPT wrapper extensions for VS
| Code, many of which either seem like kind of junk (10 d/ls, no
| updates in a year), or just lead to another paid plan, at which
| point I might as well just keep my GH Copilot. Curious what
| others are doing here for Copilot-esque code completion without
| Copilot.
| petercooper wrote:
| https://continue.dev/ is a good place to start.
| sestinj wrote:
| beat me to the punch : )
| speedgoose wrote:
| Continue doesn't support tab completion like Copilot yet.
|
| A pull/merge request is being worked on:
| https://github.com/continuedev/continue/pull/758
| sestinj wrote:
| Release coming later this week!
| jondwillis wrote:
| Bonus points for being able to use local models!
| israrkhan wrote:
| This looks really good..
| dan_can_code wrote:
| It's great. It's super easy to install ollama locally,
| `ollama run <preferred model>`, change the continue config
| to point to it, and it just works. It even has an offline
| option by disabling telemetry.
| sestinj wrote:
| I've been working on continue.dev, which is completely free to
| use with your own Ollama instance / TogetherAI key, or for a
| while with ours.
|
| Was testing with Codellama-70b this morning and it's clearly a
| step up from other OS models
| dan_can_code wrote:
| How do you test a 70B model locally? I've tried to query, but
| the response is super slow.
| sestinj wrote:
| Personally I was testing with TogetherAI because I don't
| have the specs for a local 70b. Using quantized versions
| helps (Ollama's downloads 4-bit by default, you can get
| down to 2), but it would still require a higher-end Mac.
| Highly recommend Together, it runs quite quickly and is
| $0.9/million tokens
| SparkyMcUnicorn wrote:
| There are some projects that let you run a self-hosted Copilot
| server, then you set a proxy for the official Copilot
| extension.
|
| https://github.com/fauxpilot/fauxpilot
|
| https://github.com/danielgross/localpilot
| israrkhan wrote:
| I tried fauxpilot to make it work on my own llama.cpp
| instance, but didn't work out of the box. Filed a github
| issue, but did not get any traction. Eventually gave up on
| it. This was around 5 months ago. Things might have improved
| by now.
| water-data-dude wrote:
| When I was setting up a local LLM to play with I stood up my
| own Open AI API compatible server using llama-cpp-python. I
| installed the Copilot extension and set OverrideProxyUrl in
| the advanced settings to point to my local server, but
| CoPilot obstinately refused to let me do anything until I'd
| signed in to GitHub to prove that I had a subscription.
|
| I don't _believe_ that either of these lets you bypass that
| restriction (although I'd love to be proven wrong), so if you
| don't want to sign up for a subscription you'll need to use
| something like Continue.
| marinhero wrote:
| You can download it and run it with
| [this](https://github.com/oobabooga/text-generation-webui).
| There's an API mode that you could leverage from your VS Code
| extension.
| cmgriffing wrote:
| I've been using Cody by Sourcegraph and liking it so far.
|
| https://sourcegraph.com/cody
| apapapa wrote:
| Free Bard is better than free ChatGPT... Not sure about paid
| versions
| chrishare wrote:
| Credit where credit is due, Meta has had a fantastic commitment
| towards open source ML. You love to see it.
| joshspankit wrote:
| Yes but: if the commitment is driven by internal researchers
| and coders standing firm about making their work open source (a
| rumour I've heard a couple times), most of the credit goes to
| them.
| anonymousDan wrote:
| Can anyone tell me what kind of hardware setup would be needed to
| fine-tune something like this? Would you need a cluster of GPUs?
| What kind of size + GPU spec would you think is reasonable (e.g.
| wrt VRAM per GPU etc).
| pandominium wrote:
| Everyone is mentioning using 4090 and a smaller model, but I
| rarely see an analysis where the energy consumption is used.
|
| I think Copilot is already highly subsidized by Microsoft.
|
| Let's say you use Copilot around 30% of your daily work hours.
| How much kWh does an opensource 7B or 13B model use then in a
| month on one 4090?
|
| EDIT:
|
| I think for a 13B at 30% use per day it comes around 30$/no on
| energy bill.
|
| So probably with a even more smaller but capable model can beat
| the Copilot monthly subscription.
| fullspectrumdev wrote:
| This looks potentially interesting if it can be ran locally on
| say, an M2 Max or similar - and if there's an IDE plugin to do
| the Copilot thing.
|
| Anything that saves me time writing "boilerplate" or figuring out
| the boring problems on projects is welcome - so I can expend the
| organic compute cycles on solving the more difficult software
| engineering tasks :)
| siilats wrote:
| We made a Jetbrains plugin called CodeGPT to run this locally
| https://plugins.jetbrains.com/plugin/21056-codegpt
___________________________________________________________________
(page generated 2024-01-29 23:01 UTC)