[HN Gopher] Meta AI releases Code Llama 70B
       ___________________________________________________________________
        
       Meta AI releases Code Llama 70B
        
       Author : albert_e
       Score  : 422 points
       Date   : 2024-01-29 17:11 UTC (5 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | turnsout wrote:
       | Given how good some of the smaller code models are (such as
       | Deepseek Coder at 6.7B), I'll be curious to see what this 70B
       | model is capable of!
        
         | jasonjmcghee wrote:
         | My personal experience is that Deepseek far exceeds code llama
         | of the same size, but it was released quite a while ago.
        
           | turnsout wrote:
           | Agreed--I hope Meta studied Deepseek's approach. The idea of
           | a Deepseek Coder at 70B would be exciting.
        
             | whimsicalism wrote:
             | The "approach" is likely just training on more tokens.
        
               | imjonse wrote:
               | It's the quality of data and the method of training, not
               | just the amount of tokens (per their paper release a few
               | days ago)
        
             | hackerlight wrote:
             | There's a deepseek coder around 30-35b and it has almost
             | identical performance to the 7b on benchmarks.
        
         | ignoramous wrote:
         | _AlphaCodium_ is the newest kid on the block that 's SoTA
         | _pass@5_ on coding tasks (authors claim at least 2x better than
         | GPT4): https://github.com/Codium-ai/AlphaCodium
         | 
         | As for small models, Microsoft has been making noise with the
         | unreleased _WaveCoder-Ultra-6.7b_
         | (https://arxiv.org/abs/2312.14187).
        
           | eurekin wrote:
           | Are weights available?
        
             | moyix wrote:
             | AlphaCodium is more of a prompt engineering / flow
             | engineering strategy, so it can be used with existing
             | models.
        
           | passion__desire wrote:
           | AlphaCodium author says he should have used DSPy
           | 
           | https://twitter.com/talrid23/status/1751663363216580857
        
           | hackerlight wrote:
           | Is this better than GPT4's Grimoire?
        
         | CrypticShift wrote:
         | Phind [1] uses the larger 34B Model. Still, I'm also curious
         | what they are gonna do with this one.
         | 
         | [1] https://news.ycombinator.com/item?id=38088538
        
         | hackerlight wrote:
         | Seems worse according to this benchmark:
         | 
         | https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...
        
       | colesantiago wrote:
       | Llama is getting better and better, I heard this and Llama 3 will
       | start to be good as GPT-4.
       | 
       | Who would have thought that Meta, that has been chucking billions
       | on the metaverse is on the forefront of Open Source AI.
       | 
       | Not to mention their stock is up and they are worth $1TN, again.
       | 
       | Not sure how I feel about this given the fact of all the scandals
       | that have plagued them and the massive 1BN fine from the EU,
       | Cambridge Analytica, and last of all caused a genocide in
       | Myanmar.
       | 
       | Goes to show that nobody cares about all of these scandals and
       | just moves on onto the future, allowing Facebook to still collect
       | all this data for their models.
       | 
       | If any other startup or mid sized company had at least _two_ of
       | these large scandals, they would be dead in the water.
        
         | austinpena wrote:
         | I'm really curious what their goal is
        
           | Ruq wrote:
           | Zuck just wants another robot to talk to.
        
             | throwup238 wrote:
             | Poor guy just wants a friend that won't sell him out to the
             | FTC or some movie producers.
        
           | hendersoon wrote:
           | Their goals are clear, dominance and stockholder value. What
           | I'm curious about is how they plan to _monetize it_.
        
             | refulgentis wrote:
             | Use it in products, ex. the chatbots
        
           | regimeld wrote:
           | Commoditize your compliments.
        
           | refulgentis wrote:
           | Same as MS, in the game, in the conversation, and ensuring
           | next-gen search margins approximate 0.
        
           | Arainach wrote:
           | Disclaimer: I do not work at Meta, but I work at a large tech
           | company which competes with them. I don't work in AI,
           | although if my VP asks don't tell them I said that or they
           | might lay me off.
           | 
           | Multiple of their major competitors/other large tech
           | companies are trying to monetize LLMs. OpenAI maneuvering an
           | early lead into a dominant position would be another
           | potential major competitor. If releasing these models slows
           | or hurts them that is in and itself a benefit.
        
             | michaelt wrote:
             | Why?
             | 
             | What benefit is there to grabbing market share from your
             | competitors... in a business you don't even want to be in?
             | 
             | By that logic you could justify any bizarre business
             | decision. Should Google launch a social network, to hurt
             | their competitor Facebook? Should Facebook, Amazon and
             | Microsoft each launch a phone?
        
               | Filligree wrote:
               | > Should Google launch a social network, to hurt their
               | competitor Facebook?
               | 
               | I mean, Google _did_ launch a social network, to hurt
               | their competitor Facebook. It was a whole thing. It was
               | even a really nice system, eventually.
        
               | wanderingstan wrote:
               | And it turned out that Facebook had quite a moat with
               | network effects. OpenAI doesn't have such a moat, which
               | may be what Meta is wanting to expose.
        
               | esafak wrote:
               | Google botched the launch, and they never nurture
               | products after launch anyway. Google+ could have been
               | more successful.
        
               | zaat wrote:
               | I enjoyed using Google plus more than any other social
               | network, and managed to create new connections and/or
               | have standard, authentic, real conversations with people
               | I didn't know, most of them ordinary people with shared
               | interests that I would probably wouldn't meet otherwise,
               | some of them are people I can't believe I could connect
               | with directly in any other way - newspapers and news
               | sites editors, major SDK developers. And even with Kevin
               | Kelly.
        
               | Arainach wrote:
               | >Should Facebook, Amazon and Microsoft each launch a
               | phone?
               | 
               | * https://www.lifewire.com/whatever-happened-to-the-
               | facebook-p...
               | 
               | * https://en.wikipedia.org/wiki/Fire_Phone
               | 
               | * https://en.wikipedia.org/wiki/Windows_Phone
        
               | Arainach wrote:
               | Who says they don't want to be in the market? Facebook
               | has one product. Their income is entirely determined by
               | ads on social media. That's a perilous position subject
               | to being disrupted. Meta desperately wants to diversify
               | its product offerings - that's why they've been throwing
               | so much at VR.
        
           | tsunamifury wrote:
           | Their goal is to counter the competition. You rarely should
           | pick the exact same strategy as your competitor and count on
           | out gunning them, rather you should counter them. OpenAI is
           | ironically closed, well meta will be open then. If you can't
           | beat them, you should try to degrade down the competitors
           | value case.
           | 
           | Its a smart move IMO
        
           | dsabanin wrote:
           | I think Meta's goal is to subvert Google, MS and OpenAI,
           | after realizing it's not positioned well to compete with them
           | commercially.
        
             | api wrote:
             | Could also be that these smaller models are a loss leader
             | or advertisement for a future product or service... like a
             | big brother to Llama3 that's commercial.
        
               | idkyall wrote:
               | I believe there were rumors they are developing a
               | commercial model: e.g. https://www.ft.com/content/01fd640
               | e-0c6b-4542-b82b-20afb203f...
        
           | akozak wrote:
           | I would guess mindshare in a crowded field, ie discussion
           | threads just like this one that help with recruiting and tech
           | reputation after a bummer ~8 years. (It's best not to
           | overestimate the complexity/# of layers in a bigco's
           | strategic thinking.)
        
           | jedberg wrote:
           | Commoditizing your complement. If all your competitors need a
           | key technology to get ahead, you make a cheap/free version of
           | it so that they can't use it as a competitive advantage.
        
             | nindalf wrote:
             | The complement being the metaverse. You can't handcraft the
             | metaverse because it would be infeasible. If LLMs are a
             | commodity that everyone has access to, then it can be done
             | on the cheap.
             | 
             | Put another way - if OpenAI were the only game in town how
             | much would they be charging for their product? They're
             | competing on price because competitors exist. Now imagine
             | the price if a hypothetical high quality open source model
             | existed that can customers can use for "free".
             | 
             | That's the future Meta wants. They weren't getting rich
             | selling shovels like cloud providers are, they want
             | everyone digging. And everyone digs when the shovels are
             | free.
        
           | aantix wrote:
           | To undermine the momentum of OpenAI.
           | 
           | If Meta were at the forefront, these models would not be
           | openly available.
           | 
           | They are scrambling.
        
           | ppsreejith wrote:
           | If you go by what Zuck says, he calls this out in previous
           | earnings reports and interviews[1]. It mainly boils down to 2
           | things:
           | 
           | 1. Similar to other initiatives (mainly opencompute but also
           | PyTorch, React etc), community improvements help them improve
           | their own infra and helps attract talent.
           | 
           | 2. Helping people create better content ultimately improves
           | quality of content on their platforms (Both FoA & RL)
           | 
           | Sources:
           | 
           | [1]Interview with verge:
           | https://www.theverge.com/23889057/mark-zuckerberg-meta-ai-
           | el... . Search for "regulatory capture right now with AI"
           | 
           | > Zuck: ... And we believe that it's generally positive to
           | open-source a lot of our infrastructure for a few reasons.
           | One is that we don't have a cloud business, right? So it's
           | not like we're selling access to the infrastructure, so
           | giving it away is fine. And then, when we do give it away, we
           | generally benefit from innovation from the ecosystem, and
           | when other people adopt the stuff, it increases volume and
           | drives down prices.
           | 
           | > Interviewer: Like PyTorch, for example?
           | 
           | > Zuck: When I was talking about driving down prices, I was
           | thinking about stuff like Open Compute, where we open-sourced
           | our server designs, and now the factories that are making
           | those kinds of servers can generate way more of them because
           | other companies like Amazon and others are ordering the same
           | designs, that drives down the price for everyone, which is
           | good.
        
           | PheonixPharts wrote:
           | I imagine their goal is to simultaneously show that Meta is
           | still SotA when it comes to AI and at the same time feed a
           | community of people who will work for free to essentially
           | undermine OpenAI's competitive advantage and make life worse
           | for Google since at the very least LLMs tend to be a better
           | search engine for most topics.
           | 
           | There's far more risk if Meta were to try to directly compete
           | with OpenAI and Microsoft on this. They'd have to manage the
           | infra, work to acquire customers, etc, etc on top of building
           | these massive models. If it's not a space they really want to
           | be in, it's a space they can easily disrupt.
           | 
           | Meta's late game realization was that Google owned the web
           | via search and Apple took over a lot of the mobile space with
           | their walled garden. I suspect Meta's view now is that it's
           | much easier to just prevent something like this from
           | happening with AI early on.
        
           | blackoil wrote:
           | Devil's advocate: they have to build it anyway for Meta verse
           | and in general. Management has no interest in going into
           | cloud business. They had Parse long time back but that is
           | done. So why not to release it. They are getting
           | goodwill/mindshare, may set up industry standard and get
           | community benefit. It isn't very different from React, Torch
           | etc.
        
           | ttul wrote:
           | If you want to employ the top ML researchers, you have to
           | give them what they want, which is often the ability to share
           | their discoveries with the world. Making Llama-N open may not
           | be Zuckerberg's preference. It's possible the researchers
           | demanded it.
        
           | wrsh07 wrote:
           | Rule 5: commodify your complement
           | 
           | Content generation is complementary to most of meta's apps
           | and projects
        
           | elorant wrote:
           | Prevent OpenAI from dominating the market, and at the same
           | time have the research community enhance your models and
           | identify key use cases.
        
             | pennomi wrote:
             | Meta basically got a ton of free R&D that directly applies
             | to their model architecture. Their next generation AIs will
             | always benefit from the techniques/processes developed by
             | the clever researchers and hobbyists out there.
        
           | megaman821 wrote:
           | They were going to make most this anyway for Instagram
           | filters, chat stickers, internal coding tools, VR world
           | generation, content moderation, etc. Might as well do a
           | little bit extra work to open source it since it doesn't
           | really compete with anything Meta is selling.
        
         | make3 wrote:
         | with pytorch (& so many open publications), Meta has had a
         | unimaginably strong contribution to ai for a while
        
         | fragmede wrote:
         | _Model available_ , not open source. These models aren't open
         | source because we don't have access to the data sets, nor the
         | full code to train them, so we can't recreate the models even
         | if we had the GPU time available to recreate them.
        
           | colesantiago wrote:
           | Everyone using AI in production is using Pytorch by Meta.
           | 
           | Which is open source.
           | 
           | I do not know anybody important in the AI space apart from
           | Google using TensorFlow.
        
             | Philpax wrote:
             | That may be true, but it's largely irrelevant. The ML
             | framework in use has no bearing on whether or not you have
             | the data required to reproduce the model being trained with
             | that framework.
        
               | colesantiago wrote:
               | Do you and the GP have 350K GPUs and quality data to
               | reproduce 1:1 whatever Facebook releases in their repos?
               | 
               | Even if you want to reproduce the model and they give you
               | the data, you would need to do this at Facebook scale, so
               | you and the GP are just making moot points all around.
               | 
               | https://about.fb.com/news/2023/05/metas-infrastructure-
               | for-a...
               | 
               | https://www.theregister.com/2024/01/20/metas_ai_plans/
               | 
               | The fact that these models are coming from Meta in the
               | open rather than Google which releases only papers with
               | _no_ model tell 's me that Meta's models is open enough
               | for everyone to use.
               | 
               | Besides, everyone using the Pytorch framework benefits
               | Meta in the same way they were originally founded as a
               | company:
               | 
               | Network effects
               | 
               | It's relevant.
        
               | Philpax wrote:
               | There are organisations that are capable of reproduction
               | (e.g. EleutherAI), but yes, you're right, not having the
               | data is largely irrelevant for most users.
               | 
               | The thing that bothers me more is that it's not actually
               | an open-source licence; there are restrictions on what
               | you can do with it, and whatever you do with the model is
               | subject to those restrictions. It's still very useful and
               | I'm not opposed to them releasing it under that licence
               | (they need to recoup the costs somehow), but "open-
               | source" (or even "open") it is not.
        
         | WhackyIdeas wrote:
         | It does seem like the nicest thing Facebook have ever done by
         | giving so much to the open source LLM scene, I know that it
         | might have been started by a leaker, but they have given so
         | much voluntarily. I mean don't get me wrong, I don't like the
         | company but I do really like some of the choices they have made
         | recently.
         | 
         | But I do wonder in the back of my mind why. And I should be
         | suspicious of their angle and I will keep thinking about it. Is
         | it paranoid to think that maybe their angle is putting almost
         | some kind of metadata by style of code being unique to
         | different machines that they can trace generated code to
         | different people? Is that their angle or am I biased in
         | remembering who they have been for the past decade?
        
         | cm2012 wrote:
         | Cambridge Analytica is not a real scandal (did not affect any
         | elections), and FB did not cause a genocide in Myanmar (they
         | were a popular message board during a genocide, which is not
         | the same thing)
        
       | ramshanker wrote:
       | Are these trained on internal Code bases or just the public
       | repositories?
        
         | albertzeyer wrote:
         | Without knowing any details, I'm almost sure that they did not
         | train it on internal code, as it might be possible to reveal
         | that code otherwise (given the right prompt, or just looking at
         | the weights).
        
         | make3 wrote:
         | people are able to extract training data from llms with
         | different methods, so there's no way this was trained on
         | internal code
        
         | eigenvalue wrote:
         | Would be a really bad idea to train on internal code I would
         | think. Besides, there is no shortage of open source code (even
         | open source created by Meta) out there.
        
           | changoplatanero wrote:
           | Correct that it's a bad idea to train on internal code.
           | However surprisingly there is a shortage of open source code.
           | These models are trained on substantially all the available
           | open source code that these companies can get their hands on.
        
         | andy99 wrote:
         | The github [0] hasn't been fully updated, but it links to a
         | paper [1] that describes how the smaller code llama models were
         | trained. It would be a good guess that this model is similar.
         | 
         | [0] https://github.com/facebookresearch/codellama [1]
         | https://ai.meta.com/research/publications/code-llama-open-fo...
        
       | Ninjinka wrote:
       | Benchmarks?
        
       | Havoc wrote:
       | Not sure who this is aimed at? The avg programmer probably
       | doesn't have the gear on hand to run this at the required pace
       | 
       | Cool nonetheless
        
         | Spivak wrote:
         | People who want to host the models presumably, AWS bedrock will
         | def include it.
        
         | blackoil wrote:
         | How feasible would it be too fine tune using internal code and
         | have an enterprise copilot.
        
           | keriati1 wrote:
           | We actually run already in-house ollama server prototype for
           | coding assistance with deepseek coder and it is pretty good.
           | Now if we would get a model for this, that is on chatgpt 4
           | level, I would be super happy.
        
             | eurekin wrote:
             | Did you finetune a model?
        
               | keriati1 wrote:
               | No, we went with RAG pipeline approach as we assume
               | things change too fast.
        
               | eurekin wrote:
               | Thanks! Any details how you chunk and find the relevant
               | code?
               | 
               | Or how you deal with context length? I.e. do you send
               | anything other than the current file? How is the prompt
               | constructed?
        
           | lozenge wrote:
           | Considering a number of Saas offer this service, I'd say it's
           | feasible.
        
           | jejeyyy77 wrote:
           | already been done
        
         | connorgutman wrote:
         | This is targeted towards GPU rental services like RunPod as
         | well as API providers such as together AI. Together.ai is
         | charging $0.90/1M tokens at 70B parameters.
         | https://www.together.ai/pricing
        
         | ttul wrote:
         | Yeah, but if your company wants to rent an H100, you can deploy
         | this for your developers for much less than the cost of a
         | developer...
        
         | moyix wrote:
         | You can run it on a Macbook M1/M2 with 64GB of RAM.
        
           | 2OEH8eoCRo0 wrote:
           | How? It's larger than 64GB.
        
             | coder543 wrote:
             | Quantization is highly effective at reducing memory and
             | storage requirements, and it barely has any impact on
             | quality (unless you take it to the extreme). Approximately
             | no one should ever be running the full fat fp16 models
             | during inference of any of these LLMs. That would be
             | _incredibly_ inefficient.
             | 
             | I run 33B parameter models on my RTX 3090 (24GB VRAM) no
             | problem. 70B should _easily_ fit into 64GB of RAM.
        
               | 2OEH8eoCRo0 wrote:
               | I'm aware but is it still LLaMA 70B at that point?
        
               | coder543 wrote:
               | Yes. Quantization does not reduce the number of
               | parameters. It does not re-train the model.
        
               | andy99 wrote:
               | It's a legit question, the model will be worse in some
               | way... I've seen it discussed that all things being equal
               | more parameters is better (meaning it's better to take a
               | big model and quantized it to fit in memory than use a
               | smaller unquantized model that fits), but a quantized
               | model wouldn't be expected to run identically to or as
               | well as the full model.
        
               | coder543 wrote:
               | You don't stop being andy99 just because you're a little
               | tired, do you? Being tired makes everyone a little less
               | capable at most things. Sometimes, a lot less capable.
               | 
               | In traditional software, the same program compiled for
               | 32-bit and 64-bit architectures won't be able to handle
               | all of the same inputs, because the 32-bit version is
               | limited by the available address space. It's still the
               | same program.
               | 
               | If we're not willing to declare that you are a completely
               | separate person when you're tired, or that 32-bit and
               | 64-bit versions are completely different programs, then I
               | don't think it's worth getting overly philosophical about
               | quantization. A quantized model is still the same model.
               | 
               | The quality loss from using 4+ bit quantization is
               | minimal, in my experience.
               | 
               | Yes, it has a _small_ impact on accuracy, but with
               | massive efficiency gains. I don't really think anyone
               | should be running the full models outside of research in
               | the first place. If anything, the quantized models should
               | be considered the "real" models, and the full fp16 /fp32
               | model should just be considered a research artifact
               | distinct from the model. But this philosophical rabbit
               | hole doesn't seem to lead anywhere interesting to me.
               | 
               | Various papers have shown that 4-bit quantization is a
               | great balance. One example:
               | https://arxiv.org/pdf/2212.09720.pdf
        
               | cjbprime wrote:
               | I don't like the metaphor: when I'm tired, I will be
               | alert again later. Quantization is lossy compression: the
               | human equivalent would be more like a traumatic brain
               | injury affecting recall, especially of fine details.
               | 
               | The question of whether I am still me after a traumatic
               | brain injury is philosophically unclear, and likely
               | depends on specifics about the extent of the deficits.
        
               | coder543 wrote:
               | The impact on accuracy is somewhere in the single-digit
               | percentages at 4-bit quantization, from what I've been
               | able to gather. Very small impact. To draw the analogy
               | out further, if the model was able to get an A on a test
               | before quantization, it would likely still get a B at
               | worst afterwards, given a drop in the score of less than
               | 10%. Depending on the task, the measured impact could
               | even be negligible.
               | 
               | It's _far_ more similar to the model being perpetually
               | tired than it is to a TBI.
               | 
               | You may nitpick the analogy, but analogies are _never_
               | exact. You also ignore the other piece that I pointed
               | out, which is how we treat other software that comes in
               | multiple slightly different forms.
        
               | berniedurfee wrote:
               | Reminds me of the never ending MP3 vs FLAC argument.
               | 
               | The difference can be measured empirically, but is it
               | noticeable in real world usage.
        
               | jameshart wrote:
               | But we're talking about a coding LLM here. A single digit
               | percentage reduction in accuracy means, what, one or two
               | times in a hundred, it writes == instead of !=?
        
               | coder543 wrote:
               | I think that's too simplified. The best LLMs will still
               | frequently make mistakes. Meta is advertising a HumanEval
               | score of 67.8%. In a third of cases, the code generated
               | still doesn't satisfactorily solve the problem in that
               | automated benchmark. The additional errors that
               | quantization would introduce would only be a very small
               | percentage of the overall errors, making the quantized
               | and unquantized models practically indistinguishable to a
               | human observer. Beyond that, lower accuracy can manifest
               | in many ways, and "do the opposite" seems unlikely to be
               | the most common way. There might be a dozen correct ways
               | to solve a problem. The quantized model might choose a
               | different path that still turns out to work, it's just
               | not exactly the same path.
               | 
               | As someone else pointed out, FLAC is objectively more
               | accurate than mp3, but how many people can really tell?
               | Is it worth 3x the data to store/stream music in FLAC?
               | 
               | The quantized model would run at probably 4x the speed of
               | the unquantized model, assuming you had enough memory to
               | choose between them. Is speed worth nothing? If I have to
               | wait all day for the LLM to respond, I can probably do
               | the work faster myself without its help. Is being able to
               | fit the model onto the hardware you have worth nothing?
               | 
               | In essence, quantization here is a 95% "accurate"
               | implementation of a 67% accurate model, which yields a
               | 300% increase in speed while using just 25% of the RAM.
               | All numbers are approximate, even the HumanEval benchmark
               | should be taken with a large grain of salt.
               | 
               | If you have a very opulent computational experience, you
               | can enjoy the luxury of the full 67.8% accurate model,
               | but that just feels both wasteful and like a bad user
               | experience.
        
               | manmal wrote:
               | Sure, quantization reduces information stored for each
               | parameter, not the parameter count.
        
               | sbrother wrote:
               | Can I ask how many tok/s you're getting on that setup?
               | I'm trying to decide whether to invest in a high-end
               | NVIDIA setup or a Mac Studio with llama.cpp for the
               | purposes of running LLMs like this one locally.
        
               | coder543 wrote:
               | On a 33B model at q4_0 quantization, I'm seeing about 36
               | tokens/s on the RTX 3090 with all layers offloaded to the
               | GPU.
               | 
               | Mixtral runs at about 43 tokens/s at q3_K_S with all
               | layers offloaded. I normally avoid going below 4-bit
               | quantization, but Mixtral doesn't seem phased. I'm not
               | sure if the MoE just makes it more resilient to
               | quantization, or what the deal is. If I run it at q4_0,
               | then it runs at about 24 tokens/s, with 26 out of 33
               | layers offloaded, which is still perfectly usable, but I
               | don't usually see the need with Mixtral.
               | 
               | Ollama dynamically adjusts the layers offloaded based on
               | the model and context size, so if I need to run with a
               | larger context window, that reduces the number of layers
               | that will fit on the GPU and that impacts performance,
               | but things generally work well.
        
               | sbrother wrote:
               | Thanks! That is really fast for personal use.
        
               | sorenjan wrote:
               | What's the power consumption and fan noise like when
               | doing that? I assume you're running the model doing
               | inference in the background for the whole coding session,
               | i.e. hours at a time?
        
               | coder543 wrote:
               | I don't use local LLMs for CoPilot-like functionality,
               | but I have toyed with the concept.
               | 
               | There are a few things to keep in mind: no programmer
               | that I know is sitting there typing code for hours at a
               | time without stopping. There's a lot more to being a
               | developer than just typing, whether it is debugging,
               | thinking, JIRA, Slack, or whatever else. These CoPilot-
               | like tools will only activate after you type something,
               | then stop for a defined timeout period. While you're
               | typing, they do nothing. After they generate, they do
               | nothing.
               | 
               | I would honestly be surprised if the GPU active time was
               | more than 10% averaged over an hour. When actively
               | working on a large LLM, the RTX 3090 is drawing close to
               | 400W in my desktop. At a 10% duty cycle (active time),
               | that would be 40W on average, which would be 320Wh over
               | the course of a full 8-hour day of crazy productivity. My
               | electric rate is about 15C//kWh, so that would be about
               | 5C/ per day. It is _absolutely not_ running at a 100%
               | duty cycle, and it's absurd to even do the math for that,
               | but we can multiply by 10 and say that if you're somehow
               | a mythical "10x developer" then it would be 50C/ /day in
               | electricity here. I think 5C//day to 10C//day is closer
               | to reality. Either way, the cost is marginal at the scale
               | of a software developer's salary.
        
               | sorenjan wrote:
               | That sounds perfectly reasonable. I'm more worried about
               | noise and heat than the cost though, but I guess that's
               | not too bad either then. What's the latency like? When
               | I've used generative image models the programs unload the
               | model after they're done, so it takes a while to generate
               | the next image. Is the model sitting in VRAM when it's
               | idle?
        
               | coder543 wrote:
               | Fan noise isn't very much, and you can always limit the
               | max clockspeeds on a GPU (and/or undervolt it) to be
               | quieter and more efficient at a cost of a small amount of
               | performance. The RTX 3090 still _seems_ to be faster than
               | the M3 Max for LLMs that fit on the 3090, so giving up a
               | little performance for near-silent operation wouldn't be
               | a big loss.
               | 
               | Ollama caches the last used model in memory for a few
               | minutes, then unloads it if it hasn't been used in that
               | time to free up VRAM. I think they're working on making
               | this period configurable.
               | 
               | Latency is very good in my experience, but I haven't used
               | the local code completion stuff much, just a few quick
               | experiments on personal projects, so my experience with
               | that aspect is limited. If I ever have a job that
               | encourages me to use my own LLM server, I would certainly
               | consider using it more for that.
        
               | nullstyle wrote:
               | Here's an example of megadolphin running on my m2 ultra
               | setup: https://gist.github.com/nullstyle/a9b68991128fd4be
               | 84ffe8435f...
        
               | int_19h wrote:
               | I run LLaMA 70B and 120B (frankenmerges) locally on a
               | 2022 Mac Studio with M1 Ultra and 12Gb RAM. It gives ~7
               | tok/s for 120B and ~9.5 tok/s for 70B.
               | 
               | Note that M1/M2 Ultra is quite a bit faster than M3 Max,
               | mostly due to 800 Gb/s vs 400 Gb/s memory
        
             | rgbrgb wrote:
             | Quantization can take it under 30GB (with quality
             | degradation).
             | 
             | For example, take a look at the GGUF file sizes here:
             | https://huggingface.co/TheBloke/Llama-2-70B-GGUF
        
           | reddit_clone wrote:
           | I am not too familiar with LLMs and GPUs (Not a gamer
           | either). But want to learn.
           | 
           | Could you please expand on what else would be capable of
           | running such models locally?
           | 
           | How about a linux laptop/desktop with specific hardware
           | configuration?
        
             | MeImCounting wrote:
             | It pretty much comes down to 2 factors which is memory
             | bandwidth and compute. You need a high enough memory
             | bandwidth to be able to "feed" the compute and you need
             | beefy enough compute to be able to keep up with the data
             | that is being fed in by the memory. In theory a single
             | Nvidia 4090 would be able to run a 70b model with
             | quantization at "useable" speeds. The reason mac hardware
             | is so capable in AI is because of the unified architecture
             | meaning the memory is shared across the GPU and CPU. There
             | are other factors but it essentially comes down to tokens
             | per second advantages. You could run one of these models on
             | an old GPU with low memory bandwidth just fine but your
             | tokens per second would be far too slow for what most
             | people consider "useable" and the quantization necessary
             | might star noticeably effecting the quality.
        
               | int_19h wrote:
               | A single RTX 4090 can run at most 34b models with 4-bit
               | quantization. You'd need 2-bit for 70b, and at that point
               | quality plummets.
               | 
               | Compute is actually not that big of a deal once
               | generation is ongoing, compared to memory bandwidth. But
               | the initial prompt processing can easily be an order of
               | magnitude slower on CPU, so for large prompts (which
               | would be the case for code completion), acceleration is
               | necessary.
        
         | svara wrote:
         | It's aimed at OpenAI's moat. Making sure they don't accumulate
         | too much of one. No one actually has to use this, it just needs
         | to be clear that LLM as a service won't be super high margin
         | because competition can simply start building on Meta's open
         | source releases.
        
           | FrustratedMonky wrote:
           | So. Strange as it seems, is Meta being more 'Open', than
           | OpenAI that was created to be the 'open' option to fight off
           | Meta and Google?
        
             | holoduke wrote:
             | Meta is becoming the good guy. Its actually a smart move.
             | Some extra reputation points wont hurt Meta.
        
             | chasd00 wrote:
             | If Meta can turn the money making sauce in GenAI from
             | model+data to just data then it's in a very good position.
             | Meta has tons of data.
        
           | adventured wrote:
           | The moat is all but guaranteed to be the scale of the GPUs
           | required to operate these for a lot of users as they get ever
           | larger, specifically the extreme cost that is going along
           | with that.
           | 
           | Anybody have $10 billion sitting around to deploy that
           | gigantic open source set-up for millions of users? There's
           | your moat and only a relatively few companies will be able to
           | do it.
           | 
           | One of Google's moats is, has been, and will always be the
           | scale required to just get into the search game and the tens
           | of billions of dollars you need to compete in search
           | effectively (and that's before you get to competing with
           | their brand). Microsoft has spent over a hundred billion
           | dollars trying to compete with Google, and there's little
           | evidence anybody else has done better anywhere (Western
           | Europe hasn't done anything in search, there's Baidu out of
           | China, and Yandex out of Russia).
           | 
           | VRAM isn't moving nearly as fast as the models are
           | progressing in size. And it's never going to. The cost will
           | get ever greater to operate these at scale.
           | 
           | Unless someone sees a huge paradigm change for cheaper,
           | consumer accessible GPUs in the near future (Intel? AMD?
           | China?). As it is, Nvidia owns the market and they're part of
           | the moat cost problem.
        
             | staticman2 wrote:
             | >The moat is all but guaranteed to be the scale of the GPUs
             | required to operate these
             | 
             | You don't have to run them locally.
        
             | brucethemoose2 wrote:
             | > VRAM isn't moving nearly as fast as the models are
             | progressing in size. And it's never going to. The cost will
             | get ever greater to operate these at scale.
             | 
             | It is in at least 2025. AMD (and Intel, maybe) will have
             | M-Pro-Esque APUs that can run a 70B model at very
             | reasonable speeds.
             | 
             | I am pretty sure Intel is going to rock the VRAM boat on
             | desktops as well. They literally have no market to lose,
             | unlike AMD which infuriatingly still artificially segments
             | their high VRAM cards.
        
             | dragonwriter wrote:
             | > VRAM isn't moving nearly as fast as the models are
             | progressing in size.
             | 
             | Models of any given quality are declining in size (both
             | number of parameters, and also VRAM required for inference
             | per parameter because quantization methods are improving.)
        
             | alfalfasprout wrote:
             | and this is why the LLM arms race for ultra high parameter
             | count models will stagnate. It's all well and good that
             | we're developing interesting new models. But once you
             | factor cost into the equation it does severely limit what
             | applications justify the cost.
             | 
             | Raw FLOPs may increase each generation but VRAM becomes a
             | limiting factor. And fast VRAM is expensive.
             | 
             | I do expect to see incremental innovation in reducing the
             | size of foundational models.
        
             | KaiserPro wrote:
             | > The moat is all but guaranteed to be the scale of the
             | GPUs required to operate these for a lot of users
             | 
             | for end users, yes. For small companies that want to
             | finetune, evaluate and create derivatives, it reduces the
             | cost by millions.
        
         | kungfupawnda wrote:
         | I got it to build and run the example app on my M3 max with 36
         | gb ram. Memory pressure was around 32 gb
        
           | dimask wrote:
           | Did you quantise it? At what level and what was your
           | impression compared to other recent smaller models at that
           | quantisation, if so?
        
             | kungfupawnda wrote:
             | No I just ran it out of the box but I had to modify the
             | source code to run for Mac.
             | 
             | Instructions here:
             | https://github.com/facebookresearch/llama/pull/947/
        
         | dimask wrote:
         | There are companies like phind that offer copilot-like services
         | using finetuned versions of CodeLlama-34B, which imo are
         | actually good. But I do not know if such a larger model is
         | gonna be used in such a context.
        
       | b33j0r wrote:
       | There is a bait and switch going on, and sam altman or mark
       | zuckerberg are the first to tell you.
       | 
       | "No one can compete with us, but it's cute to try! Make
       | applications though" --almost direct quote from Sam Altman.
       | 
       | I have 64gb and an RTX 3090 and a macbook M3, and I already can't
       | run a lot of the newest models even in their quantized form.
       | 
       | The business model requires this to be a subscription service. At
       | least as of today...
        
         | kuczmama wrote:
         | Realistically, what hardware would be required to run this? I
         | assumed a RTX 3090 would be enough?
        
           | summarity wrote:
           | A Studio Mac with an M1 Ultra (about 2800 USD used) is
           | actually a really cost effective way to run in. Its total
           | system power consumption is really low, even spitting out
           | tokens at full tilt (<250W).
        
           | kkzz99 wrote:
           | RTX 3090 has 24GB of memory, a quantized llama70b takes
           | around 60GB of memory. You can offload a few layers on the
           | gpu, but most of them will run on the CPU with terrible
           | speeds.
        
             | nullc wrote:
             | You're not required to put the whole model in a single GPU.
             | 
             | You can buy a 24GB gpu for $150-ish (P40).
        
               | kuczmama wrote:
               | Wow that's a really good idea. I could potentially buy 4
               | Nvidia P40's for the same price as a 3090 and run
               | inference on pretty much any model I want.
        
               | eurekin wrote:
               | Just make sure you're comfortable with manually compiling
               | the bitsandbytes and generally combine a software stack
               | of almost out of date libraries
        
               | kuczmama wrote:
               | That's a good point. Are you referring to the out of date
               | cuda libraries?
        
               | eurekin wrote:
               | I don't remember exactly (either cuda directly or the
               | cudnn version used by the flashattention)... Anyway,
               | /r/localLlama has few instances of such builds. Might be
               | really worthwhile looking that up before buying
        
               | nullc wrote:
               | P40 still works with 12.2 at the moment. I used to use
               | K80s (which I think I paid like $50 for!) which turned
               | into a huge mess to deal with older libraries, especially
               | since essentially all ML stuff is on a crazy upgrade
               | cadence with everything constantly breaking even without
               | having to deal with orphaned old software.
               | 
               | You can get gpu server chassis that have 10 pci-slots
               | too! for around $2k on ebay. But note that there is a
               | hardware limitation on the PCI-E cards such that each
               | card can only directly communicate with 8 others at a
               | time. Beware, they're LOUD even by the standards of sever
               | hardware.
               | 
               | Oh also the nvidia tesla power connectors have cpu-
               | connector like polarity instead of pci-e, so at least in
               | my chassis I needed to adapt them.
               | 
               | Also keep in mind that if you aren't using a special gpu
               | chassis, the tesla cards don't have fans, so you have to
               | provide cooling.
        
             | trentnelson wrote:
             | Can that be split across multiple GPUs? i.e. what if I have
             | 4xV100-DGXS-32GBs?
        
           | michaelt wrote:
           | You can run a similarly sized model - Llama 2 70B - at the
           | 'Q4_K_M' quantisation level, with 44 GB of memory [1]. So you
           | can just about fit it on 2x RTX 3090 (which you can buy,
           | used, for around $1100 each)
           | 
           | Of course, you can buy quite a lot of hosted model API access
           | or cloud GPU time for that money.
           | 
           | [1] https://huggingface.co/TheBloke/Llama-2-70B-GGUF
        
         | ttul wrote:
         | A 70B model is quite accessible; just rent a data center GPU
         | hourly. There are easy deployment services that are getting
         | better all the time. Smaller models can be derived from the big
         | ones to run on a MacBook running Apple Silicon. While the
         | compute won't be a match for Nvidia hardware, a MacBook can
         | pack 128GB of RAM and run enormous models - albeit slowly.
        
           | moyix wrote:
           | My Macbook has a mere 64GB and that's plenty to run 70B
           | models at 4-bit :) LM Studio is very nice for this.
        
           | b33j0r wrote:
           | Ok, well now that we've downvoted me below the visibility
           | threshold, I was being sincere. And Altman did say that. I am
           | not a hater.
           | 
           | So. Maybe we could help other people figure out why VRAM is
           | maxing out. I think it has to do with various new platforms
           | leaking memory.
           | 
           | In my case, I suspect ollama and diffusers are not actually
           | evicting VRAM. nvidia-smi shows it in one case, but I haven't
           | figured it out yet.
           | 
           | Hey, my point remains. The models are going to get too
           | expensive for me, personally, to run locally. I suspect we'll
           | default into subscriptions to APIs because the upgrade slope
           | is too steep.
        
         | YetAnotherNick wrote:
         | > At least as of today...
         | 
         | This is the exact opposite of bait and switch. The current
         | model couldn't be un-opensourced and over time it will just
         | become easier to run it.
         | 
         | Also unless there is reason to believe that prompt engineering
         | of different model families is very different(which honestly I
         | don't believe), there is no effect of baiting. I believe it
         | will always be the case that best 2-3 models would be closed
         | weights.
        
         | whimsicalism wrote:
         | the connection to the classic bait-and-switch seems tenuous at
         | best
        
       | bk146 wrote:
       | Can someone explain Meta's strategy with the open source models
       | here? Genuine question, I don't fully undestand.
       | 
       | (Please don't say "commoditize your complement" without
       | explaining what exactly they're commoditizing...)
        
         | pyinstallwoes wrote:
         | To be crowned the harbinger of AGI.
        
         | apples_oranges wrote:
         | OT: You ,,don't" or you ,,don't fully" understand? ;)
         | 
         | (I try to train myself to say it right ..)
        
         | eurekin wrote:
         | Total speculation: Yann LeCun is there and he is really
         | passionate about the technology and openness
        
           | observationist wrote:
           | The faux-open models mean the models can't be used in
           | competing products. The open code base means enthusiasts and
           | amateurs and other people hack on Meta projects and
           | contribute improvements.
           | 
           | They get free R&D and suppress competition, while looking
           | like they have principles. Yann is clueless about open source
           | principles, or the models would have been Apache or some
           | other comparably open license. It's all ruthless corporate
           | strategy, regardless of the mouth noises coming out of
           | various meta employees.
        
             | importantbrian wrote:
             | Meta's choice of license doesn't indicate that Yann is
             | clueless about open-source principles. I don't know about
             | Meta specifically, but in most companies choosing the
             | license for open source projects involves working with a
             | lot of different stakeholders. He very easily could have
             | pushed for Apache or MIT and some other interest group
             | within Meta vetoed it.
        
             | sangnoir wrote:
             | > The faux-open models mean the models can't be used in
             | competing products.
             | 
             | Just because certain entities can't profitably use a
             | product or obtain a license doesn't make it not-open. AGPL
             | is open, for an extreme example.
             | 
             | This argument is also subjective, and not new - "Which is
             | more open BSD-style licenses or GPL?" has ben a guaranteed
             | flameware starter for decades.
        
           | skottenborg wrote:
           | I doubt personal passions would merit the company funding
           | required for such big models.
        
             | eurekin wrote:
             | Given how megacorps spend millions on a whim (Disney with
             | all recent flops) or, when just a single person wants it
             | (Ms Flight Simulator?) - I wouldn't be surprised to be
             | honest...
             | 
             | But sure, sounds more reasonable
        
               | og_kalu wrote:
               | Disney didn't spend millions on a whim. It's just the
               | reality of box office that even millions in investment
               | are no guarantee for returns.
        
         | simonw wrote:
         | AI seems like the Next Big Thing. Meta have put themselves at
         | the center of the most exciting growth area in technology by
         | releasing models they have trained.
         | 
         | They've gained an incredible amount of influence and mindshare.
        
         | bryan_w wrote:
         | Part of it is that they already had this developed for years
         | (see alt text on uploaded images for example), and they want to
         | ensure that new regulations don't hamper any of their future
         | plans.
         | 
         | It costs them nothing to open it up, so why not. Kinda like all
         | the rest of their GitHub repos.
        
         | gen220 wrote:
         | They're commoditizing the ability to generate viral content,
         | which is the carrot that keeps peoples' eyeballs on the hedonic
         | treadmill. More eyeball-time = more ad placements = more money.
         | 
         | On the advertiser side, they're commoditizing the ability for
         | companies to write more persuasively-targeted ads. Higher
         | click-through rates = more money.
         | 
         | [edit]: For models that generate code instead of content (TFA),
         | it's obviously a different story. I don't have a good grip on
         | that story, beyond "they're using their otherwise-idle GPU
         | farms to buy goodwill and innovate on training methods".
        
           | esafak wrote:
           | That stuff ultimately drives people away. Who thinks "I need
           | my daily fix of genAI memes, let me head to Facebook!"?
        
         | Philpax wrote:
         | Aside from the "positive" explanations offered in the sibling
         | comments, there's also a "negative" one: other AI companies
         | that try to enter the fray will not be able to compete with
         | Meta's open offerings. After all, why would you pay a company
         | to undertake R&D on building their own models when you can just
         | finetune a Llama?
        
         | pchristensen wrote:
         | Meta doesn't have an AI "product" competing with OpenAI,
         | Google's Bard, etc. But they use AI extensively internally.
         | This is roughly a byproduct of their internal AI work that
         | they're already doing, and fostering open source AI development
         | puts incredible pressure on the AI products and their owners.
         | 
         | If Meta can help prevent there from being an AI monopoly
         | company, but rather an ecosystem of comparable products, then
         | they avoid having another threatening tech giant competitor, as
         | well as preventing their own AI work and products from being
         | devalued.
         | 
         | Think of it like Google releasing a web browser.
        
           | IshKebab wrote:
           | Google releasing a (very popular) web browser gives them
           | direct control of web standards. What does this give
           | Facebook?
        
             | eganist wrote:
             | OP already mentioned that it adds additional hurdles for
             | possible future tech giants to have to cross on their
             | quest.
             | 
             | It's akin to a Great Filter, if such an analogy helps. If
             | Meta's open models make a company's closed models
             | uneconomical for others to consume, then the business case
             | for those models is compromised and the odds of them
             | growing to a size where they can compete with Meta in other
             | ways is mitigated a bit.
        
             | patapong wrote:
             | I think we should not underestimate the strategic talent
             | acquisition value as well. Many top-tier AI engineers may
             | appreciate the openness and choose to join meta, which
             | could be very valuable in the long run.
        
               | jwkane wrote:
               | Excellent point -- goodwill in a hyper-high demand dev
               | community is invaluable.
        
             | fngjdflmdflg wrote:
             | Web standards are probably the last thing Google cares
             | about with Chrome. Much more important is being the default
             | search engine and making sure data collection isn't
             | interrupted by a potential privacy minded browser.
        
         | andy99 wrote:
         | I think a big part of it is just because they have a big AI
         | lab. I don't know the genesis of that, but it has for years
         | been a big contributor, see pytorch, models like SEER, as well
         | as being one of the dominant publishers at big conferences.
         | 
         | Maybe now their leadership wants to push for practicality so
         | they don't end up like Google (also a research powerhouse but
         | failing to convert to popular advances) so they are publicly
         | pushing strong LLMs.
        
         | Lerc wrote:
         | If they hadn't opened the models the llama series would just be
         | a few sub-GPT4 models. Opening the models has created a wealth
         | of development that has built upon those models.
         | 
         | Alone, it was unlikely they would become a major player in a
         | field that might be massively important. With a large community
         | building upon their base they have a chance to influence the
         | direction of development and possibly prevent a proprietary
         | monopoly in the hands of another company.
        
         | datadrivenangel wrote:
         | AI puts pressure on search, cutting into google's ad revenue.
         | Meta's properties are less immune to pressure from AI.
        
         | theGnuMe wrote:
         | Bill Gurly has a good perspective on it.
         | 
         | Essentially, you mitigate IP claims and reduce vendor
         | dependency.
         | 
         | https://eightify.app/summary/technology-and-software/the-imp...
        
         | crowcroft wrote:
         | Meta's end goal is to have better AI than everyone else, in the
         | medium term that means they want to have the best foundational
         | models. How does this help.
         | 
         | 1. They become an attractive place for AI researchers to work,
         | and can bring in better staff. 2. They make it less appealing
         | for startups to enter the space and build large foundation
         | models (Meta would prefer 1,000 startups pop up and play around
         | with other people's models, than 1000 startups popping up and
         | trying to build better foundational models). 3. They put cost
         | pressure on AI as a service providers. When LLAMA exists it's
         | harder for companies to make a profit just selling access to
         | models. Along with 2 this further limits the possibility of
         | startups entering the foundational model space, because the
         | path to monetization/breakeven is more difficult.
         | 
         | Essentially this puts Meta, Google, and OpenAI/Microsoft
         | (Anthropic/Amazon as a number four maybe) as the only real
         | players in the cutting edge foundational model space. Worst
         | case scenario they maintain their place in the current tech
         | hegemony as newcomers are blocked from competing.
        
           | siquick wrote:
           | > Essentially this puts Meta, Google, and OpenAI/Microsoft
           | (Anthropic/Amazon as a number four maybe) as the only real
           | players in the cutting edge foundational model space.
           | 
           | Mistral is right up there.
        
             | yodsanklai wrote:
             | Mistral has ~20 employees. I'm sure they have good
             | researchers, but don't they lack the computing and
             | engineering resources the big actors have?
        
             | crowcroft wrote:
             | I'm curious to see how they go, I might have a limited
             | understanding. From what I can tell they do a good job in
             | terms of value and efficiency with 'lighter' models, but I
             | don't put them in the same category as the others in the
             | sense that they aren't producing the massive absolute best
             | in class LLMs.
             | 
             | Hopefully they can prove me wrong though!
        
         | Calvin02 wrote:
         | Controversial take:
         | 
         | Meta sees this as the way to improve their AI offerings faster
         | than others and, eventually, better than others.
         | 
         | Instead of a small group of engineers working on this inside
         | Meta, the Open Source community helps improve it.
         | 
         | They have a history of this with React, PyTorch, hhvm, etc. All
         | these have gotten better as OS projects faster than Meta alone
         | would have been able to do.
        
         | emporas wrote:
         | Yan Le Cunn has talked about Meta's strategy with open source.
         | The general idea, is that the smartest people in the world do
         | not work for you. No company can replicate innovation from open
         | source internally.
        
           | yodsanklai wrote:
           | > The general idea, is that the smartest people in the world
           | do not work for you
           | 
           | Most likely, they work for your competitors. They may not be
           | working to improve your system for free.
           | 
           | > No company can replicate innovation from open source
           | internally.
           | 
           | Lot of innovation does come from companies.
        
         | flir wrote:
         | Really enjoying how many different answers you got.
         | 
         | (My theory: _if_ there 's an AI pot of gold, what megacorp can
         | risk one of the others getting to it first?)
        
         | Too wrote:
         | Meta still sit on all the juicy user data that they want to use
         | AI on but they don't know how. They are crowdsourcing
         | development of applications and tooling.
         | 
         | Meta releases model. Joe builds a cool app with it, earns some
         | internet points and if lucky a few hundred bucks. Meta copies
         | app, multiply Joes success story with 1 billion users and earn
         | a few million bucks.
         | 
         | Joe is happy, Meta is happy. Everybody is happy.
        
         | chasd00 wrote:
         | My opinion is Meta is taking the model out of the secret sauce
         | formula. That leaves hardware and data for training as the
         | barrier to entry. If you don't need to develop your own model
         | then all you need is data and hardware which lowers the barrier
         | to entry. The lower the barrier the more GenAI startups and the
         | more potential data customers for Meta since they certainly
         | have large, curated, datasets for sale.
        
       | edweis wrote:
       | How come a company as big as Meta still uses bit.ly ?
        
         | nemothekid wrote:
         | What else would they use?
        
           | 3pt14159 wrote:
           | Something like meta.com/our_own_tech_handles_this
        
             | nemothekid wrote:
             | Not sure it's preferable to hire people at fb salaries to
             | maintain a link shortener than just to use a reputable free
             | one?
        
               | junon wrote:
               | Every big company has one of these anyway, and usually
               | more involved (internal DNS, VPN, etc). A link shortener
               | is like an interview question.
        
           | transcriptase wrote:
           | fb.com seems like a reasonable choice.
        
           | Cthulhu_ wrote:
           | Their own?
        
           | huac wrote:
           | their own shortener, e.g. fb.me, presumably
        
         | geor9e wrote:
         | Ironically it doesn't help to use link shorteners on twitter
         | anyway - all URLs posted to twitter count as 23 characters. The
         | hypertext is the truncated original URL string, and the URL is
         | actually a t.co link.
        
         | esafak wrote:
         | Because this is a marketing channel. They handle tracking of
         | FB/IG messages by other means, intended for engineers.
        
         | kmeisthax wrote:
         | Not only that, the announcement is on Twitter, a company that
         | at least _used_ to be their biggest competitor. Old habits die
         | hard, huh?
        
       | simonw wrote:
       | Here's the model on Hugging Face:
       | https://huggingface.co/codellama/CodeLlama-70b-hf
        
         | israrkhan wrote:
         | I hope someone will soon post a quantized version that I can
         | run on my macbook pro.
        
       | theLiminator wrote:
       | Curious what's the current SOTA local copilot model? Are there
       | any extensions in vscode that give you a similar experience? I'd
       | love something more powerful than copilot for local use (I have a
       | 4090, so I should be able to run a decent number of models).
        
         | Eisenstein wrote:
         | When this 70b model gets quantized you should be able to run it
         | fine on your 4090. Check out 'TheBloke' on huggingface and the
         | llamacpp to run the gguf files.
        
           | coder543 wrote:
           | I think your take is a bit optimistic. I like quantization as
           | much as the next person, but even the 2-bit model won't fit
           | entirely on a 4090:
           | https://huggingface.co/TheBloke/Llama-2-70B-GGUF
           | 
           | I would be uncomfortable recommending less than 4-bit
           | quantization on a non-MoE model, which is ~40GB on a 70B
           | model.
        
             | nox101 wrote:
             | fortunately it will run on my UMA mac. it's made me curious
             | what the trade offs are. Would I be better off with a 4090
             | or a Mac with 128+gig of uma memory
        
               | coder543 wrote:
               | Even the M3 Max seems to be slower than my 3090 for LLMs
               | that fit onto the 3090, but it's hard to find
               | comprehensive numbers. The primary advantage is that you
               | can spec out more memory with the M3 Max to fit larger
               | models, but with the exception of CodeLlama-70B today, it
               | really seems like the trend is for models to be getting
               | smaller and better, not bigger. Mixtral runs circles
               | around Llama2-70B and arguably ChatGPT-3.5. Mistral-7B
               | often seems fairly close to Llama2-70B.
               | 
               | Microsoft accidentally leaked that ChatGPT-3.5-Turbo is
               | apparently only 20B parameters.
               | 
               | 24GB of VRAM is enough to run ~33B parameter models, and
               | enough to run Mixtral (which is a MoE, which makes direct
               | comparisons to "traditional" LLMs a little more
               | confusing.)
               | 
               | I don't think there's a clear answer of what hardware
               | someone should get. It depends. Should you give up
               | performance on the models most people run locally in
               | hopes of running very large models, or give up the
               | ability to run very large models in favor of prioritizing
               | performance on the models that are popular and proven
               | today?
        
               | int_19h wrote:
               | M3 Max is actually less than ideal because it peaks at
               | 400 Gb/s for memory. What you really want is M1 or M2
               | Ultra, which offers up to 800 Gb/s (for comparison, RTX
               | 3090 runs at 936 GB/s). A Mac Studio suitable for running
               | 70B models with speeds fast enough for realtime chat can
               | be had for ~$3K
               | 
               | The downside of Apple's hardware at the moment is that
               | the training ecosystem is very much focused on CUDA;
               | llama.cpp has an open issue about Metal-accelerated
               | training:
               | https://github.com/ggerganov/llama.cpp/issues/3799 - but
               | no work on it so far. This is likely because training at
               | any significant sizes requires enough juice that it's
               | pretty much always better to do it in the cloud
               | currently, where, again, CUDA is the well-established
               | ecosystem, and it's cheaper and easier for datacenter
               | operators to scale. But, in principle, much faster
               | training on Apple hardware should be possible, and
               | eventually someone will get it done.
        
               | zten wrote:
               | Well, the workstation-class equivalent of a 4090 -- RTX
               | 6000 Ada -- has enough RAM to work with a quantized
               | model, but it'll blow away anyone's budget at anywhere
               | between $7,000 and $10,000.
        
             | Eisenstein wrote:
             | The great thing about gguf is that it will cross to system
             | RAM if there isn't enough VRAM. It will be slower, but
             | waiting a couple minutes for a prompt response isn't the
             | worst thing if you are the type that would get use out of a
             | local 70b parameter model. Then again, one could have
             | grabbed 2x 3090s for the price of a 4090 and ended up with
             | 48gb of VRAM in exchange for a very tolerable performance
             | hit.
        
               | coder543 wrote:
               | > The great thing about gguf is that it will cross to
               | system RAM if there isn't enough VRAM.
               | 
               | No... that's not such a great thing. Helpful in a pinch,
               | but if you're not running at least 70% of your layers on
               | the GPU, then you barely get any benefit from the GPU in
               | my experience. The vast gulf in performance between the
               | CPU and GPU means that the GPU is just spinning its
               | wheels waiting on the CPU. Running half of a model on the
               | GPU is not useful.
               | 
               | > Then again, one could have grabbed 2x 3090s for the
               | price of a 4090 and ended up with 48gb of VRAM in
               | exchange for a very tolerable performance hit.
               | 
               | I agree with this, if someone has a desktop that can fit
               | two GPUs.
        
               | sp332 wrote:
               | The main benefit of a GPU in that case is much faster
               | prompt reading. Could be useful for Code Llama cases
               | where you want the model to read a lot of code and then
               | write a line or part of a line.
        
               | dimask wrote:
               | > The great thing about gguf is that it will cross to
               | system RAM if there isn't enough VRAM.
               | 
               | Then you can just run it entirely on CPU. There is no
               | point to buy an expensive GPU to run LLMs to be
               | bottlenecked by your CPU in the first place. Which is why
               | I do not get so excited with these huge models, as they
               | gain less traction as not as many people can run them
               | locally, and finetuning is probably more costly too.
        
               | int_19h wrote:
               | GGUF is just a file format. The ability to offload some
               | layers to CPU is not specific to it nor to llama.cpp in
               | general - indeed, it was available before llama.cpp was
               | even a thing.
        
         | sfsylvester wrote:
         | This is a completely fair, but open question. Not to be a
         | typical HN user, but when you say SOTA local, the question is
         | really what benchmarks do you really care about in order to
         | evaluate. Size, operability, complexity, explainability etc.
         | 
         | Working out what copilot models perform best has been a deep
         | exercise for myself and has really made me evaluate my own
         | coding style on what I find important and things I look out for
         | when investigating models and evaluating interview candidates.
         | 
         | I think three benchmarks & leaderboards most go to are:
         | 
         | https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...
         | - which is the most understood, broad language capability
         | leaderboad that relies on well understood evaluations and
         | benchmarks.
         | 
         | https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...
         | - Also comprehensive, but primarily assesses Python and
         | JavaScript.
         | 
         | https://evalplus.github.io/leaderboard.html - which I think is
         | a better take on comparing models you intend to run locally as
         | you can evaluate performance, operability and size in one
         | visualisation.
         | 
         | Best of luck and I would love to know which models & benchmarks
         | you choose and why.
        
           | theLiminator wrote:
           | I'm honestly more interested in anecdotes and I'm just
           | seeking anything that can be a drop-in copilot replacement
           | (that's subjectively better). Perhaps one major thing I'd
           | look for is improved understanding of the code in my own
           | workspace.
           | 
           | I honestly don't know what benchmarks to look at or even what
           | questions to be asking.
        
           | hackerlight wrote:
           | > https://huggingface.co/spaces/mike-ravkine/can-ai-code-
           | resul... - Also comprehensive, but primarily assesses Python
           | and JavaScript.
           | 
           | I wonder why they didn't use DeepSeek under the "senior"
           | interview test. I am curious to see how it stacks up there.
        
       | ahmednazir wrote:
       | Can you explain why big tech company make a race to release an
       | open source model? If model is free and open source then how will
       | they earn and how will they compete with others?
        
         | jampekka wrote:
         | Commoditize your complement?
        
         | stainablesteel wrote:
         | they want to incentivize dependency
        
       | martingoodson wrote:
       | Baptiste Roziere gave a great talk about Code Llama at our meetup
       | recently: https://m.youtube.com/watch?v=_mhMi-7ONWQ
       | 
       | I highly recommend watching it.
        
       | doctoboggan wrote:
       | Is there a quantized version available for ollama or is it too
       | early for that?
        
         | coder543 wrote:
         | Already there, it looks like:
         | https://ollama.ai/library/codellama
         | 
         | (Look at "tags" to see the different quantizations)
        
       | robin-whg wrote:
       | This is a prime example of the positive aspects of capitalism.
       | Meta has its own interests, of course, but as a side effect, this
       | greatly benefits consumers.
        
       | LVB wrote:
       | I'm not very plugged into how to use these models, but I do love
       | and pay for both ChatGPT and GitHub Copilot. How does one take a
       | model like this (or a smaller version) and leverage it in VS
       | Code? There's a dizzying array of GPT wrapper extensions for VS
       | Code, many of which either seem like kind of junk (10 d/ls, no
       | updates in a year), or just lead to another paid plan, at which
       | point I might as well just keep my GH Copilot. Curious what
       | others are doing here for Copilot-esque code completion without
       | Copilot.
        
         | petercooper wrote:
         | https://continue.dev/ is a good place to start.
        
           | sestinj wrote:
           | beat me to the punch : )
        
           | speedgoose wrote:
           | Continue doesn't support tab completion like Copilot yet.
           | 
           | A pull/merge request is being worked on:
           | https://github.com/continuedev/continue/pull/758
        
             | sestinj wrote:
             | Release coming later this week!
        
           | jondwillis wrote:
           | Bonus points for being able to use local models!
        
           | israrkhan wrote:
           | This looks really good..
        
             | dan_can_code wrote:
             | It's great. It's super easy to install ollama locally,
             | `ollama run <preferred model>`, change the continue config
             | to point to it, and it just works. It even has an offline
             | option by disabling telemetry.
        
         | sestinj wrote:
         | I've been working on continue.dev, which is completely free to
         | use with your own Ollama instance / TogetherAI key, or for a
         | while with ours.
         | 
         | Was testing with Codellama-70b this morning and it's clearly a
         | step up from other OS models
        
           | dan_can_code wrote:
           | How do you test a 70B model locally? I've tried to query, but
           | the response is super slow.
        
             | sestinj wrote:
             | Personally I was testing with TogetherAI because I don't
             | have the specs for a local 70b. Using quantized versions
             | helps (Ollama's downloads 4-bit by default, you can get
             | down to 2), but it would still require a higher-end Mac.
             | Highly recommend Together, it runs quite quickly and is
             | $0.9/million tokens
        
         | SparkyMcUnicorn wrote:
         | There are some projects that let you run a self-hosted Copilot
         | server, then you set a proxy for the official Copilot
         | extension.
         | 
         | https://github.com/fauxpilot/fauxpilot
         | 
         | https://github.com/danielgross/localpilot
        
           | israrkhan wrote:
           | I tried fauxpilot to make it work on my own llama.cpp
           | instance, but didn't work out of the box. Filed a github
           | issue, but did not get any traction. Eventually gave up on
           | it. This was around 5 months ago. Things might have improved
           | by now.
        
           | water-data-dude wrote:
           | When I was setting up a local LLM to play with I stood up my
           | own Open AI API compatible server using llama-cpp-python. I
           | installed the Copilot extension and set OverrideProxyUrl in
           | the advanced settings to point to my local server, but
           | CoPilot obstinately refused to let me do anything until I'd
           | signed in to GitHub to prove that I had a subscription.
           | 
           | I don't _believe_ that either of these lets you bypass that
           | restriction (although I'd love to be proven wrong), so if you
           | don't want to sign up for a subscription you'll need to use
           | something like Continue.
        
         | marinhero wrote:
         | You can download it and run it with
         | [this](https://github.com/oobabooga/text-generation-webui).
         | There's an API mode that you could leverage from your VS Code
         | extension.
        
         | cmgriffing wrote:
         | I've been using Cody by Sourcegraph and liking it so far.
         | 
         | https://sourcegraph.com/cody
        
         | apapapa wrote:
         | Free Bard is better than free ChatGPT... Not sure about paid
         | versions
        
       | chrishare wrote:
       | Credit where credit is due, Meta has had a fantastic commitment
       | towards open source ML. You love to see it.
        
         | joshspankit wrote:
         | Yes but: if the commitment is driven by internal researchers
         | and coders standing firm about making their work open source (a
         | rumour I've heard a couple times), most of the credit goes to
         | them.
        
       | anonymousDan wrote:
       | Can anyone tell me what kind of hardware setup would be needed to
       | fine-tune something like this? Would you need a cluster of GPUs?
       | What kind of size + GPU spec would you think is reasonable (e.g.
       | wrt VRAM per GPU etc).
        
       | pandominium wrote:
       | Everyone is mentioning using 4090 and a smaller model, but I
       | rarely see an analysis where the energy consumption is used.
       | 
       | I think Copilot is already highly subsidized by Microsoft.
       | 
       | Let's say you use Copilot around 30% of your daily work hours.
       | How much kWh does an opensource 7B or 13B model use then in a
       | month on one 4090?
       | 
       | EDIT:
       | 
       | I think for a 13B at 30% use per day it comes around 30$/no on
       | energy bill.
       | 
       | So probably with a even more smaller but capable model can beat
       | the Copilot monthly subscription.
        
       | fullspectrumdev wrote:
       | This looks potentially interesting if it can be ran locally on
       | say, an M2 Max or similar - and if there's an IDE plugin to do
       | the Copilot thing.
       | 
       | Anything that saves me time writing "boilerplate" or figuring out
       | the boring problems on projects is welcome - so I can expend the
       | organic compute cycles on solving the more difficult software
       | engineering tasks :)
        
       | siilats wrote:
       | We made a Jetbrains plugin called CodeGPT to run this locally
       | https://plugins.jetbrains.com/plugin/21056-codegpt
        
       ___________________________________________________________________
       (page generated 2024-01-29 23:01 UTC)