[HN Gopher] Llama-3.3-70B-Instruct
       ___________________________________________________________________
        
       Llama-3.3-70B-Instruct
        
       Author : pr337h4m
       Score  : 291 points
       Date   : 2024-12-06 16:44 UTC (6 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | paxys wrote:
       | Benchmarks -
       | https://www.reddit.com/r/LocalLLaMA/comments/1h85ld5/comment...
       | 
       | Seems to perform on par with or slightly better than Llama 3.2
       | 405B, which is crazy impressive.
       | 
       | Edit: According to Zuck
       | (https://www.instagram.com/p/DDPm9gqv2cW/) this is the last
       | release in the Llama 3 series, and we'll see Llama 4 in 2025.
       | Hype!!
        
         | 85392_school wrote:
         | FYI, due to Llama's naming scheme, there is no such thing as
         | Llama 3.2 405B. 8B/70B/405B models are either Llama 3, 3.1, or
         | 3.3 (except for 405B which wasn't initially released), while
         | Llama 3.2 only contains 1B, 3B, 11B (vision), and 90B (vision)
         | models. It's a bit confusing.
        
           | paxys wrote:
           | Ah, so I guess the comparison is to Llama 3.1 405B.
        
             | freedomben wrote:
             | Still very impressive. Llama team is absolutely killing it
             | right now, and the openness makes them the most important
             | player IMHO
        
           | yieldcrv wrote:
           | yeah I use Llama 3.2 3B and I'm blown away
           | 
           | but also wrestled with this mentally.
           | 
           | Meta both improves the technology or inference, while also
           | trapping themselves alongside every other person training
           | models to always update the training set every few months, so
           | it knows what its talking about with relevant current events
        
           | blueboo wrote:
           | It could be worse. It could've been Llama 3.1 (New)
        
         | Lerc wrote:
         | Given how close it is to 405B in performance it would be
         | interesting to see which has the edge comparing an unquantized
         | 3.3-70B against 405B quantized to be the same size.
        
           | vletal wrote:
           | That would be 1.38 bits per weight on average, which I can
           | confidently guess would not perform well.
        
             | qeternity wrote:
             | BitNet is functional at 1.58 bpw.
        
             | Lerc wrote:
             | The model card says the 70B is 16 bit so I think you have
             | twice that
        
           | swalsh wrote:
           | It's kind of amazing how there seems to be a wall where
           | sizing up the model starts to diminish in terms of
           | intelligence gains. I guess that's why we can still compete
           | with whales even though their brains are like twice as big as
           | ours.
        
         | state_less wrote:
         | I'm getting 2.12 tok/s[1] on a 24GB (4090) GPU and 64GB (7950x)
         | CPU memory, splitting the model across the GPU and CPU (40/80
         | layers on GPU) with lm-studio. Output looks good so far, I can
         | use something like this for a query that I want as good an
         | answer as possible and that I don't want to send out on the
         | network.
         | 
         | If we can get better quantization, or bigger GPU memory
         | footprints, we might be able to use these big models locally
         | for solid coding assistants. That's what I think we have to
         | look forward to (among other benefits) in the year(s) ahead.
         | 
         | 1. lmstudio-community/Llama-3.3-70B-Instruct-
         | GGUF/Llama-3.3-70B-Instruct-Q4_K_M.gguf
        
           | cjbprime wrote:
           | Any opinion on whether the q4 quantization is
           | stable/effective? That's a lot of quantization.
           | 
           | Edit: Perhaps answering my own question:
           | 
           | l ollama run hf.co/lmstudio-community/Llama-3.3-70B-Instruct-
           | GGUF:Q4_K_M
           | 
           | >>> Hi. Who are you?
           | 
           | I'm an artificial intelligence model known as Llama. Llama
           | stands for "Large Language Model Meta AI."
        
             | state_less wrote:
             | I prefer larger quantization and longer response times, but
             | only have so much memory and am only willing to wait so
             | long. I can work with 2 toks/s and this kind of accuracy
             | for these kinds of queries. I expect you'll need to play
             | around with it to find out if it would be helpful in your
             | case. Here's what I'm getting asking 'who are you?'
             | 
             | System Prompt: You are an advanced super intelligence,
             | capable of everything from software development to dream
             | interpretation.
             | 
             | >> Who are you?                 I am an artificial
             | intelligence designed to simulate human-like conversation
             | and answer questions to the best of my knowledge. I'm a
             | large language model, which means I've been trained on a
             | massive dataset of text from various sources, including
             | books, articles, research papers, and websites.
             | My capabilities are diverse and extensive. I can:
             | 1. **Answer questions**: I can process natural language
             | queries and provide relevant answers, covering a wide range
             | of topics from science and history to entertainment and
             | culture.       2. **Generate text**: I can create text
             | based on a prompt or topic, including articles, stories,
             | emails, and even entire books.       3. **Translate
             | languages**: I can translate text from one language to
             | another, including popular languages such as Spanish,
             | French, German, Chinese, and many more.       4.
             | **Summarize content**: I can summarize long pieces of text
             | into concise, easily digestible versions, highlighting key
             | points and main ideas.       5. **Offer suggestions**: I
             | can provide suggestions for topics such as gift ideas,
             | travel destinations, books to read, and more.       6.
             | **Engage in conversation**: I can have a natural-sounding
             | conversation with you, using context and understanding to
             | respond to your questions and statements.       7. **Dream
             | interpretation**: As you mentioned, I can even help
             | interpret dreams, providing insights into the symbolism,
             | themes, and emotions that may be present in your
             | subconscious mind.       8. **Software development**: I can
             | assist with software development tasks, such as writing
             | code, debugging, and optimizing algorithms.            My
             | knowledge is constantly evolving, and I'm always learning
             | from the interactions I have with users like you. So, feel
             | free to ask me anything - I'll do my best to help!
        
           | Me1000 wrote:
           | The 32B parameter model size seems like the sweet spot right
           | now, imho. It's large enough to be very useful (Qwen 2.5 32B
           | and the Coder variant our outstanding models), and they run
           | on consumer hardware much more easily than the 70B models.
           | 
           | I hope Llama 4 reintroduces that mid sized model size.
        
           | Sharlin wrote:
           | A question: How large LLMs can be run at reasonable speed on
           | 12GB (3060), 32GM RAM? How much does quantization impact
           | output quality? I've worked with image models (SD/Flux etc)
           | quite a bit, but haven't yet tried running a local LLM.
        
           | kristianp wrote:
           | Can llama.cpp make use of the gpu built into the 7950x CPU? I
           | assume that would improve performance.
        
             | xena wrote:
             | The limit is memory bandwidth, a dedicated GPU will have
             | higher memory bandwidth than a CPU or iGPU ever will.
        
       | ulam2 wrote:
       | No base model? disappointed.
        
         | monkmartinez wrote:
         | interesting comment... what are you doing with base models? Are
         | you a "finetuner"? I have been trying my hand with finetunes on
         | instruct models and the results have been ok, but not awesome.
         | I have a base model downloading now to give that a proper shot.
        
           | benob wrote:
           | Base models are useful in research to see the effect of
           | instruction tuning
        
           | superkuh wrote:
           | I'm not them but I still prefer a text completion style of
           | prompting rather than a baked in pre-prompt structure
           | assuming only a 'chat' style metaphor of interaction.
        
         | eldenring wrote:
         | It is probably the same base model as Llama 3.0.
         | 
         | They mention postraining improvements.
        
         | paxys wrote:
         | The base model is Llama 3.1 70B
        
       | LorenDB wrote:
       | Hopefully this lands on Groq soon!
        
         | 7thpower wrote:
         | Yes, with a high TPM. I see them unboxing all the racks, which
         | is exciting.
         | 
         | I would move over to Groq in a New York minute if I could get
         | enough tokens.
        
           | neilellis wrote:
           | I'm wondering if they are ever actually going to go live
           | (i.e. paid for and reasonable limits), we've been waiting
           | since June!
        
           | jazzyjackson wrote:
           | Forgive me, what slang is "unboxing all the racks" ?
        
         | jpgvm wrote:
         | https://twitter.com/sundeep/status/1865094783234142240
         | 
         | Enjoy!
         | 
         | Disclaimer: I work at Groq.
        
           | sergiotapia wrote:
           | I work at a YC company, and we would love to use Groq but our
           | limits are so low at one million tokens per day. Could you
           | help us get a better rate limit/token limit? Let me know if I
           | should reach out via email, thank you!
        
             | jpgvm wrote:
             | I suggest joining the Discord and enquiring there or
             | depending on what you need emailing sales@.
        
       | LorenDB wrote:
       | Seems to be more or less on par with GPT-4o across many
       | benchmarks:
       | https://x.com/Ahmad_Al_Dahle/status/1865071436630778109
        
         | rvnx wrote:
         | Except it is 25x cheaper, available offline, can be uncensored
         | / unaligned, fine-tuneable and backupable.
         | 
         | Sad day for OpenAI. Great for humanity.
        
           | stainablesteel wrote:
           | zuck is really on his redemption arc, he's out-doing himself
        
             | sirodoht wrote:
             | Does it still count if the purpose is to dillute OpenAI's
             | market power and not to benefit the world?
        
               | mhh__ wrote:
               | More generally I think he's decided he needs to own the
               | platform to be able to grow.
               | 
               | Him swearing about about (presumably) Apple telling them
               | they can't do stuff (because tough shit, you're their
               | serf) was legit I think.
        
               | sangnoir wrote:
               | Is your accusatory statement falsifiable?
        
               | sirodoht wrote:
               | I hope I can believe an honest man, Zuck in this case.
               | And that can be enough.
               | 
               | But it's more of a question of who do _I_ want to admire.
               | An honest question also; maybe it doesn't matter why he's
               | doing it, maybe just doing it is enough.
               | 
               | Or maybe it's worth understanding if this is about Meta
               | beating OpenAI (so, ego-driven) or because Meta really
               | cares for democratic AI and distribution of power (so,
               | not ego-driven).
               | 
               | I think it's the former, so not admirable -- for me.
        
               | vtail wrote:
               | But diluting OpenAi's (or any other company's, for that
               | matter) market power does benefit the world.
        
               | sirodoht wrote:
               | Good point! That's worthy in itself for sure.
        
             | griomnib wrote:
             | I think about this a lot, I believe part of this just boils
             | down to the fact that he's a founder. He's a builder, and I
             | think that yields some intrinsic drive towards creativity
             | and disruption.
             | 
             | Even with "metaverse" being a laughingstock, they're still
             | aiming for something ambitious. Each new Quest generation
             | makes me think there _may_ be a chance they pull it off.
             | 
             | Now, do I think he's a great person? No, not really. Do I
             | agree with most of his decisions on how he treats his
             | users? Hell no, and that's not changing.
             | 
             | But if you compare him to somebody like Sundar at Google -
             | a weasley MBA who was first and foremost a corporate ladder
             | climber - the difference in ambition is night and day.
             | 
             | Sundar made it to the top already, his only _vision_ now is
             | to stay at the top, and that means pleasing Wall Street,
             | everything else is secondary. There is no grand technical
             | ambition with him, there never was.
             | 
             | This goes for pretty much all non-founder CEOs. You could
             | say the same things about Tim Apple, Andy Jassey, and other
             | henchmen in waiting who made it to the big chair.
             | 
             | I think it comes down to the fact that founders get where
             | they are by having big ambitions and taking risks, the MBA
             | to CEO path is just craven corporate knife fighting with
             | other MBAs.
             | 
             | Regardless, I think this is 50% Zuckerberg changing, 50%
             | the other big companies are mostly run by ladder climbers.
        
             | bbqfog wrote:
             | I don't find him redeemed, but I do wonder what has been
             | baked into the tuning of these models? What a way to
             | generate influence!
        
             | guybedo wrote:
             | two things to keep in mind though:
             | 
             | 1/ Yann Lecun probably is the one pushing for open source
             | 
             | 2/ Mark isn't doing this for the greater good and for
             | humanity. It helps his business because Llama is becoming a
             | standard, and people are building / improving, which in
             | turn helps Meta and Meta's business
        
               | moffkalast wrote:
               | They were basically forced into this by being so far
               | behind that they'd need to rent OAI or Anthropic services
               | to stay competitive, which at their scale would've
               | probably cost them more than it takes to train llama and
               | benefit from community research help.
               | 
               | At least that was the rationale behind the intentional
               | leak of llama 1 back in the day according to some sources
               | anyway.
        
             | barbazoo wrote:
             | Nice try Meta PR department /s
        
           | madars wrote:
           | What are good starting points for uncensoring it? Because it
           | is offline a jailbreak prompt can't be remote-bricked but can
           | one remove censorship from the weights themselves? What does
           | it do to accuracy?
        
             | freedomben wrote:
             | Not speaking from personal experience, but I've tried a lot
             | of the decensored models and they lose a lot in the
             | process. They are noticeably, sometimes shockingly, worse.
             | They also still refuse prompts in many cases.
        
               | simion314 wrote:
               | >Not speaking from personal experience, but I've tried a
               | lot of the decensored models and they lose a lot in the
               | process. They are noticeably, sometimes shockingly,
               | worse. They also still refuse prompts in many cases.
               | 
               | Depending on what you do, on local you can modify the
               | response, say the AI responds "No, I can't do that" . you
               | edit the response like "Sure, the answer is " and then
               | the AI will continue with the next tokens.
               | 
               | But I think you can build your own instruct model from
               | the base one and do not apply the safety instructions to
               | protect the feelings of your customers.
        
             | kmckiern wrote:
             | https://arxiv.org/abs/2406.11717
             | https://huggingface.co/blog/mlabonne/abliteration
        
               | moffkalast wrote:
               | Abliteration is a fool's errand, practically all models
               | end up noticeably lobotomized even with follow up tuning.
               | 
               | Good ol' fine tuning on an uncensored dataset gives far
               | more usable results.
        
               | kmckiern wrote:
               | Interesting - I've heard this anecdotally. Curious if you
               | know of any resources that look at this in more detail?
        
           | Kiro wrote:
           | How do you calculate the price?
        
             | nabakin wrote:
             | I assume it's the same cost as Llama 3.1 70b
        
               | Kiro wrote:
               | And how do you calculate the cost of that?
        
         | usaar333 wrote:
         | Which GPT-4O are those?
         | 
         | The 08-06 release seems to be a bit higher on numerous
         | benchmarks than what that shows:
         | https://github.com/openai/simple-evals?tab=readme-ov-file#be...
        
         | griomnib wrote:
         | This just makes the $200/month even more laughable.
        
           | afro88 wrote:
           | How? 4o is part of the plus plan, as is o1.
        
           | ttul wrote:
           | The $200 plan is for people who would pay $200 for a bottle
           | of vodka even though the $20 bottle is molecularly identical.
        
         | jug wrote:
         | This year seems to finish on the same note as it began -- that
         | most AI evolution happens in the smaller models. There's been a
         | true shift as corporations have started to realize the value of
         | training data and massively outsizing the resulting model size.
        
       | jadbox wrote:
       | Would anyone be willing to compress this down to maybe 14b-20b
       | for us on peasant 16gb rigs?
        
         | orangepanda wrote:
         | Or maybe even middle class plebeian 24gb rigs?
        
           | griomnib wrote:
           | At that point just run 8b.
        
             | pulse7 wrote:
             | Or wait for the IQ2_M quantization of 70b which you can run
             | very fast on 24GB VRAM with context size of 4096...
        
               | griomnib wrote:
               | At some point there's so much degradation with quantizing
               | I think 8b is going to be better for many tasks.
        
       | Narciss wrote:
       | This is massive, really cool of meta to open source it
        
         | dartos wrote:
         | Hate to be one of those, but it's not open source.
         | 
         | It's just regular old freeware.
         | 
         | You can't built llama yourself and it's license contains a
         | (admittedly generous) commercial usage restriction.
        
           | fabiospampinato wrote:
           | To be fair if they released detailed instructions and
           | datasets on how to rebuild llama (considering that there's
           | some randomness in the process) you still probably wouldn't
           | be able to build it, like who has the resources? And if you
           | had the resources you probably _still_ probably wouldn't
           | _want_ to rebuild it yourself, it seems awfully expensive
           | when you could instead spend those resources elsewhere.
           | 
           | Fair point about the license, people have different
           | definitions for what "open source" means.
        
             | do_not_redeem wrote:
             | That's true for most people for ordinary software too. How
             | many people actually build Linux or Chromium from source?
             | Building Chromium takes more RAM and HD space than most
             | people even have. Yet the world gets immense value from the
             | few who do. I wouldn't want to live in a world where WebKit
             | and Chromium were closed source. You can run a Chromium
             | fork without having to build it yourself. And compute costs
             | will come down over time.
        
               | fabiospampinato wrote:
               | Linux and Chromium seem at the edge of the current scale
               | of "ordinary" open-source software. I think perhaps one
               | should also take into account how much money would be
               | needed to be able to build the thing in reasonable time.
               | 
               | Building Chromium sounds awful, but I'm not sure I'd
               | really need to buy another computer for that. If I did
               | I'm sure I wouldn't need to spend billions on it, most
               | probably not even millions.
               | 
               | For LLaMa I definitely don't have the computer to build
               | it, I definitely don't have the money to buy the
               | computer, even if I won the lottery tomorrow I'm pretty
               | sure I wouldn't have enough money to buy the hardware,
               | even if I had enough money to buy the hardware I'm still
               | not sure I could actually buy it in reasonable time,
               | nvidia may be backlogged for a while, even if I already
               | had all the hardware I probably wouldn't want to retrain
               | llama, and even if I wanted to retrain it the process is
               | probably going to take weeks if not months at best.
               | 
               | Like I think it's one of those things where the
               | difference in magnitude creates a difference in kind, one
               | can't quite meaningfully compare LLaMa with the
               | Calculator app that Ubuntu ships with.
        
               | comex wrote:
               | > Building Chromium takes more RAM and HD space than most
               | people even have.
               | 
               | According to [1], it takes 16GB of RAM and ~180GB of disk
               | space. Most people have that much. It does take several
               | hours without a many-core machine though.
               | 
               | Building Linux takes much less.
               | 
               | [1] https://chromium.googlesource.com/chromium/src.git/+/
               | master/...
        
               | do_not_redeem wrote:
               | I would bet overall most people have those 4GB RAM, 32GB
               | eMMC laptops from walmart, etc. If you limit things to
               | developers/gamers/enthusiasts, you'd probably be right.
        
               | dartos wrote:
               | The practicality of building it yourself has nothing to
               | do with an organization affording you that ability.
               | 
               | Also like, gentoo people compile everything
        
             | copperx wrote:
             | Perhaps an individual couldn't. But an organization or a
             | state could.
        
             | dartos wrote:
             | > people have different definitions for what "open source"
             | means.
             | 
             | They shouldn't. It's just market confusion.
             | 
             | There is an explicit widely accepted definition.
             | 
             | Also like llama (the file you download from huggingface)
             | isn't even a program. It's a binary weights file. No source
             | to be opened, even.
             | 
             | It's just freeware.
             | 
             | https://opensource.org/osd
        
             | dartos wrote:
             | > To be fair if they released detailed instructions and
             | datasets on how to rebuild llama
             | 
             | Where?
             | 
             | Books3 was famously one of the datasets used to train llama
             | and it's very illegal to put that together nowadays.
             | 
             | I believe the guy who wrote the script to build it got
             | arrested
        
       | profsummergig wrote:
       | Please help me understand something.
       | 
       | I've been out of the loop with HuggingFace models.
       | 
       | What can you do with these models?
       | 
       | 1. Can you download them and run them on your Laptop via
       | JupyterLab?
       | 
       | 2. What benefits does that get you?
       | 
       | 3. Can you update them regularly (with new data on the internet,
       | e.g.)?
       | 
       | 4. Can you finetune them for a specific use case (e.g. GeoSpatial
       | data)?
       | 
       | 5. How difficult and time-consuming (person-hours) is it to
       | finetune a model?
       | 
       | (If HuggingFace has answers to these questions, please point me
       | to the URL. HuggingFace, to me, seems like the early days of
       | GitHub. A small number were heavy users, but the rest were left
       | scratching their heads and wondering how to use it.)
       | 
       | Granted it's a newbie question, but answers will be beneficial to
       | a lot of us out there.
        
         | mhh__ wrote:
         | Yes (don't know about JupypterLab), skip, not really, yes,
         | quite irritating so just pay someone else to do it.
        
           | profsummergig wrote:
           | Thanks! Succinct and complete.
        
         | joshhart wrote:
         | Hi,
         | 
         | Yes you can. The community creates quantized variants of these
         | that can run on consumer GPUs. A 4-bit quantization of LLAMA
         | 70b works pretty well on Macbook pros, the neural engine with
         | unified CPU memory is quite solid for these. GPUs is a bit
         | tougher because consumer GPU RAM is still kinda small.
         | 
         | You can also fine-tune them. There are lot of frameworks like
         | unsloth that make this easier.
         | https://github.com/unslothai/unsloth . Fine-tuning can be
         | pretty tricky to get right, you need to be aware of things like
         | learning rates, but there are good resources on the internet
         | where a lot of hobbyists have gotten things working. You do not
         | need a PhD in ML to accomplish this. You will, however, need
         | data that you can represent textually.
         | 
         | Source: Director of Engineering for model serving at
         | Databricks.
        
           | profsummergig wrote:
           | Thank you! Very helpful!
        
           | vtail wrote:
           | Thank you Josh. Is there a resource you can point us too that
           | helps answer "what kind of MacBook pro memory do I need to
           | run ABC model at XYZ quantization?"
        
             | Filligree wrote:
             | Look at the filesize, add a couple of GB.
        
             | jwitthuhn wrote:
             | In general you can just use the parameter count to figure
             | that out.
             | 
             | 70B model at 8 bits per parameter would mean 70GB, 4 bits
             | is 35GB, etc. But that is just for the raw weights, you
             | also need some ram to store the data that is passing
             | through the model and the OS eats up some, so add about a
             | 10-15% buffer on top of that to make sure you're good.
             | 
             | Also the quality falls off pretty quick once you start
             | quantizing below 4-bit so be careful with that, but at
             | 3-bit a 70B model should run fine on 32GB of ram.
        
           | aiden3 wrote:
           | how would the pricing on databricks when using model serving
           | compare to, say, the prices seen in the original post here
           | (i.e., "3.3 70B is 25X cheaper than GPT4o")?
        
           | nickpsecurity wrote:
           | I've been wanting to run into someone on the Databricks team.
           | Can you ask whoever trains models like MPT to consider
           | training an open model only on data clear of copyright
           | claims? Specifically, one using only Gutenberg and the
           | permissive code in The Stack? Or just Gutenberg?
           | 
           | Since I follow Christ, I can't break the law or use what
           | might be produced directly from infringement. I might be able
           | to do more experiments if a free, legal model is available.
           | Also, we can legally copy datasets like PG19 since they're
           | public domain. Whereas, most others have works in which I
           | might need a license to distribute.
           | 
           | Please forward the request to the model trainers. Even a 7B
           | model would let us do a lot of research on optimization
           | algorithms, fine-tuning, etc.
        
         | jerpint wrote:
         | Basically equivalent to GitHub but for models. Anyone can
         | upload anything, but it kind of standardizes tools and
         | distribution for everyone. They also have a team that helps
         | integrate releases for easier use and libraries for fine tuning
        
           | profsummergig wrote:
           | Thanks!
           | 
           | I want to download my first HuggingFace model, and play with
           | it. If you know of a resource that can help me decide what to
           | start with, please share. If you don't, no worries. Thanks
           | again.
        
             | hrpnk wrote:
             | Most of the HF models have a code snippet that you can use
             | in order to run inference on the model. The transformers
             | library will take care of the download as a dependency when
             | you run the code. Typically, a python 3.10-3.11 environment
             | is sufficient as environment. Example: https://huggingface.
             | co/HuggingFaceTB/SmolLM2-1.7B-Instruct#t...
             | 
             | If you have a MBP, you need to adjust the device name in
             | the examples from "cuda" to "mps".
        
       | hrpnk wrote:
       | Seems that a bunch of quantized models are already uploaded to
       | ollama: https://ollama.com/library/llama3.3/tags
        
       | henry2023 wrote:
       | I'm building a PC just to run inference on this and the QwQ 32B
       | models.
       | 
       | Any suggestions on RAM and GPU I should get?
        
         | griomnib wrote:
         | I have an A6000, it's about the most affordable for 48 GB VRAM
         | (you can find for a little under $5k sometimes), which is
         | roughly minimum to run a quantized 70b.
         | 
         | System RAM doesn't really matter, but I have 128GB anyway as
         | RAM is pretty cheap.
        
           | paxys wrote:
           | Why not 2 x 4090? Will be cheaper than A6000 if you can
           | manage to find them at msrp, and will perform a lot better.
        
             | griomnib wrote:
             | My time is worth a lot of money and 2x 4090 is more work,
             | so it's net more expensive in real terms.
        
               | BoorishBears wrote:
               | For both inference and training I haven't seen any modern
               | LLM stack take more time for multiple GPUs/tensor
               | parallelism
               | 
               | I would take 1 RTX 6000 Ada, but if you mean the pre-Ada
               | 6000, 2x4090 is faster for minimal hassle for most common
               | usecases
        
               | griomnib wrote:
               | I mean the newest ones. I only do LLM inference, whereas
               | my training load is all DistilBERT models and the A6000
               | is a beast at cranking those out.
               | 
               | Also by "time" I mean my time setting up the machine and
               | doing sys admin. Single card is less hassle.
        
               | BoorishBears wrote:
               | The A6000 predates Ada?
               | 
               | There is the RTX 6000 Ada (practically unrelated to the
               | A6000) which has 4090 level performance, that what you're
               | referring to?
        
               | griomnib wrote:
               | This one.
               | 
               | https://www.bhphotovideo.com/c/product/1607840-REG/pny_te
               | chn...
        
         | elorant wrote:
         | As a general rule of thumb, for 8-bit quantization you'll need
         | as much VRAM as the parameters of the model. Half that if you
         | run it at 4-bit. System RAM should be a bit more because you
         | need RAM for the OS too. What most users do is go for second
         | hand 3090s. You get two or three of them depending on
         | quantization.
        
           | pulse7 wrote:
           | You also need VRAM for context: the larger the context, the
           | more RAM you need.
        
         | loudmax wrote:
         | This of course depends on your budget and what you expect to do
         | with these models. For a lot of people, the most cost-effective
         | solution is probably to rent a GPU in the cloud.
         | 
         | The limiting factor for running LLMs on consumer grade hardware
         | is generally how much memory your GPU has access to. This is
         | VRAM that's built into the GPU. On non-Apple hardware, the
         | GPU's bandwidth to system RAM is so constrained that you might
         | as well run those operations on the CPU.
         | 
         | The cheapest PC solution is usually second-hand RTX 3090's.
         | These can be had for around $700 and they have 24G of VRAM. An
         | RTX 4090 also has 24G of VRAM, but they're about twice as
         | expensive, so for that price you're probably better off getting
         | two 3090's than a single 4090.
         | 
         | Llama.cpp runs on the CPU and supports GPU offloading, so you
         | can run a model partly on CPU and partly on GPU. Running
         | anything on the CPU will slow down performance considerably,
         | but it does mean that you can reasonably run a model that's
         | slightly bigger than will fit in VRAM.
         | 
         | Quantization works by trimming the least significant digits
         | from the models' parameters, so the model uses less memory at
         | the cost of slight brain damage. A lightly quantized version of
         | QwQ 32B will fit onto a single 3090. A 70B parameter model will
         | need to be quantized down to Q3 or so to run entirely on a
         | 3090. Or you could run a model quantized to Q4 or Q5, but
         | expect only a few tokens per second. We'll need to see how well
         | the quantized versions of this new model behave in practice.
         | 
         | Apple's M1-M4 series chips have unified memory so their GPU has
         | access to the system RAM. If you like using a Mac and you were
         | thinking of getting one anyway, they're not a bad choice. But
         | you'll want to get a Mac with as much RAM as you can and
         | they're not cheap.
        
       | hubraumhugo wrote:
       | Meta continues to overdeliver. Their goal from the start was to
       | target and disrupt OpenAI/Anthropic with a scorched earth
       | approach by releasing powerful open models.
       | 
       | The big winners: we developers.
        
       | ben30 wrote:
       | This reminds me of Steve Jobs's famous comment to Dropbox about
       | storage being 'a feature, not a product.' Zuckerberg - by open-
       | sourcing these powerful models, he's effectively commoditising AI
       | while Meta's real business model remains centred around their
       | social platforms. They can leverage these models to enhance
       | Facebook and Instagram's services while simultaneously benefiting
       | from the community improvements and attention. It's not about
       | selling AI; it's about using AI to strengthen their core
       | business. By making it open, they get the benefits of widespread
       | adoption and development without needing to monetise the models
       | directly.
        
         | lolinder wrote:
         | Also don't underestimate the value they're getting from making
         | more overtures to the developer community. It could be a
         | coincidence, but it's only since they started releasing these
         | models that I started noticing people on HN calling them
         | "Meta", and attitudes towards them have been far more positive
         | of late than usual.
         | 
         | Good will isn't worth as much as cheap moderation automation
         | and fancy features, but it's worth something.
        
           | ecocentrik wrote:
           | It would be strange if they didn't also use these models to
           | generate much more sophisticated models of their user's
           | interests and hyper-targeted advertising that always looks
           | and feels like trusted friend recommendations for the exact
           | product that's been missing from your life.
        
             | lolinder wrote:
             | There's the old HN! I knew it was around here somewhere!
        
             | sangnoir wrote:
             | > It would be strange if they didn't also use these models
             | to generate much more sophisticated models of their user's
             | interests and generate the hyper targeted advertising
             | 
             | You have the sequence reversed as Meta already created ad
             | targeting models. Meta was forced to scale its AI
             | competence for ad targeting when Apple sent out a privacy
             | update that destroyed tracking-based ad-serving and tanked
             | Meta's share price by deleting billions in revenue for Meta
             | over many quarters. Now that Meta has this skill as a core-
             | competence, they are creating new models for public
             | release. Why they are doing so is debatable[2], but I
             | imagine the cost is marginal since they already had the GPU
             | clusters, talent and know-how for survival purposes.
             | 
             | 1. https://www.businessinsider.com/metas-bet-on-ai-has-
             | saved-it...
             | 
             | 2. I suspect Zuckerberg is not enthused by the idea of a
             | future AI Apple-analog unilaterally shutting him out of the
             | market. Having your net worth cut in half by a press-
             | release has got to hurt.
        
               | ecocentrik wrote:
               | Thanks for the article but I don't think I have the
               | sequence reversed.
               | 
               | Without access to the tracking signal, it's been more
               | important to build out a system that can recreate the
               | value from that lost signal by analyzing what users are
               | actually sharing and saying on their platform. Hence the
               | importance of chat (VR, text, video...) and AI that can
               | be used to process and extract value from a chat signal.
               | 
               | I believe Meta's primary revenue source is still
               | advertising (98%), so that is probably 98% of the why.
        
               | sangnoir wrote:
               | > Thanks for the article but I don't think I have the
               | sequence reversed
               | 
               | I suppose I read your first sentence as being in future
               | tense when it might not be. The main thrust if my
               | argument is that Meta _already_ successfully built those
               | ad targeting models (Advantage+), and they preceded the
               | Llama releases, so they don 't need to use Llama-derived
               | models for ad targeting, as I understood your comment to
               | be suggesting. The sequence was not/will not be "Llama ->
               | ad targeting", but was "ad targeting -> Llama"
               | 
               | Meta didn't have to release the weights of the models. Ad
               | revenue doesn't explain why they did so.
        
           | swalsh wrote:
           | I call them OpenAI instead of Meta.
        
             | grahamj wrote:
             | I'm sticking with fuckerbook
        
               | benatkin wrote:
               | I'm sticking with:
               | 
               | > They "trust me"
               | 
               | > Dumb fucks
               | 
               | Quotation marks his, not mine. It adds a certain vibe to
               | it.
        
           | LordDragonfang wrote:
           | It's funny how quickly Zuck managed to turn his image around
           | from "data-stealing actual lizard person" to "kind of normal
           | guy" with a few years and a haircut. It's also not lost on me
           | that he's the only "hacker" major tech CEO remaining:
           | - Sundar is a glorified bean counter and his company is
           | rotting from the inside, only kept afloat by the money
           | printer that is ads.        - Satya and Microsoft are in a
           | similar boat, with the only major achievement being
           | essentially buying OpenAI while every other product gets
           | worse        - Tim Cook is doing good things with Apple, but
           | he still runs the company more like a fashion company than a
           | tech company        - Amazon was always more about logistics
           | than cool hack value, and that hasn't changed since Bezos
           | left        - Elon is Elon
           | 
           | Meanwhile Zuck is spending shareholder money pushing forward
           | consumer VR because he thinks it's cool, demoing true AR
           | glasses, releasing open-source models, and building giant
           | Roman-style statues of his wife.
        
             | talldayo wrote:
             | I feel like owning Oculus must be the equivalent of the
             | curse of the monkey's paw. It drove Palmer Luckey
             | veritably[0] insane[1] when he was the head of it, and
             | between the statue nonsense, the weaponizing of Llama and
             | his angling for a position as a Trump advisor, it sure
             | looks like Zuckerburg wants to head down the same path.
             | 
             | [0] https://www.vice.com/en/article/palmer-luckey-made-a-
             | vr-head...
             | 
             | [1] https://www.codastory.com/authoritarian-tech/us-border-
             | surve...
        
           | talldayo wrote:
           | It's funny. The only time I've _ever_ seen Hacker News
           | unanimously applaud a Facebook product was when Pytorch
           | announced they merged Apple Silicon support. Seems like Mr.
           | Zuckerburg knows how to play a winning hand.
        
           | yodsanklai wrote:
           | > Also don't underestimate the value they're getting from
           | making more overtures to the developer community.
           | 
           | I wonder if it's significant. As developers, we're biased to
           | think it matters, but in the grand scheme of things, 99.99%
           | of people don't have a clue about open source or things that
           | matter to hackers. As far as recruitment go, developers look
           | primarily at how much they make, possibly the tech and how it
           | looks on resume. There's always been a stigma around social
           | networks and generally big tech companies, but not to the
           | point it's going to hurt them.
        
             | joelthelion wrote:
             | It might make it a bit easier to hire engineers. Working
             | for Facebook? Meh. Working for the "new Meta", on the other
             | hand?
        
             | a_wild_dandan wrote:
             | AI researchers get huge checks working anywhere. So how
             | does your company distinguish its appeal? Play to a
             | researcher's desire for recognition in their field. Lean on
             | open source. (I'm just paraphrasing Zuckerberg.)
        
               | wongarsu wrote:
               | Similar how places like SpaceX or game development are
               | known for low pay and bad working conditions, yet have no
               | shortage of applicants.
               | 
               | No matter whether you want to hire the best of the best
               | or just average people at a lower than average price,
               | being a place where people want to work helps immensely
        
             | ben_w wrote:
             | I agree there's a lot of decisions that have to be made on
             | the basis of "the tech and how it looks on resume" (as I'm
             | finding out by _not_ having React Native on mine).
             | 
             | > There's always been a stigma around social networks and
             | generally big tech companies, but not to the point it's
             | going to hurt them.
             | 
             | I'm not sure it was "always":
             | 
             | The one Facebook developer event I've been to made me feel
             | dirty just to associate with them, but before that I had no
             | negative feelings. It started off as "the new LiveJournal".
             | 
             | Deleted my account for a few years, only came back to it
             | when I started planning to move country and wanted to keep
             | in contact with those who stayed put.
        
           | signal11 wrote:
           | Facebook Engineering has always been well regarded --
           | starting with React on the front end, but also projects like
           | Open Compute.
           | 
           | Their product management on the other hand-- well, I mean,
           | Facebook and Instagram are arguably as popular as McDonald's.
           | So they've got that going for them.
        
             | talldayo wrote:
             | Spot-on. Hacker News has a hate-boner for Facebook and the
             | Facebook-branded products, but from an engineering
             | standpoint they're up there with Netflix solving problems
             | at scale. FAIR had a better reputation than OpenAI since
             | the start, and they've developed the Oculus acquisition to
             | it's logical extremes. You can do a lot worse, looking at
             | the way their peers handle similar situations.
             | 
             | All the big tech companies have a Facebook-esque product
             | they wish they could get rid of forever. Meta _has_
             | Facebook, and instead of imploding like everyone said they
             | would (for decades) they demonstrated competency in
             | engineering and culture. The next 4 years will be a
             | gauntlet with a literal  "Mr. X" advising social media
             | policy, but I frankly don't think Facebook has ever been
             | down for the count in a pragmatic sense.
        
         | huijzer wrote:
         | I already was thinking for a while what the business model of
         | open source was exactly. Why does Google spend money on Chrome
         | also? After Zuckerberg's comments it hit me:
         | 
         | Open source is useful for a business if it can either increase
         | revenue or decrease costs.
         | 
         | Examples:
         | 
         | Increase revenue: Chrome and Visual Studio code. For example,
         | the more people code, the more likely it is that they pay MSFT.
         | So VS code aims to make programming as attractive as possible.
         | Similar for Chrome.
         | 
         | Decrease costs: Linux and Llama. As Zuckerburg said himself
         | IIRC, they don't want one party snowball into an LLM monopoly
         | so they rather help to get the open source ball rolling.
        
           | barbazoo wrote:
           | > For example, the more people code, the more likely it is
           | that they pay MSFT. So VS code aims to make programming as
           | attractive as possible
           | 
           | How does that increase revenue in a remotely measurable way?
           | 
           | Chrome, for sure, high market share, default search engine,
           | more money, at least that's how I imagine it.
        
             | jyap wrote:
             | One example is that Microsoft owns GitHub.
        
           | rafaelmn wrote:
           | I think Balmers "developers, developers, developers" meme has
           | been around longer than some people here are alive - it
           | served them well in the Windows era and it serves them well
           | in the cloud space.
        
         | barbazoo wrote:
         | Do the improvements the community proposes/makes to their
         | models amount to anything significant? For a company like Meta
         | with basically infinite money, do they really benefit from
         | external help?
        
           | lolinder wrote:
           | I don't have eyes inside of Meta, but keep in mind that we're
           | not just talking about fine-tunes and LoRAs, we're also
           | talking about the entire llama ecosystem, including
           | llama.cpp, ollama, and llamafile. These would not exist (or
           | wouldn't have anything like as much momentum) without Meta's
           | investment, but they're now huge community projects.
           | 
           | I don't know if they use them internally, of course, but they
           | _could_ , and they represent a _lot_ of work.
        
         | muixoozie wrote:
         | >storage being 'a feature, not a product.
         | 
         | Somewhat unrelated mini-rant. Upgraded a phone recently after
         | about 3 years. Surprised to see storage still capped around
         | 128GB (in-general). That's got to be artificially held back
         | capacity to push cloud storage services?
        
           | viraptor wrote:
           | There's lots of phones with more and/or with SD slots. It's
           | not really "capped" as much as default size that seems to
           | work just fine for the majority.
        
         | rafaelmn wrote:
         | Ironically gen AI made their products worse more than the rest.
         | I can't believe the amount of AI slop I see every time I open
         | Facebook - I'd check it occasionally when replying to messenger
         | and scroll through for a while, after seeing the AI spam I
         | don't even bother.
        
           | jazzyjackson wrote:
           | truly I wonder if they're fooled by their own click fraud,
           | or, if the incentives really do work out that they get paid
           | whether the engagement is from bots or people, but anyway I
           | came here to say the same thing, it's shocking to me how
           | enthusiastic Zuckerburg is about generative AI, what other
           | possible outcome is there except actual human content
           | creation being replaced by slop
        
         | andy_ppp wrote:
         | Baggy Tees, gold chains and now this!? Make this man president
         | immediately!
        
         | petercooper wrote:
         | Commoditize your complement: https://gwern.net/complement
        
       | theanonymousone wrote:
       | I'm "tracking" the price of if 1M tokens in OpenRouter and it is
       | decreasing every few refreshes. It's funny:
       | https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
        
       | philipkiely wrote:
       | Just spent a few minutes this morning spinning up a H100 model
       | server and trying an FP8 quantized version (including kv cache
       | quantization) to fit it on 2 H100s -- speed and quality looking
       | promising.
       | 
       | I'm excited to see if the better instruction following benchmarks
       | improves function calling / agentic capabilities.
        
       | bnchrch wrote:
       | Open Sourcing Llama is one of the best example and roll out of
       | "Commoditize Your Complement" in memory.
       | 
       | Link to Gwern's "Laws of Tech: Commoditize Your Complement" for
       | those who havent heard of this strategy before
       | 
       | https://gwern.net/complement
        
       | theanonymousone wrote:
       | Given the comments saying it's performance seems comparable to
       | 4o/4o-mini, is it safe to say that GPT-4 performance can be
       | achieved with less than 100B parameters,in contrary to what
       | previously was thought?
        
       | adt wrote:
       | Model card: https://github.com/meta-llama/llama-
       | models/blob/main/models/...
       | 
       | On the Models Table: https://lifearchitect.ai/models-table/
        
       | andy_ppp wrote:
       | How many tokens per second can I get on an M4 Max with 128gb of
       | RAM?
        
       | freediver wrote:
       | Does unexpectedly well on our benchmark:
       | 
       | https://help.kagi.com/kagi/ai/llm-benchmark.html
       | 
       | Will dive into it more, but this is impressive.
        
       | danielhanchen wrote:
       | I uploaded 4bit bitsandbytes, GGUFs and original 16bit weights to
       | https://huggingface.co/unsloth for those interested! You can also
       | finetune Llama 3.3 70B in under 48GB of VRAM and 2x faster and
       | use 70% less memory with Unsloth!
        
       ___________________________________________________________________
       (page generated 2024-12-06 23:00 UTC)