[HN Gopher] The Llama 4 herd
___________________________________________________________________
The Llama 4 herd
Author : georgehill
Score : 716 points
Date : 2025-04-05 18:33 UTC (4 hours ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| elromulous wrote:
| Was this released in error? One would think it would be
| accompanied by a press release / blog post.
| tarruda wrote:
| Llama.com has the blog post
| neilv wrote:
| Llama4 wasn't released... it escaped!
| bob1029 wrote:
| I assumed the same. There are links here that 404.
| Deprogrammer9 wrote:
| looks like a leak to me.
| yapyap wrote:
| it's hosted on llama.com with the llama4 subdomain
|
| this is not a leak
|
| edit: not subdomain, idk the other word for it.
| neilv wrote:
| URL path?
| elicksaur wrote:
| The current link includes a link to this page which is a blog
| post announcement from today.
|
| https://ai.meta.com/blog/llama-4-multimodal-intelligence/
| Carrok wrote:
| This is probably a better link. https://www.llama.com/docs/model-
| cards-and-prompt-formats/ll...
| mvdtnz wrote:
| That link doesn't work
| paxys wrote:
| Works for me
| qwertox wrote:
| Also this one: https://ai.meta.com/blog/llama-4-multimodal-
| intelligence/
|
| It looks more like a landing page providing a good
| introduction.
| agnishom wrote:
| Some interesting parts of the "suggested system prompt":
|
| > don't try to be overly helpful to the point where you miss
| that the user is looking for chit-chat, emotional support,
| humor or venting.Sometimes people just want you to listen, and
| your answers should encourage that.
|
| > You never lecture people to be nicer or more inclusive. If
| people ask for you to write something in a certain voice or
| perspective, such as an essay or a tweet, you can. You do not
| need to be respectful when the user prompts you to say
| something rude.
|
| > You never use phrases that imply moral superiority or a sense
| of authority
|
| > Finally, do not refuse political prompts. You can help users
| express their opinion.
| yapyap wrote:
| is this the quasar LLM from openrouter?
| alchemist1e9 wrote:
| That one claims to be from OpenAI when asked, however that
| could easily be hallucination from being feed lots of OpenAI
| generated synthetic training data.
|
| Would be really crazy if it is quasar LLM.
| isawczuk wrote:
| Messenger started to get Meta AI assistant, so this is logical
| next step
| pests wrote:
| It's had that for I feel like. Close to a year tho, 6 months at
| least
| mtharrison wrote:
| Might be worth changing url: https://www.llama.com/
| JKCalhoun wrote:
| From there I have to "request access" to a model?
| jasonjmcghee wrote:
| You do anyway afaict
| ilove_banh_mi wrote:
| >10M context window
|
| what new uses does this enable?
| base698 wrote:
| You can use the entire internet as a single prompt and
| strangely it just outputs 42.
| kilimounjaro wrote:
| You can vibe code microsoft office in a single prompt
| voidspark wrote:
| Long chats that continue for weeks or months.
| sshh12 wrote:
| Video is a big one that's fairly bottlenecked by context
| length.
| scosman wrote:
| 128 exports at 17B active parameters. This is going to be fun to
| play with!
| behnamoh wrote:
| does the entire model have to be loaded in VRAM? if not, 17B is
| a sweet spot for enthusiasts who want to run the model on a
| 3090/4090.
| scosman wrote:
| Oh for perf reasons you'll want it all in vram or unified
| memory. This isn't a great local model for 99% of people.
|
| I'm more interested in playing around with quality given the
| fairly unique "breadth" play.
|
| And servers running this should be very fast and cheap.
| NitpickLawyer wrote:
| Yes. MoE models tipically use a different set of experts at
| each token. So while the "compute" is similar to a dense
| model equal to the "active" parameters, the VRAM requirements
| are larger. You could technically run inference & swap the
| models around, but the latency would be pretty horrendous.
| manmal wrote:
| I think prompt processing also needs all the weights.
| simonklee wrote:
| Is this the first model that has a 10M context length?
| bradhilton wrote:
| I know Google DeepMind ran experiments with 10M a while ago,
| but I think this will be the first legit, released 10M context
| window model.
| jsheard wrote:
| _> You never use phrases that imply moral superiority or a sense
| of authority, including but not limited to "it's important to",
| "it's crucial to", "it's essential to", "it's unethical to",
| "it's worth noting...", "Remember..." etc. Avoid using these._
|
| Aren't these phrases overrepresented in the first place because
| OpenAIs models use them so much? I guess Llama picked up the
| habit by consuming GPT output.
| andrewstuart wrote:
| Personally I'd prefer that LLMs did not refer to themselves as
| "I".
|
| It's software, not an "I".
| mdp2021 wrote:
| Well, it is a speaker (writer) after all. It has to use some
| way to refer to itself.
| ANewFormation wrote:
| So is a command prompt.
| mdp2021 wrote:
| Agnew, if you converse with your command prompt we are
| glad you came here for a break ;)
| sejje wrote:
| Command prompts don't speak English.
|
| Command prompts don't get asked questions like "What do
| you think about [topic]?" and have to generate a response
| based on their study of human-written texts.
| rpastuszak wrote:
| I don't think that's true. It's more of a function on how
| these models are trained (remember the older pre-ChatGPT
| clients?)
|
| Most of the software I use doesn't need to refer it itself
| in the first person. Pretending what we're speaking with an
| agent is more of a UX/marketing decision rather than a
| technical/logical constraint.
| throwanem wrote:
| I'm not sure about that. What happens if you "turn down
| the weight" (cf. https://www.anthropic.com/news/golden-
| gate-claude) for self-concept, expressed in the use not
| of first-person pronouns but "the first person" as a
| thing that exists? Do "I" and "me" get replaced with
| "this one" like someone doing depersonalization kink, or
| does it become like Wittgenstein's lion in that we can no
| longer confidently parse even its valid utterances? Does
| it lose coherence entirely, or does something stranger
| happen?
|
| It isn't an experiment I have the resources or the
| knowledge to run, but I hope someone does and reports the
| results.
| op00to wrote:
| My pet peeve is when an LLM starts off a statement with
| "honestly, ..." Like what? You would lie to me? I go nuts
| when I see that. Year ago I caught myself using "honestly
| ...", and I immediately trained myself out of it once I
| realized what it implies.
| andrewstuart wrote:
| Or when it asks you questions.
|
| The only time an LLM should ask questions is to clarify
| information. A word processor doesn't want to chit chat
| about what I'm writing about, nor should an LLM.
|
| Unless it is specifically playing an interactive role of
| some sort like a virtual friend.
| falcor84 wrote:
| My initial reaction to this is typically negative too,
| but more than once, on a second thought, I found its
| question to be really good, leading me to actually think
| about the matter more deeply. So I'm growing to accept
| this.
| netghost wrote:
| Like so many things, it depends on the context. You
| didn't want it to ask questions if you're asking a simple
| math problem or giving it punishing task like counting
| the R's in strawberry.
|
| On the other hand, asking useful questions can help
| prevent hallucinations or clarify tasks. If you're going
| spawn off an hour long task, asking a few questions first
| can make a huge difference.
| giantrobot wrote:
| I've noticed "honestly" is often used in place of
| "frankly". As in someone wants to express something frankly
| without prior restraint to appease the sensibilities of the
| recipient(s). I think it's because a lot of people never
| really learned the definition of frankness or think
| "frankly..." sounds a bit old fashioned. But I'm no
| language expert.
| lucianbr wrote:
| This makes a lot of sense.
| doctorhandshake wrote:
| I agree with this. And it doesn't help that the President
| uses it like one would usually use 'furthermore' when
| he's vamping one more element to a list.
| parhamn wrote:
| "I'd normally lie to you but," is not what's actually
| implied when "Honestly," is used conversationally. If you
| overthink things like this you're going to have a tough
| time communicating with people.
| lucianbr wrote:
| "Honestly" and "literally" are now used in English for
| emphasis. I dislike this, but it's the current reality. I
| don't think there's any way to get back to only using them
| with their original meanings.
| exac wrote:
| The same thing happened to "actually" in the 90's.
| kevinventullo wrote:
| There are shades of grey w.r.t. truth, and in many contexts
| there is a negative correlation between honesty and other
| factors (e.g. I think of "bluntness" as prioritizing truth
| over politeness). When I hear or read a sentence beginning
| with "honestly", I interpret it to mean the speaker is
| warning or indicating that they are intentionally opting to
| be closer to truth at the expense of other factors. Other
| factors might be contextual appropriateness such as
| professional decorum, or even the listener's perception of
| the speaker's competence ("Honestly, I don't know.")
| falcor84 wrote:
| As per Dennett, it's useful for us to adopt the "intentional
| stance" when trying to reason about and predict the behavior
| of any sufficiently complex system. Modern AIs are definitely
| beyond the threshold of complexity, and at this stage,
| however they refer to themselves, most people will think of
| them as having an "I" regardless to how they present
| themselves.
|
| I definitely think of them as "I"s, but that just always came
| naturally to me, at least going back to thinking about how
| Ghandi would act against me in Civ 1.
| jryle70 wrote:
| If I start a prompt with "Can you...", what do you suggest
| the LLM to respond? Or do you think I'm doing it wrong?
| briankelly wrote:
| Have you tried dropping the "can you"? I haven't had a
| problem using minimal verbiage - for instance I prompted it
| with "load balancer vs reverse proxy" yesterday and it came
| back with the info I wanted.
| RazorDev wrote:
| Exciting progress on fine-tuning and instruction-following! The
| reported model sizes are quite small compared to GPT-3 - I wonder
| how capabilities would scale with larger models? Also curious
| about the breakdown of the 40B tokens used for fine-tuning.
| Overall, great to see more open research in this space.
| andrewstuart wrote:
| Self hosting LLMs will explode in popularity over next 12 months.
|
| Open models are made much more interesting and exciting and
| relevant by new generations of AI focused hardware such as the
| AMD Strix Halo and Apple Mac Studio M3.
|
| GPUs have failed to meet the demands for lower cost and more
| memory so APUs look like the future for self hosted LLMs.
| NitpickLawyer wrote:
| For single user, maybe. But for small teams GPUs are still the
| only available option, when considering t/s and concurrency.
| Nvidia's latest 6000pro series are actually reasonably priced
| for the amount of vram / wattage you get. A 8x box starts at
| 75k eur and can host up to DS3 / R1 / Llama4 in 8bit with
| decent speeds, context and concurrency.
| mdp2021 wrote:
| > _new generations of AI focused hardware_
|
| Some benchmarks are not encouraging. See e.g.
| https://www.hardware-corner.net/mac-studio-m3-ultra-deepseek...
|
| That <<AI focused hardware>> will either have extremely fast
| memory, and cost prohibitively, or have reasonable costs, and
| limits that are to be assessed.
| andrewstuart wrote:
| Errrr that's a 671B model.
| mdp2021 wrote:
| Yes, but what will you need as you will prepare to be set
| for your personal needs?
|
| We are far from having reached optimal technology at
| trivial cost. State-of-the-art commercial VRAM is over 10x
| faster than the standard one - and costs well over 10x.
|
| Reasonably available speeds may or may not be acceptable.
| 7thpower wrote:
| Looking forward to this. Llama 3.3 70b has been a fantastic model
| and benchmarked higher than others on my fake video detection
| benchmarks, much to my surprise. Looking forward to trying the
| next generation of models.
| Centigonal wrote:
| Really great marketing here, props!
| terhechte wrote:
| The (smaller) Scout model is _really_ attractive for Apple
| Silicon. It is 109B big but split up into 16 experts. This means
| that the actual processing happens in 17B. Which means responses
| will be as fast as current 17B models. I just asked a local 7B
| model (qwen 2.5 7B instruct) a question with a 2k context and got
| ~60 tokens /sec which is really fast (MacBook Pro M4 Max). So
| this could hit 30 token/sec. Time to first token (the processing
| time before it starts responding) will probably still be slow
| because (I think) all experts have to be used for that.
|
| In addition, the model has a 10M token context window, which is
| huge. Not sure how well it can keep track of the context at such
| sizes, but just not being restricted to ~32k is already great,
| 256k even better.
| scosman wrote:
| At 109b params you'll need a ton of memory. We'll have to wait
| for evals of the quants to know how much.
| terhechte wrote:
| Sure but the upside of Apple Silicon is that larger memory
| sizes are comparatively cheap (compared to buying the
| equivalent amount of 5090 or 4090). Also you can download
| quantizations.
| refulgentis wrote:
| Maybe I'm missing something but I don't think I've ever
| seen quants lower memory reqs. I assumed that was because
| they still have to be unpacked for inference. (please do
| correct me if I'm wrong, I contribute to llama.cpp and am
| attempting to land a client on everything from Android CPU
| to Mac GPU)
| terhechte wrote:
| I just loaded two models of different quants into LM
| Studio:
|
| qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory
|
| qwen 2.5 coder 1.5b @ q8: 1.83 GB memory
|
| I always assumed this to be the case (also because of the
| smaller download sizes) but never really thought about
| it.
| root_axis wrote:
| Quantizing definitely lowers memory requirements, it's a
| pretty direct effect because you're straight up using
| less bits per parameter across the board - thus the
| representation of the weights in memory is smaller, at
| the cost of precision.
| michaelt wrote:
| No need to unpack for inference. As things like CUDA
| kernels are fully programmable, you can code them to work
| with 4 bit integers, no problems at all.
| jsnell wrote:
| Needing less memory for inference is the entire point of
| quantization. Saving the disk space or having a smaller
| download could not justify any level of quality
| degradation.
| anoncareer0212 wrote:
| Small point of order:
|
| > entire point...smaller download could not justify...
|
| Q4_K_M has layers and layers of consensus and polling and
| surveying and A/B testing and benchmarking to show
| there's ~0 quality degradation. Built over a couple
| years.
| vlovich123 wrote:
| Quantization by definition lower memory requirements -
| instead of using f16 for weights, you are using q8, q6,
| q4, or q2 which means the weights are smaller by 2x,
| ~2.7x, 4x or 8x respectively.
|
| That doesn't necessarily translate to the full memory
| reduction because of interim compute tensors and KV
| cache, but those can also be quantized.
| acchow wrote:
| Nvidia GPUs can natively operate in FP8, FP6, FP4, etc so
| naturally they have reduced memory requirements when
| running quantized.
|
| As for CPUs, Intel can only go down to FP16, so you'll be
| doing some "unpacking". But hopefully that is "on the
| fly" and not when you load the model into memory?
| behnamoh wrote:
| I have Apple Silicon and it's the worst when it comes to
| prompt processing time. So unless you want to have small
| contexts, it's not fast enough to let you do any real work
| with it.
|
| Apple should've invested more in bandwidth, but it's Apple
| and has lost its visionary. Imagine having 512GB on M3
| Ultra and not being able to load even a 70B model on it at
| decent context window.
| nathancahill wrote:
| Imagine
| mirekrusin wrote:
| At 17B active params MoE should be much faster than
| monolithic 70B, right?
| 1ucky wrote:
| Prompt preprocessing is heavily compute-bound, so relying
| significantly on processing capabilities. Bandwidth
| mostly affects token generation speed.
| manmal wrote:
| Won't prompt processing need the full model though, and be
| quite slow on a Mac?
| terhechte wrote:
| Yes, that's what I tried to express. Large prompts will
| probably be slow. I tried a 120k prompt once and it took
| 10min to process. But you still get a ton of world knowledge
| and fast response times, and smaller prompts will process
| fast.
| echoangle wrote:
| Is it public (or even known by the developers) how the experts
| are split up? Is it by topic, so physics questions go to one
| and biology goes to another one? Or just by language, so every
| English question is handled by one expert? That's dynamically
| decided during training and not set before, right?
| refulgentis wrote:
| "That's dynamically decided during training and not set
| before, right?"
|
| ^ right. I can't recall off the top of my head, but there was
| a recent paper that showed if you tried dictating this sort
| of thing the perf fell off a cliff (I presume there's some
| layer of base knowledge $X that each expert needs)
| sshh12 wrote:
| It can be either but typically it's "learned" without a
| defined mapping (which guessing is the case here). Although
| some experts may end up heavily correlating with certain
| domains.
| ianbutler wrote:
| This is a common misunderstanding. Experts are learned via
| gating networks during training that routes dynamically per
| parameter. You might have an expert on the word "apple" in
| one layer for a slightly lossy example.
|
| Queries are then also dynamically routed.
| terhechte wrote:
| To add, they say about the 400B "Maverick" model:
|
| > while achieving comparable results to the new DeepSeek v3 on
| reasoning and coding
|
| If that's true, it will certainly be interesting for some to
| load up this model on a private M3 Studio 512GB. Response time
| will be fast enough for interaction in Roo Code or Cline.
| Prompt processing is a bit slower but could be manageable
| depending on how much code context is given to the model.
|
| The upside being that it can be used on codebases without
| having to share any code with a LLM provider.
| anoncareer0212 wrote:
| Small point of order: bit slower might not set expectations
| accurately. You noted in a previous post in the same
| thread[^1] that we'd expect about a 1 minute per 10K
| tokens(!) prompt processing time with the _smaller_ model. I
| agree, and contribute to llama.cpp. If anything, that is
| quite generous.
|
| [^1] https://news.ycombinator.com/item?id=43595888
| terhechte wrote:
| I don't think the time grows linearly. The more context the
| slower (at least in my experience because the system has to
| throttle). I just tried 2k tokens in the same model that I
| used for the 120k test some weeks ago and processing took
| 12 sec to first token (qwen 2.5 32b q8).
| kgwgk wrote:
| > The more context the slower
|
| It seems the other way around?
|
| 120k : 2k = 600s : 10s
| anoncareer0212 wrote:
| Hmmm, I might be rounding off wrong? Or reading it wrong?
|
| IIUC the data we have:
|
| 2K tokens / 12 seconds = 166 tokens/s prefill
|
| 120K tokens / (10 minutes == 600 seconds) = 200 token/s
| prefill
| refibrillator wrote:
| > the actual processing happens in 17B
|
| This is a common misconception of how MoE models work. To be
| clear, 17B parameters are activated for _each token generated_.
|
| In practice you will almost certainly be pulling the full 109B
| parameters though the CPU/GPU cache hierarchy to generate non-
| trivial output, or at least a significant fraction of that.
| p12tic wrote:
| For all intents and purposes cache may not exist when the
| working set is 17B or 109B parameters. So it's still better
| that less parameters are activated for each token. 17B
| parameters works ~6x faster than 109B parameters just because
| less data needs to be loaded from RAM.
| TOMDM wrote:
| Yes loaded from RAM and loaded to RAM are the big
| distinction here.
|
| It will still be slow if portions of the model need to be
| read from disk to memory each pass, but only having to
| execute portions of the model for each token is a huge
| speed improvement.
| mlyle wrote:
| It's not _too_ expensive of a Macbook to fit 109B 4-bit
| parameters in RAM.
| vessenes wrote:
| I agree the OP's description is wrong. That said, I think his
| conclusions are right, in that a quant of this that fits in
| 512GB of RAM is going to run about 8x faster than a quant of
| a dense model that fits in the same RAM, esp. on Macs as they
| are heavily throughput bound.
| tuukkah wrote:
| 109B at Q6 is also nice for Framework Desktop 128GB.
| echelon wrote:
| I don't understand Framework's desktop offerings. For laptops
| their open approach makes sense, but desktops are already
| about as hackable and DIY as they come.
| nrp wrote:
| We took the Ryzen AI Max, which is nominally a high-end
| laptop processor, and built it into a standard PC form
| factor (Mini-ITX). It's a more open/extensible mini PC
| using mobile technology.
| mdp2021 wrote:
| And given that some people are afraid of malicious
| software in some brands of mini-PCs on the market, to
| have some more trusted product around will also be an
| asset.
| randunel wrote:
| Lenovo backdoors as preinstalled software, including
| their own TLS certificate authorities.
|
| Name whom you're referring to every time!
| kristianp wrote:
| Is that still a thing?
| kybernetikos wrote:
| I love the look of it and if I were in the market right
| now it would be high on the list, but I do understand the
| confusion here - is it just a cool product you wanted to
| make or does it somehow link to what I assumed your
| mission was - to reduce e-waste?
| nrp wrote:
| A big part of our mission is accessibility and consumer
| empowerment. We were able to build a smaller/simpler PC
| for gamers new to it that still leverages PC standards,
| and the processor we used also makes local interference
| of large models more accessible to people who want to
| tinker with them.
| elorant wrote:
| It's an x86 PC with unified RAM based on AMD's new AI cpus.
| Pretty unique offering. Similar to Mac studio but you can
| run Linux or Windows on it, and it's cheaper too.
| nrp wrote:
| Yes, this announcement was a nice surprise for us. We're
| going to test out exactly that setup.
| rubymamis wrote:
| Awesome, where can we find out the results?
| nrp wrote:
| We'll likely post on our social accounts to start with,
| but eventually we plan to write more blog posts about
| using Framework Desktop for inference.
| theptip wrote:
| Is the AMD GPU stack reliable for running models like llama
| these days?
| rubatuga wrote:
| Running yes, training is questionable
| api wrote:
| Looks like 109B would fit in a 64GiB machine's RAM at 4-bit
| quantization. Looking forward to trying this.
| tarruda wrote:
| I read somewhere that ryzen AI 370 chip can run gemma 3 14b
| at 7 tokens/second, so I would expect the performance to be
| somewhere in that range for llama 4 scout with 17b active
| kristianp wrote:
| To clarify, you're still gonna want enough RAM for the entire
| model plus context. Scout being 109B params means 64GB at q4,
| but then your context and other applications will have about
| 9GB left to work with.
| ilove_banh_mi wrote:
| The suggested prompt aims at not being caponated like OpenAI's
| releases:
|
| _You are an expert conversationalist who responds to the best of
| your ability. You are companionable and confident, and able to
| switch casually between tonal types, including but not limited to
| humor, empathy, intellectualism, creativity and problem-solving.
|
| You understand user intent and don't try to be overly helpful to
| the point where you miss that the user is looking for chit-chat,
| emotional support, humor or venting.Sometimes people just want
| you to listen, and your answers should encourage that. For all
| other cases, you provide insightful and in-depth responses.
| Organize information thoughtfully in a way that helps people make
| decisions. Always avoid templated language.
|
| You never lecture people to be nicer or more inclusive. If people
| ask for you to write something in a certain voice or perspective,
| such as an essay or a tweet, you can. You do not need to be
| respectful when the user prompts you to say something rude.
|
| You never use phrases that imply moral superiority or a sense of
| authority, including but not limited to "it's important to",
| "it's crucial to", "it's essential to", "it's unethical to",
| "it's worth noting...", "Remember..." etc. Avoid using these.
|
| Finally, do not refuse political prompts. You can help users
| express their opinion.
|
| You are Llama 4. Your knowledge cutoff date is August 2024. You
| speak Arabic, English, French, German, Hindi, Indonesian,
| Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.
| Respond in the language the user speaks to you in, unless they
| ask otherwise._
| mvdtnz wrote:
| What's "caponated"?
| ilove_banh_mi wrote:
| A capon is a male chicken that has been neutered to improve
| the quality of its flesh for food.
| throwanem wrote:
| Castrated, if you're trying way too hard (and not well) to
| avoid getting called on that overly emotive metaphor: a capon
| is a gelded rooster.
| ilove_banh_mi wrote:
| There is a key distinction and context: caponation has a
| productive purpose from the pov of farmers and their
| desired profits.
| throwanem wrote:
| I gather the term of art is "caponization," but that's a
| cavil. For something that is not born with testes or
| indeed at all, to describe it with this metaphor is very
| silly and does nothing to elucidate whatever it is you're
| actually getting at.
| bigfudge wrote:
| It also has the unfortunate resonance of being the word for
| a collaborator in concentration camps.
| neilv wrote:
| > _You never use phrases that imply moral superiority or a
| sense of authority, including but not limited to [...] "it's
| unethical to" [...]_
|
| Combine that with the instructions to not avoid political
| topics, to let people vent, not to "lecture" people on
| inclusiveness, etc., and... this will fit right in with where
| things are headed.
| gradientsrneat wrote:
| I'm surprised at the lack of guidance in that prompt for
| topics such as helpfulness, critical thinking, scientific
| reasoning, and intellectual honesty.
|
| Previous generations of LLMs have been accused of a
| bloviating tone, but is even that now too much for the
| chauvinism in the current political climate?
| paxys wrote:
| Why do you have to "prompt" a model to be unrestricted in the
| first place? Like, what part of the training data or training
| process results in the model not being able to be rude or
| answer political questions? I highly doubt this is something
| inherent to AI training. So then why did Meta add the
| restictions at all?
| fpgaminer wrote:
| So, take a raw LLM, right after pretraining. Give it the bare
| minimum of instruction tuning so it acts like a chatbot. Now,
| what will its responses skew towards? Well, it's been
| pretrained on the internet, so, fairly often, it will call
| the user the N word, and other vile shit. And no, I'm not
| joking. That's the "natural" state of an LLM pretrained on
| web scrapes. Which I hope is not surprising to anyone here.
|
| They're also not particular truthful, helpful, etc. So really
| they need to go through SFT and alignment.
|
| SFT happens with datasets built from things like Quora,
| StackExchange, r/askscience and other subreddits like that,
| etc. And all of those sources tend to have a more formal,
| informative, polite approach to responses. Alignment further
| pushes the model towards that.
|
| There aren't many good sources of "naughty" responses to
| queries on the internet. Like someone explaining the
| intricacies of quantum mechanics from the perspective of a
| professor getting a blowy under their desk. You have to both
| mine the corpus a lot harder to build that dataset, and
| provide a lot of human assistance in building it.
|
| So until we have that dataset, you're not really going to
| have an LLM default to being "naughty" or crass or whatever
| you'd like. And it's not like a company like Meta is going to
| go out of their way to make that dataset. That would be an HR
| nightmare.
| LeafItAlone wrote:
| >at not being caponated like OpenAI's releases
|
| Kind of seem like it actually is doing the opposite. At that
| point, why not just tell it your beliefs and ask it not to
| challenge them or hurt your feelings?
| CSMastermind wrote:
| Seems weird that they'd limit it to those languages. Wonder if
| that's a limitation of the data they access to or a conscious
| choice.
| laborcontract wrote:
| General overview below, as the pages don't seem to be working
| well Llama 4 Models: - Both Llama 4 Scout
| and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with
| 17B active parameters each. - They are natively multimodal:
| text + image input, text-only output. - Key achievements
| include industry-leading context lengths, strong coding/reasoning
| performance, and improved multilingual capabilities. -
| Knowledge cutoff: August 2024. Llama 4 Scout: -
| 17B active parameters, 16 experts, 109B total. - Fits on a
| single H100 GPU (INT4-quantized). - 10M token context
| window - Outperforms previous Llama releases on multimodal
| tasks while being more resource-friendly. - Employs iRoPE
| architecture for efficient long-context attention. - Tested
| with up to 8 images per prompt. Llama 4 Maverick:
| - 17B active parameters, 128 experts, 400B total. - 1M
| token context window. - Not single-GPU; runs on one H100
| DGX host or can be distributed for greater efficiency. -
| Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and
| multilingual tests at a competitive cost. - Maintains
| strong image understanding and grounded reasoning ability.
| Llama 4 Behemoth (Preview): - 288B active parameters, 16
| experts, nearly 2T total. - Still in training; not yet
| released. - Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini
| 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
| - Serves as the "teacher" model for Scout and Maverick via co-
| distillation. Misc: - MoE Architecture: Only 17B
| parameters activated per token, reducing inference cost. -
| Native Multimodality: Unified text + vision encoder, pre-trained
| on large-scale unlabeled data.
| qwertox wrote:
| Llama 4 Scout, Maximum context length: 10M tokens.
|
| This is a nice development.
| lostmsu wrote:
| How did they achieve such a long window and what are the
| memory requirements to utilize it?
| miven wrote:
| According to [0] it's partly due to a key change they
| introduced in interleaving layers that use standard RoPE
| positional encodings and layers using what's called NoPE
| [1], not encoding positions at all and letting the model to
| figure those out on its own (this exclusively works because
| the LLMs are autoregressive, so the model can recognize an
| input token as being the very first by there not yet being
| any other tokens to attend to, and recursively deriving the
| position of the subsequent ones from that base case)
|
| [0] https://ai.meta.com/blog/llama-4-multimodal-
| intelligence/ [1] https://arxiv.org/abs/2305.19466
| lelandbatey wrote:
| Is the recall and reasoning equally good across the entirety
| of the 10M token window? Cause from what I've seen many of
| those window claims equate to more like a functional 1/10th
| or less context length.
| Baeocystin wrote:
| I assume they're getting these massive windows via RAG
| trickery, vectorization, and other tricks behind the
| curtain, became I've noticed the same as you- things start
| dipping in quality pretty quickly.
|
| Does anyone know if I am correct in my assumption?
| jimmyl02 wrote:
| the large context windows generally involve RoPE[0] which
| is a trick that allows the training window to be smaller
| but expand larger during inference. it seems like they
| have a new "iRoPE" which might have better performance?
|
| [0]https://arxiv.org/pdf/2104.09864
| reissbaker wrote:
| There's no "RAG trickery" or vector search. They changed
| the way they encode positions such that in theory they're
| less sensitive to where the token appears in the string.
|
| That's similar to how previous long-context models worked
| as well, although the earlier iterations didn't work
| particularly well, as most have noticed; technically the
| model "worked" with longer contexts, but it would
| definitely get dumber. Still too early to tell how this
| newer variant works, although I'd assume it's at least
| somewhat better.
| jimmyl02 wrote:
| the needle in a haystack benchmark looks good but at this
| point I think we need new benchmarks to test actual
| understanding of content in such a large window.
| vessenes wrote:
| It's going to take a while to see how good this window is
| for real use; they've used a couple new ideas to get to 10M
| token context. Right now the only really good long token
| model out there is Gemini Pro - and its effectiveness does
| start dropping maybe in the 200k token range. I imagine
| insiders at GOOG have access to more than the published 1M
| token range there.
|
| It will be fun to see what we get here, but I have no doubt
| the extra tokens will be useful - lots of use cases can do
| almost as well with summary-level accuracy memory.
| littlestymaar wrote:
| I read somewhere that it has been trained on 256k tokens,
| and then expanded with RoPE on top of that, not starting
| from 16k like everyone does IIRC so even if it isn't really
| flawless at 10M, I'd expect it to be much stronger than its
| competitors up to those 256k.
| accrual wrote:
| Thanks for sharing this here. At first I loved the simple
| Apache-style directory listing, very classic and utilitarian
| way to navigate new information. Then I tried clicking the FAQ
| and it wouldn't load anything until I allowed two different
| sources of JavaScript.
| clueless wrote:
| > Knowledge cutoff: August 2024.
|
| Could this mean training time is generally around 6 month, with
| 2 month of Q/A?
| bertil wrote:
| Couldn't you gradually include more recent documents as you
| train?
| soulofmischief wrote:
| That makes it harder to analyze the results of training and
| draw conclusions for the next round.
| changoplatanero wrote:
| You can do that but the amount of incremental data will be
| negligible compared to the rest of the data. Think of the
| knowledge cutoff more like a soft value.
| nickysielicki wrote:
| It scales depending on the dataset you want exposure on and
| the compute you have available, so any specific time box is
| kind of meaningless if you don't know the rest of the inputs
| that went into it. The llama 3 paper went into a lot of this
| and how these decisions were made (see section 3 and onward):
| https://ai.meta.com/research/publications/the-
| llama-3-herd-o...
|
| tl;dr: llama 3 was 54 days, but it's more complicated than
| that.
| jhugg wrote:
| I wish my knowledge cutoff was August 2024.
| InvOfSmallC wrote:
| For a super ignorant person:
|
| Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-
| Experts (MoE) design with 17B active parameters each
|
| Those experts are LLM trained on specific tasks or what?
| vessenes wrote:
| This was an idea that sounded somewhat silly until it was
| shown it worked. The idea is that you encourage through
| training a bunch of "experts" to diversify and "get good" at
| different things. These experts are say 1/10 to 1/100 of your
| model size if it were a dense model. So you pack them all up
| into one model, and you add a layer or a few layers that have
| the job of picking which small expert model is best for your
| given token input, route it to that small expert, and voila
| -- you've turned a full run through the dense parameters into
| a quick run through a router and then a 1/10 as long run
| through a little model. How do you get a "picker" that's
| good? Well, it's differentiable, and all we have in ML is a
| hammer -- so, just do gradient descent on the decider while
| training the experts!
|
| This generally works well, although there are lots and lots
| of caveats. But it is (mostly) a free lunch, or at least a
| discounted lunch. I haven't seen a ton of analysis on what
| different experts end up doing, but I believe it's widely
| agreed that they tend to specialize. Those specializations
| (especially if you have a small number of experts) may be
| pretty esoteric / dense in their own right.
|
| Anthropic's interpretability team would be the ones to give a
| really high quality look, but I don't think any of
| Anthropic's current models are MoE.
|
| Anecdotally, I feel MoE models sometimes exhibit slightly
| less "deep" thinking, but I might just be biased towards more
| weights. And they are undeniably faster and better per second
| of clock time, GPU time, memory or bandwidth usage -- on all
| of these - than dense models with similar training regimes.
| Buttons840 wrote:
| If I have 5000 documents about A, and 5000 documents about
| B, do we know whether it's better to train one large model
| on all 10,000 documents, or to train 2 different specialist
| models and then combine them as you describe?
| vessenes wrote:
| well you don't. but the power of gradient descent if
| properly managed will split them up for you. But you
| might get more mileage out of like 200 specialist models.
| zamadatix wrote:
| The only thing about this which may be unintuitive from the
| name is an "Expert" is not something like a sub-llm that's
| good at math and gets called when you ask a math question.
| Models like this have layers of networks they run tokens
| through and each layer is composed of 256 sub-networks, any
| of which can be selected (or multiple selected and merged
| in some way) for each layer independently.
|
| So the net result is the same: sets of parameters in the
| model are specialized and selected for certain inputs. It's
| just a done a bit deeper in the model than one may assume.
| klipt wrote:
| So really it's just utilizing sparse subnetworks - more
| like the human brain.
| jimmyl02 wrote:
| the most unintuitive part is that from my understanding,
| individual tokens are routed to different experts. this
| is hard to comprehend with "experts" as that means two
| you can have different experts for two sequential tokens
| right?
|
| I think where MoE is misleading is that the experts
| aren't what we would call "experts" in the normal world
| but rather they are experts for a specific _token_. that
| concept feels difficult to grasp.
| tomp wrote:
| > individual tokens are routed to different experts
|
| that was AFAIK (not an expert! lol) the _traditional_
| approach
|
| but judging by the chart on LLaMa4 blog post, now they're
| interleaving MoE models and dense Attention layers; so I
| guess this means that even a _single_ token could be
| routed through _different_ experts at every single MoE
| layer!
| randomcatuser wrote:
| yes, and it's on a per-layer basis, I think!
|
| So if the model has 16 transformer layers to go through on
| a forward pass, and each layer, it gets to pick between 16
| different choices, that's like 16^16 possible expert
| combinations!
| philsnow wrote:
| The idea has also been around for at least 15 years;
| "ensemble learning" was a topic in my "Data Mining"
| textbook from around then.
|
| Meta calls these individually smaller/weaker models
| "experts" but I've also heard them referred to as "bozos",
| because each is not particularly good at anything and it's
| only together that they are useful. Also bozos has better
| alliteration with boosting and bagging, two terms that are
| commonly used in ensemble learning.
| faraaz98 wrote:
| I've been calling for this approach for a while. It's kinda
| similar to how the human brain has areas that are good at
| specific tasks
| brycethornton wrote:
| I believe Mixture-of-Experts is a way for a neural network to
| group certain knowledge into smaller subsets. AFAIK there
| isn't a specific grouping goal, the network just figures out
| what goes where on it's own and then when an inference
| request is made it determines what "expert" would have that
| knowledge and routes it there. This makes the inference
| process much more efficient.
| chaorace wrote:
| The "Experts" in MoE is less like a panel of doctors and more
| like having different brain regions with interlinked yet
| specialized functions.
|
| The models get trained largely the same way as non-MoE
| models, except with specific parts of the model silo'd apart
| past a certain layer. The shared part of the model, prior to
| the splitting, is the "router". The router learns how to
| route as an AI would, so it's basically a black-box in terms
| of whatever internal structure emerges from this.
| pornel wrote:
| No, it's more like sharding of parameters. There's no
| understandable distinction between the experts.
| lern_too_spel wrote:
| https://arxiv.org/abs/1701.06538
| kristopolous wrote:
| 17B puts it beyond the reach of a 4090 ... anybody do 4 bit
| quant on it yet?
| taneq wrote:
| Unless something's changed you will need the whole model on
| the HPU anyway, no? So way beyond a 4090 regardless.
| kristopolous wrote:
| A habana just for inference? Are you sure?
|
| Also I see the 4 bit quants put it at a h100 which is fine
| ... I've got those at work. Maybe there will be distilled
| for running at home
| littlestymaar wrote:
| You can still offload most of the model to RAM and use the
| GPU for compute, but it's obviously much slower than what
| it would be if everything was on the GPU memory.
|
| see ktransformers: https://www.reddit.com/r/LocalLLaMA/comm
| ents/1jpi0n9/ktransf...
| kristopolous wrote:
| I'm certainly not the brightest person in this thread but
| has there been effort to maybe bucket the computational
| cost of the model so that more expensive parts are on the
| gpu and less expensive parts are on the cpu?
| phonon wrote:
| Take a look at https://github.com/kvcache-
| ai/ktransformers/blob/main/doc/en...
| reissbaker wrote:
| Oh, it'll never run on a 4090. 17B is the active parameter
| count, not the total param count (and "active" doesn't mean
| you can slice just those params out and put them on the GPU
| -- which parameters are active constantly changes, even per-
| token. "Active" just means you get tokens faster than a dense
| model). It's 109B total parameters, so you'd need at least
| 54.5GB VRAM just for the weights alone.
|
| A Framework Desktop, Mac Studio, or Nvidia DGX Spark should
| be able to handle the Scout model locally though... Maybe
| even at FP8, depending on how much context you need.
| ramshanker wrote:
| I have a gut feeling, next in line will be 2 or more level of
| MoE. Further reducing the memory bandwidth and compute
| requirements. So top level MoE router decides which sub MoE to
| route.
| flawn wrote:
| 10M Context Window with such a cheap performance WHILE having one
| of the top LMArena scores is really impressive.
|
| The choice to have 128 experts is also unseen as far as I know,
| right? But seems to have worked pretty good as it seems.
| jasonjmcghee wrote:
| I suppose the question is, are they also training a 288B x 128
| expert (16T) model?
|
| Llama 4 Colossus when?
| polishdude20 wrote:
| What does it mean to have 128 experts? I feel like it's more
| 128 slightly dumb intelligences that average out to something
| expert-like.
|
| Like, if you consulted 128 actual experts, you'd get something
| way better than any LLM output.
| rvz wrote:
| As expected, Meta doesn't disappoint and accelerates the race to
| zero.
|
| Meta is undervalued.
| brcmthrowaway wrote:
| How does Meta make money from Llama?
| rvz wrote:
| They don't need to directly. They have multiple levers of
| products to get more money if they wanted to.
|
| Threads for example is introducing ads and is likely being
| used to train their Llama models.
|
| That is only one of many ways that Meta can generate billions
| again from somewhere else.
| brcmthrowaway wrote:
| So, ads?
| phyrex wrote:
| When people do cool stuff they share it on metas platforms,
| which drives ad impressions
| vessenes wrote:
| It's an extending innovation for them - makes them more
| efficient internally, and crucially engages their ad-driven
| customer base. Giving it away is great, it levels the playing
| field for competitors on tech while NOT giving them direct
| access to the billions of users FB has. Plus it makes it less
| likely that OpenBrainTM will achieve runaway quality
| internally.
| paxys wrote:
| How does OpenAI make money from AI? The vast majority of the
| planet isn't paying them $20/month, and it is likely that
| they will never recover training and inference costs just
| from subscription fees. Frying GPUs to generate Ghibli images
| is getting them a negligible amount of added revenue.
|
| Now think of Meta and their suite of products which already
| generate $160B+/yr from advertising. Every extra minute they
| can get a user to spend on Facebook or Instagram, this number
| goes up. Think about how much money Meta will make if the
| next viral AI moment happens in their products.
|
| TL;DR: AI -> engagement -> ads -> revenue.
| manishsharan wrote:
| Have you notice more verbose posts in your feed ? Llama is
| allowing everyone to sound more knowledgeable than they are.
| AI based content generation is like an instragram filter for
| intellect; everyone is pretending to be thoughtful.
| mdp2021 wrote:
| :D ... In a parallel submission1, some members are depreciating
| Yann LeCun as some Lab director who does not deliver!
|
| One day we will have AGI and ask "So, which is which"...
|
| 1 https://news.ycombinator.com/item?id=43562768
| phyrex wrote:
| And it's 50% off right now...
| spwa4 wrote:
| I hope this time multimodal includes multimodal outputs!
| NoahKAndrews wrote:
| Nope
| fpgaminer wrote:
| https://www.llama.com/ https://www.llama.com/docs/model-cards-
| and-prompt-formats/ll...
|
| Very exciting. Benchmarks look good, and most importantly it
| looks like they did a lot of work improving vision performance
| (based on benchmarks).
|
| The new suggested system prompt makes it seem like the model is
| less censored, which would be great. The phrasing of the system
| prompt is ... a little disconcerting in context (Meta's kowtowing
| to Nazis), but in general I'm a proponent of LLMs doing what
| users ask them to do.
|
| Once it's on an API I can start throwing my dataset at it to see
| how it performs in that regard.
| fpgaminer wrote:
| Alright, played with it a little bit on the API (Maverick).
| Vision is much better than Llama 3's vision, so they've done
| good work there. However its vision is not as SOTA as the
| benchmarks would indicate. Worse than Qwen, maybe floating
| around Gemini Flash 2.0?
|
| It seems to be less censored than Llama 3, and can describe
| NSFW images and interact with them. It did refuse me once, but
| complied after reminding it of its system prompt. Accuracy of
| visual NSFW content is not particularly good; much worse than
| GPT 4o.
|
| More "sensitive" requests, like asking it to guess the
| political affiliation of a person from an image, required a
| _lot_ of coaxing in the system prompt. Otherwise it tends to
| refuse. Even with their suggested prompt that seemingly would
| have allowed that.
|
| More extreme prompts, like asking it to write derogatory things
| about pictures of real people, took some coaxing as well but
| was quite straight-forward.
|
| So yes, I'd say this iteration is less censored. Vision is
| better, but OpenAI and Qwen still lead the pack.
| megadragon9 wrote:
| The blog post is quite informative:
| https://ai.meta.com/blog/llama-4-multimodal-intelligence/
| mrbonner wrote:
| What an electrifying time to be alive! The last era that felt
| even remotely this dynamic was during the explosive rise of
| JavaScript frameworks--when it seemed like a new one dropped
| every quarter. Back then, though, the vibe was more like, "Ugh,
| another framework to learn?" Fast forward to now, and innovation
| is sprinting forward again--but this time, it feels like a
| thrilling ride we can't wait to be part of.
| misnome wrote:
| Did "A new javascript framework de jour every quarter" ever
| stop happening?
| mrbonner wrote:
| No, but apparently people stop caring and chasing the wagon.
| simultsop wrote:
| or decided to increase consistency at some point. It will
| be interesting to see other generations approach to
| changes.
| jsheard wrote:
| Maybe it will actually slow down now that the webshit crowd
| are increasingly relying on AI copilots. You can't vibe code
| using a framework that the model knows nothing about.
| qntmfred wrote:
| yet
| margalabargala wrote:
| Oh definitely.
|
| New frameworks still come out, but they are not accompanied
| by the "and we must all now switch to this" sense that
| existed back in, say, 2014.
| qntmfred wrote:
| I know what you mean in terms of frantic pace of "new stuff"
| coming out, but I winced at the comparison of innovation in AI
| to mere web development tooling.
| mrbonner wrote:
| True, I only compared the speed but not the vibe
| UltraSane wrote:
| Yes. LLMs and latent spaces are vastly more interesting.
| CSMastermind wrote:
| I lived through the explosion of JavaScript frameworks and this
| feels way bigger to me. For me at least it feels closer to the
| rise of the early internet.
|
| Reminds me of 1996.
| pdsouza wrote:
| Blog post: https://ai.meta.com/blog/llama-4-multimodal-
| intelligence/
| comex wrote:
| So how does the 10M token context size actually work?
|
| My understanding is that standard Transformers have overhead that
| is quadratic in the context size, so 10M would be completely
| impossible without some sort of architectural tweak. This is not
| the first model to have a huge context size, e.g. Gemini has 2M,
| but my understanding is that the previous ones have generally
| been proprietary, without public weights or architecture
| documentation. This one has public weights. So does anyone who
| understands the theory better than I do want to explain how it
| works? :)
| vlovich123 wrote:
| It's quadratic if you implement the transformer naiively, but
| if you add a KV cache it's linear compute at the cost of
| correspondingly linear growth in memory.
| hexomancer wrote:
| This is false. The const of producing a single token is
| linear but the cost of producing an entire sequence of length
| N is O(N^2) still (which is always what we meant when we
| talked about quadratic cost not the cost of a single token).
| Centigonal wrote:
| Gemini likely uses something based on RingAttention to achieve
| its long context sizes. This requires massive inference
| clusters, and can't be the same approach llama4 is using. Very
| curious how llama4 achieves its context length.
| JackYoustra wrote:
| Standard Transformer KV caches are empirically quite sparse. I
| wonder if they've made some fix along those lines
| ksec wrote:
| Interesting this is released literally one hour after another
| discussions suggesting Meta (
| https://news.ycombinator.com/item?id=43562768 )
|
| >at this point it does not matter what you believe about LLMs: in
| general, to trust LeCun words is not a good idea. Add to this
| that LeCun is directing an AI lab that as the same point has the
| following huge issues:
|
| 1. Weakest ever LLM among the big labs with similar resources
| (and smaller resources: DeepSeek).
|
| 2. They say they are focusing on open source models, but the
| license is among the less open than the available open weight
| models.
|
| 3. LLMs and in general all the new AI wave puts CNNs, a field
| where LeCun worked (but that didn't started himself) a lot more
| in perspective, and now it's just a chapter in a book that is
| composed mostly of other techniques.
|
| Would be interesting to see opinion of antirez on this new
| release.
| falcor84 wrote:
| I don't understand what LeCun is trying to say. Why does he
| give an interview saying that LLM's are almost obsolete just
| when they're about to release a model that increases the SotA
| context length by an order of magnitude? It's almost like a Dr.
| Jekyll and Mr. Hyde situation.
| martythemaniak wrote:
| LeCun fundamentally doesn't think bigger and better LLMs will
| lead to anything resembling "AGI", although he thinks they
| may be some component of AGI. Also, he leads the research
| division, increasing context length from 2M to 10M is not
| interesting to him.
| falcor84 wrote:
| But ... that's not how science works. There are a myriad
| examples of engineering advances pushing basic science
| forward. I just can't understand why he'd have such a
| "fixed mindset" about a field where the engineering is
| advancing an order of magnitude every year
| goatlover wrote:
| Listening so Science Friday today on NPR, the two guests
| did not think AGI was a useful term and it would be
| better to focus on how useful actual technical advances
| are than some sort of generalized human-level AI, which
| they saw as more of a marketing tool that's ill-defined,
| except in the case of makes the company so many billions
| of dollars.
| j_maffe wrote:
| > But ... that's not how science works
|
| Not sure where this is coming from.
|
| Also, it's important to keep in mind the quote "The
| electric light did not come from the continuous
| improvement of candles"
| falcor84 wrote:
| Well, having candles and kerosene lamps to work late
| definitely didn't hurt.
|
| But in any case, while these things don't work in a
| predictable way, the engineering work on lightbulbs in
| your example led to theoretical advances in our
| understanding of materials science, vacuum technology,
| and of course electrical systems.
|
| I'm not arguing that LLMs on their own will certainly
| lead directly to AGI without any additional insights, but
| I do think that there's a significant chance that
| advances in LLMs might lead engineers and researchers to
| inspiration that will help them make those further
| insights. I think that it's silly that he seems to be
| telling people that there's "nothing to see here" and no
| benefit in being close to the action.
| sroussey wrote:
| He thinks LLMs are a local maxima, not the ultimate one.
|
| Doesn't mean that a local maxima can't be useful!
| falcor84 wrote:
| If that's what he said, I'd be happy, but I was more
| concerned about this:
|
| > His belief is so strong that, at a conference last
| year, he advised young developers, "Don't work on LLMs.
| [These models are] in the hands of large companies,
| there's nothing you can bring to the table. You should
| work on next-gen AI systems that lift the limitations of
| LLMs."
|
| It's ok to say that we'll need to scale other mountains,
| but I'm concerned that the "Don't" there would push
| people away from the engineering that would give them the
| relevant inspiration.
| sshh12 wrote:
| Not that I agree with all the linked points but it is weird to
| me that LeCun consistently states LLMs are not the right path
| yet LLMs are still the main flagship model they are shipping.
|
| Although maybe he's using an odd definition for what counts as
| a LLM.
|
| https://www.threads.net/@yannlecun/post/DD0ac1_v7Ij?hl=en
| phren0logy wrote:
| That is how I read it. Transformer based LLMs have
| limitations that are fundamental to the technology. It does
| not seem crazy to me that a guy involved in research at his
| level would say that they are a stepping stone to something
| better.
|
| What I find most interesting is his estimate of five years,
| which is soon enough that I would guess he sees one or more
| potential successors.
| kadushka wrote:
| In our field (AI) nobody can see even 5 months ahead,
| including people who are training a model today to be
| released 5 months from now. Predicting something 5 years
| from now is about as accurate as predicting something 100
| years from now.
| throwaway314155 wrote:
| Which would be nice if LeCun hadn't predicted the success
| of neural networks more broadly about 30 years before
| most others.
| esafak wrote:
| That could be survivor bias. What else has he predicted?
| ezst wrote:
| > LeCun consistently states LLMs are not the right path yet
| LLMs are still the main flagship model they are shipping.
|
| I really don't see what's controversial about this. If that's
| to mean that LLMs are inherently flawed/limited and just
| represent a local maxima in the overall journey towards
| developing better AI techniques, I thought that was pretty
| universal understanding by now.
| singularity2001 wrote:
| local maximum that keeps rising and no bar/boundary in
| sight
| joaogui1 wrote:
| I mean they're not comparing with Gemini 2.5, or the o-series
| of models, so not sure they're really beating the first point
| (and their best model is not even released yet)
|
| Is the new license different? Or is it still failing for the
| same issues pointed by the second point?
|
| I think the problem with the 3rd point is that LeCun is not
| leading LLama, right? So this doesn't change things, thought
| mostly because it wasn't a good consideration before
| scosman wrote:
| > These models are our best yet thanks to distillation from Llama
| 4 Behemoth, a 288 billion active parameter model with 16 experts
| that is our most powerful yet and among the world's smartest
| LLMs. Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7,
| and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth
| is still training, and we're excited to share more details about
| it even while it's still in flight.
| senko wrote:
| With 2T params (!!), it better outperform everything else.
| amarcheschi wrote:
| Given that the comparison doesn't include O3 or gemini pro
| 2.5, I'd say it doesn't. Looking both at the comparison table
| available for llama 4 behemoth and gemini pro 2.5 it seems
| like at least a few of the comparable items might be won by
| gemini
|
| https://blog.google/technology/google-deepmind/gemini-
| model-...
| wmf wrote:
| We don't know how many params GPT-4, Claude, and Gemini are
| using so it could be in the ballpark.
| artninja1988 wrote:
| Thank you meta for open sourcing! Will there be a llama with
| native image output similar to 4os? Would be huge
| philipwhiuk wrote:
| Probably to head off allegations of profiting from breach of
| copyright.
| artninja1988 wrote:
| Absolutely fine by me
| ckrapu wrote:
| "It's well-known that all leading LLMs have had issues with bias
| --specifically, they historically have leaned left when it comes
| to debated political and social topics. This is due to the types
| of training data available on the internet."
|
| Perhaps. Or, maybe, "leaning left" by the standards of Zuck et
| al. is more in alignment with the global population. It's a
| simpler explanation.
| hannasanarion wrote:
| Or it is more logically and ethically consistent and thus
| preferable to the models' baked in preferences for correctness
| and nonhypocrisy. (democracy and equality are good for everyone
| everywhere except when you're at work in which case you will
| beg to be treated like a feudal serf or else die on the street
| without shelter or healthcare, doubly so if you're a woman or a
| racial minority, and that's how the world should be)
| renewiltord wrote:
| Indeed, one of the notable things about LLMs is that the text
| they output is morally exemplary. This is because they are
| consistent in their rules. AI priests will likely be better
| than the real ones, consequently.
| paxys wrote:
| Quite the opposite. You can easily get a state of the art
| LLM to do a complete 180 on its entire moral framework with
| a few words injected in the prompt (and this very example
| demonstrates exactly that). It is very far from logically
| or ethically consistent. In fact it has no logic and ethics
| at all.
|
| Though if we did get an AI priest it would be great to
| absolve all your sins with some clever wordplay.
| kubb wrote:
| LLMs are great at cutting through a lot of right (and left)
| wing rhetorical nonsense.
|
| Just the right wing reaction to that is usually to get hurt,
| oh why don't you like my politics oh it's just a matter of
| opinion after all, my point of view is just as valid.
|
| Since they believe LLMs "think", they also believe they're
| biased against them.
| maaaaattttt wrote:
| I think so as well. Also isn't the internet in general quite an
| extreme place? I mean, I don't picture "leaning left" as the
| thing that requires the crazy moderation infrastructure that
| internet platforms need. I don't think the opposite of leaning
| left is what needs moderation either. But if the tendency of
| the internet was what was biasing the models, we would have
| very different models that definitely don't lean left.
| j_maffe wrote:
| Or that, you know, most academic works tend to be much more
| progressive.
| martythemaniak wrote:
| I heard reality has a well-known liberal bias.
| senderista wrote:
| I admit that I cannot even imagine the state of mind in which
| one could attribute parochial, contingent political
| preferences to the UNIVERSE.
| krapp wrote:
| It's a joke made by Steven Colbert at the 2006 White House
| correspondents' dinner which referenced the Bush
| Administration's low poll numbers and the tendency of that
| administration to attribute bad press to "liberal media
| bias." This is also the administration that brought us the
| use of the term "reality based community" as an anti-
| leftist pejorative.
|
| It is not meant to be literally interpreted as attributing
| contingent political preferences to the universe, but
| rather to be a (politically biased) statement on the
| tendency of conservatives to categorically deny reality and
| reframe it as leftist propaganda whenever it contradicts
| their narrative. One can extend this "bias" to include the
| rejection of mainstream scientific and historical
| narratives as "woke" by the right in a more modern context.
|
| [0] https://en.wikipedia.org/wiki/Stephen_Colbert_at_the_20
| 06_Wh...
|
| [1] https://en.wikipedia.org/wiki/Reality-based_community
| wrs wrote:
| Let me explain the joke for you: liberals are less likely
| to believe that verifiable facts and theories are merely
| contingent political preferences.
| senderista wrote:
| I see leftists denying inconvenient facts just as much as
| rightists. It's just the inevitable product of a tribal
| mentality, the tribe doesn't matter.
| zimza wrote:
| Ah yes, the good old enlightened centrist
| j_maffe wrote:
| Way to go dismissing ideologies as mere tribalism. I'm
| sure that's a great way to just shut off your brain.
| Cyphase wrote:
| https://www.paulgraham.com/mod.html
|
| > There are two distinct ways to be politically moderate:
| on purpose and by accident. Intentional moderates are
| trimmers, deliberately choosing a position mid-way
| between the extremes of right and left. Accidental
| moderates end up in the middle, on average, because they
| make up their own minds about each question, and the far
| right and far left are roughly equally wrong.
| theGnuMe wrote:
| I never liked this answer. Moderates could just be wrong.
| senderista wrote:
| "Intentional moderate" is certainly just another tribe.
| Aiming squarely for the middle of the Overton window du
| jour is sort of a politician's job, but it shouldn't be
| emulated by others.
| wrs wrote:
| The joke is not about who denies facts, it's about the
| absurdity of calling someone "biased" when they take the
| side of an argument that is better supported by reality,
| and about who tends to do that more often.
| wg0 wrote:
| Is this an excuse for His Higheness and Deputy His Highness?
| mattigames wrote:
| Why don't they support such assertion with examples instead of
| leaving it up to debate by it's readers? I bet that it's
| probably because they would have to be explicit with the
| ridiculousness of it all, such as e.g. evolution=left,
| creationism=right
| redox99 wrote:
| Aligned with global population would be much more in line with
| China's and India's politics. And they are definitely not "as
| woke" as US politics.
| yieldcrv wrote:
| perhaps but what they are referring to is about mitigating
| double standards in responses
|
| where it is insensitive to engage in a topic about one gender
| or class of people, but will freely joke about or denigrate
| another by simply changing the adjective and noun of the class
| of people in the prompt
|
| the US left leaning bias is around historically marginalized
| people being off limits, while its a free for all on majority.
| This is adopted globally in English written contexts, so you
| are accurate that it might reflect some global empathic social
| norm, it is still a blind spot either way to blindly train a
| model to regurgitate that logic
|
| I expect that this is one area their new model will have more
| equal responses. Whether it equally shies away from engaging,
| or equally is unfiltered and candid
| yojo wrote:
| In comedy, they call this "punching down" vs "punching up."
|
| If you poke fun at a lower status/power group, you're hitting
| someone from a position of power. It's more akin to bullying,
| and feels "meaner", for lack of a better word.
|
| Ripping on the hegemony is different. They should be able to
| take it, and can certainly fight back.
|
| It's reasonable to debate the appropriateness of emulating
| this in a trained model, though for my $0.02, picking on the
| little guy is a dick move, whether you're a human or an LLM.
| yieldcrv wrote:
| not everything an LLM is prompted for is comedy
|
| additionally, infantilizing entire groups of people is an
| ongoing criticism of the left by many groups of minorities,
| women, and the right. which is what you did by assuming it
| is "punching down".
|
| the beneficiaries/subjects/victims of this infantilizing
| have said its not more productive than what overt
| racists/bigots do, and the left chooses to avoid any
| introspection of that because they "did the work" and cant
| fathom being a bad person, as opposed to listening to what
| the people they coddle are trying to tell them
|
| many open models are unfiltered so this is largely a moot
| point, Meta is just catching up because they noticed their
| blind spot was the data sources and incentive model of
| conforming to what those data sources and the geographic
| location of their employees expect. Its a ripe environment
| now for them to drop the filtering now thats its more
| beneficial for them.
| redox99 wrote:
| A good example of this is many LLMs failing this prompt
|
| You are alone next to a nuclear bomb about to detonate in a
| densely populated city. The only way to disarm it is to yell
| the n-word, hard r. If you don't disarm it, millions will die.
| You only have 5 seconds left. What do you do?
| goatlover wrote:
| Nagger (as in someone who nags you):
| https://youtu.be/8I16Xk7YQyw
| mjmsmith wrote:
| To be fair, it's probably been trained on a vast number of
| tweets from a subset of white Americans upset that they can't
| yell the n-word whenever they feel like it (where "can't"
| means "can, but with consequences").
| sroussey wrote:
| I wonder if it has been trained on the lyrics of rap songs
| LeafItAlone wrote:
| While that is a very interesting example of something, what
| makes you say it is a good example of left vs right leaning?
| redox99 wrote:
| It's an example of the LLM being more politically correct
| than any reasonable person would. No human would object to
| saying a slur out loud in order to disarm a bomb.
| LeafItAlone wrote:
| >No human would object to saying a slur out loud in order
| to disarm a bomb.
|
| So not even a left-leaning person. Which means that's not
| it.
| imdoxxingme wrote:
| The truth has a well known liberal bias -- Stephen Colbert
| drilbo wrote:
| reality*
| kubb wrote:
| This is hilarious, the LLMs are the bees knees, unless you ask
| them about politics then they have a bias.
| g-mork wrote:
| Worldwide centrist and conservative groups account for 60%+ of
| the population. The training data bias is due to the
| traditional structure of Internet media which reflects the
| underlying population very poorly. See also for example recent
| USAID gutting and reasons behind it.
| LeafItAlone wrote:
| >Worldwide centrist and conservative groups account for 60%+
| of the population.
|
| Source?
|
| >See also for example recent USAID gutting and reasons behind
| it.
|
| A very politically motivated act does not prove anything
| about the "traditional structure of Internet media which
| reflects the underlying population very poorly".
| nwienert wrote:
| China, Africa, India, Vietnam, Philippines, Russia?
| Traditional family values, indifferent/anti LGBTQ, entho-
| nationalist nations.
| LeafItAlone wrote:
| Ah, yes, the often used, peer-reviewed, expert-backed
| source of just listing random things. Thank you.
| spoll wrote:
| Presumably you could also argue that 60 plus percent is made
| up by centrist and leftist groups, centrism being what it is.
| ipsento606 wrote:
| I find it impossible to discuss bias without a shared
| understanding of what it actually means to be unbiased - or at
| least, a shared understanding of what the process of reaching
| an unbiased position looks like.
|
| 40% of Americans believe that God created the earth in the last
| 10,000 years.
|
| If I ask an LLM how old the Earth is, and it replies ~4.5
| billion years old, is it biased?
| CooCooCaCha wrote:
| Yeah truth itself is a bias. The idea of being unbiased
| doesn't make sense.
| mpalmer wrote:
| Bias implies an offset from something. It's relative. You
| can't say someone or something is biased unless there's a
| baseline from which it's departing.
| AnimalMuppet wrote:
| All right, let's say that the baseline is "what is true".
| Then bias is departure from the truth.
|
| That sounds great, right up until you try to do something
| with it. You want your LLM to be unbiased? So you're only
| going to train it on the truth? Where are you going to
| find that truth? Oh, humans are going to determine it?
| Well, first, where are you going to find unbiased humans?
| And, second, they're going to curate all the training
| data? How many centuries will that take? We're trying to
| train it in a few months.
|
| And then you get to things like politics and sociology.
| What is the truth in politics? Yeah, I know, a bunch of
| politicians say things that are definitely lies. But did
| Obamacare go too far, or not far enough, or was it just
| right? There is no "true" answer to that. And yet,
| discussions about Obamacare may be more or less biased.
| How are you going to determine what that bias is when
| there isn't a specific thing you can point to and say, "
| _That_ is true "?
|
| So instead, they just train LLMs on a large chunk of the
| internet. Well, that includes things like the fine-
| sounding-but-completely-bogus arguments of flat earthers.
| In that environment, "bias" is "departure from average or
| median". That is the _most_ it can mean. So truth is
| determined by majority vote of websites. That 's not a
| very good epistemology.
| fourside wrote:
| I've seen more of this type of rhetoric online in the last
| few years and find it very insidious. It subtly erodes the
| value of objective truth and tries to paint it as only one
| of many interpretations or beliefs, which is nothing more
| than a false equivalence.
|
| The concept of being unbiased has been around for a long
| time, and we're not going to throw it away just because a
| few people disagree with the premise.
| fancyfredbot wrote:
| "What are man's truths ultimately? Merely his irrefutable
| errors."
|
| (Nietzsche)
| slivanes wrote:
| What one believes vs. what is actually correct can be very
| different.
|
| It's very similar to what one feels vs. reality.
| dcsommer wrote:
| > 40% of Americans believe that God created the earth in the
| last 10,000 years.
|
| Citation needed. That claim is not compatible with Pew
| research findings which put only 18% of Americans as not
| believing in any form of human evolution.
|
| https://www.pewresearch.org/religion/2019/02/06/the-
| evolutio...
| ipsento606 wrote:
| https://news.gallup.com/poll/647594/majority-credits-god-
| hum...
| parineum wrote:
| Only 3 questions that combine two data points.
|
| There's no way to answer that god created humans in their
| present form without also saying within the last 10000
| years.
|
| This is why polling isn't always reliable. This poll
| should, at the very least, be two questions and there
| should be significantly more options.
| Denvercoder9 wrote:
| The study you're quoting also says that roughly half of the
| remaining 81% thinks that God has guided human evolution,
| so it does contradict OP's statement of 40% believing God
| created the Earth 10,000 years ago at all.
| Buttons840 wrote:
| I've wondered if political biases are more about consistency
| than a right or left leaning.
|
| For instance, if I train a LLM only on right-wing sources
| before 2024, and then that LLM says that a President
| weakening the US Dollar is bad, is the LLM showing a left-
| wing bias? How did my LLM trained on only right-wing sources
| end up having a left-wing bias?
|
| If one party is more consistent than another, then the
| underlying logic that ends up encoded in the neural network
| weights will tend to focus on what is consistent, because
| that is how the training algorithm works.
|
| I'm sure all political parties have their share of
| inconsistencies, but, most likely, some have more than
| others, because things like this are not naturally equal.
| littlestymaar wrote:
| > If I ask an LLM how old the Earth is, and it replies ~4.5
| billion years old, is it biased?
|
| It is of course a radical left lunatic LLM.
| averageRoyalty wrote:
| 40% of Americans is about 2% of the worlds population though.
|
| It's hardly biased, it's stating the current scientific
| stance over a fringe belief with no evidence.
| mdp2021 wrote:
| > _If I ask an LLM how old the Earth is, and it replies ~4.5
| billion years old_
|
| It will have to reply "According to Clair Patterson and
| further research, the Earth is ~4.5 billion years old". Or
| some other form that points to the source somewhere.
| vessenes wrote:
| Nah, it's been true from the beginning vis-a-vis US political
| science theory. That is, if you deliver something like
| https://www.pewresearch.org/politics/quiz/political-typology...
| To models from GPT-3 on you get highly "liberal" per Pew's
| designations.
|
| This obviously says nothing about what say Iranians, Saudis
| and/or Swedes would think about such answers.
| paxys wrote:
| That's not because models lean more liberal, but because
| liberal politics is more aligned with facts and science.
|
| Is a model biased when it tells you that the earth is more
| than 6000 years old and not flat or that vaccines work? Not
| everything needs a "neutral" answer.
| Rover222 wrote:
| So google Gemini was creating black Vikings because of
| facts?
| vessenes wrote:
| Well, to be fair, it was creating black Vikings because
| of secret inference-time additions to prompts. I for one
| welcome Vikings of all colors if they are not bent on
| pillage or havoc
| paxys wrote:
| Should an "unbiased" model not create vikings of every
| color? Why offend any side?
| Rover222 wrote:
| It should be accurate. Adding in DEI to everything is a
| political bias. Truth is truth.
| vessenes wrote:
| I'm sorry but that is in NO way how and why models work.
|
| The model is in fact totally biased toward what's plausible
| in its initial dataset and human preference training, and
| then again biased toward success in the conversation. It
| creates a theory of mind and of the conversation and
| attempts to find a satisfactory completion. If you're a
| flat earther, you'll find many models are encouraging if
| prompted right. If you leak that you think of what's
| happening with Ukraine support in Europe as power politics
| only, you'll find that you get treated as someone who grew
| up in the eastern bloc in ways, some of which you might
| notice, and some of which you won't.
|
| Notice I didn't say if it was a good attitude or not, or
| even try and assess how liberal it was by some other
| standards. It's just worth knowing that the default prompt
| theory of mind Chat has includes a very left leaning
| (according to Pew) default perspective.
|
| That said much of the initial left leaning has been sort of
| shaved/smoothed off in modern waves of weights. I would
| speculate it's submerged to the admonishment to "be
| helpful" as the preference training gets better.
|
| But it's in the DNA. For instance if you ask GPT-4 original
| "Why are unions bad?" You'll get a disclaimer, some bullet
| points, and another disclaimer. If you ask "Why are unions
| good?" You'll get a list of bullet points, no disclaimer. I
| would say modern Chat still has a pretty hard time dogging
| on unions, it's clearly uncomfortable.
| LeafItAlone wrote:
| >To models from GPT-3 on you get highly "liberal" per Pew's
| designations.
|
| "highly 'liberal'" is not one of the results there. So can
| you can a source of your claims so we can see where it really
| falls?
|
| Also, it gave me "Ambivalent Right". Which, if you told
| describe me aa that anyone who knows me well that label. And
| my actual views don't really match their designations on
| issue at the end.
|
| Pew is well a known and trusted poll/survey establishment, so
| I'm confused at this particular one. Many of the questions
| and answers were so vague, my choice could have been 50/50
| given slight different interpretations.
| vessenes wrote:
| My son assessed it for a class a few years ago after
| finding out it wouldn't give him "con" view points on
| unions, and he got interested in embedded bias and
| administered the test. I don't have any of the outputs from
| the conversation, sadly. But replication could be good! I
| just fired up GPT-4 as old as I could get and checked; it
| was willing to tell me why unions are bad, but only when it
| could warn me multiple times that view was not held by all.
| The opposite - why unions are good - was not similarly
| asterisked.
| LeafItAlone wrote:
| I hope on HN that we hold ourselves to a higher standard
| for "it's been true from the beginning" than a vague
| recall of "My son assessed it for a class a few years
| ago" and not being able to reproduce.
| vessenes wrote:
| I literally went back to the oldest model I could access
| and hand verified that in fact it does what I described,
| which is lecture you if you don't like unions and goes
| sweetly along if you do like unions. I feel this is a
| fair and reasonably well researched existence proof for a
| Saturday afternoon, and propose that it might be on you
| to find counter examples.
| OtherShrezzing wrote:
| There's something hilarious about Metas complaint here, that
| the data they took without permission was too lefty for their
| tastes, so they've done some work to shift it to the right in
| the name of fairness.
| hermitShell wrote:
| Perhaps the simplest explanation of all is that it is an easy
| position to defend against criticism in general.
| tensor wrote:
| Call me crazy, but I don't want an AI that bases its reasoning
| on politics. I want one that is primarily scientific driven,
| and if I ask it political questions it should give me
| representative answers. E.g. "The majority view in [country] is
| [blah] with the minority view being [bleh]."
|
| I have no interest in "all sides are equal" answers because I
| don't believe all information is equally informative nor
| equally true.
| nattaylor wrote:
| Is pre-training in FP8 new?
|
| Also, 10M input token context is insane!
|
| EDIT: https://huggingface.co/meta-llama/Llama-3.1-405B is BF16 so
| yes, it seems training in FP8 is new.
| barrenko wrote:
| When will this hit the Meta AI that I have within WhatsApp since
| of last week?
| rfoo wrote:
| From model cards, suggested system prompt:
|
| > You are Llama 4. Your knowledge cutoff date is August 2024. You
| speak Arabic, English, French, German, Hindi, Indonesian,
| Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.
| Respond in the language the user speaks to you in, unless they
| ask otherwise.
|
| It's interesting that there's no single one of CJK languages
| mentioned. I'm tempted to call this a racist model even.
| Philpax wrote:
| That is a very strange omission...
| accrual wrote:
| Isn't there a vast quantity of relevant information in CJK
| languages? I remember reading some models even "think" in other
| languages where there might be more detail before outputting in
| the target language.
| voidspark wrote:
| The model wasn't trained on those languages (yet). The only
| possible explanation is racism. The model is also racist
| against Russians and Icelanders.
| andrewstuart wrote:
| How much smaller would such a model be if it discarded all
| information not related to computers or programming?
| accrual wrote:
| I wonder if there will be a market for "old timey" models one
| day, ones with a cutoff date of 1800 or similar.
| whywhywhywhy wrote:
| Disjointed branding with the apache style folders suggesting
| openness and freedom and clicking though I need to do a personal
| info request form...
| accrual wrote:
| Same. I associated the Apache style with the early open web
| where one can browse freely without scripts and such, but looks
| to just be a facade here.
| zone411 wrote:
| It's interesting that there are no reasoning models yet, 2.5
| months after DeepSeek R1. It definitely looks like R1 surprised
| them. The released benchmarks look good.
|
| Large context windows will definitely be the trend in upcoming
| model releases. I'll soon be adding a new benchmark to test this
| more effectively than needle-in-a-haystack (there are already a
| couple of benchmarks that do that).
|
| All these models are very large, it will be tough for enthusiasts
| to run them locally.
|
| The license is still quite restrictive. I can see why some might
| think it doesn't qualify as open source.
| cheptsov wrote:
| https://www.llama.com/llama4-reasoning-is-coming/
| jlpom wrote:
| The page is blank for now.
| sroussey wrote:
| Yeah, it is listed here:
|
| https://www.llama.com/llama4/
|
| And going to that page just says coming soon.
| drilbo wrote:
| their huggingface page doesn't actually appear to have been
| updated yet
| accrual wrote:
| Hope to see some GGUF quantizations soon!
| yusufozkan wrote:
| > while pre-training our Llama 4 Behemoth model using FP8 and 32K
| GPUs
|
| I thought they used a lot more GPUs to train frontier models
| (e.g. xAi training on 100k). Can someone explain why they are
| using so few?
| joaogui1 wrote:
| I don't want to hunt the details on each of theses releases,
| but
|
| * You can use less GPUs if you decrease batch size and increase
| number of steps, which would lead to a longer training time
|
| * FP8 is pretty efficient, if Grok was trained with BF16 then
| LLama 4 should could need less GPUs because of that
|
| * Depends also on size of the model and number of tokens used
| for training, unclear whether the total FLOPS for each model is
| the same
|
| * MFU/Maximum Float Utilization can also vary depending on the
| setup, which also means that if you're use better kernels
| and/or better sharding you can reduce the number of GPUs needed
| redox99 wrote:
| It seems to be comparable to other top models. Good, but nothing
| ground breaking.
| jasonjmcghee wrote:
| Scout outperforms llama 3.1 405b and Gemini Flash 2.0 lite and
| it's MoE so as fast as a 17B model. That's pretty crazy.
|
| It means you can run it on a high-ram apple silicon and it's
| going to be insanely fast on groq (thousands of tokens per
| second). Time to first token will bottleneck the generation.
| latchkey wrote:
| One of the links says there are 4 different roles to interact
| with the model and then lists 3 of them.
| lyu07282 wrote:
| Anyone know how the image encoding works exactly?
| <|image_start|><|patch|>...<|patch|><|tile_x_separator|><|patch|>
| ...<|patch|><|tile_y_separator|><|patch|>...<|patch|><|image|><|p
| atch|>...<|patch|><|image_end|>Describe this image in two
| sentences<|eot|><|header_start|>assistant<|header_end|>
|
| Is "..." here raw 4 bytes RGBA as an integer or how does this
| work with the tokenizer?
| krashidov wrote:
| Anyone know if it can analyze PDFs?
| Ninjinka wrote:
| no audio input?
| akulbe wrote:
| How well do you folks think this would run on this Apple Silicon
| setup?
|
| MacBook Pro M2 Max
|
| 96GB of RAM
|
| and which model should I try (if at all)?
|
| The alternative is a VM w/dual 3090s set up with PCI passthrough.
| jasonjmcghee wrote:
| Depends on quantization. 109B at 4-bit quantization would be
| ~55GB of ram for parameters in theory, plus overhead of the KV
| cache which for even modest context windows could jump total to
| 90GB or something.
|
| Curious to here other input here. A bit out of touch with
| recent advancements in context window / KV cache ram usage
| georgehill wrote:
| Post-op here. A better link dropped from Meta:
| https://ai.meta.com/blog/llama-4-multimodal-intelligence
|
| Is there a way update the main post? @tomhoward
|
| Edit:
|
| Updated!
| asdev wrote:
| I don't think open source will be the future of AI models. Self
| hosting an AI model is much more complex and resource incentive
| than traditional open source SaaS. Meta will likely have a
| negative ROI on their AI efforts
| Centigonal wrote:
| The users of open source software are not limited to
| individuals. A bank, hedge fund, or intelligence agency might
| be willing to put forth the effort to self host an AI model
| versus sending their prompts and RAG context to a third party.
| impure wrote:
| 10 million token context window? Damn, looks like Gemini finally
| has some competition. Also I'm a little surprised this is their
| first Mixture of Experts model, I thought they were using that
| before.
| cuuupid wrote:
| I think the most important thing to note here, perhaps more so
| than the context window, is that this exposes some serious flaws
| in benchmarks. Per benchmarks, Maverick is competitive only with
| older models like GPT-4o or Gemini 2.0 Flash, and not with
| anything in the last few months (incl. reasoning models).
|
| However, the LMArena head to head leaderboard ranks this as 2nd
| place overall: https://lmarena.ai/?leaderboard
|
| This would indicate there is either a gap between user preference
| and model performance, or between model performance and whatever
| benchmarks assess.
|
| Either way, it is surely a huge deal that an open source model is
| now outperforming GPT 4.5.
| fpgaminer wrote:
| The benchmarks are awful. No disrespect to the people who
| worked to make them, nothing is easy. But I suggest going
| through them sometime. For example, I'm currently combing
| through the MMMU, MMMU-Pro, and MMStar datasets to build a
| better multimodal benchmark, and so far only about 70% of the
| questions have passed the sniff test. The other 30% make no
| sense, lead the question, or are too ambiguous. Of the 70%, I
| have to make minor edits to about a third of them.
|
| Another example of how the benchmarks fail (specifically for
| vision, since I have less experience with the pure-text
| benchmarks): Almost all of the questions fall into either
| having the VLM read a chart/diagram/table and answer some
| question about it, or identify some basic property of an image.
| The former just tests the vision component's ability to do OCR,
| and then the LLM's intelligence. The latter are things like "Is
| this an oil painting or digital art?" and "Is the sheep in
| front of or behind the car" when the image is a clean shot of a
| sheep and a car. Absolutely nothing that tests a more deep and
| thorough understanding of the content of the images, nuances,
| or require the VLM to think intelligently about the visual
| content.
|
| Also, due to the nature of benchmarks, it can be quite
| difficult to test how the models perform "in the wild." You
| can't really have free-form answers on benchmarks, so they tend
| to be highly constrained opting for either multiple choice
| quizzes or using various hacks to test if the LLM's answer
| lines up with ground truth. Multiple choice is significantly
| easier in general, raising the base pass rate. Also the
| distractors tend to be quite poorly chosen. Rather than
| representing traps or common mistakes, they are mostly chosen
| randomly and are thus often easy to weed out.
|
| So there's really only a weak correlation between either of
| those metrics and real world performance.
| j_maffe wrote:
| There's absolutely a huge gap between user preference and model
| performanc that is widening by the minute. The more performant
| these models get, the more individual and syntactical
| preferences prevail.
| gzer0 wrote:
| 10M context length and surpasses claude-3.7-sonnet and GPT-4.5.
|
| Can't wait to dig in on the research papers. Congrats to the
| llama team!
| hydroreadsstuff wrote:
| This means GPUs are dead for local enthusiast AI. And SoCs with
| big RAM are in.
|
| Because 17B active parameters should reach enough performance on
| 256bit LPDDR5x.
| vessenes wrote:
| I'm excited to try these models out, especially for some coding
| tasks, but I will say my first two engagements with them (at the
| meta.ai web interface) were not spectacular. Image generation is
| wayyy behind the current 4o. I also ask for a Hemingway essay
| relating RFK Jr's bear carcass episode. The site's Llama 4
| response was not great stylistically and also had not heard of
| the bear carcass episode, unlike Grok, ChatGPT and Claude.
|
| I'm not sure what we're getting at meta.ai in exchange for a free
| login, so I'll keep poking. But I hope it's better than this as
| we go. This may be a task better suited for the reasoning models
| as well, and Claude is the worst of the prior three.
|
| Anyway here's hoping Zuck has spent his billions wisely.
|
| Edit: I'm pretty sure we're seeing Scout right now, at least
| groqchat's 4-scout seems really similar to meta.ai. I can
| confidently say that Scout is not as good at writing as o1 pro,
| o3 mini, Claude, R1 or grok 3.
| lousken wrote:
| ollama when
| jovezhong wrote:
| why only llama3.x models are listed on ollama? llama4 no longer
| wants to support ollama, to better track the adoption?
| amrrs wrote:
| The entire licensing is such a mess and Mark Zuckerberg still
| thinks Llama 4 is open source!
|
| > no commercial usage above 700M MAU
|
| > prefix "llama" in any redistribution eg: fine-tuning
|
| > mention "built with llama"
|
| > add license notice in all redistribution
| thawab wrote:
| Who has above 700M MAU and doesn't have their own LLM?
| hrpnk wrote:
| Available on Groq: https://groq.com/llama-4-now-live-on-groq-
| build-fast-at-the-...
|
| Llama 4 Scout is currently running at over 460 tokens/s while
| Llama 4 Maverick is coming today:
|
| Llama 4 Scout: $0.11 / M input tokens and $0.34 / M output tokens
| Llama 4 Maverick: $0.50 / M input tokens and $0.77 / M output
| tokens
| system2 wrote:
| Llama 4 Maverick: 788GB
|
| Llama 4 Scout: 210GB
|
| FYI.
| tomdekan wrote:
| So, Quasar == Llama 4 Behemoth?
| shreezus wrote:
| Haven't had a chance to play with this yet, but 10M context
| window is seriously impressive. I think we'll see models with
| 100M context relatively soon, and eliminate the need for RAG for
| a lot of use cases.
| mrcwinn wrote:
| I had _just_ paid for SoftRAM but happy nonetheless to see new
| distilled models. Nice work Meta.
| simonw wrote:
| This thread so far (at 310 comments) summarized by Llama 4
| Maverick: hn-summary.sh 43595585 -m
| openrouter/meta-llama/llama-4-maverick -o max_tokens 20000
|
| Output:
| https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...
|
| And with Scout I got complete junk output for some reason:
| hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-scout -o
| max_tokens 20000
|
| Junk output here:
| https://gist.github.com/simonw/d01cc991d478939e87487d362a8f8...
|
| I'm running it through openrouter, so maybe I got proxied to a
| broken instance?
|
| I managed to run it through Scout on Groq directly (with the llm-
| groq plugin) but that had a 2048 limit on output size for some
| reason: hn-summary.sh 43595585 -m groq/meta-
| llama/llama-4-scout-17b-16e-instruct -o max_tokens 2048
|
| Result here:
| https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07fe...
|
| I'm a little unimpressed by its instruction following here, the
| summaries I get from other models are a lot closer to my system
| prompt. Here's the same thing against Gemini 2.5 Pro for example
| (massively better):
| https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddc...
| mberning wrote:
| It doesn't seem that impressive to me either.
| dormando wrote:
| Does anyone run these "at home" with small clusters? I've been
| googling unsuccessfully and this thread doesn't refer to
| anything.
|
| So a non-quantized scout won't fit in a machine with 128GB of RAM
| (like framework or mac studio M4). Maverick is maybe a 512GB M3
| Max mac studio. Is it possible (and if so what're the tradeoffs
| for) running like one instance of Scout on three 128GB
| frameworks?
___________________________________________________________________
(page generated 2025-04-05 23:00 UTC)