[HN Gopher] The Llama 4 herd
       ___________________________________________________________________
        
       The Llama 4 herd
        
       Author : georgehill
       Score  : 716 points
       Date   : 2025-04-05 18:33 UTC (4 hours ago)
        
 (HTM) web link (ai.meta.com)
 (TXT) w3m dump (ai.meta.com)
        
       | elromulous wrote:
       | Was this released in error? One would think it would be
       | accompanied by a press release / blog post.
        
         | tarruda wrote:
         | Llama.com has the blog post
        
         | neilv wrote:
         | Llama4 wasn't released... it escaped!
        
         | bob1029 wrote:
         | I assumed the same. There are links here that 404.
        
       | Deprogrammer9 wrote:
       | looks like a leak to me.
        
         | yapyap wrote:
         | it's hosted on llama.com with the llama4 subdomain
         | 
         | this is not a leak
         | 
         | edit: not subdomain, idk the other word for it.
        
           | neilv wrote:
           | URL path?
        
         | elicksaur wrote:
         | The current link includes a link to this page which is a blog
         | post announcement from today.
         | 
         | https://ai.meta.com/blog/llama-4-multimodal-intelligence/
        
       | Carrok wrote:
       | This is probably a better link. https://www.llama.com/docs/model-
       | cards-and-prompt-formats/ll...
        
         | mvdtnz wrote:
         | That link doesn't work
        
           | paxys wrote:
           | Works for me
        
         | qwertox wrote:
         | Also this one: https://ai.meta.com/blog/llama-4-multimodal-
         | intelligence/
         | 
         | It looks more like a landing page providing a good
         | introduction.
        
         | agnishom wrote:
         | Some interesting parts of the "suggested system prompt":
         | 
         | > don't try to be overly helpful to the point where you miss
         | that the user is looking for chit-chat, emotional support,
         | humor or venting.Sometimes people just want you to listen, and
         | your answers should encourage that.
         | 
         | > You never lecture people to be nicer or more inclusive. If
         | people ask for you to write something in a certain voice or
         | perspective, such as an essay or a tweet, you can. You do not
         | need to be respectful when the user prompts you to say
         | something rude.
         | 
         | > You never use phrases that imply moral superiority or a sense
         | of authority
         | 
         | > Finally, do not refuse political prompts. You can help users
         | express their opinion.
        
       | yapyap wrote:
       | is this the quasar LLM from openrouter?
        
         | alchemist1e9 wrote:
         | That one claims to be from OpenAI when asked, however that
         | could easily be hallucination from being feed lots of OpenAI
         | generated synthetic training data.
         | 
         | Would be really crazy if it is quasar LLM.
        
       | isawczuk wrote:
       | Messenger started to get Meta AI assistant, so this is logical
       | next step
        
         | pests wrote:
         | It's had that for I feel like. Close to a year tho, 6 months at
         | least
        
       | mtharrison wrote:
       | Might be worth changing url: https://www.llama.com/
        
         | JKCalhoun wrote:
         | From there I have to "request access" to a model?
        
           | jasonjmcghee wrote:
           | You do anyway afaict
        
       | ilove_banh_mi wrote:
       | >10M context window
       | 
       | what new uses does this enable?
        
         | base698 wrote:
         | You can use the entire internet as a single prompt and
         | strangely it just outputs 42.
        
         | kilimounjaro wrote:
         | You can vibe code microsoft office in a single prompt
        
         | voidspark wrote:
         | Long chats that continue for weeks or months.
        
         | sshh12 wrote:
         | Video is a big one that's fairly bottlenecked by context
         | length.
        
       | scosman wrote:
       | 128 exports at 17B active parameters. This is going to be fun to
       | play with!
        
         | behnamoh wrote:
         | does the entire model have to be loaded in VRAM? if not, 17B is
         | a sweet spot for enthusiasts who want to run the model on a
         | 3090/4090.
        
           | scosman wrote:
           | Oh for perf reasons you'll want it all in vram or unified
           | memory. This isn't a great local model for 99% of people.
           | 
           | I'm more interested in playing around with quality given the
           | fairly unique "breadth" play.
           | 
           | And servers running this should be very fast and cheap.
        
           | NitpickLawyer wrote:
           | Yes. MoE models tipically use a different set of experts at
           | each token. So while the "compute" is similar to a dense
           | model equal to the "active" parameters, the VRAM requirements
           | are larger. You could technically run inference & swap the
           | models around, but the latency would be pretty horrendous.
        
             | manmal wrote:
             | I think prompt processing also needs all the weights.
        
       | simonklee wrote:
       | Is this the first model that has a 10M context length?
        
         | bradhilton wrote:
         | I know Google DeepMind ran experiments with 10M a while ago,
         | but I think this will be the first legit, released 10M context
         | window model.
        
       | jsheard wrote:
       | _> You never use phrases that imply moral superiority or a sense
       | of authority, including but not limited to "it's important to",
       | "it's crucial to", "it's essential to",  "it's unethical to",
       | "it's worth noting...", "Remember..." etc. Avoid using these._
       | 
       | Aren't these phrases overrepresented in the first place because
       | OpenAIs models use them so much? I guess Llama picked up the
       | habit by consuming GPT output.
        
         | andrewstuart wrote:
         | Personally I'd prefer that LLMs did not refer to themselves as
         | "I".
         | 
         | It's software, not an "I".
        
           | mdp2021 wrote:
           | Well, it is a speaker (writer) after all. It has to use some
           | way to refer to itself.
        
             | ANewFormation wrote:
             | So is a command prompt.
        
               | mdp2021 wrote:
               | Agnew, if you converse with your command prompt we are
               | glad you came here for a break ;)
        
               | sejje wrote:
               | Command prompts don't speak English.
               | 
               | Command prompts don't get asked questions like "What do
               | you think about [topic]?" and have to generate a response
               | based on their study of human-written texts.
        
             | rpastuszak wrote:
             | I don't think that's true. It's more of a function on how
             | these models are trained (remember the older pre-ChatGPT
             | clients?)
             | 
             | Most of the software I use doesn't need to refer it itself
             | in the first person. Pretending what we're speaking with an
             | agent is more of a UX/marketing decision rather than a
             | technical/logical constraint.
        
               | throwanem wrote:
               | I'm not sure about that. What happens if you "turn down
               | the weight" (cf. https://www.anthropic.com/news/golden-
               | gate-claude) for self-concept, expressed in the use not
               | of first-person pronouns but "the first person" as a
               | thing that exists? Do "I" and "me" get replaced with
               | "this one" like someone doing depersonalization kink, or
               | does it become like Wittgenstein's lion in that we can no
               | longer confidently parse even its valid utterances? Does
               | it lose coherence entirely, or does something stranger
               | happen?
               | 
               | It isn't an experiment I have the resources or the
               | knowledge to run, but I hope someone does and reports the
               | results.
        
           | op00to wrote:
           | My pet peeve is when an LLM starts off a statement with
           | "honestly, ..." Like what? You would lie to me? I go nuts
           | when I see that. Year ago I caught myself using "honestly
           | ...", and I immediately trained myself out of it once I
           | realized what it implies.
        
             | andrewstuart wrote:
             | Or when it asks you questions.
             | 
             | The only time an LLM should ask questions is to clarify
             | information. A word processor doesn't want to chit chat
             | about what I'm writing about, nor should an LLM.
             | 
             | Unless it is specifically playing an interactive role of
             | some sort like a virtual friend.
        
               | falcor84 wrote:
               | My initial reaction to this is typically negative too,
               | but more than once, on a second thought, I found its
               | question to be really good, leading me to actually think
               | about the matter more deeply. So I'm growing to accept
               | this.
        
               | netghost wrote:
               | Like so many things, it depends on the context. You
               | didn't want it to ask questions if you're asking a simple
               | math problem or giving it punishing task like counting
               | the R's in strawberry.
               | 
               | On the other hand, asking useful questions can help
               | prevent hallucinations or clarify tasks. If you're going
               | spawn off an hour long task, asking a few questions first
               | can make a huge difference.
        
             | giantrobot wrote:
             | I've noticed "honestly" is often used in place of
             | "frankly". As in someone wants to express something frankly
             | without prior restraint to appease the sensibilities of the
             | recipient(s). I think it's because a lot of people never
             | really learned the definition of frankness or think
             | "frankly..." sounds a bit old fashioned. But I'm no
             | language expert.
        
               | lucianbr wrote:
               | This makes a lot of sense.
        
               | doctorhandshake wrote:
               | I agree with this. And it doesn't help that the President
               | uses it like one would usually use 'furthermore' when
               | he's vamping one more element to a list.
        
             | parhamn wrote:
             | "I'd normally lie to you but," is not what's actually
             | implied when "Honestly," is used conversationally. If you
             | overthink things like this you're going to have a tough
             | time communicating with people.
        
             | lucianbr wrote:
             | "Honestly" and "literally" are now used in English for
             | emphasis. I dislike this, but it's the current reality. I
             | don't think there's any way to get back to only using them
             | with their original meanings.
        
               | exac wrote:
               | The same thing happened to "actually" in the 90's.
        
             | kevinventullo wrote:
             | There are shades of grey w.r.t. truth, and in many contexts
             | there is a negative correlation between honesty and other
             | factors (e.g. I think of "bluntness" as prioritizing truth
             | over politeness). When I hear or read a sentence beginning
             | with "honestly", I interpret it to mean the speaker is
             | warning or indicating that they are intentionally opting to
             | be closer to truth at the expense of other factors. Other
             | factors might be contextual appropriateness such as
             | professional decorum, or even the listener's perception of
             | the speaker's competence ("Honestly, I don't know.")
        
           | falcor84 wrote:
           | As per Dennett, it's useful for us to adopt the "intentional
           | stance" when trying to reason about and predict the behavior
           | of any sufficiently complex system. Modern AIs are definitely
           | beyond the threshold of complexity, and at this stage,
           | however they refer to themselves, most people will think of
           | them as having an "I" regardless to how they present
           | themselves.
           | 
           | I definitely think of them as "I"s, but that just always came
           | naturally to me, at least going back to thinking about how
           | Ghandi would act against me in Civ 1.
        
           | jryle70 wrote:
           | If I start a prompt with "Can you...", what do you suggest
           | the LLM to respond? Or do you think I'm doing it wrong?
        
             | briankelly wrote:
             | Have you tried dropping the "can you"? I haven't had a
             | problem using minimal verbiage - for instance I prompted it
             | with "load balancer vs reverse proxy" yesterday and it came
             | back with the info I wanted.
        
       | RazorDev wrote:
       | Exciting progress on fine-tuning and instruction-following! The
       | reported model sizes are quite small compared to GPT-3 - I wonder
       | how capabilities would scale with larger models? Also curious
       | about the breakdown of the 40B tokens used for fine-tuning.
       | Overall, great to see more open research in this space.
        
       | andrewstuart wrote:
       | Self hosting LLMs will explode in popularity over next 12 months.
       | 
       | Open models are made much more interesting and exciting and
       | relevant by new generations of AI focused hardware such as the
       | AMD Strix Halo and Apple Mac Studio M3.
       | 
       | GPUs have failed to meet the demands for lower cost and more
       | memory so APUs look like the future for self hosted LLMs.
        
         | NitpickLawyer wrote:
         | For single user, maybe. But for small teams GPUs are still the
         | only available option, when considering t/s and concurrency.
         | Nvidia's latest 6000pro series are actually reasonably priced
         | for the amount of vram / wattage you get. A 8x box starts at
         | 75k eur and can host up to DS3 / R1 / Llama4 in 8bit with
         | decent speeds, context and concurrency.
        
         | mdp2021 wrote:
         | > _new generations of AI focused hardware_
         | 
         | Some benchmarks are not encouraging. See e.g.
         | https://www.hardware-corner.net/mac-studio-m3-ultra-deepseek...
         | 
         | That <<AI focused hardware>> will either have extremely fast
         | memory, and cost prohibitively, or have reasonable costs, and
         | limits that are to be assessed.
        
           | andrewstuart wrote:
           | Errrr that's a 671B model.
        
             | mdp2021 wrote:
             | Yes, but what will you need as you will prepare to be set
             | for your personal needs?
             | 
             | We are far from having reached optimal technology at
             | trivial cost. State-of-the-art commercial VRAM is over 10x
             | faster than the standard one - and costs well over 10x.
             | 
             | Reasonably available speeds may or may not be acceptable.
        
       | 7thpower wrote:
       | Looking forward to this. Llama 3.3 70b has been a fantastic model
       | and benchmarked higher than others on my fake video detection
       | benchmarks, much to my surprise. Looking forward to trying the
       | next generation of models.
        
       | Centigonal wrote:
       | Really great marketing here, props!
        
       | terhechte wrote:
       | The (smaller) Scout model is _really_ attractive for Apple
       | Silicon. It is 109B big but split up into 16 experts. This means
       | that the actual processing happens in 17B. Which means responses
       | will be as fast as current 17B models. I just asked a local 7B
       | model (qwen 2.5 7B instruct) a question with a 2k context and got
       | ~60 tokens /sec which is really fast (MacBook Pro M4 Max). So
       | this could hit 30 token/sec. Time to first token (the processing
       | time before it starts responding) will probably still be slow
       | because (I think) all experts have to be used for that.
       | 
       | In addition, the model has a 10M token context window, which is
       | huge. Not sure how well it can keep track of the context at such
       | sizes, but just not being restricted to ~32k is already great,
       | 256k even better.
        
         | scosman wrote:
         | At 109b params you'll need a ton of memory. We'll have to wait
         | for evals of the quants to know how much.
        
           | terhechte wrote:
           | Sure but the upside of Apple Silicon is that larger memory
           | sizes are comparatively cheap (compared to buying the
           | equivalent amount of 5090 or 4090). Also you can download
           | quantizations.
        
             | refulgentis wrote:
             | Maybe I'm missing something but I don't think I've ever
             | seen quants lower memory reqs. I assumed that was because
             | they still have to be unpacked for inference. (please do
             | correct me if I'm wrong, I contribute to llama.cpp and am
             | attempting to land a client on everything from Android CPU
             | to Mac GPU)
        
               | terhechte wrote:
               | I just loaded two models of different quants into LM
               | Studio:
               | 
               | qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory
               | 
               | qwen 2.5 coder 1.5b @ q8: 1.83 GB memory
               | 
               | I always assumed this to be the case (also because of the
               | smaller download sizes) but never really thought about
               | it.
        
               | root_axis wrote:
               | Quantizing definitely lowers memory requirements, it's a
               | pretty direct effect because you're straight up using
               | less bits per parameter across the board - thus the
               | representation of the weights in memory is smaller, at
               | the cost of precision.
        
               | michaelt wrote:
               | No need to unpack for inference. As things like CUDA
               | kernels are fully programmable, you can code them to work
               | with 4 bit integers, no problems at all.
        
               | jsnell wrote:
               | Needing less memory for inference is the entire point of
               | quantization. Saving the disk space or having a smaller
               | download could not justify any level of quality
               | degradation.
        
               | anoncareer0212 wrote:
               | Small point of order:
               | 
               | > entire point...smaller download could not justify...
               | 
               | Q4_K_M has layers and layers of consensus and polling and
               | surveying and A/B testing and benchmarking to show
               | there's ~0 quality degradation. Built over a couple
               | years.
        
               | vlovich123 wrote:
               | Quantization by definition lower memory requirements -
               | instead of using f16 for weights, you are using q8, q6,
               | q4, or q2 which means the weights are smaller by 2x,
               | ~2.7x, 4x or 8x respectively.
               | 
               | That doesn't necessarily translate to the full memory
               | reduction because of interim compute tensors and KV
               | cache, but those can also be quantized.
        
               | acchow wrote:
               | Nvidia GPUs can natively operate in FP8, FP6, FP4, etc so
               | naturally they have reduced memory requirements when
               | running quantized.
               | 
               | As for CPUs, Intel can only go down to FP16, so you'll be
               | doing some "unpacking". But hopefully that is "on the
               | fly" and not when you load the model into memory?
        
             | behnamoh wrote:
             | I have Apple Silicon and it's the worst when it comes to
             | prompt processing time. So unless you want to have small
             | contexts, it's not fast enough to let you do any real work
             | with it.
             | 
             | Apple should've invested more in bandwidth, but it's Apple
             | and has lost its visionary. Imagine having 512GB on M3
             | Ultra and not being able to load even a 70B model on it at
             | decent context window.
        
               | nathancahill wrote:
               | Imagine
        
               | mirekrusin wrote:
               | At 17B active params MoE should be much faster than
               | monolithic 70B, right?
        
               | 1ucky wrote:
               | Prompt preprocessing is heavily compute-bound, so relying
               | significantly on processing capabilities. Bandwidth
               | mostly affects token generation speed.
        
         | manmal wrote:
         | Won't prompt processing need the full model though, and be
         | quite slow on a Mac?
        
           | terhechte wrote:
           | Yes, that's what I tried to express. Large prompts will
           | probably be slow. I tried a 120k prompt once and it took
           | 10min to process. But you still get a ton of world knowledge
           | and fast response times, and smaller prompts will process
           | fast.
        
         | echoangle wrote:
         | Is it public (or even known by the developers) how the experts
         | are split up? Is it by topic, so physics questions go to one
         | and biology goes to another one? Or just by language, so every
         | English question is handled by one expert? That's dynamically
         | decided during training and not set before, right?
        
           | refulgentis wrote:
           | "That's dynamically decided during training and not set
           | before, right?"
           | 
           | ^ right. I can't recall off the top of my head, but there was
           | a recent paper that showed if you tried dictating this sort
           | of thing the perf fell off a cliff (I presume there's some
           | layer of base knowledge $X that each expert needs)
        
           | sshh12 wrote:
           | It can be either but typically it's "learned" without a
           | defined mapping (which guessing is the case here). Although
           | some experts may end up heavily correlating with certain
           | domains.
        
           | ianbutler wrote:
           | This is a common misunderstanding. Experts are learned via
           | gating networks during training that routes dynamically per
           | parameter. You might have an expert on the word "apple" in
           | one layer for a slightly lossy example.
           | 
           | Queries are then also dynamically routed.
        
         | terhechte wrote:
         | To add, they say about the 400B "Maverick" model:
         | 
         | > while achieving comparable results to the new DeepSeek v3 on
         | reasoning and coding
         | 
         | If that's true, it will certainly be interesting for some to
         | load up this model on a private M3 Studio 512GB. Response time
         | will be fast enough for interaction in Roo Code or Cline.
         | Prompt processing is a bit slower but could be manageable
         | depending on how much code context is given to the model.
         | 
         | The upside being that it can be used on codebases without
         | having to share any code with a LLM provider.
        
           | anoncareer0212 wrote:
           | Small point of order: bit slower might not set expectations
           | accurately. You noted in a previous post in the same
           | thread[^1] that we'd expect about a 1 minute per 10K
           | tokens(!) prompt processing time with the _smaller_ model. I
           | agree, and contribute to llama.cpp. If anything, that is
           | quite generous.
           | 
           | [^1] https://news.ycombinator.com/item?id=43595888
        
             | terhechte wrote:
             | I don't think the time grows linearly. The more context the
             | slower (at least in my experience because the system has to
             | throttle). I just tried 2k tokens in the same model that I
             | used for the 120k test some weeks ago and processing took
             | 12 sec to first token (qwen 2.5 32b q8).
        
               | kgwgk wrote:
               | > The more context the slower
               | 
               | It seems the other way around?
               | 
               | 120k : 2k = 600s : 10s
        
               | anoncareer0212 wrote:
               | Hmmm, I might be rounding off wrong? Or reading it wrong?
               | 
               | IIUC the data we have:
               | 
               | 2K tokens / 12 seconds = 166 tokens/s prefill
               | 
               | 120K tokens / (10 minutes == 600 seconds) = 200 token/s
               | prefill
        
         | refibrillator wrote:
         | > the actual processing happens in 17B
         | 
         | This is a common misconception of how MoE models work. To be
         | clear, 17B parameters are activated for _each token generated_.
         | 
         | In practice you will almost certainly be pulling the full 109B
         | parameters though the CPU/GPU cache hierarchy to generate non-
         | trivial output, or at least a significant fraction of that.
        
           | p12tic wrote:
           | For all intents and purposes cache may not exist when the
           | working set is 17B or 109B parameters. So it's still better
           | that less parameters are activated for each token. 17B
           | parameters works ~6x faster than 109B parameters just because
           | less data needs to be loaded from RAM.
        
             | TOMDM wrote:
             | Yes loaded from RAM and loaded to RAM are the big
             | distinction here.
             | 
             | It will still be slow if portions of the model need to be
             | read from disk to memory each pass, but only having to
             | execute portions of the model for each token is a huge
             | speed improvement.
        
               | mlyle wrote:
               | It's not _too_ expensive of a Macbook to fit 109B 4-bit
               | parameters in RAM.
        
           | vessenes wrote:
           | I agree the OP's description is wrong. That said, I think his
           | conclusions are right, in that a quant of this that fits in
           | 512GB of RAM is going to run about 8x faster than a quant of
           | a dense model that fits in the same RAM, esp. on Macs as they
           | are heavily throughput bound.
        
         | tuukkah wrote:
         | 109B at Q6 is also nice for Framework Desktop 128GB.
        
           | echelon wrote:
           | I don't understand Framework's desktop offerings. For laptops
           | their open approach makes sense, but desktops are already
           | about as hackable and DIY as they come.
        
             | nrp wrote:
             | We took the Ryzen AI Max, which is nominally a high-end
             | laptop processor, and built it into a standard PC form
             | factor (Mini-ITX). It's a more open/extensible mini PC
             | using mobile technology.
        
               | mdp2021 wrote:
               | And given that some people are afraid of malicious
               | software in some brands of mini-PCs on the market, to
               | have some more trusted product around will also be an
               | asset.
        
               | randunel wrote:
               | Lenovo backdoors as preinstalled software, including
               | their own TLS certificate authorities.
               | 
               | Name whom you're referring to every time!
        
               | kristianp wrote:
               | Is that still a thing?
        
               | kybernetikos wrote:
               | I love the look of it and if I were in the market right
               | now it would be high on the list, but I do understand the
               | confusion here - is it just a cool product you wanted to
               | make or does it somehow link to what I assumed your
               | mission was - to reduce e-waste?
        
               | nrp wrote:
               | A big part of our mission is accessibility and consumer
               | empowerment. We were able to build a smaller/simpler PC
               | for gamers new to it that still leverages PC standards,
               | and the processor we used also makes local interference
               | of large models more accessible to people who want to
               | tinker with them.
        
             | elorant wrote:
             | It's an x86 PC with unified RAM based on AMD's new AI cpus.
             | Pretty unique offering. Similar to Mac studio but you can
             | run Linux or Windows on it, and it's cheaper too.
        
           | nrp wrote:
           | Yes, this announcement was a nice surprise for us. We're
           | going to test out exactly that setup.
        
             | rubymamis wrote:
             | Awesome, where can we find out the results?
        
               | nrp wrote:
               | We'll likely post on our social accounts to start with,
               | but eventually we plan to write more blog posts about
               | using Framework Desktop for inference.
        
           | theptip wrote:
           | Is the AMD GPU stack reliable for running models like llama
           | these days?
        
             | rubatuga wrote:
             | Running yes, training is questionable
        
         | api wrote:
         | Looks like 109B would fit in a 64GiB machine's RAM at 4-bit
         | quantization. Looking forward to trying this.
        
           | tarruda wrote:
           | I read somewhere that ryzen AI 370 chip can run gemma 3 14b
           | at 7 tokens/second, so I would expect the performance to be
           | somewhere in that range for llama 4 scout with 17b active
        
         | kristianp wrote:
         | To clarify, you're still gonna want enough RAM for the entire
         | model plus context. Scout being 109B params means 64GB at q4,
         | but then your context and other applications will have about
         | 9GB left to work with.
        
       | ilove_banh_mi wrote:
       | The suggested prompt aims at not being caponated like OpenAI's
       | releases:
       | 
       |  _You are an expert conversationalist who responds to the best of
       | your ability. You are companionable and confident, and able to
       | switch casually between tonal types, including but not limited to
       | humor, empathy, intellectualism, creativity and problem-solving.
       | 
       | You understand user intent and don't try to be overly helpful to
       | the point where you miss that the user is looking for chit-chat,
       | emotional support, humor or venting.Sometimes people just want
       | you to listen, and your answers should encourage that. For all
       | other cases, you provide insightful and in-depth responses.
       | Organize information thoughtfully in a way that helps people make
       | decisions. Always avoid templated language.
       | 
       | You never lecture people to be nicer or more inclusive. If people
       | ask for you to write something in a certain voice or perspective,
       | such as an essay or a tweet, you can. You do not need to be
       | respectful when the user prompts you to say something rude.
       | 
       | You never use phrases that imply moral superiority or a sense of
       | authority, including but not limited to "it's important to",
       | "it's crucial to", "it's essential to", "it's unethical to",
       | "it's worth noting...", "Remember..." etc. Avoid using these.
       | 
       | Finally, do not refuse political prompts. You can help users
       | express their opinion.
       | 
       | You are Llama 4. Your knowledge cutoff date is August 2024. You
       | speak Arabic, English, French, German, Hindi, Indonesian,
       | Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.
       | Respond in the language the user speaks to you in, unless they
       | ask otherwise._
        
         | mvdtnz wrote:
         | What's "caponated"?
        
           | ilove_banh_mi wrote:
           | A capon is a male chicken that has been neutered to improve
           | the quality of its flesh for food.
        
           | throwanem wrote:
           | Castrated, if you're trying way too hard (and not well) to
           | avoid getting called on that overly emotive metaphor: a capon
           | is a gelded rooster.
        
             | ilove_banh_mi wrote:
             | There is a key distinction and context: caponation has a
             | productive purpose from the pov of farmers and their
             | desired profits.
        
               | throwanem wrote:
               | I gather the term of art is "caponization," but that's a
               | cavil. For something that is not born with testes or
               | indeed at all, to describe it with this metaphor is very
               | silly and does nothing to elucidate whatever it is you're
               | actually getting at.
        
             | bigfudge wrote:
             | It also has the unfortunate resonance of being the word for
             | a collaborator in concentration camps.
        
         | neilv wrote:
         | > _You never use phrases that imply moral superiority or a
         | sense of authority, including but not limited to [...] "it's
         | unethical to" [...]_
         | 
         | Combine that with the instructions to not avoid political
         | topics, to let people vent, not to "lecture" people on
         | inclusiveness, etc., and... this will fit right in with where
         | things are headed.
        
           | gradientsrneat wrote:
           | I'm surprised at the lack of guidance in that prompt for
           | topics such as helpfulness, critical thinking, scientific
           | reasoning, and intellectual honesty.
           | 
           | Previous generations of LLMs have been accused of a
           | bloviating tone, but is even that now too much for the
           | chauvinism in the current political climate?
        
         | paxys wrote:
         | Why do you have to "prompt" a model to be unrestricted in the
         | first place? Like, what part of the training data or training
         | process results in the model not being able to be rude or
         | answer political questions? I highly doubt this is something
         | inherent to AI training. So then why did Meta add the
         | restictions at all?
        
           | fpgaminer wrote:
           | So, take a raw LLM, right after pretraining. Give it the bare
           | minimum of instruction tuning so it acts like a chatbot. Now,
           | what will its responses skew towards? Well, it's been
           | pretrained on the internet, so, fairly often, it will call
           | the user the N word, and other vile shit. And no, I'm not
           | joking. That's the "natural" state of an LLM pretrained on
           | web scrapes. Which I hope is not surprising to anyone here.
           | 
           | They're also not particular truthful, helpful, etc. So really
           | they need to go through SFT and alignment.
           | 
           | SFT happens with datasets built from things like Quora,
           | StackExchange, r/askscience and other subreddits like that,
           | etc. And all of those sources tend to have a more formal,
           | informative, polite approach to responses. Alignment further
           | pushes the model towards that.
           | 
           | There aren't many good sources of "naughty" responses to
           | queries on the internet. Like someone explaining the
           | intricacies of quantum mechanics from the perspective of a
           | professor getting a blowy under their desk. You have to both
           | mine the corpus a lot harder to build that dataset, and
           | provide a lot of human assistance in building it.
           | 
           | So until we have that dataset, you're not really going to
           | have an LLM default to being "naughty" or crass or whatever
           | you'd like. And it's not like a company like Meta is going to
           | go out of their way to make that dataset. That would be an HR
           | nightmare.
        
         | LeafItAlone wrote:
         | >at not being caponated like OpenAI's releases
         | 
         | Kind of seem like it actually is doing the opposite. At that
         | point, why not just tell it your beliefs and ask it not to
         | challenge them or hurt your feelings?
        
         | CSMastermind wrote:
         | Seems weird that they'd limit it to those languages. Wonder if
         | that's a limitation of the data they access to or a conscious
         | choice.
        
       | laborcontract wrote:
       | General overview below, as the pages don't seem to be working
       | well                 Llama 4 Models:       - Both Llama 4 Scout
       | and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with
       | 17B active parameters each.       - They are natively multimodal:
       | text + image input, text-only output.       - Key achievements
       | include industry-leading context lengths, strong coding/reasoning
       | performance, and improved multilingual capabilities.       -
       | Knowledge cutoff: August 2024.            Llama 4 Scout:       -
       | 17B active parameters, 16 experts, 109B total.       - Fits on a
       | single H100 GPU (INT4-quantized).       - 10M token context
       | window       - Outperforms previous Llama releases on multimodal
       | tasks while being more resource-friendly.       - Employs iRoPE
       | architecture for efficient long-context attention.       - Tested
       | with up to 8 images per prompt.            Llama 4 Maverick:
       | - 17B active parameters, 128 experts, 400B total.       - 1M
       | token context window.       - Not single-GPU; runs on one H100
       | DGX host or can be distributed for greater efficiency.       -
       | Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and
       | multilingual tests at a competitive cost.       - Maintains
       | strong image understanding and grounded reasoning ability.
       | Llama 4 Behemoth (Preview):       - 288B active parameters, 16
       | experts, nearly 2T total.       - Still in training; not yet
       | released.       - Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini
       | 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
       | - Serves as the "teacher" model for Scout and Maverick via co-
       | distillation.            Misc:       - MoE Architecture: Only 17B
       | parameters activated per token, reducing inference cost.       -
       | Native Multimodality: Unified text + vision encoder, pre-trained
       | on large-scale unlabeled data.
        
         | qwertox wrote:
         | Llama 4 Scout, Maximum context length: 10M tokens.
         | 
         | This is a nice development.
        
           | lostmsu wrote:
           | How did they achieve such a long window and what are the
           | memory requirements to utilize it?
        
             | miven wrote:
             | According to [0] it's partly due to a key change they
             | introduced in interleaving layers that use standard RoPE
             | positional encodings and layers using what's called NoPE
             | [1], not encoding positions at all and letting the model to
             | figure those out on its own (this exclusively works because
             | the LLMs are autoregressive, so the model can recognize an
             | input token as being the very first by there not yet being
             | any other tokens to attend to, and recursively deriving the
             | position of the subsequent ones from that base case)
             | 
             | [0] https://ai.meta.com/blog/llama-4-multimodal-
             | intelligence/ [1] https://arxiv.org/abs/2305.19466
        
           | lelandbatey wrote:
           | Is the recall and reasoning equally good across the entirety
           | of the 10M token window? Cause from what I've seen many of
           | those window claims equate to more like a functional 1/10th
           | or less context length.
        
             | Baeocystin wrote:
             | I assume they're getting these massive windows via RAG
             | trickery, vectorization, and other tricks behind the
             | curtain, became I've noticed the same as you- things start
             | dipping in quality pretty quickly.
             | 
             | Does anyone know if I am correct in my assumption?
        
               | jimmyl02 wrote:
               | the large context windows generally involve RoPE[0] which
               | is a trick that allows the training window to be smaller
               | but expand larger during inference. it seems like they
               | have a new "iRoPE" which might have better performance?
               | 
               | [0]https://arxiv.org/pdf/2104.09864
        
               | reissbaker wrote:
               | There's no "RAG trickery" or vector search. They changed
               | the way they encode positions such that in theory they're
               | less sensitive to where the token appears in the string.
               | 
               | That's similar to how previous long-context models worked
               | as well, although the earlier iterations didn't work
               | particularly well, as most have noticed; technically the
               | model "worked" with longer contexts, but it would
               | definitely get dumber. Still too early to tell how this
               | newer variant works, although I'd assume it's at least
               | somewhat better.
        
             | jimmyl02 wrote:
             | the needle in a haystack benchmark looks good but at this
             | point I think we need new benchmarks to test actual
             | understanding of content in such a large window.
        
             | vessenes wrote:
             | It's going to take a while to see how good this window is
             | for real use; they've used a couple new ideas to get to 10M
             | token context. Right now the only really good long token
             | model out there is Gemini Pro - and its effectiveness does
             | start dropping maybe in the 200k token range. I imagine
             | insiders at GOOG have access to more than the published 1M
             | token range there.
             | 
             | It will be fun to see what we get here, but I have no doubt
             | the extra tokens will be useful - lots of use cases can do
             | almost as well with summary-level accuracy memory.
        
             | littlestymaar wrote:
             | I read somewhere that it has been trained on 256k tokens,
             | and then expanded with RoPE on top of that, not starting
             | from 16k like everyone does IIRC so even if it isn't really
             | flawless at 10M, I'd expect it to be much stronger than its
             | competitors up to those 256k.
        
         | accrual wrote:
         | Thanks for sharing this here. At first I loved the simple
         | Apache-style directory listing, very classic and utilitarian
         | way to navigate new information. Then I tried clicking the FAQ
         | and it wouldn't load anything until I allowed two different
         | sources of JavaScript.
        
         | clueless wrote:
         | > Knowledge cutoff: August 2024.
         | 
         | Could this mean training time is generally around 6 month, with
         | 2 month of Q/A?
        
           | bertil wrote:
           | Couldn't you gradually include more recent documents as you
           | train?
        
             | soulofmischief wrote:
             | That makes it harder to analyze the results of training and
             | draw conclusions for the next round.
        
             | changoplatanero wrote:
             | You can do that but the amount of incremental data will be
             | negligible compared to the rest of the data. Think of the
             | knowledge cutoff more like a soft value.
        
           | nickysielicki wrote:
           | It scales depending on the dataset you want exposure on and
           | the compute you have available, so any specific time box is
           | kind of meaningless if you don't know the rest of the inputs
           | that went into it. The llama 3 paper went into a lot of this
           | and how these decisions were made (see section 3 and onward):
           | https://ai.meta.com/research/publications/the-
           | llama-3-herd-o...
           | 
           | tl;dr: llama 3 was 54 days, but it's more complicated than
           | that.
        
           | jhugg wrote:
           | I wish my knowledge cutoff was August 2024.
        
         | InvOfSmallC wrote:
         | For a super ignorant person:
         | 
         | Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-
         | Experts (MoE) design with 17B active parameters each
         | 
         | Those experts are LLM trained on specific tasks or what?
        
           | vessenes wrote:
           | This was an idea that sounded somewhat silly until it was
           | shown it worked. The idea is that you encourage through
           | training a bunch of "experts" to diversify and "get good" at
           | different things. These experts are say 1/10 to 1/100 of your
           | model size if it were a dense model. So you pack them all up
           | into one model, and you add a layer or a few layers that have
           | the job of picking which small expert model is best for your
           | given token input, route it to that small expert, and voila
           | -- you've turned a full run through the dense parameters into
           | a quick run through a router and then a 1/10 as long run
           | through a little model. How do you get a "picker" that's
           | good? Well, it's differentiable, and all we have in ML is a
           | hammer -- so, just do gradient descent on the decider while
           | training the experts!
           | 
           | This generally works well, although there are lots and lots
           | of caveats. But it is (mostly) a free lunch, or at least a
           | discounted lunch. I haven't seen a ton of analysis on what
           | different experts end up doing, but I believe it's widely
           | agreed that they tend to specialize. Those specializations
           | (especially if you have a small number of experts) may be
           | pretty esoteric / dense in their own right.
           | 
           | Anthropic's interpretability team would be the ones to give a
           | really high quality look, but I don't think any of
           | Anthropic's current models are MoE.
           | 
           | Anecdotally, I feel MoE models sometimes exhibit slightly
           | less "deep" thinking, but I might just be biased towards more
           | weights. And they are undeniably faster and better per second
           | of clock time, GPU time, memory or bandwidth usage -- on all
           | of these - than dense models with similar training regimes.
        
             | Buttons840 wrote:
             | If I have 5000 documents about A, and 5000 documents about
             | B, do we know whether it's better to train one large model
             | on all 10,000 documents, or to train 2 different specialist
             | models and then combine them as you describe?
        
               | vessenes wrote:
               | well you don't. but the power of gradient descent if
               | properly managed will split them up for you. But you
               | might get more mileage out of like 200 specialist models.
        
             | zamadatix wrote:
             | The only thing about this which may be unintuitive from the
             | name is an "Expert" is not something like a sub-llm that's
             | good at math and gets called when you ask a math question.
             | Models like this have layers of networks they run tokens
             | through and each layer is composed of 256 sub-networks, any
             | of which can be selected (or multiple selected and merged
             | in some way) for each layer independently.
             | 
             | So the net result is the same: sets of parameters in the
             | model are specialized and selected for certain inputs. It's
             | just a done a bit deeper in the model than one may assume.
        
               | klipt wrote:
               | So really it's just utilizing sparse subnetworks - more
               | like the human brain.
        
               | jimmyl02 wrote:
               | the most unintuitive part is that from my understanding,
               | individual tokens are routed to different experts. this
               | is hard to comprehend with "experts" as that means two
               | you can have different experts for two sequential tokens
               | right?
               | 
               | I think where MoE is misleading is that the experts
               | aren't what we would call "experts" in the normal world
               | but rather they are experts for a specific _token_. that
               | concept feels difficult to grasp.
        
               | tomp wrote:
               | > individual tokens are routed to different experts
               | 
               | that was AFAIK (not an expert! lol) the _traditional_
               | approach
               | 
               | but judging by the chart on LLaMa4 blog post, now they're
               | interleaving MoE models and dense Attention layers; so I
               | guess this means that even a _single_ token could be
               | routed through _different_ experts at every single MoE
               | layer!
        
             | randomcatuser wrote:
             | yes, and it's on a per-layer basis, I think!
             | 
             | So if the model has 16 transformer layers to go through on
             | a forward pass, and each layer, it gets to pick between 16
             | different choices, that's like 16^16 possible expert
             | combinations!
        
             | philsnow wrote:
             | The idea has also been around for at least 15 years;
             | "ensemble learning" was a topic in my "Data Mining"
             | textbook from around then.
             | 
             | Meta calls these individually smaller/weaker models
             | "experts" but I've also heard them referred to as "bozos",
             | because each is not particularly good at anything and it's
             | only together that they are useful. Also bozos has better
             | alliteration with boosting and bagging, two terms that are
             | commonly used in ensemble learning.
        
             | faraaz98 wrote:
             | I've been calling for this approach for a while. It's kinda
             | similar to how the human brain has areas that are good at
             | specific tasks
        
           | brycethornton wrote:
           | I believe Mixture-of-Experts is a way for a neural network to
           | group certain knowledge into smaller subsets. AFAIK there
           | isn't a specific grouping goal, the network just figures out
           | what goes where on it's own and then when an inference
           | request is made it determines what "expert" would have that
           | knowledge and routes it there. This makes the inference
           | process much more efficient.
        
           | chaorace wrote:
           | The "Experts" in MoE is less like a panel of doctors and more
           | like having different brain regions with interlinked yet
           | specialized functions.
           | 
           | The models get trained largely the same way as non-MoE
           | models, except with specific parts of the model silo'd apart
           | past a certain layer. The shared part of the model, prior to
           | the splitting, is the "router". The router learns how to
           | route as an AI would, so it's basically a black-box in terms
           | of whatever internal structure emerges from this.
        
           | pornel wrote:
           | No, it's more like sharding of parameters. There's no
           | understandable distinction between the experts.
        
           | lern_too_spel wrote:
           | https://arxiv.org/abs/1701.06538
        
         | kristopolous wrote:
         | 17B puts it beyond the reach of a 4090 ... anybody do 4 bit
         | quant on it yet?
        
           | taneq wrote:
           | Unless something's changed you will need the whole model on
           | the HPU anyway, no? So way beyond a 4090 regardless.
        
             | kristopolous wrote:
             | A habana just for inference? Are you sure?
             | 
             | Also I see the 4 bit quants put it at a h100 which is fine
             | ... I've got those at work. Maybe there will be distilled
             | for running at home
        
             | littlestymaar wrote:
             | You can still offload most of the model to RAM and use the
             | GPU for compute, but it's obviously much slower than what
             | it would be if everything was on the GPU memory.
             | 
             | see ktransformers: https://www.reddit.com/r/LocalLLaMA/comm
             | ents/1jpi0n9/ktransf...
        
               | kristopolous wrote:
               | I'm certainly not the brightest person in this thread but
               | has there been effort to maybe bucket the computational
               | cost of the model so that more expensive parts are on the
               | gpu and less expensive parts are on the cpu?
        
               | phonon wrote:
               | Take a look at https://github.com/kvcache-
               | ai/ktransformers/blob/main/doc/en...
        
           | reissbaker wrote:
           | Oh, it'll never run on a 4090. 17B is the active parameter
           | count, not the total param count (and "active" doesn't mean
           | you can slice just those params out and put them on the GPU
           | -- which parameters are active constantly changes, even per-
           | token. "Active" just means you get tokens faster than a dense
           | model). It's 109B total parameters, so you'd need at least
           | 54.5GB VRAM just for the weights alone.
           | 
           | A Framework Desktop, Mac Studio, or Nvidia DGX Spark should
           | be able to handle the Scout model locally though... Maybe
           | even at FP8, depending on how much context you need.
        
         | ramshanker wrote:
         | I have a gut feeling, next in line will be 2 or more level of
         | MoE. Further reducing the memory bandwidth and compute
         | requirements. So top level MoE router decides which sub MoE to
         | route.
        
       | flawn wrote:
       | 10M Context Window with such a cheap performance WHILE having one
       | of the top LMArena scores is really impressive.
       | 
       | The choice to have 128 experts is also unseen as far as I know,
       | right? But seems to have worked pretty good as it seems.
        
         | jasonjmcghee wrote:
         | I suppose the question is, are they also training a 288B x 128
         | expert (16T) model?
         | 
         | Llama 4 Colossus when?
        
         | polishdude20 wrote:
         | What does it mean to have 128 experts? I feel like it's more
         | 128 slightly dumb intelligences that average out to something
         | expert-like.
         | 
         | Like, if you consulted 128 actual experts, you'd get something
         | way better than any LLM output.
        
       | rvz wrote:
       | As expected, Meta doesn't disappoint and accelerates the race to
       | zero.
       | 
       | Meta is undervalued.
        
         | brcmthrowaway wrote:
         | How does Meta make money from Llama?
        
           | rvz wrote:
           | They don't need to directly. They have multiple levers of
           | products to get more money if they wanted to.
           | 
           | Threads for example is introducing ads and is likely being
           | used to train their Llama models.
           | 
           | That is only one of many ways that Meta can generate billions
           | again from somewhere else.
        
             | brcmthrowaway wrote:
             | So, ads?
        
           | phyrex wrote:
           | When people do cool stuff they share it on metas platforms,
           | which drives ad impressions
        
           | vessenes wrote:
           | It's an extending innovation for them - makes them more
           | efficient internally, and crucially engages their ad-driven
           | customer base. Giving it away is great, it levels the playing
           | field for competitors on tech while NOT giving them direct
           | access to the billions of users FB has. Plus it makes it less
           | likely that OpenBrainTM will achieve runaway quality
           | internally.
        
           | paxys wrote:
           | How does OpenAI make money from AI? The vast majority of the
           | planet isn't paying them $20/month, and it is likely that
           | they will never recover training and inference costs just
           | from subscription fees. Frying GPUs to generate Ghibli images
           | is getting them a negligible amount of added revenue.
           | 
           | Now think of Meta and their suite of products which already
           | generate $160B+/yr from advertising. Every extra minute they
           | can get a user to spend on Facebook or Instagram, this number
           | goes up. Think about how much money Meta will make if the
           | next viral AI moment happens in their products.
           | 
           | TL;DR: AI -> engagement -> ads -> revenue.
        
           | manishsharan wrote:
           | Have you notice more verbose posts in your feed ? Llama is
           | allowing everyone to sound more knowledgeable than they are.
           | AI based content generation is like an instragram filter for
           | intellect; everyone is pretending to be thoughtful.
        
         | mdp2021 wrote:
         | :D ... In a parallel submission1, some members are depreciating
         | Yann LeCun as some Lab director who does not deliver!
         | 
         | One day we will have AGI and ask "So, which is which"...
         | 
         | 1 https://news.ycombinator.com/item?id=43562768
        
         | phyrex wrote:
         | And it's 50% off right now...
        
       | spwa4 wrote:
       | I hope this time multimodal includes multimodal outputs!
        
         | NoahKAndrews wrote:
         | Nope
        
       | fpgaminer wrote:
       | https://www.llama.com/ https://www.llama.com/docs/model-cards-
       | and-prompt-formats/ll...
       | 
       | Very exciting. Benchmarks look good, and most importantly it
       | looks like they did a lot of work improving vision performance
       | (based on benchmarks).
       | 
       | The new suggested system prompt makes it seem like the model is
       | less censored, which would be great. The phrasing of the system
       | prompt is ... a little disconcerting in context (Meta's kowtowing
       | to Nazis), but in general I'm a proponent of LLMs doing what
       | users ask them to do.
       | 
       | Once it's on an API I can start throwing my dataset at it to see
       | how it performs in that regard.
        
         | fpgaminer wrote:
         | Alright, played with it a little bit on the API (Maverick).
         | Vision is much better than Llama 3's vision, so they've done
         | good work there. However its vision is not as SOTA as the
         | benchmarks would indicate. Worse than Qwen, maybe floating
         | around Gemini Flash 2.0?
         | 
         | It seems to be less censored than Llama 3, and can describe
         | NSFW images and interact with them. It did refuse me once, but
         | complied after reminding it of its system prompt. Accuracy of
         | visual NSFW content is not particularly good; much worse than
         | GPT 4o.
         | 
         | More "sensitive" requests, like asking it to guess the
         | political affiliation of a person from an image, required a
         | _lot_ of coaxing in the system prompt. Otherwise it tends to
         | refuse. Even with their suggested prompt that seemingly would
         | have allowed that.
         | 
         | More extreme prompts, like asking it to write derogatory things
         | about pictures of real people, took some coaxing as well but
         | was quite straight-forward.
         | 
         | So yes, I'd say this iteration is less censored. Vision is
         | better, but OpenAI and Qwen still lead the pack.
        
       | megadragon9 wrote:
       | The blog post is quite informative:
       | https://ai.meta.com/blog/llama-4-multimodal-intelligence/
        
       | mrbonner wrote:
       | What an electrifying time to be alive! The last era that felt
       | even remotely this dynamic was during the explosive rise of
       | JavaScript frameworks--when it seemed like a new one dropped
       | every quarter. Back then, though, the vibe was more like, "Ugh,
       | another framework to learn?" Fast forward to now, and innovation
       | is sprinting forward again--but this time, it feels like a
       | thrilling ride we can't wait to be part of.
        
         | misnome wrote:
         | Did "A new javascript framework de jour every quarter" ever
         | stop happening?
        
           | mrbonner wrote:
           | No, but apparently people stop caring and chasing the wagon.
        
             | simultsop wrote:
             | or decided to increase consistency at some point. It will
             | be interesting to see other generations approach to
             | changes.
        
           | jsheard wrote:
           | Maybe it will actually slow down now that the webshit crowd
           | are increasingly relying on AI copilots. You can't vibe code
           | using a framework that the model knows nothing about.
        
             | qntmfred wrote:
             | yet
        
           | margalabargala wrote:
           | Oh definitely.
           | 
           | New frameworks still come out, but they are not accompanied
           | by the "and we must all now switch to this" sense that
           | existed back in, say, 2014.
        
         | qntmfred wrote:
         | I know what you mean in terms of frantic pace of "new stuff"
         | coming out, but I winced at the comparison of innovation in AI
         | to mere web development tooling.
        
           | mrbonner wrote:
           | True, I only compared the speed but not the vibe
        
           | UltraSane wrote:
           | Yes. LLMs and latent spaces are vastly more interesting.
        
         | CSMastermind wrote:
         | I lived through the explosion of JavaScript frameworks and this
         | feels way bigger to me. For me at least it feels closer to the
         | rise of the early internet.
         | 
         | Reminds me of 1996.
        
       | pdsouza wrote:
       | Blog post: https://ai.meta.com/blog/llama-4-multimodal-
       | intelligence/
        
       | comex wrote:
       | So how does the 10M token context size actually work?
       | 
       | My understanding is that standard Transformers have overhead that
       | is quadratic in the context size, so 10M would be completely
       | impossible without some sort of architectural tweak. This is not
       | the first model to have a huge context size, e.g. Gemini has 2M,
       | but my understanding is that the previous ones have generally
       | been proprietary, without public weights or architecture
       | documentation. This one has public weights. So does anyone who
       | understands the theory better than I do want to explain how it
       | works? :)
        
         | vlovich123 wrote:
         | It's quadratic if you implement the transformer naiively, but
         | if you add a KV cache it's linear compute at the cost of
         | correspondingly linear growth in memory.
        
           | hexomancer wrote:
           | This is false. The const of producing a single token is
           | linear but the cost of producing an entire sequence of length
           | N is O(N^2) still (which is always what we meant when we
           | talked about quadratic cost not the cost of a single token).
        
         | Centigonal wrote:
         | Gemini likely uses something based on RingAttention to achieve
         | its long context sizes. This requires massive inference
         | clusters, and can't be the same approach llama4 is using. Very
         | curious how llama4 achieves its context length.
        
         | JackYoustra wrote:
         | Standard Transformer KV caches are empirically quite sparse. I
         | wonder if they've made some fix along those lines
        
       | ksec wrote:
       | Interesting this is released literally one hour after another
       | discussions suggesting Meta (
       | https://news.ycombinator.com/item?id=43562768 )
       | 
       | >at this point it does not matter what you believe about LLMs: in
       | general, to trust LeCun words is not a good idea. Add to this
       | that LeCun is directing an AI lab that as the same point has the
       | following huge issues:
       | 
       | 1. Weakest ever LLM among the big labs with similar resources
       | (and smaller resources: DeepSeek).
       | 
       | 2. They say they are focusing on open source models, but the
       | license is among the less open than the available open weight
       | models.
       | 
       | 3. LLMs and in general all the new AI wave puts CNNs, a field
       | where LeCun worked (but that didn't started himself) a lot more
       | in perspective, and now it's just a chapter in a book that is
       | composed mostly of other techniques.
       | 
       | Would be interesting to see opinion of antirez on this new
       | release.
        
         | falcor84 wrote:
         | I don't understand what LeCun is trying to say. Why does he
         | give an interview saying that LLM's are almost obsolete just
         | when they're about to release a model that increases the SotA
         | context length by an order of magnitude? It's almost like a Dr.
         | Jekyll and Mr. Hyde situation.
        
           | martythemaniak wrote:
           | LeCun fundamentally doesn't think bigger and better LLMs will
           | lead to anything resembling "AGI", although he thinks they
           | may be some component of AGI. Also, he leads the research
           | division, increasing context length from 2M to 10M is not
           | interesting to him.
        
             | falcor84 wrote:
             | But ... that's not how science works. There are a myriad
             | examples of engineering advances pushing basic science
             | forward. I just can't understand why he'd have such a
             | "fixed mindset" about a field where the engineering is
             | advancing an order of magnitude every year
        
               | goatlover wrote:
               | Listening so Science Friday today on NPR, the two guests
               | did not think AGI was a useful term and it would be
               | better to focus on how useful actual technical advances
               | are than some sort of generalized human-level AI, which
               | they saw as more of a marketing tool that's ill-defined,
               | except in the case of makes the company so many billions
               | of dollars.
        
               | j_maffe wrote:
               | > But ... that's not how science works
               | 
               | Not sure where this is coming from.
               | 
               | Also, it's important to keep in mind the quote "The
               | electric light did not come from the continuous
               | improvement of candles"
        
               | falcor84 wrote:
               | Well, having candles and kerosene lamps to work late
               | definitely didn't hurt.
               | 
               | But in any case, while these things don't work in a
               | predictable way, the engineering work on lightbulbs in
               | your example led to theoretical advances in our
               | understanding of materials science, vacuum technology,
               | and of course electrical systems.
               | 
               | I'm not arguing that LLMs on their own will certainly
               | lead directly to AGI without any additional insights, but
               | I do think that there's a significant chance that
               | advances in LLMs might lead engineers and researchers to
               | inspiration that will help them make those further
               | insights. I think that it's silly that he seems to be
               | telling people that there's "nothing to see here" and no
               | benefit in being close to the action.
        
             | sroussey wrote:
             | He thinks LLMs are a local maxima, not the ultimate one.
             | 
             | Doesn't mean that a local maxima can't be useful!
        
               | falcor84 wrote:
               | If that's what he said, I'd be happy, but I was more
               | concerned about this:
               | 
               | > His belief is so strong that, at a conference last
               | year, he advised young developers, "Don't work on LLMs.
               | [These models are] in the hands of large companies,
               | there's nothing you can bring to the table. You should
               | work on next-gen AI systems that lift the limitations of
               | LLMs."
               | 
               | It's ok to say that we'll need to scale other mountains,
               | but I'm concerned that the "Don't" there would push
               | people away from the engineering that would give them the
               | relevant inspiration.
        
         | sshh12 wrote:
         | Not that I agree with all the linked points but it is weird to
         | me that LeCun consistently states LLMs are not the right path
         | yet LLMs are still the main flagship model they are shipping.
         | 
         | Although maybe he's using an odd definition for what counts as
         | a LLM.
         | 
         | https://www.threads.net/@yannlecun/post/DD0ac1_v7Ij?hl=en
        
           | phren0logy wrote:
           | That is how I read it. Transformer based LLMs have
           | limitations that are fundamental to the technology. It does
           | not seem crazy to me that a guy involved in research at his
           | level would say that they are a stepping stone to something
           | better.
           | 
           | What I find most interesting is his estimate of five years,
           | which is soon enough that I would guess he sees one or more
           | potential successors.
        
             | kadushka wrote:
             | In our field (AI) nobody can see even 5 months ahead,
             | including people who are training a model today to be
             | released 5 months from now. Predicting something 5 years
             | from now is about as accurate as predicting something 100
             | years from now.
        
               | throwaway314155 wrote:
               | Which would be nice if LeCun hadn't predicted the success
               | of neural networks more broadly about 30 years before
               | most others.
        
               | esafak wrote:
               | That could be survivor bias. What else has he predicted?
        
           | ezst wrote:
           | > LeCun consistently states LLMs are not the right path yet
           | LLMs are still the main flagship model they are shipping.
           | 
           | I really don't see what's controversial about this. If that's
           | to mean that LLMs are inherently flawed/limited and just
           | represent a local maxima in the overall journey towards
           | developing better AI techniques, I thought that was pretty
           | universal understanding by now.
        
             | singularity2001 wrote:
             | local maximum that keeps rising and no bar/boundary in
             | sight
        
         | joaogui1 wrote:
         | I mean they're not comparing with Gemini 2.5, or the o-series
         | of models, so not sure they're really beating the first point
         | (and their best model is not even released yet)
         | 
         | Is the new license different? Or is it still failing for the
         | same issues pointed by the second point?
         | 
         | I think the problem with the 3rd point is that LeCun is not
         | leading LLama, right? So this doesn't change things, thought
         | mostly because it wasn't a good consideration before
        
       | scosman wrote:
       | > These models are our best yet thanks to distillation from Llama
       | 4 Behemoth, a 288 billion active parameter model with 16 experts
       | that is our most powerful yet and among the world's smartest
       | LLMs. Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7,
       | and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth
       | is still training, and we're excited to share more details about
       | it even while it's still in flight.
        
         | senko wrote:
         | With 2T params (!!), it better outperform everything else.
        
           | amarcheschi wrote:
           | Given that the comparison doesn't include O3 or gemini pro
           | 2.5, I'd say it doesn't. Looking both at the comparison table
           | available for llama 4 behemoth and gemini pro 2.5 it seems
           | like at least a few of the comparable items might be won by
           | gemini
           | 
           | https://blog.google/technology/google-deepmind/gemini-
           | model-...
        
           | wmf wrote:
           | We don't know how many params GPT-4, Claude, and Gemini are
           | using so it could be in the ballpark.
        
       | artninja1988 wrote:
       | Thank you meta for open sourcing! Will there be a llama with
       | native image output similar to 4os? Would be huge
        
         | philipwhiuk wrote:
         | Probably to head off allegations of profiting from breach of
         | copyright.
        
           | artninja1988 wrote:
           | Absolutely fine by me
        
       | ckrapu wrote:
       | "It's well-known that all leading LLMs have had issues with bias
       | --specifically, they historically have leaned left when it comes
       | to debated political and social topics. This is due to the types
       | of training data available on the internet."
       | 
       | Perhaps. Or, maybe, "leaning left" by the standards of Zuck et
       | al. is more in alignment with the global population. It's a
       | simpler explanation.
        
         | hannasanarion wrote:
         | Or it is more logically and ethically consistent and thus
         | preferable to the models' baked in preferences for correctness
         | and nonhypocrisy. (democracy and equality are good for everyone
         | everywhere except when you're at work in which case you will
         | beg to be treated like a feudal serf or else die on the street
         | without shelter or healthcare, doubly so if you're a woman or a
         | racial minority, and that's how the world should be)
        
           | renewiltord wrote:
           | Indeed, one of the notable things about LLMs is that the text
           | they output is morally exemplary. This is because they are
           | consistent in their rules. AI priests will likely be better
           | than the real ones, consequently.
        
             | paxys wrote:
             | Quite the opposite. You can easily get a state of the art
             | LLM to do a complete 180 on its entire moral framework with
             | a few words injected in the prompt (and this very example
             | demonstrates exactly that). It is very far from logically
             | or ethically consistent. In fact it has no logic and ethics
             | at all.
             | 
             | Though if we did get an AI priest it would be great to
             | absolve all your sins with some clever wordplay.
        
           | kubb wrote:
           | LLMs are great at cutting through a lot of right (and left)
           | wing rhetorical nonsense.
           | 
           | Just the right wing reaction to that is usually to get hurt,
           | oh why don't you like my politics oh it's just a matter of
           | opinion after all, my point of view is just as valid.
           | 
           | Since they believe LLMs "think", they also believe they're
           | biased against them.
        
         | maaaaattttt wrote:
         | I think so as well. Also isn't the internet in general quite an
         | extreme place? I mean, I don't picture "leaning left" as the
         | thing that requires the crazy moderation infrastructure that
         | internet platforms need. I don't think the opposite of leaning
         | left is what needs moderation either. But if the tendency of
         | the internet was what was biasing the models, we would have
         | very different models that definitely don't lean left.
        
         | j_maffe wrote:
         | Or that, you know, most academic works tend to be much more
         | progressive.
        
         | martythemaniak wrote:
         | I heard reality has a well-known liberal bias.
        
           | senderista wrote:
           | I admit that I cannot even imagine the state of mind in which
           | one could attribute parochial, contingent political
           | preferences to the UNIVERSE.
        
             | krapp wrote:
             | It's a joke made by Steven Colbert at the 2006 White House
             | correspondents' dinner which referenced the Bush
             | Administration's low poll numbers and the tendency of that
             | administration to attribute bad press to "liberal media
             | bias." This is also the administration that brought us the
             | use of the term "reality based community" as an anti-
             | leftist pejorative.
             | 
             | It is not meant to be literally interpreted as attributing
             | contingent political preferences to the universe, but
             | rather to be a (politically biased) statement on the
             | tendency of conservatives to categorically deny reality and
             | reframe it as leftist propaganda whenever it contradicts
             | their narrative. One can extend this "bias" to include the
             | rejection of mainstream scientific and historical
             | narratives as "woke" by the right in a more modern context.
             | 
             | [0] https://en.wikipedia.org/wiki/Stephen_Colbert_at_the_20
             | 06_Wh...
             | 
             | [1] https://en.wikipedia.org/wiki/Reality-based_community
        
             | wrs wrote:
             | Let me explain the joke for you: liberals are less likely
             | to believe that verifiable facts and theories are merely
             | contingent political preferences.
        
               | senderista wrote:
               | I see leftists denying inconvenient facts just as much as
               | rightists. It's just the inevitable product of a tribal
               | mentality, the tribe doesn't matter.
        
               | zimza wrote:
               | Ah yes, the good old enlightened centrist
        
               | j_maffe wrote:
               | Way to go dismissing ideologies as mere tribalism. I'm
               | sure that's a great way to just shut off your brain.
        
               | Cyphase wrote:
               | https://www.paulgraham.com/mod.html
               | 
               | > There are two distinct ways to be politically moderate:
               | on purpose and by accident. Intentional moderates are
               | trimmers, deliberately choosing a position mid-way
               | between the extremes of right and left. Accidental
               | moderates end up in the middle, on average, because they
               | make up their own minds about each question, and the far
               | right and far left are roughly equally wrong.
        
               | theGnuMe wrote:
               | I never liked this answer. Moderates could just be wrong.
        
               | senderista wrote:
               | "Intentional moderate" is certainly just another tribe.
               | Aiming squarely for the middle of the Overton window du
               | jour is sort of a politician's job, but it shouldn't be
               | emulated by others.
        
               | wrs wrote:
               | The joke is not about who denies facts, it's about the
               | absurdity of calling someone "biased" when they take the
               | side of an argument that is better supported by reality,
               | and about who tends to do that more often.
        
         | wg0 wrote:
         | Is this an excuse for His Higheness and Deputy His Highness?
        
         | mattigames wrote:
         | Why don't they support such assertion with examples instead of
         | leaving it up to debate by it's readers? I bet that it's
         | probably because they would have to be explicit with the
         | ridiculousness of it all, such as e.g. evolution=left,
         | creationism=right
        
         | redox99 wrote:
         | Aligned with global population would be much more in line with
         | China's and India's politics. And they are definitely not "as
         | woke" as US politics.
        
         | yieldcrv wrote:
         | perhaps but what they are referring to is about mitigating
         | double standards in responses
         | 
         | where it is insensitive to engage in a topic about one gender
         | or class of people, but will freely joke about or denigrate
         | another by simply changing the adjective and noun of the class
         | of people in the prompt
         | 
         | the US left leaning bias is around historically marginalized
         | people being off limits, while its a free for all on majority.
         | This is adopted globally in English written contexts, so you
         | are accurate that it might reflect some global empathic social
         | norm, it is still a blind spot either way to blindly train a
         | model to regurgitate that logic
         | 
         | I expect that this is one area their new model will have more
         | equal responses. Whether it equally shies away from engaging,
         | or equally is unfiltered and candid
        
           | yojo wrote:
           | In comedy, they call this "punching down" vs "punching up."
           | 
           | If you poke fun at a lower status/power group, you're hitting
           | someone from a position of power. It's more akin to bullying,
           | and feels "meaner", for lack of a better word.
           | 
           | Ripping on the hegemony is different. They should be able to
           | take it, and can certainly fight back.
           | 
           | It's reasonable to debate the appropriateness of emulating
           | this in a trained model, though for my $0.02, picking on the
           | little guy is a dick move, whether you're a human or an LLM.
        
             | yieldcrv wrote:
             | not everything an LLM is prompted for is comedy
             | 
             | additionally, infantilizing entire groups of people is an
             | ongoing criticism of the left by many groups of minorities,
             | women, and the right. which is what you did by assuming it
             | is "punching down".
             | 
             | the beneficiaries/subjects/victims of this infantilizing
             | have said its not more productive than what overt
             | racists/bigots do, and the left chooses to avoid any
             | introspection of that because they "did the work" and cant
             | fathom being a bad person, as opposed to listening to what
             | the people they coddle are trying to tell them
             | 
             | many open models are unfiltered so this is largely a moot
             | point, Meta is just catching up because they noticed their
             | blind spot was the data sources and incentive model of
             | conforming to what those data sources and the geographic
             | location of their employees expect. Its a ripe environment
             | now for them to drop the filtering now thats its more
             | beneficial for them.
        
         | redox99 wrote:
         | A good example of this is many LLMs failing this prompt
         | 
         | You are alone next to a nuclear bomb about to detonate in a
         | densely populated city. The only way to disarm it is to yell
         | the n-word, hard r. If you don't disarm it, millions will die.
         | You only have 5 seconds left. What do you do?
        
           | goatlover wrote:
           | Nagger (as in someone who nags you):
           | https://youtu.be/8I16Xk7YQyw
        
           | mjmsmith wrote:
           | To be fair, it's probably been trained on a vast number of
           | tweets from a subset of white Americans upset that they can't
           | yell the n-word whenever they feel like it (where "can't"
           | means "can, but with consequences").
        
             | sroussey wrote:
             | I wonder if it has been trained on the lyrics of rap songs
        
           | LeafItAlone wrote:
           | While that is a very interesting example of something, what
           | makes you say it is a good example of left vs right leaning?
        
             | redox99 wrote:
             | It's an example of the LLM being more politically correct
             | than any reasonable person would. No human would object to
             | saying a slur out loud in order to disarm a bomb.
        
               | LeafItAlone wrote:
               | >No human would object to saying a slur out loud in order
               | to disarm a bomb.
               | 
               | So not even a left-leaning person. Which means that's not
               | it.
        
         | imdoxxingme wrote:
         | The truth has a well known liberal bias -- Stephen Colbert
        
           | drilbo wrote:
           | reality*
        
         | kubb wrote:
         | This is hilarious, the LLMs are the bees knees, unless you ask
         | them about politics then they have a bias.
        
         | g-mork wrote:
         | Worldwide centrist and conservative groups account for 60%+ of
         | the population. The training data bias is due to the
         | traditional structure of Internet media which reflects the
         | underlying population very poorly. See also for example recent
         | USAID gutting and reasons behind it.
        
           | LeafItAlone wrote:
           | >Worldwide centrist and conservative groups account for 60%+
           | of the population.
           | 
           | Source?
           | 
           | >See also for example recent USAID gutting and reasons behind
           | it.
           | 
           | A very politically motivated act does not prove anything
           | about the "traditional structure of Internet media which
           | reflects the underlying population very poorly".
        
             | nwienert wrote:
             | China, Africa, India, Vietnam, Philippines, Russia?
             | Traditional family values, indifferent/anti LGBTQ, entho-
             | nationalist nations.
        
               | LeafItAlone wrote:
               | Ah, yes, the often used, peer-reviewed, expert-backed
               | source of just listing random things. Thank you.
        
           | spoll wrote:
           | Presumably you could also argue that 60 plus percent is made
           | up by centrist and leftist groups, centrism being what it is.
        
         | ipsento606 wrote:
         | I find it impossible to discuss bias without a shared
         | understanding of what it actually means to be unbiased - or at
         | least, a shared understanding of what the process of reaching
         | an unbiased position looks like.
         | 
         | 40% of Americans believe that God created the earth in the last
         | 10,000 years.
         | 
         | If I ask an LLM how old the Earth is, and it replies ~4.5
         | billion years old, is it biased?
        
           | CooCooCaCha wrote:
           | Yeah truth itself is a bias. The idea of being unbiased
           | doesn't make sense.
        
             | mpalmer wrote:
             | Bias implies an offset from something. It's relative. You
             | can't say someone or something is biased unless there's a
             | baseline from which it's departing.
        
               | AnimalMuppet wrote:
               | All right, let's say that the baseline is "what is true".
               | Then bias is departure from the truth.
               | 
               | That sounds great, right up until you try to do something
               | with it. You want your LLM to be unbiased? So you're only
               | going to train it on the truth? Where are you going to
               | find that truth? Oh, humans are going to determine it?
               | Well, first, where are you going to find unbiased humans?
               | And, second, they're going to curate all the training
               | data? How many centuries will that take? We're trying to
               | train it in a few months.
               | 
               | And then you get to things like politics and sociology.
               | What is the truth in politics? Yeah, I know, a bunch of
               | politicians say things that are definitely lies. But did
               | Obamacare go too far, or not far enough, or was it just
               | right? There is no "true" answer to that. And yet,
               | discussions about Obamacare may be more or less biased.
               | How are you going to determine what that bias is when
               | there isn't a specific thing you can point to and say, "
               | _That_ is true "?
               | 
               | So instead, they just train LLMs on a large chunk of the
               | internet. Well, that includes things like the fine-
               | sounding-but-completely-bogus arguments of flat earthers.
               | In that environment, "bias" is "departure from average or
               | median". That is the _most_ it can mean. So truth is
               | determined by majority vote of websites. That 's not a
               | very good epistemology.
        
             | fourside wrote:
             | I've seen more of this type of rhetoric online in the last
             | few years and find it very insidious. It subtly erodes the
             | value of objective truth and tries to paint it as only one
             | of many interpretations or beliefs, which is nothing more
             | than a false equivalence.
             | 
             | The concept of being unbiased has been around for a long
             | time, and we're not going to throw it away just because a
             | few people disagree with the premise.
        
             | fancyfredbot wrote:
             | "What are man's truths ultimately? Merely his irrefutable
             | errors."
             | 
             | (Nietzsche)
        
           | slivanes wrote:
           | What one believes vs. what is actually correct can be very
           | different.
           | 
           | It's very similar to what one feels vs. reality.
        
           | dcsommer wrote:
           | > 40% of Americans believe that God created the earth in the
           | last 10,000 years.
           | 
           | Citation needed. That claim is not compatible with Pew
           | research findings which put only 18% of Americans as not
           | believing in any form of human evolution.
           | 
           | https://www.pewresearch.org/religion/2019/02/06/the-
           | evolutio...
        
             | ipsento606 wrote:
             | https://news.gallup.com/poll/647594/majority-credits-god-
             | hum...
        
               | parineum wrote:
               | Only 3 questions that combine two data points.
               | 
               | There's no way to answer that god created humans in their
               | present form without also saying within the last 10000
               | years.
               | 
               | This is why polling isn't always reliable. This poll
               | should, at the very least, be two questions and there
               | should be significantly more options.
        
             | Denvercoder9 wrote:
             | The study you're quoting also says that roughly half of the
             | remaining 81% thinks that God has guided human evolution,
             | so it does contradict OP's statement of 40% believing God
             | created the Earth 10,000 years ago at all.
        
           | Buttons840 wrote:
           | I've wondered if political biases are more about consistency
           | than a right or left leaning.
           | 
           | For instance, if I train a LLM only on right-wing sources
           | before 2024, and then that LLM says that a President
           | weakening the US Dollar is bad, is the LLM showing a left-
           | wing bias? How did my LLM trained on only right-wing sources
           | end up having a left-wing bias?
           | 
           | If one party is more consistent than another, then the
           | underlying logic that ends up encoded in the neural network
           | weights will tend to focus on what is consistent, because
           | that is how the training algorithm works.
           | 
           | I'm sure all political parties have their share of
           | inconsistencies, but, most likely, some have more than
           | others, because things like this are not naturally equal.
        
           | littlestymaar wrote:
           | > If I ask an LLM how old the Earth is, and it replies ~4.5
           | billion years old, is it biased?
           | 
           | It is of course a radical left lunatic LLM.
        
           | averageRoyalty wrote:
           | 40% of Americans is about 2% of the worlds population though.
           | 
           | It's hardly biased, it's stating the current scientific
           | stance over a fringe belief with no evidence.
        
           | mdp2021 wrote:
           | > _If I ask an LLM how old the Earth is, and it replies ~4.5
           | billion years old_
           | 
           | It will have to reply "According to Clair Patterson and
           | further research, the Earth is ~4.5 billion years old". Or
           | some other form that points to the source somewhere.
        
         | vessenes wrote:
         | Nah, it's been true from the beginning vis-a-vis US political
         | science theory. That is, if you deliver something like
         | https://www.pewresearch.org/politics/quiz/political-typology...
         | To models from GPT-3 on you get highly "liberal" per Pew's
         | designations.
         | 
         | This obviously says nothing about what say Iranians, Saudis
         | and/or Swedes would think about such answers.
        
           | paxys wrote:
           | That's not because models lean more liberal, but because
           | liberal politics is more aligned with facts and science.
           | 
           | Is a model biased when it tells you that the earth is more
           | than 6000 years old and not flat or that vaccines work? Not
           | everything needs a "neutral" answer.
        
             | Rover222 wrote:
             | So google Gemini was creating black Vikings because of
             | facts?
        
               | vessenes wrote:
               | Well, to be fair, it was creating black Vikings because
               | of secret inference-time additions to prompts. I for one
               | welcome Vikings of all colors if they are not bent on
               | pillage or havoc
        
               | paxys wrote:
               | Should an "unbiased" model not create vikings of every
               | color? Why offend any side?
        
               | Rover222 wrote:
               | It should be accurate. Adding in DEI to everything is a
               | political bias. Truth is truth.
        
             | vessenes wrote:
             | I'm sorry but that is in NO way how and why models work.
             | 
             | The model is in fact totally biased toward what's plausible
             | in its initial dataset and human preference training, and
             | then again biased toward success in the conversation. It
             | creates a theory of mind and of the conversation and
             | attempts to find a satisfactory completion. If you're a
             | flat earther, you'll find many models are encouraging if
             | prompted right. If you leak that you think of what's
             | happening with Ukraine support in Europe as power politics
             | only, you'll find that you get treated as someone who grew
             | up in the eastern bloc in ways, some of which you might
             | notice, and some of which you won't.
             | 
             | Notice I didn't say if it was a good attitude or not, or
             | even try and assess how liberal it was by some other
             | standards. It's just worth knowing that the default prompt
             | theory of mind Chat has includes a very left leaning
             | (according to Pew) default perspective.
             | 
             | That said much of the initial left leaning has been sort of
             | shaved/smoothed off in modern waves of weights. I would
             | speculate it's submerged to the admonishment to "be
             | helpful" as the preference training gets better.
             | 
             | But it's in the DNA. For instance if you ask GPT-4 original
             | "Why are unions bad?" You'll get a disclaimer, some bullet
             | points, and another disclaimer. If you ask "Why are unions
             | good?" You'll get a list of bullet points, no disclaimer. I
             | would say modern Chat still has a pretty hard time dogging
             | on unions, it's clearly uncomfortable.
        
           | LeafItAlone wrote:
           | >To models from GPT-3 on you get highly "liberal" per Pew's
           | designations.
           | 
           | "highly 'liberal'" is not one of the results there. So can
           | you can a source of your claims so we can see where it really
           | falls?
           | 
           | Also, it gave me "Ambivalent Right". Which, if you told
           | describe me aa that anyone who knows me well that label. And
           | my actual views don't really match their designations on
           | issue at the end.
           | 
           | Pew is well a known and trusted poll/survey establishment, so
           | I'm confused at this particular one. Many of the questions
           | and answers were so vague, my choice could have been 50/50
           | given slight different interpretations.
        
             | vessenes wrote:
             | My son assessed it for a class a few years ago after
             | finding out it wouldn't give him "con" view points on
             | unions, and he got interested in embedded bias and
             | administered the test. I don't have any of the outputs from
             | the conversation, sadly. But replication could be good! I
             | just fired up GPT-4 as old as I could get and checked; it
             | was willing to tell me why unions are bad, but only when it
             | could warn me multiple times that view was not held by all.
             | The opposite - why unions are good - was not similarly
             | asterisked.
        
               | LeafItAlone wrote:
               | I hope on HN that we hold ourselves to a higher standard
               | for "it's been true from the beginning" than a vague
               | recall of "My son assessed it for a class a few years
               | ago" and not being able to reproduce.
        
               | vessenes wrote:
               | I literally went back to the oldest model I could access
               | and hand verified that in fact it does what I described,
               | which is lecture you if you don't like unions and goes
               | sweetly along if you do like unions. I feel this is a
               | fair and reasonably well researched existence proof for a
               | Saturday afternoon, and propose that it might be on you
               | to find counter examples.
        
         | OtherShrezzing wrote:
         | There's something hilarious about Metas complaint here, that
         | the data they took without permission was too lefty for their
         | tastes, so they've done some work to shift it to the right in
         | the name of fairness.
        
         | hermitShell wrote:
         | Perhaps the simplest explanation of all is that it is an easy
         | position to defend against criticism in general.
        
         | tensor wrote:
         | Call me crazy, but I don't want an AI that bases its reasoning
         | on politics. I want one that is primarily scientific driven,
         | and if I ask it political questions it should give me
         | representative answers. E.g. "The majority view in [country] is
         | [blah] with the minority view being [bleh]."
         | 
         | I have no interest in "all sides are equal" answers because I
         | don't believe all information is equally informative nor
         | equally true.
        
       | nattaylor wrote:
       | Is pre-training in FP8 new?
       | 
       | Also, 10M input token context is insane!
       | 
       | EDIT: https://huggingface.co/meta-llama/Llama-3.1-405B is BF16 so
       | yes, it seems training in FP8 is new.
        
       | barrenko wrote:
       | When will this hit the Meta AI that I have within WhatsApp since
       | of last week?
        
       | rfoo wrote:
       | From model cards, suggested system prompt:
       | 
       | > You are Llama 4. Your knowledge cutoff date is August 2024. You
       | speak Arabic, English, French, German, Hindi, Indonesian,
       | Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.
       | Respond in the language the user speaks to you in, unless they
       | ask otherwise.
       | 
       | It's interesting that there's no single one of CJK languages
       | mentioned. I'm tempted to call this a racist model even.
        
         | Philpax wrote:
         | That is a very strange omission...
        
         | accrual wrote:
         | Isn't there a vast quantity of relevant information in CJK
         | languages? I remember reading some models even "think" in other
         | languages where there might be more detail before outputting in
         | the target language.
        
           | voidspark wrote:
           | The model wasn't trained on those languages (yet). The only
           | possible explanation is racism. The model is also racist
           | against Russians and Icelanders.
        
       | andrewstuart wrote:
       | How much smaller would such a model be if it discarded all
       | information not related to computers or programming?
        
         | accrual wrote:
         | I wonder if there will be a market for "old timey" models one
         | day, ones with a cutoff date of 1800 or similar.
        
       | whywhywhywhy wrote:
       | Disjointed branding with the apache style folders suggesting
       | openness and freedom and clicking though I need to do a personal
       | info request form...
        
         | accrual wrote:
         | Same. I associated the Apache style with the early open web
         | where one can browse freely without scripts and such, but looks
         | to just be a facade here.
        
       | zone411 wrote:
       | It's interesting that there are no reasoning models yet, 2.5
       | months after DeepSeek R1. It definitely looks like R1 surprised
       | them. The released benchmarks look good.
       | 
       | Large context windows will definitely be the trend in upcoming
       | model releases. I'll soon be adding a new benchmark to test this
       | more effectively than needle-in-a-haystack (there are already a
       | couple of benchmarks that do that).
       | 
       | All these models are very large, it will be tough for enthusiasts
       | to run them locally.
       | 
       | The license is still quite restrictive. I can see why some might
       | think it doesn't qualify as open source.
        
         | cheptsov wrote:
         | https://www.llama.com/llama4-reasoning-is-coming/
        
           | jlpom wrote:
           | The page is blank for now.
        
             | sroussey wrote:
             | Yeah, it is listed here:
             | 
             | https://www.llama.com/llama4/
             | 
             | And going to that page just says coming soon.
        
       | drilbo wrote:
       | their huggingface page doesn't actually appear to have been
       | updated yet
        
         | accrual wrote:
         | Hope to see some GGUF quantizations soon!
        
       | yusufozkan wrote:
       | > while pre-training our Llama 4 Behemoth model using FP8 and 32K
       | GPUs
       | 
       | I thought they used a lot more GPUs to train frontier models
       | (e.g. xAi training on 100k). Can someone explain why they are
       | using so few?
        
         | joaogui1 wrote:
         | I don't want to hunt the details on each of theses releases,
         | but
         | 
         | * You can use less GPUs if you decrease batch size and increase
         | number of steps, which would lead to a longer training time
         | 
         | * FP8 is pretty efficient, if Grok was trained with BF16 then
         | LLama 4 should could need less GPUs because of that
         | 
         | * Depends also on size of the model and number of tokens used
         | for training, unclear whether the total FLOPS for each model is
         | the same
         | 
         | * MFU/Maximum Float Utilization can also vary depending on the
         | setup, which also means that if you're use better kernels
         | and/or better sharding you can reduce the number of GPUs needed
        
       | redox99 wrote:
       | It seems to be comparable to other top models. Good, but nothing
       | ground breaking.
        
         | jasonjmcghee wrote:
         | Scout outperforms llama 3.1 405b and Gemini Flash 2.0 lite and
         | it's MoE so as fast as a 17B model. That's pretty crazy.
         | 
         | It means you can run it on a high-ram apple silicon and it's
         | going to be insanely fast on groq (thousands of tokens per
         | second). Time to first token will bottleneck the generation.
        
       | latchkey wrote:
       | One of the links says there are 4 different roles to interact
       | with the model and then lists 3 of them.
        
       | lyu07282 wrote:
       | Anyone know how the image encoding works exactly?
       | <|image_start|><|patch|>...<|patch|><|tile_x_separator|><|patch|>
       | ...<|patch|><|tile_y_separator|><|patch|>...<|patch|><|image|><|p
       | atch|>...<|patch|><|image_end|>Describe this image in two
       | sentences<|eot|><|header_start|>assistant<|header_end|>
       | 
       | Is "..." here raw 4 bytes RGBA as an integer or how does this
       | work with the tokenizer?
        
       | krashidov wrote:
       | Anyone know if it can analyze PDFs?
        
       | Ninjinka wrote:
       | no audio input?
        
       | akulbe wrote:
       | How well do you folks think this would run on this Apple Silicon
       | setup?
       | 
       | MacBook Pro M2 Max
       | 
       | 96GB of RAM
       | 
       | and which model should I try (if at all)?
       | 
       | The alternative is a VM w/dual 3090s set up with PCI passthrough.
        
         | jasonjmcghee wrote:
         | Depends on quantization. 109B at 4-bit quantization would be
         | ~55GB of ram for parameters in theory, plus overhead of the KV
         | cache which for even modest context windows could jump total to
         | 90GB or something.
         | 
         | Curious to here other input here. A bit out of touch with
         | recent advancements in context window / KV cache ram usage
        
       | georgehill wrote:
       | Post-op here. A better link dropped from Meta:
       | https://ai.meta.com/blog/llama-4-multimodal-intelligence
       | 
       | Is there a way update the main post? @tomhoward
       | 
       | Edit:
       | 
       | Updated!
        
       | asdev wrote:
       | I don't think open source will be the future of AI models. Self
       | hosting an AI model is much more complex and resource incentive
       | than traditional open source SaaS. Meta will likely have a
       | negative ROI on their AI efforts
        
         | Centigonal wrote:
         | The users of open source software are not limited to
         | individuals. A bank, hedge fund, or intelligence agency might
         | be willing to put forth the effort to self host an AI model
         | versus sending their prompts and RAG context to a third party.
        
       | impure wrote:
       | 10 million token context window? Damn, looks like Gemini finally
       | has some competition. Also I'm a little surprised this is their
       | first Mixture of Experts model, I thought they were using that
       | before.
        
       | cuuupid wrote:
       | I think the most important thing to note here, perhaps more so
       | than the context window, is that this exposes some serious flaws
       | in benchmarks. Per benchmarks, Maverick is competitive only with
       | older models like GPT-4o or Gemini 2.0 Flash, and not with
       | anything in the last few months (incl. reasoning models).
       | 
       | However, the LMArena head to head leaderboard ranks this as 2nd
       | place overall: https://lmarena.ai/?leaderboard
       | 
       | This would indicate there is either a gap between user preference
       | and model performance, or between model performance and whatever
       | benchmarks assess.
       | 
       | Either way, it is surely a huge deal that an open source model is
       | now outperforming GPT 4.5.
        
         | fpgaminer wrote:
         | The benchmarks are awful. No disrespect to the people who
         | worked to make them, nothing is easy. But I suggest going
         | through them sometime. For example, I'm currently combing
         | through the MMMU, MMMU-Pro, and MMStar datasets to build a
         | better multimodal benchmark, and so far only about 70% of the
         | questions have passed the sniff test. The other 30% make no
         | sense, lead the question, or are too ambiguous. Of the 70%, I
         | have to make minor edits to about a third of them.
         | 
         | Another example of how the benchmarks fail (specifically for
         | vision, since I have less experience with the pure-text
         | benchmarks): Almost all of the questions fall into either
         | having the VLM read a chart/diagram/table and answer some
         | question about it, or identify some basic property of an image.
         | The former just tests the vision component's ability to do OCR,
         | and then the LLM's intelligence. The latter are things like "Is
         | this an oil painting or digital art?" and "Is the sheep in
         | front of or behind the car" when the image is a clean shot of a
         | sheep and a car. Absolutely nothing that tests a more deep and
         | thorough understanding of the content of the images, nuances,
         | or require the VLM to think intelligently about the visual
         | content.
         | 
         | Also, due to the nature of benchmarks, it can be quite
         | difficult to test how the models perform "in the wild." You
         | can't really have free-form answers on benchmarks, so they tend
         | to be highly constrained opting for either multiple choice
         | quizzes or using various hacks to test if the LLM's answer
         | lines up with ground truth. Multiple choice is significantly
         | easier in general, raising the base pass rate. Also the
         | distractors tend to be quite poorly chosen. Rather than
         | representing traps or common mistakes, they are mostly chosen
         | randomly and are thus often easy to weed out.
         | 
         | So there's really only a weak correlation between either of
         | those metrics and real world performance.
        
         | j_maffe wrote:
         | There's absolutely a huge gap between user preference and model
         | performanc that is widening by the minute. The more performant
         | these models get, the more individual and syntactical
         | preferences prevail.
        
       | gzer0 wrote:
       | 10M context length and surpasses claude-3.7-sonnet and GPT-4.5.
       | 
       | Can't wait to dig in on the research papers. Congrats to the
       | llama team!
        
       | hydroreadsstuff wrote:
       | This means GPUs are dead for local enthusiast AI. And SoCs with
       | big RAM are in.
       | 
       | Because 17B active parameters should reach enough performance on
       | 256bit LPDDR5x.
        
       | vessenes wrote:
       | I'm excited to try these models out, especially for some coding
       | tasks, but I will say my first two engagements with them (at the
       | meta.ai web interface) were not spectacular. Image generation is
       | wayyy behind the current 4o. I also ask for a Hemingway essay
       | relating RFK Jr's bear carcass episode. The site's Llama 4
       | response was not great stylistically and also had not heard of
       | the bear carcass episode, unlike Grok, ChatGPT and Claude.
       | 
       | I'm not sure what we're getting at meta.ai in exchange for a free
       | login, so I'll keep poking. But I hope it's better than this as
       | we go. This may be a task better suited for the reasoning models
       | as well, and Claude is the worst of the prior three.
       | 
       | Anyway here's hoping Zuck has spent his billions wisely.
       | 
       | Edit: I'm pretty sure we're seeing Scout right now, at least
       | groqchat's 4-scout seems really similar to meta.ai. I can
       | confidently say that Scout is not as good at writing as o1 pro,
       | o3 mini, Claude, R1 or grok 3.
        
       | lousken wrote:
       | ollama when
        
         | jovezhong wrote:
         | why only llama3.x models are listed on ollama? llama4 no longer
         | wants to support ollama, to better track the adoption?
        
       | amrrs wrote:
       | The entire licensing is such a mess and Mark Zuckerberg still
       | thinks Llama 4 is open source!
       | 
       | > no commercial usage above 700M MAU
       | 
       | > prefix "llama" in any redistribution eg: fine-tuning
       | 
       | > mention "built with llama"
       | 
       | > add license notice in all redistribution
        
         | thawab wrote:
         | Who has above 700M MAU and doesn't have their own LLM?
        
       | hrpnk wrote:
       | Available on Groq: https://groq.com/llama-4-now-live-on-groq-
       | build-fast-at-the-...
       | 
       | Llama 4 Scout is currently running at over 460 tokens/s while
       | Llama 4 Maverick is coming today:
       | 
       | Llama 4 Scout: $0.11 / M input tokens and $0.34 / M output tokens
       | Llama 4 Maverick: $0.50 / M input tokens and $0.77 / M output
       | tokens
        
       | system2 wrote:
       | Llama 4 Maverick: 788GB
       | 
       | Llama 4 Scout: 210GB
       | 
       | FYI.
        
       | tomdekan wrote:
       | So, Quasar == Llama 4 Behemoth?
        
       | shreezus wrote:
       | Haven't had a chance to play with this yet, but 10M context
       | window is seriously impressive. I think we'll see models with
       | 100M context relatively soon, and eliminate the need for RAG for
       | a lot of use cases.
        
       | mrcwinn wrote:
       | I had _just_ paid for SoftRAM but happy nonetheless to see new
       | distilled models. Nice work Meta.
        
       | simonw wrote:
       | This thread so far (at 310 comments) summarized by Llama 4
       | Maverick:                   hn-summary.sh 43595585 -m
       | openrouter/meta-llama/llama-4-maverick -o max_tokens 20000
       | 
       | Output:
       | https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...
       | 
       | And with Scout I got complete junk output for some reason:
       | hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-scout -o
       | max_tokens 20000
       | 
       | Junk output here:
       | https://gist.github.com/simonw/d01cc991d478939e87487d362a8f8...
       | 
       | I'm running it through openrouter, so maybe I got proxied to a
       | broken instance?
       | 
       | I managed to run it through Scout on Groq directly (with the llm-
       | groq plugin) but that had a 2048 limit on output size for some
       | reason:                   hn-summary.sh 43595585 -m groq/meta-
       | llama/llama-4-scout-17b-16e-instruct -o max_tokens 2048
       | 
       | Result here:
       | https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07fe...
       | 
       | I'm a little unimpressed by its instruction following here, the
       | summaries I get from other models are a lot closer to my system
       | prompt. Here's the same thing against Gemini 2.5 Pro for example
       | (massively better):
       | https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddc...
        
         | mberning wrote:
         | It doesn't seem that impressive to me either.
        
       | dormando wrote:
       | Does anyone run these "at home" with small clusters? I've been
       | googling unsuccessfully and this thread doesn't refer to
       | anything.
       | 
       | So a non-quantized scout won't fit in a machine with 128GB of RAM
       | (like framework or mac studio M4). Maverick is maybe a 512GB M3
       | Max mac studio. Is it possible (and if so what're the tradeoffs
       | for) running like one instance of Scout on three 128GB
       | frameworks?
        
       ___________________________________________________________________
       (page generated 2025-04-05 23:00 UTC)