[HN Gopher] Meta got caught gaming AI benchmarks
       ___________________________________________________________________
        
       Meta got caught gaming AI benchmarks
        
       Author : pseudolus
       Score  : 267 points
       Date   : 2025-04-08 11:29 UTC (11 hours ago)
        
 (HTM) web link (www.theverge.com)
 (TXT) w3m dump (www.theverge.com)
        
       | bn-l wrote:
       | I believe this was designed to flatter the prompter more / be
       | more ingratiating. Which is a worry if true (what it says about
       | the people doing the comparing).
        
         | add-sub-mul-div wrote:
         | There's no end to the possible vectors of human manipulation
         | with this "open-weight" black box.
        
       | deckar01 wrote:
       | The top of that leaderboard is filled with closed weight
       | experimental models.
        
       | etamponi wrote:
       | Meta got caught _first_.
        
         | davidcbc wrote:
         | Not even first, OpenAI got caught a while back
        
           | Mond_ wrote:
           | Do you have a source for this? That's interesting (if true).
        
             | davidcbc wrote:
             | They got the dataset from Epoch AI for one of the
             | benchmarks and pinky swore that they wouldn't train on it
             | 
             | https://techcrunch.com/2025/01/19/ai-benchmarking-
             | organizati...
        
               | tananaev wrote:
               | I don't see anything in the article about being caught.
               | Maybe I missed something?
        
               | tedsanders wrote:
               | davidcbc is spreading fake rumors.
               | 
               | OpenAI was never caught cheating on it, because we didn't
               | cheat on it.
               | 
               | As with any eval, you have to take our word for it, but
               | I'm not sure what more we can do. Personally, if I
               | learned that a OpenAI researcher purposely or
               | accidentally trained on it, and we didn't quickly
               | disclose this, I'd quit on the spot and disclose it
               | myself.
               | 
               | (I work at OpenAI.)
               | 
               | Generally, I don't think anyone here is cheating and I
               | think we're relatively diligent with our evals. The gray
               | zone where things could go wrong is differing levels of
               | care used in scrubbing training data of equivalent or
               | similar problems. At some point the line between learning
               | and memorizing becomes blurry. If an MMLU question asks
               | about an abstract algebra proof, is it cheating to have
               | trained on papers about abstract algebra?
        
               | suddenlybananas wrote:
               | >They got the dataset from Epoch AI for one of the
               | benchmarks and pinky swore that they wouldn't train on it
               | 
               | Is anything here actually false or do you not like the
               | conclusions that people may draw from it?
        
               | tucnak wrote:
               | Why are you being disingenuous? Simply having access to
               | the eval in question is already enough for your
               | synthetics guys to match the distribution, and of course
               | you don't contaminate directly on train, that would be
               | stupid, and you would get caught, but if it does inform
               | the reward, the result is the same. You _should_ quit,
               | but you wouldn't because you'd already convinced yourself
               | you're doing RL God's work, not sleight of hand.
               | 
               | > If an MMLU question asks about an abstract algebra
               | proof, is it cheating to have trained on papers about
               | abstract algebra?
               | 
               | This kind of disingenuous bullshit is exactly why people
               | call you cheaters.
               | 
               | > Generally, I don't think anyone here is cheating and I
               | think we're relatively diligent with our evals.
               | 
               | You guys should follow Apple's cult guidelines: never
               | stand out. Think different
        
             | wongarsu wrote:
             | It happens with basically all papers on all topics.
             | Benchmarks are useful when they are first introduced and
             | used to measure things that were released before the
             | benchmark. After that their usefulness rapidly declines.
        
           | hooloovoo_zoo wrote:
           | People have been gaming ML benchmarks as long as there have
           | been ML benchmarks. That's why it's better to see if other
           | researchers are incorporating a technique into their actual
           | models rather than 'is this paper the bold entry in a
           | benchmark table'. But it takes longer.
        
             | jandrese wrote:
             | When a measure becomes a target it is no longer a good
             | measure.
             | 
             | These ML benchmarks were never going to last very long.
             | There is far too much pressure to game them, even
             | unintentionally.
        
         | mkolodny wrote:
         | "Got caught" is a misleading way to present what happened.
         | 
         | According to the article, Meta publicly stated, right below the
         | benchmark comparison, that the version of Llama on LMArena was
         | the experimental chat version:
         | 
         | > According to Meta's own materials, it deployed an
         | "experimental chat version" of Maverick to LMArena that was
         | specifically "optimized for conversationality"
         | 
         | The AI benchmark in question, LMArena, compares Llama 4
         | experimental to closed models like ChatGPT 4o latest, and Llama
         | performs better (https://lmarena.ai/?leaderboard).
        
       | JKCalhoun wrote:
       | Is LMArena junk now?
       | 
       | I thought there was an aspect where you run two models on the
       | same user-supplied query. Surely this can't be gamed?
       | 
       | > "optimized for conversationality"
       | 
       | I don't understand what that means - how it gives it an LMArena
       | advantage.
        
         | light_hue_1 wrote:
         | LMArena was always junk. I work in this space and while the
         | media takes it seriously most scientists don't.
         | 
         | Random people ask random stuff and then it measures how good
         | they feel. This is only a worthwhile evaluation if you're
         | Google or Meta or OpenAI and you need to make a chartbot that
         | keeps people coming back. It doesn't measure anything else
         | useful.
        
           | HPsquared wrote:
           | Conversation is a two-way street. A good conversation
           | mechanic could elicit better interaction from the users and
           | result in better answers. Stands to reason, anyway.
        
           | genewitch wrote:
           | I hear AI news from time to time from the M5M in the US - and
           | the only place I've ever seen "LMArena" is on HN and in the
           | LM studio discord. At a ratio of 5:1 at least.
        
             | brandall10 wrote:
             | It's mentioned quite a bit in the LLM related subreddits.
        
           | cma wrote:
           | Llama 1 derived models on it were beating gpt 3.5 by have
           | less refusals.
        
         | gwern wrote:
         | It can be easily gamed. The users are self-selected, and they
         | have zero incentive to be honest or rigorous or provide good
         | responses. Some have incentives the opposite way. (There was a
         | report of a prediction market user who said they had won a
         | market on Gemini models by manipulating the votes; LMArena
         | swore furiously there had definitely been no manipulation but
         | was conspicuously silent on any details.) And the release of
         | more LMArena responses has shown that a lot of the user ratings
         | are blatantly wrong: either they're basically fraudulent, or
         | LMArena's current users are people whose ratings you should be
         | optimizing against because they are so ignorant, lazy, and
         | superficial.
         | 
         | At this point, when I look at the output from my Gemini-2.5-pro
         | sessions, they are so high quality, and take so long to read,
         | and check, and have an informed opinion on, I just can't trust
         | the slapdash approach of LMArena in assuming that careless
         | driveby maybe-didn't-even-read-the-responses-ain't-no-one-got-
         | time-for-that-nerd-shit ratings mean much of anything. There
         | have been red flags in the past and I've been taking them ever
         | less seriously even as one of many benchmarks since early last
         | year, but maybe this is the biggest backfire yet. And it's only
         | going to get worse. At this rate, without major overhaul, you
         | should take being #1 on LMArena seriously as useful and
         | important news - as a reason to _not_ use a model.
         | 
         | It's past time for LMArena people to sit down and have some
         | thorough reflection on whether it is still worth running at
         | all, and at what point they are doing more harm than good. No
         | benchmark lives forever and it is normal and healthy to shut
         | them down at some point after having been saturated, but some
         | manage to live a lot longer than they should have...
        
         | jxjnskkzxxhx wrote:
         | In one of karpathys videos he said that he was a bit suspicious
         | that the models that score the highest in LMarena aren't the
         | ones that people use the most to solve actual day to day
         | problems.
        
         | jsnell wrote:
         | There are almost certainly ways to fine-tune the model in ways
         | that make it perform better on the Arena, but perform worse in
         | other benchmarks or in practice. Usually that's not a good
         | trade-off. What's being suggested here is that Meta is running
         | such a fine-tuned version on the Arena (and reporting those
         | numbers) while running models with different fine-tuning on
         | other benchmarks (and reporting those numbers), while giving
         | the appearance that those are actually the same models.
        
         | consumer451 wrote:
         | Tangent, but does anyone know why links to lmarena.ai are
         | banned on Reddit, site-wide?
         | 
         | (Last I checked 1 month ago)
        
           | swyx wrote:
           | evidence for this assertion please? its highly unlikely and
           | maybe a bug
        
             | consumer451 wrote:
             | I should have re-tested prior to posting. It is fixed now.
             | I just tried posting a comment with the link and it was not
             | [removed by reddit].
             | 
             | It was a really strange situation that lasted for months.
             | Posts or comments with a direct link were removed like they
             | were the worst of the worst.
             | 
             | I tried posting to r/bugs and it was downvoted immediately.
             | I eventually contacted he lmarena folks, so maybe they
             | resolved it with reddit.
        
       | Mond_ wrote:
       | The Llama 4 launch looks like a real debacle for Meta. The model
       | doesn't look great. All the coverage I've seen has been negative.
       | 
       | This is about what I expected, but it makes you wonder what
       | they're going to do next. At this point it looks like they are
       | falling behind the other open models, and made an ambitious bet
       | on MoEs, without this paying off.
       | 
       | Did Zuck push for the release? I'm sure they knew it wasn't ready
       | yet.
        
         | bko wrote:
         | I remember reading that they were in panic mode when the
         | DeepSeek model came out so they must have scrambled and had to
         | re-work a lot of things since DeepSeek was so competitive and
         | open source as well
        
           | blueboo wrote:
           | Fear of R2 looms large as well. I suspect they succumbed to
           | the nuance collapse along the lines of "Is double checking
           | results worth it if DeepSeek eats our lunch?"
        
         | nightski wrote:
         | Do you know that they made a bet on MoE? Meaning they
         | abandonded dense models? I doubt that is the case. Just
         | releasing MoE Llama 4 does not constitute a "bet" without
         | further information.
         | 
         | Also from what I can tell this performs better than models with
         | parameter counts equal to one expert, and worse than fully
         | dense models equal to total parameter count. Isn't that kind of
         | what we'd expect? in what way is that a failure?
         | 
         | Maybe I am missing some details. But it feels like you have an
         | axe to grind.
        
           | genewitch wrote:
           | A 4x8 MOE performs better than an 8B but worse than a 32B, is
           | your statement?
           | 
           | My response would be, "so why bother with MOE?"
           | 
           | However deepseek r1 is MOE from my understanding, but the "E"
           | are all =>32B parameters. There's > 20 experts. I could be
           | misinformed; however, even so, I'd say a MOE with 32B or even
           | 70B experts will outperform (define this!) Models with equal
           | parameter counts, because deepseek outperforms (define?)
           | ChatGPT et al.
        
             | nightski wrote:
             | Easy, vastly improved inference performance on machines
             | with larger RAM but lower bandwidth/compute. These are
             | becoming more popular such as Apple's M series chips, AMD's
             | strix halo series, and the upcoming DGX Spark from Nvidia.
        
               | genewitch wrote:
               | yes i understand all that. I was saying the claim is
               | incorrect. My understanding of deepseek is mechanically
               | correct but apparently they use 3B models as experts, per
               | your sibling comment. I don't buy it, regardless of what
               | they put in the paper - 3B models are pretty dumb, and R1
               | isn't dumb. No amount of shuffling between "dumb" experts
               | will make the output _not_ dumb. it 's more likely 32x32B
               | experts, based on the quant sizes i've seen.
               | 
               | A deepseek employee is welcome to correct me.
        
             | zamadatix wrote:
             | DeepSeek V3/R1 are MoE with 256 experts per layer, actively
             | using 1 shared expert and 8 routed experts per layer https:
             | //arxiv.org/html/2412.19437v1#S2:~:text=with%20MoE%20l...
             | so you can't just take the active parameters and assume
             | that's close to the size of a single expert (ignoring
             | experts are per layer anyways and that there are still
             | dense parameters to count).
             | 
             | Despite connotations of specialized intelligences the term
             | "expert" provokes it's really mostly about
             | scalability/efficiency of running large models. By
             | splitting up sections of the layers and not activating all
             | of them for each pass a single query takes less bandwidth,
             | can be distributed across compute, and can be parallelized
             | with other queries on the same nodes.
        
         | throwaway25251 wrote:
         | I don't know about Llama 4. Competition is intense in this
         | field so you can't expect everybody to be number 1. However, I
         | think the performance culture at Meta is counterproductive.
         | Incentives are misaligned, I hope leadership will try to
         | improve it.
         | 
         | Employees are encouraged to ship half-baked features and move
         | to another project. Quality isn't rewarded at all. The recent
         | layoffs have made things even worse. Skilled people were fired,
         | slowing down teams. I assume the goal was to push remaining
         | employees to work even more, but I doubt this is working.
         | 
         | I haven't worked in enough companies of this size to be able to
         | tell if alternatives are better, but it's very clear to me that
         | Meta doesn't get the best from their employees.
        
           | berkes wrote:
           | I've never liked it, but
           | 
           | > Move fast and break things
           | 
           | is really a bad concept in this space, where you get limited
           | shots at releasing something that generates interest.
           | 
           | > Employees are encouraged to ship half-baked features
           | 
           | And this is why I never liked that motto and have always
           | pushed back at startups where I was hired that embraced this
           | line of thought. Quality matters. It's context-dependent, so
           | sometimes it matters a lot, and sometimes hardly. But "moving
           | fast and breaking things" should be a deliberate choice, made
           | for every feature, module, sprint, story all over again, IMO.
           | If at all.
        
             | Rebuff5007 wrote:
             | I'd argue its a bad concept in any spaces that involve
             | teams of people working together and deliverables that
             | enter the real world.
        
             | diggan wrote:
             | > is really a bad concept in this space, where you get
             | limited shots at releasing something that generates
             | interest.
             | 
             | Sure, but long term effects are more depending on the
             | actual performance of the model, than anything.
             | 
             | Say they launch a model that is hyped to be the best, but
             | when people try it, it's worse than other models. People
             | will quickly forget about it, unless it's particularly good
             | at something.
             | 
             | Alternatively, say they launch a model that doesn't even
             | get a press release, or any benchmark results published
             | ahead of launch, but the model actually rocks at a bunch of
             | use cases. People will start using it regardless of the
             | initial release, and continue to do so as long as it's a
             | best model.
        
             | lenerdenator wrote:
             | > is really a bad concept in this space, where you get
             | limited shots at releasing something that generates
             | interest.
             | 
             | It's a really bad concept in _any_ space.
             | 
             | We would be living in a better world if Zuck had, at least
             | once, thought "Maybe we shouldn't do that".
        
               | avs733 wrote:
               | >It's a really bad concept in any space.
               | 
               | I struggle with this because it feels like so many
               | 'rules' in the world where the important half remains
               | unsaid. That unsaid portion is then mediated by goodharts
               | law.
               | 
               | If the other half is 'then slow down and learn something'
               | its really not that bad, nothing is sacred, we try we
               | fail we learn we (critical) don't repeat the mistake.
               | Thats human learning - we don't learn from mistakes we
               | learn from reflecting on mistakes.
               | 
               | But if learning isn't part of the loop - if its a self
               | justifying defense for fuckups, if the unsaid remains
               | unsaid, its a disaster waiting to happen.
               | 
               | The difference is usually in what you reward. If you
               | reward _ship_ , you get the defensive version - and you
               | will ship crap. If you reward institutional knowledge
               | building you don't. Engineers are often taught that
               | 'good, fast, or cheap pick 2'. The reality is its usually
               | closer to 1 or 1.5. If you pick fast...you get fast.
        
           | _bin_ wrote:
           | It's also terrible output, even before you consider what
           | looks like catastrophic forgetting from crappy RL. The emoji
           | use and writing style make me want to suck-start a revolver.
           | I don't know how they expect anyone to actually use it.
        
           | spaceywilly wrote:
           | I agree. I think of it like a car engine. You can push it up
           | to a certain RPM and it will keep making more and more power.
           | Above that RPM, the engine starts to produce less power and
           | eventually blows a gasket.
           | 
           | I think the performance-based management worked for a while
           | because there were some gains to be had by pushing people
           | harder. However, they've gone past that and are now pushing
           | people too hard and getting worse results. Every machine has
           | its operating limits and an area where it operates most
           | efficiently. A company is no different.
        
             | hq123 wrote:
             | very nice analogy!
        
             | lenerdenator wrote:
             | The problem is, there's always some engine pushing the
             | power envelope, or a person pushing their performance
             | harder. And then the rest of them have to keep up.
        
           | a4isms wrote:
           | For those who haven't heard of it, "The Hawthorne Effect" is
           | the name given to a phenomena where when a person or group
           | being studied is aware they are being studied, their
           | performance goes up but as much as 50% for 4-8 weeks, then
           | regresses to its norm.
           | 
           | This is true if they are just being observed, or if some
           | novel new processes are introduced. If the new things are
           | beneficial, the performance rises for 4-8 weeks as usual, but
           | when it regresses it regresses to a higher performance
           | reflecting the value of the new process.
           | 
           | But when poor management introduce a counter-productive
           | change, the Hawthorne Effect makes it look like a resounding
           | success for 4-8 weeks. Then the effect fades, and performance
           | drops below the original level. Sufficiently devious managers
           | either move on to new projects or blame the workers for
           | failing to maintain the new higher pace of performance.
           | 
           | This explains a lot of the incentive for certain types of
           | leaders to champion arbitrary changes, take a victory lap,
           | and then disassociate themselves from accountability for the
           | long-term success or failure of their initiative.
           | 
           | (There is quite a bit of controversy over what the mechanisms
           | for the Hawthorne Effect are, and whether change alone can
           | introduce it for whether participants need to feel they are
           | being observed, but the model as I see it fits my anecdotal
           | experience where new processes are always accompanied by
           | attempts to meet new performance goals, and everyone is
           | extremely aware that the outcome is being measured.)
        
             | pixl97 wrote:
             | I mean, it sounds like we should add in the McNamara
             | fallacy also.
        
           | hintymad wrote:
           | > Employees are encouraged to ship half-baked features and
           | move to another project
           | 
           | Maybe there is more to that. It's been more than a year since
           | Llama 3 was released. That should be enough time for Meta to
           | release something with significantly improvement. Or you mean
           | quarter by quarter the engineers had to show that they were
           | making impact in their perf review, which could be
           | detrimental to the Llama 4 project?
           | 
           | Another thing that puzzles me is that again and again we see
           | that the quality of a model can improve if we have more high-
           | quality data, yet can't Meta manage to secure massive amount
           | of new high-quality data to boost their model performance?
        
             | magixx wrote:
             | > Or you mean quarter by quarter the engineers had to show
             | that they were making impact in their perf review
             | 
             | This is what I think they were referencing. Launching
             | things looks nice in review packets and few to none are
             | going to look into the quality of the output. Submitting
             | your own self review means that you can cherry pick
             | statistics and how you present them. That's why that
             | culture incentivizes launching half baked products and
             | moving on to something else because it's smart and
             | profitable (launch yet another half baked project) to
             | distance yourself from the half baked project you started.
        
               | hintymad wrote:
               | I like how Netflix set up its incentive systems years
               | ago. Essentially they told the employees that all they
               | needed to do is deliver what the company wanted. It was
               | perfectly okay that an employee did their job and didn't
               | move up or do more. Per their chief talent officer
               | McCord, "a manager's job is all about setting the
               | context" and the employees were let loose to deliver.
               | This method puts a really high bar on the managers, as
               | the entire report chain must know clearly what they want
               | delivered. Their expectation must be high enough to move
               | the company forward, but not too ridiculous to turn
               | Netflix into a burnout factory.
        
               | magixx wrote:
               | Unfortunately I wasn't able to get an interview with
               | Netflix.
               | 
               | > employees that all they needed to do is deliver what
               | the company wanted
               | 
               | How did this work out in practice and across teams? My
               | experience at Meta within my team was that it would be
               | almost impossible to determine what the company actually
               | wanted from our team in a year. Goals kept changing and
               | the existing incentive system works against this since
               | other teams are trying to come up with their own
               | solutions to things which may impact your team.
               | 
               | > an employee did their job and didn't move up
               | 
               | Does Netflix cull employees if they haven't reached a
               | certain IC level? I know at Meta SWEs need to reach IC5
               | after a while or risk being culled.
        
           | philjohn wrote:
           | It's 100% PSC (their "Performance Culture")
           | 
           | You're not encouraged per se to ship half-baked features, but
           | if you don't have enough "impact" at the end of the half (for
           | mid cycle checkin) or year (for full PSC cycle) then you're
           | going to get "Below Expectations" and then "Meets Most" (or
           | worse) and with the current environment a swift offboarding.
           | 
           | When I was there (working in integrity) our group of staff+
           | engineers opined how it led to perverse incentives - and
           | whilst you can work there and do great work, and get good
           | ratings, I saw too many examples of "optimizing for PSC"
           | (otherwise known as PSC hacking).
        
         | qoez wrote:
         | > it makes you wonder what they're going to do next
         | 
         | They're just gonna keep throwing money at it. This is a hobby
         | and talent magnet for them, instagram is the money printer.
         | They've been working on VR for like a decade with barely much
         | results in terms of users (compared to costs). This will be no
         | different.
        
           | wongarsu wrote:
           | Both are also decent long-terms bets. Being VR market leader
           | now means they will be VR market-leader with plenty of
           | inhouse talent and IP when the technology matures and the
           | market grows. Being in the AI race, even if they are not
           | leading, means they have in-house talent and technology to be
           | able to react to wherever the market is going with AI. They
           | have one of the biggest messengers and one of the biggest
           | image-posting sites, there is a decent chance AI will become
           | important to them in some not-yet-obvious way.
           | 
           | One of Meta's biggest strengths is Zuckerberg being able to
           | play these kinds of bets. Those bets being great for PR and
           | talent acquisition is the cherry on top
        
             | TheOtherHobbes wrote:
             | This assumes no upstart will create a game changing
             | innovation which upends everything.
             | 
             | Companies become complacent and confused when they get too
             | big. Employees become trapped in a maze of performative
             | careerism, and customer focus and a realistic understanding
             | of threats from potential competitors both disappear.
             | 
             | It's been a consistent pattern since the earliest days of
             | computing.
             | 
             | Nothing coming out of Big Tech at the moment is encouraging
             | me to revise that assumption.
        
         | ntlm1686 wrote:
         | "made an ambitious bet on MoEs"? No, DeepSeek is MoE, and they
         | succeeded. Meta is not betting on MoE, it just does what other
         | people have done.
        
           | antirez wrote:
           | Llama4 seems in many ways a cut and paste of DeepSeek.
           | Including the shared expert and the high sparsity. It's a
           | DeepSeek that does not work well.
        
         | thefourthchime wrote:
         | I mean, there's a reason they released it on a Saturday.
        
         | root_axis wrote:
         | It's not a big deal. Llama 4 feels like a flop because the
         | expectations are really high based on their previous releases
         | and the sense of momentum in the ecosystem because of DeepSeek.
         | At the end of the day, LLama 4 didn't meet the elevated
         | expectations, but they're fine. They'll continue to improve and
         | iterate and maybe the next one will be more hype worthy, or
         | maybe expectations will be readjusted as the specter of
         | diminishing returns continues to creep in.
        
           | lerchmo wrote:
           | The switching costs are so low (zero) that anyone using these
           | models just jumps to the best performer. I also agree that
           | this is not a brand or narrative sensitive project.
        
           | int_19h wrote:
           | It feels like a flop because it is objectively worse than
           | models many times smaller that shipped sometimes ago. In
           | fact, it is worse than earlier LLaMA releases on may points.
           | It's so bad that people who initially ran into it assumed
           | that the downloaded weights must be corrupted somehow.
        
         | agilob wrote:
         | They knew they can't beat DeepSeek 2 months ago
         | https://old.reddit.com/r/LocalLLaMA/comments/1i88g4y/meta_pa...
        
       | seydor wrote:
       | tech companies competing over something that is losing them money
       | is the most bizarre spectacle yet.
        
         | abc-1 wrote:
         | Is it really losing them money if investors throw fistfuls of
         | cash at them for it
        
         | FL33TW00D wrote:
         | Plot big tech stock valuations with markers for successful OS
         | model releases.
        
         | fullshark wrote:
         | Come on, you can do the critical thinking here to understand
         | why these companies would want the best in class (open/closed)
         | weight LLMs.
        
           | seydor wrote:
           | then why would they cheat?
        
             | fullshark wrote:
             | Well we'll see if they suffer consequences of this and they
             | cheated too hard, but being perceived as best in class is
             | arguably worth even more than being the best in class,
             | especially if differences in performance are hard to
             | perceive anecdotally.
             | 
             | The goal is long term control over a technology's
             | marketshare, as winner take all dynamics are in play here.
        
             | baby wrote:
             | they're all cheating, see grok
        
               | nomel wrote:
               | Are you referring to this [1]?
               | 
               | > Critics have pointed out that xAI's approach involves
               | running Grok 3 multiple times and cherry-picking the best
               | output while comparing it against single runs of
               | competitor models.
               | 
               | [1] https://medium.com/@cognidownunder/the-hype-machine-
               | gpt-4-5-...
        
             | SubiculumCode wrote:
             | I didn't see evidence of cheating in the article. Having a
             | slightly differently tuned version of 4 is not the most
             | dastardly thing that can be done. Everything else is
             | insinuation.
        
         | bko wrote:
         | I think Meta sees AI and VR/AR as a platform. They got left
         | behind on the mobile platform and forever have to contend with
         | Apple semi-monopoly. They have no control and little influence
         | over the ecosystem. It's an existential threat to them.
         | 
         | They have vowed not to make that mistake again so are pushing
         | for an open future that won't be dominated by a few companies
         | that could arbitrarily hurt Meta's business.
         | 
         | That's the stated rationale at least and I think it more or
         | less makes sense
        
           | fullshark wrote:
           | Makes sense except for the fact that they leaked the llama
           | weights by accident and needed to reverse engineer that
           | explanation.
        
           | jsheard wrote:
           | I wouldn't call what Meta is doing with VR/AR an "open
           | future", it's pretty much the exact same playbook that Google
           | and Apple used for their platforms. The only difference is
           | Meta gets to be the landlord this time.
        
           | alex1138 wrote:
           | I'm in favor of whatever semi-monopoly enables fine grained
           | permissions so Facebook can't en masse slurp Whatsapp
           | (antitrust?) contacts
        
           | esafak wrote:
           | They stated that?
        
         | asveikau wrote:
         | Feels very late 90s.
         | 
         | The old joke is they're losing money on every sale but they'll
         | make up for it in volume.
        
           | dfedbeef wrote:
           | _chef kiss_ perfect
        
         | diggan wrote:
         | Borderline conspiracy theory with an ounce of truth:
         | 
         | None of the models Meta put out are actually open source (by
         | any measure), and everyone who are redistributing Llama models
         | or any derivatives, or use Llama models for their business, are
         | on the hook of getting sued in the future based on the terms
         | and conditions people been explicitly/implicitly agreeing to
         | when they use/redistribute these models.
         | 
         | If you start depending on these Llama models which have
         | unfavorable proprietary terms today but Meta don't act on them,
         | doesn't mean they won't act on it in the future. Maybe this has
         | all been a play to get people into this position, so Meta can
         | in the future start charging for them or something else.
        
           | recursive wrote:
           | You never go full Oracle.
        
         | timewizard wrote:
         | This is a signal that the sector is heavily monopolized.
         | 
         | This has happened many times before in US history.
        
         | tananaev wrote:
         | The reason is simple. All tech companies have very high
         | valuations. They have to sell investors a dream to justify that
         | valuation. They have to convince people that they have the next
         | big thing around the corner.
        
         | roughly wrote:
         | https://en.wikipedia.org/wiki/Dollar_auction
        
       | gessha wrote:
       | Next on Matt Levine's newsletter: Is Meta fudging with stock
       | evaluation-correlated metrics? Is this securities fraud?
        
         | myrmidon wrote:
         | If that's what it takes to get some honesty out of corporate,
         | is it such a bad idea? Why?
        
         | benhill70 wrote:
         | Sarcasm doesn't translate well in text. Please, elaborate.
        
           | camjw wrote:
           | Matt Levine has a common refrain which is that basically
           | everything is securities fraud. If you were an investor who
           | invested on the premise that Meta was good at AI and Zuck
           | knowingly put out a bad model, is that securities fraud? Matt
           | Levine will probably argue that it could be in a future
           | edition of Money Stuff (his very good newsletter).
        
             | NickC25 wrote:
             | is it securities fraud? sort of.
             | 
             | If Mark, both through Meta and through his own resources,
             | has the capital to hire and retain the best AI researchers
             | / teams, and claims he's doing so, but puts out a model
             | that sucks, he's liable. It's probably not directly fraud,
             | but if he claims he's trying to compete with Google or
             | Microsoft or Apple or whoever, yet doesn't adequately
             | deploy a comparable amount of resources, capital, people,
             | whatever, and doesn't explain why, it _could_ (stretch) be
             | securities fraud....I think.
        
               | genewitch wrote:
               | And the fine for that? Probably 0.001% of revenue. If
               | that.
        
             | jjani wrote:
             | The "everything is securities fraud" meme is really
             | unfortunate, not quite as bad as the "fiduciary duty means
             | execs have to chase short-term profit" myth, but still
             | harmful.
             | 
             | It's only because lying ("puffery") about everything has
             | become the norm in corporate America that indeed, almost
             | all listed companies commit securities fraud. If they'd go
             | back to being honest businessmen, no more securities fraud.
             | Just stop claiming things that aren't true. This is a very
             | real option they could take. If they don't, then they're
             | willingly and knowingly commiting securities fraud. But the
             | meme makes it sound to people as if it's unavoidable, when
             | it's anything but.
        
         | grvdrm wrote:
         | Nailed it!
        
       | goldchainposse wrote:
       | I wonder how much the current work environment contributed to
       | this. There's immense pressure to deliver, so it's not surprising
       | to see this.
        
       | ekojs wrote:
       | I think it's most illustrative to see the sample battles (H2H)
       | that LMArena released [1]. The outputs of Meta's model is too
       | verbose and too 'yappy' IMO. And looking at the verdicts, it's no
       | wonder by people are discounting LMArena rankings.
       | 
       | [1]: https://huggingface.co/spaces/lmarena-
       | ai/Llama-4-Maverick-03...
        
         | smeeth wrote:
         | In fairness, 4o was like this until very recently. I suspect it
         | comes from training on COT data from larger models.
        
         | ed wrote:
         | Yep, it's clear that many wins are due to Llama 4's lowered
         | refusal rate which is an effective form of elo hacking.
        
       | goldchainposse wrote:
       | In other news, the head of AI research just left
       | 
       | https://www.cnbc.com/2025/04/01/metas-head-of-ai-research-an...
        
         | kridsdale3 wrote:
         | I would have thought that title would belong to Yann.
        
           | yodsanklai wrote:
           | TBH I'm very surprised Yann Le Cun is still there. He looks
           | to me like a free thinker and an independent person. I don't
           | think he buys into the Trump agenda and US nationalistic
           | anti-Europe speech like Zuck does. He may be giving Zuck the
           | benefit of the doubt, and probably is grateful that Zuck gave
           | him a chance when nobody else did.
        
             | lenerdenator wrote:
             | > TBH I'm very surprised Yann Le Cun is still there. He
             | looks to me like a free thinker and an independent person.
             | I don't think he buys into the Trump agenda and US
             | nationalistic anti-Europe speech like Zuck does. He may be
             | giving Zuck the benefit of the doubt, and probably is
             | grateful that Zuck gave him a chance when nobody else did.
             | 
             | Zuck doesn't buy it, either. He just knows what's good for
             | business right now.
             | 
             | In an example of the worst person you know making a great
             | point, Josh Hawley said "What really struck me is that they
             | can read an election return." [0].
             | 
             | Though it's worth remembering, it's _very_ difficult to
             | accumulate the volume of data necessary to do the current
             | kind of AI training while sticking to the strictest
             | interpretations of EU privacy law. Social media companies
             | aren 't just feeding the user data into marketing
             | algorithms, they're feeding them into AI models. If you're
             | a leading researcher in that field - Like Le Cun - and the
             | current state-of-the-art means getting as much data as
             | possible, you might not appreciate the regulatory
             | environment of the EU.
             | 
             | [0] https://www.npr.org/2025/02/27/nx-s1-5302712/senator-
             | josh-ha...
        
           | brandall10 wrote:
           | It's a misnomer - the VP left. Yann is the Chief Scientist,
           | which I imagine most would agree would be the 'head' of a
           | research division.
        
         | nailer wrote:
         | A lower level employee also resigned specifically about this:
         | 
         | https://x.com/arjunaaqa/status/1909174905549042085?s=46
        
       | ignoramous wrote:
       | Ahmad al-Dahle, who leads "Gen AI" at Meta, wrote this on
       | Twitter:                 ... We're also hearing some reports of
       | mixed quality across different services ...             We've
       | also heard claims that we trained on test sets -- that's simply
       | not true and we would never do that. Our best understanding is
       | that the variable quality people are seeing is due to needing to
       | stabilize implementations.            We believe the Llama 4
       | models are a significant advancement and we're looking forward to
       | working with the community to unlock their value.
       | 
       | https://x.com/Ahmad_Al_Dahle/status/1909302532306092107 /
       | https://archive.vn/JzONp
        
         | SubiculumCode wrote:
         | There seems to be a lot of haloo, accusations, and rumors, but
         | little meat to any of them. Maybe they rushed the release, were
         | unsure of which one to go with, and some moderate rule bending
         | in terms of which tune got sent to the arena, but I have seen
         | no real hard evidence of real hard underhandedness.
        
       | labrador wrote:
       | Meta does themselves a disservice by having such a crappy public
       | facing AI for people to try (meta.ai). I regularly use the web
       | versions for GPT 4o, Deepseek, Grok, and Google Gemeni 2.5.
       | 
       | Meta is always the worst so I don't even bother anymore.
        
       | alittletooraph2 wrote:
       | I tried to make Studio Ghibli inspired images using presumably
       | their new models. It was ass.
        
         | echelon wrote:
         | GPT 4o images is the future of all image gen.
         | 
         | Every other player: Black Forest Labs' Flux, Stability.ai's
         | Stable Diffusion, and even closed models like Ideogram and
         | Midjourney, are all on the path to extinction.
         | 
         | Image generation and editing _must_ be multimodal. Full stop.
         | 
         | Google Imagen will probably be the first model to match the
         | capabilities of 4o. I'm hoping one of the open weights labs or
         | Chinese AI giants will release a model that demonstrates
         | similar capabilities soon. That'll keep the race neck and neck.
        
           | minimaxir wrote:
           | One very important distinction between image models is the
           | implementation: 4o is autogressive, slow, and _extremely_
           | expensive.
           | 
           | Although the Ghibli trend is market validation, I suspect
           | that competitors may not want to copy it just yet.
        
             | echelon wrote:
             | > 4o is autogressive, slow, and extremely expensive.
             | 
             | If you factor in the amount of time wasted with prompting
             | and inpainting, it's extremely well worth it.
        
             | JamesBarney wrote:
             | Extremely expensive in what since? In that it costs $.03
             | instead of $.00003c? Yeah it's relatively far more
             | expensive than other solutions, but from an absolute
             | standpoint still very cheap for the vast majority of use
             | cases. And it's a LOT better.
        
               | svachalek wrote:
               | Dall-E is already 4-8 cents per image. Afaik this is not
               | in the API yet but I wouldn't be surprised if it's $1 or
               | more.
        
         | simonw wrote:
         | Llama is not an image generating model. Any interface that uses
         | Llama and generates images is calling out to a separate image
         | generator as a tool, like OpenAI used to do with ChatGPT and
         | DALL-E up until a couple of weeks ago:
         | https://simonwillison.net/2023/Oct/26/add-a-walrus/
        
       | arcastroe wrote:
       | https://archive.is/Ec6V1
        
       | codingwagie wrote:
       | The truth is that the vast majority of FAANG engineers making
       | high six figures are only good at deterministic work. They cant
       | produce new things, and so meta and google are struggling to
       | compete when actual merit matters, and they cant just brute force
       | the solutions. Inside these companies, the massive tech systems
       | built, are actually generally terrible, but they pile on legions
       | of engineers to fix the problems.
       | 
       | This is the culture of META hurting them, they are paying "AI
       | VPs" millions of dollars to go to status meetings to get dates
       | for when these models will be done. Meanwhile, deepseek r1 has a
       | flat hierarchy with engineers that actually understand low level
       | computing
       | 
       | Its making a mockery of big tech, and is why startups exist. Big
       | company employees rise the ranks by building skill sets other
       | than producing true economic value
        
         | jjani wrote:
         | > They cant produce new things, and so meta and google are
         | struggling to compete when actual merit matters, and they cant
         | just brute force the solutions.
         | 
         | You haven't been keeping up. Less than 2 weeks ago, Google
         | released a model that has crushed the competition, clearly
         | being SotA while currently effectively free for personal use.
         | 
         | Gemini 2.0 was already good, people just weren't paying
         | attention. In fact 1.5 pro was already good, and ironically
         | remains the #1 model at certain very specific tasks, despite
         | being set for deprecation in September.
         | 
         | Google just suffered from their completely botched initial
         | launch way back when (remember Bard?), rushed before the
         | product was anywhere near ready, making them look lile a bunch
         | of clowns compared to e.g. OpenAI. That left a lasting
         | impression on those who don't devote significant time to
         | keeping up with newer releases.
        
           | codingwagie wrote:
           | gemini 2.5 pro isnt good, and if you think it is, you arent
           | using LLMs correctly. The model gets crushed by o1 pro and
           | sonnet 3.7 thinking. Build a large contextual prompt ( > 50k
           | tokens) with a ton of code, and see how bad it is. I
           | cancelled my gemini subscription
        
             | lerchmo wrote:
             | https://aider.chat/docs/leaderboards/ your experience
             | doesn't align with my experience or this benchmark. o1 pro
             | is good but I would rather do 20 cycles on gemini 2.5
             | rather than wait for Pro to return.
        
             | jjani wrote:
             | I have, dozens of times, and it's generally better than
             | 3.7. Especially with more context it's less forgetful.
             | o1-pro is absurdly expensive and slow, good luck using that
             | with tools. Virtually all benchmarks, including less gamed
             | ones such as Aider's, show the same. WebLM still has 3.7
             | ahead, with Sonnet always having been particularly strong
             | at web development, but even on there 2.5 Pro is miles in
             | front of any OpenAI model.
             | 
             | Gemini subscription? Surely if you're "using LLMs
             | correctly" you'd have been using the APIs for everything
             | anyway. Subscriptions are generally for non-techy
             | consumers.
             | 
             | In any case, just straight up saying "it isn't good" is
             | absurd, even if you personally prefer others.
        
             | int_19h wrote:
             | I have just watched Sonnet 3.7 vs Gemini 2.5 solving the
             | same task (fix a bug end-to-end) side by side, and Sonnet
             | hallucinated far worse and repeatedly got stuck in dead-
             | ends requiring manual rescue. OTOH Gemini understood the
             | problem based on bug description and code from the get go,
             | and required minimal guidance to come up with a decent
             | solution and implement it.
        
         | SubiculumCode wrote:
         | A whole lot of opinion there, not a whole lot of evidence.
        
           | codingwagie wrote:
           | Evidence is a decade inside these companies, watching the
           | circus
        
             | danjl wrote:
             | "I'm not bitter! No chip on my shoulder."
        
               | codingwagie wrote:
               | bitter about what? I'm a long time employee
        
               | brcmthrowaway wrote:
               | What company?
        
         | kylebyte wrote:
         | The problem is less that those high level engineers are only
         | good at deterministic work and more that they're only rewarded
         | for deterministic work.
         | 
         | There is no system to pitch an idea as opening new frontiers -
         | all ideas must be able to optimize some number that leadership
         | has already been tricked into believing is important.
        
       | dvcky_db wrote:
       | This should surprise no one..Also Goodhart's law strikes again
        
       | HunterX1 wrote:
       | Impressive results from Meta's Llama adapting to various
       | benchmarks. However, gaming performance seems lackluster compared
       | to specialized models like Alpaca. It raises questions about the
       | viability of large language models for complex, interactive tasks
       | like gaming without more targeted fine-tuning. Exciting progress
       | nonetheless!
        
       | LetsGetTechnicl wrote:
       | I'm just _shocked_ that the companies who stole all kinds of
       | copyrighted material would again do something unethical to keep
       | the bubble and gravy train going...
        
         | antonkar wrote:
         | Yes, their worst fear is people figuring out that an AI chatbot
         | is a strict librarian that spits out quotes but doesn't let you
         | enter the library (the AI model itself). Because with 3D game-
         | like UIs people can enter the library and see all their stolen
         | personal photos (if they were ever online), all kind of
         | monsters. It'll be all over YouTube.
         | 
         | Imagine this but you remove the noise and can walk like in an
         | art gallery (it's a diffusion model but LLMs can be loosely
         | converted into 3D maps with objects, too):
         | https://writings.stephenwolfram.com/2023/07/generative-ai-sp...
        
       | moralestapia wrote:
       | LeCun making up results ... well he comes from Academia, so ...
        
       | kittikitti wrote:
       | For me at least, the 10M context window is a big deal and as long
       | as it's decent, I'm going to use it instead. I'm running Scout
       | locally and my chat history can get very long. I'm very
       | frustrated when the context window runs out. I haven't been able
       | to fully test the context length but at least that one isn't
       | fudged.
        
       ___________________________________________________________________
       (page generated 2025-04-08 23:01 UTC)