[HN Gopher] Meta got caught gaming AI benchmarks
___________________________________________________________________
Meta got caught gaming AI benchmarks
Author : pseudolus
Score : 267 points
Date : 2025-04-08 11:29 UTC (11 hours ago)
(HTM) web link (www.theverge.com)
(TXT) w3m dump (www.theverge.com)
| bn-l wrote:
| I believe this was designed to flatter the prompter more / be
| more ingratiating. Which is a worry if true (what it says about
| the people doing the comparing).
| add-sub-mul-div wrote:
| There's no end to the possible vectors of human manipulation
| with this "open-weight" black box.
| deckar01 wrote:
| The top of that leaderboard is filled with closed weight
| experimental models.
| etamponi wrote:
| Meta got caught _first_.
| davidcbc wrote:
| Not even first, OpenAI got caught a while back
| Mond_ wrote:
| Do you have a source for this? That's interesting (if true).
| davidcbc wrote:
| They got the dataset from Epoch AI for one of the
| benchmarks and pinky swore that they wouldn't train on it
|
| https://techcrunch.com/2025/01/19/ai-benchmarking-
| organizati...
| tananaev wrote:
| I don't see anything in the article about being caught.
| Maybe I missed something?
| tedsanders wrote:
| davidcbc is spreading fake rumors.
|
| OpenAI was never caught cheating on it, because we didn't
| cheat on it.
|
| As with any eval, you have to take our word for it, but
| I'm not sure what more we can do. Personally, if I
| learned that a OpenAI researcher purposely or
| accidentally trained on it, and we didn't quickly
| disclose this, I'd quit on the spot and disclose it
| myself.
|
| (I work at OpenAI.)
|
| Generally, I don't think anyone here is cheating and I
| think we're relatively diligent with our evals. The gray
| zone where things could go wrong is differing levels of
| care used in scrubbing training data of equivalent or
| similar problems. At some point the line between learning
| and memorizing becomes blurry. If an MMLU question asks
| about an abstract algebra proof, is it cheating to have
| trained on papers about abstract algebra?
| suddenlybananas wrote:
| >They got the dataset from Epoch AI for one of the
| benchmarks and pinky swore that they wouldn't train on it
|
| Is anything here actually false or do you not like the
| conclusions that people may draw from it?
| tucnak wrote:
| Why are you being disingenuous? Simply having access to
| the eval in question is already enough for your
| synthetics guys to match the distribution, and of course
| you don't contaminate directly on train, that would be
| stupid, and you would get caught, but if it does inform
| the reward, the result is the same. You _should_ quit,
| but you wouldn't because you'd already convinced yourself
| you're doing RL God's work, not sleight of hand.
|
| > If an MMLU question asks about an abstract algebra
| proof, is it cheating to have trained on papers about
| abstract algebra?
|
| This kind of disingenuous bullshit is exactly why people
| call you cheaters.
|
| > Generally, I don't think anyone here is cheating and I
| think we're relatively diligent with our evals.
|
| You guys should follow Apple's cult guidelines: never
| stand out. Think different
| wongarsu wrote:
| It happens with basically all papers on all topics.
| Benchmarks are useful when they are first introduced and
| used to measure things that were released before the
| benchmark. After that their usefulness rapidly declines.
| hooloovoo_zoo wrote:
| People have been gaming ML benchmarks as long as there have
| been ML benchmarks. That's why it's better to see if other
| researchers are incorporating a technique into their actual
| models rather than 'is this paper the bold entry in a
| benchmark table'. But it takes longer.
| jandrese wrote:
| When a measure becomes a target it is no longer a good
| measure.
|
| These ML benchmarks were never going to last very long.
| There is far too much pressure to game them, even
| unintentionally.
| mkolodny wrote:
| "Got caught" is a misleading way to present what happened.
|
| According to the article, Meta publicly stated, right below the
| benchmark comparison, that the version of Llama on LMArena was
| the experimental chat version:
|
| > According to Meta's own materials, it deployed an
| "experimental chat version" of Maverick to LMArena that was
| specifically "optimized for conversationality"
|
| The AI benchmark in question, LMArena, compares Llama 4
| experimental to closed models like ChatGPT 4o latest, and Llama
| performs better (https://lmarena.ai/?leaderboard).
| JKCalhoun wrote:
| Is LMArena junk now?
|
| I thought there was an aspect where you run two models on the
| same user-supplied query. Surely this can't be gamed?
|
| > "optimized for conversationality"
|
| I don't understand what that means - how it gives it an LMArena
| advantage.
| light_hue_1 wrote:
| LMArena was always junk. I work in this space and while the
| media takes it seriously most scientists don't.
|
| Random people ask random stuff and then it measures how good
| they feel. This is only a worthwhile evaluation if you're
| Google or Meta or OpenAI and you need to make a chartbot that
| keeps people coming back. It doesn't measure anything else
| useful.
| HPsquared wrote:
| Conversation is a two-way street. A good conversation
| mechanic could elicit better interaction from the users and
| result in better answers. Stands to reason, anyway.
| genewitch wrote:
| I hear AI news from time to time from the M5M in the US - and
| the only place I've ever seen "LMArena" is on HN and in the
| LM studio discord. At a ratio of 5:1 at least.
| brandall10 wrote:
| It's mentioned quite a bit in the LLM related subreddits.
| cma wrote:
| Llama 1 derived models on it were beating gpt 3.5 by have
| less refusals.
| gwern wrote:
| It can be easily gamed. The users are self-selected, and they
| have zero incentive to be honest or rigorous or provide good
| responses. Some have incentives the opposite way. (There was a
| report of a prediction market user who said they had won a
| market on Gemini models by manipulating the votes; LMArena
| swore furiously there had definitely been no manipulation but
| was conspicuously silent on any details.) And the release of
| more LMArena responses has shown that a lot of the user ratings
| are blatantly wrong: either they're basically fraudulent, or
| LMArena's current users are people whose ratings you should be
| optimizing against because they are so ignorant, lazy, and
| superficial.
|
| At this point, when I look at the output from my Gemini-2.5-pro
| sessions, they are so high quality, and take so long to read,
| and check, and have an informed opinion on, I just can't trust
| the slapdash approach of LMArena in assuming that careless
| driveby maybe-didn't-even-read-the-responses-ain't-no-one-got-
| time-for-that-nerd-shit ratings mean much of anything. There
| have been red flags in the past and I've been taking them ever
| less seriously even as one of many benchmarks since early last
| year, but maybe this is the biggest backfire yet. And it's only
| going to get worse. At this rate, without major overhaul, you
| should take being #1 on LMArena seriously as useful and
| important news - as a reason to _not_ use a model.
|
| It's past time for LMArena people to sit down and have some
| thorough reflection on whether it is still worth running at
| all, and at what point they are doing more harm than good. No
| benchmark lives forever and it is normal and healthy to shut
| them down at some point after having been saturated, but some
| manage to live a lot longer than they should have...
| jxjnskkzxxhx wrote:
| In one of karpathys videos he said that he was a bit suspicious
| that the models that score the highest in LMarena aren't the
| ones that people use the most to solve actual day to day
| problems.
| jsnell wrote:
| There are almost certainly ways to fine-tune the model in ways
| that make it perform better on the Arena, but perform worse in
| other benchmarks or in practice. Usually that's not a good
| trade-off. What's being suggested here is that Meta is running
| such a fine-tuned version on the Arena (and reporting those
| numbers) while running models with different fine-tuning on
| other benchmarks (and reporting those numbers), while giving
| the appearance that those are actually the same models.
| consumer451 wrote:
| Tangent, but does anyone know why links to lmarena.ai are
| banned on Reddit, site-wide?
|
| (Last I checked 1 month ago)
| swyx wrote:
| evidence for this assertion please? its highly unlikely and
| maybe a bug
| consumer451 wrote:
| I should have re-tested prior to posting. It is fixed now.
| I just tried posting a comment with the link and it was not
| [removed by reddit].
|
| It was a really strange situation that lasted for months.
| Posts or comments with a direct link were removed like they
| were the worst of the worst.
|
| I tried posting to r/bugs and it was downvoted immediately.
| I eventually contacted he lmarena folks, so maybe they
| resolved it with reddit.
| Mond_ wrote:
| The Llama 4 launch looks like a real debacle for Meta. The model
| doesn't look great. All the coverage I've seen has been negative.
|
| This is about what I expected, but it makes you wonder what
| they're going to do next. At this point it looks like they are
| falling behind the other open models, and made an ambitious bet
| on MoEs, without this paying off.
|
| Did Zuck push for the release? I'm sure they knew it wasn't ready
| yet.
| bko wrote:
| I remember reading that they were in panic mode when the
| DeepSeek model came out so they must have scrambled and had to
| re-work a lot of things since DeepSeek was so competitive and
| open source as well
| blueboo wrote:
| Fear of R2 looms large as well. I suspect they succumbed to
| the nuance collapse along the lines of "Is double checking
| results worth it if DeepSeek eats our lunch?"
| nightski wrote:
| Do you know that they made a bet on MoE? Meaning they
| abandonded dense models? I doubt that is the case. Just
| releasing MoE Llama 4 does not constitute a "bet" without
| further information.
|
| Also from what I can tell this performs better than models with
| parameter counts equal to one expert, and worse than fully
| dense models equal to total parameter count. Isn't that kind of
| what we'd expect? in what way is that a failure?
|
| Maybe I am missing some details. But it feels like you have an
| axe to grind.
| genewitch wrote:
| A 4x8 MOE performs better than an 8B but worse than a 32B, is
| your statement?
|
| My response would be, "so why bother with MOE?"
|
| However deepseek r1 is MOE from my understanding, but the "E"
| are all =>32B parameters. There's > 20 experts. I could be
| misinformed; however, even so, I'd say a MOE with 32B or even
| 70B experts will outperform (define this!) Models with equal
| parameter counts, because deepseek outperforms (define?)
| ChatGPT et al.
| nightski wrote:
| Easy, vastly improved inference performance on machines
| with larger RAM but lower bandwidth/compute. These are
| becoming more popular such as Apple's M series chips, AMD's
| strix halo series, and the upcoming DGX Spark from Nvidia.
| genewitch wrote:
| yes i understand all that. I was saying the claim is
| incorrect. My understanding of deepseek is mechanically
| correct but apparently they use 3B models as experts, per
| your sibling comment. I don't buy it, regardless of what
| they put in the paper - 3B models are pretty dumb, and R1
| isn't dumb. No amount of shuffling between "dumb" experts
| will make the output _not_ dumb. it 's more likely 32x32B
| experts, based on the quant sizes i've seen.
|
| A deepseek employee is welcome to correct me.
| zamadatix wrote:
| DeepSeek V3/R1 are MoE with 256 experts per layer, actively
| using 1 shared expert and 8 routed experts per layer https:
| //arxiv.org/html/2412.19437v1#S2:~:text=with%20MoE%20l...
| so you can't just take the active parameters and assume
| that's close to the size of a single expert (ignoring
| experts are per layer anyways and that there are still
| dense parameters to count).
|
| Despite connotations of specialized intelligences the term
| "expert" provokes it's really mostly about
| scalability/efficiency of running large models. By
| splitting up sections of the layers and not activating all
| of them for each pass a single query takes less bandwidth,
| can be distributed across compute, and can be parallelized
| with other queries on the same nodes.
| throwaway25251 wrote:
| I don't know about Llama 4. Competition is intense in this
| field so you can't expect everybody to be number 1. However, I
| think the performance culture at Meta is counterproductive.
| Incentives are misaligned, I hope leadership will try to
| improve it.
|
| Employees are encouraged to ship half-baked features and move
| to another project. Quality isn't rewarded at all. The recent
| layoffs have made things even worse. Skilled people were fired,
| slowing down teams. I assume the goal was to push remaining
| employees to work even more, but I doubt this is working.
|
| I haven't worked in enough companies of this size to be able to
| tell if alternatives are better, but it's very clear to me that
| Meta doesn't get the best from their employees.
| berkes wrote:
| I've never liked it, but
|
| > Move fast and break things
|
| is really a bad concept in this space, where you get limited
| shots at releasing something that generates interest.
|
| > Employees are encouraged to ship half-baked features
|
| And this is why I never liked that motto and have always
| pushed back at startups where I was hired that embraced this
| line of thought. Quality matters. It's context-dependent, so
| sometimes it matters a lot, and sometimes hardly. But "moving
| fast and breaking things" should be a deliberate choice, made
| for every feature, module, sprint, story all over again, IMO.
| If at all.
| Rebuff5007 wrote:
| I'd argue its a bad concept in any spaces that involve
| teams of people working together and deliverables that
| enter the real world.
| diggan wrote:
| > is really a bad concept in this space, where you get
| limited shots at releasing something that generates
| interest.
|
| Sure, but long term effects are more depending on the
| actual performance of the model, than anything.
|
| Say they launch a model that is hyped to be the best, but
| when people try it, it's worse than other models. People
| will quickly forget about it, unless it's particularly good
| at something.
|
| Alternatively, say they launch a model that doesn't even
| get a press release, or any benchmark results published
| ahead of launch, but the model actually rocks at a bunch of
| use cases. People will start using it regardless of the
| initial release, and continue to do so as long as it's a
| best model.
| lenerdenator wrote:
| > is really a bad concept in this space, where you get
| limited shots at releasing something that generates
| interest.
|
| It's a really bad concept in _any_ space.
|
| We would be living in a better world if Zuck had, at least
| once, thought "Maybe we shouldn't do that".
| avs733 wrote:
| >It's a really bad concept in any space.
|
| I struggle with this because it feels like so many
| 'rules' in the world where the important half remains
| unsaid. That unsaid portion is then mediated by goodharts
| law.
|
| If the other half is 'then slow down and learn something'
| its really not that bad, nothing is sacred, we try we
| fail we learn we (critical) don't repeat the mistake.
| Thats human learning - we don't learn from mistakes we
| learn from reflecting on mistakes.
|
| But if learning isn't part of the loop - if its a self
| justifying defense for fuckups, if the unsaid remains
| unsaid, its a disaster waiting to happen.
|
| The difference is usually in what you reward. If you
| reward _ship_ , you get the defensive version - and you
| will ship crap. If you reward institutional knowledge
| building you don't. Engineers are often taught that
| 'good, fast, or cheap pick 2'. The reality is its usually
| closer to 1 or 1.5. If you pick fast...you get fast.
| _bin_ wrote:
| It's also terrible output, even before you consider what
| looks like catastrophic forgetting from crappy RL. The emoji
| use and writing style make me want to suck-start a revolver.
| I don't know how they expect anyone to actually use it.
| spaceywilly wrote:
| I agree. I think of it like a car engine. You can push it up
| to a certain RPM and it will keep making more and more power.
| Above that RPM, the engine starts to produce less power and
| eventually blows a gasket.
|
| I think the performance-based management worked for a while
| because there were some gains to be had by pushing people
| harder. However, they've gone past that and are now pushing
| people too hard and getting worse results. Every machine has
| its operating limits and an area where it operates most
| efficiently. A company is no different.
| hq123 wrote:
| very nice analogy!
| lenerdenator wrote:
| The problem is, there's always some engine pushing the
| power envelope, or a person pushing their performance
| harder. And then the rest of them have to keep up.
| a4isms wrote:
| For those who haven't heard of it, "The Hawthorne Effect" is
| the name given to a phenomena where when a person or group
| being studied is aware they are being studied, their
| performance goes up but as much as 50% for 4-8 weeks, then
| regresses to its norm.
|
| This is true if they are just being observed, or if some
| novel new processes are introduced. If the new things are
| beneficial, the performance rises for 4-8 weeks as usual, but
| when it regresses it regresses to a higher performance
| reflecting the value of the new process.
|
| But when poor management introduce a counter-productive
| change, the Hawthorne Effect makes it look like a resounding
| success for 4-8 weeks. Then the effect fades, and performance
| drops below the original level. Sufficiently devious managers
| either move on to new projects or blame the workers for
| failing to maintain the new higher pace of performance.
|
| This explains a lot of the incentive for certain types of
| leaders to champion arbitrary changes, take a victory lap,
| and then disassociate themselves from accountability for the
| long-term success or failure of their initiative.
|
| (There is quite a bit of controversy over what the mechanisms
| for the Hawthorne Effect are, and whether change alone can
| introduce it for whether participants need to feel they are
| being observed, but the model as I see it fits my anecdotal
| experience where new processes are always accompanied by
| attempts to meet new performance goals, and everyone is
| extremely aware that the outcome is being measured.)
| pixl97 wrote:
| I mean, it sounds like we should add in the McNamara
| fallacy also.
| hintymad wrote:
| > Employees are encouraged to ship half-baked features and
| move to another project
|
| Maybe there is more to that. It's been more than a year since
| Llama 3 was released. That should be enough time for Meta to
| release something with significantly improvement. Or you mean
| quarter by quarter the engineers had to show that they were
| making impact in their perf review, which could be
| detrimental to the Llama 4 project?
|
| Another thing that puzzles me is that again and again we see
| that the quality of a model can improve if we have more high-
| quality data, yet can't Meta manage to secure massive amount
| of new high-quality data to boost their model performance?
| magixx wrote:
| > Or you mean quarter by quarter the engineers had to show
| that they were making impact in their perf review
|
| This is what I think they were referencing. Launching
| things looks nice in review packets and few to none are
| going to look into the quality of the output. Submitting
| your own self review means that you can cherry pick
| statistics and how you present them. That's why that
| culture incentivizes launching half baked products and
| moving on to something else because it's smart and
| profitable (launch yet another half baked project) to
| distance yourself from the half baked project you started.
| hintymad wrote:
| I like how Netflix set up its incentive systems years
| ago. Essentially they told the employees that all they
| needed to do is deliver what the company wanted. It was
| perfectly okay that an employee did their job and didn't
| move up or do more. Per their chief talent officer
| McCord, "a manager's job is all about setting the
| context" and the employees were let loose to deliver.
| This method puts a really high bar on the managers, as
| the entire report chain must know clearly what they want
| delivered. Their expectation must be high enough to move
| the company forward, but not too ridiculous to turn
| Netflix into a burnout factory.
| magixx wrote:
| Unfortunately I wasn't able to get an interview with
| Netflix.
|
| > employees that all they needed to do is deliver what
| the company wanted
|
| How did this work out in practice and across teams? My
| experience at Meta within my team was that it would be
| almost impossible to determine what the company actually
| wanted from our team in a year. Goals kept changing and
| the existing incentive system works against this since
| other teams are trying to come up with their own
| solutions to things which may impact your team.
|
| > an employee did their job and didn't move up
|
| Does Netflix cull employees if they haven't reached a
| certain IC level? I know at Meta SWEs need to reach IC5
| after a while or risk being culled.
| philjohn wrote:
| It's 100% PSC (their "Performance Culture")
|
| You're not encouraged per se to ship half-baked features, but
| if you don't have enough "impact" at the end of the half (for
| mid cycle checkin) or year (for full PSC cycle) then you're
| going to get "Below Expectations" and then "Meets Most" (or
| worse) and with the current environment a swift offboarding.
|
| When I was there (working in integrity) our group of staff+
| engineers opined how it led to perverse incentives - and
| whilst you can work there and do great work, and get good
| ratings, I saw too many examples of "optimizing for PSC"
| (otherwise known as PSC hacking).
| qoez wrote:
| > it makes you wonder what they're going to do next
|
| They're just gonna keep throwing money at it. This is a hobby
| and talent magnet for them, instagram is the money printer.
| They've been working on VR for like a decade with barely much
| results in terms of users (compared to costs). This will be no
| different.
| wongarsu wrote:
| Both are also decent long-terms bets. Being VR market leader
| now means they will be VR market-leader with plenty of
| inhouse talent and IP when the technology matures and the
| market grows. Being in the AI race, even if they are not
| leading, means they have in-house talent and technology to be
| able to react to wherever the market is going with AI. They
| have one of the biggest messengers and one of the biggest
| image-posting sites, there is a decent chance AI will become
| important to them in some not-yet-obvious way.
|
| One of Meta's biggest strengths is Zuckerberg being able to
| play these kinds of bets. Those bets being great for PR and
| talent acquisition is the cherry on top
| TheOtherHobbes wrote:
| This assumes no upstart will create a game changing
| innovation which upends everything.
|
| Companies become complacent and confused when they get too
| big. Employees become trapped in a maze of performative
| careerism, and customer focus and a realistic understanding
| of threats from potential competitors both disappear.
|
| It's been a consistent pattern since the earliest days of
| computing.
|
| Nothing coming out of Big Tech at the moment is encouraging
| me to revise that assumption.
| ntlm1686 wrote:
| "made an ambitious bet on MoEs"? No, DeepSeek is MoE, and they
| succeeded. Meta is not betting on MoE, it just does what other
| people have done.
| antirez wrote:
| Llama4 seems in many ways a cut and paste of DeepSeek.
| Including the shared expert and the high sparsity. It's a
| DeepSeek that does not work well.
| thefourthchime wrote:
| I mean, there's a reason they released it on a Saturday.
| root_axis wrote:
| It's not a big deal. Llama 4 feels like a flop because the
| expectations are really high based on their previous releases
| and the sense of momentum in the ecosystem because of DeepSeek.
| At the end of the day, LLama 4 didn't meet the elevated
| expectations, but they're fine. They'll continue to improve and
| iterate and maybe the next one will be more hype worthy, or
| maybe expectations will be readjusted as the specter of
| diminishing returns continues to creep in.
| lerchmo wrote:
| The switching costs are so low (zero) that anyone using these
| models just jumps to the best performer. I also agree that
| this is not a brand or narrative sensitive project.
| int_19h wrote:
| It feels like a flop because it is objectively worse than
| models many times smaller that shipped sometimes ago. In
| fact, it is worse than earlier LLaMA releases on may points.
| It's so bad that people who initially ran into it assumed
| that the downloaded weights must be corrupted somehow.
| agilob wrote:
| They knew they can't beat DeepSeek 2 months ago
| https://old.reddit.com/r/LocalLLaMA/comments/1i88g4y/meta_pa...
| seydor wrote:
| tech companies competing over something that is losing them money
| is the most bizarre spectacle yet.
| abc-1 wrote:
| Is it really losing them money if investors throw fistfuls of
| cash at them for it
| FL33TW00D wrote:
| Plot big tech stock valuations with markers for successful OS
| model releases.
| fullshark wrote:
| Come on, you can do the critical thinking here to understand
| why these companies would want the best in class (open/closed)
| weight LLMs.
| seydor wrote:
| then why would they cheat?
| fullshark wrote:
| Well we'll see if they suffer consequences of this and they
| cheated too hard, but being perceived as best in class is
| arguably worth even more than being the best in class,
| especially if differences in performance are hard to
| perceive anecdotally.
|
| The goal is long term control over a technology's
| marketshare, as winner take all dynamics are in play here.
| baby wrote:
| they're all cheating, see grok
| nomel wrote:
| Are you referring to this [1]?
|
| > Critics have pointed out that xAI's approach involves
| running Grok 3 multiple times and cherry-picking the best
| output while comparing it against single runs of
| competitor models.
|
| [1] https://medium.com/@cognidownunder/the-hype-machine-
| gpt-4-5-...
| SubiculumCode wrote:
| I didn't see evidence of cheating in the article. Having a
| slightly differently tuned version of 4 is not the most
| dastardly thing that can be done. Everything else is
| insinuation.
| bko wrote:
| I think Meta sees AI and VR/AR as a platform. They got left
| behind on the mobile platform and forever have to contend with
| Apple semi-monopoly. They have no control and little influence
| over the ecosystem. It's an existential threat to them.
|
| They have vowed not to make that mistake again so are pushing
| for an open future that won't be dominated by a few companies
| that could arbitrarily hurt Meta's business.
|
| That's the stated rationale at least and I think it more or
| less makes sense
| fullshark wrote:
| Makes sense except for the fact that they leaked the llama
| weights by accident and needed to reverse engineer that
| explanation.
| jsheard wrote:
| I wouldn't call what Meta is doing with VR/AR an "open
| future", it's pretty much the exact same playbook that Google
| and Apple used for their platforms. The only difference is
| Meta gets to be the landlord this time.
| alex1138 wrote:
| I'm in favor of whatever semi-monopoly enables fine grained
| permissions so Facebook can't en masse slurp Whatsapp
| (antitrust?) contacts
| esafak wrote:
| They stated that?
| asveikau wrote:
| Feels very late 90s.
|
| The old joke is they're losing money on every sale but they'll
| make up for it in volume.
| dfedbeef wrote:
| _chef kiss_ perfect
| diggan wrote:
| Borderline conspiracy theory with an ounce of truth:
|
| None of the models Meta put out are actually open source (by
| any measure), and everyone who are redistributing Llama models
| or any derivatives, or use Llama models for their business, are
| on the hook of getting sued in the future based on the terms
| and conditions people been explicitly/implicitly agreeing to
| when they use/redistribute these models.
|
| If you start depending on these Llama models which have
| unfavorable proprietary terms today but Meta don't act on them,
| doesn't mean they won't act on it in the future. Maybe this has
| all been a play to get people into this position, so Meta can
| in the future start charging for them or something else.
| recursive wrote:
| You never go full Oracle.
| timewizard wrote:
| This is a signal that the sector is heavily monopolized.
|
| This has happened many times before in US history.
| tananaev wrote:
| The reason is simple. All tech companies have very high
| valuations. They have to sell investors a dream to justify that
| valuation. They have to convince people that they have the next
| big thing around the corner.
| roughly wrote:
| https://en.wikipedia.org/wiki/Dollar_auction
| gessha wrote:
| Next on Matt Levine's newsletter: Is Meta fudging with stock
| evaluation-correlated metrics? Is this securities fraud?
| myrmidon wrote:
| If that's what it takes to get some honesty out of corporate,
| is it such a bad idea? Why?
| benhill70 wrote:
| Sarcasm doesn't translate well in text. Please, elaborate.
| camjw wrote:
| Matt Levine has a common refrain which is that basically
| everything is securities fraud. If you were an investor who
| invested on the premise that Meta was good at AI and Zuck
| knowingly put out a bad model, is that securities fraud? Matt
| Levine will probably argue that it could be in a future
| edition of Money Stuff (his very good newsletter).
| NickC25 wrote:
| is it securities fraud? sort of.
|
| If Mark, both through Meta and through his own resources,
| has the capital to hire and retain the best AI researchers
| / teams, and claims he's doing so, but puts out a model
| that sucks, he's liable. It's probably not directly fraud,
| but if he claims he's trying to compete with Google or
| Microsoft or Apple or whoever, yet doesn't adequately
| deploy a comparable amount of resources, capital, people,
| whatever, and doesn't explain why, it _could_ (stretch) be
| securities fraud....I think.
| genewitch wrote:
| And the fine for that? Probably 0.001% of revenue. If
| that.
| jjani wrote:
| The "everything is securities fraud" meme is really
| unfortunate, not quite as bad as the "fiduciary duty means
| execs have to chase short-term profit" myth, but still
| harmful.
|
| It's only because lying ("puffery") about everything has
| become the norm in corporate America that indeed, almost
| all listed companies commit securities fraud. If they'd go
| back to being honest businessmen, no more securities fraud.
| Just stop claiming things that aren't true. This is a very
| real option they could take. If they don't, then they're
| willingly and knowingly commiting securities fraud. But the
| meme makes it sound to people as if it's unavoidable, when
| it's anything but.
| grvdrm wrote:
| Nailed it!
| goldchainposse wrote:
| I wonder how much the current work environment contributed to
| this. There's immense pressure to deliver, so it's not surprising
| to see this.
| ekojs wrote:
| I think it's most illustrative to see the sample battles (H2H)
| that LMArena released [1]. The outputs of Meta's model is too
| verbose and too 'yappy' IMO. And looking at the verdicts, it's no
| wonder by people are discounting LMArena rankings.
|
| [1]: https://huggingface.co/spaces/lmarena-
| ai/Llama-4-Maverick-03...
| smeeth wrote:
| In fairness, 4o was like this until very recently. I suspect it
| comes from training on COT data from larger models.
| ed wrote:
| Yep, it's clear that many wins are due to Llama 4's lowered
| refusal rate which is an effective form of elo hacking.
| goldchainposse wrote:
| In other news, the head of AI research just left
|
| https://www.cnbc.com/2025/04/01/metas-head-of-ai-research-an...
| kridsdale3 wrote:
| I would have thought that title would belong to Yann.
| yodsanklai wrote:
| TBH I'm very surprised Yann Le Cun is still there. He looks
| to me like a free thinker and an independent person. I don't
| think he buys into the Trump agenda and US nationalistic
| anti-Europe speech like Zuck does. He may be giving Zuck the
| benefit of the doubt, and probably is grateful that Zuck gave
| him a chance when nobody else did.
| lenerdenator wrote:
| > TBH I'm very surprised Yann Le Cun is still there. He
| looks to me like a free thinker and an independent person.
| I don't think he buys into the Trump agenda and US
| nationalistic anti-Europe speech like Zuck does. He may be
| giving Zuck the benefit of the doubt, and probably is
| grateful that Zuck gave him a chance when nobody else did.
|
| Zuck doesn't buy it, either. He just knows what's good for
| business right now.
|
| In an example of the worst person you know making a great
| point, Josh Hawley said "What really struck me is that they
| can read an election return." [0].
|
| Though it's worth remembering, it's _very_ difficult to
| accumulate the volume of data necessary to do the current
| kind of AI training while sticking to the strictest
| interpretations of EU privacy law. Social media companies
| aren 't just feeding the user data into marketing
| algorithms, they're feeding them into AI models. If you're
| a leading researcher in that field - Like Le Cun - and the
| current state-of-the-art means getting as much data as
| possible, you might not appreciate the regulatory
| environment of the EU.
|
| [0] https://www.npr.org/2025/02/27/nx-s1-5302712/senator-
| josh-ha...
| brandall10 wrote:
| It's a misnomer - the VP left. Yann is the Chief Scientist,
| which I imagine most would agree would be the 'head' of a
| research division.
| nailer wrote:
| A lower level employee also resigned specifically about this:
|
| https://x.com/arjunaaqa/status/1909174905549042085?s=46
| ignoramous wrote:
| Ahmad al-Dahle, who leads "Gen AI" at Meta, wrote this on
| Twitter: ... We're also hearing some reports of
| mixed quality across different services ... We've
| also heard claims that we trained on test sets -- that's simply
| not true and we would never do that. Our best understanding is
| that the variable quality people are seeing is due to needing to
| stabilize implementations. We believe the Llama 4
| models are a significant advancement and we're looking forward to
| working with the community to unlock their value.
|
| https://x.com/Ahmad_Al_Dahle/status/1909302532306092107 /
| https://archive.vn/JzONp
| SubiculumCode wrote:
| There seems to be a lot of haloo, accusations, and rumors, but
| little meat to any of them. Maybe they rushed the release, were
| unsure of which one to go with, and some moderate rule bending
| in terms of which tune got sent to the arena, but I have seen
| no real hard evidence of real hard underhandedness.
| labrador wrote:
| Meta does themselves a disservice by having such a crappy public
| facing AI for people to try (meta.ai). I regularly use the web
| versions for GPT 4o, Deepseek, Grok, and Google Gemeni 2.5.
|
| Meta is always the worst so I don't even bother anymore.
| alittletooraph2 wrote:
| I tried to make Studio Ghibli inspired images using presumably
| their new models. It was ass.
| echelon wrote:
| GPT 4o images is the future of all image gen.
|
| Every other player: Black Forest Labs' Flux, Stability.ai's
| Stable Diffusion, and even closed models like Ideogram and
| Midjourney, are all on the path to extinction.
|
| Image generation and editing _must_ be multimodal. Full stop.
|
| Google Imagen will probably be the first model to match the
| capabilities of 4o. I'm hoping one of the open weights labs or
| Chinese AI giants will release a model that demonstrates
| similar capabilities soon. That'll keep the race neck and neck.
| minimaxir wrote:
| One very important distinction between image models is the
| implementation: 4o is autogressive, slow, and _extremely_
| expensive.
|
| Although the Ghibli trend is market validation, I suspect
| that competitors may not want to copy it just yet.
| echelon wrote:
| > 4o is autogressive, slow, and extremely expensive.
|
| If you factor in the amount of time wasted with prompting
| and inpainting, it's extremely well worth it.
| JamesBarney wrote:
| Extremely expensive in what since? In that it costs $.03
| instead of $.00003c? Yeah it's relatively far more
| expensive than other solutions, but from an absolute
| standpoint still very cheap for the vast majority of use
| cases. And it's a LOT better.
| svachalek wrote:
| Dall-E is already 4-8 cents per image. Afaik this is not
| in the API yet but I wouldn't be surprised if it's $1 or
| more.
| simonw wrote:
| Llama is not an image generating model. Any interface that uses
| Llama and generates images is calling out to a separate image
| generator as a tool, like OpenAI used to do with ChatGPT and
| DALL-E up until a couple of weeks ago:
| https://simonwillison.net/2023/Oct/26/add-a-walrus/
| arcastroe wrote:
| https://archive.is/Ec6V1
| codingwagie wrote:
| The truth is that the vast majority of FAANG engineers making
| high six figures are only good at deterministic work. They cant
| produce new things, and so meta and google are struggling to
| compete when actual merit matters, and they cant just brute force
| the solutions. Inside these companies, the massive tech systems
| built, are actually generally terrible, but they pile on legions
| of engineers to fix the problems.
|
| This is the culture of META hurting them, they are paying "AI
| VPs" millions of dollars to go to status meetings to get dates
| for when these models will be done. Meanwhile, deepseek r1 has a
| flat hierarchy with engineers that actually understand low level
| computing
|
| Its making a mockery of big tech, and is why startups exist. Big
| company employees rise the ranks by building skill sets other
| than producing true economic value
| jjani wrote:
| > They cant produce new things, and so meta and google are
| struggling to compete when actual merit matters, and they cant
| just brute force the solutions.
|
| You haven't been keeping up. Less than 2 weeks ago, Google
| released a model that has crushed the competition, clearly
| being SotA while currently effectively free for personal use.
|
| Gemini 2.0 was already good, people just weren't paying
| attention. In fact 1.5 pro was already good, and ironically
| remains the #1 model at certain very specific tasks, despite
| being set for deprecation in September.
|
| Google just suffered from their completely botched initial
| launch way back when (remember Bard?), rushed before the
| product was anywhere near ready, making them look lile a bunch
| of clowns compared to e.g. OpenAI. That left a lasting
| impression on those who don't devote significant time to
| keeping up with newer releases.
| codingwagie wrote:
| gemini 2.5 pro isnt good, and if you think it is, you arent
| using LLMs correctly. The model gets crushed by o1 pro and
| sonnet 3.7 thinking. Build a large contextual prompt ( > 50k
| tokens) with a ton of code, and see how bad it is. I
| cancelled my gemini subscription
| lerchmo wrote:
| https://aider.chat/docs/leaderboards/ your experience
| doesn't align with my experience or this benchmark. o1 pro
| is good but I would rather do 20 cycles on gemini 2.5
| rather than wait for Pro to return.
| jjani wrote:
| I have, dozens of times, and it's generally better than
| 3.7. Especially with more context it's less forgetful.
| o1-pro is absurdly expensive and slow, good luck using that
| with tools. Virtually all benchmarks, including less gamed
| ones such as Aider's, show the same. WebLM still has 3.7
| ahead, with Sonnet always having been particularly strong
| at web development, but even on there 2.5 Pro is miles in
| front of any OpenAI model.
|
| Gemini subscription? Surely if you're "using LLMs
| correctly" you'd have been using the APIs for everything
| anyway. Subscriptions are generally for non-techy
| consumers.
|
| In any case, just straight up saying "it isn't good" is
| absurd, even if you personally prefer others.
| int_19h wrote:
| I have just watched Sonnet 3.7 vs Gemini 2.5 solving the
| same task (fix a bug end-to-end) side by side, and Sonnet
| hallucinated far worse and repeatedly got stuck in dead-
| ends requiring manual rescue. OTOH Gemini understood the
| problem based on bug description and code from the get go,
| and required minimal guidance to come up with a decent
| solution and implement it.
| SubiculumCode wrote:
| A whole lot of opinion there, not a whole lot of evidence.
| codingwagie wrote:
| Evidence is a decade inside these companies, watching the
| circus
| danjl wrote:
| "I'm not bitter! No chip on my shoulder."
| codingwagie wrote:
| bitter about what? I'm a long time employee
| brcmthrowaway wrote:
| What company?
| kylebyte wrote:
| The problem is less that those high level engineers are only
| good at deterministic work and more that they're only rewarded
| for deterministic work.
|
| There is no system to pitch an idea as opening new frontiers -
| all ideas must be able to optimize some number that leadership
| has already been tricked into believing is important.
| dvcky_db wrote:
| This should surprise no one..Also Goodhart's law strikes again
| HunterX1 wrote:
| Impressive results from Meta's Llama adapting to various
| benchmarks. However, gaming performance seems lackluster compared
| to specialized models like Alpaca. It raises questions about the
| viability of large language models for complex, interactive tasks
| like gaming without more targeted fine-tuning. Exciting progress
| nonetheless!
| LetsGetTechnicl wrote:
| I'm just _shocked_ that the companies who stole all kinds of
| copyrighted material would again do something unethical to keep
| the bubble and gravy train going...
| antonkar wrote:
| Yes, their worst fear is people figuring out that an AI chatbot
| is a strict librarian that spits out quotes but doesn't let you
| enter the library (the AI model itself). Because with 3D game-
| like UIs people can enter the library and see all their stolen
| personal photos (if they were ever online), all kind of
| monsters. It'll be all over YouTube.
|
| Imagine this but you remove the noise and can walk like in an
| art gallery (it's a diffusion model but LLMs can be loosely
| converted into 3D maps with objects, too):
| https://writings.stephenwolfram.com/2023/07/generative-ai-sp...
| moralestapia wrote:
| LeCun making up results ... well he comes from Academia, so ...
| kittikitti wrote:
| For me at least, the 10M context window is a big deal and as long
| as it's decent, I'm going to use it instead. I'm running Scout
| locally and my chat history can get very long. I'm very
| frustrated when the context window runs out. I haven't been able
| to fully test the context length but at least that one isn't
| fudged.
___________________________________________________________________
(page generated 2025-04-08 23:01 UTC)