[HN Gopher] Quantitative AI progress needs accurate and transpar...
       ___________________________________________________________________
        
       Quantitative AI progress needs accurate and transparent evaluation
        
       Author : bertman
       Score  : 193 points
       Date   : 2025-07-25 06:47 UTC (16 hours ago)
        
 (HTM) web link (mathstodon.xyz)
 (TXT) w3m dump (mathstodon.xyz)
        
       | NitpickLawyer wrote:
       | The problem with benchmarks is that they are really useful for
       | honest researchers, but extremely toxic if used for marketing,
       | clout, etc. Something something, every measure that becomes a
       | target sucks.
       | 
       | It's really hard to trust anything public (for obvious reasons of
       | dataset contamination), but also some private ones (for the
       | obvious reasons that providers do get most/all of the questions
       | over time, and they can do sneaky things with them).
       | 
       | The only true tests are the ones you write yourself, never
       | publish, and only work 100% on open models. If you want to test
       | commercial SotA models from time to time you need to consider
       | them "burned", and come up with more tests.
        
         | antupis wrote:
         | Also, even if you want to be honest, at this point, probably
         | every public or semipublic benchmark is part of CommonCrawl.
        
           | NitpickLawyer wrote:
           | True. And it's even worse than that, because each test
           | probably gets "talked about" a lot in various places. And
           | people come up with variants. And _those_ variants get
           | ingested. And then the whole thing becomes a mess.
           | 
           | This was noticeable with the early Phi models. They were
           | originally trained fully on synthetic data (cool experiment
           | tbh) but the downside was that GPT3 / 4 was "distilling"
           | benchmarks "hacks" into it. It became aparent when new
           | benchmarks were released, after the published date, and there
           | was one that measured "contamination" of about 20+%. Just
           | from distillation.
        
         | rachofsunshine wrote:
         | What makes Goodhart's Law so interesting is that you transition
         | smoothly between two entirely-different problems the more
         | strongly people want to optimize for your metric.
         | 
         | One is a measurement problem, a statement about the world as it
         | is: an engineer who can finish such-and-such many steps of this
         | coding task in such-and-such time has such-and-such chance of
         | getting hired. The thing you're measuring isn't running away
         | from you or trying to hide itself, because facts aren't
         | conscious agents with the goal of misleading you. Measurement
         | problems are problems of statistics and optimization, and their
         | goal is a function f: states -> predictions. Your problems are
         | usually problems of inputs, not problems of mathematics.
         | 
         | But the larger you get, and the more valuable gaming your test
         | is, the more you leave that measurement problem and find an
         | _adversarial_ problem. Adversarial problems are _at least_ as
         | difficult as your adversary is intelligent, and they can
         | sometimes be even worse by making your adversary the invisible
         | hand of the market. You don 't live in the world of gradient
         | descent anymore, because the landscape is no longer fixed. You
         | now live in the world of game theory, and your goal is a
         | function f: (state) x (time) x (adversarial capability) x
         | (history of your function f) -> predictions.
         | 
         | It's that last, recursive bit that really makes adversarial
         | problems brutal. Very simple functions can rapidly result in
         | extremely deep chaotic dynamics once you allow even the
         | slightest bit of recursion - even very nice functions like f(x)
         | = 3.5x(1-x) become writhing ergodic masses of confusion.
        
           | visarga wrote:
           | Well said, the problem with recursion is that it constructs
           | its own context as it goes, rewrites its rules, and you
           | cannot predict it statically, without forward execution. It's
           | why we have the halting problem. Recursion is irreducible. A
           | benchmark is a static dataset, it does not capture the self
           | constructive nature of recursion.
        
           | bwfan123 wrote:
           | nice comment, a reason why ML approaches may struggle in
           | trading markets where other agents are also competing with
           | you possibly using similar algos. or self-driving which
           | involves other agents who could be adversarial. just training
           | on past data is not sufficient as existing edges are competed
           | away and new edges keep arising out of nowhere.
        
           | pixl97 wrote:
           | I would also assume Russell's paradox needs added in here
           | too. Humans can and do hold sets of conflicting information,
           | it is my theory that conflicts have an
           | informational/processing cost to manage. In benchmark gaming
           | you can optimize the processing speed by removing the
           | conflicting information but you lose real world reliability
           | metrics.
        
         | klingon-3 wrote:
         | > It's really hard to trust anything public
         | 
         | Just feed it into an LLM, unintentionally hint at your bias,
         | and voila, it will use research and the latest or generated
         | metrics to prove whatever you'd like.
         | 
         | > The only true tests are the ones you write yourself, never
         | publish, and only work 100% on open models.
         | 
         | This may be good enough, and that's fine if it is.
         | 
         | But, if you do it in-house in a closet with open models, you
         | will have your own biases.
         | 
         | No tests are valid if all that ever mattered was the argument
         | and perhaps curated evidence.
         | 
         | All tests, private and public tests have proved flawed theories
         | historically.
         | 
         | Truth has always been elusive and under siege.
         | 
         | People will always just believe things. Data is just foundation
         | for pre-existing or fabricated beliefs. It's the best rationale
         | for faith, because in the end, faith is everything. Without it,
         | there is nothing.
        
         | mmcnl wrote:
         | Yes, I ignore every news article about LLM benchmarks. "GPT
         | 7.3o first to reach >50% score in X2FGT AGI benchmark" - ok
         | thanks for the info?
        
         | crocowhile wrote:
         | There is also a social issue that has to do with
         | accountability. If you claim your model is the best and then it
         | turns out you overfitted the benchmarks and it's actually 68th,
         | your reputation should suffer considerably for cheating. If it
         | does not, we have a deeper problem than the benchmarks.
        
         | ACCount36 wrote:
         | Your options for evaluating AI performance are: benchmarks or
         | vibes.
         | 
         | Benchmarks are a really good option to have.
        
       | ozgrakkurt wrote:
       | Out of topic but just opening link and actually being able to
       | read the posts and go to profile on a browser, without an
       | account, feels really good. Opening a mastadon profile, fk
       | twitter
        
         | ipnon wrote:
         | Stallman was right all along.
        
       | ipnon wrote:
       | Tao's commentary is more practical and insightful than all of the
       | "rationalist" doomers put together.
        
         | Quekid5 wrote:
         | That seems like a low bar :)
        
           | ipnon wrote:
           | My priors do not allow the existence of bars. Your move.
        
             | tempodox wrote:
             | You would have felt right at home in the time of the
             | Prohibition.
        
         | jmmcd wrote:
         | (a) no it's not
         | 
         | (b) your comment is miles off-topic, as he is not addressing
         | doom in any sense
        
         | ks2048 wrote:
         | I agree about Tao in general, but here,
         | 
         | > AI technology is now rapidly approaching the point of
         | transition from qualitative to quantitative achievement.
         | 
         | I don't get it. The whole history of deep learning was driven
         | by quantitative achievement on benchmarks.
         | 
         | I guess the rest of the post is about adding emphasis on costs
         | in addition to overall performance. But, I don't see how that
         | is a shift from qualitative to quantitative.
        
           | raincole wrote:
           | He means people in this AI hype trend mostly focused on "now
           | AI can do a task that was impossible mere 5 years ago", but
           | we will gradually change our perception of AI to "how much
           | energy/hardware cost to complete this task and does it really
           | benefit us."
           | 
           | (My interpretation, obviously)
        
       | kingstnap wrote:
       | My own thoughts on it are that it's entirely crazy that we focus
       | so much on "real world" fixed benchmarks.
       | 
       | I should write an article on it sometime, but I think the
       | incessant focus on data someone collected from the mystical "real
       | world" over well designed synthetic data from a properly
       | understood algorithm is really damaging to proper understanding.
        
       | paradite wrote:
       | I believe everyone should run their own evals on their own tasks
       | or use cases.
       | 
       | Shameless plug, but I made a simple app for anyone to create
       | their own evals locally:
       | 
       | https://eval.16x.engineer/
        
       | pu_pe wrote:
       | > For instance, if a cutting-edge AI tool can expend $1000 worth
       | of compute resources to solve an Olympiad-level problem, but its
       | success rate is only 20%, then the actual cost required to solve
       | the problem (assuming for simplicity that success is independent
       | across trials) becomes $5000 on the average (with significant
       | variability). If only the 20% of trials that were successful were
       | reported, this would give a highly misleading impression of the
       | actual cost required (which could be even higher than this, if
       | the expense of verifying task completion is also non-trivial, or
       | if the failures to solve the goal were correlated across
       | iterations).
       | 
       | This is a very valid point. Google and ChatGPT announced they got
       | the gold medal with specialized models, but what exactly does
       | that entail? If one of them used a billion dollars in compute and
       | the other a fraction of that, we should know about it. Error
       | rates are equally important. Since there are conflicts of
       | interest here, academia would be best suited for producing
       | reliable benchmarks, but they would need access to closed models.
        
         | JohnKemeny wrote:
         | Don't put Google and ChatGPT in the same category here. Google
         | cooperated with the organizers, at least.
        
           | spuz wrote:
           | Could you clarify what you mean by this?
        
             | raincole wrote:
             | Google's answers were judged by IMO. OpenAI's were judged
             | by themselves internally. Whether it matters is up to the
             | reader.
        
             | EnnEmmEss wrote:
             | TheZvi had a summarization of this here:
             | https://thezvi.substack.com/i/168895545/not-announcing-so-
             | fa...
             | 
             | In short (there is nuance), Google cooperated with the IMO
             | team while OpenAI didn't which is why OpenAI announced
             | before Google.
        
           | ml-anon wrote:
           | Also neither got a gold medal. Both solved problems to meet
           | the threshold for a human child getting a gold medal but it's
           | like saying an F1 car got a gold medal in the 100m sprint at
           | the Olympics.
        
             | vdfs wrote:
             | "Google F1 Preview Experimental beat the record of the
             | fastest man on earth Usain Bolt"
        
             | nmca wrote:
             | Indeed, it's like saying a jet plane can fly!
        
             | bwfan123 wrote:
             | The popular science title was funnier with a pun on
             | "mathed" [1]
             | 
             | "Human teens beat AI at an international math competition
             | Google and OpenAI earned gold medals, but were still out-
             | mathed by students."
             | 
             | [1] https://www.popsci.com/technology/ai-math-competition/
        
         | moffkalast wrote:
         | > with specialized models
         | 
         | > what exactly does that entail
         | 
         | Overfitting on the test set with models that are useless for
         | anything else, that's what.
        
         | sojuz151 wrote:
         | Compute has been getting cheaper and models more optimised. So
         | if models can do something it will not be long till they can do
         | this cheap.
        
           | EvgeniyZh wrote:
           | GPU compute per watt has grown by a factor of 2 in last 5
           | years
        
       | BrenBarn wrote:
       | I like Tao, but it's always so sad to me to see people talk in
       | this detached rational way about "how" to do AI without even
       | mentioning the ethical and social issues involved. It's like
       | pondering what's the best way to burn down the Louvre.
        
         | spuz wrote:
         | Do you not think social and ethical issues can be approached
         | rationally? To me it sounds like Tao is concerned about the
         | cost of running AI powered solutions and I can quite easily see
         | how the ethical and social costs fit under that umbrella along
         | with monetary and environmental costs.
        
         | bubblyworld wrote:
         | I don't think everybody has to pay lip service to this stuff
         | every time they talk about AI. Many people (myself included)
         | acknowledge these issues but have nothing to add to the
         | conversation that hasn't been said a million times already. Tao
         | is a mathematician - I think it's completely fine that he's
         | focused on the quantitative aspects of this stuff, as that is
         | where his expertise is most relevant.
        
         | Karrot_Kream wrote:
         | I feel like your comment could be more clear and less
         | hyperbolic or inflammatory by saying something like: "I like
         | Tao but the ethical and social issues surrounding AI are much
         | more important to me than discussing its specifics."
        
           | rolandog wrote:
           | I don't agree; portraying it as an opinion has the risk of
           | continuing to erode the world with moral relativism.
           | 
           | The tech -- despite being sometimes impresaive -- is
           | objectively inefficient, expensive, and harmful to the
           | environment (excessive use if energy and water for cooling),
           | to the people located near the data centers (by stochastic
           | leeching of coolants to the waterbed IIRC), and the economic
           | harm done to hundreds of millions of people whose data was
           | involuntarily used for training.
        
             | Karrot_Kream wrote:
             | For the claim to be objective then I believe it needs
             | objective substance to discuss. I saw none of that. I would
             | like to see numbers, results, or something of that nature.
             | It's fine to have subjective feelings as well but I feel
             | it's important to clarify one's feelings especially because
             | I see online discussion on forums become so heated so
             | quickly which I feel degrades discussion quality.
        
               | rolandog wrote:
               | Let's not shift the burden of proof so irresponsibly.
               | 
               | We've all seen the bad faith actors that questioned, for
               | example, studies on the efficacy of wearing masks in
               | reducing chance of transmission of airborne diseases
               | because the study combined wearing masks AND washing
               | hands... Those people would gladly hand wipe without
               | toilet paper to "own the libs" or whatever hate-filled
               | mental gymnastics strokes their ego.
               | 
               | With that in mind, let's call things for what they are:
               | there are multiple companies that are salivating at the
               | prospects of being able to make the working class
               | obsolete. There's trillions to be made in their mind.
               | 
               | > I would like to see numbers, results, or something of
               | that nature
               | 
               | I would like the same thing! So far, we have seen that a
               | very big company that had pledged, IIRC, to remain not-
               | for-profit for the benefit of humanity sold out at the
               | drop of a hat the moment they were able to hint Zombocom
               | levels of possibility to investors.
        
           | calf wrote:
           | I find it extremist and inflammatory this reoccurring--
           | frankly conservative--tendency on HN to police any strong
           | polemic criticism as "hyperbole" and "inflammatory". People
           | should learn to take criticism is stride, not every strongly
           | critical comment ought to be socially censored by tone
           | policing it. The comparison to Louvre was a funny comment and
           | if people didn't get that perhaps it is not too far-fetched
           | to suggest improving on basic literary-device literacy
           | skills.
        
         | rolandog wrote:
         | > what's the best way to burn down the Louvre.
         | 
         | "There are two schools of thought, you see..."
         | 
         | Joking aside, I think that's a very valid point; not sure what
         | would be the nonreligious term for the amorality of "sins of
         | omission"... But, in essence, one can clearly be unethical by
         | ignoring the social responsibility we have to study who is
         | affected by our actions.
         | 
         | Corporations can't really play dumb there, since they have to
         | weigh the impacts for every project they undertake.
         | 
         | Also, side note... It's very telling how little control we
         | (commoners?) have as a global society that -- collectively --
         | we're throwing mountains of cash at weapons and AI, which would
         | directly move us closer to oblivion and further the effects of
         | climate change (despite the majority of people not wanting wars
         | nor being replaced by a chatbot). I would instead favor world
         | peace; ending poverty, famine, and genocide; and, preventing
         | further global warming.
        
         | blitzar wrote:
         | It's always so sad to me to see people banging on about the
         | ethical and social issues involved without quantifying
         | anything, or using dodgy projections - "at this rate it will
         | kill 100 billion people by the end of the year".
        
         | ACCount36 wrote:
         | And I am tired of "mentioning the ethical and social issues".
         | 
         | If the best you can do is bring up this garbage, then you have
         | nothing of value to say.
        
         | benlivengood wrote:
         | I think using LLMs/AI for pure mathematics is one of the very
         | least ethically fraught use-cases. Creative works aren't being
         | imitated, people aren't being deceived by hallucinations
         | (literally by design; formal proof systems prevent it), from a
         | safety perspective even a superintelligent agent that was truly
         | limited to producing true theorems would be dramatically safer
         | than other kinds of interactions with the world, etc.
        
       | fsh wrote:
       | I believe that it may be misguided to focus on compute that much,
       | and it would be more instructive to consider the effort that went
       | into curating the training set. The easiest way of solving math
       | problems with an LLM is to make sure that _very similar_ problems
       | are included in the training set. Many of the AI achievements
       | would probably look a lot less miraculous if one could check the
       | training data. The most crass example is OpenAI paying off the
       | FrontierMath creators last year to get exclusive secret access to
       | the problems before the evaluation [1]. Even without resorting to
       | cheating, competition formats are vulnerable to this. It is
       | extremely difficult to come up with truly original questions, so
       | by spending significant resources on re-hashing all kinds of
       | permutations of previous question, one will probably end up _very
       | close_ to the actual competition set. The first rule I learned
       | about training neural networks is to make damn sure there is no
       | overlap between the training and validation sets. It it
       | interesting that this rule has gone completely out of the window
       | in the age of LLMs.
       | 
       | [1] https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-
       | lesso...
        
         | OtherShrezzing wrote:
         | > The easiest way of solving math problems with an LLM is to
         | make sure that very similar problems are included in the
         | training set. Many of the AI achievements would probably look a
         | lot less miraculous if one could check the training data
         | 
         | I'm fairly certain this phenomenon is responsible for LLM
         | capabilities on GeoGuesser type games. They have unreasonably
         | good performance. For example, being able to identify obscure
         | locations from featureless/foggy pictures of a bench.
         | GeoGuesser's entire dataset, including GPS metadata, is
         | definitely included in all of the frontier model training
         | datasets - so it should be unsurprising that they have
         | excellent performance in that domain.
        
           | YetAnotherNick wrote:
           | > GeoGuesser's entire dataset
           | 
           | No, it is not included, however there must be quite a lot of
           | pictures on internet for most cities.. Geoguesser data is
           | same as Google's street view data and it probably contains
           | billions of 360 degree photos.
        
             | ivape wrote:
             | I just saw a video on Reddit where a woman still managed to
             | take a selfie while being literally face to face with a
             | black bear. There's definitely way too much video training
             | data out there for everything.
        
               | lutusp wrote:
               | > I just saw a video on Reddit where a woman still
               | managed to take a selfie while being literally face to
               | face with a black bear.
               | 
               | This is not uncommon. Bears aren't always tearing people
               | apart, that's a movie trope with little connection to
               | reality. Black bears in particular are smart and social
               | enough to befriend their food sources.
               | 
               | But a hungry bear, or a bear with cubs, that's a
               | different story. Even then bears may surprise you. Once
               | in Alaska, a mama bear got me to babysit her cubs while
               | she went fishing -- link:
               | https://arachnoid.com/alaska2018/bears.html .
        
             | suddenlybananas wrote:
             | Why do you say it's not included? Why wouldn't they include
             | it.
        
               | sebzim4500 wrote:
               | If every photo in streetview was included in the training
               | data of a multimodal LLM it would be like 99.9999% of the
               | training data/resource costs.
               | 
               | It just isn't plausible that anyone has actually done
               | that. I'm sure some people include a small sample of
               | them, though.
        
               | bluefirebrand wrote:
               | Why would every photo in streetview be required in order
               | to have Geoguessr's dataset in the training data?
        
               | bee_rider wrote:
               | I'm pretty sure they are saying that Geoguessr's just
               | pulls directly from Google Streetview. There isn't a
               | separate Geoguessr dataset, it just pulls from Google's
               | API (at least that's what Wikipedia says).
        
               | bluefirebrand wrote:
               | I suspect that Geoguessr's dataset is a subset of Google
               | Streetview, but maybe it really is just pulling
               | everything directly
        
               | bee_rider wrote:
               | My guess would be that they pull directly from street-
               | view, maybe with some extra filtering for interesting
               | locations.
               | 
               | Why bother to create a copy, if it can be avoided, right?
        
           | ACCount36 wrote:
           | People tried VLMs on "closed set" GeoGuessr-type tasks - i.e.
           | non-Street View photos in similar style, not published
           | anywhere.
           | 
           | They still kicked ass.
           | 
           | It seems like those AIs just have an awful lot of location
           | familiarity. They've seen enough tagged photos to be able to
           | pick up on the patterns, and generalize that to kicking ass
           | at GeoGuessr.
        
         | astrange wrote:
         | > The easiest way of solving math problems with an LLM is to
         | make sure that very similar problems are included in the
         | training set.
         | 
         | An irony here is that math blogs like Tao's might not be in LLM
         | training data, for the same reason they aren't accessible to
         | screen readers - they're full of math, and the math is rendered
         | as images, so it's nonsense if you can't read the images.
         | 
         | (The images on his blog do have alt text, but it's just the
         | LaTeX code, which isn't much better.)
        
           | prein wrote:
           | What would be a better alternative than LaTex for the alt
           | text? I can't think of a solution that makes more sense, it
           | provides an unambiguous representation of what's depicted.
           | 
           | I wouldn't think an LLM would have issue with that at all. I
           | can see how a screen reader might, but it seems like the same
           | problem faced by a screen reader with any piece of code, not
           | just LaTex.
        
           | QuesnayJr wrote:
           | LLMs understand LaTeX extraordinarily well.
        
           | MengerSponge wrote:
           | LLMs are decent with LaTeX! It's just markup code after all.
           | I've heard from some colleagues that they can do decent image
           | to code conversion for a picture of an equation or even some
           | handwritten ones.
        
           | alansammarone wrote:
           | As others have pointed out, LLMs have no trouble with LaTeX.
           | I can see why one might think they're not - in fact, I made
           | the same assumption myself sometime ago. LLMs, via
           | transformers, are exceptionally good any _any_ sequence or
           | one-dimensional data. One very interesting (to me anyway)
           | example is base64 - pick some not-huge sentence (say, 10
           | words), base64-encode it, and just paste it in any LLM you
           | want, and it will be able to understand it. Same works with
           | hex, ascii representation, or binary _. Here 's a sample if
           | you want to try: aWYgYWxsIEEncyBhcmUgQidzLCBidXQgb25seSBzb21l
           | IEIncyBhcmUgQydzLCBhcmUgYWxsIEEncyBDJ3M/IEFuc3dlciBpbiBiYXNlN
           | jQu
           | 
           | _ I remember running this experiment some time ago in a
           | context where I was certain there was no possibility of tool
           | use to encode/decode. Nowadays, it can be hard to certain
           | whether there is any tool use or not, in some cases, such as
           | Mistral, the response is quick enough to make it unlikely
           | there's any tool use.
        
             | throwanem wrote:
             | I've just tried it, in the form of your base64 prompt and
             | no other context, with a local Qwen-3 30b instance that I'm
             | entirely certain is not actually performing tool use. It
             | produced a correct answer ("Tm8="), which in a moment of
             | accidental comedy it spontaneously formatted with LaTeX.
             | But it did talk about invoking an online decoder, just
             | before the first appearance of the (nearly) complete
             | decoded string in its CoT.
             | 
             | It "left out" the A in its decode and still correctly
             | answered the proposition, either out of reflexive
             | familiarity with the form or via metasyntactic reasoning
             | over an implicit anaphor; I believe I recall this to be a
             | formulation of one of the elementary axioms of set theory,
             | though you will excuse me for omitting its name before
             | coffee, which makes the pattern matching possibility seem
             | somewhat more feasible. ('Seem' may work a little too hard
             | there. But a minimally more novel challenge I think would
             | be needed to really see more.)
             | 
             | There's lots of text in lots of languages about using an
             | online base64 decoder, and nearly none at all about
             | decoding the representation "in your head," which for
             | humans would be a party trick akin to that one fellow who
             | could see a city from a helicopter for 30 seconds and then
             | perfectly reproduce it on paper from memory. It makes sense
             | to me that a model trained on the Internet would "invent"
             | the "metaphor" of an online decoder here, I think. What in
             | its "experience" serves better as a description?
        
           | constantcrying wrote:
           | >(The images on his blog do have alt text, but it's just the
           | LaTeX code, which isn't much better.)
           | 
           | LLMs are extremely good at outputting LaTeX, ChatGPT will
           | output LaTeX, which the website will render as such. Why do
           | you think LLMs have trouble understanding it?
        
             | astrange wrote:
             | I don't think LLMs will have trouble understanding it. I
             | think people using screen readers will. ...oh I see, I
             | accidentally deleted the part of the comment about that.
             | 
             | But the people writing the web page extraction pipelines
             | also have to handle the alt text properly.
        
           | mbowcut2 wrote:
           | LLMs are better at LaTeX than humans. ChatGPT often writes
           | LaTeX responses.
        
             | neutronicus wrote:
             | Yeah, it's honestly one of the things they're best at!
             | 
             | I've been working on implementing some E&M simulations with
             | Claude Code and it's so-so on the C++ and TERRIBLE at the
             | actual math (multiplying a couple 6x6 matrix differential
             | operators is beyond it).
             | 
             | But I can dash off some notes and tell Claude to TeXify and
             | the output is great.
        
         | disruptbro wrote:
         | Language modeling is compression, whittle down graph to reduce
         | duplication and data with little relationship:
         | https://arxiv.org/abs/2309.10668
         | 
         | Let's say everyone agrees to refer to one hosted copy of a
         | token "cat", and instead generate a unique vector to represent
         | their reference to "cat".
         | 
         | Blam. Endless unique vectors which are nice and precise for
         | parsing. No endless copies of arbitrary text like "cat".
         | 
         | Now make that your globally distributed data base to bootstrap
         | AI chips from. The data driven programming dream where other
         | machines on the network feed new machines boot strap.
         | 
         | American tech industry is IBM now. Stuck on recent success of
         | web SaaS and way behind the plans of AI.
        
       | mhl47 wrote:
       | Side note: What is going on with these comments on Mathstodon?
       | From moon landing denials, to insults, allegations that he used
       | AI to write this ... almost all of them are to some capacity
       | insane.
        
         | nurettin wrote:
         | That is how peak humanity looks like.
        
         | Karrot_Kream wrote:
         | I find the same kind of behavior on bigger Bluesky AI threads.
         | I don't use Mathstodon (or actively follow folks on it) but I
         | certainly feel sad to see similar replies there too. I
         | speculate that folks opposed to AI are angry and take it out by
         | writing these sorts of comments, but this is just my hunch.
         | That's as much as I feel I should write about this without
         | feeling guilty for derailing the discussion.
        
           | ACCount36 wrote:
           | No wonder. Bluesky is where insane Twitter people go when
           | they get too insane for Twitter.
        
         | dash2 wrote:
         | Almost everywhere on the internet is like this. It's hn that is
         | (mostly!) exceptional.
        
           | f1shy wrote:
           | The "mostly" there is so important! But also HN suffers from
           | other problems (see in this thread the discussion about over
           | policing comments, and calling fast hyperbolic and
           | inflammatory).
           | 
           | And don't get me started in the decline on depth in technical
           | topics and soaring in political discussions. I came to HN for
           | the first, not the second.
           | 
           | So we are humans, there will never be a perfect forum.
        
             | frumiousirc wrote:
             | > So we are humans, there will never be a perfect forum.
             | 
             | Perfect is in the eye of the moderator.
        
         | hshshshshsh wrote:
         | The truth is, both deniers and believers are operating on
         | belief. Only those who actually went to the Moon know
         | firsthand. The rest of us trust information we've received --
         | filtered through media, education, or bias. That makes us no
         | fundamentally different from deniers; we just think our belief
         | is more justified.
        
           | fc417fc802 wrote:
           | Just to carry this line of reasoning out to the extreme for
           | entertainment purposes (and to illustrate for everyone how
           | misguided it is). Even if you perform a task firsthand, at
           | the end of the day you're just trusting your memory of having
           | done so. You feel that your trust in your memory is justified
           | but fundamentally that isn't any different from the deniers
           | either.
        
             | hshshshshsh wrote:
             | This is actually true. Plenty of accidents has happened
             | because of this.
             | 
             | I am not saying trusting your memory is always false or
             | true. Most of the times it might be true. It's a heuristic.
             | 
             | But if someone comes and deny what you did, the best course
             | of action would be to consider the evidence they have and
             | not assume they are stupid because they believe
             | differently.
             | 
             | Let's be honest, you have not personally went and verified
             | the rocks belongs to Moon. Nor were you tracking the
             | telemetry data in your computer when the rocket was going
             | to Moon.
             | 
             | I also believe we went to Moon.
             | 
             | But all I have is beliefs.
             | 
             | Everyone believed Early was flat 1000s years back as well.
             | They had solid evidence.
             | 
             | But the humility is accepting you don't know and you are
             | believing and not pretend you are above others who believe
             | exact opposite..
        
               | fc417fc802 wrote:
               | It's a misguided line of reasoning because the "belief"
               | thing is a red herring. Nearly everything comes down to
               | belief at a low level. The differences lie in the
               | justifications.
               | 
               | As you say, you should have the humility to consider the
               | evidence that others provide that you might be wrong. The
               | thing with the various popular conspiracy theories is
               | that the evidence is conspicuously missing when any
               | competent good faith actor would be presenting it front
               | and center.
        
           | esafak wrote:
           | Some beliefs are more supported by evidence than others. To
           | ignore this is to make the concept of belief practically
           | useless.
        
             | hshshshshsh wrote:
             | Yeah. My point is you have not seen any of the evidence.
             | You just have belief that evidence exists. Which is a
             | belief and not evidence.
        
               | esafak wrote:
               | Yes, we have seen evidence: videos, pictures and other
               | artifacts of the landing.
               | 
               | I think you don't know what evidence means. You want
               | _proof_ and that 's for mathematics.
               | 
               | You don't _know_ that you exist. You could be a
               | simulation.
        
         | andrepd wrote:
         | Have you opened a twitter thread? People are insane on social
         | media, why should open source social media be substantially
         | different? x)
        
           | f1shy wrote:
           | I refrain from any of those X, mastodon, etc. so let me ask a
           | question:
           | 
           | are all equally bad? Or same bad but a different aspect? E.g.
           | I read often here that X has more disinformation, and right
           | wing propaganda, while mastodon here was called out on
           | another topic.
           | 
           | Maybe somebody active in different networks can answer that.
        
             | fc417fc802 wrote:
             | Moderation and the algorithms used to generate user feeds
             | both have strong impacts. In the case of mastodon (ie
             | activitypub) moderation varies wildly between different
             | domains.
             | 
             | But in general, I'd say that the microblogging format as a
             | whole encourages a number of toxic behaviors and
             | interaction patterns.
        
             | miltonlost wrote:
             | X doesn't let you use trans as a word and has Grok spewing
             | right-wing propaganda (mechahitler?). That self-selects
             | into the most horrible people being on X now.
        
       | stared wrote:
       | I agree that after a challenge is something can be done at all
       | (heavier-than-air flight, Moon landing, Gold medal at the IMO)
       | then next question is makes sense economically.
       | 
       | I like ARC-AGI approach for the reason that it shows both axes -
       | score and price, and place human benchmark on these.
       | 
       | https://arcprize.org/leaderboard
        
       | pama wrote:
       | This sounds very reasonable to me.
       | 
       | When considering top tier labs that optimize inference and own
       | the GPUs: the electricity cost of USD 5000 at a data center with
       | 4 cents per kWh (which may be possible to arrange or beat in some
       | counties in the US with special industrial contracts) can produce
       | about 2 trillion tokens for the R1-0528 model using 120kW draw
       | for the B200 NVL72 hardware and the (still to be fully optimized)
       | sglang inference pipeline:
       | https://lmsys.org/blog/2025-06-16-gb200-part-1/
       | 
       | Although 2T tokens is not unreasonable for being able to get high
       | precision answers to challenging math questions, such a very high
       | token number would strongly suggest there are lots of unknown
       | techniques deployed at these labs.
       | 
       | If one adds the cost of GPU ownership or rental, say 2 USD/h/GPU,
       | then the number of tokens for 5k USD shrinks dramatically to only
       | 66B tokens, which is still high for usual techniques that try to
       | optimize for a best single answer in the end, but perhaps
       | plausible if the vast majority of these are intermediate thinking
       | tokens and a lot of the value comes from LLM-based verification.
        
       | iloveoof wrote:
       | Moore's Law for AI Progress: AI metrics will double every two
       | years whether the AI gets smarter or not.
        
       | akomtu wrote:
       | The benchmarks should really add the test of data compression.
       | Intelligence is mostly about discovering the underlying
       | principles, the ability to see simple rules behind complex
       | behaviors, and data compression captures this well. For example,
       | if you can look at a dataset of planetary and stellar motions and
       | compress it into a simple equation, you'd be considered wildly
       | intelligent. If you can't remember and reproduce a simple
       | checkerboard pattern, you'd be considered dumb. Another example
       | is drawing a duck in SVG - another form of data compression. Data
       | extrapolation, on the other hand, is the opposite problem, which
       | can be solved by imitation or by understanding the rules
       | producing the data. Only the latter deserves to be called
       | intelligence. Note, though, that understanding the rules isn't
       | always a superior method. When we are driving, we drive by
       | imitation based on our extensive experience with similar
       | situations, hardly understanding the physics of driving.
        
       | js8 wrote:
       | LLMs could be very useful in formalizing the problem and
       | assumptions (conversion from natural language), but once problem
       | is described in a formal way (it can be described in some fuzzy
       | logic), then more reliable AI techniques should be applied.
       | 
       | Interestingly, Tao mentions
       | https://teorth.github.io/equational_theories/, and I believe this
       | is better progress than LLMs doing math. I believe enhancing Lean
       | with more tactics and formalizing those in Lean itself is a more
       | fruitful avenue for AI in math.
        
         | agentcoops wrote:
         | I used to work quite extensively with Isabelle and as a
         | developer on Sledgehammer [1]. There are well-known results,
         | most obviously the halting problem, that mean fully-automated
         | logical methods applied to a formalism with any expressive
         | capability, i.e. that can be used to formalize non-trivial
         | problems, simply can never fulfill the role you seem to be
         | suggesting. The proofs that are actually generated in that way
         | are, anyway, horrendous -- in fact, the problem I used to work
         | on was using graph algorithms to try and simplify computer-
         | generated proofs for human comprehension. That's the very
         | reason that all the serious work has previously been on proof
         | /assistants/ and formal validation.
         | 
         | LLMs, especially in /conjunction/ with Lean for formal
         | validation, are really an exciting new frontier in mathematics
         | and it's a mistake to see that as just "unreliable" versus
         | "reliable" symbolic AI etc. The OP Terence Tao has been pushing
         | the edge here since day one and providing, I think, the most
         | unbiased perspective on where things stand today, strengths as
         | much as limitations.
         | 
         | [1] https://isabelle.in.tum.de/website-
         | Isabelle2009-1/sledgehamm...
        
           | js8 wrote:
           | LLMs (as well as humans) are algorithms like anything else
           | and so they are also subject to halting problem. I don't see
           | what LLMs do that couldn't be in principle formalized as a
           | Lean tactic. (IMHO LLMs are just learning rules - theorems of
           | some kind of fuzzy logic - and then try to apply them using
           | heuristic search to satisfy the goal. Unfortunately the rules
           | learned are likely not fully consistent and so you get
           | reasoning errors.)
        
       | data_maan wrote:
       | The concept of pre-registered eval (an analogy to pre-registered
       | study) will go a long way towards fixing this.
       | 
       | More information
       | 
       | https://mathstodon.xyz/@friederrr/114881863146859839
        
       | kristianp wrote:
       | It's going to take a large step up in transparency for AI
       | companies to do this. It was back in gpt 4 days that openai
       | stopped reporting model size for example and the others followed
       | suit.
        
       ___________________________________________________________________
       (page generated 2025-07-25 23:01 UTC)