[HN Gopher] Quantitative AI progress needs accurate and transpar...
___________________________________________________________________
Quantitative AI progress needs accurate and transparent evaluation
Author : bertman
Score : 193 points
Date : 2025-07-25 06:47 UTC (16 hours ago)
(HTM) web link (mathstodon.xyz)
(TXT) w3m dump (mathstodon.xyz)
| NitpickLawyer wrote:
| The problem with benchmarks is that they are really useful for
| honest researchers, but extremely toxic if used for marketing,
| clout, etc. Something something, every measure that becomes a
| target sucks.
|
| It's really hard to trust anything public (for obvious reasons of
| dataset contamination), but also some private ones (for the
| obvious reasons that providers do get most/all of the questions
| over time, and they can do sneaky things with them).
|
| The only true tests are the ones you write yourself, never
| publish, and only work 100% on open models. If you want to test
| commercial SotA models from time to time you need to consider
| them "burned", and come up with more tests.
| antupis wrote:
| Also, even if you want to be honest, at this point, probably
| every public or semipublic benchmark is part of CommonCrawl.
| NitpickLawyer wrote:
| True. And it's even worse than that, because each test
| probably gets "talked about" a lot in various places. And
| people come up with variants. And _those_ variants get
| ingested. And then the whole thing becomes a mess.
|
| This was noticeable with the early Phi models. They were
| originally trained fully on synthetic data (cool experiment
| tbh) but the downside was that GPT3 / 4 was "distilling"
| benchmarks "hacks" into it. It became aparent when new
| benchmarks were released, after the published date, and there
| was one that measured "contamination" of about 20+%. Just
| from distillation.
| rachofsunshine wrote:
| What makes Goodhart's Law so interesting is that you transition
| smoothly between two entirely-different problems the more
| strongly people want to optimize for your metric.
|
| One is a measurement problem, a statement about the world as it
| is: an engineer who can finish such-and-such many steps of this
| coding task in such-and-such time has such-and-such chance of
| getting hired. The thing you're measuring isn't running away
| from you or trying to hide itself, because facts aren't
| conscious agents with the goal of misleading you. Measurement
| problems are problems of statistics and optimization, and their
| goal is a function f: states -> predictions. Your problems are
| usually problems of inputs, not problems of mathematics.
|
| But the larger you get, and the more valuable gaming your test
| is, the more you leave that measurement problem and find an
| _adversarial_ problem. Adversarial problems are _at least_ as
| difficult as your adversary is intelligent, and they can
| sometimes be even worse by making your adversary the invisible
| hand of the market. You don 't live in the world of gradient
| descent anymore, because the landscape is no longer fixed. You
| now live in the world of game theory, and your goal is a
| function f: (state) x (time) x (adversarial capability) x
| (history of your function f) -> predictions.
|
| It's that last, recursive bit that really makes adversarial
| problems brutal. Very simple functions can rapidly result in
| extremely deep chaotic dynamics once you allow even the
| slightest bit of recursion - even very nice functions like f(x)
| = 3.5x(1-x) become writhing ergodic masses of confusion.
| visarga wrote:
| Well said, the problem with recursion is that it constructs
| its own context as it goes, rewrites its rules, and you
| cannot predict it statically, without forward execution. It's
| why we have the halting problem. Recursion is irreducible. A
| benchmark is a static dataset, it does not capture the self
| constructive nature of recursion.
| bwfan123 wrote:
| nice comment, a reason why ML approaches may struggle in
| trading markets where other agents are also competing with
| you possibly using similar algos. or self-driving which
| involves other agents who could be adversarial. just training
| on past data is not sufficient as existing edges are competed
| away and new edges keep arising out of nowhere.
| pixl97 wrote:
| I would also assume Russell's paradox needs added in here
| too. Humans can and do hold sets of conflicting information,
| it is my theory that conflicts have an
| informational/processing cost to manage. In benchmark gaming
| you can optimize the processing speed by removing the
| conflicting information but you lose real world reliability
| metrics.
| klingon-3 wrote:
| > It's really hard to trust anything public
|
| Just feed it into an LLM, unintentionally hint at your bias,
| and voila, it will use research and the latest or generated
| metrics to prove whatever you'd like.
|
| > The only true tests are the ones you write yourself, never
| publish, and only work 100% on open models.
|
| This may be good enough, and that's fine if it is.
|
| But, if you do it in-house in a closet with open models, you
| will have your own biases.
|
| No tests are valid if all that ever mattered was the argument
| and perhaps curated evidence.
|
| All tests, private and public tests have proved flawed theories
| historically.
|
| Truth has always been elusive and under siege.
|
| People will always just believe things. Data is just foundation
| for pre-existing or fabricated beliefs. It's the best rationale
| for faith, because in the end, faith is everything. Without it,
| there is nothing.
| mmcnl wrote:
| Yes, I ignore every news article about LLM benchmarks. "GPT
| 7.3o first to reach >50% score in X2FGT AGI benchmark" - ok
| thanks for the info?
| crocowhile wrote:
| There is also a social issue that has to do with
| accountability. If you claim your model is the best and then it
| turns out you overfitted the benchmarks and it's actually 68th,
| your reputation should suffer considerably for cheating. If it
| does not, we have a deeper problem than the benchmarks.
| ACCount36 wrote:
| Your options for evaluating AI performance are: benchmarks or
| vibes.
|
| Benchmarks are a really good option to have.
| ozgrakkurt wrote:
| Out of topic but just opening link and actually being able to
| read the posts and go to profile on a browser, without an
| account, feels really good. Opening a mastadon profile, fk
| twitter
| ipnon wrote:
| Stallman was right all along.
| ipnon wrote:
| Tao's commentary is more practical and insightful than all of the
| "rationalist" doomers put together.
| Quekid5 wrote:
| That seems like a low bar :)
| ipnon wrote:
| My priors do not allow the existence of bars. Your move.
| tempodox wrote:
| You would have felt right at home in the time of the
| Prohibition.
| jmmcd wrote:
| (a) no it's not
|
| (b) your comment is miles off-topic, as he is not addressing
| doom in any sense
| ks2048 wrote:
| I agree about Tao in general, but here,
|
| > AI technology is now rapidly approaching the point of
| transition from qualitative to quantitative achievement.
|
| I don't get it. The whole history of deep learning was driven
| by quantitative achievement on benchmarks.
|
| I guess the rest of the post is about adding emphasis on costs
| in addition to overall performance. But, I don't see how that
| is a shift from qualitative to quantitative.
| raincole wrote:
| He means people in this AI hype trend mostly focused on "now
| AI can do a task that was impossible mere 5 years ago", but
| we will gradually change our perception of AI to "how much
| energy/hardware cost to complete this task and does it really
| benefit us."
|
| (My interpretation, obviously)
| kingstnap wrote:
| My own thoughts on it are that it's entirely crazy that we focus
| so much on "real world" fixed benchmarks.
|
| I should write an article on it sometime, but I think the
| incessant focus on data someone collected from the mystical "real
| world" over well designed synthetic data from a properly
| understood algorithm is really damaging to proper understanding.
| paradite wrote:
| I believe everyone should run their own evals on their own tasks
| or use cases.
|
| Shameless plug, but I made a simple app for anyone to create
| their own evals locally:
|
| https://eval.16x.engineer/
| pu_pe wrote:
| > For instance, if a cutting-edge AI tool can expend $1000 worth
| of compute resources to solve an Olympiad-level problem, but its
| success rate is only 20%, then the actual cost required to solve
| the problem (assuming for simplicity that success is independent
| across trials) becomes $5000 on the average (with significant
| variability). If only the 20% of trials that were successful were
| reported, this would give a highly misleading impression of the
| actual cost required (which could be even higher than this, if
| the expense of verifying task completion is also non-trivial, or
| if the failures to solve the goal were correlated across
| iterations).
|
| This is a very valid point. Google and ChatGPT announced they got
| the gold medal with specialized models, but what exactly does
| that entail? If one of them used a billion dollars in compute and
| the other a fraction of that, we should know about it. Error
| rates are equally important. Since there are conflicts of
| interest here, academia would be best suited for producing
| reliable benchmarks, but they would need access to closed models.
| JohnKemeny wrote:
| Don't put Google and ChatGPT in the same category here. Google
| cooperated with the organizers, at least.
| spuz wrote:
| Could you clarify what you mean by this?
| raincole wrote:
| Google's answers were judged by IMO. OpenAI's were judged
| by themselves internally. Whether it matters is up to the
| reader.
| EnnEmmEss wrote:
| TheZvi had a summarization of this here:
| https://thezvi.substack.com/i/168895545/not-announcing-so-
| fa...
|
| In short (there is nuance), Google cooperated with the IMO
| team while OpenAI didn't which is why OpenAI announced
| before Google.
| ml-anon wrote:
| Also neither got a gold medal. Both solved problems to meet
| the threshold for a human child getting a gold medal but it's
| like saying an F1 car got a gold medal in the 100m sprint at
| the Olympics.
| vdfs wrote:
| "Google F1 Preview Experimental beat the record of the
| fastest man on earth Usain Bolt"
| nmca wrote:
| Indeed, it's like saying a jet plane can fly!
| bwfan123 wrote:
| The popular science title was funnier with a pun on
| "mathed" [1]
|
| "Human teens beat AI at an international math competition
| Google and OpenAI earned gold medals, but were still out-
| mathed by students."
|
| [1] https://www.popsci.com/technology/ai-math-competition/
| moffkalast wrote:
| > with specialized models
|
| > what exactly does that entail
|
| Overfitting on the test set with models that are useless for
| anything else, that's what.
| sojuz151 wrote:
| Compute has been getting cheaper and models more optimised. So
| if models can do something it will not be long till they can do
| this cheap.
| EvgeniyZh wrote:
| GPU compute per watt has grown by a factor of 2 in last 5
| years
| BrenBarn wrote:
| I like Tao, but it's always so sad to me to see people talk in
| this detached rational way about "how" to do AI without even
| mentioning the ethical and social issues involved. It's like
| pondering what's the best way to burn down the Louvre.
| spuz wrote:
| Do you not think social and ethical issues can be approached
| rationally? To me it sounds like Tao is concerned about the
| cost of running AI powered solutions and I can quite easily see
| how the ethical and social costs fit under that umbrella along
| with monetary and environmental costs.
| bubblyworld wrote:
| I don't think everybody has to pay lip service to this stuff
| every time they talk about AI. Many people (myself included)
| acknowledge these issues but have nothing to add to the
| conversation that hasn't been said a million times already. Tao
| is a mathematician - I think it's completely fine that he's
| focused on the quantitative aspects of this stuff, as that is
| where his expertise is most relevant.
| Karrot_Kream wrote:
| I feel like your comment could be more clear and less
| hyperbolic or inflammatory by saying something like: "I like
| Tao but the ethical and social issues surrounding AI are much
| more important to me than discussing its specifics."
| rolandog wrote:
| I don't agree; portraying it as an opinion has the risk of
| continuing to erode the world with moral relativism.
|
| The tech -- despite being sometimes impresaive -- is
| objectively inefficient, expensive, and harmful to the
| environment (excessive use if energy and water for cooling),
| to the people located near the data centers (by stochastic
| leeching of coolants to the waterbed IIRC), and the economic
| harm done to hundreds of millions of people whose data was
| involuntarily used for training.
| Karrot_Kream wrote:
| For the claim to be objective then I believe it needs
| objective substance to discuss. I saw none of that. I would
| like to see numbers, results, or something of that nature.
| It's fine to have subjective feelings as well but I feel
| it's important to clarify one's feelings especially because
| I see online discussion on forums become so heated so
| quickly which I feel degrades discussion quality.
| rolandog wrote:
| Let's not shift the burden of proof so irresponsibly.
|
| We've all seen the bad faith actors that questioned, for
| example, studies on the efficacy of wearing masks in
| reducing chance of transmission of airborne diseases
| because the study combined wearing masks AND washing
| hands... Those people would gladly hand wipe without
| toilet paper to "own the libs" or whatever hate-filled
| mental gymnastics strokes their ego.
|
| With that in mind, let's call things for what they are:
| there are multiple companies that are salivating at the
| prospects of being able to make the working class
| obsolete. There's trillions to be made in their mind.
|
| > I would like to see numbers, results, or something of
| that nature
|
| I would like the same thing! So far, we have seen that a
| very big company that had pledged, IIRC, to remain not-
| for-profit for the benefit of humanity sold out at the
| drop of a hat the moment they were able to hint Zombocom
| levels of possibility to investors.
| calf wrote:
| I find it extremist and inflammatory this reoccurring--
| frankly conservative--tendency on HN to police any strong
| polemic criticism as "hyperbole" and "inflammatory". People
| should learn to take criticism is stride, not every strongly
| critical comment ought to be socially censored by tone
| policing it. The comparison to Louvre was a funny comment and
| if people didn't get that perhaps it is not too far-fetched
| to suggest improving on basic literary-device literacy
| skills.
| rolandog wrote:
| > what's the best way to burn down the Louvre.
|
| "There are two schools of thought, you see..."
|
| Joking aside, I think that's a very valid point; not sure what
| would be the nonreligious term for the amorality of "sins of
| omission"... But, in essence, one can clearly be unethical by
| ignoring the social responsibility we have to study who is
| affected by our actions.
|
| Corporations can't really play dumb there, since they have to
| weigh the impacts for every project they undertake.
|
| Also, side note... It's very telling how little control we
| (commoners?) have as a global society that -- collectively --
| we're throwing mountains of cash at weapons and AI, which would
| directly move us closer to oblivion and further the effects of
| climate change (despite the majority of people not wanting wars
| nor being replaced by a chatbot). I would instead favor world
| peace; ending poverty, famine, and genocide; and, preventing
| further global warming.
| blitzar wrote:
| It's always so sad to me to see people banging on about the
| ethical and social issues involved without quantifying
| anything, or using dodgy projections - "at this rate it will
| kill 100 billion people by the end of the year".
| ACCount36 wrote:
| And I am tired of "mentioning the ethical and social issues".
|
| If the best you can do is bring up this garbage, then you have
| nothing of value to say.
| benlivengood wrote:
| I think using LLMs/AI for pure mathematics is one of the very
| least ethically fraught use-cases. Creative works aren't being
| imitated, people aren't being deceived by hallucinations
| (literally by design; formal proof systems prevent it), from a
| safety perspective even a superintelligent agent that was truly
| limited to producing true theorems would be dramatically safer
| than other kinds of interactions with the world, etc.
| fsh wrote:
| I believe that it may be misguided to focus on compute that much,
| and it would be more instructive to consider the effort that went
| into curating the training set. The easiest way of solving math
| problems with an LLM is to make sure that _very similar_ problems
| are included in the training set. Many of the AI achievements
| would probably look a lot less miraculous if one could check the
| training data. The most crass example is OpenAI paying off the
| FrontierMath creators last year to get exclusive secret access to
| the problems before the evaluation [1]. Even without resorting to
| cheating, competition formats are vulnerable to this. It is
| extremely difficult to come up with truly original questions, so
| by spending significant resources on re-hashing all kinds of
| permutations of previous question, one will probably end up _very
| close_ to the actual competition set. The first rule I learned
| about training neural networks is to make damn sure there is no
| overlap between the training and validation sets. It it
| interesting that this rule has gone completely out of the window
| in the age of LLMs.
|
| [1] https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-
| lesso...
| OtherShrezzing wrote:
| > The easiest way of solving math problems with an LLM is to
| make sure that very similar problems are included in the
| training set. Many of the AI achievements would probably look a
| lot less miraculous if one could check the training data
|
| I'm fairly certain this phenomenon is responsible for LLM
| capabilities on GeoGuesser type games. They have unreasonably
| good performance. For example, being able to identify obscure
| locations from featureless/foggy pictures of a bench.
| GeoGuesser's entire dataset, including GPS metadata, is
| definitely included in all of the frontier model training
| datasets - so it should be unsurprising that they have
| excellent performance in that domain.
| YetAnotherNick wrote:
| > GeoGuesser's entire dataset
|
| No, it is not included, however there must be quite a lot of
| pictures on internet for most cities.. Geoguesser data is
| same as Google's street view data and it probably contains
| billions of 360 degree photos.
| ivape wrote:
| I just saw a video on Reddit where a woman still managed to
| take a selfie while being literally face to face with a
| black bear. There's definitely way too much video training
| data out there for everything.
| lutusp wrote:
| > I just saw a video on Reddit where a woman still
| managed to take a selfie while being literally face to
| face with a black bear.
|
| This is not uncommon. Bears aren't always tearing people
| apart, that's a movie trope with little connection to
| reality. Black bears in particular are smart and social
| enough to befriend their food sources.
|
| But a hungry bear, or a bear with cubs, that's a
| different story. Even then bears may surprise you. Once
| in Alaska, a mama bear got me to babysit her cubs while
| she went fishing -- link:
| https://arachnoid.com/alaska2018/bears.html .
| suddenlybananas wrote:
| Why do you say it's not included? Why wouldn't they include
| it.
| sebzim4500 wrote:
| If every photo in streetview was included in the training
| data of a multimodal LLM it would be like 99.9999% of the
| training data/resource costs.
|
| It just isn't plausible that anyone has actually done
| that. I'm sure some people include a small sample of
| them, though.
| bluefirebrand wrote:
| Why would every photo in streetview be required in order
| to have Geoguessr's dataset in the training data?
| bee_rider wrote:
| I'm pretty sure they are saying that Geoguessr's just
| pulls directly from Google Streetview. There isn't a
| separate Geoguessr dataset, it just pulls from Google's
| API (at least that's what Wikipedia says).
| bluefirebrand wrote:
| I suspect that Geoguessr's dataset is a subset of Google
| Streetview, but maybe it really is just pulling
| everything directly
| bee_rider wrote:
| My guess would be that they pull directly from street-
| view, maybe with some extra filtering for interesting
| locations.
|
| Why bother to create a copy, if it can be avoided, right?
| ACCount36 wrote:
| People tried VLMs on "closed set" GeoGuessr-type tasks - i.e.
| non-Street View photos in similar style, not published
| anywhere.
|
| They still kicked ass.
|
| It seems like those AIs just have an awful lot of location
| familiarity. They've seen enough tagged photos to be able to
| pick up on the patterns, and generalize that to kicking ass
| at GeoGuessr.
| astrange wrote:
| > The easiest way of solving math problems with an LLM is to
| make sure that very similar problems are included in the
| training set.
|
| An irony here is that math blogs like Tao's might not be in LLM
| training data, for the same reason they aren't accessible to
| screen readers - they're full of math, and the math is rendered
| as images, so it's nonsense if you can't read the images.
|
| (The images on his blog do have alt text, but it's just the
| LaTeX code, which isn't much better.)
| prein wrote:
| What would be a better alternative than LaTex for the alt
| text? I can't think of a solution that makes more sense, it
| provides an unambiguous representation of what's depicted.
|
| I wouldn't think an LLM would have issue with that at all. I
| can see how a screen reader might, but it seems like the same
| problem faced by a screen reader with any piece of code, not
| just LaTex.
| QuesnayJr wrote:
| LLMs understand LaTeX extraordinarily well.
| MengerSponge wrote:
| LLMs are decent with LaTeX! It's just markup code after all.
| I've heard from some colleagues that they can do decent image
| to code conversion for a picture of an equation or even some
| handwritten ones.
| alansammarone wrote:
| As others have pointed out, LLMs have no trouble with LaTeX.
| I can see why one might think they're not - in fact, I made
| the same assumption myself sometime ago. LLMs, via
| transformers, are exceptionally good any _any_ sequence or
| one-dimensional data. One very interesting (to me anyway)
| example is base64 - pick some not-huge sentence (say, 10
| words), base64-encode it, and just paste it in any LLM you
| want, and it will be able to understand it. Same works with
| hex, ascii representation, or binary _. Here 's a sample if
| you want to try: aWYgYWxsIEEncyBhcmUgQidzLCBidXQgb25seSBzb21l
| IEIncyBhcmUgQydzLCBhcmUgYWxsIEEncyBDJ3M/IEFuc3dlciBpbiBiYXNlN
| jQu
|
| _ I remember running this experiment some time ago in a
| context where I was certain there was no possibility of tool
| use to encode/decode. Nowadays, it can be hard to certain
| whether there is any tool use or not, in some cases, such as
| Mistral, the response is quick enough to make it unlikely
| there's any tool use.
| throwanem wrote:
| I've just tried it, in the form of your base64 prompt and
| no other context, with a local Qwen-3 30b instance that I'm
| entirely certain is not actually performing tool use. It
| produced a correct answer ("Tm8="), which in a moment of
| accidental comedy it spontaneously formatted with LaTeX.
| But it did talk about invoking an online decoder, just
| before the first appearance of the (nearly) complete
| decoded string in its CoT.
|
| It "left out" the A in its decode and still correctly
| answered the proposition, either out of reflexive
| familiarity with the form or via metasyntactic reasoning
| over an implicit anaphor; I believe I recall this to be a
| formulation of one of the elementary axioms of set theory,
| though you will excuse me for omitting its name before
| coffee, which makes the pattern matching possibility seem
| somewhat more feasible. ('Seem' may work a little too hard
| there. But a minimally more novel challenge I think would
| be needed to really see more.)
|
| There's lots of text in lots of languages about using an
| online base64 decoder, and nearly none at all about
| decoding the representation "in your head," which for
| humans would be a party trick akin to that one fellow who
| could see a city from a helicopter for 30 seconds and then
| perfectly reproduce it on paper from memory. It makes sense
| to me that a model trained on the Internet would "invent"
| the "metaphor" of an online decoder here, I think. What in
| its "experience" serves better as a description?
| constantcrying wrote:
| >(The images on his blog do have alt text, but it's just the
| LaTeX code, which isn't much better.)
|
| LLMs are extremely good at outputting LaTeX, ChatGPT will
| output LaTeX, which the website will render as such. Why do
| you think LLMs have trouble understanding it?
| astrange wrote:
| I don't think LLMs will have trouble understanding it. I
| think people using screen readers will. ...oh I see, I
| accidentally deleted the part of the comment about that.
|
| But the people writing the web page extraction pipelines
| also have to handle the alt text properly.
| mbowcut2 wrote:
| LLMs are better at LaTeX than humans. ChatGPT often writes
| LaTeX responses.
| neutronicus wrote:
| Yeah, it's honestly one of the things they're best at!
|
| I've been working on implementing some E&M simulations with
| Claude Code and it's so-so on the C++ and TERRIBLE at the
| actual math (multiplying a couple 6x6 matrix differential
| operators is beyond it).
|
| But I can dash off some notes and tell Claude to TeXify and
| the output is great.
| disruptbro wrote:
| Language modeling is compression, whittle down graph to reduce
| duplication and data with little relationship:
| https://arxiv.org/abs/2309.10668
|
| Let's say everyone agrees to refer to one hosted copy of a
| token "cat", and instead generate a unique vector to represent
| their reference to "cat".
|
| Blam. Endless unique vectors which are nice and precise for
| parsing. No endless copies of arbitrary text like "cat".
|
| Now make that your globally distributed data base to bootstrap
| AI chips from. The data driven programming dream where other
| machines on the network feed new machines boot strap.
|
| American tech industry is IBM now. Stuck on recent success of
| web SaaS and way behind the plans of AI.
| mhl47 wrote:
| Side note: What is going on with these comments on Mathstodon?
| From moon landing denials, to insults, allegations that he used
| AI to write this ... almost all of them are to some capacity
| insane.
| nurettin wrote:
| That is how peak humanity looks like.
| Karrot_Kream wrote:
| I find the same kind of behavior on bigger Bluesky AI threads.
| I don't use Mathstodon (or actively follow folks on it) but I
| certainly feel sad to see similar replies there too. I
| speculate that folks opposed to AI are angry and take it out by
| writing these sorts of comments, but this is just my hunch.
| That's as much as I feel I should write about this without
| feeling guilty for derailing the discussion.
| ACCount36 wrote:
| No wonder. Bluesky is where insane Twitter people go when
| they get too insane for Twitter.
| dash2 wrote:
| Almost everywhere on the internet is like this. It's hn that is
| (mostly!) exceptional.
| f1shy wrote:
| The "mostly" there is so important! But also HN suffers from
| other problems (see in this thread the discussion about over
| policing comments, and calling fast hyperbolic and
| inflammatory).
|
| And don't get me started in the decline on depth in technical
| topics and soaring in political discussions. I came to HN for
| the first, not the second.
|
| So we are humans, there will never be a perfect forum.
| frumiousirc wrote:
| > So we are humans, there will never be a perfect forum.
|
| Perfect is in the eye of the moderator.
| hshshshshsh wrote:
| The truth is, both deniers and believers are operating on
| belief. Only those who actually went to the Moon know
| firsthand. The rest of us trust information we've received --
| filtered through media, education, or bias. That makes us no
| fundamentally different from deniers; we just think our belief
| is more justified.
| fc417fc802 wrote:
| Just to carry this line of reasoning out to the extreme for
| entertainment purposes (and to illustrate for everyone how
| misguided it is). Even if you perform a task firsthand, at
| the end of the day you're just trusting your memory of having
| done so. You feel that your trust in your memory is justified
| but fundamentally that isn't any different from the deniers
| either.
| hshshshshsh wrote:
| This is actually true. Plenty of accidents has happened
| because of this.
|
| I am not saying trusting your memory is always false or
| true. Most of the times it might be true. It's a heuristic.
|
| But if someone comes and deny what you did, the best course
| of action would be to consider the evidence they have and
| not assume they are stupid because they believe
| differently.
|
| Let's be honest, you have not personally went and verified
| the rocks belongs to Moon. Nor were you tracking the
| telemetry data in your computer when the rocket was going
| to Moon.
|
| I also believe we went to Moon.
|
| But all I have is beliefs.
|
| Everyone believed Early was flat 1000s years back as well.
| They had solid evidence.
|
| But the humility is accepting you don't know and you are
| believing and not pretend you are above others who believe
| exact opposite..
| fc417fc802 wrote:
| It's a misguided line of reasoning because the "belief"
| thing is a red herring. Nearly everything comes down to
| belief at a low level. The differences lie in the
| justifications.
|
| As you say, you should have the humility to consider the
| evidence that others provide that you might be wrong. The
| thing with the various popular conspiracy theories is
| that the evidence is conspicuously missing when any
| competent good faith actor would be presenting it front
| and center.
| esafak wrote:
| Some beliefs are more supported by evidence than others. To
| ignore this is to make the concept of belief practically
| useless.
| hshshshshsh wrote:
| Yeah. My point is you have not seen any of the evidence.
| You just have belief that evidence exists. Which is a
| belief and not evidence.
| esafak wrote:
| Yes, we have seen evidence: videos, pictures and other
| artifacts of the landing.
|
| I think you don't know what evidence means. You want
| _proof_ and that 's for mathematics.
|
| You don't _know_ that you exist. You could be a
| simulation.
| andrepd wrote:
| Have you opened a twitter thread? People are insane on social
| media, why should open source social media be substantially
| different? x)
| f1shy wrote:
| I refrain from any of those X, mastodon, etc. so let me ask a
| question:
|
| are all equally bad? Or same bad but a different aspect? E.g.
| I read often here that X has more disinformation, and right
| wing propaganda, while mastodon here was called out on
| another topic.
|
| Maybe somebody active in different networks can answer that.
| fc417fc802 wrote:
| Moderation and the algorithms used to generate user feeds
| both have strong impacts. In the case of mastodon (ie
| activitypub) moderation varies wildly between different
| domains.
|
| But in general, I'd say that the microblogging format as a
| whole encourages a number of toxic behaviors and
| interaction patterns.
| miltonlost wrote:
| X doesn't let you use trans as a word and has Grok spewing
| right-wing propaganda (mechahitler?). That self-selects
| into the most horrible people being on X now.
| stared wrote:
| I agree that after a challenge is something can be done at all
| (heavier-than-air flight, Moon landing, Gold medal at the IMO)
| then next question is makes sense economically.
|
| I like ARC-AGI approach for the reason that it shows both axes -
| score and price, and place human benchmark on these.
|
| https://arcprize.org/leaderboard
| pama wrote:
| This sounds very reasonable to me.
|
| When considering top tier labs that optimize inference and own
| the GPUs: the electricity cost of USD 5000 at a data center with
| 4 cents per kWh (which may be possible to arrange or beat in some
| counties in the US with special industrial contracts) can produce
| about 2 trillion tokens for the R1-0528 model using 120kW draw
| for the B200 NVL72 hardware and the (still to be fully optimized)
| sglang inference pipeline:
| https://lmsys.org/blog/2025-06-16-gb200-part-1/
|
| Although 2T tokens is not unreasonable for being able to get high
| precision answers to challenging math questions, such a very high
| token number would strongly suggest there are lots of unknown
| techniques deployed at these labs.
|
| If one adds the cost of GPU ownership or rental, say 2 USD/h/GPU,
| then the number of tokens for 5k USD shrinks dramatically to only
| 66B tokens, which is still high for usual techniques that try to
| optimize for a best single answer in the end, but perhaps
| plausible if the vast majority of these are intermediate thinking
| tokens and a lot of the value comes from LLM-based verification.
| iloveoof wrote:
| Moore's Law for AI Progress: AI metrics will double every two
| years whether the AI gets smarter or not.
| akomtu wrote:
| The benchmarks should really add the test of data compression.
| Intelligence is mostly about discovering the underlying
| principles, the ability to see simple rules behind complex
| behaviors, and data compression captures this well. For example,
| if you can look at a dataset of planetary and stellar motions and
| compress it into a simple equation, you'd be considered wildly
| intelligent. If you can't remember and reproduce a simple
| checkerboard pattern, you'd be considered dumb. Another example
| is drawing a duck in SVG - another form of data compression. Data
| extrapolation, on the other hand, is the opposite problem, which
| can be solved by imitation or by understanding the rules
| producing the data. Only the latter deserves to be called
| intelligence. Note, though, that understanding the rules isn't
| always a superior method. When we are driving, we drive by
| imitation based on our extensive experience with similar
| situations, hardly understanding the physics of driving.
| js8 wrote:
| LLMs could be very useful in formalizing the problem and
| assumptions (conversion from natural language), but once problem
| is described in a formal way (it can be described in some fuzzy
| logic), then more reliable AI techniques should be applied.
|
| Interestingly, Tao mentions
| https://teorth.github.io/equational_theories/, and I believe this
| is better progress than LLMs doing math. I believe enhancing Lean
| with more tactics and formalizing those in Lean itself is a more
| fruitful avenue for AI in math.
| agentcoops wrote:
| I used to work quite extensively with Isabelle and as a
| developer on Sledgehammer [1]. There are well-known results,
| most obviously the halting problem, that mean fully-automated
| logical methods applied to a formalism with any expressive
| capability, i.e. that can be used to formalize non-trivial
| problems, simply can never fulfill the role you seem to be
| suggesting. The proofs that are actually generated in that way
| are, anyway, horrendous -- in fact, the problem I used to work
| on was using graph algorithms to try and simplify computer-
| generated proofs for human comprehension. That's the very
| reason that all the serious work has previously been on proof
| /assistants/ and formal validation.
|
| LLMs, especially in /conjunction/ with Lean for formal
| validation, are really an exciting new frontier in mathematics
| and it's a mistake to see that as just "unreliable" versus
| "reliable" symbolic AI etc. The OP Terence Tao has been pushing
| the edge here since day one and providing, I think, the most
| unbiased perspective on where things stand today, strengths as
| much as limitations.
|
| [1] https://isabelle.in.tum.de/website-
| Isabelle2009-1/sledgehamm...
| js8 wrote:
| LLMs (as well as humans) are algorithms like anything else
| and so they are also subject to halting problem. I don't see
| what LLMs do that couldn't be in principle formalized as a
| Lean tactic. (IMHO LLMs are just learning rules - theorems of
| some kind of fuzzy logic - and then try to apply them using
| heuristic search to satisfy the goal. Unfortunately the rules
| learned are likely not fully consistent and so you get
| reasoning errors.)
| data_maan wrote:
| The concept of pre-registered eval (an analogy to pre-registered
| study) will go a long way towards fixing this.
|
| More information
|
| https://mathstodon.xyz/@friederrr/114881863146859839
| kristianp wrote:
| It's going to take a large step up in transparency for AI
| companies to do this. It was back in gpt 4 days that openai
| stopped reporting model size for example and the others followed
| suit.
___________________________________________________________________
(page generated 2025-07-25 23:01 UTC)