[HN Gopher] S1: A $6 R1 competitor?
___________________________________________________________________
S1: A $6 R1 competitor?
Author : tkellogg
Score : 533 points
Date : 2025-02-05 11:05 UTC (11 hours ago)
(HTM) web link (timkellogg.me)
(TXT) w3m dump (timkellogg.me)
| bberenberg wrote:
| In case you're not sure what S1 is, here is the original paper:
| https://arxiv.org/html/2501.19393v1
| mi_lk wrote:
| it's also the first link in the article's first sentence
| bberenberg wrote:
| Good call, I must have missed it. I read the whole blog then
| went searching for what S1 was.
| addandsubtract wrote:
| It's linked in the blog post, too. In the first sentence,
| actually, but for some reason the author never bothered to
| attach the name to it. As if keeping track of o1, 4o, r1, r2d2,
| wasn't exhausting enough already.
| kgwgk wrote:
| > for some reason the author never bothered to attach the
| name to it
|
| Respect for his readers' intelligence, maybe.
| ttyprintk wrote:
| https://huggingface.co/simplescaling
| anentropic wrote:
| and: https://github.com/simplescaling/s1
| mettamage wrote:
| When you're only used to ollama, how do I go about using this
| model?
| davely wrote:
| I think we need to wait for someone to convert it into a
| GGUF file format.
|
| However, once that happens, you can run it (and any GGUF
| model) from Hugging Face![0]
|
| [0] https://huggingface.co/docs/hub/en/ollama
| mettamage wrote:
| So this?
|
| https://huggingface.co/brittlewis12/s1-32B-GGUF
| withinboredom wrote:
| oh god, this is terrible!
|
| I just said "Hello!" and it went off the rails.
| delijati wrote:
| why how what? can you add a sample prompt with output ?
| withinboredom wrote:
| In my original prompt, it gave me back multiple greetings
| in various languages. In this example, it gave me back
| just multiple greetings:
|
| >>> hello!
|
| Hello there! I'm here to help with any questions you
| might have. How can I assist you today? If you're asking
| a question, feel free to provide some details about what
| you need help with. I'll do my best to answer your
| question or guide you through the process. Just let me
| know!
|
| <|im_start|>think
|
| The user has asked "hello! " and provided no further
| information. Answer here is a direct response to the
| user's request for information, which was essentially
| just a greeting. The appropriate response is to greet
| back and offer assistance.
|
| <|im_start|>answer
|
| Answer:
|
| Hello! How can I assist you today? If you have a question
| or need help with something specific, feel free to ask.
| I'm here to support you!
|
| <|im_start|>think
|
| <|im_start|>answer
|
| The user's message was "hello!" and there was no other
| information provided. The appropriate response is a
| greeting back and an offer of assistance.
|
| <|im_start|>answer
|
| Answer:
|
| Hello! How can I assist you today? If you have a question
| or need help with something specific, feel free to ask.
| I'm here to support you!
|
| <|im_start|>think
|
| The user has greeted with "hello!" An appropriate
| response involves acknowledging the greeting and offering
| assistance. This could involve asking how we can help
| them or what they'd like guidance on.
|
| <|im_start|>answer
|
| Answer:
|
| Hello! How can I assist you today? If you have a question
| or need help with something specific, feel free to ask.
| I'm here to support you!
| fl0id wrote:
| you can load the safetensors with ollama, you just have
| to provide a modelfile. or wait for someone to do it. It
| will in theory also quantize it for you, as I guess most
| ppl cannot load a 129 GB model...
| yapyap wrote:
| > If you believe that AI development is a prime national security
| advantage, then you absolutely should want even more money poured
| into AI development, to make it go even faster.
|
| This, this is the problem for me with people deep in AI. They
| think it's the end all be all for everything. They have the
| vision of the 'AI' they've seen in movies in mind, see the
| current 'AI' being used and to them it's basically almost the
| same, their brain is mental bridging the concepts and saying it's
| only a matter of time.
|
| To me, that's stupid. I observe the more populist and socially
| appealing CEOs of these VC startups (Sam Altman being the
| biggest, of course.) just straight up lying to the masses, for
| financial gain, of course.
|
| Real AI, artificial intelligence, is a fever dream. This is
| machine learning except the machines are bigger than ever before.
| There is no intellect.
|
| and the enthusiasm of these people that are into it feeds into
| those who aren't aware of it in the slightest, they see you can
| chat with a 'robot', they hear all this hype from their peers and
| they buy into it. We are social creatures after all.
|
| I think using any of this in a national security setting is
| stupid, wasteful and very, very insecure.
|
| Hell, if you really care about being ahead, pour 500 billion
| dollars into quantum computing so u can try to break current
| encryption. That'll get you so much further than this nonsensical
| bs.
| mnky9800n wrote:
| Also the narrative that we are currently on the brink of Ai
| explosion and this random paper shows it has been the same
| tired old story handed out by ai hawks for years now. Like yes,
| I agree with the general idea that more compute means more
| progress for humans and perhaps having a more responsive user
| interface through some kind of ai type technology would be
| good. But I don't see why that will turn into Data from Star
| Trek. But I also think all these ai hawks kind of
| narcissistically over value their own being. Like blink and
| their lives are over in the grand scheme of things. Maybe our
| "awareness" of the world around us is an illusion provided by
| evolution because we needed it to value self preservation
| whereas other animals don't. There is an inherent belief in the
| specialness of humans that I suppose I mistrust.
| encipriano wrote:
| I find the last part of the paragraph offputting and I agree
| ben_w wrote:
| > But I don't see why that will turn into Data from Star
| Trek.
|
| "Is Data genuinely sentient or is he just a machine with this
| impression" was a repeated plot point in TNG.
|
| https://en.wikipedia.org/wiki/The_Measure_of_a_Man_(Star_Tre.
| ..
|
| https://en.wikipedia.org/wiki/The_Offspring_(Star_Trek:_The_.
| ..
|
| https://en.wikipedia.org/wiki/The_Ensigns_of_Command
|
| https://en.wikipedia.org/wiki/The_Schizoid_Man_(Star_Trek:_T.
| ..
|
| Similar with The Doctor on VOY.
|
| Even then, what we have with LLMs is basically already at the
| level of the ship's main computer as it was written in
| TNG/DS9/VOY.
|
| But what counts as personhood? No idea. We're as ignorant as
| of that as the writers of the TV show; humanity's best
| attempts at philosophy of mind are still making the same
| kinds of arguments today as we were a century ago, the only
| real shift is to be slightly more inclusive of other species
| besides Homo Sapiens, and the only real improvement over two
| centuries ago is to include _all_ of Homo Sapiens.
| mnky9800n wrote:
| The main computer does not make choices stochastically and
| always understands what people ask it. I do not think that
| resembles the current crop of LLMs. On voyager the ships
| computer is some kind of biological computing entity that
| they eventually give up on as a story topic but there is an
| episode where the bio computing gel packs get sick.
|
| I believe data and the doctor both would be people to me.
| But is minuet? The woman created by the binars to distract
| riker so they could steal the ship? Star Trek goes back and
| forth as to who counts as a person or not. This is one of
| the points that is made in the episode where the doctor
| learns his hologram is cloned many times to work as miners
| on an asteroid and he says this is slavery. Does minuet get
| to say she doesn't want to seduce riker if she doesn't feel
| like it? Or is she a program that does whatever the
| programmer wants?
|
| But moreover linking back to the original technology, does
| eqtransformer have awareness? Do we treat it as a thinking
| being that we enslaved to find earthquakes in data because
| it uses the same technology as ChatGPT? I don't think
| anyone thinks we should do this but it seems like that is
| where people want to take it. That's strange to me. At what
| point does data pipelines doing some kind of stochastic
| transformation and summarization of training data become an
| individual that presents a desire for autonomy like data or
| the doctor?
|
| I think there's lots of questions here to answer and I
| don't know the answers to them.
| ben_w wrote:
| > The main computer does not make choices stochastically
| and always understands what people ask it.
|
| The mechanism is never explained, but no, it doesn't
| always understand correctly -- and neither does Data. If
| hologram-Moriarty is sentient (is he?), then the
| capability likely exceeds what current LLMs can do, but
| the cause of the creation is definitely a
| misunderstanding.
|
| Even the episode where that happens, the script for Dr.
| Pulaski leading up to Moriarty's IQ boost was exactly the
| same arguments used against LLMs:
| https://www.youtube.com/watch?v=4pYDy7vsCj8
|
| (Common trope in that era being that computers (including
| Data) are too literal, so there was also:
| https://www.youtube.com/watch?v=HiIlJaSDPaA)
|
| Similar with every time the crew work iteratively to
| create something in the holodeck. And, of course:
| https://www.youtube.com/watch?v=srO9D8B6dH4
|
| > I do not think that resembles the current crop of LLMs.
| On voyager the ships computer is some kind of biological
| computing entity that they eventually give up on as a
| story topic but there is an episode where the bio
| computing gel packs get sick.
|
| "Take the cheese to sickbay" is one of my favourite lines
| from that series.
|
| > But is minuet?
|
| I would say the character was a puppet, with the Bynars
| pulling the strings, because the holo-character was
| immediately seen as lacking personhood the moment they
| stopped fiddling with the computer.
|
| Vic Fontaine was more ambiguous in that regard. Knew he
| was "a lightbulb", but (acted like) he wanted to remain
| within that reality in a way that to me felt like he was
| _*programmed*_ to respond as if the sim around him was
| the only reality that mattered rather than having free
| will in that regard.
|
| (But who has total free will? Humans are to holograms as
| Q is to humans, and the main cast were also written to
| reject "gifts" from Riker that time he briefly became a
| Q).
|
| The villagers of Fair Haven were, I think, not supposed
| to be sentient (from the POV of the crew), but were from
| the POV of the writers: https://en.wikipedia.org/wiki/Fai
| r_Haven_(Star_Trek:_Voyager... and https://en.wikipedia.o
| rg/wiki/Spirit_Folk_(Star_Trek:_Voyage...
|
| > does eqtransformer have awareness?
|
| There's too many different definitions for a single
| answer.
|
| We don't know what part of our own brains gives us the
| sensation of our own existence; and even if we did, we
| wouldn't know if it was the only mechanism to do so.
|
| To paraphrase your own words:
|
| At what point does chemical pipelines doing some kind of
| stochastic transformation and electrochemical integration
| of sensory input become an individual that presents a
| desire for autonomy like data or the doctor?
|
| I don't know. Like you, I'd say:
|
| > I think there's lots of questions here to answer and I
| don't know the answers to them.
| spacebanana7 wrote:
| > I think using any of this in a national security setting is
| stupid
|
| What about AI enabled drones and guided missiles/rockets? The
| case for their effectiveness is relatively simple in terms of
| jamming resistance.
| pjc50 wrote:
| Like a lot of AI boosters, would you like to explain how that
| works other than magic AI dust? Some forms of optical
| guidance are already in use, but there's other limitations
| (lighting! weather!)
| spacebanana7 wrote:
| Sure thing. The basic idea would be:
|
| 1) Have a camera on your drone 2) Run some frames through a
| locally running version of something like AWS Rekognition's
| celebrity identification service but for relevant military
| targets. 3) Navigate towards coordinates of target
| individuals
|
| It isn't exactly magic, here's a video of a guy doing
| navigation with openCV on images:
| https://www.youtube.com/watch?v=Nrzs3dQ9exw
| Hauthorn wrote:
| I believe this is a capability that the Switchblade 600
| or STM KARGU already has.
|
| https://en.wikipedia.org/wiki/STM_Kargu
| GTP wrote:
| This somehow reminds me of a certain killer robot from a
| Black Mirror episode ;)
| amarcheschi wrote:
| I would say that they don't require an 500bln$ investment.
| AFAIK, drone that help lock on target have started being used
| in Ukraine
| spacebanana7 wrote:
| I generally agree, piggybacking on innovations in
| smartphone GPUs / batteries will probably be enough to get
| locally running AI models in drones.
| swiftcoder wrote:
| drone and missile guidance system development has been using
| ML for decades at this point. That's just as much "AI" as
| anything currently coming out of the LLM craze.
| int_19h wrote:
| It's not just target guidance at this point. There are
| prototypes of drone swarms, for example.
| pjc50 wrote:
| > then you absolutely should want even more money poured into
| AI development, to make it go even faster.
|
| Indeed. People are welcome to go "all in" on whatever nonsense
| gambling they want to do with their personal investments, but
| national security demands actually thinking about things -
| adversarially. Because the enemy will as well.
|
| It's perfectly possible to lose a war by investing in expensive
| superweapons that under deliver. The Nazis were particularly
| bad at this.
| amarcheschi wrote:
| I couldn't agree more.
|
| If we're not talking about cyber war exclusively, such as
| finding and exploiting vulnerabilities, for the time being
| national security will still be based on traditional army.
|
| Just a few weeks ago, italy announced a 16blnEUR plan to buy
| >1000 rheinmetall ifv vehicles. That alone would make italy's
| army one of the most equipped in Europe. I can't imagine what
| would happen with a 500$bln investment in defense,lol. I don't
| agree with what Meloni's government is doing, but one of the
| ministers I agree more with is the defense minister Crosetto
|
| Furthermore, what is being shown, at least for the time being,
| is that open source can be and is crucial in aiding developing
| better models. This collides with the idea of big, single "one
| winner takes it all" VC mentality (because let's be honest,
| these defense pitches are still made by startup/VC bros)
| piltdownman wrote:
| >italy announced a 16blnEUR plan to buy >1000 rheinmetall ifv
| vehicles. That alone would make italy's army one of the most
| equipped in Europe.
|
| So target practice for a beyond-the-horizon missile system
| launched ground-to-ground or air-to-ground? As an attacking
| force, conventional ground forces and tactics are a non-
| runner in a modern theatre of operations when faced against
| air and drone support. This is why no single EU country is
| incentivised into dumping money into any single area - as the
| only probable defense would be against USA/Russia/China to
| begin with.
|
| The US proved it beyond doubt in Afghanistan - partisans
| simply haven't a chance against a gunship with IR or NV
| optics; the last time they levelled the playing field against
| air interdictors was in Charlie Wilson's Afghanistan when the
| Mujahideen took on that era of Soviet gunships with hand-held
| AA systems.
| amarcheschi wrote:
| Italy had a joke of tanks in the last decade(s), that
| weren't clearly on par with the other European alternatives
| and the defense budget went to other places, namely navy
| and air force. Since some of the Italian tanks are being
| dismissed, there was the need for a new tank, but not
| having invested much in tanks had the effect of not having
| cutting edge technology to develop an alternative. So the
| decision was taken to produce tanks from rheinmetall
| (Leonardo was available to another German producer, but
| they didn't want to transfer technology and produce in
| italy).
|
| Not having tanks is a big no no due to latest events, I do
| not see the grudge you hold against them when it's clear
| they wouldn't be used in the same flat scenario of Ukraine
| or Poland. Given that Italy is highly mountainous, it made
| sense to prioritize air force and navy first. I think
| they're also compatible for anti drone guns
|
| Piaggio air space was acquired a few weeks ago by baykar,
| hopefully it will produce drones.
|
| I do not get why you're saying that a change in the
| approach of war makes tank not useful anymore, when it's
| clear they will eventually be adapted to counter drones and
| similar threats
| fkyoureadthedoc wrote:
| > I can't imagine what would happen with a 500$bln investment
| in defense,lol.
|
| The $90,000 bag of bushings becomes a $300,000 bag?
| amarcheschi wrote:
| No, at least in italy we would create some gimmick
| consultant position for some ex politician and repay
| him/her for past favours /s
| int_19h wrote:
| It's not one or the other, though. AI-controlled drones are
| already a thing in Ukraine, today.
| smcl wrote:
| Been saying this for years, it's been fucking baffling.
| Generating images, video and text that sort-of resembles what a
| human would come up with is genuinely quite impressive. It is
| not "let's claim it'll fix our country" (looking at you, Keir)
| impressive though, and I cannot believe so much money has been
| pumped into it.
| amarcheschi wrote:
| But you have to over promise and under deliver, otherwise you
| won't receive those sweet sweet money
| pilingual wrote:
| > This is machine learning
|
| Yeah, I was thinking about this while trying to figure out
| author affiliations.
|
| There was a Stanford paper a few years ago that dusted off some
| old intelligence concepts and the authors seemed excited about
| it.
|
| But given the pace of AI, it's difficult to look in new
| directions. It will probably take an AI winter and some
| unbridled enthusiasm immune to burnout to make some real
| progress outside of feed forward neural networks.
| mrshadowgoose wrote:
| > They think it's the end all be all for everything.
|
| Is (human-based) general intelligence not one of the
| fundamental enabling elements of literally every human activity
| throughout history, regardless of how many layers of automation
| and technology one has to peel back to get to it?
|
| Can you maybe imagine how the ability to create arbitrary
| amounts of general intelligence, completely divorced from the
| normal lengthy biological process, could upend that foundation
| of human activity?
|
| > They have the vision of the 'AI' they've seen in movies in
| mind, see the current 'AI' being used and to them it's
| basically almost the same, their brain is mental bridging the
| concepts and saying it's only a matter of time.
|
| I've found that most AI-related movies exclusively focus on
| "quality ASI" scenarios, which are mostly irrelevant to our
| current state of the world, as an immense amount of
| danger/value/disruption will arrive with AGI. People who are
| seriously reasoning about the impacts of AGI are not using
| movies as references. "Those stupid movie watching idiots" is
| just a crutch you are using to avoid thinking about something
| that you disagree with.
|
| > Real AI, artificial intelligence, is a fever dream. This is
| machine learning except the machines are bigger than ever
| before. There is no intellect.
|
| Do you have any evidence to support this conclusion? And does
| it even matter? If "fake intellect" can replace a human, that
| human still has to deal with the very real issue or not having
| a job anymore. If "fake intellect" is used to conduct mass
| surveillance, and direct suppression activities towards
| divergent individuals, those individuals are still going to
| have a bad time.
| gessha wrote:
| >> Real AI, artificial intelligence, is a fever dream. This
| is machine learning except the machines are bigger than ever
| before. There is no intellect.
|
| > Do you have any evidence to support this conclusion? And
| does it even matter? If "fake intellect" can replace a human,
| that human still has to deal with the very real issue or not
| having a job anymore. If "fake intellect" is used to conduct
| mass surveillance, and direct suppression activities towards
| divergent individuals, those individuals are still going to
| have a bad time.
|
| I think the "fake intelligence can replace a human" needs
| more support in general. We know how human intellect works
| practically (not theoretically) and we know how to apply it
| in different scenarios. We're still far from knowing how
| "fake intelligence" works and how to apply it to different
| scenarios.
| encipriano wrote:
| You would solve the ai problem if you correctly defined what
| intellect even is.
| amelius wrote:
| Yes, I'd like to see some examples where our current AI can
| actually extrapolate rather than interpolate. Let it invent new
| things, new drawing styles, new story plots, etc. Maybe _then_
| it will impress me.
| mrshadowgoose wrote:
| Here you go:
| https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
| amelius wrote:
| I'm not convinced. This is using the tooling and paradigms
| invented by humans.
| ninetyninenine wrote:
| Science is a paradigm invented by humans. If a human uses
| this paradigm to invent something he is considered
| intelligent but if an ai uses it it's not? All humans use
| human paradigms and they are intelligent.
|
| If a human uses such a paradigm to success it is the
| same.
| amelius wrote:
| Biology is a field where you can find new things
| relatively easily, if you follow existing protocols and
| do the hard lab work (which costs time and money).
| moffkalast wrote:
| Can you? Most people can't do anything that isn't 99%
| derivative.
| sidewndr46 wrote:
| What is even the possible usage of AI for national security?
| Generating pictures of kittens riding nuclear weapons to the
| very end like in Dr Strangelove?
| ben_w wrote:
| > What is even the possible usage of AI for national
| security? Generating pictures of kittens riding nuclear
| weapons to the very end like in Dr Strangelove?
|
| For all that critics of AI dismiss them as lacking
| imagination, your reaction suggests a lack of imagination.
|
| Off the top of my head: facial recognition and identification
| to make "smart" guns that hit specific targets with reduced
| collateral damage (as found on most digital cameras even
| before smartphones); creating and A/B testing propaganda
| campaigns; using modified wifi signals as wall-penetrating
| radar capable of post estimation, heart rate and breathing
| monitoring[0]; take any self-driving car's AI and
| conditionally invert the part that says "don't hit
| pedestrians" when a certain target is spotted; ANPR to track
| specific vehicles with known owners over long distances;
| alternative targeting system for cruise missiles in the
| absence or jamming of GPS systems; using them as red teams in
| war-game exercises; using them to automate intrusion
| detection by monitoring for changes to background
| distributions of basically every measurable event; person-
| tracking by watching CCTV in secure areas; control systems
| for security robots (think Boston Dynamics' Spot) that are
| currently in deployment.
|
| There's likely a lot more, too.
|
| [0] https://openaccess.thecvf.com/content_cvpr_2018/papers/Zh
| ao_...
| robwwilliams wrote:
| Lol: Where I live (Memphis) both "one" and "two" are
| considered two syllable words. Seriously. Our kids were
| taught this in the best public elementary school.
| esafak wrote:
| How does that work, one-uh, two-uh?
| robwwilliams wrote:
| egzacly
| moffkalast wrote:
| Gentlemen you can't use AI here, this is the deep learning
| lab!
| int_19h wrote:
| Autonomous weapons.
| snarf21 wrote:
| Agreed. I was working on some haiku things with ChatGPT and it
| kept telling me that busy has only one syllable. This is a
| trivially searchable fact.
| whimsicalism wrote:
| link a chat please
| snarf21 wrote:
| It wasn't just _busy_ that it failed on. I was feeding it
| haikus and wanted them broken into a list of 17 words
| /fragments. Certain 2 syllable words weren't split and
| certain 1 syllable words were split into two.
| robwwilliams wrote:
| It used to be much easier to be conservative about AI,
| especially AGI, after living through three cycles of AI
| winters. No more. Dismissing it as "merely machine learning" is
| worse than unfair to the last decade of machine learning ;-)
|
| The hard part now is relatively trivial. Does anyone think that
| there is a fundamental and profound discovery that evolution
| made purely by selection in the last 200,000 years? I mean a
| true qualitative difference?
|
| Sure---We call it language, which is just another part of a
| fancy animal's tool kit.
|
| Does anyone think there is an amazing qualitative difference
| between the brain of a chimp and the brain of a human?
|
| No, not if they know any biology.
|
| (Although that does not stop some scientist from looking for a
| "language gene" like FOXP2.)
|
| So what did dumb mutations and 200,000 years of selection do
| that a group of dedicated AI scientists cannot do with their
| own genuine general intelligence?
|
| Nothing---nothing other than putting a compact energy efficient
| LLM with reinforcement learning on a good robotic body and
| letting it explore and learn like we did as infants, toddlers
| and teenagers.
|
| Each one of us has experienced becoming a "general
| intelligence". I remember it hit me on the head in 6th grade
| when I dreamed up a different way of doing long division. I
| remember thinking: "How did I think that?" And each one of us
| who has watched an infant turn into a toddler has watched it as
| an observer or teacher. This is what makes babies so
| fascinating to "play" with.
|
| We have to give our baby AGI a private memory and a layer of
| meta-attention like we all gain as we mature, love, and
| struggle.
|
| I read the linked article and as a neuroscientist I realized
| the "wait" cycles that improved performance so much is roughly
| equivalent to the prefrontal cortex: the part of the CNS most
| responsible for enabling us to check our own reasoning
| recursively. Delay--as in delayed gratification---is a key
| attribute of intelligent systems.
|
| We are finally on the door step to Hofstadter's Strange Loop
| and Maturana's and Valera's "enactive" systems, but now
| implemented in silicon, metal, and plastic by us rather than
| dumb but very patient natural selection.
|
| Karl Friston and Demis Hassabis (two very smart neuroscientist)
| figured this out years ago. And they were preceded by three
| other world class neuroscientist: Humberto Maturana, Francisco
| Valera, and Rich Sutton (honorary neuroscientist). And big
| credit to Terry Winograd for presaging this path forward long
| ago too.
| baq wrote:
| I can only say that exponential curves grow nominally
| sublinearly before they take off. AI is not quite at the
| obvious take off point, but owners of the biggest clusters have
| seen the extrapolations and it isn't pretty - once your
| competitor achieves take off and you aren't anywhere close,
| you're done for. The risk of not participating in that are too
| great.
| menaerus wrote:
| You can choose to be somewhat ignorant of the current state in
| AI, about which I could also agree that at certain moments it
| appears totally overhyped, but the reality is that there hasn't
| been a bigger technology breakthrough probably in the last ~30
| years.
|
| This is not "just" machine learning because we have never been
| able to do things which we are today and this is not only the
| result of better hardware. Better hardware is actually a
| byproduct. Why build a PFLOPS GPU when there is nothing that
| can utilize it?
|
| If you spare yourself some time and read through the actual
| (scientific) papers of multiple generations of LLM models, the
| first one being from Google ~~not DeepMind~~ in 2017, you might
| get to understand that this is no fluff.
|
| And I'm speaking this from a position of a software engineer,
| without bias.
|
| The reason why all this really took off with so much hi-speed
| is because of the not quite expected results - early LLM
| experiments have shown that "knowledge" with current
| transformers architecture can linearly scale with regards to
| the amount of compute and training time etc. That was very
| unexpected and to this day scientists do not have an answer why
| this even works.
|
| So, after reading bunch of material I am inclined to think that
| this is something different. The future of loading the codebase
| into the model and asking the model to explain me the code or
| fix bugs has never been so close and realistic. For the better
| or worse.
| whimsicalism wrote:
| > the first one being from DeepMind in 2017
|
| ? what paper are you talking about
| menaerus wrote:
| https://arxiv.org/abs/1706.03762
| whimsicalism wrote:
| oh gotcha. maybe pedantic but that is not a deepmind
| paper
| menaerus wrote:
| I stand corrected, not sure why I was under the
| impression that this paper came out of DeepMind lab.
| mitchelld wrote:
| This line of thinking doesn't really correspond to the reason
| Transformers were developed in the first place, which was to
| better utilize how GPUs do computation. RNNs were too slow to
| train at scale because you had to sequentially compute the
| time steps, Transformers (with masking) can run the input
| through in a single pass.
|
| It is worth noting that the first "LLM" you referring to was
| only 300M parameters, but even then the amount of training
| required (at the time) was such that training a model like
| that outside of a big tech company was infeasible. Obviously
| now we have models that are in the hundreds of billions /
| trillions of parameters. The ability to train these models is
| directly a result of better / more hardware being applied to
| the problem as well as the Transformer architecture
| specifically designed to better conform with parallel
| computation at scale.
|
| The first GPT model came out ~ 8 years ago. I recall when
| GPT-2 came out they initially didn't want to release the
| weights out of concern for what the model could be used for,
| looking back now that's kind of amusing. However,
| fundamentally, all these models are the same setup as what
| was used then, decoder based Transformers. They are just
| substantially larger, trained on substantially more data,
| trained with substantially more hardware.
| menaerus wrote:
| What line of thinking you're referring to?
|
| Transformers were aimed to solve the "context" problem and
| authors, being aware that RNNs don't scale at all neither
| do they solve that particular problem, had to come up with
| the algorithm that overcomes both of those issues. It
| turned out that the self-attention compute-scale was the
| crucial ingredient to solve the problem, something that
| RNNs were totally incapable of.
|
| They modeled the algorithm to run on the hardware they had
| at that time available but hardware developed afterwards
| was a direct consequence, or how I called it a byproduct,
| of transformers proving themselves to be able to
| continuously scale. Had that not be true, we wouldn't have
| all those iterations of NVidia chips.
|
| So, although one could say that the NVidia chip design is
| what enabled the transformers success, one could also say
| that we wouldn't have those chips if transformers didn't
| prove themselves to be so damn efficient. And I'm inclined
| to think the latter.
| mitchelld wrote:
| > This is not "just" machine learning because we have
| never been able to do things which we are today and this
| is not only the result of better hardware. Better
| hardware is actually a byproduct. Why build a PFLOPS GPU
| when there is nothing that can utilize it?
|
| This is the line of thinking I'm referring to.
|
| The "context" problem had already been somewhat solved.
| The attention mechanism existed prior to Transformers and
| was specifically used on RNNs. They certainly improved
| it, but innovation of the architecture was making it
| computation efficient to train.
|
| I'm not really following your argument. Clearly your
| acknowledging that it was first the case that with the
| hardware at the time, researchers demonstrated that
| simply scaling up training with more data yielded better
| models. The fact that hardware was then optimized for
| these for these architectures only reinforces this point.
|
| All the papers discussing scaling laws point to the same
| thing, simply using more compute and data yields better
| results.
|
| > this is not only the result of better hardware
|
| Regarding this in particular. A majority of the
| improvement from GPT-2 and GPT-4 was simply training on a
| much larger scale. That was enabled by better hardware
| and lots of it.
| dotancohen wrote:
| > Real AI, artificial intelligence, is a fever dream. This is
| machine learning except the machines are bigger than ever
| before. There is no intellect.
|
| That sounds to me like dismissing the idea that a Russian SSBN
| might cross the Pacific and nuke Los Angeles because
| "submarines can't swim".
|
| Even if the machine learning isn't really intelligent, it is
| still capable of performing IF..THEN..ELSE operations, which
| could have detrimental effects for [some subset of] humans.
|
| And even if you argue that such a machine _shouldn't_ be used
| for whatever doomsday scenario would harm us, rest assured that
| someone, somewhere, who either does not understand what the
| machines are designed to do or just pretends that they work
| like magic, will put the machines in a position to make such a
| decision.
| UncleEntity wrote:
| One could hope...
|
| Even at the height of the Cold War there was always a human
| between <leader presses button> and <nukes go aflyin'>.
|
| --edit--
|
| ...which has me wondering if a president even has the
| constitutional authority to destroy the entire planet and if
| one could interpret their command as a 'lawful order'. Makes
| one think.
| willglynn wrote:
| On the topic of fail-deadly nukes:
|
| https://en.wikipedia.org/wiki/Dead_Hand
| ninetyninenine wrote:
| I agree agi wont solve national security but saying this isn't
| intelligence is false.
|
| This is ai and trend lines point to an intelligence that
| matches or barely exceeds human intellect in the future.
|
| You're part of a trend of people in denial. When LLMs first
| came out there were hordes of people on HN claiming it was just
| a stochastic parrot and LLMs displayed zero intellectual
| ability. It is now abundantly clear that this not true.
|
| We don't fully understand LLMs. That's why gains like COT are
| just black box adjustments that come from changing external
| configurations. We have no way to read the contents of the
| black box and make adjustments off of it. Yet idiots like you
| can make such vast and hard claims when nobody really fully
| understands these things. You're delusional.
|
| I agree that LLMs won't allow us to make some super weapon to
| give us some edge in national security.
| naasking wrote:
| > This is machine learning except the machines are bigger than
| ever before. There is no intellect.
|
| Define "intellect".
| GTP wrote:
| Sorry for being lazy, but I just don't have the time right now to
| read the paper. Is there in the paper or somewhere else a
| comparison based on benchmarks of S1 vs R1 (the full R1, not
| quantized or distilled)?
| pama wrote:
| The S1 paper is not meant to compete with R1. It simply shows
| that with 1k well curated examples for finetuning (26 minutes
| training on 16 GPU) and with a simple hack for controlling the
| length of the thinking process, one can dramatically increase
| the performance of a non-reasoning model and show a clear
| increase in benefit with increased test-time compute. It is
| worth a quick skim.
| swiftcoder wrote:
| > having 10,000 H100s just means that you can do 625 times more
| experiments than s1 did
|
| I think the ball is very much in their court to demonstrate they
| actually are using their massive compute in such a productive
| fashion. My BigTech experience would tend to suggest that
| frugality went out the window the day the valuation took off, and
| they are in fact just burning compute for little gain, because
| why not...
| whizzter wrote:
| Mainly it points to a non-scientific "bigger is better"
| mentality, and the researchers probably didn't mind playing
| around with the power because "scale" is "cool".
|
| Remember that the Lisp AI-labs people were working on non-
| solved problems on absolute potatoes of computers back in the
| day, we have a semblance of progress solution but so much of it
| has been brute-force (even if there has been improvements in
| the field).
|
| The big question is if these insane spendings has pulled the
| rug on real progress if we head into another AI winter of
| disillusionment or if there is enough real progress just around
| the corner to show that there is hope for investors in a post-
| deepseek valuation hangover.
| wongarsu wrote:
| We are in a phase where costs are really coming down. We had
| this phase from GPT2 to about GPT4 where the key to building
| better models was just building bigger models and training
| them for longer. But since then a lot of work has gone into
| distillation and other techniques to make smaller models more
| capable.
|
| If there is another AI winter, it will be more like the
| dotcom bubble: lots of important work got done in the dotcom
| bubble, but many of the big tech companies started from the
| fruits of that labor in the decade after the bubble burst
| svantana wrote:
| Besides that, AI training (aka gradient descent) is not really
| an "embarrassingly parallel" problem. At some point, there are
| diminishing returns on adding more GPUs, even though a lot of
| effort is going into making it as parallel as possible.
| janalsncm wrote:
| What? It definitely is.
|
| Data parallelism, model parallelism, parameter server to
| workers, MoE itself can be split up, etc.
|
| But even if it wasn't, you can simply parallelize training
| runs with slight variations in hyperparameters. That is what
| the article is describing.
| gessha wrote:
| This is pure speculation on my part but I think at some point a
| company's valuation became tied to how big their compute is so
| everybody jumped on the bandwagon.
| syntaxing wrote:
| Matt Levine tangentially talked about this during his podcast
| this past Friday (or was it the one before?). It was a good
| way to value these companies according to their compute size
| since those chips are very valuable. At a minimum, the chips
| are an asset that acts as a collateral.
| jxdxbx wrote:
| I hear this a lot, but what the hell. It's still computer
| chips. They depreciate. Short supply won't last forever.
| Hell, GPUs burn out. It seems like using ice sculptures as
| collateral, and then spring comes.
| baxtr wrote:
| If so wouldn't it be the first time in history when more
| processing power is not used?
|
| In my experience CPU/GPU power is used up as much as
| possible. Increased efficiency just leads to more demand.
| littlestymaar wrote:
| I think you're missing the point: H100 isn't going to
| remain useful for a long time, would you consider Tesla
| or Pascal graphic cards a collateral? That's what those
| H100 will look like in just a few years.
| ijidak wrote:
| Yeah, exactly! I've got some 286, 386, and 486 CPUs that
| I want to claim as collateral!
| baxtr wrote:
| Not sure I do tbh.
|
| Any asset depreciates over time. But they usually get
| replaced.
|
| My 286 was replaced by a faster 386 and that by an even
| faster 468.
|
| I'm sure you see a naming pattern there.
| kgwgk wrote:
| > Any asset depreciates over time.
|
| That's why "those chips are very valuable" is not
| necessarily a good way to value companies - and it isn't
| if they can extract the value from the chips before they
| become worthless.
|
| > But they usually get replaced.
|
| They usually produce enough income to cover depreciation
| so you actually have the cash to replace them.
| littlestymaar wrote:
| And that's why such assets represents only a marginal
| part of valuation. (And if you look at accounting, this
| depreciation is usually done over three years for IT
| hardware, and as such most of these chips have already
| lost half of their accounting value in the balance
| sheet).
| baq wrote:
| My 1070 was replaced by... nothing, I moved it from a
| haswell box to an alder lake box.
|
| Given that inference time will soon be extremely valuable
| with agents and <thinking> models, H100s may yet be worth
| something in a couple years.
| sixothree wrote:
| Year over year gains in computing continue to slow. I
| think we keep forgetting that when talking about these
| things as assets. The thing controlling their value is
| the supply which is tightly controlled like diamonds.
| adrianN wrote:
| They have a fairly limited lifetime even if progress
| stands still.
| throwup238 wrote:
| Last I checked AWS 1-year reserve pricing for an 8x H100
| box more than pays for the capital cost of the whole box,
| power, and NVIDIA enterprise license, with thousands left
| over for profit. On demand pricing is even worse. For
| cloud providers these things pay for themselves quickly
| and print cash afterwards. Even the bargain basement
| $2/GPU/hour pays it off in under two years.
| sdenton4 wrote:
| Labor! You need it to turn the bill of sale into a data
| center and keep it running. The bargain basement would be
| even cheaper otherwise...
| spamizbad wrote:
| > Year over year gains in computing continue to slow.
|
| This isn't true in the AI chip space (yet). And so much
| of this isn't just about compute but about the memory.
| eek2121 wrote:
| From a per mm2 performance standpoint things absolutely
| have slowed considerably. Gains are primarily being eked
| out via process advantage (which has slowed down) and
| larger chips (which has an ever-shrinking limit depending
| on the tech used)
|
| Chiplets have slowed the slowdown in AI, but you can see
| in the gaming space how much things have slowed to get an
| idea of what is coming for enterprise.
| ijidak wrote:
| Honestly, I don't fully understand the reason for this
| shortage.
|
| Isn't it because we insist on only using the latest nodes
| from a single company for manufacture?
|
| I don't understand why we can't use older process nodes
| to boost overall GPU making capacity.
|
| Can't we have tiers of GPU availability?
|
| Why is Nvidia not diversifying aggressively to Samsung
| and Intel no matter the process node.
|
| Can someone explain?
|
| I've heard packaging is also a concern, but can't you get
| Intel to figure that out with a large enough commitment?
| nl wrote:
| > Isn't it because we insist on only using the latest
| nodes from a single company for manufacture?
|
| TSMC was way ahead of anyone else introducing 5nm.
| There's a long lead time porting a chip to a new process
| from a different manufacturer.
|
| > I don't understand why we can't use older process nodes
| to boost overall GPU making capacity.
|
| > Can't we have tiers of GPU availability?
|
| NVidia do this. You can get older GPUs, but more
| performance is better for performance sensitive
| applications like training or running LLMs.
|
| Higher performance needs better manufacturing processes.
| ecocentrik wrote:
| That is the wrong take. Depreciated and burned out chips
| are replaced and a total compute value is typically
| increased over time. Efficiency gains are also calculated
| and projected over time. Seasons are inevitable and
| cyclical. Spring might be here but winter is coming.
| SecretDreams wrote:
| > It was a good way to value these companies according to
| their compute size since those chips are very valuable.
|
| Are they actually, though? Presently yes, but are they
| actually driving ROI? Or just an asset nobody really is
| meaningfully utilizing, but helps juice the stocks?
| ijidak wrote:
| I asked this elsewhere, but, I don't fully understand the
| reason for the critical GPU shortage.
|
| Isn't it because NVIDIA insists on only using the latest
| nodes from a single company (TSMC) for manufacture?
|
| I don't understand why we can't use older process nodes to
| boost overall GPU making capacity.
|
| Can't we have tiers of GPU availability some on cutting
| edge nodes, others built on older Intel and Samsung nodes?
|
| Why is Nvidia not diversifying aggressively to Samsung and
| Intel no matter the process node.
|
| Can someone explain?
|
| I've heard packaging is also a concern, but can't you get
| Intel to figure that out with a large enough commitment?
|
| (Also, I know NVIDIA has some capacity on Samsung. But why
| not go all out, even using Global Foundries?)
| aorloff wrote:
| If you are a cloud provider renting them out
|
| Otherwise you better keep them humming trying to find a
| business model because they certainly aren't getting any
| newer as chips
| dghlsakjg wrote:
| That's a great way to value a company that is going
| bankrupt.
|
| But, I'm not going to value an operating construction
| company based on how many shovels or excavators they own.
| I'm going to want to see them putting those assets to
| productive use.
| JKCalhoun wrote:
| So, "No one was ever fired for ... buying more server
| infrastructure."
| genewitch wrote:
| Walmart has massive, idle datacenters full of running
| machines doing nothing.
| tyfon wrote:
| I don't think you need to speculate too hard. On CNBC they
| are not tracking revenue, profits or technical breakthroughs,
| but how much the big companies are spending (on gpus). That's
| the metric!
| Mistletoe wrote:
| This feels like one of those stats they show from 1929 and
| everyone is like "and they didn't know they were in a
| bubble?"
| RobotToaster wrote:
| "But tulip sales keep increasing!"
| LeifCarrotson wrote:
| I probably don't have to repeat it, but this is a perfect
| example of Goodhart's Law: when a metric is used as a
| target, it loses its effectiveness as a metric.
|
| If you were a reporter who didn't necessarily understand
| how to value a particular algorithm or training operation,
| but you wanted a simple number to compare the amount of
| work OpenAI vs. Google vs Facebook are putting into their
| models, yeah, it makes sense. How many petaflops their
| datacenters are churning through in aggregate is probably
| correlated to the thing you're trying to understand. And
| it's probably easier to look at their financials and
| correlate how much they've spent on GPUs to how many
| petaflops of compute they need.
|
| But when your investors are giving you more money based on
| how well they perceive you're doing, and their perception
| is not an oracle but is instead directly based on how much
| money you're spending... the GPUs don't actually need to do
| anything other than make number go up.
| B56b wrote:
| They absolutely are tracking revenues/profits on CNBC, what
| are you talking about?
| ur-whale wrote:
| > but how much the big companies are spending (on gpus).
| That's the metric!
|
| Burn rate based valuations!
|
| The 2000's are back in full force!
| jerf wrote:
| This claim is mathematically nonsensical. It implies a more-or-
| less linear relationship, that more is always better. But
| there's no reason to limit that to H100s. Conventional servers
| are, if anything, rather more established in their ability to
| generate value, by which I mean, however much potential AI
| servers may have to be more important than conventional servers
| that they may manifest in the future, we know how to use
| conventional servers to generate value _now_.
|
| And thus, by this logic, every company in the world should just
| be buying as many servers as they can get their hands on,
| because More Servers = More Value.
|
| Obviously, this is not happening. It doesn't take much analysis
| to start listing the many and manifold reasons why. Many of
| those reasons will apply to GPUs as well. Just as if everything
| in AWS got 10x faster, overnight, this would not create a
| situation where everyone suddenly starts grabbing _more_
| servers in AWS. Obviously everyone would start trimming down,
| even if perhaps in a few years time they 'd find some way to
| use this burst of power such that they can use more later. This
| can't happen overnight, though. It would take time, and not
| "weeks" or "months" but "years" at scale.
|
| Incorporating the important variable of _time_ in the analysis,
| if AIs become literally hundreds of times cheaper to run,
| _today_ , then it is perfectly logical that the near-term
| demand for the hardware to run them is also going to go way,
| way down. However much _potential_ AI may have, it is fairly
| clear looking out at the AI landscape right now that there isn
| 't really anyone out there unlocking vast amounts of value and
| sitting there wringing their hands because they just can't get
| more GPU compute. The GPU rush has been from fear that someone
| will figure out how to "really" unlock AI and then they'll be
| stuck without the hardware to compete.
|
| It may be the case that vastly cheaper AI will in fact be part
| of unlocking that value, and that as the AI industry grows it
| will grow faster as a result... but that's still going to be on
| a multi-year time frame, not a tomorrow time frame. And all
| those GPUs and all those valuations are still broadly based on
| them being valuable _real soon now_ , not in a few years, and
| all those GPU purchases are on the assumption they need them
| _now_ , or on a timeframe where we can't be waiting around,
| rather than waiting for some rounds of exponential doublings to
| bring price down. The hardware curve in 5 years may be higher
| but the curve in the next _year_ would be lower, and by a
| _lot_.
|
| And, you know, who's to say we're done? I doubt there's another
| 100x in there, but is someone going to eke out another 2x
| improvement? Or a 10x improvement? Making it easier to run lots
| of experiments makes it much more likely for that to happen.
| I'm skeptical of another 10x general improvement but 10x
| improvements for specific, important use cases I can't rule
| out.
|
| Edit: I should also point out this is an extremely common
| pattern in technology in general. Often the very hardest part
| is producing a thing that does a particular task _at all_. Once
| we have it in hand, once we can use it and learn how it
| operates and what its characteristic operating modes are, once
| we can try modifications to it in the real world and see what
| happens, _optimizing_ it becomes much easier, sometimes
| explosively so by comparison. Taking any first iteration of a
| tech that is practical and then trying to straight-line demand
| based on it is silly, in all sorts of ways and all directions.
| The internal combustion engine, for example, has had a myriad
| of impacts on the world and certainly after various
| improvements many, many millions if not billions of them have
| been made... but any company that reacted to the first couple
| of cars and just went _ballistic_ buying those first-generation
| internal combustion engines would have lost everything, and
| rather quickly.
| deadbabe wrote:
| For starters every employee has an H100 under their desk.
| HenryBemis wrote:
| > Going forward, it'll be nearly impossible to prevent distealing
| (unauthorized distilling). One thousand examples is definitely
| within the range of what a single person might do in normal
| usage, no less ten or a hundred people. I doubt that OpenAI has a
| realistic path to preventing or even detecting distealing outside
| of simply not releasing models.
|
| (sorry for the long quote)
|
| I will say (naively perhaps) "oh but that is fairly simple". For
| any API request, add a counter of 5 seconds to the next for
| 'unverified' users. Make the "blue check" (a-la X/Twitter). For
| the 'big sales' have a third-party vetting process so that if US
| Corporation XYZ wants access, they prove themselves worthy/not
| Chinese competition and then you do give them the 1000/min deal.
|
| For everyone else, add the 5 second (or whatever other duration
| makes sense) timer/overhead and then see them drop from 1000
| requests per minutes to 500 per day. Or just cap them at 500 per
| day and close that back-door. And if you get 'many cheap
| accounts' doing hand-overs (AccountA does 1-500, AccountB does
| 501-1000, AccountC does 1001-1500, and so on) then you mass block
| them.
| mark_l_watson wrote:
| Off topic, but I just bookmarked Tim's blog, great stuff.
|
| I dismissed the X references to S1 without reading them, big
| mistake. I have been working generally in AI for 40 hears and
| neural networks for 35 years and the exponential progress since
| the hacks that make deep learning possible has been breathtaking.
|
| Reduction in processing and memory requirements for running
| models is incredible. I have been personally struggling with
| creating my own LLM-based agents with weaker on-device models (my
| same experiments usually work with 4o-mini and above models) but
| either my skills will get better or I can wait for better on
| device models.
|
| I was experimenting with the iOS/iPadOS/macOS app On-Device AI
| last night and the person who wrote this app was successful in
| combining web search tool calling working with a very small model
| - something that I have been trying to perfect.
| cowsaymoo wrote:
| The part about taking control of a reasoning model's output
| length using <think></think> tags is interesting.
|
| > In s1, when the LLM tries to stop thinking with "</think>",
| they force it to keep going by replacing it with "Wait".
|
| I had found a few days ago that this let you 'inject' your own
| CoT and jailbreak it easier. Maybe these are related?
|
| https://pastebin.com/G8Zzn0Lw
|
| https://news.ycombinator.com/item?id=42891042#42896498
| causal wrote:
| This even points to a reason why OpenAI hides the "thinking"
| step: it would be too obvious that the context is being
| manipulated to induce more thinking.
| Havoc wrote:
| The point about agents to conceal access to the model is a good
| one.
|
| Hopefully we won't lose all access to models in future
| cyp0633 wrote:
| Qwen's QvQ-72B does much more "wait"s than other LLMs with CoT I
| tried, maybe they've somewhat used that trick already?
| theturtletalks wrote:
| Deepseek R1 uses <think/> and wait and you can see it in the
| thinking tokens second guessing itself. How does the model know
| when to wait?
|
| These reasoning models are feeding more to OP's last point about
| NVidia and OpenAI data centers not being wasted since reason
| models require more tokens and faster tps.
| qwertox wrote:
| Probably when it would expect a human to second guess himself,
| as shown in literature and maybe other sources.
| UncleEntity wrote:
| From playing around they seem to 'wait' when there's a
| contradiction in their logic.
|
| And I think the second point is due to The Market thinking
| there is no need to spend ever increasing amounts of compute to
| get to the next level of AI overlordship.
|
| Of course Jevon's paradox is also all in the news these days..
| pona-a wrote:
| If chain of thought acts as a scratch buffer by providing the
| model more temporary "layers" to process the text, I wonder if
| making this buffer a separate context with its own separate FNN
| and attention would make sense; in essence, there's a
| macroprocess of "reasoning" that takes unbounded time to
| complete, and then there's a microprocess of describing this
| incomprehensible stream of embedding vectors in natural language,
| in a way returning to the encoder/decoder architecture but where
| both are autoregressive. Maybe this would give us a denser
| representation of said "thought", not constrained by imitating
| human text.
| bluechair wrote:
| I had this exact same thought yesterday.
|
| I'd go so far as to add one more layer to monitor this one and
| stop adding layers. My thinking is that this meta awareness is
| all you need.
|
| No data to back my hypothesis up. So take it for what it's
| worth.
| larodi wrote:
| My thought on the same guess being - all tokens live in same
| latent space or in many spaces and each logical units train
| separate of each other...?
| whimsicalism wrote:
| > this incomprehensible stream of embedding vectors as natural
| language explanation, in a way returning to encoder/decoder
| architecture
|
| this is just standard decoding, the stream of vectors is called
| the k/v cache
| easeout wrote:
| Here's a paper your idea reminds me of.
| https://arxiv.org/abs/2501.19201
|
| It's also so not far from Meta's large concept model idea.
| pona-a wrote:
| Previous discussion:
|
| [41 comments, 166 points]
| https://news.ycombinator.com/item?id=42919597
| bloomingkales wrote:
| Once we train models on the chain of thought outputs, next
| token prediction can solve the halting problem for us (eg, this
| chain of thinking matches _this other_ chain of thinking).
| psadri wrote:
| I think that is how human brains work. When we practice, at
| first we have to be deliberate (thinking slow). Then we
| "learn" from our own experience and it becomes muscle memory
| (thinking fast). Of course, it increases the odds we are
| wrong.
| bloomingkales wrote:
| Or worse, we incorrectly overweight the wrong chain of
| thinking to an irrelevant output (but pragmatically useful
| output), at scale.
|
| For example, xenophobia as a response to economic hardship
| is the wrong chain of thinking embedded in the larger
| zeitgeist.
| jjk7 wrote:
| Comments on a google doc? Nesting in social media comments?
|
| Seems like similar concepts. I think there is some potential to
| improving how LLMs improve and further their own reasoning
| lines, but I'm no AI mage.
| zoogeny wrote:
| I've had an idea since I was a kid which I can share. I was
| contemplating AI and consciousness generally, probably around
| the time I read "The Minds I".
|
| I reflected on the pop-psychology idea of consciousness and
| subconsciousness. I thought of each as an independent stream of
| tokens, like stream of consciousness poetry. But along the
| stream there were joining points between these two streams,
| points where the conscious stream was edited by the
| subconscious stream. You could think of the subconscious stream
| as performing CRUD like operations on the conscious stream. The
| conscious stream would act like a buffer of short-term memory
| while the subconscious stream would act like a buffer of long-
| term memory. Like, the subconscious has instructions related to
| long-term goals and the conscious stream has instructions
| related to short-term goals.
|
| You can imagine perception as input being fed into the
| conscious stream and then edited by the subconscious stream
| before execution.
|
| It seems entirely possible to actually implement this idea in
| this current day and age. I mean, it was a fever dream as a
| kid, but now it could be an experiment!
| sambull wrote:
| That sovereign wealth fund with tik tok might set a good
| precedent; when we have to 'pour money' into these companies we
| can do so with stake in them held in our sovereign wealth fund.
| TehCorwiz wrote:
| Extra-legal financial instruments meant to suck money from
| other federal departments don't strike me as a good precedent
| in any sense. I don't disagree though that nationalizing the
| value of enormous public investments is something we should be
| considering, looking at you oil industry. But until congress
| appropriates the money under law it's a pipe dream or theft.
| ipnon wrote:
| All you need is attention and waiting. I feel like a zen monk.
| jebarker wrote:
| S1 (and R1 tbh) has a bad smell to me or at least points towards
| an inefficiency. It's incredible that a tiny number of samples
| and some inserted <wait> tokens can have such a huge effect on
| model behavior. I bet that we'll see a way to have the network
| learn and "emerge" these capabilities during pre-training. We
| probably just need to look beyond the GPT objective.
| pas wrote:
| can you please elaborate on the wait tokens? what's that? how
| do they work? is that also from the R1 paper?
| jebarker wrote:
| The same idea is in both the R1 and S1 papers (<think> tokens
| are used similarly). Basically they're using special tokens
| to mark in the prompt where the LLM should think more/revise
| the previous response. This can be repeated many times until
| some stop criteria occurs. S1 manually inserts these with
| heuristics, R1 learns the placement through RL I think.
| whimsicalism wrote:
| ? theyre not special tokens really
| jebarker wrote:
| i'm not actually sure whether they're special tokens in
| the sense of being in the vocabulary
| whimsicalism wrote:
| <think> might be i think "wait" is tokenized like any
| other in the pretraining
| throwaway314155 wrote:
| There's a decent explanation in the article, just FYI.
| sfink wrote:
| I agree, but LLMs in general have a horrendously bad smell in
| terms of efficiency. s1 and r1 are just proving it.
|
| The models' latent spaces are insanely large. The vast, vast
| majority pretty much has to be irrelevant and useless, it's
| just that the training commandeers random fragments of that
| space to link up the logic they need and it's really hard to
| know which of the weights are useless, which are useful but
| interchangeable with other weights, and which are truly load-
| bearing. You could probably find out easily by testing the
| model against every possible thing you ever might want it to
| do, just as soon as someone gets around to enumerating that
| non-enumerable collection of tasks.
|
| These bogus <wait> tokens kind of demonstrate that the models
| are sort of desperate to escape the limitations imposed by the
| limited processing they're allowed to do -- they'll take
| advantage of thinking time even when it's provided in the
| silliest manner possible. It's amazing what you can live if
| it's all you have!
|
| (Apologies for the extended anthropomorphizing.)
| light_hue_1 wrote:
| S1 has no relationship to R1. It's a marketing campaign for an
| objectively terrible and unrelated paper.
|
| S1 is fully supervised by distilling Gemini. R1 works by
| reinforcement learning with a much weaker judge LLM.
|
| They don't follow the same scaling laws. They don't give you the
| same results. They don't have the same robustness. You can use R1
| for your own problems. You can't use S1 unless Gemini works
| already.
|
| We know that distillation works and is very cheap. This has been
| true for a decade; there's nothing here.
|
| S1 is a rushed hack job (they didn't even run most of their
| evaluations with an excuse that the Gemini API is too hard to
| use!) that probably existed before R1 was released and then
| pivoted into this mess.
| bloomingkales wrote:
| This thing that people are calling "reasoning" is more like
| rendering to me really, or multi pass rendering. We're just
| refining the render, there's no reasoning involved.
| dleslie wrote:
| That was succinct and beautifully stated. Thank-you for the
| "Aha!" moment.
| bloomingkales wrote:
| Hah. You should check out my other comment on how I think
| we're obviously in a simulation (remember, we just need to
| see a good enough render).
|
| LLMs are changing how I see reality.
| mistermann wrote:
| "...there's no reasoning involved...wait, could I just be
| succumbing to my heuristic intuitions of what is (seems to be)
| true....let's reconsider using System 2 thinking..."
| bloomingkales wrote:
| Or there is no objective reality (well there isn't, check out
| the study), and reality is just a rendering of the few state
| variables that keep track of your simple life.
|
| A little context about you:
|
| - person
|
| - has hands, reads HN
|
| These few state variables are enough to generate a believable
| enough frame in your rendering.
|
| If the rendering doesn't look believable to you, you modify
| state variables to make the render more believable, eg:
|
| Context:
|
| - person
|
| - with hands
|
| - incredulous demeanor
|
| - reading HN
|
| Now I can render you more accurately based on your
| "reasoning", but truly I never needed all that data to see
| you.
|
| Reasoning as we know it could just be a mechanism to fill in
| gaps in obviously sparse data (we absolutely do not have all
| the data to render reality accurately, you are seeing an
| illusion). Go reason about it all you want.
| mistermann wrote:
| Is this a clever rhetorical trick to make it appear that
| your prior claim was correct?
|
| If not: what am I intended to take away from this? What is
| its relevance to my comment?
| bloomingkales wrote:
| You made a joke about questioning reality, I simply
| entertained it. You can do whatever you want with it,
| wasn't a slight at all.
| mistermann wrote:
| It may have been in the form of a joke, but I certainly
| wasn't joking.
|
| I think it is interesting what actions cannot be done by
| humans.
| ddrdrck_ wrote:
| We could see it the other way around : what we call "reasoning"
| may actually be some kind of multipass rendering, whatever it
| is performed by computers or human brains.
| bloomingkales wrote:
| Yes, of course. The implications are awesome.
| pillefitz wrote:
| Which is related to multistage/ hierarchical/coarse-to-fine
| optimization, which is a pretty good way to find the global
| optimum in many problem domains.
| buyucu wrote:
| Yes.
|
| Before LLMs we had N-Gram language models. Many tasks like
| speech recognition worked as beach search in the graph defined
| by the ngram language model. You could easily get huge accuracy
| gains simply by pruning your beam less.
|
| s1 reminds of this. You can always trade off latency for
| accuracy. Given these LLMs are much more complex than good old
| N-Grams, we're just discovering how to do this trade.
| bloomingkales wrote:
| Let me carry that concept, "learning to do this trade", it's
| a new trade.
|
| I don't believe computer science has the algorithms to handle
| this new paradigm. Everything was about sequential
| deterministic outputs, and clever ways to do it fast. This
| stuff is useless at the moment. We need new thinkers on how
| to not think sequentially or how not to think about the
| universe in such a small way.
|
| Verifying input/output pairs is the old way. We need to
| understand differently going forward.
| LordDragonfang wrote:
| How are you defining "reasoning"?
|
| Because I see these sorts of gnostic assertion about LLMs all
| the time about how they "definitely aren't doing <thing we
| normally apply to meat-brains>" by gesturing at the technical
| things it's doing, with no attempts to actually justify the
| negative assertion.
|
| It often comes across as privileged reason trying to justify
| that of course the machine isn't doing some ineffable thing
| only meat-brains do.
| whimsicalism wrote:
| this isn't rlvr and so sorta uninteresting, they are just
| distilling the work already done
| bloomingkales wrote:
| If an LLM output is like a sculpture, then we have to sculpt it.
| I never did sculpting, but I do know they first get the clay
| spinning on a plate.
|
| Whatever you want to call this "reasoning" step, ultimately it
| really is just throwing the model into a game loop. We want to
| interact with it on each tick (spin the clay), and sculpt every
| second until it looks right.
|
| You will need to loop against an LLM to do just about anything
| and everything, forever - this is the default workflow.
|
| Those who think we will quell our thirst for compute have another
| thing coming, we're going to be insatiable with how much LLM
| brute force looping we will do.
| MrLeap wrote:
| This is a fantastic insight and really has my gears spinning.
|
| We need to cluster the AI's insights on a spatial grid hash,
| give it a minimap with the ability to zoom in and out, and give
| it the agency to try and find its way to an answer and build up
| confidence and tests for that answer.
|
| coarse -> fine, refine, test, loop.
|
| Maybe a parallel model that handles the visualization stuff. I
| imagine its training would look more like computer vision. Mind
| palace generation.
|
| If you're stuck or your confidence is low, wander the palace
| and see what questions bubble up.
|
| Bringing my current context back through the web is how I think
| deeply about things. The context has the authority to reorder
| the web if it's "epiphany grade".
|
| I wonder if the final epiphany at the end of what we're
| creating is closer to "compassion for self and others" or "eat
| everything."
| zoogeny wrote:
| I can't believe this hasn't been done yet, perhaps it is a cost
| issue.
|
| My literal first thought about AI was wondering why we couldn't
| just put it in a loop. Heck, one update per day, or one update
| per hour would even be a start. You have a running "context",
| the output is the next context (or a set of transformations on
| a context that is a bit larger than the output window). Then
| ramp that up ... one loop per minute, one per second,
| millisecond, microsecond.
| layer8 wrote:
| Same. And the next step is that it must feed back into
| training, to form long-term memory and to continually learn.
| zoogeny wrote:
| I analogize this with sleep. Perhaps that is what is
| needed, 6 hours offline per day to LoRa the base model on
| some accumulated context from the day.
| int_19h wrote:
| The hard part is coming up with a good way to grade results.
| Which you need to update the weights based on the outcome,
| otherwise the model will not actually learn anything.
| zoogeny wrote:
| For the "looping" I'm talking about you don't need to
| update the weights. It is simply, old context in, new
| context out, new context in, new-new context out, etc.
|
| Of course, keeping that coherent over numerous loops isn't
| going to be easy. No doubt there is a chance it goes off
| the rails. So you might have a section of context that is
| constantly stable, a section of context that updates each
| loop, etc.
|
| In the other response to my comment someone mentioned
| eventually updating the weights (e.g. daily) and you would
| in that case have to have some kind of loss function.
| gyomu wrote:
| > If an LLM output is like a sculpture, then we have to sculpt
| it. I never did sculpting, but I do know they first get the
| clay spinning on a plate.
|
| That's pottery, not sculpture. Traditionally in sculpture you
| start from a block of marble or wood, but you can also make
| sculptures of cast bronze or welded steel (or clay, but you
| don't use a spinning plate).
| incrudible wrote:
| Hmmm, 1 + 1 equals 3. Alternatively, 1 + 1 equals -3.
|
| Wait, actually 1 + 1 equals 1.
| falcor84 wrote:
| As one with teaching experience, the idea of asking a student
| "are you sure about that?" is to get them to think more deeply
| rather than just blurting a response. It doesn't always work,
| but it generally does.
| ALittleLight wrote:
| At 6 dollars per run, I'm tempted to try to figure out how to
| replicate this. I'd like to try some alternatives to "wait" -
| e.g. "double checking..." Or write my own chains of thought.
| qup wrote:
| Like the ones they tested?
| ALittleLight wrote:
| Yes, that is what "replicate" with my own ideas means.
| kittikitti wrote:
| Thank you for this, I really appreciate this article and I
| learned a bunch!
| Aperocky wrote:
| For all the hype about thinking models, this feels much like
| compression in terms of information theory instead of a "takeoff"
| scenario.
|
| There are a finite amount of information stored in any large
| model, the models are really good at presenting the correct
| information back, and adding thinking blocks made the models even
| better at doing that. But there is a cap to that.
|
| Just like how you can compress a file by a lot, there is a
| theoretical maximum to the amount of compression before it starts
| becoming lossy. There is also a theoretical maximum of relevant
| information from a model regardless of how long it is forced to
| think.
| psadri wrote:
| I think an interesting avenue to explore is creating
| abstractions and analogies. If a model can take a novel
| situation and create an analogy to one that it is familiar
| with, it would expand its "reasoning" capabilities beyond its
| training data.
| zoogeny wrote:
| I think this is probably accurate and what remains to be seen
| is how "compressible" the larger models are.
|
| The fact that we can compress a GPT-3 sized model into an o1
| competitor is only the beginning. Maybe there is even more
| juice to squeeze there?
|
| But even more, how much performance will we get out of o3 sized
| models? That is what is exciting since they are already
| performing near Phd levels on most evals.
| hidelooktropic wrote:
| > I doubt that OpenAI has a realistic path to preventing or even
| detecting distealing outside of simply not releasing models.
|
| Couldn't they just start hiding the thinking portion?
|
| It would be easy for them to do this. Currently, they already
| provide one sentence summaries for each step of the thinking I
| think users would be fine or at least stay if it were changed to
| provide only that.
| Aperocky wrote:
| They hid it and deepseek came up with R1 anyway, with RL on
| only results and not even needing any of the thinking tokens
| that OpenAI hid.
| girvo wrote:
| Which is still the funniest and most interesting result in AI
| so far IMO. Fascinating, but sort of makes intuitive sense
| too!
| mtrovo wrote:
| I found the discussion around inference scaling with the 'Wait'
| hack so surreal. The fact such an ingeniously simple method can
| impact performance makes me wonder how many low-hanging fruit
| we're still missing. So weird to think that improvements on a
| branch of computer science is boiling down to conjuring the right
| incantation words, how you even change your mindset to start
| thinking this way?
| nyoomboom wrote:
| I think a skill here is learning a bias for experimentation and
| accepting the results one finds. Also the book "Why Greatness
| Cannot Be Planned" showcases the kind of open ended play that
| results in people discovering stuff like this.
| cubefox wrote:
| Now imagine where we are in 12 months from now. This article
| from February 5 2025 will feel quaint by then. The acceleration
| keeps increasing. It seems likely we will soon have recursive
| self-improving AI -- reasoning models which do AI research.
| This will accelerate the rate of acceleration itself. It sounds
| stupid to say it, but yes, the singularity is near. Vastly
| superhuman AI now seems to arrive within the next few years.
| Terrifying.
| gom_jabbar wrote:
| Yes, and Accelerationism predicted this development back in
| the 1990s, perhaps most prominently in the opening lines of
| Nick Land's _Meltdown_ (1994) text: [[ ]] The
| story goes like this: Earth is captured by a technocapital
| singularity as renaissance rationalization and oceanic
| navigation lock into commoditization take-off. Logistically
| accelerating techno-economic interactivity crumbles social
| order in auto-sophisticating machine runaway. As markets
| learn to manufacture intelligence, politics modernizes,
| upgrades paranoia, and tries to get a grip.
|
| > reasoning models which do AI research
|
| In the introduction to my research project on Accelerationism
| [0], I write: Faced with the acceleration of
| progress in Artificial Intelligence (AI) -- with AI agents
| now automating AI research and development --,
| Accelerationism no longer seems like an abstract philosophy
| producing empty hyperstitional hype, but like a sober
| description of reality. The failed 2023 memorandum to stop AI
| development on systems more powerful than OpenAI's ChatGPT-4
| perfectly illustrates the phenomenological aspects of
| Accelerationism: "To be rushed by the phenomenon, to the
| point of terminal institutional paralysis, is the
| phenomenon." [1]
|
| At the current rate of acceleration, if you don't write
| hyperstitionally, your texts are dead on arrival.
|
| [0] https://retrochronic.com/
|
| [1] Nick Land (2017). _A Quick-and-Dirty Introduction to
| Accelerationism_ in _Jacobite Magazine._
| zoogeny wrote:
| This is something I have been suppressing since I don't want
| to become chicken little. Anyone who isn't terrified by the
| last 3 months probably doesn't really understand what is
| happening.
|
| I went from accepting I wouldn't see a true AI in my
| lifetime, to thinking it is possible before I die, to
| thinking it is possible in in the next decade, to thinking it
| is probably in the next 3 years to wondering if we might see
| it this year.
|
| Just 6 months ago people were wondering if pre-training was
| stalling out and if we hit a wall. Then deepseek drops with
| RL'd inference time compute, China jumps from being 2 years
| behind in the AI race to being neck-and-neck and we're all
| wondering what will happen when we apply those techniques to
| the current full-sized behemoth models.
|
| It seems the models that are going to come out around summer
| time may be jumps in capability beyond our expectations. And
| the updated costs means that there may be several open source
| alternatives available. The intelligence that will be
| available to the average technically literate individual will
| be frightening.
| palmotea wrote:
| > The intelligence that will be available to the average
| technically literate individual will be frightening.
|
| That's not the scary part. The scary part is the
| intelligence _at scale_ that could be available to the
| _average employer_. Lots of us like to LARP that we 're
| capitalists, but _very_ few of us are. There 's zero
| ideological or cultural framework in place to prioritize
| the well being of the general population over the profits
| of _some_ capitalists.
|
| AI, especially accelerating AI, is bad news for anyone who
| needs to work for a living. It's not going to lead to a
| Star Trek fantasy. It means an eventual phase change for
| the economy that consigns us (and most consumer product
| companies) to wither and fade away.
| koala_man wrote:
| It feels like we're back in 1900 when anyone's clever idea (and
| implementation) can give huge performance improvements, such as
| Ford's assembly line and Taylor's scientific management of
| optimizing shovel sizes for coal.
| andrewfromx wrote:
| yes, it also feels like we are going to lose our just-in-time
| global shipments of anything to anywhere any day now. It will
| soon feel like 1900 in other ways.
| xg15 wrote:
| I think the fact alone that distillation and quantization are
| techniques that can produce substantial improvements is a
| strong sign that we still have no real comprehensive
| understanding how the models work.
|
| If we had, there would be no reason to train a model with more
| parameters than are strictly necessary to represent the space's
| semantic structure. But then it should be impossible for
| distilled models with less parameters to come close to the
| performance of the original model.
|
| Yet this is what happens - the distilled or quantized models
| often come very close to the original model.
|
| So I think there are still many low-hanging fruits to pick.
| teruakohatu wrote:
| > still have no real comprehensive understanding how the
| models work.
|
| We do understand how they work, we just have not optimised
| their usage.
|
| For example someone who has a good general understanding of
| how an ICE or EV car works. Even if the user interface is
| very unfamiliar, they can figure out how to drive any car
| within a couple of minutes.
|
| But that does not mean they can race a car, drift a car or
| drive a car on challenging terrain even if the car is
| physically capable of all these things.
| spiorf wrote:
| We know how the next token is selected, but not why doing
| that repeatedly brings all the capabilities it does. We
| really don't understand how the emergent behaviours emerge.
| gessha wrote:
| Your example is somewhat inadequate. We _fundamentally_
| don't understand how deep learning systems works in the
| sense that they are more or less black boxes that we train
| and evaluate. Innovations in ML are a whole bunch of
| wizards with big stacks of money changing "Hmm" to "Wait"
| and seeing what happens.
|
| Would a different sampler help you? I dunno, try it. Would
| a smaller dataset help? I dunno, try it. Would training the
| model for 5000 days help? I dunno, try it.
|
| Car technology is the opposite of that - it's a white box.
| It's composed of very well defined elements whose
| interactions are defined and explained by laws of
| thermodynamics and whatnot.
| ascorbic wrote:
| I've noticed that R1 says "Wait," a lot in its reasoning. I
| wonder if there's something inherently special in that token.
| lionkor wrote:
| Semantically, wait is a bit of a stop-and-breathe point.
|
| Consider the text:
|
| I think I'll go swimming today. Wait, ___
|
| what comes next? Well, not something that would usually
| follow without the word "wait", probably something entirely
| orthogonal that impacts the earlier sentence in some
| fundamental way, like:
|
| Wait, I need to help my dad.
| cyanydeez wrote:
| its fascinating how certain political movements avoid that Wait
| moment...
| kevin009 wrote:
| There are more than 10 different ways that I know for sure will
| improve LLMs just like `wait`. It is part if the CoT. I assume
| most researchers know this. CoT in old as 2019
| maksimur wrote:
| It appears that someone has implemented a similar approach for
| DeepSeek-R1-Distill-Qwen-1.5B:
| https://reddit.com/r/LocalLLaMA/comments/1id2gox/improving_d...
|
| I hope it gets tested further.
| nullbyte wrote:
| Great article! I enjoyed reading it
| khazhoux wrote:
| I have a bunch of questions, would love for anyone to explain
| these basics:
|
| * The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based
| on very expensive oracles (if we believe DeepSeek-R1 queried
| OpenAI's model). If these are _improvements_ on existing models,
| why is this being reported as decimating training costs? Isn 't
| fine-tuning already a cheap way to optimize? (maybe not as
| effective, but still)
|
| * The R1 paper talks about improving one simple game - Countdown.
| But the original models are "magic" because they can solve a
| nearly uncountable number of problems and scenarios. How does the
| DeepSeek / R1 approach scale to the same gigantic scale?
|
| * Phrased another way, my understanding is that these techniques
| are using existing models as black-box oracles. If so, how many
| millions/billions/trillions of queries must be probed to
| replicate and improve the original dataset?
|
| * Is anything known about the training datasets used by DeepSeek?
| OpenAI used presumably every scraped dataset they could get their
| hands on. Did DS do the same?
| UncleEntity wrote:
| > If these are improvements on existing models, why is this
| being reported as decimating training costs?
|
| Because that's what gets the clicks...
|
| Saying they spent a boatload of money on the initial training +
| iteration + final fine-tuning isn't as headline grabbing as "$5
| million trained AI beats the pants off the 'mericans".
| torginus wrote:
| If what you say is true, and distilling LLMs is easy and cheap,
| and pushing the SOTA without a better model to rely on is dang
| hard and expensive, then that means the economics of LLM
| development might not be attractive to investors - spending
| billions to have your competitors come out with products that
| are 99% as good, and cost them pennies to train, does not sound
| like a good business strategy.
| janalsncm wrote:
| > even the smartest people make hundreds of tiny experiments
|
| This is the most important point, and why DeepSeek's cheaper
| training matters.
|
| And if you check the R1 paper, they have a section for "things
| that didn't work", each of which would normally be a paper of its
| own but because their training was so cheap and streamlined they
| could try a bunch of things.
| robrenaud wrote:
| > "Note that this s1 dataset is distillation. Every example is a
| thought trace generated by another model, Qwen2.5"
|
| The traces are generated by Gemini Flash Thinking.
|
| 8 hours of H100 is probably more like $24 if you want any kind of
| reliability, rather than $6.
| zaptrem wrote:
| "You can train a SOTA LLM for $0.50" (as long as you're
| distilling a model that cost $500m into another pretrained
| model that cost $5m)
| fizx wrote:
| That's absolutely fantastic, because if you have 1 good idea
| that's additive to the SOTA, you can test it for a dollar,
| not millions
| knutzui wrote:
| The original statement stands, if what you are suggesting in
| addition to it is true. If the initial one-time investment of
| $505m is enough to distill new SOTA models for $0.50 a piece,
| then the average cost for subsequent models will trend toward
| $0.50.
| nico wrote:
| > Why did it cost only $6? Because they used a small model and
| hardly any data.
|
| > After sifting their dataset of 56K examples down to just the
| best 1K, they found that the core 1K is all that's needed to
| achieve o1-preview performance on a 32B model. Adding data didn't
| raise performance at all.
|
| > 32B is a small model, I can run that on my laptop. They used 16
| NVIDIA H100s for 26 minutes per training run, that equates to
| around $6.
| nico wrote:
| > In s1, when the LLM tries to stop thinking with "</think>",
| they force it to keep going by replacing it with "Wait". It'll
| then begin to second guess and double check its answer. They do
| this to trim or extend thinking time (trimming is just abruptly
| inserting "</think>")
|
| I know some are really opposed to anthropomorphizing here, but
| this feels eerily similar to the way humans work, ie. if you just
| dedicate more time to analyzing and thinking about the task, you
| are more likely to find a better solution
|
| It also feels analogous to navigating a tree, the more time you
| have to explore the nodes, the bigger the space you'll have
| covered, hence higher chance of getting a more optimal solution
|
| At the same time, if you have "better intuition" (better
| training?), you might be able to find a good solution faster,
| without needing to think too much about it
| layer8 wrote:
| What's missing in that analogy is that humans tend to have a
| good hunch about when they have to think more and when they are
| "done". LLMs seem to be missing a mechanism for that kind of
| awareness.
| nico wrote:
| Great observation. Maybe an additional "routing model" could
| be trained to predict when it's better to think more vs just
| using the current result
| janalsncm wrote:
| I think a lot of people in the ML community were excited for Noam
| Brown to lead the O series at OpenAI because intuitively, a lot
| of reasoning problems are highly nonlinear i.e. they have a tree-
| like structure. So some kind of MCTS would work well. O1/O3 don't
| seem to use this, and DeepSeek explicitly mentioned difficulties
| training such a model.
|
| However, I think this is coming. DeepSeek mentioned it was hard
| to learn a value model for MCTS from scratch, but this doesn't
| mean we couldn't seed it with some annotated data.
| insane-c0der wrote:
| Do you have a reference for us to check? - "DeepSeek explicitly
| mentioned difficulties training such a model."
| janalsncm wrote:
| Section 4.2: Unsuccessful attempts
|
| https://arxiv.org/pdf/2501.12948
| talles wrote:
| Anyone else wants more articles on how those benchmarks are
| created and how they work?
|
| Those models can be trained in way tailored to have good results
| on specific benchmarks, making them way less general than it
| seems. No accusation from me, but I'm skeptical on all the recent
| so called 'breakthroughs'.
| charlieyu1 wrote:
| > having 10,000 H100s just means that you can do 625 times more
| experiments than s1 did
|
| The larger the organisation, the less experiments you can afford
| to do. Employees are mostly incentivised by getting something
| done quick enough to not to be fired in this job market. They
| know that the higher-ups would get them off for temporary gains.
| Rush this deadline, ship that feature, produce something that
| looks OK enough.
| mmoustafa wrote:
| Love the look under the hood! Specially discovering some AI hack
| I came up with is how the labs are doing things too.
|
| In this case, I was also forcing R1 to continue thinking by
| replacing </think> with "Okay," after augmenting reasoning with
| web search results.
|
| https://x.com/0xmmo/status/1886296693995646989
___________________________________________________________________
(page generated 2025-02-05 23:00 UTC)