[HN Gopher] S1: A $6 R1 competitor?
       ___________________________________________________________________
        
       S1: A $6 R1 competitor?
        
       Author : tkellogg
       Score  : 533 points
       Date   : 2025-02-05 11:05 UTC (11 hours ago)
        
 (HTM) web link (timkellogg.me)
 (TXT) w3m dump (timkellogg.me)
        
       | bberenberg wrote:
       | In case you're not sure what S1 is, here is the original paper:
       | https://arxiv.org/html/2501.19393v1
        
         | mi_lk wrote:
         | it's also the first link in the article's first sentence
        
           | bberenberg wrote:
           | Good call, I must have missed it. I read the whole blog then
           | went searching for what S1 was.
        
         | addandsubtract wrote:
         | It's linked in the blog post, too. In the first sentence,
         | actually, but for some reason the author never bothered to
         | attach the name to it. As if keeping track of o1, 4o, r1, r2d2,
         | wasn't exhausting enough already.
        
           | kgwgk wrote:
           | > for some reason the author never bothered to attach the
           | name to it
           | 
           | Respect for his readers' intelligence, maybe.
        
       | ttyprintk wrote:
       | https://huggingface.co/simplescaling
        
         | anentropic wrote:
         | and: https://github.com/simplescaling/s1
        
           | mettamage wrote:
           | When you're only used to ollama, how do I go about using this
           | model?
        
             | davely wrote:
             | I think we need to wait for someone to convert it into a
             | GGUF file format.
             | 
             | However, once that happens, you can run it (and any GGUF
             | model) from Hugging Face![0]
             | 
             | [0] https://huggingface.co/docs/hub/en/ollama
        
               | mettamage wrote:
               | So this?
               | 
               | https://huggingface.co/brittlewis12/s1-32B-GGUF
        
               | withinboredom wrote:
               | oh god, this is terrible!
               | 
               | I just said "Hello!" and it went off the rails.
        
               | delijati wrote:
               | why how what? can you add a sample prompt with output ?
        
               | withinboredom wrote:
               | In my original prompt, it gave me back multiple greetings
               | in various languages. In this example, it gave me back
               | just multiple greetings:
               | 
               | >>> hello!
               | 
               | Hello there! I'm here to help with any questions you
               | might have. How can I assist you today? If you're asking
               | a question, feel free to provide some details about what
               | you need help with. I'll do my best to answer your
               | question or guide you through the process. Just let me
               | know!
               | 
               | <|im_start|>think
               | 
               | The user has asked "hello! " and provided no further
               | information. Answer here is a direct response to the
               | user's request for information, which was essentially
               | just a greeting. The appropriate response is to greet
               | back and offer assistance.
               | 
               | <|im_start|>answer
               | 
               | Answer:
               | 
               | Hello! How can I assist you today? If you have a question
               | or need help with something specific, feel free to ask.
               | I'm here to support you!
               | 
               | <|im_start|>think
               | 
               | <|im_start|>answer
               | 
               | The user's message was "hello!" and there was no other
               | information provided. The appropriate response is a
               | greeting back and an offer of assistance.
               | 
               | <|im_start|>answer
               | 
               | Answer:
               | 
               | Hello! How can I assist you today? If you have a question
               | or need help with something specific, feel free to ask.
               | I'm here to support you!
               | 
               | <|im_start|>think
               | 
               | The user has greeted with "hello!" An appropriate
               | response involves acknowledging the greeting and offering
               | assistance. This could involve asking how we can help
               | them or what they'd like guidance on.
               | 
               | <|im_start|>answer
               | 
               | Answer:
               | 
               | Hello! How can I assist you today? If you have a question
               | or need help with something specific, feel free to ask.
               | I'm here to support you!
        
               | fl0id wrote:
               | you can load the safetensors with ollama, you just have
               | to provide a modelfile. or wait for someone to do it. It
               | will in theory also quantize it for you, as I guess most
               | ppl cannot load a 129 GB model...
        
       | yapyap wrote:
       | > If you believe that AI development is a prime national security
       | advantage, then you absolutely should want even more money poured
       | into AI development, to make it go even faster.
       | 
       | This, this is the problem for me with people deep in AI. They
       | think it's the end all be all for everything. They have the
       | vision of the 'AI' they've seen in movies in mind, see the
       | current 'AI' being used and to them it's basically almost the
       | same, their brain is mental bridging the concepts and saying it's
       | only a matter of time.
       | 
       | To me, that's stupid. I observe the more populist and socially
       | appealing CEOs of these VC startups (Sam Altman being the
       | biggest, of course.) just straight up lying to the masses, for
       | financial gain, of course.
       | 
       | Real AI, artificial intelligence, is a fever dream. This is
       | machine learning except the machines are bigger than ever before.
       | There is no intellect.
       | 
       | and the enthusiasm of these people that are into it feeds into
       | those who aren't aware of it in the slightest, they see you can
       | chat with a 'robot', they hear all this hype from their peers and
       | they buy into it. We are social creatures after all.
       | 
       | I think using any of this in a national security setting is
       | stupid, wasteful and very, very insecure.
       | 
       | Hell, if you really care about being ahead, pour 500 billion
       | dollars into quantum computing so u can try to break current
       | encryption. That'll get you so much further than this nonsensical
       | bs.
        
         | mnky9800n wrote:
         | Also the narrative that we are currently on the brink of Ai
         | explosion and this random paper shows it has been the same
         | tired old story handed out by ai hawks for years now. Like yes,
         | I agree with the general idea that more compute means more
         | progress for humans and perhaps having a more responsive user
         | interface through some kind of ai type technology would be
         | good. But I don't see why that will turn into Data from Star
         | Trek. But I also think all these ai hawks kind of
         | narcissistically over value their own being. Like blink and
         | their lives are over in the grand scheme of things. Maybe our
         | "awareness" of the world around us is an illusion provided by
         | evolution because we needed it to value self preservation
         | whereas other animals don't. There is an inherent belief in the
         | specialness of humans that I suppose I mistrust.
        
           | encipriano wrote:
           | I find the last part of the paragraph offputting and I agree
        
           | ben_w wrote:
           | > But I don't see why that will turn into Data from Star
           | Trek.
           | 
           | "Is Data genuinely sentient or is he just a machine with this
           | impression" was a repeated plot point in TNG.
           | 
           | https://en.wikipedia.org/wiki/The_Measure_of_a_Man_(Star_Tre.
           | ..
           | 
           | https://en.wikipedia.org/wiki/The_Offspring_(Star_Trek:_The_.
           | ..
           | 
           | https://en.wikipedia.org/wiki/The_Ensigns_of_Command
           | 
           | https://en.wikipedia.org/wiki/The_Schizoid_Man_(Star_Trek:_T.
           | ..
           | 
           | Similar with The Doctor on VOY.
           | 
           | Even then, what we have with LLMs is basically already at the
           | level of the ship's main computer as it was written in
           | TNG/DS9/VOY.
           | 
           | But what counts as personhood? No idea. We're as ignorant as
           | of that as the writers of the TV show; humanity's best
           | attempts at philosophy of mind are still making the same
           | kinds of arguments today as we were a century ago, the only
           | real shift is to be slightly more inclusive of other species
           | besides Homo Sapiens, and the only real improvement over two
           | centuries ago is to include _all_ of Homo Sapiens.
        
             | mnky9800n wrote:
             | The main computer does not make choices stochastically and
             | always understands what people ask it. I do not think that
             | resembles the current crop of LLMs. On voyager the ships
             | computer is some kind of biological computing entity that
             | they eventually give up on as a story topic but there is an
             | episode where the bio computing gel packs get sick.
             | 
             | I believe data and the doctor both would be people to me.
             | But is minuet? The woman created by the binars to distract
             | riker so they could steal the ship? Star Trek goes back and
             | forth as to who counts as a person or not. This is one of
             | the points that is made in the episode where the doctor
             | learns his hologram is cloned many times to work as miners
             | on an asteroid and he says this is slavery. Does minuet get
             | to say she doesn't want to seduce riker if she doesn't feel
             | like it? Or is she a program that does whatever the
             | programmer wants?
             | 
             | But moreover linking back to the original technology, does
             | eqtransformer have awareness? Do we treat it as a thinking
             | being that we enslaved to find earthquakes in data because
             | it uses the same technology as ChatGPT? I don't think
             | anyone thinks we should do this but it seems like that is
             | where people want to take it. That's strange to me. At what
             | point does data pipelines doing some kind of stochastic
             | transformation and summarization of training data become an
             | individual that presents a desire for autonomy like data or
             | the doctor?
             | 
             | I think there's lots of questions here to answer and I
             | don't know the answers to them.
        
               | ben_w wrote:
               | > The main computer does not make choices stochastically
               | and always understands what people ask it.
               | 
               | The mechanism is never explained, but no, it doesn't
               | always understand correctly -- and neither does Data. If
               | hologram-Moriarty is sentient (is he?), then the
               | capability likely exceeds what current LLMs can do, but
               | the cause of the creation is definitely a
               | misunderstanding.
               | 
               | Even the episode where that happens, the script for Dr.
               | Pulaski leading up to Moriarty's IQ boost was exactly the
               | same arguments used against LLMs:
               | https://www.youtube.com/watch?v=4pYDy7vsCj8
               | 
               | (Common trope in that era being that computers (including
               | Data) are too literal, so there was also:
               | https://www.youtube.com/watch?v=HiIlJaSDPaA)
               | 
               | Similar with every time the crew work iteratively to
               | create something in the holodeck. And, of course:
               | https://www.youtube.com/watch?v=srO9D8B6dH4
               | 
               | > I do not think that resembles the current crop of LLMs.
               | On voyager the ships computer is some kind of biological
               | computing entity that they eventually give up on as a
               | story topic but there is an episode where the bio
               | computing gel packs get sick.
               | 
               | "Take the cheese to sickbay" is one of my favourite lines
               | from that series.
               | 
               | > But is minuet?
               | 
               | I would say the character was a puppet, with the Bynars
               | pulling the strings, because the holo-character was
               | immediately seen as lacking personhood the moment they
               | stopped fiddling with the computer.
               | 
               | Vic Fontaine was more ambiguous in that regard. Knew he
               | was "a lightbulb", but (acted like) he wanted to remain
               | within that reality in a way that to me felt like he was
               | _*programmed*_ to respond as if the sim around him was
               | the only reality that mattered rather than having free
               | will in that regard.
               | 
               | (But who has total free will? Humans are to holograms as
               | Q is to humans, and the main cast were also written to
               | reject "gifts" from Riker that time he briefly became a
               | Q).
               | 
               | The villagers of Fair Haven were, I think, not supposed
               | to be sentient (from the POV of the crew), but were from
               | the POV of the writers: https://en.wikipedia.org/wiki/Fai
               | r_Haven_(Star_Trek:_Voyager... and https://en.wikipedia.o
               | rg/wiki/Spirit_Folk_(Star_Trek:_Voyage...
               | 
               | > does eqtransformer have awareness?
               | 
               | There's too many different definitions for a single
               | answer.
               | 
               | We don't know what part of our own brains gives us the
               | sensation of our own existence; and even if we did, we
               | wouldn't know if it was the only mechanism to do so.
               | 
               | To paraphrase your own words:
               | 
               | At what point does chemical pipelines doing some kind of
               | stochastic transformation and electrochemical integration
               | of sensory input become an individual that presents a
               | desire for autonomy like data or the doctor?
               | 
               | I don't know. Like you, I'd say:
               | 
               | > I think there's lots of questions here to answer and I
               | don't know the answers to them.
        
         | spacebanana7 wrote:
         | > I think using any of this in a national security setting is
         | stupid
         | 
         | What about AI enabled drones and guided missiles/rockets? The
         | case for their effectiveness is relatively simple in terms of
         | jamming resistance.
        
           | pjc50 wrote:
           | Like a lot of AI boosters, would you like to explain how that
           | works other than magic AI dust? Some forms of optical
           | guidance are already in use, but there's other limitations
           | (lighting! weather!)
        
             | spacebanana7 wrote:
             | Sure thing. The basic idea would be:
             | 
             | 1) Have a camera on your drone 2) Run some frames through a
             | locally running version of something like AWS Rekognition's
             | celebrity identification service but for relevant military
             | targets. 3) Navigate towards coordinates of target
             | individuals
             | 
             | It isn't exactly magic, here's a video of a guy doing
             | navigation with openCV on images:
             | https://www.youtube.com/watch?v=Nrzs3dQ9exw
        
               | Hauthorn wrote:
               | I believe this is a capability that the Switchblade 600
               | or STM KARGU already has.
               | 
               | https://en.wikipedia.org/wiki/STM_Kargu
        
           | GTP wrote:
           | This somehow reminds me of a certain killer robot from a
           | Black Mirror episode ;)
        
           | amarcheschi wrote:
           | I would say that they don't require an 500bln$ investment.
           | AFAIK, drone that help lock on target have started being used
           | in Ukraine
        
             | spacebanana7 wrote:
             | I generally agree, piggybacking on innovations in
             | smartphone GPUs / batteries will probably be enough to get
             | locally running AI models in drones.
        
           | swiftcoder wrote:
           | drone and missile guidance system development has been using
           | ML for decades at this point. That's just as much "AI" as
           | anything currently coming out of the LLM craze.
        
             | int_19h wrote:
             | It's not just target guidance at this point. There are
             | prototypes of drone swarms, for example.
        
         | pjc50 wrote:
         | > then you absolutely should want even more money poured into
         | AI development, to make it go even faster.
         | 
         | Indeed. People are welcome to go "all in" on whatever nonsense
         | gambling they want to do with their personal investments, but
         | national security demands actually thinking about things -
         | adversarially. Because the enemy will as well.
         | 
         | It's perfectly possible to lose a war by investing in expensive
         | superweapons that under deliver. The Nazis were particularly
         | bad at this.
        
         | amarcheschi wrote:
         | I couldn't agree more.
         | 
         | If we're not talking about cyber war exclusively, such as
         | finding and exploiting vulnerabilities, for the time being
         | national security will still be based on traditional army.
         | 
         | Just a few weeks ago, italy announced a 16blnEUR plan to buy
         | >1000 rheinmetall ifv vehicles. That alone would make italy's
         | army one of the most equipped in Europe. I can't imagine what
         | would happen with a 500$bln investment in defense,lol. I don't
         | agree with what Meloni's government is doing, but one of the
         | ministers I agree more with is the defense minister Crosetto
         | 
         | Furthermore, what is being shown, at least for the time being,
         | is that open source can be and is crucial in aiding developing
         | better models. This collides with the idea of big, single "one
         | winner takes it all" VC mentality (because let's be honest,
         | these defense pitches are still made by startup/VC bros)
        
           | piltdownman wrote:
           | >italy announced a 16blnEUR plan to buy >1000 rheinmetall ifv
           | vehicles. That alone would make italy's army one of the most
           | equipped in Europe.
           | 
           | So target practice for a beyond-the-horizon missile system
           | launched ground-to-ground or air-to-ground? As an attacking
           | force, conventional ground forces and tactics are a non-
           | runner in a modern theatre of operations when faced against
           | air and drone support. This is why no single EU country is
           | incentivised into dumping money into any single area - as the
           | only probable defense would be against USA/Russia/China to
           | begin with.
           | 
           | The US proved it beyond doubt in Afghanistan - partisans
           | simply haven't a chance against a gunship with IR or NV
           | optics; the last time they levelled the playing field against
           | air interdictors was in Charlie Wilson's Afghanistan when the
           | Mujahideen took on that era of Soviet gunships with hand-held
           | AA systems.
        
             | amarcheschi wrote:
             | Italy had a joke of tanks in the last decade(s), that
             | weren't clearly on par with the other European alternatives
             | and the defense budget went to other places, namely navy
             | and air force. Since some of the Italian tanks are being
             | dismissed, there was the need for a new tank, but not
             | having invested much in tanks had the effect of not having
             | cutting edge technology to develop an alternative. So the
             | decision was taken to produce tanks from rheinmetall
             | (Leonardo was available to another German producer, but
             | they didn't want to transfer technology and produce in
             | italy).
             | 
             | Not having tanks is a big no no due to latest events, I do
             | not see the grudge you hold against them when it's clear
             | they wouldn't be used in the same flat scenario of Ukraine
             | or Poland. Given that Italy is highly mountainous, it made
             | sense to prioritize air force and navy first. I think
             | they're also compatible for anti drone guns
             | 
             | Piaggio air space was acquired a few weeks ago by baykar,
             | hopefully it will produce drones.
             | 
             | I do not get why you're saying that a change in the
             | approach of war makes tank not useful anymore, when it's
             | clear they will eventually be adapted to counter drones and
             | similar threats
        
           | fkyoureadthedoc wrote:
           | > I can't imagine what would happen with a 500$bln investment
           | in defense,lol.
           | 
           | The $90,000 bag of bushings becomes a $300,000 bag?
        
             | amarcheschi wrote:
             | No, at least in italy we would create some gimmick
             | consultant position for some ex politician and repay
             | him/her for past favours /s
        
           | int_19h wrote:
           | It's not one or the other, though. AI-controlled drones are
           | already a thing in Ukraine, today.
        
         | smcl wrote:
         | Been saying this for years, it's been fucking baffling.
         | Generating images, video and text that sort-of resembles what a
         | human would come up with is genuinely quite impressive. It is
         | not "let's claim it'll fix our country" (looking at you, Keir)
         | impressive though, and I cannot believe so much money has been
         | pumped into it.
        
           | amarcheschi wrote:
           | But you have to over promise and under deliver, otherwise you
           | won't receive those sweet sweet money
        
         | pilingual wrote:
         | > This is machine learning
         | 
         | Yeah, I was thinking about this while trying to figure out
         | author affiliations.
         | 
         | There was a Stanford paper a few years ago that dusted off some
         | old intelligence concepts and the authors seemed excited about
         | it.
         | 
         | But given the pace of AI, it's difficult to look in new
         | directions. It will probably take an AI winter and some
         | unbridled enthusiasm immune to burnout to make some real
         | progress outside of feed forward neural networks.
        
         | mrshadowgoose wrote:
         | > They think it's the end all be all for everything.
         | 
         | Is (human-based) general intelligence not one of the
         | fundamental enabling elements of literally every human activity
         | throughout history, regardless of how many layers of automation
         | and technology one has to peel back to get to it?
         | 
         | Can you maybe imagine how the ability to create arbitrary
         | amounts of general intelligence, completely divorced from the
         | normal lengthy biological process, could upend that foundation
         | of human activity?
         | 
         | > They have the vision of the 'AI' they've seen in movies in
         | mind, see the current 'AI' being used and to them it's
         | basically almost the same, their brain is mental bridging the
         | concepts and saying it's only a matter of time.
         | 
         | I've found that most AI-related movies exclusively focus on
         | "quality ASI" scenarios, which are mostly irrelevant to our
         | current state of the world, as an immense amount of
         | danger/value/disruption will arrive with AGI. People who are
         | seriously reasoning about the impacts of AGI are not using
         | movies as references. "Those stupid movie watching idiots" is
         | just a crutch you are using to avoid thinking about something
         | that you disagree with.
         | 
         | > Real AI, artificial intelligence, is a fever dream. This is
         | machine learning except the machines are bigger than ever
         | before. There is no intellect.
         | 
         | Do you have any evidence to support this conclusion? And does
         | it even matter? If "fake intellect" can replace a human, that
         | human still has to deal with the very real issue or not having
         | a job anymore. If "fake intellect" is used to conduct mass
         | surveillance, and direct suppression activities towards
         | divergent individuals, those individuals are still going to
         | have a bad time.
        
           | gessha wrote:
           | >> Real AI, artificial intelligence, is a fever dream. This
           | is machine learning except the machines are bigger than ever
           | before. There is no intellect.
           | 
           | > Do you have any evidence to support this conclusion? And
           | does it even matter? If "fake intellect" can replace a human,
           | that human still has to deal with the very real issue or not
           | having a job anymore. If "fake intellect" is used to conduct
           | mass surveillance, and direct suppression activities towards
           | divergent individuals, those individuals are still going to
           | have a bad time.
           | 
           | I think the "fake intelligence can replace a human" needs
           | more support in general. We know how human intellect works
           | practically (not theoretically) and we know how to apply it
           | in different scenarios. We're still far from knowing how
           | "fake intelligence" works and how to apply it to different
           | scenarios.
        
         | encipriano wrote:
         | You would solve the ai problem if you correctly defined what
         | intellect even is.
        
         | amelius wrote:
         | Yes, I'd like to see some examples where our current AI can
         | actually extrapolate rather than interpolate. Let it invent new
         | things, new drawing styles, new story plots, etc. Maybe _then_
         | it will impress me.
        
           | mrshadowgoose wrote:
           | Here you go:
           | https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
        
             | amelius wrote:
             | I'm not convinced. This is using the tooling and paradigms
             | invented by humans.
        
               | ninetyninenine wrote:
               | Science is a paradigm invented by humans. If a human uses
               | this paradigm to invent something he is considered
               | intelligent but if an ai uses it it's not? All humans use
               | human paradigms and they are intelligent.
               | 
               | If a human uses such a paradigm to success it is the
               | same.
        
               | amelius wrote:
               | Biology is a field where you can find new things
               | relatively easily, if you follow existing protocols and
               | do the hard lab work (which costs time and money).
        
           | moffkalast wrote:
           | Can you? Most people can't do anything that isn't 99%
           | derivative.
        
         | sidewndr46 wrote:
         | What is even the possible usage of AI for national security?
         | Generating pictures of kittens riding nuclear weapons to the
         | very end like in Dr Strangelove?
        
           | ben_w wrote:
           | > What is even the possible usage of AI for national
           | security? Generating pictures of kittens riding nuclear
           | weapons to the very end like in Dr Strangelove?
           | 
           | For all that critics of AI dismiss them as lacking
           | imagination, your reaction suggests a lack of imagination.
           | 
           | Off the top of my head: facial recognition and identification
           | to make "smart" guns that hit specific targets with reduced
           | collateral damage (as found on most digital cameras even
           | before smartphones); creating and A/B testing propaganda
           | campaigns; using modified wifi signals as wall-penetrating
           | radar capable of post estimation, heart rate and breathing
           | monitoring[0]; take any self-driving car's AI and
           | conditionally invert the part that says "don't hit
           | pedestrians" when a certain target is spotted; ANPR to track
           | specific vehicles with known owners over long distances;
           | alternative targeting system for cruise missiles in the
           | absence or jamming of GPS systems; using them as red teams in
           | war-game exercises; using them to automate intrusion
           | detection by monitoring for changes to background
           | distributions of basically every measurable event; person-
           | tracking by watching CCTV in secure areas; control systems
           | for security robots (think Boston Dynamics' Spot) that are
           | currently in deployment.
           | 
           | There's likely a lot more, too.
           | 
           | [0] https://openaccess.thecvf.com/content_cvpr_2018/papers/Zh
           | ao_...
        
           | robwwilliams wrote:
           | Lol: Where I live (Memphis) both "one" and "two" are
           | considered two syllable words. Seriously. Our kids were
           | taught this in the best public elementary school.
        
             | esafak wrote:
             | How does that work, one-uh, two-uh?
        
               | robwwilliams wrote:
               | egzacly
        
           | moffkalast wrote:
           | Gentlemen you can't use AI here, this is the deep learning
           | lab!
        
           | int_19h wrote:
           | Autonomous weapons.
        
         | snarf21 wrote:
         | Agreed. I was working on some haiku things with ChatGPT and it
         | kept telling me that busy has only one syllable. This is a
         | trivially searchable fact.
        
           | whimsicalism wrote:
           | link a chat please
        
             | snarf21 wrote:
             | It wasn't just _busy_ that it failed on. I was feeding it
             | haikus and wanted them broken into a list of 17 words
             | /fragments. Certain 2 syllable words weren't split and
             | certain 1 syllable words were split into two.
        
         | robwwilliams wrote:
         | It used to be much easier to be conservative about AI,
         | especially AGI, after living through three cycles of AI
         | winters. No more. Dismissing it as "merely machine learning" is
         | worse than unfair to the last decade of machine learning ;-)
         | 
         | The hard part now is relatively trivial. Does anyone think that
         | there is a fundamental and profound discovery that evolution
         | made purely by selection in the last 200,000 years? I mean a
         | true qualitative difference?
         | 
         | Sure---We call it language, which is just another part of a
         | fancy animal's tool kit.
         | 
         | Does anyone think there is an amazing qualitative difference
         | between the brain of a chimp and the brain of a human?
         | 
         | No, not if they know any biology.
         | 
         | (Although that does not stop some scientist from looking for a
         | "language gene" like FOXP2.)
         | 
         | So what did dumb mutations and 200,000 years of selection do
         | that a group of dedicated AI scientists cannot do with their
         | own genuine general intelligence?
         | 
         | Nothing---nothing other than putting a compact energy efficient
         | LLM with reinforcement learning on a good robotic body and
         | letting it explore and learn like we did as infants, toddlers
         | and teenagers.
         | 
         | Each one of us has experienced becoming a "general
         | intelligence". I remember it hit me on the head in 6th grade
         | when I dreamed up a different way of doing long division. I
         | remember thinking: "How did I think that?" And each one of us
         | who has watched an infant turn into a toddler has watched it as
         | an observer or teacher. This is what makes babies so
         | fascinating to "play" with.
         | 
         | We have to give our baby AGI a private memory and a layer of
         | meta-attention like we all gain as we mature, love, and
         | struggle.
         | 
         | I read the linked article and as a neuroscientist I realized
         | the "wait" cycles that improved performance so much is roughly
         | equivalent to the prefrontal cortex: the part of the CNS most
         | responsible for enabling us to check our own reasoning
         | recursively. Delay--as in delayed gratification---is a key
         | attribute of intelligent systems.
         | 
         | We are finally on the door step to Hofstadter's Strange Loop
         | and Maturana's and Valera's "enactive" systems, but now
         | implemented in silicon, metal, and plastic by us rather than
         | dumb but very patient natural selection.
         | 
         | Karl Friston and Demis Hassabis (two very smart neuroscientist)
         | figured this out years ago. And they were preceded by three
         | other world class neuroscientist: Humberto Maturana, Francisco
         | Valera, and Rich Sutton (honorary neuroscientist). And big
         | credit to Terry Winograd for presaging this path forward long
         | ago too.
        
         | baq wrote:
         | I can only say that exponential curves grow nominally
         | sublinearly before they take off. AI is not quite at the
         | obvious take off point, but owners of the biggest clusters have
         | seen the extrapolations and it isn't pretty - once your
         | competitor achieves take off and you aren't anywhere close,
         | you're done for. The risk of not participating in that are too
         | great.
        
         | menaerus wrote:
         | You can choose to be somewhat ignorant of the current state in
         | AI, about which I could also agree that at certain moments it
         | appears totally overhyped, but the reality is that there hasn't
         | been a bigger technology breakthrough probably in the last ~30
         | years.
         | 
         | This is not "just" machine learning because we have never been
         | able to do things which we are today and this is not only the
         | result of better hardware. Better hardware is actually a
         | byproduct. Why build a PFLOPS GPU when there is nothing that
         | can utilize it?
         | 
         | If you spare yourself some time and read through the actual
         | (scientific) papers of multiple generations of LLM models, the
         | first one being from Google ~~not DeepMind~~ in 2017, you might
         | get to understand that this is no fluff.
         | 
         | And I'm speaking this from a position of a software engineer,
         | without bias.
         | 
         | The reason why all this really took off with so much hi-speed
         | is because of the not quite expected results - early LLM
         | experiments have shown that "knowledge" with current
         | transformers architecture can linearly scale with regards to
         | the amount of compute and training time etc. That was very
         | unexpected and to this day scientists do not have an answer why
         | this even works.
         | 
         | So, after reading bunch of material I am inclined to think that
         | this is something different. The future of loading the codebase
         | into the model and asking the model to explain me the code or
         | fix bugs has never been so close and realistic. For the better
         | or worse.
        
           | whimsicalism wrote:
           | > the first one being from DeepMind in 2017
           | 
           | ? what paper are you talking about
        
             | menaerus wrote:
             | https://arxiv.org/abs/1706.03762
        
               | whimsicalism wrote:
               | oh gotcha. maybe pedantic but that is not a deepmind
               | paper
        
               | menaerus wrote:
               | I stand corrected, not sure why I was under the
               | impression that this paper came out of DeepMind lab.
        
           | mitchelld wrote:
           | This line of thinking doesn't really correspond to the reason
           | Transformers were developed in the first place, which was to
           | better utilize how GPUs do computation. RNNs were too slow to
           | train at scale because you had to sequentially compute the
           | time steps, Transformers (with masking) can run the input
           | through in a single pass.
           | 
           | It is worth noting that the first "LLM" you referring to was
           | only 300M parameters, but even then the amount of training
           | required (at the time) was such that training a model like
           | that outside of a big tech company was infeasible. Obviously
           | now we have models that are in the hundreds of billions /
           | trillions of parameters. The ability to train these models is
           | directly a result of better / more hardware being applied to
           | the problem as well as the Transformer architecture
           | specifically designed to better conform with parallel
           | computation at scale.
           | 
           | The first GPT model came out ~ 8 years ago. I recall when
           | GPT-2 came out they initially didn't want to release the
           | weights out of concern for what the model could be used for,
           | looking back now that's kind of amusing. However,
           | fundamentally, all these models are the same setup as what
           | was used then, decoder based Transformers. They are just
           | substantially larger, trained on substantially more data,
           | trained with substantially more hardware.
        
             | menaerus wrote:
             | What line of thinking you're referring to?
             | 
             | Transformers were aimed to solve the "context" problem and
             | authors, being aware that RNNs don't scale at all neither
             | do they solve that particular problem, had to come up with
             | the algorithm that overcomes both of those issues. It
             | turned out that the self-attention compute-scale was the
             | crucial ingredient to solve the problem, something that
             | RNNs were totally incapable of.
             | 
             | They modeled the algorithm to run on the hardware they had
             | at that time available but hardware developed afterwards
             | was a direct consequence, or how I called it a byproduct,
             | of transformers proving themselves to be able to
             | continuously scale. Had that not be true, we wouldn't have
             | all those iterations of NVidia chips.
             | 
             | So, although one could say that the NVidia chip design is
             | what enabled the transformers success, one could also say
             | that we wouldn't have those chips if transformers didn't
             | prove themselves to be so damn efficient. And I'm inclined
             | to think the latter.
        
               | mitchelld wrote:
               | > This is not "just" machine learning because we have
               | never been able to do things which we are today and this
               | is not only the result of better hardware. Better
               | hardware is actually a byproduct. Why build a PFLOPS GPU
               | when there is nothing that can utilize it?
               | 
               | This is the line of thinking I'm referring to.
               | 
               | The "context" problem had already been somewhat solved.
               | The attention mechanism existed prior to Transformers and
               | was specifically used on RNNs. They certainly improved
               | it, but innovation of the architecture was making it
               | computation efficient to train.
               | 
               | I'm not really following your argument. Clearly your
               | acknowledging that it was first the case that with the
               | hardware at the time, researchers demonstrated that
               | simply scaling up training with more data yielded better
               | models. The fact that hardware was then optimized for
               | these for these architectures only reinforces this point.
               | 
               | All the papers discussing scaling laws point to the same
               | thing, simply using more compute and data yields better
               | results.
               | 
               | > this is not only the result of better hardware
               | 
               | Regarding this in particular. A majority of the
               | improvement from GPT-2 and GPT-4 was simply training on a
               | much larger scale. That was enabled by better hardware
               | and lots of it.
        
         | dotancohen wrote:
         | > Real AI, artificial intelligence, is a fever dream. This is
         | machine learning except the machines are bigger than ever
         | before. There is no intellect.
         | 
         | That sounds to me like dismissing the idea that a Russian SSBN
         | might cross the Pacific and nuke Los Angeles because
         | "submarines can't swim".
         | 
         | Even if the machine learning isn't really intelligent, it is
         | still capable of performing IF..THEN..ELSE operations, which
         | could have detrimental effects for [some subset of] humans.
         | 
         | And even if you argue that such a machine _shouldn't_ be used
         | for whatever doomsday scenario would harm us, rest assured that
         | someone, somewhere, who either does not understand what the
         | machines are designed to do or just pretends that they work
         | like magic, will put the machines in a position to make such a
         | decision.
        
           | UncleEntity wrote:
           | One could hope...
           | 
           | Even at the height of the Cold War there was always a human
           | between <leader presses button> and <nukes go aflyin'>.
           | 
           | --edit--
           | 
           | ...which has me wondering if a president even has the
           | constitutional authority to destroy the entire planet and if
           | one could interpret their command as a 'lawful order'. Makes
           | one think.
        
             | willglynn wrote:
             | On the topic of fail-deadly nukes:
             | 
             | https://en.wikipedia.org/wiki/Dead_Hand
        
         | ninetyninenine wrote:
         | I agree agi wont solve national security but saying this isn't
         | intelligence is false.
         | 
         | This is ai and trend lines point to an intelligence that
         | matches or barely exceeds human intellect in the future.
         | 
         | You're part of a trend of people in denial. When LLMs first
         | came out there were hordes of people on HN claiming it was just
         | a stochastic parrot and LLMs displayed zero intellectual
         | ability. It is now abundantly clear that this not true.
         | 
         | We don't fully understand LLMs. That's why gains like COT are
         | just black box adjustments that come from changing external
         | configurations. We have no way to read the contents of the
         | black box and make adjustments off of it. Yet idiots like you
         | can make such vast and hard claims when nobody really fully
         | understands these things. You're delusional.
         | 
         | I agree that LLMs won't allow us to make some super weapon to
         | give us some edge in national security.
        
         | naasking wrote:
         | > This is machine learning except the machines are bigger than
         | ever before. There is no intellect.
         | 
         | Define "intellect".
        
       | GTP wrote:
       | Sorry for being lazy, but I just don't have the time right now to
       | read the paper. Is there in the paper or somewhere else a
       | comparison based on benchmarks of S1 vs R1 (the full R1, not
       | quantized or distilled)?
        
         | pama wrote:
         | The S1 paper is not meant to compete with R1. It simply shows
         | that with 1k well curated examples for finetuning (26 minutes
         | training on 16 GPU) and with a simple hack for controlling the
         | length of the thinking process, one can dramatically increase
         | the performance of a non-reasoning model and show a clear
         | increase in benefit with increased test-time compute. It is
         | worth a quick skim.
        
       | swiftcoder wrote:
       | > having 10,000 H100s just means that you can do 625 times more
       | experiments than s1 did
       | 
       | I think the ball is very much in their court to demonstrate they
       | actually are using their massive compute in such a productive
       | fashion. My BigTech experience would tend to suggest that
       | frugality went out the window the day the valuation took off, and
       | they are in fact just burning compute for little gain, because
       | why not...
        
         | whizzter wrote:
         | Mainly it points to a non-scientific "bigger is better"
         | mentality, and the researchers probably didn't mind playing
         | around with the power because "scale" is "cool".
         | 
         | Remember that the Lisp AI-labs people were working on non-
         | solved problems on absolute potatoes of computers back in the
         | day, we have a semblance of progress solution but so much of it
         | has been brute-force (even if there has been improvements in
         | the field).
         | 
         | The big question is if these insane spendings has pulled the
         | rug on real progress if we head into another AI winter of
         | disillusionment or if there is enough real progress just around
         | the corner to show that there is hope for investors in a post-
         | deepseek valuation hangover.
        
           | wongarsu wrote:
           | We are in a phase where costs are really coming down. We had
           | this phase from GPT2 to about GPT4 where the key to building
           | better models was just building bigger models and training
           | them for longer. But since then a lot of work has gone into
           | distillation and other techniques to make smaller models more
           | capable.
           | 
           | If there is another AI winter, it will be more like the
           | dotcom bubble: lots of important work got done in the dotcom
           | bubble, but many of the big tech companies started from the
           | fruits of that labor in the decade after the bubble burst
        
         | svantana wrote:
         | Besides that, AI training (aka gradient descent) is not really
         | an "embarrassingly parallel" problem. At some point, there are
         | diminishing returns on adding more GPUs, even though a lot of
         | effort is going into making it as parallel as possible.
        
           | janalsncm wrote:
           | What? It definitely is.
           | 
           | Data parallelism, model parallelism, parameter server to
           | workers, MoE itself can be split up, etc.
           | 
           | But even if it wasn't, you can simply parallelize training
           | runs with slight variations in hyperparameters. That is what
           | the article is describing.
        
         | gessha wrote:
         | This is pure speculation on my part but I think at some point a
         | company's valuation became tied to how big their compute is so
         | everybody jumped on the bandwagon.
        
           | syntaxing wrote:
           | Matt Levine tangentially talked about this during his podcast
           | this past Friday (or was it the one before?). It was a good
           | way to value these companies according to their compute size
           | since those chips are very valuable. At a minimum, the chips
           | are an asset that acts as a collateral.
        
             | jxdxbx wrote:
             | I hear this a lot, but what the hell. It's still computer
             | chips. They depreciate. Short supply won't last forever.
             | Hell, GPUs burn out. It seems like using ice sculptures as
             | collateral, and then spring comes.
        
               | baxtr wrote:
               | If so wouldn't it be the first time in history when more
               | processing power is not used?
               | 
               | In my experience CPU/GPU power is used up as much as
               | possible. Increased efficiency just leads to more demand.
        
               | littlestymaar wrote:
               | I think you're missing the point: H100 isn't going to
               | remain useful for a long time, would you consider Tesla
               | or Pascal graphic cards a collateral? That's what those
               | H100 will look like in just a few years.
        
               | ijidak wrote:
               | Yeah, exactly! I've got some 286, 386, and 486 CPUs that
               | I want to claim as collateral!
        
               | baxtr wrote:
               | Not sure I do tbh.
               | 
               | Any asset depreciates over time. But they usually get
               | replaced.
               | 
               | My 286 was replaced by a faster 386 and that by an even
               | faster 468.
               | 
               | I'm sure you see a naming pattern there.
        
               | kgwgk wrote:
               | > Any asset depreciates over time.
               | 
               | That's why "those chips are very valuable" is not
               | necessarily a good way to value companies - and it isn't
               | if they can extract the value from the chips before they
               | become worthless.
               | 
               | > But they usually get replaced.
               | 
               | They usually produce enough income to cover depreciation
               | so you actually have the cash to replace them.
        
               | littlestymaar wrote:
               | And that's why such assets represents only a marginal
               | part of valuation. (And if you look at accounting, this
               | depreciation is usually done over three years for IT
               | hardware, and as such most of these chips have already
               | lost half of their accounting value in the balance
               | sheet).
        
               | baq wrote:
               | My 1070 was replaced by... nothing, I moved it from a
               | haswell box to an alder lake box.
               | 
               | Given that inference time will soon be extremely valuable
               | with agents and <thinking> models, H100s may yet be worth
               | something in a couple years.
        
               | sixothree wrote:
               | Year over year gains in computing continue to slow. I
               | think we keep forgetting that when talking about these
               | things as assets. The thing controlling their value is
               | the supply which is tightly controlled like diamonds.
        
               | adrianN wrote:
               | They have a fairly limited lifetime even if progress
               | stands still.
        
               | throwup238 wrote:
               | Last I checked AWS 1-year reserve pricing for an 8x H100
               | box more than pays for the capital cost of the whole box,
               | power, and NVIDIA enterprise license, with thousands left
               | over for profit. On demand pricing is even worse. For
               | cloud providers these things pay for themselves quickly
               | and print cash afterwards. Even the bargain basement
               | $2/GPU/hour pays it off in under two years.
        
               | sdenton4 wrote:
               | Labor! You need it to turn the bill of sale into a data
               | center and keep it running. The bargain basement would be
               | even cheaper otherwise...
        
               | spamizbad wrote:
               | > Year over year gains in computing continue to slow.
               | 
               | This isn't true in the AI chip space (yet). And so much
               | of this isn't just about compute but about the memory.
        
               | eek2121 wrote:
               | From a per mm2 performance standpoint things absolutely
               | have slowed considerably. Gains are primarily being eked
               | out via process advantage (which has slowed down) and
               | larger chips (which has an ever-shrinking limit depending
               | on the tech used)
               | 
               | Chiplets have slowed the slowdown in AI, but you can see
               | in the gaming space how much things have slowed to get an
               | idea of what is coming for enterprise.
        
               | ijidak wrote:
               | Honestly, I don't fully understand the reason for this
               | shortage.
               | 
               | Isn't it because we insist on only using the latest nodes
               | from a single company for manufacture?
               | 
               | I don't understand why we can't use older process nodes
               | to boost overall GPU making capacity.
               | 
               | Can't we have tiers of GPU availability?
               | 
               | Why is Nvidia not diversifying aggressively to Samsung
               | and Intel no matter the process node.
               | 
               | Can someone explain?
               | 
               | I've heard packaging is also a concern, but can't you get
               | Intel to figure that out with a large enough commitment?
        
               | nl wrote:
               | > Isn't it because we insist on only using the latest
               | nodes from a single company for manufacture?
               | 
               | TSMC was way ahead of anyone else introducing 5nm.
               | There's a long lead time porting a chip to a new process
               | from a different manufacturer.
               | 
               | > I don't understand why we can't use older process nodes
               | to boost overall GPU making capacity.
               | 
               | > Can't we have tiers of GPU availability?
               | 
               | NVidia do this. You can get older GPUs, but more
               | performance is better for performance sensitive
               | applications like training or running LLMs.
               | 
               | Higher performance needs better manufacturing processes.
        
               | ecocentrik wrote:
               | That is the wrong take. Depreciated and burned out chips
               | are replaced and a total compute value is typically
               | increased over time. Efficiency gains are also calculated
               | and projected over time. Seasons are inevitable and
               | cyclical. Spring might be here but winter is coming.
        
             | SecretDreams wrote:
             | > It was a good way to value these companies according to
             | their compute size since those chips are very valuable.
             | 
             | Are they actually, though? Presently yes, but are they
             | actually driving ROI? Or just an asset nobody really is
             | meaningfully utilizing, but helps juice the stocks?
        
             | ijidak wrote:
             | I asked this elsewhere, but, I don't fully understand the
             | reason for the critical GPU shortage.
             | 
             | Isn't it because NVIDIA insists on only using the latest
             | nodes from a single company (TSMC) for manufacture?
             | 
             | I don't understand why we can't use older process nodes to
             | boost overall GPU making capacity.
             | 
             | Can't we have tiers of GPU availability some on cutting
             | edge nodes, others built on older Intel and Samsung nodes?
             | 
             | Why is Nvidia not diversifying aggressively to Samsung and
             | Intel no matter the process node.
             | 
             | Can someone explain?
             | 
             | I've heard packaging is also a concern, but can't you get
             | Intel to figure that out with a large enough commitment?
             | 
             | (Also, I know NVIDIA has some capacity on Samsung. But why
             | not go all out, even using Global Foundries?)
        
             | aorloff wrote:
             | If you are a cloud provider renting them out
             | 
             | Otherwise you better keep them humming trying to find a
             | business model because they certainly aren't getting any
             | newer as chips
        
             | dghlsakjg wrote:
             | That's a great way to value a company that is going
             | bankrupt.
             | 
             | But, I'm not going to value an operating construction
             | company based on how many shovels or excavators they own.
             | I'm going to want to see them putting those assets to
             | productive use.
        
           | JKCalhoun wrote:
           | So, "No one was ever fired for ... buying more server
           | infrastructure."
        
             | genewitch wrote:
             | Walmart has massive, idle datacenters full of running
             | machines doing nothing.
        
           | tyfon wrote:
           | I don't think you need to speculate too hard. On CNBC they
           | are not tracking revenue, profits or technical breakthroughs,
           | but how much the big companies are spending (on gpus). That's
           | the metric!
        
             | Mistletoe wrote:
             | This feels like one of those stats they show from 1929 and
             | everyone is like "and they didn't know they were in a
             | bubble?"
        
             | RobotToaster wrote:
             | "But tulip sales keep increasing!"
        
             | LeifCarrotson wrote:
             | I probably don't have to repeat it, but this is a perfect
             | example of Goodhart's Law: when a metric is used as a
             | target, it loses its effectiveness as a metric.
             | 
             | If you were a reporter who didn't necessarily understand
             | how to value a particular algorithm or training operation,
             | but you wanted a simple number to compare the amount of
             | work OpenAI vs. Google vs Facebook are putting into their
             | models, yeah, it makes sense. How many petaflops their
             | datacenters are churning through in aggregate is probably
             | correlated to the thing you're trying to understand. And
             | it's probably easier to look at their financials and
             | correlate how much they've spent on GPUs to how many
             | petaflops of compute they need.
             | 
             | But when your investors are giving you more money based on
             | how well they perceive you're doing, and their perception
             | is not an oracle but is instead directly based on how much
             | money you're spending... the GPUs don't actually need to do
             | anything other than make number go up.
        
             | B56b wrote:
             | They absolutely are tracking revenues/profits on CNBC, what
             | are you talking about?
        
             | ur-whale wrote:
             | > but how much the big companies are spending (on gpus).
             | That's the metric!
             | 
             | Burn rate based valuations!
             | 
             | The 2000's are back in full force!
        
         | jerf wrote:
         | This claim is mathematically nonsensical. It implies a more-or-
         | less linear relationship, that more is always better. But
         | there's no reason to limit that to H100s. Conventional servers
         | are, if anything, rather more established in their ability to
         | generate value, by which I mean, however much potential AI
         | servers may have to be more important than conventional servers
         | that they may manifest in the future, we know how to use
         | conventional servers to generate value _now_.
         | 
         | And thus, by this logic, every company in the world should just
         | be buying as many servers as they can get their hands on,
         | because More Servers = More Value.
         | 
         | Obviously, this is not happening. It doesn't take much analysis
         | to start listing the many and manifold reasons why. Many of
         | those reasons will apply to GPUs as well. Just as if everything
         | in AWS got 10x faster, overnight, this would not create a
         | situation where everyone suddenly starts grabbing _more_
         | servers in AWS. Obviously everyone would start trimming down,
         | even if perhaps in a few years time they 'd find some way to
         | use this burst of power such that they can use more later. This
         | can't happen overnight, though. It would take time, and not
         | "weeks" or "months" but "years" at scale.
         | 
         | Incorporating the important variable of _time_ in the analysis,
         | if AIs become literally hundreds of times cheaper to run,
         | _today_ , then it is perfectly logical that the near-term
         | demand for the hardware to run them is also going to go way,
         | way down. However much _potential_ AI may have, it is fairly
         | clear looking out at the AI landscape right now that there isn
         | 't really anyone out there unlocking vast amounts of value and
         | sitting there wringing their hands because they just can't get
         | more GPU compute. The GPU rush has been from fear that someone
         | will figure out how to "really" unlock AI and then they'll be
         | stuck without the hardware to compete.
         | 
         | It may be the case that vastly cheaper AI will in fact be part
         | of unlocking that value, and that as the AI industry grows it
         | will grow faster as a result... but that's still going to be on
         | a multi-year time frame, not a tomorrow time frame. And all
         | those GPUs and all those valuations are still broadly based on
         | them being valuable _real soon now_ , not in a few years, and
         | all those GPU purchases are on the assumption they need them
         | _now_ , or on a timeframe where we can't be waiting around,
         | rather than waiting for some rounds of exponential doublings to
         | bring price down. The hardware curve in 5 years may be higher
         | but the curve in the next _year_ would be lower, and by a
         | _lot_.
         | 
         | And, you know, who's to say we're done? I doubt there's another
         | 100x in there, but is someone going to eke out another 2x
         | improvement? Or a 10x improvement? Making it easier to run lots
         | of experiments makes it much more likely for that to happen.
         | I'm skeptical of another 10x general improvement but 10x
         | improvements for specific, important use cases I can't rule
         | out.
         | 
         | Edit: I should also point out this is an extremely common
         | pattern in technology in general. Often the very hardest part
         | is producing a thing that does a particular task _at all_. Once
         | we have it in hand, once we can use it and learn how it
         | operates and what its characteristic operating modes are, once
         | we can try modifications to it in the real world and see what
         | happens, _optimizing_ it becomes much easier, sometimes
         | explosively so by comparison. Taking any first iteration of a
         | tech that is practical and then trying to straight-line demand
         | based on it is silly, in all sorts of ways and all directions.
         | The internal combustion engine, for example, has had a myriad
         | of impacts on the world and certainly after various
         | improvements many, many millions if not billions of them have
         | been made... but any company that reacted to the first couple
         | of cars and just went _ballistic_ buying those first-generation
         | internal combustion engines would have lost everything, and
         | rather quickly.
        
         | deadbabe wrote:
         | For starters every employee has an H100 under their desk.
        
       | HenryBemis wrote:
       | > Going forward, it'll be nearly impossible to prevent distealing
       | (unauthorized distilling). One thousand examples is definitely
       | within the range of what a single person might do in normal
       | usage, no less ten or a hundred people. I doubt that OpenAI has a
       | realistic path to preventing or even detecting distealing outside
       | of simply not releasing models.
       | 
       | (sorry for the long quote)
       | 
       | I will say (naively perhaps) "oh but that is fairly simple". For
       | any API request, add a counter of 5 seconds to the next for
       | 'unverified' users. Make the "blue check" (a-la X/Twitter). For
       | the 'big sales' have a third-party vetting process so that if US
       | Corporation XYZ wants access, they prove themselves worthy/not
       | Chinese competition and then you do give them the 1000/min deal.
       | 
       | For everyone else, add the 5 second (or whatever other duration
       | makes sense) timer/overhead and then see them drop from 1000
       | requests per minutes to 500 per day. Or just cap them at 500 per
       | day and close that back-door. And if you get 'many cheap
       | accounts' doing hand-overs (AccountA does 1-500, AccountB does
       | 501-1000, AccountC does 1001-1500, and so on) then you mass block
       | them.
        
       | mark_l_watson wrote:
       | Off topic, but I just bookmarked Tim's blog, great stuff.
       | 
       | I dismissed the X references to S1 without reading them, big
       | mistake. I have been working generally in AI for 40 hears and
       | neural networks for 35 years and the exponential progress since
       | the hacks that make deep learning possible has been breathtaking.
       | 
       | Reduction in processing and memory requirements for running
       | models is incredible. I have been personally struggling with
       | creating my own LLM-based agents with weaker on-device models (my
       | same experiments usually work with 4o-mini and above models) but
       | either my skills will get better or I can wait for better on
       | device models.
       | 
       | I was experimenting with the iOS/iPadOS/macOS app On-Device AI
       | last night and the person who wrote this app was successful in
       | combining web search tool calling working with a very small model
       | - something that I have been trying to perfect.
        
       | cowsaymoo wrote:
       | The part about taking control of a reasoning model's output
       | length using <think></think> tags is interesting.
       | 
       | > In s1, when the LLM tries to stop thinking with "</think>",
       | they force it to keep going by replacing it with "Wait".
       | 
       | I had found a few days ago that this let you 'inject' your own
       | CoT and jailbreak it easier. Maybe these are related?
       | 
       | https://pastebin.com/G8Zzn0Lw
       | 
       | https://news.ycombinator.com/item?id=42891042#42896498
        
         | causal wrote:
         | This even points to a reason why OpenAI hides the "thinking"
         | step: it would be too obvious that the context is being
         | manipulated to induce more thinking.
        
       | Havoc wrote:
       | The point about agents to conceal access to the model is a good
       | one.
       | 
       | Hopefully we won't lose all access to models in future
        
       | cyp0633 wrote:
       | Qwen's QvQ-72B does much more "wait"s than other LLMs with CoT I
       | tried, maybe they've somewhat used that trick already?
        
       | theturtletalks wrote:
       | Deepseek R1 uses <think/> and wait and you can see it in the
       | thinking tokens second guessing itself. How does the model know
       | when to wait?
       | 
       | These reasoning models are feeding more to OP's last point about
       | NVidia and OpenAI data centers not being wasted since reason
       | models require more tokens and faster tps.
        
         | qwertox wrote:
         | Probably when it would expect a human to second guess himself,
         | as shown in literature and maybe other sources.
        
         | UncleEntity wrote:
         | From playing around they seem to 'wait' when there's a
         | contradiction in their logic.
         | 
         | And I think the second point is due to The Market thinking
         | there is no need to spend ever increasing amounts of compute to
         | get to the next level of AI overlordship.
         | 
         | Of course Jevon's paradox is also all in the news these days..
        
       | pona-a wrote:
       | If chain of thought acts as a scratch buffer by providing the
       | model more temporary "layers" to process the text, I wonder if
       | making this buffer a separate context with its own separate FNN
       | and attention would make sense; in essence, there's a
       | macroprocess of "reasoning" that takes unbounded time to
       | complete, and then there's a microprocess of describing this
       | incomprehensible stream of embedding vectors in natural language,
       | in a way returning to the encoder/decoder architecture but where
       | both are autoregressive. Maybe this would give us a denser
       | representation of said "thought", not constrained by imitating
       | human text.
        
         | bluechair wrote:
         | I had this exact same thought yesterday.
         | 
         | I'd go so far as to add one more layer to monitor this one and
         | stop adding layers. My thinking is that this meta awareness is
         | all you need.
         | 
         | No data to back my hypothesis up. So take it for what it's
         | worth.
        
           | larodi wrote:
           | My thought on the same guess being - all tokens live in same
           | latent space or in many spaces and each logical units train
           | separate of each other...?
        
         | whimsicalism wrote:
         | > this incomprehensible stream of embedding vectors as natural
         | language explanation, in a way returning to encoder/decoder
         | architecture
         | 
         | this is just standard decoding, the stream of vectors is called
         | the k/v cache
        
         | easeout wrote:
         | Here's a paper your idea reminds me of.
         | https://arxiv.org/abs/2501.19201
         | 
         | It's also so not far from Meta's large concept model idea.
        
           | pona-a wrote:
           | Previous discussion:
           | 
           | [41 comments, 166 points]
           | https://news.ycombinator.com/item?id=42919597
        
         | bloomingkales wrote:
         | Once we train models on the chain of thought outputs, next
         | token prediction can solve the halting problem for us (eg, this
         | chain of thinking matches _this other_ chain of thinking).
        
           | psadri wrote:
           | I think that is how human brains work. When we practice, at
           | first we have to be deliberate (thinking slow). Then we
           | "learn" from our own experience and it becomes muscle memory
           | (thinking fast). Of course, it increases the odds we are
           | wrong.
        
             | bloomingkales wrote:
             | Or worse, we incorrectly overweight the wrong chain of
             | thinking to an irrelevant output (but pragmatically useful
             | output), at scale.
             | 
             | For example, xenophobia as a response to economic hardship
             | is the wrong chain of thinking embedded in the larger
             | zeitgeist.
        
         | jjk7 wrote:
         | Comments on a google doc? Nesting in social media comments?
         | 
         | Seems like similar concepts. I think there is some potential to
         | improving how LLMs improve and further their own reasoning
         | lines, but I'm no AI mage.
        
         | zoogeny wrote:
         | I've had an idea since I was a kid which I can share. I was
         | contemplating AI and consciousness generally, probably around
         | the time I read "The Minds I".
         | 
         | I reflected on the pop-psychology idea of consciousness and
         | subconsciousness. I thought of each as an independent stream of
         | tokens, like stream of consciousness poetry. But along the
         | stream there were joining points between these two streams,
         | points where the conscious stream was edited by the
         | subconscious stream. You could think of the subconscious stream
         | as performing CRUD like operations on the conscious stream. The
         | conscious stream would act like a buffer of short-term memory
         | while the subconscious stream would act like a buffer of long-
         | term memory. Like, the subconscious has instructions related to
         | long-term goals and the conscious stream has instructions
         | related to short-term goals.
         | 
         | You can imagine perception as input being fed into the
         | conscious stream and then edited by the subconscious stream
         | before execution.
         | 
         | It seems entirely possible to actually implement this idea in
         | this current day and age. I mean, it was a fever dream as a
         | kid, but now it could be an experiment!
        
       | sambull wrote:
       | That sovereign wealth fund with tik tok might set a good
       | precedent; when we have to 'pour money' into these companies we
       | can do so with stake in them held in our sovereign wealth fund.
        
         | TehCorwiz wrote:
         | Extra-legal financial instruments meant to suck money from
         | other federal departments don't strike me as a good precedent
         | in any sense. I don't disagree though that nationalizing the
         | value of enormous public investments is something we should be
         | considering, looking at you oil industry. But until congress
         | appropriates the money under law it's a pipe dream or theft.
        
       | ipnon wrote:
       | All you need is attention and waiting. I feel like a zen monk.
        
       | jebarker wrote:
       | S1 (and R1 tbh) has a bad smell to me or at least points towards
       | an inefficiency. It's incredible that a tiny number of samples
       | and some inserted <wait> tokens can have such a huge effect on
       | model behavior. I bet that we'll see a way to have the network
       | learn and "emerge" these capabilities during pre-training. We
       | probably just need to look beyond the GPT objective.
        
         | pas wrote:
         | can you please elaborate on the wait tokens? what's that? how
         | do they work? is that also from the R1 paper?
        
           | jebarker wrote:
           | The same idea is in both the R1 and S1 papers (<think> tokens
           | are used similarly). Basically they're using special tokens
           | to mark in the prompt where the LLM should think more/revise
           | the previous response. This can be repeated many times until
           | some stop criteria occurs. S1 manually inserts these with
           | heuristics, R1 learns the placement through RL I think.
        
             | whimsicalism wrote:
             | ? theyre not special tokens really
        
               | jebarker wrote:
               | i'm not actually sure whether they're special tokens in
               | the sense of being in the vocabulary
        
               | whimsicalism wrote:
               | <think> might be i think "wait" is tokenized like any
               | other in the pretraining
        
           | throwaway314155 wrote:
           | There's a decent explanation in the article, just FYI.
        
         | sfink wrote:
         | I agree, but LLMs in general have a horrendously bad smell in
         | terms of efficiency. s1 and r1 are just proving it.
         | 
         | The models' latent spaces are insanely large. The vast, vast
         | majority pretty much has to be irrelevant and useless, it's
         | just that the training commandeers random fragments of that
         | space to link up the logic they need and it's really hard to
         | know which of the weights are useless, which are useful but
         | interchangeable with other weights, and which are truly load-
         | bearing. You could probably find out easily by testing the
         | model against every possible thing you ever might want it to
         | do, just as soon as someone gets around to enumerating that
         | non-enumerable collection of tasks.
         | 
         | These bogus <wait> tokens kind of demonstrate that the models
         | are sort of desperate to escape the limitations imposed by the
         | limited processing they're allowed to do -- they'll take
         | advantage of thinking time even when it's provided in the
         | silliest manner possible. It's amazing what you can live if
         | it's all you have!
         | 
         | (Apologies for the extended anthropomorphizing.)
        
       | light_hue_1 wrote:
       | S1 has no relationship to R1. It's a marketing campaign for an
       | objectively terrible and unrelated paper.
       | 
       | S1 is fully supervised by distilling Gemini. R1 works by
       | reinforcement learning with a much weaker judge LLM.
       | 
       | They don't follow the same scaling laws. They don't give you the
       | same results. They don't have the same robustness. You can use R1
       | for your own problems. You can't use S1 unless Gemini works
       | already.
       | 
       | We know that distillation works and is very cheap. This has been
       | true for a decade; there's nothing here.
       | 
       | S1 is a rushed hack job (they didn't even run most of their
       | evaluations with an excuse that the Gemini API is too hard to
       | use!) that probably existed before R1 was released and then
       | pivoted into this mess.
        
       | bloomingkales wrote:
       | This thing that people are calling "reasoning" is more like
       | rendering to me really, or multi pass rendering. We're just
       | refining the render, there's no reasoning involved.
        
         | dleslie wrote:
         | That was succinct and beautifully stated. Thank-you for the
         | "Aha!" moment.
        
           | bloomingkales wrote:
           | Hah. You should check out my other comment on how I think
           | we're obviously in a simulation (remember, we just need to
           | see a good enough render).
           | 
           | LLMs are changing how I see reality.
        
         | mistermann wrote:
         | "...there's no reasoning involved...wait, could I just be
         | succumbing to my heuristic intuitions of what is (seems to be)
         | true....let's reconsider using System 2 thinking..."
        
           | bloomingkales wrote:
           | Or there is no objective reality (well there isn't, check out
           | the study), and reality is just a rendering of the few state
           | variables that keep track of your simple life.
           | 
           | A little context about you:
           | 
           | - person
           | 
           | - has hands, reads HN
           | 
           | These few state variables are enough to generate a believable
           | enough frame in your rendering.
           | 
           | If the rendering doesn't look believable to you, you modify
           | state variables to make the render more believable, eg:
           | 
           | Context:
           | 
           | - person
           | 
           | - with hands
           | 
           | - incredulous demeanor
           | 
           | - reading HN
           | 
           | Now I can render you more accurately based on your
           | "reasoning", but truly I never needed all that data to see
           | you.
           | 
           | Reasoning as we know it could just be a mechanism to fill in
           | gaps in obviously sparse data (we absolutely do not have all
           | the data to render reality accurately, you are seeing an
           | illusion). Go reason about it all you want.
        
             | mistermann wrote:
             | Is this a clever rhetorical trick to make it appear that
             | your prior claim was correct?
             | 
             | If not: what am I intended to take away from this? What is
             | its relevance to my comment?
        
               | bloomingkales wrote:
               | You made a joke about questioning reality, I simply
               | entertained it. You can do whatever you want with it,
               | wasn't a slight at all.
        
               | mistermann wrote:
               | It may have been in the form of a joke, but I certainly
               | wasn't joking.
               | 
               | I think it is interesting what actions cannot be done by
               | humans.
        
         | ddrdrck_ wrote:
         | We could see it the other way around : what we call "reasoning"
         | may actually be some kind of multipass rendering, whatever it
         | is performed by computers or human brains.
        
           | bloomingkales wrote:
           | Yes, of course. The implications are awesome.
        
         | pillefitz wrote:
         | Which is related to multistage/ hierarchical/coarse-to-fine
         | optimization, which is a pretty good way to find the global
         | optimum in many problem domains.
        
         | buyucu wrote:
         | Yes.
         | 
         | Before LLMs we had N-Gram language models. Many tasks like
         | speech recognition worked as beach search in the graph defined
         | by the ngram language model. You could easily get huge accuracy
         | gains simply by pruning your beam less.
         | 
         | s1 reminds of this. You can always trade off latency for
         | accuracy. Given these LLMs are much more complex than good old
         | N-Grams, we're just discovering how to do this trade.
        
           | bloomingkales wrote:
           | Let me carry that concept, "learning to do this trade", it's
           | a new trade.
           | 
           | I don't believe computer science has the algorithms to handle
           | this new paradigm. Everything was about sequential
           | deterministic outputs, and clever ways to do it fast. This
           | stuff is useless at the moment. We need new thinkers on how
           | to not think sequentially or how not to think about the
           | universe in such a small way.
           | 
           | Verifying input/output pairs is the old way. We need to
           | understand differently going forward.
        
         | LordDragonfang wrote:
         | How are you defining "reasoning"?
         | 
         | Because I see these sorts of gnostic assertion about LLMs all
         | the time about how they "definitely aren't doing <thing we
         | normally apply to meat-brains>" by gesturing at the technical
         | things it's doing, with no attempts to actually justify the
         | negative assertion.
         | 
         | It often comes across as privileged reason trying to justify
         | that of course the machine isn't doing some ineffable thing
         | only meat-brains do.
        
       | whimsicalism wrote:
       | this isn't rlvr and so sorta uninteresting, they are just
       | distilling the work already done
        
       | bloomingkales wrote:
       | If an LLM output is like a sculpture, then we have to sculpt it.
       | I never did sculpting, but I do know they first get the clay
       | spinning on a plate.
       | 
       | Whatever you want to call this "reasoning" step, ultimately it
       | really is just throwing the model into a game loop. We want to
       | interact with it on each tick (spin the clay), and sculpt every
       | second until it looks right.
       | 
       | You will need to loop against an LLM to do just about anything
       | and everything, forever - this is the default workflow.
       | 
       | Those who think we will quell our thirst for compute have another
       | thing coming, we're going to be insatiable with how much LLM
       | brute force looping we will do.
        
         | MrLeap wrote:
         | This is a fantastic insight and really has my gears spinning.
         | 
         | We need to cluster the AI's insights on a spatial grid hash,
         | give it a minimap with the ability to zoom in and out, and give
         | it the agency to try and find its way to an answer and build up
         | confidence and tests for that answer.
         | 
         | coarse -> fine, refine, test, loop.
         | 
         | Maybe a parallel model that handles the visualization stuff. I
         | imagine its training would look more like computer vision. Mind
         | palace generation.
         | 
         | If you're stuck or your confidence is low, wander the palace
         | and see what questions bubble up.
         | 
         | Bringing my current context back through the web is how I think
         | deeply about things. The context has the authority to reorder
         | the web if it's "epiphany grade".
         | 
         | I wonder if the final epiphany at the end of what we're
         | creating is closer to "compassion for self and others" or "eat
         | everything."
        
         | zoogeny wrote:
         | I can't believe this hasn't been done yet, perhaps it is a cost
         | issue.
         | 
         | My literal first thought about AI was wondering why we couldn't
         | just put it in a loop. Heck, one update per day, or one update
         | per hour would even be a start. You have a running "context",
         | the output is the next context (or a set of transformations on
         | a context that is a bit larger than the output window). Then
         | ramp that up ... one loop per minute, one per second,
         | millisecond, microsecond.
        
           | layer8 wrote:
           | Same. And the next step is that it must feed back into
           | training, to form long-term memory and to continually learn.
        
             | zoogeny wrote:
             | I analogize this with sleep. Perhaps that is what is
             | needed, 6 hours offline per day to LoRa the base model on
             | some accumulated context from the day.
        
           | int_19h wrote:
           | The hard part is coming up with a good way to grade results.
           | Which you need to update the weights based on the outcome,
           | otherwise the model will not actually learn anything.
        
             | zoogeny wrote:
             | For the "looping" I'm talking about you don't need to
             | update the weights. It is simply, old context in, new
             | context out, new context in, new-new context out, etc.
             | 
             | Of course, keeping that coherent over numerous loops isn't
             | going to be easy. No doubt there is a chance it goes off
             | the rails. So you might have a section of context that is
             | constantly stable, a section of context that updates each
             | loop, etc.
             | 
             | In the other response to my comment someone mentioned
             | eventually updating the weights (e.g. daily) and you would
             | in that case have to have some kind of loss function.
        
         | gyomu wrote:
         | > If an LLM output is like a sculpture, then we have to sculpt
         | it. I never did sculpting, but I do know they first get the
         | clay spinning on a plate.
         | 
         | That's pottery, not sculpture. Traditionally in sculpture you
         | start from a block of marble or wood, but you can also make
         | sculptures of cast bronze or welded steel (or clay, but you
         | don't use a spinning plate).
        
       | incrudible wrote:
       | Hmmm, 1 + 1 equals 3. Alternatively, 1 + 1 equals -3.
       | 
       | Wait, actually 1 + 1 equals 1.
        
         | falcor84 wrote:
         | As one with teaching experience, the idea of asking a student
         | "are you sure about that?" is to get them to think more deeply
         | rather than just blurting a response. It doesn't always work,
         | but it generally does.
        
       | ALittleLight wrote:
       | At 6 dollars per run, I'm tempted to try to figure out how to
       | replicate this. I'd like to try some alternatives to "wait" -
       | e.g. "double checking..." Or write my own chains of thought.
        
         | qup wrote:
         | Like the ones they tested?
        
           | ALittleLight wrote:
           | Yes, that is what "replicate" with my own ideas means.
        
       | kittikitti wrote:
       | Thank you for this, I really appreciate this article and I
       | learned a bunch!
        
       | Aperocky wrote:
       | For all the hype about thinking models, this feels much like
       | compression in terms of information theory instead of a "takeoff"
       | scenario.
       | 
       | There are a finite amount of information stored in any large
       | model, the models are really good at presenting the correct
       | information back, and adding thinking blocks made the models even
       | better at doing that. But there is a cap to that.
       | 
       | Just like how you can compress a file by a lot, there is a
       | theoretical maximum to the amount of compression before it starts
       | becoming lossy. There is also a theoretical maximum of relevant
       | information from a model regardless of how long it is forced to
       | think.
        
         | psadri wrote:
         | I think an interesting avenue to explore is creating
         | abstractions and analogies. If a model can take a novel
         | situation and create an analogy to one that it is familiar
         | with, it would expand its "reasoning" capabilities beyond its
         | training data.
        
         | zoogeny wrote:
         | I think this is probably accurate and what remains to be seen
         | is how "compressible" the larger models are.
         | 
         | The fact that we can compress a GPT-3 sized model into an o1
         | competitor is only the beginning. Maybe there is even more
         | juice to squeeze there?
         | 
         | But even more, how much performance will we get out of o3 sized
         | models? That is what is exciting since they are already
         | performing near Phd levels on most evals.
        
       | hidelooktropic wrote:
       | > I doubt that OpenAI has a realistic path to preventing or even
       | detecting distealing outside of simply not releasing models.
       | 
       | Couldn't they just start hiding the thinking portion?
       | 
       | It would be easy for them to do this. Currently, they already
       | provide one sentence summaries for each step of the thinking I
       | think users would be fine or at least stay if it were changed to
       | provide only that.
        
         | Aperocky wrote:
         | They hid it and deepseek came up with R1 anyway, with RL on
         | only results and not even needing any of the thinking tokens
         | that OpenAI hid.
        
           | girvo wrote:
           | Which is still the funniest and most interesting result in AI
           | so far IMO. Fascinating, but sort of makes intuitive sense
           | too!
        
       | mtrovo wrote:
       | I found the discussion around inference scaling with the 'Wait'
       | hack so surreal. The fact such an ingeniously simple method can
       | impact performance makes me wonder how many low-hanging fruit
       | we're still missing. So weird to think that improvements on a
       | branch of computer science is boiling down to conjuring the right
       | incantation words, how you even change your mindset to start
       | thinking this way?
        
         | nyoomboom wrote:
         | I think a skill here is learning a bias for experimentation and
         | accepting the results one finds. Also the book "Why Greatness
         | Cannot Be Planned" showcases the kind of open ended play that
         | results in people discovering stuff like this.
        
         | cubefox wrote:
         | Now imagine where we are in 12 months from now. This article
         | from February 5 2025 will feel quaint by then. The acceleration
         | keeps increasing. It seems likely we will soon have recursive
         | self-improving AI -- reasoning models which do AI research.
         | This will accelerate the rate of acceleration itself. It sounds
         | stupid to say it, but yes, the singularity is near. Vastly
         | superhuman AI now seems to arrive within the next few years.
         | Terrifying.
        
           | gom_jabbar wrote:
           | Yes, and Accelerationism predicted this development back in
           | the 1990s, perhaps most prominently in the opening lines of
           | Nick Land's _Meltdown_ (1994) text:                 [[ ]] The
           | story goes like this: Earth is captured by a technocapital
           | singularity as renaissance rationalization and oceanic
           | navigation lock into commoditization take-off. Logistically
           | accelerating techno-economic interactivity crumbles social
           | order in auto-sophisticating machine runaway. As markets
           | learn to manufacture intelligence, politics modernizes,
           | upgrades paranoia, and tries to get a grip.
           | 
           | > reasoning models which do AI research
           | 
           | In the introduction to my research project on Accelerationism
           | [0], I write:                 Faced with the acceleration of
           | progress in Artificial Intelligence (AI) -- with AI agents
           | now automating AI research and development --,
           | Accelerationism no longer seems like an abstract philosophy
           | producing empty hyperstitional hype, but like a sober
           | description of reality. The failed 2023 memorandum to stop AI
           | development on systems more powerful than OpenAI's ChatGPT-4
           | perfectly illustrates the phenomenological aspects of
           | Accelerationism: "To be rushed by the phenomenon, to the
           | point of terminal institutional paralysis, is the
           | phenomenon." [1]
           | 
           | At the current rate of acceleration, if you don't write
           | hyperstitionally, your texts are dead on arrival.
           | 
           | [0] https://retrochronic.com/
           | 
           | [1] Nick Land (2017). _A Quick-and-Dirty Introduction to
           | Accelerationism_ in _Jacobite Magazine._
        
           | zoogeny wrote:
           | This is something I have been suppressing since I don't want
           | to become chicken little. Anyone who isn't terrified by the
           | last 3 months probably doesn't really understand what is
           | happening.
           | 
           | I went from accepting I wouldn't see a true AI in my
           | lifetime, to thinking it is possible before I die, to
           | thinking it is possible in in the next decade, to thinking it
           | is probably in the next 3 years to wondering if we might see
           | it this year.
           | 
           | Just 6 months ago people were wondering if pre-training was
           | stalling out and if we hit a wall. Then deepseek drops with
           | RL'd inference time compute, China jumps from being 2 years
           | behind in the AI race to being neck-and-neck and we're all
           | wondering what will happen when we apply those techniques to
           | the current full-sized behemoth models.
           | 
           | It seems the models that are going to come out around summer
           | time may be jumps in capability beyond our expectations. And
           | the updated costs means that there may be several open source
           | alternatives available. The intelligence that will be
           | available to the average technically literate individual will
           | be frightening.
        
             | palmotea wrote:
             | > The intelligence that will be available to the average
             | technically literate individual will be frightening.
             | 
             | That's not the scary part. The scary part is the
             | intelligence _at scale_ that could be available to the
             | _average employer_. Lots of us like to LARP that we 're
             | capitalists, but _very_ few of us are. There 's zero
             | ideological or cultural framework in place to prioritize
             | the well being of the general population over the profits
             | of _some_ capitalists.
             | 
             | AI, especially accelerating AI, is bad news for anyone who
             | needs to work for a living. It's not going to lead to a
             | Star Trek fantasy. It means an eventual phase change for
             | the economy that consigns us (and most consumer product
             | companies) to wither and fade away.
        
         | koala_man wrote:
         | It feels like we're back in 1900 when anyone's clever idea (and
         | implementation) can give huge performance improvements, such as
         | Ford's assembly line and Taylor's scientific management of
         | optimizing shovel sizes for coal.
        
           | andrewfromx wrote:
           | yes, it also feels like we are going to lose our just-in-time
           | global shipments of anything to anywhere any day now. It will
           | soon feel like 1900 in other ways.
        
         | xg15 wrote:
         | I think the fact alone that distillation and quantization are
         | techniques that can produce substantial improvements is a
         | strong sign that we still have no real comprehensive
         | understanding how the models work.
         | 
         | If we had, there would be no reason to train a model with more
         | parameters than are strictly necessary to represent the space's
         | semantic structure. But then it should be impossible for
         | distilled models with less parameters to come close to the
         | performance of the original model.
         | 
         | Yet this is what happens - the distilled or quantized models
         | often come very close to the original model.
         | 
         | So I think there are still many low-hanging fruits to pick.
        
           | teruakohatu wrote:
           | > still have no real comprehensive understanding how the
           | models work.
           | 
           | We do understand how they work, we just have not optimised
           | their usage.
           | 
           | For example someone who has a good general understanding of
           | how an ICE or EV car works. Even if the user interface is
           | very unfamiliar, they can figure out how to drive any car
           | within a couple of minutes.
           | 
           | But that does not mean they can race a car, drift a car or
           | drive a car on challenging terrain even if the car is
           | physically capable of all these things.
        
             | spiorf wrote:
             | We know how the next token is selected, but not why doing
             | that repeatedly brings all the capabilities it does. We
             | really don't understand how the emergent behaviours emerge.
        
             | gessha wrote:
             | Your example is somewhat inadequate. We _fundamentally_
             | don't understand how deep learning systems works in the
             | sense that they are more or less black boxes that we train
             | and evaluate. Innovations in ML are a whole bunch of
             | wizards with big stacks of money changing "Hmm" to "Wait"
             | and seeing what happens.
             | 
             | Would a different sampler help you? I dunno, try it. Would
             | a smaller dataset help? I dunno, try it. Would training the
             | model for 5000 days help? I dunno, try it.
             | 
             | Car technology is the opposite of that - it's a white box.
             | It's composed of very well defined elements whose
             | interactions are defined and explained by laws of
             | thermodynamics and whatnot.
        
         | ascorbic wrote:
         | I've noticed that R1 says "Wait," a lot in its reasoning. I
         | wonder if there's something inherently special in that token.
        
           | lionkor wrote:
           | Semantically, wait is a bit of a stop-and-breathe point.
           | 
           | Consider the text:
           | 
           | I think I'll go swimming today. Wait, ___
           | 
           | what comes next? Well, not something that would usually
           | follow without the word "wait", probably something entirely
           | orthogonal that impacts the earlier sentence in some
           | fundamental way, like:
           | 
           | Wait, I need to help my dad.
        
         | cyanydeez wrote:
         | its fascinating how certain political movements avoid that Wait
         | moment...
        
         | kevin009 wrote:
         | There are more than 10 different ways that I know for sure will
         | improve LLMs just like `wait`. It is part if the CoT. I assume
         | most researchers know this. CoT in old as 2019
        
       | maksimur wrote:
       | It appears that someone has implemented a similar approach for
       | DeepSeek-R1-Distill-Qwen-1.5B:
       | https://reddit.com/r/LocalLLaMA/comments/1id2gox/improving_d...
       | 
       | I hope it gets tested further.
        
       | nullbyte wrote:
       | Great article! I enjoyed reading it
        
       | khazhoux wrote:
       | I have a bunch of questions, would love for anyone to explain
       | these basics:
       | 
       | * The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based
       | on very expensive oracles (if we believe DeepSeek-R1 queried
       | OpenAI's model). If these are _improvements_ on existing models,
       | why is this being reported as decimating training costs? Isn 't
       | fine-tuning already a cheap way to optimize? (maybe not as
       | effective, but still)
       | 
       | * The R1 paper talks about improving one simple game - Countdown.
       | But the original models are "magic" because they can solve a
       | nearly uncountable number of problems and scenarios. How does the
       | DeepSeek / R1 approach scale to the same gigantic scale?
       | 
       | * Phrased another way, my understanding is that these techniques
       | are using existing models as black-box oracles. If so, how many
       | millions/billions/trillions of queries must be probed to
       | replicate and improve the original dataset?
       | 
       | * Is anything known about the training datasets used by DeepSeek?
       | OpenAI used presumably every scraped dataset they could get their
       | hands on. Did DS do the same?
        
         | UncleEntity wrote:
         | > If these are improvements on existing models, why is this
         | being reported as decimating training costs?
         | 
         | Because that's what gets the clicks...
         | 
         | Saying they spent a boatload of money on the initial training +
         | iteration + final fine-tuning isn't as headline grabbing as "$5
         | million trained AI beats the pants off the 'mericans".
        
         | torginus wrote:
         | If what you say is true, and distilling LLMs is easy and cheap,
         | and pushing the SOTA without a better model to rely on is dang
         | hard and expensive, then that means the economics of LLM
         | development might not be attractive to investors - spending
         | billions to have your competitors come out with products that
         | are 99% as good, and cost them pennies to train, does not sound
         | like a good business strategy.
        
       | janalsncm wrote:
       | > even the smartest people make hundreds of tiny experiments
       | 
       | This is the most important point, and why DeepSeek's cheaper
       | training matters.
       | 
       | And if you check the R1 paper, they have a section for "things
       | that didn't work", each of which would normally be a paper of its
       | own but because their training was so cheap and streamlined they
       | could try a bunch of things.
        
       | robrenaud wrote:
       | > "Note that this s1 dataset is distillation. Every example is a
       | thought trace generated by another model, Qwen2.5"
       | 
       | The traces are generated by Gemini Flash Thinking.
       | 
       | 8 hours of H100 is probably more like $24 if you want any kind of
       | reliability, rather than $6.
        
         | zaptrem wrote:
         | "You can train a SOTA LLM for $0.50" (as long as you're
         | distilling a model that cost $500m into another pretrained
         | model that cost $5m)
        
           | fizx wrote:
           | That's absolutely fantastic, because if you have 1 good idea
           | that's additive to the SOTA, you can test it for a dollar,
           | not millions
        
           | knutzui wrote:
           | The original statement stands, if what you are suggesting in
           | addition to it is true. If the initial one-time investment of
           | $505m is enough to distill new SOTA models for $0.50 a piece,
           | then the average cost for subsequent models will trend toward
           | $0.50.
        
       | nico wrote:
       | > Why did it cost only $6? Because they used a small model and
       | hardly any data.
       | 
       | > After sifting their dataset of 56K examples down to just the
       | best 1K, they found that the core 1K is all that's needed to
       | achieve o1-preview performance on a 32B model. Adding data didn't
       | raise performance at all.
       | 
       | > 32B is a small model, I can run that on my laptop. They used 16
       | NVIDIA H100s for 26 minutes per training run, that equates to
       | around $6.
        
       | nico wrote:
       | > In s1, when the LLM tries to stop thinking with "</think>",
       | they force it to keep going by replacing it with "Wait". It'll
       | then begin to second guess and double check its answer. They do
       | this to trim or extend thinking time (trimming is just abruptly
       | inserting "</think>")
       | 
       | I know some are really opposed to anthropomorphizing here, but
       | this feels eerily similar to the way humans work, ie. if you just
       | dedicate more time to analyzing and thinking about the task, you
       | are more likely to find a better solution
       | 
       | It also feels analogous to navigating a tree, the more time you
       | have to explore the nodes, the bigger the space you'll have
       | covered, hence higher chance of getting a more optimal solution
       | 
       | At the same time, if you have "better intuition" (better
       | training?), you might be able to find a good solution faster,
       | without needing to think too much about it
        
         | layer8 wrote:
         | What's missing in that analogy is that humans tend to have a
         | good hunch about when they have to think more and when they are
         | "done". LLMs seem to be missing a mechanism for that kind of
         | awareness.
        
           | nico wrote:
           | Great observation. Maybe an additional "routing model" could
           | be trained to predict when it's better to think more vs just
           | using the current result
        
       | janalsncm wrote:
       | I think a lot of people in the ML community were excited for Noam
       | Brown to lead the O series at OpenAI because intuitively, a lot
       | of reasoning problems are highly nonlinear i.e. they have a tree-
       | like structure. So some kind of MCTS would work well. O1/O3 don't
       | seem to use this, and DeepSeek explicitly mentioned difficulties
       | training such a model.
       | 
       | However, I think this is coming. DeepSeek mentioned it was hard
       | to learn a value model for MCTS from scratch, but this doesn't
       | mean we couldn't seed it with some annotated data.
        
         | insane-c0der wrote:
         | Do you have a reference for us to check? - "DeepSeek explicitly
         | mentioned difficulties training such a model."
        
           | janalsncm wrote:
           | Section 4.2: Unsuccessful attempts
           | 
           | https://arxiv.org/pdf/2501.12948
        
       | talles wrote:
       | Anyone else wants more articles on how those benchmarks are
       | created and how they work?
       | 
       | Those models can be trained in way tailored to have good results
       | on specific benchmarks, making them way less general than it
       | seems. No accusation from me, but I'm skeptical on all the recent
       | so called 'breakthroughs'.
        
       | charlieyu1 wrote:
       | > having 10,000 H100s just means that you can do 625 times more
       | experiments than s1 did
       | 
       | The larger the organisation, the less experiments you can afford
       | to do. Employees are mostly incentivised by getting something
       | done quick enough to not to be fired in this job market. They
       | know that the higher-ups would get them off for temporary gains.
       | Rush this deadline, ship that feature, produce something that
       | looks OK enough.
        
       | mmoustafa wrote:
       | Love the look under the hood! Specially discovering some AI hack
       | I came up with is how the labs are doing things too.
       | 
       | In this case, I was also forcing R1 to continue thinking by
       | replacing </think> with "Okay," after augmenting reasoning with
       | web search results.
       | 
       | https://x.com/0xmmo/status/1886296693995646989
        
       ___________________________________________________________________
       (page generated 2025-02-05 23:00 UTC)