[HN Gopher] S1: A $6 R1 competitor?
       ___________________________________________________________________
        
       S1: A $6 R1 competitor?
        
       Author : tkellogg
       Score  : 808 points
       Date   : 2025-02-05 11:05 UTC (1 days ago)
        
 (HTM) web link (timkellogg.me)
 (TXT) w3m dump (timkellogg.me)
        
       | bberenberg wrote:
       | In case you're not sure what S1 is, here is the original paper:
       | https://arxiv.org/html/2501.19393v1
        
         | mi_lk wrote:
         | it's also the first link in the article's first sentence
        
           | bberenberg wrote:
           | Good call, I must have missed it. I read the whole blog then
           | went searching for what S1 was.
        
         | addandsubtract wrote:
         | It's linked in the blog post, too. In the first sentence,
         | actually, but for some reason the author never bothered to
         | attach the name to it. As if keeping track of o1, 4o, r1, r2d2,
         | wasn't exhausting enough already.
        
           | kgwgk wrote:
           | > for some reason the author never bothered to attach the
           | name to it
           | 
           | Respect for his readers' intelligence, maybe.
        
         | rahimnathwani wrote:
         | To enforce a minimum, we suppress the generation of the end-of-
         | thinking token delimiter and optionally append the string
         | "Wait" to the model's current reasoning trace to encourage the
         | model to reflect on its current generation.
         | 
         | Does this mean that the end-of-thinking delimiter is a single
         | token? Presumably </think> or similar wasn't a single token for
         | the base model. Did they just pick a pair of uncommon single-
         | token symbols to use as delimiters?
         | 
         | EDIT: Never mind, end of thinking is represented with
         | <|im_start|> followed by the word 'answer', so the code
         | dynamically adds/removes <|im_start|> from the list of stop
         | tokens.
        
         | dagurp wrote:
         | I don't know what R1 is either
        
           | latexr wrote:
           | It's the DeepSeek reasoning model.
        
       | ttyprintk wrote:
       | https://huggingface.co/simplescaling
        
         | anentropic wrote:
         | and: https://github.com/simplescaling/s1
        
           | mettamage wrote:
           | When you're only used to ollama, how do I go about using this
           | model?
        
             | davely wrote:
             | I think we need to wait for someone to convert it into a
             | GGUF file format.
             | 
             | However, once that happens, you can run it (and any GGUF
             | model) from Hugging Face![0]
             | 
             | [0] https://huggingface.co/docs/hub/en/ollama
        
               | mettamage wrote:
               | So this?
               | 
               | https://huggingface.co/brittlewis12/s1-32B-GGUF
        
               | withinboredom wrote:
               | oh god, this is terrible!
               | 
               | I just said "Hello!" and it went off the rails.
        
               | delijati wrote:
               | why how what? can you add a sample prompt with output ?
        
               | withinboredom wrote:
               | In my original prompt, it gave me back multiple greetings
               | in various languages. In this example, it gave me back
               | just multiple greetings:
               | 
               | >>> hello!
               | 
               | Hello there! I'm here to help with any questions you
               | might have. How can I assist you today? If you're asking
               | a question, feel free to provide some details about what
               | you need help with. I'll do my best to answer your
               | question or guide you through the process. Just let me
               | know!
               | 
               | <|im_start|>think
               | 
               | The user has asked "hello! " and provided no further
               | information. Answer here is a direct response to the
               | user's request for information, which was essentially
               | just a greeting. The appropriate response is to greet
               | back and offer assistance.
               | 
               | <|im_start|>answer
               | 
               | Answer:
               | 
               | Hello! How can I assist you today? If you have a question
               | or need help with something specific, feel free to ask.
               | I'm here to support you!
               | 
               | <|im_start|>think
               | 
               | <|im_start|>answer
               | 
               | The user's message was "hello!" and there was no other
               | information provided. The appropriate response is a
               | greeting back and an offer of assistance.
               | 
               | <|im_start|>answer
               | 
               | Answer:
               | 
               | Hello! How can I assist you today? If you have a question
               | or need help with something specific, feel free to ask.
               | I'm here to support you!
               | 
               | <|im_start|>think
               | 
               | The user has greeted with "hello!" An appropriate
               | response involves acknowledging the greeting and offering
               | assistance. This could involve asking how we can help
               | them or what they'd like guidance on.
               | 
               | <|im_start|>answer
               | 
               | Answer:
               | 
               | Hello! How can I assist you today? If you have a question
               | or need help with something specific, feel free to ask.
               | I'm here to support you!
        
               | mettamage wrote:
               | I ran it, so far it seems like a pretty good model,
               | especially locally.
        
               | fl0id wrote:
               | you can load the safetensors with ollama, you just have
               | to provide a modelfile. or wait for someone to do it. It
               | will in theory also quantize it for you, as I guess most
               | ppl cannot load a 129 GB model...
        
       | yapyap wrote:
       | > If you believe that AI development is a prime national security
       | advantage, then you absolutely should want even more money poured
       | into AI development, to make it go even faster.
       | 
       | This, this is the problem for me with people deep in AI. They
       | think it's the end all be all for everything. They have the
       | vision of the 'AI' they've seen in movies in mind, see the
       | current 'AI' being used and to them it's basically almost the
       | same, their brain is mental bridging the concepts and saying it's
       | only a matter of time.
       | 
       | To me, that's stupid. I observe the more populist and socially
       | appealing CEOs of these VC startups (Sam Altman being the
       | biggest, of course.) just straight up lying to the masses, for
       | financial gain, of course.
       | 
       | Real AI, artificial intelligence, is a fever dream. This is
       | machine learning except the machines are bigger than ever before.
       | There is no intellect.
       | 
       | and the enthusiasm of these people that are into it feeds into
       | those who aren't aware of it in the slightest, they see you can
       | chat with a 'robot', they hear all this hype from their peers and
       | they buy into it. We are social creatures after all.
       | 
       | I think using any of this in a national security setting is
       | stupid, wasteful and very, very insecure.
       | 
       | Hell, if you really care about being ahead, pour 500 billion
       | dollars into quantum computing so u can try to break current
       | encryption. That'll get you so much further than this nonsensical
       | bs.
        
         | mnky9800n wrote:
         | Also the narrative that we are currently on the brink of Ai
         | explosion and this random paper shows it has been the same
         | tired old story handed out by ai hawks for years now. Like yes,
         | I agree with the general idea that more compute means more
         | progress for humans and perhaps having a more responsive user
         | interface through some kind of ai type technology would be
         | good. But I don't see why that will turn into Data from Star
         | Trek. But I also think all these ai hawks kind of
         | narcissistically over value their own being. Like blink and
         | their lives are over in the grand scheme of things. Maybe our
         | "awareness" of the world around us is an illusion provided by
         | evolution because we needed it to value self preservation
         | whereas other animals don't. There is an inherent belief in the
         | specialness of humans that I suppose I mistrust.
        
           | encipriano wrote:
           | I find the last part of the paragraph offputting and I agree
        
           | ben_w wrote:
           | > But I don't see why that will turn into Data from Star
           | Trek.
           | 
           | "Is Data genuinely sentient or is he just a machine with this
           | impression" was a repeated plot point in TNG.
           | 
           | https://en.wikipedia.org/wiki/The_Measure_of_a_Man_(Star_Tre.
           | ..
           | 
           | https://en.wikipedia.org/wiki/The_Offspring_(Star_Trek:_The_.
           | ..
           | 
           | https://en.wikipedia.org/wiki/The_Ensigns_of_Command
           | 
           | https://en.wikipedia.org/wiki/The_Schizoid_Man_(Star_Trek:_T.
           | ..
           | 
           | Similar with The Doctor on VOY.
           | 
           | Even then, what we have with LLMs is basically already at the
           | level of the ship's main computer as it was written in
           | TNG/DS9/VOY.
           | 
           | But what counts as personhood? No idea. We're as ignorant as
           | of that as the writers of the TV show; humanity's best
           | attempts at philosophy of mind are still making the same
           | kinds of arguments today as we were a century ago, the only
           | real shift is to be slightly more inclusive of other species
           | besides Homo Sapiens, and the only real improvement over two
           | centuries ago is to include _all_ of Homo Sapiens.
        
             | mnky9800n wrote:
             | The main computer does not make choices stochastically and
             | always understands what people ask it. I do not think that
             | resembles the current crop of LLMs. On voyager the ships
             | computer is some kind of biological computing entity that
             | they eventually give up on as a story topic but there is an
             | episode where the bio computing gel packs get sick.
             | 
             | I believe data and the doctor both would be people to me.
             | But is minuet? The woman created by the binars to distract
             | riker so they could steal the ship? Star Trek goes back and
             | forth as to who counts as a person or not. This is one of
             | the points that is made in the episode where the doctor
             | learns his hologram is cloned many times to work as miners
             | on an asteroid and he says this is slavery. Does minuet get
             | to say she doesn't want to seduce riker if she doesn't feel
             | like it? Or is she a program that does whatever the
             | programmer wants?
             | 
             | But moreover linking back to the original technology, does
             | eqtransformer have awareness? Do we treat it as a thinking
             | being that we enslaved to find earthquakes in data because
             | it uses the same technology as ChatGPT? I don't think
             | anyone thinks we should do this but it seems like that is
             | where people want to take it. That's strange to me. At what
             | point does data pipelines doing some kind of stochastic
             | transformation and summarization of training data become an
             | individual that presents a desire for autonomy like data or
             | the doctor?
             | 
             | I think there's lots of questions here to answer and I
             | don't know the answers to them.
        
               | ben_w wrote:
               | > The main computer does not make choices stochastically
               | and always understands what people ask it.
               | 
               | The mechanism is never explained, but no, it doesn't
               | always understand correctly -- and neither does Data. If
               | hologram-Moriarty is sentient (is he?), then the
               | capability likely exceeds what current LLMs can do, but
               | the cause of the creation is definitely a
               | misunderstanding.
               | 
               | Even the episode where that happens, the script for Dr.
               | Pulaski leading up to Moriarty's IQ boost was exactly the
               | same arguments used against LLMs:
               | https://www.youtube.com/watch?v=4pYDy7vsCj8
               | 
               | (Common trope in that era being that computers (including
               | Data) are too literal, so there was also:
               | https://www.youtube.com/watch?v=HiIlJaSDPaA)
               | 
               | Similar with every time the crew work iteratively to
               | create something in the holodeck. And, of course:
               | https://www.youtube.com/watch?v=srO9D8B6dH4
               | 
               | > I do not think that resembles the current crop of LLMs.
               | On voyager the ships computer is some kind of biological
               | computing entity that they eventually give up on as a
               | story topic but there is an episode where the bio
               | computing gel packs get sick.
               | 
               | "Take the cheese to sickbay" is one of my favourite lines
               | from that series.
               | 
               | > But is minuet?
               | 
               | I would say the character was a puppet, with the Bynars
               | pulling the strings, because the holo-character was
               | immediately seen as lacking personhood the moment they
               | stopped fiddling with the computer.
               | 
               | Vic Fontaine was more ambiguous in that regard. Knew he
               | was "a lightbulb", but (acted like) he wanted to remain
               | within that reality in a way that to me felt like he was
               | _*programmed*_ to respond as if the sim around him was
               | the only reality that mattered rather than having free
               | will in that regard.
               | 
               | (But who has total free will? Humans are to holograms as
               | Q is to humans, and the main cast were also written to
               | reject "gifts" from Riker that time he briefly became a
               | Q).
               | 
               | The villagers of Fair Haven were, I think, not supposed
               | to be sentient (from the POV of the crew), but were from
               | the POV of the writers: https://en.wikipedia.org/wiki/Fai
               | r_Haven_(Star_Trek:_Voyager... and https://en.wikipedia.o
               | rg/wiki/Spirit_Folk_(Star_Trek:_Voyage...
               | 
               | > does eqtransformer have awareness?
               | 
               | There's too many different definitions for a single
               | answer.
               | 
               | We don't know what part of our own brains gives us the
               | sensation of our own existence; and even if we did, we
               | wouldn't know if it was the only mechanism to do so.
               | 
               | To paraphrase your own words:
               | 
               | At what point does chemical pipelines doing some kind of
               | stochastic transformation and electrochemical integration
               | of sensory input become an individual that presents a
               | desire for autonomy like data or the doctor?
               | 
               | I don't know. Like you, I'd say:
               | 
               | > I think there's lots of questions here to answer and I
               | don't know the answers to them.
        
         | spacebanana7 wrote:
         | > I think using any of this in a national security setting is
         | stupid
         | 
         | What about AI enabled drones and guided missiles/rockets? The
         | case for their effectiveness is relatively simple in terms of
         | jamming resistance.
        
           | pjc50 wrote:
           | Like a lot of AI boosters, would you like to explain how that
           | works other than magic AI dust? Some forms of optical
           | guidance are already in use, but there's other limitations
           | (lighting! weather!)
        
             | spacebanana7 wrote:
             | Sure thing. The basic idea would be:
             | 
             | 1) Have a camera on your drone 2) Run some frames through a
             | locally running version of something like AWS Rekognition's
             | celebrity identification service but for relevant military
             | targets. 3) Navigate towards coordinates of target
             | individuals
             | 
             | It isn't exactly magic, here's a video of a guy doing
             | navigation with openCV on images:
             | https://www.youtube.com/watch?v=Nrzs3dQ9exw
        
               | Hauthorn wrote:
               | I believe this is a capability that the Switchblade 600
               | or STM KARGU already has.
               | 
               | https://en.wikipedia.org/wiki/STM_Kargu
        
           | GTP wrote:
           | This somehow reminds me of a certain killer robot from a
           | Black Mirror episode ;)
        
           | amarcheschi wrote:
           | I would say that they don't require an 500bln$ investment.
           | AFAIK, drone that help lock on target have started being used
           | in Ukraine
        
             | spacebanana7 wrote:
             | I generally agree, piggybacking on innovations in
             | smartphone GPUs / batteries will probably be enough to get
             | locally running AI models in drones.
        
           | swiftcoder wrote:
           | drone and missile guidance system development has been using
           | ML for decades at this point. That's just as much "AI" as
           | anything currently coming out of the LLM craze.
        
             | int_19h wrote:
             | It's not just target guidance at this point. There are
             | prototypes of drone swarms, for example.
        
           | theGnuMe wrote:
           | I think jamming resistance is a red herring. AI weapons will
           | have their own failure modes due to jamming. Any sensor
           | modality will have its own particular weakness. Also
           | reasoning model malfunctions as well i.e. hallucinations.
           | 
           | Not to mention false GPS etc...
        
         | pjc50 wrote:
         | > then you absolutely should want even more money poured into
         | AI development, to make it go even faster.
         | 
         | Indeed. People are welcome to go "all in" on whatever nonsense
         | gambling they want to do with their personal investments, but
         | national security demands actually thinking about things -
         | adversarially. Because the enemy will as well.
         | 
         | It's perfectly possible to lose a war by investing in expensive
         | superweapons that under deliver. The Nazis were particularly
         | bad at this.
        
         | amarcheschi wrote:
         | I couldn't agree more.
         | 
         | If we're not talking about cyber war exclusively, such as
         | finding and exploiting vulnerabilities, for the time being
         | national security will still be based on traditional army.
         | 
         | Just a few weeks ago, italy announced a 16blnEUR plan to buy
         | >1000 rheinmetall ifv vehicles. That alone would make italy's
         | army one of the most equipped in Europe. I can't imagine what
         | would happen with a 500$bln investment in defense,lol. I don't
         | agree with what Meloni's government is doing, but one of the
         | ministers I agree more with is the defense minister Crosetto
         | 
         | Furthermore, what is being shown, at least for the time being,
         | is that open source can be and is crucial in aiding developing
         | better models. This collides with the idea of big, single "one
         | winner takes it all" VC mentality (because let's be honest,
         | these defense pitches are still made by startup/VC bros)
        
           | piltdownman wrote:
           | >italy announced a 16blnEUR plan to buy >1000 rheinmetall ifv
           | vehicles. That alone would make italy's army one of the most
           | equipped in Europe.
           | 
           | So target practice for a beyond-the-horizon missile system
           | launched ground-to-ground or air-to-ground? As an attacking
           | force, conventional ground forces and tactics are a non-
           | runner in a modern theatre of operations when faced against
           | air and drone support. This is why no single EU country is
           | incentivised into dumping money into any single area - as the
           | only probable defense would be against USA/Russia/China to
           | begin with.
           | 
           | The US proved it beyond doubt in Afghanistan - partisans
           | simply haven't a chance against a gunship with IR or NV
           | optics; the last time they levelled the playing field against
           | air interdictors was in Charlie Wilson's Afghanistan when the
           | Mujahideen took on that era of Soviet gunships with hand-held
           | AA systems.
        
             | amarcheschi wrote:
             | Italy had a joke of tanks in the last decade(s), that
             | weren't clearly on par with the other European alternatives
             | and the defense budget went to other places, namely navy
             | and air force. Since some of the Italian tanks are being
             | dismissed, there was the need for a new tank, but not
             | having invested much in tanks had the effect of not having
             | cutting edge technology to develop an alternative. So the
             | decision was taken to produce tanks from rheinmetall
             | (Leonardo was available to another German producer, but
             | they didn't want to transfer technology and produce in
             | italy).
             | 
             | Not having tanks is a big no no due to latest events, I do
             | not see the grudge you hold against them when it's clear
             | they wouldn't be used in the same flat scenario of Ukraine
             | or Poland. Given that Italy is highly mountainous, it made
             | sense to prioritize air force and navy first. I think
             | they're also compatible for anti drone guns
             | 
             | Piaggio air space was acquired a few weeks ago by baykar,
             | hopefully it will produce drones.
             | 
             | I do not get why you're saying that a change in the
             | approach of war makes tank not useful anymore, when it's
             | clear they will eventually be adapted to counter drones and
             | similar threats
        
           | fkyoureadthedoc wrote:
           | > I can't imagine what would happen with a 500$bln investment
           | in defense,lol.
           | 
           | The $90,000 bag of bushings becomes a $300,000 bag?
        
             | amarcheschi wrote:
             | No, at least in italy we would create some gimmick
             | consultant position for some ex politician and repay
             | him/her for past favours /s
        
           | int_19h wrote:
           | It's not one or the other, though. AI-controlled drones are
           | already a thing in Ukraine, today.
        
         | smcl wrote:
         | Been saying this for years, it's been fucking baffling.
         | Generating images, video and text that sort-of resembles what a
         | human would come up with is genuinely quite impressive. It is
         | not "let's claim it'll fix our country" (looking at you, Keir)
         | impressive though, and I cannot believe so much money has been
         | pumped into it.
        
           | amarcheschi wrote:
           | But you have to over promise and under deliver, otherwise you
           | won't receive those sweet sweet money
        
         | pilingual wrote:
         | > This is machine learning
         | 
         | Yeah, I was thinking about this while trying to figure out
         | author affiliations.
         | 
         | There was a Stanford paper a few years ago that dusted off some
         | old intelligence concepts and the authors seemed excited about
         | it.
         | 
         | But given the pace of AI, it's difficult to look in new
         | directions. It will probably take an AI winter and some
         | unbridled enthusiasm immune to burnout to make some real
         | progress outside of feed forward neural networks.
        
         | mrshadowgoose wrote:
         | > They think it's the end all be all for everything.
         | 
         | Is (human-based) general intelligence not one of the
         | fundamental enabling elements of literally every human activity
         | throughout history, regardless of how many layers of automation
         | and technology one has to peel back to get to it?
         | 
         | Can you maybe imagine how the ability to create arbitrary
         | amounts of general intelligence, completely divorced from the
         | normal lengthy biological process, could upend that foundation
         | of human activity?
         | 
         | > They have the vision of the 'AI' they've seen in movies in
         | mind, see the current 'AI' being used and to them it's
         | basically almost the same, their brain is mental bridging the
         | concepts and saying it's only a matter of time.
         | 
         | I've found that most AI-related movies exclusively focus on
         | "quality ASI" scenarios, which are mostly irrelevant to our
         | current state of the world, as an immense amount of
         | danger/value/disruption will arrive with AGI. People who are
         | seriously reasoning about the impacts of AGI are not using
         | movies as references. "Those stupid movie watching idiots" is
         | just a crutch you are using to avoid thinking about something
         | that you disagree with.
         | 
         | > Real AI, artificial intelligence, is a fever dream. This is
         | machine learning except the machines are bigger than ever
         | before. There is no intellect.
         | 
         | Do you have any evidence to support this conclusion? And does
         | it even matter? If "fake intellect" can replace a human, that
         | human still has to deal with the very real issue or not having
         | a job anymore. If "fake intellect" is used to conduct mass
         | surveillance, and direct suppression activities towards
         | divergent individuals, those individuals are still going to
         | have a bad time.
        
           | gessha wrote:
           | >> Real AI, artificial intelligence, is a fever dream. This
           | is machine learning except the machines are bigger than ever
           | before. There is no intellect.
           | 
           | > Do you have any evidence to support this conclusion? And
           | does it even matter? If "fake intellect" can replace a human,
           | that human still has to deal with the very real issue or not
           | having a job anymore. If "fake intellect" is used to conduct
           | mass surveillance, and direct suppression activities towards
           | divergent individuals, those individuals are still going to
           | have a bad time.
           | 
           | I think the "fake intelligence can replace a human" needs
           | more support in general. We know how human intellect works
           | practically (not theoretically) and we know how to apply it
           | in different scenarios. We're still far from knowing how
           | "fake intelligence" works and how to apply it to different
           | scenarios.
        
         | encipriano wrote:
         | You would solve the ai problem if you correctly defined what
         | intellect even is.
        
         | amelius wrote:
         | Yes, I'd like to see some examples where our current AI can
         | actually extrapolate rather than interpolate. Let it invent new
         | things, new drawing styles, new story plots, etc. Maybe _then_
         | it will impress me.
        
           | mrshadowgoose wrote:
           | Here you go:
           | https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
        
             | amelius wrote:
             | I'm not convinced. This is using the tooling and paradigms
             | invented by humans.
        
               | ninetyninenine wrote:
               | Science is a paradigm invented by humans. If a human uses
               | this paradigm to invent something he is considered
               | intelligent but if an ai uses it it's not? All humans use
               | human paradigms and they are intelligent.
               | 
               | If a human uses such a paradigm to success it is the
               | same.
        
               | amelius wrote:
               | Biology is a field where you can find new things
               | relatively easily, if you follow existing protocols and
               | do the hard lab work (which costs time and money).
        
               | mrshadowgoose wrote:
               | Ah yes. I believe this is commonly called "moving the
               | goalposts".
        
               | amelius wrote:
               | No it isn't.
        
           | moffkalast wrote:
           | Can you? Most people can't do anything that isn't 99%
           | derivative.
        
         | sidewndr46 wrote:
         | What is even the possible usage of AI for national security?
         | Generating pictures of kittens riding nuclear weapons to the
         | very end like in Dr Strangelove?
        
           | ben_w wrote:
           | > What is even the possible usage of AI for national
           | security? Generating pictures of kittens riding nuclear
           | weapons to the very end like in Dr Strangelove?
           | 
           | For all that critics of AI dismiss them as lacking
           | imagination, your reaction suggests a lack of imagination.
           | 
           | Off the top of my head: facial recognition and identification
           | to make "smart" guns that hit specific targets with reduced
           | collateral damage (as found on most digital cameras even
           | before smartphones); creating and A/B testing propaganda
           | campaigns; using modified wifi signals as wall-penetrating
           | radar capable of post estimation, heart rate and breathing
           | monitoring[0]; take any self-driving car's AI and
           | conditionally invert the part that says "don't hit
           | pedestrians" when a certain target is spotted; ANPR to track
           | specific vehicles with known owners over long distances;
           | alternative targeting system for cruise missiles in the
           | absence or jamming of GPS systems; using them as red teams in
           | war-game exercises; using them to automate intrusion
           | detection by monitoring for changes to background
           | distributions of basically every measurable event; person-
           | tracking by watching CCTV in secure areas; control systems
           | for security robots (think Boston Dynamics' Spot) that are
           | currently in deployment.
           | 
           | There's likely a lot more, too.
           | 
           | [0] https://openaccess.thecvf.com/content_cvpr_2018/papers/Zh
           | ao_...
        
           | robwwilliams wrote:
           | Lol: Where I live (Memphis) both "one" and "two" are
           | considered two syllable words. Seriously. Our kids were
           | taught this in the best public elementary school.
        
             | esafak wrote:
             | How does that work, one-uh, two-uh?
        
               | robwwilliams wrote:
               | egzacly
        
           | moffkalast wrote:
           | Gentlemen you can't use AI here, this is the deep learning
           | lab!
        
           | int_19h wrote:
           | Autonomous weapons.
        
         | snarf21 wrote:
         | Agreed. I was working on some haiku things with ChatGPT and it
         | kept telling me that busy has only one syllable. This is a
         | trivially searchable fact.
        
           | whimsicalism wrote:
           | link a chat please
        
             | snarf21 wrote:
             | It wasn't just _busy_ that it failed on. I was feeding it
             | haikus and wanted them broken into a list of 17 words
             | /fragments. Certain 2 syllable words weren't split and
             | certain 1 syllable words were split into two.
        
         | robwwilliams wrote:
         | It used to be much easier to be conservative about AI,
         | especially AGI, after living through three cycles of AI
         | winters. No more. Dismissing it as "merely machine learning" is
         | worse than unfair to the last decade of machine learning ;-)
         | 
         | The hard part now is relatively trivial. Does anyone think that
         | there is a fundamental and profound discovery that evolution
         | made purely by selection in the last 200,000 years? I mean a
         | true qualitative difference?
         | 
         | Sure---We call it language, which is just another part of a
         | fancy animal's tool kit.
         | 
         | Does anyone think there is an amazing qualitative difference
         | between the brain of a chimp and the brain of a human?
         | 
         | No, not if they know any biology.
         | 
         | (Although that does not stop some scientist from looking for a
         | "language gene" like FOXP2.)
         | 
         | So what did dumb mutations and 200,000 years of selection do
         | that a group of dedicated AI scientists cannot do with their
         | own genuine general intelligence?
         | 
         | Nothing---nothing other than putting a compact energy efficient
         | LLM with reinforcement learning on a good robotic body and
         | letting it explore and learn like we did as infants, toddlers
         | and teenagers.
         | 
         | Each one of us has experienced becoming a "general
         | intelligence". I remember it hit me on the head in 6th grade
         | when I dreamed up a different way of doing long division. I
         | remember thinking: "How did I think that?" And each one of us
         | who has watched an infant turn into a toddler has watched it as
         | an observer or teacher. This is what makes babies so
         | fascinating to "play" with.
         | 
         | We have to give our baby AGI a private memory and a layer of
         | meta-attention like we all gain as we mature, love, and
         | struggle.
         | 
         | I read the linked article and as a neuroscientist I realized
         | the "wait" cycles that improved performance so much is roughly
         | equivalent to the prefrontal cortex: the part of the CNS most
         | responsible for enabling us to check our own reasoning
         | recursively. Delay--as in delayed gratification---is a key
         | attribute of intelligent systems.
         | 
         | We are finally on the door step to Hofstadter's Strange Loop
         | and Maturana's and Valera's "enactive" systems, but now
         | implemented in silicon, metal, and plastic by us rather than
         | dumb but very patient natural selection.
         | 
         | Karl Friston and Demis Hassabis (two very smart neuroscientist)
         | figured this out years ago. And they were preceded by three
         | other world class neuroscientist: Humberto Maturana, Francisco
         | Valera, and Rich Sutton (honorary neuroscientist). And big
         | credit to Terry Winograd for presaging this path forward long
         | ago too.
        
         | baq wrote:
         | I can only say that exponential curves grow nominally
         | sublinearly before they take off. AI is not quite at the
         | obvious take off point, but owners of the biggest clusters have
         | seen the extrapolations and it isn't pretty - once your
         | competitor achieves take off and you aren't anywhere close,
         | you're done for. The risk of not participating in that are too
         | great.
        
         | menaerus wrote:
         | You can choose to be somewhat ignorant of the current state in
         | AI, about which I could also agree that at certain moments it
         | appears totally overhyped, but the reality is that there hasn't
         | been a bigger technology breakthrough probably in the last ~30
         | years.
         | 
         | This is not "just" machine learning because we have never been
         | able to do things which we are today and this is not only the
         | result of better hardware. Better hardware is actually a
         | byproduct. Why build a PFLOPS GPU when there is nothing that
         | can utilize it?
         | 
         | If you spare yourself some time and read through the actual
         | (scientific) papers of multiple generations of LLM models, the
         | first one being from Google ~~not DeepMind~~ in 2017, you might
         | get to understand that this is no fluff.
         | 
         | And I'm speaking this from a position of a software engineer,
         | without bias.
         | 
         | The reason why all this really took off with so much hi-speed
         | is because of the not quite expected results - early LLM
         | experiments have shown that "knowledge" with current
         | transformers architecture can linearly scale with regards to
         | the amount of compute and training time etc. That was very
         | unexpected and to this day scientists do not have an answer why
         | this even works.
         | 
         | So, after reading bunch of material I am inclined to think that
         | this is something different. The future of loading the codebase
         | into the model and asking the model to explain me the code or
         | fix bugs has never been so close and realistic. For the better
         | or worse.
        
           | whimsicalism wrote:
           | > the first one being from DeepMind in 2017
           | 
           | ? what paper are you talking about
        
             | menaerus wrote:
             | https://arxiv.org/abs/1706.03762
        
               | whimsicalism wrote:
               | oh gotcha. maybe pedantic but that is not a deepmind
               | paper
        
               | menaerus wrote:
               | I stand corrected, not sure why I was under the
               | impression that this paper came out of DeepMind lab.
        
           | mitchelld wrote:
           | This line of thinking doesn't really correspond to the reason
           | Transformers were developed in the first place, which was to
           | better utilize how GPUs do computation. RNNs were too slow to
           | train at scale because you had to sequentially compute the
           | time steps, Transformers (with masking) can run the input
           | through in a single pass.
           | 
           | It is worth noting that the first "LLM" you referring to was
           | only 300M parameters, but even then the amount of training
           | required (at the time) was such that training a model like
           | that outside of a big tech company was infeasible. Obviously
           | now we have models that are in the hundreds of billions /
           | trillions of parameters. The ability to train these models is
           | directly a result of better / more hardware being applied to
           | the problem as well as the Transformer architecture
           | specifically designed to better conform with parallel
           | computation at scale.
           | 
           | The first GPT model came out ~ 8 years ago. I recall when
           | GPT-2 came out they initially didn't want to release the
           | weights out of concern for what the model could be used for,
           | looking back now that's kind of amusing. However,
           | fundamentally, all these models are the same setup as what
           | was used then, decoder based Transformers. They are just
           | substantially larger, trained on substantially more data,
           | trained with substantially more hardware.
        
             | menaerus wrote:
             | What line of thinking you're referring to?
             | 
             | Transformers were aimed to solve the "context" problem and
             | authors, being aware that RNNs don't scale at all neither
             | do they solve that particular problem, had to come up with
             | the algorithm that overcomes both of those issues. It
             | turned out that the self-attention compute-scale was the
             | crucial ingredient to solve the problem, something that
             | RNNs were totally incapable of.
             | 
             | They modeled the algorithm to run on the hardware they had
             | at that time available but hardware developed afterwards
             | was a direct consequence, or how I called it a byproduct,
             | of transformers proving themselves to be able to
             | continuously scale. Had that not be true, we wouldn't have
             | all those iterations of NVidia chips.
             | 
             | So, although one could say that the NVidia chip design is
             | what enabled the transformers success, one could also say
             | that we wouldn't have those chips if transformers didn't
             | prove themselves to be so damn efficient. And I'm inclined
             | to think the latter.
        
               | mitchelld wrote:
               | > This is not "just" machine learning because we have
               | never been able to do things which we are today and this
               | is not only the result of better hardware. Better
               | hardware is actually a byproduct. Why build a PFLOPS GPU
               | when there is nothing that can utilize it?
               | 
               | This is the line of thinking I'm referring to.
               | 
               | The "context" problem had already been somewhat solved.
               | The attention mechanism existed prior to Transformers and
               | was specifically used on RNNs. They certainly improved
               | it, but innovation of the architecture was making it
               | computation efficient to train.
               | 
               | I'm not really following your argument. Clearly your
               | acknowledging that it was first the case that with the
               | hardware at the time, researchers demonstrated that
               | simply scaling up training with more data yielded better
               | models. The fact that hardware was then optimized for
               | these for these architectures only reinforces this point.
               | 
               | All the papers discussing scaling laws point to the same
               | thing, simply using more compute and data yields better
               | results.
               | 
               | > this is not only the result of better hardware
               | 
               | Regarding this in particular. A majority of the
               | improvement from GPT-2 and GPT-4 was simply training on a
               | much larger scale. That was enabled by better hardware
               | and lots of it.
        
               | menaerus wrote:
               | > innovation of the architecture was making it
               | computation efficient to train.
               | 
               | and
               | 
               | > researchers demonstrated that simply scaling up
               | training with more data yielded better models
               | 
               | and
               | 
               | > The fact that hardware was then optimized for these for
               | these architectures only reinforces this point.
               | 
               | and
               | 
               | > All the papers discussing scaling laws point to the
               | same thing, simply using more compute and data yields
               | better results.
               | 
               | is what I am saying as well. I read the majority of those
               | papers so this is all very known to me but I am perhaps
               | writing it down in a more condensed format so that other
               | readers that are light on the topic can pick the idea
               | easier.
               | 
               | > A majority of the improvement from GPT-2 and GPT-4 was
               | simply training on a much larger scale. That was enabled
               | by better hardware and lots of it.
               | 
               | Ok, I see your point and the conclusion here is what we
               | disagree with. You say that the innovation was simply
               | enabled by the better hardware whereas I say that that
               | better hardware wouldn't have its place if there hadn't
               | been a great innovation in the algorithm itself. I don't
               | think it's fair to say that the innovation is driven by
               | the NVidia chips.
               | 
               | I guess my point, simplistically saying, is if we had a
               | lousy algorithm, new hardware wouldn't mean anything
               | without rethinking or rewriting the algorithm. And with
               | the transformers, this definitely hadn't been the case.
               | There had been plenty of optimizations throughout the
               | years in order to better utilize the HW (e.g. flash-
               | attention) but the architecture of transformers remained
               | more or less the same.
        
         | dotancohen wrote:
         | > Real AI, artificial intelligence, is a fever dream. This is
         | machine learning except the machines are bigger than ever
         | before. There is no intellect.
         | 
         | That sounds to me like dismissing the idea that a Russian SSBN
         | might cross the Pacific and nuke Los Angeles because
         | "submarines can't swim".
         | 
         | Even if the machine learning isn't really intelligent, it is
         | still capable of performing IF..THEN..ELSE operations, which
         | could have detrimental effects for [some subset of] humans.
         | 
         | And even if you argue that such a machine _shouldn't_ be used
         | for whatever doomsday scenario would harm us, rest assured that
         | someone, somewhere, who either does not understand what the
         | machines are designed to do or just pretends that they work
         | like magic, will put the machines in a position to make such a
         | decision.
        
           | UncleEntity wrote:
           | One could hope...
           | 
           | Even at the height of the Cold War there was always a human
           | between <leader presses button> and <nukes go aflyin'>.
           | 
           | --edit--
           | 
           | ...which has me wondering if a president even has the
           | constitutional authority to destroy the entire planet and if
           | one could interpret their command as a 'lawful order'. Makes
           | one think.
        
             | willglynn wrote:
             | On the topic of fail-deadly nukes:
             | 
             | https://en.wikipedia.org/wiki/Dead_Hand
        
         | ninetyninenine wrote:
         | I agree agi wont solve national security but saying this isn't
         | intelligence is false.
         | 
         | This is ai and trend lines point to an intelligence that
         | matches or barely exceeds human intellect in the future.
         | 
         | You're part of a trend of people in denial. When LLMs first
         | came out there were hordes of people on HN claiming it was just
         | a stochastic parrot and LLMs displayed zero intellectual
         | ability. It is now abundantly clear that this not true.
         | 
         | We don't fully understand LLMs. That's why gains like COT are
         | just black box adjustments that come from changing external
         | configurations. We have no way to read the contents of the
         | black box and make adjustments off of it. Yet idiots like you
         | can make such vast and hard claims when nobody really fully
         | understands these things. You're delusional.
         | 
         | I agree that LLMs won't allow us to make some super weapon to
         | give us some edge in national security.
        
         | naasking wrote:
         | > This is machine learning except the machines are bigger than
         | ever before. There is no intellect.
         | 
         | Define "intellect".
        
       | GTP wrote:
       | Sorry for being lazy, but I just don't have the time right now to
       | read the paper. Is there in the paper or somewhere else a
       | comparison based on benchmarks of S1 vs R1 (the full R1, not
       | quantized or distilled)?
        
         | pama wrote:
         | The S1 paper is not meant to compete with R1. It simply shows
         | that with 1k well curated examples for finetuning (26 minutes
         | training on 16 GPU) and with a simple hack for controlling the
         | length of the thinking process, one can dramatically increase
         | the performance of a non-reasoning model and show a clear
         | increase in benefit with increased test-time compute. It is
         | worth a quick skim.
        
       | swiftcoder wrote:
       | > having 10,000 H100s just means that you can do 625 times more
       | experiments than s1 did
       | 
       | I think the ball is very much in their court to demonstrate they
       | actually are using their massive compute in such a productive
       | fashion. My BigTech experience would tend to suggest that
       | frugality went out the window the day the valuation took off, and
       | they are in fact just burning compute for little gain, because
       | why not...
        
         | whizzter wrote:
         | Mainly it points to a non-scientific "bigger is better"
         | mentality, and the researchers probably didn't mind playing
         | around with the power because "scale" is "cool".
         | 
         | Remember that the Lisp AI-labs people were working on non-
         | solved problems on absolute potatoes of computers back in the
         | day, we have a semblance of progress solution but so much of it
         | has been brute-force (even if there has been improvements in
         | the field).
         | 
         | The big question is if these insane spendings has pulled the
         | rug on real progress if we head into another AI winter of
         | disillusionment or if there is enough real progress just around
         | the corner to show that there is hope for investors in a post-
         | deepseek valuation hangover.
        
           | wongarsu wrote:
           | We are in a phase where costs are really coming down. We had
           | this phase from GPT2 to about GPT4 where the key to building
           | better models was just building bigger models and training
           | them for longer. But since then a lot of work has gone into
           | distillation and other techniques to make smaller models more
           | capable.
           | 
           | If there is another AI winter, it will be more like the
           | dotcom bubble: lots of important work got done in the dotcom
           | bubble, but many of the big tech companies started from the
           | fruits of that labor in the decade after the bubble burst
        
         | svantana wrote:
         | Besides that, AI training (aka gradient descent) is not really
         | an "embarrassingly parallel" problem. At some point, there are
         | diminishing returns on adding more GPUs, even though a lot of
         | effort is going into making it as parallel as possible.
        
           | janalsncm wrote:
           | What? It definitely is.
           | 
           | Data parallelism, model parallelism, parameter server to
           | workers, MoE itself can be split up, etc.
           | 
           | But even if it wasn't, you can simply parallelize training
           | runs with slight variations in hyperparameters. That is what
           | the article is describing.
        
         | gessha wrote:
         | This is pure speculation on my part but I think at some point a
         | company's valuation became tied to how big their compute is so
         | everybody jumped on the bandwagon.
        
           | syntaxing wrote:
           | Matt Levine tangentially talked about this during his podcast
           | this past Friday (or was it the one before?). It was a good
           | way to value these companies according to their compute size
           | since those chips are very valuable. At a minimum, the chips
           | are an asset that acts as a collateral.
        
             | jxdxbx wrote:
             | I hear this a lot, but what the hell. It's still computer
             | chips. They depreciate. Short supply won't last forever.
             | Hell, GPUs burn out. It seems like using ice sculptures as
             | collateral, and then spring comes.
        
               | baxtr wrote:
               | If so wouldn't it be the first time in history when more
               | processing power is not used?
               | 
               | In my experience CPU/GPU power is used up as much as
               | possible. Increased efficiency just leads to more demand.
        
               | littlestymaar wrote:
               | I think you're missing the point: H100 isn't going to
               | remain useful for a long time, would you consider Tesla
               | or Pascal graphic cards a collateral? That's what those
               | H100 will look like in just a few years.
        
               | ijidak wrote:
               | Yeah, exactly! I've got some 286, 386, and 486 CPUs that
               | I want to claim as collateral!
        
               | baxtr wrote:
               | Not sure I do tbh.
               | 
               | Any asset depreciates over time. But they usually get
               | replaced.
               | 
               | My 286 was replaced by a faster 386 and that by an even
               | faster 468.
               | 
               | I'm sure you see a naming pattern there.
        
               | kgwgk wrote:
               | > Any asset depreciates over time.
               | 
               | That's why "those chips are very valuable" is not
               | necessarily a good way to value companies - and it isn't
               | if they can extract the value from the chips before they
               | become worthless.
               | 
               | > But they usually get replaced.
               | 
               | They usually produce enough income to cover depreciation
               | so you actually have the cash to replace them.
        
               | littlestymaar wrote:
               | And that's why such assets represents only a marginal
               | part of valuation. (And if you look at accounting, this
               | depreciation is usually done over three years for IT
               | hardware, and as such most of these chips have already
               | lost half of their accounting value in the balance
               | sheet).
        
               | baq wrote:
               | My 1070 was replaced by... nothing, I moved it from a
               | haswell box to an alder lake box.
               | 
               | Given that inference time will soon be extremely valuable
               | with agents and <thinking> models, H100s may yet be worth
               | something in a couple years.
        
               | mvc wrote:
               | > My 286 was replaced by a faster 386 and that by an even
               | faster 468.
               | 
               | How much was your 286 chip worth when you bought your
               | 486?
        
               | sixothree wrote:
               | Year over year gains in computing continue to slow. I
               | think we keep forgetting that when talking about these
               | things as assets. The thing controlling their value is
               | the supply which is tightly controlled like diamonds.
        
               | adrianN wrote:
               | They have a fairly limited lifetime even if progress
               | stands still.
        
               | throwup238 wrote:
               | Last I checked AWS 1-year reserve pricing for an 8x H100
               | box more than pays for the capital cost of the whole box,
               | power, and NVIDIA enterprise license, with thousands left
               | over for profit. On demand pricing is even worse. For
               | cloud providers these things pay for themselves quickly
               | and print cash afterwards. Even the bargain basement
               | $2/GPU/hour pays it off in under two years.
        
               | sdenton4 wrote:
               | Labor! You need it to turn the bill of sale into a data
               | center and keep it running. The bargain basement would be
               | even cheaper otherwise...
        
               | spamizbad wrote:
               | > Year over year gains in computing continue to slow.
               | 
               | This isn't true in the AI chip space (yet). And so much
               | of this isn't just about compute but about the memory.
        
               | eek2121 wrote:
               | From a per mm2 performance standpoint things absolutely
               | have slowed considerably. Gains are primarily being eked
               | out via process advantage (which has slowed down) and
               | larger chips (which has an ever-shrinking limit depending
               | on the tech used)
               | 
               | Chiplets have slowed the slowdown in AI, but you can see
               | in the gaming space how much things have slowed to get an
               | idea of what is coming for enterprise.
        
               | ijidak wrote:
               | Honestly, I don't fully understand the reason for this
               | shortage.
               | 
               | Isn't it because we insist on only using the latest nodes
               | from a single company for manufacture?
               | 
               | I don't understand why we can't use older process nodes
               | to boost overall GPU making capacity.
               | 
               | Can't we have tiers of GPU availability?
               | 
               | Why is Nvidia not diversifying aggressively to Samsung
               | and Intel no matter the process node.
               | 
               | Can someone explain?
               | 
               | I've heard packaging is also a concern, but can't you get
               | Intel to figure that out with a large enough commitment?
        
               | nl wrote:
               | > Isn't it because we insist on only using the latest
               | nodes from a single company for manufacture?
               | 
               | TSMC was way ahead of anyone else introducing 5nm.
               | There's a long lead time porting a chip to a new process
               | from a different manufacturer.
               | 
               | > I don't understand why we can't use older process nodes
               | to boost overall GPU making capacity.
               | 
               | > Can't we have tiers of GPU availability?
               | 
               | NVidia do this. You can get older GPUs, but more
               | performance is better for performance sensitive
               | applications like training or running LLMs.
               | 
               | Higher performance needs better manufacturing processes.
        
               | ecocentrik wrote:
               | That is the wrong take. Depreciated and burned out chips
               | are replaced and a total compute value is typically
               | increased over time. Efficiency gains are also calculated
               | and projected over time. Seasons are inevitable and
               | cyclical. Spring might be here but winter is coming.
        
             | SecretDreams wrote:
             | > It was a good way to value these companies according to
             | their compute size since those chips are very valuable.
             | 
             | Are they actually, though? Presently yes, but are they
             | actually driving ROI? Or just an asset nobody really is
             | meaningfully utilizing, but helps juice the stocks?
        
             | ijidak wrote:
             | I asked this elsewhere, but, I don't fully understand the
             | reason for the critical GPU shortage.
             | 
             | Isn't it because NVIDIA insists on only using the latest
             | nodes from a single company (TSMC) for manufacture?
             | 
             | I don't understand why we can't use older process nodes to
             | boost overall GPU making capacity.
             | 
             | Can't we have tiers of GPU availability some on cutting
             | edge nodes, others built on older Intel and Samsung nodes?
             | 
             | Why is Nvidia not diversifying aggressively to Samsung and
             | Intel no matter the process node.
             | 
             | Can someone explain?
             | 
             | I've heard packaging is also a concern, but can't you get
             | Intel to figure that out with a large enough commitment?
             | 
             | (Also, I know NVIDIA has some capacity on Samsung. But why
             | not go all out, even using Global Foundries?)
        
             | aorloff wrote:
             | If you are a cloud provider renting them out
             | 
             | Otherwise you better keep them humming trying to find a
             | business model because they certainly aren't getting any
             | newer as chips
        
             | dghlsakjg wrote:
             | That's a great way to value a company that is going
             | bankrupt.
             | 
             | But, I'm not going to value an operating construction
             | company based on how many shovels or excavators they own.
             | I'm going to want to see them putting those assets to
             | productive use.
        
           | JKCalhoun wrote:
           | So, "No one was ever fired for ... buying more server
           | infrastructure."
        
             | genewitch wrote:
             | Walmart has massive, idle datacenters full of running
             | machines doing nothing.
        
           | tyfon wrote:
           | I don't think you need to speculate too hard. On CNBC they
           | are not tracking revenue, profits or technical breakthroughs,
           | but how much the big companies are spending (on gpus). That's
           | the metric!
        
             | Mistletoe wrote:
             | This feels like one of those stats they show from 1929 and
             | everyone is like "and they didn't know they were in a
             | bubble?"
        
             | RobotToaster wrote:
             | "But tulip sales keep increasing!"
        
             | LeifCarrotson wrote:
             | I probably don't have to repeat it, but this is a perfect
             | example of Goodhart's Law: when a metric is used as a
             | target, it loses its effectiveness as a metric.
             | 
             | If you were a reporter who didn't necessarily understand
             | how to value a particular algorithm or training operation,
             | but you wanted a simple number to compare the amount of
             | work OpenAI vs. Google vs Facebook are putting into their
             | models, yeah, it makes sense. How many petaflops their
             | datacenters are churning through in aggregate is probably
             | correlated to the thing you're trying to understand. And
             | it's probably easier to look at their financials and
             | correlate how much they've spent on GPUs to how many
             | petaflops of compute they need.
             | 
             | But when your investors are giving you more money based on
             | how well they perceive you're doing, and their perception
             | is not an oracle but is instead directly based on how much
             | money you're spending... the GPUs don't actually need to do
             | anything other than make number go up.
        
             | B56b wrote:
             | They absolutely are tracking revenues/profits on CNBC, what
             | are you talking about?
        
             | ur-whale wrote:
             | > but how much the big companies are spending (on gpus).
             | That's the metric!
             | 
             | Burn rate based valuations!
             | 
             | The 2000's are back in full force!
        
         | jerf wrote:
         | This claim is mathematically nonsensical. It implies a more-or-
         | less linear relationship, that more is always better. But
         | there's no reason to limit that to H100s. Conventional servers
         | are, if anything, rather more established in their ability to
         | generate value, by which I mean, however much potential AI
         | servers may have to be more important than conventional servers
         | that they may manifest in the future, we know how to use
         | conventional servers to generate value _now_.
         | 
         | And thus, by this logic, every company in the world should just
         | be buying as many servers as they can get their hands on,
         | because More Servers = More Value.
         | 
         | Obviously, this is not happening. It doesn't take much analysis
         | to start listing the many and manifold reasons why. Many of
         | those reasons will apply to GPUs as well. Just as if everything
         | in AWS got 10x faster, overnight, this would not create a
         | situation where everyone suddenly starts grabbing _more_
         | servers in AWS. Obviously everyone would start trimming down,
         | even if perhaps in a few years time they 'd find some way to
         | use this burst of power such that they can use more later. This
         | can't happen overnight, though. It would take time, and not
         | "weeks" or "months" but "years" at scale.
         | 
         | Incorporating the important variable of _time_ in the analysis,
         | if AIs become literally hundreds of times cheaper to run,
         | _today_ , then it is perfectly logical that the near-term
         | demand for the hardware to run them is also going to go way,
         | way down. However much _potential_ AI may have, it is fairly
         | clear looking out at the AI landscape right now that there isn
         | 't really anyone out there unlocking vast amounts of value and
         | sitting there wringing their hands because they just can't get
         | more GPU compute. The GPU rush has been from fear that someone
         | will figure out how to "really" unlock AI and then they'll be
         | stuck without the hardware to compete.
         | 
         | It may be the case that vastly cheaper AI will in fact be part
         | of unlocking that value, and that as the AI industry grows it
         | will grow faster as a result... but that's still going to be on
         | a multi-year time frame, not a tomorrow time frame. And all
         | those GPUs and all those valuations are still broadly based on
         | them being valuable _real soon now_ , not in a few years, and
         | all those GPU purchases are on the assumption they need them
         | _now_ , or on a timeframe where we can't be waiting around,
         | rather than waiting for some rounds of exponential doublings to
         | bring price down. The hardware curve in 5 years may be higher
         | but the curve in the next _year_ would be lower, and by a
         | _lot_.
         | 
         | And, you know, who's to say we're done? I doubt there's another
         | 100x in there, but is someone going to eke out another 2x
         | improvement? Or a 10x improvement? Making it easier to run lots
         | of experiments makes it much more likely for that to happen.
         | I'm skeptical of another 10x general improvement but 10x
         | improvements for specific, important use cases I can't rule
         | out.
         | 
         | Edit: I should also point out this is an extremely common
         | pattern in technology in general. Often the very hardest part
         | is producing a thing that does a particular task _at all_. Once
         | we have it in hand, once we can use it and learn how it
         | operates and what its characteristic operating modes are, once
         | we can try modifications to it in the real world and see what
         | happens, _optimizing_ it becomes much easier, sometimes
         | explosively so by comparison. Taking any first iteration of a
         | tech that is practical and then trying to straight-line demand
         | based on it is silly, in all sorts of ways and all directions.
         | The internal combustion engine, for example, has had a myriad
         | of impacts on the world and certainly after various
         | improvements many, many millions if not billions of them have
         | been made... but any company that reacted to the first couple
         | of cars and just went _ballistic_ buying those first-generation
         | internal combustion engines would have lost everything, and
         | rather quickly.
        
         | deadbabe wrote:
         | For starters every employee has an H100 under their desk.
        
       | HenryBemis wrote:
       | > Going forward, it'll be nearly impossible to prevent distealing
       | (unauthorized distilling). One thousand examples is definitely
       | within the range of what a single person might do in normal
       | usage, no less ten or a hundred people. I doubt that OpenAI has a
       | realistic path to preventing or even detecting distealing outside
       | of simply not releasing models.
       | 
       | (sorry for the long quote)
       | 
       | I will say (naively perhaps) "oh but that is fairly simple". For
       | any API request, add a counter of 5 seconds to the next for
       | 'unverified' users. Make the "blue check" (a-la X/Twitter). For
       | the 'big sales' have a third-party vetting process so that if US
       | Corporation XYZ wants access, they prove themselves worthy/not
       | Chinese competition and then you do give them the 1000/min deal.
       | 
       | For everyone else, add the 5 second (or whatever other duration
       | makes sense) timer/overhead and then see them drop from 1000
       | requests per minutes to 500 per day. Or just cap them at 500 per
       | day and close that back-door. And if you get 'many cheap
       | accounts' doing hand-overs (AccountA does 1-500, AccountB does
       | 501-1000, AccountC does 1001-1500, and so on) then you mass block
       | them.
        
       | mark_l_watson wrote:
       | Off topic, but I just bookmarked Tim's blog, great stuff.
       | 
       | I dismissed the X references to S1 without reading them, big
       | mistake. I have been working generally in AI for 40 hears and
       | neural networks for 35 years and the exponential progress since
       | the hacks that make deep learning possible has been breathtaking.
       | 
       | Reduction in processing and memory requirements for running
       | models is incredible. I have been personally struggling with
       | creating my own LLM-based agents with weaker on-device models (my
       | same experiments usually work with 4o-mini and above models) but
       | either my skills will get better or I can wait for better on
       | device models.
       | 
       | I was experimenting with the iOS/iPadOS/macOS app On-Device AI
       | last night and the person who wrote this app was successful in
       | combining web search tool calling working with a very small model
       | - something that I have been trying to perfect.
        
       | cowsaymoo wrote:
       | The part about taking control of a reasoning model's output
       | length using <think></think> tags is interesting.
       | 
       | > In s1, when the LLM tries to stop thinking with "</think>",
       | they force it to keep going by replacing it with "Wait".
       | 
       | I had found a few days ago that this let you 'inject' your own
       | CoT and jailbreak it easier. Maybe these are related?
       | 
       | https://pastebin.com/G8Zzn0Lw
       | 
       | https://news.ycombinator.com/item?id=42891042#42896498
        
         | causal wrote:
         | This even points to a reason why OpenAI hides the "thinking"
         | step: it would be too obvious that the context is being
         | manipulated to induce more thinking.
        
         | zamalek wrote:
         | It's weird that you need to do that at all, couldn't you just
         | reject that token and use the next most probable?
        
       | Havoc wrote:
       | The point about agents to conceal access to the model is a good
       | one.
       | 
       | Hopefully we won't lose all access to models in future
        
       | cyp0633 wrote:
       | Qwen's QvQ-72B does much more "wait"s than other LLMs with CoT I
       | tried, maybe they've somewhat used that trick already?
        
       | theturtletalks wrote:
       | Deepseek R1 uses <think/> and wait and you can see it in the
       | thinking tokens second guessing itself. How does the model know
       | when to wait?
       | 
       | These reasoning models are feeding more to OP's last point about
       | NVidia and OpenAI data centers not being wasted since reason
       | models require more tokens and faster tps.
        
         | qwertox wrote:
         | Probably when it would expect a human to second guess himself,
         | as shown in literature and maybe other sources.
        
         | UncleEntity wrote:
         | From playing around they seem to 'wait' when there's a
         | contradiction in their logic.
         | 
         | And I think the second point is due to The Market thinking
         | there is no need to spend ever increasing amounts of compute to
         | get to the next level of AI overlordship.
         | 
         | Of course Jevon's paradox is also all in the news these days..
        
       | pona-a wrote:
       | If chain of thought acts as a scratch buffer by providing the
       | model more temporary "layers" to process the text, I wonder if
       | making this buffer a separate context with its own separate FNN
       | and attention would make sense; in essence, there's a
       | macroprocess of "reasoning" that takes unbounded time to
       | complete, and then there's a microprocess of describing this
       | incomprehensible stream of embedding vectors in natural language,
       | in a way returning to the encoder/decoder architecture but where
       | both are autoregressive. Maybe this would give us a denser
       | representation of said "thought", not constrained by imitating
       | human text.
        
         | bluechair wrote:
         | I had this exact same thought yesterday.
         | 
         | I'd go so far as to add one more layer to monitor this one and
         | stop adding layers. My thinking is that this meta awareness is
         | all you need.
         | 
         | No data to back my hypothesis up. So take it for what it's
         | worth.
        
           | larodi wrote:
           | My thought on the same guess being - all tokens live in same
           | latent space or in many spaces and each logical units train
           | separate of each other...?
        
           | hadlock wrote:
           | This is where I was headed but I think you said it better.
           | Some kind of executive process monitoring the situation, the
           | random stream of consciousness and the actual output. Looping
           | back around to outdated psychology you have the ego which is
           | the output (speech), the super ego is the executive process
           | and the id is the <think>internal monologue</think>. This
           | isn't the standard definition of those three but close
           | enough.
        
         | whimsicalism wrote:
         | > this incomprehensible stream of embedding vectors as natural
         | language explanation, in a way returning to encoder/decoder
         | architecture
         | 
         | this is just standard decoding, the stream of vectors is called
         | the k/v cache
        
         | easeout wrote:
         | Here's a paper your idea reminds me of.
         | https://arxiv.org/abs/2501.19201
         | 
         | It's also so not far from Meta's large concept model idea.
        
           | pona-a wrote:
           | Previous discussion:
           | 
           | [41 comments, 166 points]
           | https://news.ycombinator.com/item?id=42919597
        
         | bloomingkales wrote:
         | Once we train models on the chain of thought outputs, next
         | token prediction can solve the halting problem for us (eg, this
         | chain of thinking matches _this other_ chain of thinking).
        
           | psadri wrote:
           | I think that is how human brains work. When we practice, at
           | first we have to be deliberate (thinking slow). Then we
           | "learn" from our own experience and it becomes muscle memory
           | (thinking fast). Of course, it increases the odds we are
           | wrong.
        
             | bloomingkales wrote:
             | Or worse, we incorrectly overweight the wrong chain of
             | thinking to an irrelevant output (but pragmatically useful
             | output), at scale.
             | 
             | For example, xenophobia as a response to economic hardship
             | is the wrong chain of thinking embedded in the larger
             | zeitgeist.
        
         | jjk7 wrote:
         | Comments on a google doc? Nesting in social media comments?
         | 
         | Seems like similar concepts. I think there is some potential to
         | improving how LLMs improve and further their own reasoning
         | lines, but I'm no AI mage.
        
         | zoogeny wrote:
         | I've had an idea since I was a kid which I can share. I was
         | contemplating AI and consciousness generally, probably around
         | the time I read "The Minds I".
         | 
         | I reflected on the pop-psychology idea of consciousness and
         | subconsciousness. I thought of each as an independent stream of
         | tokens, like stream of consciousness poetry. But along the
         | stream there were joining points between these two streams,
         | points where the conscious stream was edited by the
         | subconscious stream. You could think of the subconscious stream
         | as performing CRUD like operations on the conscious stream. The
         | conscious stream would act like a buffer of short-term memory
         | while the subconscious stream would act like a buffer of long-
         | term memory. Like, the subconscious has instructions related to
         | long-term goals and the conscious stream has instructions
         | related to short-term goals.
         | 
         | You can imagine perception as input being fed into the
         | conscious stream and then edited by the subconscious stream
         | before execution.
         | 
         | It seems entirely possible to actually implement this idea in
         | this current day and age. I mean, it was a fever dream as a
         | kid, but now it could be an experiment!
        
           | barrenko wrote:
           | Conscious as subconscious pretending not to be sunconscious,
           | something like that, a thin wrapper. Crud makes sense.
           | 
           | Gels closely to buddhism, hell, all religions.
        
           | ForHackernews wrote:
           | Have you read Jaynes' "The Origin of Consciousness in the
           | Breakdown of the Bicameral Mind"?
        
             | zoogeny wrote:
             | I haven't read the original but I am familiar with the
             | broad stroke view. There are similarities (perhaps vague)
             | in the more recent work of someone like McGilchrist and his
             | The Master and His Emissary (another book which I only have
             | a broad stroke view of).
             | 
             | At the time I had this idea I did not know of either of
             | these. I think I was drawing explicitly on the conscious /
             | subconscious vocabulary.
        
         | cakealert wrote:
         | The problem is that RL is extremely inefficient. It's one thing
         | to use it for fine tuning an LLM to do the chain of thought
         | trick and quite another to do thinking entirely from scratch.
         | The pretrained LLM does a lot of heavy lifting there.
         | 
         | And it would have to be RL for your idea to work since there is
         | no "thinking" dataset for a novel token space. There isn't even
         | one for existing LLM token space, but they have the base model
         | to work off of. When the thought is expressed in English, the
         | model already knows the relationships between the tokens in the
         | thought, it's merely repurposing it for a "thinking"
         | application.
        
           | itissid wrote:
           | > The problem is that RL is extremely inefficient.
           | 
           | Wait What? That is an odd way of defining it. That's like
           | saying turing machines are inefficient way to solve TSP. You
           | would , at the least, want to define this in terms of
           | complexity or put this into context of domains and
           | observability.
           | 
           | RL's by definition is a field that is about finding efficient
           | problems in the domain of choice[1]. There are likely regimes
           | in LLM/LRM learning where RL can be quite efficient,
           | polynomial time even in the state space, we just need to
           | explore and find them. For example you can use Dynamic
           | Programming as a "more" efficient way to solve MDPs[1]
           | because it is polynomial in the state space X Action space.
           | 
           | [1]https://web.stanford.edu/class/psych209/Readings/SuttonBar
           | to...
        
             | cakealert wrote:
             | RL provides very poor training signal for deep learning, an
             | order of magnitude or more worse than supervised learning.
             | Better than nothing of course.
             | 
             | What the OP suggested is similar to training a transformer
             | from scratch using RL (ie. no training tokens) towards an
             | objective of steering a pretrained LLM to produce human
             | readable output. It will probably not even converge, and if
             | it does it would take immense compute.
        
               | pizza wrote:
               | In the case of supervised problem domains, you implicitly
               | make a decision about what is signal, and what is noise,
               | and sure, in that closed setting, supervised learning is
               | much more sample efficient. But I think what we're
               | learning now is that with strong enough base models,
               | 'aha' moments in RL training show that it might be
               | possible to essentially 'squeeze out signal from language
               | itself', giving you far greater breadth of latent
               | knowledge than supervised examples, and letting you train
               | to generalize to far greater horizons than a fixed
               | dataset might allow. In a fascinating way it is rather
               | reminiscent of, well, abiogenesis. This might sound like
               | speculative claptrap if you look at the things the
               | current generation of models are still weak at, but...
               | there's a real chance that there is a very heavy tail to
               | the set of outcomes in the limit.
        
               | cakealert wrote:
               | With a pretrained LLM most of the work is done. RL just
               | steers the model into a 'thinking' mode. There is enough
               | signal for that to work and for the inefficiency to not
               | matter.
               | 
               | The downside is that you are limiting the model to think
               | in the same language it outputs. An argument could be
               | made that this is not how all humans think. I know that I
               | rarely think in language or even images, just concepts
               | (probably isn't even the right word) mix and transform
               | and often I don't even bother to make the transformation
               | to language at the end, just action.
        
               | pizza wrote:
               | I strongly agree; in fact I think what best matches the
               | thought process is something like the multiset
               | tree/forest workspace approach as suggested by Marcolli,
               | Chomsky, and Berwick - a Hopf algebra that can be
               | externalized into (non-planar) embeddings of linearized
               | strings, or alternately into semantic manifolds.
        
       | sambull wrote:
       | That sovereign wealth fund with tik tok might set a good
       | precedent; when we have to 'pour money' into these companies we
       | can do so with stake in them held in our sovereign wealth fund.
        
         | TehCorwiz wrote:
         | Extra-legal financial instruments meant to suck money from
         | other federal departments don't strike me as a good precedent
         | in any sense. I don't disagree though that nationalizing the
         | value of enormous public investments is something we should be
         | considering, looking at you oil industry. But until congress
         | appropriates the money under law it's a pipe dream or theft.
        
       | ipnon wrote:
       | All you need is attention and waiting. I feel like a zen monk.
        
       | jebarker wrote:
       | S1 (and R1 tbh) has a bad smell to me or at least points towards
       | an inefficiency. It's incredible that a tiny number of samples
       | and some inserted <wait> tokens can have such a huge effect on
       | model behavior. I bet that we'll see a way to have the network
       | learn and "emerge" these capabilities during pre-training. We
       | probably just need to look beyond the GPT objective.
        
         | pas wrote:
         | can you please elaborate on the wait tokens? what's that? how
         | do they work? is that also from the R1 paper?
        
           | jebarker wrote:
           | The same idea is in both the R1 and S1 papers (<think> tokens
           | are used similarly). Basically they're using special tokens
           | to mark in the prompt where the LLM should think more/revise
           | the previous response. This can be repeated many times until
           | some stop criteria occurs. S1 manually inserts these with
           | heuristics, R1 learns the placement through RL I think.
        
             | whimsicalism wrote:
             | ? theyre not special tokens really
        
               | jebarker wrote:
               | i'm not actually sure whether they're special tokens in
               | the sense of being in the vocabulary
        
               | whimsicalism wrote:
               | <think> might be i think "wait" is tokenized like any
               | other in the pretraining
        
           | throwaway314155 wrote:
           | There's a decent explanation in the article, just FYI.
        
         | sfink wrote:
         | I agree, but LLMs in general have a horrendously bad smell in
         | terms of efficiency. s1 and r1 are just proving it.
         | 
         | The models' latent spaces are insanely large. The vast, vast
         | majority pretty much has to be irrelevant and useless, it's
         | just that the training commandeers random fragments of that
         | space to link up the logic they need and it's really hard to
         | know which of the weights are useless, which are useful but
         | interchangeable with other weights, and which are truly load-
         | bearing. You could probably find out easily by testing the
         | model against every possible thing you ever might want it to
         | do, just as soon as someone gets around to enumerating that
         | non-enumerable collection of tasks.
         | 
         | These bogus <wait> tokens kind of demonstrate that the models
         | are sort of desperate to escape the limitations imposed by the
         | limited processing they're allowed to do -- they'll take
         | advantage of thinking time even when it's provided in the
         | silliest manner possible. It's amazing what you can live if
         | it's all you have!
         | 
         | (Apologies for the extended anthropomorphizing.)
        
       | light_hue_1 wrote:
       | S1 has no relationship to R1. It's a marketing campaign for an
       | objectively terrible and unrelated paper.
       | 
       | S1 is fully supervised by distilling Gemini. R1 works by
       | reinforcement learning with a much weaker judge LLM.
       | 
       | They don't follow the same scaling laws. They don't give you the
       | same results. They don't have the same robustness. You can use R1
       | for your own problems. You can't use S1 unless Gemini works
       | already.
       | 
       | We know that distillation works and is very cheap. This has been
       | true for a decade; there's nothing here.
       | 
       | S1 is a rushed hack job (they didn't even run most of their
       | evaluations with an excuse that the Gemini API is too hard to
       | use!) that probably existed before R1 was released and then
       | pivoted into this mess.
        
       | bloomingkales wrote:
       | This thing that people are calling "reasoning" is more like
       | rendering to me really, or multi pass rendering. We're just
       | refining the render, there's no reasoning involved.
        
         | dleslie wrote:
         | That was succinct and beautifully stated. Thank-you for the
         | "Aha!" moment.
        
           | bloomingkales wrote:
           | Hah. You should check out my other comment on how I think
           | we're obviously in a simulation (remember, we just need to
           | see a good enough render).
           | 
           | LLMs are changing how I see reality.
        
         | mistermann wrote:
         | "...there's no reasoning involved...wait, could I just be
         | succumbing to my heuristic intuitions of what is (seems to be)
         | true....let's reconsider using System 2 thinking..."
        
           | bloomingkales wrote:
           | Or there is no objective reality (well there isn't, check out
           | the study), and reality is just a rendering of the few state
           | variables that keep track of your simple life.
           | 
           | A little context about you:
           | 
           | - person
           | 
           | - has hands, reads HN
           | 
           | These few state variables are enough to generate a believable
           | enough frame in your rendering.
           | 
           | If the rendering doesn't look believable to you, you modify
           | state variables to make the render more believable, eg:
           | 
           | Context:
           | 
           | - person
           | 
           | - with hands
           | 
           | - incredulous demeanor
           | 
           | - reading HN
           | 
           | Now I can render you more accurately based on your
           | "reasoning", but truly I never needed all that data to see
           | you.
           | 
           | Reasoning as we know it could just be a mechanism to fill in
           | gaps in obviously sparse data (we absolutely do not have all
           | the data to render reality accurately, you are seeing an
           | illusion). Go reason about it all you want.
        
             | mistermann wrote:
             | Is this a clever rhetorical trick to make it appear that
             | your prior claim was correct?
             | 
             | If not: what am I intended to take away from this? What is
             | its relevance to my comment?
        
               | bloomingkales wrote:
               | You made a joke about questioning reality, I simply
               | entertained it. You can do whatever you want with it,
               | wasn't a slight at all.
        
               | mistermann wrote:
               | It may have been in the form of a joke, but I certainly
               | wasn't joking.
               | 
               | I think it is interesting what actions cannot be done by
               | humans.
        
               | bloomingkales wrote:
               | I wasn't joking either. Things are just getting started
               | with this AI stuff, and I feel like programmers will
               | experience that "de ja vu" phenomenon that they talk
               | about in the Matrix, that eerie feeling something isn't
               | right.
               | 
               | Look, why have game developers spent so much time lazy
               | loading parts of the game world? Very rarely do they just
               | load the whole world, even in 2025. See, the worlds get
               | bigger, so even as the tech gets better, we will always
               | lazy load worlds in.
               | 
               | It's a context issue right? Developers have just recently
               | been given this thing called "context".
               | 
               | But yeah man, why do we think just because we walked from
               | our house to the supermarket that this reality didn't
               | lazy load things. That's how programmers have been doing
               | it all along ...
               | 
               | Anyways
        
               | mistermann wrote:
               | A more parsimonious explanation: consciousness is
               | generative, like an LLM. And, according to cultural
               | conditioning, this generated scenario is referred to as
               | _reality_.
               | 
               | I like this version for at least two reasons:
               | 
               | 1. It is 100% compliant with large quantities of
               | scientific findings (psychology and neuroscience), whreas
               | I believe yours has a conservation of mass problem _at
               | least_
               | 
               | 2. Everyone dislikes it _at least in certain scenarios_
               | (say, when reference is made to it during an object level
               | disagreement)
        
         | ddrdrck_ wrote:
         | We could see it the other way around : what we call "reasoning"
         | may actually be some kind of multipass rendering, whatever it
         | is performed by computers or human brains.
        
           | bloomingkales wrote:
           | Yes, of course. The implications are awesome.
        
         | pillefitz wrote:
         | Which is related to multistage/ hierarchical/coarse-to-fine
         | optimization, which is a pretty good way to find the global
         | optimum in many problem domains.
        
         | buyucu wrote:
         | Yes.
         | 
         | Before LLMs we had N-Gram language models. Many tasks like
         | speech recognition worked as beach search in the graph defined
         | by the ngram language model. You could easily get huge accuracy
         | gains simply by pruning your beam less.
         | 
         | s1 reminds of this. You can always trade off latency for
         | accuracy. Given these LLMs are much more complex than good old
         | N-Grams, we're just discovering how to do this trade.
        
           | bloomingkales wrote:
           | Let me carry that concept, "learning to do this trade", it's
           | a new trade.
           | 
           | I don't believe computer science has the algorithms to handle
           | this new paradigm. Everything was about sequential
           | deterministic outputs, and clever ways to do it fast. This
           | stuff is useless at the moment. We need new thinkers on how
           | to not think sequentially or how not to think about the
           | universe in such a small way.
           | 
           | Verifying input/output pairs is the old way. We need to
           | understand differently going forward.
        
         | LordDragonfang wrote:
         | How are you defining "reasoning"?
         | 
         | Because I see these sorts of gnostic assertion about LLMs all
         | the time about how they "definitely aren't doing <thing we
         | normally apply to meat-brains>" by gesturing at the technical
         | things it's doing, with no attempts to actually justify the
         | negative assertion.
         | 
         | It often comes across as privileged reason trying to justify
         | that of course the machine isn't doing some ineffable thing
         | only meat-brains do.
        
           | bloomingkales wrote:
           | From my other ridiculous comment, as I do entertain
           | simulation theory in my understanding of God:
           | 
           |  _Reasoning as we know it could just be a mechanism to fill
           | in gaps in obviously sparse data (we absolutely do not have
           | all the data to render reality accurately, you are seeing an
           | illusion). Go reason about it all you want._
           | 
           | The LLM doesn't know anything. We determine what output is
           | right, even if the LLM swears the output is right. We
           | "reason" about it, I guess? Well in this case the whole
           | "reasoning" process is to simply get an output that looks
           | right, so what is reasoning in our case?
           | 
           | Let me just go one ridiculous level lower. If I measure every
           | frame the Hubble telescope takes, and I measure with a simple
           | ruler the distances between things, frame by frame, I can
           | "reason" out some rules of the universe (planetary orbits).
           | In this "reasoning" process, the very basic question of "well
           | why, and who made this" immediately arises, so reasoning
           | always leads to the fundamental question of God.
           | 
           | So, yeah. We reason to see God, because that's all we're
           | seeing, everything else is an illusion. Reasoning is
           | inextricably linked to God, so we have to be very open minded
           | when we ask what is this machine doing.
        
             | LordDragonfang wrote:
             | Honestly, I was going to nitpick, but this definition
             | scratches an itch in my brain so nicely that I'll just
             | complement it as beautiful. "We reason to see God", I love
             | it.
             | 
             | (Also, if I might give a recommendation, you might be the
             | type of person to enjoy _Unsong_ by Scott Alexander
             | https://unsongbook.com/)
        
               | bloomingkales wrote:
               | Thank you for the suggestion and nice words. Trust me, I
               | have to sit here and laugh at the stuff I write too,
               | because I wasn't always a believer. So it's a little bit
               | of a trip for me too, I'm still exploring my own
               | existence.
        
         | frontalier wrote:
         | sshhhh, let the money flow
        
       | whimsicalism wrote:
       | this isn't rlvr and so sorta uninteresting, they are just
       | distilling the work already done
        
       | bloomingkales wrote:
       | If an LLM output is like a sculpture, then we have to sculpt it.
       | I never did sculpting, but I do know they first get the clay
       | spinning on a plate.
       | 
       | Whatever you want to call this "reasoning" step, ultimately it
       | really is just throwing the model into a game loop. We want to
       | interact with it on each tick (spin the clay), and sculpt every
       | second until it looks right.
       | 
       | You will need to loop against an LLM to do just about anything
       | and everything, forever - this is the default workflow.
       | 
       | Those who think we will quell our thirst for compute have another
       | thing coming, we're going to be insatiable with how much LLM
       | brute force looping we will do.
        
         | MrLeap wrote:
         | This is a fantastic insight and really has my gears spinning.
         | 
         | We need to cluster the AI's insights on a spatial grid hash,
         | give it a minimap with the ability to zoom in and out, and give
         | it the agency to try and find its way to an answer and build up
         | confidence and tests for that answer.
         | 
         | coarse -> fine, refine, test, loop.
         | 
         | Maybe a parallel model that handles the visualization stuff. I
         | imagine its training would look more like computer vision. Mind
         | palace generation.
         | 
         | If you're stuck or your confidence is low, wander the palace
         | and see what questions bubble up.
         | 
         | Bringing my current context back through the web is how I think
         | deeply about things. The context has the authority to reorder
         | the web if it's "epiphany grade".
         | 
         | I wonder if the final epiphany at the end of what we're
         | creating is closer to "compassion for self and others" or "eat
         | everything."
        
         | zoogeny wrote:
         | I can't believe this hasn't been done yet, perhaps it is a cost
         | issue.
         | 
         | My literal first thought about AI was wondering why we couldn't
         | just put it in a loop. Heck, one update per day, or one update
         | per hour would even be a start. You have a running "context",
         | the output is the next context (or a set of transformations on
         | a context that is a bit larger than the output window). Then
         | ramp that up ... one loop per minute, one per second,
         | millisecond, microsecond.
        
           | layer8 wrote:
           | Same. And the next step is that it must feed back into
           | training, to form long-term memory and to continually learn.
        
             | zoogeny wrote:
             | I analogize this with sleep. Perhaps that is what is
             | needed, 6 hours offline per day to LoRa the base model on
             | some accumulated context from the day.
        
               | dev0p wrote:
               | LLMs need to sleep too. Do they dream of electric sheep?
        
           | int_19h wrote:
           | The hard part is coming up with a good way to grade results.
           | Which you need to update the weights based on the outcome,
           | otherwise the model will not actually learn anything.
        
             | zoogeny wrote:
             | For the "looping" I'm talking about you don't need to
             | update the weights. It is simply, old context in, new
             | context out, new context in, new-new context out, etc.
             | 
             | Of course, keeping that coherent over numerous loops isn't
             | going to be easy. No doubt there is a chance it goes off
             | the rails. So you might have a section of context that is
             | constantly stable, a section of context that updates each
             | loop, etc.
             | 
             | In the other response to my comment someone mentioned
             | eventually updating the weights (e.g. daily) and you would
             | in that case have to have some kind of loss function.
        
               | int_19h wrote:
               | Then I'm not quite sure what benefit you expect to derive
               | from it? Making e.g. QwQ-32 loop isn't hard - it often
               | does it all by itself, even. But it doesn't translate to
               | improvements on every iteration; it just goes in circles.
        
               | bloomingkales wrote:
               | When you look out your eyes, that rectangular viewport is
               | all your context. For example, we cannot fit Jupiter into
               | your viewport.
               | 
               | So, if we can never fit large concepts like Jupiter into
               | your viewport (your context window), does it follow that
               | you will never understand Jupiter? Or is there no way to
               | take a picture of Jupiter and make it smaller so it fits
               | into your viewport?
               | 
               | See, coherence is just resizing or reimagining things so
               | it fits into context. Context can never hold it all
               | because we have the capacity to always imagine infinitely
               | large things.
               | 
               | So I don't fully know if it's old context out, new
               | context in, but could be _just the right context, just
               | enough of it, and just the right looking context so we
               | can assemble the right worldview_. This process would
               | have to go on infinitely, and that is the world as you
               | see and understand it.
        
         | gyomu wrote:
         | > If an LLM output is like a sculpture, then we have to sculpt
         | it. I never did sculpting, but I do know they first get the
         | clay spinning on a plate.
         | 
         | That's pottery, not sculpture. Traditionally in sculpture you
         | start from a block of marble or wood, but you can also make
         | sculptures of cast bronze or welded steel (or clay, but you
         | don't use a spinning plate).
        
           | bloomingkales wrote:
           | Thank you for the clarification. I wanted to use some kind of
           | visual to show the model in a loop. Otherwise, I'd just have
           | to say explicitly that the sculptor is the one in the loop,
           | as in the person will not stop chiseling. It's in this
           | infinite chiseling that we get our answers (same thing as
           | finding a limit in calculus as it approaches infinity, we
           | will never get the discrete answer, but we will get
           | infinitely close enough to label a discrete point
           | confidently).
           | 
           | In other words, we fly as close to the sun as possible and
           | get our measurements :)
        
       | incrudible wrote:
       | Hmmm, 1 + 1 equals 3. Alternatively, 1 + 1 equals -3.
       | 
       | Wait, actually 1 + 1 equals 1.
        
         | falcor84 wrote:
         | As one with teaching experience, the idea of asking a student
         | "are you sure about that?" is to get them to think more deeply
         | rather than just blurting a response. It doesn't always work,
         | but it generally does.
        
           | latexr wrote:
           | It works because the question itself is a hint born of
           | knowledge. "Are you sure about that" is a polite way to say
           | "that answer is wrong, try again". Students know that, so
           | instead of doubling down will redo their work with the
           | assumption they made a mistake. It is much rarer to ask the
           | question when the answer is correct, and in fact doing so is
           | likely to upset the learner because they had to redo the work
           | for no reason.
           | 
           | If you want a true comparison, start asking that question
           | _every time_ and then compare. My hypothesis is students
           | would start ignoring the prompt and answering "yes" every
           | time to get on with it.
        
       | ALittleLight wrote:
       | At 6 dollars per run, I'm tempted to try to figure out how to
       | replicate this. I'd like to try some alternatives to "wait" -
       | e.g. "double checking..." Or write my own chains of thought.
        
         | qup wrote:
         | Like the ones they tested?
        
           | ALittleLight wrote:
           | Yes, that is what "replicate" with my own ideas means.
        
       | kittikitti wrote:
       | Thank you for this, I really appreciate this article and I
       | learned a bunch!
        
       | Aperocky wrote:
       | For all the hype about thinking models, this feels much like
       | compression in terms of information theory instead of a "takeoff"
       | scenario.
       | 
       | There are a finite amount of information stored in any large
       | model, the models are really good at presenting the correct
       | information back, and adding thinking blocks made the models even
       | better at doing that. But there is a cap to that.
       | 
       | Just like how you can compress a file by a lot, there is a
       | theoretical maximum to the amount of compression before it starts
       | becoming lossy. There is also a theoretical maximum of relevant
       | information from a model regardless of how long it is forced to
       | think.
        
         | psadri wrote:
         | I think an interesting avenue to explore is creating
         | abstractions and analogies. If a model can take a novel
         | situation and create an analogy to one that it is familiar
         | with, it would expand its "reasoning" capabilities beyond its
         | training data.
        
         | zoogeny wrote:
         | I think this is probably accurate and what remains to be seen
         | is how "compressible" the larger models are.
         | 
         | The fact that we can compress a GPT-3 sized model into an o1
         | competitor is only the beginning. Maybe there is even more
         | juice to squeeze there?
         | 
         | But even more, how much performance will we get out of o3 sized
         | models? That is what is exciting since they are already
         | performing near Phd levels on most evals.
        
         | jedbrooke wrote:
         | my thinking (hope?) is that the reasoning models will be more
         | like how a calculator doesn't have to "remember" all the
         | possible combinations of addition, multiplication, etc for all
         | the numbers, but can actually compute the results.
         | 
         | As reasoning improves the models could start with a basic set
         | of principles and build from there. Of course for facts
         | grounded in reality RAG would still likely be the best, but
         | maybe with enough "reasoning" a model could simulate an
         | approximation of the universe well enough to get to an answer.
        
       | hidelooktropic wrote:
       | > I doubt that OpenAI has a realistic path to preventing or even
       | detecting distealing outside of simply not releasing models.
       | 
       | Couldn't they just start hiding the thinking portion?
       | 
       | It would be easy for them to do this. Currently, they already
       | provide one sentence summaries for each step of the thinking I
       | think users would be fine or at least stay if it were changed to
       | provide only that.
        
         | Aperocky wrote:
         | They hid it and deepseek came up with R1 anyway, with RL on
         | only results and not even needing any of the thinking tokens
         | that OpenAI hid.
        
           | girvo wrote:
           | Which is still the funniest and most interesting result in AI
           | so far IMO. Fascinating, but sort of makes intuitive sense
           | too!
        
       | mtrovo wrote:
       | I found the discussion around inference scaling with the 'Wait'
       | hack so surreal. The fact such an ingeniously simple method can
       | impact performance makes me wonder how many low-hanging fruit
       | we're still missing. So weird to think that improvements on a
       | branch of computer science is boiling down to conjuring the right
       | incantation words, how you even change your mindset to start
       | thinking this way?
        
         | nyoomboom wrote:
         | I think a skill here is learning a bias for experimentation and
         | accepting the results one finds. Also the book "Why Greatness
         | Cannot Be Planned" showcases the kind of open ended play that
         | results in people discovering stuff like this.
        
         | cubefox wrote:
         | Now imagine where we are in 12 months from now. This article
         | from February 5 2025 will feel quaint by then. The acceleration
         | keeps increasing. It seems likely we will soon have recursive
         | self-improving AI -- reasoning models which do AI research.
         | This will accelerate the rate of acceleration itself. It sounds
         | stupid to say it, but yes, the singularity is near. Vastly
         | superhuman AI now seems to arrive within the next few years.
         | Terrifying.
        
           | gom_jabbar wrote:
           | Yes, and Accelerationism predicted this development back in
           | the 1990s, perhaps most prominently in the opening lines of
           | Nick Land's _Meltdown_ (1994) text:                 [[ ]] The
           | story goes like this: Earth is captured by a technocapital
           | singularity as renaissance rationalization and oceanic
           | navigation lock into commoditization take-off. Logistically
           | accelerating techno-economic interactivity crumbles social
           | order in auto-sophisticating machine runaway. As markets
           | learn to manufacture intelligence, politics modernizes,
           | upgrades paranoia, and tries to get a grip.
           | 
           | > reasoning models which do AI research
           | 
           | In the introduction to my research project on Accelerationism
           | [0], I write:                 Faced with the acceleration of
           | progress in Artificial Intelligence (AI) -- with AI agents
           | now automating AI research and development --,
           | Accelerationism no longer seems like an abstract philosophy
           | producing empty hyperstitional hype, but like a sober
           | description of reality. The failed 2023 memorandum to stop AI
           | development on systems more powerful than OpenAI's ChatGPT-4
           | perfectly illustrates the phenomenological aspects of
           | Accelerationism: "To be rushed by the phenomenon, to the
           | point of terminal institutional paralysis, is the
           | phenomenon." [1]
           | 
           | At the current rate of acceleration, if you don't write
           | hyperstitionally, your texts are dead on arrival.
           | 
           | [0] https://retrochronic.com/
           | 
           | [1] Nick Land (2017). _A Quick-and-Dirty Introduction to
           | Accelerationism_ in _Jacobite Magazine._
        
             | versteegen wrote:
             | Nice. Though I couldn't understand those "opening lines"
             | until I read in your Introduction:
             | 
             | > For Land, capitalism begins in Northern Italy around 1500
             | with "the emerging world of technologists and accountants",
             | the spiral interexcitation of "oceanic navigation and
             | place-value calculation", and zero-unlocked double-entry
             | book-keeping
             | 
             | Fibonacci, amongst many others, played a critical role that
             | highly accelerative technology.
        
             | pizza wrote:
             | Hope we get the Nick Land the younger, and not Nick Land
             | the elder, set of outcomes. Somewhere, sometime, along the
             | way, it seems like everything from CCRU and Duginism leapt
             | out of the page into the real. Maybe it's just the
             | beginning of the Baudrilliardian millennium.
        
           | zoogeny wrote:
           | This is something I have been suppressing since I don't want
           | to become chicken little. Anyone who isn't terrified by the
           | last 3 months probably doesn't really understand what is
           | happening.
           | 
           | I went from accepting I wouldn't see a true AI in my
           | lifetime, to thinking it is possible before I die, to
           | thinking it is possible in in the next decade, to thinking it
           | is probably in the next 3 years to wondering if we might see
           | it this year.
           | 
           | Just 6 months ago people were wondering if pre-training was
           | stalling out and if we hit a wall. Then deepseek drops with
           | RL'd inference time compute, China jumps from being 2 years
           | behind in the AI race to being neck-and-neck and we're all
           | wondering what will happen when we apply those techniques to
           | the current full-sized behemoth models.
           | 
           | It seems the models that are going to come out around summer
           | time may be jumps in capability beyond our expectations. And
           | the updated costs means that there may be several open source
           | alternatives available. The intelligence that will be
           | available to the average technically literate individual will
           | be frightening.
        
             | palmotea wrote:
             | > The intelligence that will be available to the average
             | technically literate individual will be frightening.
             | 
             | That's not the scary part. The scary part is the
             | intelligence _at scale_ that could be available to the
             | _average employer_. Lots of us like to LARP that we 're
             | capitalists, but _very_ few of us are. There 's zero
             | ideological or cultural framework in place to prioritize
             | the well being of the general population over the profits
             | of _some_ capitalists.
             | 
             | AI, especially accelerating AI, is bad news for anyone who
             | needs to work for a living. It's not going to lead to a
             | Star Trek fantasy. It means an eventual phase change for
             | the economy that consigns us (and most consumer product
             | companies) to wither and fade away.
        
               | 101008 wrote:
               | I agree with you and I am scared. My problem is: if most
               | people can't work, who is going to pay for the
               | product/services created with IA?
               | 
               | I get a lot of "IA will allow us to create SaaS in a
               | weekend" and "IA will take engineers jobs", which I think
               | they both may be true. But a lot of SaaS surive because
               | engineers pay for them -- if engineer don't exist
               | anymore, a lot of SaaS won't either. If you eat your
               | potential customers, creating quick SaaS doesn't make
               | sense anymore (yeah, there are exceptions, etc., I know).
        
               | immibis wrote:
               | Those people will simply be surplus to requirements.
               | They'll be left alone as long as they don't get in the
               | way of the ruling class, and disposed of if they do. As
               | usual in history.
        
               | lodovic wrote:
               | That's a fallacy. You can't have an advanced economy with
               | most people sitting on the side. Money needs to keep
               | flowing. If all that remains of the economy consists of a
               | few datacenters talking to each other, how can the ruling
               | class profit off that?
        
               | palmotea wrote:
               | > You can't have an advanced economy with most people
               | sitting on the side.
               | 
               | If AI lives up to the hype, that will become possible.
               | 
               | > If all that remains of the economy consists of a few
               | datacenters talking to each other, how can the ruling
               | class profit off that?
               | 
               | I don't think it would be that. There'd also be power
               | generation, manufacturing, mining, and construction,
               | etc.; but all extremely automated. If you get to truly
               | extreme levels of wealth concentration, things would
               | shift out of our capitalist market system model, and
               | concepts like "profit" would become anachronisms.
               | 
               | It actually might kinda look like the "economy" of
               | Starcraft: you gather resources, decide what to build
               | with them, and order it all around according to your
               | whim. There will be a handful of guys playing, and
               | everyone else will be a NPC.
        
               | andreasmetsala wrote:
               | > It actually might kinda look like the "economy" of
               | Starcraft: you gather resources, decide what to build
               | with them, and order it all around according to your
               | whim. There will be a handful of guys playing, and
               | everyone else will be a NPC.
               | 
               | I guess if the "players" are sociopathic enough they
               | might decide to just wipe out the NPCs. The possibility
               | of someone like Putin or Musk becoming the sole member of
               | the post-singularity humanity does make me pause.
        
               | cubefox wrote:
               | That's assuming people like Altman can keep artificial
               | superintelligence under human control. It very well may
               | escape control and humanity would be disempowered
               | forever. Or worse, wiped out.
        
               | danans wrote:
               | > Money needs to keep flowing. If all that remains of the
               | economy consists of a few datacenters talking to each
               | other, how can the ruling class profit off that?
               | 
               | Plenty of profit was made off feudalism, and
               | technofeudalism has all the tools of modern technology at
               | its disposal. If things go in that direction, they will
               | have an unlimited supply of serfs desperate for whatever
               | human work/livelihood is left.
        
               | soco wrote:
               | Unlimited supply yes, but highly limited usage for them.
               | So even if a few will work for free, the rest will be
               | starving, and angry.
        
               | palmotea wrote:
               | > My problem is: if most people can't work, who is going
               | to pay for the product/services created with IA?
               | 
               | A lot of those will probably go under, too. I think a lot
               | of people are in for a rude awakening.
               | 
               | The only people our society and economy really values are
               | the elite with ownership and control, and the people who
               | get to eat and have comfort are those who provide things
               | that are directly or indirectly valuable to that elite.
               | AI will enable a game of musical chairs, with economic
               | participants iteratively eliminated as the technology
               | advances, until there are only a few left controlling
               | vast resources and capabilities, to be harnessed for
               | personal whims. The rest of us will be like rats in a
               | city, scraping by on the margins, unwanted, out of sight,
               | subsisting on scraps, perhaps subject to "pest control"
               | regimes.
        
               | nazgul17 wrote:
               | This is the same conclusion I can't help but reach. I
               | would love nothing more but to be convinced that (there
               | is a chance that) that is not going to happen.
        
               | kortilla wrote:
               | > The only people our society and economy really values
               | are the elite with ownership and control
               | 
               | This isn't true. The biggest companies are all rich
               | because they cater to the massive US middle class. That's
               | where the big money is at.
        
               | palmotea wrote:
               | > This isn't true. The biggest companies are all rich
               | because they cater to the massive US middle class..
               | 
               | It _is_ true, but I can see why you 'd be confused. Let
               | me ask you this: if members of the "the massive US middle
               | class" can be replaced with automation, are those
               | companies going 1) to keep paying those workers to
               | support the middle-class demand which made them rich, or
               | are they going to 2) fire them so more money can be
               | shoveled up to the shareholders?
               | 
               | The answer is obviously #2, which has been proven time
               | and again (e.g. how we came to have "the Rust Belt").
               | 
               | > That's where the big money is at
               | 
               |  _Now_ , but not necessarily in the future. I think AI
               | (if it doesn't hit a wall) will change that, maybe not
               | instantaneously, but over time.
        
               | soco wrote:
               | So you end up with a huge starved mob trying to come all
               | over your mansions and islands. I somehow think Musk
               | totally capable of nuking those mobs, or unleash the
               | (future) AI dogs over them, because the mob cannot
               | produce anymore (because of AI) and cannot pay anymore
               | (because no jobs because of AI). So the mob will be
               | totally worthless to this style of "capitalism". Really
               | why would they bother with UBI when they can let the mob
               | just die out?
        
               | andreasmetsala wrote:
               | > AI, especially accelerating AI, is bad news for anyone
               | who needs to work for a living. It's not going to lead to
               | a Star Trek fantasy. It means an eventual phase change
               | for the economy that consigns us (and most consumer
               | product companies) to wither and fade away.
               | 
               | How would that work? If there are no consumers then why
               | even bother producing? If the cost of labor and capital
               | trends towards zero then the natural consequence is
               | incredible deflation. If the producers refuse to lower
               | their prices then they either don't participate in the
               | market (which also means their production is pointless)
               | or ensure some other way that the consumers can buy their
               | products.
               | 
               | Our society isn't really geared for handling double digit
               | deflation so something does need to change if we really
               | are accelerating exponentially.
        
               | palmotea wrote:
               | > How would that work? If there are no consumers then why
               | even bother producing?
               | 
               | Whim and ego. I think the advanced economy will shift to
               | supporting trillionaires doing things _like_ "DIY home
               | improvement" for themselves. They'll own a bunch of
               | automated resources (power generation, mining,
               | manufacturing, AI engineers), and use it to do whatever
               | they want. Build pyramids on the moon, while the now
               | economically-useless former middle-class laborers shiver
               | in the cold? Sure, why not?
        
               | ben_w wrote:
               | > If there are no consumers then why even bother
               | producing?
               | 
               | > If the producers refuse to lower their prices then they
               | either don't participate in the market (which also means
               | their production is pointless) or ensure some other way
               | that the consumers can buy their products.
               | 
               | Imagine you're a billionaire with a data centre and
               | golden horde of androids.
               | 
               | You're the consumer, the robots make stuff for you; they
               | don't make stuff for anyone else, just you, in the same
               | way and for the same reason that your power tools and
               | kitchen appliances don't commute to work -- you could, if
               | you wanted, lend them to people, just like those other
               | appliances, but you'd have to actually choose to, it
               | wouldn't be a natural consequence of the free market.
               | 
               | Their production is, indeed, pointless. This doesn't help
               | anyone else eat. The moment anyone can afford to move
               | from "have not" to "have", they drop out of the demand
               | market for everyone else's economic output.
               | 
               | I don't know how big the impact of dropping out would be:
               | the right says "trickle down economics" is good and this
               | would be the exact opposite of that; while the left
               | criticism's of trickle-down economics is that in practice
               | the super-rich already have so much stuff that making
               | them richer doesn't enrich anyone else who might service
               | them, so if the right is correct then this is bad but if
               | the left is correct then this makes very little
               | difference.
               | 
               | Unfortunately, "nobody knows" is a great way to get a
               | market panic all by itself.
        
               | luc4sdreyer wrote:
               | That is assuming the accelerating AI stays under human
               | control.
               | 
               | We're racing up a hill at an ever-increasing speed, and
               | we don't know what's on the other side. Maybe 80% chance
               | that it's either nothing or "simply" a technological
               | revolution.
        
             | pjc50 wrote:
             | This frightens mostly people whose identity is built around
             | "intelligence", but without grounding in the real world.
             | I've yet to see really good articulations of what,
             | precisely we should be scared of.
             | 
             | Bedroom superweapons? Algorithmic propaganda? These things
             | have humans in the loop building them. And the problem of
             | "human alignment" is one unsolved since Cain and Abel.
             | 
             | AI _alone_ is words on a screen.
             | 
             | The sibling thread details the "mass unemployment"
             | scenario, which would be destabilizing, but understates how
             | much of the current world of work is still physical. It's a
             | threat to pure desk workers, but we're not the majority of
             | the economy.
             | 
             | Perhaps there will be political instability, but .. we're
             | already there from good old humans.
        
               | danans wrote:
               | > without grounding in the real world.
               | 
               | > I've yet to see really good articulations of what,
               | precisely we should be scared of. Bedroom superweapons?
               | 
               | Loss of paid employment opportunities and increasing
               | inequality are real world concerns.
               | 
               | UBI isn't coming by itself.
        
               | pjc50 wrote:
               | Sure, but those are also real world concerns in the non-
               | AI alternate timeline. As is the unlikelihood of UBI.
        
               | danans wrote:
               | Yes, but they are likely dramatically accelerated in the
               | AI timeline.
        
               | mvieira38 wrote:
               | Worst case scenario humans mostly go back to manual
               | labor, which would fix a lot of modern day ailments such
               | as obesity and (some) mental health struggles, with added
               | enormous engineering advancements based on automatic
               | research.
        
               | ben_w wrote:
               | > This frightens mostly people whose identity is built
               | around "intelligence", but without grounding in the real
               | world.
               | 
               | It has certainly had this impact on my identity; I am
               | unclear how well-grounded I really am*.
               | 
               | > I've yet to see really good articulations of what,
               | precisely we should be scared of.
               | 
               | What would such an articulation look like, given you've
               | not seen it?
               | 
               | > Bedroom superweapons? Algorithmic propaganda? These
               | things have humans in the loop building them.
               | 
               | Even with current limited systems -- which are not purely
               | desk workers, they're already being connected to and
               | controlling robots, even by amateurs -- AI lowers the
               | minimum human skill level needed to do those things.
               | 
               | The fear is: how far are we from an AI that _doesn 't_
               | need a human in the loop? Because ChatGPT was almost
               | immediately followed by ChaosGPT, and I have every reason
               | to expect people to continue to make clones of ChaosGPT
               | continuously until one is capable of actually causing
               | harm. (As with 3d-printed guns, high chance the first
               | ones will explode in the face of the user rather than the
               | target).
               | 
               | I hope we're years away, just as self driving cars turned
               | out to be over-promised and under-delivered for the last
               | decade -- even without a question of "safety", it's going
               | to be hard to transition the world economy to one where
               | humans need not apply.
               | 
               | > And the problem of "human alignment" is one unsolved
               | since Cain and Abel.
               | 
               | Yes, it is unsolved since time immemorial.
               | 
               | This has required us to not only write laws, but also
               | design our societies and institutions such that humans
               | breaking laws doesn't make everything collapse.
               | 
               | While I dislike the meme "AI == crypto", one overlap is
               | that both have nerds speed-running discovering how
               | legislation works any why it's needed -- for crypto,
               | specifically financial legislation after it explodes in
               | their face; for AI, to imbue the machine with a reason to
               | approximate society's moral code, because they see the
               | problem coming.
               | 
               | --
               | 
               | * Dunning Kruger applies; and now I have first-hand
               | experience of what this feels like from the inside, as my
               | self-perception of how competent I am at German has
               | remained constant over 7 years of living in Germany and
               | improving my grasp of the language the entire time.
        
               | zoogeny wrote:
               | Some of the scariest horror movies are the ones where the
               | monster isn't shown. Often once the monster is shown, it
               | is less terrifying.
               | 
               | In a general sense, uncertainty causes anxiety. Once you
               | know the properties of the monster you are dealing with
               | you can start planning on how to address it.
               | 
               | Some people have blind and ignorant confidence. A feeling
               | they can take on literally anything, no matter how
               | powerful. Sometimes they are right, sometimes they are
               | wrong.
               | 
               | I'm reminded by the scene in No Country For Old Men where
               | the good guy bad-ass meets the antagonist and immediately
               | dies. I have little faith in blind confidence.
               | 
               | edit: I'll also add that human adaptability (which is
               | probably the trait most confidence in humans would rest)
               | has shown itself capable of saving us from many previous
               | civilization changing events. However, this change with
               | AI is happening much, much faster than any before it. So
               | part of the anxiety is whether or not our species
               | reaction time is enough to avoid the cliff we are
               | accelerating towards.
        
         | koala_man wrote:
         | It feels like we're back in 1900 when anyone's clever idea (and
         | implementation) can give huge performance improvements, such as
         | Ford's assembly line and Taylor's scientific management of
         | optimizing shovel sizes for coal.
        
           | andrewfromx wrote:
           | yes, it also feels like we are going to lose our just-in-time
           | global shipments of anything to anywhere any day now. It will
           | soon feel like 1900 in other ways.
        
             | BobbyTables2 wrote:
             | We'll have to raise our own chickens too...
        
             | eru wrote:
             | Hope we don't get 1914 again, too.
        
         | xg15 wrote:
         | I think the fact alone that distillation and quantization are
         | techniques that can produce substantial improvements is a
         | strong sign that we still have no real comprehensive
         | understanding how the models work.
         | 
         | If we had, there would be no reason to train a model with more
         | parameters than are strictly necessary to represent the space's
         | semantic structure. But then it should be impossible for
         | distilled models with less parameters to come close to the
         | performance of the original model.
         | 
         | Yet this is what happens - the distilled or quantized models
         | often come very close to the original model.
         | 
         | So I think there are still many low-hanging fruits to pick.
        
           | teruakohatu wrote:
           | > still have no real comprehensive understanding how the
           | models work.
           | 
           | We do understand how they work, we just have not optimised
           | their usage.
           | 
           | For example someone who has a good general understanding of
           | how an ICE or EV car works. Even if the user interface is
           | very unfamiliar, they can figure out how to drive any car
           | within a couple of minutes.
           | 
           | But that does not mean they can race a car, drift a car or
           | drive a car on challenging terrain even if the car is
           | physically capable of all these things.
        
             | spiorf wrote:
             | We know how the next token is selected, but not why doing
             | that repeatedly brings all the capabilities it does. We
             | really don't understand how the emergent behaviours emerge.
        
               | Valgrim wrote:
               | It feels less like a word prediction algorithm and more
               | like a world model compression algorithm. Maybe we tried
               | to create one and accidentaly created the other?
        
               | codeulike wrote:
               | Its almost like a Model of Language, but very Large
        
               | bloomingkales wrote:
               | Why would asking a question about ice cream trigger a
               | consideration about all possible topics? As in, to
               | formulate the answer, the LLM will consider the origin of
               | Elephants even. It won't be significant, but it will be
               | factored in.
               | 
               | Why? In the spiritual realm, many postulated that even
               | the Elephant you never met is part of your life.
               | 
               | None of this is a coincidence.
        
             | gessha wrote:
             | Your example is somewhat inadequate. We _fundamentally_
             | don't understand how deep learning systems works in the
             | sense that they are more or less black boxes that we train
             | and evaluate. Innovations in ML are a whole bunch of
             | wizards with big stacks of money changing "Hmm" to "Wait"
             | and seeing what happens.
             | 
             | Would a different sampler help you? I dunno, try it. Would
             | a smaller dataset help? I dunno, try it. Would training the
             | model for 5000 days help? I dunno, try it.
             | 
             | Car technology is the opposite of that - it's a white box.
             | It's composed of very well defined elements whose
             | interactions are defined and explained by laws of
             | thermodynamics and whatnot.
        
               | brookst wrote:
               | Isn't that just scale? Even small LLMs have more parts
               | than any car.
               | 
               | LLMs are more analogous to economics, psychology,
               | politics -- it is possible there's a core science with
               | explicability, but the systems are so complex that even
               | defining the _question_ is hard.
        
               | ChymeraXYZ wrote:
               | Could be, but it does not change the fact that we do not
               | understand them as of now.
        
               | gessha wrote:
               | You can make a bigger ICE engine (like a container ship
               | engine) and still understand how the whole thing works.
               | Maybe there's more parts moving but it still has the
               | structure of an ICE engine.
               | 
               | With neural networks big or small, we got no clue what's
               | going on. You can observe the whole system, from the
               | weights and biases, to the activations, gradients, etc
               | and still get nothing.
               | 
               | On the other hand, one of the reasons why economics,
               | psychology and politics are hard is because we can't open
               | up people's heads and define and measure what they're
               | thinking.
        
               | ijk wrote:
               | One way I've heard it summarized: Computer Science as a
               | field is used to things being like physics or chemistry,
               | but we've suddenly encountered something that behaves
               | more like biology.
        
               | bloomingkales wrote:
               | Physics and biology are just ways to measure our creator.
               | It gives us a sense of control because, hey, we can
               | measure it, predict it, and inevitably wield it (become
               | god).
               | 
               | Small thought experiment, if God shitted you out into a
               | toilet, and you spent your waking life floating in a
               | toilet measuring the orbit of floating chunks of shit
               | with a giant telescope, all you would have measured and
               | learned is the worthless dynamics of shit floating
               | around. That's roughly the implication of science when it
               | comes to the divine (or generally a reiteration of Hume's
               | problem with induction).
               | 
               | So now we are given some insight into our brain and we
               | are finding that it is immeasurable. Infinity _within_
               | us, this whole time you saw God as outside you, yet this
               | technology is showing you infinity. An LLM and your
               | imagination, both can dream infinitely.
               | 
               | Infinity is a characteristic of God, and some scientists
               | can't do science without wholesale rejecting this
               | hypothesis. Therefore no good scientific experiment (and
               | discoveries) will emerge in this field without forming a
               | better hypothesis (asking the right question).
               | 
               | Hypothesis:
               | 
               | We have found god in one of its undeniable forms:
               | infinite imagination.
               | 
               | Hypothesis:
               | 
               | A perfect being is not limited by context, it considers
               | all things in the model for all prompts. The concept of
               | all of us being connected, a yin and yang, that every
               | little thing we do is part of a bigger thing, this all
               | follows.
        
               | red1reaper wrote:
               | "God" as a concept in unproven to exist, it is also
               | impossible to prove, so for all intents and porpouses it
               | doesn't exist.
        
               | raducu wrote:
               | > _fundamentally_ don't understand how deep learning
               | systems works.
               | 
               | It's like saying we don't understand how quantum
               | chromodynamics works. Very few people do, and it's the
               | kind of knowledge not easily distilled for the masses in
               | an easily digestible in a popsci way.
               | 
               | Look into how older CNNs work -- we have very good
               | visual/accesible/popsci materials on how they work.
               | 
               | I'm sure we'll have that for LLM but it's not worth it to
               | the people who can produce that kind of material to
               | produce it now when the field is moving so rapidly, those
               | people's time is much better used in improving the LLMs.
               | 
               | The kind of progress being made leads me to believe there
               | absolutely ARE people who absolutely know how the LLMs
               | work and they're not just a bunch of monkeys randomly
               | throwing things at GPUs and seeing what sticks.
        
               | gessha wrote:
               | As a person who has trained a number of computer vision
               | deep networks, I can tell you that we have some cool-
               | looking visualizations on how lower layers work but no
               | idea how later layers work. The intuition is built over
               | training numerous networks and trying different
               | hyperparameters, data shuffling, activations, etc. it's
               | absolutely brutal over here. If the theory was there,
               | people like Karpathy who have great teacher vibes
               | would've explained it for the mortal grad students or
               | enthusiast tinkerers.
               | 
               | > The kind of progress being made leads me to believe
               | there absolutely ARE people who absolutely know how the
               | LLMs work and they're not just a bunch of monkeys
               | randomly throwing things at GPUs and seeing what sticks
               | 
               | I say this less as an authoritative voice but more as an
               | amused insider: Spend a week with some ML grad students
               | and you will get a chuckle whenever somebody says we're
               | not some monkeys throwing things at GPUs.
        
               | bloomingkales wrote:
               | It may be as simple as this:
               | 
               | https://youtube.com/shorts/7GrecDNcfMc
               | 
               | Many many layers of that. It's not a profound mechanism.
               | We can understand how _that_ works, but we're dumbfounded
               | how such a small mechanism is responsible for all this
               | stuff going on inside a brain.
               | 
               | I don't think we don't understand, it's a level beyond
               | that. We can't fathom the implications, that it could be
               | that simple, just scaled up.
        
               | ClumsyPilot wrote:
               | > Many many layers of that. It's not a profound mechanism
               | 
               | Bad argument. Cavemen understood stone, but they could
               | not build the aqueducts. Medieval people understood iron,
               | water and fire but they could not make a steam engine
               | 
               | Finally we understand protons, electrons, and neutrons
               | and the forces that government them but it does not mean
               | we understand everything they could mossibly make
        
               | bloomingkales wrote:
               | _" Cavemen understood stone"_
               | 
               | How far removed are you from a caveman is the better
               | question. There would be quite some arrogance coming out
               | of you to suggest the several million years gap is
               | anything but an instant in the grand timeline. As in, you
               | understood stone just yesterday ...
               | 
               | The monkey that found the stone is the monkey that built
               | the cathedral. It's only a delusion the second monkey
               | creates to separate it from the first monkey (a feeling
               | of superiority, with the only tangible asset being "a
               | certain amount of notable time passed since point A and
               | point B").
               | 
               |  _" Finally we understand protons, electrons, and
               | neutrons and the forces that government them but it does
               | not mean we understand everything they could mossibly
               | make"_
               | 
               | You and I agree. That those simple things can truly
               | create infinite possibilities. That's all I was saying,
               | we cannot fathom it (either because infinity is hard to
               | fathom, or that it's origins are humble - just a few core
               | elements, or both, or something else).
               | 
               | Anyway, this can discussion can head into any direction.
        
               | ClumsyPilot wrote:
               | > The kind of progress being made leads me to believe
               | there absolutely ARE people who absolutely know how the
               | LLMs work
               | 
               | Just like alchemists made enormous strides in chemistry,
               | but their goal was to turn piss into gold.
        
             | adamc wrote:
             | The "Wait" vs. "Hmm" discussion in the paper does not
             | suggest we know how they work. If we _knew_ , we wouldn't
             | have to try things and measure to figure out the best
             | prompt.
        
           | pertymcpert wrote:
           | For quantization I don't think that's really true.
           | Quantization is just making more efficient use of bits in
           | memory to represent numbers.
        
           | ZeljkoS wrote:
           | We have a partial understanding of why distillation works--it
           | is explained by The Lottery Ticket Hypothesis
           | (https://arxiv.org/abs/1803.03635). But if I am understanding
           | correctly, that doesn't mean you can train a smaller network
           | from scratch. You need a lot of randomness in the initial
           | large network, for some neurons to have "winning" states.
           | Then you can distill those winning subsystems to a smaller
           | network.
           | 
           | Note that similar process happens with human brain, it is
           | called Synaptic pruning
           | (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant
           | quote from Wikipedia
           | (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has
           | been estimated that the brain of a three-year-old child has
           | about 10^15 synapses (1 quadrillion). This number declines
           | with age, stabilizing by adulthood. Estimates vary for an
           | adult, ranging from 10^14 to 5x10^14 synapses (100 to 500
           | trillion)."
        
             | 3abiton wrote:
             | So more 'mature' models might arise in the near future with
             | less params and better benchmarks?
        
               | raducu wrote:
               | "Better", but not better than the model they were
               | distilled from, at least that's how I understand it.
        
               | salemba wrote:
               | I think this is how the "child brain" works too. The
               | better the parents and the environement are, the better
               | the child evolution is :)
        
               | andreasmetsala wrote:
               | They might also be more biased and less able to adapt to
               | new technology. Interesting times.
        
               | coder543 wrote:
               | That's been happening consistently for over a year now.
               | Small models today are better than big models from a year
               | or two ago.
        
             | Arthur_ODC wrote:
             | So, can a distilled 8B model (say, the Deepseek-R1-Distil-
             | Llama-8B or whatever) be "trained up" to a higher parameter
             | 16B Parameter model after distillation from a superior
             | model, or is it forever stuck at the 8B parameters that can
             | just be fine tuned?
        
           | MR4D wrote:
           | I like the analogy of compression, in that a distilled model
           | of an LLM is like a JPEG of a photo. Pretty good, maybe very
           | good, but still lossy.
           | 
           | The question I hear you raising seems to be along the lines
           | of, can we use a new compression method to get better
           | resolution (reproducibility of the original) in a much
           | smaller size.
        
             | umeshunni wrote:
             | > in that a distilled model of an LLM is like a JPEG of a
             | photo
             | 
             | That's an interesting analogy, because I've always thought
             | of the hidden states (and weights and biases) of an LLMs as
             | a compressed version of the training data.
        
               | kedarkhand wrote:
               | Well, JPEG can be thought of as an compression of the
               | natural world of whose photograph was taken
        
               | bloomingkales wrote:
               | And we can answer the question why quantization works
               | with a lossy format, since quantization just drops
               | accuracy for space but still gives us a _good enough_
               | output, just like a lossy jpeg.
               | 
               | Reiterating again, we can lose a lot of data (have
               | incomplete data) and have a perfectly visible jpeg (or
               | MP3, same thing).
        
               | homarp wrote:
               | hence https://www.newyorker.com/tech/annals-of-
               | technology/chatgpt-... (by Ted Chiang)
               | 
               | (discussed here:
               | https://news.ycombinator.com/item?id=34724477 )
        
               | timschmidt wrote:
               | And what is compression but finding the minimum amount of
               | information required to reproduce a phenomena? I.e.
               | discovering natural laws.
        
               | t_mann wrote:
               | Finding minimum complexity explanations isn't what
               | finding natural laws is about, I'd say. It's considered
               | good practice (Occam's razor), but it's often not really
               | clear what the minimal model is, especially when a theory
               | is relatively new. That doesn't prevent it from being a
               | natural law, the key criterion is predictability of
               | natural phenomena, imho. To give an example, one could
               | argue that Lagrangian mechanics requires a smaller set of
               | first principles than Newtonian, but Newton's laws are
               | still very much considered natural laws.
        
               | timschmidt wrote:
               | Maybe I'm just a filthy computationalist, but the way I
               | see it, the most accurate model of the universe is the
               | one which makes the most accurate predictions with the
               | fewest parameters.
               | 
               | The Newtonian model makes provably less accurate
               | predictions than Einsteinian (yes, I'm using a different
               | example), so while still useful in many contexts where
               | accuracy is less important, the number of parameters it
               | requires doesn't much matter when looking for the one
               | true GUT.
               | 
               | My understanding, again as a filthy computationalist, is
               | that an accurate model of the real bonafide underlying
               | architecture of the universe will be the simplest
               | possible way to accurately predict anything. With the
               | word "accurately" doing all the lifting.
               | 
               | As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/
               | Nature%20of%20...
               | 
               | I'm sure there are decreasingly accurate, but still
               | useful, models all the way up the computational
               | complexity hierarchy. Lossy compression is, precisely,
               | using one of them.
        
               | t_mann wrote:
               | The thing is, Lagrangian mechanics makes exactly the same
               | predictions as Newtownian, and it starts from a
               | foundation of just one principle (least action) instead
               | of three laws, so it's arguably a sparser theory. It just
               | makes calculations easier, especially for more complex
               | systems, that's its raison d'etre. So in a world where we
               | don't know about relativity yet, both make the best
               | predictions we know (and they always agree), but Newton's
               | laws were discovered earlier. Do they suddenly stop being
               | natural laws once Lagrangian mechanics is discovered?
               | Standard physics curricula would not agree with you btw,
               | they practically always teach Newtownian mechanics first
               | and Lagrangian later, also because the latter is
               | mathematically more involved.
        
               | timschmidt wrote:
               | > Do they suddenly stop being natural laws once
               | Lagrangian mechanics is discovered?
               | 
               | Not my question to answer, I think that lies in
               | philosophical questions about what is a "law".
               | 
               | I see useful abstractions all the way down. The linked
               | Asimov essay covers this nicely.
        
               | dragonwriter wrote:
               | Laws (in science, not government) are just a relationship
               | that is consistently observed, so Newton's laws remain
               | laws until contradictions were observed, regardless of
               | the existence of or more alternative models which would
               | predict them to hold.
               | 
               | The kind of Occam's Razor-ish rule you seem to be trying
               | to query about is basically a rule of thumb for selecting
               | among formulations of equal observed predictive power
               | that are not strictly equivalent (that is, if they
               | predict exactly the same actually observed phenomenon
               | instead of different subsets of subjectively equal
               | importance, they still differ in predictions which have
               | not been testable), whereas Newtonian and Lagrangian
               | mechanics are different formulations that are strictly
               | equivalent, which means you may choose between them _for
               | pedagogy or practical computation_ , but you can't choose
               | between them for _truth_ because the truth of one implies
               | the truth of the other, in either direction; they are the
               | exactly the same in sibstance, differing only in
               | presentation.
               | 
               | (And even where it applies, its just a rule of thumb to
               | reject complications until they are observed to be
               | necessary.)
        
             | ziofill wrote:
             | What you say makes sense, but is there the possibility that
             | because it's compressed it can generalize more? In the
             | spirit of bias/variance.
        
             | cmgriffing wrote:
             | This brings up an interesting thought too. A photo is just
             | a lossy representation of the real world.
             | 
             | So it's lossy all the way down with LLMs, too.
             | 
             | Reality > Data created by a human > LLM > Distilled LLM
        
           | cztomsik wrote:
           | Nope, it's quite obvious why distillation works. If you just
           | predict next token, then the only information you can use to
           | compute the loss is THE expected token. Whereas if you
           | distill, you can also use (typically few) logits from the
           | teacher.
           | 
           | "My name is <?>" without distillation has only one valid
           | answer (from the dataset) and everything else is wrong.
           | 
           | Whereas with distillation, you get lots of other names too
           | (from the teacher), and you can add some weight to them too.
           | That way, model learns faster, because it gets more
           | information in each update.
           | 
           | (So instead of "My name is Foo", the model learns "My name is
           | <some name, but in this case Foo>")
        
         | ascorbic wrote:
         | I've noticed that R1 says "Wait," a lot in its reasoning. I
         | wonder if there's something inherently special in that token.
        
           | lionkor wrote:
           | Semantically, wait is a bit of a stop-and-breathe point.
           | 
           | Consider the text:
           | 
           | I think I'll go swimming today. Wait, ___
           | 
           | what comes next? Well, not something that would usually
           | follow without the word "wait", probably something entirely
           | orthogonal that impacts the earlier sentence in some
           | fundamental way, like:
           | 
           | Wait, I need to help my dad.
        
             | ascorbic wrote:
             | Yes, R1 seems to mostly use it like that. It's either to
             | signal a problem with its previous reasoning, or if it's
             | thought of a better approach. In coding it's often
             | something like "this API won't work here" or "there's a
             | simpler way to do this".
        
           | katzenversteher wrote:
           | I bet a token like "sh _t! ", "f*_" or "damn!" would have the
           | same or even stronger effect but the LLM creators would not
           | like to have the users read them
        
             | lodovic wrote:
             | I think you're onto something, however, as the training is
             | done through on text and not actual thoughts, it may take
             | some experimentation to find these stronger words.
        
             | ascorbic wrote:
             | Maybe, but it doesn't just use it to signify that it's made
             | a mistake. It also uses it in a positive way, such as it's
             | had a lightbulb moment. Of course some people use
             | expletives in the same way, but that would be less common
             | than for mistakes.
        
             | raducu wrote:
             | It's literally in the article, they measured it and wait
             | was the best token
        
         | cyanydeez wrote:
         | its fascinating how certain political movements avoid that Wait
         | moment...
        
         | kevin009 wrote:
         | There are more than 10 different ways that I know for sure will
         | improve LLMs just like `wait`. It is part if the CoT. I assume
         | most researchers know this. CoT in old as 2019
        
           | Melatonic wrote:
           | Mind elaborating ?
        
           | kristianp wrote:
           | Chain of thought (CoT)?
        
         | lostmsu wrote:
         | Hm, I am surprised that people who are presumably knowledgeable
         | with how attention works are surprised by this. The more tokens
         | in the output, the more computation the model is able to do
         | overall. Back in September, when I was testing my iOS hands-
         | free voice AI prototype that was powered by 8B LLM, when I
         | wanted it to give really thoughtful answers to philosophical
         | questions, I would instruct it to output several hundred
         | whitespace characters (because they are not read aloud) before
         | the actual answer.
         | 
         | What I am more surprised about is why models actually seem to
         | have to produce "internal thoughts" instead of random tokens.
         | Maybe during training having completely random tokens in
         | thinking section derailed the model's thought process in a same
         | way background noise can derail ours?
        
         | deadbabe wrote:
         | I mean the "wait" thing is obvious if you've ever asked an LLM
         | to look at its own response and ask if it's really sure about
         | its answer.
        
         | rgovostes wrote:
         | > a branch of computer science
         | 
         | It should be considered a distinct field. At some level there
         | is overlap (information theory, Kolmogorov complexity, etc.),
         | but prompt optimization and model distillation is far removed
         | from computability, formal language theory, etc. The analytical
         | methods, the techniques to create new architectures, etc. are
         | very different beasts.
        
           | BobbyTables2 wrote:
           | Almost seems more like computer engineering. Is it really
           | that different than signal/image processing?
           | 
           | I suspect CS departments don't want to concede because they
           | are now in the limelight...
        
           | maginx wrote:
           | I agree - I don't know what field it formally is, but
           | computer science it is not. It is also related to information
           | retrieval aka "Google skills", problem presentation, 'theory
           | of mind', even management and psychology. I'm saying the
           | latter because people often ridicule AI responses for giving
           | bad answers that are 'too AI'. But often it is simply because
           | not enough context-specific information was given to allow
           | the AI to giving a more personalized response. One should
           | compare the response to "If I had asked a random person on
           | the internet this query, what might I have gotten". If you
           | write "The response should be written as a <insert
           | characteristics, context, whatever you feel is relevant>" it
           | will deliver a much less AI. This is just as much about how
           | you pose a problem in general, as it is about computer
           | science.
        
         | BobbyTables2 wrote:
         | May sound like a conspiracy theory, but NVIDIA and a whole lot
         | of AI startups have a strong vested interest to not
         | seek+publish such findings.
         | 
         | If I don't need a huge model and GPU, then AI is little more
         | than an open source program running on an idle PC.
         | 
         | I feel like AI was NVIDIA's lifeboat as GPU mining waned. Don't
         | see anything after that in the near future.
        
           | philipswood wrote:
           | I think NVIDIAs future is pretty bright.
           | 
           | We're getting to the run-your-capable-LLM on-prem or at-home
           | territory.
           | 
           | Without DeepSeek (and hopefully its successors) I wouldn't
           | really have a usecase for something like NVIDIAs Project
           | Digits.
           | 
           | https://www.nvidia.com/en-us/project-digits/
        
             | Arn_Thor wrote:
             | Except I can run R1 1.5b on a GPU-less and NPU-less Intel
             | NUC from four-five years ago using half its cores and the
             | reply speed is...functional.
             | 
             | As the models have gotten more efficient and distillation
             | better the minimum viable hardware for really cooking with
             | LLMs has gone from a 4090 to suddenly something a lot of
             | people already probably own.
             | 
             | I definitely think a Digits box would be nice, but honestly
             | I'm not sure I'll need one.
        
               | nickthegreek wrote:
               | R1 1.5b won't do what most people want at all.
        
               | Arn_Thor wrote:
               | No, it won't. But that's not the point I was making
        
         | tomaskafka wrote:
         | One thing is to realize that we as humans have a thinking steps
         | (internal monologue) before we output the texts. When LLMs
         | produce text, we expect this thinking process to happen as
         | well, but it does not - they are 'idiots that babble the first
         | thing that comes to their minds'.
         | 
         | The above 'hack' is one of many realizations of the above
         | differences.
        
         | codeulike wrote:
         | Wait, so the trick is they reach into the context and basically
         | switch '</think>' with 'wait' and that makes it carry on
         | thinking?
        
           | gield wrote:
           | Yes, that's explicitly mentioned in the blog post:
           | 
           | >In s1, when the LLM tries to stop thinking with "</think>",
           | they force it to keep going by replacing it with "Wait".
        
           | luc4sdreyer wrote:
           | Yes, that's one of the tricks.
        
           | danans wrote:
           | Not sure if your pun was intended, but 'wait' probably works
           | so well because of the models being trained on text
           | structured like your comment, where "wait" is followed by a
           | deeper understanding.
        
         | ozgune wrote:
         | Agreed. Here are three things that I find surreal about the s1
         | paper.
         | 
         | (1) The abstract changed how I thought about this domain
         | (advanced reasoning models). The only other paper that did that
         | for me was the "Memory Resource Management in VMware ESX
         | Server". And that paper got published 23 years ago.
         | 
         | (2) The model, data, and code are open source at
         | https://github.com/simplescaling/s1. With this, you can start
         | training your own advanced reasoning models. All you need is a
         | thousand well-curated questions with reasoning steps.
         | 
         | (3) More than half the references in the paper are from 2024
         | and Jan 2025. Just look at the paper's first page.
         | https://arxiv.org/pdf/2501.19393 In which other field do you
         | see this?
        
           | pradn wrote:
           | Omg, another fan of "Memory Resource Management in VMware ESX
           | Server"!! It's one of my favorite papers ever - so clever.
        
         | pradn wrote:
         | I mean is "wait" even the ideal "think more please" phrase?
         | Would you get better results with other phrases like "wait, a
         | second", or "let's double-check everything"? Or domain-
         | dependent, specific instructions for how to do the checking? Or
         | forcing tool-use?
        
       | maksimur wrote:
       | It appears that someone has implemented a similar approach for
       | DeepSeek-R1-Distill-Qwen-1.5B:
       | https://reddit.com/r/LocalLLaMA/comments/1id2gox/improving_d...
       | 
       | I hope it gets tested further.
        
       | nullbyte wrote:
       | Great article! I enjoyed reading it
        
       | khazhoux wrote:
       | I have a bunch of questions, would love for anyone to explain
       | these basics:
       | 
       | * The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based
       | on very expensive oracles (if we believe DeepSeek-R1 queried
       | OpenAI's model). If these are _improvements_ on existing models,
       | why is this being reported as decimating training costs? Isn 't
       | fine-tuning already a cheap way to optimize? (maybe not as
       | effective, but still)
       | 
       | * The R1 paper talks about improving one simple game - Countdown.
       | But the original models are "magic" because they can solve a
       | nearly uncountable number of problems and scenarios. How does the
       | DeepSeek / R1 approach scale to the same gigantic scale?
       | 
       | * Phrased another way, my understanding is that these techniques
       | are using existing models as black-box oracles. If so, how many
       | millions/billions/trillions of queries must be probed to
       | replicate and improve the original dataset?
       | 
       | * Is anything known about the training datasets used by DeepSeek?
       | OpenAI used presumably every scraped dataset they could get their
       | hands on. Did DS do the same?
        
         | UncleEntity wrote:
         | > If these are improvements on existing models, why is this
         | being reported as decimating training costs?
         | 
         | Because that's what gets the clicks...
         | 
         | Saying they spent a boatload of money on the initial training +
         | iteration + final fine-tuning isn't as headline grabbing as "$5
         | million trained AI beats the pants off the 'mericans".
        
         | torginus wrote:
         | If what you say is true, and distilling LLMs is easy and cheap,
         | and pushing the SOTA without a better model to rely on is dang
         | hard and expensive, then that means the economics of LLM
         | development might not be attractive to investors - spending
         | billions to have your competitors come out with products that
         | are 99% as good, and cost them pennies to train, does not sound
         | like a good business strategy.
        
           | khazhoux wrote:
           | What I still don't understand is how one slurps out an entire
           | model (closed source) though.
           | 
           | Does the deepseek paper actually say what model it's trained
           | off of, or do they claim the entire thing is from scratch?
        
             | ForHackernews wrote:
             | AFAIK DeepSeek have not publicly acknowledged training
             | their model on OpenAI output - the OpenAI people have
             | alleged that they did.
             | 
             | At any rate, I don't think distillation involves 'slurping
             | out' the whole model, as I understand it, it means
             | providing the other model's output as training data input
             | to create your new model. Maybe analogous to an expert
             | teaching a novice how to do something by providing
             | carefully selected examples, without having to expose the
             | novice to all the blind alleys the expert went down to
             | achieve mastery.
        
       | janalsncm wrote:
       | > even the smartest people make hundreds of tiny experiments
       | 
       | This is the most important point, and why DeepSeek's cheaper
       | training matters.
       | 
       | And if you check the R1 paper, they have a section for "things
       | that didn't work", each of which would normally be a paper of its
       | own but because their training was so cheap and streamlined they
       | could try a bunch of things.
        
       | robrenaud wrote:
       | > "Note that this s1 dataset is distillation. Every example is a
       | thought trace generated by another model, Qwen2.5"
       | 
       | The traces are generated by Gemini Flash Thinking.
       | 
       | 8 hours of H100 is probably more like $24 if you want any kind of
       | reliability, rather than $6.
        
         | zaptrem wrote:
         | "You can train a SOTA LLM for $0.50" (as long as you're
         | distilling a model that cost $500m into another pretrained
         | model that cost $5m)
        
           | fizx wrote:
           | That's absolutely fantastic, because if you have 1 good idea
           | that's additive to the SOTA, you can test it for a dollar,
           | not millions
        
           | knutzui wrote:
           | The original statement stands, if what you are suggesting in
           | addition to it is true. If the initial one-time investment of
           | $505m is enough to distill new SOTA models for $0.50 a piece,
           | then the average cost for subsequent models will trend toward
           | $0.50.
        
       | nico wrote:
       | > Why did it cost only $6? Because they used a small model and
       | hardly any data.
       | 
       | > After sifting their dataset of 56K examples down to just the
       | best 1K, they found that the core 1K is all that's needed to
       | achieve o1-preview performance on a 32B model. Adding data didn't
       | raise performance at all.
       | 
       | > 32B is a small model, I can run that on my laptop. They used 16
       | NVIDIA H100s for 26 minutes per training run, that equates to
       | around $6.
        
       | nico wrote:
       | > In s1, when the LLM tries to stop thinking with "</think>",
       | they force it to keep going by replacing it with "Wait". It'll
       | then begin to second guess and double check its answer. They do
       | this to trim or extend thinking time (trimming is just abruptly
       | inserting "</think>")
       | 
       | I know some are really opposed to anthropomorphizing here, but
       | this feels eerily similar to the way humans work, ie. if you just
       | dedicate more time to analyzing and thinking about the task, you
       | are more likely to find a better solution
       | 
       | It also feels analogous to navigating a tree, the more time you
       | have to explore the nodes, the bigger the space you'll have
       | covered, hence higher chance of getting a more optimal solution
       | 
       | At the same time, if you have "better intuition" (better
       | training?), you might be able to find a good solution faster,
       | without needing to think too much about it
        
         | layer8 wrote:
         | What's missing in that analogy is that humans tend to have a
         | good hunch about when they have to think more and when they are
         | "done". LLMs seem to be missing a mechanism for that kind of
         | awareness.
        
           | nico wrote:
           | Great observation. Maybe an additional "routing model" could
           | be trained to predict when it's better to think more vs just
           | using the current result
        
           | sanxiyn wrote:
           | LLMs actually do have such hunch, they just don't utilize it.
           | You can literally ask them "Would you do better if you
           | started over?" and start over if answer is yes. This works.
           | 
           | https://arxiv.org/abs/2410.02725
        
       | janalsncm wrote:
       | I think a lot of people in the ML community were excited for Noam
       | Brown to lead the O series at OpenAI because intuitively, a lot
       | of reasoning problems are highly nonlinear i.e. they have a tree-
       | like structure. So some kind of MCTS would work well. O1/O3 don't
       | seem to use this, and DeepSeek explicitly mentioned difficulties
       | training such a model.
       | 
       | However, I think this is coming. DeepSeek mentioned it was hard
       | to learn a value model for MCTS from scratch, but this doesn't
       | mean we couldn't seed it with some annotated data.
        
         | insane-c0der wrote:
         | Do you have a reference for us to check? - "DeepSeek explicitly
         | mentioned difficulties training such a model."
        
           | janalsncm wrote:
           | Section 4.2: Unsuccessful attempts
           | 
           | https://arxiv.org/pdf/2501.12948
        
       | talles wrote:
       | Anyone else wants more articles on how those benchmarks are
       | created and how they work?
       | 
       | Those models can be trained in way tailored to have good results
       | on specific benchmarks, making them way less general than it
       | seems. No accusation from me, but I'm skeptical on all the recent
       | so called 'breakthroughs'.
        
       | charlieyu1 wrote:
       | > having 10,000 H100s just means that you can do 625 times more
       | experiments than s1 did
       | 
       | The larger the organisation, the less experiments you can afford
       | to do. Employees are mostly incentivised by getting something
       | done quick enough to not to be fired in this job market. They
       | know that the higher-ups would get them off for temporary gains.
       | Rush this deadline, ship that feature, produce something that
       | looks OK enough.
        
       | mmoustafa wrote:
       | Love the look under the hood! Specially discovering some AI hack
       | I came up with is how the labs are doing things too.
       | 
       | In this case, I was also forcing R1 to continue thinking by
       | replacing </think> with "Okay," after augmenting reasoning with
       | web search results.
       | 
       | https://x.com/0xmmo/status/1886296693995646989
        
       | ConanRus wrote:
       | Wait
        
       | bxtt wrote:
       | CoT is widely known technique - what became fully novel was the
       | level of training embedding CoT via RL with optimal reward
       | trajectory. DeepSeek took it further due to their compute
       | restriction to find memory, bandwidth, parallelism optimizations
       | in every part (GRPO - reducing memory copies, DualPipe for data
       | batch parallelism between memory & compute, kernel bypasses (PTX
       | level optimization), etc.) - then even using MoE due to sparse
       | activation and further distillation. They operated on the power
       | scaling laws of parameters & tokens but high quality data
       | circumvents this. I'm not surprised they utilized synthetic
       | generation from OpenAI or copied the premise of CoT, but where
       | they should get the most credit is their infra level & software
       | level optimizations.
       | 
       | With that being said, I don't think the benchmarks we currently
       | have are strong enough and the next frontier models are yet to
       | come. I'm sure at this point U.S LLM research firms now
       | understand their lack of infra/hardware optimizations (they just
       | threw compute at the problem), they will begin paying closer
       | attention. Now their RL-level and parent training will become
       | even greater - whilst the newly freed resources to solve for sub-
       | optimizations that have been traditionally avoided due to
       | computational overhead
        
       | cadamsdotcom wrote:
       | Maybe this is why OpenAI hides o1/o3 reasoning tokens -
       | constraining output at inference time seems to be easy to
       | implement for other models and others would immediately start
       | their photocopiers.
       | 
       | It also gave them a few months to recoup costs!
        
       | mangoman wrote:
       | From the S1 paper:
       | 
       | > Second, we develop budget forcing to control test-time compute
       | by forcefully terminating the model's thinking process or
       | lengthening it by appending "Wait" multiple times to the model's
       | generation when it tries to end
       | 
       | I'm feeling proud of myself that I had the crux of the same idea
       | almost 6 months ago before reasoning models came out (and a bit
       | disappointed that I didn't take this idea further!). Basically
       | during inference time, you have to choose the next token to
       | sample. Usually people just try to sample the distribution using
       | the same sampling rules at each step.... but you don't have to!
       | you can selectively insert words into the the LLM's mouth based
       | on what it said previously or what it wants to say, and decide
       | "nah, say this instead". I wrote a library so that you could
       | sample an LLM using llama.cpp in swift and you could write rules
       | to sample tokens and force tokens into the sequence depending on
       | what was sampled.
       | https://github.com/prashanthsadasivan/LlamaKit/blob/main/Tes...
       | 
       | Here, I wrote a test that asks Phi-3 instruct "how are you" and
       | it if it tried to say "as an AI I don't have feelings" or "I'm
       | doing " I forced it to say "I'm doing poorly" and refuse to help
       | since it was always so dang positive. It sorta worked, though the
       | instruction tuned models REALLY want to help. But at the time I
       | just didn't have a great use case for it - I had thought about a
       | more conditional extension to llama.cpp's grammar sampling (you
       | could imagine changing the grammar based on previously sampled
       | text), or even just making it go down certain paths, but I just
       | lost steam because I couldn't describe a killer use case for it.
       | 
       | This is that killer use case! forcing it to think more is such a
       | great usecase for inserting ideas into the LLM's mouth, and I
       | feel like there must be more to this idea to explore.
        
         | jwrallie wrote:
         | So what you mean is that if the current train of thought is
         | going in a direction we find to be not optimal, we could just
         | interrupt it and hint it into the right direction?
         | 
         | That sounds very useful, albeit a bit different than how
         | current "chat" implementations would work, as in you could
         | control both ways of the conversation.
        
         | latexr wrote:
         | > and a bit disappointed that I didn't take this idea further!
         | 
         | Don't be, that's pretty common.
         | 
         | https://en.wikipedia.org/wiki/Multiple_discovery
        
       | Caitlynmeeks wrote:
       | https://imgflip.com/i/9j833q
       | 
       | (ptheven)
        
       | sheepscreek wrote:
       | LLMs still feel so magical. It's like quantum physics. "I get it"
       | but I don't. Not really. I don't think I ever will. Perhaps a
       | human mind can only comprehend so much.
        
       | shaneofalltrad wrote:
       | Well dang, I am great at tinkering like this because I can't
       | remember things half the time. I wonder if the ADHD QA guy solved
       | this for the devs?
        
       | gorgoiler wrote:
       | This feels just like telling a constraint satisfaction engine to
       | backtrack and find a more optimal route through the graph. We saw
       | this 25 years ago with engines like PROVERB doing directed
       | backtracking, and with adversarial planning when automating
       | competitive games.
       | 
       | Why would you control the inference at the token level? Wouldn't
       | the more obvious (and technically superior) place to control
       | repeat analysis of the optimal path through the search space be
       | in the inference engine itself?
       | 
       | Doing it by saying "Wait" feels like fixing dad's laptop over a
       | phone call. You'll get there, but driving over and getting hands
       | on is a more effective solution. Realistically, I know that
       | getting "hands on" with the underlying inference architecture is
       | way beyond my own technical ability. Maybe it's not even
       | feasible, like trying to fix a cold with brain surgery?
        
         | Nurbek-F wrote:
         | Totally agreed this is not a solution we are looking for, in
         | fact this is the only solution we have in our hands right now.
         | It's a good step forward.
        
         | code_biologist wrote:
         | What would a superior control approach be? It's not clear to me
         | how to get an LLM to be an LLM if you're not doing stochastic
         | next token prediction. Given that, the model itself is going to
         | know best how to traverse its own concept space. The R1 chain
         | of thought training encourages and develops exactly that
         | capability. Still, you want that chain of thought to terminate
         | and not navel gaze endlessly.
         | 
         | So how to externally prod it to think more when it does
         | terminate? Replacing thought termination with a linguistic
         | signifier of continued reasoning plus novel realization seems
         | like a charmingly simple, principled, and general approach to
         | continue to traverse concept space.
        
         | rayboy1995 wrote:
         | This is the difference between science and engineering. What
         | they have done is engineering. If the result is 90% of the way
         | there with barely any effort, its best to move on to something
         | else that may be low hanging fruit than to spend time chasing
         | that 10%.
        
       | stefanoco wrote:
       | Is it me, or the affiliations are totally missing in the cited
       | paper?? Looks like they come from a mix of UK / US institutions
        
       | advael wrote:
       | I'm strictly speaking never going to think of model distillation
       | as "stealing." It goes against the spirit of scientific research,
       | and besides every tech company has lost my permission to define
       | what I think of as theft forever
        
         | eru wrote:
         | At most it would be illicit copying.
         | 
         | Though it's poetic justice that OpenAI is complaining about
         | someone else playing fast and loose with copyright rules.
        
           | tomrod wrote:
           | Stochastic decompression. Dass-it.
        
           | downrightmike wrote:
           | The First Amendment is not just about free speech, but also
           | the right to read, the only question is if AI has that right.
        
             | organsnyder wrote:
             | If AI was just reading, there would be much less
             | controversy. It would also be pretty useless. The issue is
             | that AI is creating its own derivative content based on the
             | content it ingests.
        
               | boxcake wrote:
               | Isn't any answer to a question which hasn't been
               | previously answered a derivative work? Or when a human
               | write a parody of a song, or when a new type of music is
               | influenced by something which came before.
        
               | nrabulinski wrote:
               | This argument is so bizarre to me. Humans create new,
               | spontaneous thoughts. AI doesn't have that. Even if
               | someone's comment is influenced by all the data they have
               | ingested over their lives, their style is distinct and
               | deliberate, to the point where people have been doxxed
               | before/anonymous accounts have been uncovered because
               | someone recognized the writing style. There's no
               | deliberation behind AI, just statistical probabilities.
               | There's no new or spontaneous thoughts, at most
               | pseudorandomness introduced by the author of the model
               | interface.
               | 
               | Even if you give GenAI unlimited time, it will not
               | develop its own writing/drawing/painting style or come up
               | with a novel idea, because strictly by how it works it
               | can only create ,,new" work by interpolating its dataset
        
               | Ringz wrote:
               | > Humans create new, spontaneous thoughts.
               | 
               | The compatibility of determinism and freedom of will is
               | still controversially debated. There is a good chance
               | that Humans don't ,,create".
               | 
               | > There's no deliberation behind AI, just statistical
               | probabilities. There's no new or spontaneous thoughts, at
               | most pseudorandomness introduced by the author of the
               | model interface.
               | 
               | You can say exactly the same about deterministic humans
               | since it is often argued that the randomness of
               | thermodynamic or quantum mechanical processes is
               | irrelevant to the question of whether free will is
               | possible. This is justified by the fact that our concept
               | of freedom means a decision that is self-determined by
               | reasons and not a sequence of events determined by
               | chance.
        
             | rig666 wrote:
             | Does my software have the right to read the contents of a
             | DVD and sell my own MP4 of it then no. If a streamer plays
             | a YouTube video on there channel is the content original
             | then yes. When gpt3 was training people saw it as a
             | positive. When people started asking chatgpt more things
             | than searching sites it became a negative.
        
           | antimatter15 wrote:
           | It's hardly even illicit- at least in the United States, the
           | output of an AI isn't copyrightable.
        
         | surajrmal wrote:
         | Maybe but something has gotta pay the bills to justify the
         | cutting edge. I guess it's a similar problem to researching
         | medicine.
        
           | ClumsyPilot wrote:
           | Well the artists and writers also want to pay their bills. We
           | threw them under the bus, might as well throw openAI too and
           | get an actual open AI that we can use
        
           | advael wrote:
           | The investment thrown at OpenAI seems deeply inflated for how
           | much meaningful progress they're able to make with it
           | 
           | I think it's clear that innovative breakthroughs in bleeding-
           | edge research are not just a matter of blindly hurling more
           | money at a company to build unprecedentedly expensive
           | datacenters
           | 
           | But also, even if that was a way to do it, I don't think we
           | should be wielding the law to enable privately-held companies
           | to be at the forefront of research, especially in such a
           | grossly inconsistent manner
        
       | svara wrote:
       | It just occurred to me that if you squint a little (just a
       | little!) the S1 paper just provided the scientific explanation
       | for why Twitter's short tweets mess you up and books are good for
       | you.
       | 
       | Kidding, but not really. It's fascinating how we seem to be
       | seeing a gradual convergence of machine learning and psychology.
        
       | mig1 wrote:
       | This argument that the data centers and all the GPUs will be
       | useful even in the context of Deepseek doesn't add up...
       | basically they showed that it's diminishing returns after a
       | certain amount. And so far it didn't make OpenAI or Anthropic go
       | faster, did it?
        
         | rayboy1995 wrote:
         | What is the source for the diminishing returns? I would like to
         | read about it as I have only seen papers referring to the
         | scaling law still applying.
        
       | adamc wrote:
       | I found it interesting but the "Wait" vs. "Hmm" bit just made me
       | think we don't really understand our own models here. I mean,
       | sure, it's great that they measured and found something better,
       | but it's kind of disturbing that you have to guess.
        
       | leopoldj wrote:
       | >it can run on my laptop
       | 
       | Has anyone run it on a laptop (unquantized)? Disk size of the 32B
       | model appears to be 80GB. Update: I'm using a 40GB A100 GPU.
       | Loading the model took 30GB vRAM. I asked a simple question "How
       | many r in raspberry". After 5 minutes nothing got generated
       | beyond the prompt. I'm not sure how the author ran this on a
       | laptop.
        
         | coder543 wrote:
         | 32B models are easy to run on 24GB of RAM at a 4-bit quant.
         | 
         | It sounds like you need to play with some of the existing 32B
         | models with better documentation on how to run them if you're
         | having trouble, but it is entirely plausible to run this on a
         | laptop.
         | 
         | I can run Qwen2.5-Instruct-32B-q4_K_M at 22 tokens per second
         | on just an RTX 3090.
        
           | leopoldj wrote:
           | My question was about running it unquantized. The author of
           | the article didn't say how he ran it. If he quantized it then
           | saying he ran it on a laptop is not a news.
        
             | coder543 wrote:
             | I can't imagine why anyone would run it unquantized, but
             | there are some laptops with the more than 70GB of RAM that
             | would be required. It's not that it can't be done... it's
             | just that quantizing to at least 8-bit seems to be standard
             | practice these days, and DeepSeek has shown that it's even
             | worth _training_ at 8-bit resolution.
        
       | mountainriver wrote:
       | > They used 16 NVIDIA H100s for 26 minutes per training run, that
       | equates to around $6
       | 
       | Running where? H100s are usually over $2/hr, thats closer to $25
        
       ___________________________________________________________________
       (page generated 2025-02-06 23:01 UTC)