[HN Gopher] S1: A $6 R1 competitor?
___________________________________________________________________
S1: A $6 R1 competitor?
Author : tkellogg
Score : 808 points
Date : 2025-02-05 11:05 UTC (1 days ago)
(HTM) web link (timkellogg.me)
(TXT) w3m dump (timkellogg.me)
| bberenberg wrote:
| In case you're not sure what S1 is, here is the original paper:
| https://arxiv.org/html/2501.19393v1
| mi_lk wrote:
| it's also the first link in the article's first sentence
| bberenberg wrote:
| Good call, I must have missed it. I read the whole blog then
| went searching for what S1 was.
| addandsubtract wrote:
| It's linked in the blog post, too. In the first sentence,
| actually, but for some reason the author never bothered to
| attach the name to it. As if keeping track of o1, 4o, r1, r2d2,
| wasn't exhausting enough already.
| kgwgk wrote:
| > for some reason the author never bothered to attach the
| name to it
|
| Respect for his readers' intelligence, maybe.
| rahimnathwani wrote:
| To enforce a minimum, we suppress the generation of the end-of-
| thinking token delimiter and optionally append the string
| "Wait" to the model's current reasoning trace to encourage the
| model to reflect on its current generation.
|
| Does this mean that the end-of-thinking delimiter is a single
| token? Presumably </think> or similar wasn't a single token for
| the base model. Did they just pick a pair of uncommon single-
| token symbols to use as delimiters?
|
| EDIT: Never mind, end of thinking is represented with
| <|im_start|> followed by the word 'answer', so the code
| dynamically adds/removes <|im_start|> from the list of stop
| tokens.
| dagurp wrote:
| I don't know what R1 is either
| latexr wrote:
| It's the DeepSeek reasoning model.
| ttyprintk wrote:
| https://huggingface.co/simplescaling
| anentropic wrote:
| and: https://github.com/simplescaling/s1
| mettamage wrote:
| When you're only used to ollama, how do I go about using this
| model?
| davely wrote:
| I think we need to wait for someone to convert it into a
| GGUF file format.
|
| However, once that happens, you can run it (and any GGUF
| model) from Hugging Face![0]
|
| [0] https://huggingface.co/docs/hub/en/ollama
| mettamage wrote:
| So this?
|
| https://huggingface.co/brittlewis12/s1-32B-GGUF
| withinboredom wrote:
| oh god, this is terrible!
|
| I just said "Hello!" and it went off the rails.
| delijati wrote:
| why how what? can you add a sample prompt with output ?
| withinboredom wrote:
| In my original prompt, it gave me back multiple greetings
| in various languages. In this example, it gave me back
| just multiple greetings:
|
| >>> hello!
|
| Hello there! I'm here to help with any questions you
| might have. How can I assist you today? If you're asking
| a question, feel free to provide some details about what
| you need help with. I'll do my best to answer your
| question or guide you through the process. Just let me
| know!
|
| <|im_start|>think
|
| The user has asked "hello! " and provided no further
| information. Answer here is a direct response to the
| user's request for information, which was essentially
| just a greeting. The appropriate response is to greet
| back and offer assistance.
|
| <|im_start|>answer
|
| Answer:
|
| Hello! How can I assist you today? If you have a question
| or need help with something specific, feel free to ask.
| I'm here to support you!
|
| <|im_start|>think
|
| <|im_start|>answer
|
| The user's message was "hello!" and there was no other
| information provided. The appropriate response is a
| greeting back and an offer of assistance.
|
| <|im_start|>answer
|
| Answer:
|
| Hello! How can I assist you today? If you have a question
| or need help with something specific, feel free to ask.
| I'm here to support you!
|
| <|im_start|>think
|
| The user has greeted with "hello!" An appropriate
| response involves acknowledging the greeting and offering
| assistance. This could involve asking how we can help
| them or what they'd like guidance on.
|
| <|im_start|>answer
|
| Answer:
|
| Hello! How can I assist you today? If you have a question
| or need help with something specific, feel free to ask.
| I'm here to support you!
| mettamage wrote:
| I ran it, so far it seems like a pretty good model,
| especially locally.
| fl0id wrote:
| you can load the safetensors with ollama, you just have
| to provide a modelfile. or wait for someone to do it. It
| will in theory also quantize it for you, as I guess most
| ppl cannot load a 129 GB model...
| yapyap wrote:
| > If you believe that AI development is a prime national security
| advantage, then you absolutely should want even more money poured
| into AI development, to make it go even faster.
|
| This, this is the problem for me with people deep in AI. They
| think it's the end all be all for everything. They have the
| vision of the 'AI' they've seen in movies in mind, see the
| current 'AI' being used and to them it's basically almost the
| same, their brain is mental bridging the concepts and saying it's
| only a matter of time.
|
| To me, that's stupid. I observe the more populist and socially
| appealing CEOs of these VC startups (Sam Altman being the
| biggest, of course.) just straight up lying to the masses, for
| financial gain, of course.
|
| Real AI, artificial intelligence, is a fever dream. This is
| machine learning except the machines are bigger than ever before.
| There is no intellect.
|
| and the enthusiasm of these people that are into it feeds into
| those who aren't aware of it in the slightest, they see you can
| chat with a 'robot', they hear all this hype from their peers and
| they buy into it. We are social creatures after all.
|
| I think using any of this in a national security setting is
| stupid, wasteful and very, very insecure.
|
| Hell, if you really care about being ahead, pour 500 billion
| dollars into quantum computing so u can try to break current
| encryption. That'll get you so much further than this nonsensical
| bs.
| mnky9800n wrote:
| Also the narrative that we are currently on the brink of Ai
| explosion and this random paper shows it has been the same
| tired old story handed out by ai hawks for years now. Like yes,
| I agree with the general idea that more compute means more
| progress for humans and perhaps having a more responsive user
| interface through some kind of ai type technology would be
| good. But I don't see why that will turn into Data from Star
| Trek. But I also think all these ai hawks kind of
| narcissistically over value their own being. Like blink and
| their lives are over in the grand scheme of things. Maybe our
| "awareness" of the world around us is an illusion provided by
| evolution because we needed it to value self preservation
| whereas other animals don't. There is an inherent belief in the
| specialness of humans that I suppose I mistrust.
| encipriano wrote:
| I find the last part of the paragraph offputting and I agree
| ben_w wrote:
| > But I don't see why that will turn into Data from Star
| Trek.
|
| "Is Data genuinely sentient or is he just a machine with this
| impression" was a repeated plot point in TNG.
|
| https://en.wikipedia.org/wiki/The_Measure_of_a_Man_(Star_Tre.
| ..
|
| https://en.wikipedia.org/wiki/The_Offspring_(Star_Trek:_The_.
| ..
|
| https://en.wikipedia.org/wiki/The_Ensigns_of_Command
|
| https://en.wikipedia.org/wiki/The_Schizoid_Man_(Star_Trek:_T.
| ..
|
| Similar with The Doctor on VOY.
|
| Even then, what we have with LLMs is basically already at the
| level of the ship's main computer as it was written in
| TNG/DS9/VOY.
|
| But what counts as personhood? No idea. We're as ignorant as
| of that as the writers of the TV show; humanity's best
| attempts at philosophy of mind are still making the same
| kinds of arguments today as we were a century ago, the only
| real shift is to be slightly more inclusive of other species
| besides Homo Sapiens, and the only real improvement over two
| centuries ago is to include _all_ of Homo Sapiens.
| mnky9800n wrote:
| The main computer does not make choices stochastically and
| always understands what people ask it. I do not think that
| resembles the current crop of LLMs. On voyager the ships
| computer is some kind of biological computing entity that
| they eventually give up on as a story topic but there is an
| episode where the bio computing gel packs get sick.
|
| I believe data and the doctor both would be people to me.
| But is minuet? The woman created by the binars to distract
| riker so they could steal the ship? Star Trek goes back and
| forth as to who counts as a person or not. This is one of
| the points that is made in the episode where the doctor
| learns his hologram is cloned many times to work as miners
| on an asteroid and he says this is slavery. Does minuet get
| to say she doesn't want to seduce riker if she doesn't feel
| like it? Or is she a program that does whatever the
| programmer wants?
|
| But moreover linking back to the original technology, does
| eqtransformer have awareness? Do we treat it as a thinking
| being that we enslaved to find earthquakes in data because
| it uses the same technology as ChatGPT? I don't think
| anyone thinks we should do this but it seems like that is
| where people want to take it. That's strange to me. At what
| point does data pipelines doing some kind of stochastic
| transformation and summarization of training data become an
| individual that presents a desire for autonomy like data or
| the doctor?
|
| I think there's lots of questions here to answer and I
| don't know the answers to them.
| ben_w wrote:
| > The main computer does not make choices stochastically
| and always understands what people ask it.
|
| The mechanism is never explained, but no, it doesn't
| always understand correctly -- and neither does Data. If
| hologram-Moriarty is sentient (is he?), then the
| capability likely exceeds what current LLMs can do, but
| the cause of the creation is definitely a
| misunderstanding.
|
| Even the episode where that happens, the script for Dr.
| Pulaski leading up to Moriarty's IQ boost was exactly the
| same arguments used against LLMs:
| https://www.youtube.com/watch?v=4pYDy7vsCj8
|
| (Common trope in that era being that computers (including
| Data) are too literal, so there was also:
| https://www.youtube.com/watch?v=HiIlJaSDPaA)
|
| Similar with every time the crew work iteratively to
| create something in the holodeck. And, of course:
| https://www.youtube.com/watch?v=srO9D8B6dH4
|
| > I do not think that resembles the current crop of LLMs.
| On voyager the ships computer is some kind of biological
| computing entity that they eventually give up on as a
| story topic but there is an episode where the bio
| computing gel packs get sick.
|
| "Take the cheese to sickbay" is one of my favourite lines
| from that series.
|
| > But is minuet?
|
| I would say the character was a puppet, with the Bynars
| pulling the strings, because the holo-character was
| immediately seen as lacking personhood the moment they
| stopped fiddling with the computer.
|
| Vic Fontaine was more ambiguous in that regard. Knew he
| was "a lightbulb", but (acted like) he wanted to remain
| within that reality in a way that to me felt like he was
| _*programmed*_ to respond as if the sim around him was
| the only reality that mattered rather than having free
| will in that regard.
|
| (But who has total free will? Humans are to holograms as
| Q is to humans, and the main cast were also written to
| reject "gifts" from Riker that time he briefly became a
| Q).
|
| The villagers of Fair Haven were, I think, not supposed
| to be sentient (from the POV of the crew), but were from
| the POV of the writers: https://en.wikipedia.org/wiki/Fai
| r_Haven_(Star_Trek:_Voyager... and https://en.wikipedia.o
| rg/wiki/Spirit_Folk_(Star_Trek:_Voyage...
|
| > does eqtransformer have awareness?
|
| There's too many different definitions for a single
| answer.
|
| We don't know what part of our own brains gives us the
| sensation of our own existence; and even if we did, we
| wouldn't know if it was the only mechanism to do so.
|
| To paraphrase your own words:
|
| At what point does chemical pipelines doing some kind of
| stochastic transformation and electrochemical integration
| of sensory input become an individual that presents a
| desire for autonomy like data or the doctor?
|
| I don't know. Like you, I'd say:
|
| > I think there's lots of questions here to answer and I
| don't know the answers to them.
| spacebanana7 wrote:
| > I think using any of this in a national security setting is
| stupid
|
| What about AI enabled drones and guided missiles/rockets? The
| case for their effectiveness is relatively simple in terms of
| jamming resistance.
| pjc50 wrote:
| Like a lot of AI boosters, would you like to explain how that
| works other than magic AI dust? Some forms of optical
| guidance are already in use, but there's other limitations
| (lighting! weather!)
| spacebanana7 wrote:
| Sure thing. The basic idea would be:
|
| 1) Have a camera on your drone 2) Run some frames through a
| locally running version of something like AWS Rekognition's
| celebrity identification service but for relevant military
| targets. 3) Navigate towards coordinates of target
| individuals
|
| It isn't exactly magic, here's a video of a guy doing
| navigation with openCV on images:
| https://www.youtube.com/watch?v=Nrzs3dQ9exw
| Hauthorn wrote:
| I believe this is a capability that the Switchblade 600
| or STM KARGU already has.
|
| https://en.wikipedia.org/wiki/STM_Kargu
| GTP wrote:
| This somehow reminds me of a certain killer robot from a
| Black Mirror episode ;)
| amarcheschi wrote:
| I would say that they don't require an 500bln$ investment.
| AFAIK, drone that help lock on target have started being used
| in Ukraine
| spacebanana7 wrote:
| I generally agree, piggybacking on innovations in
| smartphone GPUs / batteries will probably be enough to get
| locally running AI models in drones.
| swiftcoder wrote:
| drone and missile guidance system development has been using
| ML for decades at this point. That's just as much "AI" as
| anything currently coming out of the LLM craze.
| int_19h wrote:
| It's not just target guidance at this point. There are
| prototypes of drone swarms, for example.
| theGnuMe wrote:
| I think jamming resistance is a red herring. AI weapons will
| have their own failure modes due to jamming. Any sensor
| modality will have its own particular weakness. Also
| reasoning model malfunctions as well i.e. hallucinations.
|
| Not to mention false GPS etc...
| pjc50 wrote:
| > then you absolutely should want even more money poured into
| AI development, to make it go even faster.
|
| Indeed. People are welcome to go "all in" on whatever nonsense
| gambling they want to do with their personal investments, but
| national security demands actually thinking about things -
| adversarially. Because the enemy will as well.
|
| It's perfectly possible to lose a war by investing in expensive
| superweapons that under deliver. The Nazis were particularly
| bad at this.
| amarcheschi wrote:
| I couldn't agree more.
|
| If we're not talking about cyber war exclusively, such as
| finding and exploiting vulnerabilities, for the time being
| national security will still be based on traditional army.
|
| Just a few weeks ago, italy announced a 16blnEUR plan to buy
| >1000 rheinmetall ifv vehicles. That alone would make italy's
| army one of the most equipped in Europe. I can't imagine what
| would happen with a 500$bln investment in defense,lol. I don't
| agree with what Meloni's government is doing, but one of the
| ministers I agree more with is the defense minister Crosetto
|
| Furthermore, what is being shown, at least for the time being,
| is that open source can be and is crucial in aiding developing
| better models. This collides with the idea of big, single "one
| winner takes it all" VC mentality (because let's be honest,
| these defense pitches are still made by startup/VC bros)
| piltdownman wrote:
| >italy announced a 16blnEUR plan to buy >1000 rheinmetall ifv
| vehicles. That alone would make italy's army one of the most
| equipped in Europe.
|
| So target practice for a beyond-the-horizon missile system
| launched ground-to-ground or air-to-ground? As an attacking
| force, conventional ground forces and tactics are a non-
| runner in a modern theatre of operations when faced against
| air and drone support. This is why no single EU country is
| incentivised into dumping money into any single area - as the
| only probable defense would be against USA/Russia/China to
| begin with.
|
| The US proved it beyond doubt in Afghanistan - partisans
| simply haven't a chance against a gunship with IR or NV
| optics; the last time they levelled the playing field against
| air interdictors was in Charlie Wilson's Afghanistan when the
| Mujahideen took on that era of Soviet gunships with hand-held
| AA systems.
| amarcheschi wrote:
| Italy had a joke of tanks in the last decade(s), that
| weren't clearly on par with the other European alternatives
| and the defense budget went to other places, namely navy
| and air force. Since some of the Italian tanks are being
| dismissed, there was the need for a new tank, but not
| having invested much in tanks had the effect of not having
| cutting edge technology to develop an alternative. So the
| decision was taken to produce tanks from rheinmetall
| (Leonardo was available to another German producer, but
| they didn't want to transfer technology and produce in
| italy).
|
| Not having tanks is a big no no due to latest events, I do
| not see the grudge you hold against them when it's clear
| they wouldn't be used in the same flat scenario of Ukraine
| or Poland. Given that Italy is highly mountainous, it made
| sense to prioritize air force and navy first. I think
| they're also compatible for anti drone guns
|
| Piaggio air space was acquired a few weeks ago by baykar,
| hopefully it will produce drones.
|
| I do not get why you're saying that a change in the
| approach of war makes tank not useful anymore, when it's
| clear they will eventually be adapted to counter drones and
| similar threats
| fkyoureadthedoc wrote:
| > I can't imagine what would happen with a 500$bln investment
| in defense,lol.
|
| The $90,000 bag of bushings becomes a $300,000 bag?
| amarcheschi wrote:
| No, at least in italy we would create some gimmick
| consultant position for some ex politician and repay
| him/her for past favours /s
| int_19h wrote:
| It's not one or the other, though. AI-controlled drones are
| already a thing in Ukraine, today.
| smcl wrote:
| Been saying this for years, it's been fucking baffling.
| Generating images, video and text that sort-of resembles what a
| human would come up with is genuinely quite impressive. It is
| not "let's claim it'll fix our country" (looking at you, Keir)
| impressive though, and I cannot believe so much money has been
| pumped into it.
| amarcheschi wrote:
| But you have to over promise and under deliver, otherwise you
| won't receive those sweet sweet money
| pilingual wrote:
| > This is machine learning
|
| Yeah, I was thinking about this while trying to figure out
| author affiliations.
|
| There was a Stanford paper a few years ago that dusted off some
| old intelligence concepts and the authors seemed excited about
| it.
|
| But given the pace of AI, it's difficult to look in new
| directions. It will probably take an AI winter and some
| unbridled enthusiasm immune to burnout to make some real
| progress outside of feed forward neural networks.
| mrshadowgoose wrote:
| > They think it's the end all be all for everything.
|
| Is (human-based) general intelligence not one of the
| fundamental enabling elements of literally every human activity
| throughout history, regardless of how many layers of automation
| and technology one has to peel back to get to it?
|
| Can you maybe imagine how the ability to create arbitrary
| amounts of general intelligence, completely divorced from the
| normal lengthy biological process, could upend that foundation
| of human activity?
|
| > They have the vision of the 'AI' they've seen in movies in
| mind, see the current 'AI' being used and to them it's
| basically almost the same, their brain is mental bridging the
| concepts and saying it's only a matter of time.
|
| I've found that most AI-related movies exclusively focus on
| "quality ASI" scenarios, which are mostly irrelevant to our
| current state of the world, as an immense amount of
| danger/value/disruption will arrive with AGI. People who are
| seriously reasoning about the impacts of AGI are not using
| movies as references. "Those stupid movie watching idiots" is
| just a crutch you are using to avoid thinking about something
| that you disagree with.
|
| > Real AI, artificial intelligence, is a fever dream. This is
| machine learning except the machines are bigger than ever
| before. There is no intellect.
|
| Do you have any evidence to support this conclusion? And does
| it even matter? If "fake intellect" can replace a human, that
| human still has to deal with the very real issue or not having
| a job anymore. If "fake intellect" is used to conduct mass
| surveillance, and direct suppression activities towards
| divergent individuals, those individuals are still going to
| have a bad time.
| gessha wrote:
| >> Real AI, artificial intelligence, is a fever dream. This
| is machine learning except the machines are bigger than ever
| before. There is no intellect.
|
| > Do you have any evidence to support this conclusion? And
| does it even matter? If "fake intellect" can replace a human,
| that human still has to deal with the very real issue or not
| having a job anymore. If "fake intellect" is used to conduct
| mass surveillance, and direct suppression activities towards
| divergent individuals, those individuals are still going to
| have a bad time.
|
| I think the "fake intelligence can replace a human" needs
| more support in general. We know how human intellect works
| practically (not theoretically) and we know how to apply it
| in different scenarios. We're still far from knowing how
| "fake intelligence" works and how to apply it to different
| scenarios.
| encipriano wrote:
| You would solve the ai problem if you correctly defined what
| intellect even is.
| amelius wrote:
| Yes, I'd like to see some examples where our current AI can
| actually extrapolate rather than interpolate. Let it invent new
| things, new drawing styles, new story plots, etc. Maybe _then_
| it will impress me.
| mrshadowgoose wrote:
| Here you go:
| https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
| amelius wrote:
| I'm not convinced. This is using the tooling and paradigms
| invented by humans.
| ninetyninenine wrote:
| Science is a paradigm invented by humans. If a human uses
| this paradigm to invent something he is considered
| intelligent but if an ai uses it it's not? All humans use
| human paradigms and they are intelligent.
|
| If a human uses such a paradigm to success it is the
| same.
| amelius wrote:
| Biology is a field where you can find new things
| relatively easily, if you follow existing protocols and
| do the hard lab work (which costs time and money).
| mrshadowgoose wrote:
| Ah yes. I believe this is commonly called "moving the
| goalposts".
| amelius wrote:
| No it isn't.
| moffkalast wrote:
| Can you? Most people can't do anything that isn't 99%
| derivative.
| sidewndr46 wrote:
| What is even the possible usage of AI for national security?
| Generating pictures of kittens riding nuclear weapons to the
| very end like in Dr Strangelove?
| ben_w wrote:
| > What is even the possible usage of AI for national
| security? Generating pictures of kittens riding nuclear
| weapons to the very end like in Dr Strangelove?
|
| For all that critics of AI dismiss them as lacking
| imagination, your reaction suggests a lack of imagination.
|
| Off the top of my head: facial recognition and identification
| to make "smart" guns that hit specific targets with reduced
| collateral damage (as found on most digital cameras even
| before smartphones); creating and A/B testing propaganda
| campaigns; using modified wifi signals as wall-penetrating
| radar capable of post estimation, heart rate and breathing
| monitoring[0]; take any self-driving car's AI and
| conditionally invert the part that says "don't hit
| pedestrians" when a certain target is spotted; ANPR to track
| specific vehicles with known owners over long distances;
| alternative targeting system for cruise missiles in the
| absence or jamming of GPS systems; using them as red teams in
| war-game exercises; using them to automate intrusion
| detection by monitoring for changes to background
| distributions of basically every measurable event; person-
| tracking by watching CCTV in secure areas; control systems
| for security robots (think Boston Dynamics' Spot) that are
| currently in deployment.
|
| There's likely a lot more, too.
|
| [0] https://openaccess.thecvf.com/content_cvpr_2018/papers/Zh
| ao_...
| robwwilliams wrote:
| Lol: Where I live (Memphis) both "one" and "two" are
| considered two syllable words. Seriously. Our kids were
| taught this in the best public elementary school.
| esafak wrote:
| How does that work, one-uh, two-uh?
| robwwilliams wrote:
| egzacly
| moffkalast wrote:
| Gentlemen you can't use AI here, this is the deep learning
| lab!
| int_19h wrote:
| Autonomous weapons.
| snarf21 wrote:
| Agreed. I was working on some haiku things with ChatGPT and it
| kept telling me that busy has only one syllable. This is a
| trivially searchable fact.
| whimsicalism wrote:
| link a chat please
| snarf21 wrote:
| It wasn't just _busy_ that it failed on. I was feeding it
| haikus and wanted them broken into a list of 17 words
| /fragments. Certain 2 syllable words weren't split and
| certain 1 syllable words were split into two.
| robwwilliams wrote:
| It used to be much easier to be conservative about AI,
| especially AGI, after living through three cycles of AI
| winters. No more. Dismissing it as "merely machine learning" is
| worse than unfair to the last decade of machine learning ;-)
|
| The hard part now is relatively trivial. Does anyone think that
| there is a fundamental and profound discovery that evolution
| made purely by selection in the last 200,000 years? I mean a
| true qualitative difference?
|
| Sure---We call it language, which is just another part of a
| fancy animal's tool kit.
|
| Does anyone think there is an amazing qualitative difference
| between the brain of a chimp and the brain of a human?
|
| No, not if they know any biology.
|
| (Although that does not stop some scientist from looking for a
| "language gene" like FOXP2.)
|
| So what did dumb mutations and 200,000 years of selection do
| that a group of dedicated AI scientists cannot do with their
| own genuine general intelligence?
|
| Nothing---nothing other than putting a compact energy efficient
| LLM with reinforcement learning on a good robotic body and
| letting it explore and learn like we did as infants, toddlers
| and teenagers.
|
| Each one of us has experienced becoming a "general
| intelligence". I remember it hit me on the head in 6th grade
| when I dreamed up a different way of doing long division. I
| remember thinking: "How did I think that?" And each one of us
| who has watched an infant turn into a toddler has watched it as
| an observer or teacher. This is what makes babies so
| fascinating to "play" with.
|
| We have to give our baby AGI a private memory and a layer of
| meta-attention like we all gain as we mature, love, and
| struggle.
|
| I read the linked article and as a neuroscientist I realized
| the "wait" cycles that improved performance so much is roughly
| equivalent to the prefrontal cortex: the part of the CNS most
| responsible for enabling us to check our own reasoning
| recursively. Delay--as in delayed gratification---is a key
| attribute of intelligent systems.
|
| We are finally on the door step to Hofstadter's Strange Loop
| and Maturana's and Valera's "enactive" systems, but now
| implemented in silicon, metal, and plastic by us rather than
| dumb but very patient natural selection.
|
| Karl Friston and Demis Hassabis (two very smart neuroscientist)
| figured this out years ago. And they were preceded by three
| other world class neuroscientist: Humberto Maturana, Francisco
| Valera, and Rich Sutton (honorary neuroscientist). And big
| credit to Terry Winograd for presaging this path forward long
| ago too.
| baq wrote:
| I can only say that exponential curves grow nominally
| sublinearly before they take off. AI is not quite at the
| obvious take off point, but owners of the biggest clusters have
| seen the extrapolations and it isn't pretty - once your
| competitor achieves take off and you aren't anywhere close,
| you're done for. The risk of not participating in that are too
| great.
| menaerus wrote:
| You can choose to be somewhat ignorant of the current state in
| AI, about which I could also agree that at certain moments it
| appears totally overhyped, but the reality is that there hasn't
| been a bigger technology breakthrough probably in the last ~30
| years.
|
| This is not "just" machine learning because we have never been
| able to do things which we are today and this is not only the
| result of better hardware. Better hardware is actually a
| byproduct. Why build a PFLOPS GPU when there is nothing that
| can utilize it?
|
| If you spare yourself some time and read through the actual
| (scientific) papers of multiple generations of LLM models, the
| first one being from Google ~~not DeepMind~~ in 2017, you might
| get to understand that this is no fluff.
|
| And I'm speaking this from a position of a software engineer,
| without bias.
|
| The reason why all this really took off with so much hi-speed
| is because of the not quite expected results - early LLM
| experiments have shown that "knowledge" with current
| transformers architecture can linearly scale with regards to
| the amount of compute and training time etc. That was very
| unexpected and to this day scientists do not have an answer why
| this even works.
|
| So, after reading bunch of material I am inclined to think that
| this is something different. The future of loading the codebase
| into the model and asking the model to explain me the code or
| fix bugs has never been so close and realistic. For the better
| or worse.
| whimsicalism wrote:
| > the first one being from DeepMind in 2017
|
| ? what paper are you talking about
| menaerus wrote:
| https://arxiv.org/abs/1706.03762
| whimsicalism wrote:
| oh gotcha. maybe pedantic but that is not a deepmind
| paper
| menaerus wrote:
| I stand corrected, not sure why I was under the
| impression that this paper came out of DeepMind lab.
| mitchelld wrote:
| This line of thinking doesn't really correspond to the reason
| Transformers were developed in the first place, which was to
| better utilize how GPUs do computation. RNNs were too slow to
| train at scale because you had to sequentially compute the
| time steps, Transformers (with masking) can run the input
| through in a single pass.
|
| It is worth noting that the first "LLM" you referring to was
| only 300M parameters, but even then the amount of training
| required (at the time) was such that training a model like
| that outside of a big tech company was infeasible. Obviously
| now we have models that are in the hundreds of billions /
| trillions of parameters. The ability to train these models is
| directly a result of better / more hardware being applied to
| the problem as well as the Transformer architecture
| specifically designed to better conform with parallel
| computation at scale.
|
| The first GPT model came out ~ 8 years ago. I recall when
| GPT-2 came out they initially didn't want to release the
| weights out of concern for what the model could be used for,
| looking back now that's kind of amusing. However,
| fundamentally, all these models are the same setup as what
| was used then, decoder based Transformers. They are just
| substantially larger, trained on substantially more data,
| trained with substantially more hardware.
| menaerus wrote:
| What line of thinking you're referring to?
|
| Transformers were aimed to solve the "context" problem and
| authors, being aware that RNNs don't scale at all neither
| do they solve that particular problem, had to come up with
| the algorithm that overcomes both of those issues. It
| turned out that the self-attention compute-scale was the
| crucial ingredient to solve the problem, something that
| RNNs were totally incapable of.
|
| They modeled the algorithm to run on the hardware they had
| at that time available but hardware developed afterwards
| was a direct consequence, or how I called it a byproduct,
| of transformers proving themselves to be able to
| continuously scale. Had that not be true, we wouldn't have
| all those iterations of NVidia chips.
|
| So, although one could say that the NVidia chip design is
| what enabled the transformers success, one could also say
| that we wouldn't have those chips if transformers didn't
| prove themselves to be so damn efficient. And I'm inclined
| to think the latter.
| mitchelld wrote:
| > This is not "just" machine learning because we have
| never been able to do things which we are today and this
| is not only the result of better hardware. Better
| hardware is actually a byproduct. Why build a PFLOPS GPU
| when there is nothing that can utilize it?
|
| This is the line of thinking I'm referring to.
|
| The "context" problem had already been somewhat solved.
| The attention mechanism existed prior to Transformers and
| was specifically used on RNNs. They certainly improved
| it, but innovation of the architecture was making it
| computation efficient to train.
|
| I'm not really following your argument. Clearly your
| acknowledging that it was first the case that with the
| hardware at the time, researchers demonstrated that
| simply scaling up training with more data yielded better
| models. The fact that hardware was then optimized for
| these for these architectures only reinforces this point.
|
| All the papers discussing scaling laws point to the same
| thing, simply using more compute and data yields better
| results.
|
| > this is not only the result of better hardware
|
| Regarding this in particular. A majority of the
| improvement from GPT-2 and GPT-4 was simply training on a
| much larger scale. That was enabled by better hardware
| and lots of it.
| menaerus wrote:
| > innovation of the architecture was making it
| computation efficient to train.
|
| and
|
| > researchers demonstrated that simply scaling up
| training with more data yielded better models
|
| and
|
| > The fact that hardware was then optimized for these for
| these architectures only reinforces this point.
|
| and
|
| > All the papers discussing scaling laws point to the
| same thing, simply using more compute and data yields
| better results.
|
| is what I am saying as well. I read the majority of those
| papers so this is all very known to me but I am perhaps
| writing it down in a more condensed format so that other
| readers that are light on the topic can pick the idea
| easier.
|
| > A majority of the improvement from GPT-2 and GPT-4 was
| simply training on a much larger scale. That was enabled
| by better hardware and lots of it.
|
| Ok, I see your point and the conclusion here is what we
| disagree with. You say that the innovation was simply
| enabled by the better hardware whereas I say that that
| better hardware wouldn't have its place if there hadn't
| been a great innovation in the algorithm itself. I don't
| think it's fair to say that the innovation is driven by
| the NVidia chips.
|
| I guess my point, simplistically saying, is if we had a
| lousy algorithm, new hardware wouldn't mean anything
| without rethinking or rewriting the algorithm. And with
| the transformers, this definitely hadn't been the case.
| There had been plenty of optimizations throughout the
| years in order to better utilize the HW (e.g. flash-
| attention) but the architecture of transformers remained
| more or less the same.
| dotancohen wrote:
| > Real AI, artificial intelligence, is a fever dream. This is
| machine learning except the machines are bigger than ever
| before. There is no intellect.
|
| That sounds to me like dismissing the idea that a Russian SSBN
| might cross the Pacific and nuke Los Angeles because
| "submarines can't swim".
|
| Even if the machine learning isn't really intelligent, it is
| still capable of performing IF..THEN..ELSE operations, which
| could have detrimental effects for [some subset of] humans.
|
| And even if you argue that such a machine _shouldn't_ be used
| for whatever doomsday scenario would harm us, rest assured that
| someone, somewhere, who either does not understand what the
| machines are designed to do or just pretends that they work
| like magic, will put the machines in a position to make such a
| decision.
| UncleEntity wrote:
| One could hope...
|
| Even at the height of the Cold War there was always a human
| between <leader presses button> and <nukes go aflyin'>.
|
| --edit--
|
| ...which has me wondering if a president even has the
| constitutional authority to destroy the entire planet and if
| one could interpret their command as a 'lawful order'. Makes
| one think.
| willglynn wrote:
| On the topic of fail-deadly nukes:
|
| https://en.wikipedia.org/wiki/Dead_Hand
| ninetyninenine wrote:
| I agree agi wont solve national security but saying this isn't
| intelligence is false.
|
| This is ai and trend lines point to an intelligence that
| matches or barely exceeds human intellect in the future.
|
| You're part of a trend of people in denial. When LLMs first
| came out there were hordes of people on HN claiming it was just
| a stochastic parrot and LLMs displayed zero intellectual
| ability. It is now abundantly clear that this not true.
|
| We don't fully understand LLMs. That's why gains like COT are
| just black box adjustments that come from changing external
| configurations. We have no way to read the contents of the
| black box and make adjustments off of it. Yet idiots like you
| can make such vast and hard claims when nobody really fully
| understands these things. You're delusional.
|
| I agree that LLMs won't allow us to make some super weapon to
| give us some edge in national security.
| naasking wrote:
| > This is machine learning except the machines are bigger than
| ever before. There is no intellect.
|
| Define "intellect".
| GTP wrote:
| Sorry for being lazy, but I just don't have the time right now to
| read the paper. Is there in the paper or somewhere else a
| comparison based on benchmarks of S1 vs R1 (the full R1, not
| quantized or distilled)?
| pama wrote:
| The S1 paper is not meant to compete with R1. It simply shows
| that with 1k well curated examples for finetuning (26 minutes
| training on 16 GPU) and with a simple hack for controlling the
| length of the thinking process, one can dramatically increase
| the performance of a non-reasoning model and show a clear
| increase in benefit with increased test-time compute. It is
| worth a quick skim.
| swiftcoder wrote:
| > having 10,000 H100s just means that you can do 625 times more
| experiments than s1 did
|
| I think the ball is very much in their court to demonstrate they
| actually are using their massive compute in such a productive
| fashion. My BigTech experience would tend to suggest that
| frugality went out the window the day the valuation took off, and
| they are in fact just burning compute for little gain, because
| why not...
| whizzter wrote:
| Mainly it points to a non-scientific "bigger is better"
| mentality, and the researchers probably didn't mind playing
| around with the power because "scale" is "cool".
|
| Remember that the Lisp AI-labs people were working on non-
| solved problems on absolute potatoes of computers back in the
| day, we have a semblance of progress solution but so much of it
| has been brute-force (even if there has been improvements in
| the field).
|
| The big question is if these insane spendings has pulled the
| rug on real progress if we head into another AI winter of
| disillusionment or if there is enough real progress just around
| the corner to show that there is hope for investors in a post-
| deepseek valuation hangover.
| wongarsu wrote:
| We are in a phase where costs are really coming down. We had
| this phase from GPT2 to about GPT4 where the key to building
| better models was just building bigger models and training
| them for longer. But since then a lot of work has gone into
| distillation and other techniques to make smaller models more
| capable.
|
| If there is another AI winter, it will be more like the
| dotcom bubble: lots of important work got done in the dotcom
| bubble, but many of the big tech companies started from the
| fruits of that labor in the decade after the bubble burst
| svantana wrote:
| Besides that, AI training (aka gradient descent) is not really
| an "embarrassingly parallel" problem. At some point, there are
| diminishing returns on adding more GPUs, even though a lot of
| effort is going into making it as parallel as possible.
| janalsncm wrote:
| What? It definitely is.
|
| Data parallelism, model parallelism, parameter server to
| workers, MoE itself can be split up, etc.
|
| But even if it wasn't, you can simply parallelize training
| runs with slight variations in hyperparameters. That is what
| the article is describing.
| gessha wrote:
| This is pure speculation on my part but I think at some point a
| company's valuation became tied to how big their compute is so
| everybody jumped on the bandwagon.
| syntaxing wrote:
| Matt Levine tangentially talked about this during his podcast
| this past Friday (or was it the one before?). It was a good
| way to value these companies according to their compute size
| since those chips are very valuable. At a minimum, the chips
| are an asset that acts as a collateral.
| jxdxbx wrote:
| I hear this a lot, but what the hell. It's still computer
| chips. They depreciate. Short supply won't last forever.
| Hell, GPUs burn out. It seems like using ice sculptures as
| collateral, and then spring comes.
| baxtr wrote:
| If so wouldn't it be the first time in history when more
| processing power is not used?
|
| In my experience CPU/GPU power is used up as much as
| possible. Increased efficiency just leads to more demand.
| littlestymaar wrote:
| I think you're missing the point: H100 isn't going to
| remain useful for a long time, would you consider Tesla
| or Pascal graphic cards a collateral? That's what those
| H100 will look like in just a few years.
| ijidak wrote:
| Yeah, exactly! I've got some 286, 386, and 486 CPUs that
| I want to claim as collateral!
| baxtr wrote:
| Not sure I do tbh.
|
| Any asset depreciates over time. But they usually get
| replaced.
|
| My 286 was replaced by a faster 386 and that by an even
| faster 468.
|
| I'm sure you see a naming pattern there.
| kgwgk wrote:
| > Any asset depreciates over time.
|
| That's why "those chips are very valuable" is not
| necessarily a good way to value companies - and it isn't
| if they can extract the value from the chips before they
| become worthless.
|
| > But they usually get replaced.
|
| They usually produce enough income to cover depreciation
| so you actually have the cash to replace them.
| littlestymaar wrote:
| And that's why such assets represents only a marginal
| part of valuation. (And if you look at accounting, this
| depreciation is usually done over three years for IT
| hardware, and as such most of these chips have already
| lost half of their accounting value in the balance
| sheet).
| baq wrote:
| My 1070 was replaced by... nothing, I moved it from a
| haswell box to an alder lake box.
|
| Given that inference time will soon be extremely valuable
| with agents and <thinking> models, H100s may yet be worth
| something in a couple years.
| mvc wrote:
| > My 286 was replaced by a faster 386 and that by an even
| faster 468.
|
| How much was your 286 chip worth when you bought your
| 486?
| sixothree wrote:
| Year over year gains in computing continue to slow. I
| think we keep forgetting that when talking about these
| things as assets. The thing controlling their value is
| the supply which is tightly controlled like diamonds.
| adrianN wrote:
| They have a fairly limited lifetime even if progress
| stands still.
| throwup238 wrote:
| Last I checked AWS 1-year reserve pricing for an 8x H100
| box more than pays for the capital cost of the whole box,
| power, and NVIDIA enterprise license, with thousands left
| over for profit. On demand pricing is even worse. For
| cloud providers these things pay for themselves quickly
| and print cash afterwards. Even the bargain basement
| $2/GPU/hour pays it off in under two years.
| sdenton4 wrote:
| Labor! You need it to turn the bill of sale into a data
| center and keep it running. The bargain basement would be
| even cheaper otherwise...
| spamizbad wrote:
| > Year over year gains in computing continue to slow.
|
| This isn't true in the AI chip space (yet). And so much
| of this isn't just about compute but about the memory.
| eek2121 wrote:
| From a per mm2 performance standpoint things absolutely
| have slowed considerably. Gains are primarily being eked
| out via process advantage (which has slowed down) and
| larger chips (which has an ever-shrinking limit depending
| on the tech used)
|
| Chiplets have slowed the slowdown in AI, but you can see
| in the gaming space how much things have slowed to get an
| idea of what is coming for enterprise.
| ijidak wrote:
| Honestly, I don't fully understand the reason for this
| shortage.
|
| Isn't it because we insist on only using the latest nodes
| from a single company for manufacture?
|
| I don't understand why we can't use older process nodes
| to boost overall GPU making capacity.
|
| Can't we have tiers of GPU availability?
|
| Why is Nvidia not diversifying aggressively to Samsung
| and Intel no matter the process node.
|
| Can someone explain?
|
| I've heard packaging is also a concern, but can't you get
| Intel to figure that out with a large enough commitment?
| nl wrote:
| > Isn't it because we insist on only using the latest
| nodes from a single company for manufacture?
|
| TSMC was way ahead of anyone else introducing 5nm.
| There's a long lead time porting a chip to a new process
| from a different manufacturer.
|
| > I don't understand why we can't use older process nodes
| to boost overall GPU making capacity.
|
| > Can't we have tiers of GPU availability?
|
| NVidia do this. You can get older GPUs, but more
| performance is better for performance sensitive
| applications like training or running LLMs.
|
| Higher performance needs better manufacturing processes.
| ecocentrik wrote:
| That is the wrong take. Depreciated and burned out chips
| are replaced and a total compute value is typically
| increased over time. Efficiency gains are also calculated
| and projected over time. Seasons are inevitable and
| cyclical. Spring might be here but winter is coming.
| SecretDreams wrote:
| > It was a good way to value these companies according to
| their compute size since those chips are very valuable.
|
| Are they actually, though? Presently yes, but are they
| actually driving ROI? Or just an asset nobody really is
| meaningfully utilizing, but helps juice the stocks?
| ijidak wrote:
| I asked this elsewhere, but, I don't fully understand the
| reason for the critical GPU shortage.
|
| Isn't it because NVIDIA insists on only using the latest
| nodes from a single company (TSMC) for manufacture?
|
| I don't understand why we can't use older process nodes to
| boost overall GPU making capacity.
|
| Can't we have tiers of GPU availability some on cutting
| edge nodes, others built on older Intel and Samsung nodes?
|
| Why is Nvidia not diversifying aggressively to Samsung and
| Intel no matter the process node.
|
| Can someone explain?
|
| I've heard packaging is also a concern, but can't you get
| Intel to figure that out with a large enough commitment?
|
| (Also, I know NVIDIA has some capacity on Samsung. But why
| not go all out, even using Global Foundries?)
| aorloff wrote:
| If you are a cloud provider renting them out
|
| Otherwise you better keep them humming trying to find a
| business model because they certainly aren't getting any
| newer as chips
| dghlsakjg wrote:
| That's a great way to value a company that is going
| bankrupt.
|
| But, I'm not going to value an operating construction
| company based on how many shovels or excavators they own.
| I'm going to want to see them putting those assets to
| productive use.
| JKCalhoun wrote:
| So, "No one was ever fired for ... buying more server
| infrastructure."
| genewitch wrote:
| Walmart has massive, idle datacenters full of running
| machines doing nothing.
| tyfon wrote:
| I don't think you need to speculate too hard. On CNBC they
| are not tracking revenue, profits or technical breakthroughs,
| but how much the big companies are spending (on gpus). That's
| the metric!
| Mistletoe wrote:
| This feels like one of those stats they show from 1929 and
| everyone is like "and they didn't know they were in a
| bubble?"
| RobotToaster wrote:
| "But tulip sales keep increasing!"
| LeifCarrotson wrote:
| I probably don't have to repeat it, but this is a perfect
| example of Goodhart's Law: when a metric is used as a
| target, it loses its effectiveness as a metric.
|
| If you were a reporter who didn't necessarily understand
| how to value a particular algorithm or training operation,
| but you wanted a simple number to compare the amount of
| work OpenAI vs. Google vs Facebook are putting into their
| models, yeah, it makes sense. How many petaflops their
| datacenters are churning through in aggregate is probably
| correlated to the thing you're trying to understand. And
| it's probably easier to look at their financials and
| correlate how much they've spent on GPUs to how many
| petaflops of compute they need.
|
| But when your investors are giving you more money based on
| how well they perceive you're doing, and their perception
| is not an oracle but is instead directly based on how much
| money you're spending... the GPUs don't actually need to do
| anything other than make number go up.
| B56b wrote:
| They absolutely are tracking revenues/profits on CNBC, what
| are you talking about?
| ur-whale wrote:
| > but how much the big companies are spending (on gpus).
| That's the metric!
|
| Burn rate based valuations!
|
| The 2000's are back in full force!
| jerf wrote:
| This claim is mathematically nonsensical. It implies a more-or-
| less linear relationship, that more is always better. But
| there's no reason to limit that to H100s. Conventional servers
| are, if anything, rather more established in their ability to
| generate value, by which I mean, however much potential AI
| servers may have to be more important than conventional servers
| that they may manifest in the future, we know how to use
| conventional servers to generate value _now_.
|
| And thus, by this logic, every company in the world should just
| be buying as many servers as they can get their hands on,
| because More Servers = More Value.
|
| Obviously, this is not happening. It doesn't take much analysis
| to start listing the many and manifold reasons why. Many of
| those reasons will apply to GPUs as well. Just as if everything
| in AWS got 10x faster, overnight, this would not create a
| situation where everyone suddenly starts grabbing _more_
| servers in AWS. Obviously everyone would start trimming down,
| even if perhaps in a few years time they 'd find some way to
| use this burst of power such that they can use more later. This
| can't happen overnight, though. It would take time, and not
| "weeks" or "months" but "years" at scale.
|
| Incorporating the important variable of _time_ in the analysis,
| if AIs become literally hundreds of times cheaper to run,
| _today_ , then it is perfectly logical that the near-term
| demand for the hardware to run them is also going to go way,
| way down. However much _potential_ AI may have, it is fairly
| clear looking out at the AI landscape right now that there isn
| 't really anyone out there unlocking vast amounts of value and
| sitting there wringing their hands because they just can't get
| more GPU compute. The GPU rush has been from fear that someone
| will figure out how to "really" unlock AI and then they'll be
| stuck without the hardware to compete.
|
| It may be the case that vastly cheaper AI will in fact be part
| of unlocking that value, and that as the AI industry grows it
| will grow faster as a result... but that's still going to be on
| a multi-year time frame, not a tomorrow time frame. And all
| those GPUs and all those valuations are still broadly based on
| them being valuable _real soon now_ , not in a few years, and
| all those GPU purchases are on the assumption they need them
| _now_ , or on a timeframe where we can't be waiting around,
| rather than waiting for some rounds of exponential doublings to
| bring price down. The hardware curve in 5 years may be higher
| but the curve in the next _year_ would be lower, and by a
| _lot_.
|
| And, you know, who's to say we're done? I doubt there's another
| 100x in there, but is someone going to eke out another 2x
| improvement? Or a 10x improvement? Making it easier to run lots
| of experiments makes it much more likely for that to happen.
| I'm skeptical of another 10x general improvement but 10x
| improvements for specific, important use cases I can't rule
| out.
|
| Edit: I should also point out this is an extremely common
| pattern in technology in general. Often the very hardest part
| is producing a thing that does a particular task _at all_. Once
| we have it in hand, once we can use it and learn how it
| operates and what its characteristic operating modes are, once
| we can try modifications to it in the real world and see what
| happens, _optimizing_ it becomes much easier, sometimes
| explosively so by comparison. Taking any first iteration of a
| tech that is practical and then trying to straight-line demand
| based on it is silly, in all sorts of ways and all directions.
| The internal combustion engine, for example, has had a myriad
| of impacts on the world and certainly after various
| improvements many, many millions if not billions of them have
| been made... but any company that reacted to the first couple
| of cars and just went _ballistic_ buying those first-generation
| internal combustion engines would have lost everything, and
| rather quickly.
| deadbabe wrote:
| For starters every employee has an H100 under their desk.
| HenryBemis wrote:
| > Going forward, it'll be nearly impossible to prevent distealing
| (unauthorized distilling). One thousand examples is definitely
| within the range of what a single person might do in normal
| usage, no less ten or a hundred people. I doubt that OpenAI has a
| realistic path to preventing or even detecting distealing outside
| of simply not releasing models.
|
| (sorry for the long quote)
|
| I will say (naively perhaps) "oh but that is fairly simple". For
| any API request, add a counter of 5 seconds to the next for
| 'unverified' users. Make the "blue check" (a-la X/Twitter). For
| the 'big sales' have a third-party vetting process so that if US
| Corporation XYZ wants access, they prove themselves worthy/not
| Chinese competition and then you do give them the 1000/min deal.
|
| For everyone else, add the 5 second (or whatever other duration
| makes sense) timer/overhead and then see them drop from 1000
| requests per minutes to 500 per day. Or just cap them at 500 per
| day and close that back-door. And if you get 'many cheap
| accounts' doing hand-overs (AccountA does 1-500, AccountB does
| 501-1000, AccountC does 1001-1500, and so on) then you mass block
| them.
| mark_l_watson wrote:
| Off topic, but I just bookmarked Tim's blog, great stuff.
|
| I dismissed the X references to S1 without reading them, big
| mistake. I have been working generally in AI for 40 hears and
| neural networks for 35 years and the exponential progress since
| the hacks that make deep learning possible has been breathtaking.
|
| Reduction in processing and memory requirements for running
| models is incredible. I have been personally struggling with
| creating my own LLM-based agents with weaker on-device models (my
| same experiments usually work with 4o-mini and above models) but
| either my skills will get better or I can wait for better on
| device models.
|
| I was experimenting with the iOS/iPadOS/macOS app On-Device AI
| last night and the person who wrote this app was successful in
| combining web search tool calling working with a very small model
| - something that I have been trying to perfect.
| cowsaymoo wrote:
| The part about taking control of a reasoning model's output
| length using <think></think> tags is interesting.
|
| > In s1, when the LLM tries to stop thinking with "</think>",
| they force it to keep going by replacing it with "Wait".
|
| I had found a few days ago that this let you 'inject' your own
| CoT and jailbreak it easier. Maybe these are related?
|
| https://pastebin.com/G8Zzn0Lw
|
| https://news.ycombinator.com/item?id=42891042#42896498
| causal wrote:
| This even points to a reason why OpenAI hides the "thinking"
| step: it would be too obvious that the context is being
| manipulated to induce more thinking.
| zamalek wrote:
| It's weird that you need to do that at all, couldn't you just
| reject that token and use the next most probable?
| Havoc wrote:
| The point about agents to conceal access to the model is a good
| one.
|
| Hopefully we won't lose all access to models in future
| cyp0633 wrote:
| Qwen's QvQ-72B does much more "wait"s than other LLMs with CoT I
| tried, maybe they've somewhat used that trick already?
| theturtletalks wrote:
| Deepseek R1 uses <think/> and wait and you can see it in the
| thinking tokens second guessing itself. How does the model know
| when to wait?
|
| These reasoning models are feeding more to OP's last point about
| NVidia and OpenAI data centers not being wasted since reason
| models require more tokens and faster tps.
| qwertox wrote:
| Probably when it would expect a human to second guess himself,
| as shown in literature and maybe other sources.
| UncleEntity wrote:
| From playing around they seem to 'wait' when there's a
| contradiction in their logic.
|
| And I think the second point is due to The Market thinking
| there is no need to spend ever increasing amounts of compute to
| get to the next level of AI overlordship.
|
| Of course Jevon's paradox is also all in the news these days..
| pona-a wrote:
| If chain of thought acts as a scratch buffer by providing the
| model more temporary "layers" to process the text, I wonder if
| making this buffer a separate context with its own separate FNN
| and attention would make sense; in essence, there's a
| macroprocess of "reasoning" that takes unbounded time to
| complete, and then there's a microprocess of describing this
| incomprehensible stream of embedding vectors in natural language,
| in a way returning to the encoder/decoder architecture but where
| both are autoregressive. Maybe this would give us a denser
| representation of said "thought", not constrained by imitating
| human text.
| bluechair wrote:
| I had this exact same thought yesterday.
|
| I'd go so far as to add one more layer to monitor this one and
| stop adding layers. My thinking is that this meta awareness is
| all you need.
|
| No data to back my hypothesis up. So take it for what it's
| worth.
| larodi wrote:
| My thought on the same guess being - all tokens live in same
| latent space or in many spaces and each logical units train
| separate of each other...?
| hadlock wrote:
| This is where I was headed but I think you said it better.
| Some kind of executive process monitoring the situation, the
| random stream of consciousness and the actual output. Looping
| back around to outdated psychology you have the ego which is
| the output (speech), the super ego is the executive process
| and the id is the <think>internal monologue</think>. This
| isn't the standard definition of those three but close
| enough.
| whimsicalism wrote:
| > this incomprehensible stream of embedding vectors as natural
| language explanation, in a way returning to encoder/decoder
| architecture
|
| this is just standard decoding, the stream of vectors is called
| the k/v cache
| easeout wrote:
| Here's a paper your idea reminds me of.
| https://arxiv.org/abs/2501.19201
|
| It's also so not far from Meta's large concept model idea.
| pona-a wrote:
| Previous discussion:
|
| [41 comments, 166 points]
| https://news.ycombinator.com/item?id=42919597
| bloomingkales wrote:
| Once we train models on the chain of thought outputs, next
| token prediction can solve the halting problem for us (eg, this
| chain of thinking matches _this other_ chain of thinking).
| psadri wrote:
| I think that is how human brains work. When we practice, at
| first we have to be deliberate (thinking slow). Then we
| "learn" from our own experience and it becomes muscle memory
| (thinking fast). Of course, it increases the odds we are
| wrong.
| bloomingkales wrote:
| Or worse, we incorrectly overweight the wrong chain of
| thinking to an irrelevant output (but pragmatically useful
| output), at scale.
|
| For example, xenophobia as a response to economic hardship
| is the wrong chain of thinking embedded in the larger
| zeitgeist.
| jjk7 wrote:
| Comments on a google doc? Nesting in social media comments?
|
| Seems like similar concepts. I think there is some potential to
| improving how LLMs improve and further their own reasoning
| lines, but I'm no AI mage.
| zoogeny wrote:
| I've had an idea since I was a kid which I can share. I was
| contemplating AI and consciousness generally, probably around
| the time I read "The Minds I".
|
| I reflected on the pop-psychology idea of consciousness and
| subconsciousness. I thought of each as an independent stream of
| tokens, like stream of consciousness poetry. But along the
| stream there were joining points between these two streams,
| points where the conscious stream was edited by the
| subconscious stream. You could think of the subconscious stream
| as performing CRUD like operations on the conscious stream. The
| conscious stream would act like a buffer of short-term memory
| while the subconscious stream would act like a buffer of long-
| term memory. Like, the subconscious has instructions related to
| long-term goals and the conscious stream has instructions
| related to short-term goals.
|
| You can imagine perception as input being fed into the
| conscious stream and then edited by the subconscious stream
| before execution.
|
| It seems entirely possible to actually implement this idea in
| this current day and age. I mean, it was a fever dream as a
| kid, but now it could be an experiment!
| barrenko wrote:
| Conscious as subconscious pretending not to be sunconscious,
| something like that, a thin wrapper. Crud makes sense.
|
| Gels closely to buddhism, hell, all religions.
| ForHackernews wrote:
| Have you read Jaynes' "The Origin of Consciousness in the
| Breakdown of the Bicameral Mind"?
| zoogeny wrote:
| I haven't read the original but I am familiar with the
| broad stroke view. There are similarities (perhaps vague)
| in the more recent work of someone like McGilchrist and his
| The Master and His Emissary (another book which I only have
| a broad stroke view of).
|
| At the time I had this idea I did not know of either of
| these. I think I was drawing explicitly on the conscious /
| subconscious vocabulary.
| cakealert wrote:
| The problem is that RL is extremely inefficient. It's one thing
| to use it for fine tuning an LLM to do the chain of thought
| trick and quite another to do thinking entirely from scratch.
| The pretrained LLM does a lot of heavy lifting there.
|
| And it would have to be RL for your idea to work since there is
| no "thinking" dataset for a novel token space. There isn't even
| one for existing LLM token space, but they have the base model
| to work off of. When the thought is expressed in English, the
| model already knows the relationships between the tokens in the
| thought, it's merely repurposing it for a "thinking"
| application.
| itissid wrote:
| > The problem is that RL is extremely inefficient.
|
| Wait What? That is an odd way of defining it. That's like
| saying turing machines are inefficient way to solve TSP. You
| would , at the least, want to define this in terms of
| complexity or put this into context of domains and
| observability.
|
| RL's by definition is a field that is about finding efficient
| problems in the domain of choice[1]. There are likely regimes
| in LLM/LRM learning where RL can be quite efficient,
| polynomial time even in the state space, we just need to
| explore and find them. For example you can use Dynamic
| Programming as a "more" efficient way to solve MDPs[1]
| because it is polynomial in the state space X Action space.
|
| [1]https://web.stanford.edu/class/psych209/Readings/SuttonBar
| to...
| cakealert wrote:
| RL provides very poor training signal for deep learning, an
| order of magnitude or more worse than supervised learning.
| Better than nothing of course.
|
| What the OP suggested is similar to training a transformer
| from scratch using RL (ie. no training tokens) towards an
| objective of steering a pretrained LLM to produce human
| readable output. It will probably not even converge, and if
| it does it would take immense compute.
| pizza wrote:
| In the case of supervised problem domains, you implicitly
| make a decision about what is signal, and what is noise,
| and sure, in that closed setting, supervised learning is
| much more sample efficient. But I think what we're
| learning now is that with strong enough base models,
| 'aha' moments in RL training show that it might be
| possible to essentially 'squeeze out signal from language
| itself', giving you far greater breadth of latent
| knowledge than supervised examples, and letting you train
| to generalize to far greater horizons than a fixed
| dataset might allow. In a fascinating way it is rather
| reminiscent of, well, abiogenesis. This might sound like
| speculative claptrap if you look at the things the
| current generation of models are still weak at, but...
| there's a real chance that there is a very heavy tail to
| the set of outcomes in the limit.
| cakealert wrote:
| With a pretrained LLM most of the work is done. RL just
| steers the model into a 'thinking' mode. There is enough
| signal for that to work and for the inefficiency to not
| matter.
|
| The downside is that you are limiting the model to think
| in the same language it outputs. An argument could be
| made that this is not how all humans think. I know that I
| rarely think in language or even images, just concepts
| (probably isn't even the right word) mix and transform
| and often I don't even bother to make the transformation
| to language at the end, just action.
| pizza wrote:
| I strongly agree; in fact I think what best matches the
| thought process is something like the multiset
| tree/forest workspace approach as suggested by Marcolli,
| Chomsky, and Berwick - a Hopf algebra that can be
| externalized into (non-planar) embeddings of linearized
| strings, or alternately into semantic manifolds.
| sambull wrote:
| That sovereign wealth fund with tik tok might set a good
| precedent; when we have to 'pour money' into these companies we
| can do so with stake in them held in our sovereign wealth fund.
| TehCorwiz wrote:
| Extra-legal financial instruments meant to suck money from
| other federal departments don't strike me as a good precedent
| in any sense. I don't disagree though that nationalizing the
| value of enormous public investments is something we should be
| considering, looking at you oil industry. But until congress
| appropriates the money under law it's a pipe dream or theft.
| ipnon wrote:
| All you need is attention and waiting. I feel like a zen monk.
| jebarker wrote:
| S1 (and R1 tbh) has a bad smell to me or at least points towards
| an inefficiency. It's incredible that a tiny number of samples
| and some inserted <wait> tokens can have such a huge effect on
| model behavior. I bet that we'll see a way to have the network
| learn and "emerge" these capabilities during pre-training. We
| probably just need to look beyond the GPT objective.
| pas wrote:
| can you please elaborate on the wait tokens? what's that? how
| do they work? is that also from the R1 paper?
| jebarker wrote:
| The same idea is in both the R1 and S1 papers (<think> tokens
| are used similarly). Basically they're using special tokens
| to mark in the prompt where the LLM should think more/revise
| the previous response. This can be repeated many times until
| some stop criteria occurs. S1 manually inserts these with
| heuristics, R1 learns the placement through RL I think.
| whimsicalism wrote:
| ? theyre not special tokens really
| jebarker wrote:
| i'm not actually sure whether they're special tokens in
| the sense of being in the vocabulary
| whimsicalism wrote:
| <think> might be i think "wait" is tokenized like any
| other in the pretraining
| throwaway314155 wrote:
| There's a decent explanation in the article, just FYI.
| sfink wrote:
| I agree, but LLMs in general have a horrendously bad smell in
| terms of efficiency. s1 and r1 are just proving it.
|
| The models' latent spaces are insanely large. The vast, vast
| majority pretty much has to be irrelevant and useless, it's
| just that the training commandeers random fragments of that
| space to link up the logic they need and it's really hard to
| know which of the weights are useless, which are useful but
| interchangeable with other weights, and which are truly load-
| bearing. You could probably find out easily by testing the
| model against every possible thing you ever might want it to
| do, just as soon as someone gets around to enumerating that
| non-enumerable collection of tasks.
|
| These bogus <wait> tokens kind of demonstrate that the models
| are sort of desperate to escape the limitations imposed by the
| limited processing they're allowed to do -- they'll take
| advantage of thinking time even when it's provided in the
| silliest manner possible. It's amazing what you can live if
| it's all you have!
|
| (Apologies for the extended anthropomorphizing.)
| light_hue_1 wrote:
| S1 has no relationship to R1. It's a marketing campaign for an
| objectively terrible and unrelated paper.
|
| S1 is fully supervised by distilling Gemini. R1 works by
| reinforcement learning with a much weaker judge LLM.
|
| They don't follow the same scaling laws. They don't give you the
| same results. They don't have the same robustness. You can use R1
| for your own problems. You can't use S1 unless Gemini works
| already.
|
| We know that distillation works and is very cheap. This has been
| true for a decade; there's nothing here.
|
| S1 is a rushed hack job (they didn't even run most of their
| evaluations with an excuse that the Gemini API is too hard to
| use!) that probably existed before R1 was released and then
| pivoted into this mess.
| bloomingkales wrote:
| This thing that people are calling "reasoning" is more like
| rendering to me really, or multi pass rendering. We're just
| refining the render, there's no reasoning involved.
| dleslie wrote:
| That was succinct and beautifully stated. Thank-you for the
| "Aha!" moment.
| bloomingkales wrote:
| Hah. You should check out my other comment on how I think
| we're obviously in a simulation (remember, we just need to
| see a good enough render).
|
| LLMs are changing how I see reality.
| mistermann wrote:
| "...there's no reasoning involved...wait, could I just be
| succumbing to my heuristic intuitions of what is (seems to be)
| true....let's reconsider using System 2 thinking..."
| bloomingkales wrote:
| Or there is no objective reality (well there isn't, check out
| the study), and reality is just a rendering of the few state
| variables that keep track of your simple life.
|
| A little context about you:
|
| - person
|
| - has hands, reads HN
|
| These few state variables are enough to generate a believable
| enough frame in your rendering.
|
| If the rendering doesn't look believable to you, you modify
| state variables to make the render more believable, eg:
|
| Context:
|
| - person
|
| - with hands
|
| - incredulous demeanor
|
| - reading HN
|
| Now I can render you more accurately based on your
| "reasoning", but truly I never needed all that data to see
| you.
|
| Reasoning as we know it could just be a mechanism to fill in
| gaps in obviously sparse data (we absolutely do not have all
| the data to render reality accurately, you are seeing an
| illusion). Go reason about it all you want.
| mistermann wrote:
| Is this a clever rhetorical trick to make it appear that
| your prior claim was correct?
|
| If not: what am I intended to take away from this? What is
| its relevance to my comment?
| bloomingkales wrote:
| You made a joke about questioning reality, I simply
| entertained it. You can do whatever you want with it,
| wasn't a slight at all.
| mistermann wrote:
| It may have been in the form of a joke, but I certainly
| wasn't joking.
|
| I think it is interesting what actions cannot be done by
| humans.
| bloomingkales wrote:
| I wasn't joking either. Things are just getting started
| with this AI stuff, and I feel like programmers will
| experience that "de ja vu" phenomenon that they talk
| about in the Matrix, that eerie feeling something isn't
| right.
|
| Look, why have game developers spent so much time lazy
| loading parts of the game world? Very rarely do they just
| load the whole world, even in 2025. See, the worlds get
| bigger, so even as the tech gets better, we will always
| lazy load worlds in.
|
| It's a context issue right? Developers have just recently
| been given this thing called "context".
|
| But yeah man, why do we think just because we walked from
| our house to the supermarket that this reality didn't
| lazy load things. That's how programmers have been doing
| it all along ...
|
| Anyways
| mistermann wrote:
| A more parsimonious explanation: consciousness is
| generative, like an LLM. And, according to cultural
| conditioning, this generated scenario is referred to as
| _reality_.
|
| I like this version for at least two reasons:
|
| 1. It is 100% compliant with large quantities of
| scientific findings (psychology and neuroscience), whreas
| I believe yours has a conservation of mass problem _at
| least_
|
| 2. Everyone dislikes it _at least in certain scenarios_
| (say, when reference is made to it during an object level
| disagreement)
| ddrdrck_ wrote:
| We could see it the other way around : what we call "reasoning"
| may actually be some kind of multipass rendering, whatever it
| is performed by computers or human brains.
| bloomingkales wrote:
| Yes, of course. The implications are awesome.
| pillefitz wrote:
| Which is related to multistage/ hierarchical/coarse-to-fine
| optimization, which is a pretty good way to find the global
| optimum in many problem domains.
| buyucu wrote:
| Yes.
|
| Before LLMs we had N-Gram language models. Many tasks like
| speech recognition worked as beach search in the graph defined
| by the ngram language model. You could easily get huge accuracy
| gains simply by pruning your beam less.
|
| s1 reminds of this. You can always trade off latency for
| accuracy. Given these LLMs are much more complex than good old
| N-Grams, we're just discovering how to do this trade.
| bloomingkales wrote:
| Let me carry that concept, "learning to do this trade", it's
| a new trade.
|
| I don't believe computer science has the algorithms to handle
| this new paradigm. Everything was about sequential
| deterministic outputs, and clever ways to do it fast. This
| stuff is useless at the moment. We need new thinkers on how
| to not think sequentially or how not to think about the
| universe in such a small way.
|
| Verifying input/output pairs is the old way. We need to
| understand differently going forward.
| LordDragonfang wrote:
| How are you defining "reasoning"?
|
| Because I see these sorts of gnostic assertion about LLMs all
| the time about how they "definitely aren't doing <thing we
| normally apply to meat-brains>" by gesturing at the technical
| things it's doing, with no attempts to actually justify the
| negative assertion.
|
| It often comes across as privileged reason trying to justify
| that of course the machine isn't doing some ineffable thing
| only meat-brains do.
| bloomingkales wrote:
| From my other ridiculous comment, as I do entertain
| simulation theory in my understanding of God:
|
| _Reasoning as we know it could just be a mechanism to fill
| in gaps in obviously sparse data (we absolutely do not have
| all the data to render reality accurately, you are seeing an
| illusion). Go reason about it all you want._
|
| The LLM doesn't know anything. We determine what output is
| right, even if the LLM swears the output is right. We
| "reason" about it, I guess? Well in this case the whole
| "reasoning" process is to simply get an output that looks
| right, so what is reasoning in our case?
|
| Let me just go one ridiculous level lower. If I measure every
| frame the Hubble telescope takes, and I measure with a simple
| ruler the distances between things, frame by frame, I can
| "reason" out some rules of the universe (planetary orbits).
| In this "reasoning" process, the very basic question of "well
| why, and who made this" immediately arises, so reasoning
| always leads to the fundamental question of God.
|
| So, yeah. We reason to see God, because that's all we're
| seeing, everything else is an illusion. Reasoning is
| inextricably linked to God, so we have to be very open minded
| when we ask what is this machine doing.
| LordDragonfang wrote:
| Honestly, I was going to nitpick, but this definition
| scratches an itch in my brain so nicely that I'll just
| complement it as beautiful. "We reason to see God", I love
| it.
|
| (Also, if I might give a recommendation, you might be the
| type of person to enjoy _Unsong_ by Scott Alexander
| https://unsongbook.com/)
| bloomingkales wrote:
| Thank you for the suggestion and nice words. Trust me, I
| have to sit here and laugh at the stuff I write too,
| because I wasn't always a believer. So it's a little bit
| of a trip for me too, I'm still exploring my own
| existence.
| frontalier wrote:
| sshhhh, let the money flow
| whimsicalism wrote:
| this isn't rlvr and so sorta uninteresting, they are just
| distilling the work already done
| bloomingkales wrote:
| If an LLM output is like a sculpture, then we have to sculpt it.
| I never did sculpting, but I do know they first get the clay
| spinning on a plate.
|
| Whatever you want to call this "reasoning" step, ultimately it
| really is just throwing the model into a game loop. We want to
| interact with it on each tick (spin the clay), and sculpt every
| second until it looks right.
|
| You will need to loop against an LLM to do just about anything
| and everything, forever - this is the default workflow.
|
| Those who think we will quell our thirst for compute have another
| thing coming, we're going to be insatiable with how much LLM
| brute force looping we will do.
| MrLeap wrote:
| This is a fantastic insight and really has my gears spinning.
|
| We need to cluster the AI's insights on a spatial grid hash,
| give it a minimap with the ability to zoom in and out, and give
| it the agency to try and find its way to an answer and build up
| confidence and tests for that answer.
|
| coarse -> fine, refine, test, loop.
|
| Maybe a parallel model that handles the visualization stuff. I
| imagine its training would look more like computer vision. Mind
| palace generation.
|
| If you're stuck or your confidence is low, wander the palace
| and see what questions bubble up.
|
| Bringing my current context back through the web is how I think
| deeply about things. The context has the authority to reorder
| the web if it's "epiphany grade".
|
| I wonder if the final epiphany at the end of what we're
| creating is closer to "compassion for self and others" or "eat
| everything."
| zoogeny wrote:
| I can't believe this hasn't been done yet, perhaps it is a cost
| issue.
|
| My literal first thought about AI was wondering why we couldn't
| just put it in a loop. Heck, one update per day, or one update
| per hour would even be a start. You have a running "context",
| the output is the next context (or a set of transformations on
| a context that is a bit larger than the output window). Then
| ramp that up ... one loop per minute, one per second,
| millisecond, microsecond.
| layer8 wrote:
| Same. And the next step is that it must feed back into
| training, to form long-term memory and to continually learn.
| zoogeny wrote:
| I analogize this with sleep. Perhaps that is what is
| needed, 6 hours offline per day to LoRa the base model on
| some accumulated context from the day.
| dev0p wrote:
| LLMs need to sleep too. Do they dream of electric sheep?
| int_19h wrote:
| The hard part is coming up with a good way to grade results.
| Which you need to update the weights based on the outcome,
| otherwise the model will not actually learn anything.
| zoogeny wrote:
| For the "looping" I'm talking about you don't need to
| update the weights. It is simply, old context in, new
| context out, new context in, new-new context out, etc.
|
| Of course, keeping that coherent over numerous loops isn't
| going to be easy. No doubt there is a chance it goes off
| the rails. So you might have a section of context that is
| constantly stable, a section of context that updates each
| loop, etc.
|
| In the other response to my comment someone mentioned
| eventually updating the weights (e.g. daily) and you would
| in that case have to have some kind of loss function.
| int_19h wrote:
| Then I'm not quite sure what benefit you expect to derive
| from it? Making e.g. QwQ-32 loop isn't hard - it often
| does it all by itself, even. But it doesn't translate to
| improvements on every iteration; it just goes in circles.
| bloomingkales wrote:
| When you look out your eyes, that rectangular viewport is
| all your context. For example, we cannot fit Jupiter into
| your viewport.
|
| So, if we can never fit large concepts like Jupiter into
| your viewport (your context window), does it follow that
| you will never understand Jupiter? Or is there no way to
| take a picture of Jupiter and make it smaller so it fits
| into your viewport?
|
| See, coherence is just resizing or reimagining things so
| it fits into context. Context can never hold it all
| because we have the capacity to always imagine infinitely
| large things.
|
| So I don't fully know if it's old context out, new
| context in, but could be _just the right context, just
| enough of it, and just the right looking context so we
| can assemble the right worldview_. This process would
| have to go on infinitely, and that is the world as you
| see and understand it.
| gyomu wrote:
| > If an LLM output is like a sculpture, then we have to sculpt
| it. I never did sculpting, but I do know they first get the
| clay spinning on a plate.
|
| That's pottery, not sculpture. Traditionally in sculpture you
| start from a block of marble or wood, but you can also make
| sculptures of cast bronze or welded steel (or clay, but you
| don't use a spinning plate).
| bloomingkales wrote:
| Thank you for the clarification. I wanted to use some kind of
| visual to show the model in a loop. Otherwise, I'd just have
| to say explicitly that the sculptor is the one in the loop,
| as in the person will not stop chiseling. It's in this
| infinite chiseling that we get our answers (same thing as
| finding a limit in calculus as it approaches infinity, we
| will never get the discrete answer, but we will get
| infinitely close enough to label a discrete point
| confidently).
|
| In other words, we fly as close to the sun as possible and
| get our measurements :)
| incrudible wrote:
| Hmmm, 1 + 1 equals 3. Alternatively, 1 + 1 equals -3.
|
| Wait, actually 1 + 1 equals 1.
| falcor84 wrote:
| As one with teaching experience, the idea of asking a student
| "are you sure about that?" is to get them to think more deeply
| rather than just blurting a response. It doesn't always work,
| but it generally does.
| latexr wrote:
| It works because the question itself is a hint born of
| knowledge. "Are you sure about that" is a polite way to say
| "that answer is wrong, try again". Students know that, so
| instead of doubling down will redo their work with the
| assumption they made a mistake. It is much rarer to ask the
| question when the answer is correct, and in fact doing so is
| likely to upset the learner because they had to redo the work
| for no reason.
|
| If you want a true comparison, start asking that question
| _every time_ and then compare. My hypothesis is students
| would start ignoring the prompt and answering "yes" every
| time to get on with it.
| ALittleLight wrote:
| At 6 dollars per run, I'm tempted to try to figure out how to
| replicate this. I'd like to try some alternatives to "wait" -
| e.g. "double checking..." Or write my own chains of thought.
| qup wrote:
| Like the ones they tested?
| ALittleLight wrote:
| Yes, that is what "replicate" with my own ideas means.
| kittikitti wrote:
| Thank you for this, I really appreciate this article and I
| learned a bunch!
| Aperocky wrote:
| For all the hype about thinking models, this feels much like
| compression in terms of information theory instead of a "takeoff"
| scenario.
|
| There are a finite amount of information stored in any large
| model, the models are really good at presenting the correct
| information back, and adding thinking blocks made the models even
| better at doing that. But there is a cap to that.
|
| Just like how you can compress a file by a lot, there is a
| theoretical maximum to the amount of compression before it starts
| becoming lossy. There is also a theoretical maximum of relevant
| information from a model regardless of how long it is forced to
| think.
| psadri wrote:
| I think an interesting avenue to explore is creating
| abstractions and analogies. If a model can take a novel
| situation and create an analogy to one that it is familiar
| with, it would expand its "reasoning" capabilities beyond its
| training data.
| zoogeny wrote:
| I think this is probably accurate and what remains to be seen
| is how "compressible" the larger models are.
|
| The fact that we can compress a GPT-3 sized model into an o1
| competitor is only the beginning. Maybe there is even more
| juice to squeeze there?
|
| But even more, how much performance will we get out of o3 sized
| models? That is what is exciting since they are already
| performing near Phd levels on most evals.
| jedbrooke wrote:
| my thinking (hope?) is that the reasoning models will be more
| like how a calculator doesn't have to "remember" all the
| possible combinations of addition, multiplication, etc for all
| the numbers, but can actually compute the results.
|
| As reasoning improves the models could start with a basic set
| of principles and build from there. Of course for facts
| grounded in reality RAG would still likely be the best, but
| maybe with enough "reasoning" a model could simulate an
| approximation of the universe well enough to get to an answer.
| hidelooktropic wrote:
| > I doubt that OpenAI has a realistic path to preventing or even
| detecting distealing outside of simply not releasing models.
|
| Couldn't they just start hiding the thinking portion?
|
| It would be easy for them to do this. Currently, they already
| provide one sentence summaries for each step of the thinking I
| think users would be fine or at least stay if it were changed to
| provide only that.
| Aperocky wrote:
| They hid it and deepseek came up with R1 anyway, with RL on
| only results and not even needing any of the thinking tokens
| that OpenAI hid.
| girvo wrote:
| Which is still the funniest and most interesting result in AI
| so far IMO. Fascinating, but sort of makes intuitive sense
| too!
| mtrovo wrote:
| I found the discussion around inference scaling with the 'Wait'
| hack so surreal. The fact such an ingeniously simple method can
| impact performance makes me wonder how many low-hanging fruit
| we're still missing. So weird to think that improvements on a
| branch of computer science is boiling down to conjuring the right
| incantation words, how you even change your mindset to start
| thinking this way?
| nyoomboom wrote:
| I think a skill here is learning a bias for experimentation and
| accepting the results one finds. Also the book "Why Greatness
| Cannot Be Planned" showcases the kind of open ended play that
| results in people discovering stuff like this.
| cubefox wrote:
| Now imagine where we are in 12 months from now. This article
| from February 5 2025 will feel quaint by then. The acceleration
| keeps increasing. It seems likely we will soon have recursive
| self-improving AI -- reasoning models which do AI research.
| This will accelerate the rate of acceleration itself. It sounds
| stupid to say it, but yes, the singularity is near. Vastly
| superhuman AI now seems to arrive within the next few years.
| Terrifying.
| gom_jabbar wrote:
| Yes, and Accelerationism predicted this development back in
| the 1990s, perhaps most prominently in the opening lines of
| Nick Land's _Meltdown_ (1994) text: [[ ]] The
| story goes like this: Earth is captured by a technocapital
| singularity as renaissance rationalization and oceanic
| navigation lock into commoditization take-off. Logistically
| accelerating techno-economic interactivity crumbles social
| order in auto-sophisticating machine runaway. As markets
| learn to manufacture intelligence, politics modernizes,
| upgrades paranoia, and tries to get a grip.
|
| > reasoning models which do AI research
|
| In the introduction to my research project on Accelerationism
| [0], I write: Faced with the acceleration of
| progress in Artificial Intelligence (AI) -- with AI agents
| now automating AI research and development --,
| Accelerationism no longer seems like an abstract philosophy
| producing empty hyperstitional hype, but like a sober
| description of reality. The failed 2023 memorandum to stop AI
| development on systems more powerful than OpenAI's ChatGPT-4
| perfectly illustrates the phenomenological aspects of
| Accelerationism: "To be rushed by the phenomenon, to the
| point of terminal institutional paralysis, is the
| phenomenon." [1]
|
| At the current rate of acceleration, if you don't write
| hyperstitionally, your texts are dead on arrival.
|
| [0] https://retrochronic.com/
|
| [1] Nick Land (2017). _A Quick-and-Dirty Introduction to
| Accelerationism_ in _Jacobite Magazine._
| versteegen wrote:
| Nice. Though I couldn't understand those "opening lines"
| until I read in your Introduction:
|
| > For Land, capitalism begins in Northern Italy around 1500
| with "the emerging world of technologists and accountants",
| the spiral interexcitation of "oceanic navigation and
| place-value calculation", and zero-unlocked double-entry
| book-keeping
|
| Fibonacci, amongst many others, played a critical role that
| highly accelerative technology.
| pizza wrote:
| Hope we get the Nick Land the younger, and not Nick Land
| the elder, set of outcomes. Somewhere, sometime, along the
| way, it seems like everything from CCRU and Duginism leapt
| out of the page into the real. Maybe it's just the
| beginning of the Baudrilliardian millennium.
| zoogeny wrote:
| This is something I have been suppressing since I don't want
| to become chicken little. Anyone who isn't terrified by the
| last 3 months probably doesn't really understand what is
| happening.
|
| I went from accepting I wouldn't see a true AI in my
| lifetime, to thinking it is possible before I die, to
| thinking it is possible in in the next decade, to thinking it
| is probably in the next 3 years to wondering if we might see
| it this year.
|
| Just 6 months ago people were wondering if pre-training was
| stalling out and if we hit a wall. Then deepseek drops with
| RL'd inference time compute, China jumps from being 2 years
| behind in the AI race to being neck-and-neck and we're all
| wondering what will happen when we apply those techniques to
| the current full-sized behemoth models.
|
| It seems the models that are going to come out around summer
| time may be jumps in capability beyond our expectations. And
| the updated costs means that there may be several open source
| alternatives available. The intelligence that will be
| available to the average technically literate individual will
| be frightening.
| palmotea wrote:
| > The intelligence that will be available to the average
| technically literate individual will be frightening.
|
| That's not the scary part. The scary part is the
| intelligence _at scale_ that could be available to the
| _average employer_. Lots of us like to LARP that we 're
| capitalists, but _very_ few of us are. There 's zero
| ideological or cultural framework in place to prioritize
| the well being of the general population over the profits
| of _some_ capitalists.
|
| AI, especially accelerating AI, is bad news for anyone who
| needs to work for a living. It's not going to lead to a
| Star Trek fantasy. It means an eventual phase change for
| the economy that consigns us (and most consumer product
| companies) to wither and fade away.
| 101008 wrote:
| I agree with you and I am scared. My problem is: if most
| people can't work, who is going to pay for the
| product/services created with IA?
|
| I get a lot of "IA will allow us to create SaaS in a
| weekend" and "IA will take engineers jobs", which I think
| they both may be true. But a lot of SaaS surive because
| engineers pay for them -- if engineer don't exist
| anymore, a lot of SaaS won't either. If you eat your
| potential customers, creating quick SaaS doesn't make
| sense anymore (yeah, there are exceptions, etc., I know).
| immibis wrote:
| Those people will simply be surplus to requirements.
| They'll be left alone as long as they don't get in the
| way of the ruling class, and disposed of if they do. As
| usual in history.
| lodovic wrote:
| That's a fallacy. You can't have an advanced economy with
| most people sitting on the side. Money needs to keep
| flowing. If all that remains of the economy consists of a
| few datacenters talking to each other, how can the ruling
| class profit off that?
| palmotea wrote:
| > You can't have an advanced economy with most people
| sitting on the side.
|
| If AI lives up to the hype, that will become possible.
|
| > If all that remains of the economy consists of a few
| datacenters talking to each other, how can the ruling
| class profit off that?
|
| I don't think it would be that. There'd also be power
| generation, manufacturing, mining, and construction,
| etc.; but all extremely automated. If you get to truly
| extreme levels of wealth concentration, things would
| shift out of our capitalist market system model, and
| concepts like "profit" would become anachronisms.
|
| It actually might kinda look like the "economy" of
| Starcraft: you gather resources, decide what to build
| with them, and order it all around according to your
| whim. There will be a handful of guys playing, and
| everyone else will be a NPC.
| andreasmetsala wrote:
| > It actually might kinda look like the "economy" of
| Starcraft: you gather resources, decide what to build
| with them, and order it all around according to your
| whim. There will be a handful of guys playing, and
| everyone else will be a NPC.
|
| I guess if the "players" are sociopathic enough they
| might decide to just wipe out the NPCs. The possibility
| of someone like Putin or Musk becoming the sole member of
| the post-singularity humanity does make me pause.
| cubefox wrote:
| That's assuming people like Altman can keep artificial
| superintelligence under human control. It very well may
| escape control and humanity would be disempowered
| forever. Or worse, wiped out.
| danans wrote:
| > Money needs to keep flowing. If all that remains of the
| economy consists of a few datacenters talking to each
| other, how can the ruling class profit off that?
|
| Plenty of profit was made off feudalism, and
| technofeudalism has all the tools of modern technology at
| its disposal. If things go in that direction, they will
| have an unlimited supply of serfs desperate for whatever
| human work/livelihood is left.
| soco wrote:
| Unlimited supply yes, but highly limited usage for them.
| So even if a few will work for free, the rest will be
| starving, and angry.
| palmotea wrote:
| > My problem is: if most people can't work, who is going
| to pay for the product/services created with IA?
|
| A lot of those will probably go under, too. I think a lot
| of people are in for a rude awakening.
|
| The only people our society and economy really values are
| the elite with ownership and control, and the people who
| get to eat and have comfort are those who provide things
| that are directly or indirectly valuable to that elite.
| AI will enable a game of musical chairs, with economic
| participants iteratively eliminated as the technology
| advances, until there are only a few left controlling
| vast resources and capabilities, to be harnessed for
| personal whims. The rest of us will be like rats in a
| city, scraping by on the margins, unwanted, out of sight,
| subsisting on scraps, perhaps subject to "pest control"
| regimes.
| nazgul17 wrote:
| This is the same conclusion I can't help but reach. I
| would love nothing more but to be convinced that (there
| is a chance that) that is not going to happen.
| kortilla wrote:
| > The only people our society and economy really values
| are the elite with ownership and control
|
| This isn't true. The biggest companies are all rich
| because they cater to the massive US middle class. That's
| where the big money is at.
| palmotea wrote:
| > This isn't true. The biggest companies are all rich
| because they cater to the massive US middle class..
|
| It _is_ true, but I can see why you 'd be confused. Let
| me ask you this: if members of the "the massive US middle
| class" can be replaced with automation, are those
| companies going 1) to keep paying those workers to
| support the middle-class demand which made them rich, or
| are they going to 2) fire them so more money can be
| shoveled up to the shareholders?
|
| The answer is obviously #2, which has been proven time
| and again (e.g. how we came to have "the Rust Belt").
|
| > That's where the big money is at
|
| _Now_ , but not necessarily in the future. I think AI
| (if it doesn't hit a wall) will change that, maybe not
| instantaneously, but over time.
| soco wrote:
| So you end up with a huge starved mob trying to come all
| over your mansions and islands. I somehow think Musk
| totally capable of nuking those mobs, or unleash the
| (future) AI dogs over them, because the mob cannot
| produce anymore (because of AI) and cannot pay anymore
| (because no jobs because of AI). So the mob will be
| totally worthless to this style of "capitalism". Really
| why would they bother with UBI when they can let the mob
| just die out?
| andreasmetsala wrote:
| > AI, especially accelerating AI, is bad news for anyone
| who needs to work for a living. It's not going to lead to
| a Star Trek fantasy. It means an eventual phase change
| for the economy that consigns us (and most consumer
| product companies) to wither and fade away.
|
| How would that work? If there are no consumers then why
| even bother producing? If the cost of labor and capital
| trends towards zero then the natural consequence is
| incredible deflation. If the producers refuse to lower
| their prices then they either don't participate in the
| market (which also means their production is pointless)
| or ensure some other way that the consumers can buy their
| products.
|
| Our society isn't really geared for handling double digit
| deflation so something does need to change if we really
| are accelerating exponentially.
| palmotea wrote:
| > How would that work? If there are no consumers then why
| even bother producing?
|
| Whim and ego. I think the advanced economy will shift to
| supporting trillionaires doing things _like_ "DIY home
| improvement" for themselves. They'll own a bunch of
| automated resources (power generation, mining,
| manufacturing, AI engineers), and use it to do whatever
| they want. Build pyramids on the moon, while the now
| economically-useless former middle-class laborers shiver
| in the cold? Sure, why not?
| ben_w wrote:
| > If there are no consumers then why even bother
| producing?
|
| > If the producers refuse to lower their prices then they
| either don't participate in the market (which also means
| their production is pointless) or ensure some other way
| that the consumers can buy their products.
|
| Imagine you're a billionaire with a data centre and
| golden horde of androids.
|
| You're the consumer, the robots make stuff for you; they
| don't make stuff for anyone else, just you, in the same
| way and for the same reason that your power tools and
| kitchen appliances don't commute to work -- you could, if
| you wanted, lend them to people, just like those other
| appliances, but you'd have to actually choose to, it
| wouldn't be a natural consequence of the free market.
|
| Their production is, indeed, pointless. This doesn't help
| anyone else eat. The moment anyone can afford to move
| from "have not" to "have", they drop out of the demand
| market for everyone else's economic output.
|
| I don't know how big the impact of dropping out would be:
| the right says "trickle down economics" is good and this
| would be the exact opposite of that; while the left
| criticism's of trickle-down economics is that in practice
| the super-rich already have so much stuff that making
| them richer doesn't enrich anyone else who might service
| them, so if the right is correct then this is bad but if
| the left is correct then this makes very little
| difference.
|
| Unfortunately, "nobody knows" is a great way to get a
| market panic all by itself.
| luc4sdreyer wrote:
| That is assuming the accelerating AI stays under human
| control.
|
| We're racing up a hill at an ever-increasing speed, and
| we don't know what's on the other side. Maybe 80% chance
| that it's either nothing or "simply" a technological
| revolution.
| pjc50 wrote:
| This frightens mostly people whose identity is built around
| "intelligence", but without grounding in the real world.
| I've yet to see really good articulations of what,
| precisely we should be scared of.
|
| Bedroom superweapons? Algorithmic propaganda? These things
| have humans in the loop building them. And the problem of
| "human alignment" is one unsolved since Cain and Abel.
|
| AI _alone_ is words on a screen.
|
| The sibling thread details the "mass unemployment"
| scenario, which would be destabilizing, but understates how
| much of the current world of work is still physical. It's a
| threat to pure desk workers, but we're not the majority of
| the economy.
|
| Perhaps there will be political instability, but .. we're
| already there from good old humans.
| danans wrote:
| > without grounding in the real world.
|
| > I've yet to see really good articulations of what,
| precisely we should be scared of. Bedroom superweapons?
|
| Loss of paid employment opportunities and increasing
| inequality are real world concerns.
|
| UBI isn't coming by itself.
| pjc50 wrote:
| Sure, but those are also real world concerns in the non-
| AI alternate timeline. As is the unlikelihood of UBI.
| danans wrote:
| Yes, but they are likely dramatically accelerated in the
| AI timeline.
| mvieira38 wrote:
| Worst case scenario humans mostly go back to manual
| labor, which would fix a lot of modern day ailments such
| as obesity and (some) mental health struggles, with added
| enormous engineering advancements based on automatic
| research.
| ben_w wrote:
| > This frightens mostly people whose identity is built
| around "intelligence", but without grounding in the real
| world.
|
| It has certainly had this impact on my identity; I am
| unclear how well-grounded I really am*.
|
| > I've yet to see really good articulations of what,
| precisely we should be scared of.
|
| What would such an articulation look like, given you've
| not seen it?
|
| > Bedroom superweapons? Algorithmic propaganda? These
| things have humans in the loop building them.
|
| Even with current limited systems -- which are not purely
| desk workers, they're already being connected to and
| controlling robots, even by amateurs -- AI lowers the
| minimum human skill level needed to do those things.
|
| The fear is: how far are we from an AI that _doesn 't_
| need a human in the loop? Because ChatGPT was almost
| immediately followed by ChaosGPT, and I have every reason
| to expect people to continue to make clones of ChaosGPT
| continuously until one is capable of actually causing
| harm. (As with 3d-printed guns, high chance the first
| ones will explode in the face of the user rather than the
| target).
|
| I hope we're years away, just as self driving cars turned
| out to be over-promised and under-delivered for the last
| decade -- even without a question of "safety", it's going
| to be hard to transition the world economy to one where
| humans need not apply.
|
| > And the problem of "human alignment" is one unsolved
| since Cain and Abel.
|
| Yes, it is unsolved since time immemorial.
|
| This has required us to not only write laws, but also
| design our societies and institutions such that humans
| breaking laws doesn't make everything collapse.
|
| While I dislike the meme "AI == crypto", one overlap is
| that both have nerds speed-running discovering how
| legislation works any why it's needed -- for crypto,
| specifically financial legislation after it explodes in
| their face; for AI, to imbue the machine with a reason to
| approximate society's moral code, because they see the
| problem coming.
|
| --
|
| * Dunning Kruger applies; and now I have first-hand
| experience of what this feels like from the inside, as my
| self-perception of how competent I am at German has
| remained constant over 7 years of living in Germany and
| improving my grasp of the language the entire time.
| zoogeny wrote:
| Some of the scariest horror movies are the ones where the
| monster isn't shown. Often once the monster is shown, it
| is less terrifying.
|
| In a general sense, uncertainty causes anxiety. Once you
| know the properties of the monster you are dealing with
| you can start planning on how to address it.
|
| Some people have blind and ignorant confidence. A feeling
| they can take on literally anything, no matter how
| powerful. Sometimes they are right, sometimes they are
| wrong.
|
| I'm reminded by the scene in No Country For Old Men where
| the good guy bad-ass meets the antagonist and immediately
| dies. I have little faith in blind confidence.
|
| edit: I'll also add that human adaptability (which is
| probably the trait most confidence in humans would rest)
| has shown itself capable of saving us from many previous
| civilization changing events. However, this change with
| AI is happening much, much faster than any before it. So
| part of the anxiety is whether or not our species
| reaction time is enough to avoid the cliff we are
| accelerating towards.
| koala_man wrote:
| It feels like we're back in 1900 when anyone's clever idea (and
| implementation) can give huge performance improvements, such as
| Ford's assembly line and Taylor's scientific management of
| optimizing shovel sizes for coal.
| andrewfromx wrote:
| yes, it also feels like we are going to lose our just-in-time
| global shipments of anything to anywhere any day now. It will
| soon feel like 1900 in other ways.
| BobbyTables2 wrote:
| We'll have to raise our own chickens too...
| eru wrote:
| Hope we don't get 1914 again, too.
| xg15 wrote:
| I think the fact alone that distillation and quantization are
| techniques that can produce substantial improvements is a
| strong sign that we still have no real comprehensive
| understanding how the models work.
|
| If we had, there would be no reason to train a model with more
| parameters than are strictly necessary to represent the space's
| semantic structure. But then it should be impossible for
| distilled models with less parameters to come close to the
| performance of the original model.
|
| Yet this is what happens - the distilled or quantized models
| often come very close to the original model.
|
| So I think there are still many low-hanging fruits to pick.
| teruakohatu wrote:
| > still have no real comprehensive understanding how the
| models work.
|
| We do understand how they work, we just have not optimised
| their usage.
|
| For example someone who has a good general understanding of
| how an ICE or EV car works. Even if the user interface is
| very unfamiliar, they can figure out how to drive any car
| within a couple of minutes.
|
| But that does not mean they can race a car, drift a car or
| drive a car on challenging terrain even if the car is
| physically capable of all these things.
| spiorf wrote:
| We know how the next token is selected, but not why doing
| that repeatedly brings all the capabilities it does. We
| really don't understand how the emergent behaviours emerge.
| Valgrim wrote:
| It feels less like a word prediction algorithm and more
| like a world model compression algorithm. Maybe we tried
| to create one and accidentaly created the other?
| codeulike wrote:
| Its almost like a Model of Language, but very Large
| bloomingkales wrote:
| Why would asking a question about ice cream trigger a
| consideration about all possible topics? As in, to
| formulate the answer, the LLM will consider the origin of
| Elephants even. It won't be significant, but it will be
| factored in.
|
| Why? In the spiritual realm, many postulated that even
| the Elephant you never met is part of your life.
|
| None of this is a coincidence.
| gessha wrote:
| Your example is somewhat inadequate. We _fundamentally_
| don't understand how deep learning systems works in the
| sense that they are more or less black boxes that we train
| and evaluate. Innovations in ML are a whole bunch of
| wizards with big stacks of money changing "Hmm" to "Wait"
| and seeing what happens.
|
| Would a different sampler help you? I dunno, try it. Would
| a smaller dataset help? I dunno, try it. Would training the
| model for 5000 days help? I dunno, try it.
|
| Car technology is the opposite of that - it's a white box.
| It's composed of very well defined elements whose
| interactions are defined and explained by laws of
| thermodynamics and whatnot.
| brookst wrote:
| Isn't that just scale? Even small LLMs have more parts
| than any car.
|
| LLMs are more analogous to economics, psychology,
| politics -- it is possible there's a core science with
| explicability, but the systems are so complex that even
| defining the _question_ is hard.
| ChymeraXYZ wrote:
| Could be, but it does not change the fact that we do not
| understand them as of now.
| gessha wrote:
| You can make a bigger ICE engine (like a container ship
| engine) and still understand how the whole thing works.
| Maybe there's more parts moving but it still has the
| structure of an ICE engine.
|
| With neural networks big or small, we got no clue what's
| going on. You can observe the whole system, from the
| weights and biases, to the activations, gradients, etc
| and still get nothing.
|
| On the other hand, one of the reasons why economics,
| psychology and politics are hard is because we can't open
| up people's heads and define and measure what they're
| thinking.
| ijk wrote:
| One way I've heard it summarized: Computer Science as a
| field is used to things being like physics or chemistry,
| but we've suddenly encountered something that behaves
| more like biology.
| bloomingkales wrote:
| Physics and biology are just ways to measure our creator.
| It gives us a sense of control because, hey, we can
| measure it, predict it, and inevitably wield it (become
| god).
|
| Small thought experiment, if God shitted you out into a
| toilet, and you spent your waking life floating in a
| toilet measuring the orbit of floating chunks of shit
| with a giant telescope, all you would have measured and
| learned is the worthless dynamics of shit floating
| around. That's roughly the implication of science when it
| comes to the divine (or generally a reiteration of Hume's
| problem with induction).
|
| So now we are given some insight into our brain and we
| are finding that it is immeasurable. Infinity _within_
| us, this whole time you saw God as outside you, yet this
| technology is showing you infinity. An LLM and your
| imagination, both can dream infinitely.
|
| Infinity is a characteristic of God, and some scientists
| can't do science without wholesale rejecting this
| hypothesis. Therefore no good scientific experiment (and
| discoveries) will emerge in this field without forming a
| better hypothesis (asking the right question).
|
| Hypothesis:
|
| We have found god in one of its undeniable forms:
| infinite imagination.
|
| Hypothesis:
|
| A perfect being is not limited by context, it considers
| all things in the model for all prompts. The concept of
| all of us being connected, a yin and yang, that every
| little thing we do is part of a bigger thing, this all
| follows.
| red1reaper wrote:
| "God" as a concept in unproven to exist, it is also
| impossible to prove, so for all intents and porpouses it
| doesn't exist.
| raducu wrote:
| > _fundamentally_ don't understand how deep learning
| systems works.
|
| It's like saying we don't understand how quantum
| chromodynamics works. Very few people do, and it's the
| kind of knowledge not easily distilled for the masses in
| an easily digestible in a popsci way.
|
| Look into how older CNNs work -- we have very good
| visual/accesible/popsci materials on how they work.
|
| I'm sure we'll have that for LLM but it's not worth it to
| the people who can produce that kind of material to
| produce it now when the field is moving so rapidly, those
| people's time is much better used in improving the LLMs.
|
| The kind of progress being made leads me to believe there
| absolutely ARE people who absolutely know how the LLMs
| work and they're not just a bunch of monkeys randomly
| throwing things at GPUs and seeing what sticks.
| gessha wrote:
| As a person who has trained a number of computer vision
| deep networks, I can tell you that we have some cool-
| looking visualizations on how lower layers work but no
| idea how later layers work. The intuition is built over
| training numerous networks and trying different
| hyperparameters, data shuffling, activations, etc. it's
| absolutely brutal over here. If the theory was there,
| people like Karpathy who have great teacher vibes
| would've explained it for the mortal grad students or
| enthusiast tinkerers.
|
| > The kind of progress being made leads me to believe
| there absolutely ARE people who absolutely know how the
| LLMs work and they're not just a bunch of monkeys
| randomly throwing things at GPUs and seeing what sticks
|
| I say this less as an authoritative voice but more as an
| amused insider: Spend a week with some ML grad students
| and you will get a chuckle whenever somebody says we're
| not some monkeys throwing things at GPUs.
| bloomingkales wrote:
| It may be as simple as this:
|
| https://youtube.com/shorts/7GrecDNcfMc
|
| Many many layers of that. It's not a profound mechanism.
| We can understand how _that_ works, but we're dumbfounded
| how such a small mechanism is responsible for all this
| stuff going on inside a brain.
|
| I don't think we don't understand, it's a level beyond
| that. We can't fathom the implications, that it could be
| that simple, just scaled up.
| ClumsyPilot wrote:
| > Many many layers of that. It's not a profound mechanism
|
| Bad argument. Cavemen understood stone, but they could
| not build the aqueducts. Medieval people understood iron,
| water and fire but they could not make a steam engine
|
| Finally we understand protons, electrons, and neutrons
| and the forces that government them but it does not mean
| we understand everything they could mossibly make
| bloomingkales wrote:
| _" Cavemen understood stone"_
|
| How far removed are you from a caveman is the better
| question. There would be quite some arrogance coming out
| of you to suggest the several million years gap is
| anything but an instant in the grand timeline. As in, you
| understood stone just yesterday ...
|
| The monkey that found the stone is the monkey that built
| the cathedral. It's only a delusion the second monkey
| creates to separate it from the first monkey (a feeling
| of superiority, with the only tangible asset being "a
| certain amount of notable time passed since point A and
| point B").
|
| _" Finally we understand protons, electrons, and
| neutrons and the forces that government them but it does
| not mean we understand everything they could mossibly
| make"_
|
| You and I agree. That those simple things can truly
| create infinite possibilities. That's all I was saying,
| we cannot fathom it (either because infinity is hard to
| fathom, or that it's origins are humble - just a few core
| elements, or both, or something else).
|
| Anyway, this can discussion can head into any direction.
| ClumsyPilot wrote:
| > The kind of progress being made leads me to believe
| there absolutely ARE people who absolutely know how the
| LLMs work
|
| Just like alchemists made enormous strides in chemistry,
| but their goal was to turn piss into gold.
| adamc wrote:
| The "Wait" vs. "Hmm" discussion in the paper does not
| suggest we know how they work. If we _knew_ , we wouldn't
| have to try things and measure to figure out the best
| prompt.
| pertymcpert wrote:
| For quantization I don't think that's really true.
| Quantization is just making more efficient use of bits in
| memory to represent numbers.
| ZeljkoS wrote:
| We have a partial understanding of why distillation works--it
| is explained by The Lottery Ticket Hypothesis
| (https://arxiv.org/abs/1803.03635). But if I am understanding
| correctly, that doesn't mean you can train a smaller network
| from scratch. You need a lot of randomness in the initial
| large network, for some neurons to have "winning" states.
| Then you can distill those winning subsystems to a smaller
| network.
|
| Note that similar process happens with human brain, it is
| called Synaptic pruning
| (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant
| quote from Wikipedia
| (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has
| been estimated that the brain of a three-year-old child has
| about 10^15 synapses (1 quadrillion). This number declines
| with age, stabilizing by adulthood. Estimates vary for an
| adult, ranging from 10^14 to 5x10^14 synapses (100 to 500
| trillion)."
| 3abiton wrote:
| So more 'mature' models might arise in the near future with
| less params and better benchmarks?
| raducu wrote:
| "Better", but not better than the model they were
| distilled from, at least that's how I understand it.
| salemba wrote:
| I think this is how the "child brain" works too. The
| better the parents and the environement are, the better
| the child evolution is :)
| andreasmetsala wrote:
| They might also be more biased and less able to adapt to
| new technology. Interesting times.
| coder543 wrote:
| That's been happening consistently for over a year now.
| Small models today are better than big models from a year
| or two ago.
| Arthur_ODC wrote:
| So, can a distilled 8B model (say, the Deepseek-R1-Distil-
| Llama-8B or whatever) be "trained up" to a higher parameter
| 16B Parameter model after distillation from a superior
| model, or is it forever stuck at the 8B parameters that can
| just be fine tuned?
| MR4D wrote:
| I like the analogy of compression, in that a distilled model
| of an LLM is like a JPEG of a photo. Pretty good, maybe very
| good, but still lossy.
|
| The question I hear you raising seems to be along the lines
| of, can we use a new compression method to get better
| resolution (reproducibility of the original) in a much
| smaller size.
| umeshunni wrote:
| > in that a distilled model of an LLM is like a JPEG of a
| photo
|
| That's an interesting analogy, because I've always thought
| of the hidden states (and weights and biases) of an LLMs as
| a compressed version of the training data.
| kedarkhand wrote:
| Well, JPEG can be thought of as an compression of the
| natural world of whose photograph was taken
| bloomingkales wrote:
| And we can answer the question why quantization works
| with a lossy format, since quantization just drops
| accuracy for space but still gives us a _good enough_
| output, just like a lossy jpeg.
|
| Reiterating again, we can lose a lot of data (have
| incomplete data) and have a perfectly visible jpeg (or
| MP3, same thing).
| homarp wrote:
| hence https://www.newyorker.com/tech/annals-of-
| technology/chatgpt-... (by Ted Chiang)
|
| (discussed here:
| https://news.ycombinator.com/item?id=34724477 )
| timschmidt wrote:
| And what is compression but finding the minimum amount of
| information required to reproduce a phenomena? I.e.
| discovering natural laws.
| t_mann wrote:
| Finding minimum complexity explanations isn't what
| finding natural laws is about, I'd say. It's considered
| good practice (Occam's razor), but it's often not really
| clear what the minimal model is, especially when a theory
| is relatively new. That doesn't prevent it from being a
| natural law, the key criterion is predictability of
| natural phenomena, imho. To give an example, one could
| argue that Lagrangian mechanics requires a smaller set of
| first principles than Newtonian, but Newton's laws are
| still very much considered natural laws.
| timschmidt wrote:
| Maybe I'm just a filthy computationalist, but the way I
| see it, the most accurate model of the universe is the
| one which makes the most accurate predictions with the
| fewest parameters.
|
| The Newtonian model makes provably less accurate
| predictions than Einsteinian (yes, I'm using a different
| example), so while still useful in many contexts where
| accuracy is less important, the number of parameters it
| requires doesn't much matter when looking for the one
| true GUT.
|
| My understanding, again as a filthy computationalist, is
| that an accurate model of the real bonafide underlying
| architecture of the universe will be the simplest
| possible way to accurately predict anything. With the
| word "accurately" doing all the lifting.
|
| As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/
| Nature%20of%20...
|
| I'm sure there are decreasingly accurate, but still
| useful, models all the way up the computational
| complexity hierarchy. Lossy compression is, precisely,
| using one of them.
| t_mann wrote:
| The thing is, Lagrangian mechanics makes exactly the same
| predictions as Newtownian, and it starts from a
| foundation of just one principle (least action) instead
| of three laws, so it's arguably a sparser theory. It just
| makes calculations easier, especially for more complex
| systems, that's its raison d'etre. So in a world where we
| don't know about relativity yet, both make the best
| predictions we know (and they always agree), but Newton's
| laws were discovered earlier. Do they suddenly stop being
| natural laws once Lagrangian mechanics is discovered?
| Standard physics curricula would not agree with you btw,
| they practically always teach Newtownian mechanics first
| and Lagrangian later, also because the latter is
| mathematically more involved.
| timschmidt wrote:
| > Do they suddenly stop being natural laws once
| Lagrangian mechanics is discovered?
|
| Not my question to answer, I think that lies in
| philosophical questions about what is a "law".
|
| I see useful abstractions all the way down. The linked
| Asimov essay covers this nicely.
| dragonwriter wrote:
| Laws (in science, not government) are just a relationship
| that is consistently observed, so Newton's laws remain
| laws until contradictions were observed, regardless of
| the existence of or more alternative models which would
| predict them to hold.
|
| The kind of Occam's Razor-ish rule you seem to be trying
| to query about is basically a rule of thumb for selecting
| among formulations of equal observed predictive power
| that are not strictly equivalent (that is, if they
| predict exactly the same actually observed phenomenon
| instead of different subsets of subjectively equal
| importance, they still differ in predictions which have
| not been testable), whereas Newtonian and Lagrangian
| mechanics are different formulations that are strictly
| equivalent, which means you may choose between them _for
| pedagogy or practical computation_ , but you can't choose
| between them for _truth_ because the truth of one implies
| the truth of the other, in either direction; they are the
| exactly the same in sibstance, differing only in
| presentation.
|
| (And even where it applies, its just a rule of thumb to
| reject complications until they are observed to be
| necessary.)
| ziofill wrote:
| What you say makes sense, but is there the possibility that
| because it's compressed it can generalize more? In the
| spirit of bias/variance.
| cmgriffing wrote:
| This brings up an interesting thought too. A photo is just
| a lossy representation of the real world.
|
| So it's lossy all the way down with LLMs, too.
|
| Reality > Data created by a human > LLM > Distilled LLM
| cztomsik wrote:
| Nope, it's quite obvious why distillation works. If you just
| predict next token, then the only information you can use to
| compute the loss is THE expected token. Whereas if you
| distill, you can also use (typically few) logits from the
| teacher.
|
| "My name is <?>" without distillation has only one valid
| answer (from the dataset) and everything else is wrong.
|
| Whereas with distillation, you get lots of other names too
| (from the teacher), and you can add some weight to them too.
| That way, model learns faster, because it gets more
| information in each update.
|
| (So instead of "My name is Foo", the model learns "My name is
| <some name, but in this case Foo>")
| ascorbic wrote:
| I've noticed that R1 says "Wait," a lot in its reasoning. I
| wonder if there's something inherently special in that token.
| lionkor wrote:
| Semantically, wait is a bit of a stop-and-breathe point.
|
| Consider the text:
|
| I think I'll go swimming today. Wait, ___
|
| what comes next? Well, not something that would usually
| follow without the word "wait", probably something entirely
| orthogonal that impacts the earlier sentence in some
| fundamental way, like:
|
| Wait, I need to help my dad.
| ascorbic wrote:
| Yes, R1 seems to mostly use it like that. It's either to
| signal a problem with its previous reasoning, or if it's
| thought of a better approach. In coding it's often
| something like "this API won't work here" or "there's a
| simpler way to do this".
| katzenversteher wrote:
| I bet a token like "sh _t! ", "f*_" or "damn!" would have the
| same or even stronger effect but the LLM creators would not
| like to have the users read them
| lodovic wrote:
| I think you're onto something, however, as the training is
| done through on text and not actual thoughts, it may take
| some experimentation to find these stronger words.
| ascorbic wrote:
| Maybe, but it doesn't just use it to signify that it's made
| a mistake. It also uses it in a positive way, such as it's
| had a lightbulb moment. Of course some people use
| expletives in the same way, but that would be less common
| than for mistakes.
| raducu wrote:
| It's literally in the article, they measured it and wait
| was the best token
| cyanydeez wrote:
| its fascinating how certain political movements avoid that Wait
| moment...
| kevin009 wrote:
| There are more than 10 different ways that I know for sure will
| improve LLMs just like `wait`. It is part if the CoT. I assume
| most researchers know this. CoT in old as 2019
| Melatonic wrote:
| Mind elaborating ?
| kristianp wrote:
| Chain of thought (CoT)?
| lostmsu wrote:
| Hm, I am surprised that people who are presumably knowledgeable
| with how attention works are surprised by this. The more tokens
| in the output, the more computation the model is able to do
| overall. Back in September, when I was testing my iOS hands-
| free voice AI prototype that was powered by 8B LLM, when I
| wanted it to give really thoughtful answers to philosophical
| questions, I would instruct it to output several hundred
| whitespace characters (because they are not read aloud) before
| the actual answer.
|
| What I am more surprised about is why models actually seem to
| have to produce "internal thoughts" instead of random tokens.
| Maybe during training having completely random tokens in
| thinking section derailed the model's thought process in a same
| way background noise can derail ours?
| deadbabe wrote:
| I mean the "wait" thing is obvious if you've ever asked an LLM
| to look at its own response and ask if it's really sure about
| its answer.
| rgovostes wrote:
| > a branch of computer science
|
| It should be considered a distinct field. At some level there
| is overlap (information theory, Kolmogorov complexity, etc.),
| but prompt optimization and model distillation is far removed
| from computability, formal language theory, etc. The analytical
| methods, the techniques to create new architectures, etc. are
| very different beasts.
| BobbyTables2 wrote:
| Almost seems more like computer engineering. Is it really
| that different than signal/image processing?
|
| I suspect CS departments don't want to concede because they
| are now in the limelight...
| maginx wrote:
| I agree - I don't know what field it formally is, but
| computer science it is not. It is also related to information
| retrieval aka "Google skills", problem presentation, 'theory
| of mind', even management and psychology. I'm saying the
| latter because people often ridicule AI responses for giving
| bad answers that are 'too AI'. But often it is simply because
| not enough context-specific information was given to allow
| the AI to giving a more personalized response. One should
| compare the response to "If I had asked a random person on
| the internet this query, what might I have gotten". If you
| write "The response should be written as a <insert
| characteristics, context, whatever you feel is relevant>" it
| will deliver a much less AI. This is just as much about how
| you pose a problem in general, as it is about computer
| science.
| BobbyTables2 wrote:
| May sound like a conspiracy theory, but NVIDIA and a whole lot
| of AI startups have a strong vested interest to not
| seek+publish such findings.
|
| If I don't need a huge model and GPU, then AI is little more
| than an open source program running on an idle PC.
|
| I feel like AI was NVIDIA's lifeboat as GPU mining waned. Don't
| see anything after that in the near future.
| philipswood wrote:
| I think NVIDIAs future is pretty bright.
|
| We're getting to the run-your-capable-LLM on-prem or at-home
| territory.
|
| Without DeepSeek (and hopefully its successors) I wouldn't
| really have a usecase for something like NVIDIAs Project
| Digits.
|
| https://www.nvidia.com/en-us/project-digits/
| Arn_Thor wrote:
| Except I can run R1 1.5b on a GPU-less and NPU-less Intel
| NUC from four-five years ago using half its cores and the
| reply speed is...functional.
|
| As the models have gotten more efficient and distillation
| better the minimum viable hardware for really cooking with
| LLMs has gone from a 4090 to suddenly something a lot of
| people already probably own.
|
| I definitely think a Digits box would be nice, but honestly
| I'm not sure I'll need one.
| nickthegreek wrote:
| R1 1.5b won't do what most people want at all.
| Arn_Thor wrote:
| No, it won't. But that's not the point I was making
| tomaskafka wrote:
| One thing is to realize that we as humans have a thinking steps
| (internal monologue) before we output the texts. When LLMs
| produce text, we expect this thinking process to happen as
| well, but it does not - they are 'idiots that babble the first
| thing that comes to their minds'.
|
| The above 'hack' is one of many realizations of the above
| differences.
| codeulike wrote:
| Wait, so the trick is they reach into the context and basically
| switch '</think>' with 'wait' and that makes it carry on
| thinking?
| gield wrote:
| Yes, that's explicitly mentioned in the blog post:
|
| >In s1, when the LLM tries to stop thinking with "</think>",
| they force it to keep going by replacing it with "Wait".
| luc4sdreyer wrote:
| Yes, that's one of the tricks.
| danans wrote:
| Not sure if your pun was intended, but 'wait' probably works
| so well because of the models being trained on text
| structured like your comment, where "wait" is followed by a
| deeper understanding.
| ozgune wrote:
| Agreed. Here are three things that I find surreal about the s1
| paper.
|
| (1) The abstract changed how I thought about this domain
| (advanced reasoning models). The only other paper that did that
| for me was the "Memory Resource Management in VMware ESX
| Server". And that paper got published 23 years ago.
|
| (2) The model, data, and code are open source at
| https://github.com/simplescaling/s1. With this, you can start
| training your own advanced reasoning models. All you need is a
| thousand well-curated questions with reasoning steps.
|
| (3) More than half the references in the paper are from 2024
| and Jan 2025. Just look at the paper's first page.
| https://arxiv.org/pdf/2501.19393 In which other field do you
| see this?
| pradn wrote:
| Omg, another fan of "Memory Resource Management in VMware ESX
| Server"!! It's one of my favorite papers ever - so clever.
| pradn wrote:
| I mean is "wait" even the ideal "think more please" phrase?
| Would you get better results with other phrases like "wait, a
| second", or "let's double-check everything"? Or domain-
| dependent, specific instructions for how to do the checking? Or
| forcing tool-use?
| maksimur wrote:
| It appears that someone has implemented a similar approach for
| DeepSeek-R1-Distill-Qwen-1.5B:
| https://reddit.com/r/LocalLLaMA/comments/1id2gox/improving_d...
|
| I hope it gets tested further.
| nullbyte wrote:
| Great article! I enjoyed reading it
| khazhoux wrote:
| I have a bunch of questions, would love for anyone to explain
| these basics:
|
| * The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based
| on very expensive oracles (if we believe DeepSeek-R1 queried
| OpenAI's model). If these are _improvements_ on existing models,
| why is this being reported as decimating training costs? Isn 't
| fine-tuning already a cheap way to optimize? (maybe not as
| effective, but still)
|
| * The R1 paper talks about improving one simple game - Countdown.
| But the original models are "magic" because they can solve a
| nearly uncountable number of problems and scenarios. How does the
| DeepSeek / R1 approach scale to the same gigantic scale?
|
| * Phrased another way, my understanding is that these techniques
| are using existing models as black-box oracles. If so, how many
| millions/billions/trillions of queries must be probed to
| replicate and improve the original dataset?
|
| * Is anything known about the training datasets used by DeepSeek?
| OpenAI used presumably every scraped dataset they could get their
| hands on. Did DS do the same?
| UncleEntity wrote:
| > If these are improvements on existing models, why is this
| being reported as decimating training costs?
|
| Because that's what gets the clicks...
|
| Saying they spent a boatload of money on the initial training +
| iteration + final fine-tuning isn't as headline grabbing as "$5
| million trained AI beats the pants off the 'mericans".
| torginus wrote:
| If what you say is true, and distilling LLMs is easy and cheap,
| and pushing the SOTA without a better model to rely on is dang
| hard and expensive, then that means the economics of LLM
| development might not be attractive to investors - spending
| billions to have your competitors come out with products that
| are 99% as good, and cost them pennies to train, does not sound
| like a good business strategy.
| khazhoux wrote:
| What I still don't understand is how one slurps out an entire
| model (closed source) though.
|
| Does the deepseek paper actually say what model it's trained
| off of, or do they claim the entire thing is from scratch?
| ForHackernews wrote:
| AFAIK DeepSeek have not publicly acknowledged training
| their model on OpenAI output - the OpenAI people have
| alleged that they did.
|
| At any rate, I don't think distillation involves 'slurping
| out' the whole model, as I understand it, it means
| providing the other model's output as training data input
| to create your new model. Maybe analogous to an expert
| teaching a novice how to do something by providing
| carefully selected examples, without having to expose the
| novice to all the blind alleys the expert went down to
| achieve mastery.
| janalsncm wrote:
| > even the smartest people make hundreds of tiny experiments
|
| This is the most important point, and why DeepSeek's cheaper
| training matters.
|
| And if you check the R1 paper, they have a section for "things
| that didn't work", each of which would normally be a paper of its
| own but because their training was so cheap and streamlined they
| could try a bunch of things.
| robrenaud wrote:
| > "Note that this s1 dataset is distillation. Every example is a
| thought trace generated by another model, Qwen2.5"
|
| The traces are generated by Gemini Flash Thinking.
|
| 8 hours of H100 is probably more like $24 if you want any kind of
| reliability, rather than $6.
| zaptrem wrote:
| "You can train a SOTA LLM for $0.50" (as long as you're
| distilling a model that cost $500m into another pretrained
| model that cost $5m)
| fizx wrote:
| That's absolutely fantastic, because if you have 1 good idea
| that's additive to the SOTA, you can test it for a dollar,
| not millions
| knutzui wrote:
| The original statement stands, if what you are suggesting in
| addition to it is true. If the initial one-time investment of
| $505m is enough to distill new SOTA models for $0.50 a piece,
| then the average cost for subsequent models will trend toward
| $0.50.
| nico wrote:
| > Why did it cost only $6? Because they used a small model and
| hardly any data.
|
| > After sifting their dataset of 56K examples down to just the
| best 1K, they found that the core 1K is all that's needed to
| achieve o1-preview performance on a 32B model. Adding data didn't
| raise performance at all.
|
| > 32B is a small model, I can run that on my laptop. They used 16
| NVIDIA H100s for 26 minutes per training run, that equates to
| around $6.
| nico wrote:
| > In s1, when the LLM tries to stop thinking with "</think>",
| they force it to keep going by replacing it with "Wait". It'll
| then begin to second guess and double check its answer. They do
| this to trim or extend thinking time (trimming is just abruptly
| inserting "</think>")
|
| I know some are really opposed to anthropomorphizing here, but
| this feels eerily similar to the way humans work, ie. if you just
| dedicate more time to analyzing and thinking about the task, you
| are more likely to find a better solution
|
| It also feels analogous to navigating a tree, the more time you
| have to explore the nodes, the bigger the space you'll have
| covered, hence higher chance of getting a more optimal solution
|
| At the same time, if you have "better intuition" (better
| training?), you might be able to find a good solution faster,
| without needing to think too much about it
| layer8 wrote:
| What's missing in that analogy is that humans tend to have a
| good hunch about when they have to think more and when they are
| "done". LLMs seem to be missing a mechanism for that kind of
| awareness.
| nico wrote:
| Great observation. Maybe an additional "routing model" could
| be trained to predict when it's better to think more vs just
| using the current result
| sanxiyn wrote:
| LLMs actually do have such hunch, they just don't utilize it.
| You can literally ask them "Would you do better if you
| started over?" and start over if answer is yes. This works.
|
| https://arxiv.org/abs/2410.02725
| janalsncm wrote:
| I think a lot of people in the ML community were excited for Noam
| Brown to lead the O series at OpenAI because intuitively, a lot
| of reasoning problems are highly nonlinear i.e. they have a tree-
| like structure. So some kind of MCTS would work well. O1/O3 don't
| seem to use this, and DeepSeek explicitly mentioned difficulties
| training such a model.
|
| However, I think this is coming. DeepSeek mentioned it was hard
| to learn a value model for MCTS from scratch, but this doesn't
| mean we couldn't seed it with some annotated data.
| insane-c0der wrote:
| Do you have a reference for us to check? - "DeepSeek explicitly
| mentioned difficulties training such a model."
| janalsncm wrote:
| Section 4.2: Unsuccessful attempts
|
| https://arxiv.org/pdf/2501.12948
| talles wrote:
| Anyone else wants more articles on how those benchmarks are
| created and how they work?
|
| Those models can be trained in way tailored to have good results
| on specific benchmarks, making them way less general than it
| seems. No accusation from me, but I'm skeptical on all the recent
| so called 'breakthroughs'.
| charlieyu1 wrote:
| > having 10,000 H100s just means that you can do 625 times more
| experiments than s1 did
|
| The larger the organisation, the less experiments you can afford
| to do. Employees are mostly incentivised by getting something
| done quick enough to not to be fired in this job market. They
| know that the higher-ups would get them off for temporary gains.
| Rush this deadline, ship that feature, produce something that
| looks OK enough.
| mmoustafa wrote:
| Love the look under the hood! Specially discovering some AI hack
| I came up with is how the labs are doing things too.
|
| In this case, I was also forcing R1 to continue thinking by
| replacing </think> with "Okay," after augmenting reasoning with
| web search results.
|
| https://x.com/0xmmo/status/1886296693995646989
| ConanRus wrote:
| Wait
| bxtt wrote:
| CoT is widely known technique - what became fully novel was the
| level of training embedding CoT via RL with optimal reward
| trajectory. DeepSeek took it further due to their compute
| restriction to find memory, bandwidth, parallelism optimizations
| in every part (GRPO - reducing memory copies, DualPipe for data
| batch parallelism between memory & compute, kernel bypasses (PTX
| level optimization), etc.) - then even using MoE due to sparse
| activation and further distillation. They operated on the power
| scaling laws of parameters & tokens but high quality data
| circumvents this. I'm not surprised they utilized synthetic
| generation from OpenAI or copied the premise of CoT, but where
| they should get the most credit is their infra level & software
| level optimizations.
|
| With that being said, I don't think the benchmarks we currently
| have are strong enough and the next frontier models are yet to
| come. I'm sure at this point U.S LLM research firms now
| understand their lack of infra/hardware optimizations (they just
| threw compute at the problem), they will begin paying closer
| attention. Now their RL-level and parent training will become
| even greater - whilst the newly freed resources to solve for sub-
| optimizations that have been traditionally avoided due to
| computational overhead
| cadamsdotcom wrote:
| Maybe this is why OpenAI hides o1/o3 reasoning tokens -
| constraining output at inference time seems to be easy to
| implement for other models and others would immediately start
| their photocopiers.
|
| It also gave them a few months to recoup costs!
| mangoman wrote:
| From the S1 paper:
|
| > Second, we develop budget forcing to control test-time compute
| by forcefully terminating the model's thinking process or
| lengthening it by appending "Wait" multiple times to the model's
| generation when it tries to end
|
| I'm feeling proud of myself that I had the crux of the same idea
| almost 6 months ago before reasoning models came out (and a bit
| disappointed that I didn't take this idea further!). Basically
| during inference time, you have to choose the next token to
| sample. Usually people just try to sample the distribution using
| the same sampling rules at each step.... but you don't have to!
| you can selectively insert words into the the LLM's mouth based
| on what it said previously or what it wants to say, and decide
| "nah, say this instead". I wrote a library so that you could
| sample an LLM using llama.cpp in swift and you could write rules
| to sample tokens and force tokens into the sequence depending on
| what was sampled.
| https://github.com/prashanthsadasivan/LlamaKit/blob/main/Tes...
|
| Here, I wrote a test that asks Phi-3 instruct "how are you" and
| it if it tried to say "as an AI I don't have feelings" or "I'm
| doing " I forced it to say "I'm doing poorly" and refuse to help
| since it was always so dang positive. It sorta worked, though the
| instruction tuned models REALLY want to help. But at the time I
| just didn't have a great use case for it - I had thought about a
| more conditional extension to llama.cpp's grammar sampling (you
| could imagine changing the grammar based on previously sampled
| text), or even just making it go down certain paths, but I just
| lost steam because I couldn't describe a killer use case for it.
|
| This is that killer use case! forcing it to think more is such a
| great usecase for inserting ideas into the LLM's mouth, and I
| feel like there must be more to this idea to explore.
| jwrallie wrote:
| So what you mean is that if the current train of thought is
| going in a direction we find to be not optimal, we could just
| interrupt it and hint it into the right direction?
|
| That sounds very useful, albeit a bit different than how
| current "chat" implementations would work, as in you could
| control both ways of the conversation.
| latexr wrote:
| > and a bit disappointed that I didn't take this idea further!
|
| Don't be, that's pretty common.
|
| https://en.wikipedia.org/wiki/Multiple_discovery
| Caitlynmeeks wrote:
| https://imgflip.com/i/9j833q
|
| (ptheven)
| sheepscreek wrote:
| LLMs still feel so magical. It's like quantum physics. "I get it"
| but I don't. Not really. I don't think I ever will. Perhaps a
| human mind can only comprehend so much.
| shaneofalltrad wrote:
| Well dang, I am great at tinkering like this because I can't
| remember things half the time. I wonder if the ADHD QA guy solved
| this for the devs?
| gorgoiler wrote:
| This feels just like telling a constraint satisfaction engine to
| backtrack and find a more optimal route through the graph. We saw
| this 25 years ago with engines like PROVERB doing directed
| backtracking, and with adversarial planning when automating
| competitive games.
|
| Why would you control the inference at the token level? Wouldn't
| the more obvious (and technically superior) place to control
| repeat analysis of the optimal path through the search space be
| in the inference engine itself?
|
| Doing it by saying "Wait" feels like fixing dad's laptop over a
| phone call. You'll get there, but driving over and getting hands
| on is a more effective solution. Realistically, I know that
| getting "hands on" with the underlying inference architecture is
| way beyond my own technical ability. Maybe it's not even
| feasible, like trying to fix a cold with brain surgery?
| Nurbek-F wrote:
| Totally agreed this is not a solution we are looking for, in
| fact this is the only solution we have in our hands right now.
| It's a good step forward.
| code_biologist wrote:
| What would a superior control approach be? It's not clear to me
| how to get an LLM to be an LLM if you're not doing stochastic
| next token prediction. Given that, the model itself is going to
| know best how to traverse its own concept space. The R1 chain
| of thought training encourages and develops exactly that
| capability. Still, you want that chain of thought to terminate
| and not navel gaze endlessly.
|
| So how to externally prod it to think more when it does
| terminate? Replacing thought termination with a linguistic
| signifier of continued reasoning plus novel realization seems
| like a charmingly simple, principled, and general approach to
| continue to traverse concept space.
| rayboy1995 wrote:
| This is the difference between science and engineering. What
| they have done is engineering. If the result is 90% of the way
| there with barely any effort, its best to move on to something
| else that may be low hanging fruit than to spend time chasing
| that 10%.
| stefanoco wrote:
| Is it me, or the affiliations are totally missing in the cited
| paper?? Looks like they come from a mix of UK / US institutions
| advael wrote:
| I'm strictly speaking never going to think of model distillation
| as "stealing." It goes against the spirit of scientific research,
| and besides every tech company has lost my permission to define
| what I think of as theft forever
| eru wrote:
| At most it would be illicit copying.
|
| Though it's poetic justice that OpenAI is complaining about
| someone else playing fast and loose with copyright rules.
| tomrod wrote:
| Stochastic decompression. Dass-it.
| downrightmike wrote:
| The First Amendment is not just about free speech, but also
| the right to read, the only question is if AI has that right.
| organsnyder wrote:
| If AI was just reading, there would be much less
| controversy. It would also be pretty useless. The issue is
| that AI is creating its own derivative content based on the
| content it ingests.
| boxcake wrote:
| Isn't any answer to a question which hasn't been
| previously answered a derivative work? Or when a human
| write a parody of a song, or when a new type of music is
| influenced by something which came before.
| nrabulinski wrote:
| This argument is so bizarre to me. Humans create new,
| spontaneous thoughts. AI doesn't have that. Even if
| someone's comment is influenced by all the data they have
| ingested over their lives, their style is distinct and
| deliberate, to the point where people have been doxxed
| before/anonymous accounts have been uncovered because
| someone recognized the writing style. There's no
| deliberation behind AI, just statistical probabilities.
| There's no new or spontaneous thoughts, at most
| pseudorandomness introduced by the author of the model
| interface.
|
| Even if you give GenAI unlimited time, it will not
| develop its own writing/drawing/painting style or come up
| with a novel idea, because strictly by how it works it
| can only create ,,new" work by interpolating its dataset
| Ringz wrote:
| > Humans create new, spontaneous thoughts.
|
| The compatibility of determinism and freedom of will is
| still controversially debated. There is a good chance
| that Humans don't ,,create".
|
| > There's no deliberation behind AI, just statistical
| probabilities. There's no new or spontaneous thoughts, at
| most pseudorandomness introduced by the author of the
| model interface.
|
| You can say exactly the same about deterministic humans
| since it is often argued that the randomness of
| thermodynamic or quantum mechanical processes is
| irrelevant to the question of whether free will is
| possible. This is justified by the fact that our concept
| of freedom means a decision that is self-determined by
| reasons and not a sequence of events determined by
| chance.
| rig666 wrote:
| Does my software have the right to read the contents of a
| DVD and sell my own MP4 of it then no. If a streamer plays
| a YouTube video on there channel is the content original
| then yes. When gpt3 was training people saw it as a
| positive. When people started asking chatgpt more things
| than searching sites it became a negative.
| antimatter15 wrote:
| It's hardly even illicit- at least in the United States, the
| output of an AI isn't copyrightable.
| surajrmal wrote:
| Maybe but something has gotta pay the bills to justify the
| cutting edge. I guess it's a similar problem to researching
| medicine.
| ClumsyPilot wrote:
| Well the artists and writers also want to pay their bills. We
| threw them under the bus, might as well throw openAI too and
| get an actual open AI that we can use
| advael wrote:
| The investment thrown at OpenAI seems deeply inflated for how
| much meaningful progress they're able to make with it
|
| I think it's clear that innovative breakthroughs in bleeding-
| edge research are not just a matter of blindly hurling more
| money at a company to build unprecedentedly expensive
| datacenters
|
| But also, even if that was a way to do it, I don't think we
| should be wielding the law to enable privately-held companies
| to be at the forefront of research, especially in such a
| grossly inconsistent manner
| svara wrote:
| It just occurred to me that if you squint a little (just a
| little!) the S1 paper just provided the scientific explanation
| for why Twitter's short tweets mess you up and books are good for
| you.
|
| Kidding, but not really. It's fascinating how we seem to be
| seeing a gradual convergence of machine learning and psychology.
| mig1 wrote:
| This argument that the data centers and all the GPUs will be
| useful even in the context of Deepseek doesn't add up...
| basically they showed that it's diminishing returns after a
| certain amount. And so far it didn't make OpenAI or Anthropic go
| faster, did it?
| rayboy1995 wrote:
| What is the source for the diminishing returns? I would like to
| read about it as I have only seen papers referring to the
| scaling law still applying.
| adamc wrote:
| I found it interesting but the "Wait" vs. "Hmm" bit just made me
| think we don't really understand our own models here. I mean,
| sure, it's great that they measured and found something better,
| but it's kind of disturbing that you have to guess.
| leopoldj wrote:
| >it can run on my laptop
|
| Has anyone run it on a laptop (unquantized)? Disk size of the 32B
| model appears to be 80GB. Update: I'm using a 40GB A100 GPU.
| Loading the model took 30GB vRAM. I asked a simple question "How
| many r in raspberry". After 5 minutes nothing got generated
| beyond the prompt. I'm not sure how the author ran this on a
| laptop.
| coder543 wrote:
| 32B models are easy to run on 24GB of RAM at a 4-bit quant.
|
| It sounds like you need to play with some of the existing 32B
| models with better documentation on how to run them if you're
| having trouble, but it is entirely plausible to run this on a
| laptop.
|
| I can run Qwen2.5-Instruct-32B-q4_K_M at 22 tokens per second
| on just an RTX 3090.
| leopoldj wrote:
| My question was about running it unquantized. The author of
| the article didn't say how he ran it. If he quantized it then
| saying he ran it on a laptop is not a news.
| coder543 wrote:
| I can't imagine why anyone would run it unquantized, but
| there are some laptops with the more than 70GB of RAM that
| would be required. It's not that it can't be done... it's
| just that quantizing to at least 8-bit seems to be standard
| practice these days, and DeepSeek has shown that it's even
| worth _training_ at 8-bit resolution.
| mountainriver wrote:
| > They used 16 NVIDIA H100s for 26 minutes per training run, that
| equates to around $6
|
| Running where? H100s are usually over $2/hr, thats closer to $25
___________________________________________________________________
(page generated 2025-02-06 23:01 UTC)