[HN Gopher] Coconut by Meta AI - Better LLM Reasoning with Chain...
___________________________________________________________________
Coconut by Meta AI - Better LLM Reasoning with Chain of Continuous
Thought?
Author : TaurenHunter
Score : 317 points
Date : 2024-12-31 00:54 UTC (22 hours ago)
(HTM) web link (aipapersacademy.com)
(TXT) w3m dump (aipapersacademy.com)
| fosterfriends wrote:
| Once again, we see Meta being more open than OpenAI. I'm loving
| that their business incentive is aligned with open sourcing and
| commodifying state-of-the-art LLM technology. Keep em coming
| fragmede wrote:
| don't buy their bullshit. it's not open source.
| speedgoose wrote:
| Yes it's more about open weights. I also think that you would
| need the training data to consider it open source.
|
| Open weights is still appreciated and they probably train on
| data they don't have the license to open source.
| astrange wrote:
| I'm not sure open source is a useful concept for something
| that takes millions of dollars to compile from it.
| blackeyeblitzar wrote:
| I mean they have no way to monetize LLMs as well as others, so
| they're working on it and giving it away to not look irrelevant
| and to weaken anyone who may make money off this tech and
| threaten them in the future. Meanwhile there is a danger they
| impose their long standing invisible "moderation" on everyone
| else once they starve all the startups of revenue by giving
| this away. We'll just be left with the same big tech overlords
| to choose from.
|
| Oh and it still isn't open source even though people like Yann
| LeCun dishonestly claim it is. Only OLMo is truly open source
| among competitive models, as far as I know:
| https://allenai.org/blog/olmo2
| spencerflem wrote:
| Facebook would rather do no moderation, it's an expense for
| them.
|
| They do it to make the platform more pleasant so that people
| stay on it
| graemep wrote:
| > They do it to make the platform more pleasant so that
| people stay on it
|
| Almost everything unpleasant I see on FB is stuff that the
| FB algorithm shows me - not things posted by FB friends, or
| pages I follow or groups I am in.
| nightski wrote:
| Everything you see on FB is what the algorithm shows you,
| unpleasant or not. So it's a tautology that everything
| unpleasant would be from the algorithm.
| blackeyeblitzar wrote:
| No they do it to support their owners' and employees'
| biases. It doesn't make the platform more pleasant for the
| half that gets censored. That's leaving aside the feed not
| remembering the choice to view chronologically ordered
| posts, the inability to easily track actual people in my
| life, the addictive algorithms, the clickbait that causes
| mental health issues for teens, etc.
| creato wrote:
| 99% of FB's moderation has nothing to do with "biases",
| unless you think FB is biased against spam, scams, and
| all the other dregs of the internet that incessantly pops
| up anywhere users can post content.
| TheOtherHobbes wrote:
| Quite a few people left Threads for Bluesky because
| progressive posts were being removed while far-right,
| antivax, etc content was allowed to stand even though it
| was reported.
|
| At best the algo is imperfect. At worst it really does
| seem oddly selective.
| dudeinjapan wrote:
| I am a humble Cialis salesman, like my father and
| grandfather before me. I confirm Facebook is biased
| against our profession. (My grandfather also moonlighted
| as a Barrister representing the estates of deceased
| African royalty--it was always so difficult to track down
| their heirs.)
| roywiggins wrote:
| The stuff that Facebook moderators are actually tasked
| with removing is really awful, bad enough to produce
| severe psychological effects in the moderators.
|
| Facebook pays people to look at and remove this stuff
| because the platform would not survive if it wasn't
| removed before you or I saw it. Do they also enforce
| other corporate values? Yeah, probably. That doesn't seem
| to be the main job though, they have their hands full
| dealing with the worst content in the world.
|
| https://amp-theguardian-
| com.cdn.ampproject.org/v/s/amp.thegu...
|
| > The images and videos including necrophilia, bestiality
| and self-harm caused some moderators to faint, vomit,
| scream and run away from their desks...
|
| > Some reported marriage breakdown and the collapse of
| desire for sexual intimacy, and losing connection with
| their families. Some whose job was to remove videos
| uploaded by terrorist and rebel groups were afraid they
| were being watched and targeted, and that if they
| returned home they would be hunted and killed.
| cess11 wrote:
| It's more likely they do it to keep their people from being
| coerced to visit the Hague. What they did in Myanmar got a
| lot of press and a case at the ICJ, and similar launches of
| 'free internet' elsewhere had similar results.
| rlupi wrote:
| (tongue in cheek comment) I wonder if FB moderation now or
| eventually will be just a prompt to a sufficiently evolved
| and unhinged AI model:
|
| > FB or 4chan?
| BoorishBears wrote:
| > they have no way to monetize LLMs as well as others
|
| Random nobodies are putting together companies to monetize
| generative AI and getting bought out a couple of years later,
| you think Meta couldn't figure out how to deploy their own
| models to an API and stick up a billing interface if they
| really wanted to? (or even buy a company that does already?)
|
| > they starve all the startups of revenue by giving this away
|
| Would you say startups like Deepseek have been hurt or help
| by their (even partial) openness?
|
| In fact, how does this track with your first statement?
| They're not monetizing this: so their startup competition can
| actually serve their models to gain revenue _which they then
| turn around use to train competitor models_ (we 've already
| seen this with Fireworks.ai)
|
| You seem to underestimate how much of the value in LLMs is
| _productizing_ them. The margins on per-token usage are
| insane, Meta not taking that margin is creating a huge
| opportunity for a wave of startups in so many directions...
|
| > Only OLMo is truly open source among competitive models
|
| Synthetic data from competitor models was a huge part of
| that. It would seem no one is fighting the startups as hard
| as you're claiming they are.
| bongodongobob wrote:
| All the LLM companies are going to eat those "product
| companies" lunch in a few years. Why would I use product X
| when it's going to be inevitably be baked into the actual
| tech itself? Those product companies are just wrappers and
| have even less of a moat than the LLM companies. The very
| fact that random nobodies are doing this should signal
| there isn't a lot of real value there. Yes, there is some
| money to be made right now but it reminds me a lot of the
| videogame bust and dotcom bust. A LOT of companies are
| wasting a crazy amount of money on "solutions" that will be
| obsolete in a few years.
| BoorishBears wrote:
| Productization in this context is creating APIs for
| Meta's models.
|
| Fireworks.ai, Together.ai, and literal boatloads of other
| startups are making real money just efficiently serving
| up these models that Meta is supposedly using to... choke
| out startups.
|
| The comment I replied to is under the mistaken idea that
| the presence of free models from Meta has a chilling
| effect on startups trying to build their own models, but
| right now the biggest barriers are capital and data.
|
| Meta updated Llama to allow for synthetic generation, and
| they're even partnering with these startups give them
| distribution and day 0 access to the models.
|
| -
|
| If anything I'd say Meta is actively fighting against the
| big tech overlords the comment thinks they're trying to
| join. Even before Ilya mentioned it, it was clear to me
| that the power of post-training was going to become more
| and more important (I've literally built a business on
| it).
|
| Llama represents a real ongoing chance for tiny startups
| with minimal resources to get into the fray very
| affordably (through either offering inference, or post-
| training for a specific task, etc.), scale revenue, and
| then start to compete against much larger, resource rich
| companies.
| jayd16 wrote:
| Is there any vendor lock-in with this conspiracy? Even if
| startups are pushed out of the spotlight, what stops them
| from competing? If the meta model is bad, won't it be even
| easier to make an alternative in the future?
| scarface_74 wrote:
| They are definitely making some money off of their licensing
| to AWS as part of the bedrock offering. Facebook's licensing
| is such that they aren't going to let happen to them what
| happened to ElasticSearch, Redis, etc.
|
| I'm okay with that.
| rlupi wrote:
| In the agentic era, the new Ads eyeballs are the LLMs
| training corpus (IMHO).
| throwup238 wrote:
| Master coconut! I don't know if that's an Archer reference or a
| Frisky Dingo reference.
|
| It's fascinating how fast the competitors are catching up to each
| other. Can't wait for seven different SkyNets to compete for
| dominance.
| yard2010 wrote:
| Both! And/or, either
| throwaway314155 wrote:
| A little column a, a little column b.
| Klathmon wrote:
| So is the big improvement here simply skipping the
| unembedding/embedding step for internal thoughts? Or is it mainly
| in the training methods to teach the CoT and how to switch
| between "latent thought" and text output?
|
| It's really interesting that a fixed number of "latent thoughts"
| performed as well as a binary classifier! I didn't expect that at
| all, the way OpenAI talks about CoT it seems the ability to let
| it "keep thinking" let's them continually score higher on
| benchmarks while throwing eye watering amounts of compute at the
| inference.
| Crye wrote:
| It mentioned not penalizing/rewarding the model for thoughts
| only rewarding the answer after the thought. I am curious how
| back propagation works then.
| yorwba wrote:
| The tokens of the answer depend on the preceding continuous
| thought vectors, which you can backprop through in the usual
| way.
| lovasoa wrote:
| The researchers leverage existing language Chain-of-Thought
| data, where each sample consists of a question, reasoning
| steps, and the final answer. At stage 0, the model does not
| generate any thought tokens, and is just trained to yield the
| reasoning traces and correct answers for the Chain-of-Thought
| samples. In the subsequent stages, at each stage, we remove
| one reasoning step from the sample, and instead add thought
| tokens. In the illustration above, a single thought token is
| added in each stage, instead of a single reasoning step, but
| this is controlled by a hyperparameter 'c'.
| viraptor wrote:
| I was waiting for something like that to happen! Next step -
| creating a human-language-free representation. I believe that
| once a group of llms can communicate only in embeddings tuned
| without any human text input, we're going to open a completely
| new chapter in AI.
| bboygravity wrote:
| How does a group help anything?
|
| If you put 1000 dumb people together, they don't magically
| become smart?
| sunshinerag wrote:
| Wait what ... how does democracy work then?
| nfw2 wrote:
| the benefit of democracy is primarily that it prevents
| governments from doing bad things, less so that it empowers
| more effective governance
| mathgeek wrote:
| It can do either, and can fail to do either. It's the
| people having power that enables the outcomes, not the
| system itself. Democracy just grants the power to a
| broader set of people.
| optimalsolver wrote:
| It doesn't.
| coldtea wrote:
| Democracy is not about being smart or dumb.
|
| It's about everyboty having a say in decisions of
| government that affects them.
|
| The failure of democracy as a system is not when people
| make dumb decisions (experts and high-IQ people have made
| some of the most stupid and catastrophic decisions in
| history), but when people's collective decisions are not
| being respected.
| IshKebab wrote:
| If you put 1000 people who can't talk together they will
| create language so they can communicate. He's saying if we
| put LLMs together and don't force them to use English to
| communicate then they'll create their own language which may
| be superior for LLMs to English.
|
| May be true but who knows.
|
| I wonder if anyone has somehow tested the Sapir-Whorf
| hypothesis for LLMs somehow by training them on different
| languages and comparing task performance. I guess it's too
| difficult to get a large equivalent training set in different
| languages.
| wodderam wrote:
| It feels like an exercise in anthropomorphization to me.
|
| Sapir-Whorf hypothesis is generally not considered to be
| reality. It makes intuitive sense but is wrong.
|
| There are hours of podcasts with Chomsky talking about
| LLMs. The gist of which is that LLMS are extracting surface
| level statistical structure of language that will be good
| for routine coding and not much else. It is easy to infer
| that Chomsky would believe this idea to be utter nonsense.
|
| I believe even the idea of getting a 1000 people together
| and we agree to label a rock "rock", a tree "tree", a bird
| "bird" is not even how human language works. Something that
| is completely counter intuitive.
|
| Reading the paper, no one believes a hidden markov model is
| creating some kind of new thought process in the hidden
| state.
|
| I certainly though could have no idea what I am talking
| about with all this and have pieced together parts that
| make no sense while this is a breakthrough path to AGI.
| pjerem wrote:
| Well maybe not 1000 people but to our knowledge, the
| human brain is actually made of physically independent
| zones that barely communicate together except with the
| zone that take all the outputs together and tries to do
| something coherent with all the garbage.
|
| Idk if this could work with LLMs, especially because all
| the brain zones are somehow specialized into something
| while two LLMs are just identical machines. But we also
| know that the specialization isn't that hardcoded : we
| know that people losing half their brain (after a stroke)
| can still relearn things that were managed in the "dead"
| part.
|
| I don't know, please correct my errors, I was just
| thinking aloud to say that multiple independent agents
| working together may be how "intelligence" already works
| in the biological world so why not for AIs ?
| PittleyDunkin wrote:
| > Sapir-Whorf hypothesis is generally not considered to
| be reality.
|
| This is true only in the strictest terms of the
| hypothesis, i.e. linguistic determinism. Language still
| encodes a lot of culture (& hence norms and values) in
| its grammar & diction--this isn't very controversial.
|
| Granted, I don't think this is that related to the topic
| at hand. There's bias all over the decisions in how to
| train and what to train on; choice of language is just
| one facet of that.
| coldtea wrote:
| > _Sapir-Whorf hypothesis is generally not considered to
| be reality. It makes intuitive sense but is wrong_
|
| Strong S-W (full determinism) might not be, but there's
| hardly a clear cut consensus on the general case.
|
| And the whole "scientific field" is more like psychology,
| with people exchanging and shooting down ideas, and less
| like Math and Physics, so any consensus is equally likely
| to be a trend rather than reflecting some hard measurable
| understanding.
|
| I'd say that the idea S-W is not to a degree reality is
| naive.
| digbybk wrote:
| > There are hours of podcasts with Chomsky talking about
| LLMs
|
| I'm not an expert, but it seems like Chomsky's views have
| pretty much been falsified at this point. He's been
| saying for a long time that neural networks are a dead
| end. But there hasn't been anything close to a working
| implementation of his theory of language, and meanwhile
| the learning approach has proven itself to be effective
| beyond any reasonable doubt. I've been interested in
| Chomsky for a long time but when I hear him say "there's
| nothing interesting to learn from artificial neural
| networks" it just sounds like a man that doesn't want to
| admit he's been wrong all this time. There is _nothing_
| for a linguist to learn from an actually working
| artificial language model? How can that possibly be?
| There were two approaches - rule-based vs learning - and
| who came out on top is pretty damn obvious at this point.
| jokethrowaway wrote:
| What can you learn from something parroting data we
| already have?
|
| Similarly, we are now finding that training on synthetic
| data is not helpful.
|
| What would have happened if we invested 1/100 of what we
| spent on LLM on the rule based approach?
| stingraycharles wrote:
| Is everything in LLMs translated back to English before
| interpretation?
|
| It works fairly well in my native language, I'm surprised
| to learn that things get translated back.
| astrange wrote:
| LLMs have no fixed internal representation - they barely
| have internal anything - so no, there is no translation.
|
| But there's also no guarantee any particular query
| generalizes (vs is memorized), so it might only be able
| to answer some queries in some languages.
| littlestymaar wrote:
| > If you put 1000 dumb people together, they don't magically
| become smart?
|
| 1000 is probably too high, but groups of people are in fact
| more intelligent than individuals (though for humans it is
| likely because recognizing a correct answer is easier than
| finding it in the first place)
| nfw2 wrote:
| depends on the circumstances. lin-manuel miranda can
| probably write a better musical by himself than a team of
| 20 people with equal input would.
|
| also, the bottlenecks that teamwork helps solve (eg the
| high cost of gaining expertise and low throughput of
| reasoning capacity) may not be that relevant in the ai age
| littlestymaar wrote:
| > by himself than a team of 20 people with equal input
| would.
|
| Sure, but the result would still be far better than the
| average of the output of the 20 individuals taken alone.
|
| > also, the bottlenecks that teamwork helps solve (eg the
| high cost of gaining expertise and low throughput of
| reasoning capacity) may not be that relevant in the ai
| age
|
| It's always tempting to anthropomorphize these systems
| and conclude that what works for us would work for them,
| but yes we don't really know if it would bring anything
| to AI.
| TheOtherHobbes wrote:
| _Functional_ groups which work well together, include open
| sharing of research and ideas, persistence of best output,
| are dedicated to realism, and are more focussed on problem
| solving than status display, will be smarter. The group
| works like a filter which generates multiple solutions and
| selects, remembers, and abstracts the best.
|
| Dysfunctional groups which do the opposite will be
| catastrophically stupid.
|
| There have been plenty of dysfunctional groups in history.
| JFingleton wrote:
| > If you put 1000 dumb people together, they don't magically
| become smart?
|
| Do they not become smart*er* though?
| computably wrote:
| "Smarter" is too vague. A group can compensate for
| individual weaknesses or even converge on a hard-to-make
| prediction given sufficiently uncorrelated outputs;
| basically the idea behind ensemble models / wisdom of the
| crowds. But a group of 1000 dumb apes would never achieve
| categorically-above-ape intelligence, probably not even
| "genius" ape intelligence. Groups of unintelligent agents
| come with downsides as well, like the ant death spiral.
| coldtea wrote:
| > _But a group of 1000 dumb apes would never achieve
| categorically-above-ape intelligence_
|
| And yet, here we are.
|
| A group of 1000 apes is large enough to have offspring
| and, given time, go through evolution.
| mromanuk wrote:
| Because group estimation is superior to individual
| estimations: The phenomenon is called wisdom of the crowds.
| When a group of people independently estimate something,
| individual errors tend to cancel each other out, leading to a
| surprisingly accurate collective result. This works because
| of:
|
| Diversity of opinions: Different perspectives bring a range
| of estimates. Independence: Errors aren't systematically
| biased as long as individuals estimate without external
| influence. Error averaging: Overestimation and
| underestimations balance out when averaged. Law of large
| numbers: More participants increase accuracy by minimizing
| random errors. It was demonstrated by Francis Galton in 1906,
| where a crowd's average guess of a bull's weight was almost
| spot-on. (estimates must be independent and reasonably
| informed for this to work.)
| senectus1 wrote:
| they kinda do.. Its how City's work.
|
| People learn by being around others being both successful and
| unsuccessful.
| coldtea wrote:
| Isn't that the very case behind the "wisdom of crowds" thing?
| amelius wrote:
| Looking at the current state of democracies around the
| world, my hopes are not on "wisdom of the crowds".
| bee_rider wrote:
| If you think the democracies are doing bad, you should
| see the autocracies!
| amelius wrote:
| You mean the thing democracies are turning into, thanks
| to social (crowd wisdom) media?
| bee_rider wrote:
| I don't think social media really is crowd wisdom at all.
| It is built to pander to our worst impulses (I think,
| knowingly and openly, right? The algorithm selects for
| engagement, not learning and growing), and I'd be
| surprised if it isn't producing a feedback loop as well
| (perhaps as an unintentional side effect). The wisdom of
| the crowds hypothesis relies on a random sampling, we're
| intentionally applying a skew toward the angry and
| shallow.
| konart wrote:
| Not magically. Our great ancestors were pretty dumb, but they
| were getting smarter and better because of sharing their
| knowledge.
| ulbu wrote:
| they were not one bit dumber than you.
| EliBullockPapa wrote:
| Average intelligence measures have risen substantially
| since early 1900s
|
| https://en.wikipedia.org/wiki/Flynn_effect
| pigpop wrote:
| yes they got "smarter" by compiling a corpus of knowledge
| which future generations could train on.
|
| sarcasm aside, throwing away the existing corpus in favor
| of creating a new one from scratch seems misguided.
|
| this paper isn't about creating a new language, they are
| omitting the sampler that chooses a single token in favor
| of sending the entire end state back in to the model like a
| superposition of tokens. that's the breadth first search
| part, they don't collapse the choice down to a single token
| before continuing so it effectively operates on all of the
| possible tokens each step until it decides it's done.
|
| it would be interesting to try this with similar models
| that had slightly different post training if you could
| devise a good way to choose the best answer or combine the
| outputs effectively or feed the output of a downstream
| model back in to the initial model, etc. but I'm not sure
| if there'd necessarily be any benefit to this over using a
| single specialized model.
| jkingsman wrote:
| How does one impart textual knowledge discovered by humans
| without language?
| thelittleone wrote:
| Couldn't we use an AI model trained on historical text data
| (up to today) to predict likely events for tomorrow? Taking
| this further, a sufficiently advanced AI system could
| potentially analyze human-generated text up to any given
| point in history to understand patterns of human thought and
| behavior, then project those patterns forward. This speaks to
| your point about human language - while we need text data for
| initial training, the AI's internal representations and
| predictions could potentially transcend human language
| constraints.
| viraptor wrote:
| The training of the LLM itself would still use the human
| language. But you could add an extra channel that's never
| given any text or direct dataset training. Keep it purely a
| connection between hidden layers of different instances of
| LLM and train using the usual loss of perplexity or similar
| metric.
|
| The interesting thing then would be - does it converge to
| similar embedding space as the input, or can LLMs create a
| more efficient "language".
| wruza wrote:
| I thought about it too (layman). When I learned about
| embeddings it almost immediately clicked as a sort of an
| ascended language, not sure why no one seems to talk about it.
| Exchanging embeddings must be so much "wider" communication
| channel than speaking real language. And in contrast to a
| language embeddings are (iiuc) continuous, i.e. you can rotate
| a vector continously and it will smoothly trace the changes
| between A and B. I can picture communicating in something like
| https://www.google.com/search?q=charlie+conspiracy+meme&udm=...
| - embedding difference vectors, but it's all crystal clear and
| is a natural language for an llm, cause any vector combination
| points to a correct "inner screen" image/concept/younameit.
|
| Or maybe this is my own ignorant confabulation, so nvm.
| mckirk wrote:
| This is actually something you probably want to avoid, if at
| all possible, because it makes it very hard to maintain insight
| into what the AIs are communicating among them. But that
| insight is crucial to stay informed about their progress in
| taking over the world, etc.
| dwohnitmok wrote:
| Yes! We should be extremely cautious about embracing
| approaches that make LLMs even more inscrutable. Having CoT,
| however unreliable it is, is nonetheless a huge boon for
| model evaluation that we should not give up so lightly.
| blizdiddy wrote:
| That came out a few weeks ago from meta. Large Concept Models
|
| https://ai.meta.com/research/publications/large-concept-mode...
| davidclark wrote:
| Is this article AI-generated? This website appears to do a lot of
| "diving in".
| zombiwoof wrote:
| Will this allow Facebook new user base of AI generated characters
| to interact with themselves better?
| behnamoh wrote:
| There was no reason to call it something it's not ("chain of
| cont. thought" [?] coconut).
| CGamesPlay wrote:
| Is your complaint here that the paper is not discussing a
| literal coconut?
| ripped_britches wrote:
| We desperately need more literal coconut coverage here on HN
| BoorishBears wrote:
| Not just any regular old coconuts "Coconut by Meta AI -
| Better LLM Reasoning with Chain of Continuous Thought?"
| coconuts
|
| (Sometimes acronyms in titles are vague/misleading... this
| was not one of those times)
| layer8 wrote:
| To be fair, it's not even a metaphorical coconut. ;)
| gloosx wrote:
| for sure, chocothot aligns better with letters
| CGamesPlay wrote:
| Paper: https://arxiv.org/abs/2412.06769
|
| The link is in the OP, hidden away in an image caption fir some
| reason.
| MarkMarine wrote:
| This ought to make it much easier for the AI to lie to us without
| getting caught.
|
| https://www.transformernews.ai/p/openais-new-model-tried-to-...
| h8hawk wrote:
| The article is pure nonsense, and even the author added an
| update admitting it missed important context.
| throw-qqqqq wrote:
| Have you read the paper discussed in the article?
|
| https://static1.squarespace.com/static/6593e7097565990e65c88...
|
| From the abstract:
|
| > We study whether models have the capability to scheme in
| pursuit of a goal that we provide in-context and instruct the
| model to strongly follow.
|
| > Analysis of the models' chains-of-thought reveals that models
| explicitly reason about these deceptive strategies, providing
| evidence that the scheming behavior is not accidental.
|
| ---
|
| It is impressive work for sure, but it comes down to prompting
| IMHO. They almost instruct the AI to plot and scheme here, with
| the goal of finding out if it CAN do this at all.
|
| The AI is not very covert about it either! It directly admits
| its scheming in the "thinking" outputs :D
|
| IMO this is just another issue of alignment.
| opdahl wrote:
| This is super cool and something I've been waiting on. Would be
| interesting to intersperse these thinking steps into token
| generation. What would be the effect of adding lets say 5
| thinking <<thoughts>> for every 50 generated tokens?
| jkelleyrtp wrote:
| I think this might be the "it" moment for AI/LLMs. I was hiking
| with a friend recently and we talked about this at length.
|
| The arc-AGI results from O3 are apparently a result of chain of
| thought given enough time to explore a solution space. Reasoning
| might be simply a higher dimensional form of rubix cube solving.
| BFS, search, back-tracking, etc. It seems unlikely that humans
| think in "tokens" so why do LLMs?
|
| By staying in latent space, the models are free to describe an
| "idea" in higher resolution than what language allows. English is
| coarse, granular. Latent space is a much finer representation of
| ideas and their interplay.
|
| Latent space is also much cheaper to execute in. The model can
| think without the language encoding/decoding step. This lets it
| branch out hundreds of ideas and explore only the most useful
| ones in a fraction of time that reasoning "out-loud" would take.
|
| The states also don't need to be tied to language. Feed in a
| robot's state, time series data, or any abstract data. Reason in
| category theory or linear algebra or complex analysis. Humans are
| hard wired for one set of math - an abstract latent space can
| represent anything.
|
| I'm a bit disappointed OpenAI didn't stumble on this first. I've
| been skeptical of LLMs since their big debut last year. LLMs seem
| like a great way of solving language, but reasoning is much more
| complex. Once you grok the math behind the current models, you
| immediately question why the encoding/decoding step is there.
| Diffusion models are incredible but it felt that LLMs lacked the
| same creativity. Encoding/decoding forces a token-based
| discretization and therefore a loss of complexity.
|
| With the byte-latent paper it was quite clear we'd see this
| paper. This truly might be the "it" moment.
| otikik wrote:
| > It seems unlikely that humans think in "tokens" so why do
| LLMs?
|
| I can think of one reason: scrutability. It's going to be even
| harder to understand how a response gets produced if there
| isn't even a text-based representation to help the human
| understand
| IshKebab wrote:
| I think we're already way beyond the point where anyone
| really understands how a response is produced, even without
| this.
| nfw2 wrote:
| the token generation part isn't well understood, but the
| output "chain-of-thought" used to produce the final answer
| can be scrutinized for correctness with a traditional CoT
| model (although this would require model providers to not
| hide reasoning tokens)
| anon373839 wrote:
| Indeed. Even if an LLM tells you its "reasoning" process
| step by step, it's not actually an exposition of the
| model's internal decision process. It's just more text
| that, when generated, improves the chances of a good final
| output.
| pigpop wrote:
| you can save the hidden states and convert them into a more
| interpretable format. it's still recorded and you could make
| modifications at different steps to see how that would change
| the conclusion.
| rlupi wrote:
| IMHO The problem (for us) with this approach are the logical
| consequences:
|
| 1) if AI large model become more powerful avoiding language,
| embeddings of AI state become even more tied to the model they
| originate than now
|
| Consequence: AI progress stalls, as AI user companies need to
| invest increasing amount of money to reindex their growing
| corpuses.
|
| This is already a problem, it becomes more of a lock-in
| mechanism.
|
| If this is overcome...
|
| 2) Embeddings become a viral mechanism: it makes sense for a
| large company that commands a market to impose to its suppliers
| to use the same AI models, because they can transfer state via
| embeddings rather than external formats.
|
| This allows to cut down decisions mechanisms that otherwise
| require expensive coordination mechanism.
|
| Something similar will happen within companies IMHO:
| https://rlupi.com/okr-planning-as-belief-revision
|
| 3) Eventually this potentially results in another exponential
| growth and lock-in mechanism, also at the expense of most tech
| people as more and more is done outside our interface with AI
| (i.e. programming and software architecture improvements will
| it self move below language level, we'll have to reverse
| engineering increasingly opaque improvements).
|
| 4) It ends with the impossibility of AI alignment.
|
| ---
|
| I have written a bit about it in the past at the start of the
| year, when I had a burnout. So, I deleted those confused
| ramblings. You can stil find it on archive.org:
| https://web.archive.org/web/20240714153146/https://rlupi.com...
| layer8 wrote:
| IMO we won't have the "it" moment until we have continuous
| learning (training) in some fashion.
| mattxxx wrote:
| ^ This and we need to be continually learning on an energy
| budget similar to how much a human spends per hour.
| rlupi wrote:
| The main reason why we can't do that now is because we
| require models to be digitally reproducible (IMHO, but also
| read Geoffrey Hinton's mortal computing).
|
| The energy cost come from error correction as much as
| training algorithms.
| pigpop wrote:
| I think this is a step in the right direction but not the end.
| it takes the sampler out of the equation during most of the
| reasoning process but it is still important for the "show your
| work" aspects of reasoning or solving a problem. balancing when
| to think against when to write down or commit to certain
| thoughts is important. there are many more pieces to the
| puzzle.
| jokethrowaway wrote:
| This sounds like brute forcing a solution to make up for lack
| of intelligence.
|
| In an IQ test, like the one in the arc agi test, a human sees
| the pattern instantly and effortlessly. o3 tries N paths until
| it stumbles on the right one and assess that there is a
| pattern.
|
| I think we need a radically different architecture, this is a
| gimmick.
| jeswin wrote:
| Interesting. Due to its emphasiss on BFS, it's the opposite of
| something I've been trying (I named it the "Tree of failures").
|
| My assumption was that humans don't try a breadth-first approach.
| Instead, we split a task into a short-step (instinct and
| intuition selected), and long-step that summarizes/stores the
| next steps. The key idea is to recursively evaluate a task as a
| short-step (high-res - gets executed) and a long-step (lower-res
| - is just stored), until it succeeds or fails. If it fails, we
| must walk back keeping a summarized tree of failures in state so
| that we can exclude them in future selections.
|
| The effectiveness of instinct has a steep fall-off at longer
| distances - so it's better not to chart out of a series of steps.
| When we do BFS, we drive down the value of instinct in favor of
| compute. I guess ultimately, it depends on the type of problem
| you want to solve.
|
| Reach out to me if you want to prototype it with me.
| katamari-damacy wrote:
| that's more fit for agents, no?
| jeswin wrote:
| You're right that it's technically orthogonal to what's in
| the paper. I was trying to model the "reasoning process",
| which has general applicability depending on how/where it's
| implemented.
| viraptor wrote:
| Reminds me of what plandex does. https://plandex.ai/ It already
| does the automatic "does this need splitting into subtasks, or
| can it be solved immediately" processing.
| cube2222 wrote:
| I think the problem with long chains of steps on their own
| (without the bfs stuff) is that your failure probability
| quickly grows to unreasonable levels.
|
| Basically, if each step has a 97% chance of being completed
| correctly, if your task requires 10 steps one after the other,
| the chance of success falls to 97%*10=74%
|
| If I understand correctly, part of the point of the BFS is to
| throw compute at it, in order to lower the failure rates. Kind
| of a "run many times in parallel and pick the best one". This
| can be effective, but also quite expensive, as seen in the
| costs OpenAI had to pay for their ARC-AGI benchmarking runs.
| dietr1ch wrote:
| I feel humans like doing something in between, maybe a bit like
| A* would do sometimes.I wouldn't call it A* because of the lack
| of a consistent heuristic and also lack of strictly numeric
| evaluation, but it's an in-between DFS and BFS for sure (as is
| every tree search algorithm?).
|
| We go deep while we think it's a good lead, because so far
| things make sense and it'll be less work, but at some point we
| start questioning our decisions early in the descent and try
| alternatives.
| verdverm wrote:
| You may find Prioritized Grammar Enumeration as an
| interesting in-between DFS/BFS algorithm
|
| https://seminars.math.binghamton.edu/ComboSem/worm-
| chiu.pge_...
| wafflemaker wrote:
| How do you understand instinct?
|
| I bought a new SSD drive for an old laptop to avoid buying a
| new one, (x230 has amazing keyboard) but left to another
| country for Christmas. My intuition told me to take it with me,
| but logical sense said there will be no time for such things as
| moving OS to a new drive.
|
| My flight back to the work country got cancelled due to fog and
| I ended up spending a week longer at in-laws place, with plenty
| free time. A new 512GB drive would help me studying, giving
| plenty space for school VMs.
| kurthr wrote:
| The classic thing people say is "asking the right question"
| gets you half way there. Your approach sounds like something I
| call "getting to No" for a problem. It's sort of a combination
| of "getting to know" and the opposite of the salesman's
| "getting to Yes". When it works, it's the fastest way to prune
| off obligations.
|
| The goal is to figure out why some particular problem: isn't
| really a problem, doesn't need to be solved, can't be solved
| that way, can't really be solved (because of physics or it's
| really a different problem). As you define the problem better,
| you can rule each one out to find, the "real" problem, that you
| CAN solve, and at least one path forward. There's still many
| ways that it might not be the optimal path, but you know
| roughly how to get to somewhere better. It also trains you to
| see around obstacles to success.
|
| I've found that some of the best work I've done (especially on
| acquisitions) was in defining why NOT to do something that
| looked like a good idea (or particularly interesting to work
| on) from the onset, but was destined to fail or required
| unknown HW technology. Frankly, looking >5 years out feels like
| a coin flip, because some other competing technology could come
| along before you can get to production.
| torginus wrote:
| I don't get why you need tree search at all? What does it give
| you over a pure LLM trained to do CoT in a tree-like manner? If
| the context window's long enough, it can generate the
| reasoning-tree just by pure next-token prediction, and rather
| than BFS, it can guide the tree search with its own value
| function (which is part of the LLM itself) instead of sticking
| to hard algos like BFS and DFS.
|
| By the way, BFS sounds like it will give you thorough results,
| at the cost of increased compute. Useful for beating
| benchmarks, but probably causes marginal improvement for
| massively improved compute.
|
| Still, the improved quality could be meaningful, if it's used
| for generating training data for Llama4
| galaxyLogic wrote:
| The thing about 'thinking" in problem solving I think is that
| thoughts often produce new questions which then guide the overall
| problem solving. I wonder is this something like that?
| smusamashah wrote:
| I believe this was shared and discussed here a while ago and this
| article looks LLM generated. It keeps doing "let's start...".
| Either it's LLM fluff or very poor writing.
| hadjian wrote:
| If this site didn't appear here, I'd think it's a scam:
|
| - site claims to simplify papers, but movies AI generated
|
| - full of ads
|
| - Can't find ,,Coconut" on the official Meta FAIR page
|
| Is this the best site to link to?
| davidhowlett wrote:
| The official pdf for the paper is at
| https://arxiv.org/pdf/2412.06769
|
| I can find "Coconut" 54 times in the PDF. The movie does not
| look made up.
| hadjian wrote:
| I was referring to aipapersacademy and not the arxiv link.
|
| Also I didn't mean the occurrence of the word ,,coconut" in
| the paper, but thanks for counting.
|
| I meant their publication site: https://ai.meta.com/results/?
| content_types%5B0%5D=publicatio...
|
| The video is something I'd expect from AI.
| cornel_io wrote:
| So, what's happening here on the surface is that it's an
| optimization (fairly meaningful, from the looks of it) aimed at
| doing roughly the same things we could already do with chain-of-
| thought (CoT), but IMO the downstream effects of this sort of
| optimization could be much more meaningful.
|
| LLMs can already do a decent amount of "processing" in a single
| token generation because of the number of layers they have. The
| layers are trained independently so it's not exactly like they're
| a recurrent network doing multiple steps, but they are layering
| sequences of context-dependent transformations on top of each
| other; no matter how you cut it, if getting to a problem's answer
| requires 100 steps, you won't be able to do it in a single token
| output from a 20 layer LLM. To some approximation, CoT is just a
| way to give the network more chances to transform the data than
| there are layers in the network - each additional token of output
| gives a shot to bake another vector the size of the token
| embedding into each layer's state in the network, enriching what
| it's computed so far.
|
| The problem with chain of thought is that as you add each new
| token, at the input level of the network, your computation is
| basically starting from scratch against the raw text, just with
| one additional token. You don't even have access to all the stuff
| you already figured out in the deepest layers of the network
| during the previous step! If you were processing "All wunguses
| are glurgles, and Joe is a wungus", then somewhere in those
| deepest layers as you're generating the next token you've almost
| certainly got some vector that basically represents "therefore
| Joe is a glurgle", but with chain of thought you've got to first
| output "t", then "h", then "e", and so on (I know those aren't
| tokens, let's pretend letter == token for argument sake), and
| during that process almost _ALL_ of the work being done by the
| network is mere bookkeeping, slowly dumping that thought into the
| output stream. Only once you get the whole sentence out can you
| _start_ processing the next token at the first layer with the
| information that Joe is, in fact, a glurgle, in hand. Which is a
| damn shame, because it 's been sitting right there in the deeper
| layers of the network parallel to previous tokens this whole
| time, it just wasn't available for the shallow layers to process
| directly because you were casting most of the info away and
| "rounding" to a single token.
|
| With Coconut's approach, you don't need to output "therefore Joe
| is a glurgle" token by token to continue the train of thought,
| you can essentially pass the entire thought through as a single
| uber-token, and the next pass can generate a new entire thought,
| and so on.
|
| It's a pretty straightforward idea, IMO the neat bit is that they
| were able to train the network to work well in this way by
| leveraging CoT. I'm guessing you probably don't need to act as if
| these are two distinct modes of operation, you could instead
| _always_ have this side channel of "continuous thought" running,
| even when you have generated a normal token, coming through as a
| separate input to the first attention block. You still might want
| to have a "thinking" token when you need to sit there and let the
| thing do more work, but you'd generally increase the information
| flow from time step to time step, which would allow the net to
| keep thinking in the background even as it's doing the gruntwork
| of outputting whatever its current "buffered" thought is.
| astrange wrote:
| Why is it "continuous" thought? I don't see what is continuous -
| the values inside an LLM are discrete even if they're floating
| point.
|
| Hmm, I guess you could evaluate it at any given finite precision,
| but it would be surprising to me if that made it more accurate.
| mkl wrote:
| It's far more continuous than constantly jumping to the nearest
| token vector. The fact that real numbers are approximated by
| floating point isn't really relevant.
| layer8 wrote:
| If you are continuously complaining, does it mean you do it
| non-discretely and with infinite precision?
| astrange wrote:
| It apparently uses the same iteration strategy as tokenized
| thinking, so that's not it.
|
| > Since both strategies provided comparable results, the
| researchers opted for using a constant number of thoughts for
| simplicity.
| HarHarVeryFunny wrote:
| > the values inside an LLM are discrete even if they're
| floating point.
|
| If that were true they'd never be able to learn anything -
| neural nets depend on continuous gradients to learn. Weights
| get updated by incremental/continuous amounts based on
| gradients.
|
| Even at the output of an LLM, where the internal embeddings
| have been mapped to token probabilities, those probabilities
| are also continuous. It's only when you sample from the model
| that a continuous probability becomes a discrete chosen token.
| unsupp0rted wrote:
| I'm excited for this to filter down to the Rayban Meta glasses.
| Right now the AI is about as helpful as Siri (i.e it can tell me
| the weather 6 times out of 10)
| mattfrommars wrote:
| Wondering about folks who keep up to date with the industry,
|
| Does anyone use specific keywords or tools to get latest LLM
| research and their ideas?
|
| Something like Goolge Scholar + keyword "LLM" ?
| hrtk wrote:
| I read hacker news daily
| Agentus wrote:
| yeah what is a general tutorial to this. is there a website
| that keeps track of keywords to keep track of. also a website
| that generalizes core nn tech and frontier stuff thats
| promising.
| melvinmelih wrote:
| You can also subscribe to arxiv email notifications directly,
| but since there's 20-30 AI papers coming out per day, it can be
| a bit overwhelming.
|
| Instructions: https://info.arxiv.org/help/subscribe.html
| maxrmk wrote:
| As much as I hate it, I use twitter to follow a bunch of people
| who work at fair/openai/etc and that's been a pretty good
| source. There's also a "daily papers" newsletter from
| huggingface, but it's pretty hit or miss.
| barrenko wrote:
| Yes, it's all definitely X first of all.
| jokethrowaway wrote:
| Definitely Twitter.
|
| Some linkedin too.
| marojejian wrote:
| Dupe from 20 days ago:
| https://news.ycombinator.com/item?id=42385412
| t0lo wrote:
| Heh that's me, guess they weren't ready for it. Also the
| decoder (where i linked) is one of the best just ai news sites
| ive found
| ttul wrote:
| TL;DR: Meta started with a pre-trained language model. They then
| fine-tuned it on step-by-step reasoning examples as you would do
| if you wanted your model to become particularly good at chain of
| thought reasoning.
|
| However, they also introduced a couple of new tokens. The <bot>
| token tells the model to go into latent space thought mode
| ("beginning of thought"). The <eot> token ends latent space
| thought mode. While in this mode, the model auto-regressive
| iterates by copying its final hidden layer back onto its input
| layer, obviously generating new tokens at the output with each
| inference step as it always does.
|
| The idea is that by passing the final hidden layer back through a
| few times, the model can squeeze more insight from the context.
| And that's precisely what they found was true.
|
| Training involves progressively replacing language reasoning
| steps with latent space auto-regression steps. So for instance,
| you might have a math problem in the training data and at first
| the model is fed all of the steps of the math problem in language
| form. But in later iterations of training, step one is replaced
| with latent space auto-regression. And then step two as well,
| then also step three, etc...
|
| Eventually, the model learns to enable latent space thinking mode
| by itself by generating the <bot> tokens and to end it be
| generating <eot> tokens.
|
| Pretty ingenious!
| avodonosov wrote:
| Thank you for the summary, useful for me as I only managed to
| skim throught the first half.
|
| But one correction, probably, regarding this bit:
|
| > While in this [latent space thought] mode, the model auto-
| regressive iterates by copying its final hidden layer back onto
| its input layer, obviously generating new tokens at the output
| with each inference step as it always does.
|
| I have impression that output tokens are not generated while in
| the latent thought mode.
| treprinum wrote:
| Would that mean that we would need to exchange latent
| "embeddings" between various "reasoning" models for emulating
| thinking and an LLM will be just about converting to/from human
| language when interfacing with mere humans, at some point in
| the future?
___________________________________________________________________
(page generated 2024-12-31 23:00 UTC)