[HN Gopher] Coconut by Meta AI - Better LLM Reasoning with Chain...
       ___________________________________________________________________
        
       Coconut by Meta AI - Better LLM Reasoning with Chain of Continuous
       Thought?
        
       Author : TaurenHunter
       Score  : 317 points
       Date   : 2024-12-31 00:54 UTC (22 hours ago)
        
 (HTM) web link (aipapersacademy.com)
 (TXT) w3m dump (aipapersacademy.com)
        
       | fosterfriends wrote:
       | Once again, we see Meta being more open than OpenAI. I'm loving
       | that their business incentive is aligned with open sourcing and
       | commodifying state-of-the-art LLM technology. Keep em coming
        
         | fragmede wrote:
         | don't buy their bullshit. it's not open source.
        
           | speedgoose wrote:
           | Yes it's more about open weights. I also think that you would
           | need the training data to consider it open source.
           | 
           | Open weights is still appreciated and they probably train on
           | data they don't have the license to open source.
        
           | astrange wrote:
           | I'm not sure open source is a useful concept for something
           | that takes millions of dollars to compile from it.
        
         | blackeyeblitzar wrote:
         | I mean they have no way to monetize LLMs as well as others, so
         | they're working on it and giving it away to not look irrelevant
         | and to weaken anyone who may make money off this tech and
         | threaten them in the future. Meanwhile there is a danger they
         | impose their long standing invisible "moderation" on everyone
         | else once they starve all the startups of revenue by giving
         | this away. We'll just be left with the same big tech overlords
         | to choose from.
         | 
         | Oh and it still isn't open source even though people like Yann
         | LeCun dishonestly claim it is. Only OLMo is truly open source
         | among competitive models, as far as I know:
         | https://allenai.org/blog/olmo2
        
           | spencerflem wrote:
           | Facebook would rather do no moderation, it's an expense for
           | them.
           | 
           | They do it to make the platform more pleasant so that people
           | stay on it
        
             | graemep wrote:
             | > They do it to make the platform more pleasant so that
             | people stay on it
             | 
             | Almost everything unpleasant I see on FB is stuff that the
             | FB algorithm shows me - not things posted by FB friends, or
             | pages I follow or groups I am in.
        
               | nightski wrote:
               | Everything you see on FB is what the algorithm shows you,
               | unpleasant or not. So it's a tautology that everything
               | unpleasant would be from the algorithm.
        
             | blackeyeblitzar wrote:
             | No they do it to support their owners' and employees'
             | biases. It doesn't make the platform more pleasant for the
             | half that gets censored. That's leaving aside the feed not
             | remembering the choice to view chronologically ordered
             | posts, the inability to easily track actual people in my
             | life, the addictive algorithms, the clickbait that causes
             | mental health issues for teens, etc.
        
               | creato wrote:
               | 99% of FB's moderation has nothing to do with "biases",
               | unless you think FB is biased against spam, scams, and
               | all the other dregs of the internet that incessantly pops
               | up anywhere users can post content.
        
               | TheOtherHobbes wrote:
               | Quite a few people left Threads for Bluesky because
               | progressive posts were being removed while far-right,
               | antivax, etc content was allowed to stand even though it
               | was reported.
               | 
               | At best the algo is imperfect. At worst it really does
               | seem oddly selective.
        
               | dudeinjapan wrote:
               | I am a humble Cialis salesman, like my father and
               | grandfather before me. I confirm Facebook is biased
               | against our profession. (My grandfather also moonlighted
               | as a Barrister representing the estates of deceased
               | African royalty--it was always so difficult to track down
               | their heirs.)
        
               | roywiggins wrote:
               | The stuff that Facebook moderators are actually tasked
               | with removing is really awful, bad enough to produce
               | severe psychological effects in the moderators.
               | 
               | Facebook pays people to look at and remove this stuff
               | because the platform would not survive if it wasn't
               | removed before you or I saw it. Do they also enforce
               | other corporate values? Yeah, probably. That doesn't seem
               | to be the main job though, they have their hands full
               | dealing with the worst content in the world.
               | 
               | https://amp-theguardian-
               | com.cdn.ampproject.org/v/s/amp.thegu...
               | 
               | > The images and videos including necrophilia, bestiality
               | and self-harm caused some moderators to faint, vomit,
               | scream and run away from their desks...
               | 
               | > Some reported marriage breakdown and the collapse of
               | desire for sexual intimacy, and losing connection with
               | their families. Some whose job was to remove videos
               | uploaded by terrorist and rebel groups were afraid they
               | were being watched and targeted, and that if they
               | returned home they would be hunted and killed.
        
             | cess11 wrote:
             | It's more likely they do it to keep their people from being
             | coerced to visit the Hague. What they did in Myanmar got a
             | lot of press and a case at the ICJ, and similar launches of
             | 'free internet' elsewhere had similar results.
        
             | rlupi wrote:
             | (tongue in cheek comment) I wonder if FB moderation now or
             | eventually will be just a prompt to a sufficiently evolved
             | and unhinged AI model:
             | 
             | > FB or 4chan?
        
           | BoorishBears wrote:
           | > they have no way to monetize LLMs as well as others
           | 
           | Random nobodies are putting together companies to monetize
           | generative AI and getting bought out a couple of years later,
           | you think Meta couldn't figure out how to deploy their own
           | models to an API and stick up a billing interface if they
           | really wanted to? (or even buy a company that does already?)
           | 
           | > they starve all the startups of revenue by giving this away
           | 
           | Would you say startups like Deepseek have been hurt or help
           | by their (even partial) openness?
           | 
           | In fact, how does this track with your first statement?
           | They're not monetizing this: so their startup competition can
           | actually serve their models to gain revenue _which they then
           | turn around use to train competitor models_ (we 've already
           | seen this with Fireworks.ai)
           | 
           | You seem to underestimate how much of the value in LLMs is
           | _productizing_ them. The margins on per-token usage are
           | insane, Meta not taking that margin is creating a huge
           | opportunity for a wave of startups in so many directions...
           | 
           | > Only OLMo is truly open source among competitive models
           | 
           | Synthetic data from competitor models was a huge part of
           | that. It would seem no one is fighting the startups as hard
           | as you're claiming they are.
        
             | bongodongobob wrote:
             | All the LLM companies are going to eat those "product
             | companies" lunch in a few years. Why would I use product X
             | when it's going to be inevitably be baked into the actual
             | tech itself? Those product companies are just wrappers and
             | have even less of a moat than the LLM companies. The very
             | fact that random nobodies are doing this should signal
             | there isn't a lot of real value there. Yes, there is some
             | money to be made right now but it reminds me a lot of the
             | videogame bust and dotcom bust. A LOT of companies are
             | wasting a crazy amount of money on "solutions" that will be
             | obsolete in a few years.
        
               | BoorishBears wrote:
               | Productization in this context is creating APIs for
               | Meta's models.
               | 
               | Fireworks.ai, Together.ai, and literal boatloads of other
               | startups are making real money just efficiently serving
               | up these models that Meta is supposedly using to... choke
               | out startups.
               | 
               | The comment I replied to is under the mistaken idea that
               | the presence of free models from Meta has a chilling
               | effect on startups trying to build their own models, but
               | right now the biggest barriers are capital and data.
               | 
               | Meta updated Llama to allow for synthetic generation, and
               | they're even partnering with these startups give them
               | distribution and day 0 access to the models.
               | 
               | -
               | 
               | If anything I'd say Meta is actively fighting against the
               | big tech overlords the comment thinks they're trying to
               | join. Even before Ilya mentioned it, it was clear to me
               | that the power of post-training was going to become more
               | and more important (I've literally built a business on
               | it).
               | 
               | Llama represents a real ongoing chance for tiny startups
               | with minimal resources to get into the fray very
               | affordably (through either offering inference, or post-
               | training for a specific task, etc.), scale revenue, and
               | then start to compete against much larger, resource rich
               | companies.
        
           | jayd16 wrote:
           | Is there any vendor lock-in with this conspiracy? Even if
           | startups are pushed out of the spotlight, what stops them
           | from competing? If the meta model is bad, won't it be even
           | easier to make an alternative in the future?
        
           | scarface_74 wrote:
           | They are definitely making some money off of their licensing
           | to AWS as part of the bedrock offering. Facebook's licensing
           | is such that they aren't going to let happen to them what
           | happened to ElasticSearch, Redis, etc.
           | 
           | I'm okay with that.
        
           | rlupi wrote:
           | In the agentic era, the new Ads eyeballs are the LLMs
           | training corpus (IMHO).
        
       | throwup238 wrote:
       | Master coconut! I don't know if that's an Archer reference or a
       | Frisky Dingo reference.
       | 
       | It's fascinating how fast the competitors are catching up to each
       | other. Can't wait for seven different SkyNets to compete for
       | dominance.
        
         | yard2010 wrote:
         | Both! And/or, either
        
           | throwaway314155 wrote:
           | A little column a, a little column b.
        
       | Klathmon wrote:
       | So is the big improvement here simply skipping the
       | unembedding/embedding step for internal thoughts? Or is it mainly
       | in the training methods to teach the CoT and how to switch
       | between "latent thought" and text output?
       | 
       | It's really interesting that a fixed number of "latent thoughts"
       | performed as well as a binary classifier! I didn't expect that at
       | all, the way OpenAI talks about CoT it seems the ability to let
       | it "keep thinking" let's them continually score higher on
       | benchmarks while throwing eye watering amounts of compute at the
       | inference.
        
         | Crye wrote:
         | It mentioned not penalizing/rewarding the model for thoughts
         | only rewarding the answer after the thought. I am curious how
         | back propagation works then.
        
           | yorwba wrote:
           | The tokens of the answer depend on the preceding continuous
           | thought vectors, which you can backprop through in the usual
           | way.
        
           | lovasoa wrote:
           | The researchers leverage existing language Chain-of-Thought
           | data, where each sample consists of a question, reasoning
           | steps, and the final answer. At stage 0, the model does not
           | generate any thought tokens, and is just trained to yield the
           | reasoning traces and correct answers for the Chain-of-Thought
           | samples. In the subsequent stages, at each stage, we remove
           | one reasoning step from the sample, and instead add thought
           | tokens. In the illustration above, a single thought token is
           | added in each stage, instead of a single reasoning step, but
           | this is controlled by a hyperparameter 'c'.
        
       | viraptor wrote:
       | I was waiting for something like that to happen! Next step -
       | creating a human-language-free representation. I believe that
       | once a group of llms can communicate only in embeddings tuned
       | without any human text input, we're going to open a completely
       | new chapter in AI.
        
         | bboygravity wrote:
         | How does a group help anything?
         | 
         | If you put 1000 dumb people together, they don't magically
         | become smart?
        
           | sunshinerag wrote:
           | Wait what ... how does democracy work then?
        
             | nfw2 wrote:
             | the benefit of democracy is primarily that it prevents
             | governments from doing bad things, less so that it empowers
             | more effective governance
        
               | mathgeek wrote:
               | It can do either, and can fail to do either. It's the
               | people having power that enables the outcomes, not the
               | system itself. Democracy just grants the power to a
               | broader set of people.
        
             | optimalsolver wrote:
             | It doesn't.
        
             | coldtea wrote:
             | Democracy is not about being smart or dumb.
             | 
             | It's about everyboty having a say in decisions of
             | government that affects them.
             | 
             | The failure of democracy as a system is not when people
             | make dumb decisions (experts and high-IQ people have made
             | some of the most stupid and catastrophic decisions in
             | history), but when people's collective decisions are not
             | being respected.
        
           | IshKebab wrote:
           | If you put 1000 people who can't talk together they will
           | create language so they can communicate. He's saying if we
           | put LLMs together and don't force them to use English to
           | communicate then they'll create their own language which may
           | be superior for LLMs to English.
           | 
           | May be true but who knows.
           | 
           | I wonder if anyone has somehow tested the Sapir-Whorf
           | hypothesis for LLMs somehow by training them on different
           | languages and comparing task performance. I guess it's too
           | difficult to get a large equivalent training set in different
           | languages.
        
             | wodderam wrote:
             | It feels like an exercise in anthropomorphization to me.
             | 
             | Sapir-Whorf hypothesis is generally not considered to be
             | reality. It makes intuitive sense but is wrong.
             | 
             | There are hours of podcasts with Chomsky talking about
             | LLMs. The gist of which is that LLMS are extracting surface
             | level statistical structure of language that will be good
             | for routine coding and not much else. It is easy to infer
             | that Chomsky would believe this idea to be utter nonsense.
             | 
             | I believe even the idea of getting a 1000 people together
             | and we agree to label a rock "rock", a tree "tree", a bird
             | "bird" is not even how human language works. Something that
             | is completely counter intuitive.
             | 
             | Reading the paper, no one believes a hidden markov model is
             | creating some kind of new thought process in the hidden
             | state.
             | 
             | I certainly though could have no idea what I am talking
             | about with all this and have pieced together parts that
             | make no sense while this is a breakthrough path to AGI.
        
               | pjerem wrote:
               | Well maybe not 1000 people but to our knowledge, the
               | human brain is actually made of physically independent
               | zones that barely communicate together except with the
               | zone that take all the outputs together and tries to do
               | something coherent with all the garbage.
               | 
               | Idk if this could work with LLMs, especially because all
               | the brain zones are somehow specialized into something
               | while two LLMs are just identical machines. But we also
               | know that the specialization isn't that hardcoded : we
               | know that people losing half their brain (after a stroke)
               | can still relearn things that were managed in the "dead"
               | part.
               | 
               | I don't know, please correct my errors, I was just
               | thinking aloud to say that multiple independent agents
               | working together may be how "intelligence" already works
               | in the biological world so why not for AIs ?
        
               | PittleyDunkin wrote:
               | > Sapir-Whorf hypothesis is generally not considered to
               | be reality.
               | 
               | This is true only in the strictest terms of the
               | hypothesis, i.e. linguistic determinism. Language still
               | encodes a lot of culture (& hence norms and values) in
               | its grammar & diction--this isn't very controversial.
               | 
               | Granted, I don't think this is that related to the topic
               | at hand. There's bias all over the decisions in how to
               | train and what to train on; choice of language is just
               | one facet of that.
        
               | coldtea wrote:
               | > _Sapir-Whorf hypothesis is generally not considered to
               | be reality. It makes intuitive sense but is wrong_
               | 
               | Strong S-W (full determinism) might not be, but there's
               | hardly a clear cut consensus on the general case.
               | 
               | And the whole "scientific field" is more like psychology,
               | with people exchanging and shooting down ideas, and less
               | like Math and Physics, so any consensus is equally likely
               | to be a trend rather than reflecting some hard measurable
               | understanding.
               | 
               | I'd say that the idea S-W is not to a degree reality is
               | naive.
        
               | digbybk wrote:
               | > There are hours of podcasts with Chomsky talking about
               | LLMs
               | 
               | I'm not an expert, but it seems like Chomsky's views have
               | pretty much been falsified at this point. He's been
               | saying for a long time that neural networks are a dead
               | end. But there hasn't been anything close to a working
               | implementation of his theory of language, and meanwhile
               | the learning approach has proven itself to be effective
               | beyond any reasonable doubt. I've been interested in
               | Chomsky for a long time but when I hear him say "there's
               | nothing interesting to learn from artificial neural
               | networks" it just sounds like a man that doesn't want to
               | admit he's been wrong all this time. There is _nothing_
               | for a linguist to learn from an actually working
               | artificial language model? How can that possibly be?
               | There were two approaches - rule-based vs learning - and
               | who came out on top is pretty damn obvious at this point.
        
               | jokethrowaway wrote:
               | What can you learn from something parroting data we
               | already have?
               | 
               | Similarly, we are now finding that training on synthetic
               | data is not helpful.
               | 
               | What would have happened if we invested 1/100 of what we
               | spent on LLM on the rule based approach?
        
             | stingraycharles wrote:
             | Is everything in LLMs translated back to English before
             | interpretation?
             | 
             | It works fairly well in my native language, I'm surprised
             | to learn that things get translated back.
        
               | astrange wrote:
               | LLMs have no fixed internal representation - they barely
               | have internal anything - so no, there is no translation.
               | 
               | But there's also no guarantee any particular query
               | generalizes (vs is memorized), so it might only be able
               | to answer some queries in some languages.
        
           | littlestymaar wrote:
           | > If you put 1000 dumb people together, they don't magically
           | become smart?
           | 
           | 1000 is probably too high, but groups of people are in fact
           | more intelligent than individuals (though for humans it is
           | likely because recognizing a correct answer is easier than
           | finding it in the first place)
        
             | nfw2 wrote:
             | depends on the circumstances. lin-manuel miranda can
             | probably write a better musical by himself than a team of
             | 20 people with equal input would.
             | 
             | also, the bottlenecks that teamwork helps solve (eg the
             | high cost of gaining expertise and low throughput of
             | reasoning capacity) may not be that relevant in the ai age
        
               | littlestymaar wrote:
               | > by himself than a team of 20 people with equal input
               | would.
               | 
               | Sure, but the result would still be far better than the
               | average of the output of the 20 individuals taken alone.
               | 
               | > also, the bottlenecks that teamwork helps solve (eg the
               | high cost of gaining expertise and low throughput of
               | reasoning capacity) may not be that relevant in the ai
               | age
               | 
               | It's always tempting to anthropomorphize these systems
               | and conclude that what works for us would work for them,
               | but yes we don't really know if it would bring anything
               | to AI.
        
             | TheOtherHobbes wrote:
             | _Functional_ groups which work well together, include open
             | sharing of research and ideas, persistence of best output,
             | are dedicated to realism, and are more focussed on problem
             | solving than status display, will be smarter. The group
             | works like a filter which generates multiple solutions and
             | selects, remembers, and abstracts the best.
             | 
             | Dysfunctional groups which do the opposite will be
             | catastrophically stupid.
             | 
             | There have been plenty of dysfunctional groups in history.
        
           | JFingleton wrote:
           | > If you put 1000 dumb people together, they don't magically
           | become smart?
           | 
           | Do they not become smart*er* though?
        
             | computably wrote:
             | "Smarter" is too vague. A group can compensate for
             | individual weaknesses or even converge on a hard-to-make
             | prediction given sufficiently uncorrelated outputs;
             | basically the idea behind ensemble models / wisdom of the
             | crowds. But a group of 1000 dumb apes would never achieve
             | categorically-above-ape intelligence, probably not even
             | "genius" ape intelligence. Groups of unintelligent agents
             | come with downsides as well, like the ant death spiral.
        
               | coldtea wrote:
               | > _But a group of 1000 dumb apes would never achieve
               | categorically-above-ape intelligence_
               | 
               | And yet, here we are.
               | 
               | A group of 1000 apes is large enough to have offspring
               | and, given time, go through evolution.
        
           | mromanuk wrote:
           | Because group estimation is superior to individual
           | estimations: The phenomenon is called wisdom of the crowds.
           | When a group of people independently estimate something,
           | individual errors tend to cancel each other out, leading to a
           | surprisingly accurate collective result. This works because
           | of:
           | 
           | Diversity of opinions: Different perspectives bring a range
           | of estimates. Independence: Errors aren't systematically
           | biased as long as individuals estimate without external
           | influence. Error averaging: Overestimation and
           | underestimations balance out when averaged. Law of large
           | numbers: More participants increase accuracy by minimizing
           | random errors. It was demonstrated by Francis Galton in 1906,
           | where a crowd's average guess of a bull's weight was almost
           | spot-on. (estimates must be independent and reasonably
           | informed for this to work.)
        
           | senectus1 wrote:
           | they kinda do.. Its how City's work.
           | 
           | People learn by being around others being both successful and
           | unsuccessful.
        
           | coldtea wrote:
           | Isn't that the very case behind the "wisdom of crowds" thing?
        
             | amelius wrote:
             | Looking at the current state of democracies around the
             | world, my hopes are not on "wisdom of the crowds".
        
               | bee_rider wrote:
               | If you think the democracies are doing bad, you should
               | see the autocracies!
        
               | amelius wrote:
               | You mean the thing democracies are turning into, thanks
               | to social (crowd wisdom) media?
        
               | bee_rider wrote:
               | I don't think social media really is crowd wisdom at all.
               | It is built to pander to our worst impulses (I think,
               | knowingly and openly, right? The algorithm selects for
               | engagement, not learning and growing), and I'd be
               | surprised if it isn't producing a feedback loop as well
               | (perhaps as an unintentional side effect). The wisdom of
               | the crowds hypothesis relies on a random sampling, we're
               | intentionally applying a skew toward the angry and
               | shallow.
        
           | konart wrote:
           | Not magically. Our great ancestors were pretty dumb, but they
           | were getting smarter and better because of sharing their
           | knowledge.
        
             | ulbu wrote:
             | they were not one bit dumber than you.
        
               | EliBullockPapa wrote:
               | Average intelligence measures have risen substantially
               | since early 1900s
               | 
               | https://en.wikipedia.org/wiki/Flynn_effect
        
             | pigpop wrote:
             | yes they got "smarter" by compiling a corpus of knowledge
             | which future generations could train on.
             | 
             | sarcasm aside, throwing away the existing corpus in favor
             | of creating a new one from scratch seems misguided.
             | 
             | this paper isn't about creating a new language, they are
             | omitting the sampler that chooses a single token in favor
             | of sending the entire end state back in to the model like a
             | superposition of tokens. that's the breadth first search
             | part, they don't collapse the choice down to a single token
             | before continuing so it effectively operates on all of the
             | possible tokens each step until it decides it's done.
             | 
             | it would be interesting to try this with similar models
             | that had slightly different post training if you could
             | devise a good way to choose the best answer or combine the
             | outputs effectively or feed the output of a downstream
             | model back in to the initial model, etc. but I'm not sure
             | if there'd necessarily be any benefit to this over using a
             | single specialized model.
        
         | jkingsman wrote:
         | How does one impart textual knowledge discovered by humans
         | without language?
        
           | thelittleone wrote:
           | Couldn't we use an AI model trained on historical text data
           | (up to today) to predict likely events for tomorrow? Taking
           | this further, a sufficiently advanced AI system could
           | potentially analyze human-generated text up to any given
           | point in history to understand patterns of human thought and
           | behavior, then project those patterns forward. This speaks to
           | your point about human language - while we need text data for
           | initial training, the AI's internal representations and
           | predictions could potentially transcend human language
           | constraints.
        
           | viraptor wrote:
           | The training of the LLM itself would still use the human
           | language. But you could add an extra channel that's never
           | given any text or direct dataset training. Keep it purely a
           | connection between hidden layers of different instances of
           | LLM and train using the usual loss of perplexity or similar
           | metric.
           | 
           | The interesting thing then would be - does it converge to
           | similar embedding space as the input, or can LLMs create a
           | more efficient "language".
        
         | wruza wrote:
         | I thought about it too (layman). When I learned about
         | embeddings it almost immediately clicked as a sort of an
         | ascended language, not sure why no one seems to talk about it.
         | Exchanging embeddings must be so much "wider" communication
         | channel than speaking real language. And in contrast to a
         | language embeddings are (iiuc) continuous, i.e. you can rotate
         | a vector continously and it will smoothly trace the changes
         | between A and B. I can picture communicating in something like
         | https://www.google.com/search?q=charlie+conspiracy+meme&udm=...
         | - embedding difference vectors, but it's all crystal clear and
         | is a natural language for an llm, cause any vector combination
         | points to a correct "inner screen" image/concept/younameit.
         | 
         | Or maybe this is my own ignorant confabulation, so nvm.
        
         | mckirk wrote:
         | This is actually something you probably want to avoid, if at
         | all possible, because it makes it very hard to maintain insight
         | into what the AIs are communicating among them. But that
         | insight is crucial to stay informed about their progress in
         | taking over the world, etc.
        
           | dwohnitmok wrote:
           | Yes! We should be extremely cautious about embracing
           | approaches that make LLMs even more inscrutable. Having CoT,
           | however unreliable it is, is nonetheless a huge boon for
           | model evaluation that we should not give up so lightly.
        
         | blizdiddy wrote:
         | That came out a few weeks ago from meta. Large Concept Models
         | 
         | https://ai.meta.com/research/publications/large-concept-mode...
        
       | davidclark wrote:
       | Is this article AI-generated? This website appears to do a lot of
       | "diving in".
        
       | zombiwoof wrote:
       | Will this allow Facebook new user base of AI generated characters
       | to interact with themselves better?
        
       | behnamoh wrote:
       | There was no reason to call it something it's not ("chain of
       | cont. thought" [?] coconut).
        
         | CGamesPlay wrote:
         | Is your complaint here that the paper is not discussing a
         | literal coconut?
        
           | ripped_britches wrote:
           | We desperately need more literal coconut coverage here on HN
        
             | BoorishBears wrote:
             | Not just any regular old coconuts "Coconut by Meta AI -
             | Better LLM Reasoning with Chain of Continuous Thought?"
             | coconuts
             | 
             | (Sometimes acronyms in titles are vague/misleading... this
             | was not one of those times)
        
           | layer8 wrote:
           | To be fair, it's not even a metaphorical coconut. ;)
        
         | gloosx wrote:
         | for sure, chocothot aligns better with letters
        
       | CGamesPlay wrote:
       | Paper: https://arxiv.org/abs/2412.06769
       | 
       | The link is in the OP, hidden away in an image caption fir some
       | reason.
        
       | MarkMarine wrote:
       | This ought to make it much easier for the AI to lie to us without
       | getting caught.
       | 
       | https://www.transformernews.ai/p/openais-new-model-tried-to-...
        
         | h8hawk wrote:
         | The article is pure nonsense, and even the author added an
         | update admitting it missed important context.
        
         | throw-qqqqq wrote:
         | Have you read the paper discussed in the article?
         | 
         | https://static1.squarespace.com/static/6593e7097565990e65c88...
         | 
         | From the abstract:
         | 
         | > We study whether models have the capability to scheme in
         | pursuit of a goal that we provide in-context and instruct the
         | model to strongly follow.
         | 
         | > Analysis of the models' chains-of-thought reveals that models
         | explicitly reason about these deceptive strategies, providing
         | evidence that the scheming behavior is not accidental.
         | 
         | ---
         | 
         | It is impressive work for sure, but it comes down to prompting
         | IMHO. They almost instruct the AI to plot and scheme here, with
         | the goal of finding out if it CAN do this at all.
         | 
         | The AI is not very covert about it either! It directly admits
         | its scheming in the "thinking" outputs :D
         | 
         | IMO this is just another issue of alignment.
        
       | opdahl wrote:
       | This is super cool and something I've been waiting on. Would be
       | interesting to intersperse these thinking steps into token
       | generation. What would be the effect of adding lets say 5
       | thinking <<thoughts>> for every 50 generated tokens?
        
       | jkelleyrtp wrote:
       | I think this might be the "it" moment for AI/LLMs. I was hiking
       | with a friend recently and we talked about this at length.
       | 
       | The arc-AGI results from O3 are apparently a result of chain of
       | thought given enough time to explore a solution space. Reasoning
       | might be simply a higher dimensional form of rubix cube solving.
       | BFS, search, back-tracking, etc. It seems unlikely that humans
       | think in "tokens" so why do LLMs?
       | 
       | By staying in latent space, the models are free to describe an
       | "idea" in higher resolution than what language allows. English is
       | coarse, granular. Latent space is a much finer representation of
       | ideas and their interplay.
       | 
       | Latent space is also much cheaper to execute in. The model can
       | think without the language encoding/decoding step. This lets it
       | branch out hundreds of ideas and explore only the most useful
       | ones in a fraction of time that reasoning "out-loud" would take.
       | 
       | The states also don't need to be tied to language. Feed in a
       | robot's state, time series data, or any abstract data. Reason in
       | category theory or linear algebra or complex analysis. Humans are
       | hard wired for one set of math - an abstract latent space can
       | represent anything.
       | 
       | I'm a bit disappointed OpenAI didn't stumble on this first. I've
       | been skeptical of LLMs since their big debut last year. LLMs seem
       | like a great way of solving language, but reasoning is much more
       | complex. Once you grok the math behind the current models, you
       | immediately question why the encoding/decoding step is there.
       | Diffusion models are incredible but it felt that LLMs lacked the
       | same creativity. Encoding/decoding forces a token-based
       | discretization and therefore a loss of complexity.
       | 
       | With the byte-latent paper it was quite clear we'd see this
       | paper. This truly might be the "it" moment.
        
         | otikik wrote:
         | > It seems unlikely that humans think in "tokens" so why do
         | LLMs?
         | 
         | I can think of one reason: scrutability. It's going to be even
         | harder to understand how a response gets produced if there
         | isn't even a text-based representation to help the human
         | understand
        
           | IshKebab wrote:
           | I think we're already way beyond the point where anyone
           | really understands how a response is produced, even without
           | this.
        
             | nfw2 wrote:
             | the token generation part isn't well understood, but the
             | output "chain-of-thought" used to produce the final answer
             | can be scrutinized for correctness with a traditional CoT
             | model (although this would require model providers to not
             | hide reasoning tokens)
        
             | anon373839 wrote:
             | Indeed. Even if an LLM tells you its "reasoning" process
             | step by step, it's not actually an exposition of the
             | model's internal decision process. It's just more text
             | that, when generated, improves the chances of a good final
             | output.
        
           | pigpop wrote:
           | you can save the hidden states and convert them into a more
           | interpretable format. it's still recorded and you could make
           | modifications at different steps to see how that would change
           | the conclusion.
        
         | rlupi wrote:
         | IMHO The problem (for us) with this approach are the logical
         | consequences:
         | 
         | 1) if AI large model become more powerful avoiding language,
         | embeddings of AI state become even more tied to the model they
         | originate than now
         | 
         | Consequence: AI progress stalls, as AI user companies need to
         | invest increasing amount of money to reindex their growing
         | corpuses.
         | 
         | This is already a problem, it becomes more of a lock-in
         | mechanism.
         | 
         | If this is overcome...
         | 
         | 2) Embeddings become a viral mechanism: it makes sense for a
         | large company that commands a market to impose to its suppliers
         | to use the same AI models, because they can transfer state via
         | embeddings rather than external formats.
         | 
         | This allows to cut down decisions mechanisms that otherwise
         | require expensive coordination mechanism.
         | 
         | Something similar will happen within companies IMHO:
         | https://rlupi.com/okr-planning-as-belief-revision
         | 
         | 3) Eventually this potentially results in another exponential
         | growth and lock-in mechanism, also at the expense of most tech
         | people as more and more is done outside our interface with AI
         | (i.e. programming and software architecture improvements will
         | it self move below language level, we'll have to reverse
         | engineering increasingly opaque improvements).
         | 
         | 4) It ends with the impossibility of AI alignment.
         | 
         | ---
         | 
         | I have written a bit about it in the past at the start of the
         | year, when I had a burnout. So, I deleted those confused
         | ramblings. You can stil find it on archive.org:
         | https://web.archive.org/web/20240714153146/https://rlupi.com...
        
         | layer8 wrote:
         | IMO we won't have the "it" moment until we have continuous
         | learning (training) in some fashion.
        
           | mattxxx wrote:
           | ^ This and we need to be continually learning on an energy
           | budget similar to how much a human spends per hour.
        
             | rlupi wrote:
             | The main reason why we can't do that now is because we
             | require models to be digitally reproducible (IMHO, but also
             | read Geoffrey Hinton's mortal computing).
             | 
             | The energy cost come from error correction as much as
             | training algorithms.
        
         | pigpop wrote:
         | I think this is a step in the right direction but not the end.
         | it takes the sampler out of the equation during most of the
         | reasoning process but it is still important for the "show your
         | work" aspects of reasoning or solving a problem. balancing when
         | to think against when to write down or commit to certain
         | thoughts is important. there are many more pieces to the
         | puzzle.
        
         | jokethrowaway wrote:
         | This sounds like brute forcing a solution to make up for lack
         | of intelligence.
         | 
         | In an IQ test, like the one in the arc agi test, a human sees
         | the pattern instantly and effortlessly. o3 tries N paths until
         | it stumbles on the right one and assess that there is a
         | pattern.
         | 
         | I think we need a radically different architecture, this is a
         | gimmick.
        
       | jeswin wrote:
       | Interesting. Due to its emphasiss on BFS, it's the opposite of
       | something I've been trying (I named it the "Tree of failures").
       | 
       | My assumption was that humans don't try a breadth-first approach.
       | Instead, we split a task into a short-step (instinct and
       | intuition selected), and long-step that summarizes/stores the
       | next steps. The key idea is to recursively evaluate a task as a
       | short-step (high-res - gets executed) and a long-step (lower-res
       | - is just stored), until it succeeds or fails. If it fails, we
       | must walk back keeping a summarized tree of failures in state so
       | that we can exclude them in future selections.
       | 
       | The effectiveness of instinct has a steep fall-off at longer
       | distances - so it's better not to chart out of a series of steps.
       | When we do BFS, we drive down the value of instinct in favor of
       | compute. I guess ultimately, it depends on the type of problem
       | you want to solve.
       | 
       | Reach out to me if you want to prototype it with me.
        
         | katamari-damacy wrote:
         | that's more fit for agents, no?
        
           | jeswin wrote:
           | You're right that it's technically orthogonal to what's in
           | the paper. I was trying to model the "reasoning process",
           | which has general applicability depending on how/where it's
           | implemented.
        
         | viraptor wrote:
         | Reminds me of what plandex does. https://plandex.ai/ It already
         | does the automatic "does this need splitting into subtasks, or
         | can it be solved immediately" processing.
        
         | cube2222 wrote:
         | I think the problem with long chains of steps on their own
         | (without the bfs stuff) is that your failure probability
         | quickly grows to unreasonable levels.
         | 
         | Basically, if each step has a 97% chance of being completed
         | correctly, if your task requires 10 steps one after the other,
         | the chance of success falls to 97%*10=74%
         | 
         | If I understand correctly, part of the point of the BFS is to
         | throw compute at it, in order to lower the failure rates. Kind
         | of a "run many times in parallel and pick the best one". This
         | can be effective, but also quite expensive, as seen in the
         | costs OpenAI had to pay for their ARC-AGI benchmarking runs.
        
         | dietr1ch wrote:
         | I feel humans like doing something in between, maybe a bit like
         | A* would do sometimes.I wouldn't call it A* because of the lack
         | of a consistent heuristic and also lack of strictly numeric
         | evaluation, but it's an in-between DFS and BFS for sure (as is
         | every tree search algorithm?).
         | 
         | We go deep while we think it's a good lead, because so far
         | things make sense and it'll be less work, but at some point we
         | start questioning our decisions early in the descent and try
         | alternatives.
        
           | verdverm wrote:
           | You may find Prioritized Grammar Enumeration as an
           | interesting in-between DFS/BFS algorithm
           | 
           | https://seminars.math.binghamton.edu/ComboSem/worm-
           | chiu.pge_...
        
         | wafflemaker wrote:
         | How do you understand instinct?
         | 
         | I bought a new SSD drive for an old laptop to avoid buying a
         | new one, (x230 has amazing keyboard) but left to another
         | country for Christmas. My intuition told me to take it with me,
         | but logical sense said there will be no time for such things as
         | moving OS to a new drive.
         | 
         | My flight back to the work country got cancelled due to fog and
         | I ended up spending a week longer at in-laws place, with plenty
         | free time. A new 512GB drive would help me studying, giving
         | plenty space for school VMs.
        
         | kurthr wrote:
         | The classic thing people say is "asking the right question"
         | gets you half way there. Your approach sounds like something I
         | call "getting to No" for a problem. It's sort of a combination
         | of "getting to know" and the opposite of the salesman's
         | "getting to Yes". When it works, it's the fastest way to prune
         | off obligations.
         | 
         | The goal is to figure out why some particular problem: isn't
         | really a problem, doesn't need to be solved, can't be solved
         | that way, can't really be solved (because of physics or it's
         | really a different problem). As you define the problem better,
         | you can rule each one out to find, the "real" problem, that you
         | CAN solve, and at least one path forward. There's still many
         | ways that it might not be the optimal path, but you know
         | roughly how to get to somewhere better. It also trains you to
         | see around obstacles to success.
         | 
         | I've found that some of the best work I've done (especially on
         | acquisitions) was in defining why NOT to do something that
         | looked like a good idea (or particularly interesting to work
         | on) from the onset, but was destined to fail or required
         | unknown HW technology. Frankly, looking >5 years out feels like
         | a coin flip, because some other competing technology could come
         | along before you can get to production.
        
         | torginus wrote:
         | I don't get why you need tree search at all? What does it give
         | you over a pure LLM trained to do CoT in a tree-like manner? If
         | the context window's long enough, it can generate the
         | reasoning-tree just by pure next-token prediction, and rather
         | than BFS, it can guide the tree search with its own value
         | function (which is part of the LLM itself) instead of sticking
         | to hard algos like BFS and DFS.
         | 
         | By the way, BFS sounds like it will give you thorough results,
         | at the cost of increased compute. Useful for beating
         | benchmarks, but probably causes marginal improvement for
         | massively improved compute.
         | 
         | Still, the improved quality could be meaningful, if it's used
         | for generating training data for Llama4
        
       | galaxyLogic wrote:
       | The thing about 'thinking" in problem solving I think is that
       | thoughts often produce new questions which then guide the overall
       | problem solving. I wonder is this something like that?
        
       | smusamashah wrote:
       | I believe this was shared and discussed here a while ago and this
       | article looks LLM generated. It keeps doing "let's start...".
       | Either it's LLM fluff or very poor writing.
        
       | hadjian wrote:
       | If this site didn't appear here, I'd think it's a scam:
       | 
       | - site claims to simplify papers, but movies AI generated
       | 
       | - full of ads
       | 
       | - Can't find ,,Coconut" on the official Meta FAIR page
       | 
       | Is this the best site to link to?
        
         | davidhowlett wrote:
         | The official pdf for the paper is at
         | https://arxiv.org/pdf/2412.06769
         | 
         | I can find "Coconut" 54 times in the PDF. The movie does not
         | look made up.
        
           | hadjian wrote:
           | I was referring to aipapersacademy and not the arxiv link.
           | 
           | Also I didn't mean the occurrence of the word ,,coconut" in
           | the paper, but thanks for counting.
           | 
           | I meant their publication site: https://ai.meta.com/results/?
           | content_types%5B0%5D=publicatio...
           | 
           | The video is something I'd expect from AI.
        
       | cornel_io wrote:
       | So, what's happening here on the surface is that it's an
       | optimization (fairly meaningful, from the looks of it) aimed at
       | doing roughly the same things we could already do with chain-of-
       | thought (CoT), but IMO the downstream effects of this sort of
       | optimization could be much more meaningful.
       | 
       | LLMs can already do a decent amount of "processing" in a single
       | token generation because of the number of layers they have. The
       | layers are trained independently so it's not exactly like they're
       | a recurrent network doing multiple steps, but they are layering
       | sequences of context-dependent transformations on top of each
       | other; no matter how you cut it, if getting to a problem's answer
       | requires 100 steps, you won't be able to do it in a single token
       | output from a 20 layer LLM. To some approximation, CoT is just a
       | way to give the network more chances to transform the data than
       | there are layers in the network - each additional token of output
       | gives a shot to bake another vector the size of the token
       | embedding into each layer's state in the network, enriching what
       | it's computed so far.
       | 
       | The problem with chain of thought is that as you add each new
       | token, at the input level of the network, your computation is
       | basically starting from scratch against the raw text, just with
       | one additional token. You don't even have access to all the stuff
       | you already figured out in the deepest layers of the network
       | during the previous step! If you were processing "All wunguses
       | are glurgles, and Joe is a wungus", then somewhere in those
       | deepest layers as you're generating the next token you've almost
       | certainly got some vector that basically represents "therefore
       | Joe is a glurgle", but with chain of thought you've got to first
       | output "t", then "h", then "e", and so on (I know those aren't
       | tokens, let's pretend letter == token for argument sake), and
       | during that process almost _ALL_ of the work being done by the
       | network is mere bookkeeping, slowly dumping that thought into the
       | output stream. Only once you get the whole sentence out can you
       | _start_ processing the next token at the first layer with the
       | information that Joe is, in fact, a glurgle, in hand. Which is a
       | damn shame, because it 's been sitting right there in the deeper
       | layers of the network parallel to previous tokens this whole
       | time, it just wasn't available for the shallow layers to process
       | directly because you were casting most of the info away and
       | "rounding" to a single token.
       | 
       | With Coconut's approach, you don't need to output "therefore Joe
       | is a glurgle" token by token to continue the train of thought,
       | you can essentially pass the entire thought through as a single
       | uber-token, and the next pass can generate a new entire thought,
       | and so on.
       | 
       | It's a pretty straightforward idea, IMO the neat bit is that they
       | were able to train the network to work well in this way by
       | leveraging CoT. I'm guessing you probably don't need to act as if
       | these are two distinct modes of operation, you could instead
       | _always_ have this side channel of  "continuous thought" running,
       | even when you have generated a normal token, coming through as a
       | separate input to the first attention block. You still might want
       | to have a "thinking" token when you need to sit there and let the
       | thing do more work, but you'd generally increase the information
       | flow from time step to time step, which would allow the net to
       | keep thinking in the background even as it's doing the gruntwork
       | of outputting whatever its current "buffered" thought is.
        
       | astrange wrote:
       | Why is it "continuous" thought? I don't see what is continuous -
       | the values inside an LLM are discrete even if they're floating
       | point.
       | 
       | Hmm, I guess you could evaluate it at any given finite precision,
       | but it would be surprising to me if that made it more accurate.
        
         | mkl wrote:
         | It's far more continuous than constantly jumping to the nearest
         | token vector. The fact that real numbers are approximated by
         | floating point isn't really relevant.
        
         | layer8 wrote:
         | If you are continuously complaining, does it mean you do it
         | non-discretely and with infinite precision?
        
           | astrange wrote:
           | It apparently uses the same iteration strategy as tokenized
           | thinking, so that's not it.
           | 
           | > Since both strategies provided comparable results, the
           | researchers opted for using a constant number of thoughts for
           | simplicity.
        
         | HarHarVeryFunny wrote:
         | > the values inside an LLM are discrete even if they're
         | floating point.
         | 
         | If that were true they'd never be able to learn anything -
         | neural nets depend on continuous gradients to learn. Weights
         | get updated by incremental/continuous amounts based on
         | gradients.
         | 
         | Even at the output of an LLM, where the internal embeddings
         | have been mapped to token probabilities, those probabilities
         | are also continuous. It's only when you sample from the model
         | that a continuous probability becomes a discrete chosen token.
        
       | unsupp0rted wrote:
       | I'm excited for this to filter down to the Rayban Meta glasses.
       | Right now the AI is about as helpful as Siri (i.e it can tell me
       | the weather 6 times out of 10)
        
       | mattfrommars wrote:
       | Wondering about folks who keep up to date with the industry,
       | 
       | Does anyone use specific keywords or tools to get latest LLM
       | research and their ideas?
       | 
       | Something like Goolge Scholar + keyword "LLM" ?
        
         | hrtk wrote:
         | I read hacker news daily
        
         | Agentus wrote:
         | yeah what is a general tutorial to this. is there a website
         | that keeps track of keywords to keep track of. also a website
         | that generalizes core nn tech and frontier stuff thats
         | promising.
        
         | melvinmelih wrote:
         | You can also subscribe to arxiv email notifications directly,
         | but since there's 20-30 AI papers coming out per day, it can be
         | a bit overwhelming.
         | 
         | Instructions: https://info.arxiv.org/help/subscribe.html
        
         | maxrmk wrote:
         | As much as I hate it, I use twitter to follow a bunch of people
         | who work at fair/openai/etc and that's been a pretty good
         | source. There's also a "daily papers" newsletter from
         | huggingface, but it's pretty hit or miss.
        
           | barrenko wrote:
           | Yes, it's all definitely X first of all.
        
         | jokethrowaway wrote:
         | Definitely Twitter.
         | 
         | Some linkedin too.
        
       | marojejian wrote:
       | Dupe from 20 days ago:
       | https://news.ycombinator.com/item?id=42385412
        
         | t0lo wrote:
         | Heh that's me, guess they weren't ready for it. Also the
         | decoder (where i linked) is one of the best just ai news sites
         | ive found
        
       | ttul wrote:
       | TL;DR: Meta started with a pre-trained language model. They then
       | fine-tuned it on step-by-step reasoning examples as you would do
       | if you wanted your model to become particularly good at chain of
       | thought reasoning.
       | 
       | However, they also introduced a couple of new tokens. The <bot>
       | token tells the model to go into latent space thought mode
       | ("beginning of thought"). The <eot> token ends latent space
       | thought mode. While in this mode, the model auto-regressive
       | iterates by copying its final hidden layer back onto its input
       | layer, obviously generating new tokens at the output with each
       | inference step as it always does.
       | 
       | The idea is that by passing the final hidden layer back through a
       | few times, the model can squeeze more insight from the context.
       | And that's precisely what they found was true.
       | 
       | Training involves progressively replacing language reasoning
       | steps with latent space auto-regression steps. So for instance,
       | you might have a math problem in the training data and at first
       | the model is fed all of the steps of the math problem in language
       | form. But in later iterations of training, step one is replaced
       | with latent space auto-regression. And then step two as well,
       | then also step three, etc...
       | 
       | Eventually, the model learns to enable latent space thinking mode
       | by itself by generating the <bot> tokens and to end it be
       | generating <eot> tokens.
       | 
       | Pretty ingenious!
        
         | avodonosov wrote:
         | Thank you for the summary, useful for me as I only managed to
         | skim throught the first half.
         | 
         | But one correction, probably, regarding this bit:
         | 
         | > While in this [latent space thought] mode, the model auto-
         | regressive iterates by copying its final hidden layer back onto
         | its input layer, obviously generating new tokens at the output
         | with each inference step as it always does.
         | 
         | I have impression that output tokens are not generated while in
         | the latent thought mode.
        
         | treprinum wrote:
         | Would that mean that we would need to exchange latent
         | "embeddings" between various "reasoning" models for emulating
         | thinking and an LLM will be just about converting to/from human
         | language when interfacing with mere humans, at some point in
         | the future?
        
       ___________________________________________________________________
       (page generated 2024-12-31 23:00 UTC)