[HN Gopher] Large Concept Models: Language modeling in a sentenc...
       ___________________________________________________________________
        
       Large Concept Models: Language modeling in a sentence
       representation space
        
       Author : batata_frita
       Score  : 152 points
       Date   : 2025-01-01 02:38 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | inshard wrote:
       | This is interesting. I wonder if such a project could dive into
       | lower-level concepts, those akin to prime numbers. The atoms from
       | which all other concepts are built.
        
       | benreesman wrote:
       | Between this and learned patches and ModernBERT and DeepSeek?
       | 
       | I think it's time to read up.
        
       | lern_too_spel wrote:
       | This is like going back to CNNs. Attention is all you need.
        
         | zed1726 wrote:
         | Quantum states are all one really needs, but it turns out that
         | it's way to computationally expensive to simulate all that just
         | for the purpose of AI applications - so instead we have to go
         | to higher levels of construction. Attention is surely _just
         | about_ on the cusp of what is computationally reasonable which
         | means that it 's not all we need, we need more efficient and
         | richer constructions.
        
           | katamari-damacy wrote:
           | Yes, just spray Quantum on it
        
             | chronic4948412 wrote:
             | > Yes, just spray Quantum on it
             | 
             | Careful, don't give Sam Altman any ideas.
             | 
             | Once OpenAI cannot raise enough capital, he will aim
             | quantum AGI.
        
           | mdp2021 wrote:
           | We do not need quantum states to build (arithmetic)
           | calculators. Nor, very probably, for complex and much more
           | complex calculators.
        
         | snake_doc wrote:
         | Attention is just communication? It's orthogonal to the space
         | of the representation.
        
       | mdp2021 wrote:
       | > _Current best practice for large scale language modeling is to
       | operate at the token level, i.e. to learn to predict the next
       | tokens given a sequence of preceding tokens. There is a large
       | body of research on improvements of LLMs, but most works
       | concentrate on incremental changes and do not question the main
       | underlying architecture. In this paper, we have proposed a new
       | architecture,_
       | 
       | For some 2024 may have ended badly,
       | 
       | but reading the lines above shines a great light of hope for the
       | new year.
        
       | stravant wrote:
       | This feels like a failure to learn the bitter lesson: You're just
       | taking the translation to concepts that the LLM is certainly
       | already doing and trying to make it explicitly forced.
        
         | mdp2021 wrote:
         | That should be proven. The two approaches - predicting tokens
         | vs predicting "sentences" - should be compared to see how much
         | their output differ in terms of quality.
         | 
         | Edit2: ...and both (and their variants) be compared to other
         | ideas such as "multi-token prediction"...
         | 
         | Edit: or, appropriateness of the approach should be
         | demonstrated after acquired "transparency" of how the LLMs
         | effectively internally work. I am not aware of studies that
         | make the inner workings of LLMs adequately clear.
         | 
         | Edit3: Substantially, the architecture should be as solid as
         | possible (and results should reflect that).
        
           | blackeyeblitzar wrote:
           | Isn't "sentence prediction" roughly the same as multi token
           | prediction of sufficient length? In the end are we just
           | talking about a change to hyper parameters or maybe a new
           | hyper parameter that controls the granularity of "prediction
           | length"?
        
             | mdp2021 wrote:
             | > _multi token prediction of sufficient length_
             | 
             | Is multi token prediction the same as predicting the
             | embedding of a complex token (the articulation of those
             | input tokens in a sentence)?
        
               | blackeyeblitzar wrote:
               | To be honest I don't know. Maybe the only way to know is
               | to build and measure all these variations.
        
         | anon373839 wrote:
         | The bitter lesson isn't a law of nature, though. And as GPT-
         | style LLMs appear to be at the foot of a scaling wall, I
         | personally think inductive bias is due for a comeback.
        
           | Der_Einzige wrote:
           | Everyone keeps claiming this but we have zero evidence of any
           | kind of scaling wall what-so-ever. Oh you mean data?
           | Synthetic Data, Agents, and Digitization solve that.
        
             | anon373839 wrote:
             | I disagree, but I also wasn't referring to the exhaustion
             | of training materials. I am referring to the fact that
             | exponentially more compute is required to achieve linear
             | gains in performance. At some point, it just won't be
             | feasible to do $50B training runs, you know?
        
               | throw5959 wrote:
               | 50B still seems reasonable compared to the revenue of the
               | Big AI companies.
        
               | mentalgear wrote:
               | what revenues? If by big AI companies you mean llm
               | service providers (OpenAI, ...), their revenues are far
               | from high or profitable.
               | https://www.cnbc.com/2024/09/27/openai-sees-5-billion-
               | loss-t...
               | 
               | Maybe Nvidia, but they are a chip / hardware maker first.
               | And even for them 50B training run with no exponential
               | gains seems unreasonable.
               | 
               | Better to optimize the architecture / approach first,
               | which also is what most companies are doing now before
               | scaling out.
        
             | cubefox wrote:
             | There were multiple reports confirming that OpenAI's Orion
             | (planned to be GPT-5) yielded unexpectedly weak results.
        
               | pegasus wrote:
               | And not just OpenAI is facing this problem. Anthropic and
               | Google as well.
        
               | UltraSane wrote:
               | And costs $500 million per training run.
        
               | Der_Einzige wrote:
               | So Deepseek V3 did nothing to show you how wrong this
               | take is?
        
             | UltraSane wrote:
             | There seems to be a affordable scaling wall.
        
         | mdp2021 wrote:
         | It is explicitly stated in the paper that
         | 
         | > _One may argue that LLMs are implicitly learning a
         | hierarchical representation, but we stipulate that models with
         | an explicit hierarchical architecture are better suited to
         | create coherent long-form output_
         | 
         | And the problem remains that (text surrounding the above):
         | 
         | > _Despite the undeniable success of LLMs and continued
         | progress, all current LLMs miss a crucial characteristic of
         | human intelligence: explicit reasoning and planning at multiple
         | levels of abstraction. The human brain does not operate at the
         | word level only. We usually have a top-down process to solve a
         | complex task or compose a long document: we first plan at a
         | higher level the overall structure, and then step-by-step, add
         | details at lower levels of abstraction. [...] Imagine a
         | researcher giving a fifteen-minute talk. In such a situation,
         | researchers do not usually prepare detailed speeches by writing
         | out every single word they will pronounce. Instead, they
         | outline a flow of higher-level ideas they want to communicate.
         | Should they give the same talk multiple times, the actual words
         | being spoken may differ, the talk could even be given in
         | different languages, but the flow of higher-level abstract
         | ideas will remain the same. Similarly, when writing a research
         | paper or essay on a specific topic, humans usually start by
         | preparing an outline that structures the whole document into
         | sections, which they then refine iteratively. Humans also
         | detect and remember dependencies between the different parts of
         | a longer document at an abstract level. If we expand on our
         | previous research writing example, keeping track of
         | dependencies means that we need to provide results for each of
         | the experiment mentioned in the introduction. Finally, when
         | processing and analyzing information, humans rarely consider
         | every single word in a large document. Instead, we use a
         | hierarchical approach: we remember which part of a long
         | document we should search to find a specific piece of
         | information. To the best of our knowledge, this explicit
         | hierarchical structure of information processing and
         | generation, at an abstract level, independent of any
         | instantiation in a particular language or modality, cannot be
         | found in any of the current LLMs_
        
           | motoboi wrote:
           | I suppose humans need high level concepts because we can only
           | hold 7[ _] things in working memory. Computers don't have
           | that limitation.
           | 
           | Also, humans cannot iterate over thousands of possibilities
           | in a second, like computers do.
           | 
           | And finally, animal brains are severely limited by heat
           | dissipation and energy input flow.
           | 
           | Based on that, artificial intelligence may arise from
           | unexpected simple strategies, given the fundamental
           | differences in scale and structure from animal brains.
           | 
           | _ - where 7 is whatever number is the correct number
           | nowadays.
        
           | dr_dshiv wrote:
           | I just don't understand that -- I thought deep neural nets
           | were inherently hierarchical. Or at least emergently
           | hierarchical?
        
             | mdp2021 wrote:
             | Neural Nets can be made to be hierarchical - I would say a
             | most notable example is the Convolutional Neural Network so
             | successfully promoted by Yann Le Cun.
             | 
             | But the issue with the LLMs architectures in place is with
             | the idea of "predicting the next token", so strident with
             | the exercise of intelligence - where we search instead for
             | the "neighbouring fitting ideas".
             | 
             | So, "hierarchical" in this context is there to express that
             | it is typical of natural intelligence to refine an idea -
             | formulating an hypothesis and improving its form (hence its
             | expression) step after step of pondering. The issue of
             | transparency in current LLMs, and the idea of "predicting
             | the next token", do not help in having the idea of typical
             | natural intelligence mechanism and the tentative
             | interpretation of LLM internals match.
        
               | nightski wrote:
               | Is that true? There are many attention/mlp layers stacked
               | on top of each other. Higher level layers aren't
               | performing attention on input tokens, but instead on the
               | output of the previous layer.
        
               | mdp2021 wrote:
               | > _Is that true_
               | 
               | Well, if you are referring to <<The issue of transparency
               | in current LLMs>>, I have not read an essay that explains
               | satisfactorily the inner process and world modelling
               | inside LLMs. Some pieces say (guess?) that the engine has
               | no idea what the whole concept in the reply would be
               | before outputting all the tokens, others swear it seems
               | impossible it has no such idea before formulation...
        
               | throwawaymaths wrote:
               | there is a way that "predicting the next token" is
               | ~append-only turing machine. Obviously the tokens we're
               | using might be suboptimal for whatever goalpost "agi" is
               | at any given time, but the structure/strategies of LLMs
               | is probably not far from a really good one, modulo
               | refactoring for efficiency like MAMBA (but still doing
               | token stream prediction, esp. during inference)
        
         | Jensson wrote:
         | > You're just taking the translation to concepts that the LLM
         | is certainly already doing and trying to make it explicitly
         | forced.
         | 
         | That is what tokens are doing in the first place though, and
         | you get better results with tokens instead of letters.
        
           | mdp2021 wrote:
           | Well, individual letters in these languages in use* do not
           | convey specific meaning, while individual tokens do - so, you
           | cannot really construe a ladder that would go from letter to
           | token, then from token to sentence.
           | 
           | This said, to research whether the search for concepts (in
           | the solutions space) works better than the search for tokens
           | seems absolutely dutiful, in absence of a solid theory that
           | showed otherwise.
           | 
           | (*Sounds convey their own meaning e.g. in proto-Indo-European
           | according to some interpretations, but that becomes too
           | remote in the current descendants - you cannot reconstruct
           | the implicit sound-token in words directly in English, just
           | from the spelling.)
        
           | IanCal wrote:
           | Is that true? I thought there was a desire to move towards
           | byte level work rather than tokens, and that the benefits of
           | tokens was more that you are reducing the context size for
           | the same input.
        
             | fngjdflmdflg wrote:
             | >there was a desire to move towards byte level work rather
             | than tokens
             | 
             | Yeah, latest work on this is from Meta a last month.[0] It
             | showed good results.
             | 
             | [0] https://ai.meta.com/research/publications/byte-latent-
             | transf... (https://news.ycombinator.com/item?id=42415122)
        
         | macawfish wrote:
         | At a performance boost of 10-100x :)
        
       | attentionmech wrote:
       | I like the idea of "concept" .. you can represent a concept with
       | language, visual etc. but it isn't any of those. Those are
       | symbols used to communicate a concept or give representation to
       | it but concepts are just connections between other concepts at
       | the core. The closest things i feel to this is categories in
       | category theory.
        
         | layer8 wrote:
         | Concepts need to be linked to reality somehow in order to carry
         | any meaning. They are thus not just relations between
         | themselves.
        
         | dr_dshiv wrote:
         | Platonic forms?
        
           | attentionmech wrote:
           | interesting concept they are.
        
       | YeGoblynQueenne wrote:
       | From the paper:
       | 
       | >> In this paper, we present an attempt at an architecture which
       | operates on an explicit higher-level semantic representation,
       | which we name a "concept".
       | 
       | I wonder if the many authors of the paper know that what they
       | call "concept" is what all of machine learning and AI has also
       | called a "concept" for many decades, and not a new thing that
       | they have just named from scratch.
       | 
       | For instance, classes of "concepts" are the target of learning in
       | Leslie Valiant's "A Theory of the Learnable", the paper that
       | introduced Probably Approximately Correct Learning (PAC-
       | Learning). Quoting from its abstract:                 ABSTRACT:
       | Humans appear to be able to learn new       concepts without
       | needing to be programmed explicitly in       any conventional
       | sense. In this paper we regard learning as       the phenomenon
       | of knowledge acquisition in the absence of       explicit
       | programming. We give a precise methodology for       studying
       | this phenomenon from a computational viewpoint.       It consists
       | of choosing an appropriate information gathering       mechanism,
       | the learning protocol, and exploring the class of       concepts
       | that can be learned using it in a reasonable       (polynomial)
       | number of steps. Although inherent algorithmic       complexity
       | appears to set serious limits to the range of       concepts that
       | can be learned, we show that there are some       important
       | nontrivial classes of propositional concepts that       can be
       | learned in a realistic sense
       | 
       | From: https://web.mit.edu/6.435/www/Valiant84.pdf
       | 
       | Or take this Introduction to Chapter 2 in Tom Mitchell's "Machine
       | Learning" (the original ML textbook, published 1997):
       | This chapter considers concept learning: acquiring the definition
       | of        a general category given a sample of positive and
       | negative training        examples of the category.
       | 
       | From: https://www.cs.cmu.edu/~tom/mlbook.html (clink link in "the
       | book").
       | 
       | I mean I really wonder some times what is going on here. There's
       | been decades of research in AI and machine learning but recently
       | papers look like their authors have landed in an undiscovered
       | country and are having to invent everything from scratch. That's
       | not good. There are pitfalls that all the previous generations
       | have explored thoroughly by falling in them time and again. Those
       | who don't remember those lessons will have to find that out the
       | hard way.
        
         | mdp2021 wrote:
         | I am not sure that fits the point, YGQ:
         | 
         | it seems to me the concept of <<concept>> in the paper is "the
         | embedding vector we get in systems like SONAR (which we could
         | use to generalize ordered sets of tokens into more complex
         | ideas)". That's pretty specific, only marginally related to
         | past handling as mentioned.
        
           | YeGoblynQueenne wrote:
           | That's only the representation of a concept. Different
           | systems and different approaches will have different
           | representations but that doesn't change the fact of what is
           | being represented.
        
             | mdp2021 wrote:
             | But if the issue is about "research in AI has had to deal
             | with the concept of "concept" since the inception" (and of
             | course it had to), the contribution in this paper is to try
             | an operational implementation that could bear fruit and
             | possibly fix architectural shortcomings of the mainstream
             | effort.
             | 
             | (It is not separate from the context of LLMs.)
        
               | YeGoblynQueenne wrote:
               | Right, but there's been many operationalisations before.
               | That's what's not new. Tome Mitchell's textbook has
               | plenty of examples. Basically all of machine learning is
               | about learning concepts- in practice as well as in
               | theory. That's the whole point.
        
       | upghost wrote:
       | Aside from the using the word "concept" instead of "language" I
       | don't see how this is different than an LLM. It's still doing
       | next token prediction. This is like in D&D where you have two
       | swords with wildly different flavor text but ultimately they both
       | do 1d6+1 damage.
       | 
       | What am I missing -- aside from the marketing? Is there something
       | architecturally different or what? Looks like regular
       | autoregressive sequence transformer to me.
        
         | tantalor wrote:
         | (Guessing here) It does tokenization and prediction for a whole
         | sentence, not fragments of words.
         | 
         | I like this idea because that's how humans think. We mentally
         | formulate a whole sentence, then say it. People who don't do
         | this speak in run-ons and word salad.
        
           | upghost wrote:
           | oh interesting. concepts as tokens. Yeah I'd buy that. They
           | do something similar with transformers in robotics, except
           | they use tokens as actions instead of word chunks. Good eye.
        
           | botanical76 wrote:
           | I would be interested to know how many people do formulate a
           | whole sentence before saying it. "Think before you speak" as
           | they say. I feel I do not have the cognitive window or
           | processing speed to do this; instead, I formulate a concept
           | of how I would like to respond abstractly, and then think of
           | and say phrases of several words one at a time until the
           | sentence ends itself. The latter process leans heavily on
           | some kind of next word anticipation.
        
         | mdp2021 wrote:
         | > _something architecturally different_
         | 
         | An embedding space engine accepting sentences (SONAR) fit in so
         | that the tokens of this architecture are complex sets of the
         | tokens of past architectures.
        
       | nutanc wrote:
       | This maps a little to what we are doing research on what we are
       | calling as shape of stories[1].
       | 
       | We can clearly see in 2D space itself how different "concepts"
       | are explored.
       | 
       | Using the shape of stories for semantic chunking we can clearly
       | see in multiple articles how we can chunk by "concepts". [2]
       | 
       | Now we are trying to see if we can just use these chunks and
       | train a next "chunk" predictor instead of a next word predictor.
       | 
       | In the paper, they take a sentence to mean a concept. We believe
       | that a "semantic chunk" is better suited for a concept instead of
       | a sentence.
       | 
       | [1] https://gpt3experiments.substack.com/p/the-shape-of-
       | stories-...
       | 
       | [2]https://gpt3experiments.substack.com/p/a-new-chunking-
       | approa...
        
         | Lerc wrote:
         | Can you spot conceptually similar stories by their shape?
         | 
         | For instance what is the shape of the ugly duckling compared to
         | Rudolf the red nosed reindeer. They are essentially the same
         | story, so presumably on some dimension you should be able to
         | spot them in a group of unrelated stories.
        
       | rxm wrote:
       | What used to be feature engineering a decade or more ago now
       | seems to have shifted to developing distributed representations.
       | LLMs use word tokens (for words or the entities in images). But
       | there are many more. The 3D Fields (or whatever they have evolved
       | to) developed by Fei-Fei Li's group represent visual information
       | in a way better suited for geometrical tasks. Wav2Vec, the
       | convolutional features for YOLO and friends, and these sentence
       | representations are other examples. I would love to read a review
       | of this circle of ideas.
        
       ___________________________________________________________________
       (page generated 2025-01-01 23:01 UTC)