[HN Gopher] Large Concept Models: Language modeling in a sentenc...
___________________________________________________________________
Large Concept Models: Language modeling in a sentence
representation space
Author : batata_frita
Score : 152 points
Date : 2025-01-01 02:38 UTC (20 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| inshard wrote:
| This is interesting. I wonder if such a project could dive into
| lower-level concepts, those akin to prime numbers. The atoms from
| which all other concepts are built.
| benreesman wrote:
| Between this and learned patches and ModernBERT and DeepSeek?
|
| I think it's time to read up.
| lern_too_spel wrote:
| This is like going back to CNNs. Attention is all you need.
| zed1726 wrote:
| Quantum states are all one really needs, but it turns out that
| it's way to computationally expensive to simulate all that just
| for the purpose of AI applications - so instead we have to go
| to higher levels of construction. Attention is surely _just
| about_ on the cusp of what is computationally reasonable which
| means that it 's not all we need, we need more efficient and
| richer constructions.
| katamari-damacy wrote:
| Yes, just spray Quantum on it
| chronic4948412 wrote:
| > Yes, just spray Quantum on it
|
| Careful, don't give Sam Altman any ideas.
|
| Once OpenAI cannot raise enough capital, he will aim
| quantum AGI.
| mdp2021 wrote:
| We do not need quantum states to build (arithmetic)
| calculators. Nor, very probably, for complex and much more
| complex calculators.
| snake_doc wrote:
| Attention is just communication? It's orthogonal to the space
| of the representation.
| mdp2021 wrote:
| > _Current best practice for large scale language modeling is to
| operate at the token level, i.e. to learn to predict the next
| tokens given a sequence of preceding tokens. There is a large
| body of research on improvements of LLMs, but most works
| concentrate on incremental changes and do not question the main
| underlying architecture. In this paper, we have proposed a new
| architecture,_
|
| For some 2024 may have ended badly,
|
| but reading the lines above shines a great light of hope for the
| new year.
| stravant wrote:
| This feels like a failure to learn the bitter lesson: You're just
| taking the translation to concepts that the LLM is certainly
| already doing and trying to make it explicitly forced.
| mdp2021 wrote:
| That should be proven. The two approaches - predicting tokens
| vs predicting "sentences" - should be compared to see how much
| their output differ in terms of quality.
|
| Edit2: ...and both (and their variants) be compared to other
| ideas such as "multi-token prediction"...
|
| Edit: or, appropriateness of the approach should be
| demonstrated after acquired "transparency" of how the LLMs
| effectively internally work. I am not aware of studies that
| make the inner workings of LLMs adequately clear.
|
| Edit3: Substantially, the architecture should be as solid as
| possible (and results should reflect that).
| blackeyeblitzar wrote:
| Isn't "sentence prediction" roughly the same as multi token
| prediction of sufficient length? In the end are we just
| talking about a change to hyper parameters or maybe a new
| hyper parameter that controls the granularity of "prediction
| length"?
| mdp2021 wrote:
| > _multi token prediction of sufficient length_
|
| Is multi token prediction the same as predicting the
| embedding of a complex token (the articulation of those
| input tokens in a sentence)?
| blackeyeblitzar wrote:
| To be honest I don't know. Maybe the only way to know is
| to build and measure all these variations.
| anon373839 wrote:
| The bitter lesson isn't a law of nature, though. And as GPT-
| style LLMs appear to be at the foot of a scaling wall, I
| personally think inductive bias is due for a comeback.
| Der_Einzige wrote:
| Everyone keeps claiming this but we have zero evidence of any
| kind of scaling wall what-so-ever. Oh you mean data?
| Synthetic Data, Agents, and Digitization solve that.
| anon373839 wrote:
| I disagree, but I also wasn't referring to the exhaustion
| of training materials. I am referring to the fact that
| exponentially more compute is required to achieve linear
| gains in performance. At some point, it just won't be
| feasible to do $50B training runs, you know?
| throw5959 wrote:
| 50B still seems reasonable compared to the revenue of the
| Big AI companies.
| mentalgear wrote:
| what revenues? If by big AI companies you mean llm
| service providers (OpenAI, ...), their revenues are far
| from high or profitable.
| https://www.cnbc.com/2024/09/27/openai-sees-5-billion-
| loss-t...
|
| Maybe Nvidia, but they are a chip / hardware maker first.
| And even for them 50B training run with no exponential
| gains seems unreasonable.
|
| Better to optimize the architecture / approach first,
| which also is what most companies are doing now before
| scaling out.
| cubefox wrote:
| There were multiple reports confirming that OpenAI's Orion
| (planned to be GPT-5) yielded unexpectedly weak results.
| pegasus wrote:
| And not just OpenAI is facing this problem. Anthropic and
| Google as well.
| UltraSane wrote:
| And costs $500 million per training run.
| Der_Einzige wrote:
| So Deepseek V3 did nothing to show you how wrong this
| take is?
| UltraSane wrote:
| There seems to be a affordable scaling wall.
| mdp2021 wrote:
| It is explicitly stated in the paper that
|
| > _One may argue that LLMs are implicitly learning a
| hierarchical representation, but we stipulate that models with
| an explicit hierarchical architecture are better suited to
| create coherent long-form output_
|
| And the problem remains that (text surrounding the above):
|
| > _Despite the undeniable success of LLMs and continued
| progress, all current LLMs miss a crucial characteristic of
| human intelligence: explicit reasoning and planning at multiple
| levels of abstraction. The human brain does not operate at the
| word level only. We usually have a top-down process to solve a
| complex task or compose a long document: we first plan at a
| higher level the overall structure, and then step-by-step, add
| details at lower levels of abstraction. [...] Imagine a
| researcher giving a fifteen-minute talk. In such a situation,
| researchers do not usually prepare detailed speeches by writing
| out every single word they will pronounce. Instead, they
| outline a flow of higher-level ideas they want to communicate.
| Should they give the same talk multiple times, the actual words
| being spoken may differ, the talk could even be given in
| different languages, but the flow of higher-level abstract
| ideas will remain the same. Similarly, when writing a research
| paper or essay on a specific topic, humans usually start by
| preparing an outline that structures the whole document into
| sections, which they then refine iteratively. Humans also
| detect and remember dependencies between the different parts of
| a longer document at an abstract level. If we expand on our
| previous research writing example, keeping track of
| dependencies means that we need to provide results for each of
| the experiment mentioned in the introduction. Finally, when
| processing and analyzing information, humans rarely consider
| every single word in a large document. Instead, we use a
| hierarchical approach: we remember which part of a long
| document we should search to find a specific piece of
| information. To the best of our knowledge, this explicit
| hierarchical structure of information processing and
| generation, at an abstract level, independent of any
| instantiation in a particular language or modality, cannot be
| found in any of the current LLMs_
| motoboi wrote:
| I suppose humans need high level concepts because we can only
| hold 7[ _] things in working memory. Computers don't have
| that limitation.
|
| Also, humans cannot iterate over thousands of possibilities
| in a second, like computers do.
|
| And finally, animal brains are severely limited by heat
| dissipation and energy input flow.
|
| Based on that, artificial intelligence may arise from
| unexpected simple strategies, given the fundamental
| differences in scale and structure from animal brains.
|
| _ - where 7 is whatever number is the correct number
| nowadays.
| dr_dshiv wrote:
| I just don't understand that -- I thought deep neural nets
| were inherently hierarchical. Or at least emergently
| hierarchical?
| mdp2021 wrote:
| Neural Nets can be made to be hierarchical - I would say a
| most notable example is the Convolutional Neural Network so
| successfully promoted by Yann Le Cun.
|
| But the issue with the LLMs architectures in place is with
| the idea of "predicting the next token", so strident with
| the exercise of intelligence - where we search instead for
| the "neighbouring fitting ideas".
|
| So, "hierarchical" in this context is there to express that
| it is typical of natural intelligence to refine an idea -
| formulating an hypothesis and improving its form (hence its
| expression) step after step of pondering. The issue of
| transparency in current LLMs, and the idea of "predicting
| the next token", do not help in having the idea of typical
| natural intelligence mechanism and the tentative
| interpretation of LLM internals match.
| nightski wrote:
| Is that true? There are many attention/mlp layers stacked
| on top of each other. Higher level layers aren't
| performing attention on input tokens, but instead on the
| output of the previous layer.
| mdp2021 wrote:
| > _Is that true_
|
| Well, if you are referring to <<The issue of transparency
| in current LLMs>>, I have not read an essay that explains
| satisfactorily the inner process and world modelling
| inside LLMs. Some pieces say (guess?) that the engine has
| no idea what the whole concept in the reply would be
| before outputting all the tokens, others swear it seems
| impossible it has no such idea before formulation...
| throwawaymaths wrote:
| there is a way that "predicting the next token" is
| ~append-only turing machine. Obviously the tokens we're
| using might be suboptimal for whatever goalpost "agi" is
| at any given time, but the structure/strategies of LLMs
| is probably not far from a really good one, modulo
| refactoring for efficiency like MAMBA (but still doing
| token stream prediction, esp. during inference)
| Jensson wrote:
| > You're just taking the translation to concepts that the LLM
| is certainly already doing and trying to make it explicitly
| forced.
|
| That is what tokens are doing in the first place though, and
| you get better results with tokens instead of letters.
| mdp2021 wrote:
| Well, individual letters in these languages in use* do not
| convey specific meaning, while individual tokens do - so, you
| cannot really construe a ladder that would go from letter to
| token, then from token to sentence.
|
| This said, to research whether the search for concepts (in
| the solutions space) works better than the search for tokens
| seems absolutely dutiful, in absence of a solid theory that
| showed otherwise.
|
| (*Sounds convey their own meaning e.g. in proto-Indo-European
| according to some interpretations, but that becomes too
| remote in the current descendants - you cannot reconstruct
| the implicit sound-token in words directly in English, just
| from the spelling.)
| IanCal wrote:
| Is that true? I thought there was a desire to move towards
| byte level work rather than tokens, and that the benefits of
| tokens was more that you are reducing the context size for
| the same input.
| fngjdflmdflg wrote:
| >there was a desire to move towards byte level work rather
| than tokens
|
| Yeah, latest work on this is from Meta a last month.[0] It
| showed good results.
|
| [0] https://ai.meta.com/research/publications/byte-latent-
| transf... (https://news.ycombinator.com/item?id=42415122)
| macawfish wrote:
| At a performance boost of 10-100x :)
| attentionmech wrote:
| I like the idea of "concept" .. you can represent a concept with
| language, visual etc. but it isn't any of those. Those are
| symbols used to communicate a concept or give representation to
| it but concepts are just connections between other concepts at
| the core. The closest things i feel to this is categories in
| category theory.
| layer8 wrote:
| Concepts need to be linked to reality somehow in order to carry
| any meaning. They are thus not just relations between
| themselves.
| dr_dshiv wrote:
| Platonic forms?
| attentionmech wrote:
| interesting concept they are.
| YeGoblynQueenne wrote:
| From the paper:
|
| >> In this paper, we present an attempt at an architecture which
| operates on an explicit higher-level semantic representation,
| which we name a "concept".
|
| I wonder if the many authors of the paper know that what they
| call "concept" is what all of machine learning and AI has also
| called a "concept" for many decades, and not a new thing that
| they have just named from scratch.
|
| For instance, classes of "concepts" are the target of learning in
| Leslie Valiant's "A Theory of the Learnable", the paper that
| introduced Probably Approximately Correct Learning (PAC-
| Learning). Quoting from its abstract: ABSTRACT:
| Humans appear to be able to learn new concepts without
| needing to be programmed explicitly in any conventional
| sense. In this paper we regard learning as the phenomenon
| of knowledge acquisition in the absence of explicit
| programming. We give a precise methodology for studying
| this phenomenon from a computational viewpoint. It consists
| of choosing an appropriate information gathering mechanism,
| the learning protocol, and exploring the class of concepts
| that can be learned using it in a reasonable (polynomial)
| number of steps. Although inherent algorithmic complexity
| appears to set serious limits to the range of concepts that
| can be learned, we show that there are some important
| nontrivial classes of propositional concepts that can be
| learned in a realistic sense
|
| From: https://web.mit.edu/6.435/www/Valiant84.pdf
|
| Or take this Introduction to Chapter 2 in Tom Mitchell's "Machine
| Learning" (the original ML textbook, published 1997):
| This chapter considers concept learning: acquiring the definition
| of a general category given a sample of positive and
| negative training examples of the category.
|
| From: https://www.cs.cmu.edu/~tom/mlbook.html (clink link in "the
| book").
|
| I mean I really wonder some times what is going on here. There's
| been decades of research in AI and machine learning but recently
| papers look like their authors have landed in an undiscovered
| country and are having to invent everything from scratch. That's
| not good. There are pitfalls that all the previous generations
| have explored thoroughly by falling in them time and again. Those
| who don't remember those lessons will have to find that out the
| hard way.
| mdp2021 wrote:
| I am not sure that fits the point, YGQ:
|
| it seems to me the concept of <<concept>> in the paper is "the
| embedding vector we get in systems like SONAR (which we could
| use to generalize ordered sets of tokens into more complex
| ideas)". That's pretty specific, only marginally related to
| past handling as mentioned.
| YeGoblynQueenne wrote:
| That's only the representation of a concept. Different
| systems and different approaches will have different
| representations but that doesn't change the fact of what is
| being represented.
| mdp2021 wrote:
| But if the issue is about "research in AI has had to deal
| with the concept of "concept" since the inception" (and of
| course it had to), the contribution in this paper is to try
| an operational implementation that could bear fruit and
| possibly fix architectural shortcomings of the mainstream
| effort.
|
| (It is not separate from the context of LLMs.)
| YeGoblynQueenne wrote:
| Right, but there's been many operationalisations before.
| That's what's not new. Tome Mitchell's textbook has
| plenty of examples. Basically all of machine learning is
| about learning concepts- in practice as well as in
| theory. That's the whole point.
| upghost wrote:
| Aside from the using the word "concept" instead of "language" I
| don't see how this is different than an LLM. It's still doing
| next token prediction. This is like in D&D where you have two
| swords with wildly different flavor text but ultimately they both
| do 1d6+1 damage.
|
| What am I missing -- aside from the marketing? Is there something
| architecturally different or what? Looks like regular
| autoregressive sequence transformer to me.
| tantalor wrote:
| (Guessing here) It does tokenization and prediction for a whole
| sentence, not fragments of words.
|
| I like this idea because that's how humans think. We mentally
| formulate a whole sentence, then say it. People who don't do
| this speak in run-ons and word salad.
| upghost wrote:
| oh interesting. concepts as tokens. Yeah I'd buy that. They
| do something similar with transformers in robotics, except
| they use tokens as actions instead of word chunks. Good eye.
| botanical76 wrote:
| I would be interested to know how many people do formulate a
| whole sentence before saying it. "Think before you speak" as
| they say. I feel I do not have the cognitive window or
| processing speed to do this; instead, I formulate a concept
| of how I would like to respond abstractly, and then think of
| and say phrases of several words one at a time until the
| sentence ends itself. The latter process leans heavily on
| some kind of next word anticipation.
| mdp2021 wrote:
| > _something architecturally different_
|
| An embedding space engine accepting sentences (SONAR) fit in so
| that the tokens of this architecture are complex sets of the
| tokens of past architectures.
| nutanc wrote:
| This maps a little to what we are doing research on what we are
| calling as shape of stories[1].
|
| We can clearly see in 2D space itself how different "concepts"
| are explored.
|
| Using the shape of stories for semantic chunking we can clearly
| see in multiple articles how we can chunk by "concepts". [2]
|
| Now we are trying to see if we can just use these chunks and
| train a next "chunk" predictor instead of a next word predictor.
|
| In the paper, they take a sentence to mean a concept. We believe
| that a "semantic chunk" is better suited for a concept instead of
| a sentence.
|
| [1] https://gpt3experiments.substack.com/p/the-shape-of-
| stories-...
|
| [2]https://gpt3experiments.substack.com/p/a-new-chunking-
| approa...
| Lerc wrote:
| Can you spot conceptually similar stories by their shape?
|
| For instance what is the shape of the ugly duckling compared to
| Rudolf the red nosed reindeer. They are essentially the same
| story, so presumably on some dimension you should be able to
| spot them in a group of unrelated stories.
| rxm wrote:
| What used to be feature engineering a decade or more ago now
| seems to have shifted to developing distributed representations.
| LLMs use word tokens (for words or the entities in images). But
| there are many more. The 3D Fields (or whatever they have evolved
| to) developed by Fei-Fei Li's group represent visual information
| in a way better suited for geometrical tasks. Wav2Vec, the
| convolutional features for YOLO and friends, and these sentence
| representations are other examples. I would love to read a review
| of this circle of ideas.
___________________________________________________________________
(page generated 2025-01-01 23:01 UTC)