[HN Gopher] Reasoning in Large Language Models: A Geometric Pers...
       ___________________________________________________________________
        
       Reasoning in Large Language Models: A Geometric Perspective
        
       Author : belter
       Score  : 80 points
       Date   : 2024-07-07 18:09 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | dr_dshiv wrote:
       | What does reasoning have to do with geometry? Is this like the
       | idea that different concepts have inherent geometrical forms? A
       | Platonic or noetic take on the geometries of reason? (I struggled
       | to understand much of this paper...)
        
         | qorrect wrote:
         | I think they are talking about the word embeddings, where
         | context is embedded into high geometric dimensions (one
         | dimension might capture how 'feminine' a word is, or how 'blue'
         | it is).
        
           | cubefox wrote:
           | Which word embeddings get their own dimension though, and
           | which don't? ("feminine" and "blue" are words like any other)
        
             | exe34 wrote:
             | maybe it's like a PCA/Huffman deal, where the more
             | regularly useful ones get to be the eigenvectors.
        
           | exe34 wrote:
           | ooooh what if qualia is just embedding? some philosophers
           | would get their toga in a twist!
        
         | cornholio wrote:
         | I think the connection is that the authors could convincingly
         | write a paper on this connection, thus inflating the AI
         | publication bubble, furthering their academic acumen and
         | improving their chances of getting research grants or selective
         | jobs in the field. Some other interests of the authors seem to
         | be detecting exoplanets using AI and detecting birds through
         | audio analysis.
         | 
         | Since nobody can really say what a good AI department does,
         | companies seem to be driven by credentiallism, load up on
         | machine learning PhDs and masters so they can show their board
         | and investors that they are ready for the AI revolution. This
         | creates economic pressure to write such papers, the vast
         | majority of which will amount to nothing.
        
           | techbro92 wrote:
           | I think a lot of the time you would be correct. But this is
           | published to arxiv so it's not peer reviewed and doesn't
           | boost the authors credentials. It could be designed to
           | attract attention to the company they work at. Or it could
           | just be a cool idea the author wanted to share.
        
         | magicalhippo wrote:
         | Modern neural networks make heavy use of linear algebra, in
         | particular the transformer[1] architecture that powers modern
         | LLMs.
         | 
         | Since linear algebra is closely related to geometry[2], it
         | seems quite reasonable that there are some geometric aspects
         | that define their capabilities and performance.
         | 
         | Specifically, in this paper they're considering the intrinsic
         | dimension[3] of the attention layers, and seeing how it
         | correlates with the performance of LLMs.
         | 
         | [1]:
         | https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc...
         | 
         | [2]:
         | https://en.wikipedia.org/wiki/Linear_algebra#Relationship_wi...
         | 
         | [3]: https://en.wikipedia.org/wiki/Intrinsic_dimension
        
       | lifeisstillgood wrote:
       | But I understand there are two sides to the discussion - that by
       | ingesting huge amounts of text these models have somehow built
       | reasoning capabilities (language then reasoning) or that the
       | reasoning was done by humans and then written down so as long as
       | you ask something like "should romeo find another love after
       | Juliet" there is a set of reasoning reflected in a billion
       | English literature essays and the model just reflects those
       | answers
       | 
       | Am I missing something?
        
         | magicalhippo wrote:
         | Glossing through the paper, it seems they're noting this issue
         | but kinda skipping over it:
         | 
         |  _In fact, it is clear that approximation capabilities and
         | generalization are not equivalent notions. However, it is not
         | yet determined that the reasoning capabilities of LLMs are tied
         | to their generalization. While these notions are still hard to
         | pinpoint, we will focus in this experimental section on the
         | relationship between intrinsic dimension, thus expressive
         | power, and reasoning capabilities._
        
           | cowsaymoo wrote:
           | Right, they never claimed to have found a roadmap to AGI,
           | they just found a cool geometric tool to describe how LLMs
           | reason through approximation. Sounds like a handy tool if you
           | want to discover things about approximation or
           | generalization.
        
         | Fripplebubby wrote:
         | > the model just reflects those answers
         | 
         | I think there is a lot happening in the word "reflects"! Is it
         | so simple?
         | 
         | Does this mean that the model takes on the opinion of a
         | specific lit crit essay it has "read"? Does that mean it takes
         | on some kind of "average" opinion from everything? How would
         | you define the "average" opinion on a topic, anyway?
         | 
         | Anyway, although I think this is really interesting stuff and
         | cuts to the core of what an LLM is, this paper isn't where
         | you're going to get the answer to that, because it is much more
         | focused and narrow.
        
         | nshm wrote:
         | It is actually pretty straightforward why those model "reason"
         | or, to be more exact, can operate on a complex concepts. By
         | processing huge amount of texts they build an internal
         | representation where those concepts are represented as a simple
         | nodes (neurons or groups). So they really distill knowledge.
         | Alternatively you can think about it as a very good principal
         | component analysis that can extract many important aspects. Or
         | like a semantic graph built automatically.
         | 
         | Once knowledge is distilled you can build on top of it easily
         | by merging concepts for example.
         | 
         | So no secret here.
        
           | lifeisstillgood wrote:
           | Do they distill knowledge or distill the relationship between
           | words (that describe knowledge)
           | 
           | I know it seems dancing on head of pin but ...
        
         | wongarsu wrote:
         | To me those seem like to sides of the same coin. LLMs are
         | fundamentally trained to complete text. The training just tries
         | to find the most effective way to do that within the given
         | model architecture and parameter count.
         | 
         | Now if we start by "LLMs ingest huge amounts of text", then a
         | simple model would complete text by simple memorization. But
         | correctly completing "234 * 452 =" is a lot simpler to do by
         | doing math than by having memorized all possible
         | multiplications. Similarly, understanding the world and being
         | able to reason about it helps you correctly completing human-
         | written sentences. Thus a sufficiently well-trained model that
         | has enough parameters to do this but not so many that it simply
         | overfits should be expected to develop some reasoning ability.
         | 
         | If you start with "the training set contains a lot of
         | reasoning" you can get something that looks like reasoning in
         | the memorization stage. But the same argument why the model
         | would develop actual reasoning still works and is even
         | stronger: if you have to complete someone's argument that's a
         | lot easier if you can follow their train of thought.
        
         | godelski wrote:
         | I think you're close enough that the differences probably
         | aren't too important. But if you want a bit more nuance, then
         | read on. For disclosure, I'm in the second camp here. But I'll
         | also say that I have a lot of very strong evidence to support
         | this position, and that I do this from the perspective of a
         | researcher.
         | 
         | There's a few big problems when making any definite claims
         | about either side. First, we need to know what data the machine
         | is processing when training. I think we all understand that if
         | the data is in training, then testing is not actually testing a
         | model's ability to generalize, but a model's ability to recall.
         | Second, we need to recognize the amount of duplication of data,
         | both exact and semantically.
         | 
         | 1) We have no idea because these are proprietary. While LLAMA
         | is more open than GPT, we don't know all the data that went
         | into it (last I checked). Thus, you can't say "this isn't in
         | the data."[0] But we do know some things that are in the data,
         | though we don't know exactly what was filtered out. We're all
         | pretty online people here and I'm sure many people have seen
         | some of the depths of places like Reddit, Medium, or even
         | Hacker News. These are all in the (unfiltered) training data!
         | There's even a large number of arxiv papers, books,
         | publications, and so much more. So you have to ask yourself
         | this: "Are we confident that what we're asking the model to do
         | is not in the data we trained on?" Almost certainly it is, so
         | then the question moves to "Are we confident that what we're
         | asking the model to do was adequately filtered out during
         | training so we can have a fair test?" Regardless of what your
         | position is, I think you can see how such a question is
         | incredibly important and how it would be easy to mess up. And
         | only easier the more data we train on, since it's so incredibly
         | hard to process that data.[1] I think you can see some
         | concerning issues with this filtering method and how it can
         | create a large number of false negatives. They explicitly
         | ignore answers, which is important for part 2. IIRC the GPT-3
         | paper also used an ngram model to check for dupes. But the most
         | concerning line to me was this one:                 > As can be
         | seen in tables 9 and 10, contamination overall has very little
         | effect on the reported results.
         | 
         | There is a concerning way to read the data here that serves a
         | valid explanation for the results. That the data is so
         | contaminated, the filtering process does not meaningfully
         | remove the contamination and thus does not significantly change
         | the results. If introducing contamination into your data does
         | not change your results you either have a model that has
         | learned the function of the data VERY well and has an extremely
         | impressive form of generalization, OR your data is contaminated
         | in ways you aren't aware of (there are other explanations too
         | btw). There's a clearly simpler answer here.
         | 
         | Second, is about semantic information and contamination[2].
         | This is when data has the same effective meaning, but uses
         | different ways to express it. "This is a cat" and "este es un
         | gato" are semantically the same but share no similar words. So
         | is "I think there's data spoilage" as well as "There is some
         | concerning issues left to be resolved that bring into question
         | the potential for information leakage." These will not be
         | caught by substrings or ngrams. Yet, training on one will be no
         | different than training on the other once we consider RLHF. The
         | thing here is that in high dimensions, data is very confusing
         | and does not act the way you might expect when operating in 2D
         | and 3D. A mean between two values may or may not be
         | representative depending on the type of distribution (uniform
         | and gaussian, respectively), and we don't have a clue what that
         | is (it is intractable!). The curse of dimensionality is about
         | how it is difficult to distinguish a nearest neighboring point
         | from the furthest neighboring point, because our concept of a
         | metric degrades as we increase dimensionality (just like we
         | lose algebraic structure when going from C (complex) -> H
         | (quaternion) -> O (octonions) (commutativity, then
         | associativity)[3]. Some of this may be uninteresting in the
         | mathematical sense but some does matter too. But because of
         | this, we need to rethink our previous questions carefully. Now
         | we need to ask: "Are we confident that we have filtered out
         | data that is not sufficiently meaningfully different from that
         | in the test data?" Given the complexity of semantic similarity
         | and the fact that "sufficiently" is not well defined, I think
         | this should make anybody uneasy. If you are absolutely
         | confident the answer is "yes, we have filtered it" I would
         | think you a fool. It is so incredibly easy to fool ourselves
         | that any good researcher needs to have a constant amount of
         | doubt (though confidence is needed too!). But neither should
         | our lack of a definite answer here stop progress. But it should
         | make us more careful about what claims we do make. And we need
         | to be clear about this or else conmen have an easy time
         | convincing others.
         | 
         | To me, the common line of research is wrong. Until we know the
         | data and have processed the data with many looking for means of
         | contamination, results like these are not meaningful. They rely
         | on a shaky foundation and often are more looking for evidence
         | to prove reasoning than to consider it might not.
         | 
         | But for me, I think the conversations about a lot of this are
         | quite strange. Does it matter that LLMs can't reason? I mean in
         | some sense yes, but the lack of this property does not make
         | them any less powerful of a tool. If all they are is a lossy
         | compression of the majority of human knowledge with a built in
         | human interface, that sounds like an incredible achievement and
         | a very useful tool. Even Google is fuzzy! But this also tells
         | us what the tool is good for and isn't. That this puts bounds
         | on what we should rely on it for and what we can trust it to do
         | with and without human intervention. I think some are afraid
         | that if LLMs aren't reasoning, then that means we won't get
         | AGI. But at the same time, if they don't reason, then we need
         | to find out why and how to make machines reason if we are to
         | get there. So ignoring potential pitfalls hinders this
         | progress. I'm not suggesting that we should stop using or
         | studying LLMs (we should continue to), but rather that we need
         | to stop putting alternatives down. We need to stop comparing
         | alternatives one-to-one to models that took millions of dollars
         | to do a single training and have been studied by thousands of
         | people for several years against things scrambled together by
         | small labs on a shoestring budget. We'll never be able to
         | advance if the goalpost is that you can't make incremental
         | steps along the way. Otherwise how do you? You got to create
         | something new without testing, convince someone to give you
         | millions of dollars to train it, and then millions more to
         | debug your mistakes and things you've learned along the way?
         | Very inefficient. We can take small steps. I think this
         | goalpost results in obscurification. That because the bar is
         | set so high, that strong claims need to be made for these works
         | to be published. So we have to ask ourselves the deeper
         | questions: "Why are we doing this?"[4]
         | 
         | [0] This might seem backwards but the creation of the model
         | implicitly claims that the test data and training data are
         | segregated. "Show me this isn't in training" is a request for
         | validation.
         | 
         | [1] https://arxiv.org/abs/2303.08774
         | 
         | [2] If you're interested, Meta put out a work on semantic
         | deduplication last year. They mostly focused on vision, but it
         | still shows the importance of what's being argued here. It is
         | probably easier to verify that images are semantically similar
         | than sentences, since language is more abstract. So pixels can
         | be wildly different and the result is visually identical; how
         | does this concept translate with language?
         | https://arxiv.org/abs/2303.09540
         | 
         | [3] https://math.stackexchange.com/questions/641809/what-
         | specifi...
         | 
         | [4] I think if our answer is just "to make money" (or anything
         | semantically similar like "increase share value") then we are
         | doomed to mediocrity and will stagnate. But I think if we're
         | doing these things to better human lives, to understand the
         | world and how things work (I'd argue building AI is, even if a
         | bit abstract), or to make useful and meaningful things, then
         | the money will follow. But I think that many of us and many
         | leading teams and businesses have lost focus on the journey
         | that has led to profits and are too focused on the end result.
         | And I do not think this is isolated to CEOs, I think this
         | similar short sighted thinking can be repeated all the way down
         | the corporate ladder. To a manager focusing on what their
         | bosses explicitly ask for (rather than the intent) to the
         | employee who knows that this is not the right thing to do but
         | does it anyways (often because they know the manager will be
         | unhappy. And this repeats all the way up). All life, business,
         | technology, and creation have immense amounts of complexity to
         | them. Ones we obviously want to simplify as much as possible.
         | But when we hyper focus on any set of rules, no matter how
         | complex, we will be doomed to fail because the environment is
         | always changing and you will never be able to instantly adapt
         | (this is the nature of chaos. Where small perturbations have
         | large changes on the outcome). That doesn't mean we shouldn't
         | try to make rules, but rather it means that rules are to be
         | broken. It's just a matter of knowing when. In the end, this is
         | an example of what it means to be able to reason. So we should
         | be careful to ensure that we create AGI by making machines able
         | to reason and think (to make them "more human") rather than by
         | making humans into unthinking machines. I worry that the latter
         | looks more likely, given that it is a much easier task to
         | accomplish.
        
       | ChicagoDave wrote:
       | LLMs do not have the technology to iteratively solve a complex
       | problem.
       | 
       | This is a fact. No graph will change this.
       | 
       | You want "reasoning," then you need to invent a new technology to
       | iterate, validate, experiment, validate, query external
       | expertise, and validate again. When we get that technology, then
       | AI will become resilient in solving complex problems.
        
         | sigmoid10 wrote:
         | That's false. It has been shown that LLMs can perform e.g.
         | gradient descent internally [1], which can explain why they are
         | so good at few shot prompting. The universal approximation
         | theorem already tells us that a single layer is sufficient to
         | approximate any function, so it should come as no surprise that
         | modern deep networks with many layers should be able to perform
         | iterative optimisations.
         | 
         | [1] https://arxiv.org/abs/2212.10559
        
       | jens-c wrote:
       | Not an expert, but to me the paper reads like it was written _by_
       | an LLM.
       | 
       | The first paragraph of the introduction is fine, but then it
       | kinda turns into gobbledygook...
        
         | belter wrote:
         | Comes from this company: https://www.tenyx.com/about-us
        
       | magicalhippo wrote:
       | I'm not into AI, but I like to watch from the sidelines. Here's
       | my non-AI summary of the paper after glossing through
       | (corrections appreciated):
       | 
       | The multilayered perceptron[1] layers used in modern neural
       | networks, like LLMs, essentially partitions the input into
       | multiple regions. They show that the number of regions a single
       | MLP layer can partition into depends exponentially on the
       | intrinsic dimension[2] of the input. The number of
       | regions/partitions increases the approximation power of the MLP
       | layer.
       | 
       | Thus you can significantly increase the approximation power of a
       | MLP layer without increasing the number of neurons, by
       | essentially "distilling" the input to it.
       | 
       | In the transformer architecture, the inputs to the MLP layers are
       | the self-attention layers[3]. The authors then show that the
       | graph density of the self-attention layers[3] correlates strongly
       | with the intrinsic dimension of the self-attention layer. Thus a
       | more dense self-attention layer means the MLP can do a better
       | job.
       | 
       | One way of increasing the density of the attention layers is to
       | add more context. (edited, see comment) They show that prepending
       | any token as context to a question which increases the intrinsic
       | dimension of the final layer makes the LLM perform better.
       | 
       | They also note that the transformer architecture is susceptible
       | to compounding approximation errors, and that the much more
       | precise partitioning provided by the MLP layers when fed with
       | high intrinsic-dimensional input can help with this. However the
       | impact of this on generalization remains to be explored further.
       | 
       | If the results hold up it does seem like this paper provides nice
       | insight into how to better optimize LLMs and similar neural
       | networks.
       | 
       | [1]: https://en.wikipedia.org/wiki/Multilayer_perceptron
       | 
       | [2]: https://en.wikipedia.org/wiki/Intrinsic_dimension
       | 
       | [3]:
       | https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc...
        
         | Fripplebubby wrote:
         | Awesome summarization by someone who read and actually
         | understood the paper.
         | 
         | > One way of increasing the density of the attention layers is
         | to add more context. They show that simply prepending any token
         | as context to a question makes the LLM perform better. Adding
         | relevant context makes it even better.
         | 
         | Right, I think a more intuitive way to think about this is to
         | define density: the number of _edges_ in the self-attention
         | graph connecting tokens. Maybe a simpler explanation: the
         | number of times a token had some connection to another token
         | divided by the number of tokens. So, tokens which actually
         | relate to one another and provide information are good, non
         | sequitur tokens don't help except that you say
         | 
         | > They show that simply prepending any token as context to a
         | question makes the LLM perform better.
         | 
         | I think this is not quite right. What they found was:
         | 
         | > pre-pending the question at hand with any type of token does
         | increase the intrinsic dimension at the first layer
         | 
         | > however, this increase is not necessarily correlated with the
         | reasoning capability of the model
         | 
         | but it is only
         | 
         | > when the pre-pended tokens lead to an increase in the
         | intrinsic dimension at the *final layer* of the model, the
         | reasoning capabilities of the LLM improve significantly.
         | 
         | (emphasis mine)
        
           | magicalhippo wrote:
           | Thanks, good catch, got distracted by the editing flaws at
           | the end there (they rewrote a section without removing the
           | old one).
        
       | bastien2 wrote:
       | You can't "enhance" from zero. LLMs _by design_ are not capable
       | of reason.
       | 
       | We can observe LLM-like behaviour in humans: all those
       | reactionaries who just parrot whatever catchphrases mass media
       | programmed into them. LLMs are just the computer version of that
       | uncle who thinks Fox News is true and is the reason your nieces
       | have to wear long pants at family gatherings.
       | 
       | He doesn't understand the catchphrases he parrots any more than
       | the chatbots do.
       | 
       | Actual AI will require a kind of modelling that as yet does not
       | exist.
        
         | belter wrote:
         | > LLMs by design are not capable of reason.
         | 
         | It is not as clear cut. The argument being, that the patterns
         | they learn in text encodes several layers of abstraction, one
         | of them being _some_ reasoning, as it is encoded in the
         | discourse.
        
           | wizzwizz4 wrote:
           | They are capable of picking up incredibly crude, noisy
           | versions of first-order symbolic reasoning, and specific,
           | commonly-used arguments, and the context for when those might
           | be applied.
           | 
           | Taken together and iterated, you get something vaguely
           | resembling a reasoning algorithm, but your average
           | schoolchild with an NLP library and regular expressions could
           | make a better reasoning algorithm. (While I've been calling
           | these "reasoning algorithms" for analogy's sake, they don't
           | actually behave how we expect reasoning to behave.)
           | 
           | The language model predicts what reasoning might look like.
           | But it doesn't actually _do the reasoning_ , so (unless it
           | has something capable of reasoning to guide it), it's not
           | going to correctly derive conclusions from premises.
        
         | cowsaymoo wrote:
         | The vocabulary used here doesn't have sufficient intrinsic
         | dimension to partition the input into a low loss prediction.
         | Improvement is promising with larger context or denser
         | attention.
        
         | bl0rg wrote:
         | Can you explain what it means to reason about something? Since
         | you are so confident I'm guessing you'll find it easy to come
         | up with a non-contrived definition that'll clearly include
         | humans and future "actual AI" but exclude LLMs.
        
           | stoperaticless wrote:
           | Not the parent, but there are couple of things current AI
           | lack:
           | 
           | - learning from single article /book with lasting effect
           | (accumulation of knowledge)
           | 
           | - arithmetics without unexpected errors
           | 
           | - gauging reliability of information it's printing
           | 
           | BTW. I doubt that you'll get satisfactory definition of "able
           | to reason" (or "conscious" or "alive" or "chair"). As they
           | define more an end or direction of a spectrum, not an exact
           | cut off point.
           | 
           | Current llms are impressive and useful, but given how often
           | they spout nonsense, it is hard to put them into "able to
           | reason" category.
        
         | p1esk wrote:
         | LLMs are trained to predict the next word in a sequence. As a
         | result of this training they developed reasoning abilities.
         | Currently these reasoning abilities are roughly at human level,
         | but next gen models (gpt5) should be superior to humans at any
         | reasoning tasks.
        
         | Kiro wrote:
         | Go look at the top comment of this thread:
         | https://news.ycombinator.com/item?id=40900482
         | 
         | That's the kind of stuff I want to see when opening a thread on
         | HN, but most of the times we get shallow snark like yours
         | instead. It's a shame.
        
       ___________________________________________________________________
       (page generated 2024-07-07 23:00 UTC)