[HN Gopher] Reasoning in Large Language Models: A Geometric Pers...
___________________________________________________________________
Reasoning in Large Language Models: A Geometric Perspective
Author : belter
Score : 80 points
Date : 2024-07-07 18:09 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| dr_dshiv wrote:
| What does reasoning have to do with geometry? Is this like the
| idea that different concepts have inherent geometrical forms? A
| Platonic or noetic take on the geometries of reason? (I struggled
| to understand much of this paper...)
| qorrect wrote:
| I think they are talking about the word embeddings, where
| context is embedded into high geometric dimensions (one
| dimension might capture how 'feminine' a word is, or how 'blue'
| it is).
| cubefox wrote:
| Which word embeddings get their own dimension though, and
| which don't? ("feminine" and "blue" are words like any other)
| exe34 wrote:
| maybe it's like a PCA/Huffman deal, where the more
| regularly useful ones get to be the eigenvectors.
| exe34 wrote:
| ooooh what if qualia is just embedding? some philosophers
| would get their toga in a twist!
| cornholio wrote:
| I think the connection is that the authors could convincingly
| write a paper on this connection, thus inflating the AI
| publication bubble, furthering their academic acumen and
| improving their chances of getting research grants or selective
| jobs in the field. Some other interests of the authors seem to
| be detecting exoplanets using AI and detecting birds through
| audio analysis.
|
| Since nobody can really say what a good AI department does,
| companies seem to be driven by credentiallism, load up on
| machine learning PhDs and masters so they can show their board
| and investors that they are ready for the AI revolution. This
| creates economic pressure to write such papers, the vast
| majority of which will amount to nothing.
| techbro92 wrote:
| I think a lot of the time you would be correct. But this is
| published to arxiv so it's not peer reviewed and doesn't
| boost the authors credentials. It could be designed to
| attract attention to the company they work at. Or it could
| just be a cool idea the author wanted to share.
| magicalhippo wrote:
| Modern neural networks make heavy use of linear algebra, in
| particular the transformer[1] architecture that powers modern
| LLMs.
|
| Since linear algebra is closely related to geometry[2], it
| seems quite reasonable that there are some geometric aspects
| that define their capabilities and performance.
|
| Specifically, in this paper they're considering the intrinsic
| dimension[3] of the attention layers, and seeing how it
| correlates with the performance of LLMs.
|
| [1]:
| https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc...
|
| [2]:
| https://en.wikipedia.org/wiki/Linear_algebra#Relationship_wi...
|
| [3]: https://en.wikipedia.org/wiki/Intrinsic_dimension
| lifeisstillgood wrote:
| But I understand there are two sides to the discussion - that by
| ingesting huge amounts of text these models have somehow built
| reasoning capabilities (language then reasoning) or that the
| reasoning was done by humans and then written down so as long as
| you ask something like "should romeo find another love after
| Juliet" there is a set of reasoning reflected in a billion
| English literature essays and the model just reflects those
| answers
|
| Am I missing something?
| magicalhippo wrote:
| Glossing through the paper, it seems they're noting this issue
| but kinda skipping over it:
|
| _In fact, it is clear that approximation capabilities and
| generalization are not equivalent notions. However, it is not
| yet determined that the reasoning capabilities of LLMs are tied
| to their generalization. While these notions are still hard to
| pinpoint, we will focus in this experimental section on the
| relationship between intrinsic dimension, thus expressive
| power, and reasoning capabilities._
| cowsaymoo wrote:
| Right, they never claimed to have found a roadmap to AGI,
| they just found a cool geometric tool to describe how LLMs
| reason through approximation. Sounds like a handy tool if you
| want to discover things about approximation or
| generalization.
| Fripplebubby wrote:
| > the model just reflects those answers
|
| I think there is a lot happening in the word "reflects"! Is it
| so simple?
|
| Does this mean that the model takes on the opinion of a
| specific lit crit essay it has "read"? Does that mean it takes
| on some kind of "average" opinion from everything? How would
| you define the "average" opinion on a topic, anyway?
|
| Anyway, although I think this is really interesting stuff and
| cuts to the core of what an LLM is, this paper isn't where
| you're going to get the answer to that, because it is much more
| focused and narrow.
| nshm wrote:
| It is actually pretty straightforward why those model "reason"
| or, to be more exact, can operate on a complex concepts. By
| processing huge amount of texts they build an internal
| representation where those concepts are represented as a simple
| nodes (neurons or groups). So they really distill knowledge.
| Alternatively you can think about it as a very good principal
| component analysis that can extract many important aspects. Or
| like a semantic graph built automatically.
|
| Once knowledge is distilled you can build on top of it easily
| by merging concepts for example.
|
| So no secret here.
| lifeisstillgood wrote:
| Do they distill knowledge or distill the relationship between
| words (that describe knowledge)
|
| I know it seems dancing on head of pin but ...
| wongarsu wrote:
| To me those seem like to sides of the same coin. LLMs are
| fundamentally trained to complete text. The training just tries
| to find the most effective way to do that within the given
| model architecture and parameter count.
|
| Now if we start by "LLMs ingest huge amounts of text", then a
| simple model would complete text by simple memorization. But
| correctly completing "234 * 452 =" is a lot simpler to do by
| doing math than by having memorized all possible
| multiplications. Similarly, understanding the world and being
| able to reason about it helps you correctly completing human-
| written sentences. Thus a sufficiently well-trained model that
| has enough parameters to do this but not so many that it simply
| overfits should be expected to develop some reasoning ability.
|
| If you start with "the training set contains a lot of
| reasoning" you can get something that looks like reasoning in
| the memorization stage. But the same argument why the model
| would develop actual reasoning still works and is even
| stronger: if you have to complete someone's argument that's a
| lot easier if you can follow their train of thought.
| godelski wrote:
| I think you're close enough that the differences probably
| aren't too important. But if you want a bit more nuance, then
| read on. For disclosure, I'm in the second camp here. But I'll
| also say that I have a lot of very strong evidence to support
| this position, and that I do this from the perspective of a
| researcher.
|
| There's a few big problems when making any definite claims
| about either side. First, we need to know what data the machine
| is processing when training. I think we all understand that if
| the data is in training, then testing is not actually testing a
| model's ability to generalize, but a model's ability to recall.
| Second, we need to recognize the amount of duplication of data,
| both exact and semantically.
|
| 1) We have no idea because these are proprietary. While LLAMA
| is more open than GPT, we don't know all the data that went
| into it (last I checked). Thus, you can't say "this isn't in
| the data."[0] But we do know some things that are in the data,
| though we don't know exactly what was filtered out. We're all
| pretty online people here and I'm sure many people have seen
| some of the depths of places like Reddit, Medium, or even
| Hacker News. These are all in the (unfiltered) training data!
| There's even a large number of arxiv papers, books,
| publications, and so much more. So you have to ask yourself
| this: "Are we confident that what we're asking the model to do
| is not in the data we trained on?" Almost certainly it is, so
| then the question moves to "Are we confident that what we're
| asking the model to do was adequately filtered out during
| training so we can have a fair test?" Regardless of what your
| position is, I think you can see how such a question is
| incredibly important and how it would be easy to mess up. And
| only easier the more data we train on, since it's so incredibly
| hard to process that data.[1] I think you can see some
| concerning issues with this filtering method and how it can
| create a large number of false negatives. They explicitly
| ignore answers, which is important for part 2. IIRC the GPT-3
| paper also used an ngram model to check for dupes. But the most
| concerning line to me was this one: > As can be
| seen in tables 9 and 10, contamination overall has very little
| effect on the reported results.
|
| There is a concerning way to read the data here that serves a
| valid explanation for the results. That the data is so
| contaminated, the filtering process does not meaningfully
| remove the contamination and thus does not significantly change
| the results. If introducing contamination into your data does
| not change your results you either have a model that has
| learned the function of the data VERY well and has an extremely
| impressive form of generalization, OR your data is contaminated
| in ways you aren't aware of (there are other explanations too
| btw). There's a clearly simpler answer here.
|
| Second, is about semantic information and contamination[2].
| This is when data has the same effective meaning, but uses
| different ways to express it. "This is a cat" and "este es un
| gato" are semantically the same but share no similar words. So
| is "I think there's data spoilage" as well as "There is some
| concerning issues left to be resolved that bring into question
| the potential for information leakage." These will not be
| caught by substrings or ngrams. Yet, training on one will be no
| different than training on the other once we consider RLHF. The
| thing here is that in high dimensions, data is very confusing
| and does not act the way you might expect when operating in 2D
| and 3D. A mean between two values may or may not be
| representative depending on the type of distribution (uniform
| and gaussian, respectively), and we don't have a clue what that
| is (it is intractable!). The curse of dimensionality is about
| how it is difficult to distinguish a nearest neighboring point
| from the furthest neighboring point, because our concept of a
| metric degrades as we increase dimensionality (just like we
| lose algebraic structure when going from C (complex) -> H
| (quaternion) -> O (octonions) (commutativity, then
| associativity)[3]. Some of this may be uninteresting in the
| mathematical sense but some does matter too. But because of
| this, we need to rethink our previous questions carefully. Now
| we need to ask: "Are we confident that we have filtered out
| data that is not sufficiently meaningfully different from that
| in the test data?" Given the complexity of semantic similarity
| and the fact that "sufficiently" is not well defined, I think
| this should make anybody uneasy. If you are absolutely
| confident the answer is "yes, we have filtered it" I would
| think you a fool. It is so incredibly easy to fool ourselves
| that any good researcher needs to have a constant amount of
| doubt (though confidence is needed too!). But neither should
| our lack of a definite answer here stop progress. But it should
| make us more careful about what claims we do make. And we need
| to be clear about this or else conmen have an easy time
| convincing others.
|
| To me, the common line of research is wrong. Until we know the
| data and have processed the data with many looking for means of
| contamination, results like these are not meaningful. They rely
| on a shaky foundation and often are more looking for evidence
| to prove reasoning than to consider it might not.
|
| But for me, I think the conversations about a lot of this are
| quite strange. Does it matter that LLMs can't reason? I mean in
| some sense yes, but the lack of this property does not make
| them any less powerful of a tool. If all they are is a lossy
| compression of the majority of human knowledge with a built in
| human interface, that sounds like an incredible achievement and
| a very useful tool. Even Google is fuzzy! But this also tells
| us what the tool is good for and isn't. That this puts bounds
| on what we should rely on it for and what we can trust it to do
| with and without human intervention. I think some are afraid
| that if LLMs aren't reasoning, then that means we won't get
| AGI. But at the same time, if they don't reason, then we need
| to find out why and how to make machines reason if we are to
| get there. So ignoring potential pitfalls hinders this
| progress. I'm not suggesting that we should stop using or
| studying LLMs (we should continue to), but rather that we need
| to stop putting alternatives down. We need to stop comparing
| alternatives one-to-one to models that took millions of dollars
| to do a single training and have been studied by thousands of
| people for several years against things scrambled together by
| small labs on a shoestring budget. We'll never be able to
| advance if the goalpost is that you can't make incremental
| steps along the way. Otherwise how do you? You got to create
| something new without testing, convince someone to give you
| millions of dollars to train it, and then millions more to
| debug your mistakes and things you've learned along the way?
| Very inefficient. We can take small steps. I think this
| goalpost results in obscurification. That because the bar is
| set so high, that strong claims need to be made for these works
| to be published. So we have to ask ourselves the deeper
| questions: "Why are we doing this?"[4]
|
| [0] This might seem backwards but the creation of the model
| implicitly claims that the test data and training data are
| segregated. "Show me this isn't in training" is a request for
| validation.
|
| [1] https://arxiv.org/abs/2303.08774
|
| [2] If you're interested, Meta put out a work on semantic
| deduplication last year. They mostly focused on vision, but it
| still shows the importance of what's being argued here. It is
| probably easier to verify that images are semantically similar
| than sentences, since language is more abstract. So pixels can
| be wildly different and the result is visually identical; how
| does this concept translate with language?
| https://arxiv.org/abs/2303.09540
|
| [3] https://math.stackexchange.com/questions/641809/what-
| specifi...
|
| [4] I think if our answer is just "to make money" (or anything
| semantically similar like "increase share value") then we are
| doomed to mediocrity and will stagnate. But I think if we're
| doing these things to better human lives, to understand the
| world and how things work (I'd argue building AI is, even if a
| bit abstract), or to make useful and meaningful things, then
| the money will follow. But I think that many of us and many
| leading teams and businesses have lost focus on the journey
| that has led to profits and are too focused on the end result.
| And I do not think this is isolated to CEOs, I think this
| similar short sighted thinking can be repeated all the way down
| the corporate ladder. To a manager focusing on what their
| bosses explicitly ask for (rather than the intent) to the
| employee who knows that this is not the right thing to do but
| does it anyways (often because they know the manager will be
| unhappy. And this repeats all the way up). All life, business,
| technology, and creation have immense amounts of complexity to
| them. Ones we obviously want to simplify as much as possible.
| But when we hyper focus on any set of rules, no matter how
| complex, we will be doomed to fail because the environment is
| always changing and you will never be able to instantly adapt
| (this is the nature of chaos. Where small perturbations have
| large changes on the outcome). That doesn't mean we shouldn't
| try to make rules, but rather it means that rules are to be
| broken. It's just a matter of knowing when. In the end, this is
| an example of what it means to be able to reason. So we should
| be careful to ensure that we create AGI by making machines able
| to reason and think (to make them "more human") rather than by
| making humans into unthinking machines. I worry that the latter
| looks more likely, given that it is a much easier task to
| accomplish.
| ChicagoDave wrote:
| LLMs do not have the technology to iteratively solve a complex
| problem.
|
| This is a fact. No graph will change this.
|
| You want "reasoning," then you need to invent a new technology to
| iterate, validate, experiment, validate, query external
| expertise, and validate again. When we get that technology, then
| AI will become resilient in solving complex problems.
| sigmoid10 wrote:
| That's false. It has been shown that LLMs can perform e.g.
| gradient descent internally [1], which can explain why they are
| so good at few shot prompting. The universal approximation
| theorem already tells us that a single layer is sufficient to
| approximate any function, so it should come as no surprise that
| modern deep networks with many layers should be able to perform
| iterative optimisations.
|
| [1] https://arxiv.org/abs/2212.10559
| jens-c wrote:
| Not an expert, but to me the paper reads like it was written _by_
| an LLM.
|
| The first paragraph of the introduction is fine, but then it
| kinda turns into gobbledygook...
| belter wrote:
| Comes from this company: https://www.tenyx.com/about-us
| magicalhippo wrote:
| I'm not into AI, but I like to watch from the sidelines. Here's
| my non-AI summary of the paper after glossing through
| (corrections appreciated):
|
| The multilayered perceptron[1] layers used in modern neural
| networks, like LLMs, essentially partitions the input into
| multiple regions. They show that the number of regions a single
| MLP layer can partition into depends exponentially on the
| intrinsic dimension[2] of the input. The number of
| regions/partitions increases the approximation power of the MLP
| layer.
|
| Thus you can significantly increase the approximation power of a
| MLP layer without increasing the number of neurons, by
| essentially "distilling" the input to it.
|
| In the transformer architecture, the inputs to the MLP layers are
| the self-attention layers[3]. The authors then show that the
| graph density of the self-attention layers[3] correlates strongly
| with the intrinsic dimension of the self-attention layer. Thus a
| more dense self-attention layer means the MLP can do a better
| job.
|
| One way of increasing the density of the attention layers is to
| add more context. (edited, see comment) They show that prepending
| any token as context to a question which increases the intrinsic
| dimension of the final layer makes the LLM perform better.
|
| They also note that the transformer architecture is susceptible
| to compounding approximation errors, and that the much more
| precise partitioning provided by the MLP layers when fed with
| high intrinsic-dimensional input can help with this. However the
| impact of this on generalization remains to be explored further.
|
| If the results hold up it does seem like this paper provides nice
| insight into how to better optimize LLMs and similar neural
| networks.
|
| [1]: https://en.wikipedia.org/wiki/Multilayer_perceptron
|
| [2]: https://en.wikipedia.org/wiki/Intrinsic_dimension
|
| [3]:
| https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc...
| Fripplebubby wrote:
| Awesome summarization by someone who read and actually
| understood the paper.
|
| > One way of increasing the density of the attention layers is
| to add more context. They show that simply prepending any token
| as context to a question makes the LLM perform better. Adding
| relevant context makes it even better.
|
| Right, I think a more intuitive way to think about this is to
| define density: the number of _edges_ in the self-attention
| graph connecting tokens. Maybe a simpler explanation: the
| number of times a token had some connection to another token
| divided by the number of tokens. So, tokens which actually
| relate to one another and provide information are good, non
| sequitur tokens don't help except that you say
|
| > They show that simply prepending any token as context to a
| question makes the LLM perform better.
|
| I think this is not quite right. What they found was:
|
| > pre-pending the question at hand with any type of token does
| increase the intrinsic dimension at the first layer
|
| > however, this increase is not necessarily correlated with the
| reasoning capability of the model
|
| but it is only
|
| > when the pre-pended tokens lead to an increase in the
| intrinsic dimension at the *final layer* of the model, the
| reasoning capabilities of the LLM improve significantly.
|
| (emphasis mine)
| magicalhippo wrote:
| Thanks, good catch, got distracted by the editing flaws at
| the end there (they rewrote a section without removing the
| old one).
| bastien2 wrote:
| You can't "enhance" from zero. LLMs _by design_ are not capable
| of reason.
|
| We can observe LLM-like behaviour in humans: all those
| reactionaries who just parrot whatever catchphrases mass media
| programmed into them. LLMs are just the computer version of that
| uncle who thinks Fox News is true and is the reason your nieces
| have to wear long pants at family gatherings.
|
| He doesn't understand the catchphrases he parrots any more than
| the chatbots do.
|
| Actual AI will require a kind of modelling that as yet does not
| exist.
| belter wrote:
| > LLMs by design are not capable of reason.
|
| It is not as clear cut. The argument being, that the patterns
| they learn in text encodes several layers of abstraction, one
| of them being _some_ reasoning, as it is encoded in the
| discourse.
| wizzwizz4 wrote:
| They are capable of picking up incredibly crude, noisy
| versions of first-order symbolic reasoning, and specific,
| commonly-used arguments, and the context for when those might
| be applied.
|
| Taken together and iterated, you get something vaguely
| resembling a reasoning algorithm, but your average
| schoolchild with an NLP library and regular expressions could
| make a better reasoning algorithm. (While I've been calling
| these "reasoning algorithms" for analogy's sake, they don't
| actually behave how we expect reasoning to behave.)
|
| The language model predicts what reasoning might look like.
| But it doesn't actually _do the reasoning_ , so (unless it
| has something capable of reasoning to guide it), it's not
| going to correctly derive conclusions from premises.
| cowsaymoo wrote:
| The vocabulary used here doesn't have sufficient intrinsic
| dimension to partition the input into a low loss prediction.
| Improvement is promising with larger context or denser
| attention.
| bl0rg wrote:
| Can you explain what it means to reason about something? Since
| you are so confident I'm guessing you'll find it easy to come
| up with a non-contrived definition that'll clearly include
| humans and future "actual AI" but exclude LLMs.
| stoperaticless wrote:
| Not the parent, but there are couple of things current AI
| lack:
|
| - learning from single article /book with lasting effect
| (accumulation of knowledge)
|
| - arithmetics without unexpected errors
|
| - gauging reliability of information it's printing
|
| BTW. I doubt that you'll get satisfactory definition of "able
| to reason" (or "conscious" or "alive" or "chair"). As they
| define more an end or direction of a spectrum, not an exact
| cut off point.
|
| Current llms are impressive and useful, but given how often
| they spout nonsense, it is hard to put them into "able to
| reason" category.
| p1esk wrote:
| LLMs are trained to predict the next word in a sequence. As a
| result of this training they developed reasoning abilities.
| Currently these reasoning abilities are roughly at human level,
| but next gen models (gpt5) should be superior to humans at any
| reasoning tasks.
| Kiro wrote:
| Go look at the top comment of this thread:
| https://news.ycombinator.com/item?id=40900482
|
| That's the kind of stuff I want to see when opening a thread on
| HN, but most of the times we get shallow snark like yours
| instead. It's a shame.
___________________________________________________________________
(page generated 2024-07-07 23:00 UTC)