[HN Gopher] Kolmogorov-Arnold networks may make neural networks ...
___________________________________________________________________
Kolmogorov-Arnold networks may make neural networks more
understandable
Author : isaacfrond
Score : 207 points
Date : 2024-09-12 10:14 UTC (12 hours ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| itsthecourier wrote:
| TL;DR: they are talking about KAN (Kolmogorov-Arnold networks)
| weberer wrote:
| Yeah. Thankfully, HN updated the title to be more descriptive.
| (Old title was "Novel Architecture Makes Neural Networks More
| Understandable")
| RustySpottedCat wrote:
| Can someone explain exactly what is the "unknown" of neural
| networks? We built them, we know what they comprise of and how
| they work. Yes, we can't map out every single connection between
| nodes in this "multilayer perceptron" but don't we know how these
| connections are formed?
| taneq wrote:
| There's a ton of research going into analysing and reverse
| engineering NNs, this "they're mysterious black boxes and
| forever inscrutable" narrative is outdated.
| lupire wrote:
| We don't know what each connection means, what information is
| encoded in each weight. We don't know how it would behave
| differently if each of the million or trillion weights was
| changed.
|
| Compare this to dictionaey, where it's obvious what information
| is on each page and each line.
| wslh wrote:
| The brain serves as a useful analogy, even though LLMs are not
| brains. Just as we can't fully understand how we think by
| merely examining all of our neurons, understanding LLMs
| requires more than analyzing their individual components,
| though decoding LLMs is most likely easier, which doesn't mean
| easy.
| og_kalu wrote:
| Sota LLMs like GPT-4o can natively understand b64 encoded text.
| Now we have algorithms that can decode and encode b64 text. Is
| that what GPT-4o is doing ? Did training learn that algorithm ?
| Clearly not or at least not completely because typos in b64
| that would destroy any chance of extracting meaning in the
| original text for our algorithms are barely an inconvenience
| for 4o.
|
| So how is it decoding b64 then ? We have no idea.
|
| We don't built Neural Networks. Not really. We build
| architectures and then train them. Whatever they learn is
| outside the scope of human action beyond supplying the training
| data.
|
| What they learn is largely unknown beyond trivial toy examples.
|
| We know connections form, we can see the weights, we can even
| see the matrices multiplying. We don't know what any of those
| calculations are doing. We don't know what they mean.
|
| Would an alien understand C Code just because he could see it
| executing ?
| HarHarVeryFunny wrote:
| Base64 encoding is very simple - it's just taking each 6-bits
| of the input and encoding (replacing) it as one of the 64
| (2^6) characters A-Za-z0-9+/. If the input is 8-bit ASCII
| text, then each 3 input characters will be encoded as 4
| Base64 characters (3 * 8 = 24 bits = 4 * 6-bit Base64
| chunks).
|
| So, this is very similar to an LLM having to deal with
| tokenized input, but instead of sequences of tokens
| representing words you've got sequences of Base64 characters
| representing words.
| og_kalu wrote:
| It's not about how simple B64 is or isn't. In fact i chose
| a simple problem we've already solved algorithmically on
| purpose. It's that all you've just said, reasonable as it
| may sound is entirely speculation.
|
| Maybe "no idea" was a bit much for this example but any
| idea certainly didn't come from seeing the matrices
| themselves fly.
| HarHarVeryFunny wrote:
| Huh? I just pointed out what Base64 encoding actually is
| - not some complex algorithm, but effectively just a
| tokenization scheme.
|
| This isn't speculation - I've implemented Base64
| decode/encode myself, and you can google for the
| definition if you don't believe I've accurately described
| it!
| og_kalu wrote:
| The speculation here is not about what b64 text is. It's
| about how the LLM has learnt to process it.
|
| Edit: Basically, For all anyone knows, it treats b64 as
| another language entirely and decoding it is akin in the
| network to translating French rather than the very simple
| swapping you've just described.
| HarHarVeryFunny wrote:
| LLMs, just like all modern neural nets, are trained via
| gradient descent which means following the most direct
| path (steepest gradient on the error surface) to reduce
| the error, with no more changes to weights once the error
| gradient is zero.
|
| Complexity builds upon simplicity, and the LLM will begin
| by noticing the direct (and repeated without variation)
| predictive relationship between Base64 encoded text and
| corresponding plain text in the training set. Having
| learnt this simple way to predict Base64
| decoding/encoding, there is simply no mechanism whereby
| it could change to a more complex "like translating
| French" way of doing it. Once the training process has
| discovered that Base64 text decoding can be PERFECTLY
| predicted by a simple mapping, then the training error
| will be zero and no more changes (unnecessary
| complexification) will take place.
| og_kalu wrote:
| Modern Neural Networks are by no means guaranteed to
| converge on _the_ simplest solution. and examples abound
| in which NNs are discovered to learn weird esoteric
| algorithms when simpler ones exist. The reason why is
| kind of obvious. The simplest solution (that you 're
| alluding to) from the perspective of training is simply
| what works best first.
|
| It's no secret the order of data has an impact on what
| the network learns and how quickly, it's just not
| feasible to police for these giant trillion token
| datasets.
|
| If a NN learns a more complex solution that works
| perfectly for a less complex subset it meets later on,
| there is little pressure to meet the simpler solution.
| Especially when we're talking about instances where the
| more complex solution might be more robust to any weird
| permutations it might meet on the internet. e.g there is
| probably a simpler way to translate text that never has
| typos and a LLM will never converge on it.
|
| Decoding/Encoding b64 is not the first thing it will
| learn. It will learn to predict it first as it predicts
| any other language carrying sequence. Then, it will learn
| to translate it, mostly like long after learning how to
| translate other languages. All that will have some impact
| on the exact process it carries out with b64.
|
| And like i said, we already know for a fact it's not just
| doing naive substitution because it can recover corrupted
| b64 text wholesale that our substitutions cannot.
| drdeca wrote:
| Isn't the gradient descent used, stochastic gradient
| descent? I think that could matter a little bit.
|
| Also, the base model when responding to base64 text, most
| of the time the next token is also part of the base64
| text, right? So presumably the first thing to learn would
| be like, predicting how some base64 text continues,
| which, when the base64 text is an encoding of some ascii
| text, seems like it would involve picking up on the
| patterns for that?
|
| I would think that there would be both those cases, and
| cases where the plaintext is present before or after.
| kevindamm wrote:
| That's not entirely true in the case of base64 because of
| how statistical patterns within natural languages work.
| For example, you can use frequency analysis to decrypt a
| monoalphabetic substitution cipher on pretty much any
| language if you have a frequency table for character
| n-grams of the language, even with small numbers for n.
| This is a much more shallow statistical processing than
| what's going on within an LLM so I don't think many were
| surprised that a transformer stack and attention heads
| could decode base64. Especially if there were also
| examples of base64-encoding in the training data (even
| without parallel corpora for their encodings).
|
| It doesn't explain higher level generalizations like
| being a transpiler between different programming
| languages that didn't have any side-by-side examples in
| the training data. Or giving an answer in the voice of
| some celebrity. Or being able to find entire rhyming word
| sequences across languages. These are probably more like
| the kind of unexplainable generalizations that you were
| referring to.
|
| I think it may be better to frame it in terms of accuracy
| vs precision. Many people can explain accurately what an
| LLM is doing under all those matrix multiplies, both
| during training and inference. But, precisely why an
| input leads to the resulting output is not explainable.
| Being able to do that would involve "seeing" the shape of
| the hypersurface of the entire language model, which as
| sibling commenters have mentioned is quite difficult even
| when aided by probing tools.
| mapt wrote:
| Our DNA didn't build our brain. Not really. Our DNA coded for
| a loose trainable architecture with a lot of features that
| result from emergent design, constraints of congenital
| development, et cetera. Even if you include our full exome, a
| bunch of environmental factors in your simulation, and are
| examining a human with obscenely detailed tools at autopsy,
| you're never going to be able to tell me with any
| authenticity whether a given subject possesses the skill
| 'skateboarding'.
| drdeca wrote:
| I find this analogy kind of confusing? Wouldn't the
| analogous thing be to say that our DNA doesn't understand,
| uh, how we are able to skateboard? But like, we generally
| don't regard DNA as understanding anything, so that not
| unexpected.
|
| Where does "we can't tell whether a person possesses the
| skill of 'skateboarding'?" fit in with, DNA not encoding
| anything specific to skateboarding? It isn't as if we
| designed our genome and therefore if our genome did hard-
| code skateboarding skill that we would therefore (as
| designers of our genome) have full understanding of how
| skateboarding skill works at the neuron level.
|
| I recognize that a metaphor/analogy/whatever does not have
| to extend to all parts of something, and indeed most
| metaphors/analogies/whatever fail at some point if pushed
| too far. But, I don't understand how the commonalities you
| are pointing to between [NN architecture : full NN network
| with the specific weights] and [human genome : the whole
| behavior of a person's brain including all the facts,
| behaviors, etc. that they've learned throughout their life]
| is supposed to apply to the example of _knowing_that_ a
| person knows how to skateboard?
|
| It is quite possible that I'm being dense.
|
| Could you please elaborate on the analogy / the point you
| are making with the analogy?
| spencerchubb wrote:
| We know the process to train a model, but when a model makes a
| prediction we don't know exactly "how" it predicts the way it
| does.
|
| We can use the economy as an analogy. No single person really
| understands the whole supply chain. But we know that each
| person in the supply chain is trying to maximize their own
| profit, and that ultimately delivers goods and services to a
| consumer.
| Lerc wrote:
| We know how they are formed(and how to form them), we don't
| know why forming in that particular way solves the problem at
| hand.
|
| Even this characterization is not strictly valid anymore, there
| is a great deal of research into what's going on inside the
| black box. The problem was never that it was a black box(we can
| look inside at any time), but that it was hard to understand.
| KANs help some of that be placed into mathematical formulation.
| Generating mappings of activations over data similarly grants
| insight.
| mjburgess wrote:
| * Given the training data, and the architecture of the network,
| why does SGD with backprop find the given f? vs. any other of
| an infinite set.
|
| * Why are there are a set of f each with 0-loss that work?
|
| * Given the weight space, and an f within it, why/when is a
| task/skill defined as a subset of that space covered by f?
|
| I think a major reasons why these are hard to answer is that
| it's assumed that NNs are operating within an inferential
| statistical context (ie., reversing some latent structure in
| the data). But they're really bad at that. In my view, they are
| just representation-builders that find proxy representations in
| a proxy "task" space (def, aprox, proxy = "shadow of some real
| structure, as captured in an unrelated space").
| _navierstokes wrote:
| Skipping some detail: the model applies many high-dimensional
| functions to the input, and we don't know the reasoning for why
| these functions solve the problem. Reducing the dimension of
| the weights to human-readable values is non-trivial, and
| multiple neurons interact in unpredictable ways.
|
| Interpretability research has resulted in many useful results
| and pretty visualizations[1][2], and there are many efforts to
| understand Transformers[3][4] but we're far from being able to
| completely explain the large models currently in use.
|
| [1] - https://distill.pub/2018/building-blocks/
|
| [2] - https://distill.pub/2019/activation-atlas/
|
| [3] - https://transformer-circuits.pub/
|
| [4] - https://arxiv.org/pdf/2407.02646
| xiaodai wrote:
| It doesn't that's the problem
| mansoor_ wrote:
| Not really. For a trivial function fitting problem, a KAN will
| allow you to visualise the contribution of each base function
| into the next layer of your network. Still, these trivial shallow
| networks are the ones nobody needs to introspect. A deep NN will
| not be explainable using this approach.
| Taikonerd wrote:
| Yeah. I'm not sure if anything with millions or billions of
| parameters will ever be "explainable" in the way we want.
|
| I mean, imagine a regular multivariable function with billions
| of terms, written out on a (very big) whiteboard. Are we ever
| really going to understand why it produces the numbers it does?
|
| KANs may have an order of magnitude fewer parameters, but the
| basic problem is still the same.
| afiori wrote:
| I found these articles very interesting in the context of
| future ways to understand LLM/AIs
|
| https://www.astralcodexten.com/p/the-road-to-honest-ai
|
| https://www.astralcodexten.com/p/god-help-us-lets-try-to-
| und...
| etiam wrote:
| Good points.
|
| Personally I'm still basically with Geoff Hinton's early
| conjecture that people will have to choose whether they want
| a model that's easy to explain or one that actually works as
| well as it could.
|
| I'd imagine the really big whiteboard would often be
| understandable in principle, but most people wouldn't be very
| satisfied at having the model go "Jolly good. Set aside the
| next 25 years in your calendar then, and tell me when you're
| ready to start on practicing the prerequisites!".
|
| On the other hand, one might question how often we really
| understand something complex ostensibly "explained" to us,
| rather than just gloss over real understanding. A lot of the
| time people seem to act as if they don't care about really
| knowing it, and just (hopefully!) want to get an inkling
| what's involved and make sure that the process could be
| demonstrated not to be seriously flawed.
|
| The models are being held to standards that are typically not
| applied to people nor to most traditional software. But sure,
| there are also some real issues about reliability, trust and
| bureaucratic certifications.
| scarmig wrote:
| I came across "Learning XOR: exploring the space of a classic
| problem" other day:
| https://www.maths.stir.ac.uk/~kjt/techreps/pdf/TR148.pdf
|
| Even something with three units and two inputs is nontrivial
| to understand on a deep level.
| crazygringo wrote:
| > _Are we ever really going to understand why it produces the
| numbers it does?_
|
| I would expect so, because we can categorize things
| hierarchically.
|
| A medium-sized library contains many billions of words, but
| even with just a Dewey decimal system and a card catalog you
| could find information relatively quickly.
|
| There's no inherent difficulty in understanding what a
| billion terms do, if you're able to just drill down using
| some basic hierarchies. It's just about finding the right
| algorithms to identify and describe the best set of
| hierarchies. Which is difficult, but there's no reason to
| think it won't be solvable in the near term.
| thesz wrote:
| KAN's have O(N^(-4)) scaling law where N is the number of
| parameters. MLPs have O(N^(-1)) scaling or worse.
|
| For where you need MLP with a tens of billions of parameters
| you may need KAN with thousands.
| stefanpie wrote:
| The main author of KANs did a tutorial session yesterday at
| MLCAD, an academic conference focused on the intersection of
| hardware / semiconductor design and ML / deep learning. It was
| super fascinating and seems really good for what they advertise
| it for, gaining insight and interpret for physical systems
| (symbolic expressions, conserved quantities , symmetries). For
| science and mathematics this can be useful but for engineering
| this might not be the main priority of an ML / deep learning (to
| some extent).
|
| There are still unknowns for leaning hard tasks and learning
| capacity over harder problems. Even choices in for things like
| the chosen basis function used for the KAN "activations" and what
| other architectures these layers can be plugged into with some
| gain is still unexplored. I think as people mess around with KANs
| we'll get better answers to these questions.
| notpublic wrote:
| Presentation by the same author made 2 months back:
|
| https://www.youtube.com/watch?v=FYYZZVV5vlY
| abhgh wrote:
| Is there a publicly available version of the session?
| light_hue_1 wrote:
| They cannot.
|
| Just because one internal operation is understandable, doesn't
| imply that the whole network is understandable.
|
| Take even something much simpler: decision trees. Textbooks give
| these as an example of understandable systems. A tree where you
| make one decision based on one feature at a time then at the
| leaves you output something. Like a bunch of if statements. And
| in the 90s when computers were slow and trees were small this was
| true.
|
| Today massive decision trees and approaches like random forests
| can create trees with millions of nodes. Nothing is interpretable
| about them.
|
| We have a basic math gap when it comes to understanding complex
| systems. Yet another network type solves nothing.
| ImHereToVote wrote:
| A formula or equation that enables you to reason about complex
| systems might simply not exists. It could very well be that to
| reason about complexity forces you to actually do the
| complexity.
| empath75 wrote:
| Even extremely complicated decision trees are interpretable to
| some extent because you can just walk through the tree and
| answer questions like: "If this had not been true, would the
| result have been different?". It may not be possible to hold
| the entire tree in your head at once, but it's certainly
| possible to investigate the tree as needed to understand the
| path that was taken through it.
| svboese wrote:
| But couldn't the same be said about standard MLPs or NNs in
| general?
| empath75 wrote:
| _Sometimes_, and people do find features in neural networks
| by tweaking stuff and seeing how the neurons activate, but
| in general, no. Any given weight or layer or perceptron or
| whatever can be reused for multiple purposes and it's
| extremely difficult to say "this is responsible for that",
| and if you do find parts of the network responsible for a
| particular task, you don't know if it's _also_ responsible
| for something else. Whereas with a decision tree it's
| pretty simple to trace causality and tweak things without
| changing unrelated parts of the tree. Changing weights in a
| neural network leads to unpredictable results.
| tomhallett wrote:
| If a KAN has multiple layers, would tweaking the
| equations of a KAN be more similar to tweaking the
| weights in a MLP/NN, or more similar to tweaking a
| decision tree?
|
| EDIT: I gave the above thread (light_hue_1 > empath75 >
| svboese > empath75) to chatgpt and had it write a
| question to learn more, and it gave me "How do KAN
| networks compare to decision trees or neural networks
| when it comes to tracing causality and making
| interpretability more accessible, especially in large,
| complex models?". Either shows me and ai are on the right
| track, or i'm as dumb as a statistical token guessing
| machine....
|
| https://imgur.com/3dSNZrG
| Scene_Cast2 wrote:
| LIME (local linear approximation basically) is one popular
| technique to do so. Still has flaws (such as not being
| close to a decision boundary).
| pkage wrote:
| LIME and other post-hoc explanatory techniques (deepshap,
| etc.) only give an explanation for a singular inference,
| but aren't helpful for the model as a whole. In other
| words, you can make a reasonable guess as to why a
| specific prediction was made but you have no idea how the
| model will behave in the general case, even on similar
| inputs.
| Narhem wrote:
| The purpose of post-prediction explanations would be to
| increase confidence of a practitioner to use said
| inference.
|
| It's a disconnect between finding a real life "AI" and
| trying to find something which works and you can have a
| form of trust with.
| ljosifov wrote:
| You are right and IDK why you are downvoted. Few units of
| perceptrons, few nodes in a decision tree, few of anything
| - they are "interpretable". Billions of the sames - are not
| interpretable any more. This b/c our understanding of
| "interpretable" is "an array of symbols that can fit a page
| or a white board". But there is no reason to think that all
| the rules of our world would be such that they can be
| expressed that way. Some maybe, others maybe not.
| Interpretable is another platitudinous term that seems
| appealing at 1st sight, only to be found to not be that
| great after all. We humans are not interpretable, we can't
| explain how we come up with the actions we take, yet we
| don't say "now don't move, do nothing, until you are
| interpretable". So - much ado about little.
| t_mann wrote:
| I think of it as "Could Newton have used this to find the
| expressions for the forces he was analyzing (eg gravitational
| force = g m_1 m_2 / d^2)?". I once asked a physics prof whether
| that was conceivable in principle, and he said yes. It seems to
| me like KANs should be able to find expressions like these
| given experimental data. If that was true, then I don't see how
| that wouldn't deserve being called interpretability.
| fjkdlsjflkds wrote:
| > It seems to me like KANs should be able to find expressions
| like these given experimental data.
|
| Perhaps, but this is not something unique to KANs: any
| symbolic regression method can (at least in theory) find such
| simple expressions. Here is an example of such type of work
| (using non-KAN neural networks):
| https://www.science.org/doi/10.1126/sciadv.aay2631
|
| Rephrasing: just because you can reach simple expressions
| with symbolic regression methods based on neural networks (or
| KANs) does not necessarily imply that neural networks (or
| KANs) are inherently interpretable (particularly once you
| start stacking multiple layers).
| nathan_compton wrote:
| Just giving the force law hardly counts as interpret-ability.
| You probably know that the 1/r^2 in the force law comes from
| the dimensionality of space. That is the interpretation.
| baq wrote:
| yeah. you can run SHAP[0] on your xgboosted trees, results are
| kinda interesting, but it doesn't actually explain anything
| IME.
|
| [0] https://shap.readthedocs.io/en/latest/index.html
| cubefox wrote:
| No wonder. "Shapley values" have the problem that they assume
| all necessary conditions are equally important. Say a
| successful surgery needs both a surgeon and a nurse,
| otherwise the patient dies. Shapley values will then assume
| that both have contributed equally to the successful surgery.
| Which isn't true, because surgeons are much less available
| (less replaceable) than nurses. If the nurse gets ill, a
| different nurse could probably do the task, while if the
| surgeon gets ill, the surgery may well have to be postponed.
| So the surgeons are more important for (contribute more to) a
| successful surgery.
| adammarples wrote:
| Clearly both are equally important, 100% necessary. This
| doesn't account for rarity, nor does it account for wages,
| agreeability, smell or any of the other things it isn't
| trying to measure. You'll need a different metric for that
| and if you want to take both into account you should.
| cubefox wrote:
| Shapley values try to measure importance of
| contributions, and for this, bare necessity isn't a
| sufficient indicator. I think it comes down to
| probability. The task of the surgeon is, from a prior
| perspective, less likely to be fulfilled because it is
| harder to get hold of a surgeon.
|
| Similarly: What what was the main cause of the match
| getting lit? The match being struck? Or the atmosphere
| containing oxygen? Both are necessary in the sense that
| if either hadn't occurred the match wouldn't be lit. But
| it seems clear that the main cause was the match being
| struck, because matches being struck is relatively rare,
| and hence unlikely, while the atmosphere contains oxygen
| pretty much always.
|
| So I think the contributions calculated for Shapley
| values should be weighted by the inverse of their prior
| probabilities. Though it is possible that such
| probabilities are not typically available in the machine
| learning context in which SHAP operates.
| empath75 wrote:
| I have a question, which might not even be related to this -- one
| of the keys to the power of neural networks is exploiting the
| massive parallelism enabled by GPUs, but are we leaving some
| compute on the table by using just scalar weights? What if
| instead of a matrices of weights, what if they were matrices of
| functions?
| mglz wrote:
| GPUs are optimized for matrices of floating point values, so
| current neural networks use this as a basis (with matrices
| containing the scalar weights).
| immibis wrote:
| Each row/column (I always forget which way around matrices go)
| of weights followed by a nonlinearity is a learnable function.
| dahart wrote:
| They way to think about NNs is that they are already made of
| functions; groups of layered nodes become complex nonlinear
| functions. For example a small 3-layer network can learn to
| model a cubic spline function. The internals of the function
| are learned at every step of the way; every addition and
| multiplication. You can assume the number of functions in a
| network is a fraction of the number of weights. This makes the
| NN theoretically more flexible and powerful than modeling it
| using more complex functions, because it learns and adapts each
| and every function during training.
|
| I would assume its possible using certain functions to, say,
| model a small fixed-function MLP could perhaps result in more
| efficient training, if we know the right functions to use. But
| you could end up losing perf too if not careful. I'd guess the
| main problems are we don't know what functions to use, and
| adding nonlinear functions might come with added difficultly
| wrt performance and precision and new modes of initialization
| and normalization. Linear math is easy and powerful and already
| capable of modeling complex functions, but nonlinear math might
| be useful I'd guess... needs more study! ;)
| ocular-rockular wrote:
| What you're describing is very similar to deep Gaussian
| processes.
| esafak wrote:
| Recently discussed in
| https://news.ycombinator.com/item?id=40219205
| IWeldMelons wrote:
| Fad.
| CamperBob2 wrote:
| What evidence would change your mind?
| throwaway2562 wrote:
| The point on interpretability is scientific applications is in
| symbolic regression - MLPs cannot always spit out an equation for
| some data set: KANs can.
| buildbot wrote:
| I thought that MLPs are universal function approximators?
| https://en.wikipedia.org/wiki/Universal_approximation_theore...
| js8 wrote:
| I don't what KANs are, but from the informal description in the
| article "turn function on many variables into many functions of
| single variable", it sounds reminiscent of lambda calculus.
| samus wrote:
| Nope, that's just currying and/or partial application.
| triclops200 wrote:
| The (semi) automatic simplification algorithm provided in the
| paper for KANs seem, to me, like they're solving a similar
| problem to https://arxiv.org/pdf/2112.04035, but with the
| additional constraint of forward functional interpretability as
| the goal instead of just a generalized abstraction compressor.
___________________________________________________________________
(page generated 2024-09-12 23:00 UTC)