[HN Gopher] A new type of neural network is more interpretable
___________________________________________________________________
A new type of neural network is more interpretable
Author : pseudolus
Score : 157 points
Date : 2024-08-05 16:15 UTC (6 hours ago)
(HTM) web link (spectrum.ieee.org)
(TXT) w3m dump (spectrum.ieee.org)
| smusamashah wrote:
| > One downside of KANs is that they take longer per parameter to
| train--in part because they can't take advantage of GPUs. But
| they need fewer parameters. Liu notes that even if KANs don't
| replace giant CNNs and transformers for processing images and
| language, training time won't be an issue at the smaller scale of
| many physics problems.
|
| They don't even say that it might be possible to take advantage
| of GPUs in future. Reads like a fundamental problem with these.
| scotty79 wrote:
| I wonder what's the issue ... GPUs can do very complex stuff
| raidicy wrote:
| From my limited understanding: No one has written GPU code
| for it yet.
| UncleOxidant wrote:
| There's not a lot of details there, but GPUs tend to not like
| code with a lot of branching. I'm guessing that's probably
| the issue.
| MattPalmer1086 wrote:
| I suspect it is because they have different activation
| functions on each edge, rather than using the same one over
| lots of data.
| XMPPwocky wrote:
| Are the activation functions truly different, or just
| different parameter values to one underlying function?
| hansvm wrote:
| A usual problem is that GPUs don't branch on instructions
| efficiently. A next most likely problem is that they don't
| branch on data efficiently. Ideas fundamentally requiring the
| former or the latter are hard to port efficiently.
|
| A simple example of something hard to port to a GPU is a deep
| (24 lvls) binary tree with large leaf sizes (4kb). Particular
| trees can be optimized further, particular operations on
| trees might have further optimizations, and trees with nicer
| dimensionality might have tricks available, but solving that
| problem in the abstract is 32x slower on a GPU that "good"
| GPU problems. That's not a death knell, but it cuts down
| substantially the constraints which would make a GPU a better
| fit than a CPU.
|
| Instruction branching is much worse, when required. Runtime
| is exponential.
|
| As far as KANs are concerned, the problem is more with data
| branching. Each spline computation requires its own set of
| data and is only used once. The math being done on the
| aggregate computations is non-negligible, but fast relative
| to the memory loads. You quickly enter a regime where (1)
| you're bottlenecked on RAM bandwidth, and (2) for a given RAM
| load you can't efficiently use the warp allocated to it.
|
| You can tweak the parameters a bit to alleviate that problem
| (smaller splines allow you to load and parallelize a few at
| once, larger ones allow you to do more work at once), but
| it's a big engineering challenge to fully utilize a GPU for
| that architecture. Your best bets are (1) observing something
| clever allowing you to represent the same result with
| different computations, and (2) a related idea, construct a
| different KAN-inspired algorithm with similar expressivity
| and more amenable to acceleration. My gut says (2) is more
| likely, but we'll see.
|
| More succinctly: The algorithm as written is not a good fit
| for the GPU primitives we have. It might be possible to
| bridge that gap, but that isn't guaranteed.
| earthnail wrote:
| What about cards with higher memory bandwidth, like Groq's
| LPUs? Would that help with data branching?
| hansvm wrote:
| Data branching in general, no (pulling from 32 places is
| still 32x as expensive in that architecture, but you
| might be able to load bigger chunks in each place). For a
| KAN, a bit (it shifts the constants involved when I was
| talking about smaller vs bigger splines above -- sparsity
| and dropout will tend to make the GPU tend toward that
| worst-case though). You still have the problem that
| you're heavily underutilizing the GPU's compute.
| scotty79 wrote:
| What if instead of splines there were Fourier series or
| something like that? Would that be easier to infer and
| learn on GPU if it was somehow teachable?
|
| EDIT: FourierKAN exists https://arxiv.org/html/2406.01034v1
| hansvm wrote:
| Fourier subcomponents are definitely teachable in
| general. I'd expect that FourierKAN to have similar
| runtime to a normal KAN, only really benefitting from a
| GPU on datasets where you get better predictive
| performance than a normal KAN.
| johnsutor wrote:
| From the KAN repo itself, it appears they already have GPU
| support
| https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
| nickpsecurity wrote:
| I've seen neural nets combined with decision trees. There's a
| few ways to do such hybrids. One style essentially uses the
| accurate, GPU-trained networks to push the interpretable
| networks to higher accuracy.
|
| Do any of you think that can be done cost-effectively with
| KAN's? Especially using pre-trained, language models like
| LlaMa-3 to train the interpretable models?
| BenoitP wrote:
| I wonder if a set of learned function (can|does) reproduce the
| truth tables from First Order Logic.
|
| I think it'd be easy to check.
|
| ----
|
| Anyways that's great news for differentiability. For now 'if'
| conditions expressed in JAX are tricky (at least for me), and are
| de facto an optimization barrier. If they're learnable and
| already into the network, I'd say that's a great thing.
| zeknife wrote:
| It is easy to construct an MLP that implements any basic logic
| function. But XOR requires at least one hidden layer.
| yorwba wrote:
| Previous discussion of Kolmogorov-Arnold networks:
| https://news.ycombinator.com/item?id=40219205
| Bluestein wrote:
| (I am wondering if there might not be a perverse incentive _not_
| to improve on interpretability for major incumbents ...
|
| ... given how, what you can "see" (ie. have visibility into) is
| something that regulatory stakeholders can ask you to exercise
| control over, or for oversight or information about ...
|
| ... whereas a "black box" they have trained and control - but few
| understand - can perhaps give you "plausible deniability" of the
| "we don't know how it works either" type.-
| thomasahle wrote:
| KANs can be modeled as just another activation architecture in
| normal MLPs, which is of course not surprising, since they are
| very flexible. I made a chart of different types of architectures
| here: https://x.com/thomasahle/status/1796902311765434694
|
| Curiously KANs are not very efficient when implemented with
| normal matrix multiplications in Pytorch, say. But with a custom
| cuda kernel, or using torch.compile they can be very fast:
| https://x.com/thomasahle/status/1798408687981297844
| byteknight wrote:
| Side question:
|
| Can people this deep in the field read that visualization with
| all the formulas and actually grok what's going on? I'm trying
| to understand just how far behind I am from the average math
| person (obviously very very very far, but quantifiable lol)
| Mc91 wrote:
| I'm not deep in the field at all, I did about four hours of
| Andrew Ng's deep learning course, and have played around a
| little bit with Pytorch and Python (although more to install
| LLMs and Stable Diffusion than to do Pytorch directly,
| although I did that a little too). I also did a little more
| reading and playing with it all, but not that much.
|
| Do I understand the Python? Somewhat. I know a relu is a
| rectified linear unit, which is a type of activation
| function. I have seen einsum before but forget what it is.
|
| For the classical diagram I know what the nodes, edges and
| weights are. I have some idea what the formulas do, but not
| totally.
|
| I'm unfamiliar with tensor diagrams.
|
| So I have very little knowledge of this field, and I have a
| decent grasp of some of what it means, a vague grasp on other
| parts, and tensor diagrams I have little to no familiarity
| with.
| thomasahle wrote:
| The tensor diagrams are not quite standard (yet). That's why
| I also include more "classical" neural network diagrams next
| to them.
|
| I've recently been working on a library for doing automatic
| manipulation and differentiation of tensor diagrams
| (https://github.com/thomasahle/tensorgrad), and to me they
| are clearly a cleaner notation.
|
| For a beautiful introduction to tensor networks, see also
| Jordan Taylor's blog post (https://www.lesswrong.com/posts/BQ
| KKQiBmc63fwjDrj/graphical-...)
| Krei-se wrote:
| You don't need to be more good in math than in high school.
| AI is a chain of functions and you derive over those to get
| to the loss-function (gradient) to tell you which parameters
| to change to get a better result (simplified!).
|
| Now this structure of functions is different in each
| implementations, but the type of function is quite similar -
| even though a large model will combine billions of those
| nodes and weights. Those visualizations tell you f.e. that
| some models connect neurons back to ones earlier in the chain
| to better remember a state. But the activation function is
| usually a weight and threshold.
|
| KAN changes the functions on the edges to more sophisticated
| ones than just "multiply by 0.x" and uses known physical
| formulas that you can actually explain to a human instead of
| the result coming from 100x different weights which tell you
| nothing.
|
| The language models we use currently may map how your brain
| works, but how strong the neurons are connected and to which
| others does not tell you anything. Instead a computer can
| chain different functions like you would chain a normal work
| task and explain each step to you / combine those learned
| routines on different tasks.
|
| I am by no means an expert in this field, but i do a lot of
| category theory, especially for the reason that i wanted a
| more explainable neuron network. So take my pov with a grain
| of salt, but please don't be discouraged to learn this. If
| you can program a little and remember some calculus you can
| definitely grasp these concepts after learning the
| vocabulary!
| godelski wrote:
| > You don't need to be more good in math than in high
| school.
|
| I'm very tired of this... it needs to stop as it literally
| hinders ML progress
|
| 1) I know one (ONE) person who took multivariate calculus
| in high school. They did so by going to the local community
| college. I know zero people who took linear algebra. I just
| checked the listing of my old high school. Over a decade
| later neither multivariate calculus nor linear algebra is
| offered.
|
| 2) There's something I like to tell my students
| You don't need math to train a good model, but you do need
| to know math to know why your model is wrong.
|
| I'm sure many here recognize the reference[0], but being
| able to make a model that performs successfully on a test
| set[1] is not always meaningful. For example, about a year
| ago I was working a very big tech firm and increased their
| model's capacity on customer data by over 200% with a model
| that performed worse on their "test set". No additional
| data was used, nor did I make any changes to the
| architecture. Figure that out without math. (note, I was
| able to predict poor generalization performance PRIOR to my
| changes and accurately predict my model's significantly
| higher generalization performance)
|
| 3) Math isn't just writing calculations down. That's part
| of it -- a big part -- but the concepts are critical. And
| to truly understand those concepts, you at some point need
| to do these calculations. Because at the end of the day,
| math is a language[2].
|
| 4) Just because the simplified view is not mathematically
| intensive does not mean math isn't important nor does it
| mean there isn't extremely complex mathematics under the
| hood. You're only explaining the mathematics in a simple
| way that is only about the updating process. There's a lot
| more to ML. And this should obviously be true since we
| consider them "black boxes"[3]. A lack of interpretability
| is not due to an immutable law, but due to our lack of
| understanding of a highly complex system. Yes, maybe each
| action in that system is simple, but if that meant the
| system as a whole was simple then I welcome you to develop
| a TOE for physics. Emergence is useful but also a pain in
| the ass[4].
|
| [0] https://en.wikipedia.org/wiki/All_models_are_wrong
|
| [1] For one, this is more accurately called a validation
| set. Test sets are held out. No more tuning. You're done.
| This is self-referential to my point.
|
| [2] If you want to fight me on this, at least demonstrate
| to me you have taken an abstract algebra course and
| understand ideals and rings. Even better if axioms and set
| theory. I accept other positions, but too many argue from
| the basis of physics without understanding the difference
| between a physics and physics. Just because math is the
| language of physics does not mean math (or even physics) is
| inherently an objective principle (physics is a model).
|
| [3] I hate this term. They are not black, but they are
| opaque. Which is to say that there is _some_ transparency.
|
| [4] I am using the term "emergence" in the way a physicist
| would, not what you've seen in an ML paper. Why? Well read
| point 4 again starting at footnote [3].
| Onavo wrote:
| > _1) I know one (ONE) person who took multivariate
| calculus in high school._
|
| Unless you are specifically dealing with intractable
| Bayesian integral problems, the multivariate calculus
| involved in NNs are primarily differentiation, not
| integration. The fun problems like boundary conditions
| and Stokes/Green that makes up the meat of multivariable
| calculus don't truly apply when you are dealing with
| differentiation only. In other words you only need the
| parts of calc 2/3 that can be taught in an afternoon, not
| the truly difficult parts.
|
| > _I 'm sure many here recognize the reference[0], but
| being able to make a model that performs successfully on
| a test set[1] is not always meaningful. (sic) ...[2] If
| you want to fight me on this, at least demonstrate to me
| you have taken an abstract algebra course and understand
| ideals and rings. Even better if axioms and set theory._
|
| Doesn't matter, if it creates value, it is sufficiently
| correct for all intents and purposes. Pray tell me how
| discrete math and abstract algebra has anything to do
| with day to day ML research. If you want to appeal to
| physics sure, plenty of Ising models, energy functions,
| and belief propagation in ML but you have lost all
| credibility bringing up discrete math.
|
| Again those correlation tests you use to fact check your
| model are primarily linear frequentist models. Most
| statistics practitioners outside of graduate research
| will just be plugging formulas, not doing research level
| proofs.
|
| > _Just because the simplified view is not mathematically
| intensive does not mean math isn 't important nor does it
| mean there isn't extremely complex mathematics under the
| hood. You're only explaining the mathematics in a simple
| way that is only about the updating process. There's a
| lot more to ML._
|
| Are you sure? The traditional linear algebra (and
| similar) models never (or rarely) outperformed neural
| networks, except perhaps on efficiency, absent hardware
| acceleration and all other things being equal. A flapping
| bird wing is beautiful from a bioengineering point of
| view but the aerospace industry is powered by dumb
| (mostly) static airfoils. Just because something is
| elegant doesn't mean it solves problems. A scaled up CNN
| is about as boring a NN can get, yet it beats the pants
| off all those traditional computer vision algorithms that
| I am sure contain way more "discrete math and abstract
| algebra".
|
| That being said, more knowledge is always a good thing,
| but I am not naive enough to believe that ML research can
| only be advanced by people with "mathematical maturity".
| It's still in the highly empirical stage where we
| experimentation (regardless of whether it's guided by
| mathematical intuition) dominates. I have seen plenty of
| interesting ML results from folks who don't know what
| ELBOs and KL divergences are.
| danielmarkbruce wrote:
| Yes. But it's not difficult math in 99% of cases, it's just
| notation. It may as well be written in Japanese.
| kherud wrote:
| Interesting, thanks for sharing! Do you have an explanation or
| idea why compilation slows some architectures down?
| thomasahle wrote:
| Consider the function: relu(np.outer(x, y))
| @ z.
|
| This takes n^2 time and memory in the naive implementation.
| But clearly, the memory could be reduced to O(n) with the
| right "fusing" of the operations.
|
| KANs are similar. This is the forward code for KANs:
| x = einsum("bi,oik->boik", x, w1) + b1 x =
| einsum("boik,oik->bo", relu(x), w2) + b2
|
| This is the forward code for a Expansion / Inverse Bottleneck
| MLPs: x = einsum("bi,iok->bok", x, w1) + b1
| x = einsum("bok,okp->bp", relu(x), w2) + b2
|
| Both take nd^2 time, but Inverse Bottleneck only takes nd
| memory. For KANs to match the memory usage, the two einsums
| must be fused.
|
| It's actually quite similar to flash-attention.
| godelski wrote:
| Which is to say, a big part is lack of optimization.
|
| Personally, I think this is fine in context. Context that
| it is a new formulation and the difficulty and non-
| obviousness of optimization. Shouldn't be expected that
| every researcher can recognize and solve all optimization
| problems.
| jcims wrote:
| I can find descriptions at one level or another (eg RNN vs CNN)
| but is there a deeper kingdom/phylum/class type taxonomy of
| neural network architectures that can help a layman understand
| how they differ and how they align, ideally with specific
| references to contemporary ones in use or being researched?
|
| I don't know why I'm interested because I'm not planning to
| actually do any work in the space, but I always struggle to
| understand when some new architecture is announced if it's a
| fundamental shift or if it's an optimization.
| kens wrote:
| You might find "The neural network zoo" helpful; it's a chart
| showing the different types of neural networks, along with a
| brief discussion of each type:
| https://www.asimovinstitute.org/neural-network-zoo/
| jcims wrote:
| Perfect!!! Thank you!
| zygy wrote:
| Naive question: what's the intuition for how this is different
| from increasing the number of learnable parameters on a regular
| MLP?
| slashdave wrote:
| Orthogonality ensures that each weight has its own, individual
| importance. In a regular MLP, the weights are naturally
| correlated.
| Ameo wrote:
| I've tried out and written about[1] KANs on some small-scale
| modeling, comparing them to vanilla neural networks, as
| previously discussed here:
| https://news.ycombinator.com/item?id=40855028.
|
| My main finding was that KANs are very tricky to train compared
| to NNs. It's usually possible to get per-parameter loss roughly
| on par with NNs, but it requires a lot of hyperparameter tuning
| and extra tricks in the KAN architecture. In comparison, vanilla
| NNs were much easier to train and worked well under a much
| broader set of conditions.
|
| Some people commented that we've invested an incredible amount of
| effort into getting really good at training NNs efficiently, and
| many of the things in ML libraries (optimizers like Adam, for
| example) are designed and optimized specifically for NNs. For
| that reason, it's not really a good apples-to-apples comparison.
|
| I think there's definitely potential in KANs, but they aren't a
| magic bullet. I'm also a bit dubious about interpretability
| claims; the splines that are usually used for KANs don't really
| offer much more insight to me than just analyzing the output of a
| neuron in a lower layer of a NN.
|
| [1] https://cprimozic.net/blog/trying-out-kans/
| smus wrote:
| Not just the optimizers, but the initialization schemes for
| neural networks have been explicitly tuned for stable training
| of neural nets with traditional activation functions. I'm not
| sure as much work has gone into intialization for KANs
|
| I 100% agree with the idea that these won't be any more
| interpretable and I've never understood the argument that they
| would be. Sure, if the NN was a single neuron I can see it, but
| as soon as you start composing these things you lose all
| interpretability imo
| xg15 wrote:
| > _Then they could summarize the entire KAN in an intuitive one-
| line function (including all the component activation functions),
| in some cases perfectly reconstructing the physics function that
| created the dataset._
|
| The idea of KANs sounds really exciting, but just to nitpick, you
| could also write any traditional NN as a closed-form "one line"
| expression - the line will just become very very long. I don't
| see how the expression itself would become less complex if you
| used splines instead of weights (even if this resulted in less
| neurons for the same decision boundary).
| asdfman123 wrote:
| Can someone ELIF this for me?
|
| I understand how neural networks try to reduce their loss
| function to get the best result. But what's actually different
| about the KANs?
___________________________________________________________________
(page generated 2024-08-05 23:00 UTC)