[HN Gopher] A new type of neural network is more interpretable
       ___________________________________________________________________
        
       A new type of neural network is more interpretable
        
       Author : pseudolus
       Score  : 157 points
       Date   : 2024-08-05 16:15 UTC (6 hours ago)
        
 (HTM) web link (spectrum.ieee.org)
 (TXT) w3m dump (spectrum.ieee.org)
        
       | smusamashah wrote:
       | > One downside of KANs is that they take longer per parameter to
       | train--in part because they can't take advantage of GPUs. But
       | they need fewer parameters. Liu notes that even if KANs don't
       | replace giant CNNs and transformers for processing images and
       | language, training time won't be an issue at the smaller scale of
       | many physics problems.
       | 
       | They don't even say that it might be possible to take advantage
       | of GPUs in future. Reads like a fundamental problem with these.
        
         | scotty79 wrote:
         | I wonder what's the issue ... GPUs can do very complex stuff
        
           | raidicy wrote:
           | From my limited understanding: No one has written GPU code
           | for it yet.
        
           | UncleOxidant wrote:
           | There's not a lot of details there, but GPUs tend to not like
           | code with a lot of branching. I'm guessing that's probably
           | the issue.
        
           | MattPalmer1086 wrote:
           | I suspect it is because they have different activation
           | functions on each edge, rather than using the same one over
           | lots of data.
        
             | XMPPwocky wrote:
             | Are the activation functions truly different, or just
             | different parameter values to one underlying function?
        
           | hansvm wrote:
           | A usual problem is that GPUs don't branch on instructions
           | efficiently. A next most likely problem is that they don't
           | branch on data efficiently. Ideas fundamentally requiring the
           | former or the latter are hard to port efficiently.
           | 
           | A simple example of something hard to port to a GPU is a deep
           | (24 lvls) binary tree with large leaf sizes (4kb). Particular
           | trees can be optimized further, particular operations on
           | trees might have further optimizations, and trees with nicer
           | dimensionality might have tricks available, but solving that
           | problem in the abstract is 32x slower on a GPU that "good"
           | GPU problems. That's not a death knell, but it cuts down
           | substantially the constraints which would make a GPU a better
           | fit than a CPU.
           | 
           | Instruction branching is much worse, when required. Runtime
           | is exponential.
           | 
           | As far as KANs are concerned, the problem is more with data
           | branching. Each spline computation requires its own set of
           | data and is only used once. The math being done on the
           | aggregate computations is non-negligible, but fast relative
           | to the memory loads. You quickly enter a regime where (1)
           | you're bottlenecked on RAM bandwidth, and (2) for a given RAM
           | load you can't efficiently use the warp allocated to it.
           | 
           | You can tweak the parameters a bit to alleviate that problem
           | (smaller splines allow you to load and parallelize a few at
           | once, larger ones allow you to do more work at once), but
           | it's a big engineering challenge to fully utilize a GPU for
           | that architecture. Your best bets are (1) observing something
           | clever allowing you to represent the same result with
           | different computations, and (2) a related idea, construct a
           | different KAN-inspired algorithm with similar expressivity
           | and more amenable to acceleration. My gut says (2) is more
           | likely, but we'll see.
           | 
           | More succinctly: The algorithm as written is not a good fit
           | for the GPU primitives we have. It might be possible to
           | bridge that gap, but that isn't guaranteed.
        
             | earthnail wrote:
             | What about cards with higher memory bandwidth, like Groq's
             | LPUs? Would that help with data branching?
        
               | hansvm wrote:
               | Data branching in general, no (pulling from 32 places is
               | still 32x as expensive in that architecture, but you
               | might be able to load bigger chunks in each place). For a
               | KAN, a bit (it shifts the constants involved when I was
               | talking about smaller vs bigger splines above -- sparsity
               | and dropout will tend to make the GPU tend toward that
               | worst-case though). You still have the problem that
               | you're heavily underutilizing the GPU's compute.
        
             | scotty79 wrote:
             | What if instead of splines there were Fourier series or
             | something like that? Would that be easier to infer and
             | learn on GPU if it was somehow teachable?
             | 
             | EDIT: FourierKAN exists https://arxiv.org/html/2406.01034v1
        
               | hansvm wrote:
               | Fourier subcomponents are definitely teachable in
               | general. I'd expect that FourierKAN to have similar
               | runtime to a normal KAN, only really benefitting from a
               | GPU on datasets where you get better predictive
               | performance than a normal KAN.
        
         | johnsutor wrote:
         | From the KAN repo itself, it appears they already have GPU
         | support
         | https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
        
         | nickpsecurity wrote:
         | I've seen neural nets combined with decision trees. There's a
         | few ways to do such hybrids. One style essentially uses the
         | accurate, GPU-trained networks to push the interpretable
         | networks to higher accuracy.
         | 
         | Do any of you think that can be done cost-effectively with
         | KAN's? Especially using pre-trained, language models like
         | LlaMa-3 to train the interpretable models?
        
       | BenoitP wrote:
       | I wonder if a set of learned function (can|does) reproduce the
       | truth tables from First Order Logic.
       | 
       | I think it'd be easy to check.
       | 
       | ----
       | 
       | Anyways that's great news for differentiability. For now 'if'
       | conditions expressed in JAX are tricky (at least for me), and are
       | de facto an optimization barrier. If they're learnable and
       | already into the network, I'd say that's a great thing.
        
         | zeknife wrote:
         | It is easy to construct an MLP that implements any basic logic
         | function. But XOR requires at least one hidden layer.
        
       | yorwba wrote:
       | Previous discussion of Kolmogorov-Arnold networks:
       | https://news.ycombinator.com/item?id=40219205
        
       | Bluestein wrote:
       | (I am wondering if there might not be a perverse incentive _not_
       | to improve on interpretability for major incumbents ...
       | 
       | ... given how, what you can  "see" (ie. have visibility into) is
       | something that regulatory stakeholders can ask you to exercise
       | control over, or for oversight or information about ...
       | 
       | ... whereas a "black box" they have trained and control - but few
       | understand - can perhaps give you "plausible deniability" of the
       | "we don't know how it works either" type.-
        
       | thomasahle wrote:
       | KANs can be modeled as just another activation architecture in
       | normal MLPs, which is of course not surprising, since they are
       | very flexible. I made a chart of different types of architectures
       | here: https://x.com/thomasahle/status/1796902311765434694
       | 
       | Curiously KANs are not very efficient when implemented with
       | normal matrix multiplications in Pytorch, say. But with a custom
       | cuda kernel, or using torch.compile they can be very fast:
       | https://x.com/thomasahle/status/1798408687981297844
        
         | byteknight wrote:
         | Side question:
         | 
         | Can people this deep in the field read that visualization with
         | all the formulas and actually grok what's going on? I'm trying
         | to understand just how far behind I am from the average math
         | person (obviously very very very far, but quantifiable lol)
        
           | Mc91 wrote:
           | I'm not deep in the field at all, I did about four hours of
           | Andrew Ng's deep learning course, and have played around a
           | little bit with Pytorch and Python (although more to install
           | LLMs and Stable Diffusion than to do Pytorch directly,
           | although I did that a little too). I also did a little more
           | reading and playing with it all, but not that much.
           | 
           | Do I understand the Python? Somewhat. I know a relu is a
           | rectified linear unit, which is a type of activation
           | function. I have seen einsum before but forget what it is.
           | 
           | For the classical diagram I know what the nodes, edges and
           | weights are. I have some idea what the formulas do, but not
           | totally.
           | 
           | I'm unfamiliar with tensor diagrams.
           | 
           | So I have very little knowledge of this field, and I have a
           | decent grasp of some of what it means, a vague grasp on other
           | parts, and tensor diagrams I have little to no familiarity
           | with.
        
           | thomasahle wrote:
           | The tensor diagrams are not quite standard (yet). That's why
           | I also include more "classical" neural network diagrams next
           | to them.
           | 
           | I've recently been working on a library for doing automatic
           | manipulation and differentiation of tensor diagrams
           | (https://github.com/thomasahle/tensorgrad), and to me they
           | are clearly a cleaner notation.
           | 
           | For a beautiful introduction to tensor networks, see also
           | Jordan Taylor's blog post (https://www.lesswrong.com/posts/BQ
           | KKQiBmc63fwjDrj/graphical-...)
        
           | Krei-se wrote:
           | You don't need to be more good in math than in high school.
           | AI is a chain of functions and you derive over those to get
           | to the loss-function (gradient) to tell you which parameters
           | to change to get a better result (simplified!).
           | 
           | Now this structure of functions is different in each
           | implementations, but the type of function is quite similar -
           | even though a large model will combine billions of those
           | nodes and weights. Those visualizations tell you f.e. that
           | some models connect neurons back to ones earlier in the chain
           | to better remember a state. But the activation function is
           | usually a weight and threshold.
           | 
           | KAN changes the functions on the edges to more sophisticated
           | ones than just "multiply by 0.x" and uses known physical
           | formulas that you can actually explain to a human instead of
           | the result coming from 100x different weights which tell you
           | nothing.
           | 
           | The language models we use currently may map how your brain
           | works, but how strong the neurons are connected and to which
           | others does not tell you anything. Instead a computer can
           | chain different functions like you would chain a normal work
           | task and explain each step to you / combine those learned
           | routines on different tasks.
           | 
           | I am by no means an expert in this field, but i do a lot of
           | category theory, especially for the reason that i wanted a
           | more explainable neuron network. So take my pov with a grain
           | of salt, but please don't be discouraged to learn this. If
           | you can program a little and remember some calculus you can
           | definitely grasp these concepts after learning the
           | vocabulary!
        
             | godelski wrote:
             | > You don't need to be more good in math than in high
             | school.
             | 
             | I'm very tired of this... it needs to stop as it literally
             | hinders ML progress
             | 
             | 1) I know one (ONE) person who took multivariate calculus
             | in high school. They did so by going to the local community
             | college. I know zero people who took linear algebra. I just
             | checked the listing of my old high school. Over a decade
             | later neither multivariate calculus nor linear algebra is
             | offered.
             | 
             | 2) There's something I like to tell my students
             | You don't need math to train a good model, but you do need
             | to know math to know why your model is wrong.
             | 
             | I'm sure many here recognize the reference[0], but being
             | able to make a model that performs successfully on a test
             | set[1] is not always meaningful. For example, about a year
             | ago I was working a very big tech firm and increased their
             | model's capacity on customer data by over 200% with a model
             | that performed worse on their "test set". No additional
             | data was used, nor did I make any changes to the
             | architecture. Figure that out without math. (note, I was
             | able to predict poor generalization performance PRIOR to my
             | changes and accurately predict my model's significantly
             | higher generalization performance)
             | 
             | 3) Math isn't just writing calculations down. That's part
             | of it -- a big part -- but the concepts are critical. And
             | to truly understand those concepts, you at some point need
             | to do these calculations. Because at the end of the day,
             | math is a language[2].
             | 
             | 4) Just because the simplified view is not mathematically
             | intensive does not mean math isn't important nor does it
             | mean there isn't extremely complex mathematics under the
             | hood. You're only explaining the mathematics in a simple
             | way that is only about the updating process. There's a lot
             | more to ML. And this should obviously be true since we
             | consider them "black boxes"[3]. A lack of interpretability
             | is not due to an immutable law, but due to our lack of
             | understanding of a highly complex system. Yes, maybe each
             | action in that system is simple, but if that meant the
             | system as a whole was simple then I welcome you to develop
             | a TOE for physics. Emergence is useful but also a pain in
             | the ass[4].
             | 
             | [0] https://en.wikipedia.org/wiki/All_models_are_wrong
             | 
             | [1] For one, this is more accurately called a validation
             | set. Test sets are held out. No more tuning. You're done.
             | This is self-referential to my point.
             | 
             | [2] If you want to fight me on this, at least demonstrate
             | to me you have taken an abstract algebra course and
             | understand ideals and rings. Even better if axioms and set
             | theory. I accept other positions, but too many argue from
             | the basis of physics without understanding the difference
             | between a physics and physics. Just because math is the
             | language of physics does not mean math (or even physics) is
             | inherently an objective principle (physics is a model).
             | 
             | [3] I hate this term. They are not black, but they are
             | opaque. Which is to say that there is _some_ transparency.
             | 
             | [4] I am using the term "emergence" in the way a physicist
             | would, not what you've seen in an ML paper. Why? Well read
             | point 4 again starting at footnote [3].
        
               | Onavo wrote:
               | > _1) I know one (ONE) person who took multivariate
               | calculus in high school._
               | 
               | Unless you are specifically dealing with intractable
               | Bayesian integral problems, the multivariate calculus
               | involved in NNs are primarily differentiation, not
               | integration. The fun problems like boundary conditions
               | and Stokes/Green that makes up the meat of multivariable
               | calculus don't truly apply when you are dealing with
               | differentiation only. In other words you only need the
               | parts of calc 2/3 that can be taught in an afternoon, not
               | the truly difficult parts.
               | 
               | > _I 'm sure many here recognize the reference[0], but
               | being able to make a model that performs successfully on
               | a test set[1] is not always meaningful. (sic) ...[2] If
               | you want to fight me on this, at least demonstrate to me
               | you have taken an abstract algebra course and understand
               | ideals and rings. Even better if axioms and set theory._
               | 
               | Doesn't matter, if it creates value, it is sufficiently
               | correct for all intents and purposes. Pray tell me how
               | discrete math and abstract algebra has anything to do
               | with day to day ML research. If you want to appeal to
               | physics sure, plenty of Ising models, energy functions,
               | and belief propagation in ML but you have lost all
               | credibility bringing up discrete math.
               | 
               | Again those correlation tests you use to fact check your
               | model are primarily linear frequentist models. Most
               | statistics practitioners outside of graduate research
               | will just be plugging formulas, not doing research level
               | proofs.
               | 
               | > _Just because the simplified view is not mathematically
               | intensive does not mean math isn 't important nor does it
               | mean there isn't extremely complex mathematics under the
               | hood. You're only explaining the mathematics in a simple
               | way that is only about the updating process. There's a
               | lot more to ML._
               | 
               | Are you sure? The traditional linear algebra (and
               | similar) models never (or rarely) outperformed neural
               | networks, except perhaps on efficiency, absent hardware
               | acceleration and all other things being equal. A flapping
               | bird wing is beautiful from a bioengineering point of
               | view but the aerospace industry is powered by dumb
               | (mostly) static airfoils. Just because something is
               | elegant doesn't mean it solves problems. A scaled up CNN
               | is about as boring a NN can get, yet it beats the pants
               | off all those traditional computer vision algorithms that
               | I am sure contain way more "discrete math and abstract
               | algebra".
               | 
               | That being said, more knowledge is always a good thing,
               | but I am not naive enough to believe that ML research can
               | only be advanced by people with "mathematical maturity".
               | It's still in the highly empirical stage where we
               | experimentation (regardless of whether it's guided by
               | mathematical intuition) dominates. I have seen plenty of
               | interesting ML results from folks who don't know what
               | ELBOs and KL divergences are.
        
           | danielmarkbruce wrote:
           | Yes. But it's not difficult math in 99% of cases, it's just
           | notation. It may as well be written in Japanese.
        
         | kherud wrote:
         | Interesting, thanks for sharing! Do you have an explanation or
         | idea why compilation slows some architectures down?
        
           | thomasahle wrote:
           | Consider the function:                   relu(np.outer(x, y))
           | @ z.
           | 
           | This takes n^2 time and memory in the naive implementation.
           | But clearly, the memory could be reduced to O(n) with the
           | right "fusing" of the operations.
           | 
           | KANs are similar. This is the forward code for KANs:
           | x = einsum("bi,oik->boik", x, w1) + b1        x =
           | einsum("boik,oik->bo", relu(x), w2) + b2
           | 
           | This is the forward code for a Expansion / Inverse Bottleneck
           | MLPs:                  x = einsum("bi,iok->bok", x, w1) + b1
           | x = einsum("bok,okp->bp", relu(x), w2) + b2
           | 
           | Both take nd^2 time, but Inverse Bottleneck only takes nd
           | memory. For KANs to match the memory usage, the two einsums
           | must be fused.
           | 
           | It's actually quite similar to flash-attention.
        
             | godelski wrote:
             | Which is to say, a big part is lack of optimization.
             | 
             | Personally, I think this is fine in context. Context that
             | it is a new formulation and the difficulty and non-
             | obviousness of optimization. Shouldn't be expected that
             | every researcher can recognize and solve all optimization
             | problems.
        
       | jcims wrote:
       | I can find descriptions at one level or another (eg RNN vs CNN)
       | but is there a deeper kingdom/phylum/class type taxonomy of
       | neural network architectures that can help a layman understand
       | how they differ and how they align, ideally with specific
       | references to contemporary ones in use or being researched?
       | 
       | I don't know why I'm interested because I'm not planning to
       | actually do any work in the space, but I always struggle to
       | understand when some new architecture is announced if it's a
       | fundamental shift or if it's an optimization.
        
         | kens wrote:
         | You might find "The neural network zoo" helpful; it's a chart
         | showing the different types of neural networks, along with a
         | brief discussion of each type:
         | https://www.asimovinstitute.org/neural-network-zoo/
        
           | jcims wrote:
           | Perfect!!! Thank you!
        
       | zygy wrote:
       | Naive question: what's the intuition for how this is different
       | from increasing the number of learnable parameters on a regular
       | MLP?
        
         | slashdave wrote:
         | Orthogonality ensures that each weight has its own, individual
         | importance. In a regular MLP, the weights are naturally
         | correlated.
        
       | Ameo wrote:
       | I've tried out and written about[1] KANs on some small-scale
       | modeling, comparing them to vanilla neural networks, as
       | previously discussed here:
       | https://news.ycombinator.com/item?id=40855028.
       | 
       | My main finding was that KANs are very tricky to train compared
       | to NNs. It's usually possible to get per-parameter loss roughly
       | on par with NNs, but it requires a lot of hyperparameter tuning
       | and extra tricks in the KAN architecture. In comparison, vanilla
       | NNs were much easier to train and worked well under a much
       | broader set of conditions.
       | 
       | Some people commented that we've invested an incredible amount of
       | effort into getting really good at training NNs efficiently, and
       | many of the things in ML libraries (optimizers like Adam, for
       | example) are designed and optimized specifically for NNs. For
       | that reason, it's not really a good apples-to-apples comparison.
       | 
       | I think there's definitely potential in KANs, but they aren't a
       | magic bullet. I'm also a bit dubious about interpretability
       | claims; the splines that are usually used for KANs don't really
       | offer much more insight to me than just analyzing the output of a
       | neuron in a lower layer of a NN.
       | 
       | [1] https://cprimozic.net/blog/trying-out-kans/
        
         | smus wrote:
         | Not just the optimizers, but the initialization schemes for
         | neural networks have been explicitly tuned for stable training
         | of neural nets with traditional activation functions. I'm not
         | sure as much work has gone into intialization for KANs
         | 
         | I 100% agree with the idea that these won't be any more
         | interpretable and I've never understood the argument that they
         | would be. Sure, if the NN was a single neuron I can see it, but
         | as soon as you start composing these things you lose all
         | interpretability imo
        
       | xg15 wrote:
       | > _Then they could summarize the entire KAN in an intuitive one-
       | line function (including all the component activation functions),
       | in some cases perfectly reconstructing the physics function that
       | created the dataset._
       | 
       | The idea of KANs sounds really exciting, but just to nitpick, you
       | could also write any traditional NN as a closed-form "one line"
       | expression - the line will just become very very long. I don't
       | see how the expression itself would become less complex if you
       | used splines instead of weights (even if this resulted in less
       | neurons for the same decision boundary).
        
       | asdfman123 wrote:
       | Can someone ELIF this for me?
       | 
       | I understand how neural networks try to reduce their loss
       | function to get the best result. But what's actually different
       | about the KANs?
        
       ___________________________________________________________________
       (page generated 2024-08-05 23:00 UTC)