[HN Gopher] Trying Kolmogorov-Arnold Networks in Practice
       ___________________________________________________________________
        
       Trying Kolmogorov-Arnold Networks in Practice
        
       Author : Ameo
       Score  : 108 points
       Date   : 2024-07-02 09:54 UTC (13 hours ago)
        
 (HTM) web link (cprimozic.net)
 (TXT) w3m dump (cprimozic.net)
        
       | anonymousDan wrote:
       | 'hacky feeling techniques' - as opposed to the rest of DNN
       | research??! More seriously, I wonder if some kind of hybrid
       | approach could be possible/beneficial (e.g. KANs for some
       | layers?)
        
         | janalsncm wrote:
         | That's a fair criticism but it should count for something that
         | we've spent the last 20 years ironing out standard "hacks" to
         | get NNs working. Someday we will hopefully have some
         | theoretical grounding for basic things like network shape and
         | size, optimizer, learning rate schedule, etc.
         | 
         | Until that time we will be stuck using empirical methods (read:
         | trial and error) and KANs are at best just another thing to
         | try.
        
       | abetusk wrote:
       | Here are what I think are the main conclusions of the article:
       | 
       | """ ... the most significant factor controlling performance is
       | just parameter count. """
       | 
       | """ No matter what I did, the most simple neural network was
       | still outperforming the fanciest KAN-based model I tried. """
       | 
       | I suspected this was the case when I first heard about KANs. Its
       | nice to see someone diving into a bit more, even if it is just
       | anecdotal.
        
       | thesz wrote:
       | The original paper [1] used LBFGS [2], it is quasi-second-order
       | optimization algorithm.                 [1]
       | https://arxiv.org/pdf/2404.19756 - "Both MLPs and KANs are
       | trained with LBFGS for 1800 steps in total."       [2]
       | https://en.wikipedia.org/wiki/Limited-memory_BFGS
       | 
       | (Quasi-)Newton methods approximate learning rate using local
       | curvature which gradient-based methods do not do.
       | 
       | The post relies on Tinygrad because it is familiar to author and
       | author tinkers with batch size and learning rate, but not with
       | optimizer itself.
       | 
       | I think that even line search for minimum on the direction of the
       | batch gradient can provide most of the benefits of LBFGS. It is
       | easy to implement.
        
         | osipov wrote:
         | What's your basis for claiming that Tinygrad can't compute 2nd
         | order partial derivatives (i.e. Hessians) needed for LBFGS?
         | Tinygrad like PyTorch uses automatic differentiation which has
         | no problem supporting nth order derivatives.
        
           | fjkdlsjflkds wrote:
           | OP does not (seemingly) claim that tinygrad can't compute
           | hessians, only that a first-order optimization method was the
           | only thing tried.
           | 
           | Also, as a quasi-newton method, L-BFGS does not require
           | explicit (pre-)computation of the hessian (it implicitly
           | iteratively estimates its inverse in an online manner).
        
           | thesz wrote:
           | As someone with highly unpronoceable nickname said, my only
           | complaint is that only first order methods are used.
           | 
           | Second order methods are fun, actually. I like them. ;)
        
         | abetusk wrote:
         | You're saying that LBFGS is fundamental to the success of KANs?
         | Why is this so?
        
           | buildbot wrote:
           | And if it is, good luck scaling LBFGS to anything useable,
           | like vgg-16 scale...let alone a 7B param LLM.
           | 
           | Back in grad school I tried to use LBFGS to optimize a small
           | lenet network. I want to say it used over 128GB before OOM.
        
             | thesz wrote:
             | This is why I mentioned batch gradient line search. You can
             | combine it with conjugate gradient.
             | 
             | And small LeNet (I think it is first convolutional network
             | that obtained good score on MNIST) is orders of magnitude
             | bigger than KAN's in the original paper. And it will be, if
             | we believe scaling claims from the KAN paper.
        
           | thesz wrote:
           | I can only state that KAN paper used LBFGS and that LBFGS is
           | remarkably different from SGD.
           | 
           | One of the differences is a dynamic learning rate guided by
           | approximation of the local curvature.
        
           | adw wrote:
           | Will deal better with less well-conditioned parameters,
           | maybe? That's a bit of a wild-ass guess, but it's not an
           | entirely uninformed one.
        
       | gwern wrote:
       | Web design note for OP: you designed your site for dark-mode, and
       | your initial SVGs are correct, but then it clashes with your
       | graphs which are all light-mode. You can invert them in CSS, and
       | they'll look a lot better.
       | 
       | And you can choose which ones to invert _automatically_ using the
       | free+Free https://invertornot.com/ API - IoN will correctly
       | return that eg https://i.ameo.link/caa.png (and the other two)
       | should be inverted.
        
         | Ameo wrote:
         | Very good suggestion, thanks. I'll fix that
        
       | davesque wrote:
       | KANs seem like a great tool for the right job. However, based on
       | my understanding of how they work, my intuitions tell me that
       | they would be awful at image processing, which I think was one of
       | the author's test beds.
        
         | ein0p wrote:
         | On what do you base your intuitions? The way I see it is either
         | they are good universal approximators (like ANNs), and then
         | they're usable for everything, or they aren't, and then they're
         | at best very narrow in application. I've yet to see any
         | evidence for the latter.
        
           | eximius wrote:
           | Uninformed response: if two different universal approximators
           | can have different modes/rates of learning, then even if
           | they're usable in theory, they could be less good in
           | practice.
        
           | davesque wrote:
           | Then why would you choose to use a CNN over a fully connected
           | neural network for visual data? For the same reason you'd
           | choose a KAN over a traditional neural network if you were
           | trying to fit a continuous function that can easily be
           | modeled as a combination of b-splines. Machine learning
           | models aren't magic and their internal function very much
           | determines the kinds of problems to which they are well
           | applied.
           | 
           | My intuitions about KANs and visual data comes from an
           | impression that it would be hard for a decision boundary on
           | visual data to behave nicely if it could only be built from
           | b-splines.
           | 
           | Judging the usefulness of a machine learning architecture is
           | not a matter of determining which architecture will perform
           | the best in all scenarios.
        
           | nyrikki wrote:
           | Universal approximation just indicates existence of a
           | suitable sequence, not that it is findable and doesn't apply
           | with finite numbers of neurons.
           | 
           | But MLPs are not good for everything. Where Simulated
           | annealing works better than auto-diff is the classic example
           | that is easier to visualize, at least for me.
           | 
           | Even if the sequence 'exists', finding it is the problem, it
           | doesn't matter if a method can represent an unfindable
           | sequence.
           | 
           | That said, IMHO, MLP vs KAN is probably safer to think of as
           | horses for courses, they are better at different things.
           | 
           | At least with your definition of 'usable' being undefined.
        
       | dahart wrote:
       | > And here's a neural network/multi-layer perceptron with the
       | same number of layers and nodes: One big difference to note is
       | that there are far fewer connections between nodes in KANs
       | compared to neural networks/MLPs.
       | 
       | I think it's probably worth clarifying a little here that a
       | Bspline is essentially a little MLP, where, at least for uniform
       | Bsplines, the depth is equal to the polynomial degree of the
       | spline. (That's also the width of the first layer.)
       | 
       | So those two network diagrams are only superficially similar, but
       | the KAN is actually a much bigger network if degree > 1 for the
       | splines. I wonder if that contributed to the difficulty of
       | training it. It is possible some of the "code smell" you noticed
       | and got rid of is relatively important for achieving good
       | results. I'd guess the processes for normalizing inputs and
       | layers of a KAN need to be a bit different than for standard
       | nets.
        
       | slashdave wrote:
       | It is important to remember that the optimizers used in
       | mainstream deep learning models have all been developed and fine
       | tuned to work well with classic NN architectures. There is no
       | such thing as a generic optimization algorithm.
        
       ___________________________________________________________________
       (page generated 2024-07-02 23:00 UTC)