[HN Gopher] Trying Kolmogorov-Arnold Networks in Practice
___________________________________________________________________
Trying Kolmogorov-Arnold Networks in Practice
Author : Ameo
Score : 108 points
Date : 2024-07-02 09:54 UTC (13 hours ago)
(HTM) web link (cprimozic.net)
(TXT) w3m dump (cprimozic.net)
| anonymousDan wrote:
| 'hacky feeling techniques' - as opposed to the rest of DNN
| research??! More seriously, I wonder if some kind of hybrid
| approach could be possible/beneficial (e.g. KANs for some
| layers?)
| janalsncm wrote:
| That's a fair criticism but it should count for something that
| we've spent the last 20 years ironing out standard "hacks" to
| get NNs working. Someday we will hopefully have some
| theoretical grounding for basic things like network shape and
| size, optimizer, learning rate schedule, etc.
|
| Until that time we will be stuck using empirical methods (read:
| trial and error) and KANs are at best just another thing to
| try.
| abetusk wrote:
| Here are what I think are the main conclusions of the article:
|
| """ ... the most significant factor controlling performance is
| just parameter count. """
|
| """ No matter what I did, the most simple neural network was
| still outperforming the fanciest KAN-based model I tried. """
|
| I suspected this was the case when I first heard about KANs. Its
| nice to see someone diving into a bit more, even if it is just
| anecdotal.
| thesz wrote:
| The original paper [1] used LBFGS [2], it is quasi-second-order
| optimization algorithm. [1]
| https://arxiv.org/pdf/2404.19756 - "Both MLPs and KANs are
| trained with LBFGS for 1800 steps in total." [2]
| https://en.wikipedia.org/wiki/Limited-memory_BFGS
|
| (Quasi-)Newton methods approximate learning rate using local
| curvature which gradient-based methods do not do.
|
| The post relies on Tinygrad because it is familiar to author and
| author tinkers with batch size and learning rate, but not with
| optimizer itself.
|
| I think that even line search for minimum on the direction of the
| batch gradient can provide most of the benefits of LBFGS. It is
| easy to implement.
| osipov wrote:
| What's your basis for claiming that Tinygrad can't compute 2nd
| order partial derivatives (i.e. Hessians) needed for LBFGS?
| Tinygrad like PyTorch uses automatic differentiation which has
| no problem supporting nth order derivatives.
| fjkdlsjflkds wrote:
| OP does not (seemingly) claim that tinygrad can't compute
| hessians, only that a first-order optimization method was the
| only thing tried.
|
| Also, as a quasi-newton method, L-BFGS does not require
| explicit (pre-)computation of the hessian (it implicitly
| iteratively estimates its inverse in an online manner).
| thesz wrote:
| As someone with highly unpronoceable nickname said, my only
| complaint is that only first order methods are used.
|
| Second order methods are fun, actually. I like them. ;)
| abetusk wrote:
| You're saying that LBFGS is fundamental to the success of KANs?
| Why is this so?
| buildbot wrote:
| And if it is, good luck scaling LBFGS to anything useable,
| like vgg-16 scale...let alone a 7B param LLM.
|
| Back in grad school I tried to use LBFGS to optimize a small
| lenet network. I want to say it used over 128GB before OOM.
| thesz wrote:
| This is why I mentioned batch gradient line search. You can
| combine it with conjugate gradient.
|
| And small LeNet (I think it is first convolutional network
| that obtained good score on MNIST) is orders of magnitude
| bigger than KAN's in the original paper. And it will be, if
| we believe scaling claims from the KAN paper.
| thesz wrote:
| I can only state that KAN paper used LBFGS and that LBFGS is
| remarkably different from SGD.
|
| One of the differences is a dynamic learning rate guided by
| approximation of the local curvature.
| adw wrote:
| Will deal better with less well-conditioned parameters,
| maybe? That's a bit of a wild-ass guess, but it's not an
| entirely uninformed one.
| gwern wrote:
| Web design note for OP: you designed your site for dark-mode, and
| your initial SVGs are correct, but then it clashes with your
| graphs which are all light-mode. You can invert them in CSS, and
| they'll look a lot better.
|
| And you can choose which ones to invert _automatically_ using the
| free+Free https://invertornot.com/ API - IoN will correctly
| return that eg https://i.ameo.link/caa.png (and the other two)
| should be inverted.
| Ameo wrote:
| Very good suggestion, thanks. I'll fix that
| davesque wrote:
| KANs seem like a great tool for the right job. However, based on
| my understanding of how they work, my intuitions tell me that
| they would be awful at image processing, which I think was one of
| the author's test beds.
| ein0p wrote:
| On what do you base your intuitions? The way I see it is either
| they are good universal approximators (like ANNs), and then
| they're usable for everything, or they aren't, and then they're
| at best very narrow in application. I've yet to see any
| evidence for the latter.
| eximius wrote:
| Uninformed response: if two different universal approximators
| can have different modes/rates of learning, then even if
| they're usable in theory, they could be less good in
| practice.
| davesque wrote:
| Then why would you choose to use a CNN over a fully connected
| neural network for visual data? For the same reason you'd
| choose a KAN over a traditional neural network if you were
| trying to fit a continuous function that can easily be
| modeled as a combination of b-splines. Machine learning
| models aren't magic and their internal function very much
| determines the kinds of problems to which they are well
| applied.
|
| My intuitions about KANs and visual data comes from an
| impression that it would be hard for a decision boundary on
| visual data to behave nicely if it could only be built from
| b-splines.
|
| Judging the usefulness of a machine learning architecture is
| not a matter of determining which architecture will perform
| the best in all scenarios.
| nyrikki wrote:
| Universal approximation just indicates existence of a
| suitable sequence, not that it is findable and doesn't apply
| with finite numbers of neurons.
|
| But MLPs are not good for everything. Where Simulated
| annealing works better than auto-diff is the classic example
| that is easier to visualize, at least for me.
|
| Even if the sequence 'exists', finding it is the problem, it
| doesn't matter if a method can represent an unfindable
| sequence.
|
| That said, IMHO, MLP vs KAN is probably safer to think of as
| horses for courses, they are better at different things.
|
| At least with your definition of 'usable' being undefined.
| dahart wrote:
| > And here's a neural network/multi-layer perceptron with the
| same number of layers and nodes: One big difference to note is
| that there are far fewer connections between nodes in KANs
| compared to neural networks/MLPs.
|
| I think it's probably worth clarifying a little here that a
| Bspline is essentially a little MLP, where, at least for uniform
| Bsplines, the depth is equal to the polynomial degree of the
| spline. (That's also the width of the first layer.)
|
| So those two network diagrams are only superficially similar, but
| the KAN is actually a much bigger network if degree > 1 for the
| splines. I wonder if that contributed to the difficulty of
| training it. It is possible some of the "code smell" you noticed
| and got rid of is relatively important for achieving good
| results. I'd guess the processes for normalizing inputs and
| layers of a KAN need to be a bit different than for standard
| nets.
| slashdave wrote:
| It is important to remember that the optimizers used in
| mainstream deep learning models have all been developed and fine
| tuned to work well with classic NN architectures. There is no
| such thing as a generic optimization algorithm.
___________________________________________________________________
(page generated 2024-07-02 23:00 UTC)