[HN Gopher] Kolmogorov-Arnold Networks
___________________________________________________________________
Kolmogorov-Arnold Networks
Author : sumo43
Score : 376 points
Date : 2024-05-01 03:30 UTC (19 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| nico wrote:
| Looks super interesting
|
| I wonder how many more new architectures are going to be found in
| the next few years
| Maro wrote:
| Interesting!
|
| Would this approach (with non-linear learning) still be able to
| utilize GPUs to speed up training?
| diwank wrote:
| Seconded. I'm guessing you could create an implementation that
| is able to do that and then write optimised triton/cuda kernels
| to accelerate them but need to investigate further
| diwank wrote:
| It'd be really cool to see a transformer with the MLP layers
| swapped for KANs and then compare its scaling properties with
| vanilla transformers
| gautam5669 wrote:
| This is the first thought came to my mind too.
|
| Given its sparse, Will this be just replacement for MoE.
| mp187 wrote:
| Why was this your first thought? Is a limiting factor to
| transformers the MLP layer? I thought the bottleneck was in the
| renormalization part.
| brrrrrm wrote:
| At small input size, yes the MLP dominates compute. At large
| input attention matters more
| bart1ett wrote:
| After trying this out with the fourier implementation above,
| swapping MLP/Attention Linear layers for KANs (all, or even a
| few layers) produces diverging loss. KANs don't require
| normalization for good forward pass dynamics, but may be
| trickier to train in a deep net.
| cbsmith wrote:
| Feels like someone stuffed splines into decision trees.
| baq wrote:
| It's Monte Carlo all the way down
| xpe wrote:
| splines, yes.
|
| I'm not seeing decision trees, though. Am I missing something?
|
| > "KANs' nodes simply sum incoming signals without applying any
| non-linearities." (page 2 of the PDF)
| cbsmith wrote:
| I definitely think I'm projecting and maybe seeing things
| that aren't there. If you replaced splines with linear
| weights, it kind of looks like a decision tree to me.
| ALittleLight wrote:
| I can't assess this, but I do worry that overnight some
| algorithmic advance will enhance LLMs by orders of magnitude and
| the next big model to get trained is suddenly 10,000x better than
| GPT-4 and nobody's ready for it.
| DeathArrow wrote:
| >some algorithmic advance will enhance LLMs by orders of
| magnitude
|
| I would worry if I'd own Nvidia shares.
| krasin wrote:
| Actually, that would be fantastic for NVIDIA shares;
|
| 1. A new architecture would make all/most of these upcoming
| Transformer accelerators obsolete => back to GPUs.
|
| 2. Higher performance LLMs on GPUs => we can speed up LLMs
| with 1T+ parameters. So, LLMs become more useful, so more of
| GPUs would be purchased.
| mindcrime wrote:
| _1. A new architecture would make all /most of these
| upcoming Transformer accelerators obsolete => back to
| GPUs._
|
| There's no guarantee that that is what would happen. The
| right (or wrong, depending on your POV) algorithmic
| breakthrough might make GPU's obsolete for AI, by making
| CPU's (or analog computing units, or DSP's, or "other") the
| preferred platform to run AI.
| 6mian wrote:
| What to be worried about? Technical progress will happen,
| sometimes by sudden jumps. Some company will become a leader,
| competitors will catch up after a while.
| cess11 wrote:
| "Technical progress" has been destroying our habitat for
| centuries, causing lots of other species to go extinct.
| Pretty much the entire planet surface has been 'technically
| progressed', spreading plastics, climate change and whatnot
| over the entirety of it.
|
| Are you assuming that this particular "progress" would be
| relatively innocent?
| 6mian wrote:
| On the other hand, the same "technical progress" (if we're
| putting machine learning, deforestation, and mining in the
| same bag) gave you medicine, which turns many otherwise
| deadly diseases into inconveniences and allows you to work
| less than 12 hrs/7 days per week to not die from hunger in
| a large portion of the world. A few hundred years ago,
| unless you were born into the lucky 0.01% of the ruling
| population, working from dawn to sunset was the norm for a
| lot more people than now.
|
| I'm not assuming that something 10k x better than GPT-4
| will be good or bad; I don't know. I was just curious what
| exactly to be worried about. I think in the current state,
| LLMs are already advanced enough for bad uses like article
| generation for SEO, spam, scams, etc., and I wonder if an
| order of magnitude better model would allow for something
| worse.
| cess11 wrote:
| Where did you learn that history?
|
| What do you mean by "better"?
| 6mian wrote:
| I had a European peasant in the 1600-1700s in mind when I
| wrote about the amount of work. During the season, they
| worked all day; off-season, they had "free time" that
| went into taking care of the household, inventory, etc.,
| so it's still work. Can't quickly find a reliable source
| in English I could link, so I can be wrong here.
|
| "Better" was referring to what OP wrote in the top
| comment. I guess 10x faster, 10x longer context, and 100x
| less prone to hallucinations would make a good "10k x
| better" than GPT-4.
| cess11 wrote:
| Sorry, I can't fit that with what you wrote earlier: "12
| hrs/7 days per week to not die from hunger".
|
| Those peasants payed taxes, i.e. some of their work was
| exploited by an army or a priest rather than hunger, and
| as you mention, they did not work "12 hrs/7 days per
| week".
|
| Do you have a better example?
| vladms wrote:
| Many species went extinct during Earth's history. Evolution
| requires quite aggressive competition.
|
| The way the habitat got destroyed by humans is stupid
| because it might put us in danger. You can call me
| "speciesist" but I do care more for humans rather than for
| a particular other specie.
|
| So I think progress should be geared towards human species
| survival and if possible preventing other species
| extinction. Some of the current developments are a bit too
| much on the side of "I don't care about anyone's survival"
| (which is stupid and inefficient).
| lccerina wrote:
| If other species die, we follow shortly. This
| anthropocentric view really ignore how much of our food
| chain exists because of other animals surviving despite
| human activities.
| cess11 wrote:
| Evolution is the result of catastrophies and atrocities.
| You use the word as if it has positive connotations,
| which I find weird.
|
| How do you come to the conclusion "stupid" rather than
| evil? Aren't we very aware of the consequences of how we
| are currently organising human societies, and have been
| for a long time?
| snewman wrote:
| I think this is unlikely. There has never (in the visible
| fossil record) been a mutation that suddenly made tigers an
| order of magnitude stronger and faster, or humans an order of
| magnitude more intelligent. It's been a long time (if ever?)
| since chip transistor density made a multiple-order-of-
| magnitude leap. Any complex optimized system has many limiting
| factors and it's unlikely that all of them would leap forward
| at once. The current generation of LLMs are not as complex or
| optimized as tigers or humans, but they're far enough along
| that changing one thing is unlikely to result in a giant leap.
|
| If and when something radically better comes along, say an
| alternative to back-propagation that is more like the way our
| brains learn, it will need a lot of scaling and refinement to
| catch up with the then-current LLM.
| the8472 wrote:
| Comparing it to evolution and SNPs isn't really a good
| analogy. Novel network architectures are much larger changes,
| maybe comparable to new organelles or metabolic pathways? And
| those have caused catastrophic changes. Evolution also
| operates on much longer time-scales due to its blind parallel
| search.
|
| https://en.wikipedia.org/wiki/Oxygen_catastrophe
| mxwsn wrote:
| From the preprint - 100 input dimensions is considered "high",
| and most problems considered have 5 or fewer input dimensions.
| This is typical of physics-inspired settings I've seen considered
| in ML. The next step would be demonstrating them on MNIST, which,
| at 784 dimensions is tiny by modern standards.
| wongarsu wrote:
| In actual business processes there are lots of ML problems with
| fewer than 100 input dimensions. But for most of them decision
| trees are still competitive with neural networks or even
| outperform them.
| galangalalgol wrote:
| The aid to explainability seems at least somewhat compelling.
| Understanding what a random forest did isn't always easy. And
| if what you want isn't the model but the closed form of what
| the model does, this could be quite useful. When those
| hundred input dimensions interact nonlinearly in a million
| ways thats nice. Or more likely I'd use it when I don't want
| to find a pencil to derive the closed form of what I'm trying
| to do.
| trwm wrote:
| Business processes don't need deep learning in the first
| place. It is just there because hype.
| JoshCole wrote:
| Competent companies tend to put a lot of effort into
| building data analysis tools. There will often be A/B or
| QRT frameworks in place allowing deployment of two models,
| for example, the new deep learning model, and the old rule
| based system. By using the results from these experiments
| in conjunction with typical offline and online evaluation
| metrics one can begin to make statements about the impact
| of model performance on revenue. Naturally model
| performance is tracked through many offline and online
| metrics. So people can and do say things like "if this
| model is x% more accurate then that translates to $y
| million dollars in monthly revenue" with great confidence.
|
| Lets call someone working at such a company Bob.
|
| A restatement of your claim is that Bob decided to launch a
| model to live because of hype rather than because he could
| justify his promotion by pointing to the millions of
| dollars in increased revenue his switch produced. Bob of
| course did not make his decision based on hype. He made his
| decision because there were evaluation criteria in place
| for the launch. He was literally not allowed to launch
| things that didn't improve the system according to the
| evaluation criteria. As Bob didn't want to be fired for not
| doing anything at the company, he was forced to use a tool
| that worked to improve the evaluation according to the
| criteria that was specified. So he used the tool that
| worked. Hype might provide motivation to experiment, but it
| doesn't justify a launch.
|
| I say this as someone whose literally seen transitions from
| decision trees to deep learning models on < 100 feature
| models which had multi-million dollar monthly revenue
| impacts.
| krasin wrote:
| I've spent some time playing with their Jupyter notebooks. The
| most useful (to me, anyway) is their
| Example_3_classfication.ipynb ([1]).
|
| It works as advertised with the parameters selected by the
| authors, but if we modified the network shape in the second half
| of the tutorial (Classification formulation) from (2, 2) to (2,
| 2, 2), it fails to generalize. The training loss gets down to
| 1e-9, while test loss stays around 3e-1. Getting to larger
| network sizes does not help either.
|
| I would really like to see a bigger example with many more
| parameters and more data complexity and if it could be trained at
| all. MNIST would be a good start.
|
| Update: I increased the training dataset size 100x, and that
| helps with the overfitting, but now I can't get training loss
| below 1e-2. Still iterating on it; a GPU acceleration would
| really help - right now, my progress is limited by the speed of
| my CPU.
|
| 1.
| https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
| krasin wrote:
| Update2: got it to 100% training accuracy, 99% test accuracy
| with (2, 2, 2) shape.
|
| Changes:
|
| 1. Increased the training set from 1000 to 100k samples. This
| solved overfitting.
|
| 2. In the dataset generation, slightly reduced noise (0.1 ->
| 0.07) so that classes don't overlap. With an overlap,
| naturally, it's impossible to hit 100%.
|
| 3. Most important & specific to KANs: train for 30 steps with
| grid=5 (5 segments for each activation function), then 30 steps
| with grid=10 (and initializing from the previous model), and
| then 30 steps with grid=20. This is idiomatic to KANs and
| covered in the Example_1_function_fitting.ipynb:
| https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
|
| Overall, my impressions are:
|
| - it works!
|
| - the reference implementation is very slow. A GPU
| implementation is dearly needed.
|
| - it feels like it's a bit too non-linear and training is not
| as stable as it's with MLP + ReLU.
|
| - Scaling is not guaranteed to work well. Really need to see if
| MNIST is possible to solve with this approach.
|
| I will definitely keep an eye on this development.
| thom wrote:
| This makes me wonder what you could achieve if instead of
| iteratively growing the grid, or worrying about pruning or
| regularization, you governed network topology with some sort
| of evolutionary algorithm.
| gxyt6gfy5t wrote:
| Believe there is a Google paper out there that tried that
| verdverm wrote:
| 1000s, there is a whole field and set of conferences. You
| can find more by searching "Genetic Programming" or
| "Symbolic Regression"
|
| KAN, with the library of variables and math operators,
| very much resembles this family of algos, problems, and
| limitations. The lowest hanging fruit they usually leave
| on the proverbial tree is that you can use fast
| regression techniques for the constants and coefficients.
| No need to leave it up to random perturbations or
| gradient descent. What you really need to figure out is
| the form or shape of the model, rather than leaving it up
| to the human (in KAN)
| verdverm wrote:
| You can do much better by growing an AST with memoization
| and non-linear regression. So much so, the EVO folks gave a
| best paper to a non-EVO, deterministic algorithm at their
| conference
|
| https://seminars.math.binghamton.edu/ComboSem/worm-
| chiu.pge_... (author)
| godelski wrote:
| > Increased the training set from 1000 to 100k samples. This
| solved overfitting.
|
| Solved over fitting or created more? Even if your sets are
| completely disjoint with something like two moons the more
| data you have the lower the variance.
| londons_explore wrote:
| > I would really like to see a bigger example
|
| This. I don't think toy examples are useful for modern ML
| techniques. If you tested big ideas in ML (transformers,
| LSTM's, ADAM) on a training dataset of 50 numbers trying to fit
| a y=sin(x) curve, I think you'd wrongly throw these ideas out.
| arianvanp wrote:
| This really reminds me of petrinets but an analog version? But
| instead of places and discrete tokens we have activation
| functions and signals. You can only trigger a transition if an
| activation function (place) has the right signal (tokens).
| montebicyclelo wrote:
| The success we're seeing with neural networks is tightly coupled
| with the ability to scale - the algorithm itself works at scale
| (more layers), but it also scales well with hardware, (neural
| nets mostly consist of matrix multiplications, and GPUs have
| specialised matrix multiplication acceleration) - one of the most
| impactful neural network papers, AlexNet, was impactful because
| it showed that NNs could be put on the GPU, scaled and
| accelerated, to great effect.
|
| It's not clear from the paper how well this algorithm will scale,
| both in terms of the algorithm itself (does it still train well
| with more layers?), and ability to make use of hardware
| acceleration, (e.g. it's not clear to me that the structure, with
| its per-weight activation functions, can make use of fast matmul
| acceleration).
|
| It's an interesting idea, that seems to work well and have nice
| properties on a smaller scale; but whether it's a good
| architecture for imagenet, LLMs, etc. is not clear at this stage.
| dist-epoch wrote:
| > with its per-weight activation functions
|
| Sounds like something which could be approximated by a DCT
| (discrete cosine transform). JPEG compression does this, and
| there are hardware accelerations for it.
|
| > can make use of fast matmul acceleration
|
| Maybe not, but matmul acceleration was done in hardware because
| it's useful for some problems (graphics initially).
|
| So if these per weight activations functions really work,
| people will be quick to figure out how to run them in hardware.
| WithinReason wrote:
| Looks very interesting, but my guess would be that this would run
| into the problem of exploding/vanishing gradients at larger
| depths, just like TanH or sigmoid networks do.
| yobbo wrote:
| https://kindxiaoming.github.io/pykan/intro.html
|
| At the end of this example, they recover the symbolic formula
| that generated their training set: exp(x22 + sin(3.14x1)).
|
| It's like a computation graph with a library of "activation
| functions" that is optimised, and then pruned. You can recover
| good symbolic formulas from the pruned graph.
|
| Maybe not meaningful for MNIST.
| beagle3 wrote:
| I wonder if Breiman's ACE (alternating conditional expectation)
| is useful as a building block here.
|
| It will easily recover this formula, because it is separable
| under the log transformation (which ACE recovers as well).
|
| But ACE doesn't work well on unseparable problems - not sure
| how well KAN will.
| GistNoesis wrote:
| I quickly skimmed the paper, got inspired to simplify it, and
| created some Pytorch Layer :
|
| https://github.com/GistNoesis/FourierKAN/
|
| The core is really just a few lines.
|
| In the paper they use some spline interpolation to represent 1d
| function that they sum. Their code seemed aimed at smaller sizes.
| Instead I chose a different representation, aka fourier
| coefficients that are used to interpolate the functions of
| individual coordinates.
|
| It should give an idea of Kolmogorov-Arnold networks
| representation power, it should probably converge easier than
| their spline version but spline version have less operations.
|
| Of course, if my code doesn't work, it doesn't mean theirs
| doesn't.
|
| Feel free to experiment and publish paper if you want.
| itsthecourier wrote:
| you really are a pragmatic programmer, Noesis
| GistNoesis wrote:
| Thanks. I like simple things.
|
| Sums and products can get you surprisingly far.
|
| Conceptually it's simpler to think about and optimize. But
| you can also write it use einsum to do the sum product
| reductions (I've updated some comment to show how) to use
| less memory, but it's more intimidating.
|
| You can probably use KeOps library to fuse it further (einsum
| would get in the way).
|
| But the best is probably a custom kernel. Once you have
| written it as sums and product, it's just iterating. Like the
| core is 5 lines, but you have to add roughly 500 lines of
| low-level wrapping code to do cuda parallelisation, c++ to
| python, various types, manual derivatives. And then you have
| to add various checks so that there are no buffer overflows.
| And then you can optimize for special hardware operations
| like tensor cores. Making sure along the way that no
| numerical errors where introduced.
|
| So there are a lot more efforts involved, and it's usually
| only worth it if the layer is promising, but hopefully AI
| should be able to autocomplete these soon.
| agnosticmantis wrote:
| How GPU-friendly is this class of models?
| ubj wrote:
| Very interesting! Kolmogorov neutral networks can represent
| discontinuous functions [1], but I've wondered about how
| practically applicable they are. This repo seems to show that
| they have some use after all.
|
| [1]: https://arxiv.org/abs/2311.00049
| reynoldss wrote:
| Perhaps a hasty comment but linear combinations of B-splines are
| yet another (higher-degree) B-spline. Isn't this simply fitting
| high degree B-splines to functions?
| Lichtso wrote:
| That would be true for a single node / single layer. But once
| the output of one layer is fed into the input of the next it is
| not just a linear combination of splines anymore.
| cs702 wrote:
| It's so _refreshing_ to come across new AI research different
| from the usual "we modified a transformer in this and that way
| and got slightly better results on this and that benchmark." All
| those new papers proposing incremental improvements are
| important, but... everyone is getting a bit tired of them. Also,
| anecdotal evidence and recent work suggest we're starting to run
| into fundamental limits inherent to transformers, so we may well
| need new alternatives.[a]
|
| The best thing about this new work is that it's not an either/or
| proposition. The proposed "learnable spline interpolations as
| activation functions" can be used _in conventional DNNs_ , to
| improve their expressivity. Now we just have to test the stuff to
| see if it really works better.
|
| Very nice. Thank you for sharing this work here!
|
| ---
|
| [a] https://news.ycombinator.com/item?id=40179232
| glebnovikov wrote:
| > Everyone is getting tired of those papers.
|
| This is science as is :)
|
| 95% percent will produce mediocre-to-nice improvements to what
| we already have so there were reserachers that eventually grow
| up and do something really exciting
| godelski wrote:
| Nothing wrong with incremental improvements. Giant leaps
| (almost always) only happen because of a lack of your niche
| domain expertise. And I mean niche niche
| godelski wrote:
| There's a ton actually. Just they tend to go through extra
| rounds of review (or never make it...) and never make it to HN
| unless there's special circumstances (this one is MIT and CIT).
| Unfortunately we've let PR become a very powerful force (it's
| always been a thing, but seems more influential now). We can
| fight against this by up voting things like this and if you're
| a reviewee, not focusing on sota (it's clearly been gamed and
| clearly leading us in the wrong direction)
| cs702 wrote:
| > never make it to HN unless there's special circumstances
|
| Yes, I agree. The two most common patterns I've noticed in
| research that does show up on HN are: 1) It outright
| improves, or has the potential to improve, applications
| currently used in production by many HN readers. In other
| words, it's not just navel-gazing. 2) The authors and/or
| their organizations are well-known, as you suggest.
| godelski wrote:
| What bothers me the most is that comments will float to the
| top of a link that's an arxiv paper or uni press where
| people will talk about how something is still in a
| prototype stage and not production yet/has a ways to go to
| production. While this is fine, that's also the context of
| works like these. But it is the same thing that I see in
| reviews. I've had works myself killed because reviewers
| treat the paper as a product rather than... you know...
| research.
| abhgh wrote:
| Yes seconding this. If you want a broad view of ML IMHO the
| best places to look at are conference proceedings. The
| typical review process is imperfect so that still doesn't
| show you all the interesting work out there (which you
| mention), but it is still a start wrt diversity of research.
| I follow LLMs closely but then going through proceedings
| means I come across exciting research like these [1],[2],[3].
|
| References:
|
| [1] A grad.-based way to optimize axis-parallel and oblique
| decision trees: the _Tree Alternating Optimization (TAO)_
| algorithm https://proceedings.neurips.cc/paper_files/paper/20
| 18/file/1.... An extension was the _softmax tree_
| https://aclanthology.org/2021.emnlp-main.838/.
|
| [2] XAI explains models, but can you recommend corrective
| actions? _FACE: feasible and Actionable Counterfactual
| Explanations_ https://arxiv.org/pdf/1909.09369, _Algorithmic
| Recourse: from Counterfactual Explanations to Interventions_
| https://arxiv.org/pdf/2002.06278
|
| [3] _OBOE: Collaborative Filtering for AutoML Model
| Selection_ https://arxiv.org/abs/1808.03233
| godelski wrote:
| Honestly, these days I just rely on arxiv. The conferences
| are so noisy that it is hard to really tell what's useful
| and what's crap. Twitter is a bit better but still a crap
| shoot. So as far as it seems to me, there's no real good
| signal to use to differentiate. And what's the point of
| journals/conferences if not to provide some reasonable
| signal? If it is a slot machine, it is useless.
|
| And I feel like we're far too dismissive of instances we
| see where good papers get rejected. We're too dismissive of
| the collusion rings. What am I putting in all this time to
| write and all this time to review (and be an emergency
| reviewer) if we aren't going to take some basic steps
| forward? Fuck, I've saved a Welling paper from rejection
| from two reviewers who admitted to not knowing PDEs, and
| this was a workshop (should have been accepted into the
| main conference). I think review works for those already
| successful, who can pay "perform more experiments when
| requested" their way out of review hell, but we're ignoring
| a lot of good work simply for lack of money compute. It
| slows down our progress to reach AGI.
| abhgh wrote:
| Yes arxiv is a good first source too. I mentioned
| conferences as a way to get exposed to diversity, but not
| necessarily (sadly) merit. It has been my experience as
| an author and reviewer both that review quality has
| plummeted over the years for the most part. As a reviewer
| I had to struggle with the ills of "commission and
| omission" both, i.e., (a) convince other reviewers to see
| an idea (from a trendy area such as in-context learning)
| as not novel (because it has been done before, even in
| the area of LLMs), and (b) see an idea as novel, which
| wouldn't haven't seemed so initially because some
| reviewers weren't aware of the background or impact of
| anything non-LLM, or god forbid, non-DL. As an author
| this has personally affected me because I had to work on
| my PhD remotely, so I didn't have access to a bunch of
| compute and I deliberately picked a non-DL area, and I
| had to pay the price for that in terms of multiple
| rejections, reviewer ghosting, journals not responding
| for years (yes, years).
| godelski wrote:
| I've stopped considering novelty at all. The only thing I
| now consider is if the precise technique has been done
| before. If not, well I've seen pretty small things change
| results dramatically. The pattern I've seen that scares
| me more is that when authors do find simple but effective
| changes, they end up convoluting the ideas because
| simplicity and clarity is often confused with novelty.
| And honestly, revisiting ideas is useful as our
| environments change. So I don't want to discourage this
| type of work.
|
| Personally, this has affected me as a late PhD student.
| Late in the literal sense as I'm not getting my work
| pushed out (even some SOTA stuff) because of factors like
| these and my department insists something is wrong with
| me but will not read my papers, the reviews, or suggest
| what I need to do besides "publish more." (Literally told
| to me, "try publishing 5 papers a year, one should get
| in.") You'll laugh at this, I pushed a paper into a
| workshop and a major complaint was that I didn't give
| enough background on StyleGAN because "not everyone would
| be familiar with the architecture." (while I can
| understand the comment, 8 pages is not much room when you
| gotta show pictures on several datasets. My appendix was
| quite lengthy and included all requested information). We
| just used a GAN as a proxy because diffusion is much more
| expensive to train (most common complaints are "not
| enough datasets" and "how's it scale"). I think this is
| the reason so many universities use pretrained networks
| instead of training things from scratch, which just
| railroads research.
|
| (I also got a paper double desk rejected. First because
| it was "already published." Took a 2 months for them to
| realize it was arxiv only. Then they fixed that and
| rejected again because "didn't cite relevant works" with
| no mention of what those works were... I've obviously
| lost all faith in the review process)
| beagle3 wrote:
| I read a book on NNs by Robert Hecht Nielsen in 1989, during
| the NN hype of the time (I believe it was the 2nd hype cycle,
| the first beginning with Rosenblatt's original hardware
| perceptron and dying with Minsky and Pappert's "Perceptrons"
| manuscript a decade or two earlier).
|
| Everything described was laughably basic by modern standards,
| but the motivation given in that book was the Kolmogorov
| representation theorem: a modest 3 layer networks with the
| right activation function _can_ represent any continuous m-to-n
| function.
|
| Most research back then focused on 3 layer networks, possibly
| for that reason. Sigmoid activation was king, and vanishing
| gradients the main issue. It took 2 decades until AlexNet
| brought NN research back from the AI winter of the 1990's
| keynesyoudigit wrote:
| Eli5: why aren't these more popular and broadly used?
| OisinMoran wrote:
| Because they have just been invented!
| wbeckler wrote:
| 60 years ago
| SpaceManNabs wrote:
| How does back propagation work now? Do these suffer from
| vanishing or exploding gradients?
| nextaccountic wrote:
| At page 6 it explains how they did back propagation
| https://arxiv.org/pdf/2404.19756 (and in page 2 it says that
| previous efforts to leverage Kolmogorov-Arnold representation
| failed to use backpropagation), so maybe using backpropagation
| to train multilayer networks with this architecture is their
| main contribution?
|
| > Unsurprisingly, the possibility of using Kolmogorov-Arnold
| representation theorem to build neuralnetworks has been studied
| [8, 9, 10, 11, 12, 13]. However, most work has stuck with the
| original depth-2 width-(2n + 1) representation, and did not
| have the chance to leverage more modern techniques (e.g., back
| propagation) to train the networks. Our contribution lies in
| generalizing the original Kolmogorov-Arnold representation to
| arbitrary widths and depths, revitalizing and contextualizing
| it in today's deep learning world, as well as using extensive
| empirical experiments to highlight its potential role as a
| foundation model for AI + Science due to its accuracy and
| interpretability.
| goggy_googy wrote:
| No, the activations are a combination of the basis function and
| the spline function. It's a little unclear to me still how the
| grid works, but it seems like this shouldn't suffer anymore
| than a generic relu MLP.
| Lichtso wrote:
| 1. Interestingly the foundations of this approach and MLP were
| invented / discovered around the same time about 66 years ago:
|
| 1957:
| https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...
|
| 1958: https://en.wikipedia.org/wiki/Multilayer_perceptron
|
| 2. Another advantage of this approach is that it has only one
| class of parameters (the coefficients of the local activation
| functions) as opposed to MLP which has three classes of
| parameters (weights, biases, and the globally uniform activation
| function).
|
| 3. Everybody is talking transformers. I want to see diffusion
| models with this approach.
| trwm wrote:
| Biases are just weights on an always on input.
|
| There isn't much difference between weights of a linear sum and
| coefficients of a spline.
| Lichtso wrote:
| > Biases are just weights on an always on input.
|
| Granted, however this approach does not require that
| constant-one input either.
|
| > There isn't much difference between weights of a linear sum
| and coefficients of a function.
|
| Yes, the trained function coefficients of this approach are
| the equivalent to the trained weights of MLP. Still this
| approach does not require the globally uniform activation
| function of MLP.
| trwm wrote:
| At this point this is a distinction without a difference.
|
| The only question is if splines are more efficient than
| lines at describing general functions at the billion to
| trillion parameter count.
| kolinko wrote:
| I may be wrong but with midern llms biases aren't really used
| any more.
| tripplyons wrote:
| From what I remember, larger LLMs like PaLM don't use biases
| for training stability, but smaller ones tend to still use
| them.
| xpe wrote:
| Yes, #2 is a difference. But what makes it an advantage?
|
| One might argue this via parsimony (Occam's razor). Is this
| your thinking? / Anything else?
| tripplyons wrote:
| To your 3rd point, most diffusion models already use a
| transformer-based architecture (U-Net with self attention and
| cross attention, Vision Transformer, Diffusion Transformer,
| etc.).
| brrrrrm wrote:
| doesn't KA representation require continuous univariate
| functions? do B-splines actually cover the space of all
| continuous functions? wouldn't... MLPs be better for the
| learnable activation functions?
| ComplexSystems wrote:
| Very interesting! Could existing MLP-style neural networks be put
| into this form?
| kevmo314 wrote:
| This seems very similar in concept to the finite element method.
| Nice to see patterns across fields like that.
___________________________________________________________________
(page generated 2024-05-01 23:00 UTC)