hngopher.com

       [HN Gopher] Kolmogorov-Arnold Networks
       ___________________________________________________________________
        
       Kolmogorov-Arnold Networks
        
       Author : sumo43
       Score  : 376 points
       Date   : 2024-05-01 03:30 UTC (19 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | nico wrote:
       | Looks super interesting
       | 
       | I wonder how many more new architectures are going to be found in
       | the next few years
        
       | Maro wrote:
       | Interesting!
       | 
       | Would this approach (with non-linear learning) still be able to
       | utilize GPUs to speed up training?
        
         | diwank wrote:
         | Seconded. I'm guessing you could create an implementation that
         | is able to do that and then write optimised triton/cuda kernels
         | to accelerate them but need to investigate further
        
       | diwank wrote:
       | It'd be really cool to see a transformer with the MLP layers
       | swapped for KANs and then compare its scaling properties with
       | vanilla transformers
        
         | gautam5669 wrote:
         | This is the first thought came to my mind too.
         | 
         | Given its sparse, Will this be just replacement for MoE.
        
         | mp187 wrote:
         | Why was this your first thought? Is a limiting factor to
         | transformers the MLP layer? I thought the bottleneck was in the
         | renormalization part.
        
           | brrrrrm wrote:
           | At small input size, yes the MLP dominates compute. At large
           | input attention matters more
        
         | bart1ett wrote:
         | After trying this out with the fourier implementation above,
         | swapping MLP/Attention Linear layers for KANs (all, or even a
         | few layers) produces diverging loss. KANs don't require
         | normalization for good forward pass dynamics, but may be
         | trickier to train in a deep net.
        
       | cbsmith wrote:
       | Feels like someone stuffed splines into decision trees.
        
         | baq wrote:
         | It's Monte Carlo all the way down
        
         | xpe wrote:
         | splines, yes.
         | 
         | I'm not seeing decision trees, though. Am I missing something?
         | 
         | > "KANs' nodes simply sum incoming signals without applying any
         | non-linearities." (page 2 of the PDF)
        
           | cbsmith wrote:
           | I definitely think I'm projecting and maybe seeing things
           | that aren't there. If you replaced splines with linear
           | weights, it kind of looks like a decision tree to me.
        
       | ALittleLight wrote:
       | I can't assess this, but I do worry that overnight some
       | algorithmic advance will enhance LLMs by orders of magnitude and
       | the next big model to get trained is suddenly 10,000x better than
       | GPT-4 and nobody's ready for it.
        
         | DeathArrow wrote:
         | >some algorithmic advance will enhance LLMs by orders of
         | magnitude
         | 
         | I would worry if I'd own Nvidia shares.
        
           | krasin wrote:
           | Actually, that would be fantastic for NVIDIA shares;
           | 
           | 1. A new architecture would make all/most of these upcoming
           | Transformer accelerators obsolete => back to GPUs.
           | 
           | 2. Higher performance LLMs on GPUs => we can speed up LLMs
           | with 1T+ parameters. So, LLMs become more useful, so more of
           | GPUs would be purchased.
        
             | mindcrime wrote:
             | _1. A new architecture would make all /most of these
             | upcoming Transformer accelerators obsolete => back to
             | GPUs._
             | 
             | There's no guarantee that that is what would happen. The
             | right (or wrong, depending on your POV) algorithmic
             | breakthrough might make GPU's obsolete for AI, by making
             | CPU's (or analog computing units, or DSP's, or "other") the
             | preferred platform to run AI.
        
         | 6mian wrote:
         | What to be worried about? Technical progress will happen,
         | sometimes by sudden jumps. Some company will become a leader,
         | competitors will catch up after a while.
        
           | cess11 wrote:
           | "Technical progress" has been destroying our habitat for
           | centuries, causing lots of other species to go extinct.
           | Pretty much the entire planet surface has been 'technically
           | progressed', spreading plastics, climate change and whatnot
           | over the entirety of it.
           | 
           | Are you assuming that this particular "progress" would be
           | relatively innocent?
        
             | 6mian wrote:
             | On the other hand, the same "technical progress" (if we're
             | putting machine learning, deforestation, and mining in the
             | same bag) gave you medicine, which turns many otherwise
             | deadly diseases into inconveniences and allows you to work
             | less than 12 hrs/7 days per week to not die from hunger in
             | a large portion of the world. A few hundred years ago,
             | unless you were born into the lucky 0.01% of the ruling
             | population, working from dawn to sunset was the norm for a
             | lot more people than now.
             | 
             | I'm not assuming that something 10k x better than GPT-4
             | will be good or bad; I don't know. I was just curious what
             | exactly to be worried about. I think in the current state,
             | LLMs are already advanced enough for bad uses like article
             | generation for SEO, spam, scams, etc., and I wonder if an
             | order of magnitude better model would allow for something
             | worse.
        
               | cess11 wrote:
               | Where did you learn that history?
               | 
               | What do you mean by "better"?
        
               | 6mian wrote:
               | I had a European peasant in the 1600-1700s in mind when I
               | wrote about the amount of work. During the season, they
               | worked all day; off-season, they had "free time" that
               | went into taking care of the household, inventory, etc.,
               | so it's still work. Can't quickly find a reliable source
               | in English I could link, so I can be wrong here.
               | 
               | "Better" was referring to what OP wrote in the top
               | comment. I guess 10x faster, 10x longer context, and 100x
               | less prone to hallucinations would make a good "10k x
               | better" than GPT-4.
        
               | cess11 wrote:
               | Sorry, I can't fit that with what you wrote earlier: "12
               | hrs/7 days per week to not die from hunger".
               | 
               | Those peasants payed taxes, i.e. some of their work was
               | exploited by an army or a priest rather than hunger, and
               | as you mention, they did not work "12 hrs/7 days per
               | week".
               | 
               | Do you have a better example?
        
             | vladms wrote:
             | Many species went extinct during Earth's history. Evolution
             | requires quite aggressive competition.
             | 
             | The way the habitat got destroyed by humans is stupid
             | because it might put us in danger. You can call me
             | "speciesist" but I do care more for humans rather than for
             | a particular other specie.
             | 
             | So I think progress should be geared towards human species
             | survival and if possible preventing other species
             | extinction. Some of the current developments are a bit too
             | much on the side of "I don't care about anyone's survival"
             | (which is stupid and inefficient).
        
               | lccerina wrote:
               | If other species die, we follow shortly. This
               | anthropocentric view really ignore how much of our food
               | chain exists because of other animals surviving despite
               | human activities.
        
               | cess11 wrote:
               | Evolution is the result of catastrophies and atrocities.
               | You use the word as if it has positive connotations,
               | which I find weird.
               | 
               | How do you come to the conclusion "stupid" rather than
               | evil? Aren't we very aware of the consequences of how we
               | are currently organising human societies, and have been
               | for a long time?
        
         | snewman wrote:
         | I think this is unlikely. There has never (in the visible
         | fossil record) been a mutation that suddenly made tigers an
         | order of magnitude stronger and faster, or humans an order of
         | magnitude more intelligent. It's been a long time (if ever?)
         | since chip transistor density made a multiple-order-of-
         | magnitude leap. Any complex optimized system has many limiting
         | factors and it's unlikely that all of them would leap forward
         | at once. The current generation of LLMs are not as complex or
         | optimized as tigers or humans, but they're far enough along
         | that changing one thing is unlikely to result in a giant leap.
         | 
         | If and when something radically better comes along, say an
         | alternative to back-propagation that is more like the way our
         | brains learn, it will need a lot of scaling and refinement to
         | catch up with the then-current LLM.
        
           | the8472 wrote:
           | Comparing it to evolution and SNPs isn't really a good
           | analogy. Novel network architectures are much larger changes,
           | maybe comparable to new organelles or metabolic pathways? And
           | those have caused catastrophic changes. Evolution also
           | operates on much longer time-scales due to its blind parallel
           | search.
           | 
           | https://en.wikipedia.org/wiki/Oxygen_catastrophe
        
       | mxwsn wrote:
       | From the preprint - 100 input dimensions is considered "high",
       | and most problems considered have 5 or fewer input dimensions.
       | This is typical of physics-inspired settings I've seen considered
       | in ML. The next step would be demonstrating them on MNIST, which,
       | at 784 dimensions is tiny by modern standards.
        
         | wongarsu wrote:
         | In actual business processes there are lots of ML problems with
         | fewer than 100 input dimensions. But for most of them decision
         | trees are still competitive with neural networks or even
         | outperform them.
        
           | galangalalgol wrote:
           | The aid to explainability seems at least somewhat compelling.
           | Understanding what a random forest did isn't always easy. And
           | if what you want isn't the model but the closed form of what
           | the model does, this could be quite useful. When those
           | hundred input dimensions interact nonlinearly in a million
           | ways thats nice. Or more likely I'd use it when I don't want
           | to find a pencil to derive the closed form of what I'm trying
           | to do.
        
           | trwm wrote:
           | Business processes don't need deep learning in the first
           | place. It is just there because hype.
        
             | JoshCole wrote:
             | Competent companies tend to put a lot of effort into
             | building data analysis tools. There will often be A/B or
             | QRT frameworks in place allowing deployment of two models,
             | for example, the new deep learning model, and the old rule
             | based system. By using the results from these experiments
             | in conjunction with typical offline and online evaluation
             | metrics one can begin to make statements about the impact
             | of model performance on revenue. Naturally model
             | performance is tracked through many offline and online
             | metrics. So people can and do say things like "if this
             | model is x% more accurate then that translates to $y
             | million dollars in monthly revenue" with great confidence.
             | 
             | Lets call someone working at such a company Bob.
             | 
             | A restatement of your claim is that Bob decided to launch a
             | model to live because of hype rather than because he could
             | justify his promotion by pointing to the millions of
             | dollars in increased revenue his switch produced. Bob of
             | course did not make his decision based on hype. He made his
             | decision because there were evaluation criteria in place
             | for the launch. He was literally not allowed to launch
             | things that didn't improve the system according to the
             | evaluation criteria. As Bob didn't want to be fired for not
             | doing anything at the company, he was forced to use a tool
             | that worked to improve the evaluation according to the
             | criteria that was specified. So he used the tool that
             | worked. Hype might provide motivation to experiment, but it
             | doesn't justify a launch.
             | 
             | I say this as someone whose literally seen transitions from
             | decision trees to deep learning models on < 100 feature
             | models which had multi-million dollar monthly revenue
             | impacts.
        
       | krasin wrote:
       | I've spent some time playing with their Jupyter notebooks. The
       | most useful (to me, anyway) is their
       | Example_3_classfication.ipynb ([1]).
       | 
       | It works as advertised with the parameters selected by the
       | authors, but if we modified the network shape in the second half
       | of the tutorial (Classification formulation) from (2, 2) to (2,
       | 2, 2), it fails to generalize. The training loss gets down to
       | 1e-9, while test loss stays around 3e-1. Getting to larger
       | network sizes does not help either.
       | 
       | I would really like to see a bigger example with many more
       | parameters and more data complexity and if it could be trained at
       | all. MNIST would be a good start.
       | 
       | Update: I increased the training dataset size 100x, and that
       | helps with the overfitting, but now I can't get training loss
       | below 1e-2. Still iterating on it; a GPU acceleration would
       | really help - right now, my progress is limited by the speed of
       | my CPU.
       | 
       | 1.
       | https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
        
         | krasin wrote:
         | Update2: got it to 100% training accuracy, 99% test accuracy
         | with (2, 2, 2) shape.
         | 
         | Changes:
         | 
         | 1. Increased the training set from 1000 to 100k samples. This
         | solved overfitting.
         | 
         | 2. In the dataset generation, slightly reduced noise (0.1 ->
         | 0.07) so that classes don't overlap. With an overlap,
         | naturally, it's impossible to hit 100%.
         | 
         | 3. Most important & specific to KANs: train for 30 steps with
         | grid=5 (5 segments for each activation function), then 30 steps
         | with grid=10 (and initializing from the previous model), and
         | then 30 steps with grid=20. This is idiomatic to KANs and
         | covered in the Example_1_function_fitting.ipynb:
         | https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
         | 
         | Overall, my impressions are:
         | 
         | - it works!
         | 
         | - the reference implementation is very slow. A GPU
         | implementation is dearly needed.
         | 
         | - it feels like it's a bit too non-linear and training is not
         | as stable as it's with MLP + ReLU.
         | 
         | - Scaling is not guaranteed to work well. Really need to see if
         | MNIST is possible to solve with this approach.
         | 
         | I will definitely keep an eye on this development.
        
           | thom wrote:
           | This makes me wonder what you could achieve if instead of
           | iteratively growing the grid, or worrying about pruning or
           | regularization, you governed network topology with some sort
           | of evolutionary algorithm.
        
             | gxyt6gfy5t wrote:
             | Believe there is a Google paper out there that tried that
        
               | verdverm wrote:
               | 1000s, there is a whole field and set of conferences. You
               | can find more by searching "Genetic Programming" or
               | "Symbolic Regression"
               | 
               | KAN, with the library of variables and math operators,
               | very much resembles this family of algos, problems, and
               | limitations. The lowest hanging fruit they usually leave
               | on the proverbial tree is that you can use fast
               | regression techniques for the constants and coefficients.
               | No need to leave it up to random perturbations or
               | gradient descent. What you really need to figure out is
               | the form or shape of the model, rather than leaving it up
               | to the human (in KAN)
        
             | verdverm wrote:
             | You can do much better by growing an AST with memoization
             | and non-linear regression. So much so, the EVO folks gave a
             | best paper to a non-EVO, deterministic algorithm at their
             | conference
             | 
             | https://seminars.math.binghamton.edu/ComboSem/worm-
             | chiu.pge_... (author)
        
           | godelski wrote:
           | > Increased the training set from 1000 to 100k samples. This
           | solved overfitting.
           | 
           | Solved over fitting or created more? Even if your sets are
           | completely disjoint with something like two moons the more
           | data you have the lower the variance.
        
         | londons_explore wrote:
         | > I would really like to see a bigger example
         | 
         | This. I don't think toy examples are useful for modern ML
         | techniques. If you tested big ideas in ML (transformers,
         | LSTM's, ADAM) on a training dataset of 50 numbers trying to fit
         | a y=sin(x) curve, I think you'd wrongly throw these ideas out.
        
       | arianvanp wrote:
       | This really reminds me of petrinets but an analog version? But
       | instead of places and discrete tokens we have activation
       | functions and signals. You can only trigger a transition if an
       | activation function (place) has the right signal (tokens).
        
       | montebicyclelo wrote:
       | The success we're seeing with neural networks is tightly coupled
       | with the ability to scale - the algorithm itself works at scale
       | (more layers), but it also scales well with hardware, (neural
       | nets mostly consist of matrix multiplications, and GPUs have
       | specialised matrix multiplication acceleration) - one of the most
       | impactful neural network papers, AlexNet, was impactful because
       | it showed that NNs could be put on the GPU, scaled and
       | accelerated, to great effect.
       | 
       | It's not clear from the paper how well this algorithm will scale,
       | both in terms of the algorithm itself (does it still train well
       | with more layers?), and ability to make use of hardware
       | acceleration, (e.g. it's not clear to me that the structure, with
       | its per-weight activation functions, can make use of fast matmul
       | acceleration).
       | 
       | It's an interesting idea, that seems to work well and have nice
       | properties on a smaller scale; but whether it's a good
       | architecture for imagenet, LLMs, etc. is not clear at this stage.
        
         | dist-epoch wrote:
         | > with its per-weight activation functions
         | 
         | Sounds like something which could be approximated by a DCT
         | (discrete cosine transform). JPEG compression does this, and
         | there are hardware accelerations for it.
         | 
         | > can make use of fast matmul acceleration
         | 
         | Maybe not, but matmul acceleration was done in hardware because
         | it's useful for some problems (graphics initially).
         | 
         | So if these per weight activations functions really work,
         | people will be quick to figure out how to run them in hardware.
        
       | WithinReason wrote:
       | Looks very interesting, but my guess would be that this would run
       | into the problem of exploding/vanishing gradients at larger
       | depths, just like TanH or sigmoid networks do.
        
       | yobbo wrote:
       | https://kindxiaoming.github.io/pykan/intro.html
       | 
       | At the end of this example, they recover the symbolic formula
       | that generated their training set: exp(x22 + sin(3.14x1)).
       | 
       | It's like a computation graph with a library of "activation
       | functions" that is optimised, and then pruned. You can recover
       | good symbolic formulas from the pruned graph.
       | 
       | Maybe not meaningful for MNIST.
        
         | beagle3 wrote:
         | I wonder if Breiman's ACE (alternating conditional expectation)
         | is useful as a building block here.
         | 
         | It will easily recover this formula, because it is separable
         | under the log transformation (which ACE recovers as well).
         | 
         | But ACE doesn't work well on unseparable problems - not sure
         | how well KAN will.
        
       | GistNoesis wrote:
       | I quickly skimmed the paper, got inspired to simplify it, and
       | created some Pytorch Layer :
       | 
       | https://github.com/GistNoesis/FourierKAN/
       | 
       | The core is really just a few lines.
       | 
       | In the paper they use some spline interpolation to represent 1d
       | function that they sum. Their code seemed aimed at smaller sizes.
       | Instead I chose a different representation, aka fourier
       | coefficients that are used to interpolate the functions of
       | individual coordinates.
       | 
       | It should give an idea of Kolmogorov-Arnold networks
       | representation power, it should probably converge easier than
       | their spline version but spline version have less operations.
       | 
       | Of course, if my code doesn't work, it doesn't mean theirs
       | doesn't.
       | 
       | Feel free to experiment and publish paper if you want.
        
         | itsthecourier wrote:
         | you really are a pragmatic programmer, Noesis
        
           | GistNoesis wrote:
           | Thanks. I like simple things.
           | 
           | Sums and products can get you surprisingly far.
           | 
           | Conceptually it's simpler to think about and optimize. But
           | you can also write it use einsum to do the sum product
           | reductions (I've updated some comment to show how) to use
           | less memory, but it's more intimidating.
           | 
           | You can probably use KeOps library to fuse it further (einsum
           | would get in the way).
           | 
           | But the best is probably a custom kernel. Once you have
           | written it as sums and product, it's just iterating. Like the
           | core is 5 lines, but you have to add roughly 500 lines of
           | low-level wrapping code to do cuda parallelisation, c++ to
           | python, various types, manual derivatives. And then you have
           | to add various checks so that there are no buffer overflows.
           | And then you can optimize for special hardware operations
           | like tensor cores. Making sure along the way that no
           | numerical errors where introduced.
           | 
           | So there are a lot more efforts involved, and it's usually
           | only worth it if the layer is promising, but hopefully AI
           | should be able to autocomplete these soon.
        
         | agnosticmantis wrote:
         | How GPU-friendly is this class of models?
        
       | ubj wrote:
       | Very interesting! Kolmogorov neutral networks can represent
       | discontinuous functions [1], but I've wondered about how
       | practically applicable they are. This repo seems to show that
       | they have some use after all.
       | 
       | [1]: https://arxiv.org/abs/2311.00049
        
       | reynoldss wrote:
       | Perhaps a hasty comment but linear combinations of B-splines are
       | yet another (higher-degree) B-spline. Isn't this simply fitting
       | high degree B-splines to functions?
        
         | Lichtso wrote:
         | That would be true for a single node / single layer. But once
         | the output of one layer is fed into the input of the next it is
         | not just a linear combination of splines anymore.
        
       | cs702 wrote:
       | It's so _refreshing_ to come across new AI research different
       | from the usual  "we modified a transformer in this and that way
       | and got slightly better results on this and that benchmark." All
       | those new papers proposing incremental improvements are
       | important, but... everyone is getting a bit tired of them. Also,
       | anecdotal evidence and recent work suggest we're starting to run
       | into fundamental limits inherent to transformers, so we may well
       | need new alternatives.[a]
       | 
       | The best thing about this new work is that it's not an either/or
       | proposition. The proposed "learnable spline interpolations as
       | activation functions" can be used _in conventional DNNs_ , to
       | improve their expressivity. Now we just have to test the stuff to
       | see if it really works better.
       | 
       | Very nice. Thank you for sharing this work here!
       | 
       | ---
       | 
       | [a] https://news.ycombinator.com/item?id=40179232
        
         | glebnovikov wrote:
         | > Everyone is getting tired of those papers.
         | 
         | This is science as is :)
         | 
         | 95% percent will produce mediocre-to-nice improvements to what
         | we already have so there were reserachers that eventually grow
         | up and do something really exciting
        
           | godelski wrote:
           | Nothing wrong with incremental improvements. Giant leaps
           | (almost always) only happen because of a lack of your niche
           | domain expertise. And I mean niche niche
        
         | godelski wrote:
         | There's a ton actually. Just they tend to go through extra
         | rounds of review (or never make it...) and never make it to HN
         | unless there's special circumstances (this one is MIT and CIT).
         | Unfortunately we've let PR become a very powerful force (it's
         | always been a thing, but seems more influential now). We can
         | fight against this by up voting things like this and if you're
         | a reviewee, not focusing on sota (it's clearly been gamed and
         | clearly leading us in the wrong direction)
        
           | cs702 wrote:
           | > never make it to HN unless there's special circumstances
           | 
           | Yes, I agree. The two most common patterns I've noticed in
           | research that does show up on HN are: 1) It outright
           | improves, or has the potential to improve, applications
           | currently used in production by many HN readers. In other
           | words, it's not just navel-gazing. 2) The authors and/or
           | their organizations are well-known, as you suggest.
        
             | godelski wrote:
             | What bothers me the most is that comments will float to the
             | top of a link that's an arxiv paper or uni press where
             | people will talk about how something is still in a
             | prototype stage and not production yet/has a ways to go to
             | production. While this is fine, that's also the context of
             | works like these. But it is the same thing that I see in
             | reviews. I've had works myself killed because reviewers
             | treat the paper as a product rather than... you know...
             | research.
        
           | abhgh wrote:
           | Yes seconding this. If you want a broad view of ML IMHO the
           | best places to look at are conference proceedings. The
           | typical review process is imperfect so that still doesn't
           | show you all the interesting work out there (which you
           | mention), but it is still a start wrt diversity of research.
           | I follow LLMs closely but then going through proceedings
           | means I come across exciting research like these [1],[2],[3].
           | 
           | References:
           | 
           | [1] A grad.-based way to optimize axis-parallel and oblique
           | decision trees: the _Tree Alternating Optimization (TAO)_
           | algorithm https://proceedings.neurips.cc/paper_files/paper/20
           | 18/file/1.... An extension was the _softmax tree_
           | https://aclanthology.org/2021.emnlp-main.838/.
           | 
           | [2] XAI explains models, but can you recommend corrective
           | actions? _FACE: feasible and Actionable Counterfactual
           | Explanations_ https://arxiv.org/pdf/1909.09369, _Algorithmic
           | Recourse: from Counterfactual Explanations to Interventions_
           | https://arxiv.org/pdf/2002.06278
           | 
           | [3] _OBOE: Collaborative Filtering for AutoML Model
           | Selection_ https://arxiv.org/abs/1808.03233
        
             | godelski wrote:
             | Honestly, these days I just rely on arxiv. The conferences
             | are so noisy that it is hard to really tell what's useful
             | and what's crap. Twitter is a bit better but still a crap
             | shoot. So as far as it seems to me, there's no real good
             | signal to use to differentiate. And what's the point of
             | journals/conferences if not to provide some reasonable
             | signal? If it is a slot machine, it is useless.
             | 
             | And I feel like we're far too dismissive of instances we
             | see where good papers get rejected. We're too dismissive of
             | the collusion rings. What am I putting in all this time to
             | write and all this time to review (and be an emergency
             | reviewer) if we aren't going to take some basic steps
             | forward? Fuck, I've saved a Welling paper from rejection
             | from two reviewers who admitted to not knowing PDEs, and
             | this was a workshop (should have been accepted into the
             | main conference). I think review works for those already
             | successful, who can pay "perform more experiments when
             | requested" their way out of review hell, but we're ignoring
             | a lot of good work simply for lack of money compute. It
             | slows down our progress to reach AGI.
        
               | abhgh wrote:
               | Yes arxiv is a good first source too. I mentioned
               | conferences as a way to get exposed to diversity, but not
               | necessarily (sadly) merit. It has been my experience as
               | an author and reviewer both that review quality has
               | plummeted over the years for the most part. As a reviewer
               | I had to struggle with the ills of "commission and
               | omission" both, i.e., (a) convince other reviewers to see
               | an idea (from a trendy area such as in-context learning)
               | as not novel (because it has been done before, even in
               | the area of LLMs), and (b) see an idea as novel, which
               | wouldn't haven't seemed so initially because some
               | reviewers weren't aware of the background or impact of
               | anything non-LLM, or god forbid, non-DL. As an author
               | this has personally affected me because I had to work on
               | my PhD remotely, so I didn't have access to a bunch of
               | compute and I deliberately picked a non-DL area, and I
               | had to pay the price for that in terms of multiple
               | rejections, reviewer ghosting, journals not responding
               | for years (yes, years).
        
               | godelski wrote:
               | I've stopped considering novelty at all. The only thing I
               | now consider is if the precise technique has been done
               | before. If not, well I've seen pretty small things change
               | results dramatically. The pattern I've seen that scares
               | me more is that when authors do find simple but effective
               | changes, they end up convoluting the ideas because
               | simplicity and clarity is often confused with novelty.
               | And honestly, revisiting ideas is useful as our
               | environments change. So I don't want to discourage this
               | type of work.
               | 
               | Personally, this has affected me as a late PhD student.
               | Late in the literal sense as I'm not getting my work
               | pushed out (even some SOTA stuff) because of factors like
               | these and my department insists something is wrong with
               | me but will not read my papers, the reviews, or suggest
               | what I need to do besides "publish more." (Literally told
               | to me, "try publishing 5 papers a year, one should get
               | in.") You'll laugh at this, I pushed a paper into a
               | workshop and a major complaint was that I didn't give
               | enough background on StyleGAN because "not everyone would
               | be familiar with the architecture." (while I can
               | understand the comment, 8 pages is not much room when you
               | gotta show pictures on several datasets. My appendix was
               | quite lengthy and included all requested information). We
               | just used a GAN as a proxy because diffusion is much more
               | expensive to train (most common complaints are "not
               | enough datasets" and "how's it scale"). I think this is
               | the reason so many universities use pretrained networks
               | instead of training things from scratch, which just
               | railroads research.
               | 
               | (I also got a paper double desk rejected. First because
               | it was "already published." Took a 2 months for them to
               | realize it was arxiv only. Then they fixed that and
               | rejected again because "didn't cite relevant works" with
               | no mention of what those works were... I've obviously
               | lost all faith in the review process)
        
         | beagle3 wrote:
         | I read a book on NNs by Robert Hecht Nielsen in 1989, during
         | the NN hype of the time (I believe it was the 2nd hype cycle,
         | the first beginning with Rosenblatt's original hardware
         | perceptron and dying with Minsky and Pappert's "Perceptrons"
         | manuscript a decade or two earlier).
         | 
         | Everything described was laughably basic by modern standards,
         | but the motivation given in that book was the Kolmogorov
         | representation theorem: a modest 3 layer networks with the
         | right activation function _can_ represent any continuous m-to-n
         | function.
         | 
         | Most research back then focused on 3 layer networks, possibly
         | for that reason. Sigmoid activation was king, and vanishing
         | gradients the main issue. It took 2 decades until AlexNet
         | brought NN research back from the AI winter of the 1990's
        
       | keynesyoudigit wrote:
       | Eli5: why aren't these more popular and broadly used?
        
         | OisinMoran wrote:
         | Because they have just been invented!
        
           | wbeckler wrote:
           | 60 years ago
        
       | SpaceManNabs wrote:
       | How does back propagation work now? Do these suffer from
       | vanishing or exploding gradients?
        
         | nextaccountic wrote:
         | At page 6 it explains how they did back propagation
         | https://arxiv.org/pdf/2404.19756 (and in page 2 it says that
         | previous efforts to leverage Kolmogorov-Arnold representation
         | failed to use backpropagation), so maybe using backpropagation
         | to train multilayer networks with this architecture is their
         | main contribution?
         | 
         | > Unsurprisingly, the possibility of using Kolmogorov-Arnold
         | representation theorem to build neuralnetworks has been studied
         | [8, 9, 10, 11, 12, 13]. However, most work has stuck with the
         | original depth-2 width-(2n + 1) representation, and did not
         | have the chance to leverage more modern techniques (e.g., back
         | propagation) to train the networks. Our contribution lies in
         | generalizing the original Kolmogorov-Arnold representation to
         | arbitrary widths and depths, revitalizing and contextualizing
         | it in today's deep learning world, as well as using extensive
         | empirical experiments to highlight its potential role as a
         | foundation model for AI + Science due to its accuracy and
         | interpretability.
        
         | goggy_googy wrote:
         | No, the activations are a combination of the basis function and
         | the spline function. It's a little unclear to me still how the
         | grid works, but it seems like this shouldn't suffer anymore
         | than a generic relu MLP.
        
       | Lichtso wrote:
       | 1. Interestingly the foundations of this approach and MLP were
       | invented / discovered around the same time about 66 years ago:
       | 
       | 1957:
       | https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...
       | 
       | 1958: https://en.wikipedia.org/wiki/Multilayer_perceptron
       | 
       | 2. Another advantage of this approach is that it has only one
       | class of parameters (the coefficients of the local activation
       | functions) as opposed to MLP which has three classes of
       | parameters (weights, biases, and the globally uniform activation
       | function).
       | 
       | 3. Everybody is talking transformers. I want to see diffusion
       | models with this approach.
        
         | trwm wrote:
         | Biases are just weights on an always on input.
         | 
         | There isn't much difference between weights of a linear sum and
         | coefficients of a spline.
        
           | Lichtso wrote:
           | > Biases are just weights on an always on input.
           | 
           | Granted, however this approach does not require that
           | constant-one input either.
           | 
           | > There isn't much difference between weights of a linear sum
           | and coefficients of a function.
           | 
           | Yes, the trained function coefficients of this approach are
           | the equivalent to the trained weights of MLP. Still this
           | approach does not require the globally uniform activation
           | function of MLP.
        
             | trwm wrote:
             | At this point this is a distinction without a difference.
             | 
             | The only question is if splines are more efficient than
             | lines at describing general functions at the billion to
             | trillion parameter count.
        
         | kolinko wrote:
         | I may be wrong but with midern llms biases aren't really used
         | any more.
        
           | tripplyons wrote:
           | From what I remember, larger LLMs like PaLM don't use biases
           | for training stability, but smaller ones tend to still use
           | them.
        
         | xpe wrote:
         | Yes, #2 is a difference. But what makes it an advantage?
         | 
         | One might argue this via parsimony (Occam's razor). Is this
         | your thinking? / Anything else?
        
         | tripplyons wrote:
         | To your 3rd point, most diffusion models already use a
         | transformer-based architecture (U-Net with self attention and
         | cross attention, Vision Transformer, Diffusion Transformer,
         | etc.).
        
       | brrrrrm wrote:
       | doesn't KA representation require continuous univariate
       | functions? do B-splines actually cover the space of all
       | continuous functions? wouldn't... MLPs be better for the
       | learnable activation functions?
        
       | ComplexSystems wrote:
       | Very interesting! Could existing MLP-style neural networks be put
       | into this form?
        
       | kevmo314 wrote:
       | This seems very similar in concept to the finite element method.
       | Nice to see patterns across fields like that.
        
       ___________________________________________________________________
       (page generated 2024-05-01 23:00 UTC)