[HN Gopher] The Fourier transform is a neural network
       ___________________________________________________________________
        
       The Fourier transform is a neural network
        
       Author : montebicyclelo
       Score  : 255 points
       Date   : 2021-04-29 12:15 UTC (10 hours ago)
        
 (HTM) web link (sidsite.com)
 (TXT) w3m dump (sidsite.com)
        
       | cl3misch wrote:
       | > We can consider the the discrete Fourier transform (DFT) to be
       | an artificial neural network: it is a single layer network, with
       | no bias, no activation function, and particular values for the
       | weights.
       | 
       | So this post essentially shows that the Fourier transform is a
       | linear combination.
        
         | mhh__ wrote:
         | Hammer meet "nail"
        
       | threatripper wrote:
       | The Fourier transformation is a special case of a linear
       | transformation. Linear neural networks can represent any linear
       | transformation. Both can also be expressed as matrix
       | multiplication.
       | 
       | It's not really all that surprising if you know some of the math
       | behind neural networks, matrix algebra, linear transformations or
       | the Fourier transformation.
        
         | amelius wrote:
         | Yes but this article shows that the matrix coefficients of the
         | Fourier transform can be learned.
        
           | threatripper wrote:
           | This is also not really surprising for a single layer neural
           | network.
        
             | visarga wrote:
             | It might not be surprising when they implemented it by
             | supervised learning, but when they also solved it by
             | reconstructing the input it was like a rediscovery.
        
           | madhadron wrote:
           | That's because "learning" in neural networks is a fancy way
           | of saying "curve fitting."
        
         | amcoastal wrote:
         | I dont think its surprising but its definitely a cool little
         | exercise. Somewhat aside of the content of the post, I could
         | see some use cases where specifically implementing DFT layers
         | in your architectures could lead to improved performance over
         | using just the raw inputs or activations. Noisy datasets, or
         | synthetic data based transfer learning come to mind as
         | potential uses for the DFT as a step in your ML pipeline.
        
           | threatripper wrote:
           | There is the FFT in TensorFlow already:
           | https://www.tensorflow.org/api_docs/python/tf/signal/fft
        
             | amelius wrote:
             | And FFT ( _Fast_ Fourier Transform) is more efficient than
             | matrix multiplication (EDIT: by a matrix with Fourier
             | elements).
        
               | bigbillheck wrote:
               | You know, I'll go out on the limb and say (unhelpfully)
               | that a FFT is more efficient than any general dense
               | matvec.
        
               | touisteur wrote:
               | Bah, when you've maxxed out your cuda cores, who's to
               | prevent you from reaching for those sweet tensor core
               | 100+ tflops mat-mul now?
        
               | mochomocha wrote:
               | This is analogous to saying "an umbrella is better than a
               | toaster". They are two different things.
        
               | Joker_vD wrote:
               | Why do people insist on comparing only things that are
               | _same_? The same things are same, no point whatsoever
               | comparing them: they 're same! No, you should only
               | compare things that are different, that way you learn
               | more about their differences.
               | 
               | And what's better, an umbrella or a toaster, depends on
               | the one's goals: if one'd like to fry some bread, toaster
               | is more useful; if one'd like to cover from rain,
               | umbrella is far superior, although toaster could used be
               | too; if one'd like to drive a nail into the wall, toaster
               | again is better; etc.
        
               | mochomocha wrote:
               | My statement was as approximate as the one I was replying
               | to. If the original statement had been "you can use FFT
               | to implement some operations involving matrix
               | multiplication faster" I wouldn't have taken issue with
               | it. But saying "FFT is faster than matrix multiplication"
               | has little meaning.
        
               | vladTheInhaler wrote:
               | The naive implementation of the DFT in terms of a matrix-
               | vector multiplication has time complexity O(n^2) [1]. The
               | point is that the FFT approach is much faster than that
               | implementation.
               | 
               | [1] https://en.wikipedia.org/wiki/DFT_matrix
        
               | amelius wrote:
               | Not if you perform these operations in order to obtain a
               | Fourier transform.
        
               | ok123456 wrote:
               | You can implement a DFT with matrix multiplication using
               | the vandermonde matrix with roots of the unitary
               | polynomial as the basis.
        
               | cdavid wrote:
               | Fourier transform, more exactly DFT, is a special case of
               | matrix multiplication, when seen as a linear operator.
               | The fact that you can do operations faster than the usual
               | N*2 matrix / vector operation is due to the specific form
               | of the underlying matrix (Van Der Mont matrix).
               | 
               | So they are not that different: the specific structure of
               | the Fourier basis allows for more efficient matrix *
               | vector operation.
        
               | layoutIfNeeded wrote:
               | Vandermonde. He was not Dutch, he was French.
               | https://en.wikipedia.org/wiki/Alexandre-
               | Theophile_Vandermond...
        
               | [deleted]
        
               | hamiltonkibbe wrote:
               | They are different things, but you can use both to
               | protect you from the rain, and an umbrella does that more
               | efficiently than a toaster. The drawback is that the
               | umbrella doesn't generalize to the broader field of bagel
               | toasting
        
         | amatic wrote:
         | On the other hand, if you don't yet know a lot of NN math, like
         | me, it is super surprising. Definitely a cool way to connect
         | FFT and NNs.
        
         | thearn4 wrote:
         | I've come to internalize the Fourier transform as a type of
         | sparse matrix algorithm. I.e. something that implements a
         | particular matrix-vector product without requiring explicit
         | construction of said matrix.
        
           | enchiridion wrote:
           | Do you mean the FFT? I've been trying to wrap my head around
           | the Fourier transform lately, and I can see the matrix
           | connection to an FFT, but not the Fourier transform in
           | general.
        
             | bonoboTP wrote:
             | The Fourier transform performs a change of basis into the
             | eigenbasis of convolution.
             | 
             | It's linear and so in the discrete case can be expressed as
             | matrix multiplication (every discrete linear operation is
             | expressible as matmul).
        
           | srean wrote:
           | And pushing the parantheses to factor out common
           | subexpressions. Essentially exploiting the distributive law
        
       | peter_retief wrote:
       | https://math.stackexchange.com/questions/2569814/fourier-tra...
        
       | layoutIfNeeded wrote:
       | Ummmm... sure. A layer of a neural network is a linear operator
       | followed by an arbitrary (nonlinear in practice) mapping. Which
       | means that neural networks can trivially implement linear
       | operators (like the Fourier transform) by making the second
       | mapping to be identity. QED
        
       | Vhano wrote:
       | this is why linear dynamics should be a basic course before
       | ML/DL. Next week OP discovers Kalman filter... and surprise,
       | surprise!
        
       | ajross wrote:
       | Not to be too glib, but as a non-ML person with decent numerics
       | ability:
       | 
       | ... did this post just spend a ton of boring math explaining that
       | a neural network has the ability to detect whether an image is
       | bumpy or smooth?
       | 
       | I mean, Fourier analysis has a very straightforward "intuitive"
       | interpretation. That's the point of its value in applied domains,
       | after all. Straightforward intuitive things are "neural" by their
       | nature, no?
        
       | cblconfederate wrote:
       | If the fourier transform is written as two real parts, the
       | universal approximation theorem says it can be used to
       | approximate any function, the same way NNs do
        
       | cantagi wrote:
       | Great post!
       | 
       | I once had to port a siamese neural network from Tensorflow to
       | Apple's CoreML to make it run on an iPhone. Siamese neural
       | networks have a cross convolution step which wasn't something
       | CoreML could handle. But CoreML could multiply two layer outputs
       | element-wise.
       | 
       | I implemented it using a fourier transform (not a fast fourier
       | transform), with separate re and im parts, since a fourier
       | transform is just a matrix multiplication, and convolution is
       | element-wise multiplication in the fourier domain. Unsurprisingly
       | it was very slow.
        
       | Gravityloss wrote:
       | Of course. What about the Haar transform? Or digital filters?
       | It's very much all about the same thing.
       | 
       | You compare the original signal with some reference signals, look
       | at how much it correlates. Then you save those correlations only.
       | With the correlations and the reference signals, you can roughly
       | reproduce the original signal. (Extremely paraphrased.)
        
       | 6gvONxR4sf7o wrote:
       | > [with regards to y = A x:] This should look familiar, because
       | it is a neural network layer with no activation function and no
       | bias.
       | 
       | This line makes me think this might be satire. Are there really
       | people who see y = A x and think neural networks?
       | 
       | Otherwise, a better title would be "The Fourier transform is a
       | linear operation. (neural networks can do linear operations too)"
       | ... which makes it pretty boring.
        
       | zeroxfe wrote:
       | Taking this one more step forward, generating spectrograms with
       | neural networks: https://0xfe.blogspot.com/2020/03/generating-
       | spectrograms-wi...
        
       | photonemitter wrote:
       | I'll leave these here:
       | https://en.wikipedia.org/wiki/Multiresolution_analysis
       | https://en.wikipedia.org/wiki/Discrete_wavelet_transform
       | 
       | And as a lot of people have mentioned in here, DFT is pretty much
       | implicated in neural networks already because of the mathematics
       | (especially in convolutional/correlational neural networks, which
       | often make use of the convolution theorem (which is "just"
       | fourier coefficient multiplication) to do the convolution)
       | 
       | Extending this post it seems more interesting to look more
       | generally at the correspondence with wavelet-transforms.
        
       | phonebucket wrote:
       | That was a fun read. While it might be unsurprising to some, it's
       | a testament to breadth of applications of modern machine learning
       | frameworks.
       | 
       | I enjoyed this sentence in particular.
       | 
       | > This should look familiar, because it is a neural network layer
       | with no activation function and no bias.
       | 
       | I thought it should look familiar because it's matrix
       | multiplication. That it looks like a neural network layer first
       | and foremost to some is maybe a sign of the times.
        
         | 6gvONxR4sf7o wrote:
         | > it's a testament to breadth of applications of modern machine
         | learning frameworks.
         | 
         | More like a testament to the the breadth of applications of
         | linear algebra. It is absolutely remarkable what we're able to
         | compute and analyze in the form of y = A x (hilbert spaces are
         | a wild invention).
         | 
         | But it really isn't a testament to modern ML frameworks in any
         | way. The fourier transform has been easy to compute/fit in this
         | exact way (fourier = linear problem + solving linear problems
         | by optimization) by modern-at-the-time frameworks for over two
         | centuries.
        
       | slver wrote:
       | Technically physical reality is also a neural network.
        
       | ktln2 wrote:
       | Just another example of
       | 
       | https://en.wikipedia.org/wiki/Universal_approximation_theore...
        
       | [deleted]
        
       | throwawayffffas wrote:
       | I have always thought of it the other way around, neural networks
       | are a kind of transform similar to the Fourier transform. But you
       | know, instead of a sum of sinusoid components you get a sum of
       | weighted <insert your activation function>.
        
       | RocketSyntax wrote:
       | I think the author's point is that every data transformation can
       | be represented as a neural network. Neural nets are just many
       | degree polynomials.
        
       | TaupeRanger wrote:
       | "Multiplication of matrices is a neural network"
        
       | omarhaneef wrote:
       | Here is the special sauce: "We can consider the the discrete
       | Fourier transform (DFT) to be an artificial neural network: it is
       | a single layer network, with no bias, no activation function, and
       | particular values for the weights. The number of output nodes is
       | equal to the number of frequencies we evaluate."
       | 
       | A single layer neural network is a sum of products, the basic
       | Fourier equation is a sum of products.
       | 
       | In this view there are lots of single layer neural networks out
       | there. For me, it's the training algorithm (backprop) that sets
       | apart the neural net.
        
         | nerdponx wrote:
         | In my mind, the _layers_ make the network  "non-trivial". A
         | Fourier transform is only a neural network in a trivial,
         | definitional sense.
        
           | hyperman1 wrote:
           | The activation function is what keeps layers separated. When
           | it isn't there, a pair of layers devolves to a matrix
           | multiplication, which can be replaced by its resulting
           | matrix, a single layer.
        
           | omarhaneef wrote:
           | That is a thoughtful distinction, too. I think you're right.
        
         | 29athrowaway wrote:
         | Biological neural networks do not use backpropagation, or at
         | least, there's no evidence of that yet.
         | 
         | Backpropagation is a placeholder for stuff we don't understand.
        
           | etienne618 wrote:
           | It seems that backpropagation might be a good abstraction for
           | biological learing rules after all: "local predictive coding
           | converges asymptotically (and in practice rapidly) to exact
           | backprop gradients on arbitrary computation graphs using only
           | local learning rules" from https://arxiv.org/abs/2006.04182
        
           | ta988 wrote:
           | Not exactly the same kind of backpropagation but signals fly
           | the other way in neurons as well
           | https://en.wikipedia.org/wiki/Neural_backpropagation
        
           | [deleted]
        
           | cl3misch wrote:
           | Isn't "backpropagation" just a synonym for the partial
           | derivative of the scalar cost function with respect to the
           | weights? And in the mathematical formulation of biological
           | neural networks these derivatives can't be computed
           | analytically. Your comment sounds like "backpropagation" is
           | some kind of natural phenomenon.
        
             | 29athrowaway wrote:
             | > Your comment sounds like "backpropagation" is some kind
             | of natural phenomenon.
             | 
             | No it's not, read again.
        
       | leecarraher wrote:
       | Alt title, can a neural network learn the fast fourier transform?
        
         | SuchAnonMuchWow wrote:
         | the discrete Fourier transform, not the fast fourrier
         | transform: the learned algorithm runs in O(n2) and not O(n*log
         | n)
        
           | leecarraher wrote:
           | oh then this is just approximating the values of the linear
           | transform. I figured at least they'd use mlp to learn each
           | layer of the FFT butterfly network, and the corresponding
           | weights. What would be of interest would be if you could
           | extend this for n length fft using an RNN network
        
       | mikewarot wrote:
       | I have zero doubt that given sufficient resources ANY function
       | can be learned by a neural network.
       | 
       | Why not just initialize part of the layers in a network to do the
       | FFT, or DFT when you're setting up a system to learn, that it
       | doesn't have to waste megawatt hours of power? Consider it a form
       | of discretized / explicit transfer learning.
       | 
       | Why not use the fact that you could just set the coefficients in
       | a neural network to do DFTs or FFTs to utilize neural network
       | chips as ersatz DSP chips, without learning?
       | 
       | Just like when GPUs turned out to be useful for neural networks,
       | the TPUs will turn out to be useful for DSP and other things.
       | These neural network chips showing up can do a lot of cool things
       | unrelated to AI, if we use them properly.
        
       | ellisv wrote:
       | Neural networks can learn many things but I can't imagine a
       | practical use-case of this.
        
         | SuchAnonMuchWow wrote:
         | Here is an alternative idea around this:
         | 
         | Winograd transform can be seen as a depthwise convolution with
         | fixed weights.
         | 
         | And similarly to DTF, winograd transform can be used to speed
         | up convolutions: a 3x3 convolution can be implemented with a
         | winograd transform + a 1x1 convolution + inverse winograd
         | transform. This basically reduce the number of operations
         | performed by 2.
         | 
         | Now, if a neural network is able to learn the weights of a
         | depthwise convolution to match the coefficients of a winograd
         | transform, that means you should be able to do this
         | transformation when describing the architecture of your
         | network. That way, you don't rely on your deep learning
         | framework to do the transformation for you.
         | 
         | And more importantly you might end up with something more
         | generic: perhaps the weights in the depthwise convolutions will
         | not converge to the weights of a winograd transform, but to
         | weight better suited to what you are trying to learn.
         | 
         | This is the same ideas which lead to the development of CNN:
         | replace the fixed weights of convolutions used in traditional
         | computer vision algorithm with weights that are learned by the
         | network.
        
         | killerstorm wrote:
         | I guess it is mostly an illustration that time-domain =>
         | frequency-domain transformation is trivial for ANNs. And e.g.
         | it won't make any sense to pre-process your signal with FFT
         | before training ANN since it will do it itself if needed.
        
           | raphaelj wrote:
           | Well, you can still probably reduce the training
           | computational cost by feeding the additional FFT features to
           | your NN.
        
             | ellisv wrote:
             | The Fourier transform layer might not be the immediate
             | hidden layer.
        
           | ellisv wrote:
           | That's a fair point. My assumption was that the neural
           | network isn't going to be as efficient at the FFT but the
           | article doesn't include that type of assessment.
        
       | meiji163 wrote:
       | > What is the explanation for the flexibility in the weights? How
       | much flexibility is there?
       | 
       | most of the terms are on the order of 1/N so on these would be
       | negligible
        
       | jhidding wrote:
       | Next exercise: implement the _fast_ Fourier transform as a multi-
       | layered network.
        
         | mensetmanusman wrote:
         | This is a great idea, I bet one could find a bit-faster fftw
         | that would be close enough for many applications, but it would
         | apply a statistical approach to each weight.
        
         | wiggle_woof wrote:
         | Nice idea. Turns out, doing this works well and has some nice
         | properties https://dawn.cs.stanford.edu/2019/06/13/butterfly/
        
         | threatripper wrote:
         | This could be highly interesting. Especially when starting with
         | the FFT butterfly structure for connections and the
         | corresponding weights as network initialization and then
         | adapting that to real data via training.
        
       | Aardwolf wrote:
       | So is the identity transform...
        
         | srean wrote:
         | I wish your comment is voted up. Where do we go next ? -- look
         | addition is a neural network, look weighted average is a neural
         | network, look linear regression, logistic regression, Poisson
         | regression are neural networks ...
        
           | Certhas wrote:
           | I have definitely seen regressions referred to as linear
           | machine learning. Not even kidding.
        
             | bonoboTP wrote:
             | So? Machine learning is not some buzzword fad. Regression
             | is part of ML for sure.
             | 
             | Some people seem to think ML means fancy deep learning
             | GPT-3 transformer running on a TPU farm. Actually ML is a
             | discipline that has existed for several decades and has
             | various theoretical results too, including VC theory etc.
             | 
             | It is also not the same as statistics. They are adjacent
             | but different fields.
        
               | Certhas wrote:
               | Yeah, but no. It definitely is a buzzword that randomly
               | gets attached to every method of data analysis these
               | days. If you disagree, then please give me a definition
               | that is both distinct from statistics but also means that
               | Regression is an ML technique.
               | 
               | BTW, that doesn't mean there isn't also very substantial
               | work done under the umbrella term.
               | 
               | Regression analysis was first introduced by Legendre in
               | the early 19th century. I never would have dreamed to
               | pretend it is Machine Learning 10 years ago when the term
               | ML was still used somewhat more sparingly, and carried a
               | bit more meaning...
               | 
               | But "I will use ML to analyse the data" gets more funding
               | than "I will run a regression on the data".
        
               | nikk1 wrote:
               | I agree ML is a buzzword and using a sentence like that
               | would sound silly but I can elaborate. First, though,
               | regression does not need a separate ML definition to make
               | it distinct from statistics... there's tons of overlap in
               | the field of mathematics.
               | 
               | Machine learning is the study of computer algorithms that
               | improve automatically through experience and by the use
               | of data.
               | 
               | This definition would include something as simple as
               | linear regression.
               | 
               | Linear regression attempts to model the relationship
               | between two variables by fitting a linear equation to
               | observed data.
               | 
               | The purpose of a neural network is exactly the same as a
               | line of best fit. That is, you are just approximating an
               | unknown function based on input/output data. The only
               | difference is that a NN can better approximate a
               | nonlinear function.
        
               | nonameiguess wrote:
               | The generally accepted definition in the community is Tom
               | Mitchell's:
               | 
               | > A computer program is said to learn from experience E
               | with respect to some task T and performance measure P, if
               | its performance at task T, as measured by P, improves
               | with experience E.
               | 
               | Statistical estimation methods are one way to achieve
               | this, but not the only way, especially when an exact
               | function can be learned, i.e. a typical layer 2 switch is
               | a learning device. You don't program in the mapping from
               | connected device MAC addresses to switch port, the switch
               | itself learns this from receiving a message from a MAC on
               | a port and then records the mapping. That is a very
               | simple form of non-statistical machine learning.
               | 
               | I'm not really sure how you can start here but then say
               | regression is not a form of machine learning.
               | "Regression" is a pretty broad class of techniques that
               | just means using labeled examples to estimate a function
               | with a continuous response, typically contrasted with
               | "classification," where the response is discrete. The
               | method you use to do the function approximation may or
               | may not be statistical. A genetic algorithm is not, for
               | instance. I'm not sure least squares, which is what
               | Legendre invented, should really be considered
               | statistical, either. The original purpose was for
               | approximating the solution to an overdetermined system of
               | equations, before statistics was even formalized. It
               | certainly became a preferred method in mathematical
               | statistics, but mathematical statistics wasn't even
               | formalized until later. It didn't start being called
               | "regression" until Galton published his paper showing
               | children of exceptionally tall or short people tended to
               | regress to the mean, which was 80 years later. But you're
               | performing regression analysis whether you use the normal
               | equations, LU decomposition, QR decomposition, a genetic
               | algorithm, gradient descent, stochastic gradient descent,
               | stochastic gradient descent with dropout. Doesn't matter.
               | As long as you're doing function approximation with a
               | continuous response, it's still regression. Whether or
               | not it can also be considered "machine" learning just
               | depends on whether you're doing it by hand or via
               | machine.
               | 
               | Though sure, typically people tend to imagine the more
               | exotic and newer techniques that scale to very large data
               | sets and reduce overfitting and deal with noise
               | automatically and involve hyperparameters, i.e. not least
               | squares.
        
               | bonoboTP wrote:
               | I won't type up a chapter about this now but ML is about
               | learning from data. It is concerned with
               | training/validation/test data, the generalization error
               | etc.
               | 
               | Statisticians care more about "modeling", actually
               | estimating parameters that stand in for something real,
               | to check assumptions and hypotheses. The cultures are
               | very different. What makes total sense to one may baffle
               | the other as lunacy (there's more convergence now though,
               | by realizing the complementary nature of the two
               | approaches). The terminology is different too: in stats,
               | they call logistic regression regression, while in ML it
               | would have been called classification (but the name is
               | now stuck). ML also tends to be more Bayesian than
               | frequentist, as opposed to classical stats.
               | 
               | I could write more but having taken ML courses before the
               | big hype times, I can assure you ML doesn't just mean
               | hyped up startup fluff vaporware.
               | 
               | Open up Kevin Murphy's ML book and compare the TOC with a
               | Statistics textbook. There is overlap but it's certainly
               | different in its goals and mentality.
               | 
               | It seems like some bitter bickering from the side of
               | stats people that they didn't manage to hype up their
               | field as much. Yeah they did have many of the same tools
               | at their disposal but didn't use them in this way.
        
               | 6gvONxR4sf7o wrote:
               | You're setting up a strawman of statistics. You're
               | describing a small part of a large field. Stats has been
               | more than what you're describing for longer than ML has
               | been a phrase.
               | 
               | The only real useful definition these days is that stats
               | is learning from data that happens in the stats
               | department and ML is learning from data that happens in
               | the CS department.
        
               | bonoboTP wrote:
               | I'm fine with that. I would also hope for more collab
               | cross-departments, but the incentives in academia are
               | unfortunately not exactly set up the right way. People
               | become protective of "their turf".
               | 
               | I wrote a bit more here:
               | https://news.ycombinator.com/item?id=26984234
               | 
               | I think it's good if people are aware of the origins of
               | the methods they use. Regression wasn't invented under
               | the umbrella of ML, but is analyzed from a particular
               | angle in ML more than in stats.
        
               | 6gvONxR4sf7o wrote:
               | > Regression is part of ML for sure.
               | 
               | > ML is a discipline that has existed for several decades
               | 
               | If you're going to say regression is part of ML, you
               | can't say ML has only existed for decades. If you define
               | it with that expansive scope, it's existed for centuries.
        
               | bonoboTP wrote:
               | ML uses regression in a particular way with particular
               | aims and has developed prior approaches like OLS and
               | Ridge regression, lasso, kriging etc. into a particular
               | theoretical framework and analyses them from the angle of
               | learning, i.e. heavy focus on generalization from
               | training data to test data.
               | 
               | The truth is, these are tools that many communities have
               | used in parallel and these communities are in constant
               | flux regarding what promising directions they find and
               | exploit and when they make a big splash in one place,
               | others take notice and incorporate those ideas etc. There
               | are no rigidly predefined disciplines and fields over
               | long timescales.
               | 
               | When electrical engineers and signal processing
               | communities worked on similar things they called it
               | "pattern recognition".
               | 
               | Machine learning as a field started out with more
               | ambition than mere regression and classification (and
               | indeed covers a lot more ground), but it turns out that
               | supervised learning has had the most practical success.
               | But it's a research program and community that goes
               | beyond that.
               | 
               | Similarly, there are parallels and similar equations
               | between control theory and reinforcement learning. And
               | indeed some controllers can be expressed as reinforcement
               | learning agents. But the aims of the two communities are
               | not the same.
               | 
               | Maybe people would be happier if "statistical learning"
               | (which is also used) was used more instead of "machine
               | learning"? But indeed as another comment points out, ML
               | as a paradigm does not necessarily require learning of a
               | statistical kind.
               | 
               | Labels grow and wane in popularity, it doesn't mean it's
               | the same thing repackaged, rather that the aims and the
               | focus changes depending on what we find most productive
               | and fruitful.
               | 
               | For example many of these things were also called "soft
               | computing" a few years ago, but that term is rarely seen
               | nowadays.
        
             | dunefox wrote:
             | You mean as in linear and logistic regression, two popular
             | classical ML models whose parameters can be optimised using
             | gradient descent?
        
             | aulin wrote:
             | why linear though
        
             | nikk1 wrote:
             | Regression is an approach to machine learning
             | 
             | https://en.wikipedia.org/wiki/Machine_learning#Regression_a
             | n...
        
           | visarga wrote:
           | > Where do we go next ?
           | 
           | Neural nets are all you need[*].
           | 
           | [*] if what you need is non-robust black boxes
        
       | enriquto wrote:
       | of course (it's a single fully-connected layer, aka a linear
       | map), but more interestingly, the FFT algorithm can be
       | implemented as a very sparse network of logarithmic depth.
        
       | ondrysak wrote:
       | Given the fact that fourier basis, can be seen as polynomials on
       | unit circle, this may be intresting for some of you.
       | 
       | https://arxiv.org/abs/2008.07669
        
       | AndrewGYork wrote:
       | One of the questions raised at the end of the article:
       | 
       | "Do we ever benefit from explicitly putting Fourier layers into
       | our models?"
       | 
       | Has a simple, partial answer here:
       | 
       | https://docs.scipy.org/doc/scipy/reference/generated/scipy.l...
       | 
       | "...multiplying a vector by the matrix returned by dft is
       | mathematically equivalent to (but much less efficient than) the
       | calculation performed by scipy.fft.fft."
       | 
       | fft-as-a-matrix-multiply is _much_ slower to compute than a
       | standard fft, especially on large input.
        
         | modeless wrote:
         | You can do the linear parts of neural nets in the frequency
         | domain, but AFAIK you can't do the nonlinearity, so you have to
         | inverse transform back to the spatial domain for that. The
         | nonlinearity is an absolutely essential part of pretty much
         | every neural net layer, so there is no big win to be had
         | unfortunately. For convolutional nets in particular there are
         | other ways of going faster than a naive matrix multiply, e.g.
         | winograd convolution.
        
           | aesthesia wrote:
           | But if your network contains a layer whose linear part
           | performs an (approximate) DFT, you will get an efficiency
           | gain by replacing it with an exact FFT.
           | 
           | You wouldn't want to use an FFT for most CNNs anyway because
           | the kernels have very small support. Convolution with them is
           | O(n) in the spatial domain as long as you recognize the
           | sparsity.
        
           | evancox100 wrote:
           | Why can't you apply the nonlinear activation functions in the
           | frequency domain? What's stopping this or making it not work?
        
       | pontus wrote:
       | Is there anything deep here? There's a parametrized
       | representation of a function and it's being interpreted as a
       | neural network, why is this surprising? It's like saying that
       | Newton's second law is a neural network: F=ma can be written
       | log(F) = log(m) + log(a). Aha, this is a neural network! The
       | inputs are m and a, the first layer is sparsely connected with
       | log activation functions, the second layer is fully connected
       | with an exponential activation function:
       | 
       | F = exp(c1 * o1 + c2 * o2) o1 = log(c3 * a + c4 * m) o2 = log(c5
       | * a + c6 * m)
       | 
       | If you feed it enough data you'll find c1=c2=c3=c6=1 and c4=c5=0.
       | 
       | But saying that Newton's second law is a Neural Network, while
       | correct, seems a bit deceptive in that it's not a deep idea at
       | all.
        
         | korijn wrote:
         | Perhaps it is the compsci glasses talking, but this is just one
         | very specific instance where someone figured out a way to map
         | the DFT problem to a neural net. I agree that it is unfortunate
         | that it is being presented as some kind of big discovery, and
         | that the fundamental lesson is either still undiscovered to the
         | author or just unclearly communicated, but there is still good
         | intention in there (sharing something you've learned and
         | eliciting feedback).
        
         | haecceity wrote:
         | Can be represented as a neural network is not the same as is a
         | neural network??
        
         | candiodari wrote:
         | Well the point is that it does not make sense to DFT data
         | before feeding it into a multilayered neural network (or
         | calculating the force generated by mass and acceleration as you
         | point out). Those formulas make no sense: the network can just
         | learn them on the fly.
         | 
         | In fact you'll find that this does not just work for the
         | Fourier transform, but for any FIR filter (and some other
         | classes), and therefore neural networks can deal with signals
         | and construct low-pass filters, high-pass filters, bandgap
         | filters, ... as required for the task at hand without the
         | network (or it's designer) having any idea at all what is
         | happening.
         | 
         | I mean there's some basic assumptions these reasonings make
         | (main one is that you need to feed many discretized values from
         | a time window).
         | 
         | Of course, a problem remains: local optima. Just because a
         | neural network _can_ construct a filterbank or do a DFT, doesn
         | 't mean that it _will_ actually do it when the situation
         | warrants it. If there 's a local optimum without filters ...
         | well, you may get unlucky. It there's many local optima without
         | filters ... sucks to be you.
        
           | ska wrote:
           | > Those formulas make no sense: the network can just learn
           | them on the fly.
           | 
           | > Just because [...] can construct a filterbank or do a DFT,
           | doesn't mean that it will actually do it when the situation
           | warrants it.
           | 
           | These statements seem in conflict, no?
        
           | nonameiguess wrote:
           | I don't think I can agree with that. You do a Fourier
           | transform when the data you're working with doesn't admit
           | easily or at all certain operations in the time domain but it
           | does in frequency domain. If you already know that to be the
           | case, preprocessing with a FFT is a better idea than hoping a
           | neural network with enough layers uses a few of those to much
           | less efficiently perform a DFT. Always take advantage of pre-
           | existing knowledge of structure in your data. With the FFT
           | especially, depending on how your data is being ingested, you
           | might be able to use specialized DSPs that implement the FFT
           | directly in hardware. These are cheap and easy to find since
           | they're used in frequency-division multiplexing.
        
             | dkarras wrote:
             | Practically, yes, absolutely! Theoretically, we want the
             | "black box" learning system to do such feature extraction
             | itself. At least, that is the goal.
        
               | whimsicalism wrote:
               | > does not make sense to DFT data before feeding it into
               | a multilayered neural network
               | 
               | is false.
               | 
               | "It would be nice if we didn't have to DFT data before
               | feeding it into a multilayered neural network" is true,
               | but a completely different statement.
        
           | monocasa wrote:
           | Which is weird, because biological neural nets seem to have
           | some evolutionary pressure to do hardware fourier transforms
           | before neurons even get involved.
           | 
           | You can see this most clearly in the auditory system where
           | the incoming signal is transformed into the frequency domain
           | by the cochlea before the signal is received by the
           | epithelial cells.
           | 
           | Neurons absolutely love working in the frequency domain, but
           | they seem to prefer to not be the ones to do the binning in
           | the first place.
        
             | philipswood wrote:
             | This is probably because the neurons involved can't switch
             | fast enough. Our neural hardware is simply too slow to work
             | in the time domain directly.
        
               | monocasa wrote:
               | They do just fine with acoustic frequency ranges.
        
               | philipswood wrote:
               | Firing rates for neurons are in the order of 10hz - under
               | the minimum acoustic frequencies of about 20hz.
               | 
               | Peak neuron firing rates basically only touch the lower
               | bounds of frequencies that need to be processed.
               | 
               | Without the "hardware acceleration" of the cochlea that
               | preprocesses the time domain signals to frequency domain
               | first, basically our whole audio sensorium is out of
               | processing range.
               | 
               | Normal nervous signal transmission speeds are in the
               | order of the speed of sound, switching rates are in Hz
               | range. Voltages are in the 10s of millivolt range.
        
               | nomel wrote:
               | Are you suggesting that anything over 10Hz is perceived
               | with the representation of a rate, rather than direct
               | stimulus?
        
               | monocasa wrote:
               | Neuron spiking isn't rate encoded, it's temporal encoded.
               | Ie. the distance from the next expected regular interval
               | is the signal, not the spiking interval itself.
               | 
               | Additionally there's large variations in neuron spiking
               | rates.
        
               | saltcured wrote:
               | I think I understand your sentiment, but consider our
               | tactile sense of vibrations. We are able to distinguish
               | different frequencies with the nerves in the skin,
               | without cochlea doing the frequency domain transform. I
               | think there are very different spectral limits and
               | aliasing artifacts, but it's not like we can only sense a
               | 5 Hz vibration either...
        
         | windsignaling wrote:
         | It's a stretch to call it a neural network. It was already
         | well-known that the Fourier transform can be seen as a matrix
         | multiply which minimizes some least squares problem.
         | 
         | This can be found somewhere in S.M. Kay's Fundamentals of
         | Statistical Signal Processing: Estimation Theory (Vol 1),
         | Detection Theory (Vol 2).
        
         | montebicyclelo wrote:
         | Op here, IMO the "deepest" bit is [1] - the network learns the
         | DFT in order to reconstruct the signal, and is not explicitly
         | trained on the DFT values from the FFT.
         | 
         | Admittedly, I should have mentioned that any linear transform
         | can be considered to be a single layer neural network (if you
         | want to see the world through a neural network lens), and will
         | add this to the post at some point.
         | 
         | In fact, I have a series of posts planned, which will reveal
         | that well known algorithms/models are actually neural
         | networks...
         | 
         | [1] https://sidsite.com/posts/fourier-nets/#learning-the-
         | fourier...
        
           | mochomocha wrote:
           | You might be interested in this line of work:
           | https://eng.uber.com/neural-networks-jpeg/
           | 
           | Training straight from DCT coefficients to avoid spending
           | time learning a similar representation in the bottom layers
           | of the net. I've personally toyed with something similar on
           | GANs to gauge the computational benefits of not doing
           | convolutions in the bottom layers of a net but learning
           | directly in a FFT-like compressed space instead.
        
             | touisteur wrote:
             | Isn't there something like Fouriernets and spdnets already,
             | and stuff on Riemannian manifolds and my head is spinning?
             | Followed Daniel Brook's work for some time, e.g.
             | https://hal.archives-ouvertes.fr/hal-02290838
        
           | hctaw wrote:
           | > I should have mentioned that any linear transform can be
           | considered to be a single layer neural network
           | 
           | I think this should be turned around. A single layer neural
           | network can be considered a linear mapping (but not
           | necessarily an orthogonal transform or change of basis, like
           | the DFT).
           | 
           | A more clear example of this are adaptive filters, which are
           | trained in real time using gradient descent.
           | 
           | This is an important distinction because thinking of "X is a
           | Neural Net" doesn't provide meaningful insight, whereas
           | "Neural Nets with X properties are a case of linear dynamic
           | systems, here's an example of how we can equate one linear
           | transformation to a neural net" leads you to deeper
           | conclusions on the analysis and synthesis of ANNs in the
           | context of dynamics - which encompasses a much larger surface
           | area than the DFT.
        
             | hcrisp wrote:
             | I've sometimes wondered how many signal processing kernels
             | or filter architectures you could learn via SGD given the
             | right data and search strategy. Maybe using something like
             | Neural Architecture Search, so not just the weights /
             | coefficients but the architecture too.
             | 
             | Could it discover an interpolation filter like upfirdn? An
             | IIR Butterworth filter (it would have to be recurrent)? A
             | frequency transfer function derived from two signals?
             | 
             | I imagined that the solutions it found wouldn't be "clean"
             | but have other non-essential operators bloating them. Could
             | it find new architectures that haven't been created from
             | first principles?
        
               | hctaw wrote:
               | I think a big question here is, "what is the goal?" A key
               | distinction is whether the NN is intended to handle the
               | signal directly (the NN _is_ the filter) or if it 's just
               | a technique for finding filter coefficients (the NN
               | _designs_ the filter for the signal).
               | 
               | For some tasks, like system identification (used in echo
               | cancellation), the NN _is the filter_ - aformentioned
               | adaptive filters are used for this case right now. It can
               | also be used for black box modeling, which has numerous
               | application in real time or otherwise.
               | 
               | For others like Butterworth (and other classic designs)
               | there's not really a good reason to use a NN. Butterworth
               | (and Chebychev I/II, elliptical, optimum-L, and others)
               | are filter design formulae with a closed form (for a
               | given filter order) that yield roots of transfer
               | functions that have desirable properties - I'm not sure
               | how a learning approach can beat a formulae that are
               | derived from the properties of the filter they design.
               | 
               | There are iterative design algorithms that do not have a
               | closed form, like Parks-McClellan. It is however _quite_
               | good at what it does - it would be interesting to compare
               | these methods against some design tricks to reduce filter
               | order.
               | 
               | There are some applications of filter design that NNs can
               | do that we don't have good solutions for with traditional
               | algorithms, like system identification, another is in
               | optimization (in terms of filter order) and iteratively
               | designing stable IIR filters to fit arbitrary frequency
               | curves (for FIR it's a solved problem, at significant
               | extra cost compared to IIR).
               | 
               | As for topologies of the filter itself that is an
               | interesting angle. Topologies affect quantization effects
               | (if you wanted an FPGA to realize the filter with the
               | fewest number of bits, topologies matter) and for time
               | variant filters (like an adaptive IIR) there are
               | _drastic_ differences between the behavior of various
               | topologies. I 'm not sure how it would manifest with a NN
               | designing the filter.
        
               | hcrisp wrote:
               | I meant learning the topology (I called it architecture)
               | from scratch, or minimally from heuristics. What would I
               | discover? Something that we recognize as an IIR filter?
               | Probably not, but if it did, what would it discover that
               | we didn't recognize over years of academic thought but
               | _could_ have from first principles? To me that 's the
               | intriguing angle.
        
               | hctaw wrote:
               | I don't think I follow. These designs _do_ come from
               | first principles. In fact the reason we 've had so much
               | academic thought put into it is because 100 years ago we
               | designed systems empirically and a few smart people went
               | back to first principles to find the fundamental
               | constraints on filter networks.
        
           | suvakov wrote:
           | Actually, "learns" here means to fit reverse linear
           | transformation a.k.a. inverse matrix. He defines reverse FT
           | matrix and using gradient descent numerically converges to
           | inverse matrix. Nothing more than inefficient way to solve
           | system of linear equations.
        
           | aesthesia wrote:
           | As far as I can tell, you're still using a fixed inverse DFT
           | as the reconstruction layer, so it's not just rediscovering
           | the DFT on its own. Instead of learning a linear
           | transformation from input-output pairs, it's learning the
           | inverse of a linear transformation when that transformation
           | is given as an oracle. It's not terribly surprising that this
           | works, although there are probably some interesting issues of
           | numerical conditioning in the general case.
        
         | volta83 wrote:
         | > But saying that Newton's second law is a Neural Network,
         | while correct, seems a bit deceptive in that it's not a deep
         | idea at all.
         | 
         | I guess the point is that neither is the idea of a Neural
         | Network.
        
           | Nasrudith wrote:
           | A neural network can also be set to learn unconditionally
           | return a fixed value with no learning feedback. I don't think
           | lower bounds on capabilities are very informative. So could
           | many arbitrarily complex arrangements that do massive amounts
           | of work only to discard it and return a constant. An upper
           | bound of what an approach is capable of is more useful. Say
           | no matter how vast a look up table is it will never return a
           | different value for the same input regardless of prior
           | sequence.
        
         | Imnimo wrote:
         | I think this is the sort of thing that is very obvious if you
         | are already comfortable with neural networks and the FFT. But
         | if you're only comfortable with neural networks, and the FFT
         | feels like arcane magic, this exercise might be very
         | instructive.
        
         | xyzzy21 wrote:
         | The "Deep" part ia to realize what this really means in terms
         | of limitations of ML/NN!
        
       | akrymski wrote:
       | I'm really interested in training NNs in the frequency domain:
       | 
       | - a ton of data (JPEGs, MPEGs, MFCCs) exists in the compressed
       | form of FFT-ed features
       | 
       | - FFTs are hardware encoded/decoded & optimised
       | 
       | - FFTs are great at separating signal from noise, leading to a
       | huge reduction in input features (eg JPEG encoded) as compared to
       | raw pixel values
       | 
       | - convolutions in the time domain are just multiplications in the
       | frequency domain
       | 
       | - converting to log polar coordinates turns them into additions
       | 
       | - distance between two log polar coordinates could be the loss
       | function
       | 
       | However, I've not had much luck getting a frequency domain
       | network like this to converge. Anyone tried something similar
       | with success?
        
         | ashertrockman wrote:
         | The choice of activation function isn't entirely clear to me,
         | but I think it's definitely possible to make a network that
         | operates entirely in the frequency domain. It would probably be
         | pretty easy to start experimenting with such a thing with the
         | nice complex number and FFT support in PyTorch 1.8. :)
         | 
         | Like you said, there's already a significant connection between
         | convolutional networks and the Fourier domain (the convolution
         | theorem).
         | 
         | Tangentially, I've recently worked on a project that focused on
         | implementing convolution in the Fourier domain, and how that
         | allows one to control useful properties of convolutions (like
         | "smoothness" and orthogonality).
         | 
         | I made a demonstration of convolution in the Fourier domain in
         | PyTorch, which you might find interesting:
         | https://nbviewer.jupyter.org/github/locuslab/orthogonal-conv...
         | 
         | More generally, you could look here for more code and the
         | corresponding paper: https://github.com/locuslab/orthogonal-
         | convolutions
        
         | akrymski wrote:
         | I don't see why processing JPEG coefficients directly as input
         | features shouldn't have similar accuracy to a much larger NN
         | using raw pixel data.
         | 
         | In particular I'm interested in the efficiency gains from
         | avoiding convolutions and the possibility of running a
         | compressed frequency-domain NN on CPUs.
        
         | mochomocha wrote:
         | I have trained GANs on raw JPEG coefficients with moderate
         | success as a pet project. Read the raw DCT coefficients without
         | decompressing and train in this space. JPEG-decompress the
         | output of the net to reconstruct an image in the pixel space.
         | There are few papers doing similar things for supervised
         | learning tasks iirc
        
           | akrymski wrote:
           | Using a standard loss function like MSE?
           | 
           | Yeah it kinda works when you feed JPEG coefficients into a
           | typical time-domain CNN, but mathematically it seems that if
           | you're using frequencies as inputs, your convolutions should
           | become simple multiplications. Am I wrong?
        
             | mochomocha wrote:
             | Yep, you can successfully do it without convolutions. Here
             | is a pointer if you want to dig deeper:
             | https://eng.uber.com/neural-networks-jpeg/ (there's prior
             | art to that, but this one is well written)
        
       | [deleted]
        
       | jl2718 wrote:
       | A 'neural' network is tautologically nonlinear.
        
       | master_yoda_1 wrote:
       | My fart molecule also have non linear interaction with other
       | atom. So my farting is a neural network too. #bullshitArticle
        
       | turtletontine wrote:
       | I don't see anything here beyond "neural networks generalize
       | linear operators." (But of course they're more interesting for
       | nonlinear applications)
       | 
       | I think neutral nets are such a generalized thing that tons of
       | stuff can be shown to map to them. I saw a talk once about
       | evolution being equivalent to neural networks: consider a genome
       | as a set of neuron weights, and your iterative refinement as new
       | generations in evolution. Maybe there's something deep there, but
       | maybe if you make your model arbitrarily complicated enough you
       | can make it look like any system you want. Which is the whole
       | idea with neural nets.
        
       | chobytes wrote:
       | ...This could be a brilliant technique for collecting funding.
        
         | srean wrote:
         | Mono layer DNNs with a block-chained backpropagated backend
        
       | n_kr wrote:
       | People may find this interesting too:
       | https://ocw.mit.edu/courses/mechanical-engineering/2-71-opti...
        
       | cochne wrote:
       | This is kind of silly. Neural networks are universal function
       | approximators, meaning any function "is" (more accurately, "can
       | be represented by") a neural network. In the DFT case, it is a
       | linear function so we could get an exact representation. Though
       | there is also nothing stopping you from just saying, f(x) is a
       | neural network, because I can choose my activation function to be
       | f.
        
         | TheRealPomax wrote:
         | If silly but interesting things weren't allowed, there would be
         | no point to having a website like this. The article is a neat
         | exploration of two things that many, _many_ people don 't
         | realise are related because they simply don't know the maths
         | involved in either of the two subjects.
        
           | vletal wrote:
           | Maybe, if the title wan not so overblown the pushback againt
           | the articles premise would not be so hard.
        
         | visarga wrote:
         | > Neural networks are universal function approximators
         | 
         | What about discontinuous functions?
        
           | hyperbovine wrote:
           | Continuous functions are dense in <pick your favorite
           | function space> so yes.
        
           | estebarb wrote:
           | Every function can be made continuous if you add one
           | dimension in it's output for defined/not defined.
        
         | volta83 wrote:
         | > Neural networks are universal function approximators,
         | 
         | Do we have tight error bounds proofs for neural networks as
         | approximators ?
        
           | ska wrote:
           | https://en.wikipedia.org/wiki/Universal_approximation_theore.
           | ..
        
       | swiley wrote:
       | Anything can be expressed by a neural network to some arbitrary
       | degree of precision. That's the whole point. The hard part isn't
       | expressing the thing but getting the machine to figure out how to
       | express it correctly on it's own. There's also no way to prove
       | your learning algorithm actually works outside of toy problems
       | like this which is why ML is really only good for data mining
       | rather than steering a car.
        
         | eigenket wrote:
         | *anything continuous
        
       | mabbo wrote:
       | I love this kind of post.
       | 
       | While it's all very posh to scoff while saying "well,
       | obviously!", the author has explained it, demonstrated it,
       | documented it. It's not about proving something new, it's about
       | sharing something neat with the world. It's also well written and
       | emotes a bit of a fun in the telling.
       | 
       | I'm reminded of xkcd's "Ten Thousand" principle[0]: (Paraphrased)
       | "For each thing that 'Everybody knows', every day there are, on
       | average, 10,000 people in the US learning it for the first time".
       | Good writing makes the experience better for those learning it
       | for the first time today.
       | 
       | [0]https://xkcd.com/1053/
        
         | 6gvONxR4sf7o wrote:
         | The problem isn't an "everybody knows" kind of problem. People
         | (myself included) are rolling our eyes because it's the most
         | rube goldbergian way to say it.
         | 
         | It's analogous to saying that subtraction is a neural network.
         | Addition and negation are core elements of modern state of the
         | art neural networks, and you can express subtraction as a
         | combination of those two modern neural network techniques.
         | Therefore subtraction is a neural network.
         | 
         | The line in the article about the formula y = A x is
         | illustrative:
         | 
         | > This [y = A x] should look familiar, because it is a neural
         | network layer with no activation function and no bias.
         | 
         | Imagine that as:
         | 
         | > This [y = a + b] should look familiar, because it is a neural
         | network layer with no activation function and no bias and no
         | multiplication. Meanwhile, this [y = -x] should also look
         | familiar, because it is a neural network layer with no
         | activation function, no bias, no addition, and only negation.
         | 
         | You'd expect that kind of explanation of subtraction if someone
         | had never learned about addition and negation outside of neural
         | networks, but you'd roll your eyes and say it's just such a
         | bizarre way to talk about addition, negation, subtraction, and
         | neural networks.
        
         | Certhas wrote:
         | But it explains the wrong thing. It's a fundamentally
         | ridiculous framing. A basic neural network layer is a linear
         | transformation + constant offset + activation function. Linear
         | transformations are the elementary building block. And Fourier
         | Transform is a quintessential example of a linear transform.
         | But the article doesn't actually discuss that at all. If you
         | don't understand this how will you ever deal with more
         | complicated NN architectures?
        
       | albertzeyer wrote:
       | Also interesting/related:
       | 
       | In speech recognition, we usually use log Mel features, or MFCC
       | features, which do a Fourier transformation on the raw audio
       | frames.
       | 
       | You can also train neural networks directly on the raw audio
       | features. When you do so, you can inspect the weights of the
       | first layer (convolutional layer), and you see that it pretty
       | much learned the short Fourier transformation.
       | 
       | https://www-i6.informatik.rwth-aachen.de/publications/downlo...
        
       | bigbillheck wrote:
       | You can't just point at things and call them 'a neural network'.
       | 
       | Linear regression's just learning an affine transform but I think
       | it's misleading and unhelpful to call that a neural network
       | either.
        
       ___________________________________________________________________
       (page generated 2021-04-29 23:01 UTC)