[HN Gopher] Deep neural networks as computational graphs (2018)
___________________________________________________________________
Deep neural networks as computational graphs (2018)
Author : softwaredoug
Score : 79 points
Date : 2023-05-21 13:09 UTC (1 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| dynamite-ready wrote:
| Wasn't that the paradigm behind Tensorflow?
| dkislyuk wrote:
| As another commenter said, viewing a neural network as a
| computation graph is how all automatic differentiation engines
| work (particularly reverse-mode where one needs to traverse
| through all the previous computations to correctly apply the
| gradient), and there were several libraries predating
| Tensorflow following this idea. The initial contribution of
| Tensorflow and PyTorch was more about making the developer
| interface much cleaner and enabling training on a wider range
| of hardware by developing a bunch of useful kernels as part of
| the library.
| dekhn wrote:
| I always thought of seeing most computations of functions as
| computational graphs- at least, when I used MathLink to
| connect Mathematica to Python, it basically gives you a
| protocol to break any Mathematica function into its
| recursively expanded definition. Konrad Hinsen suggested
| using python's built-in operator overloading, so if you said
| "1 + Symbol('x')" it would get converted to "Plus(1,
| Symbol('x')) and then sent over MathLink to Mathematica,
| which would evaluate Plus(1, x), and return an expression
| which I'd then convert back to a Python object
| representation.
|
| I don't think we talked about doing any sort of automated
| diff (in my day we figured out our own derivatives!) but
| after I made a simple eigendecomp of a matrix of floats, the
| mathematica folks contributed an example that did eigendecomp
| of a matrix with symbols (IE, some of the terms weren't 5.7
| but "1-x"). Still kind of blows my mind today how much
| mathematica can do with computation graphs.
|
| IIUC this is the basis of LISP as well.
| ggr2342 wrote:
| AFAIK it was first brought forward by PyTorch. I may be wrong
| though.
| mhh__ wrote:
| PyTorch's insight was to be very dynamic and simple. The
| underlying model is basically the same with the caveat that
| tensorflow uses/used something called a "tape".
| HarHarVeryFunny wrote:
| Before PyTorch there was Torch (which was Lua-based, hence
| the Py- prefix of the follow on PyTorch). With Torch, like
| the later TensorFlow 1.0, you first explicitly constructed
| the computational graph then executed it. PyTorch's later
| innovation was define-by-run where you just ran the code
| corresponding to your computational graph and the framework
| traced your code and built the graph behind the scenes...
| this was so much more convenient that PyTorch quickly became
| more popular than TensorFlow, and by the time TF 2.0 "eager
| mode" copied this approach it was too late (although TF shot
| itself in the foot in many ways, which also accelerated it's
| demise).
| cgearhart wrote:
| It certainly pre-dates PyTorch. Applying computational graphs
| to neural networks was the central purpose of the Theano
| framework in the early 2010s. TensorFlow heavily followed
| those ideas, and Theano shut down a year or two later because
| they were so close conceptually and a grad research lab can't
| compete with the resources of Google.
| 6gvONxR4sf7o wrote:
| It's been a core part of automatic differentiation for
| decades.
| ggr2342 wrote:
| Oh! Can you give some pointers where to read about it more?
| 6gvONxR4sf7o wrote:
| The wikipedia page for automatic differentiation and its
| references would be a good start.
| https://en.wikipedia.org/wiki/Automatic_differentiation
|
| In particular, the section "Beyond forward and reverse
| accumulation" tells you how hard it is to deal with the
| graph in optimal ways, hence the popularity of simpler
| forwards and backwards traversals of the graph.
| PartiallyTyped wrote:
| The original 2017 PyTorch paper is probably a good one.
| gavi wrote:
| I highly recommend the video from karpathy
| https://www.youtube.com/watch?v=VMj-3S1tku0 - This explains the
| same idea spelled out in code. It really helped me understand the
| mathematical expression graph of neural networks. He uses
| graphviz to show the expressions and initially calculate
| gradients manually and then implement back propagation.
| ttul wrote:
| Indeed, the self-attention mechanism within the transformer
| architecture is often described as a graph, where tokens are the
| nodes and the key and query vectors are combined to establish the
| affinity that each token has for each other token.
| blackbear_ wrote:
| These are two different things, actually. The graph you are
| talking about is constructed using the inputs of the attention
| layer (ie, nodes=tokens, edges=affinity), while the article
| talks about graphs that represent the actual operations
| performed in the attention layer (ie, nodes=weights and
| intermediate results, edges=mathematical operations that
| combine them).
| garganzol wrote:
| The article is a thought-provoking piece of art. A few personal
| observations from me: a) An acyclic
| computational graph is equivalent to a program without loops and
| thus not Turing-complete b) A cyclic computational
| graph is equivalent to a program with recursion. The closest
| analogy from the programming world is a program without loops but
| with recursive calls. This means that it has Turing completeness,
| as loops are fully interchangeable with recursion (just different
| facets of the same thing).
|
| As easy as 2+2=4. This means that a neural network is essentially
| a program. The only difference is the way to write the program:
| neural networks do it themselves by "learning".
|
| Those simple observations explain everything. Brain is
| essentially a biological Turing-complete processing unit with
| full plasticity, as you can build any network (program) by
| changing "weights" of the neurons. This also means that full AI
| is possible, it is just a matter of the available computational
| power. Why is that? Because Turing completeness guarantees the
| absence of boundaries for achieving progressively sophisticated
| system abilities.
| gcr wrote:
| Most neural networks these days aren't recursive however.
|
| GPT-3 and friends certainly isn't, unless you count the
| inference loop as recursion, which is bounded both by
| vocabulary size and maximum token length (hardcoded for now).
| garganzol wrote:
| This only means that there is so much more space for AI to
| grow and improve in the future.
|
| As far as I know, having a recursion in neural networks poses
| an extravagant dilemma: from one hand, recursive networks are
| more capable; from another, they are harder to train and may
| have issues with stability caused by occasional positive
| feedback loops.
| data_maan wrote:
| This says nothing new. This idea was around since the 80s, if not
| earlier, and used extensively for automatic differentiation
| textbook.
|
| But, for an audience that seems to think it rides on the frontier
| of knowledge and that we have just discoveres things (when, in
| fact, it's our ignorance of earlier research that is the reason
| that we are re-discovering them), such a medium post might be
| like a drug :D
| garganzol wrote:
| Sometimes simple (and old) ideas start to click and ignite a
| tornado when the timing is right.
|
| While many may argue that the notion of a computational graph
| was associated with neural networks (NN) from the very
| beginning, this post is quite novel. It sheds the light on how
| NN and the general computation theory are actually
| interconnected, it is basically the same thing.
|
| For me, it was an instant blast: the computational graph
| reminds me a typical intermediate representation (IR) tree of a
| compiler. And because it is a formal graph, all mathematical
| benefits of a graph theory suddenly start to click. For
| instance, the graph theory formalizes the definitions of cyclic
| directed graphs versus acyclic directed graphs. If we imagine
| for a moment that a graph vertex represents a unit of
| computation, it immediately starts to resonate with Lambda
| calculus. It quickly becomes evident that when the graph is
| cyclic, it represents a recursion in terms of Lambda calculus
| and thus becomes Turing-complete.
|
| If computational graph is acyclic, Turing completeness is not
| achievable.
|
| If you are going to say that this is an obvious and well-known
| observation - I will be surprised, because it is not. And all
| that enlightenment was possible thanks to this article combined
| with a bit of knowledge about graphs, compilers, and lambda
| calculus. (It just so happened that I'm relatively well-versed
| in those topics due to a professional involvement.)
| blackbear_ wrote:
| I cannot stress enough how foundational of an idea it is to store
| mathematical expressions as graphs! Analyzing and manipulating
| such graphs enables, among other things, to automatically obtain
| derivatives of a function with respect to its parameters, which
| is one of the cornerstones of most AI methods.
|
| Before this was invented, people had to manually find expressions
| for these derivatives, and implement the code to compute them.
| Needless to say, this took way too much time and was likely to
| result in sub-optimal and unoptimized code, while nowadays nobody
| has to deal with this anymore (except on their homeworks)!
|
| For the curious of how this works, I present a simple
| implementation on my blog [1].
|
| [1]
| https://e-dorigatti.github.io/math/deep%20learning/2020/04/0...
| BaculumMeumEst wrote:
| Excellent post, thank you for writing it.
| joe_the_user wrote:
| _I cannot stress enough how foundational of an idea it is to
| store mathematical expressions as graphs!.._
|
| Understanding a function's computational graph is certainly
| useful but storing a function as a computation graph is, in
| fact, quite expensive. Deep learning systems don't store their
| computational graphs, the graphs are _implicit_ in their
| computation process. Deep learning systems instead store their
| functions as tensors; generalized arrays /matrices. This allows
| both efficient storage and efficient computation. It's quite
| possible do automatic differentiation on these structures as
| well - that is the basis of "backpropagation".
|
| It's important to distinguish useful conceptual structures like
| computation graphs (or symbolic representations) and the
| structures that are necessary for efficient computation.
| Automatic differentiation itself is important because the
| traditional symbolic differentiation one learns in calculus
| "blows up", can have O(m^expression-length) cost, when
| attempted on a large expression where automatic differentiation
| has a cost that is not that much higher than the base cost of
| computing a given function (but if that base cost is high, you
| lose your advantage also).
|
| Just sayin'
| [deleted]
| necroforest wrote:
| you can't store a function as a tensor. the tensors are the
| inputs/outputs that flow along the edges of the graph. TF
| stores things directly as a graph: https://github.com/tensorf
| low/tensorflow/blob/master/tensorf... while Jax represents
| computations in an intermediate representation:
| https://jax.readthedocs.io/en/latest/jaxpr.html
| curiousgal wrote:
| Another major application is in finance! Computing risk metrics
| for hedging purposes used to to be based on a "bump and
| reprice" method, i.e. you bump the input by a small amount and
| see how that affects the price (the definition of a
| derivative). But now in modern quant libraries, implementing
| the pricing as a graph allows you to get all of the
| sensitivities (derivatives) for free!
| mhh__ wrote:
| Not quite. Some people want bump Greeks. For example a
| company may have a standard bump.
|
| Make of that what you will.
| KMag wrote:
| Not quite free. Last I checked, adding automatic
| differentiation to GPU computation for pricing exotic
| derivatives roughly doubled the computation cost. (Though,
| traditional bump-and-reprice has exponential cost in the
| number of dimensions.)
| blackbear_ wrote:
| That's cool! But it relies on the model being an accurate
| representation of reality, right? What kind of models are we
| talking about anyways, and what inputs do they use? (Asking
| out of curiosity, not skepticism)
| mhh__ wrote:
| A model is a computation based on its initial assumptions.
| Sometimes that's fine.
|
| The key thing is that the (good) practitioners know the
| finance and know the models. If it's obviously wrong that's
| a sign in itself - a simple model doesn't fit the market:
| You might be about to lose some money, or you could take on
| some risk from other people and get rewarded for it (people
| are panicking).
|
| Weirder derivatives and so on can get more dangerous, of
| course. However even the really famous example from the 07
| crash (pricing CDOs and CDS against them) was arguably due
| to a deliberate ignorance of widespread fraud and
| fictitious numbers than the models as per se. Garbage in
| garbage out (the model was also not great but still)
___________________________________________________________________
(page generated 2023-05-22 23:00 UTC)