[HN Gopher] Automatic Differentiation Does Incur Truncation Erro...
       ___________________________________________________________________
        
       Automatic Differentiation Does Incur Truncation Errors (Kinda)
        
       Author : oxinabox
       Score  : 58 points
       Date   : 2021-02-08 18:45 UTC (4 hours ago)
        
 (HTM) web link (www.oxinabox.net)
 (TXT) w3m dump (www.oxinabox.net)
        
       | not2b wrote:
       | You could use a lazy representation of the Taylor series that
       | will give you as many terms as you ask for and symbolically
       | differentiate that. Then you'll gate an accurate automatic
       | differentiation. When you go to evaluate your approximation
       | you'll get errors at that point, but you'll correctly get that
       | the second derivative of sin(x) is exactly -sin(x).
        
         | oxinabox wrote:
         | Correct, which is mentioned at the end of the post. I am not
         | sure how well it generalizes, e.g. to multivariate functions
         | that are defines as solutions of some thing like a differential
         | equation.
         | 
         | Maybe it does? I don't know.
        
         | joe_the_user wrote:
         | You could do this but I'm pretty sure that real AD just deals
         | all the common primitives directly. The derivative of sin(x) is
         | cos(x), etc. AD just postpones the chain rule calculations that
         | would be involved in something sin(x + x^2)^2, so the chain-
         | rule happens at run time instead of a symbol expansion at
         | compile time.
        
       | londons_explore wrote:
       | Interesting...
       | 
       | But do these truncation errors cause any real world heartaches?
       | Does anyone do n-derivations of any approximate function for
       | large n?
        
         | oxinabox wrote:
         | Occationally, there are a few places where you want arbitary
         | high derivatives. But in general there are specialized
         | solutions for those (though very few things implement them.
         | E.g. Jax's JETs and TaylorSeries.jl in julia implement taylor-
         | mode AD. But those also need custom primatives)
         | 
         | More of a problem is first derivatives, but that you need very
         | high accuracy. Which doesn't occur in ML but does occur
         | sometimes in scientific computing. (Doesn't occur in ML cos we
         | just basically shake the thing about a bit anyway)
        
       | stabbles wrote:
       | An obvious example here is when you expand e^x with its Taylor
       | series using `n` terms. The derivative of that length-n Taylor
       | expansion is identical but only `n-1` terms long, so you lose
       | precision.
       | 
       | More generally if you approximate a smooth function f with a
       | truncated taylor series of length n around c, the error behaves
       | as O((x-c)^(n+1)), and the error of the k'th derivative of that
       | function auto-diffed will be of order O((x-c)^(n+1-k)).
        
         | joe_the_user wrote:
         | That's true but this sort of problem would normally not happen
         | with a standard AD library or subsystem, whatever language
         | you're using. Such systems don't approximate functions then
         | differentiate their approximations.
        
           | taeric wrote:
           | I think the example is supposed to be indicative of larger
           | functions that users define into the system. That is, that
           | this was shown with sin/cos was simply an example. Right?
        
             | oxinabox wrote:
             | Correct. sin and cos was merely an example.
             | 
             | Every real AD system has a primative for sin and cos. but
             | they do not have every single operation you might ever
             | implement (else what would be the point), and it is
             | unlikely they will have every operation that involves any
             | approximation. (Though there is a good chance the might
             | have every named scalar operation involving an
             | approximation that you do, depends how fancy you are.)
             | 
             | I mean I maintain a library of hundreds of custom
             | primatives. (one of the uses of this post was so i can
             | point to it in response to people saying "why do you need
             | all those primatives if AD can just decompose everything
             | into + and *)
        
           | stabbles wrote:
           | Well, all functions that use finite precision arithmetic deal
           | with approximations. Errors of 1/2 ulp like trig functions
           | are rather the exception. And machine learning is keen on
           | using low precision floating point type, so errors are
           | relatively large.
        
       | jpollock wrote:
       | I'm seeing a lot of discussions around Automatic Differentiation,
       | but I don't understand the purpose.
       | 
       | Why would you take the derivative of a piece of code?
        
         | sesuximo wrote:
         | It's a tool to help solve inverse problems.
        
         | freemint wrote:
         | I have used Automatic Differention used extensively to
         | "optimize" simulations. This is problem is know under many
         | names such parameter estimation, local sensitivity analysis,
         | optimal control ...
         | 
         | Having access to something the resembles the derivative of your
         | code allows you to use optimization algorithms that converge
         | faster (Newton or Gradient descent vs bisection methods) and
         | you don't have to specify the derivatives by hand (which gets
         | really bothersome when you have to specify a hessian matrix
         | (2nd derivative information including all the mixed terms)).
        
           | taeric wrote:
           | Would be cool to see what your coffee looked like. I have the
           | same question as the start of the thread.
           | 
           | That is, I can see how the derivative can help optimise a
           | simulation, but I'm not clear on how using an auto derivative
           | really helps.
           | 
           | Edit: note that I also see a difference in being able to get
           | the derivative of a function, versus writing in a sense where
           | any function can be moved to its derivative.
        
             | oxinabox wrote:
             | > That is, I can see how the derivative can help optimise a
             | simulation, but I'm not clear on how using an auto
             | derivative really helps.
             | 
             | Finding a derivative by hand gets tiring fast, and is error
             | prone. Used to be a neural net paper doing anything novel
             | spent like 1-2 pages deriving the derivative of its cool
             | new thing. Not it's derivative isn't even mentioned.
        
               | taeric wrote:
               | But does that require auto derivative stuff? I'm assuming
               | most could have simply mentioned the main equation, said
               | "use Mathematica" and proceeded with the derivative?
        
         | eigenspace wrote:
         | You want to take the derivative of a piece of code so that you
         | don't have to either hand-write the derivative yourself, or
         | rely on finite differencing which can be slower in many
         | circumstances and is almost universally less accurate.
        
         | srl wrote:
         | The most frequent application is machine learning. If my piece
         | of code is implementing some function that I want to minimize,
         | then taking the derivative (w.r.t. some vector of parameters)
         | tells me in which direction I need to change the parameters in
         | order to reduce the function: "gradient descent".
        
           | jpollock wrote:
           | So, we've got "reality", as represented by data.
           | 
           | We've got a model, implemented in code. Since it's code can
           | be differentiated - not sure how that works with branches, I
           | guess that's the math. :) This is generated through a set of
           | input parameters.
           | 
           | We've got an error function, representing the difference
           | between the model and reality.
           | 
           | If we differentiate the error function, we can choose which
           | set of parameter mutations are heading in the right direction
           | to then generate a new model? We check each close point and
           | find the max benefit?
           | 
           | However, if everything is taking the parameters as input, is
           | the derivative of the error function only generated once?
           | 
           | Is it saying that the derivative of the error function is
           | independent of the parameters, so it doesn't matter what the
           | model is, they all have the same error function, and that
           | error function can be found by generating a single model?
        
             | eigenspace wrote:
             | Dealing with branches is indeed and interesting problem.
             | Many AD systems can't accomodate branches. I think most of
             | the Julia ones do.
             | 
             | The gist of it is that branches don't actually require
             | _that_ much fanciness, but they can introduce
             | discontinuties, and AD systems will often do things like
             | happily give you a finite derivative right _at_ a
             | discontinuity where a calculus student would normally
             | complain.
        
           | dnautics wrote:
           | Note that the example here is forward differentiation and
           | most machine learning these days is backward propagation
        
             | oxinabox wrote:
             | Note that this applies backwards also. Of the 7 ADs demoed,
             | only 1 was forward. The other 6 were reverse mode. All gave
             | same result
        
         | jmeyer2k wrote:
         | To expand, it tells you how you can change certain variables of
         | a function to change the output of a function.
         | 
         | In machine learning, you create a loss function that takes in
         | modal parameters and returns how accurate the model is. Then,
         | you can optimize the function inputs for accuracy.
         | 
         | Automatic differentiation makes it much easier to code
         | different types of models.
        
       | freemint wrote:
       | The Julia ecosystem provides has a library that includes the
       | differentiation rules hinted at at the end.
       | 
       | https://github.com/JuliaDiff/ChainRules.jl is used by (almost
       | all) automatic differentiation engines and provides an extensive
       | list of such rules.
       | 
       | If the example used sin|cos the auto diff implementations in
       | Julia would have called native cos|-sin and not encurred such a
       | "truncation error". However the post illustrates the idea in a
       | good way.
       | 
       | Good post oxinabox
        
         | oxinabox wrote:
         | > https://github.com/JuliaDiff/ChainRules.jl is used by (almost
         | all) automatic differentiation engines and provides an
         | extensive list of such rules.
         | 
         | ChainRules is _going_ to be used by everything. Right now it is
         | used by 3 AD and a PR is open for a 4th, plus one thing that
         | hasn 't been released yet.
         | 
         | For context of anyone who doesn't know, I am the lead
         | maintainer of ChainRules.jl, (and the auther of this blog post)
        
       | oxinabox wrote:
       | anyone one know why of the 7 AD's tried (8 including the one
       | implemented at the start) there are the different answers?
       | 0.4999999963909431 vs 0.4999999963909432 vs 0.4999999963909433
       | 
       | I assume if is some kind of IEEE math thing. given that IEEE
       | allow `(a+b)+c != a + (b + c)` but where is it occuring exactly?
        
       | joe_the_user wrote:
       | Article: "Bonus: will symbolic differentiation save me? Probably,
       | but probably not in an interesting way."
       | 
       | This articles seems to misunderstand AD.
       | 
       | Automatic Differentiation doesn't incur _more_ truncation error
       | than symbolically differentiating the function and then
       | calculating the symbolic derivative 's value. Automatic
       | differentiating is basically following the steps you'd follow for
       | symbolic differentiation but substituting a value for the
       | symbolic expansion and so avoiding the explosion of symbols that
       | symbolic differentiation involves. But it's literally the same
       | sequence of calculations. The one way symbolic differentiation
       | might help is if you symbolically differentiated and then
       | rearranged terms to avoid truncation error but that's a bit
       | different.
       | 
       | The article seems to calculate sin(x) in a lossy fashion and then
       | attribute to the error to AD. That's not how it works.
       | 
       | [I can go through the steps if anyone's doubtful]
        
         | oxinabox wrote:
         | > Automatic Differentiation doesn't incur more truncation error
         | than symbolically differentiating the function and then
         | calculating the symbolic derivative's value.
         | 
         | Yes. The article even says that.
         | 
         | _"The AD system is (as you might have surmised) not incurring
         | truncation errors. It is giving us exactly what we asked for,
         | which is the derivative of my_sin. my_sin is a polynomial. The
         | derivative of the polynomial is: [article lists the hand-
         | derived derivative]"_
         | 
         | The reason symbolic might help is symbolic AD is often used in
         | languages that don't represent things with numbers, but with
         | lazy expressions. I will clarify that. (edit, I have updates
         | that bit and I think it is clearer. Thanks)
         | 
         | > The article seems to calculate sin(x) in a lossy fashion and
         | then attribute to the error to AD. That's not how it works.
         | 
         | The important bit is that the accurasy lost from a accurate
         | derviative of a lossy approximation is greater than accurasy
         | lost from a lossy approximation to an accurate derivative. Is
         | there a bit I should clarify more about that? I tried to
         | emphisize that at the end before the bit mentioning symbolic.
        
           | joe_the_user wrote:
           | I would suggest saying front and center, more prominently,
           | that AD incurs the truncation errors that the equivalent
           | symbolic derivative would incur. Since you're talking about
           | AD, you should be giving people a good picture of what it is
           | (since it's not that commonly understood). 'Cause you're kind
           | of suggesting otherwise even if somewhere you're eventually
           | saying this.
           | 
           | I mean, Griewank saying _" Algorithmic differentiation does
           | not incur truncation error"_ does deserve the caveat "unless
           | the underlying system has truncation errors, which it often
           | does". But you can give that caveat without bending the stick
           | the other way.
           | 
           |  _The important bit is that the accurasy lost from a accurate
           | derviative of a lossy approximation is greater than accurasy
           | lost from a lossy approximation to an accurate derivative._
           | 
           | I understand you get inaccuracy from approximating the sin
           | but I don't know what you're contrasting this to. I think
           | real AD libraries deal with primitives and with combinations
           | of primitive so for such a library, the derivative of sin(x)
           | would be "symbolically" calculated as cos(x) since sin is
           | primitive (essentially, all the "standard functions" have to
           | be primitives and create things from that. I doubt any
           | library would apply AD to it's approximation of a given
           | function).
           | 
           | I haven't used such libraries but I am in the process of
           | writing an AD subsystem for my own little language.
        
       | yudlejoza wrote:
       | AD is a symbolic technique.
       | 
       | If you use a symbolic technique over numeric data without knowing
       | what you're doing, I feel sorry for you.
       | 
       | (numeric: specifically the inclusion of floating-point.)
        
       ___________________________________________________________________
       (page generated 2021-02-08 23:00 UTC)