[HN Gopher] Automatic Differentiation Does Incur Truncation Erro...
___________________________________________________________________
Automatic Differentiation Does Incur Truncation Errors (Kinda)
Author : oxinabox
Score : 58 points
Date : 2021-02-08 18:45 UTC (4 hours ago)
(HTM) web link (www.oxinabox.net)
(TXT) w3m dump (www.oxinabox.net)
| not2b wrote:
| You could use a lazy representation of the Taylor series that
| will give you as many terms as you ask for and symbolically
| differentiate that. Then you'll gate an accurate automatic
| differentiation. When you go to evaluate your approximation
| you'll get errors at that point, but you'll correctly get that
| the second derivative of sin(x) is exactly -sin(x).
| oxinabox wrote:
| Correct, which is mentioned at the end of the post. I am not
| sure how well it generalizes, e.g. to multivariate functions
| that are defines as solutions of some thing like a differential
| equation.
|
| Maybe it does? I don't know.
| joe_the_user wrote:
| You could do this but I'm pretty sure that real AD just deals
| all the common primitives directly. The derivative of sin(x) is
| cos(x), etc. AD just postpones the chain rule calculations that
| would be involved in something sin(x + x^2)^2, so the chain-
| rule happens at run time instead of a symbol expansion at
| compile time.
| londons_explore wrote:
| Interesting...
|
| But do these truncation errors cause any real world heartaches?
| Does anyone do n-derivations of any approximate function for
| large n?
| oxinabox wrote:
| Occationally, there are a few places where you want arbitary
| high derivatives. But in general there are specialized
| solutions for those (though very few things implement them.
| E.g. Jax's JETs and TaylorSeries.jl in julia implement taylor-
| mode AD. But those also need custom primatives)
|
| More of a problem is first derivatives, but that you need very
| high accuracy. Which doesn't occur in ML but does occur
| sometimes in scientific computing. (Doesn't occur in ML cos we
| just basically shake the thing about a bit anyway)
| stabbles wrote:
| An obvious example here is when you expand e^x with its Taylor
| series using `n` terms. The derivative of that length-n Taylor
| expansion is identical but only `n-1` terms long, so you lose
| precision.
|
| More generally if you approximate a smooth function f with a
| truncated taylor series of length n around c, the error behaves
| as O((x-c)^(n+1)), and the error of the k'th derivative of that
| function auto-diffed will be of order O((x-c)^(n+1-k)).
| joe_the_user wrote:
| That's true but this sort of problem would normally not happen
| with a standard AD library or subsystem, whatever language
| you're using. Such systems don't approximate functions then
| differentiate their approximations.
| taeric wrote:
| I think the example is supposed to be indicative of larger
| functions that users define into the system. That is, that
| this was shown with sin/cos was simply an example. Right?
| oxinabox wrote:
| Correct. sin and cos was merely an example.
|
| Every real AD system has a primative for sin and cos. but
| they do not have every single operation you might ever
| implement (else what would be the point), and it is
| unlikely they will have every operation that involves any
| approximation. (Though there is a good chance the might
| have every named scalar operation involving an
| approximation that you do, depends how fancy you are.)
|
| I mean I maintain a library of hundreds of custom
| primatives. (one of the uses of this post was so i can
| point to it in response to people saying "why do you need
| all those primatives if AD can just decompose everything
| into + and *)
| stabbles wrote:
| Well, all functions that use finite precision arithmetic deal
| with approximations. Errors of 1/2 ulp like trig functions
| are rather the exception. And machine learning is keen on
| using low precision floating point type, so errors are
| relatively large.
| jpollock wrote:
| I'm seeing a lot of discussions around Automatic Differentiation,
| but I don't understand the purpose.
|
| Why would you take the derivative of a piece of code?
| sesuximo wrote:
| It's a tool to help solve inverse problems.
| freemint wrote:
| I have used Automatic Differention used extensively to
| "optimize" simulations. This is problem is know under many
| names such parameter estimation, local sensitivity analysis,
| optimal control ...
|
| Having access to something the resembles the derivative of your
| code allows you to use optimization algorithms that converge
| faster (Newton or Gradient descent vs bisection methods) and
| you don't have to specify the derivatives by hand (which gets
| really bothersome when you have to specify a hessian matrix
| (2nd derivative information including all the mixed terms)).
| taeric wrote:
| Would be cool to see what your coffee looked like. I have the
| same question as the start of the thread.
|
| That is, I can see how the derivative can help optimise a
| simulation, but I'm not clear on how using an auto derivative
| really helps.
|
| Edit: note that I also see a difference in being able to get
| the derivative of a function, versus writing in a sense where
| any function can be moved to its derivative.
| oxinabox wrote:
| > That is, I can see how the derivative can help optimise a
| simulation, but I'm not clear on how using an auto
| derivative really helps.
|
| Finding a derivative by hand gets tiring fast, and is error
| prone. Used to be a neural net paper doing anything novel
| spent like 1-2 pages deriving the derivative of its cool
| new thing. Not it's derivative isn't even mentioned.
| taeric wrote:
| But does that require auto derivative stuff? I'm assuming
| most could have simply mentioned the main equation, said
| "use Mathematica" and proceeded with the derivative?
| eigenspace wrote:
| You want to take the derivative of a piece of code so that you
| don't have to either hand-write the derivative yourself, or
| rely on finite differencing which can be slower in many
| circumstances and is almost universally less accurate.
| srl wrote:
| The most frequent application is machine learning. If my piece
| of code is implementing some function that I want to minimize,
| then taking the derivative (w.r.t. some vector of parameters)
| tells me in which direction I need to change the parameters in
| order to reduce the function: "gradient descent".
| jpollock wrote:
| So, we've got "reality", as represented by data.
|
| We've got a model, implemented in code. Since it's code can
| be differentiated - not sure how that works with branches, I
| guess that's the math. :) This is generated through a set of
| input parameters.
|
| We've got an error function, representing the difference
| between the model and reality.
|
| If we differentiate the error function, we can choose which
| set of parameter mutations are heading in the right direction
| to then generate a new model? We check each close point and
| find the max benefit?
|
| However, if everything is taking the parameters as input, is
| the derivative of the error function only generated once?
|
| Is it saying that the derivative of the error function is
| independent of the parameters, so it doesn't matter what the
| model is, they all have the same error function, and that
| error function can be found by generating a single model?
| eigenspace wrote:
| Dealing with branches is indeed and interesting problem.
| Many AD systems can't accomodate branches. I think most of
| the Julia ones do.
|
| The gist of it is that branches don't actually require
| _that_ much fanciness, but they can introduce
| discontinuties, and AD systems will often do things like
| happily give you a finite derivative right _at_ a
| discontinuity where a calculus student would normally
| complain.
| dnautics wrote:
| Note that the example here is forward differentiation and
| most machine learning these days is backward propagation
| oxinabox wrote:
| Note that this applies backwards also. Of the 7 ADs demoed,
| only 1 was forward. The other 6 were reverse mode. All gave
| same result
| jmeyer2k wrote:
| To expand, it tells you how you can change certain variables of
| a function to change the output of a function.
|
| In machine learning, you create a loss function that takes in
| modal parameters and returns how accurate the model is. Then,
| you can optimize the function inputs for accuracy.
|
| Automatic differentiation makes it much easier to code
| different types of models.
| freemint wrote:
| The Julia ecosystem provides has a library that includes the
| differentiation rules hinted at at the end.
|
| https://github.com/JuliaDiff/ChainRules.jl is used by (almost
| all) automatic differentiation engines and provides an extensive
| list of such rules.
|
| If the example used sin|cos the auto diff implementations in
| Julia would have called native cos|-sin and not encurred such a
| "truncation error". However the post illustrates the idea in a
| good way.
|
| Good post oxinabox
| oxinabox wrote:
| > https://github.com/JuliaDiff/ChainRules.jl is used by (almost
| all) automatic differentiation engines and provides an
| extensive list of such rules.
|
| ChainRules is _going_ to be used by everything. Right now it is
| used by 3 AD and a PR is open for a 4th, plus one thing that
| hasn 't been released yet.
|
| For context of anyone who doesn't know, I am the lead
| maintainer of ChainRules.jl, (and the auther of this blog post)
| oxinabox wrote:
| anyone one know why of the 7 AD's tried (8 including the one
| implemented at the start) there are the different answers?
| 0.4999999963909431 vs 0.4999999963909432 vs 0.4999999963909433
|
| I assume if is some kind of IEEE math thing. given that IEEE
| allow `(a+b)+c != a + (b + c)` but where is it occuring exactly?
| joe_the_user wrote:
| Article: "Bonus: will symbolic differentiation save me? Probably,
| but probably not in an interesting way."
|
| This articles seems to misunderstand AD.
|
| Automatic Differentiation doesn't incur _more_ truncation error
| than symbolically differentiating the function and then
| calculating the symbolic derivative 's value. Automatic
| differentiating is basically following the steps you'd follow for
| symbolic differentiation but substituting a value for the
| symbolic expansion and so avoiding the explosion of symbols that
| symbolic differentiation involves. But it's literally the same
| sequence of calculations. The one way symbolic differentiation
| might help is if you symbolically differentiated and then
| rearranged terms to avoid truncation error but that's a bit
| different.
|
| The article seems to calculate sin(x) in a lossy fashion and then
| attribute to the error to AD. That's not how it works.
|
| [I can go through the steps if anyone's doubtful]
| oxinabox wrote:
| > Automatic Differentiation doesn't incur more truncation error
| than symbolically differentiating the function and then
| calculating the symbolic derivative's value.
|
| Yes. The article even says that.
|
| _"The AD system is (as you might have surmised) not incurring
| truncation errors. It is giving us exactly what we asked for,
| which is the derivative of my_sin. my_sin is a polynomial. The
| derivative of the polynomial is: [article lists the hand-
| derived derivative]"_
|
| The reason symbolic might help is symbolic AD is often used in
| languages that don't represent things with numbers, but with
| lazy expressions. I will clarify that. (edit, I have updates
| that bit and I think it is clearer. Thanks)
|
| > The article seems to calculate sin(x) in a lossy fashion and
| then attribute to the error to AD. That's not how it works.
|
| The important bit is that the accurasy lost from a accurate
| derviative of a lossy approximation is greater than accurasy
| lost from a lossy approximation to an accurate derivative. Is
| there a bit I should clarify more about that? I tried to
| emphisize that at the end before the bit mentioning symbolic.
| joe_the_user wrote:
| I would suggest saying front and center, more prominently,
| that AD incurs the truncation errors that the equivalent
| symbolic derivative would incur. Since you're talking about
| AD, you should be giving people a good picture of what it is
| (since it's not that commonly understood). 'Cause you're kind
| of suggesting otherwise even if somewhere you're eventually
| saying this.
|
| I mean, Griewank saying _" Algorithmic differentiation does
| not incur truncation error"_ does deserve the caveat "unless
| the underlying system has truncation errors, which it often
| does". But you can give that caveat without bending the stick
| the other way.
|
| _The important bit is that the accurasy lost from a accurate
| derviative of a lossy approximation is greater than accurasy
| lost from a lossy approximation to an accurate derivative._
|
| I understand you get inaccuracy from approximating the sin
| but I don't know what you're contrasting this to. I think
| real AD libraries deal with primitives and with combinations
| of primitive so for such a library, the derivative of sin(x)
| would be "symbolically" calculated as cos(x) since sin is
| primitive (essentially, all the "standard functions" have to
| be primitives and create things from that. I doubt any
| library would apply AD to it's approximation of a given
| function).
|
| I haven't used such libraries but I am in the process of
| writing an AD subsystem for my own little language.
| yudlejoza wrote:
| AD is a symbolic technique.
|
| If you use a symbolic technique over numeric data without knowing
| what you're doing, I feel sorry for you.
|
| (numeric: specifically the inclusion of floating-point.)
___________________________________________________________________
(page generated 2021-02-08 23:00 UTC)